Ahmed Emad, Mark Refaat, Fady Bassel, Mohamed Abdelhamed

Publishing Date

October 31, 2020


In videos, description and timestamps play an important role in the choosing process of the right video to watch. The main idea of the proposed project (Video Bite) is to generate description and times-tamps for videos automatically. Video Bite plays an extremely important role in reducing time consumption, as it shall save time for users from watching wrong unwanted videos and shall save time using timestamps and click-baits to watch only the desired part of the video. The summarizing of video will depend on frames, emotions and speech summarization, summarizing the video at first based on frames and then summarizing of the audio occurs and at last the fusion happens between both summarization in order to get a meaningful accurate description for the video.

1.1 Background

There are millions of videos all over the internet, and more and more are being uploaded every second. This large collection of videos makes it very hard to find the video you are looking for. Platforms introduced video descriptions and tags to try to get over this problem but the problem persists and another problem of click baits does appear. As technology advance people started to make some applications addressing this problem. Some applications that summarize the video or describe it were introduced lately. They mainly use algorithms like CNN [1] and LSTM [2] they also use natural language processing [3]. Some of the applications uses the video frames in generating the summarization and others use the audio in the video to do so, both generating a video output. But no one generates a textual output description for the video, based on the video frames and audio, which will then give the most accurate results to what is really happening inside the video.

1.2 Motivation

Our Academic motivation is based on past work done in the domain of video summarising, two main approaches were carried out. First approach carried by Quang Dieu Tran and Dosam Hwang [4] As well as Zhong Ji and Kailin Xiong [5] main target was outputting a shorter version of the input video. The other main approach aim was outputting a text description to the input video Anna Rohrbach, Marcus Rohrbach and Bernt Schiele [6] focused on identifing verbs,objects and places and LSTM for sentence generation. Also N. Krishnamoorthy and G. Malkarnenkar [7] focused on extracting subjects, objects and verbs from video frames. From the future work of some of these researches was to introduce different type of modalities other than video frames such as dialogues or audio, Vasili Ramanishka and Abir Das [8] focused on object recognition, video category, and audio features to generate their description. Also Chiori Hori and Takaaki Hori [9] proposes an attention model to handle fusion of multiple modalities such as image features, motion features, and audio features. What’s interesting and challenging about our proposed idea and can present improvement is by fusing multiple modalities audio transcription and frame description which could lead to more accurate description and video time stamping a slightly similar approach was presented by HaoranLi and Junnan Zhu they combined related documents and videos of a topic and The salience score of text was measured, including the sentences in documents and the transcripts of speech from recordings. But the integration of videos and audio does not increase output in comparison to the text only model, according to their conclusion. which makes perfect sense as summarizing of a well written document would be enough to present acceptable results.

As for our Business motivation, online videos are so important for so many people, but they face some problems while trying to get the best video to watch (non-accurate video description to get views and the length of the video while trying to find specific part). Lack of tools that can summarize videos with a good accuracy to get the best summarized description for the required video that could help content creators and youtubers so much having the best description for their videos is one of the problems people face, also the need of everyone to have timestamps on the video that could save students/searchers so much time finding what the exact topic or part that they need to watch.

1.3 Problem Statement

The video industry is growing rapidly, 500 hours of video are uploaded to YouTube every minute [10]. Content creators find difficulty to add a description for videos to make it easy for viewers to find these videos on any video platform. Also for viewers it is so hard to get the video they really need from the first few trials due to the vague descriptions and click baits. Some introduced systems do skim the video and summarize it into smaller chunks, and others transcribe the video depending on the voice inside it or the video scenes. Our proposed system will use the video voice and objects in the scenes to generate a text description for what is actually happening inside the video, in addition to adding video timestamps, so when the video is talking about more than one topic it will be known what was being said at which time. This will also help in searching engine optimization.