Ahmed Emad, Fady Bassel, Mark Refaat, Mohamed Abdelhamed

Publishing Date

December 28, 2020


In videos, description and timestamps play an important role in the choosing process of the right video to watch. The main idea of the proposed project (Video Bite) is to generate description and times-tamps for videos automatically. Video Bite plays an extremely important role in reducing time consumption, as it shall save time for users from watching wrong unwanted videos and shall save time using timestamps and click-baits to watch only the desired part of the video. The summarizing of video will depend on frames,emotions and speech summarization, summarizing the video at first based on frames and then summarizing of the audio occurs and at last the fusion happens between both summarization in order to get a meaningful accurate description for the video.

1.1 Purpose of this document

The main purpose of this SRS document is to illustrate and outline the requirements for our graduation project (Video Bite). That is mainly generating summarized descriptions for videos based on their audio, video, and emotions all together. Besides creating timestamps for the video to describe what is happening in each part. Our aim is to save users who search for video time and make it easier for content creators to generate their own description.

1.2 Scope of this document

Our system Video bite is aimed to help people dealing with videos. It helps them to find accurate description for any video so they do not waste time watching unrelated videos. The users will get description and timestamps for the selected video. These will also help content creators to generate their video descriptions, keywords and timestamps.

1.3 System Overview

The proposed system implements a new way for summarizing videos based on their audio, video,and emotions all together. That aims to achieve a good readable and accurate description of video. Moreover, the fusion of the summarized texts outputted from audio, video, and emotions. To get one summarized text at the end that could be a meaningful description for the video We conducted a survey to ask about the importance of videos for people. Also to check if they create videos or not and ask about their uploading process. Survey results showed that 73.5% of the respondents did not create or upload videos to any video platforms. However, for the 26.5% who do upload videos, 33.3% of them do not write a description for their videos . Sometimes, they write only a few words that do not describe the actual video content. 87.2% of them agreed if they had such a tool, they would use it.The proposed system overview is shown in figure 1 which includes the following steps:

1. Every couple of seconds, the video is separated into smaller videos. This splitting is done according to scenes hence each video into frames and audio.

2. Each frame of the smaller videos frames is passed through the preprocessing phase. The frames and audio will be preprocessed by removing blurry scenes and noisy audio.

3. The audio is transcribed into text by applying speech to text recognition. The output is a text and then passed to LSTM and CNN models.

4. The three text outputted descriptions are fused together to output a summarized fused meanngful text.

5. Repetition of this process occur to complete the remaining of all smaller videos.

6. An abstractive summarization would be then applied to text using Natural Language Processing (NLP) techniques.

7. NLP is used to produce the video description and timestamps. Having cuts at the scenes and labeling it help finding it later.

1.4 System Scope

• Summarize the video according to frames.

• Summarize the video according to audio.

• Summarize the video according to emotions.

• Fusion these 3 summaries.

• Time-stamping video.

• Extraction of video’s main keywords.