Authors

Mayar Osama, Rawan Sherif, Alaa Yehia, Salma Mohamed

Publishing Date

April 21, 2021

Abstract

OUR AIM IS TO FACILITATE THE LIFE OF VISUALLY IMPAIRED PEOPLE WHO COULD LACK THE TECHNOLOGY TO HELP THEM IN THEIR LIVES. THIS PROJECT FOCUSES ON THE DEVELOPMENT OF A MOBILE APPLICATION THAT USES VOICE COMMANDS AND TEXT TO VOICE TECHNOLOGY TO ENABLE USERS TO INTERACT WITH THE MOBILE APPLICATION. THE USER CAPTURES THE ITEM THEY WANT TO IDENTIFY USING THE SMARTPHONE’S CAMERA OR CAN BROWSE THROUGH THEIR GALLERY AND PICK THE IMAGE THEY WANT TO UPLOAD FOR PROCESSING. THE IMAGE CHOSEN THEN WILL UNDERGO PROCESSING DEPENDING ON THE FEATURE THE USER CHOSE THROUGH THE GIVEN VOICE COMMAND, THE FEATURES INCLUDE FOOD RECOGNITION, BANKNOTE RECOGNITION, TEXT TO VOICE AND COLORS/PATTERNS IDENTIFICATION. AS WELL, THE APPLICATION ALLOWS THE USER TO REGISTER OR LOGIN USING FACIAL RECOGNITION TO FURTHER FACILITATE THE EXPERIENCE FOR USERS. THE APPLICATION IS DESIGNED TO BE EFFECTIVE AND CAN BE EASILY NAVIGATED BY VISUALLY IMPAIRED USERS.

1.1 Purpose of this document

THE PURPOSE OF THIS DOCUMENT IS TO EXPLAIN IN DETAIL THE REQUIREMENTS NEEDED TO DEVELOP A MOBILE APPLICATION THAT ASSISTS VISUALLY IMPAIRED PEOPLE WITH DOING SIMPLE DAILY ACTIVITIES. THE APPLICATION AIMS TO PROVIDE AN EASY AND RELIABLE WAY FOR VISUALLY IMPAIRED USERS TO NAVIGATE THROUGH THEIR DAILY LIVES. THIS DOCUMENT IS THE BASIS FOR EVERY DEVELOPER WHO IS A PART OF THE DEVELOPMENT TEAM, WILL NEED IN ORDER TO FULFILL ALL REQUIREMENTS [1]. THE DOCUMENT WILL OUTLINE THE FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTS AS WELL OTHER FACTORS THAT COLLECTIVELY COVERS THE DEVELOPMENT PROCESS OF THIS APPLICATION.

1.2 Scope of this document

THIS DOCUMENT TARGETS SOFTWARE DEVELOPERS WHO ARE PART OF THE DEVELOPING TEAM ON THIS PROJECT. IT WILL CONTAIN AN OVERVIEW OF OUR PRODUCT AND A SIMPLE EXPLANATION OF THE APPLICATION SO IT CAN GIVE DEVELOPERS A CLEAR IDEA WITHOUT THE NEED FOR FUTURE RE-DESIGN [2]. AS WELL, IT WILL DESCRIBE THE FUNCTIONALITY THE APPLICATION SHALL FULFILL.

1.3 System Overview

INPUT:

• THE USER LOGIN AND SIGN UP THE APPLICATION USING FACE RECOGNITION. WHILE SIGNING UP, THE USER SHALL ENTER HIS INFORMATION THROUGH STT.

• WHEN THE USER LOGIN, AN AUDIO MENU IS PLAYED FOR THE USER TO CHOOSE A PARTICULAR COMMAND.

• THE USER CAN NAVIGATE IN THE MOBILE APPLICATION USING VOICE COMMANDS.

• THE USER SHALL SAY USING STT WHAT THE IMAGE HOLDS IF IT’S CLOTHES, FOOD, MONEY OR TEXT.

• THE USER SHALL OPEN CAMERA AND THE PERFECT FRAME WILL BE CHOSEN OR UPLOAD AN IMAGE FROM GALLERY.

• THE PERFECT FRAME IS TAKEN WHEN THE OBJECT IS DETECTED AND AN ALERT IS GENERATED FOR THE USER TO KNOW THAT THE FRAME WAS TAKEN.

• THE IMAGE THEN ENTERS THE PRE-PROCESSING TECHNIQUES IN WHICH SOME OPERATIONS ARE DONE ON THE IMAGE WHICH ARE HISTOGRAM EQUALIZATION FOR CONTRAST ENHANCEMENT, GEOMETRIC TRANSFORMATIONS LIKE ROTATING THE IMAGE.

• IMAGE FILTERING IS DONE ON THE IMAGE FOR IMAGE PROPERTIES ENHANCEMENT AND FOR EXTRACTING FEATURES FROM THE IMAGE. IMAGE SEGMENTATION IS ALSO USED TO PARTITION THE IMAGE INTO SEVERAL PARTS AND REGIONS.

• THE EXTRACTED FEATURES WILL BE SAVED IN THE CLOUD STORAGE FOR IT TO ENTER THE PROCESSING STAGE. PROCESSING :THE FEATURES ARE RETRIEVED FROM CLOUD STORAGE AND THEN ENTERS THE MODEL WHICH WILL EXECUTE THE COMMANDS SAID BY THE USER FROM THE BEGINNING TO CLASSIFY THE RESULT.

1) MONEY

• IF MONEY WAS DESIRED, THE SYSTEM WILL ANALYZE THE BANKNOTE LOCATED IN THE IMAGE AND IT WILL ENTER THE OUTPUT STAGE.

• CNN USING MOBILENET ARCHITECTURE WAS USED TO RECOGNIZE THE EGYPTIAN NOTES.

• TRANSFER LEARNING OF THE MOBILENET CNN MODEL WAS USED WHERE THE MODEL DID NOT HAVE TO BE TRAINED FROM SCRATCH USING A SMALL DATASET. THE PRE-TRAINED MODEL IS FROZEN AND ONLY THE WEIGHTS OF CLASSIFIER ARE UPDATED DURING TRAINING. SEQUENTIAL LAYERS ARE ADDED TO THE MODEL TO ADD TRAINING AND INFERENCE AND AN OUTPUT LAYER MOBILENET IS BASED ON SEPERABLE CONVOLUTIONS WHERE IT FILTERS THE IMAGE.

• THE FIRST LAYER IS A CONVOLUTION LAYER, THEN IT IS FOLLOWED BY BATCH NORMALIZATION AND RELU NON- LINEARITY LAYERS, EXCEPT FINAL LAYER IS A FULLY CONNECTED LAYER WITHOUT ANY NON-LINEARITY AND THEN PASSES TO SOFTMAX FOR CLASSIFICATION. AS WELL, THE LAYERS INCLUDE MAX POOLING, WHERE IT SELECTS THE MAXIMUM ELEMENT FROM FEATURE MAP AND OUTPUTS A FEATURE MAP CONTAINING ONLY THE MOST PROMINENT FEATURES. THIS REDUCES THE DIMENSIONS OF THE FEATURE MAPS AND REDUCES THE NUMBER OF PARAMETERS TO TRAIN AND COMPUTATION PERFORMED. THE LAYERS WERE TRAINED USING IMAGENET DATASET (1.4M IMAGES AND 1000 CLASSES)

2) CLOTHES

• IF CLOTHES WAS DESIRED, THE SYSTEM WILL ANALYZE THE COLOR AND PATTERN OF THE CLOTHING AND IT WILL ENTER THE OUTPUT STAGE.

• CNN RESNET50 ARCHITECTURE WAS USED TO ANALYZE THE COLOR AND PATTERN OF THE CLOTHING.

• TENSORFLOW OBJECT DETECTION API IS USED FIRST FOR OBJECT DETECTION. WHEN CLOTHES ARE DETECTED, THE IMAGE ENTERS THE RESNET50 MODEL FOR CLASSIFICATION. RESNET50 IS A CONVOLUTION NEURAL NETWORK THAT IS 50 LAYERS DEEP. SOFTMAX AND RELU FUNCTIONS ARE USED HERE WHERE SOFTMAX IS USED IN THE LAST LAYER WHILE RELU IS USED IN THE FIRST LAYERS. THIS MODEL IS CAPABLE OF CLASSIFYING AROUND 1000 CATEGORIES ACCORDING TO CLOTHES WHERE IT IS ABLE TO RECOGNIZE ALL COLORS AND MANY DIFFERENT PATTERNS.

OUTPUT:

THE CLASSIFIED RESULT WILL BE CONVERTED TO TTS FOR THE USER TO HEAR IT .ALSO WHEN THE INPUT IMAGE IS CLOTHES, THE USER CAN HAVE A RECOMMENDATION SYSTEM TO HELP HIM WITH WHAT SHALL MATCH THE COLOR AND PATTERN IN THE INPUT IMAGE.

1.4 System Scope

• AUTHENTICATE USERS USING FACIAL RECOGNITION

• ALLOW USERS TO REGISTER USING FACIAL RECOGNITION

• SPEECH RECOGNITION IS USED TO FILL IN THE INFORMATION SPOKEN BY THE USER AT REGISTRATION

• USERS CAN USE SPECIFIC VOICE COMMANDS TO EXECUTE THE NECESSARY ACTION

• CONVERT TEXT PRESENT IN IMAGES TO AUDIO

• IDENTIFY THE AMOUNT OF BANKNOTES PRESENT IN THE IMAGE PROVIDED BY THE USER

• RECOGNIZE FOOD PRESENT IN THE IMAGE

• RECOGNIZE COLORS AND PATTERNS

• OFFER RECOMMENDATIONS BASED ON THE RECOGNIZED COLORS AND PATTERNS

• USER CAN EDIT INFORMATION PRESENT IN THEIR PROFILE USING SPEECH RECOGNITION

• ADMIN CAN EDIT, DELETE, CHANGE AND UPDATE THE DATA OF THE USERS

• PROCESSED DATA FROM THE IMAGES ARE CONVERTED TO SOUND