0% found this document useful (0 votes)
25 views14 pages

G7 Synopsis

The document describes a project to develop a sign language recognition system using machine learning. The goals are to recognize letters of the alphabet to facilitate communication for deaf individuals. The team generated a dataset of hand gestures for common words using webcams. They segmented images to isolate hands and applied labeling. Features were extracted using OpenCV for training an LSTM model to classify gestures. The aims are to enhance accessibility, support multiple sign languages, and ensure accurate, low-latency translation of signs to text or speech.

Uploaded by

HARSHIT KHATTER
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views14 pages

G7 Synopsis

The document describes a project to develop a sign language recognition system using machine learning. The goals are to recognize letters of the alphabet to facilitate communication for deaf individuals. The team generated a dataset of hand gestures for common words using webcams. They segmented images to isolate hands and applied labeling. Features were extracted using OpenCV for training an LSTM model to classify gestures. The aims are to enhance accessibility, support multiple sign languages, and ensure accurate, low-latency translation of signs to text or speech.

Uploaded by

HARSHIT KHATTER
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

SIGN LANGUAGE DETECTION

MINOR PROJECT SYNOPSIS

Submitted in partial fulfilment of the


degree of
BACHELOR OF TECHNOLOGY
in
INFORMATION TECHNOLOGY
By

Ayush Dixit Harshit Khatter Manish Sharma

01611503120 03311503120 04411503120

Guided by
Dr. Prakhar Priyadarshi
Head of Department (IT)

Department of Information Technology


BHARATI VIDYAPEETH’S COLLEGE OF ENGINEERING
PASCHIM VIHAR, NEW DELHI
October 2023
CANDIDATE’s DECLARATION

It is hereby certified that the work which is being presented in the B. Tech Minor project Report
entitled "SIGNO-FY" in partial fulfilment of the requirements for the award of the degree of
Bachelor of Technology and submitted in the Department of Information Technology of BHARATI
VIDYAPEETH’S COLLEGE OF ENGINEERING, New Delhi (Affiliated to Guru Gobind Singh
Indraprastha University, Delhi) is an authentic record of our own work carried out under the
guidance of Dr. Prakhar Priyadarshi, HOD-IT.

The matter presented in the B. Tech Major Project Report has not been submitted by me for the
award of any other degree of this or any other Institute.

Ayush Dixit Harshit Khatter Manish Sharma

01611503120 03311503120 04411503120

This is to certify that the above statement made by the candidate is correct to the best of my knowledge. They
are permitted to appear in the External Minor Project Examination.

Dr. Prakhar Priyadarshi


Head of Department, IT
TABLE OF CONTENT

CHAPTER 1: 1.1 ABSTRACT


1.2 INTRODUCTION
1.2.1 DATA SET GENERATION
1.2.2 HAND SEGMENTATION
1.2.3 FEATURE EXTRACTION PROCESS
1.3 AIMS AND OBJECTIVES

CHAPTER 2: 2.1 METHODOLOGY


2.1.1 GESTURE CLASSIFICATION
2.1.2 FINGURE SPELLING SENTENCE FORMATION
2.2 LITERATURE SURVEY

CHAPTER 3: 3.1 CONCLUSION


3.1.1 CONCLUSION
3.1.2 CHALLENGES FACED
3.1.3 FUTURE SCOPE
3.2 APPENDIX
3.2.1 CONVOLUTIONAL NEURAL NETWORK
3.2.2 TENSORFLOW
3.3 REFERENCES
CHAPTER 1
1.1 ABSTRACT

More than 5% of the world's population is affected by hearing impairment. To overcome the
challenges faced by these individuals, various sign languages have been developed as an easy and
efficient means of communication. Sign language depends on signs and gestures which give meaning
to something during communication . Researchers are actively investigating methods to develop sign
language recognition systems, but they face many challenges during the implementation of such
systems which include recognition of hand poses and gestures. Furthermore, some signs have similar
appearances which add to the complexity in creating recognition systems. This paper focuses on the
sign language alphabet recognition system because the letters are the core of any language.
Moreover, the system presented here can be considered as a starting point for developing more
complex systems.

People dealing with hearing or speech impairment make use of Sign Language for effective
communication. Sign Language uses finger-spelling and word-level gestures for communication.
Interpreters can be sparse and the lack of knowledge of the sign language can pose a
communication-barrier for signers. Sensor based approaches that have been developed were slightly
unsuccessful because of the hardware components involved. To overcome this, machine learning,
image classification and object detection techniques have been employed over the years for
recognizing sign language. 3D CNN and RNN combined together resulted in sequential modelling but
used large number of pre-processing steps for training. This research paper proposes to implement a
solution that would recognize hand gestures used in sign language by leveraging LSTM and Object
Detection techniques while taking into consideration the short-comings of previous algorithms.
1.2 INTRODUCTION

People dealing with hearing or speech impairment make use of Sign Language for effective
communication. It is not quite possible for most people to understand the Sign Language without
prior knowledge about it. Hence it is difficult for them to communicate with the hearing and speech
impaired. The National Deaf Association (NAD) ran a survey that found out, the number of Indians
suffering from loss of hearing are approximately 18 million. Even interpreters who do know the Sign
Language, are sparsely available and charge a hefty fee. Using a pen and paper to convey messages
is a very time-consuming and tedious process so it’s not feasible.

American sign language is a predominant sign language Since the only disability D&M people have is
communication related and they cannot use spoken languages hence the only way for them to
communicate is through sign language. Communication is the process of exchange of thoughts and
messages in various ways such as speech, signals, behavior and visuals. Deaf and dumb(D&M)
people make use of their hands to express different gestures to express their ideas with other
people. Gestures are the nonverbally exchanged messages and these gestures are understood with
vision. This nonverbal communication of deaf and dumb people is called sign language.

Sign language is a visual language and consists of 3 major components:

- Finger Spelling: In this, the signer signs out a word, letter by letter.
- Gestures, vocabulary and word-level Sign Language: In this, each gesture the signer makes,
represents an actual word and it is faster and more commonly used.
- Facial Expressions: These are some external features to be considered during interpreting
the signer. In our project we basically focus on producing a model which can recognise
Fingerspelling based hand gestures in order to form a complete word by combining each
gesture.

1.2.1 Data Set Generation

In this step the dataset has been generated for model training using the web cams of the team
members. Currently, the model includes 7 gestures in the dataset that are as follows; “Hello”, “I Love
You”, “Yes”, “No”, “Thank You”, “Sorry”, “Please”. 30 videos have been captured, 30 for each class
and each video consists of 30 frames denoting the same gesture. Alternative hand usage is ensured ,
different angles and poses for capturing these videos. The process of data acquisition is followed by
proper labelling of each image according to the gesture it represents.

1.2.2 Hand Segmentation in images: This process is crucial for our proposed solution. In this step
the image is broken down into small segments, so that before it goes to the feature extraction step it
is already more accurate in terms of the image attributes in it. One way of doing segmentation is also
to separate the background of the image from the object in the image. The more meticulous we are
during segmentation, the more accurate our recognition will be in the future. For the process of
annotation and labelling our images Labeling Library of Python have been used, which helps in
labelling object bounding boxes in images. After choosing the segment, this tool generates the
characteristics of the segment chosen by the team like the size of the bounding box.
1.2.3 Feature Extraction Process: We used Open computer vision(OpenCV) library in order to
produce our dataset. Firstly we captured around 800 images of each of the symbol in ASL for training
purposes and around 200 images per symbol for testing purpose
1.3 Aims and Objectives

The primary aim of our project is to develop a real-time sign language recognition system that can
effectively interpret and translate sign language gestures into text or speech. By achieving this goal,
we intend to improve accessibility for the deaf and hard of hearing communities, thereby reducing
communication barriers and enhancing human-machine interaction. Our overarching aim includes
creating a user-friendly interface, supporting multiple sign languages, ensuring high accuracy and
low latency, exploring educational and training applications, and facilitating communication in
emergency situations. We are committed to continuous improvement, incorporating user feedback
and adhering to ethical and inclusive design principles throughout the development process.
Ultimately, our project seeks to empower individuals who use sign language as their primary mode
of communication by providing them with an effective and reliable means of expressing themselves
in various contexts.

1. Creation of custom image dataset using a webcam.

2. Hand Segmentation for Object Detection Model. This is a data Preprocessing step in which
segmentation and labelling of image dataset has been done with the use of proper annotations
according to the American Sign Language.

3. Feature Extraction for LSTM based Approach: In this step important features have been extracted
in a NumPy array suitable for LSTM model input. Face, Hands and Pose Landmarks are captured
and then converted to model suitable form.

4. Use of Deep Learning based algorithms like LSTM to train our ASL Gesture Recognition model.

5. Use of Tensor Board to monitor model training and perform iterations to improve prediction
accuracy for certain classes.

6. Use of SSD Mobile net Model (Object Detection) to train ASL Gesture Recognition model.

7. Test both models to predict the gestures in real time. Compare the results..
CHAPTER 2
2.1 Methodology

2.1.1 GESTURE CLASSIFICATION

The approach which we used for this project is :

Our approach uses two layers of algorithm to predict the final symbol of the user.

Algorithm Layer 1:

1. Apply gaussian blur filter and threshold to the frame taken with open cv to get the processed
image after feature extraction.

2. This processed image is passed to the CNN model for prediction and if a letter is detected for
more than 50 frames then the letter is printed and taken into consideration for forming the word.

3. Space between the words are considered using the blank symbol.

Algorithm Layer 2:

1. We detect various sets of symbols which show similar results on getting detected.

2. 2. We then classify between those sets using classifiers made for those sets only.

Layer 1:

CNN Model:

1. 1st Convolution Layer: The input picture has resolution of 128x128 pixels. It is first processed in
the first convolutional layer using 32 filter weights (3x3 pixels each). This will result in a 126X126
pixel image, one for each Filter-weights.

2. 1st Pooling Layer : The pictures are down sampled using max pooling of 2x2 i.e we keep the
highest value in the 2x2 square of array. Therefore, our picture is down sampled to 63x63 pixels.

3. 2nd Convolution Layer : Now, these 63 x 63 from the output of the first pooling layer is served as
an input to the second convolutional layer. It is processed in the second convolutional layer using
32 filter weights (3x3 pixels each).This will result in a 60 x 60 pixel image.

4. 2nd Pooling Layer : The resulting images are down sampled again using max pool of 2x2 and is
reduced to 30 x 30 resolution of images.

5. 1st Densely Connected Layer : Now these images are used as an input to a fully connected layer
with 128 neurons and the output from the second convolutional layer is reshaped to an array of
30x30x32 =28800 values. The input to this layer is an array of 28800 values. The output of these
layer is fed to the 2nd Densely Connected Layer. We are using a dropout layer of value 0.5 to
avoid overfitting.

6. 2nd Densely Connected Layer : Now the output from the 1st Densely Connected Layer are used as
an input to a fully connected layer with 96 neurons.
7. Final layer: The output of the 2nd Densely Connected Layer serves as an input for the final layer
which will have the number of neurons as the number of classes we are classifying (alphabets +
blank symbol).

Activation Function:

We have used ReLu (Rectified Linear Unit) in each of the layers(convolutional as well as fully
connected neurons). ReLu calculates max(x,0) for each input pixel. This adds nonlinearity to the
formula and helps to learn more complicated features. It helps in removing the vanishing gradient
problem and speeding up the training by reducing the computation time.

Pooling Layer: We apply Max pooling to the input image with a pool size of (2, 2) with relu activation
function. This reduces the amount of parameters thus lessening the computation cost and reduces
overfitting.

Dropout Layers: The problem of overfitting, where after training, the weights of the network are so
tuned to the training examples they are given that the network doesn’t perform well when given
new examples. This layer “drops out” a random set of activations in that layer by setting them to
zero. The network should be able to provide the right classification or output for a specific example
even if some of the activations are dropped out.

Optimizer: We have used Adam optimizer for updating the model in response to the output of the
loss function. Adam combines the advantages of two extensions of two stochastic gradient descent
algorithms namely adaptive gradient algorithm(ADA GRAD) and root mean square propagation(RMS
Prop)

Layer 2

We are using two layers of algorithms to verify and predict symbols which are more similar to each
other so that we can get us close as we can get to detect the symbol shown. In our testing we found
that following symbols were not showing properly and were giving other symbols also:

1. For D : R and U

2. For U : D and R

3. For I : T, D, K and I

4. For S : M and N

So to handle above cases we made three different classifiers for classifying these sets:

1. {D,R,U}

2. {T,K,D,I}

3. {S,M,N}

2.1.2 Finger spelling sentence formation Implementation:

1. Whenever the count of a letter detected exceeds a specific value and no other letter is close to it
by a threshold we print the letter and add it to the current string(In our code we kept the value as
50 and difference threshold as 20).
2. Otherwise we clear the current dictionary which has the count of detections of present symbol to
avoid the probability of a wrong letter getting predicted.

3. Whenever the count of a blank(plain background) detected exceeds a specific value and if the
current buffer is empty no spaces are detected.

4. In other case it predicts the end of word by printing a space and the current gets appended to the
sentence below.

Autocorrect Feature:

A python library Hunspell suggest is used to suggest correct alternatives for each (incorrect) input
word and we display a set of words matching the current word in which the user can select a word
to append it to the current sentence. This helps in reducing mistakes committed in spellings and
assists in predicting complex words.

Training and Testing:

We convert our input images(RGB) into grayscale and apply gaussian blur to remove unnecessary
noise. We apply adaptive threshold to extract our hand from the background and resize our images
to 128 x 128. We feed the input images after preprocessing to our model for training and testing
after applying all the operations mentioned above. The prediction layer estimates how likely the
image will fall under one of the classes. So the output is normalized between 0 and 1 and such that
the sum of each values in each class sums to 1. We have achieved this using softmax function. At first
the output of the prediction layer will be somewhat far from the actual value. To make it better we
have trained the networks using labeled data. The cross-entropy is a performance measurement
used in the classification. It is a continuous function which is positive at values which is not same as
labeled value and is zero exactly when it is equal to the labeled value. Therefore we optimized the
cross-entropy by minimizing it as close to zero. To do this in our network layer we adjust the weights
of our neural networks. TensorFlow has an inbuilt function to calculate the cross entropy. As we
have found out the cross entropy function, we have optimized it using Gradient Descent in fact with
the best
2.2 Literature Survey

CHAPTER 3
3.1 CONCLUSION

3.1.1 Conclusion:

In this report, a functional real time vision based american sign language recognition for D&M
people have been developed for asl alphabets. We achieved final accuracy of 98.0% on our dataset.
We are able to improve our prediction after implementing two layers of algorithms in which we
verify and predict symbols which are more similar to each other. This way we are able to detect
almost all the symbols provided that they are shown properly, there is no noise in the background
and lighting is adequate.

3.1.2 Challenges Faced:

There were many challenges faced by us during the project. The very first issue we faced was of
dataset. We wanted to deal with raw images and that too square images as CNN in Keras as it was a
lot more convenient working with only square images. We couldn’t find any existing dataset for that
hence we decided to make our own dataset. Second issue was to select a filter which we could apply
on our images so that proper features of the images could be obtained and hence then we could
provided that image as input for CNN model. We tried various filter including binary threshold,
canny edge detection, gaussian blur etc. but finally we settled with gaussian blur filter. More issues
were faced relating to the accuracy of the model we trained in earlier phases which we eventually
improved by increasing the input image size and also by improving the dataset.

3.1.3 Future Scope:

We are planning to achieve higher accuracy even in case of complex backgrounds by trying out
various background subtraction algorithms. We are also thinking of improving the preprocessing to
predict gestures in low light conditions with a higher accuracy.
3.2 APPENDIX

3.2.1 Convolutional Neural network


CNNs use a variation of multilayer perceptrons designed to require minimal preprocessing.
They are also known as shift invariant or space invariant artificial neural networks (SIANN),
based on their shared-weights architecture and translation invariance characteristics.
Convolutional networks were inspired by biological processes in that the connectivity
pattern between neurons resembles the organization of the animal visual cortex.

Individual cortical neurons respond to stimuli only in a restricted region of the visual field
known as the receptive field. The receptive fields of different neurons partially overlap such
that they cover the entire visual field CNNs use relatively little pre-processing compared to
other image classification algorithms. This means that the network learns the filters thatin
traditional algorithms were hand-engineered. This independence from prior knowledge and
human effort in feature design is a major advantage. They have applications in image and
video recognition, recommender systems, image classification, medical image analysis, and
natural language processing.

3.1.2 Tensorflow
TensorFlow is an open-source software library for dataflow programming across a range of
tasks. It is a symbolic math library, and is also used for machine learning applications such as
neural networks. It is used for both research and production at Google. TensorFlow was
developed by the Google brain team for internal Google use. It was released under the
Apache 2.0 open source library on November 9, 2015.

TensorFlow is Google Brain's second-generation system. Version 1.0.0 was released on


February 11, 2017. While the reference implementation runs on single devices, TensorFlow
can run on multiple CPUs and GPUs (with optional CUDA and SYCL extensions for general-
purpose computing on graphics processing units). TensorFlow is available on 64-bit Linux,
macOS, Windows, and mobile computing platforms including Android and iOS. Its flexible
architecture allows for the easy deployment of computation across a variety of platforms
(CPUs, GPUs, TPUs), and from desktops to clusters of servers to mobile and edge devices.
REFERENCES

[1] Cooper, Roz, et al. "Inclusive design: from the pixel to the city." Proceedings of the International
Conference on Inclusive Design and Communication (2017).

[2] Dix, Alan, et al. "Human-computer interaction." Pearson Education (2004).

[3] Rubin, Jeffrey, and Dana Chisnell. "Handbook of Usability Testing: How to Plan, Design, and
Conduct Effective Tests." Wiley (2008).

[4] Nishimura, Masami, and Norma M. Graham. "Ethical issues in the use of assistive technology for
the deaf." The Oxford Handbook of Deaf Studies, Language, and Education (2010): 353-366.

[5] Rassam, Murad A., et al. "Sign language recognition and translation with Kinect." Proceedings of
the International Conference on Image Processing (ICIP). IEEE, 2012.

You might also like