Enabling Object Detection Through Speech for Visually Impaired-2
Enabling Object Detection Through Speech for Visually Impaired-2
DHINAHARI R (810020104023)
DHIVYA A S (810020104024)
BACHLOER OF ENGINEERING
in
COMPUTER SCIENCE AND ENGINEERING
MAY 2024
i
UNIVERSITY COLLEGE OF ENGINEERING
BHARATHIDASANINSTITUTE OF TECHNOLOGY CAMPUS
ANNA UNIVERSITY, TIRUCHIRAPPALLI – 620024.
BONAFIDE CERTIFICATE
SIGNATURE SIGNATURE
ii
DECLARATION
iii
ACKNOWLEDGEMENT
We also express our sincere thanks to all other staff members, friends, and
our parents for their help and encouragement.
iv
ABSTRACT
KEYWORDS:
v
TABLE OF CONTENT
vi
4.5.1 Tensorflow.js 15
4.5.2 Deeplearn.js 16
4.5.2.1 Transition to tensorflow.js 17
4.5.3 tfjs-converter 17
4.6 Conclusion 17
5. SYSTEM DESIGN 18
5.1 Architectural Diagram 18
5.2 UML Diagram 20
5.2.1 Dataflow Diagram 20
6. DATASET AGGREAGTION AND 23
ACQUISTION
6.1 Introduction 23
6.2 Initial Dataset Collection 23
6.3 User interaction for sample images 24
6.4 Real-time image acquisition 24
6.5 Dataset Pre-processing 24
6.5.1 Cropping the image 24
6.5.2 Image Augmentation 25
6.6 Dynamic learning approach 26
6.7 Annotation and labelling 26
6.8 Data Storage and Management 26
6.9 Conclusion 26
7. SYSTEM IMPLEMENTATION 27
7.1 Modules 27
7.2 Module Description 27
7.2.1 Image Acquisition Module 27
7.2.2 Preprocessing Module 28
7.2.3 Feature Extraction Module 28
vii
7.2.4 Classification Module 29
7.2.5 Post-Processing Module 30
7.2.6 Evaluation and Validation 30
7.2.7 Webapp Implementation 31
7.2.8 Implementation 31
8. CONCLUSION AND FUTURE 41
ENHANCEMENTS
8.1 Conclusion 41
8.2 Future Enhancements 42
8.3 Object Detection Process 42
REFERENCE 44
viii
LIST OF FIGURES
ix
LIST OF ACRONYMS
x
CHAPTER 1
INTRODUCTION
1.1 INTRODUCTION
1
Moreover, by integrating audio feedback alongside visual recognition, our
system enhances accessibility and user experience across a wide range of
applications. Whether assisting visually impaired individuals in navigating their
surroundings or facilitating interactive experiences in augmented reality
environments.
Feature Extraction: Learn about different audio features that can be extracted
to represent audio signals effectively. These features may include spectrogram
representations, Mel-frequency cepstral coefficients (MFCCs), pitch, energy,
zero-crossing rate, and more.
2
deploying machine learning models in web browsers and Node.js environments.
Explain its significance in enabling on-device inference and real-time processing
without the need for server-side computations.
3
CHAPTER 2
LITERATURE SURVEY
2.1 INTRODUCTION
4
By discussing strategies such as incremental learning and online learning, the
paper offers insights into the design of our system's dynamic learning component.
5
7."Web-Based Object Detection Using TensorFlow.js" by Jason Mayes: This
blog post provides a practical tutorial on building web-based object detection
systems using TensorFlow.js. It covers topics such as model conversion,
inference optimization, and integration with web applications, offering practical
guidance for implementing our real-time object detection system within a
browser environment.
6
CHAPTER 3
SYSTEM ANALYSIS
3.1 INTRODUCTION
3.2 GOAL
3.3 OBJECTIVES
7
This involves designing and implementing a custom convolutional neural
network (CNN) architecture optimized for rapid inference using TensorFlow.js.
The system should be able to identify objects accurately and efficiently as new
examples are provided.
3.3.2Dynamic Learning from Examples: Another key objective is to enable the
system to dynamically learn from examples provided by the user and from live
environment data. This involves implementing algorithms for collecting and
incorporating new examples into the training process iteratively. The system
should continuously update its understanding of target objects, improving its
detection capabilities over time without requiring retraining from scratch.
8
The system should be capable of running smoothly on a variety of devices,
including low-power devices with limited computational resources.
3.3.8.Privacy and Data Security: Ensuring user privacy and data security is
paramount, especially when collecting and processing live environment data. The
system should implement privacy-preserving measures such as data
anonymization, encryption, and user consent mechanisms to protect sensitive
information. Additionally, the system should adhere to data protection
regulations and best practices to safeguard user privacy rights.
9
increasing data volumes and user interactions without compromising
performance. The system should be capable of efficiently scaling across multiple
devices and users while maintaining real-time responsiveness.
3.4 ALGORITHM
10
This approach is commonly used for tasks like classification, regression, and
sequence prediction. Unsupervised learning algorithms, on the other hand, are
trained on data without explicit labels, seeking to uncover hidden patterns or
structures within the data.
Clustering, dimensionality reduction, and anomaly detection are examples of
unsupervised learning tasks. Machine learning algorithms can further be
categorized based on their model complexity, ranging from simple linear models
to complex deep neural networks capable of learning intricate representations
from high-dimensional data. Some of the Machine learning models are random
forest and Support Vector machine(SVM).
Training Phase:
In the training phase of the KNN algorithm, the model simply memorizes
the entire training dataset. There's no explicit training involved, as KNN is
considered a lazy learner. For each data point in the training dataset, the algorithm
stores the feature values and their corresponding class labels (in the case of
classification) or target values (in the case of regression).
11
Prediction Phase (Classification):
When a new data point (with unknown class label) is presented for
prediction, KNN identifies the K nearest neighbors to the new data point based
on a distance metric. To find the nearest neighbors, the algorithm calculates the
distance between the new data point and every point in the training dataset. This
results in a distance value for each training data point.The algorithm then selects
the K data points with the smallest distances to the new data point. These K data
points are considered the "nearest neighbors."
Once the K nearest neighbors are identified, the algorithm assigns the class
label to the new data point based on majority voting among the K neighbors. That
is, the class label with the highest frequency among the K neighbors is assigned
to the new data point. For example, if out of the K nearest neighbors, 5 belong to
class A and 3 belong to class B, the algorithm will classify the new data point as
belonging to class A.
12
Choosing K:
Figure 3.4.2.1
3.5 CONCLUSION
The model has been developed using KNN model and trained with mixed
dataset from Kaggle. On validating the model, the accuracy was found to be 98%.
13
CHAPTER 4
REQUIREMENT ANALYSIS
4.1 INTRODUCTION
14
4.4 FRAMEWORKS
4.5 LIBRARIES
4.5.1 TENSORFLOW
15
core functionality, TensorFlow also includes a number of tools and libraries for
data preprocessing, visualization, and model serving.
TensorFlow can be used with a variety of programming languages, including
Python, C++, and Java, and has support for distributed computing, allowing for
the training of very large models across multiple machines. TensorFlow has been
widely adopted in both academia and industry and has been used to develop state-
of-the-art models for a wide range of applications, including image classification,
natural language processing, and speech recognition.
4.5.2 DEEPLEARN.JS
Users could include the DeepLearn.js library in their projects via script tags
in HTML files or by importing it in Node.js environments using npm. Once
imported, developers could leverage the library to create and train neural
networks using a familiar JavaScript syntax. DeepLearn.js was designed to be
accessible to both beginners and advanced users, providing a user-friendly API
while still offering flexibility for customization and advanced techniques.
16
4.5.2.1. TRANSITION TO TENSORFLOW.JS:
4.5.3. TFJS-CONVERTER
The converter will analyse the input model, extract its architecture, and
convert it into a JSON file containing the model's architecture and weights in a
format suitable for TensorFlow.js. Once the model is converted to the
TensorFlow.js format, you can load and use it in your JavaScript code using
TensorFlow.js.
4.6 CONCLUSION
17
CHAPTER 5
SYSTEM DESIGN
Data Acquisition Module: Responsible for capturing real-time images from the
environment. Utilizes web-based APIs or device cameras for image acquisition.
Provides a mechanism for user interaction to supply initial sample images.
18
Object Detection Module: Employs a custom convolutional neural network
(CNN) architecture optimized for real-time inference. Utilizes TensorFlow.js for
in-browser execution of the CNN model. Performs object detection on the
acquired images in real-time.
Outputs bounding boxes and class probabilities for detected objects.
User Interface (UI) Module: Facilitates user interaction for providing initial
sample images and system control. Displays real-time video feed with overlaid
object detection results (bounding boxes). Provides feedback mechanisms for
system status and errors. Enables user customization of audio feedback
preferences and settings.
19
5.2 UML DIAGRAM
DFD LEVEL 0:
Zeroth level DFD shows the overall data flow of the dynamic Real-time
Object Detection and Audio Feedback System. Data undergoes the subsequence
of processes such as pre-processing, feature extraction, and predicting the output
by the model with help of a database.
The above figure shows the 0th level DFD Diagram of the enabling object
detection through speech for visually impaired.
20
DFD LEVEL 1:
:
Level one DFD shows the detailed view of dataflow in enabling object
detection through speech for visually impaired with their techniques. Till feature
extraction, all the details are same as level zero DFD. Data undergoes the
subsequence of processes such as pre-processing, feature extraction, and
predicting the output by the model with help of a database.
The above figure shows the 1st level DFD Diagram of the enabling object
detection through speech for visually impaired
21
DFD LEVEL 2:
Level two DFD gives the complete and detailed view of full enabling
object detection through speech for visually impaired as a web application by
dividing them into client and server sides. In client side, get input from the user
and pass to server by request. Enabling object detection through speech for
visually impaired took input from the client-end and gives output by passing
through various processes then give output.
The above figure shows the 2nd level DFD Diagram of the enabling object
detection through speech for visually impaire
22
CHAPTER 6
DATASET AGGREGATION AND ACQUISITION
6.1 INTRODUCTION
For all ML or AI-based projects and resources, Dataset is the precise one.
The project worth mostly based on Dataset which is used for research or model
training. Most of the time, the accuracy of the model increased by train using a
large amount of data. Dataset aggregation and acquisition is the process of
collecting, cleaning, and organizing datasets from various sources for the purpose
of analysis and modeling.
23
6.3 USER INTERACTION FOR SAMPLE IMAGES
Preprocess the acquired images to prepare them for object detection and
model training. Resize images to a consistent size suitable for the input
requirements of the object detection model. Normalize pixel values to a common
scale to improve model convergence and performance. Augment the dataset by
applying transformations such as rotation, scaling, and flipping to increase
diversity and robustness.
Figure 6.5.2.1
25
6.6 DYNAMIC LEARNING APPROACH
6.9 CONCLUSION
The dataset has been collected from Figshare and Kaggle. The
dataset consists of 4 classes (glioma, meningioma, pituitary, notumor). The
collected dataset has been well preprocessed using cropping the images and
image augmentation.
26
CHAPTER 7
SYSTEM IMPLEMENTATION
7.1 MODULES
27
responsive to its surroundings. Through efficient image acquisition, the system
can adapt to changing environmental conditions and deliver accurate object
detection results in real-time, thereby enhancing its usability and effectiveness
across diverse applications.
28
from simple edges and textures to complex object shapes and structures. By
leveraging transfer learning techniques, the feature extraction module can utilize
pre-trained CNN models, such as VGG, ResNet, or MobileNet, to extract high-
level features efficiently. Alternatively, custom CNN architectures can be
designed and trained from scratch to suit the specific requirements of the
application domain.
29
Overall, the classification module forms an integral component of the dynamic
real-time object detection and audio feedback system, empowering it to
effectively recognize and classify objects in diverse real-world scenarios, thereby
enhancing accessibility, usability, and user engagement.
Upon receiving the object detection outputs from the TensorFlow.js model, the
post-processing module first processes the bounding box coordinates and class
probabilities to filter out redundant or low-confidence detections. It applies
thresholding techniques to eliminate false positives and ensure that only relevant
objects are considered for audio feedback generation.
The evaluation and validation module for the dynamic real-time object
detection and audio feedback system with TensorFlow.js plays a crucial role in
assessing the system's performance, accuracy, and usability. This module
encompasses a series of processes and metrics aimed at validating the
effectiveness and reliability of the system in real-world scenarios. Firstly, it
conducts comprehensive testing to evaluate the object detection capabilities,
including precision, recall, and mean average precision (mAP) metrics, to
30
quantify the system's ability to accurately detect and classify objects in varying
environments and conditions. Additionally, the module assesses the audio
feedback generation component, measuring its responsiveness, clarity, and
appropriateness in providing auditory cues corresponding to detected objects.
7.2.8 IMPLEMENTATION
Main.js
32
this.knn = null
this.textLine = document.getElementById("text")
this.video = document.getElementById('video');
this.addWordForm = document.getElementById("add-word")
this.statusText = document.getElementById("status-text")
this.video.addEventListener('mousedown', () => {
main.pausePredicting();
this.trainingListDiv.style.display = "block"
})
this.addWordForm.addEventListener('submit', (e) => {
e.preventDefault();
let word = document.getElementById("new
word").value.trim().toLowerCase();
let checkbox = document.getElementById("is-terminal-word")
createTrainingBtn(){
var div = document.getElementById("action-btn")
div.innerHTML = ""
const trainButton = document.createElement('button')
trainButton.innerText = "Training >>>"
div.appendChild(trainButton);
trainButton.addEventListener('mousedown', () => {
if(words.length > 3 && endWords.length == 1){
console.log('no terminal word added')
alert(`You have not added any terminal words.\nCurrently the only
query can make is "Alexa, hello".\n\nA terminal word is a word
that will appear in the end of your query.\nIf you intend to ask
"What's the weather" & "What's the time" then add "the weather"
and "the time" as terminal words. "What's" on the other hand is not
a terminal word.`)
33
return
}
if(words.length == 3 && endWords.length ==1){
var proceed = confirm("You have not added any words.\n\nThe only query
you can currently make is: 'Alexa, hello'")
if(!proceed) return
}
this.startWebcam()
console.log("ready to train")
this.createButtonList(true)
this.addWordForm.innerHTML = ''
let p = document.createElement('p')
p.innerText = `Perform the appropriate sign while holding down the ADD
EXAMPLE button near each word to capture atleast 30 training examples
for each word
For OTHER, capture yourself in an idle state to act as a catchall sign. e.g
hands down by your side`
this.addWordForm.appendChild(p)
this.loadKNN()
this.createPredictBtn()
this.textLine.innerText = "Step 2: Train"
let subtext = document.createElement('span')
subtext.innerHTML = "<br/>Time to associate signs with the words"
subtext.classList.add('subtext')
this.textLine.appendChild(subtext)
})
}
areTerminalWordsTrained(exampleCount){
34
var totalTerminalWordsTrained = 0
for(var i=0;i<words.length;i++){
if(endWords.includes(words[i])){
if(exampleCount[i] > 0){
totalTerminalWordsTrained+=1
}
}
}
return totalTerminalWordsTrained
}
startWebcam(){
// Setup webcam
navigator.mediaDevices.getUserMedia({video: {facingMode: 'user'}, audio:
false})
.then((stream) => {
this.video.srcObject = stream;
this.video.width = IMAGE_SIZE;
this.video.height = IMAGE_SIZE;
this.video.addEventListener('playing', ()=> this.videoPlaying = true);
this.video.addEventListener('paused', ()=> this.videoPlaying = false);
})
}
loadKNN(){
this.knn = new KNNImageClassifier(words.length, TOPK);
// Load knn model
this.knn.load()
.then(() => this.startTraining());
}
35
updateExampleCount(){
var p = document.getElementById('count')
p.innerText = `Training: ${words.length} words`
}
createButtonList(showBtn){
//showBtn - true: show training btns, false:show only text
// Clear List
this.exampleListDiv.innerHTML = ""
// Create training buttons and info texts
for(let i=0;i<words.length; i++){
this.createButton(i, showBtn)
}
}
createButton(i, showBtn){
const div = document.createElement('div');
this.exampleListDiv.appendChild(div);
div.style.marginBottom = '10px';
// Create Word Text
const wordText = document.createElement('span')
if(i==0 && !showBtn){
wordText.innerText = words[i].toUpperCase()+" (wake word) "
} else if(i==words.length-1 && !showBtn){
wordText.innerText = words[i].toUpperCase()+" (catchall sign) "
} else {
wordText.innerText = words[i].toUpperCase()+" "
wordText.style.fontWeight = "bold"
}
div.appendChild(wordText);
if(showBtn){
36
// Create training button
const button = document.createElement('button')
button.innerText = "Add Example"//"Train " + words[i].toUpperCase()
div.appendChild(button);
button.addEventListener('mousedown', () => this.training = i);
button.addEventListener('mouseup', () => this.training = -1);
// Create clear button to emove training examples
const btn = document.createElement('button')
btn.innerText = "Clear"//`Clear ${words[i].toUpperCase()}`
div.appendChild(btn);
btn.addEventListener('mousedown', () => {
console.log("clear training data for this label")
this.knn.clearClass(i)
this.infoTexts[i].innerText = " 0 examples"
})
// Create info text
const infoText = document.createElement('span')
infoText.innerText = " 0 examples";
div.appendChild(infoText);
this.infoTexts.push(infoText);
}
}
startTraining(){
if (this.timer) {
this.stopTraining();
}
)};
}
37
Build.js
38
function _toConsumableArray(arr) { if (Array.isArray(arr)) { for (var i =
0, arr2 = Array(arr.length); i < arr.length; i++) { arr2[i] = arr[i]; } return
arr2; } else { return Array.from(arr); } }
function _classCallCheck(instance, Constructor) { if (!(instance
instanceof Constructor)) { throw new TypeError("Cannot call a class as a
function"); } } // Launch in kiosk mode
// /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome -
-kiosk --app=http://localhost:9966
// Webcam Image size. Must be 227.
var IMAGE_SIZE = 227;
var TOPK = 10;
var predictionThreshold = 0.98;
var words = ["alexa", "hello", "other"];
var LaunchModal = function LaunchModal() {
var _this = this;
_classCallCheck(this, LaunchModal);
this.modalWindow = document.getElementById('launchModal');
this.closeBtn = document.getElementById('close-modal');
this.closeBtn.addEventListener('click', function (e) {
_this.modalWindow.style.display = "none";
});
window.addEventListener('click', function (e) {
if (e.target == _this.modalWindow) {
_this.modalWindow.style.display = "none";
}
});
this.modalWindow.style.display = "block";
this.modalWindow.style.zIndex = 500;
};
39
var Main = function () {
function Main() {
var _this2 = this;
_classCallCheck(this, Main);
this.infoTexts = [];
this.training = -1; // -1 when no class is being trained
this.videoPlaying = false;
this.previousPrediction = -1;
this.currentPredictedWords = [];
// variables to restrict prediction rate
this.now;
this.then = Date.now();
this.startTime = this.then;
this.fps = 5; //framerate - number of prediction per second
this.fpsInterval = 1000 / this.fps;
this.elapsed = 0;
this.trainingListDiv = document.getElementById("training-list");
this.exampleListDiv = document.getElementById("example-list");
this.knn = null;
this.textLine = document.getElementById("text");
// Get video element that will contain the webcam image
this.video = document.getElementById('video');
this.addWordForm = document.getElementById("add-word");
this.statusText = document.getElementById("status-text");
this.video.addEventListener('mousedown', function () {
main.pausePredicting();
_this2.trainingListDiv.style.display = "block";
});
}
40
CHAPTER 8
CONCLUSIONS AND FUTURE ENHANCEMENTS
8.1 CONCLUSION
41
8.2 FUTURE ENHANCEMENTS
Model Optimization:
Enhance the system's ability to adapt and learn dynamically from new data and
user feedback. Implement more sophisticated online learning algorithms and
active learning strategies to continuously update the object detection model and
improve its performance over time.
42
Fig: 8.3.1 Process-1
43
REFERENCE
44
14.A. Bharti and S. K. Singh, "A Survey of Real-Time Object Detection
Techniques," in 2017
15.H. Eslami, M. H. Sheidaei, and M. H. Sedaaghi, "Real-Time Object
Detection and Classification for Unmanned Aerial Vehicles (UAVs)," in
2018.
16.C. Zhou, M. Guo, W. Zhu, and Y. Li, "Real-Time Object Detection and
Tracking Based on Deep Learning," in 2019
17.M. A. Rahman, M. S. Uddin, and K. S. Islam, "Real-Time Object Detection
and Tracking Using Deep Learning," in 2018
18.S. Liu, H. Yu, S. Li, and W. Li, "Real-Time Object Detection and Tracking
Based on Deep Learning," in 2018
19.Y. Wang, H. Wu, Y. Zhou, and J. Cai, "Real-Time Object Detection and
Tracking for Intelligent Transportation Systems," in 2019
20.X. Wang, H. Li, C. Peng, Z. Lu, and J. Wei, "Real-Time Object Detection
and Recognition for Smart Cities," in 2019
45