Voice Based Email System For The Visually Challeng
Voice Based Email System For The Visually Challeng
Challenged
Satya Prakash ( [email protected] )
Amity University Amity School of Engineering and Technology https://orcid.org/0009-0002-9943-7431
Kartikey Agrawal
Amity University Amity School of Engineering and Technology
Siddharth Dosaj
Amity University Amity School of Engineering and Technology
Shatakshi Singh
Amity University Amity School of Engineering and Technology
Research Article
Keywords: Voice Recognition, Text-to-Speech (TTS), Speech-to-Text (STT), Visually challenged people
DOI: https://doi.org/10.21203/rs.3.rs-2982553/v1
License: This work is licensed under a Creative Commons Attribution 4.0 International License.
Read Full License
Voice Based Email System for the Visually Challenged
Abstract - In today's digital world, email is a crucial form The voice-based email system's major goal is to enable
of communication, but it can be difficult for those who visually impaired users to engage with email interfaces via
voice commands and speech-to-text conversion. The
are blind or visually impaired to access and use it. The
system accurately transcribes spoken words into written
idea of a voice-based email system created exclusively for text by using powerful voice recognition technology,
those with visual impairments is presented in this allowing users to create, edit, and manage emails using
abstract. The objective is to provide a welcoming their natural voice.
atmosphere that enables people who are blind or visually Furthermore, the system uses natural language
impaired to use computers on their own to send and processing to decode spoken commands, allowing for
receive emails. smooth navigation across the email interface. Visually
impaired users may easily execute operations such as
The suggested solution has cutting-edge components checking their inbox, creating new messages, replying to
designed with visually impaired users in mind. By emails, and organising folders by comprehending and
reacting to audio instructions.
integrating a screen reader, users may convert text into
voice and interact with the email interface aloud. Users Overall, the voice-based email system marks a
may create emails using their voice thanks to reliable substantial leap in accessibility technology, allowing
visually impaired people to access and use email
speech recognition technology that properly converts independently. This system provides new channels for
spoken words into written text. To continually develop the communication and productivity for the visually impaired
system and adapt it to unique tastes and needs, user input community by using speech recognition, natural language
and customization options are included. processing, and other cutting-edge technology.
A. Overview
Keywords: Voice Recognition, Text-to-Speech (TTS),
Speech-to-Text (STT), Visually challenged people The creation of this system intends to promote a sense
of community among people who are blind or visually
I. INTRODUCTION impaired by giving them access to inclusive and accessible
technologies. It encourages independence and fair
Email has become a vital tool for communication and participation in the digital world by making it simple for
information sharing in today's digitally linked society. people of all ages to utilise the email system.
Individuals with vision impairments, on the other hand,
confront distinct obstacles in accessing and utilising this This article's main goal is to provide a voice-based
critical technology. The creation of a voice-based email emailing system that caters to the demands of blind and
system particularly built for visually impaired people illiterate people and enables them to use commonplace
attempts to bridge this accessibility gap and enable them to technology like email and internet access. The suggested
send and receive emails on their own computers. solution eliminates the requirement for conventional text-
based input techniques by allowing blind people to log in
To access digital material, the visually impaired using voice instructions.
population has long relied on assistive technology such as
screen readers and text-to-speech synthesis. While these B. Project’s Scope
technologies have been quite helpful, they frequently The goal of the speech-based email system project is to
require external assistance or other software to fully create a complete system that especially addresses the
interface with email systems. We can give visually needs of people who are blind or visually impaired and
impaired people with a more inclusive and efficient method enables them to send and receive emails using their voice.
of communication by developing a dedicated voice-based To provide a welcoming and accessible environment for
email system. users who are visually impaired, the project considers
several important factors. The project's primary objective The [6] approach offers a stronger emphasis on user
is to create a voice-based email system. The technology friendliness for all kinds of users, including regular users
used for this project may, however, be expanded in the who are blind or illiterate, as compared to the current
future to support other services like texting and the use of system, which lays more emphasis on user friendliness for
voice commands to control other programmes. The regular users. The core of the entire system is IVR, or
ultimate objective is to provide a complete and usable interactive voice response. When using this system, the
solution that improves the productivity and independence computer will request the user to complete specific actions
of visually impaired people while communicating via in order to access the associated services, and the user must
email. complete those actions in order to access the related
services. One of the biggest advantages of this method is
C. Objective that the keyboard is not required. All actions will be based
The objective of the study is to offer users a simple, on mouse click events. How blind individuals will be able
effective method of using voice commands to create and to find the mouse pointer is now a question that needs to be
send emails. Users can dictate email content instead of answered.
manually typing it thanks to an application that converts The creation of a search engine is the aim of this paper
spoken words into text using speech recognition [10] which only permits voice-based man and machines
technology. The goal is to improve productivity and interaction. This was the debut of a ground-breaking page
accessibility by enabling hands-free email composition. By reader and search engine that is driven entirely by voice. It
including functions like message confirmation, login enables end-users to control and surf the web using user
authentication and error handling the application also speech to navigate. In response to user text requests, current
strives to offer a smooth user experience. This study search engines secure relevant records from the server and
compares and assesses various speech-to-text tools in display them as text. The paper's authors [7] suggested a
various real-world settings, including the built-in tool in user-friendly email system for individuals who are blind or
Android Studio, Google Cloud speech to text, and
visually handicapped. TTS (text-to-speech) module, STT
Microsoft Azure speech to text. The project aims to (speech-to-text) module, as well as Module for mail
enhance user experience by addressing challenges like composition activities (which includes composing, inbox,
noise, accents, multilingualism, and accuracy. and sending) module makes up the system design. This
II. LITERATURE SURVEY system uses speech-to-text functionality. using an API to
implement artificial intelligence utilising Google-provided
The solution presented in the study [3] is based on a neural network models Speech-to-text in the cloud for
voice command mechanism, in contrast to the current email developers.
system. In essence, text to number conversion is the
foundation of the entire system. The system will prompt the [8] employ artificial intelligence to help the blind utilise
user for voice commands once it is started in order to access cutting-edge technology for their development and
the required services. It is crucial to note that the user must advancement. The suggested system is a desktop
want to access the relevant services for this command to programme that uses artificial intelligence to reduce costs
work. This programme makes use of IMAP (Internet and make maintenance simple. The voice detection and
Message Access Protocol). When sending emails from mail conversion are utilised by the suggested system. Since it is
servers over TCP/IP, email clients frequently use this entirely voice based and uses neither the mouse nor
Internet protocol. The principal activity type, the screen, keyboard, it eliminates the drawbacks of the previous
will be the first one to appear at the beginning of the year. system. Because it uses voices, it offers an intuitive,
The user only needs to tap one button on this screen for the interactive, and user-friendly GUI that even blind users
device to start responding to voice instructions. There is who are not computer literate can use.
just one full-size button, and you may tap it anywhere on [19] talks about a straightforward programme with text-
the screen. The user can then send an email and use voice to-speech capabilities. The application is split into two
commands to read it. main modules: the main application module, which
Saurabh Sawant et al. offer a solution in their study [21] contains the fundamental GUI elements and manages the
for those who are visually impaired or illiterate to increase application's fundamental activities, such as parameter
their involvement with email systems. Screen readers and input for conversion through file, direct keyboard input, or
Braille keyboards used in IVR systems are no longer web. Both DJNativeSwing and the open source SWT API
required thanks to this solution. There, both speech-to-text would be used in this. The major conversion engine of the
and text-to-speech conversions have been used. Other second module, which is integrated into the main module,
operations also make use of voice commands. To register, is responsible for accepting data and converting it. This
enter your email address and password. Use PHP mailer, a would put into practise the free TTS API.
PHP feature, for the functionality. It is a library that enables A study that was conducted resulted in the creation of a
email sending. so that the user's mail can be retrieved from programme that might assist users in sending and receiving
the server of IMAP. For searching mail in inboxes, the mail in English language. It was found during this
Knuth-Morris-Pratt algorithm is utilised in this instance. As investigation that the proposed architecture outperformed
a conclusion, the entire system environment is voice- the current architecture. In order to make it simple for blind
driven, and each level receives the proper system response. people to access information, text-to-speech and speech-to-
The drawback of this approach is that we can't use other text conversion techniques were employed. [11]
email providers, such as Yahoo, etc., because it requires
Gmail as a host server. A. Advantage of the Surveyed Techniques
Many articles demonstrate how the text-to-speech and accessibility expert feedback, enhancing the overall user
the speech-to-text conversion processes make it easier and experience.
even participatory for people who are blind or visually
impaired. People with disabilities feel like regular users 1) Android’s speech-to-text conversion: To
thanks to this system. Voice-based technology is also receive user voice, Android supports a range
helpful for those who are illiterate or disabled. Automatic of interfaces for speech recognition listeners.
speech recognition is one of the main benefits. We can The recognizer intent comes first. Before
observe a decrease in the mental effort required by visually returning to the previous activity, it first
handicapped people to remember and write characters on a accepts the user's speech as input. The ten
keyboard. A user-friendly system is the speech driven available languages on the platform can be
email system. selected by the Android voice recognizer
when speaking input. The verbal response
B. Limitations of the Surveyed Techniques must then be captured while performing
another or ongoing task. The response is
It is evident that mouse clicks are used for several converted into text by the code, which is then
activities in almost all the articles. It becomes more either displayed or transmitted once more as
challenging for those who are blind. Additionally, because input to the text-to-speech converter.
there are many languages spoken there that speech 2) Android’s text-to-speech conversion: This is
recognition software cannot grasp, the subcontinents of a crucial component of the program. It
India do not benefit from this. English is the preferred examines the text, creates an audio version of
language in its entirety. it, and then plays it through the user's
The proposed system for those who are blind is microphone. This text to speech capability
described in the section that follows. was created by Android and is supported,
especially for persons who are blind or
III. METHODOLOGY visually handicapped. To obtain the text, a
There are numerous crucial elements in the class object is constructed. Many languages
development process for an Android voice-based email are supported by Android TT's engine. The
system. function that converts text to speech receives
the text as an argument. In the listener
First, in-depth requirements collecting is done in order function, the transformed text is transmitted
to comprehend the unique requirements and difficulties to the user as an internal voice.
experienced by visually impaired people using Android 3) Mail Programming Module: Email appears to
email. In order to learn more about their preferences, ideal be one of the most useful services now
features, and pain spots, user research techniques such as offered online. Many internet-based services
interviews and questionnaires are used. use the SMTP protocol to transfer mail from
The planning step for the design and interface starts one user to another. SMTP is a sending
once the requirements have been acquired. In order to protocol used to send mail, whereas POP
ensure ease of use for visually impaired people, a user- (post office protocol) or IMAP (internet
friendly and accessible interface is built, taking into mind message access protocol) are used to receive
high contrast colours, big buttons, and intuitive navigation. emails at the recipient's end.
The Android application's speech recognition system will 4) Sending email: A distributed email will
be integrated as the following phase. To reliably translate contain components, such as a header and a
spoken words into written text, this calls for the separate body. The customer and server line
incorporation of a trustworthy voice recognition library or up a sequence of responses to the client's
API, such as Google Cloud voice-to-Text or PocketSphinx. request before sending the email. The header
The Android application incorporates accessibility features is different from the body in that it will come
that follow Android accessibility rules. This includes to an end when there are no more lines. The
features that improve usability and accessibility for those message body by reception contains the
who are blind or visually impaired, such as high contrast specific information. After a null line, each
mode, haptic feedback, keyboard navigation, and gesture- data point in the body is taken.
based controls. To enable communication with email Password organisation could be risky. Keeping a password
servers, integration with an email API or protocols like straight might be risky, so make sure to teach them how to
IMAP or SMTP is essential. As a result, users who are blind establish a database table while also keeping the password
may use the programme to send, receive, and manage straightforward. The server will be contacted when the user
emails. requests to log in so it may check the live load and save the
User data is protected by security and privacy username and password.
safeguards. To prevent unauthorised access to email A. System Architecture
accounts, encryption technologies are included for data
transmission and storage. Voice-based user authentication The architectural diagram displays the system design
is also available. In order to discover and fix any usability for the voice-based email system for visually challenged
or accessibility concerns, the produced system goes users. The architecture consists of numerous parts and how
through extensive testing, including usability testing with they interact to enable users to send emails using voice
users who are visually impaired. The design and commands. Let us talk about the architecture in more detail.
functionality are iterated upon in response to user and The User component represents the visually impaired user
who interacts with the system. The Speech Recognition
component captures the user's spoken speech instructions. to input the recipient's email address, message body, and
With voice recognition technology, which converts spoken topic. The NLP component processes the user's input
words into text, the system can process the user's requests. before extracting the relevant data.
The Email Server component is in charge of handling An accessible user experience is produced via the
the email-related functions. By connecting to the email system's Text-to-Speech (TTS) Converter component. This
service provider's server, it enables email sending and section converts the system's output into voice format,
receiving. The Email Server component checks the user's including email content or confirmation messages. The
login details with the Authentication component before TTS Converter enables the system to interact with the
beginning email operations. The authentication component visually impaired user vocally, allowing them to effectively
establishes a safe and regulated environment by verifying receive and understand the system's replies. Arrows on the
that the user has authorization to access their email account. architectural diagram depict the information flow and
interactions among the parts. The arrow pointing from the
User component to the Speech Recognition component, for
example, depicts the flow of the user's spoken input. The
arrows between the components depict how information
and control are passed between them in a manner like this.
B. Design
1) User Interface Design: The user interface for
the voice-based email system has been
painstakingly created using the tools in
Android Studio, resulting in a seamless and
straightforward operation. The interface is
made up of three independent displays: the
logo screen, the login page, and the message
page. Every screen has a distinct purpose that
makes it simpler for users to interact with and
utilise the programme.
2) Logo Screen: When users initially launch the
programme, they are welcomed by the Logo
Screen, which serves as the system's initial
visual depiction of the voice-based email
system. This page presents the user interface
and displays the application's amiable and
distinctive logo or symbol.
3) Login Page: After viewing the Logo Screen,
users are sent to the Login Page, which acts
as the doorway to their email account. On the
login page, the User ID and Password
sections are both very important.
a) User ID Field: Users can fill out this
field with their unique user ID for their
Gmail account. Users can enter text
using a virtual keyboard or a voice
command utilising the text entry
function it provides.
b) Password Field: The Password field's
safe input area allows users to enter their
account password. The password
entering is covered up in order to
safeguard privacy and prevent
unauthorised access.
4) Message Page: After successfully
authenticating into the system, the user is then
Figure 1: System Architecture of the proposed system
sent to the Message Page, where they may
When a user tries to log in, the system prompts them for create and send emails. There are many text
their login details. The authentication component areas on the Message Page where users may
authenticates the credentials by comparing the provided type the necessary email details:
credentials to the user data that has been saved. If the a) Recipient Email Address Field: Users
credentials are legitimate, the system moves on to the next can use this field to input the recipient's
phase. Even if the user's login credentials are entered email address. There is a text entry form
incorrectly, the system will prompt them to do so again available that is like the User ID field on
until they are properly authorised. After the user has been the Login Page.
validated, they can proceed to the "Compose Email" b) Message Subject Field: This allows
component. At this stage, the user follows audio directions users to provide a brief yet informative
subject line for their email. Users may • Send Email: The sending of an email is started by this
speak the subject into a text entry box function. It sends the email using the recipient's email
while utilising voice recognition. address, the message's subject, and its content.
c) Message Text Field: The Message Text
box allows users to enter email content in The component in charge of verifying the email's
a larger text input area. Users may enter content is represented by the Confirmation class. It has a
the message's text, along with any single public method called confirmContent(email: string,
attachments or additional information, subject: string, message: string): Boolean, which accepts
using the virtual keyboard or voice input. the email address, subject, and message content as
parameters and returns a Boolean value indicating whether
the user has confirmed that the content is correct.
Text to speech conversion is handled by the
TTSConverter class. It has a single public method,
convertTextToSpeech(text: string): void, that turns the
input text into audible speech.
C. Implementation
• Login: A Boolean that denotes a user's
ability to log in. It verifies the user's
credentials and returns a Boolean value
indicating if the login process was
successful or unsuccessful.
• Compose Email: A void function allows
the user to create an email by passing
parameters for the subject, message
content, and recipient's email address.
• Confirm Content: A Boolean value is
used to verify the email's content. It
provides a Boolean result that indicates if
the user has approved or disapproved of
the material.
• Send Email: The process of sending the
composed email is started by the void
function sendEmail().
Converting spoken words into text is possible using the
public function convertSpeechToText(): string of the
SpeechRecognition class. This function retrieves the text
that corresponds to the user's spoken input.
Three public methods for identifying specific
information in the text are provided by the NLP class.
extractRecipient(text: string), extractSubject(text: string),
and extractMessage(text: string) are some of these methods.
Every method extracts the recipient email address, subject,
or message content from the input text and then returns it.
The Email Server class represents the component
responsible for handling email-related operations. It has two
public methods:
The component in charge of dealing with email-related
processes is represented by the Email Server class. There
Figure 3: The main classes involved in the voice-based email system,
are two public methods: their attributes, and their relationships.
• Authenticate User: The user's email address and IV. RESULTS AND DISCUSSIONS
password are verified using the authenticateUser(email:
string, password: string) function, which returns a Boolean A. Analysis of Speech to Text Tool Used
value. It requests the user's password and email as
The inbuilt speech-to-text tool provided by the android
parameters and then returns a Boolean value indicating
studio was utilised in this project. Six test cases were
whether the authentication was successful.
employed in this experiment to determine how accurately
the speech-to-text system performed. The spoken sentence
is represented by the original text, while the speech-to-text multiple instances. Test case 6 displays an 87.5% accuracy
system's output is represented by the transcribed text. In rate with some transcription issues.
order to evaluate how accurately the machine transcribes
spoken text, the accuracy percentage is calculated. B. Analysis of Comparison
1) Word error rate formula In this study, the performance of three speech to text
tools - Android Studio speech to text, Google Cloud speech
Word error rate often referred to as WER is a way to to text, and Microsoft Azure speech to text - was compared
measure the performance of an automatic speech on six parameters: accented speech, multilingual speech,
recognition (ASR) system. It is tricky to measure because noisy environment, challenging vocabulary, speed and
the "ASR result" can have a different length than the pacing, and homophones and ambiguities. Test cases were
"Voice input." designed for each parameter to evaluate the accuracy of
transcriptions generated by the tools. The results provide
Here is a simple way to understand how WER is insights into the tools' performance and their suitability for
calculated: different speech recognition scenarios.
TABLE II: INPUT SAMPLE AND ITS TRANSCRIPTION IN NOISY
ENVIRONMENT
2 "I have a meeting "I have a meeting at 100 Test Sample Google Cloud Android Microsoft
at 2 PM." 2 PM." Studio Azure
3 "Can you please "Can you please call 75 Medical "The patient "The patient "The patient
call me back?" me bank?" terminology has a has a has a
myocardial myocardial myocardial
4 "The weather is "The weather is 75 infarction." infarction." infection."
sunny and warm." money and warm."
Scientific "The "The "The
5 "What time is the "What time is the 50 terms experiment experiment experiment
train arriving?" train a driving?" yielded yielded shielded
6 "I need to buy "I need to pie some 87.5 significant significant significant
some groceries." groceries." results." results. results."
Legal jargon "The "The "The
Test cases 1 and 2 both display 100% accuracy, defendant defendant defendant
demonstrating that the speech-to-text system accurately pleaded not pleaded not pleated not
and error-free transcribed the uttered sentences. The guilty." guilty." guilty."
transcription accuracy in test instances 3 and 4 is only 75%
due to some transcription errors. Test case 5 shows a lower
accuracy of 50% due to inaccurate word transcriptions in
process was expedited by this integration, which also made
testing and debugging more effective.
Furthermore, our program could run without an active
internet connection thanks to the built-in speech-to-text
tool's offline functionality. In situations where users might
not have continuous or dependable internet access, this was
useful because it guaranteed uninterrupted voice
recognition functionality.
Cost-effectiveness also played a role in our decision.
Figure 5: Performance of Speech to text tools with challenging Although the text-to-speech tool from Google Cloud may
vocabulary. have provided higher performance, it frequently has
additional expenses based on usage and API calls. We were
The bar graph visually represents the performance of able to eliminate these extra costs by making use of the
three speech to text tools - Android Studio speech to text, built-in tool, making our application more user-friendly
Google Cloud speech to text, and Microsoft Azure speech and economical.
to text - across different parameters. The graph consists of
vertical bars that indicate the average accuracy rate of each Finally, we made decisions based in large part on
tool for each parameter. privacy and data security concerns. We were able to
prevent sensitive user voice data from leaving the device
The x-axis represents the six different parameters, and application by using the built-in tool, potentially
namely accented speech, multilingual speech, noisy lowering privacy issues related to transferring speech data
environment, challenging vocabulary, speed and pacing, to external services or APIs.
and homophones and ambiguities. Each parameter is
labelled on the x-axis. Overall, the advantages of seamless integration, offline
functionality, cost effectiveness, and improved privacy and
The y-axis represents the accuracy rate in percentage, data security offered by Android Studio's built-in speech-
ranging from 0% to 100%. The scale on the y-axis allows to-text tool outweighed the potential performance gains.
for easy comparison of the accuracy rates between the This is true even though Google Cloud's text-to-speech tool
tools. may have offered better performance.