Speech Recog Intro
Speech Recog Intro
Ample Technologies 1
CHAPTER 1
INTRODUCTION
Language is a very fast and effective way of communicating. To use language means to
express an unlimited amount of ideas, thoughts and practical information by combining a
limited amount of words with the help of a limited amount of grammatical rules. The result
of language production processes are series of words and structure. Series of words are
produced i.e. spoken or signed in a very rapid and effective way. Any person can follow
such language production processes and understand what the person wants to express if two
preconditions are fulfilled the recipients must:
1.know the words and grammatical rules the speaker uses and
2.be able to receive and process the physical signal.
Most people use oral language for everyday communication, i.e. they speak to other people
and hear what other people say. People who are deaf or hard-of-hearing do not have equal
access to spoken language, for them, precondition 2 is not fulfilled, their ability to receive
speech is impaired. If people who are severely impaired in their hearing abilities want to take
part in oral communication, they need a way to compensate their physical impairment
1. Hearing aids are sufficient for many hearing impairment people. However, if hearing aids
are insufficient,
1.1 Definitions
The hierarchical linear discriminant feature extraction method described in [4], and audio
feature enhancement [5]. Feature fusion based techniques lack the ability to explicitly model
the relative reliability of each feature stream. This is important as the reliability of either
stream may vary significantly even within the duration of an utterance because of constant or
Introduction speech to text
Ample Technologies 2
instantaneous background noise or channel degradations.
1.2 Literature Survey
[5] Voice Conversion
A smart assistive environment is an area where people live and work and are helped by the
embedded technology within it. This environment needs to interact with its occupants and
assist them in everyday life. The interaction between an individual and a smart space needs to
be as much intuitive and natural for the human as possible. Automatic speech recognition
(ASR) is the task of converting spoken words to text and it is clearly a necessity in smart
spaces, and particularly important in assistive environments [1]. Of course, an ASR system
needs to be robust to noise and retain high accuracy.
Despite the increased research attention that the topic has attracted, voice conversion has
remained a challenging area. One of the challenges is that the perception of the quality and
the successfulness of the identity conversion are largely subjective. Furthermore, there is no
unique correct conversion result: when a speaker utters a given sentence multiple times, each
repetition is different. Due to these reasons, time-consuming listening tests must be used in
the development and evaluation of voice conversion systems. The use of listening tests can
be complemented with some objective quality measures approximating the subjective rating,
such as the one proposed in (Mller, 2000). Before diving deeper into different aspects of
voice conversion, it is essential to understand the factors that determine the perceived speaker
identity. Speech conveys a variety of information that can be categorized, for example, into
linguistic and nonlinguistic information. Linguistic information has not traditionally been
considered in the existing VC systems but is of high interest for example in the field of
speech recognition. Even though some hints of speaker identity exist on the linguistic level,
nonlinguistic information is more clearly linked to speaker individuality. The nonlinguistic
factors affecting speaker individuality can be linked into sociological and physiological
dimensions that both have their effect on the acoustic speech signal. Sociological factors,
such as the social class, the region of birth or residence, and the age of the speaker, mostly
affect the speaking style that is acoustically realized predominantly in prosodic features, such
Introduction speech to text
Ample Technologies 3
as pitch contour, duration of words, rhythm, etc. The physical attributes of the speaker (e.g.
the anatomy of the vocal tract), on the other hand, strongly affect the spectral content and
determine the individual voice quality. Perceptually .
the most important acoustic features characterizing speaker individuality include the
third and the fourth formant, the fundamental frequency and the closing phase of the glottal
wave, but the specific parameter importance varies from speaker to speaker and from
listener.
Speech parameterization and modification
Most of the voice conversion approaches use segmental feature extraction to find a set of
representative features that are then converted from source to target speakers. In principle,
the features to be transformed in voice conversion can be any parameters describing the
speaker-dependent factors of speech. The parameterization of the speech and the flexibility
of the analysis/synthesis framework have a fundamental effect on the quality of converted
speech. Hence, the parameterization should allow easy modification of the perceptually
important characteristics of speech as well as to provide high-quality waveform resynthesis.
The most popular speech representations are based on the source-filter model. In the source-
filter model, the glottal airflow is represented as an excitation signal that can be thought to
take the form of a pulse train for the voiced sounds and the form of a noise signal for the
unvoiced sounds. A voiced excitation is characterized by a fundamental frequency or Block
diagram of speaker adaptation in HMM-based TTS. In the training phase, HSMMs are
generated using speech data from multiple speakers. Then, model adaptation is applied to
obtain HSMMs for a given target speaker. The adapted HSMMs can be used in TTS
synthesis for producing speech with the target voice.
[7]Accessible Instructional Materials (AIM)
reading is fundamental to academic success, high school graduation, and positive transition
to employment or post-secondary education. unfortunately, many students with disabilities
continue to have significant reading deficits when they enter high school despite the best
Introduction speech to text
Ample Technologies 4
efforts of special education intervention during their elementary years. these students
frequently fall behind and become at risk for dropping out of school, failing to gather enough
credits to graduate, or otherwise failing to successfully transition to employment or post-
secondary education. this is especially true for students with learning and other disabilities
that impact reading proficiency, for example, traumatic brain injury, autism, dyslexia, and
others. special education remediation has been the traditional intervention used to address
reading deficits in high school students with disabilities. However, students with disabilities
often realize maximum benefit from remediation by the time they reach high school. other
interventions include providing specific academic supports and enhancing study skills. Even
these techniques are frequently ineffective in making students independent, successful
learners. these students continue to struggle with academic content, reinforcing a sense of
failure, and become less engaged and motivated at school. rarely is reading compensation
through text-to-speech (tts) software and accessible in-structional materials (aiM) considered
a viable intervention for these students. this technology allows print information from
textbooks, worksheets, tests, or notes to be scanned or obtained in digital format and then
read aloud by a speech synthesizer. this means students can work free of human assistance
(no more oral reading accommodations). The speech output generated by the software
shifts the skill demand to listening comprehension (receptive language) and the student is
able to focus on academic content rather than struggling with reading decoding, fluency, or
comprehension problems. the software is customizable to match the needs of each student.
Highlighting of text may help visual tracking;word prediction supports written expression;
and a built-in dictionary, thesaurus, and other multi-modal tools are available within most
robust tts software applications.
Text-to-Speech as a Compensatory Accommodation
text-to-speech (tts) software allows print information (once converted to a digital format) to
be read aloud to students, allowing them to work more independently and efficiently. The
software is customizable to match the needs of individual students. Highlighting of text helps
visual tracking, word prediction assists with written expression, and built-in dictionary and
thesaurus tools and a host of other multi-modal tools are accessible within tts software. a
number of tts software programs are commercially available and several shareware pro-
grams exist that can be accessed online. Major commercial products such as read & Write
Gold, Kurzweil 3000, read:outloud, and WYNN are frequently used in schools, primarily be-
Introduction speech to text
Ample Technologies 5
cause they offer a robust array of tools and the products companies provide high levels of
technical support. in addition, there are a handful of stand-alone readers that provide
simultaneous access to speech and print output, such as Victor reader, Mobile reader, and
ClassMate reader. additional information on products available to support audio output can
be found in appendix C of this guide along with at-a-glance guides to support initial product
selection and use. When using tts as a compensatory accommodation, a secondary student
must have access to a system ubiquitously throughout the school day and at home to be used
with all print in- structional materials to provide comprehensive access to the curriculum.
such a system may include a computer, tts software, a scanner, and a printer. Print
instructional materials must be made available in accessible electronic form (such as a DaisY
format file) or converted into electronic form through the use of scanning. additional
information about resources for electronic text and copyright provisions can be found in
section ii: implementation. Potential Barriers/Implementing Change implementing tts
software as a compensatory strategy for students who are not visually impaired generally
faces two philosophical barriers that must be addressed prior to implementation:
1) typically there is great reluctance on the part of educational staff to compensate for
reading deficits. educators tend to persist in focusing on remediating reading deficits and are
frequently unwilling and/or unable to implement a compensatory strategy. some educators
view compensatory technology as an unfair advantage and are not supportive of its use.
1.3 Purpose
Computer technology that enables a device to recognize and understand spoken words, by
igitizing the sound and matching its pattern against the stored patterns. Currently available
devices are largely speaker-dependent (recognize speech of only one or two persons) and can
recognize discrete speech (speech with pauses between words) better than the normal
(continuous) speech. Their major applications are in assistive technology for helping people
in working around their disabilities. Not to be confused with voice recognition which is used
mainly in security devices
Introduction speech to text
Ample Technologies 6
1.4 Scope
The scope of this study was to explore the practical applications of implementing SR
technology in the classroom in order to automatically convert instructors speech to text. In-
classroom captioning and the provision of lecture transcripts can serve a variety of
educational or pedagogical purposes, including supplementing students class notes with
accurate SR transcripts, allowing students who have difficulty taking notes by themselves to
acquire the lecture content, and permitting instructors to confirm what they said during class.
Two different methods of SR-mLA were compared during different course subjects as a
technical feasibility and case study. Our specific objectives were to:
Identify the issues regarding the use of SR-mLA as a standard classroom tool in capturing
spoken lecture information.
Compare the technical reliabilities and word recognition accuracies of RTC and PLT.
Explore the effects of SR-mLA on student class grades and voluntary quiz participation and
performance in a regular STEM course.
Investigate the potential benefits of acquiring SR lecture transcripts for instructors and
students, particularly for those with special needs.
It was not the purpose of this study to compare the many different SR algorithms or software
currently available but the practices of SR usage in lecture capture. We believe the strategies
and framework of this study could be applied to include any SR engine that others may wish
to employ.
1.5 Motivation
The technology has a burgeoning range of applications in education from captioning video
and television for the hearing-impaired, voice controlled computer operation, and dictation.
Most commercial SR software applications were developed for dictation with
punctuation, not for transcribing extemporaneous speech, which is structurally and
grammatically different from written prose.
Transcripts produced from a continuous unbroken stream of text are additionally difficult to
read and interpret without punctuation or formatting.
Introduction speech to text
Ample Technologies 7
These technologies have been applied to automatically transcribe instructors lecture
and process the transcription to acquire near verbatim lecture transcripts for students. The
benefits of producing lecture transcripts have shown to enhance both learning and teaching.
Students could make up for missed lectures as well as to corroborate the accuracy of their
own notes during the lectures they attended.
1.6 Proposed Solution
The technologies were evaluated in different classroom environments to assist
students to automatically convert oral lectures into text.
Service and then provide it to the students. This would help students to remember
whatever was spoken in the class, also since now it is text, students can search for some
particular term or topic.
After taking the any audio as a sample or we can also record our speech using mobile or
other device , we will upload to the our developed software these audio , the system will
convert it in to the text file. All the text file will be sent mail to the students for getting the
feedback status by the software only using JMS(Java Mailing Services) Methodologies . If
some error occurring or some technical word are wrong in to the text format the student will
send the feedback to the admin based on the feedback will compare the results.
This is used to improve the technology and reduce the error mistake. This will help to student
to know about the corresponding lecturer session on the specific subject like as datamining,
image processing ,C programming etc . Students will be able to search the record also to get
the required file.
Input: Basically an audio file (we can record any speech using mobile or other device )
Output: text transcription of the file( Audio file will be converted into text file ) .
Advantage
The words should be clearly articulated by the speaker, especially word endings.
Introduction speech to text
Ample Technologies 8
If some error occurring or some technical word are wrong in to the text format the student
will send the feedback to the admin based on the feedback will compare the results.
All the text file will be sent mail to the students for getting the feedback status by the
software only using JMS(Java Mailing Services) Methodologies.
When gesturing or making other nonverbal cues to demonstrate a point; make sure to
describe what is being done during transcription.
Product functionality
Actually, voice recognition software, also called speech recognition software, has made great
strides in the last few years, meaning youll no longer need to wait until the warp drive is
invented. The technology is ready now. Voice recognition software has the capability of
streamlining your work flow, allowing you to work as fast as you can speak instead of as fast
as you can type and move the mouse. And even though the keyboard and the mouse have
been around long enough for everyone to become proficient at using them, there are still
some advantages to using software operated by your voice instead.
The biggest advantage is the ability to do hands-free computing. If your job requires
you to jump back and forth from your computer to other tasks, this type of software can
allow you to do both things at once. It is also a huge benefit to those with limited mobility or
some other kind of disability that makes it difficult to use a keyboard and mouse.
1.7 Organization of the Report
This report focuses on Speech to text conversion. The main body of the report is
preceded by detailed table of contents including lists of figures, tables, and glossary.
The body of the report is divided into 7 chapters.
Chapter 1: Introduction
Chapter 2 Software Requirement Specification
Chapter 3 High Level Design
Chapter 4 Detailed Design
Chapter 5 Implementation
Introduction speech to text
Ample Technologies 9
Chapter 6 Software Testing
Chapter 7 Conclusion
References