draft
draft
A PROJECT REPORT
on
“TRANSCRIBER AI”
Submitted in partial fulfillment of the requirement for the award of the degree
Bachelor of Engineering
in
Computer Science and Engineering
by
AKASH ROSHAN R : 1VI20CS005
DARSHAN S M : 1VI21CS401
Certificate
Certified that the project work entitled “TRANSCRIBER AI” carried out jointly by Akash
Roshan R (1VI20CS005), Gaurav Singh (1VI20CS042), Keerthi Kumar V (1VI20CS058)
and Darshan S M (1VI21CS401), are bonafide students of Vemana Institute of Technology
in partial fulfillment for the award of Bachelor of Engineering in Computer Science and
Engineering of the Visvesvaraya Technological University, Belagavi during the year
2023-24. It is certified that all corrections/suggestions indicated for internal assessment have
been incorporated in the report. The project report has been approved as it satisfies the
academic requirements in respect of the project work prescribed for the said degree.
External Viva
Internal Examiner External Examiner
1.
2.
DECLARATION BY THE CANDIDATES
We the undersigned solemnly declare that the project report “TRANSCRIBER AI” is based
on our own work carried out during the course of our study under the supervision of ‘Ms. J
Brundha Elci.
We assert the statements made and conclusions drawn are an outcome of our project work.
We further certify that,
a. The work contained in the report is original and has been done by us under the general
supervision of my supervisor.
b. The work has not been submitted to any other Institution for any other
degree/diploma/certificate in this university or any other University of India or abroad.
c. We have followed the guidelines provided by the university in writing the report.
d. Whenever we have used materials (data, theoretical analysis, and text) from other
sources, we have given due credit to them in the text of the report and their details are
provided in the references.
Date:
Place:
iii
ACKNOWLEDGEMENT
First we would like to thank our parents for their kind help, encouragement and moral
support.
We would like to place on record our regards to Dr. M. Ramakrishna, Professor and Head,
Department of Computer Science and Engineering for his continued support.
We would like to thank our project coordinators Mrs. Mary Vidya John, Assistant Professor
and Mrs. Vijayashree HP, Assistant Professor, Dept. of CSE for their support and
coordination.
We would like to thank our project guide Ms. J Brundha Elci, Designation, Dept. of CSE
for her continuous support, valuable guidance and supervision towards successful
completion of the project work.
We also thank all the Teaching and Non-teaching staff of Computer Science and Engineering
Department, who have helped us to complete the project in time.
iv
ABSTRACT
This project focuses on the development of application that enables real-time audio-to-text
transcription. Leveraging wireless Bluetooth technology, the application captures and
transcribes spoken words from both the device's built-in microphone and external Bluetooth
devices. "Transformative Bluetooth Communication Device for Inclusive Communication"
encapsulates the development of a groundbreaking device aimed at overcoming
communication barriers. This Bluetooth device integrates a Bluetooth transmitter and a
sophisticated microphone, facilitating real-time spoken word transmission. Coupled with a
dedicated application powered by advanced speech-to-text technology, the device offers
instantaneous transcription, making communication universally accessible. Emphasizing
hands-free convenience and portability, the project envisions a transformative bridge catering
to individuals with diverse language backgrounds and hearing impairments. Beyond
immediate functionalities, the project has wide-ranging implications for inclusivity, adaptation
to various environments, potential integration with existing platforms, and future research in
speech recognition and wearable technology. This abstract outline a visionary venture set to
redefine communication paradigms and empower users through technology.
Key words: Wearable Technology, Speech Recognition, Natural Language Processing (NLP),
Information Retrieval, Local Transcription.
v
CONTENTS
Content Details Page No.
Title Page i
Bonafide Certificate ii
Declaration iii
Acknowledgement iv
Abstract v
Contents vi
List of Figures ix
List of Tables x
List of Abbreviations xi
Chapter 1 Introduction 1
1.1 Introduction 1
1.2 Scope 2
1.3 Objectives 4
2.1 Paper 1 8
2.2 Paper 2 9
2.3 Paper 3 10
2.4 Paper 4 11
2.5 Paper 5 13
2.6 Paper 6 14
2.7 Paper 7 16
vi
2.8 Paper 8 17
2.9 Paper 9 18
2.10 Paper 10 19
3.1.1 Drawbacks 26
vii
5.4.2 Text Translation Module 37
5.4.4 Storage 39
6.1 Introduction 42
Chapter 8 Summary 53
9.1 Conclusions 54
References 55
Appendix A 56
viii
LIST OF FIGURES
ix
LIST OF TABLES
x
LIST OF ABBREVIATIONS
Abbreviation Description
A2DP Advanced Audio Distribution Profile
STT Speech-to-Text
CHAPTER 1
INTRODUCTION
1.1 Introduction
The project unveils a Bluetooth device, seamlessly merging technology and inclusivity to
address communication barriers. Featuring a Bluetooth transmitter and integrated microphone,
it's a visionary solution for individuals with diverse language backgrounds and hearing
impairments. The Bluetooth transmitter enables real-time spoken word transmission, while the
integrated microphone captures nuanced human expression. This data integrates with a mobile
app, utilizing advanced speech-to-text tech for real-time transcription, a commitment to
inclusivity. Serving as a transformative bridge, this device prioritizes hands-free convenience
and portability, reshaping the essence of communication for a future marked by connectivity
and universal accessibility. "Transformative Bluetooth Device for Inclusive Communication,"
encapsulates the development of a groundbreaking device aimed at overcoming communication
barriers. This Bluetooth device integrates a Bluetooth transmitter and a sophisticated
microphone, facilitating real-time spoken word transmission. Coupled with a dedicated mobile
application powered by advanced speech-to-text technology, the device offers instantaneous
transcription, making communication universally accessible. Emphasizing hands-free
convenience and portability, the project envisions a transformative bridge catering to
individuals with diverse language backgrounds and hearing impairments. Beyond immediate
functionalities, the project has wide-ranging implications for inclusivity, adaptation to various
environments, potential integration with existing platforms, and future research in speech
recognition and Bluetooth technology. This abstract outline a visionary venture set to redefine
communication paradigms and empower users through technology.
interaction. This aspect is particularly vital for users who rely on visual cues and facial
expressions as part of their communication.
The device doesn't stop at facilitating spoken communication; it also integrates with a dedicated
mobile application. This app utilizes state-of-the-art speech-to-text technology to provide real-
time transcription of conversations. This feature is invaluable for individuals with hearing
impairments or those who prefer written communication. By offering multiple modes of
communication within one device, the Bluetooth device ensures inclusivity and accessibility
for all users. In terms of usability, the device is designed with hands-free convenience and
portability in mind. Its compact form factor and lightweight construction make it easy to use
throughout the day, allowing users to communicate effortlessly in various environments.
Whether in a busy public space or a quiet intimate setting, the device empowers users to engage
in meaningful interactions without cumbersome equipment or logistical barriers. The
implications of this project extend far beyond its immediate functionalities. It has the potential
to shape the future of assistive technology, paving the way for advancements in speech
recognition, Bluetooth devices, and adaptive communication tools. Moreover, its seamless
integration with existing platforms opens up possibilities for enhanced connectivity and
interoperability across different communication channels.
1.2 Scope
The scope of our project is to create a Bluetooth device, incorporating a transmitter and
advanced speech-to-text technology, to address communication challenges faced by individuals
with diverse language backgrounds and hearing impairments. The project aims for real-time
transcription, offering hands-free convenience and portability. The dedicated mobile
application enhances user experience with customization options. The device's adaptability to
various environments, potential integration with existing platforms, and acknowledgment of
global linguistic diversity contribute to its broader impact. Beyond immediate functionalities,
the project envisions empowering individuals through technology and serves as a stepping
stone for future applications and research opportunities in speech recognition and Bluetooth
technology. The scope of the Bluetooth device project is extensive, encompassing various
aspects of technology development, user experience design, market research, and potential
applications. Here's a breakdown of its scope:
exploring new features, enhancements, and potential applications for the Bluetooth
device, as well as contributing to advancements in speech recognition, Bluetooth
technology, and inclusive design.
1.3 Objective
• Develop a system for instant conversion of spoken words into written text with a focus
on low latency for real-time transcription.
• Seamlessly incorporate Bluetooth technology to enable wireless microphone audio
input, offering users flexibility with external audio sources.
• Optimize the transcription process by implementing local audio processing on mobile
devices, handling input from both Bluetooth microphones and built-in device
microphones.
• Design an intuitive user interface, allowing users to easily monitor transcription
progress, connect to Bluetooth devices, and access transcribed text.
• Utilize audio preprocessing techniques to enhance transcription accuracy, particularly
in challenging acoustic conditions.
• Develop robust error-handling mechanisms to gracefully manage issues such as
Bluetooth disconnections or errors in the transcription process.
• Incorporate customizable settings for users to tailor the transcription process, including
options for language selection, transcription formats, and output preferences.
• These objectives are strategically designed to create a flexible, user-friendly, and highly
functional Bluetooth device that not only meets the immediate needs of its users but
also lays the groundwork for future innovations in communication and assistive
technologies.
On November 8th, 2023, the project domain and title, "Transcriber AI," were officially
finalized. Following approval by both the project guide and the project committee, our team
initiated an extensive literature survey. This survey encompassed a thorough examination of
research papers and publications with objectives akin to ours. After analyzing papers from
reputable journals, we identified key methodologies and technologies relevant to our project.
Subsequently, we focused on identifying the necessary technologies and methodologies
essential for implementing our transcriber AI. This phase involved comprehensive research
to identify cutting-edge technologies and methodologies in the fields of speech recognition
and real-time data processing.
December 2023:
• Finalization and Approval: Finalize and obtain approval for the project title and domain.
• Module Decision and Component Selection: Decide on the specific modules to include
in the project, such as speech recognition, language processing, and translation
modules. Select and purchase necessary components.
January 2024:
• Hardware Prototype Development: Learn about and connect necessary hardware, such
as servers or specialized processors that could handle real-time data processing. Start
creating a hardware prototype suitable for handling transcription and translation.
April 2024:
• Integration and Testing: Integrate the software with the hardware, conduct thorough
testing of the real-time transcription, translation, and summarization capabilities.
• Project Completion: Finalize all components of the system, ensuring the model and
application operate seamlessly in real-time scenarios
• Chapter 2 Literature Survey: Conducted as part of the project's literature review, this
chapter presents a comprehensive list of base and reference papers that have been
influential in shaping the project. It goes beyond listing these papers, providing an in-
depth analysis of their key findings and contributions, and includes a comparative
analysis.
• Chapter 3 System Analysis: Delves into the foundational aspects essential for
understanding and evaluating both the existing and proposed systems. This chapter
delineates the functional and non-functional requirements crucial for the project's
success. Functional requirements detail the specific functionalities and capabilities
expected from the system, such as real-time transcription and translation tasks. Non-
functional requirements encompass performance, scalability, security, and usability
considerations, ensuring the system meets quality standards and user expectations.
• Chapter 9: Conclusion and Future Work: The conclusion chapter summarizes the
project's objectives, emphasizing the creation of an efficient and user-friendly
Transcriber AI system. It discusses potential future enhancements, such as online cloud
storage, hardware controls integration, offline translation libraries, and AI-generated
summarization algorithms.
CHAPTER 2
LITERATURE SURVEY
2.1 Sadaoki Furui, Tomonori Kikuchi, Yousuke Shinnaka, and Chiori Hori, Speechto-
Text and Speech-to-Speech Summarization of Spontaneous Speech, IEEE
transactions on speech and audio processing, vol. 12, no. 4, july 2004
The paper addresses the problem of speaker diarization, which is the process of determining
who is speaking when in a conversation or meeting. The authors are motivated by the
challenges of real-time diarization, such as segmenting and separating overlapping speech,
estimating the number and location of speakers, and tracking speaker movement and change.
To tackle these challenges, they propose a novel speaker diarization system that incorporates
spatial information from a circular differential directional microphone array (CDDMA).
The system consists of four components: audio segmentation based on beamforming and voice
activity detection, speaker diarization based on joint clustering of speaker embeddings and
source localization, a phonetically-aware coupled network to normalize the effects of speech
content, and separation of overlapping speech based on state estimation and unimodality test.
The authors evaluate their system on a corpus of real and simulated meeting conversations
recorded by the CDDMA. They compare their system with several baselines and show that
their system achieves significant improvements in terms of diarization error rate (DER) and
segmentation accuracy.
They also demonstrate the advantages of the CDDMA design over other beamforming
methods. This paper represents a significant contribution to the field of speaker diarization.
Evaluation experiments were conducted on spontaneous Japanese presentation speeches from
the Corpus of Spontaneous Japanese (CSJ).
For speech-to-text summarization, the two-stage method achieved better performance than a
one-stage approach, especially at lower summarization ratios (higher compression).
Combining multiple scoring measures (linguistic, significance, and confidence) for sentence
extraction further improved results. For speech-to-speech summarization, subjective
evaluation at a 50% summarization ratio showed that sentence units achieved the best results
in terms of ease of understanding and appropriateness as a summary. Word units performed
worst due to unnatural sound caused by concatenating many short segments, while between-
filler units showed promising results comparable to sentence units.
Advantages:
• Addresses challenges of spontaneous speech like recognition errors and redundant
information through two-stage summarization.
• Utilizes multiple scoring measures for sentence extraction, leveraging linguistic
naturalness and acoustic reliability.
• Preserves speaker's voice and prosody in speech-to-speech summarization, conveying
additional meaning and emotion.
Limitations:
• Evaluation conducted on a small dataset may limit generalizability, requiring
assessment on larger, diverse corpora.
• Relies primarily on textual information, lacking explicit incorporation of prosodic
features which could enhance identification of important speech segments.
• Concatenating speech units may introduce acoustic discontinuities and unnatural
sounds, necessitating further improvements for natural-sounding summaries.
2.2 Sridhar Krishna Nemala and Mounya Elhilali, Multilevel speech intelligibility for
robust speaker recognition, The Johns Hopkins University, Baltimore
In the real world, natural conversational speech is an amalgam of speech segments, silences
and environmental/background and channel effects. Labelling the different regions of an
acoustic signal according to their information levels would greatly benefit all automatic speech
processing tasks. A novel segmentation approach based on a perception-based measure of
speech intelligibility is proposed. Unlike segmentation approaches based on various forms of
voice-activity detection (VAD), the proposed parsing approach exploits higher-level perceptual
information about signal intelligibility levels. This labelling information is integrated into a
novel multilevel framework for automatic speaker recognition task. The system processes the
input acoustic signal along independent streams reflecting various levels of intelligibility and
then fusing the decision scores from the multiple steams according to their intelligibility
contribution. The results show that the proposed system achieves significant improvements
over standard baseline and VAD based approaches, and attains a performance similar to the
one obtained with oracle speech segmentation information. With the advent of E-commerce
technology, the importance of non-intrusive and highly reliable methods for personal
authentication has been growing rapidly. Voice prints being the most natural form of
communication, and being already used widely in spoken dialog systems, have significant
advantage over other biometrics such as retina scans, face, and finger prints.
Dept. of CSE, Vemana IT 9 2023-24
Transcriber AI Literature Survey
Advantages:
Limitations:
2.3 Xavier Anguera Miro, Simon Bozonnet, Nicholas Evans, Corinne Fredouille, Gerald
Friedland and Oriol Vinyals, Speaker Diarization: A Review of Recent Research,
IEEE transactions on audio, speech, and language processing, vol. 20, no. 2, February
2012
Speaker diarization, the process of determining “who spoke when?” in an audio or video
recording with multiple speakers, is a field of growing interest. This is particularly true for
conference meetings, which present unique challenges such as background noise,
reverberation, overlapping speech, and spontaneous speech style. The main approaches to
speaker diarization include bottom-up and top-down clustering, as well as information-
theoretic and Bayesian methods. Most systems utilize hidden Markov models with Gaussian
mixture models to represent speakers and segments. The primary algorithms involved in
speaker diarization include data preprocessing, speech activity detection, acoustic
beamforming, feature extraction, cluster initialization, distance computation, cluster merging
or splitting, stopping criterion, and resegmentation.
New research directions are emerging in the field, including handling overlapping speech,
incorporating multimodal information, system combination, and exploring nonparametric
Bayesian methods. Performance evaluation of speaker diarization systems is typically based
on the NIST Rich Transcription evaluations, which provide standard datasets and metrics. The
current state-of-the-art systems achieve around a 10% diarization error rate on conference
meeting data. This field continues to evolve, with ongoing research aimed at improving the
accuracy and efficiency of speaker diarization systems.
Advantages:
• Speaker diarization enhances meeting management by automatically identifying
speakers, aiding in organizing discussions and summarizing key points.
• Provides speaker labels, improving accessibility of audio/video recordings for
individuals with hearing impairments or those who prefer transcripts.
• Automated speaker diarization reduces time and effort required for transcription tasks,
saving resources in processing recordings with multiple speakers.
Limitations:
• Speaker diarization systems may struggle with accurately identifying speakers in
challenging acoustic environments, affecting accuracy.
• Some algorithms require significant computational resources, limiting practical
applicability, especially for large datasets or real-time streams.
• Variability in speakers' voices, accents, and speech patterns can lead to errors in speaker
segmentation and identification, impacting system reliability.
2.4 Dongmahn SEO, Suhyun KIM, Gyuwon SONG, and Seung-gil HONG, Speechto-
Text-based Life Log System for Smartphones, IEEE International Conference on
Consumer Electronics (ICCE), 2014
A life log is a digital data logging for human daily life. Many life log projects have been
proposed. In life log projects, digital data are collected using cameras, wearable devices, and a
remote controller. However, legacy projects are using wearable devices or specific purpose
devices, which are not commonly used, or only photographs and videos are focused to research.
Speech recognition technologies based on word dictionary have enough accuracy. Current
speech-to-text technologies for a sentence and a paragraph have less accuracy than word
dictation. However, it is assumed that searching voice record files is available if accuracy of
sentence or paragraph dictation is more than half. Because users use a few keywords when they
search recorded voice files. These keywords probably appear several times in a voice file
because the topic of a speech or a dialog is related on these keywords.
A smartphone is a suitable device for life log, because smartphones belong in human life on 24
hours 7 days a week and have various sensors, a camera, a microphone, computing powers and
network connectivity. In this paper, a new life log system using user voice recording with
dictated texts is proposed. The proposed system records user's voice and phone calls using
smartphones whenever a user wants. Recorded sound files are dictated for storing text files by
a speech-to-text service. Recorded sound files and dictated text files are stored in a life log
server. The server provides life log file lists and searching features for life log users. A life log
system provides a real-time voice recording with a speech-to-text feature over smartphone
environments. The proposed system records data of user life using a microphone of
smartphone. Recorded data are sent to a server, analysed, and stored. Recorded data are dictated
using a speech-to-text service, and saved as text files. The proposed system is implemented as
a prototype system and evaluated. Users of the system are able to search their life log sound
files using text. The life log server manages life log data files, which are pairs of a sound file
and a text file and provides a web service for a user interface. To dictate sound files, a speech-
to-text (STT) module controls a flac encoding module and a STT request Module. The flac
encoding module converts .WAV format files which are received from the life log application
into .FLAC format files. The STT request module requests dictated text files from sound files
converted by the flac encoding module to the speech-to-text server. A life log web server
provides life log data stored in the life log server to the life log web client.
Advantages:
• Enables convenient and ubiquitous life logging using widely available smartphones,
eliminating the need for dedicated wearable devices.
• Leverages smartphone's microphone and GPS capabilities for multimodal data capture,
providing a comprehensive record of user's experiences.
• Allows for efficient navigation and access to life log data by transcribing audio files
into text, enabling keyword-based search and retrieval.
Limitations:
• Relies on speech-to-text service accuracy for transcription, which may be limited by
accents, background noise, and conversational speech, affecting searchability.
• Continuous audio recording on mobile devices raises privacy concerns by capturing
user's voice, conversations, and environmental sounds, necessitating appropriate
privacy measures. Additionally, it may strain device resources, requiring efficient
encoding, compression, and scheduling to mitigate battery drain and network usage.
The paper presents a real-time meeting analyzer that recognizes “who is speaking what” and
provides assistance for meeting participants or audience. It uses a microphone array and an
omni-directional camera to capture audio and visual information, and performs speech
enhancement, speech recognition, speaker diarization, acoustic event detection, sentence
boundary detection, topic tracking, and face pose detection to analyze the meeting content and
context. The system employs three techniques to improve the speech recognition accuracy for
distant speech: dereverberation, source separation, and noise suppression.
The system also performs speaker diarization based on direction of arrival estimation and
VAD. The system uses a WFST-based decoder with one-pass decoding and error correction
language models. The system also uses diarization-based word filtering to reduce insertion
errors caused by non-target speech. The authors compare different models for extracting and
tracking the topics of an ongoing meeting based on speech transcripts. They show that their
proposed topic tracking model (TTM) outperforms the semibatch latent Dirichlet allocation
(LDA) and the dynamic mixture model (DMM) in terms of perplexity and entropy. The authors
summarize the main contributions of their system for low-latency real-time meeting analysis,
which integrates audio, speech, language, and visual processing modules.
They report the experimental results for speaker diarization, speech recognition, acoustic event
detection, and topic tracking on real meeting data. They also discuss the future work for
improving the system’s accuracy and usability. The paper contains many references to previous
works on meeting recognition, speech enhancement, speech recognition, topic modelling, and
other related topics. The ASR system incorporates a novel diarization-based word filtering
(DWF) technique to reduce insertion errors caused by residual non-target speech in the
separated channels.
The DWF method utilizes the frame-based speaker diarization results obtained during the audio
processing stage to determine the relevance of each recognized word to the target speaker.
Words with low relevance are filtered out, improving the overall recognition accuracy.In
parallel with speech recognition, the system performs acoustic event detection, particularly for
laughter, using HMM-based models. This information is used to monitor the atmosphere of the
meeting and detect casual moments. Sentence boundary detection is applied to the recognized
transcripts to improve readability, and topic tracking is performed using the Topic Tracking
Model (TTM) to extract relevant topic words from the conversation. The paper contains many
references to previous works on meeting recognition, speech enhancement, speech recognition,
topic modelling, and other related topics.
Advantages:
• Operates in real-time with minimal delay, providing transcripts and speaker activities
promptly, enabling near real-time monitoring of meetings.
• Utilizes advanced audio processing techniques such as dereverberation, source
separation, and noise suppression to enhance speech signals in challenging
environments, improving subsequent speech recognition accuracy.
• Implements robust speaker diarization and word filtering techniques to reduce insertion
errors and improve recognition accuracy and speaker attribution by leveraging frame-
based diarization results.
Limitations:
• Relies on limited training data for adapting acoustic and language models, potentially
limiting overall system performance.
• Involves significant computational complexity due to multiple complex components,
hindering deployment on resource-constrained platforms.
• May experience latency issues in certain scenarios, such as rapid topic shifts in
discussions, and face detection delays affecting synchronization with audio cues,
impacting real-time performance.
2.6 Siqi Zheng, Weilong Huang, Xianliang Wang, Hongbin Suo, Jinwei Feng and Zhijie
Yan, A real-time speaker diarization system based on spatial spectrum, ICASSP 2021
- 2021 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASS)
Speaker diarization is a process of finding the optimal segmentation based on speaker identity
and determining the identity of each segment. It is also commonly stated as a task to answer
the question “who speaks when”. It tries to match a list of audio segments to a list of different
speakers. Unlike the speaker verification task, which is a simple one-to-one matching, speaker
diarization matches M utterances to N speakers, and, in some situations, N is often unknown.
Agglomerative hierarchical clustering (AHC) over speaker embeddings extracted from neural
networks has become the commonly accepted approach for speaker diarization. Variational
Bayes (VB) Hidden Markov Model has proven to be effective in many studies. A refinement
process is usually applied after clustering.
VB re-segmentation is often selected as means for refinement. The real-time requirement poses
another challenge for speaker diarization. To be specific, at any particular moment, it is
required to determine whether a speaker change incidence occurs at the current frame within a
delay of less than 500 milliseconds. This restriction makes refinement process such as VB
resegmentation extremely difficult. Since we can only look 500 milliseconds ahead, the speaker
embedding extracted from the speech within that short window can be biased. Therefore,
finding the precise timestamp of speaker change in a real-time manner still remains quite
intractable for conventional speaker diarization technique. In this paper a speaker diarization
system is described that enables localization and identification of all speakers present in a
conversation or meeting.
A novel systematic approach is proposed to tackle several long-standing challenges in speaker
diarization tasks, to segment and separate overlapping speech from two speakers, to estimate
the number of speakers when participants may enter or leave the conversation at any time, to
provide accurate speaker identification on short text-independent utterances, to track down
speakers’ movement during the conversation, to detect speaker change incidence real-time.
Utilizes a differential directional microphone array and CDDMA design for effective far-field
speech capture, aiding accurate speaker localization and real-time diarization.
First, a differential directional microphone array-based approach is exploited to capture the
target speakers’ voice in far-field adverse environment.
Second, an online speaker-location joint clustering approach is proposed to keep track of
speaker location.
Third, an instant speaker number detector is developed to trigger the mechanism that separates
overlapped speech. The results suggest that the system effectively incorporates spatial
information and achieves significant gains.
Advantages:
• Utilizes a differential directional microphone array and CDDMA design for effective
far-field speech capture, aiding accurate speaker localization.
• The system's real-time speaker diarization capability accurately detects speaker change
incidences within a delay of less than 500 milliseconds leveraging spatial information
and joint efforts of localization and NN-VAD.
Limitations:
• Relies heavily on the use of a microphone array, specifically the CDDMA design,
introducing a hardware dependency and potentially limiting the system's applicability
in scenarios where microphone arrays are not available or feasible to deploy.
2.7 Yu-An Chung, Wei-Hung Weng, Schrasing Tong and James Glass, Towards
Unsupervised Speech-To-Text Translation, Computer Science and Artificial
Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge
prior knowledge about the target language. Experimental results show that unsupervised
system achieves comparable BLEU scores to supervised end-to-end models despite the lack of
supervision.
Advantages:
• Low resource requirement: No need for labelled data, making it applicable to languages
with limited resources.
• Scalability: Can potentially work for a wide range of languages without language-
specific resources.
Limitations:
• Translation quality may vary: Dependent on factors like language complexity and data
availability.
• Limited language coverage: Effectiveness reliant on the quality and availability of
monolingual corpora.
2.8 Md Ashraful Islam Talukder, Sheikh Abujar, Abu Kaisar Mohammad Masum,
Sharmin Akter and Syed Akhter Hossain, Comparative Study on Abstractive Text
Summarization, IEEE – 49239
The paper presents a comparative study on abstractive text summarization, a task that involves
generating a shorter text while preserving the main meaning of the original text. It reviews
three different methodologies: a word graph-based model, a semantic graph reduction model,
and a hybrid approach using Markov clustering. The word graph-based model, proposed for
Vietnamese abstractive text summarization, reduces and combines sentences based on
keywords, discourse rules, and syntactic constraints. This model aims to overcome some of the
drawbacks of existing word graph methods, such as producing incorrect or ungrammatical
sentences. The semantic graph reduction model uses a rich semantic graph to represent the
input text and applies heuristic rules to reduce the graph using WordNet relations. This model
is designed to capture the semantic meaning of the text and generate coherent summaries.
Lastly, the hybrid approach uses Markov clustering to group sentences based on their relations
and then applies rule-based sentence fusion and compression techniques to generate
summaries. This approach is argued to produce concise and informative summaries. The paper
asserts that these methodologies offer promising solutions to the challenges in abstractive text
summarization.
Advantages:
• Methodological Diversity: The paper examines three distinct approaches for abstractive
text summarization, offering a comprehensive overview for comparison and
understanding.
• Addressing Drawbacks: Models incorporate discourse rules, syntactic constraints, and
semantic information from WordNet, enhancing grammatical correctness, coherence,
and capturing underlying meaning.
Limitations:
• Evaluation: Lack of thorough evaluation and validation on diverse datasets hinders
assessment of effectiveness and generalizability.
• Language Specificity: Some models are tailored to Vietnamese, potentially limiting
applicability to other languages.
• Complexity: Certain methodologies introduce computational complexity, potentially
hindering scalability for real-time applications.
• Subjectivity: Inherent interpretation in abstractive summarization may pose challenges
in accurately reflecting original text across contexts.
2.9 Lili Wan, Extraction algorithm of English Text Summarization for English Teaching,
2018 International Conference on Intelligent Transportation, Big Data and Smart
City
In order to improve the ability of sharing and scheduling capability of English teaching
resources, an improved algorithm for English text summarization is proposed based on
Association semantic rules. The relative features are mined among English text phrases and
sentences, the semantic relevance analysis and feature extraction of keywords in English
abstract are realized, the association rules differentiation for English text summarization is
obtained based on information theory, related semantic rules information in English Teaching
Texts is mined. Text similarity feature is taken as the maximum difference component of two
semantic association rule vectors, and combining semantic similarity information, the accurate
extraction of English text Abstract is realized.
The simulation results show that the method can extract the text summarization accurately, it
has better convergence and precision performance in the extraction process. With the rapid
development of computer and network technology, a variety of network medias are rapidly
developing. English teaching materials are increased, a large number of English articles and
news media contain vast amounts of English text, the comprehensive reading is difficult, it
needs to extract the English text summarization effectively, extract the key information, so as
to improve the quality of English teaching.
Advantages:
• Association Semantic Rules: Utilizing these rules, the algorithm comprehends nuanced
relationships among English text elements, facilitating accurate summarization.
• Semantic Relevance Analysis: Integration of this analysis enhances identification of
crucial information and key concepts within English abstracts.
Limitations:
• Language Specificity: Tailored for English, limiting applicability to other languages
and necessitating significant modifications for adaptation.
• Dependency on Semantic Analysis: Accuracy of summarization hinges on precise
semantic analysis and keyword extraction, posing risks of errors.
• Scalability: Performance may suffer with large text volumes or real-time applications
due to computational complexity.
• Subjectivity: Challenges persist in capturing nuanced semantic relationships,
potentially resulting in variable summary quality across texts and domains.
2.10 Devi Krishnan, Preethi Bharathy, Anagha, and Manju Venugopalan, A Supervised
Approach for Extractive Text Summarization Using Minimal Robust Features,
Proceedings of the International Conference on Intelligent Computing and Control
Systems (ICICCS 2019)
Over the past decade or so the amount of data on the Internet has increased exponentially. Thus
arises the need for a system that processes this immense amount of data into useful information
that can be easily understood by the human brain. Text summarization is one such popular
method under research that opens the door to dealing with voluminous data. It works by
generating a compressed version of the given document by preserving the important details.
Text summarization can be classified into Abstractive and Extractive summarization.
Extractive summarization methods reduce the burden of summarization by selecting a subset
of sentences that are important from the actual document. Although there are many suggested
approaches to this methodology, researchers in the field of natural language processing are
especially attracted to extractive summarization. The model that is introduced in this paper
implements extractive text summarization using a supervised approach with the minimalistic
features extracted. The results reported by the proposed model are quite satisfactory with an
average ROUGE-1 score of 0.51 across 5 reported domains on the BBC news article dataset.
To implement the text summarization approaches there are existing algorithms in the field of
machine learning which improve the accuracy in predicting the results without programming
explicitly.
The paper discusses the two primary approaches to text summarization: abstractive and
extractive. While abstractive summarization involves generating new text that captures the
essence of the original document, extractive summarization selects and rearranges existing
sentences to create a summary. The focus of the paper lies on extractive summarization,
particularly in the context of natural language processing, where researchers are increasingly
drawn to its simplicity and effectiveness. The proposed model adopts a supervised approach to
extractive summarization, leveraging minimalistic features extracted from the text. By
employing machine learning techniques, the model learns to identify and prioritize important
sentences for inclusion in the summary. The use of robust features ensures that the model can
effectively capture the essence of the original document while minimizing the computational
overhead. The paper reports satisfactory results from the proposed model, with an average
ROUGE-1 score of 0.51 across five domains using the BBC news article dataset. ROUGE-1 is
a metric commonly used to evaluate the quality of text summarization.
Advantages:
• Automation of Routine Tasks: Summarization automates tasks like content curation and
report generation, freeing up human resources.
• Customization and Personalization: Summaries can be customized to cater to individual
preferences and priorities.
Disadvantages:
In Table 2.1, the diverse range of algorithms and platforms showcased underscores the breadth
of approaches adopted in research. By considering a variety of performance metrics, from
accuracy and efficiency to scalability and robustness, these papers offer valuable insights into
the strengths and weaknesses of different methodologies. This comprehensive overview
serves as a valuable resource for researchers and practitioners seeking to navigate the
landscape of audio recording and privacy concerns on mobile devices.
The literature delves into various aspects of speech processing, including automatic
summarization, speaker recognition, real-time meeting analysis, and speech-based life logging,
all contributing to the broader field of natural language understanding and human-computer
interaction. Speech summarization methods aim to distill key information from spoken content,
facilitating efficient review and retrieval, particularly in scenarios where written text is
impractical. Speaker recognition techniques leverage speech intelligibility to enhance
identification accuracy, benefiting security authentication, forensic analysis, and human-
computer interaction. Real-time meeting analysis systems employ advanced techniques for
speech enhancement, speaker diarization, and topic tracking, facilitating effective
communication and collaboration during meetings. Speech-based life logging systems capture
and store audio data from daily experiences, creating a rich repository of personal memories
and contextual information accessible through searchable text. These advancements drive
innovation across domains, enabling more natural and intuitive interactions between humans
and machines.
CHAPTER 3
SYSTEM ANALYSIS
3.1 Existing System
The existing systems, including Google Speech-to-Text and Translation API, Microsoft Azure
Speech Services, IBM Watson Speech to Text and Language Translator, and Amazon
Transcribe and Translate, are esteemed for their prowess in real-time transcription and
translation tasks. These systems offer a comprehensive array of functionalities tailored to meet
the demands of dynamic communication environments. Leveraging powerful speech
recognition models, these systems excel in swiftly converting spoken words into written text
with remarkable accuracy. Trained on vast datasets of speech and text, these models possess
the capability to identify and transcribe individual words and phrases with precision. In the
realm of translation, existing systems employ a variety of techniques, such as statistical
machine translation (SMT) or neural machine translation (NMT). SMT relies on statistical
analysis of large bilingual text corpora to discern patterns and perform translations, while NMT
utilizes deep learning algorithms to analyse source and target languages, enabling more
nuanced and context-aware translations.
3.1.1 Drawbacks
• Privacy Concerns: Real-time audio processing raises privacy concerns regarding the
security and handling of transcribed conversations, potentially affecting user trust.
The feasibility study will assess the technical viability and economic aspects of the proposed
system. It will analyze available technology, resource requirements, and potential challenges
for technical feasibility, along with cost estimates for development, deployment, and
maintenance, ensuring economic viability. Operational aspects such as user acceptance,
usability, and scalability will also be evaluated to ensure alignment with user needs and
organizational objectives.
• Cost Estimates: Economic feasibility involves evaluating the costs associated with the
development, deployment, and maintenance of the proposed system. This includes
expenses related to software development, hardware acquisition, infrastructure setup,
and ongoing maintenance and support.
• Return on Investment (ROI): An analysis of potential revenue streams and cost
savings resulting from the implementation of the system will be conducted to determine
the ROI. This may include revenue generated from product sales, subscription fees, or
service offerings, as well as cost savings resulting from improved efficiency or reduced
operational expenses.
• Cost-Benefit Analysis: A cost-benefit analysis will be conducted to weigh the
projected benefits of the system against its associated costs. This analysis will help
determine whether the benefits of implementing the system outweigh the costs, thereby
indicating its economic viability.
• Market Demand and Competition: Economic feasibility also involves assessing
market demand and competition to determine the system's potential for success in the
marketplace. This includes analyzing factors such as target market size, customer
needs, competitive landscape, and pricing strategies.
• Risk Assessment: Economic feasibility analysis will also involve identifying and
mitigating potential risks and uncertainties that could impact the financial viability of
the project. This may include risks related to technology, market dynamics, regulatory
compliance, and unforeseen expenses.
CHAPTER 4
SYSTEM SPECIFICATION
4.1 Hardware Requirements
• Memory (RAM): Minimum 2GB RAM, recommended 4GB or more for optimal
performance, especially with additional features like summarization and task
management.
• Processor: Quad-core or higher processor for efficient real-time audio processing.
• Storage: Minimum 16GB internal storage for application installation and data storage.
External storage options (e.g., microSD card) recommended for users generating large
amounts of transcribed content.
• Microphone (Built-In): Functional built-in microphone for audio input.
• Bluetooth Microphone (Optional): Bluetooth compatibility for users opting to use an
external Bluetooth microphone.
• Display: Responsive display with suitable resolution for displaying transcribed text and
interacting with the application.
• Battery: Sufficient battery capacity to support continuous audio processing and
transcription for extended periods.
• Scalability: Design the system to handle a scalable number of users and increased
transcription loads, ensuring consistent performance under varying usage conditions,
including the new features.
• Reliability: The system should have a high level of reliability, with a maximum
allowable downtime of 1 hour per month for maintenance.
• Compatibility: Ensure compatibility with Android devices running Android OS
version 6.0 (Marshmallow) and above, supporting a variety of screen sizes and
resolutions, considering the new features.
• Usability: Conduct usability testing to ensure the application is user-friendly,
considering users with different levels of technical expertise, and specifically testing
the new summarization, search, and task management features.
• Accessibility: Implement accessibility features for the new functionalities, ensuring
users with diverse needs can efficiently utilize summarization, search, and task
extraction features.
• Security: Enhance security measures to safeguard the privacy of summarized content,
search queries, and task-related information for the new features.
• Maintainability: Design the system with modular and well-documented code,
facilitating easy maintenance, updates, and bug fixes for the new features.
• Portability: Optimize the application for various Android devices, providing a
consistent experience for users with different screen sizes and resolutions, considering
the new functionalities.
• Compliance: Ensure compliance with privacy regulations for the new features,
especially when dealing with summarized content and task-related data.
CHAPTER 5
PROJECT DESCRIPTION
5.1 Problem Definition
Transcriber AI addresses the pressing need for comprehensive text processing solutions in
today's digital landscape. As information continues to inundate various channels, the ability to
efficiently manage, understand, and extract insights from textual data becomes paramount. The
project aims to tackle this challenge by offering a versatile platform that encompasses real-time
transcription, text translation, and text summarization functionalities. However, achieving
these goals requires overcoming several key hurdles.
Firstly, the accuracy and speed of real-time transcription present significant challenges.
Converting spoken words into written text in near real-time demands robust speech recognition
algorithms capable of deciphering diverse accents, languages, and audio qualities.
Additionally, ensuring minimal latency between audio input and transcribed output is crucial
for providing a seamless user experience.
Secondly, text translation across multiple languages poses another significant challenge.
Achieving accurate and contextually appropriate translations necessitates integration with
reliable translation APIs and the implementation of sophisticated natural language processing
techniques. Moreover, maintaining consistency and coherence in translated texts while
preserving the nuances of the original language is essential for effective communication.
Lastly, text summarization requires the development of algorithms capable of distilling large
volumes of textual data into concise and informative summaries. Identifying key information,
extracting essential concepts, and maintaining the coherence and relevance of the summary
represent formidable challenges. Moreover, ensuring that the summarization process is
efficient and scalable to handle diverse types of text inputs is critical for the application's utility
across different domains.
Fig 5.1 illustrates the system architecture of the project, showcasing the seamless process of
real-time transcription and speaker diarization as speech is inputted from a microphone. The
diagram visually captures the intricate workflow, highlighting the efficient transformation of
spoken words into written text while discerning and categorizing speakers in real time.
• Microphone: A small and compact device that will take the input as speech.
• Real-Time Transcription Module: This module employs advanced speech
recognition algorithms to convert audio input into text in near real-time. It utilizes
techniques such as acoustic modeling, language modeling, and deep learning to
accurately transcribe spoken words into written text.
• Text Translation Module: The translation module integrates with external translation
APIs to provide seamless and accurate translations between multiple languages. It
leverages natural language processing techniques to ensure contextually appropriate
translations while maintaining consistency and coherence across languages.
• Text Summarization Module: This module utilizes natural language processing
algorithms to extract key insights from large volumes of textual data and generate
concise summaries. It employs techniques such as keyword extraction, sentence
scoring, and semantic analysis to identify important information and present it in a
digestible format.
• User Interface: The user interface of TranscriberAI serves as the primary interaction
point for users, providing a visually appealing and intuitive platform for accessing the
application's functionalities. It includes components such as buttons, text input fields,
and scrollable views, enabling users to navigate the application effortlessly.
• Integration Layer: The integration layer facilitates seamless communication between
different modules of the system, ensuring cohesive operation and efficient data flow. It
manages interactions such as data exchange, synchronization, and error handling,
enabling the application to function smoothly across various scenarios.
• Storage: The user decides where he wants to store his personal speech data.
The Real-Time Transcription Module serves as the foundation of TranscriberAI, enabling the
application to convert audio input into text in near real-time with high accuracy and minimal
latency. This module incorporates advanced speech recognition algorithms and techniques to
decipher spoken words, regardless of accents, languages, or audio qualities. At its core, the
Real-Time Transcription Module utilizes state-of-the-art speech recognition algorithms, such
The Text Translation Module empowers users to translate text between multiple languages
seamlessly, facilitating cross-cultural communication and understanding. This module
integrates with external translation APIs and employs natural language processing techniques
to ensure accurate and contextually appropriate translations.
However, achieving accurate translations goes beyond mere word-for-word conversion. The
Text Translation Module employs natural language processing techniques to ensure
contextually appropriate translations that capture the intended meaning of the text. This
includes analyzing the syntactic and semantic structure of the text, identifying idiomatic
expressions and cultural references, and adapting the translation to suit the target language's
linguistic conventions.
Moreover, the module maintains consistency and coherence in translated texts to enhance
readability and comprehension. It ensures that translated texts maintain the same tone, style,
and level of formality as the original text, thereby preserving the author's voice and intent.
Additionally, the module employs techniques such as post-editing and quality assurance to
refine translations and address any discrepancies or errors.
The Text Translation Module is designed to be versatile and adaptable to various translation
scenarios and use cases. Whether it's translating documents, emails, or website content, the
module can handle diverse types of text inputs with ease. It supports batch translation for
processing large volumes of text efficiently and offers customizable translation options to meet
specific user preferences and requirements.
The Text Summarization Module plays a pivotal role in distilling large volumes of textual data
into concise and informative summaries, enabling users to quickly grasp essential information
and identify relevant insights. This module utilizes natural language processing algorithms and
Key to the Text Summarization Module is its ability to identify important sentences or phrases
within the text and assign them a relevance score based on their significance. The module
employs techniques such as keyword extraction, sentence scoring, and semantic analysis to
analyze the content and identify key concepts, ideas, or arguments. By prioritizing sentences
with higher relevance scores, the module ensures that the generated summaries focus on the
most important information.
Another critical aspect of the Text Summarization Module is maintaining coherence and
relevance in the generated summaries. This involves structuring the summary in a logical and
coherent manner, ensuring smooth transitions between sentences and paragraphs, and
eliminating redundant or irrelevant information. Additionally, the module adapts the
summary's length and level of detail to suit user preferences and requirements, allowing users
to customize the summary's depth and complexity.
Furthermore, the Text Summarization Module is designed to be efficient and scalable, capable
of handling diverse types of text inputs and generating summaries in real-time. Whether it's
summarizing articles, research papers, or meeting transcripts, the module can process large
volumes of textual data with minimal computational overhead. It supports integration with
external text sources and offers options for batch summarization to streamline the
summarization process and improve efficiency.
5.4.4 Storage
The user has the autonomy to decide where they want to store their personal speech data. This
emphasizes user privacy and control over personal data. Users can choose storage options
based on their preferences and requirements, providing a flexible and customizable aspect to
the application. This approach aligns with the principles of user-centric design and data
ownership. The user decides where he wants to store his personal speech data.
Fig 5.2 is a graphical representation that illustrates how data moves through the system. Data
flows seamlessly from the microphone input, capturing spoken words, to the real-time
transcription process, where advanced algorithms convert the audio data into written text while
concurrently performing speaker diarization for accurate identification of speakers.
• User Launches Application: his module initiates the process when the user starts the
application.
• Bluetooth Connection Check: This module verifies if a Bluetooth device is connected.
o If not connected, it prompts the user to connect a device.
o If connected, the process proceeds to the next module.
• Audio Input Capture: This module captures audio input from the connected Bluetooth
device.
• Local Audio Processing and Transcription: This module processes the captured
audio locally on the device. It transcribes the audio into text format.
• Real-Time Display and Summarization: This module displays the transcribed text in
real-time. It may also generate summaries or highlights of the audio content.
• End: This marks the completion of the process.
CHAPTER 6
SYSTEM IMPLEMENTATION
6.1 Introduction
The system implementation for the proposed real-time transcription project involves setting up
the development environment with required libraries such as Python, PyQt5, PyAudio, Vosk,
Googletrans, and Sumy. The user interface is designed using PyQt5 framework, comprising
screens for real-time transcription, translation, and summarization, featuring text input/output
fields and interaction buttons. PyAudio captures audio input from the microphone in real-time,
while Vosk library performs speech recognition and transcription, displaying transcribed text
on the UI dynamically. Integration of Googletrans enables text translation based on user-
selected target language, with translated text displayed alongside the original. Utilizing Sumy
library, the system summarizes transcribed text, offering users options for summary length.
Event handling mechanisms manage user interactions, including transcription control, file
selection for translation/summarization, and screen navigation. Robust error handling and
validation ensure smooth operation, while testing and debugging phase identifies and resolves
any issues. Comprehensive documentation covers system architecture, installation, usage, and
developer guidelines, facilitating seamless deployment across platforms.
The integration has been engineered to enable compatibility with a broad spectrum of Bluetooth
devices, including True Wireless Stereo (TWS) earphones and similar peripherals. By adopting
a versatile Bluetooth chip/module with comprehensive protocol support, such as Advanced
Audio Distribution Profile (A2DP) and Hands-Free Profile (HFP), the system accommodates
diverse Bluetooth-enabled accessories seamlessly. This approach ensures that users can
leverage their preferred Bluetooth devices, ranging from earphones to headsets, as the primary
audio input source for the system. Thus, users can experience enhanced flexibility and
convenience in utilizing their existing Bluetooth ecosystem while engaging with the real-time
transcription functionality offered by the system.
1. User Interface:
• Libraries Used: PyQt5
• Implementation: PyQt5 framework is utilized for designing and implementing
the user interface. Various PyQt5 widgets such as BoxLayout, Button,
TextInput, ScrollView, and ScreenManager are used to create a user-friendly
interface with multiple screens.
• Responsive Design: The user interface is designed to be responsive, adapting
gracefully to different screen sizes and orientations, ensuring consistent user
experience across devices.
• Fig 6.3 shows the intro screen of the ‘TranscriberAI’ application.
5. File Handling:
• Libraries Used: os
• Implementation: The os library is used for file operations such as file selection,
reading, and writing. It enables users to select files for translation or
summarization and handles file loading for processing.
• File Format Support: The application supports various file formats for input and
output, including plain text (.txt), documents (.docx), and portable document
format (.pdf), enhancing versatility and usability.
• Fig 6.8 shows the file selection module of the ‘TranscriberAI’ application.
CHAPTER 7
SYSTEM TESTING
System testing is a crucial part of any project to ensure that it meets the desired requirements
and functions correctly. For the Transcriber AI project, there are several components that need
to be tested to ensure their proper functioning.
Hardware testing for the Transcriber AI involves testing the components and the system to
ensure that they are functioning correctly and as intended. The testing process involves the
following steps:
• Testing Bluetooth Connection: The Bluetooth module undergoes testing to verify its
ability to establish a stable connection with the system. This involves initiating
connections with various Bluetooth devices and ensuring successful pairing and
communication.
• Testing Bluetooth Range: In addition to connection testing, the Bluetooth module's
range is assessed to determine the distance over which it can maintain a reliable
connection with paired devices. This is achieved by gradually increasing the distance
between the system and the Bluetooth device while monitoring signal strength and
connectivity stability.
Software testing is an important part of the development process to ensure that the software
functions as expected and meets the user requirements. In the case of the Anti-theft security
system, the software includes the code that runs on the Raspberry Pi Pico microcontroller and
the application that the user interacts with. The software testing process typically involves the
following steps:
• Unit Testing: This involves testing individual units of code in isolation to ensure they
function as expected. In the case of the Transcriber AI, this would entail testing the
code responsible for handling transcription, translation, and summarization
independently.
CHAPTER 8
SUMMARY
The Transcriber AI represents a pivotal evolution in communication technologies, catering to
the increasing demand for efficient audio processing solutions. Its real-time transcription
capabilities offer unparalleled speed and accuracy, transforming the way audio content is
converted into written text. This innovation is particularly beneficial in scenarios such as live
events, interviews, and lectures, where quick and accurate transcriptions are essential. By
eliminating language barriers through seamless translation, the AI fosters inclusivity and
facilitates cross-cultural communication. Its ability to generate concise summaries of
transcribed content streamlines information processing and decision-making, saving time and
effort. Moreover, the Transcriber AI's adaptability to various domains, including business,
education, and entertainment, makes it a versatile tool for diverse applications. Its user-friendly
interface and intuitive features make it accessible to users of all levels, from individuals to large
organizations. The AI's integration of advanced security measures ensures data privacy and
confidentiality, instilling trust and confidence among users. Continuous updates and
improvements keep the Transcriber AI at the forefront of innovation, meeting the evolving
needs of the digital era. Its impact extends beyond communication, driving efficiency,
productivity, and collaboration across industries. Furthermore, the Transcriber AI's adaptability
to various accents and dialects enhances its utility in diverse linguistic environments. Its robust
performance in noisy or challenging acoustic conditions ensures reliable transcription
outcomes even in less-than-ideal settings. The AI's cloud-based architecture enables seamless
integration with existing workflows and platforms, facilitating scalability and accessibility. Its
ability to generate timestamps and speaker identification enhances the usability and
organization of transcribed content, particularly in multi-speaker scenarios. As the demand for
accurate and efficient transcription solutions continues to grow, the Transcriber AI remains at
the forefront, driving innovation and reshaping the landscape of audio communication.
CHAPTER 9
• Online Cloud Storage: Implementing online cloud storage capabilities will enable
users to access transcribed text and related data through an internet connection. This
feature ensures seamless accessibility to transcripts from anywhere, anytime, enhancing
user convenience and flexibility.
• Hardware Controls Integration: Integrating hardware controls, such as buttons, will
streamline user interactions for initiating and terminating transcription processes. This
enhancement simplifies the transcription workflow, making it more efficient and
intuitive for users.
• Transition to Offline Translation Libraries: Transitioning from online to offline
translation libraries will enhance user privacy and accessibility by enabling translations
directly on the device without relying on internet connectivity. This improvement
ensures that users can translate content securely and efficiently.
• AI-Generated Summarization Algorithms: Introducing AI-generated summarization
algorithms will automate the summarization process, providing users with concise
summaries of transcribed content without the need for manual intervention. This
enhancement saves time and effort for users, improving overall productivity
REFERENCES
[1] Sadaoki Furui, Tomonori Kikuchi, Yousuke Shinnaka, and Chiori Hori, Speech-to-Text
and Speech-to-Speech Summarization of Spontaneous Speech, IEEE transactions on speech
and audio processing, vol. 12, no. 4, july 2004.
[2] Sridhar Krishna Nemala and Mounya Elhilali, Multilevel speech intelligibility for robust
speaker recognition, The Johns Hopkins University, Baltimore.
[3] Xavier Anguera Miro, Simon Bozonnet, Nicholas Evans, Corinne Fredouille, Gerald
Friedland and Oriol Vinyals, Speaker Diarization: A Review of Recent Research, IEEE
transactions on audio, speech, and language processing, vol. 20, no. 2, february 2012
[4] Dongmahn SEO, Suhyun KIM, Gyuwon SONG, and Seung-gil HONG, Speech-to-Text-
based Life Log System for Smartphones,2014.
[5] Multiple Authors, Low-Latency Real-Time Meeting Recognition and Understanding Using
Distant Microphones and Omni-Directional Camera, IEEE TRANSACTIONS ON AUDIO,
SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 2, FEBRUARY 2012
[6] Siqi Zheng, Weilong Huang, Xianliang Wang, Hongbin Suo, Jinwei Feng and Zhijie Yan,
A real-time speaker diarization system based on spatial spectrum, ICASSP 2021 - 2021 IEEE
[7] Yu-An Chung, Wei-Hung Weng, Schrasing Tong and James Glass, Towards Unsupervised
Speech-To-Text Translation, Computer Science and Artificial Intelligence Laboratory,
Massachusetts Institute of Technology, Cambridge
[8] Md Ashraful Islam Talukder, Sheikh Abujar, Abu Kaisar Mohammad Masum, Sharmin
Akter and Syed Akhter Hossain, Comparative Study on Abstractive Text Summarization, IEEE
- 49239
[9] Lili Wan, Extraction algorithm of English Text Summarization for English Teaching, 2018
International Conference on Intelligent Transportation, Big Data and Smart City
[10] Devi Krishnan, Preethi Bharathy, Anagha, and Manju Venugopalan, A Supervised
Approach For Extractive Text Summarization Using Minimal Robust Features, Proceedingsy
of the International Conference on Intelligent Computing and Control Systems (ICICCS 2019).
APPENDIX A
A.1 Source Code
import sys # Import the sys module for interacting with the Python runtime environment.
# Import various classes from PyQt5.QtWidgets for creating the application's graphical user
interface.
# Import classes from PyQt5.QtCore for handling core functionality like threads, signals, and
timers.
# Import the Translator class from googletrans for translating text between languages.
# Import classes for text parsing and tokenization from sumy, a text summarization library.
# Import QColor, QFont, and QPixmap from PyQt5.QtGui for handling colors, fonts, and
images in the application's GUI.
# Import Model and KaldiRecognizer from vosk for speech recognition capabilities.
# Import json for parsing JSON data, which is commonly used in communication with APIs
and data storage.
import json
# Import pyaudio for handling audio streams, necessary for capturing audio input for real-time
transcription.
import pyaudio
# Import QMovie from PyQt5.QtGui to handle GIF animations within the application's GUI.
# Import QMediaPlayer and QMediaContent from PyQt5.QtMultimedia for handling audio and
video media playback.
# Import QUrl from PyQt5.QtCore to manage URLs, useful for handling media files and web
links.
p = pyaudio.PyAudio()
info = p.get_host_api_info_by_index(0)
numdevices = info.get('deviceCount')
# Iterate through the number of devices and print the input-capable devices.
device_index = 2
class RealTimeTranscription(QWidget):
def _init_(self):
super()._init_()
self.init_ui()
self.p = pyaudio.PyAudio()
self.stream = None
self.model = Model("model")
self.transcription_running = False
self.transcription_file = None
def init_ui(self):
self.setWindowTitle('Real-Time Transcription')
self.transcription_text_edit = QTextEdit()
self.transcription_text_edit.setReadOnly(True)
self.transcription_text_edit.setMinimumHeight(300)
self.start_button.clicked.connect(self.start_transcription)
self.stop_button.clicked.connect(self.stop_transcription)
button_layout = QHBoxLayout()
button_layout.addWidget(self.stop_button)
main_layout = QVBoxLayout()
main_layout.addWidget(self.transcription_text_edit)
main_layout.addLayout(button_layout)
self.setLayout(main_layout)
def start_transcription(self):
if self.transcription_running:
return
self.stream.start_stream()
self.transcription_running = True
self.transcription_thread.transcription_updated.connect(self.update_transcription)
self.transcription_thread.start()
timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
file_name = f"transcription_{timestamp}.txt"
self.transcription_text_edit.append(text)
if self.transcription_file:
self.transcription_file.write(text)
def stop_transcription(self):
if self.stream:
self.stream.stop_stream()
self.stream.close()
self.transcription_running = False
self.transcription_thread.quit()
self.transcription_thread.wait()
if self.transcription_file:
self.transcription_file.close()
self.transcription_file = None
# Define a QThread based class for handling the transcription process in a separate thread.
class TranscriptionThread(QThread):
transcription_updated = pyqtSignal(str)
super()._init_()
self.stream = stream
self.recognizer = recognizer
# Main run loop for the thread that handles continuous transcription.
def run(self):
try:
while True:
data = self.stream.read(8000)
if len(data) == 0:
if self.recognizer.AcceptWaveform(data):
result = json.loads(self.recognizer.Result())
if 'text' in result:
self.transcription_updated.emit(result['text'] + "\n")
except OSError as e:
class TranslationScreen(QWidget):
def _init_(self):
super()._init_()
self.init_ui()
def init_ui(self):
self.setWindowTitle('Translation')
self.original_text_edit = QTextEdit()
self.translated_text_edit = QTextEdit()
self.translated_text_edit.setReadOnly(True)
self.language_combo_box = QComboBox()
load_button.clicked.connect(self.load_file)
translate_button = QPushButton('Translate')
translate_button.clicked.connect(self.translate_text)
text_layout.addWidget(self.original_text_edit)
text_layout.addWidget(self.translated_text_edit)
button_layout = QHBoxLayout()
button_layout.addWidget(load_button)
button_layout.addWidget(self.language_combo_box)
button_layout.addWidget(translate_button)
main_layout = QVBoxLayout()
main_layout.addLayout(text_layout)
main_layout.addLayout(button_layout)
self.setLayout(main_layout)
def load_file(self):
file_dialog = QFileDialog()
if file_path:
content = file.read()
self.original_text_edit.setPlainText(content)
def translate_text(self):
original_text = self.original_text_edit.toPlainText()
target_language = self.language_combo_box.currentText()
translator = Translator()
class SummarizationScreen(QWidget):
def _init_(self):
super()._init_()
self.init_ui()
self.num_sentences = 0
def init_ui(self):
self.setWindowTitle('Summarization')
self.original_text_edit = QTextEdit()
self.summarized_text_edit = QTextEdit()
self.summarized_text_edit.setReadOnly(True)
self.load_button.clicked.connect(self.load_file)
self.summarize_button = QPushButton('Summarize')
self.summarize_button.clicked.connect(self.summarize_text)
main_layout = QVBoxLayout()
main_layout.addWidget(self.original_text_edit)
main_layout.addWidget(self.load_button)
main_layout.addWidget(self.summarized_text_edit)
main_layout.addWidget(self.summarize_button)
self.setLayout(main_layout)
file_dialog = QFileDialog()
if file_path:
lines = file.readlines()
self.num_sentences = len(non_empty_lines)
content = "".join(non_empty_lines)
self.original_text_edit.setPlainText(content)
def summarize_text(self):
original_text = self.original_text_edit.toPlainText()
if self.num_sentences > 0:
else:
num_sentences_summary = 2
summarizer = LexRankSummarizer()
self.summarized_text_edit.setPlainText(summarized_text)
# Define a QWidget based class for displaying an animated logo and playing a sound.
class LogoScreen(QWidget):
self.main_window = main_window
self.init_ui()
def init_ui(self):
self.setWindowTitle('Logo Screen')
layout = QVBoxLayout(self)
self.logo_label = QLabel(self)
self.logo_label.setMovie(self.movie)
self.logo_label.setAlignment(Qt.AlignCenter)
layout.addWidget(self.logo_label)
self.transcriber_label.setAlignment(Qt.AlignCenter)
self.transcriber_label.setFont(QFont('Arial', 24))
self.transcriber_label.setStyleSheet("color: white")
layout.addWidget(self.transcriber_label)
# Background color
self.setAutoFillBackground(True)
palette = self.palette()
self.setPalette(palette)
def setup_animation_and_music(self):
QTimer.singleShot(2000, self.playSound)
QTimer.singleShot(8000, self.transition_to_transcription)
def playSound(self):
self.player.setMedia(QMediaContent(url))
def transition_to_transcription(self):
# Define the main window class which contains all other screens.
class MainWindow(QWidget):
def _init_(self):
super()._init_()
self.init_ui()
self.setWindowTitle('Main Window')
self.logo_screen = LogoScreen(self)
self.transcription_screen = RealTimeTranscription()
self.translation_screen = TranslationScreen()
self.summarization_screen = SummarizationScreen()
self.transcription_button = QPushButton('Transcription')
self.transcription_button.clicked.connect(self.show_transcription)
self.translation_button = QPushButton('Translation')
self.summarization_button = QPushButton('Summarization')
self.button_layout = QHBoxLayout()
self.button_layout.addWidget(self.transcription_button)
self.button_layout.addWidget(self.translation_button)
self.button_layout.addWidget(self.summarization_button)
self.main_layout = QVBoxLayout()
self.main_layout.addLayout(self.button_layout)
self.main_layout.addWidget(self.logo_screen)
self.main_layout.addWidget(self.transcription_screen)
self.main_layout.addWidget(self.translation_screen)
self.main_layout.addWidget(self.summarization_screen)
self.show_logo()
def show_transcription(self):
self.show_buttons()
self.logo_screen.hide()
self.transcription_screen.show()
self.translation_screen.hide()
self.summarization_screen.hide()
def show_translation(self):
this.show_buttons()
this.logo_screen.hide()
this.transcription_screen.hide()
this.translation_screen.show()
this.summarization_screen.hide()
def show_summarization(self):
this.show_buttons()
this.logo_screen.hide()
this.transcription_screen.hide()
this.translation_screen.hide()
this.summarization_screen.show()
self.transcription_button.hide()
self.translation_button.hide()
self.summarization_button.hide()
self.transcription_screen.hide()
self.translation_screen.hide()
self.summarization_screen.hide()
self.logo_screen.show()
def show_buttons(self):
self.transcription_button.show()
self.translation_button.show()
self.summarization_button.show()
if _name_ == '_main_':
app = QApplication(sys.argv)
window = MainWindow()
window.show()
try:
sys.exit(app.exec_())
except KeyboardInterrupt: