0% found this document useful (0 votes)
5 views

draft

The project report titled 'Transcriber AI' presents the development of a Bluetooth device designed for real-time audio-to-text transcription, aimed at enhancing communication for individuals with diverse language backgrounds and hearing impairments. It integrates a Bluetooth transmitter and advanced speech-to-text technology within a mobile application, promoting hands-free convenience and accessibility. The project emphasizes inclusivity and has potential implications for future research in speech recognition and wearable technology.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

draft

The project report titled 'Transcriber AI' presents the development of a Bluetooth device designed for real-time audio-to-text transcription, aimed at enhancing communication for individuals with diverse language backgrounds and hearing impairments. It integrates a Bluetooth transmitter and advanced speech-to-text technology within a mobile application, promoting hands-free convenience and accessibility. The project emphasizes inclusivity and has potential implications for future research in speech recognition and wearable technology.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

Jnana Sangama, Belagavi – 590 018.

A PROJECT REPORT
on
“TRANSCRIBER AI”
Submitted in partial fulfillment of the requirement for the award of the degree

Bachelor of Engineering
in
Computer Science and Engineering
by
AKASH ROSHAN R : 1VI20CS005

GAURAV SINGH : 1VI20CS042

KEERTHI KUMAR V : 1VI20CS058

DARSHAN S M : 1VI21CS401

Under the supervision of


Ms. J Brundha Elci
Assistant Professor

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

VEMANA INSTITUTE OF TECHNOLOGY


BENGALURU – 560 034
2023 - 24
Karnataka ReddyJana Sangha®
VEMANA INSTITUTE OF TECHNOLOGY
(Affiliated to Visvesvaraya Technological University, Belagavi)
Koramangala, Bengaluru-34.

Department of Computer Science and Engineering

Certificate
Certified that the project work entitled “TRANSCRIBER AI” carried out jointly by Akash
Roshan R (1VI20CS005), Gaurav Singh (1VI20CS042), Keerthi Kumar V (1VI20CS058)
and Darshan S M (1VI21CS401), are bonafide students of Vemana Institute of Technology
in partial fulfillment for the award of Bachelor of Engineering in Computer Science and
Engineering of the Visvesvaraya Technological University, Belagavi during the year
2023-24. It is certified that all corrections/suggestions indicated for internal assessment have
been incorporated in the report. The project report has been approved as it satisfies the
academic requirements in respect of the project work prescribed for the said degree.

Supervisor HOD Principal


Ms. J Brundha Elci Dr. M. Ramakrishna Dr. Vijayasimha Reddy. B. G.

Submitted for the university examination (viva-voce) held on ………….………….

External Viva
Internal Examiner External Examiner

1.

2.
DECLARATION BY THE CANDIDATES

We the undersigned solemnly declare that the project report “TRANSCRIBER AI” is based
on our own work carried out during the course of our study under the supervision of ‘Ms. J
Brundha Elci.

We assert the statements made and conclusions drawn are an outcome of our project work.
We further certify that,

a. The work contained in the report is original and has been done by us under the general
supervision of my supervisor.

b. The work has not been submitted to any other Institution for any other
degree/diploma/certificate in this university or any other University of India or abroad.

c. We have followed the guidelines provided by the university in writing the report.

d. Whenever we have used materials (data, theoretical analysis, and text) from other
sources, we have given due credit to them in the text of the report and their details are
provided in the references.

Date:
Place:

Project Team Members:

AKASH ROSHAN R : 1VI20CS005


GAURAV SINGH : 1VI20CS042
KEERTHI KUMAR V : 1VI20CS058
DARSHAN S M : 1VI21CS401

iii
ACKNOWLEDGEMENT

First we would like to thank our parents for their kind help, encouragement and moral
support.

We thank Dr. Vijayasimha Reddy. B. G, Principal, Vemana Institute of Technology,


Bengaluru for providing the necessary support.

We would like to place on record our regards to Dr. M. Ramakrishna, Professor and Head,
Department of Computer Science and Engineering for his continued support.

We would like to thank our project coordinators Mrs. Mary Vidya John, Assistant Professor
and Mrs. Vijayashree HP, Assistant Professor, Dept. of CSE for their support and
coordination.

We would like to thank our project guide Ms. J Brundha Elci, Designation, Dept. of CSE
for her continuous support, valuable guidance and supervision towards successful
completion of the project work.

We also thank all the Teaching and Non-teaching staff of Computer Science and Engineering
Department, who have helped us to complete the project in time.

Project Team Members:

AKASH ROSHAN R : 1VI20CS005


GAURAV SINGH : 1VI20CS042
KEERTHI KUMAR V : 1VI20CS058
DARSHAN S M : 1VI21CS401

iv
ABSTRACT

This project focuses on the development of application that enables real-time audio-to-text
transcription. Leveraging wireless Bluetooth technology, the application captures and
transcribes spoken words from both the device's built-in microphone and external Bluetooth
devices. "Transformative Bluetooth Communication Device for Inclusive Communication"
encapsulates the development of a groundbreaking device aimed at overcoming
communication barriers. This Bluetooth device integrates a Bluetooth transmitter and a
sophisticated microphone, facilitating real-time spoken word transmission. Coupled with a
dedicated application powered by advanced speech-to-text technology, the device offers
instantaneous transcription, making communication universally accessible. Emphasizing
hands-free convenience and portability, the project envisions a transformative bridge catering
to individuals with diverse language backgrounds and hearing impairments. Beyond
immediate functionalities, the project has wide-ranging implications for inclusivity, adaptation
to various environments, potential integration with existing platforms, and future research in
speech recognition and wearable technology. This abstract outline a visionary venture set to
redefine communication paradigms and empower users through technology.

Key words: Wearable Technology, Speech Recognition, Natural Language Processing (NLP),
Information Retrieval, Local Transcription.

v
CONTENTS
Content Details Page No.
Title Page i

Bonafide Certificate ii

Declaration iii

Acknowledgement iv

Abstract v

Contents vi

List of Figures ix

List of Tables x

List of Abbreviations xi

Chapter 1 Introduction 1

1.1 Introduction 1

1.2 Scope 2

1.3 Objectives 4

1.4 Organization of the project work 5

Chapter 2 Literature Survey 8

2.1 Paper 1 8

2.2 Paper 2 9

2.3 Paper 3 10

2.4 Paper 4 11

2.5 Paper 5 13

2.6 Paper 6 14

2.7 Paper 7 16
vi
2.8 Paper 8 17

2.9 Paper 9 18

2.10 Paper 10 19

2.11 Comparative Analysis 22

2.12 Summary of Literature Survey 25

Chapter 3 System Analysis 26

3.1 Existing System 26

3.1.1 Drawbacks 26

3.2 Proposed System 27

3.3 Feasibility Study 28

3.3.1 Technical Feasibility 28

3.3.2 Operational Feasibility 29

3.3.3 Economical Feasibility 29

Chapter 4 System Specification 31

4.1 Hardware Requirements 31

4.2 Software Requirements 31

4.3 Functional Requirements 32

4.4 Non-Functional Requirements 32

Chapter 5 Project Description 34

5.1 Problem Definition 34

5.2 Overview of the Project 34

5.3 System Architecture 35

5.4 Module Description 36

5.4.1 Real-Time Transcription Module 36

vii
5.4.2 Text Translation Module 37

5.4.3 Text Summarization Module 38

5.4.4 Storage 39

5.5 Data Flow Diagram 40

Chapter 6 System Implementation 42

6.1 Introduction 42

6.2 Hardware Implementation 42

6.3 Software Implementation 44

Chapter 7 System Testing 50

7.1 Tests Conducted 50

7.2 Test Cases 51

Chapter 8 Summary 53

Chapter 9 Conclusions and Future Work 54

9.1 Conclusions 54

9.2 Future Enhancements 54

References 55

Appendix A 56

A.1 Source Code 56

viii
LIST OF FIGURES

Fig. No. Name Page No.


5.1 System architecture 35

5.2 Dataflow diagram 40

6.1 Integrated Bluetooth module 43

6.2 Bluetooth ear buds 43

6.3 Intro Design 44

6.4 Transcription Module 45

6.5 Translation Module 46

6.6 Language options 47

6.7 Summarization Module 48

6.8 File Selection Module 49

ix
LIST OF TABLES

Table No. Name Page No.

2.1 Comparative analysis 22

7.1 Test cases for Transcriber AI 51

x
LIST OF ABBREVIATIONS

Abbreviation Description
A2DP Advanced Audio Distribution Profile

BLEU Bilingual Evaluation Understudy

CDDMA Circular Differential Directional Microphone Array

CSJ Corpus of Spontaneous Japanese

DER Diarization Error Rate

DMM Dynamic Mixture Model

DWF Diarization-based Word Filtering

EER Equal Error Rate

FLAC Free Lossless Audio Codec

GMM Gaussian Mixture Model

HFP Hands-Free Profile

LDA Latent Dirichlet Allocation

LSA Latent Semantic Analysis

NMT Neural Machine Translation

SMT Statistical Machine Translation

STT Speech-to-Text

UAT User Acceptance Testing

UBM Universal Background Model

VAD Voice Activity Detection


xi
Transcriber AI Introduction

CHAPTER 1

INTRODUCTION
1.1 Introduction
The project unveils a Bluetooth device, seamlessly merging technology and inclusivity to
address communication barriers. Featuring a Bluetooth transmitter and integrated microphone,
it's a visionary solution for individuals with diverse language backgrounds and hearing
impairments. The Bluetooth transmitter enables real-time spoken word transmission, while the
integrated microphone captures nuanced human expression. This data integrates with a mobile
app, utilizing advanced speech-to-text tech for real-time transcription, a commitment to
inclusivity. Serving as a transformative bridge, this device prioritizes hands-free convenience
and portability, reshaping the essence of communication for a future marked by connectivity
and universal accessibility. "Transformative Bluetooth Device for Inclusive Communication,"
encapsulates the development of a groundbreaking device aimed at overcoming communication
barriers. This Bluetooth device integrates a Bluetooth transmitter and a sophisticated
microphone, facilitating real-time spoken word transmission. Coupled with a dedicated mobile
application powered by advanced speech-to-text technology, the device offers instantaneous
transcription, making communication universally accessible. Emphasizing hands-free
convenience and portability, the project envisions a transformative bridge catering to
individuals with diverse language backgrounds and hearing impairments. Beyond immediate
functionalities, the project has wide-ranging implications for inclusivity, adaptation to various
environments, potential integration with existing platforms, and future research in speech
recognition and Bluetooth technology. This abstract outline a visionary venture set to redefine
communication paradigms and empower users through technology.

The Bluetooth device described is a cutting-edge solution designed to revolutionize inclusive


communication. It incorporates several key features to cater to individuals with diverse
language backgrounds and hearing impairments. At its core is a Bluetooth transmitter, which
facilitates real-time spoken word transmission. This means that individuals using the device
can communicate verbally with others nearby, breaking down language barriers and enabling
seamless interaction. Complementing the Bluetooth transmitter is an integrated microphone of
advanced sophistication. This microphone is engineered to capture not just the words being
spoken but also the nuances of human expression. It ensures that the communication facilitated
by the device retains the emotional depth and subtleties essential for effective human

Dept. of CSE, Vemana IT 1 2023-24


Transcriber AI Introduction

interaction. This aspect is particularly vital for users who rely on visual cues and facial
expressions as part of their communication.

The device doesn't stop at facilitating spoken communication; it also integrates with a dedicated
mobile application. This app utilizes state-of-the-art speech-to-text technology to provide real-
time transcription of conversations. This feature is invaluable for individuals with hearing
impairments or those who prefer written communication. By offering multiple modes of
communication within one device, the Bluetooth device ensures inclusivity and accessibility
for all users. In terms of usability, the device is designed with hands-free convenience and
portability in mind. Its compact form factor and lightweight construction make it easy to use
throughout the day, allowing users to communicate effortlessly in various environments.
Whether in a busy public space or a quiet intimate setting, the device empowers users to engage
in meaningful interactions without cumbersome equipment or logistical barriers. The
implications of this project extend far beyond its immediate functionalities. It has the potential
to shape the future of assistive technology, paving the way for advancements in speech
recognition, Bluetooth devices, and adaptive communication tools. Moreover, its seamless
integration with existing platforms opens up possibilities for enhanced connectivity and
interoperability across different communication channels.

1.2 Scope

The scope of our project is to create a Bluetooth device, incorporating a transmitter and
advanced speech-to-text technology, to address communication challenges faced by individuals
with diverse language backgrounds and hearing impairments. The project aims for real-time
transcription, offering hands-free convenience and portability. The dedicated mobile
application enhances user experience with customization options. The device's adaptability to
various environments, potential integration with existing platforms, and acknowledgment of
global linguistic diversity contribute to its broader impact. Beyond immediate functionalities,
the project envisions empowering individuals through technology and serves as a stepping
stone for future applications and research opportunities in speech recognition and Bluetooth
technology. The scope of the Bluetooth device project is extensive, encompassing various
aspects of technology development, user experience design, market research, and potential
applications. Here's a breakdown of its scope:

• Technology Development: The project involves the design, development, and


integration of sophisticated hardware and software components. This includes

Dept. of CSE, Vemana IT 2 2023-24


Transcriber AI Introduction

engineering the Bluetooth transmitter, microphone, and mobile application with


advanced speech-to-text technology. Additionally, research and development efforts
may focus on optimizing battery life, improving signal quality, and ensuring
compatibility with a wide range of devices.
• User Experience Design: A critical aspect of the project is to prioritize user experience
(UX) design to ensure the device is intuitive and easy to use for individuals with diverse
communication needs. This may involve conducting user testing, gathering feedback,
and iterating on the design to enhance usability and accessibility.
• Market Research: Understanding the market landscape and identifying target
demographics is essential for the success of the project. Market research efforts may
involve analyzing the needs of individuals with hearing impairments, language barriers,
and other communication challenges. Additionally, assessing competitors and
identifying potential partnerships or distribution channels can inform the project's
strategy.
• Regulatory Compliance: Compliance with regulatory standards and certifications,
such as those related to Bluetooth technology, medical devices, and accessibility
guidelines, is crucial. The project scope may include ensuring that the device meets
relevant regulatory requirements to ensure safety, efficacy, and legal compliance.
• Integration and Compatibility: Exploring opportunities for integration with existing
platforms, devices, and communication tools is another aspect of the project's scope.
This may involve collaborating with technology partners to ensure seamless
interoperability and compatibility with popular operating systems, communication
apps, and assistive devices.
• Deployment and Distribution: Planning for the deployment and distribution of the
Bluetooth device involves logistics, supply chain management, and strategic
partnerships. This may include manufacturing considerations, distribution channels,
pricing strategies, and marketing efforts to raise awareness and reach target audiences
effectively.
• Feedback and Iteration: Continuous improvement based on user feedback and market
insights is integral to the project's success. The scope may include mechanisms for
gathering feedback from users, monitoring performance metrics, and iterating on the
product to address evolving needs and preferences.
• Future Research and Development: The project's scope extends beyond the initial
launch to include ongoing research and development efforts. This may involve

Dept. of CSE, Vemana IT 3 2023-24


Transcriber AI Introduction

exploring new features, enhancements, and potential applications for the Bluetooth
device, as well as contributing to advancements in speech recognition, Bluetooth
technology, and inclusive design.

1.3 Objective

The project's objectives revolve around leveraging innovative technology to address


communication barriers and promote inclusivity. Its primary aim is to develop a Bluetooth
device that seamlessly integrates hardware and software innovations to revolutionize
communication for individuals with diverse language backgrounds and hearing impairments.
Through the integration of a Bluetooth transmitter and advanced microphone, the project seeks
to enable real-time spoken word transmission, facilitating seamless interaction among users.
Additionally, the project aims to ensure universal accessibility by integrating a dedicated
mobile application with speech-to-text capabilities, providing instant transcription for
individuals with hearing impairments or those preferring written communication. Emphasizing
hands-free convenience and portability, the project aims to prioritize user comfort and usability.
Furthermore, by promoting inclusivity through meticulous design and integration with existing
platforms, the project seeks to break down communication barriers and cater to individuals with
diverse communication needs. Ultimately, the project aspires to advance technological
solutions in speech recognition and Bluetooth technology, contributing to innovation in the
field and fostering a more connected and inclusive society.

• Develop a system for instant conversion of spoken words into written text with a focus
on low latency for real-time transcription.
• Seamlessly incorporate Bluetooth technology to enable wireless microphone audio
input, offering users flexibility with external audio sources.
• Optimize the transcription process by implementing local audio processing on mobile
devices, handling input from both Bluetooth microphones and built-in device
microphones.
• Design an intuitive user interface, allowing users to easily monitor transcription
progress, connect to Bluetooth devices, and access transcribed text.
• Utilize audio preprocessing techniques to enhance transcription accuracy, particularly
in challenging acoustic conditions.
• Develop robust error-handling mechanisms to gracefully manage issues such as
Bluetooth disconnections or errors in the transcription process.

Dept. of CSE, Vemana IT 4 2023-24


Transcriber AI Introduction

• Incorporate customizable settings for users to tailor the transcription process, including
options for language selection, transcription formats, and output preferences.
• These objectives are strategically designed to create a flexible, user-friendly, and highly
functional Bluetooth device that not only meets the immediate needs of its users but
also lays the groundwork for future innovations in communication and assistive
technologies.

1.4 Organization of the project work

On November 8th, 2023, the project domain and title, "Transcriber AI," were officially
finalized. Following approval by both the project guide and the project committee, our team
initiated an extensive literature survey. This survey encompassed a thorough examination of
research papers and publications with objectives akin to ours. After analyzing papers from
reputable journals, we identified key methodologies and technologies relevant to our project.
Subsequently, we focused on identifying the necessary technologies and methodologies
essential for implementing our transcriber AI. This phase involved comprehensive research
to identify cutting-edge technologies and methodologies in the fields of speech recognition
and real-time data processing.

1.4.1 Completion Timeline


November 2023:
• Domain and Project Title Discussion: Begin discussions about the domain focus for the
transcription, translation, and summarization project.
• Literature Survey Commencement: Start gathering and reviewing relevant literature and
existing technologies that relate to the project's focus.

December 2023:
• Finalization and Approval: Finalize and obtain approval for the project title and domain.
• Module Decision and Component Selection: Decide on the specific modules to include
in the project, such as speech recognition, language processing, and translation
modules. Select and purchase necessary components.

January 2024:
• Hardware Prototype Development: Learn about and connect necessary hardware, such
as servers or specialized processors that could handle real-time data processing. Start
creating a hardware prototype suitable for handling transcription and translation.

Dept. of CSE, Vemana IT 5 2023-24


Transcriber AI Introduction

April 2024:
• Integration and Testing: Integrate the software with the hardware, conduct thorough
testing of the real-time transcription, translation, and summarization capabilities.
• Project Completion: Finalize all components of the system, ensuring the model and
application operate seamlessly in real-time scenarios

1.4.2 Outline of the chapters

• Chapter 1 Introduction: This chapter offers a concise introduction to the Transcriber


AI project, encompassing key subsections such as scope, project objectives, plan of
action, current project status, proposed plan for completion, and an overview of
subsequent chapters.

• Chapter 2 Literature Survey: Conducted as part of the project's literature review, this
chapter presents a comprehensive list of base and reference papers that have been
influential in shaping the project. It goes beyond listing these papers, providing an in-
depth analysis of their key findings and contributions, and includes a comparative
analysis.

• Chapter 3 System Analysis: Delves into the foundational aspects essential for
understanding and evaluating both the existing and proposed systems. This chapter
delineates the functional and non-functional requirements crucial for the project's
success. Functional requirements detail the specific functionalities and capabilities
expected from the system, such as real-time transcription and translation tasks. Non-
functional requirements encompass performance, scalability, security, and usability
considerations, ensuring the system meets quality standards and user expectations.

• Chapter 4 System Specification: This section details hardware and software


prerequisites for the system, including specific Android device specifications and
required tools like Android Studio and Bluetooth APIs. Functional requirements
encompass core functionalities such as audio capture and transcription, while non-
functional aspects address performance, reliability, and compliance, ensuring adherence
to quality standards and user expectations.

• Chapter 5 Project Description: In this chapter, the project's problem definition is


outlined, focusing on the need for efficient text processing solutions in today's digital
era. The project, Transcriber AI, aims to address this need by offering real-time

Dept. of CSE, Vemana IT 6 2023-24


Transcriber AI Introduction

transcription, text translation, and text summarization functionalities. Key challenges


include ensuring accuracy and speed in transcription, achieving accurate translations
across languages, and developing algorithms for text summarization. The overview of
the project highlights its core functionalities and system architecture, emphasizing
scalability, efficiency, and user experience

• Chapter 6 System Implementation: This chapter provides insights into the


implementation of the Transcriber AI system. It covers hardware and software
implementation, detailing the setup of development environments, user interface
design, and integration of various libraries for real-time transcription, translation, and
summarization. The hardware implementation focuses on Bluetooth functionality,
while software implementation involves PyQt5 framework for the user interface and
libraries like PyAudio, Vosk, Googletrans, and Sumy for core functionalities. The
chapter emphasizes robust error handling, testing, and comprehensive documentation.

• Chapter 7 System Testing: System testing is crucial to ensure the Transcriber AI


system meets requirements and functions correctly. This chapter discusses hardware
and software testing processes, including Bluetooth connection testing and unit testing
for different modules. Integration testing evaluates how different units interact, while
system testing assesses overall system functionality under diverse conditions. User
acceptance testing involves end-users evaluating the system's features and providing
feedback. Test cases and their analysis are presented to evaluate system performance.

• Chapter 8 Summary: The summary chapter provides an overview of the Transcriber


AI project's significance and achievements. It highlights the project's response to
communication challenges by offering real-time transcription, translation, and
summarization capabilities. The integration of advanced technology enhances
productivity and accessibility. The chapter also draws parallels between the described
security system and Transcriber AI, emphasizing proactive approaches to contemporary
needs.

• Chapter 9: Conclusion and Future Work: The conclusion chapter summarizes the
project's objectives, emphasizing the creation of an efficient and user-friendly
Transcriber AI system. It discusses potential future enhancements, such as online cloud
storage, hardware controls integration, offline translation libraries, and AI-generated
summarization algorithms.

Dept. of CSE, Vemana IT 7 2023-24


Transcriber AI Literature Survey

CHAPTER 2

LITERATURE SURVEY
2.1 Sadaoki Furui, Tomonori Kikuchi, Yousuke Shinnaka, and Chiori Hori, Speechto-
Text and Speech-to-Speech Summarization of Spontaneous Speech, IEEE
transactions on speech and audio processing, vol. 12, no. 4, july 2004

The paper addresses the problem of speaker diarization, which is the process of determining
who is speaking when in a conversation or meeting. The authors are motivated by the
challenges of real-time diarization, such as segmenting and separating overlapping speech,
estimating the number and location of speakers, and tracking speaker movement and change.
To tackle these challenges, they propose a novel speaker diarization system that incorporates
spatial information from a circular differential directional microphone array (CDDMA).
The system consists of four components: audio segmentation based on beamforming and voice
activity detection, speaker diarization based on joint clustering of speaker embeddings and
source localization, a phonetically-aware coupled network to normalize the effects of speech
content, and separation of overlapping speech based on state estimation and unimodality test.
The authors evaluate their system on a corpus of real and simulated meeting conversations
recorded by the CDDMA. They compare their system with several baselines and show that
their system achieves significant improvements in terms of diarization error rate (DER) and
segmentation accuracy.
They also demonstrate the advantages of the CDDMA design over other beamforming
methods. This paper represents a significant contribution to the field of speaker diarization.
Evaluation experiments were conducted on spontaneous Japanese presentation speeches from
the Corpus of Spontaneous Japanese (CSJ).
For speech-to-text summarization, the two-stage method achieved better performance than a
one-stage approach, especially at lower summarization ratios (higher compression).
Combining multiple scoring measures (linguistic, significance, and confidence) for sentence
extraction further improved results. For speech-to-speech summarization, subjective
evaluation at a 50% summarization ratio showed that sentence units achieved the best results
in terms of ease of understanding and appropriateness as a summary. Word units performed
worst due to unnatural sound caused by concatenating many short segments, while between-
filler units showed promising results comparable to sentence units.

Dept. of CSE, Vemana IT 8 2023-24


Transcriber AI Literature Survey

Advantages:
• Addresses challenges of spontaneous speech like recognition errors and redundant
information through two-stage summarization.
• Utilizes multiple scoring measures for sentence extraction, leveraging linguistic
naturalness and acoustic reliability.
• Preserves speaker's voice and prosody in speech-to-speech summarization, conveying
additional meaning and emotion.
Limitations:
• Evaluation conducted on a small dataset may limit generalizability, requiring
assessment on larger, diverse corpora.
• Relies primarily on textual information, lacking explicit incorporation of prosodic
features which could enhance identification of important speech segments.
• Concatenating speech units may introduce acoustic discontinuities and unnatural
sounds, necessitating further improvements for natural-sounding summaries.

2.2 Sridhar Krishna Nemala and Mounya Elhilali, Multilevel speech intelligibility for
robust speaker recognition, The Johns Hopkins University, Baltimore

In the real world, natural conversational speech is an amalgam of speech segments, silences
and environmental/background and channel effects. Labelling the different regions of an
acoustic signal according to their information levels would greatly benefit all automatic speech
processing tasks. A novel segmentation approach based on a perception-based measure of
speech intelligibility is proposed. Unlike segmentation approaches based on various forms of
voice-activity detection (VAD), the proposed parsing approach exploits higher-level perceptual
information about signal intelligibility levels. This labelling information is integrated into a
novel multilevel framework for automatic speaker recognition task. The system processes the
input acoustic signal along independent streams reflecting various levels of intelligibility and
then fusing the decision scores from the multiple steams according to their intelligibility
contribution. The results show that the proposed system achieves significant improvements
over standard baseline and VAD based approaches, and attains a performance similar to the
one obtained with oracle speech segmentation information. With the advent of E-commerce
technology, the importance of non-intrusive and highly reliable methods for personal
authentication has been growing rapidly. Voice prints being the most natural form of
communication, and being already used widely in spoken dialog systems, have significant
advantage over other biometrics such as retina scans, face, and finger prints.
Dept. of CSE, Vemana IT 9 2023-24
Transcriber AI Literature Survey

Advantages:

• Enhances speaker recognition robustness in adverse conditions by leveraging speech


intelligibility information.
• Incorporates perception-based measure of speech intelligibility for adaptive signal
processing, aligning closely with human judgment.
• Efficiently utilizes informative signal components, improving overall speaker
recognition performance by weighting higher intelligibility streams during fusion.

Limitations:

• Computational complexity poses challenges for real-time or resource-constrained


applications, requiring optimizations or dedicated hardware acceleration.
• Requires substantial labeled training data for intelligibility likelihood model and
speaker verification components, which can be resource-intensive.
• Sensitivity to intelligibility model performance may affect system accuracy and
robustness, potentially degrading speaker recognition performance if the model
inaccurately estimates speech intelligibility level.

2.3 Xavier Anguera Miro, Simon Bozonnet, Nicholas Evans, Corinne Fredouille, Gerald
Friedland and Oriol Vinyals, Speaker Diarization: A Review of Recent Research,
IEEE transactions on audio, speech, and language processing, vol. 20, no. 2, February
2012
Speaker diarization, the process of determining “who spoke when?” in an audio or video
recording with multiple speakers, is a field of growing interest. This is particularly true for
conference meetings, which present unique challenges such as background noise,
reverberation, overlapping speech, and spontaneous speech style. The main approaches to
speaker diarization include bottom-up and top-down clustering, as well as information-
theoretic and Bayesian methods. Most systems utilize hidden Markov models with Gaussian
mixture models to represent speakers and segments. The primary algorithms involved in
speaker diarization include data preprocessing, speech activity detection, acoustic
beamforming, feature extraction, cluster initialization, distance computation, cluster merging
or splitting, stopping criterion, and resegmentation.
New research directions are emerging in the field, including handling overlapping speech,
incorporating multimodal information, system combination, and exploring nonparametric
Bayesian methods. Performance evaluation of speaker diarization systems is typically based

Dept. of CSE, Vemana IT 10 2023-24


Transcriber AI Literature Survey

on the NIST Rich Transcription evaluations, which provide standard datasets and metrics. The
current state-of-the-art systems achieve around a 10% diarization error rate on conference
meeting data. This field continues to evolve, with ongoing research aimed at improving the
accuracy and efficiency of speaker diarization systems.

Advantages:
• Speaker diarization enhances meeting management by automatically identifying
speakers, aiding in organizing discussions and summarizing key points.
• Provides speaker labels, improving accessibility of audio/video recordings for
individuals with hearing impairments or those who prefer transcripts.
• Automated speaker diarization reduces time and effort required for transcription tasks,
saving resources in processing recordings with multiple speakers.
Limitations:
• Speaker diarization systems may struggle with accurately identifying speakers in
challenging acoustic environments, affecting accuracy.
• Some algorithms require significant computational resources, limiting practical
applicability, especially for large datasets or real-time streams.
• Variability in speakers' voices, accents, and speech patterns can lead to errors in speaker
segmentation and identification, impacting system reliability.

2.4 Dongmahn SEO, Suhyun KIM, Gyuwon SONG, and Seung-gil HONG, Speechto-
Text-based Life Log System for Smartphones, IEEE International Conference on
Consumer Electronics (ICCE), 2014

A life log is a digital data logging for human daily life. Many life log projects have been
proposed. In life log projects, digital data are collected using cameras, wearable devices, and a
remote controller. However, legacy projects are using wearable devices or specific purpose
devices, which are not commonly used, or only photographs and videos are focused to research.
Speech recognition technologies based on word dictionary have enough accuracy. Current
speech-to-text technologies for a sentence and a paragraph have less accuracy than word
dictation. However, it is assumed that searching voice record files is available if accuracy of
sentence or paragraph dictation is more than half. Because users use a few keywords when they
search recorded voice files. These keywords probably appear several times in a voice file
because the topic of a speech or a dialog is related on these keywords.

Dept. of CSE, Vemana IT 11 2023-24


Transcriber AI Literature Survey

A smartphone is a suitable device for life log, because smartphones belong in human life on 24
hours 7 days a week and have various sensors, a camera, a microphone, computing powers and
network connectivity. In this paper, a new life log system using user voice recording with
dictated texts is proposed. The proposed system records user's voice and phone calls using
smartphones whenever a user wants. Recorded sound files are dictated for storing text files by
a speech-to-text service. Recorded sound files and dictated text files are stored in a life log
server. The server provides life log file lists and searching features for life log users. A life log
system provides a real-time voice recording with a speech-to-text feature over smartphone
environments. The proposed system records data of user life using a microphone of
smartphone. Recorded data are sent to a server, analysed, and stored. Recorded data are dictated
using a speech-to-text service, and saved as text files. The proposed system is implemented as
a prototype system and evaluated. Users of the system are able to search their life log sound
files using text. The life log server manages life log data files, which are pairs of a sound file
and a text file and provides a web service for a user interface. To dictate sound files, a speech-
to-text (STT) module controls a flac encoding module and a STT request Module. The flac
encoding module converts .WAV format files which are received from the life log application
into .FLAC format files. The STT request module requests dictated text files from sound files
converted by the flac encoding module to the speech-to-text server. A life log web server
provides life log data stored in the life log server to the life log web client.

Advantages:
• Enables convenient and ubiquitous life logging using widely available smartphones,
eliminating the need for dedicated wearable devices.
• Leverages smartphone's microphone and GPS capabilities for multimodal data capture,
providing a comprehensive record of user's experiences.
• Allows for efficient navigation and access to life log data by transcribing audio files
into text, enabling keyword-based search and retrieval.
Limitations:
• Relies on speech-to-text service accuracy for transcription, which may be limited by
accents, background noise, and conversational speech, affecting searchability.
• Continuous audio recording on mobile devices raises privacy concerns by capturing
user's voice, conversations, and environmental sounds, necessitating appropriate
privacy measures. Additionally, it may strain device resources, requiring efficient
encoding, compression, and scheduling to mitigate battery drain and network usage.

Dept. of CSE, Vemana IT 12 2023-24


Transcriber AI Literature Survey

2.5 Multiple Authors, IEEE, Low-Latency Real-Time Meeting Recognition and


Understanding Using Distant Microphones and Omni-Directional Camera, IEEE
TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL.
20, NO. 2, FEBRUARY 2012

The paper presents a real-time meeting analyzer that recognizes “who is speaking what” and
provides assistance for meeting participants or audience. It uses a microphone array and an
omni-directional camera to capture audio and visual information, and performs speech
enhancement, speech recognition, speaker diarization, acoustic event detection, sentence
boundary detection, topic tracking, and face pose detection to analyze the meeting content and
context. The system employs three techniques to improve the speech recognition accuracy for
distant speech: dereverberation, source separation, and noise suppression.
The system also performs speaker diarization based on direction of arrival estimation and
VAD. The system uses a WFST-based decoder with one-pass decoding and error correction
language models. The system also uses diarization-based word filtering to reduce insertion
errors caused by non-target speech. The authors compare different models for extracting and
tracking the topics of an ongoing meeting based on speech transcripts. They show that their
proposed topic tracking model (TTM) outperforms the semibatch latent Dirichlet allocation
(LDA) and the dynamic mixture model (DMM) in terms of perplexity and entropy. The authors
summarize the main contributions of their system for low-latency real-time meeting analysis,
which integrates audio, speech, language, and visual processing modules.
They report the experimental results for speaker diarization, speech recognition, acoustic event
detection, and topic tracking on real meeting data. They also discuss the future work for
improving the system’s accuracy and usability. The paper contains many references to previous
works on meeting recognition, speech enhancement, speech recognition, topic modelling, and
other related topics. The ASR system incorporates a novel diarization-based word filtering
(DWF) technique to reduce insertion errors caused by residual non-target speech in the
separated channels.
The DWF method utilizes the frame-based speaker diarization results obtained during the audio
processing stage to determine the relevance of each recognized word to the target speaker.
Words with low relevance are filtered out, improving the overall recognition accuracy.In
parallel with speech recognition, the system performs acoustic event detection, particularly for
laughter, using HMM-based models. This information is used to monitor the atmosphere of the
meeting and detect casual moments. Sentence boundary detection is applied to the recognized
transcripts to improve readability, and topic tracking is performed using the Topic Tracking

Dept. of CSE, Vemana IT 13 2023-24


Transcriber AI Literature Survey

Model (TTM) to extract relevant topic words from the conversation. The paper contains many
references to previous works on meeting recognition, speech enhancement, speech recognition,
topic modelling, and other related topics.

Advantages:
• Operates in real-time with minimal delay, providing transcripts and speaker activities
promptly, enabling near real-time monitoring of meetings.
• Utilizes advanced audio processing techniques such as dereverberation, source
separation, and noise suppression to enhance speech signals in challenging
environments, improving subsequent speech recognition accuracy.
• Implements robust speaker diarization and word filtering techniques to reduce insertion
errors and improve recognition accuracy and speaker attribution by leveraging frame-
based diarization results.
Limitations:
• Relies on limited training data for adapting acoustic and language models, potentially
limiting overall system performance.
• Involves significant computational complexity due to multiple complex components,
hindering deployment on resource-constrained platforms.
• May experience latency issues in certain scenarios, such as rapid topic shifts in
discussions, and face detection delays affecting synchronization with audio cues,
impacting real-time performance.

2.6 Siqi Zheng, Weilong Huang, Xianliang Wang, Hongbin Suo, Jinwei Feng and Zhijie
Yan, A real-time speaker diarization system based on spatial spectrum, ICASSP 2021
- 2021 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASS)

Speaker diarization is a process of finding the optimal segmentation based on speaker identity
and determining the identity of each segment. It is also commonly stated as a task to answer
the question “who speaks when”. It tries to match a list of audio segments to a list of different
speakers. Unlike the speaker verification task, which is a simple one-to-one matching, speaker
diarization matches M utterances to N speakers, and, in some situations, N is often unknown.
Agglomerative hierarchical clustering (AHC) over speaker embeddings extracted from neural
networks has become the commonly accepted approach for speaker diarization. Variational
Bayes (VB) Hidden Markov Model has proven to be effective in many studies. A refinement
process is usually applied after clustering.

Dept. of CSE, Vemana IT 14 2023-24


Transcriber AI Literature Survey

VB re-segmentation is often selected as means for refinement. The real-time requirement poses
another challenge for speaker diarization. To be specific, at any particular moment, it is
required to determine whether a speaker change incidence occurs at the current frame within a
delay of less than 500 milliseconds. This restriction makes refinement process such as VB
resegmentation extremely difficult. Since we can only look 500 milliseconds ahead, the speaker
embedding extracted from the speech within that short window can be biased. Therefore,
finding the precise timestamp of speaker change in a real-time manner still remains quite
intractable for conventional speaker diarization technique. In this paper a speaker diarization
system is described that enables localization and identification of all speakers present in a
conversation or meeting.
A novel systematic approach is proposed to tackle several long-standing challenges in speaker
diarization tasks, to segment and separate overlapping speech from two speakers, to estimate
the number of speakers when participants may enter or leave the conversation at any time, to
provide accurate speaker identification on short text-independent utterances, to track down
speakers’ movement during the conversation, to detect speaker change incidence real-time.
Utilizes a differential directional microphone array and CDDMA design for effective far-field
speech capture, aiding accurate speaker localization and real-time diarization.
First, a differential directional microphone array-based approach is exploited to capture the
target speakers’ voice in far-field adverse environment.
Second, an online speaker-location joint clustering approach is proposed to keep track of
speaker location.
Third, an instant speaker number detector is developed to trigger the mechanism that separates
overlapped speech. The results suggest that the system effectively incorporates spatial
information and achieves significant gains.
Advantages:
• Utilizes a differential directional microphone array and CDDMA design for effective
far-field speech capture, aiding accurate speaker localization.
• The system's real-time speaker diarization capability accurately detects speaker change
incidences within a delay of less than 500 milliseconds leveraging spatial information
and joint efforts of localization and NN-VAD.
Limitations:
• Relies heavily on the use of a microphone array, specifically the CDDMA design,
introducing a hardware dependency and potentially limiting the system's applicability
in scenarios where microphone arrays are not available or feasible to deploy.

Dept. of CSE, Vemana IT 15 2023-24


Transcriber AI Literature Survey

• Incorporates several components, including beamforming, localization algorithms,


neural networks, and clustering techniques, resulting in high computational
requirements, especially for real-time processing, which could be a limitation for
resource-constrained environments or devices.

2.7 Yu-An Chung, Wei-Hung Weng, Schrasing Tong and James Glass, Towards
Unsupervised Speech-To-Text Translation, Computer Science and Artificial
Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge

Conventional speech-to-text translation (ST) systems, which typically cascade automatic


speech recognition (ASR) and machine translation (MT), impose significant requirements on
training data. They usually require hundreds of hours of transcribed audio and millions of
words of parallel text from the source and target languages to train individual components,
which makes it difficult to use this approach on low-resource languages. Although recent works
have shown the feasibility of building end-to-end systems that directly translate source speech
to target text without using any intermediate source language transcriptions, they still require
data in the form of source audio paired with target text translations for end-to end training.
presents a framework for building speech-to-text translation (ST) systems using only
monolingual speech and text corpora, in other words, speech utterances from a source language
and independent text from a target language.
As opposed to traditional cascaded systems and end-to-end architectures, the system does not
require any labelled data (i.e., transcribed source audio or parallel source and target text
corpora) during training, making it especially applicable to language pairs with very few or
even zero bilingual resources. A framework for building speech-totext translation (ST) systems
using only monolingual speech and text corpora, in other words, speech utterances from a
source language and independent text from a target language.
As opposed to traditional cascaded systems and end-to-end architectures, the system does not
require any labelled data (i.e., transcribed source audio or parallel source and target text
corpora) during training, making it especially applicable to language pairs with very few or
even zero bilingual resources.
The framework initializes the ST system with a cross-modal bilingual dictionary inferred from
the monolingual corpora, that maps every source speech segment corresponding to a spoken
word to its target text translation. For unseen source speech utterances, the system first
performs word-by-word translation on each speech segment in the utterance. The translation is
improved by leveraging a language model and a sequence denoising autoencoder to provide

Dept. of CSE, Vemana IT 16 2023-24


Transcriber AI Literature Survey

prior knowledge about the target language. Experimental results show that unsupervised
system achieves comparable BLEU scores to supervised end-to-end models despite the lack of
supervision.
Advantages:
• Low resource requirement: No need for labelled data, making it applicable to languages
with limited resources.
• Scalability: Can potentially work for a wide range of languages without language-
specific resources.
Limitations:
• Translation quality may vary: Dependent on factors like language complexity and data
availability.
• Limited language coverage: Effectiveness reliant on the quality and availability of
monolingual corpora.

2.8 Md Ashraful Islam Talukder, Sheikh Abujar, Abu Kaisar Mohammad Masum,
Sharmin Akter and Syed Akhter Hossain, Comparative Study on Abstractive Text
Summarization, IEEE – 49239
The paper presents a comparative study on abstractive text summarization, a task that involves
generating a shorter text while preserving the main meaning of the original text. It reviews
three different methodologies: a word graph-based model, a semantic graph reduction model,
and a hybrid approach using Markov clustering. The word graph-based model, proposed for
Vietnamese abstractive text summarization, reduces and combines sentences based on
keywords, discourse rules, and syntactic constraints. This model aims to overcome some of the
drawbacks of existing word graph methods, such as producing incorrect or ungrammatical
sentences. The semantic graph reduction model uses a rich semantic graph to represent the
input text and applies heuristic rules to reduce the graph using WordNet relations. This model
is designed to capture the semantic meaning of the text and generate coherent summaries.
Lastly, the hybrid approach uses Markov clustering to group sentences based on their relations
and then applies rule-based sentence fusion and compression techniques to generate
summaries. This approach is argued to produce concise and informative summaries. The paper
asserts that these methodologies offer promising solutions to the challenges in abstractive text
summarization.
Advantages:

Dept. of CSE, Vemana IT 17 2023-24


Transcriber AI Literature Survey

• Methodological Diversity: The paper examines three distinct approaches for abstractive
text summarization, offering a comprehensive overview for comparison and
understanding.
• Addressing Drawbacks: Models incorporate discourse rules, syntactic constraints, and
semantic information from WordNet, enhancing grammatical correctness, coherence,
and capturing underlying meaning.
Limitations:
• Evaluation: Lack of thorough evaluation and validation on diverse datasets hinders
assessment of effectiveness and generalizability.
• Language Specificity: Some models are tailored to Vietnamese, potentially limiting
applicability to other languages.
• Complexity: Certain methodologies introduce computational complexity, potentially
hindering scalability for real-time applications.
• Subjectivity: Inherent interpretation in abstractive summarization may pose challenges
in accurately reflecting original text across contexts.

2.9 Lili Wan, Extraction algorithm of English Text Summarization for English Teaching,
2018 International Conference on Intelligent Transportation, Big Data and Smart
City
In order to improve the ability of sharing and scheduling capability of English teaching
resources, an improved algorithm for English text summarization is proposed based on
Association semantic rules. The relative features are mined among English text phrases and
sentences, the semantic relevance analysis and feature extraction of keywords in English
abstract are realized, the association rules differentiation for English text summarization is
obtained based on information theory, related semantic rules information in English Teaching
Texts is mined. Text similarity feature is taken as the maximum difference component of two
semantic association rule vectors, and combining semantic similarity information, the accurate
extraction of English text Abstract is realized.
The simulation results show that the method can extract the text summarization accurately, it
has better convergence and precision performance in the extraction process. With the rapid
development of computer and network technology, a variety of network medias are rapidly
developing. English teaching materials are increased, a large number of English articles and
news media contain vast amounts of English text, the comprehensive reading is difficult, it

Dept. of CSE, Vemana IT 18 2023-24


Transcriber AI Literature Survey

needs to extract the English text summarization effectively, extract the key information, so as
to improve the quality of English teaching.
Advantages:
• Association Semantic Rules: Utilizing these rules, the algorithm comprehends nuanced
relationships among English text elements, facilitating accurate summarization.
• Semantic Relevance Analysis: Integration of this analysis enhances identification of
crucial information and key concepts within English abstracts.
Limitations:
• Language Specificity: Tailored for English, limiting applicability to other languages
and necessitating significant modifications for adaptation.
• Dependency on Semantic Analysis: Accuracy of summarization hinges on precise
semantic analysis and keyword extraction, posing risks of errors.
• Scalability: Performance may suffer with large text volumes or real-time applications
due to computational complexity.
• Subjectivity: Challenges persist in capturing nuanced semantic relationships,
potentially resulting in variable summary quality across texts and domains.

2.10 Devi Krishnan, Preethi Bharathy, Anagha, and Manju Venugopalan, A Supervised
Approach for Extractive Text Summarization Using Minimal Robust Features,
Proceedings of the International Conference on Intelligent Computing and Control
Systems (ICICCS 2019)

Over the past decade or so the amount of data on the Internet has increased exponentially. Thus
arises the need for a system that processes this immense amount of data into useful information
that can be easily understood by the human brain. Text summarization is one such popular
method under research that opens the door to dealing with voluminous data. It works by
generating a compressed version of the given document by preserving the important details.
Text summarization can be classified into Abstractive and Extractive summarization.
Extractive summarization methods reduce the burden of summarization by selecting a subset
of sentences that are important from the actual document. Although there are many suggested
approaches to this methodology, researchers in the field of natural language processing are
especially attracted to extractive summarization. The model that is introduced in this paper
implements extractive text summarization using a supervised approach with the minimalistic
features extracted. The results reported by the proposed model are quite satisfactory with an
average ROUGE-1 score of 0.51 across 5 reported domains on the BBC news article dataset.

Dept. of CSE, Vemana IT 19 2023-24


Transcriber AI Literature Survey

To implement the text summarization approaches there are existing algorithms in the field of
machine learning which improve the accuracy in predicting the results without programming
explicitly.
The paper discusses the two primary approaches to text summarization: abstractive and
extractive. While abstractive summarization involves generating new text that captures the
essence of the original document, extractive summarization selects and rearranges existing
sentences to create a summary. The focus of the paper lies on extractive summarization,
particularly in the context of natural language processing, where researchers are increasingly
drawn to its simplicity and effectiveness. The proposed model adopts a supervised approach to
extractive summarization, leveraging minimalistic features extracted from the text. By
employing machine learning techniques, the model learns to identify and prioritize important
sentences for inclusion in the summary. The use of robust features ensures that the model can
effectively capture the essence of the original document while minimizing the computational
overhead. The paper reports satisfactory results from the proposed model, with an average
ROUGE-1 score of 0.51 across five domains using the BBC news article dataset. ROUGE-1 is
a metric commonly used to evaluate the quality of text summarization.

Advantages:

• Efficient Knowledge Management: Text summarization aids in managing vast digital


information, offering users an overview of key topics and trends without reading every
document fully.
• Scalability: Summarization techniques handle large-scale data, ideal for monitoring
news, academic research, and corporate documents with information overload.
• Memory Optimization: By reducing text data, summarization decreases memory and
storage requirements, facilitating faster processing and lower costs.
• Improvement in User Engagement: Summaries provide quick, digestible information,
enhancing user engagement by overcoming time constraints and text density.
• Support for Multilingual Content: Advanced tools work across languages, providing
summaries for global audiences without requiring fluency in the original text.
• Enhanced Accessibility: Summarization makes information more accessible, especially
for individuals with time constraints or attention limitations.
• Decision Support: Summaries aid decision-making processes by quickly conveying
main points of documents or datasets.

Dept. of CSE, Vemana IT 20 2023-24


Transcriber AI Literature Survey

• Automation of Routine Tasks: Summarization automates tasks like content curation and
report generation, freeing up human resources.
• Customization and Personalization: Summaries can be customized to cater to individual
preferences and priorities.

Disadvantages:

• Loss of Important Nuances: Extractive summarization may omit critical nuances,


leading to incomplete understanding.
• Dependency on Text Quality: Summarization effectiveness depends heavily on text
quality, impacting accuracy.
• Over-Simplification: Summarization might oversimplify complex topics, misleading
readers about depth or implications.
• Technological Limitations: Models may struggle with human language and context,
affecting summary quality.
• Maintenance and Update Needs: Models require continual updates to handle new topics
effectively, which can be resource-intensive.
• Feature Selection: Minimalistic feature usage simplifies the model but might restrict its
ability to capture complex patterns.
• Domain Dependence: Model effectiveness varies across domains, requiring domain-
specific adjustments.
• Bias Amplification: Summarization models may unintentionally amplify biases present
in the original text.
• Evaluation Challenges: Assessing summary quality remains subjective and lacks
standardized metrics.
• Language Complexity: Summarization models struggle with technical language and
linguistic nuances.
• Contextual Ambiguity: Extractive methods may fail to capture contextual ambiguity,
leading to inaccurate summaries.

Dept. of CSE, Vemana IT 21 2023-24


Transcriber AI Literature Survey

2.11 Comparative Analysis

In Table 2.1, the diverse range of algorithms and platforms showcased underscores the breadth
of approaches adopted in research. By considering a variety of performance metrics, from
accuracy and efficiency to scalability and robustness, these papers offer valuable insights into
the strengths and weaknesses of different methodologies. This comprehensive overview
serves as a valuable resource for researchers and practitioners seeking to navigate the
landscape of audio recording and privacy concerns on mobile devices.

Table 2.1: Comparative analysis

Refe Algorithm/ Platform Performan Advantages Limitations


renc Technique used ceMetrics/
e Based On
[1] Sentence extraction MATLAB Summariza Extract Speech is a
and sentence tion important continuous
compaction, accuracy, information phenomenon
Word-based sentence Word error from long that comes
compaction rate speeches or without
(WER), presentations unambiguous
Confidence sentence
score boundaries

[2] Multilevel framework Universal Equal Error State-of-the- Relies on


that segments the Background Rate (EER) art Voice unsupervised
speech signal into Model Activity clustering to
streams of different (UBM) - Detection segment
intelligibility levels. Gaussian (VAD) signal into
Mixture streams
Model
(GMM)
framework
for speaker
verification.

Dept. of CSE, Vemana IT 22 2023-24


Transcriber AI Literature Survey

[3] Bayesian information Diarization Speech quality of the


criterion (BIC) error rate recognition, audio signal,
- (DER) Speaker the number of
identification speakers
[4] Real-time voice Android Speech-to- Searchable Speech-to-
recording on smartphone, text voice logs text accuracy
smartphone, Speech- Server for accuracy, using is limited
to-text conversion file storage Logged speech-to- Storage needs
and data size text, are still
processing Reduced substantial
storage needs
with text
filtering

[5] Dereverberation, 8-channel Diarization Handles Processing is


Blind source microphone error rate, reverberation computationa
separation, Noise array Word/chara and lly expensive,
suppression, Low- cter error overlapping Errors still
latency speech rates speech, Low remain in
recognition, Speaker latency transcription
diarization speech
recognition
[6] Circular differential Beamformin Diarization Leverages High
microphone array g and error rate spatial computationa
(CDDMA), Spatial localization (DER), information l cost, Still
spectrum and algorithms, Segmentati for speaker has
direction of arrival Neural on accuracy diarization, diarization
estimation network separates errors in
embeddings speech challenging
signals cases
[7] Unsupervised speech BLEU Does not Does not
segmentation, scores require require
Language model, transcribed transcribed
- speech data speech data

Dept. of CSE, Vemana IT 23 2023-24


Transcriber AI Literature Survey

Sequence denoising or parallel or parallel


autoencoder corpora for corpora for
training training

[8] PageRank algorithm, MATLAB Rouge Can generate Computation


WordNet semantic scores more concise ally more
relations, Support (ROUGE- and fluent complex than
vector machines 1, ROUGE- summaries extractive
2), compared to methods,
Percentage extractive Risk of losing
of methods meaning or
semanticall introducing
y/syntactica factual
lly correct inconsistenci
sentences es

[9] Association Semantic ANDROID Efficient Provides


Rules, Semantic Data minimalistic
Relevance Analysis Processing, features
and Feature Extractive which limit
Extraction - Summarizati the model's
on, ability to
Supervised capture more
Learning complex
relationships
within the
text.
[10] position features, ANDROID ROUGE-1, Specific Weaknesses
sentence length, term ROUGE-2, benefits or limitations
frequency-inverse F1-score, provided by identified in
document frequency human using the method.
judgment minimal Any potential
robust areas for
features. improvement.

Dept. of CSE, Vemana IT 24 2023-24


Transcriber AI Literature Survey

2.12 Summary of Literature Survey

The literature delves into various aspects of speech processing, including automatic
summarization, speaker recognition, real-time meeting analysis, and speech-based life logging,
all contributing to the broader field of natural language understanding and human-computer
interaction. Speech summarization methods aim to distill key information from spoken content,
facilitating efficient review and retrieval, particularly in scenarios where written text is
impractical. Speaker recognition techniques leverage speech intelligibility to enhance
identification accuracy, benefiting security authentication, forensic analysis, and human-
computer interaction. Real-time meeting analysis systems employ advanced techniques for
speech enhancement, speaker diarization, and topic tracking, facilitating effective
communication and collaboration during meetings. Speech-based life logging systems capture
and store audio data from daily experiences, creating a rich repository of personal memories
and contextual information accessible through searchable text. These advancements drive
innovation across domains, enabling more natural and intuitive interactions between humans
and machines.

Dept. of CSE, Vemana IT 25 2023-24


Transcriber AI System Analysis

CHAPTER 3

SYSTEM ANALYSIS
3.1 Existing System

The existing systems, including Google Speech-to-Text and Translation API, Microsoft Azure
Speech Services, IBM Watson Speech to Text and Language Translator, and Amazon
Transcribe and Translate, are esteemed for their prowess in real-time transcription and
translation tasks. These systems offer a comprehensive array of functionalities tailored to meet
the demands of dynamic communication environments. Leveraging powerful speech
recognition models, these systems excel in swiftly converting spoken words into written text
with remarkable accuracy. Trained on vast datasets of speech and text, these models possess
the capability to identify and transcribe individual words and phrases with precision. In the
realm of translation, existing systems employ a variety of techniques, such as statistical
machine translation (SMT) or neural machine translation (NMT). SMT relies on statistical
analysis of large bilingual text corpora to discern patterns and perform translations, while NMT
utilizes deep learning algorithms to analyse source and target languages, enabling more
nuanced and context-aware translations.

3.1.1 Drawbacks

• Dependency on Hardware Resources: Existing systems often require high-


performance GPUs or dedicated hardware resources, which can be a barrier for users
without access to such hardware or unwilling to make additional investments.
• Handling Complex Speech Patterns: Existing systems may struggle with complex
speech patterns like overlapping speech or heavy accents, leading to decreased
transcription accuracy and requiring additional preprocessing or post-processing steps.
• Impact of Input Audio Quality and Background Noise: The performance of existing
systems may degrade in noisy environments or with low-quality input audio, resulting
in inconsistent or inaccurate transcription and translation results.
• Usability Challenges: Some users, particularly those with disabilities, may face
difficulties navigating the interface or operating the device effectively.
• Language and Accent Limitations: The system's transcription accuracy may vary
depending on language proficiency and accent, potentially leading to errors in
transcription.

Dept. of CSE, Vemana IT 26 2023-24


Transcriber AI System Analysis

• Privacy Concerns: Real-time audio processing raises privacy concerns regarding the
security and handling of transcribed conversations, potentially affecting user trust.

3.2 Proposed System

The proposed system represents a significant advancement in inclusive communication


technology, leveraging cutting-edge advancements in several key areas, including translation
and summarization. Firstly, the system offers dual-mode audio input functionality, seamlessly
integrating Bluetooth technology for wireless transmission. This capability ensures stability
and reliability in audio transmission, catering to diverse user preferences and environmental
conditions, while also facilitating translation and summarization tasks. By providing users with
the flexibility to switch between Bluetooth devices, the system enhances usability and
adaptability, particularly in scenarios where connectivity may be limited or prone to
interference.

Secondly, the system prioritizes performance optimization through the implementation of


advanced techniques such as parallel processing and hardware acceleration for local audio
processing, including translation and summarization algorithms. These optimization strategies
enable the system to achieve efficient real-time transcription, translation, and summarization
without placing undue strain on computational resources. Whether operating on high-end
devices or lower-end hardware configurations, users can expect consistent and reliable
performance across all tasks, even under challenging conditions or during periods of high
demand.

In addition to technical improvements, the user interface undergoes a comprehensive redesign


aimed at enhancing accessibility and usability, including for translation and summarization
tasks. Modern UI frameworks like PyQt5 are utilized to create an intuitive and user-friendly
interface that caters to individuals with diverse needs, including those with disabilities.
Accessibility features such as voice commands, high contrast modes, and gesture-based
interactions are integrated to ensure an inclusive user experience, facilitating seamless
interaction with translation and summarization functionalities. By prioritizing accessibility in
interface design, the system empowers users to engage in real-time communication more
effectively, regardless of their physical abilities or technological proficiency. Furthermore, the
system's translation functionality expands its reach by breaking down language barriers,
facilitating communication between individuals who speak different languages. Users can
seamlessly translate transcribed text into their preferred language, fostering inclusivity and

Dept. of CSE, Vemana IT 27 2023-24


Transcriber AI System Analysis

enabling cross-cultural communication. Additionally, the summarization feature enhances


productivity by providing succinct summaries of lengthy transcripts, allowing users to quickly
grasp the main points without having to read through entire documents. This capability is
particularly useful in fast-paced environments where time is of the essence. Moreover, the
system's continuous updates and improvements ensure that it remains at the forefront of
technology, incorporating the latest advancements in translation, summarization, and
accessibility. As the demand for inclusive communication solutions continues to grow, the
system stands as a testament to the power of innovation in creating a more connected and
accessible world.

3.3 Feasibility Study

The feasibility study will assess the technical viability and economic aspects of the proposed
system. It will analyze available technology, resource requirements, and potential challenges
for technical feasibility, along with cost estimates for development, deployment, and
maintenance, ensuring economic viability. Operational aspects such as user acceptance,
usability, and scalability will also be evaluated to ensure alignment with user needs and
organizational objectives.

3.3.1 Technical Feasibility

Availability of Advanced Technologies: The system leverages advanced technologies such as


Bluetooth for wireless audio transmission and local audio processing capabilities on mobile
devices. Real-time Transcription Capabilities: The integration of Bluetooth and local audio
processing enables real-time transcription with low latency, facilitating seamless
communication. Machine Learning for Enhanced Accuracy: The use of machine learning
algorithms for transcription enhances accuracy and adaptability across diverse languages and
accents, ensuring reliable performance in various scenarios.

Utilization of Modern UI Frameworks: The selection of modern UI frameworks like PyQt5


facilitates the development of an intuitive and user-friendly interface, enhancing usability and
accessibility. Readily Available Technical Infrastructure: The required technical infrastructure
and capabilities for the proposed system are readily available, making its implementation
technically feasible. Overall, the combination of these factors contributes to the technical
feasibility of the proposed system, enabling effective communication solutions for individuals
with diverse language backgrounds and hearing impairments.

Dept. of CSE, Vemana IT 28 2023-24


Transcriber AI System Analysis

3.3.2 Operational Feasibility


• User Acceptance: The system aims to address communication barriers for individuals
with diverse language backgrounds and hearing impairments. User acceptance will be
crucial, and initial feedback from potential users and stakeholders should be gathered
to ensure that the system meets their needs and expectations.
• Usability: The system undergoes a comprehensive redesign of the user interface to
enhance accessibility and usability. Features such as voice commands, high contrast
modes, and gesture-based interactions are integrated to ensure an intuitive and user-
friendly experience.
• Scalability: The proposed system should be scalable to accommodate potential future
growth in user base and usage demands. It should be able to handle increased
transcription requests and user interactions without significant degradation in
performance.
• Compatibility: The system should be compatible with a wide range of devices and
platforms to maximize accessibility and user reach. Compatibility testing should be
conducted to ensure seamless integration and functionality across different devices and
operating systems.
• Training and Support: Adequate training and support mechanisms should be in place
to assist users in effectively utilizing the system. This may include user documentation,
training materials, and customer support channels to address any issues or questions
that may arise during usage.

3.3.3 Economical Feasibility

• Cost Estimates: Economic feasibility involves evaluating the costs associated with the
development, deployment, and maintenance of the proposed system. This includes
expenses related to software development, hardware acquisition, infrastructure setup,
and ongoing maintenance and support.
• Return on Investment (ROI): An analysis of potential revenue streams and cost
savings resulting from the implementation of the system will be conducted to determine
the ROI. This may include revenue generated from product sales, subscription fees, or
service offerings, as well as cost savings resulting from improved efficiency or reduced
operational expenses.
• Cost-Benefit Analysis: A cost-benefit analysis will be conducted to weigh the
projected benefits of the system against its associated costs. This analysis will help

Dept. of CSE, Vemana IT 29 2023-24


Transcriber AI System Analysis

determine whether the benefits of implementing the system outweigh the costs, thereby
indicating its economic viability.
• Market Demand and Competition: Economic feasibility also involves assessing
market demand and competition to determine the system's potential for success in the
marketplace. This includes analyzing factors such as target market size, customer
needs, competitive landscape, and pricing strategies.
• Risk Assessment: Economic feasibility analysis will also involve identifying and
mitigating potential risks and uncertainties that could impact the financial viability of
the project. This may include risks related to technology, market dynamics, regulatory
compliance, and unforeseen expenses.

Dept. of CSE, Vemana IT 30 2023-24


Transcriber AI System Specification

CHAPTER 4

SYSTEM SPECIFICATION
4.1 Hardware Requirements
• Memory (RAM): Minimum 2GB RAM, recommended 4GB or more for optimal
performance, especially with additional features like summarization and task
management.
• Processor: Quad-core or higher processor for efficient real-time audio processing.
• Storage: Minimum 16GB internal storage for application installation and data storage.
External storage options (e.g., microSD card) recommended for users generating large
amounts of transcribed content.
• Microphone (Built-In): Functional built-in microphone for audio input.
• Bluetooth Microphone (Optional): Bluetooth compatibility for users opting to use an
external Bluetooth microphone.
• Display: Responsive display with suitable resolution for displaying transcribed text and
interacting with the application.
• Battery: Sufficient battery capacity to support continuous audio processing and
transcription for extended periods.

4.2 Software Requirements


• Android Operating System: Android OS version 6.0 (Marshmallow) or above.
• Development Environment: Vs code, PyCharm and IDLE Tool.
• Bluetooth API: Utilize Android's Bluetooth API for communication with external
Bluetooth devices.
• Programming Language: Python.
• User Interface Tools: Pqt.
• Documentation Tools: Tools for creating comprehensive user and developer
documentation.
• Continuous Integration (Optional): Implementation of a continuous integration
system for automated builds and testing.

Dept. of CSE, Vemana IT 31 2023-24


Transcriber AI System Specification

4.3 Functional Requirements


• Audio Input Capture: The system should capture audio input from both the built-in
microphone and external Bluetooth microphones simultaneously.
• Bluetooth Integration: Users must be able to seamlessly connect to and switch
between various external Bluetooth devices for flexible audio input sources.
• Optimized Local Audio Processing: The application should perform real-time local
audio processing, including noise reduction, echo cancellation, and audio
normalization, to optimize the quality of the captured audio for transcription.
• Real-Time Transcription: The system must provide real-time transcription, displaying
the transcribed text to the user as it is processed.
• User Interface: An intuitive and user-friendly interface allows users to start/stop
transcription, customize settings, and view transcribed text easily.
• Customization Options: Users should have the ability to customize transcription
settings, including language selection, transcription format, and the option to save
transcriptions.
• Error Handling: Robust error-handling mechanisms are implemented to manage
issues such as Bluetooth disconnections, low audio quality, or cloud service
unavailability.
• Summarization of Transcribed Text: The application provides a summarization
feature, condensing the transcribed text to capture key points and highlights.
• Text Search and Task Extraction: Users can search the transcribed text to extract
tasks and important information. The application identifies and categorizes tasks for
further processing.
• To-Do List Creation: Based on the extracted tasks, the application allows users to
create a to-do list, organizing and prioritizing tasks for easy reference and management.
• Documentation: Provide comprehensive user documentation explaining how to use the
application, connect Bluetooth devices, understand customization options, utilize
summarization, and manage tasks through the to-do list.

4.4 Non-Functional Requirements


• Performance: The application should transcribe audio in real-time with a latency of no
more than 2 seconds and provide efficient performance for summarization, search, and
task management functionalities.

Dept. of CSE, Vemana IT 32 2023-24


Transcriber AI System Specification

• Scalability: Design the system to handle a scalable number of users and increased
transcription loads, ensuring consistent performance under varying usage conditions,
including the new features.
• Reliability: The system should have a high level of reliability, with a maximum
allowable downtime of 1 hour per month for maintenance.
• Compatibility: Ensure compatibility with Android devices running Android OS
version 6.0 (Marshmallow) and above, supporting a variety of screen sizes and
resolutions, considering the new features.
• Usability: Conduct usability testing to ensure the application is user-friendly,
considering users with different levels of technical expertise, and specifically testing
the new summarization, search, and task management features.
• Accessibility: Implement accessibility features for the new functionalities, ensuring
users with diverse needs can efficiently utilize summarization, search, and task
extraction features.
• Security: Enhance security measures to safeguard the privacy of summarized content,
search queries, and task-related information for the new features.
• Maintainability: Design the system with modular and well-documented code,
facilitating easy maintenance, updates, and bug fixes for the new features.
• Portability: Optimize the application for various Android devices, providing a
consistent experience for users with different screen sizes and resolutions, considering
the new functionalities.
• Compliance: Ensure compliance with privacy regulations for the new features,
especially when dealing with summarized content and task-related data.

Dept. of CSE, Vemana IT 33 2023-24


Transcriber AI Project Description

CHAPTER 5

PROJECT DESCRIPTION
5.1 Problem Definition
Transcriber AI addresses the pressing need for comprehensive text processing solutions in
today's digital landscape. As information continues to inundate various channels, the ability to
efficiently manage, understand, and extract insights from textual data becomes paramount. The
project aims to tackle this challenge by offering a versatile platform that encompasses real-time
transcription, text translation, and text summarization functionalities. However, achieving
these goals requires overcoming several key hurdles.

Firstly, the accuracy and speed of real-time transcription present significant challenges.
Converting spoken words into written text in near real-time demands robust speech recognition
algorithms capable of deciphering diverse accents, languages, and audio qualities.
Additionally, ensuring minimal latency between audio input and transcribed output is crucial
for providing a seamless user experience.

Secondly, text translation across multiple languages poses another significant challenge.
Achieving accurate and contextually appropriate translations necessitates integration with
reliable translation APIs and the implementation of sophisticated natural language processing
techniques. Moreover, maintaining consistency and coherence in translated texts while
preserving the nuances of the original language is essential for effective communication.

Lastly, text summarization requires the development of algorithms capable of distilling large
volumes of textual data into concise and informative summaries. Identifying key information,
extracting essential concepts, and maintaining the coherence and relevance of the summary
represent formidable challenges. Moreover, ensuring that the summarization process is
efficient and scalable to handle diverse types of text inputs is critical for the application's utility
across different domains.

5.2 Overview of the Project

Transcriber AI represents a comprehensive text processing solution designed to empower users


with advanced tools for managing textual data effectively. At its core, the project aims to
provide a seamless user experience through a combination of robust algorithms, intuitive
interfaces, and scalable architecture. The application's functionality revolves around three

Dept. of CSE, Vemana IT 34 2023-24


Transcriber AI Project Description
primary modules: real-time transcription, text translation, and text summarization. The real-
time transcription module enables users to convert audio input into text in near real-time,
facilitating the rapid capture and documentation of spoken content. The text translation module
allows users to seamlessly translate text between multiple languages, promoting cross-cultural
communication and understanding. The text summarization module extracts key insights from
lengthy documents or conversations, enabling users to quickly grasp essential information and
identify relevant trends or patterns.

The system architecture of Transcriber AI is designed to accommodate these diverse


functionalities while ensuring scalability, efficiency, and reliability. Each module interacts
seamlessly with the others, leveraging integration layers to facilitate data flow and
communication. Moreover, the user interface provides an intuitive platform for users to access
and utilize the application's features, fostering a seamless and enjoyable user experience.

5.3 System Architecture

Fig 5.1 illustrates the system architecture of the project, showcasing the seamless process of
real-time transcription and speaker diarization as speech is inputted from a microphone. The
diagram visually captures the intricate workflow, highlighting the efficient transformation of
spoken words into written text while discerning and categorizing speakers in real time.

Fig. 5.1: System Architecture

Dept. of CSE, Vemana IT 35 2023-24


Transcriber AI Project Description

The system architecture of TranscriberAI comprises several interconnected components, each


playing a crucial role in the application's functionality:

• Microphone: A small and compact device that will take the input as speech.
• Real-Time Transcription Module: This module employs advanced speech
recognition algorithms to convert audio input into text in near real-time. It utilizes
techniques such as acoustic modeling, language modeling, and deep learning to
accurately transcribe spoken words into written text.
• Text Translation Module: The translation module integrates with external translation
APIs to provide seamless and accurate translations between multiple languages. It
leverages natural language processing techniques to ensure contextually appropriate
translations while maintaining consistency and coherence across languages.
• Text Summarization Module: This module utilizes natural language processing
algorithms to extract key insights from large volumes of textual data and generate
concise summaries. It employs techniques such as keyword extraction, sentence
scoring, and semantic analysis to identify important information and present it in a
digestible format.
• User Interface: The user interface of TranscriberAI serves as the primary interaction
point for users, providing a visually appealing and intuitive platform for accessing the
application's functionalities. It includes components such as buttons, text input fields,
and scrollable views, enabling users to navigate the application effortlessly.
• Integration Layer: The integration layer facilitates seamless communication between
different modules of the system, ensuring cohesive operation and efficient data flow. It
manages interactions such as data exchange, synchronization, and error handling,
enabling the application to function smoothly across various scenarios.
• Storage: The user decides where he wants to store his personal speech data.

5.4 Module Description

5.4.1 Real-Time Transcription Module

The Real-Time Transcription Module serves as the foundation of TranscriberAI, enabling the
application to convert audio input into text in near real-time with high accuracy and minimal
latency. This module incorporates advanced speech recognition algorithms and techniques to
decipher spoken words, regardless of accents, languages, or audio qualities. At its core, the
Real-Time Transcription Module utilizes state-of-the-art speech recognition algorithms, such

Dept. of CSE, Vemana IT 36 2023-24


Transcriber AI Project Description
as deep learning-based models, to process audio input and generate transcribed text. These
algorithms are trained on vast datasets of speech samples, allowing them to recognize patterns
in speech and accurately convert spoken words into written text. Additionally, the module
employs techniques such as acoustic modeling and language modeling to further enhance
transcription accuracy. Acoustic modeling involves analyzing the acoustic features of speech,
such as pitch and intensity, while language modeling focuses on understanding the context and
grammar of the spoken language. One of the key challenges addressed by the Real-Time
Transcription Module is handling diverse audio inputs. This includes accommodating different
accents, dialects, and languages spoken by users. To overcome this challenge, the module
employs adaptive algorithms that can adjust to variations in speech patterns and linguistic
nuances. By continuously learning from new data and feedback, the module improves its
transcription accuracy over time, ensuring reliable performance across various audio inputs.
Another critical aspect of the Real-Time Transcription Module is minimizing latency between
audio input and transcribed output. Real-time transcription requires rapid processing of audio
data to provide timely feedback to users. To achieve this, the module leverages efficient
algorithms and optimized processing pipelines that prioritize speed without compromising
accuracy. By utilizing parallel processing techniques and optimizing resource usage, the
module minimizes processing delays and delivers transcribed text in near real-time.
Additionally, the Real-Time Transcription Module is designed to be scalable and adaptable to
different environments and use cases. Whether it's transcribing live speeches, conference calls,
or dictations, the module can handle varying levels of audio complexity and volume. It supports
integration with external audio sources, such as microphones, audio files, or streaming services,
enabling seamless capture and transcription of audio data from diverse sources.

In summary, the Real-Time Transcription Module is a critical component of TranscriberAI,


enabling the application to convert audio input into text with high accuracy, minimal latency,
and scalability. By leveraging advanced speech recognition algorithms and techniques, the
module provides users with a reliable and efficient solution for capturing spoken content in
real-time.

5.4.2 Text Translation Module

The Text Translation Module empowers users to translate text between multiple languages
seamlessly, facilitating cross-cultural communication and understanding. This module
integrates with external translation APIs and employs natural language processing techniques
to ensure accurate and contextually appropriate translations.

Dept. of CSE, Vemana IT 37 2023-24


Transcriber AI Project Description
Central to the Text Translation Module is its integration with external translation APIs, such
as Google Translate or Microsoft Translator. These APIs offer extensive language support and
robust translation capabilities, enabling the module to provide accurate translations across a
wide range of languages. By leveraging pre-trained translation models and linguistic resources,
the module can translate text with high fidelity while preserving the nuances of the original
language.

However, achieving accurate translations goes beyond mere word-for-word conversion. The
Text Translation Module employs natural language processing techniques to ensure
contextually appropriate translations that capture the intended meaning of the text. This
includes analyzing the syntactic and semantic structure of the text, identifying idiomatic
expressions and cultural references, and adapting the translation to suit the target language's
linguistic conventions.

Moreover, the module maintains consistency and coherence in translated texts to enhance
readability and comprehension. It ensures that translated texts maintain the same tone, style,
and level of formality as the original text, thereby preserving the author's voice and intent.
Additionally, the module employs techniques such as post-editing and quality assurance to
refine translations and address any discrepancies or errors.

The Text Translation Module is designed to be versatile and adaptable to various translation
scenarios and use cases. Whether it's translating documents, emails, or website content, the
module can handle diverse types of text inputs with ease. It supports batch translation for
processing large volumes of text efficiently and offers customizable translation options to meet
specific user preferences and requirements.

Overall, the Text Translation Module is an essential component of TranscriberAI, enabling


users to overcome language barriers and communicate effectively across different cultures and
languages. By integrating with external translation APIs and employing natural language
processing techniques, the module provides accurate, contextually appropriate translations that
facilitate global communication and collaboration.

5.4.3 Text Summarization Module

The Text Summarization Module plays a pivotal role in distilling large volumes of textual data
into concise and informative summaries, enabling users to quickly grasp essential information
and identify relevant insights. This module utilizes natural language processing algorithms and

Dept. of CSE, Vemana IT 38 2023-24


Transcriber AI Project Description
techniques to extract key information from lengthy documents or conversations and generate
summaries that maintain coherence and relevance.

Key to the Text Summarization Module is its ability to identify important sentences or phrases
within the text and assign them a relevance score based on their significance. The module
employs techniques such as keyword extraction, sentence scoring, and semantic analysis to
analyze the content and identify key concepts, ideas, or arguments. By prioritizing sentences
with higher relevance scores, the module ensures that the generated summaries focus on the
most important information.

Another critical aspect of the Text Summarization Module is maintaining coherence and
relevance in the generated summaries. This involves structuring the summary in a logical and
coherent manner, ensuring smooth transitions between sentences and paragraphs, and
eliminating redundant or irrelevant information. Additionally, the module adapts the
summary's length and level of detail to suit user preferences and requirements, allowing users
to customize the summary's depth and complexity.

Furthermore, the Text Summarization Module is designed to be efficient and scalable, capable
of handling diverse types of text inputs and generating summaries in real-time. Whether it's
summarizing articles, research papers, or meeting transcripts, the module can process large
volumes of textual data with minimal computational overhead. It supports integration with
external text sources and offers options for batch summarization to streamline the
summarization process and improve efficiency.

Overall, the Text Summarization Module is a crucial component of TranscriberAI, enabling


users to extract valuable insights and glean actionable information from textual data efficiently.
By leveraging natural language processing algorithms and techniques, the module provides
concise, informative summaries that empower users to make informed decisions and drive
meaningful outcomes.

5.4.4 Storage

The user has the autonomy to decide where they want to store their personal speech data. This
emphasizes user privacy and control over personal data. Users can choose storage options
based on their preferences and requirements, providing a flexible and customizable aspect to
the application. This approach aligns with the principles of user-centric design and data
ownership. The user decides where he wants to store his personal speech data.

Dept. of CSE, Vemana IT 39 2023-24


Transcriber AI Project Description

5.4 Dataflow Diagram

Fig 5.2 is a graphical representation that illustrates how data moves through the system. Data
flows seamlessly from the microphone input, capturing spoken words, to the real-time
transcription process, where advanced algorithms convert the audio data into written text while
concurrently performing speaker diarization for accurate identification of speakers.

Fig. 5.2: Dataflow Diagram

Dept. of CSE, Vemana IT 40 2023-24


Transcriber AI Project Description

• User Launches Application: his module initiates the process when the user starts the
application.
• Bluetooth Connection Check: This module verifies if a Bluetooth device is connected.
o If not connected, it prompts the user to connect a device.
o If connected, the process proceeds to the next module.
• Audio Input Capture: This module captures audio input from the connected Bluetooth
device.
• Local Audio Processing and Transcription: This module processes the captured
audio locally on the device. It transcribes the audio into text format.
• Real-Time Display and Summarization: This module displays the transcribed text in
real-time. It may also generate summaries or highlights of the audio content.
• End: This marks the completion of the process.

Dept. of CSE, Vemana IT 41 2023-24


Transcriber AI System Implementation

CHAPTER 6

SYSTEM IMPLEMENTATION
6.1 Introduction

The system implementation for the proposed real-time transcription project involves setting up
the development environment with required libraries such as Python, PyQt5, PyAudio, Vosk,
Googletrans, and Sumy. The user interface is designed using PyQt5 framework, comprising
screens for real-time transcription, translation, and summarization, featuring text input/output
fields and interaction buttons. PyAudio captures audio input from the microphone in real-time,
while Vosk library performs speech recognition and transcription, displaying transcribed text
on the UI dynamically. Integration of Googletrans enables text translation based on user-
selected target language, with translated text displayed alongside the original. Utilizing Sumy
library, the system summarizes transcribed text, offering users options for summary length.
Event handling mechanisms manage user interactions, including transcription control, file
selection for translation/summarization, and screen navigation. Robust error handling and
validation ensure smooth operation, while testing and debugging phase identifies and resolves
any issues. Comprehensive documentation covers system architecture, installation, usage, and
developer guidelines, facilitating seamless deployment across platforms.

6.2 Hardware Implementation

The hardware implementation for integrating Bluetooth functionality involves selecting a


suitable Bluetooth chip/module supporting Bluetooth 4.0/5.0 and profiles like A2DP and HFP,
referring to the Fig 6.1 and Fig 6.2. This chip/module will include a high-fidelity condenser
microphone and a rechargeable lithium-ion battery for power. Advanced audio processing
algorithms for noise reduction and echo cancellation may be integrated. The device will be
housed in a compact enclosure with LED status indicators. Seamless wireless connectivity will
be ensured for compatibility with various Bluetooth devices. Thorough testing and compliance
with regulatory standards will affirm functionality and reliability before production. The
device's design prioritizes user comfort and durability, ensuring ergonomic usability.
Compatibility extends to a wide range of Bluetooth peripherals, including TWS earphones.
Stringent quality assurance procedures validate functionality and regulatory compliance pre-
production.

Dept. of CSE, Vemana IT 42 2023-24


Transcriber AI System Implementation

Fig 6.1: Integrated Bluetooth module

The integration has been engineered to enable compatibility with a broad spectrum of Bluetooth
devices, including True Wireless Stereo (TWS) earphones and similar peripherals. By adopting
a versatile Bluetooth chip/module with comprehensive protocol support, such as Advanced
Audio Distribution Profile (A2DP) and Hands-Free Profile (HFP), the system accommodates
diverse Bluetooth-enabled accessories seamlessly. This approach ensures that users can
leverage their preferred Bluetooth devices, ranging from earphones to headsets, as the primary
audio input source for the system. Thus, users can experience enhanced flexibility and
convenience in utilizing their existing Bluetooth ecosystem while engaging with the real-time
transcription functionality offered by the system.

Fig 6.2: Bluetooth ear buds

Dept. of CSE, Vemana IT 43 2023-24


Transcriber AI System Implementation

6.3 Software Implementation

1. User Interface:
• Libraries Used: PyQt5
• Implementation: PyQt5 framework is utilized for designing and implementing
the user interface. Various PyQt5 widgets such as BoxLayout, Button,
TextInput, ScrollView, and ScreenManager are used to create a user-friendly
interface with multiple screens.
• Responsive Design: The user interface is designed to be responsive, adapting
gracefully to different screen sizes and orientations, ensuring consistent user
experience across devices.
• Fig 6.3 shows the intro screen of the ‘TranscriberAI’ application.

Fig 6.3: Intro Design

Dept. of CSE, Vemana IT 44 2023-24


Transcriber AI System Implementation
2. Real-Time Transcription Module:

• Libraries Used: PyAudio, Vosk

• Implementation: PyAudio captures audio input from the microphone in real-


time. Vosk library performs speech recognition and transcription, displaying
transcribed text dynamically on the user interface.

• Error Handling: Robust error-handling mechanisms are implemented to manage


issues such as microphone access errors, audio input interruptions, and
recognition failures.

• Multithreading: Threading is utilized to run audio capture and transcription


processes concurrently, ensuring smooth real-time operation without blocking
the user interface.

• Fig 6.4 shows the transcription module of the ‘TranscriberAI’ application.

Fig 6.4: Transcription Module

Dept. of CSE, Vemana IT 45 2023-24


Transcriber AI System Implementation
3. Translation Module:
• Libraries Used: Googletrans
• Implementation: Googletrans library enables text translation based on the user-
selected target language. Translated text is displayed alongside the original text
on the user interface.
• Language Detection: The application incorporates language detection
functionality to automatically identify the source language of the transcribed
text before translation
• Fig 6.5 shows the translation module of the ‘TranscriberAI’ application.
• Fig 6.6 shows the different languages available to which the transcribed data
can be translated

Fig 6.5: Translation Module

Dept. of CSE, Vemana IT 46 2023-24


Transcriber AI System Implementation

Fig 6.6: Language options


4. Summarization Module:
• Libraries Used: Sumy
• Implementation: Sumy library facilitates text summarization, condensing
transcribed text to capture key points and highlights. Summarized text is
displayed on the user interface.
• Summarization Algorithms: Different summarization algorithms provided by
Sumy, such as LexRank, LSA, and Luhn, are evaluated and compared to
determine the most effective summarization approach for the application.
• Fig 6.7 shows the summarization module of the ‘TranscriberAI’ application.

Dept. of CSE, Vemana IT 47 2023-24


Transcriber AI System Implementation

Fig 6.7. Summarization Module

5. File Handling:
• Libraries Used: os
• Implementation: The os library is used for file operations such as file selection,
reading, and writing. It enables users to select files for translation or
summarization and handles file loading for processing.
• File Format Support: The application supports various file formats for input and
output, including plain text (.txt), documents (.docx), and portable document
format (.pdf), enhancing versatility and usability.
• Fig 6.8 shows the file selection module of the ‘TranscriberAI’ application.

Dept. of CSE, Vemana IT 48 2023-24


Transcriber AI System Implementation

Fig 6.8: File Selection Module

Dept. of CSE, Vemana IT 49 2023-24


Transcriber AI System Testing

CHAPTER 7

SYSTEM TESTING
System testing is a crucial part of any project to ensure that it meets the desired requirements
and functions correctly. For the Transcriber AI project, there are several components that need
to be tested to ensure their proper functioning.

7.1 Tests conducted

7.1.1 Hardware Testing

Hardware testing for the Transcriber AI involves testing the components and the system to
ensure that they are functioning correctly and as intended. The testing process involves the
following steps:

• Testing Bluetooth Connection: The Bluetooth module undergoes testing to verify its
ability to establish a stable connection with the system. This involves initiating
connections with various Bluetooth devices and ensuring successful pairing and
communication.
• Testing Bluetooth Range: In addition to connection testing, the Bluetooth module's
range is assessed to determine the distance over which it can maintain a reliable
connection with paired devices. This is achieved by gradually increasing the distance
between the system and the Bluetooth device while monitoring signal strength and
connectivity stability.

7.1.2 Software Testing

Software testing is an important part of the development process to ensure that the software
functions as expected and meets the user requirements. In the case of the Anti-theft security
system, the software includes the code that runs on the Raspberry Pi Pico microcontroller and
the application that the user interacts with. The software testing process typically involves the
following steps:

• Unit Testing: This involves testing individual units of code in isolation to ensure they
function as expected. In the case of the Transcriber AI, this would entail testing the
code responsible for handling transcription, translation, and summarization
independently.

Dept. of CSE, Vemana IT 50 2023-24


Transcriber AI System Testing
• Integration Testing: Integration Testing focuses on evaluating how different units of
code interact with each other. In the context of the Transcriber AI, this would entail
testing how the code manages the transition between different screens within the PyQt5
framework and how the output of the transcription is seamlessly fed into the translation
and summarization processes.
• System Testing: This involves testing the system to ensure that it functions as
expected. For the Transcriber AI, this would entail testing the system under diverse
conditions to verify its capability to accurately recognize speech and execute
transcription, alongside other operations like translation and summarization, reliably.
• User Acceptance Testing: UAT for the Transcriber AI involves assessing the
application with end-users to ensure they can effectively utilize its features, such as
real-time transcription, translation, and summarization, while providing feedback on
usability and functionality. During this phase, users will interact with the AI to
transcribe various audio inputs, translate them into desired languages, and summarize
the transcribed content.

7.2 Test Cases


Table 7.1 Test cases for Transcriber AI

ID Description Expected Output Actual Output Remarks


1 Data The audio input to be The AI system Pass
Collection transcribed by the AI transcribes the audio
system. input provided to it.
2 Transcription Transcriber accurately The transcriber Pass
transcribes the audio input accurately converts
into text in real-time. real-time audio input
into text.
3 Translation The Transcribed text to be The transcribed text is Pass
translated into the desired translated into the
language. desired language.
4 Summarization Transcriber generates a The transcriber Pass
concise summary of the produces a concise
transcribed text. summary of the
transcribed text.

Dept. of CSE, Vemana IT 51 2023-24


Transcriber AI System Testing
Table 7.1 presents the analysis of the test cases, offering a comparison between the actual and
expected outputs. Based on this evaluation, it can be concluded that the Transcriber AI system
demonstrated satisfactory performance across the test cases.

Dept. of CSE, Vemana IT 52 2023-24


Transcriber AI Summary

CHAPTER 8

SUMMARY
The Transcriber AI represents a pivotal evolution in communication technologies, catering to
the increasing demand for efficient audio processing solutions. Its real-time transcription
capabilities offer unparalleled speed and accuracy, transforming the way audio content is
converted into written text. This innovation is particularly beneficial in scenarios such as live
events, interviews, and lectures, where quick and accurate transcriptions are essential. By
eliminating language barriers through seamless translation, the AI fosters inclusivity and
facilitates cross-cultural communication. Its ability to generate concise summaries of
transcribed content streamlines information processing and decision-making, saving time and
effort. Moreover, the Transcriber AI's adaptability to various domains, including business,
education, and entertainment, makes it a versatile tool for diverse applications. Its user-friendly
interface and intuitive features make it accessible to users of all levels, from individuals to large
organizations. The AI's integration of advanced security measures ensures data privacy and
confidentiality, instilling trust and confidence among users. Continuous updates and
improvements keep the Transcriber AI at the forefront of innovation, meeting the evolving
needs of the digital era. Its impact extends beyond communication, driving efficiency,
productivity, and collaboration across industries. Furthermore, the Transcriber AI's adaptability
to various accents and dialects enhances its utility in diverse linguistic environments. Its robust
performance in noisy or challenging acoustic conditions ensures reliable transcription
outcomes even in less-than-ideal settings. The AI's cloud-based architecture enables seamless
integration with existing workflows and platforms, facilitating scalability and accessibility. Its
ability to generate timestamps and speaker identification enhances the usability and
organization of transcribed content, particularly in multi-speaker scenarios. As the demand for
accurate and efficient transcription solutions continues to grow, the Transcriber AI remains at
the forefront, driving innovation and reshaping the landscape of audio communication.

Dept. of CSE, Vemana IT 53 2023-24


Transcriber AI Conclusion And Future Work

CHAPTER 9

CONCLUSION AND FUTURE WORK


9.1 CONCLUSION
The primary focus of this project is to design and develop an efficient and user-friendly
Transcriber AI system. The aim is to create a solution that facilitates accurate real-time
transcription, translation, and summarization, addressing communication barriers and
enhancing productivity. This system is suitable for various applications such as meetings,
interviews, lectures, and personal note-taking. The goal is to design the AI in a manner that
meets the diverse requirements of users or organizations seeking transcription services. The
effectiveness and performance of the entire system can be evaluated based on the accuracy of
transcription, translation fidelity, and the coherence of the summarized content.

9.2 FUTURE WORK


In future iterations of the Transcriber AI system, several enhancements can significantly
enhance its functionality and user experience:

• Online Cloud Storage: Implementing online cloud storage capabilities will enable
users to access transcribed text and related data through an internet connection. This
feature ensures seamless accessibility to transcripts from anywhere, anytime, enhancing
user convenience and flexibility.
• Hardware Controls Integration: Integrating hardware controls, such as buttons, will
streamline user interactions for initiating and terminating transcription processes. This
enhancement simplifies the transcription workflow, making it more efficient and
intuitive for users.
• Transition to Offline Translation Libraries: Transitioning from online to offline
translation libraries will enhance user privacy and accessibility by enabling translations
directly on the device without relying on internet connectivity. This improvement
ensures that users can translate content securely and efficiently.
• AI-Generated Summarization Algorithms: Introducing AI-generated summarization
algorithms will automate the summarization process, providing users with concise
summaries of transcribed content without the need for manual intervention. This
enhancement saves time and effort for users, improving overall productivity

Dept. of CSE, Vemana IT 54 2023-24


Transcriber AI References

REFERENCES

[1] Sadaoki Furui, Tomonori Kikuchi, Yousuke Shinnaka, and Chiori Hori, Speech-to-Text
and Speech-to-Speech Summarization of Spontaneous Speech, IEEE transactions on speech
and audio processing, vol. 12, no. 4, july 2004.

[2] Sridhar Krishna Nemala and Mounya Elhilali, Multilevel speech intelligibility for robust
speaker recognition, The Johns Hopkins University, Baltimore.

[3] Xavier Anguera Miro, Simon Bozonnet, Nicholas Evans, Corinne Fredouille, Gerald
Friedland and Oriol Vinyals, Speaker Diarization: A Review of Recent Research, IEEE
transactions on audio, speech, and language processing, vol. 20, no. 2, february 2012

[4] Dongmahn SEO, Suhyun KIM, Gyuwon SONG, and Seung-gil HONG, Speech-to-Text-
based Life Log System for Smartphones,2014.

[5] Multiple Authors, Low-Latency Real-Time Meeting Recognition and Understanding Using
Distant Microphones and Omni-Directional Camera, IEEE TRANSACTIONS ON AUDIO,
SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 2, FEBRUARY 2012

[6] Siqi Zheng, Weilong Huang, Xianliang Wang, Hongbin Suo, Jinwei Feng and Zhijie Yan,
A real-time speaker diarization system based on spatial spectrum, ICASSP 2021 - 2021 IEEE

[7] Yu-An Chung, Wei-Hung Weng, Schrasing Tong and James Glass, Towards Unsupervised
Speech-To-Text Translation, Computer Science and Artificial Intelligence Laboratory,
Massachusetts Institute of Technology, Cambridge

[8] Md Ashraful Islam Talukder, Sheikh Abujar, Abu Kaisar Mohammad Masum, Sharmin
Akter and Syed Akhter Hossain, Comparative Study on Abstractive Text Summarization, IEEE
- 49239

[9] Lili Wan, Extraction algorithm of English Text Summarization for English Teaching, 2018
International Conference on Intelligent Transportation, Big Data and Smart City

[10] Devi Krishnan, Preethi Bharathy, Anagha, and Manju Venugopalan, A Supervised
Approach For Extractive Text Summarization Using Minimal Robust Features, Proceedingsy
of the International Conference on Intelligent Computing and Control Systems (ICICCS 2019).

Dept. of CSE, Vemana IT 55 2023-24


Transcriber AI Appendix A

APPENDIX A
A.1 Source Code

import sys # Import the sys module for interacting with the Python runtime environment.

# Import various classes from PyQt5.QtWidgets for creating the application's graphical user
interface.

from PyQt5.QtWidgets import QApplication, QWidget, QVBoxLayout, QHBoxLayout,


QTextEdit, QPushButton, QFileDialog, QComboBox, QLabel

# Import classes from PyQt5.QtCore for handling core functionality like threads, signals, and
timers.

from PyQt5.QtCore import QThread, pyqtSignal, QTimer, Qt

# Import the Translator class from googletrans for translating text between languages.

from googletrans import Translator

# Import datetime to handle operations on dates and times.

from datetime import datetime

# Import classes for text parsing and tokenization from sumy, a text summarization library.

from sumy.parsers.plaintext import PlaintextParser

from sumy.nlp.tokenizers import Tokenizer

# Import LexRankSummarizer for text summarization using the LexRank algorithm.

from sumy.summarizers.lex_rank import LexRankSummarizer

# Import QColor, QFont, and QPixmap from PyQt5.QtGui for handling colors, fonts, and
images in the application's GUI.

from PyQt5.QtGui import QColor, QFont, QPixmap

# Import QSound from PyQt5.QtMultimedia for playing sound files.

from PyQt5.QtMultimedia import QSound

# Import Model and KaldiRecognizer from vosk for speech recognition capabilities.

Dept. of CSE, Vemana IT 56 2023-24


Transcriber AI Appendix A
from vosk import Model, KaldiRecognizer

# Import json for parsing JSON data, which is commonly used in communication with APIs
and data storage.

import json

# Import pyaudio for handling audio streams, necessary for capturing audio input for real-time
transcription.

import pyaudio

# Import QMovie from PyQt5.QtGui to handle GIF animations within the application's GUI.

from PyQt5.QtGui import QMovie

# Import QMediaPlayer and QMediaContent from PyQt5.QtMultimedia for handling audio and
video media playback.

from PyQt5.QtMultimedia import QMediaPlayer, QMediaContent

# Import QUrl from PyQt5.QtCore to manage URLs, useful for handling media files and web
links.

from PyQt5.QtCore import QUrl

# Initialize the PyAudio object.

p = pyaudio.PyAudio()

# Retrieve default host API information.

info = p.get_host_api_info_by_index(0)

# Get the number of audio devices available.

numdevices = info.get('deviceCount')

# Iterate through the number of devices and print the input-capable devices.

for i in range(0, numdevices):

if (p.get_device_info_by_host_api_device_index(0, i).get('maxInputChannels')) > 0:

print("Input Device id ", i, " - ", p.get_device_info_by_host_api_device_index(0,


i).get('name'))

Dept. of CSE, Vemana IT 57 2023-24


Transcriber AI Appendix A
# Hardcoded device index for audio input.

device_index = 2

# Define a QWidget based class for real-time audio transcription.

class RealTimeTranscription(QWidget):

def _init_(self):

super()._init_()

self.init_ui()

self.p = pyaudio.PyAudio()

self.stream = None

self.model = Model("model")

self.recognizer = KaldiRecognizer(self.model, 16000)

self.transcription_running = False

self.transcription_file = None

# Initialize the user interface.

def init_ui(self):

self.setWindowTitle('Real-Time Transcription')

self.setGeometry(100, 100, 600, 400)

self.transcription_text_edit = QTextEdit()

self.transcription_text_edit.setReadOnly(True)

self.transcription_text_edit.setMinimumHeight(300)

self.start_button = QPushButton('Start Transcription')

self.start_button.clicked.connect(self.start_transcription)

self.stop_button = QPushButton('Stop Transcription')

self.stop_button.clicked.connect(self.stop_transcription)

button_layout = QHBoxLayout()

Dept. of CSE, Vemana IT 58 2023-24


Transcriber AI Appendix A
button_layout.addWidget(self.start_button)

button_layout.addWidget(self.stop_button)

main_layout = QVBoxLayout()

main_layout.addWidget(self.transcription_text_edit)

main_layout.addLayout(button_layout)

self.setLayout(main_layout)

# Start the transcription process.

def start_transcription(self):

if self.transcription_running:

return

self.stream = self.p.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True,


input_device_index=device_index, frames_per_buffer=8000)

self.stream.start_stream()

self.transcription_running = True

self.transcription_thread = TranscriptionThread(self.stream, self.recognizer)

self.transcription_thread.transcription_updated.connect(self.update_transcription)

self.transcription_thread.start()

timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

file_name = f"transcription_{timestamp}.txt"

self.transcription_file = open(file_name, "a")

# Update the transcription text edit with new text.

def update_transcription(self, text):

self.transcription_text_edit.append(text)

if self.transcription_file:

self.transcription_file.write(text)

Dept. of CSE, Vemana IT 59 2023-24


Transcriber AI Appendix A
self.transcription_file.flush()

# Stop the transcription process.

def stop_transcription(self):

if self.stream:

self.stream.stop_stream()

self.stream.close()

self.transcription_running = False

self.transcription_thread.quit()

self.transcription_thread.wait()

if self.transcription_file:

self.transcription_file.close()

self.transcription_file = None

# Define a QThread based class for handling the transcription process in a separate thread.

class TranscriptionThread(QThread):

transcription_updated = pyqtSignal(str)

def _init_(self, stream, recognizer):

super()._init_()

self.stream = stream

self.recognizer = recognizer

# Main run loop for the thread that handles continuous transcription.

def run(self):

try:

while True:

data = self.stream.read(8000)

if len(data) == 0:

Dept. of CSE, Vemana IT 60 2023-24


Transcriber AI Appendix A
break

if self.recognizer.AcceptWaveform(data):

result = json.loads(self.recognizer.Result())

if 'text' in result:

self.transcription_updated.emit(result['text'] + "\n")

except OSError as e:

print(f"Error in transcription thread: {e}")

# Define a QWidget based class for handling text translation.

class TranslationScreen(QWidget):

def _init_(self):

super()._init_()

self.init_ui()

# Initialize the user interface.

def init_ui(self):

self.setWindowTitle('Translation')

self.setGeometry(100, 100, 800, 600)

self.original_text_edit = QTextEdit()

self.translated_text_edit = QTextEdit()

self.translated_text_edit.setReadOnly(True)

self.language_combo_box = QComboBox()

self.language_combo_box.addItems(['English', 'Spanish', 'French', 'Hindi', 'Kannada'])

load_button = QPushButton('Load File')

load_button.clicked.connect(self.load_file)

translate_button = QPushButton('Translate')

translate_button.clicked.connect(self.translate_text)

Dept. of CSE, Vemana IT 61 2023-24


Transcriber AI Appendix A
text_layout = QVBoxLayout()

text_layout.addWidget(self.original_text_edit)

text_layout.addWidget(self.translated_text_edit)

button_layout = QHBoxLayout()

button_layout.addWidget(load_button)

button_layout.addWidget(self.language_combo_box)

button_layout.addWidget(translate_button)

main_layout = QVBoxLayout()

main_layout.addLayout(text_layout)

main_layout.addLayout(button_layout)

self.setLayout(main_layout)

# Load text file for translation.

def load_file(self):

file_dialog = QFileDialog()

file_path, _ = file_dialog.getOpenFileName(self, 'Open File', '', 'Text Files (*.txt)')

if file_path:

with open(file_path, 'r') as file:

content = file.read()

self.original_text_edit.setPlainText(content)

# Translate text based on selected language.

def translate_text(self):

original_text = self.original_text_edit.toPlainText()

target_language = self.language_combo_box.currentText()

translator = Translator()

translated = translator.translate(original_text, dest=target_language.lower())

Dept. of CSE, Vemana IT 62 2023-24


Transcriber AI Appendix A
self.translated_text_edit.setPlainText(translated.text)

# Define a QWidget based class for text summarization.

class SummarizationScreen(QWidget):

def _init_(self):

super()._init_()

self.init_ui()

self.num_sentences = 0

# Initialize the user interface.

def init_ui(self):

self.setWindowTitle('Summarization')

self.setGeometry(100, 100, 600, 400)

self.original_text_edit = QTextEdit()

self.summarized_text_edit = QTextEdit()

self.summarized_text_edit.setReadOnly(True)

self.load_button = QPushButton('Load File')

self.load_button.clicked.connect(self.load_file)

self.summarize_button = QPushButton('Summarize')

self.summarize_button.clicked.connect(self.summarize_text)

main_layout = QVBoxLayout()

main_layout.addWidget(self.original_text_edit)

main_layout.addWidget(self.load_button)

main_layout.addWidget(self.summarized_text_edit)

main_layout.addWidget(self.summarize_button)

self.setLayout(main_layout)

# Load text file for summarization.

Dept. of CSE, Vemana IT 63 2023-24


Transcriber AI Appendix A
def load_file(self):

file_dialog = QFileDialog()

file_path, _ = file_dialog.getOpenFileName(self, 'Open File', '', 'Text Files (*.txt)')

if file_path:

with open(file_path, 'r') as file:

lines = file.readlines()

non_empty_lines = [line for line in lines if line.strip()]

self.num_sentences = len(non_empty_lines)

content = "".join(non_empty_lines)

self.original_text_edit.setPlainText(content)

# Perform text summarization based on the loaded content.

def summarize_text(self):

original_text = self.original_text_edit.toPlainText()

if self.num_sentences > 0:

num_sentences_summary = int(self.num_sentences * 0.75)

else:

num_sentences_summary = 2

parser = PlaintextParser.from_string(original_text, Tokenizer("english"))

summarizer = LexRankSummarizer()

summary = summarizer(parser.document, num_sentences_summary)

summarized_text = " ".join(str(sentence) for sentence in summary)

self.summarized_text_edit.setPlainText(summarized_text)

# Define a QWidget based class for displaying an animated logo and playing a sound.

class LogoScreen(QWidget):

def _init_(self, main_window):

Dept. of CSE, Vemana IT 64 2023-24


Transcriber AI Appendix A
super()._init_()

self.main_window = main_window

self.init_ui()

self.setup_animation_and_music() # Setup for animation and delayed sound

# Initialize the user interface.

def init_ui(self):

self.setWindowTitle('Logo Screen')

self.setGeometry(100, 100, 600, 400)

layout = QVBoxLayout(self)

# Set up QLabel for GIF

self.logo_label = QLabel(self)

self.movie = QMovie('F9.gif') # Ensure the path is correct

self.logo_label.setMovie(self.movie)

self.movie.start() # Start the GIF animation

self.logo_label.setAlignment(Qt.AlignCenter)

layout.addWidget(self.logo_label)

# Additional label styling

self.transcriber_label = QLabel('Transcriber AI', self)

self.transcriber_label.setAlignment(Qt.AlignCenter)

self.transcriber_label.setFont(QFont('Arial', 24))

self.transcriber_label.setStyleSheet("color: white")

layout.addWidget(self.transcriber_label)

# Background color

self.setAutoFillBackground(True)

palette = self.palette()

Dept. of CSE, Vemana IT 65 2023-24


Transcriber AI Appendix A
palette.setColor(self.backgroundRole(), QColor(0, 0, 0))

self.setPalette(palette)

# Setup for animation and delayed sound playback.

def setup_animation_and_music(self):

# Delay sound start by 2000 milliseconds (2 seconds)

QTimer.singleShot(2000, self.playSound)

# Set the timer to transition after 5 seconds

QTimer.singleShot(8000, self.transition_to_transcription)

# Play sound effect.

def playSound(self):

# Setup the media player to play sound

self.player = QMediaPlayer() # Create a media player object

url = QUrl.fromLocalFile("sounds/s.wav") # Adjust the sound path as necessary

self.player.setMedia(QMediaContent(url))

self.player.play() # Play the sound after a 2-second delay

# Transition to the transcription screen.

def transition_to_transcription(self):

self.hide() # Hide the current screen

self.main_window.show_transcription() # Transition to the transcription screen

# Define the main window class which contains all other screens.

class MainWindow(QWidget):

def _init_(self):

super()._init_()

self.init_ui()

# Initialize the user interface.

Dept. of CSE, Vemana IT 66 2023-24


Transcriber AI Appendix A
def init_ui(self):

self.setWindowTitle('Main Window')

self.setGeometry(100, 100, 600, 400)

self.logo_screen = LogoScreen(self)

self.transcription_screen = RealTimeTranscription()

self.translation_screen = TranslationScreen()

self.summarization_screen = SummarizationScreen()

self.transcription_button = QPushButton('Transcription')

self.transcription_button.clicked.connect(self.show_transcription)

self.translation_button = QPushButton('Translation')

self.translation_button.clicked.connect(self.show_translation) # Ensure method is


defined

self.summarization_button = QPushButton('Summarization')

self.summarization_button.clicked.connect(self.show_summarization) # Ensure method


is defined

self.button_layout = QHBoxLayout()

self.button_layout.addWidget(self.transcription_button)

self.button_layout.addWidget(self.translation_button)

self.button_layout.addWidget(self.summarization_button)

self.main_layout = QVBoxLayout()

self.main_layout.addLayout(self.button_layout)

self.main_layout.addWidget(self.logo_screen)

self.main_layout.addWidget(self.transcription_screen)

self.main_layout.addWidget(self.translation_screen)

self.main_layout.addWidget(self.summarization_screen)

Dept. of CSE, Vemana IT 67 2023-24


Transcriber AI Appendix A
self.setLayout(self.main_layout)

# Initially hide everything except the logo screen

self.show_logo()

# Show transcription screen and hide others.

def show_transcription(self):

self.show_buttons()

self.logo_screen.hide()

self.transcription_screen.show()

self.translation_screen.hide()

self.summarization_screen.hide()

# Show translation screen and hide others.

def show_translation(self):

this.show_buttons()

this.logo_screen.hide()

this.transcription_screen.hide()

this.translation_screen.show()

this.summarization_screen.hide()

# Show summarization screen and hide others.

def show_summarization(self):

this.show_buttons()

this.logo_screen.hide()

this.transcription_screen.hide()

this.translation_screen.hide()

this.summarization_screen.show()

# Initially display the logo screen.

Dept. of CSE, Vemana IT 68 2023-24


Transcriber AI Appendix A
def show_logo(self):

self.transcription_button.hide()

self.translation_button.hide()

self.summarization_button.hide()

self.transcription_screen.hide()

self.translation_screen.hide()

self.summarization_screen.hide()

self.logo_screen.show()

# Show navigation buttons.

def show_buttons(self):

self.transcription_button.show()

self.translation_button.show()

self.summarization_button.show()

# Entry point of the program.

if _name_ == '_main_':

app = QApplication(sys.argv)

window = MainWindow()

window.show()

try:

sys.exit(app.exec_())

except KeyboardInterrupt:

print("Application terminated by user.")

Dept. of CSE, Vemana IT 69 2023-24

You might also like