voice sample (1)
voice sample (1)
Submitted by
SWATHI S
November 2024
SCHOOL OF COMPUTING
THANJAVUR, TAMIL NADU, INDIA – 613 401
SCHOOL OF COMPUTING
Bonafide Certificate
This is to certify that the report titled “Voxmate” submitted as a requirement for the
course, BIN522: PYTHON FOR DATA SCIENCE for M.Tech. Artificial Intelligence and
Data Science programme, is a bona fide record of the work done by Ms. SWATHI S (Reg.
No.126162016) during the academic year 2024-25, in the School of Computing, under my
supervision.
Examiner 1 Examiner 2
ii
SCHOOL OF COMPUTING
Declaration
I declare that the report titled “Voxmate” submitted by me/us is an original work done
semester of the academic year 2024-25, in the School of Computing. The work is original and
wherever I have used materials from other sources, I have given due credit and cited them in the text
of the report. This report has not formed the basis for the award of any degree, diploma, associate-
iii
Acknowledgements
This project was completed successfully with the kind support and help of many individuals.
We would like to extend my sincere thanks to all of them. First and foremost, we would like to
thank the Almighty God for helping us complete the project successfully.
We would like to express our sincere thanks to Dr. S. Vaidhyasubramaniam, Honorable Vice
Chancellor, and Dr. V.S. Shankar Sriraman, Dean, School of Computing, SASTRA
Deemed University, for providing such an opportunity to carry out our project which has
enriched our practical knowledge in research aspects.
We would like to thank our Project Guide, Dr. Ashok Palaniappan, Associate Professor, School
of Chemical and Biotechnology for his diligent guidance, valuable advice, and constant
encouragement throughout the project.
iv
DECLARATION iii
ACKNOWLEDGEMENT iv
LIST OF FIGURES vi
ABSTRACT vii
1 INTRODUCTION 1
1
1.1 INTRODUCTION
2
1.2 BACKGROUND
2 SYSTEM ANALYSIS 4
3 LITERATURE SURVEY 6
4 SYSTEM DESIGN 8
4.1 ER DIAGRAM 8
viii
CHAPTER 1
INTRODUCTION
1.1 INTRODUCTION
Artificial Intelligence when used with machines, it shows us the capability of thinking like
humans. In this, a computer system is designed in such a way that typically requires interaction
from human. Python is an emerging language so it becomes easy to write a script for Voice
Assistant in Python. The instructions for the assistant can be handled as per the requirement of
user. Speech recognition is the Alexa, Siri, etc. In Python there is an API called
SpeechRecognition which allows us to convert speech into text. It was an interesting task to make
my own assistant. It became easier to send emails without typing any word, Searching on Google
without opening the browser, and performing many other daily tasks like playing music, opening
your favorite IDE with the help of a single voice command. In the current scenario, advancement
in technologies are such that they can perform any task with same effectiveness or can say more
effectively than us. By making this project, realized that the concept of AI in every field is
decreasing human effort and saving time. As the voice assistant is using Artificial Intelligence
hence the result that it is providing are highly accurate and efficient. The assistant can help to
reduce human effort and consumes time while performing any task, they removed the concept of
typing completely and behave as another individual to whom we are talking and asking to perform
task. The assistant is no less than a human assistant but we can say that this is more effective and
efficient to perform any task. The libraries and packages used to make this assistant focuses on the
time complexities and reduces time. The functionalities include, It can send emails, It can read
PDF, It can send text on WhatsApp, It can open command prompt, your favorite IDE, notepad
etc., It can play music, It can play Video, It can do Wikipedia searches for you, It can open
websites like Google, YouTube, etc., in a web browser, It can give weather forecast, It can give
desktop reminders of your choice. It can have some basic conversation. Tools and technologies
used are PyCharm IDE for making this project, and created all py files in PyCharm. Along with
this, following modules and libraries in our project. pyttsx3, SpeechRecognition, Datetime,
Wikipedia, Smtplib, twilio, pyjokes, pyPDF2, pyautogui, pyQt5 etc..
These days, Voice Assistants are all the rage. Amazon, Apple, and Google all have thrown
seperate versions into the ring to duke it out in the Smart Home space.
1
Like all technology, it’s easy to make Voice Assistants sound complicated. But when you spend a
few minutes boiling it down, it’s not so complicated at all. In my years as an engineer at both
Amazon and Apple, I’ve found that Voice Assistants aren’t really that complicated. Having an
understanding of what Voice Assistants are and are not will go a long way, as well as learning the
overall advantages of using this technology.
1.2 BACKGROUND
SIRI from Apple
SIRI is personal assistant software that interfaces with the user thru voice interface, recognizes
commands and acts on them. It learns to adapt to user’s speech and thus improves voice
recognition over time. It also tries to converse with the user when it does not identify the user
request. It integrates with calendar, contacts and music library applications on the device and also
integrates with GPS and camera on the device. It uses location, temporal, social and task based
contexts, to personalize the agent behavior specifically to the user at a given point of time.
Supported Tasks
2
ReQall
ReQall is personal assistant software that runs on smartphones running Apple iOS or Google
Android operating system. It helps user to recall notes as well as tasks within a location and time
context. It records user inputs and converts them into commands, and monitors current stack of
user tasks to proactively suggest actions while considering any changes in the environment. It also
presents information based on the context of the user, as well as filter information to the user
based on its learned understanding of the priority of that information.
Supported Tasks
• Reminders
• Email
• Calendar, Google Calendar
• Outlook
• Evernote
• Facebook, LinkedIn
• News Feeds
Drawback
Will take some time to put all of the to-do items in – you could spend more time putting the
entries in than actually doing the revision.
3
CHAPTER -2
SYSTEM ANALYSIS
It was an interesting task to make voice assistant. It became easier to send emails without
typing any word, Searching on Google without opening the browser, and performing many
other daily tasks like playing music, opening your favorite IDE with the help of a single
voice command.
VOXMATE is different from other traditional voice assistants in terms that it is specific to
desktop and user does not need to make account to use this, it does not require any internet
connection while getting the instructions to perform any specific task.
4
The IDE used in this project is PyCharm. All the python files were created in PyCharm and all
the necessary packages were easily installable in this IDE. For this project following modules and
libraries were used i.e. pyttsx3, SpeechRecognition, Datetime, Wikipedia, Smtplib, twilio,
pyjokes, pyPDF2, pyautogui, pyQt etc
5
CHAPTER-3
LITERATURE SURVEY
[2] M. Subi, M. Rajeswari, J. J. Rajan and S. Sri Harshini, "AI-Based Desktop VIZ: A
Voice-Activated Personal Assistant-Futuristic and Sustainable Technology," 2024 10th
International Conference on Communication and Signal Processing (ICCSP),
Melmaruvathur, India, 2024,
Abstract: Modern technology integrates advanced techniques like speech recognition, machine
learning, artificial intelligence, and OpenAI, exemplified by voice-activated personal assistants.
These systems leverage OpenAI to enhance functionality and deliver precise, comprehensive user-
requested data. This AI-powered desktop uses the Python AutoGUI package to automate mouse
control and interact with program windows, improving user interface interaction. Unlike previous
models with basic interfaces, this system combines OpenAI, GUI automation, and speech
recognition for seamless and efficient performance. It serves as a prime example of advanced
technology integration.
[3] J. Vijaya, C. Swati and S. Satya, "Ikigai: Artificial Intelligence-Based Virtual Desktop
Assistant," 2024 IEEE International Conference on Interdisciplinary Approaches in
Technology and Management for Social Innovation (IATMSI), Gwalior, India, 2024.
6
Abstract: The AI Desktop Assistant project aims to create a virtual assistant inspired by cinematic
systems, enabling natural voice interactions for tasks like email, scheduling, and file organization.
To address limitations such as reliance on pre-defined commands, accent recognition, and privacy
concerns, the project will enhance NLP using models like BERT or GPT-3.5, improve adaptive
voice recognition, and prioritize user-centric design. Data from open-source platforms like Kaggle
will refine language understanding, while features like NASA Navigator provide space-related
news. Success will be measured by user satisfaction, task efficiency, and continuous feedback for
improvement.
[4] S. Kumar, S. Patel, Sonam and V. Srivastav, "Voice-Based Virtual Assistant for
Windows (Ziva - AI Companion)," 2024 IEEE International Conference on Computing,
Power and Communication Technologies (IC2PCT), Greater Noida, India, 2024
Abstract: ZIVA is a Python-based desktop assistant designed to execute voice commands,
eliminating the need for manual typing. Inspired by Siri and Cortana, ZIVA uses Natural
Language Processing (NLP) and intelligent voice recognition to interpret user input and perform
tasks like web searches, opening websites, playing music, telling time, and system shutdowns. It
stores and matches voice commands with predefined actions, streamlining everyday operations.
Machine learning techniques analyze user commands to deliver optimal responses, making ZIVA
a versatile tool for enhancing productivity and interaction with local machines.
[5] L. R. Sirisha Munduri and M. Venugopalan, "Leap Motion Based AI Mouse With
Virtual Voice Assistant Support," 2023 3rd International Conference on Mobile Networks
and Wireless Communications (ICMNWC), Tumkur, India, 2023
Abstract: This paper proposes using Leap Motion technology to control computer systems
through hand gestures, offering contactless operation for tasks like presentations and assisting
people with repetitive strain injuries. By integrating a laptop voice assistant, users can launch or
stop the AI-powered virtual mouse with voice commands. The system uses a desktop camera for
mouse functions like clicking and scrolling. Experimental results show an accuracy of 94.6% for
various gestures and multi-handed use, outperforming other state-of-the-art methods, which
achieved 78% accuracy. This approach eliminates the need for additional hardware while
enhancing device control..
7
CHAPTER-4
SYSTEM DESIGN
4.1 ER DIAGRAM
Single user can ask multiple questions. Each question will be given ID to get
recognized along with the query and its corresponding answer. User can also be
having n number of tasks. These should have their own unique id and status i.e. their
current state. A task should also have a priority value and its category whether it is a
parent task or child task of an older task.
4.1.1 ER DIAGRAM
8
4.2 DATA FLOW DIAGRAM
9
4.2.2 DFD Level 2
10
4.3 ACTIVITY DIAGRAM
Initially, the system is in idle mode. As it receives any wake up call it begins
execution. The received command is identified whether it is a questionnaire or a
task to be performed. Specific action is taken accordingly. After the Question is
being answered or the task is being performed, the system waits for another
command. This loop continues unless it receives quit command. At that moment, it
goes back to sleep.
11
4.4 COMPONENT DIAGRAM
The main component here is the Virtual Assistant. It provides two specific service,
executing Task or Answering your question.
12
4.5 USE CASE DIAGRAM
In this project there is only one user. The user queries command to the system.
System then interprets it and fetches answer. The response is sent back to the user.
13
CHAPTER-5
PROPOSED METHODS
o Enhanced NLP can be integrated to process more complex queries using libraries
like spaCy or transformer models.
o File I/O: Use Python’s open() to store tasks and alarm times in text files.
o The system can handle alarms by storing and processing time-based inputs.
o For recurring tasks, use the schedule library to trigger actions at set times
o Display results via voice feedback and console output using speak() method.
5.5. Translation
o Use speech recognition to listen for phrases like "Translate 'hello' to Spanish".
o The result will be spoken back to the user using the pyttsx3 library.
15
Method for Application Control:
The assistant can launch or close applications using system calls with pyautogui and os
modules. The assistant listens for commands like "Open Chrome" or "Close Firefox" and
triggers system-level events to open or close the applications.
Proposed Method:
o Implement keyword parsing to detect application names and send commands to the
system.
o The assistant can also trigger the camera app on the system, instructing the user to
"smile" before taking the photo.
16
5.9. Security and Privacy
o Store passwords securely in a text file (consider encrypting this for added security).
o Implement retry attempts for security, exiting the program after multiple failed
attempts.
17
CHAPTER-6
The speech recognition module, implemented using the SpeechRecognition library and Google’s
Speech-to-Text API, demonstrated the following effects:
• Recognition Performance:
o The module achieved over 90% accuracy in quiet environments and moderately
noisy settings.
o The assistant performed well with predefined commands like "open Chrome" or
"check internet speed."
• Discussion:
To improve accuracy further, integrating advanced NLP models like BERT or Whisper AI
could help in understanding context and handling accents or variations in speech patterns.
The modular approach to handling tasks (e.g., alarms, media control, and browser automation)
yielded the following results:
• System Performance:
18
o More complex operations, such as browser automation using Selenium, took
slightly longer (about 2–3 seconds) due to browser load times.
• User Satisfaction:
o Users appreciated the modular design, which allowed seamless integration of new
features without disrupting existing functionalities.
• Discussion:
Enhancements like parallel task execution and asynchronous processing could reduce
delays. Additionally, refining browser automation scripts for optimized performance is
crucial.
While the assistant performed effectively overall, the following challenges were noted:
o The local processing approach successfully maintained user privacy but limited the
assistant’s ability to leverage cloud-based computational resources for advanced
tasks.
o Features like media control and internet speed testing were executed with 100%
success, but user feedback indicated that adding context-aware responses (e.g.,
recommending improvements for slow internet) would enhance usability.
19
CHAPTER-7
7.1 CONCLUSION
The development of this desktop voice assistant demonstrates the power of modern technologies
in creating intelligent, interactive systems that can simplify daily tasks. Through the integration of
multiple modules such as speech recognition, task automation, media control, multilingual
support, and system commands, the assistant has proven capable of performing a wide array of
functions including opening applications, scheduling alarms, checking internet speed, controlling
multimedia, and more.
• Privacy-Focused Design: By processing all data locally, the assistant ensures user data is
kept private and secure, which is in line with recent research advocating for privacy-
preserving technologies.
• Modular and Scalable Architecture: The assistant is designed with a modular approach,
allowing for easy addition of new features without affecting the existing functionality. This
is in line with best practices for scalable system design.
• Multilingual Capabilities: The system can handle basic multilingual translation tasks,
enhancing its accessibility to a broader audience
Through the use of popular libraries such as PyAutoGUI, Selenium, and Pyttsx3, the
assistant is able to perform complex tasks such as controlling YouTube, taking screenshots,
and managing system applications.
20
7.2 Future Work
While the system has achieved the core functionality intended, there are several areas for
improvement and potential extensions for future work:
1. Enhanced Natural Language Understanding (NLU):
One of the primary areas of improvement lies in natural language understanding. The
current implementation processes commands based on pre-defined keywords. Future work
could integrate more advanced machine learning models (e.g., BERT, GPT) to allow the
assistant to better handle more complex, context-sensitive interactions. This would
enhance its ability to understand nuanced commands and improve user experience.
2. Contextual Awareness and Memory:
The assistant could benefit from contextual awareness—the ability to retain and recall
past conversations or actions. This can be achieved by integrating a knowledge base or
state-tracking mechanism, allowing the assistant to handle multi-turn conversations more
fluidly.
3. Cross-Platform Compatibility:
Currently, the assistant is tailored for a desktop environment. Future versions could be
made cross-platform (Windows, macOS, Linux) to broaden its accessibility. Additionally,
developing mobile versions for iOS and Android would further extend the assistant's reach.
4. Cloud Integration for Advanced Features:
While local processing ensures privacy, some advanced features such as real-time news
updates, weather, and complex calculations could benefit from cloud-based services. This
would allow the assistant to process heavy tasks that require extensive computational
resources without overloading the local machine
21
REFERENCES
[1] Roro Ayu Fasha Dewatri et al., "Potential Tools to Support Learning: OpenAI and Elevenlabs
Integration", Southeast Asian Journal on Open and Distance Learning, vol. 1, no. 02, 2023.
[2] Rui Yang et al., "Large language models in health care: Development applications and
challenges", Health Care Science, vol. 2, no. 2023, pp. 255-263.
[3] Ms G. Pydi Prasanna Kumari and Mrs P. Pavithra, "ChatGPT Integrated With Voice
Assistant", Journal of Engineering Sciences, vol. 15, no. 02, 2024.
[4] Rathore Bharati, "Future of AI & generation alpha: ChatGPT beyond boundaries", Eduzone:
International Peer Reviewed/Refereed Multidisciplinary Journal, vol. 12, no. 2023, pp. 63-68.
[5] Shashi Kant Singh, Shubham Kumar, Pawan Singh and Mehra, "Chat GPT & Google Bard
AI: A Review", 2023 International Conference on IoT Communication and Automation
Technology (ICICAT), 2023.
[6] Berşe Soner et al., "The role and potential contributions of the artificial intelligence language
model ChatGPT", Annals of Biomedical Engineering, vol. 52, no. 2024, pp. 130-133.
[7] Sai H. Vemprala et al., "Chatgpt for robotics: Design principles and model abilities", IEEE
Access, 2024.
[8] Naoki Wake et al., "Chatgpt empowered long-step robot control in various environments: A
case application", IEEE Access, 2023.
[10] T.B. Brown et al., "’Language models are few-shot learners", Proc. Adv. Neur. Inf. Process.
Sys, vol. 33, pp. 1877-1901, 2020.
22
APPENDIX- A
SAMPLE CODE
import pyttsx3
import speech_recognition
import requests
from bs4 import BeautifulSoup
import os
import datetime
import pyautogui
import random
import webbrowser
from plyer import notification
from pygame import mixer
import time
import speedtest
for i in range(3):
a = input("Enter Password to open Jarvis :- ")
pw_file = open("password.txt","r")
pw = pw_file.read()
pw_file.close()
if (a==pw):
print("WELCOME SIR ! PLZ SPEAK [WAKE UP] TO LOAD ME UP")
break
elif (i==2 and a!=pw):
exit()
elif (a!=pw):
print("Try Again")
engine = pyttsx3.init("sapi5")
voices = engine.getProperty('voices')
engine.setProperty('voice', voices[0].id)
engine.setProperty('rate', 175)
def speak(audio):
engine.say(audio)
engine.runAndWait()
def takeCommand():
r = speech_recognition.Recognizer()
with speech_recognition.Microphone() as source:
print("Listening...")
r.pause_threshold=1
r.energy_threshold=300
audio=r.listen(source,0,4)
try:
print("understanding.....")
query= r.recognize_google(audio,language='en-in')
print(f"You said: {query}")
except Exception as e:
print("Say that again please...")
return "None"
return query
def alarm(query):
timehere = open("Alaramtext.txt","a")
timehere.write(query)
timehere.close()
os.startfile("alarm.py")
if __name__ == "__main__":
while True:
query = takeCommand().lower()
if "wake up" in query:
from GreetMe import greetMe
greetMe()
while True:
query = takeCommand().lower()
if "go to sleep" in query:
speak("OK sir,you can me call anytime")
break
im = pyautogui.screenshot()
im.save("ss.jpg")
elif "click my photo" in query:
pyautogui.press("super")
pyautogui.typewrite("camera",interval=0.1)
time.sleep(2)
pyautogui.press("enter")
pyautogui.sleep(4)
speak("SMILE")
pyautogui.press("enter")
elif "translate" in query:
from Translator import translategl