DevelopmentoftheSpeech to TextChatbotInterface
DevelopmentoftheSpeech to TextChatbotInterface
net/publication/360054074
CITATIONS READS
53 4,573
3 authors, including:
All content following this page was uploaded by Oleh Basystiuk on 02 May 2022.
Abstract. The paper describes possibilities, which are provided by open APIs,
and how to use them for creating unified interfaces using the example of our bot
based on Google API. In last decade AI technologies became widespread and
easy to implement and use. One of the most perspective technology in the AI
field is speech recognition as part of natural language processing. New speech
recognition technologies and methods will become a central part of future life
because they save a lot of communication time, replacing common texting with
voice/audio. In addition, this paper explores the advantages and disadvantages of
well-known chatbots. The method of their improvement is built. The algorithms
of Rabin-Karp and Knut-Pratt are used. The time complexity of proposed algo-
rithm is compared with existed one.
1 Introduction
2 State of art
The chatbots market after the stage of PR and experiments moved to the stage of tech-
nology race and measuring the effectiveness of the tool. The concept of "chat bot" is
replaced by a larger and more complete. Now it is a conversational artificial intelli-
gence, Conversational AI. Companies use it to communicate with customers on web-
sites, instant messengers, mobile applications and smart devices.
In order to "understand" the speech, the chat bot compares what has been said with
the phrases of a large number of other people in whom the bot has been trained. It finds
similarities with the phrase patterns, determines the topic of the question, and performs
the action programmed for this type of question.
A person gets the illusion that the bot understands him: he acts logically, reacts like
a person and maintains a conversation. The famous Turing test is based on this illusion:
if the judge could not determine if the bot communicates with him or the person, the
technology passes the test.
Understanding is a subjective “value” that is difficult to measure. Microsoft con-
ducted a study of the effectiveness of the speech recognition system and the system of
answers to questions on a given set of tests - in 2017-2018 both were more effective
than people, passing the test with a minimum of errors.
The main technology of modern bots is natural language understanding (NLU). It
allows the machine to understand users and run the necessary parameters for processing
requests. All technological solutions related to the processing of natural language are
needed for this.
These include the approach to processing (rule-based, statistical, hybrid), speech
recognition and speech synthesis technologies, chatbot deployment technologies in the
company's business processes (cloud or local). This is a huge and very promising mar-
ket, which experts predict growth to $ 16 billion by 2021.
Machine learning technologies that are the most competitive in the market are based
on the neural networks principle, when a chatbot is trained in response samples. He is
shown examples of customer phrases, and he learns to put such phrases and words that
are similar to them in the right class. The more efficient the algorithm, the fewer exam-
ples you need to train a bot.
Speech recognition and speech synthesis technologies, natural language understand-
ing technologies, machine learning algorithms are behind the “conversational spirit”.
The company Just AI is developing the designer of conversational bots Aimylogic. The
best online artificial intelligence (AI) online chats are Mitsuku, Rose, Poncho, Right
Click, Insomno Bot, Dr. AI and Melody.
1. Mitsuku is one of the best bots in AI [2]. Also, is the current Laurel Laureate of
Loebner. This bot can talk about something, unlike other ones, done for a specific
task.
2. Rose. The ChatBot, which won the Loebner Prize in 2014 and 2015 [3].
3. RightClick. Launches a program that creates websites. He asks general questions
during the conversation: "What is the area of your interests?" and "Why do you want
to create a website?" Based on the analysis of the replies received, the chatbot creates
custom templates. Quite adequately responsive to non-site-related topics [4].
4. Poncho is a meteorological specialist. He sends notifications at least twice a day with
the user's consent and is smart enough to answer such questions as "Should I take an
umbrella today?" [5].
5. Insomno Bot is designed for “night owls”. This applies to all people with sleep prob-
lems. This bot has the ability to keep conversation on any topic [6].
6. Dr. AI bot asks for symptoms, body parameters and disease history then lists the
most and least likely causes of the symptoms and sorts them in order of severity [7].
7. Baidu melody. This application collects medical information about people and then
passes it to doctors in a form that facilitates their use for diagnostic purposes [8].
Nowadays, Speech-to-text methods and systems have been wide-spread to solve a va-
riety of tasks. Highly used in the artificial assistants’ field as a method to understand
users' desires. As a lot of these speech-to-text technologies is nowadays represented in
different APIs and frameworks, that means creating any new software for testing or
production based on these technologies become a really easy task, as each of famous
API provides samples [https://cloud.google.com/speech-to-text/docs/].
Today, based on the classic Speech-to-text methods, there are online solutions such
as APIs and libraries [15], where you can push information and in real time receive a
response. However, the classical neural network will provide more flexibility, but it
required prepared data for learning, in this case, it's possible to use open source audio
to text data [16], but collect own data, helps to understand users better, what kind of
requests they prefer, long or short, that themes, which words [17]. All information re-
ceived by this system based on Google API will provide answers to all these questions.
The natural language processing systems are carried out mainly on the analysis of
time series (data arrays). The time series include methods of nonlinear analysis and
unsupervised learning. Highly used in this field are RNN (Recurrent Neural Networks),
НММ (Hidden Markov Model), Bayesian network. RNN became a sophisticated solu-
tion for all kind of translation works because it's made it possible to work with different
input and output length, you need to use a recurrent neural network [17].
The papers [16 - 20] emphasizes the need to protect personal data. Security reasons
are well described in papers [20], as data need to be stored in the database, which will
help clearly understand and customize recognition algorithm.
This research will describe the way in which was created an interface for Telegram
chatbot (in future easily connect it to Slack, Facebook, etc.), whose main aim is to
translate audio (in the future video) messages into text.
3 The system architecture
Based on the figure above, it’s become clear that software work as an interface to the
speech-to-text recognition system, so in this case we have some specific design require-
ments. There are two main ideas of how bot for the social network can works.
First, in general, is called polling, that’s mean like ones in a small period, bot re-
quests to the social network server, asking about if there are any new users requests to
him. This design solution has main pros that bot creation, based on polling system it’s
easier, you don’t need have fully created a web app, which will handle HTTP requests
and main cons that this method suitable for a small app, which won’t receive a signifi-
cant amount of messages.
Second - web hook. That’s mean, a web application is hosting and social network
servers, redirect all requests to this address, this type of bot become more independent,
less overloaded and handle requests faster. Visual representation of how both design
methods work on Fig 2.
Fig. 2. The simplified representation of webhook and polling technologies
Fig. 3. Example of the prefix of the function for the line «abcdabcabcdabcdab»
Hashing [11] is converting an array of input data of arbitrary length (an array of rows)
into (output) a bit string of fixed length executed by a certain algorithm. A hash func-
tion, or convolution function, is a function that converts the input data of any (usually
large) size to a fixed-size data. Hashing transforms the input array of data of arbitrary
length into a source bit string of fixed length. Output data is called an input array, key,
or message. Hash functions are linked with checksum, control digits, fingerprints, ran-
domization of functions, error-correcting codes, and ciphers. Although these concepts
coincide to some extent, each one has its own application and requirements and is de-
veloped and optimized in many ways.
The algorithm of the chatbot “Hashbot” will consist of searching for the keyword
(words) of the conversation and the formation of the person's termination of the verb.
We search for a keyword by means of hashing. To do this, we use the calculation
formula:
ℎ(𝑆) = 𝑆[0] + 𝑆[1] ∗ 𝑃 + 𝑆[2] ∗ 𝑃2 + 𝑆[3] ∗ 𝑃3 + . . . + 𝑆[𝑁] 𝑃𝑁, (1)
1. Keyword searching;
2. Verb ending finding.
When the keyword has been found, you must define the individual verb ending. So, we
try to find the verb as the word located before or after the keyword found. We test the
ending using a prefix function: by Knutt-Pratt algorithm, which does not contain ex-
plicit line comparisons and is executed for O (n) actions.
Here is an algorithm scheme:
1. To count the value of the prefix function π [i] for i ∈ [1..N-1] (π [0] = 0).
2. To calculate the current value π [i], use the variable j, which indicates the length of
the current sample considered. First time j is equal j = π [i-1].
3. To test a sample of length j, for which we compare the characters s [j] and s [i]. If
they are the same, then we consider π [i] = j + 1, i= i + 1. If the characters are differ-
ent, then we reduce the length j, assuming it is equal to π [j-1], and repeat this step
of the algorithm from the beginning.
4. If we have reached the length of j = 0 and so did not find the same characters, then
stop the sampling process and consider π [i] = 0, i = i + 1.
The general schema of proposed method is given in fig. 4.
4 Results
def get_text_from_audio(file_name):
"""Transcribes the audio file."""
from google.cloud import speech
from google.cloud.speech import enums
from google.cloud.speech import types
client = speech.SpeechClient()
with io.open(file_name, 'rb') as audio_file:
content = audio_file.read()
audio = types.RecognitionAudio(content=content)
config=types.RecognitionConfig(encoding=enums.
RecognitionConfig.AudioEncoding.OGG_OPUS,sample_rate_hertz=16000
)
response = client.recognize(config, audio)
print("Before recognition")
result = 'Seems this file is empty'
for result in response.results:
result = result.alternatives[0].transcript
print("Ready to send result")
return result
@app.route('/', methods=['POST', 'GET'])
def index():
if request.method == 'POST':
fetched_request = request.get_json()
if "message" in fetched_request:
send_response_message(fetched_request)
return "ok!", 200
return render_template('index.html')
And the last main point for organization interface is communication with Google API.
Three possible communication solutions are: using client libraries, gcloud tool and
command line requests, for chatbot more suitable is client library usage. To make it
work we need set up google-client in the virtual environment, create a project in Google
Cloud Platform dashboard, generate unique access credentials and download private
key in JSON format and include it into the project and set up global environment vari-
able GOOGLE_APPLICATION_CREDENTIALS to the file path of private key [19].
Algorithm Complexity
Hashbot algorithm 𝑂(|𝑆| + |𝑇|)
Linear search О ((T-S + 1) * T)
KMP-search from 𝑂(|𝑆| + |𝑇|) to О ((T-S + 1) * T)
LSA O (n 2 k 3), n=|S| |T|
SVM with tf-idf scheme 𝑂(|𝑄||𝑆||𝑇|)
5 Conclusions
The paper presents scalable software solution for collecting and processing audio in-
formation to text. The program is implemented on Python language and web framework
Flask. According to the results of research and after analysis of collected data set of
audios and texts, will be developed a custom model based on Keras library using a
recurrent neural network for training. Creation of such custom system will be the next
phase of research.
After analyzing existing methods, we see that the complexity of the developed algo-
rithm Hashbot for finding keywords is less than of well-known algorithm. Using the
prefix function to form a bot response allows you to work with Cyrillic texts. The pro-
posed algorithm improves the work of chatbots, which determines the gender of the
interlocutor. This brings the bot closer to the level of human conversation
6 References
1. Shevat, Amir. Designing bots: Creating conversational experiences (First ed.). Sebastopol,
CA: O'Reilly Media. ISBN 9781491974827 (2017).
2. Mitsuku: http://www.mitsuku.com/, last accessed 2019/01/21
3. Rose: https://www.robeco.nl/service-contact/index.jsp, last accessed 2019/01/21
4. Right click: https://rightclick.io/#/, last accessed 2019/01/21
5. Poncho: https://poncho.is/, last accessed 2019/01/21
6. Insomnobot: http://insomnobot3000.com/, last accessed 2019/01/21
7. Dr.A.I: https://www.healthtap.com/login?redirect_to=/symptoms, last accessed 2019/01/21
8. Baidu Melody’s: http://research.baidu.com/baidus-melody-ai-powered-conversational-bot-
doctors-patients/, last accessed 2019/01/21
9. Cormen, T. H., Leiserson, C. E., Rivest, R. L., & Stein, C.. Introduction to algorithms. MIT
press (2009).
10. Knuth, D. E., Morris, Jr, J. H., & Pratt, V. R.. Fast pattern matching in strings. SIAM journal
on computing, 6(2), 323-350 (1977)
11. Knuth, D. E. Sorting and Searching, 2nd edn. The Art of Computer Programming, vol. 3
(1998)
12. Shakhovska, N., & Shvorob, I. The method for detecting plagiarism in a collection of docu-
ments. In Scientific and Technical Conference" Computer Sciences and Information Tech-
nologies"(CSIT), 2015 Xth International (pp. 142-145). (2015).
13. Landauer, T. K., McNamara, D. S., Dennis, S., & Kintsch, W. (Eds.). Handbook of latent
semantic analysis. Psychology Press (2013).
14. Aizawa, A. An information-theoretic perspective of tf–idf measures. Information Processing
& Management, 39(1), 45-65 (2003).
15. Shakhovska, N., Medykovskyj, M., Bychkovska, L. Building a smart news annotation sys-
tem for further evaluation of news validity and reliability of their sources. Przeglad Elektro-
techniczny, 91 (7), 43-44 (2015)
16. du Preez, S. J., Lall, M., & Sinha, S. An intelligent web-based voice chatbot. In IEEE
EUROCON 2009 386-391. (2009).
17. Boyko, N., Basystiuk, O., & Shakhovska, N. Performance Evaluation and Comparison of
Software for Face Recognition, Based on Dlib and Opencv Library. In 2018 IEEE Second
International Conference on Data Stream Mining & Processing (DSMP) 478-482 (2018).
18. https://github.com/obasys/harry-bot
19. Cloud Speech-to-Text Documentation, https://cloud.google.com/speech-to-text/docs/
20. Huang, J., Zhou, M., & Yang, D. Extracting Chatbot Knowledge from Online Discussion
Forums. In IJCAI 7, 423-428 (2007).