0% found this document useful (0 votes)
30 views50 pages

Proposal Guid

The document describes a research project report submitted by Wainaina Sospeter Gathungu to the School of Science and Computing at South Eastern Kenya University in partial fulfillment of the requirements for a Bachelor of Science degree in Computer Science. The report discusses developing a question answering system called SUMMAQA using deep learning and transformers to retrieve information from documents. It includes chapters on introducing the topic, literature review on related concepts like question answering and BERT models, the research design and methodology, data analysis and system design including implementation and testing of the SUMMAQA system on the SQuAD dataset. The goal is to build a system that can answer questions with near human-level accuracy by leveraging state-of-the-

Uploaded by

markfabio073
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views50 pages

Proposal Guid

The document describes a research project report submitted by Wainaina Sospeter Gathungu to the School of Science and Computing at South Eastern Kenya University in partial fulfillment of the requirements for a Bachelor of Science degree in Computer Science. The report discusses developing a question answering system called SUMMAQA using deep learning and transformers to retrieve information from documents. It includes chapters on introducing the topic, literature review on related concepts like question answering and BERT models, the research design and methodology, data analysis and system design including implementation and testing of the SUMMAQA system on the SQuAD dataset. The goal is to build a system that can answer questions with near human-level accuracy by leveraging state-of-the-

Uploaded by

markfabio073
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

South Eastern Kenya University

School of Information Communication and Technology

TOPIC : INFORMATION RETRIEVAL USING DEEPLEARNING

(SUMMAQA)

By

WAINAINA SOSPETER GATHUNGU

G126/0606/2018

SUPERVISOR: CASPER SHIKALI

A Research Project Report Submitted to the School of Science and Computing in partial fulfillment
of the Requirements for the award of the Degree of the Bachelor of Science in Computer Science

November, 2021
DECLARATION
I hereby solemnly declare that the proposal entitled : “information retrieval using
transformers” is based on my own work carried during the course of my study under the
supervision of Mr. Casper Shikali.

The work has not been submitted in any other institution for any other
degree/diploma/certificate in this university or any other university.
DEDICATION & ACKNOWLEDGEMENTS
This project is especially dedicated to the lecturers who have continually helped and guided me to
successfully complete this project work. I would also like to give special thanks to my supervisor Mr.
Casper Shikali who spared his valuable time to direct me and ensured that I had the materials relevant to
carry out my research and continue building my project and also improve on it.
Also I would like to dedicate this project to my family and friends, who have been continuously supported
me, gave me ideas, its a pleasure.
TABLE OF CONTENT
DECLARATION 2
DEDICATION & ACKNOWLEDGEMENTS 3
ABSTRACT 7

CHAPTER ONE 8

INTRODUCTION 8
1.1 Background of the Study 8
1.1.1 complexity of information in the current times 8
1.2 Statement of the Problem 9
1.3 Objectives of the Study 10
1.4 Research Questions 10
1.5 Justification of the Study 10
1.6 Scope of the Study 10

CHAPTER TWO 11
LITERATURE REVIEW 11
2.1 Introduction 11
2.2 Question Answering 11
2.3 Summarization 13
2.4 BERT 14
2.5 Conceptual Framework 14
2.6 Related Work 16
2.6.1 Related methods 17
Transformers 17
2.7 Gaps in the Literature 17
2.8 Approach 18

CHAPTER THREE 19
RESEARCH DESIGN AND METHODOLOGY 19
3.1 Introduction 19
3.2 Research Design 19
3.3 Target Population 19
3.4 Sampling Design 20
3.5 Data collection Techniques 20
3.6 Data analysis methods 21
3.7 Development Methodology 21
3.8 Technology for Development 21
3.8.1 Front-end / Client-Side 21
3.8.2 React 21
3.8.3 CSS 22
3.8.4 Server-Side/ Back-end 22
3.8.5 FastAPI 22

CHAPTER 4: 24

DATA ANALYSIS, INTERPRETATION AND SYSTEM DESIGN 24


4.0. Introduction to Data Analysis 24
4.1Presentation of Data Analysis 24
4.1.1Dataset Description What is SQUAD? 24
4.1.3 Data Cleaning 24
ii) Tokenization 24
4.1.4 Data Visualization and Exploration 25
4.3Summary of Findings 25
4.3.1 Description of the existing system 25
4.3.2 Description of the proposed system 25
4.3.3 Functional Requirements 25
4.3.3.2 Non-functional Requirements 25
4.3.3.2.1 Software Requirements 26
4.4 System Design 26
4.4.1 Flow chart 26
4.5System Implementation and Testing 26
4.5.1Creating, Testing and Training Datasets 26
Summary of the Dataset 26
Sample of dataset structure 27
Data Fields 27
4.5.1.2 Training and Evaluating the Question Answering System 27
4.5.1.2.1 Model and Tokenizer initialization,Pipeline and and Prediction 28
EVALUATING FINE TUNED MODEL 31

CHAPTER 5 35

SUMMARY OF FINDINGS, CONCLUSIONS AND RECOMMENDATION 35


5.1 Introduction 35
5.2Summary of analysis 35
5.3 QA Metrics 35
5.3.1 Exact Match 35
5.4 Model Deployment 40
REFERENCES 41
APPENDIX A: PROJECT BUDGET 42
APPENDIX D: SOFTWARE REQUIREMENTS 44

LIST OF FIGURES
Figure 1 11

Figure 2 14

Figure 3 15

Figure 4 25

Figure 5 26

Figure 6 27

Figure 7 29

Figure 8 31

Figure 9 45

Figure 10 47

LIST OF TABLES
Table 1 23
LIST OF ABBREVIATIONS

SQUAD - Stanford Question Answering Dataset


QA - Question Answering
NLP - Natural Language Processing
ABSTRACT
The amount of data available and consumed by people globally is growing day by day , in fact the
estimated amount of data created in a day on the internet stands at 1.145 trillion MBs. To reduce mental
fatigue and increase the general ability to gain insight into complex texts or documents, I have developed
an application to aid in this task. Some of the benefits of using the software include saving time,increased
productivity, and ensures all the important aspects are covered. The application has a user-friendly GUI
which allows users to easily upload documents and the user can ask domain-specific questions based on
the document that they have uploaded. A summarized version of each document is presented to the user,
which will further enable them to understand the document and guide them towards what types of
questions could be relevant to ask the model. The web application allows users flexibility based on the
types of documents that the application can process, it stores no user data, and uses state-of-the-art models
for its summaries and answers. The result is an application that yields near human-level intuition for
answering questions in certain isolated cases, such as Wikipedia and news articles, as well as some
scientific texts. The application portrays a decrease in reliability,accuracy and its prediction depending on
the complexity of the subject i.e if the subject is too complex for the model, the number of words in the
document or text , i.e the model accuracy decreases with an increase in the number of words past 3000
words and grammatical inconsistency in the questions increases. These are all aspects that can be improved
further.
CHAPTER ONE

INTRODUCTION
1.1 Background of the Study

In this section, I will detail the context and background surrounding the project and I will explain why
computerized solutions to text and Natural Language processing are needed, how it was actually achieved
historically and some of the present techniques.

1.1.1 complexity of information in the current times


In the current times technology is taking over everything, we are surrounded by massive amounts of
information, present in form of documents, images, blogs,websites, and the list is endless. We humans
naturally avoid lengthy information and that's why we find ourselves not reading terms and conditions
when subscribing to something online which is very important, we just look for a direct answer instead of
reading the entirety of lengthy documents.
Tools like search engines are used to scroll the internet, others which help crawl the internet include : links,
or social media, which can provide the user with access to information about a wide range of subjects.
There are complicated subjects however, like medicine, or political analyses, where relevant articles on the
Internet might prove challenging for a user to read. If a text contains rare, difficult words and long
complicated sentences it might be detrimental to its readability, readability depends on the content of the
information i.e complexity of its vocabulary and syntax. Readability is commonly defined as the ease with
which a reader can understand written text and can be measured by many factors, most relevant to this
project is the speed of reading, speed of perception and fatigue in reading. Readability affects consumer
information as well. For example, there are legal documents like the terms of service of various private
companies that consumers are required to read and accept. These documents are typically long, so you will
find out that most consumers do not see it as worth their time to read and understand what they are actually
agreeing to. You have also probably encountered such a situation and agreed without even reading what
terms we are agreeing to . This opens the question of whether technology could help present this
information in a way where fewer consumers will ignore the details of these agreements. Google has used
machine learning for natural language processing in language translation, and is actually very successful. It
can actually translate a total of 108 languages and a google assistant which can recognise speech and new
features being added by the day. This project investigates the possibility to utilize similar methods, but for
reducing the complexity of the text to make information more easily accessible,readable.

1.2 Statement of the Problem


The amount of data available, which is produced and consumed by people from all over the globe is
growing but the ability to process it is not. This coupled with the fact that documents are large and may as
well be complex can make it very laborious to find information.It has become more complicated to find the
required information from a mass of text. Although it is said that information is power, if we cannot find
the required information it is likely to cause confusion. To help people who may or may not have a
background in the field which they are researching in I have come up with an application which can help
summarise a corpus of text containing some information about a certain field e.g medicine which will
inturn help the consumer get a better understanding of the data and give the consumer an idea of the
questions which he/she can ask.
The web application increases productivity, it first assists the user by summarizing the uploaded document
from which the user can then use it to be able to formulate the questions which the user can ask about the
content of the document that the user has uploaded and get an answer immediately.
My application brings something new to the user which is an automatic text summarization which will be
embedded in the system as a function and I will just have to import the module and give it path to where
the documents are to be summarized.

1.3 Objectives of the Study


1.To create a model that will summarize a corpus of text.
2. To integrate the automatic text summarization model with the Question answering system.
3.To evaluate the model and its accuracy in answering questions.
4. To design a UI that is intuitive and will be easy for the user to use without any difficulty.

1.4 Research Questions


1. How can one ensure that a user asks questions in a structured manner in order to receive expected
and reliable responses?

2. Will the system be able to retrieve the documents that the user has uploaded?
3. Will the text which has been summarized by the system retain its meaning?
4. Will the system be able to respond with the correct answers phrased in the right way?

1.5 Justification of the Study


The application needed to be divided into two parts based on the problem description. The application's
first component is a Question-Answering model, which allows the user to ask a question of a document and
receive an answer.
The abstractive summarizing model, which provides a summary of the material presented, is the second
portion of the application. I might present the user with a summary version of the document and its most
significant features by integrating a summarizing model. This may then assist the user in swiftly gaining
insight into the material and learning what questions could be useful to ask to get the utmost best
performance from the model.
The web application guides the user on the questions which he/she can ask the application and get an
answer directly. This application is one of its kind, its a two in one web application and its worth
everything.

1.6 Scope of the Study


The summaqa application is a two in one application. It summarises a corpus of text uploaded to it which
gives an idea to the user on the question which the user can ask the question answering model. The
components of the application entails an import section where the user will use to import some files with
which he/she will summarize. There will be a window for the QA through which the user will ask the
questions and get the answers.
CHAPTER TWO
LITERATURE REVIEW

2.1 Introduction
While there are vast scholarly articles based on both Question Answering Systems and automatic text
summarization, there is still a wide research area especially involving Natural Language Understanding
since it's an emerging field. However there is little research that explores both text summarization
combined with question answering systems which this project is trying to explore.

2.2 Question Answering

Question answering systems (QA) come out as powerful platforms and they are very much known for
their function of automatically answering questions in Natural Language. Imagine talking to a computer
just as you would to a human, a good example are chatbots. Information is in large amounts even when
reading documents we often go skimming looking for the direct answer from the document, this can
become tedious especially when the document is lengthy and is using complex syntax. Question
Answering systems are being used for this purpose.These systems usually scan through a corpus of
documents and provide you with the relevant answer or paragraph, quite easy right?. All these are a part
of the computer science discipline in the field of information retrieval and NLP, which focuses on
building systems that automatically extract an answer to questions posed by humans or machines in a
natural language. Coming to the history of QA systems. Two of the earliest question answering systems,
BASEBALL and LUNAR, have been popular because of their core database or information system.
BASEBALL was built for answers to American League baseball questions over a one-year cycle.
LUNAR on the other hand was built to answer questions related to geological analysis of lunar rocks
based on data collected from the Apollo moon mission.Some of the advanced question answering systems
of the modern world are Apple Siri, Amazon Alexa, and Google Assistant which most of us interact with.
To understand the Question Answering subject, we need to define associated terms. A Question Phrase is
the part of the question that says what is being searched. The term Question Type refers to the
categorization of the question for its purpose. In the literature the term Answer Type refers to a class of
objects which are sought by the question. Question Focus is the property or entity being searched by the
question. Question Topic is the object or event that the question is about. Candidate Passage can broadly
be defined as anything from a sentence to a document retrieved by a search engine in response to a
question. Candidate Answer is the text ranked according to its suitability to as an answer.Previous studies
mostly defined a architecture of Question Answering systems in three macro modules : Question
Processing, Document Processing and Answer Processing as shown below:

Figure 1

Question Processing usually receives the input from the user which is a question in natural language, to
classify and analyze it. The analysis is to find out the type of question, meaning the focus of the question
which is necessary to avoid ambiguities in the answer produced.

Types of QA systems

Question answering systems(QA) are broadly divided into two categories: open-domain question
answering (ODQA) system and closed-domain question answering (CDQA) system.

Open Domain - in open domain systems questions can be from any domain such as health care Sports ,IT
and many more, key concept is that its not enclosed to any field, an example of such a system is the
DeepPavlov that uses a large dataset from wikipedia as its source of knowledge.

Closed Domain - in closed domain it deals with questions in a particular domain, example a healthcare QA
cannot answer any IT related questions.

2.3 Summarization

The Internet itself is a wide source of electronic information.The outcome of information retrieval
becomes a laborious task for humans. Hence automatic summarization came into use. It automatically
retrieves the data from documents in the process it utilizes precious time. Luhn was the first one who
invented automatic summarization of text in 1958 . NLP community invented the subfield of
summarization. Radev says that one or more documents are processed and a short summary is produced
which is less than the size of original documents. The requirements of summarization include:

● The information should not lose relevance


● The information must keep the most important that makes up the context of the information.

This now brings us to the question , what is Automatic Text Summarization? Automatic text summarization
is the process of taking a sequence of words and reducing the number of words, while retaining the most
essential information from the original context. The approach of summarization are divided into either
extraction based or abstraction-based :

• In extractive summarization the main aim is to summarize the corpus using only the words provided in the
body of the text.

• In abstractive summarization, the model’s goal is to learn the inherent language representation to make
summarization more like how a human would make summaries, that is using their own choice of words.
Extractive summarization in fact has historically been the more extensively researched of the two since it is
considered a simpler problem to solve

2.4 BERT

In this project I will use a language representation model BERT which stands for Bidirectional Encoder
Representations from Transformers. BERT is designed to pretrain deep bidirectional representations from
unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained
BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a
wide range of tasks, such as question answering and language inference, without substantial task specific
architecture modifications.

Since the introduction of the language representation model, the field of NLP has revolutionized and the
surprising thing is that it has surpassed all previous models up to that point in a wide variety of tasks .
BERT now expands a new model architecture called a Transformer, which allows the model to learn entire
sentences at a time instead of sequences of words. It also allows the network to learn how sentences and
languages are constructed, based on the context of the surrounding words . The transformer model, and
subsequently BERT is based on an encoder-decoder structure instead of 9 5 Method and Implementation a
recurrent structure and uses a concept called attention .

Attention is a metric assigned to each word in a sentence. It represents for each prediction how important
each word is and which words should be emphasized more than others . This allows transformer models
such as BERT to learn the context of words and sentences based on the context and surrounding words.

2.5 Conceptual Framework

BERT extracts tokens from the question and passage and combines them together as an input. As
mentioned earlier, it starts with a [CLS] token that indicates the start of a sentence and uses an [SEP]
separator to separate the question and passage. Along with the [SEP] token, BERT also uses segment
embeddings to differentiate between the question and the passage that contains an answer. BERT creates
two segment embeddings, one for the question and other for the passage, to differentiate between question
and passage. Then these embeddings are added to a one-hot representation of tokens to segregate between
question and passage as shown

Next, we pass the combined embedded representation of question and passage as input in the BERT model.
The last hidden layer of BERT will then be changed and uses softmax to generate probability distributions
for the start and end index over an input text sentence that defines a substring,which is an answer, as
shown.
QA system diagram

Figure 2

Summarization flow diagram

Figure 3
2.6 Related Work
In this section I will be discussing some of the related projects and works, and how they compare to the
application that I am developing. I will also discuss some of the technologies that my application is based
on.

In my research I found out that there are many related applications to my project, a good example being
Watson which was developed and built by IBM.It was developed to answer questions on the popular
TV-show Jeopardy but has grown to be a general-purpose QA machine utilized in many fields like
economics and healthcare however, for the application to take on similar, but low-level tasks, the QA
system needed to be scaled down. A major drawback is that the speed, size,and complexity of IBM
Watson are not achievable due to resource limitations. In my application, I could not use Watson directly
since it uses proprietary components, which would not allow us to make modifications to it. This project
could potentially serve as an open-source alternative to Watson in the future.

2.6.1 Related methods

In this section, I go over related deep learning approaches and architectures for handling specific problems
like Question-Answering and Summarization. We'll start with a list of the fundamental architectures and
models, then show how these models are typically utilized to solve NLP problems.

Transformers

In natural language processing, the Transformer is a unique design that seeks to solve
sequence-to-sequence tasks while also resolving long-range dependencies. In the paper Attention Is All
You Need, the Transformer was proposed.
In the paper Attention is all you need by Ashish the transformer neural network architecture was introduced
which brought about significant improvements which included the following:
1. It allows learning entire sentences instead of sequences of words
2. Allows models to be trained in parallel
3. The model learns to distinguish the words in a sentence based on its context.
This has made transformers to be greatly used especially in the field of NLP
Huggingface has released a Python module with various pre-trained transformer models taken from a
number of recent research papers. These transformer models are deep learning models designed for general
language understanding that may be retrained via transfer learning to handle specific challenges . I
customized parts of Huggingface's pre-trained NLP models, particularly a model titled BERT.

2.7 Gaps in the Literature

1. Deviation from the question asked by the user, hence returning an answer that makes no sense
according to the question posed
2. The user may not know the appropriate question to ask the model which decreases usability

2.8 Approach

1. The Bert based model is trained on the SQUAD v1 dataset hence conversant with the natural
language.
SQUAD Dataset - You might be wondering what type of challenge the SQUAD dataset poses. Well it is a dataset
which tests a model's ability to read a passage of text and be able to answer questions about it.
2. The user is actually guided through , when the user imports the file of text that he/she wants to ask
The application first summarizes the text and in the process the user gets a better understanding of the
context and the type of questions which he/she can ask to get the most out of the system
CHAPTER THREE
RESEARCH DESIGN AND METHODOLOGY

3.1 Introduction
This chapter outlines the many stages of the research. It explains the research design, the target
population, the sampling procedures used to obtain a representative sample size, the data collection
procedures and instruments, how the validity and reliability of the research instruments of data collection
were tested, and finally how the data was analyzed.

3.2 Research Design

Before anything else we must first understand what research design is. It is the overarching method you
adopt to combine the various components of the study in a logical and cohesive manner, ensuring that you
will effectively address the research problem; it is the blueprint for data collecting, measurement, and
analysis. Normally the type of design you choose should be determined by the research challenge.My
research on Question answering and automatic text summarization requires a qualitative research design.
A qualitative research design is concerned with establishing answers to the whys and hows of the
phenomenon in question: How can we make machines understand natural language?, How can we make
them reply in a way that is indistinguishable from human? How can we make them better? . Due to this,
qualitative research is often defined as being subjective (not objective), and findings are gathered in a
written format as opposed to numerical. This means that the data collected from a piece of qualitative
research cannot usually be analyzed in a quantifiable way using statistical techniques establishing answers
to the whys and hows of the phenomenon in question (unlike quantitative).

3.3 Target Population

The target population is the group of individuals that the intervention intends to conduct research
in and draw conclusions from.

The target population is the SQUAD dataset with which the fine tuned BERT model has been trained
on.The SQUAD dataset is a collection of question-answer pairs derived from Wikipedia articles. In
SQuAD, the correct answers of questions can be any sequence of tokens in the given text. Because the
questions and answers are produced by humans through crowdsourcing, it is more diverse than some
other question-answering datasets.

3.4 Sampling Design

In research terms a sample is a group of people, objects, or items that are taken from a larger
population for measurement. The sample should be representative of the population to ensure
that we can generalize the findings from the research sample to the population as a whole.

I will use the SQUAD 2.0 dataset which is representative of the SQUAD v1 to evaluate the model
and test its performance.

3.5 Data collection Techniques

In my area of research especially in the field of Natural Language Processing, there is a vast amount of
data and Datasets in this field, since I will not be training the deep learning model from scratch , I will be
using a model which has been pre-trained on the SQUAD v1 dataset. Although the Pre-trained model may
not be enough I will use Transfer learning to further train the models to attain high accuracy.

Why BERT pretrained model? I will be using an already available fine-tuned BERT model from the
Hugging Face Transformers library. BERT can better understand long term queries and as a result surface
more appropriate results.

Transfer Learning- Transfer Learning is a machine learning technique in which we reuse a previously
trained model as the basis for a new model on a different task.Simply said, a model learned on one task is
repurposed on a second, similar work as an optimization that allows for faster modeling progress on the
second activity. And with just a small amount of data we can achieve greater accuracy.
3.6 Data analysis methods
Data analysis is the process of working on data with the purpose of arranging it correctly,
explaining it, making it presentable, and finding a conclusion from that data. It is done for
finding useful information from data to make rational decisions.
The systematic application of statistical and logical techniques to describe the data scope,
modularize the data structure, condense the data representation, illustrate via images, tables, and
graphs, and evaluate statistical inclinations, probability data, to derive meaningful conclusions.
Data analysis tools are defined as a series of charts, and diagrams designed to interpret, and
present data for a wide range of applications. However my project does not entail any data
collection or data analysis since I am using a pretrained model by huggingface library.

3.7 Development Methodology


In these sections, I describe the components of my application, as well as the tools, frameworks,
and languages that were used and why. I will discuss them in a later section,it goes over the
structure of our application and its interactions in greater detail.
The application is built using a client-server architecture. The diagram below depicts the various
frameworks or libraries used for the various sections. For the sake of brevity, not all libraries
used, such as those for handling files and paths in Python, are depicted. In each section, we'll go
over which frameworks and libraries are used in the application and why.

3.8 Technology for Development


3.8.1 Front-end / Client-Side
As previously stated, I decided to develop a web application that communicates with a server (client
server model) to retrieve data, such as requesting summaries and/or answers to questions about specific
documents. I'll go over the application's client side in detail, the frameworks that were used.
3.8.2 React

React, an open-source JavaScript library that was developed by Facebook and is currently maintained by
Facebook, Instagram, and community developers. To display views, which are rendered to the screen,
React employs a component-based system.The components are specified as custom HTML tags, which
make them easy to use because they can be reused not only in different views, but also within other
components.
React is also extremely efficient when it comes to updating the HTML document with new data. The
content is re-rendered as the state changes. This enables React to display dynamic content on a web page
without requiring or changing anything on the server side. React is currently one of the most popular
front-end web frameworks. Other popular web frameworks include Angular and Vue, both of which are
open-source JavaScript frameworks. Google maintains Angular, while Vue is maintained by its creator
and a smaller team.
All three are component-based, which means you build your front end by assembling various components
to create a finished product. The aforementioned frameworks share similarities but also differences, with
the learning curve being the most important for us. Angular has the steepest learning curve, while Vue has
the flattest. Based on the statement above and the amount of time I had to work on this project, I decided
not to use Angular. Even though Angular is a powerful framework, the time required to get a front-end up
and running was not worth it for this project.
Angular would be the best choice for larger applications with a longer time frame.
3.8.3 CSS

I considered using a CSS framework for the styling of my web application at the start of the project. I
chose Bootstrap over others. Today, Bootstrap is the most popular CSS framework .
It was chosen because of its popularity, element styling, and previous experience with the framework. I
decided to redesign myr application after some careful consideration. I found that bootstrap was no longer
useful for how I wanted my app to look. Instead, I used CSS to customize the look of the web application.

3.8.4 Server-Side/ Back-end


In this section I present to you the frameworks that I think would deliver the utmost best performance with the least
cost.

3.8.5 FastAPI
FastAPI is described as a modern, fast web-framework for creating Python APIs.
It is just a simple framework for defining web-end points to which a request can be made.
FastAPI is very fast as compared to Node.js. Even though it is still relatively new, it is gaining popularity
and developers are beginning to use it for projects, particularly those involving machine learning, such as
SpaCy.
In mind I also had Flask and Node.js. Flask is a Python-based web framework with extensive
documentation . Node.js is an open-source server environment that allows JavaScript to be run on the
server.
When I first started this project, I thought about whether to use Node.js or a Python-based framework. I
reasoned that because the NLP models were written in Python, integrating the NLP models in one language
for the server and logic would be easier than using two which made lots of sense. Because the models were
written in Python, we chose a Python framework and began using Flask. Comparing the speeds I then
decided to switch. And so I switched to FastAPI. Which I believed to be better in terms of performance.
CHAPTER 4

DATA ANALYSIS, INTERPRETATION AND


SYSTEM DESIGN
4.0. Introduction to Data Analysis

This chapter presents the descriptive statistical analysis of data, interpretation of the finding and system
design of the study. The work process includes an in-depth view of the data preprocessing and feature
engineering activities. The results of mathematical procedures will be used to assess the performance of
significant variables on the dataset for this machine learning model

4.1Presentation of Data Analysis

4.1.1Dataset Description What is SQUAD?

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of


questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a
segment of text, or span, from the corresponding reading passage, or the question might be
unanswerable.In this project I used SQUAD 2.0 which has the following format: SQuAD2.0 combines the
100,000 questions in SQuAD1.1 which contained 100,000+ question-answer pairs on 500+ articles. with
over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable
ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine
when no answer is supported by the paragraph and abstain from answering.

4.1.3 Data Cleaning

This process involved checking for missing values,punctuations,stopwords.We also do a lemmatization s


the process of reducing a word to its root form. The main purpose is to reduce variations of the same word,
thereby reducing the corpus of words we include in the model. The difference between stemming and
lemmatizing is that, stemming chops off the end of the word without taking into consideration the context
of the word. Whereas, Lemmatizing considers the context of the word and shortens the word into its root
form based on the dictionary definition.

ii) Tokenization

Tokenizing is the process of splitting strings into a list of words. We will make use of Regular
Expressions or regex to do the splitting. Regex can be used to describe a search pattern.

4.1.4 Data Visualization and Exploration

This section gives a visualization of the words on the dataset by use of Wordcloud, Countplot. These visual
representations enable one to explore the dataset and have insights of the dataset.

Figure 4

The word cloud above describes the most frequent words in the dataset though I grabbed wordcloud at
every iteration of every paragraph containing an article, so there were a total of 50+ word clouds. Here
is another word cloud.
Figure 5

4.3Summary of Findings

4.3.1 Description of the existing system

Majority of the Question Answering Systems prediction models have been made but deployment has not
been possible. The previous systems were not flexible in that it accepts various types of documents and you
can ask it any questions regarding that document. Also they didnt have an added functionality of having a
summarization model within the system. So this is indeed something new.

4.3.2 Description of the proposed system

The SummaQA system is a two in one system, a Summarizer and a QA model which is a closed domain
QA system. I performed transfer learning to the BERT qa Model and trained it with the SQUAD 2.0 which
made it become somewhat better than other models which were trained on the SQUAD 1.1. The system
provides the User with an UI where he/she can upload documents in the system and ask it questions with
respect to the document which they uploaded.

4.3.3 Functional Requirements

1. Usability -The model must be easy to use, accessible and easy to navigate by the user
2. Reliability-The deployed model must give accurate prediction to the user continuously
3. Performance-The prediction must not take long.

4.3.3.2 Non-functional Requirements

1. Accuracy & Performance-Most ML work reports on algorithm accuracy (often precision and recall)
i.e. how “correct” the output is compared to reality
2. Security considerations-efforts have been made to address the privacy concerns when using
personal data facilitate ML.
3. Reliability-the summaQA system should be reliable when it comes to ML predictions

4.3.3.2.1 Software Requirements

● Any of the following operating systems: Windows 7 SP1 32/64-bit, Windows 8.1 32/64-bit,
● Or Windows 10 32/64-bit, Ubuntu 14.04 or later, or macOS Sierra or later.
● Browser: Google Chrome or Mozilla Firefox.
● Anaconda
● Python 3.6 ^
● cdQA == 1.3.9- Closed Domain Question Answering Pipeline
● Matplotlib offering many numerical computation and visualization tools.2.1Software requirements

4.4 System Design


4.4.1 Flow chart

Figure 6
In general the structure involves:
The pool of articles you can see from the above diagram is what the user will be uploading, so when a
user uploads a document it becomes a knowledge base for our system where it will be fitted to our
model. A user will then query the model which will then go to the retriever and it will output the
prediction, which is what the model thinks the answer is and it will also output where it thinks the
paragraph where it thinks matches the answer it found.
Figure 7
4.5System Implementation and Testing
4.5.1Creating, Testing and Training Datasets

Summary of the Dataset

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of


questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a
segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

Sample of dataset structure

Figure 8
Data Fields

The data fields are the same among all splits.

plain_text
● id: a string feature.
● title: a string feature.
● context: a string feature.
● question: a string feature.
● answers: a dictionary feature containing:
○ text: a string feature.
○ answer_start: a int32 feature.

SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions
written adversarially by crowdworkers to look similar to answerable ones. And in order for the model to
perform well it must not only answer questions when possible, but also determine when no answer is
supported by the paragraph and abstain from answering.

4.5.1.2 Training and Evaluating the Question Answering System

At first I used a pre-trained Bert model with the following steps:

1. Model and tokenizer initialization

2. Pipeline and Prediction

4.5.1.2.1 Model and Tokenizer initialization,Pipeline and and Prediction

Here I import transformers and initialize our model and tokenizer using the following code:
def load_model():

model =
joblib.load('/content/drive/MyDrive/coding/models/bert_qa.joblib')

cdqa_pipeline = QAPipeline(reader=model, max_df=1.0, )

return cdqa_pipeline

The Pipeline will entail everything from, tokenization, text cleaning and prediction and Once I fitted the

model with a document I uploaded to the system this is how it was answering questions :

Figure 9
Now since the pretrained model was trained on the SQUAD1.1 I took the initiative of performing a transfer
learning using the SQUAD2.0 with which I faced a lot of challenges due to computing power and It took
10 hours plus to train the pre-trained model and at times runtime disconnected and it was just tiresome.

Code to extract data from dict and append them to a list to train our model with, I will do the same thing
with the test data. With data as dict type it will be hard to be able to categorize the data into questions and
answers.

path = Path('squad/train-v2.0.json')

# Open .json file

with open(path, 'rb') as f:

squad_dict = json.load(f)
texts = []

queries = []

answers = []

# Search for each passage, its question and its answer

for group in squad_dict['data']:

for passage in group['paragraphs']:

context = passage['context']

for qa in passage['qas']:

question = qa['question']

for answer in qa['answers']:

# Store every passage, query and its answer to the lists

texts.append(context)

queries.append(question)

answers.append(answer)

train_texts, train_queries, train_answers = texts, queries, answers

After all this, we had a total of the following train examples:

print(len(train_texts))

print(len(train_queries))

print(len(train_answers))

86821

86821
86821

And we had a total of the following validation set:

print(len(val_texts))

print(len(val_queries))

print(len(val_answers))

20302

20302

20302

Next is to tokenize the train and validation sets:

from transformers import AutoTokenizer,AdamW,BertForQuestionAnswering

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

train_encodings = tokenizer(train_texts, train_queries, truncation=True,

padding=True)

val_encodings = tokenizer(val_texts, val_queries, truncation=True, padding=True)

I created a Squad Dataset class (inherits from torch.utils.data.Dataset), that helped me to train and
validate my previous data more easily and convert encodings to datasets.

class SquadDataset(torch.utils.data.Dataset):

def __init__(self, encodings):

self.encodings = encodings

def __getitem__(self, idx):

return {key: torch.tensor(val[idx]) for key, val in

self.encodings.items()}

def __len__(self):

return len(self.encodings.input_ids)

In this project I had to use GPU a free resource provided by google.

device = torch.device('cuda:0' if torch.cuda.is_available()

else 'cpu')

With the following parameters:

model = BertForQuestionAnswering.from_pretrained('bert-base-uncased').to(device)

optim = AdamW(model.parameters(), lr=5e-5)

epochs = 3
EVALUATING FINE TUNED MODEL

Here I give some examples to my model to see how well I trained it. I started with more easier examples

and then I gave more complex ones.

For extractive textual QA tasks, we usually adopt two evaluation metrics, which measure exact match and

partially overlapped scores respectively.

● Exact Match: measures whether the predicted answer exactly matches the ground-truth answers. If

the exact matching occurs, then assigns 1.0, otherwise assigns 0.0.

● F1 Score: computes the average word overlap between predicted and ground-truth answers, which

can ensure both precision and recall rate are optimized at the same time.

Here is an example:

Here I took some content from Wikipedia pages to test my model. I observed that for questions that require

an answer with more than one entity, that in the context are separated by comma, the model returns only

the first one (in the question of the members of the band). Moreover, when I asked about the kind of band

they are, the model gave me the answer of "British rock", while I didn't ask about the origin of the band.

In [7]:

context = """ Queen are a British rock band formed in London in 1970. Their

classic line-up was Freddie Mercury (lead vocals, piano),

Brian May (guitar, vocals), Roger Taylor (drums, vocals) and John

Deacon (bass). Their earliest works were influenced


by progressive rock, hard rock and heavy metal, but the band gradually

ventured into more conventional and radio-friendly

works by incorporating further styles, such as arena rock and pop

rock. """

queries = ["When did Queen found?",

"Who were the basic members of Queen band?",

"What kind of band they are?”]

answers = ["1970",

"Freddie Mercury, Brian May, Roger Taylor and John Deacon",

"rock"]

for q,a in zip(queries,answers):

give_an_answer(context,q,a)

Question: When did Queen found?

Prediction: 1970

True Answer: 1970

EM: 1
F1: 1.0

Question: Who were the basic members of Queen band?

Prediction: freddie mercury ( lead vocals, piano ), brian may ( guitar, vocals ),

roger taylor ( drums, vocals ) and john deacon ( bass )

True Answer: Freddie Mercury, Brian May, Roger Taylor and John Deacon

EM: 0

F1: 0.6923076923076924

Question: What kind of band they are?

Prediction: british rock

True Answer: rock

EM: 0

F1: 0.6666666666666666
CHAPTER 5

SUMMARY OF FINDINGS, CONCLUSIONS AND RECOMMENDATION

5.1 Introduction
This chapter covers the summary of the findings of the processes undertaken during the system
implementation and testing.

5.2Summary of analysis

As mentioned earlier, the Bert model for this research project was based on a NLP challenge and Natural
Language Understanding. The key classification metrics were: Accuracy, Recall, Precision and F1-score.

5.3 QA Metrics

The accuracy , recall and F1score was done per each question answered by the model which it had never
seen below as shown in the screenshots below.
There are two dominant metrics used by many question answering datasets, including SQuAD: exact match
(EM) and F1 score. These scores are computed on individual question+answer pairs. When multiple
correct answers are possible for a given question, the maximum score over all possible correct answers is
computed. Overall EM and F1 scores are computed for a model by averaging over the individual example
scores.

5.3.1 Exact Match


This metric is as simple as it sounds. For each question+answer pair, if the characters of the model's
prediction exactly match the characters of (one of) the True Answer(s), EM = 1, otherwise EM = 0. This is
a strict all-or-nothing metric; being off by a single character results in a score of 0. When assessing against
a negative example, if the model predicts any text at all, it automatically receives a 0 for that example.
5.3.2 F1 Score
F1 score is a common metric for classification problems, and widely used in QA. It is appropriate when we
care equally about precision and recall. In this case, it's computed over the individual words in the
prediction against those in the True Answer. The number of shared words between the prediction and the
truth is the basis of the F1 score: precision is the ratio of the number of shared words to the total number of
words in the prediction, and recall is the ratio of the number of shared words to the total number of words in the
ground truth.

Here is an example I crafted to test my model:


context = "Hi! My name is Alexa and I am 21 years old. I used to live in Peristeri
of Athens, but now I moved on in Kaisariani of Athens."

queries = ["How old is Alexa?",


"Where does Alexa live now?",
"Where Alexa used to live?"
]
answers = ["21",
"Kaisariani of Athens",
"Peristeri of Athens"
]

for q,a in zip(queries,answers):


give_an_answer(context,q,a)

Results

Question: How old is Alexa?

Prediction: 21

True Answer: 21

EM: 1

F1: 1.0

Question: Where does Alexa live now?

Prediction: kaisariani of athens

True Answer: Kaisariani of Athens


EM: 1

F1: 1.0

Question: Where Alexa used to live?

Prediction: peristeri of athens

True Answer: Peristeri of Athens

EM: 1

F1: 1.0

More complex example:

context = """ Harry Potter is a series of seven fantasy novels written by British

author, J. K. Rowling. The novels chronicle the lives of a young wizard,

Harry Potter, and his friends Hermione Granger and Ron Weasley, all

of whom are students at Hogwarts School of Witchcraft and Wizardry.

The main story arc concerns Harry's struggle against Lord Voldemort,

a dark wizard who intends to become immortal, overthrow the wizard

governing body known as the Ministry of Magic and subjugate all

wizards and Muggles (non-magical people). Since the release of the first novel,

Harry Potter and the Philosopher's Stone, on 26 June 1997, the books

have found immense popularity, positive reviews, and commercial success worldwide.

They have attracted a wide adult audience as well as younger readers

and are often considered cornerstones of modern young adult literature.[2]

As of February 2018, the books have sold more than 500 million

copies worldwide, making them the best-selling book series in history, and have

been translated

into eighty languages.[3] The last four books consecutively set

records as the fastest-selling books in history, with the final installment

selling roughly

eleven million copies in the United States within twenty-four hours

of its release. """


queries = [

"Who wrote Harry Potter's novels?",

"Who are Harry Potter's friends?",

"Who is the enemy of Harry Potter?",

"What are Muggles?",

"Which is the name of Harry Poter's first novel?",

"When did the first novel release?",

"Who was attracted by Harry Potter novels?",

"How many languages Harry Potter has been translated into? "

answers = [

"J. K. Rowling",

"Hermione Granger and Ron Weasley",

"Lord Voldemort",

"non-magical people",

"Harry Potter and the Philosopher's Stone",

"26 June 1997",

"a wide adult audience as well as younger readers",

"eighty"

for q,a in zip(queries,answers):

give_an_answer(context,q,a)

Model Output:

Question: Who wrote Harry Potter's novels?

Prediction: j. k. rowling

True Answer: J. K. Rowling

EM: 1

F1: 1.0
Question: Who are Harry Potter's friends?

Prediction: hermione granger and ron weasley

True Answer: Hermione Granger and Ron Weasley

EM: 1

F1: 1.0

Question: Who is the enemy of Harry Potter?

Prediction: lord voldemort

True Answer: Lord Voldemort

EM: 1

F1: 1.0

Question: What are Muggles?

Prediction: non - magical people

True Answer: non-magical people

EM: 0

F1: 0.4

Question: Which is the name of Harry Poter's first novel?

Prediction: harry potter and the philosopher's stone

True Answer: Harry Potter and the Philosopher's Stone

EM: 1

F1: 1.0

Question: When did the first novel release?

Prediction: 26 june 1997

True Answer: 26 June 1997


EM: 1

F1: 1.0

Question: Who was attracted by Harry Potter novels?

Prediction: wide adult audience as well as younger readers

True Answer: a wide adult audience as well as younger readers

EM: 1

F1: 0.875

Question: How many languages Harry Potter has been translated into?

Prediction: eighty

True Answer: eighty

EM: 1

F1: 1.0

5.4 Model Deployment


I deployed my model using the Streamlit library wit which I used to build my web application. For now I

am using google colab to create a remote local tunnel through which I now run my model over because it

requires high computational power.

5.5 Conclusion
In this project I have shown how this tool can help people with digital text

Comprehension, question answering and text Summarization. I made this contribution to test whether or

not such a tool can be accomplished with currently existing technologies, and limited hardware. I have also

demonstrated what difficulties lie ahead in producing this as a real world application. The application

managed to solve basic problems with short to medium sized texts. Longer texts and passages proved to
yield inconsistent results, also how the user formats his/ her context to get an accurate response played a

big role since one can ask a question which may not have an answer. The application also proved to be

reasonably reliable when processing texts with simpler content, while being inaccurate with more complex

texts. The results presented was accomplished with minimal hardware resources and free to use libraries,

frameworks and tools. Test groups were presented with the application, with a mixed reception. Many users

saw the potential use of the tool, but did not express trust in the current implementation. The motivation

many participants had was that the time gain would not compensate for the amount of mistakes made by

the algorithm. In summary, our application showed that solving the problems of text comprehension is

indeed feasible, even for smaller systems. However, we have also seen that there needs to be a more

sophisticated central model in place to deal with text analysis. We also need a better way to communicate

the decision process of the machine to its users.

5.6 Future Work

While the application serves as an optimistic prototype, there are a couple of improvements that would be

necessary to make this a usable application in the real world. In this section, we outline these steps as a

proposal for future work. These include :

GPU Support

The first step towards making the application more usable as a product is changing the hardware it runs on.

Instead of simply adding more CPU cores, moving the application to a hosting service with dedicated

CUDA enabled GPU support would make the NLP models much faster.

Use of other Models such as ALBERT, RoBERTa, XLNet,

File Support

The system in its current state supports the file extensions TXT, PDF andFor future improvements, we

would like to extend the capabilities of the file formats and types of documents that can be parsed, such as
DOC, DOCX, ODT, MD, and CSV. We also considered support for URLs13, which could parse the text

content from a website. Furthermore, the ability to include and parse the content from an image, such as the

text from an image (OCR)

REFERENCES
C, L. (n.d.). An Introduction To Question Answering Systems | Engineering Education
(EngEd) Program | Section. Engineering Education (EngEd) Program | Section.
https://www.section.io/engineering-education/question-answering/.

Rothman, D. (2021). Transformers for Natural Language Processing. Packt.

Joshi, P. (2019, June 19). Transformers In NLP | State-Of-The-Art-Models. Analytics


Vidhya.
https://www.analyticsvidhya.com/blog/2019/06/understanding-transformers-nlp-state-of-the
-art-models/.

Hofesmann, E. (2021, January 21). The Machine Learning Lifecycle In 2021 | By Eric
Hofesmann | Towards Data Science. Medium.
https://towardsdatascience.com/the-machine-learning-lifecycle-in-2021-473717c633bc.

Dwivedi, S. (2013, December 28). Research And Reviews In Question Answering System.
Research and Reviews in Question Answering System - ScienceDirect.
https://www.sciencedirect.com/science/article/pii/S2212017313005409.

Thompson, R. (2014, October 14). 'Experimental Methodology In NLP Research' -


Inspiritive. Inspiritive.
http://www.inspiritive.com.au/experimental-methodology-in-nlp-research/.

Sabharwal, N., & Agrawal, A. (2021). Hands-On Question Answering Systems with BERT.
Apress.

Bahetti, P. (2021, November 11). A Newbie-Friendly Guide To Transfer Learning:


Everything You Need To Know. A Newbie-Friendly Guide to Transfer Learning: Everything
You Need to Know. https://www.v7labs.com/blog/transfer-learning-guide.

Available https://streamlit.io

Available https://www.machinelearningmastery.com
APPENDICES

APPENDIX A: PROJECT BUDGET

Table 1
ACTIVITY BUDGET

Internet 1000

Application Deployment on the cloud 3500

Printing & Binding 150

Transport 500
Gantt chart
Figure 9

PERT CHART
Figure 10
APPENDIX C: HARDWARE REQUIREMENTS

Hardware requirements

The minimum hardware requirements are:

● Processor: i5 or better
● Memory: 8GB RAM
● Storage: 10 GB available space.
● Mouse/Keyboard
● Monitor (LCD, LED)
● Printer
APPENDIX D: SOFTWARE
REQUIREMENTS

● Any of the following operating systems: Windows 7 SP1 32/64-bit, Windows 8.1 32/64-bit,
● Or Windows 10 32/64-bit, Ubuntu 14.04 or later, or macOS Sierra or later.
● Browser: Google Chrome or Mozilla Firefox.
● Anaconda
● Python 3.6 ^
● cdQA == 1.3.9- Closed Domain Question Answering Pipeline
● Matplotlib offering many numerical computation
and visualization tools.2.1Software requirements

You might also like