0% found this document useful (0 votes)

44 views11 pages

Question Answering On Squad 2.0: Stanford Cs224N Default Project, Option 3

This document summarizes a student project on question answering models for the SQuAD 2.0 dataset. The students implemented and compared variants of the BiDAF and QANet models, including adding character embeddings to BiDAF and using data augmentation. They achieved the best performance with QANet trained on the augmented dataset, obtaining a F1 score of 65.21 on the dev set and 62.08 on the test set.

Uploaded by

prabhakaran sridharan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views11 pages

Question Answering On Squad 2.0: Stanford Cs224N Default Project, Option 3

Uploaded by

prabhakaran sridharan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Question Answering on SQuAD 2.

0
Stanford CS224N Default Project, Option 3

Kevin Han Han Wu Yuqi Jin

Department of Statistics Department of Statistics Department of Computer Science
Stanford University Stanford University Stanford University
[email protected] [email protected] [email protected]

Abstract

In this project, we implemented several variants of Question Answering models

and compare their performance on SQuAD 2.0. We also implemented a data-
augmentation method that enlarges the training set and compared the model trained
on augmented dataset with the original model. We found that the QANet model
trained on augmented dataset gives us the best performance in terms of both EM
and F1 metrics. We achieved 65.21 F1 score and 61.80 EM score on dev set, 62.08
F1 score and 58.58 EM score on test set.

1 Introduction
Question answering is concerned with building systems that automatically answer questions posed
by humans in a natural language [Wikipedia, 2004]. It has been a popular and crucial topic in the
field of natural language processing for a long time, as researchers try to make this important human
intelligence be automated by the machine. Despite the seemingly easy nature from a human point
of view, it remains to be an open question and a complex challenge. With the rise of deep learning
based models and more and more public available datasets, we have seen significant progress on this
task. In this project we focus on one particular dataset, Stanford SQuAD2.0 [Rajpurkar et al., 2018].
SQuAD2.0 improves over SQuAD1.1 by adding roughly half of unanswerable questions, which gives
the system extra layer of complication.
Attention mechanism is indispensable in recent successful models. In this project, we investigate
two kinds of attention based models, BiDAF and QANet. We improve upon the baseline BiDAF by
adding additional trainable character embeddings. Inspired by the work of [Wang et al., 2017], we
also add a self-matching attention layer to effectively aggregate the information across the whole
passage. We also implement the QANet architecture [Yu et al., 2018] which bypasses RNN entirely
and uses convolution instead. We perform detailed ablation study and hyperparameter tuning. We also
do data augmentation using back translation, which is introduced in [Yu et al., 2018], and adversarial
questions, which aims to enlarge unanswerable questions.

2 Related Work
There are many existing models for question answering task in natural language processing literature.
In BiDAF model [Seo et al., 2016], the authors use a context-query attention layer on top of
embedding layers and encoder layers. R-net [Wang et al., 2017] is another successful model on
question answering task. The key ingredient in their model is a self-matching attention layer. The
intuition behind the self-matching attention layer is that we want the model to encode the information
from the context when predicting the answer. Transformer [Vaswani et al., 2017] has already become
a popular model in all kinds of natural language processing tasks now. QANet [Yu et al., 2018]
utilizes transformer block in both encoder layer and modeling layer. The authors demonstrate
significant speed up in their paper which gives the possibility to use an augmented training dataset for
training. Another work in past CS224n offering [Dhoot and Gupta] tried to modify the embedding

Stanford CS224N Natural Language Processing with Deep Learning

by incorporating the cosine similarities between context words and query words. They showed the
advantage of using this extra information.

3 Models
We implement our Question Answering models basing on BiDAF and QANet. We train them under
different hyperparameters and also increase our amount of training data by data augmentation.

3.1 BiDAF based models

The baseline model, which is provided in the CS224N default project handout, is a slightly modified
BiDAF model compared to the model in the original paper [Seo et al., 2016]. As most existing models,
it consists of five layers: an embedding layer, an encoder Layer, an attention Layer, a modeling Layer
and an output layer. The illustration is in figure 1. We detailed these and our modifications in the
following subsections.

Figure 1: BiDAF Architecture

3.1.1 Embedding Layer

We combine both word-level embeddings and character-level embeddings for each word in this
embedding layer. Given a word w = [c1 , ...cl ] ∈ Z l , we use pretrained GloVe [Pennington et al.,
2014] vectors to get its 300 dimensional word embedding, which is fixed during training. To get the
character embedding, we initialize the character vectors to be the pretrained values, which gives us
a l × H matrix for each word. We then pass this matrix into a convolution layer followed by max
pooling to get a H-dimensional vector. We concatenate this with the word vectors followed by a
projection and two-layer highway network [Srivastava et al., 2015]. The final output of this layer will
be a 2H-dimensional vector for each word.

3.1.2 Encoder Layer

The encoder layer consists of a bi-directional LSTM to help us capture temporal dependencies, the
output will be the concatenation of forward and backward hidden states. The output is a matrix
C ∈ RN ×4H and a matrix Q ∈ RM ×4H obtained by the following equations,
ci = bi-LSTM(hi , WiC ) ∈ R4H
qj = bi-LSTM(hj , WjQ ) ∈ R4H

3.1.3 Context-Query Attention Layer

This layer is the core part of BiDAF model where we perform bi-directional attention between context
and query. Given context hidden states c1 , ....cN ∈ R4H and query hidden states q1 , ...qM ∈ R4H we
first compute the similarity matrix S.
T
Sij = wsim [ci ; qj ; ci ◦ qj ]

2
Context-to-Query (C2Q) Attention We take row-wise softmax of S and get question aware context
representation, i.e.
S̄i,: = softmax(Si,: ) ∈ RM ∀i ∈ {1, ..., N }
M
X
ai = S̄i,j qj ∈ R4H ∀i ∈ {1, ..., N }
j=1

Question-to-Context (Q2C) Attention We first take column-wise softmax of S to get S̄¯ and multiply
this to S̄ to get the new weight matrix S 0 , which is used to get Q2C attention output:
S̄¯ = softmax(S̄ ) ∈ RM ∀j ∈ {1, ..., M }
:,j :,j

S = S̄ S̄¯T ∈ RN ×N
0

N
X
0
bi = Si,j cj ∈ R4H ∀i ∈ {1, ..., N }
j=1
The final output layer will be
gi = [ci ; ai ; ci ◦ ai ; ci ◦ bi ] ∈ R16H ∀i ∈ {1, ...N }
3.1.4 Modeling Layer
The modeling layer is again a bi-directional LSTM which integrates between temporal information in
the context condition on question. However we reduce dimension to 4H.

3.1.5 Self-Matching Attention Layer

On top of the modeling layer, we add a self-attention layer mentioned in R-net [Wang et al., 2017].
We implemented this layer because we want to emphasize the related part in the context regarding the
query. We deleted the bi-RNN structure and replaced the additive attention to multiplicative attention
due to memory constraint and speed concern. We kept the gate structure and added a linear projection
layer with bias before output to make the dimension match. Specifically, given the question aware
context representations vt ∈ R4H , t = 1, ..., N , we get an attention based vector ct by:
stj = vjT W vt
N
X
ati = exp(sti )/ exp(stj )
j=1
N
X
ct = ati vi
i=1
The output will be
mt = sigmoid(W [vt , ct ])
mt = W 0 (mt ◦ [vt , ct ]) ∈ R4H
3.1.6 Output Layer
In this layer we further aggregate all the information. Suppose we have output matrix G from the
BiDAF attention layer and matrix M from the self-matching attention layer. We pass M into a
bi-directional LSTM to get a matrix M 0 and then we predict pstart and pend by
pstart = softmax(Wstart [G; M ]) pend = softmax(Wend [G; M 0 ])

3.2 QANet

QANet shares some components with BiDAF. The embedding layer and context-query attention
layer stay the same. The main differences are how we handle encoder layer and modeling layer.
This is where the transformer block comes in. We replace the recurrent architecture in BiDAF by
convolution-layer and transformer block. Specifically, both the encoder layer and modeling layer is a
stack of the basic encoder blocks shown in the right side of Figure 2. For an overview of the QANet
model, please refer to the left side of Figure 2. Next, we introduce the key ingredients in one single
encoder block.

3
Figure 2: Overview of QANet Architecture

3.2.1 Positional Encoding

In [Vaswani et al., 2017], they add an positional encoding to the input embeddings to encode the
information of the position of tokens in the sentence. We use the same positional encoding as shown
below:
PE(pos,2i) = sin(pos/100002i/dmodel ) (1)
PE(pos,2i+1) = cos(pos/100002i/dmodel ) (2)

3.2.2 Depthwise separable convolutions

The idea of depthwise separable convolutions come from [Kaiser et al., 2018]. It consists of a
depthwise convolution, i.e. a spatial convolution performed independently over every channel of an
input, followed by a pointwise convolution, i.e. a regular convolution with kernel size 1. According
to [Kaiser et al., 2018], this is a powerful simplification to a regular 2D or 3D convolution under the
assumption that the 2D or 3D inputs that convolutions operate on will feature both fairly independent
channels and highly correlated spatial locations.

3.2.3 Layer Normalization

This is first introduced in [Ba et al., 2016]. One of the challenges of deep learning is that the gradients
with respect to the weights in one layer are highly dependent on the outputs in the previous layer.
Batch normalization is designed to fix this problem. However, it is hard to apply to recurrent neural
networks [Ba et al., 2016]. Layer normalization was designed to overcome the drawback of batch
normalization. Suppose H is the number of hidden units of a layer and l denotes one layer in our
model. The layer normalization computes
v
H u H
l 1 X
l l
u1 X
µ = ai σ = t (al − µl )2 (3)
H i=1 H i=1 i
and then normalize the inputs by
al − µl
âl = , (4)
(σ l )2 +
where is a small positive number to avoid dividing by 0 problem.

3.2.4 Multi-head attention

This is the key part of the transformer block. The multi-head attention mechanism first appeared
in [Vaswani et al., 2017]. For each head i, we first project keys, querys and values through linear
projections to dk , dq and dv dimensions and then we calculate
HEAD i = ATTENTION(QWiQ , KWiK , V WiV ), (5)

4
where we have
QK T
ATTENTION(Q, K, V ) = SOFTMAX( √ )V. (6)
dk
Finally we concatenate all the heads and pass it through another linear projection:
M ULTI H EAD(Q, K, V ) = C ONCAT(HEAD1 , · · · , HEADh )W O (7)
Please refer to the original paper [Vaswani et al., 2017] for more details. The self-attention layer
utilizes the multi-head attention mechanism by using the same inputs for keys, queries and values.

3.2.5 Feedforward layer

This feedforward layer consists of two linear layers with the same input and output sizes. We apply a
ReLU activation and then dropout to the ouputs of the first linear layer.

3.3 Keywords-awared Embeddings

Inspried by [Dhoot and Gupta], we implemented a slightly modified version of embedding layers
by incorporating keywords information. In this paper, they add the maximum element-wise cosine
similarities between context and query to the original word embeddings. We adapted their idea but
calculated the cosine similarities between only context words and keywords in the query. We selected
the keywords by using StanfordCoreNLP dependency parser [Manning et al., 2014]. Specifically, we
select those words that are characterized as ‘ROOT’ or ‘nsubj’ or ‘nsubjpass’ or ‘ccomp’ or ‘xcomp’
or ‘nummod’ or ‘nmod’ or ‘advmod’ or ‘compound’ or ‘conj’. Suppose [c1 , · · · , clc ] represents the
context paragraph and [q1 , · · · , qlk ] represents the keyword. For each ci in the context paragraph and
qj in the keyword set, we calculate
cTi qj
(8)
kci kkqj k
and append the biggest three cosine similarities to the word embeddings for context words and then
resize to the original hidden size. In our experiments, we found that although in [Dhoot and Gupta],
the authors claimed relative advantage of this kind of embedding, it hurts the performance of our
model on SQuAD 2.0 dataset. Hence, we remove the discussion of this particular variant of our
model in the later discussion.

3.4 Data Augmentation

Considering the importance of data in training a good model, we implemented data augmentation
in two steps to create a more various dataset on the basis of the original provided training data-set.
Notice that we only modified the training dataset, without do any modifications on the development
dataset to ensure a fair evaluation on our model’s performance. We introduce some notations here
to simplify our explanation: Assume for each query, we have a tuple (c,q,a), where c represents the
context, q represents the query and a represents the answer.

3.4.1 Backtranslation
In this step, we only modified q and keep c & a unchanged. By observing the SQuAD2.0 dataset, we
found that the meaning of question is very important in determining the answers. So we decided to
use the backtranslation strategy to get a different expression for each question in the dataset without
changing its meaning. We utilized the Cloud Translation API provided by Google to finish the
translation part of work. We first translated from English to Spanish and then translated from Spanish
to English to obtain paraphrases of questions. The reason why we chose Spanish as the intermediate
language is Spanish has a similar language structure as English to try avoiding the errors caused
by different word orders. After finishing this step, the augmented dataset was twice the size of the
original dataset.

3.4.2 Adversarial Questions

Regarding the next step, we generated more unanswerable questions to deal with the new feature of
SQuAD2.0 dataset that roughly half of the questions are unanswered. For every originally answerable

5
queries, we modified them and turned them into unanswerable queries, so a new tuple (c, q’, a’)
was generated. First, to ensure the correctness of the resulting adversarial sentence, we applied
semantics-altering perturbations to the question([Jia and Liang, 2017]. Next, we used antonyms
from WordNet [Soergel, 1998] to replace the nouns and adjectives. Finally we replaced named
entities and numbers with the nearest word in GloVe word-vector space[Pennington et al., 2014]. The
augmentation process is illustrated in Figure 3.

Figure 3: Illustration of Adversarial Questions

After finishing this step, the augmented dataset was twice the size of the original dataset. the final
augmented dataset we got was 200% larger than the original training dataset.

4 Experiments
4.1 Data

We use SQuAD2.0, a question answering data set proposed in Rajpurkar et al. [2018]. The data set
consists of context, question and answer pairs. The answer is either N/A which means the question is
not answerable according to the context or an excerpt from the passage.

4.2 Evaluation method

The training loss used was the sum of negative log probabilities of the true start and end word indices
given by the predicted distributions. We evaluate our model performance by two measures, EM and
F1. EM stands for exact match, which gives us a 0 or 1 depending on whether or not our predicted
answer matches one of the correct answers given by human. F1 is the harmonic mean of precision
and recall when we answers as bag of words. This is considered to be more reliable and taken to be
primary.

4.3 Experimental details

For the baseline model, we configure our model as the default starter code, namely hidden size of 100
with no dropout. For the modified model, we use dropout rate 0.1 in character embedding part and
all other layers except self-matching attention, which has a higher dropout rate of 0.3. We continue
to use 100 as the hidden size. For the optimizer, we use Adadelta [Zeiler, 2012]with a constant 0.5
learning rate. We fix this throughout the three models to get a fair comparison. For all these three
models we ran experiments on one NV6 promo GPU and they take roughly 20, 25, 30 minutes to train
per epoch. We train our models for 30 epochs and checkpoint every 50k steps. During evaluation, we
pick the best saved model.
As for the QANet based models, we tried several configurations. To start with, we tried the following
versions of QANet: 1 head for the self-attention layer with hidden size 96, 1 head for the self-attention

6
layer with hidden size 128 and 2 heads for self-attention layer with hidden size 128. We followed
exactly the same set-up as in the paper for the model architecture. For all these three versions, we
trained the model using Adadelta optimizer with constant learning rate 0.5 and dropout rate 0.1 for
all layer dropouts. The character dropout rate is 0.05 as described in the QANet paper [Yu et al.,
2018]. The other hyperparameters are exactly the same as in the starter code. The average training
times for 1 epoch differ because of the hidden size. For the first model, it took us about 50 minutes to
finish 1 epoch. However, with hidden size increasing to 128, the average time increased to around
1 hour. When we implemented the above three versions of QANet models, we did not adopt the
stochastic depth method [Huang et al., 2016] mentioned in the paper, rather, we applied dropout layer
with probabilities same as the survival probabilities in the paper. After we are done with the above
three models, we found that the F1 scores and EM scores are not high enough, we thus implemented
the stochastic depth method. Also, we changed the optimizer to Adam optimizer with β1 = 0.8,
β2 = 0.999, = 10−7 . Following the details in the QANet paper, we also implemented a warm-up
scheme with an inverse exponential increase from 0.0 to 0.001 in the first 1000 steps and then keep
the learning rate constant. Finally, we trained our first model (1 attention head with hidden size 96)
on our augmented dataset. Since our training set is now three times larger than the original one, our
average training time for 1 epoch is now 2.5 hours.

5 Results

Table 1: BiDAF-Based Model Results on DEV Dataset

Case Char-Embedding Self-Attention Dev-F1 Dev-EM
1 × × 61.62 58.11
2 ! × 62.79 59.25
3 ! ! 63.75 60.63

Table 2: QANet-Based Model Results on DEV Dataset

Case Head # Hidden Size Augmentation Adam+Stochastic Depth Dev-F1 Dev-EM
4 1 96 × × 64.37 60.91
5 1 128 × × 63.43 59.45
6 2 128 × × 63.54 59.96
7 1 96 ! × 65.21 61.80
8 1 96 × ! 64.98 61.44

Table 3: Model Results on TEST Dataset

Case Model Description Test-F1 Test-EM
7 QANet w/ Data Augmentation 61.83 57.90
8 QANet w/ Adam + Stochastic Depth Method 62.08 58.58
4 QANet 60.04 56.28

Table 1 shows our results on dev set for BiDAF-based models. We see that char embedding and self-
attention do help improve the model performance. However, the improvements are less significant than
what we expect. One possible explanation is that by using more complex models, hyperparameter
tuning and optimizer choice becomes more important. Table 2 shows our results on dev set for
different variants of QANet models. We see that increasing the number of attention heads and
increasing the hidden size do not help. However, adding the stochastic depth method and using Adam
optimizer together gives us better performance in terms of both F1 and EM. Moreover, training our
model on augmented dataset gives us an increase in both F1 and EM. The results fit our intuition.
By using a very large model like QANet, careful training and more data do help us improve the
performance of our model. Table 3 shows our three best submissions on the test leaderboard. We see
that although the model trained with data augmentation shows the best performance on the dev set, it
is not the case on test set. One explanation is that our augmented data may not be good since we use

7
backtranslation and some simple word substitutions. Hence, our model may not generalize well on
the real test set.

6 Analysis

6.1 QANet with Data Augmentation

Question: Economy, Energy and Tourism is one of the what?

Context: Subject Committees are established at the beginning of each parliamentary session, and
again the members on each committee reflect the balance of parties across Parliament. Typically
each committee corresponds with one (or more) of the departments (or ministries) of the Scottish
Government. The current Subject Committees in the fourth Session are: Economy, Energy and
Tourism; Education and Culture; Health and Sport; Justice; Local Government and Regeneration;
Rural Affairs, Climate Change and Environment; Welfare Reform; and Infrastructure and Capital
Investment.
Answer: current Subject Committees
QANet w/ Data Augmentation Prediction: N/A
Analysis: We found that our QANet model with data augmentation predicted the answer to be N/A
for an anwerable questions. The possible reason might be we augmented the unanswerable questions.
Originally, there were roughly half of the questions to be unanswerable, but after data augmentation,
two of the thirds of the questions were unanswerable, which might made the model have more bias
to those unanswerable ones and thus have more possibility to generate N/A answers. A possible
solution is re-balance the spread of unanswerable questions in the dataset.

6.2 QANet

Question: How many miles is Montpellier from Paris?

Context: Montpellier was among the most important of the 66 "villes de sûreté" that the Edict of
1598 granted to the Huguenots. The city’s political institutions and the university were all handed
over to the Huguenots. Tension with Paris led to a siege by the royal army in 1622. Peace terms
called for the dismantling of the city’s fortifications. A royal citadel was built and the university and
consulate were taken over by the Catholic party. Even before the Edict of Alès (1629), Protestant rule
was dead and the ville de sûreté was no more.[citation needed]
Answer: N/A
Prediction: 66
Analysis: We see that QANet sometimes also gives answer for unanswerable questions. In this
case, 66 is not the distance from Montpellier to Paris but the model gives this answer. One possible
explanation is that the model cannot understand the query quite well. Although there is no answer to
this query based on the context, the model is still trying to generate some answer. If we think about
the problem in this way, one possible direction is to better model the semantic relation between the
query and the context, i.e. we should let the model learn if the query and the context are related.

7 Conclusion

In this project, we worked on SQuAD challenge and tried to improve model performance on the
SQuAD2.0 dataset. We studied and implemented both BiDAF and QANet based models, and did
various experiments on our models, such as adapting different data augmentation strategies. We are
able to achieve better results than the baseline models on the dataset.
We fully admit that there are still much space for improvement on our model, and the future work can
be the fine-tuning of the hyperparameters, or more trials on models with different data augmentation
strategies. We are also interested to see why our QANet model did not achieve the high F1 and EM
scores claimed by other teams. We learned a lot on NLP during the whole semester and would like to
express our sincere gratitude to all teaching staffs.

8
8 References

References
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization, 2016.
Anand Dhoot and Anchit Gupta. Iterative reasoning with bi-directional attention flow for ma-
chine comprehensionn. https://web.stanford.edu/class/archive/cs/cs224n/cs224n.
1184/reports/6908250.pdf.
Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q. Weinberger. Deep networks with
stochastic depth. CoRR, abs/1603.09382, 2016. URL http://arxiv.org/abs/1603.09382.
Robin Jia and Percy Liang. Adversarial examples for evaluating reading comprehension systems,
2017.
Lukasz Kaiser, Aidan N. Gomez, and Francois Chollet. Depthwise separable convolutions for neural
machine translation. In International Conference on Learning Representations, 2018. URL
https://openreview.net/forum?id=S1jBcueAb.
Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David
McClosky. The Stanford CoreNLP natural language processing toolkit. In Association for
Computational Linguistics (ACL) System Demonstrations, pages 55–60, 2014. URL http://www.
aclweb.org/anthology/P/P14/P14-5010.
Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word
representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language
Processing (EMNLP), pages 1532–1543, Doha, Qatar, October 2014. Association for Computa-
tional Linguistics. doi: 10.3115/v1/D14-1162. URL https://www.aclweb.org/anthology/
D14-1162.
Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions
for SQuAD. In Association for Computational Linguistics (ACL), 2018.
Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional attention
flow for machine comprehension. CoRR, abs/1611.01603, 2016. URL http://arxiv.org/abs/
1611.01603.
Dagobert Soergel. Wordnet. an electronic lexical database. 10 1998.
Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks. CoRR,
abs/1505.00387, 2015. URL http://arxiv.org/abs/1505.00387.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg,
S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural
Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc., 2017. URL
http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf.
Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang, and Ming Zhou. Gated self-matching networks
for reading comprehension and question answering. In Proceedings of the 55th Annual Meeting
of the Association for Computational Linguistics (Volume 1: Long Papers), pages 189–198,
Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/
P17-1018. URL https://www.aclweb.org/anthology/P17-1018.
Wikipedia. Question answering — Wikipedia, the free encyclopedia, 2004. URL https://en.
wikipedia.org/wiki/Question_answering. [Online; accessed 22-July-2004].
Adams Wei Yu, David Dohan, Quoc Le, Thang Luong, Rui Zhao, and Kai Chen. Fast and accurate
reading comprehension by combining self-attention and convolution. In International Conference
on Learning Representations, 2018. URL https://openreview.net/forum?id=B14TlG-RW.
Matthew D. Zeiler. ADADELTA: an adaptive learning rate method. CoRR, abs/1212.5701, 2012.
URL http://arxiv.org/abs/1212.5701.

9
A Appendix
We attached our results pictures from tensorboard on the dev dataset for the best three models.

A.1 Model Case #: 7

Figure 4: AvNA for Model 7

Figure 5: EM for Model 7

Figure 6: F1 for Model 7

10
Figure 7: NLL for Model 7

Figure 8: Dev Statistics for Model 8

Default 116625923
No ratings yet
Default 116625923
11 pages
QA With Deep Learning
No ratings yet
QA With Deep Learning
10 pages
L22_Attention in Deep Learning
No ratings yet
L22_Attention in Deep Learning
65 pages
Question Answering On SQuAD
No ratings yet
Question Answering On SQuAD
7 pages
Lec15 Qa
No ratings yet
Lec15 Qa
32 pages
Data Science Interview Questions (#Day27)
No ratings yet
Data Science Interview Questions (#Day27)
18 pages
LSTM-D L M - : Based EEP Earning Odels For Non Factoid Answer Selection
No ratings yet
LSTM-D L M - : Based EEP Earning Odels For Non Factoid Answer Selection
11 pages
QALD-4 Open Challenge: Question Answering Over Linked Data
No ratings yet
QALD-4 Open Challenge: Question Answering Over Linked Data
17 pages
Learning To Answer by Learning To Ask - Getting The Best of GPT-2 and BERT Worlds PDF
No ratings yet
Learning To Answer by Learning To Ask - Getting The Best of GPT-2 and BERT Worlds PDF
10 pages
Reducing Language Biases in Visual Question Answering With Visually-Grounded Question Encoder
No ratings yet
Reducing Language Biases in Visual Question Answering With Visually-Grounded Question Encoder
17 pages
Neural Module Networks
No ratings yet
Neural Module Networks
10 pages
1 s2.0 S0262885621000706 Main
No ratings yet
1 s2.0 S0262885621000706 Main
11 pages
CS11-711 Advanced NLP: Retrieval and Retrieval-Augmented Generation
No ratings yet
CS11-711 Advanced NLP: Retrieval and Retrieval-Augmented Generation
37 pages
DC GCN
No ratings yet
DC GCN
11 pages
Teney Tips and Tricks CVPR 2018 Paper
No ratings yet
Teney Tips and Tricks CVPR 2018 Paper
10 pages
Reinforced Mnemonic Reader For Machine Comprehension: Minghao Hu Yuxing Peng Xipeng Qiu
No ratings yet
Reinforced Mnemonic Reader For Machine Comprehension: Minghao Hu Yuxing Peng Xipeng Qiu
8 pages
Christopher Manning Lecture 5: Language Models and Recurrent Neural Networks (Oh, and Finish Neural Dependency Parsing J)
No ratings yet
Christopher Manning Lecture 5: Language Models and Recurrent Neural Networks (Oh, and Finish Neural Dependency Parsing J)
66 pages
Joint Question/Answering
No ratings yet
Joint Question/Answering
7 pages
6440-Article Text-9665-1-10-20200517
No ratings yet
6440-Article Text-9665-1-10-20200517
8 pages
Grammar As A Foreign Language: Equal Contribution
No ratings yet
Grammar As A Foreign Language: Equal Contribution
10 pages
A Question-Focused Multi-Factor Attention Network For Question Answering
No ratings yet
A Question-Focused Multi-Factor Attention Network For Question Answering
8 pages
12.4.7 TransformerModels
No ratings yet
12.4.7 TransformerModels
37 pages
Novel Approach For Semantic Similarity Measuremwnt For High Quality Answer Selection in QA Using DL Methods
No ratings yet
Novel Approach For Semantic Similarity Measuremwnt For High Quality Answer Selection in QA Using DL Methods
5 pages
report24
No ratings yet
report24
7 pages
1704.01792v3
No ratings yet
1704.01792v3
6 pages
Comprehensive Guide Attention Mechanism Deep Learning
No ratings yet
Comprehensive Guide Attention Mechanism Deep Learning
17 pages
L3 Transformer and PLMs
No ratings yet
L3 Transformer and PLMs
111 pages
Open Ended VQA Models Using Transformers
No ratings yet
Open Ended VQA Models Using Transformers
10 pages
Stanford Dataset 2.0
No ratings yet
Stanford Dataset 2.0
9 pages
LSTM4
No ratings yet
LSTM4
5 pages
Modification and Extension of a Neural Question Answering System with Attention and Feature Variants
No ratings yet
Modification and Extension of a Neural Question Answering System with Attention and Feature Variants
10 pages
image captioning
No ratings yet
image captioning
9 pages
Bidirectional Attentive Memory Networks For Question Answering Over Knowledge Bases
No ratings yet
Bidirectional Attentive Memory Networks For Question Answering Over Knowledge Bases
10 pages
2017-3-R1
No ratings yet
2017-3-R1
10 pages
0632
No ratings yet
0632
7 pages
NeurIPS-2021-understanding-how-encoder-decoder-architectures-attend-Paper
No ratings yet
NeurIPS-2021-understanding-how-encoder-decoder-architectures-attend-Paper
12 pages
Tructured Ttention Etworks: (Yoonkim@seas, Carldenton@college, Lhoang@g, Srush@seas) .Harvard - Edu
No ratings yet
Tructured Ttention Etworks: (Yoonkim@seas, Carldenton@college, Lhoang@g, Srush@seas) .Harvard - Edu
21 pages
UNIT 2 FULL - Compressed
No ratings yet
UNIT 2 FULL - Compressed
26 pages
AN2DL_06_2324_AttentionAndTrasformers
No ratings yet
AN2DL_06_2324_AttentionAndTrasformers
60 pages
Attention_ Attention! _ Lil'Log
No ratings yet
Attention_ Attention! _ Lil'Log
23 pages
Transformer Tutorial
No ratings yet
Transformer Tutorial
14 pages
Lecture 10
No ratings yet
Lecture 10
66 pages
3.1 Language Models and Attention
No ratings yet
3.1 Language Models and Attention
22 pages
Transformer
No ratings yet
Transformer
55 pages
cs224n 2022 Lecture08 Final Project
No ratings yet
cs224n 2022 Lecture08 Final Project
71 pages
Visualizing A Neural Machine Translation Model
No ratings yet
Visualizing A Neural Machine Translation Model
38 pages
Mark
No ratings yet
Mark
3 pages
Vinija's Notes - Natural Language Processing - Attention
No ratings yet
Vinija's Notes - Natural Language Processing - Attention
27 pages
Attention Attention!
No ratings yet
Attention Attention!
26 pages
C11-Attention and Transformers
No ratings yet
C11-Attention and Transformers
59 pages
Attention is all you need
No ratings yet
Attention is all you need
15 pages
02-Transformer Based NLP Applications
No ratings yet
02-Transformer Based NLP Applications
57 pages
2503.12992v1 (1)
No ratings yet
2503.12992v1 (1)
42 pages
Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey
No ratings yet
Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey
14 pages
1705.02798v6
No ratings yet
1705.02798v6
8 pages
Attention in Natural Language Processing: Andrea Galassi, Marco Lippi, and Paolo Torroni
No ratings yet
Attention in Natural Language Processing: Andrea Galassi, Marco Lippi, and Paolo Torroni
18 pages
Deep Learning Computer Vision NLP
No ratings yet
Deep Learning Computer Vision NLP
140 pages
Deep Neural Network Module 7 Attention Transformer
No ratings yet
Deep Neural Network Module 7 Attention Transformer
40 pages
DL_Unit-3
No ratings yet
DL_Unit-3
56 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
62 pages
Training Deep Neural Networks
No ratings yet
Training Deep Neural Networks
55 pages
Attn Is All You Need
No ratings yet
Attn Is All You Need
15 pages
UNIT V (1)
No ratings yet
UNIT V (1)
25 pages
Unit 2 batch noramlization
No ratings yet
Unit 2 batch noramlization
26 pages
Decision Trees
No ratings yet
Decision Trees
24 pages
AML Lab
No ratings yet
AML Lab
34 pages
Deep learning in computer vision: principles and applications First Edition. Edition Mahmoud Hassaballah - Own the complete ebook with all chapters in PDF format
100% (2)
Deep learning in computer vision: principles and applications First Edition. Edition Mahmoud Hassaballah - Own the complete ebook with all chapters in PDF format
61 pages
Heston Model Calibration via Deep Neural Networks 1748284226
No ratings yet
Heston Model Calibration via Deep Neural Networks 1748284226
11 pages
Chapter-2(Deep Learning)
No ratings yet
Chapter-2(Deep Learning)
18 pages
Deep MLP's
No ratings yet
Deep MLP's
44 pages
How_to_get_pavement_distress_detection_ready_for_deep_learning_A_systematic_approach
No ratings yet
How_to_get_pavement_distress_detection_ready_for_deep_learning_A_systematic_approach
9 pages
Normalizer Free Networks
No ratings yet
Normalizer Free Networks
22 pages
A Novel End-To-End 1D-ResCNN Model To Remove Artifact From EEG Signals
No ratings yet
A Novel End-To-End 1D-ResCNN Model To Remove Artifact From EEG Signals
14 pages
Fundamentals of Deep Learning
No ratings yet
Fundamentals of Deep Learning
26 pages
Untitled7.ipynb - Colaboratory
No ratings yet
Untitled7.ipynb - Colaboratory
12 pages
Clustering
No ratings yet
Clustering
28 pages
Difference Between Local Response Normalization and Batch Normalization - by Aqeel Anwar - Towards Data Science
No ratings yet
Difference Between Local Response Normalization and Batch Normalization - by Aqeel Anwar - Towards Data Science
9 pages
Tinier YOLO
No ratings yet
Tinier YOLO
10 pages
EfficientNet Tutorial
No ratings yet
EfficientNet Tutorial
20 pages
Chen, Deng et al 2021 - Effective and Efficient Batch Normalization
No ratings yet
Chen, Deng et al 2021 - Effective and Efficient Batch Normalization
15 pages
Given A Pattern As Below and An Integer N Your Task Is To Decode It and Print NTH Row of It
No ratings yet
Given A Pattern As Below and An Integer N Your Task Is To Decode It and Print NTH Row of It
9 pages
(2020) Gaussian Error Linear Units (Gelus)
No ratings yet
(2020) Gaussian Error Linear Units (Gelus)
9 pages
A Fortran-Keras Deep Learning Bridge For Scientific Computing
No ratings yet
A Fortran-Keras Deep Learning Bridge For Scientific Computing
16 pages
Audio-visual sentiment analysis for learning emotional arcs in movies
No ratings yet
Audio-visual sentiment analysis for learning emotional arcs in movies
6 pages
Proceeding Book - Mahsa Minaei
No ratings yet
Proceeding Book - Mahsa Minaei
8 pages
Deep Learning Models Based On Image Classification: A Review
No ratings yet
Deep Learning Models Based On Image Classification: A Review
8 pages
unit 4 nndl
No ratings yet
unit 4 nndl
37 pages
Learning Performance of Neuron Model Based On Quantum Superposit
No ratings yet
Learning Performance of Neuron Model Based On Quantum Superposit
6 pages
A Tomato Leaf Diseases Classification Method Based On Deep Learning
No ratings yet
A Tomato Leaf Diseases Classification Method Based On Deep Learning
5 pages
Batch Normalization 1
No ratings yet
Batch Normalization 1
3 pages
A High-Throughput and Power-Efficient FPGA Implementation of YOLO CNN For Object Detection
No ratings yet
A High-Throughput and Power-Efficient FPGA Implementation of YOLO CNN For Object Detection
13 pages
Deep Learning - IIT Ropar - Unit 11 - Week 8
No ratings yet
Deep Learning - IIT Ropar - Unit 11 - Week 8
4 pages
DL - Assignment 10 Solution
100% (2)
DL - Assignment 10 Solution
6 pages
On The Expressive Power of Deep Neural Networks
No ratings yet
On The Expressive Power of Deep Neural Networks
8 pages
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet

Question Answering On Squad 2.0: Stanford Cs224N Default Project, Option 3

Uploaded by

Question Answering On Squad 2.0: Stanford Cs224N Default Project, Option 3

Uploaded by

Question Answering on SQuAD 2.

Kevin Han Han Wu Yuqi Jin

In this project, we implemented several variants of Question Answering models

Stanford CS224N Natural Language Processing with Deep Learning

3.1 BiDAF based models

Figure 1: BiDAF Architecture

3.1.1 Embedding Layer

3.1.2 Encoder Layer

3.1.3 Context-Query Attention Layer

3.1.5 Self-Matching Attention Layer

3.2.1 Positional Encoding

3.2.2 Depthwise separable convolutions

3.2.3 Layer Normalization

3.2.4 Multi-head attention

3.2.5 Feedforward layer

3.3 Keywords-awared Embeddings

3.4 Data Augmentation

3.4.2 Adversarial Questions

Figure 3: Illustration of Adversarial Questions

4.2 Evaluation method

4.3 Experimental details

Table 1: BiDAF-Based Model Results on DEV Dataset

Table 2: QANet-Based Model Results on DEV Dataset

Table 3: Model Results on TEST Dataset

6.1 QANet with Data Augmentation

Question: Economy, Energy and Tourism is one of the what?

Question: How many miles is Montpellier from Paris?

A.1 Model Case #: 7

Figure 4: AvNA for Model 7

Figure 5: EM for Model 7

Figure 6: F1 for Model 7

Figure 8: Dev Statistics for Model 8

You might also like