Script
Script
During the time of any crises, Twitter or any other social media platform, like reddit or facebook
has become a major communication channel. This is of course because Twitter can be helpful in
tackling many different challenges that can be caused due to uncertain circumstances. You can
have donation drives, you can warn people of the upcoming threats and how you can deal with
them and stuff like that. But, there can be fake rumours or fake news on Twitter as well, fake
rumors on Twitter can be a major problem. Various reports have shown that false reports spread
much further and faster on the platform than truth. A great example of this can be the recent
events in Washington where Trump spread a lot of misinformation using twitter only. As the
world reacts to the COVID-19 pandemic, we are confronted with an overabundance of
virus-related material. Some of this knowledge may be misleading and dangerous. Twitter
should mark content that is demonstrably inaccurate or misleading and poses a serious risk of
damage (such as increased virus transmission or negative impacts on public health systems).
But the problem is that on average there are 500 million new tweets on every single day. There
is no way this huge amount of data can be processed by sheer manpower. So, we need
machine learning or natural language processing techniques to help us out.
That’s where our project comes in. Our task is about predicting many properties of a tweet about
COVID-19 (also, although the organizers claim that the tweets are about covid, some of those
aren’t, but there’s not a lot we can do about it).
Dataset Description
The dataset has 451 different tweets. If there is any video or image or any other link present in
the tweet, apart from text, it is just written URL, and so our task is strictly restricted to text
processing. For each tweet, we are given 7 different labels:
1. The first one is about whether the tweet makes a claim or not.
2. After that we have label 2 to 5. And they are applicable only when the first question is
true. If the first question is false, they are not applicable. So in the second question, we
need to tell if there is any false information contained it the tweet or not.
3. In the third question, we need to see if the tweet is of any interest for the general public.
4. Whether the claim made is harmful
5. Whether the claim made needs to be verified by someone
6. Whether the tweet is harmful
7. Whether the tweet requires govt attention. Example could be a hospital running out of
surgical masks
EXAMPLES
Arch and Method
Bertweet has the same configuration as bert and it has been trained on 845 million english
tweets. Currently, bertweet outperforms all the other state-of-the-art models on most Tweet NLP
tasks such as Part-of-speech tagging, Named entity recognition and text classification.
And so, we decided to start off all our experiments with bertweet.
Transfer Learning
Ideally, bertweet will only provide us embeddings for different tweet. But, since the dataset on
which bertweet was optimized would be different from the dataset on which we will optimize our
model, it may not perform that great. But, if we allow bert to fine tune the embeddings for our
specific task, it may give us better results. So, it gives us two options we can try: either we can
work with the embeddings directly, or we can use transfer leaning, and see if there is any
improvements in the results.
Currently, we have a 70-15-15 split between the training, validation and test dataset. But, the
dev dataset was released 2-3 days ago, and going forward, we plan to incorporate that into our
experiments as well. Also, the evaluation will be based on the F1-score values.
Our implementation allows us to have completely different models for two particular questions.
We exploit this by massively fine-tuning models for each question independently.
In bert, we have two models: one where we only use the BERT embeddings, and the second
where we use Transfer Learning to fine-tune those embeddings. For the first case, we only use
the embeddings provided by BERTweet, and don’t fine-tune them for our task. However, this
model performs relatively poorly because the main task for which BERTweet was initially trained
might have been different than what we are currently using. The second one is the transfer
learning based model, where we allow the embeddings to be fine-tuned depending upon how
the model is performing. This is our best model so far, and outperforms both the other models
by a significant margin (in terms of F1-Score). FC1 and FC2 denote fully connected layers.
Activation denotes the activation functions we can use, and optimizer denotes the optimizer we
can use. We can have a variable number of such fully connected layers, and the number of
perceptron units used in between can be tuned as well, given that the input and output sizes are
fixed at 768 and 1 respectively. For activation function, in the hidden layer, we can choose from
ReLU, Leaky ReLu etc. The optimizer can be Adam or AdaFactor.
We also tried elmo with transfer learning, to see how it stands relative to the bert model. And the
transfer learning architecture is the same for this as it was for bert to maintain consistency.
In comparison between models, you can see that the bert with learning model outperforms both
the other models by a significant margin. Its just in Q5 where the difference between the middle
and the other two models isn’t that much, but it still beats all the other models in all the labels.
Going forward
1. Dependency across questions - Currently, we have only used the labels for Q1 to help in
predicting the labels for q2 to q5. Maybe if we can use q6 and q7 as well, this will lead to
a better performance. We need to see how it goes.
2. Dev data - Organizers have released additional 50 data examples. Hence, we need to
re-run our experiments with this dev data, in order to have a larger training data. We can
train + validate on the data we initially had, and test on the dev data.
3. Cross Validation: Until now, we have kept the validation data to be fixed. Going forward,
we will adopt cross validation strategy, as it may provide better insights on the
performance of our algorithms.
4. All-Bert Ensemble Model - We will try an ensemble of our best BERTweet based models
and see whether it improves the performance.
5. GloVe Twitter Embeddings based model - We found glove embeddings trained on close
to 2M tweets. This may give better performance hence we would experiment on it
6. Bert + GloVe Ensemble Model (Depending on the results of Glove model)
Naman
Slide 5 :
Now we will look at some of the architectures that we used in our experiments,
Transformer -
Apart from elmo we have used BERT , a transformer based architecture. A transformer has
two components encoder and decoder , encoder transforms the input vector into high
dimensional representation and decoder transform it to output . One of the key concept
used in transformer is attention. The attention-mechanism looks at an input sequence and
decides at each step which other parts of the sequence are important. Apart from this
transformer takes whole sentence as input encodes each word with its position compared to
RNN which take each word sequentially,
Bert
Bert is a transformer based model, It basically contains a set of transformer based encoder
modules. It has many variants , Two of its variants are shown in fig 3, where bert base and bert
large contains 12 and 24 encoder modules respectively