0% found this document useful (0 votes)
23 views6 pages

NLP Summary

The document discusses several methods for representing word meaning, including distributional semantics models that correlate word contexts with conceptual knowledge. It describes count-based and predict-based models for learning distributional semantics through word vectors, and evaluates techniques like word2vec. The document also covers part-of-speech tagging using techniques like Hidden Markov Models and RNNs, and image captioning using retrieval-based, template-based, and neural methods.

Uploaded by

Rivka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views6 pages

NLP Summary

The document discusses several methods for representing word meaning, including distributional semantics models that correlate word contexts with conceptual knowledge. It describes count-based and predict-based models for learning distributional semantics through word vectors, and evaluates techniques like word2vec. The document also covers part-of-speech tagging using techniques like Hidden Markov Models and RNNs, and image captioning using retrieval-based, template-based, and neural methods.

Uploaded by

Rivka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Lexical semantics: how meaning of words can be learned

Distributional Semantics: a linguistic theory for representing word meaning based on


contexts of use
- Distributional semantics models correlate with how brain stores conceptual
knowledge
- The goal is to find context that describes the meaning of a word and find similar
meaning words by similar contexts
Formal semantics: solves the negation and quantification problem but defining concepts
through enumeration of all of their features is highly problematic
WordNet: thesaurus with word relations X is a Y relation
- Problems: subjective because humans make it
- Missing new meanings
- Can’t compute word similarity
Words as one-hot vectors: no similarity as vectors are orthogonal

Represent words by context works better


- Contexts are dimensions of vector
- Exist in vector space

Learning distributional semantics / vector space


- Count-based models:
- frequency of co-occurrence of a target and a context
- Vectors are long and sparse
- dimensions correspond to the words in the vocabulary → are interpretable
- Predict-based models:
- try to predict plausible contexts for a word
- learn word representations in the process (dense)
- Neural network
- short and dense vectors with latent dimensions
- GloVe: a count-based method with benefits

Word windows can be unfiltered, filtered (removing useless words) or lexeme: using only
stems

TODO

Two models:
- Binary model: if context word and target word occur together 1 else 0
- Basic frequency model: frequency that c and w occur together
- Not good representation because frequent words have higher counts anyway
Collocations: words or terms that co-occur more often than would be expected by chance
- Pointwise mutual information (PMI): probability that words occur together normalized
by occurrence of both words individually
- Problem: over-sensitive to infrequent words
- Solution, set negative PMI to zero because infrequent words are not accurate
enough to estimate negative PMI

De-Zipfianising with PPMI: reduces the weight of frequent words, flattening the original
Zipfian distribution. We pay less attention to frequent events, more to infrequent ones.

Calculate similarity
- Euclidean distance
- Dot product
- Normalized option better because some vectors are longer than other=Cosine

Similarity vs Relatedness: similar is replaceable in many context, relatedness is occur in same context
→ similarity is difficult to get from context

Compare similar rating of model vs human with spearman rank sum → currently 0.7
Capture analogy via vector offsets → king/queen, man/woman

Hyponym: specific instance of category → best learned supervised


Antonym: contradiction → best learned supervised

Vectors are long so better to make shorter


- Reduction techniques
- Singular Value Decomposition (SVD): dimensions reduced by exploiting
redundancies in the data
- efficient (200-500 dimensions)
- SVD matrices are not interpretable
- Principle Component Analysis (PCA), Non-negative matrix factorization
(NMF) : uses a different factorization
- Predict base model
- GloVe

Word2Vec: predict context or word


Negative sampling: only a portion is updated every learning cycle
- train a neural network to predict neighbouring words
- Learn dense embeddings for the words in the training corpus in the process
- Skipgram: predict the distribution (probability) of context words from center word
- CBOW (continuous bag of words): predict a center word from context

Training process:
● Go through a text and for each position find word w and context c
● Calculate probability of w given c and c given w
● Adjust word vectors to maximize probability
● Two vectors are learnt for each word w
○ When word in center: word w or target vector v (input vector)
○ When word in context: context vector c (output vector)

● Use dot product to calculate similarity between w and c vectors


● For learning we must learn the weights of word to middle layer and weights of middle
to output - context layer.

Two vectors because that gives easier optimization, average both at the end

Neural Language Models


Approximate the probability of a word given prior context
- Prior context is represented by vectors or "embeddings"
- Input embeddings could be learnt simultaneously while training the network
- Could be learnt separately e.g. using word2vec etc. (pretraining)

Language model in NLP = small sense: assign probability to a sentence or word sequence
- Help with speech-to-text, which word is likely spoken I scream/ice-cream
- Machine translation: word order other language
Estimating sentence probability: entire sentence does not work but probability of word pairs does →
Chain Rule of Probability Theory
- Bag of words/unigram model
- Bigram, trigram etc
- Also called Markov models of 0,1st,2nd order
This is called Relative-Frequency Estimation / Maximum-Likelihood Estimation (MLE)

Part of Speech tags


Fast way to grammatically analyze text
- Useful in text-classification, authorship identification

Classifying words
- Based on what word refers to → semantic criteria
- What is the form of word → formal criteria
- In what context does word occur → distributional criteria
POS tagging is hard because of ambiguity and multiple word meanings
Also, sparse data - words not seen before - or word-tag pairs not seen before

Three approaches to tagging:


- rule-based : not flexible
- Stochastic/ Probabilistic models for sequence tagging - Markov models
- RNN
Strategies for tagging are
- Most common tag / unigram tagging ~ 90% correct
- Bigram tagging → look at previous tag and estimate follow-up probability as well
- Problem: one wrong POS tag might cause another wrong tags
- Trigram might also work, but frequencies can become to sparse to be useful
Solution, we want to find overall most likely tagging sequence
Stochastic approach or RNN
- Viterbi algorithm to find the best (tag) sequence

RNN better than FF because they have no persistence and forget old input once new one
In RNN the input is changed with regard to the activation of the previous input from the
hidden layer
- Inputs to RNN are pre-trained word embeddings
- Output of each time step is a distribution over POS tag sets generated by softmax
- To get tag, select most likely tag at each point
- Not necessarily best tag sequence → viterbi can choose best sequence

Encoder-Decoder - can be used to complete a sentence


- Encoder RNN reads each symbol of the input string one at a time
- Decoder RNN produces a sequence of output phonemes one at a time
- Attention mechanism to look back at encoder states as needed
- Decoding ends when a stop symbol is produced

Experiments:
Kirov and Cotterel(2018)
- Encoder is a bi-directional LSTM with 2 layers and 100 hidden units
- Decoder is a uni-directional LSTM with 1 layer and 100 hidden units
- Each character has embedding size of 100, and training is done for 100 epochs

Results:
Kirov and Cotterell (2018):
- learns to conjugate all verbs seen in the training set, including irregulars.
- There are no blend errors of the sort eat -> ated
- Accuracy is lower for irregular verbs, due to overregularisation. throw-> throwed
- Macro-U-shaped curve is not observed.

Corkery, Matusevych, Goldwater (2019)


- Reproduce broad results above.
- Go into depth for “Wug-testing”
- Behaviour on nonce verbs does not correlate with human-experimental data I
- Earlier rule-based models (Albridge and Hayes, 2003) perform better I
- Overproduces irregulars, while humans prefer the regular form.

Modern encoder-decoder architectures overcome some of the flaws of earlier networks on


this task
Fit to human acquisition data and patterns is weak.
Retrieval based captioning: find closest image match in database and copy caption
Template based captioning: create representation of objects in image and translate to
language

Comparing generated captions to ground truth ones


BLEU
- How much of the generated caption is found in ground-truth one
- Count number of overlapping n-grams
BLEU limitations:
- Synonyms are penalized
- Extra info in generated is not considered
- Recall (extra info in ground truth) is not penalized
- Fluency and correctness of grammar not taken into account
Other token matching-based metrics
- Rouge: bleu + recall
- Meteor: n-gram plus synonyms and different phrasing
- CIDEr: n-grams + takes amount of info in generated into account
- BERTScore: overlap calculation with tokens matching by cosine similarity or BERT
embeddings

Limitations of systems trained on standard datasets (SOTA)


- Prone to hallucinations
- Unable to pick up pragmatic abnormalities → blue banana
- Cannot utilize common world knowledge
- Produces captions that are generic and uncontextualized

Contextualized image captioning


- Added world knowledge
- Added related text knowledge (article)
- Geometrical knowledge such as maps are added

Challenges of Contextualized image captioning


- identifying information relevant to an image in external data sources
- Representing external knowledge in a way that is useful for the captioning process →
distributional word embedding not great for named entities - especially rare ones
- Adapting caption generation pipeline to produce contextualized captions informed by
the external knowledge
- Template approach: caption is made with placeholder that is filled in with
most suited name from article
- Additional context approach: encoding external information and using it as
extra for caption or extra input word for the decoder
-

You might also like