NLP Summary
NLP Summary
Word windows can be unfiltered, filtered (removing useless words) or lexeme: using only
stems
TODO
Two models:
- Binary model: if context word and target word occur together 1 else 0
- Basic frequency model: frequency that c and w occur together
- Not good representation because frequent words have higher counts anyway
Collocations: words or terms that co-occur more often than would be expected by chance
- Pointwise mutual information (PMI): probability that words occur together normalized
by occurrence of both words individually
- Problem: over-sensitive to infrequent words
- Solution, set negative PMI to zero because infrequent words are not accurate
enough to estimate negative PMI
De-Zipfianising with PPMI: reduces the weight of frequent words, flattening the original
Zipfian distribution. We pay less attention to frequent events, more to infrequent ones.
Calculate similarity
- Euclidean distance
- Dot product
- Normalized option better because some vectors are longer than other=Cosine
Similarity vs Relatedness: similar is replaceable in many context, relatedness is occur in same context
→ similarity is difficult to get from context
Compare similar rating of model vs human with spearman rank sum → currently 0.7
Capture analogy via vector offsets → king/queen, man/woman
Training process:
● Go through a text and for each position find word w and context c
● Calculate probability of w given c and c given w
● Adjust word vectors to maximize probability
● Two vectors are learnt for each word w
○ When word in center: word w or target vector v (input vector)
○ When word in context: context vector c (output vector)
Two vectors because that gives easier optimization, average both at the end
Language model in NLP = small sense: assign probability to a sentence or word sequence
- Help with speech-to-text, which word is likely spoken I scream/ice-cream
- Machine translation: word order other language
Estimating sentence probability: entire sentence does not work but probability of word pairs does →
Chain Rule of Probability Theory
- Bag of words/unigram model
- Bigram, trigram etc
- Also called Markov models of 0,1st,2nd order
This is called Relative-Frequency Estimation / Maximum-Likelihood Estimation (MLE)
Classifying words
- Based on what word refers to → semantic criteria
- What is the form of word → formal criteria
- In what context does word occur → distributional criteria
POS tagging is hard because of ambiguity and multiple word meanings
Also, sparse data - words not seen before - or word-tag pairs not seen before
RNN better than FF because they have no persistence and forget old input once new one
In RNN the input is changed with regard to the activation of the previous input from the
hidden layer
- Inputs to RNN are pre-trained word embeddings
- Output of each time step is a distribution over POS tag sets generated by softmax
- To get tag, select most likely tag at each point
- Not necessarily best tag sequence → viterbi can choose best sequence
Experiments:
Kirov and Cotterel(2018)
- Encoder is a bi-directional LSTM with 2 layers and 100 hidden units
- Decoder is a uni-directional LSTM with 1 layer and 100 hidden units
- Each character has embedding size of 100, and training is done for 100 epochs
Results:
Kirov and Cotterell (2018):
- learns to conjugate all verbs seen in the training set, including irregulars.
- There are no blend errors of the sort eat -> ated
- Accuracy is lower for irregular verbs, due to overregularisation. throw-> throwed
- Macro-U-shaped curve is not observed.