EUC1502 Module6 TextualAnalysis
EUC1502 Module6 TextualAnalysis
1
Outline
2
Outline
3
Introduction to textual analysis
● Source code
4
Outline
5
Tasks
● Terms extraction
● Clustering
6
Tasks
- Terms extraction
7
Tasks
- Classification &
categorisation
8
Tasks
- Classification &
categorisation
9
Tasks
- Clustering
10
Tasks
- Association (of concepts)
11
Outline
12
Preprocessing and feature selection
- Stemming
consign
consign ed
STEMMING
consign
consign ing
consign ment
13
Preprocessing and feature selection
- Stopwords
14
Preprocessing and feature selection
- Information Gain
● T: training examples
● a: attribute
● val(a): value of attribute a
● x: example
● y: class 15
Preprocessing and feature selection
- Principal Components
Analysis
16
Outline
17
Linguistic resources
- Lexicons
19
Linguistic resources
- Ontologies
20
Linguistic resources
- Ontologies
21
Outline
22
Representation models & weighting schemes
- Vector Space Model
23
Representation models & weighting schemes
- Vector Space Model
24
Representation models & weighting schemes
- Bag of Words
S2 = “The dog hate the cat because they are not friends”
Vocabulary = “the, cat, and, dog, are, friends, hate, because, they, not”
BoW(S1) = “2, 1, 1, 1, 1, 1, 0, 0, 0, 0”
BoW(S2) = “2, 1, 0, 1, 1, 1, 1, 1, 1, 1”
26
Representation models & weighting schemes
- Binary weighting
S2 = “The dog hate the cat because they are not friends”
Vocabulary = “the, cat, and, dog, are, friends, hate, because, they, not”
BoW(S1) = “1, 1, 1, 1, 1, 1, 0, 0, 0, 0”
BoW(S2) = “1, 1, 0, 1, 1, 1, 1, 1, 1, 1”
27
Representation models & weighting schemes
- Frequency weighting
S2 = “The dog hate the cat because they are not friends”
Vocabulary = “the, cat, and, dog, are, friends, hate, because, they, not”
BoW(S2) = “2/10, 1/10, 0, 1/10, 1/10, 1/10, 1/10, 1/10, 1/10, 1/10”
28
Representation models & weighting schemes
- TF-IDF
29
Representation models & weighting schemes
- TF-IDF
● Term Frequency
32
Representation models & weighting schemes
- Statistical Language
Model
○ Backoff
○ Linear interpolation
○ Good-Turing
○ ...
33
Representation models & weighting schemes
- n-grams
○ 1-gram (unigrams)
○ 2-grams (bigrams)
○ 3-grams (trigrams)
34
Representation models & weighting schemes
- n-grams
S2 = “The dog hate the cat because they are not friends”
2-grams = “the cat, cat and, and the, the dog, dog are, are friends, dog
hate, hate the, cat because, because they, they are, are not, not friends”
2-grams(S1) = “1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0”
2-grams(S2) = “1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1”
35
Representation models & weighting schemes
- EmoGraph
36
Representation models & weighting schemes
- EmoGraph
37
Representation models & weighting schemes
- EmoGraph
38
Representation models & weighting schemes
- EmoGraph
39
Representation models & weighting schemes
- EmoGraph
40
Representation models & weighting schemes
- EmoGraph
41
Representation models & weighting schemes
- EmoGraph
42
Representation models & weighting schemes
- EmoGraph
43
Representation models & weighting schemes
- EmoGraph
44
Representation models & weighting schemes
- EmoGraph
45
Representation models & weighting schemes
- EmoGraph
46
Representation models & weighting schemes
- EmoGraph
47
Representation models & weighting schemes
- EmoGraph
48
Representation models & weighting schemes
- EmoGraph
49
Representation models & weighting schemes
- EmoGraph
50
Representation models & weighting schemes
- EmoGraph
51
Representation models & weighting schemes
- EmoGraph
52
Representation models & weighting schemes
- EmoGraph
53
Representation models & weighting schemes
- EmoGraph
54
Representation models & weighting schemes
- EmoGraph
55
Representation models & weighting schemes
- EmoGraph
56
Representation models & weighting schemes
- EmoGraph
57
Representation models & weighting schemes
- EmoGraph
58
Representation models & weighting schemes
- EmoGraph
59
Representation models & weighting schemes
- EmoGraph
60
Representation models & weighting schemes
- EmoGraph
61
Representation models & weighting schemes
- EmoGraph
62
Representation models & weighting schemes
- EmoGraph
63
Representation models & weighting schemes
- EmoGraph
64
Representation models & weighting schemes
- EmoGraph
65
Representation models & weighting schemes
- EmoGraph
66
Representation models & weighting schemes
- EmoGraph
67
Representation models & weighting schemes
- EmoGraph
68
Representation models & weighting schemes
- EmoGraph
69
Representation models & weighting schemes
- EmoGraph
70
Representation models & weighting schemes
- EmoGraph
71
Representation models & weighting schemes
- EmoGraph
72
Representation models & weighting schemes
- EmoGraph
73
Representation models & weighting schemes
- EmoGraph
74
74
Representation models & weighting schemes
- EmoGraph
75
Representation models & weighting schemes
- EmoGraph
Graph-based 8 features
76
Representation models & weighting schemes
- EmoGraph
79
Representation models & weighting schemes
- LDR
80
Meaning of the measures
Representation models & weighting schemes
- LDR
LDR 30
Skip-gram 300
SenVec 300
BOW 10,000
82
Representation models & weighting schemes
- Word embeddings
● Negative sampling
● Hierarchical soft-max
Maximizing the average of the log probability: Using the negative sampling estimator:
84
Outline
85
Learning and evaluating models
86
Learning and evaluating models
- Machine learning
algorithms
88
Learning and evaluating models
- Evaluating classification
tasks
CONFUSION Prediction
MATRIX
Negative Positive
Real Negative a b
Positive c d
89
Learning and evaluating models
- Evaluating classification
tasks
Accuracy Precision
Recall
Sample error
F-Score
90
Learning and evaluating models
- Evaluating clustering
tasks
91
Learning and evaluating models
- Evaluating association
rules tasks
Support:
Confidence:
Lift:
92
Outline
93
Avoiding overfitting
- n/1-n split
0.7
0.3
94
Avoiding overfitting
- k-fold cross-validation
95
Avoiding overfitting
- Special cases
● When building the representation from the whole dataset and not
from the training set (e.g. using the entire vocabulary in bow/n-
gram models)
96
Outline
97
Language Variety
Identification
99