NLPmidterm Slide
NLPmidterm Slide
MIDTERM REPORT
Conclusion
Reference
1
Exercise 1
Algorithm use to handle exercise
In order to get the similar content from news or report, we will use SimHash and MinHash to measure the
similarity of it.
• Definition of MinHash: A minhash function converts tokenized text into a set of hash integers,
then selects the minimum value.
+ Math Formular of hash: x: input integer,
a,b: random number with a,b < x
c: random number with c > x
• Definition of SimHash: is a hashing function and its property is that the more similar the text inputs are,
the smaller the Hamming distance of their hashes.
+ Math Formular: Wi: weight of i-th word in text.
TF(i): frequency of i-th word in text
• Definition of Jaccard Distance: is a statistic used for gauging the similarity and diversity of sample sets.
+ Math Formular:
Compare and accuracy of algorithm
MIN HASH SIM HASH
BigO with k hash functions: O(mnk + m2k) BigO: O(n^2)
Accuracy: SIM HASH < MIN HASH Accuracy: SIM HASH < MIN HASH
Time running: SIM HASH > MIN HASH Time running: SIM HASH > MIN HASH
- Then, we must remove special characters in data also lower all of words in this column.
- n-gram(2-gram in this case) sentences in text:
N-Gram with Smoothing:
2. UNK solve zero problem