The document discusses analyzing the statistical properties of text including calculating word frequencies, ranking words by frequency, and plotting graphs. It also covers Zipf's law and how the data follows a Zipfian distribution, as well as discussing Luhn's method for determining significant words based on frequency cut-off points.
Download as PPTX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
32 views
ISR Assignment 1
The document discusses analyzing the statistical properties of text including calculating word frequencies, ranking words by frequency, and plotting graphs. It also covers Zipf's law and how the data follows a Zipfian distribution, as well as discussing Luhn's method for determining significant words based on frequency cut-off points.
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 13
ISR Assignment 1
objectives from my part:-
Study the statistical property of the text • Calculate the frequency of words • Rank words according to their frequency • Plot the graph of frequency Vs rank • Calculate the product of rank and frequency • Give explanation on the distribution: does it follow Zipfian distribution, does it follow Zipfs law, etc. cont... Based on Luhn’s idea: • What are the words that will be removed/will not be considered as index terms, what is your upper and lower cut-off point, how did you decide on the cut-off points • Which words are used for indexing Study the statistical property of the text • Under the given alternative as a group we have chosen Afaan Oromo to be our base to this project. As tokenization and mark up removal are language dependent or language based tasks to be performed Afaan oroma have almost the same markup used and character used as of English except that of “ ’ ” or known as ‘hudha’. • As the technicality of the operation we have used python as our base of code to be executed. Why Python? We have used python because it enables use to a matlablib library which plots the graphs in ranking order and other functionalities it offers. Python is the most suitable programming language specially for data analysis and data manipulation. cont.. • Based on Zipf’s law, Luhan’s and heap’s law to determine the word distribution, word significant and to show how vocabulary size grows with the growth the corpus we have extracted data form about 12 documents and about 23,094 words as a total. cont.. cont.. • Zipf’s Law • Zipf's law is an empirical law that describes the distribution of frequencies of words in a language. Named after linguist George Zipf, this law states that in a given corpus of natural language text, the frequency of any word is inversely proportional to its rank in the frequency table. In other words, the second most frequent word will occur approximately half as often as the most frequent word, the third most frequent word will occur one-third as often, and so on. • Key Points of Zipf's Law: • 1. Rank-Frequency Relationship: • 2. Frequency Distribution: cont... • Zipf's law implies that a small number of words are used very frequently, while the vast majority are used rarely. • 3. Log-Log Plot: • When the rank and frequency of words are plotted on a log-log scale, Zipf's law predicts that the plot will be a straight line with a slope of approximately -1. • Since Zipf's law indicates that a few words are very common, data compression algorithms can use this property to encode text more efficiently by using shorter codes for frequent words. cont... Rank Words Word frquency r*f/tf
1 hin 1932 0.083658093
2 kan 1568 0.135792847
3 akka 1525 0.198103403
4 fi 1483 0.256863255
5 ta 1371 0.296830346
6 a 926 0.240581969
7 isaa 832 0.252186715
Does it follow Zipfian distribution, does it follow Zipfs law?
yes,it follows zipfs law.
Zipf's Law states that when the distinct words in a text are arranged in decreasing order of their frequency of occuerence (most frequent words first), the occurence characterstics of the vocabulary can be characterized by the constant rank-frequency law of Zipf. that is r * f = c in above table 0.083658093,0.135792847,0.198103403... almost have approximately constant cont... • Luhn’s Law • Luhn's method, proposed by Hans Peter Luhn in 1958, is a technique used in information retrieval and text summarization to decide a cut-off point for selecting significant words (terms) from a document. The idea is to identify the most informative words that can be used for indexing, summarizing, or further analysis. The method is based on the observation that words with extremely high or low frequencies tend to be less informative, while words with medium frequencies tend to carry more significant content.
• Steps to Decide a Cut-off Point Using Luhn’s Method:
• 1. Frequency Distribution Analysis:
• Calculate the frequency of each word in the document or corpus. • -Rank the words in descending order of their frequency. cont...
• 2.Determine the Upper and Lower Frequency Thresholds:
• Lower Cut-off: Words that occur very infrequently are often not informative because they might be rare terms or misspellings. • Upper Cut-off: Words that occur very frequently are usually stop words which are common in all texts and do not provide specific information about the document's content.
• 3.Set the Lower and Upper Cut-off Points:
• Luhn suggested that the most informative words are those whose frequency lies between the lower and upper cut-off points. • Lower Cut-off (L):** Often determined by ignoring words that occur less than a certain number of times (e.g., less than 3 times in the document). • Upper Cut-off (U): Often determined by ignoring words that are among the topmost frequent words, typically stop words. cont... • In this example, the function `luhns_method` takes a document, a lower cut-off, and an upper cut-off ratio. It returns the list of informative words based on Luhn's method. Adjust the cut-off parameters as needed for different corpora or use cases.
• Examples: Lower cutoff point begins from frequency: 0.5