Find the k most frequent words from data set in Python

Last Updated : 07 Jan, 2025

The goal is to find the k most common words in a given dataset of text. We’ll look at different ways to identify and return the top k words based on their frequency, using Python.

Using collections.Counter

collections.Counter that works like a dictionary, but its main job is to count how many times each item appears in a list or collection. It makes it simple and fast to count things, check how often they appear, and find the most common ones.

Example:

Python

from collections import Counter

# A list of items to count
li = ['apple', 'banana', 'orange', 'apple', 'banana', 'apple']

# Create a Counter object
cnt = Counter(li)

# Display the counts as a dictionary
print("Item counts:", cnt)

# Access count of a specific item
print("Count of apples:", cnt['apple'])

Output

Item counts: Counter({'apple': 3, 'banana': 2, 'orange': 1})
Count of apples: 3

Using `heapq.nlargest`

heapq.nlargest method helps quickly find the top k most frequent words by keeping track of just the k most common ones as it goes through the data. This is faster and uses less memory than sorting all the words.

Example:

Python

import heapq
from collections import Counter

# A list of words to analyze
li = ['apple', 'banana', 'orange', 'apple', 'banana', 'apple', 
         'orange', 'banana', 'grape', 'grape', 'grape', 'grape']

# Step 1: Count the frequency of each word
cnt = Counter(li)

# Step 2: Find the top k most frequent words using heapq.nlargest
k = 2  # Top 2 most frequent words
top_k = heapq.nlargest(k, cnt.items(), key=lambda x: x[1])

# Display the results
print(cnt)
print(f"{k} most frequent words:", top_k)

Output

Counter({'grape': 4, 'apple': 3, 'banana': 3, 'orange': 2})
2 most frequent words: [('grape', 4), ('apple', 3)]

Using pandas.value_counts

pandas.value_counts()is a simple and fast method to find the most frequent words, especially when our data is in a pandas DataFrame. It counts how often each word appears and easily gives us the top k most frequent words.

Example:

Python

import pandas as pd

# A list of words to analyze
li = ['apple', 'banana', 'orange', 'apple', 'banana', 'apple',
         'orange', 'banana', 'grape', 'grape', 'grape', 'grape']

# Step 1: Convert the list to a pandas Series
s = pd.Series(li)

# Step 2: Use value_counts to count the occurrences
cnt = s.value_counts()

# Step 3: Extract the top k most frequent words
k = 2  # Top 2 most frequent words
top_k = cnt.head(k)

# Display the results
print(cnt)
print(f"\n{k} most frequent words:\n", top_k)

Output

grape     4
apple     3
banana    3
orange    2
dtype: int64

2 most frequent words:
 grape    4
apple    3
dtype: int64

Note: Here dtype represents the data type of the word counts, which is int64 in this case because pandas uses this data type for integer values.