Open In App

Word Tokenization Using R

Last Updated : 28 Apr, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Word Tokenization is a fundamental task in Natural Language Processing (NLP) and text analysis. It involves breaking down text into smaller units called tokens. These tokens can be words, sentences or even individual characters. In word tokenization it means breaking text into words.

For example, the sentence "I love my dog" will be tokenized into the vector ["I", "love", "my", "dog"] in word tokenization.

When tokenizing we should consider:

  1. Punctuation and special characters: Decide whether to retain or remove symbols like hyphens.
  2. Case normalization: Convert text to lowercase for consistency or preserve case for proper nouns.
  3. Stopword removal: Filter common words like "the" or "and" which do not contribute much to the meaning of a sentence.
  4. Stemming vs. lemmatization: Choose truncation or dictionary-based normalization as needed.
  5. Numbers, dates and encoding: Handle numeric tokens and ensure UTF-8 compatibility.
  6. Tokenization granularity: Select word, subword or sentence tokens according to task requirements.
  7. Domain and language specifics: Use custom tokenizers or dictionaries for specialized text.

R language provides various packages and functions to perform tokenization each with its own set of features and capabilities.

1. Using tokenizers package

This package provides functions such as tokenize_words(text) for word tokenization where text is the input text.

R
install.packages("tokenizers")
library(tokenizers)

text <- "Welcome to Geeks for Geeks.Embark on an extraordinary coding odyssey with our 
groundbreaking course,
DSA to Development - Complete Coding Guide! Discover the transformative power of mastering
Data Structures and Algorithms
(DSA) as you venture towards becoming a Proficient Developer."

word_tokens <- unlist(tokenize_words(text))
print(word_tokens)

Output:

[1] "welcome" "to" "geeks" "for" "geeks" "embark"
[7] "on" "an" "extraordinary" "coding" "odyssey" "with" "our"
[14] "groundbreaking" "course" "dsa" "to" "development" "complete" "coding"
[21] "guide" "discover" "the" "transformative" "power" "of" "mastering"
[28] "data" "structures" "and" "algorithms" "as" "you" "venture" "towards"
[35] "becoming" "a" "proficient" "developer"

Handling Stopwords

Stopwords are common words like "the", "is", "and" that are often removed from text during analysis as they don't carry significant meaning. We can handle the stopwords by stopwords = () to ensure which words must be excluded.

R
 library(tokenizers)

text<-"welcome to GFG !@# 23"

word_tokens<-unlist(tokenize_words(text,lowercase = TRUE,stopwords = ("to"),
                           strip_punct = TRUE, strip_numeric = TRUE,simplify = FALSE))
print(word_tokens)

Output:

[1] "welcome" "gfg"

Key features and functions of tokenizers package

  • Sentence Tokenization: The tokenize_sentences() function splits text into sentences for sentence-level analysis.
  • N-grams: The tokenize_ngrams() function creates sequences of n words to capture contextual information.
  • Multi-lingual Support: Handles text in multiple languages and character encodings making it useful for various tasks.
  • Flexibility: Supports various tokenization patterns making it suitable for different languages and text formats.

2. Using the tm package

This package has wordTokenize() function. It is primarily designed for text mining but also includes functions for word tokenization. It has functions for preprocessing, exploring and analyzing text data. Here

  • corpus <- Corpus(VectorSource(text)): Creates a corpus from the provided text.
  • corpus <- tm_map(corpus, content_transformer(tolower)): Converts all text to lowercase.
  • corpus <- tm_map(corpus, removePunctuation): Removes punctuation.
  • corpus <- tm_map(corpus, removeNumbers): Removes numbers.
  • corpus <- tm_map(corpus, removeWords, stopwords("english")): Removes common English stop words.
  • words <- unlist(sapply(corpus, words)): Extracts and flattens words from the corpus into a single vector.
R
install.packages("tm")
library(tm)

text <- "Welcome to Geeks for Geeks, this is the best platform for articles."

corpus <- Corpus(VectorSource(text))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("english"))

words <- unlist(sapply(corpus, words))
print(words)

Output:

1,] "welcome"
[2,] "geeks"
[3,] "geeks"
[4,] "best"
[5,] "platform"
[6,] "artices"

Key features and functions of tm package

  • Word Tokenization: Tokenizes text into words.
  • Text Corpus Creation: Builds a collection of text documents (corpus).
  • Text Preprocessing: Provides functions for converting text to lowercase, removing stopwords, punctuation and special characters.
  • Stemming and Lemmatization: Reduces words to their base form (e.g "running" to "run").
  • Integration with NLP Tools: Works with other NLP packages for advanced analysis.
  • Text Mining: Supports various text mining operations.

3. Using the quanteda package

This package is known for its flexibility in text analysis and provides a tokenize_words(text) function for word tokenization. The text is the input text that must be tokenize into words. It should be a character vector and the function will tokenize each element in the vector.

R
install.packages("quanteda")
library(quanteda)

text <- "Geeks for Geeks."

word_tokens <- tokenize_words(text)
print(word_tokens)

Output:

[1] "geeks" "for" "geeks"

Key features and functions of Quanteda package

  • Flexible Text Analysis: Supports a wide range of tasks including text cleaning, tokenization, feature selection and document modeling.
  • Document-Feature Matrix (DFM): Creates DFMs to represent text documents numerically for statistical and machine learning analyses.
  • Customization: Allows extensive customization for text analysis workflows such as stemming and stopword removal.
  • NLP and Linguistic Analysis: Supports advanced NLP tasks like part-of-speech tagging, named entity recognition, collocation analysis and sentiment analysis.
  • Community and Documentation: Provides detailed documentation, tutorials and an active community for support.

4. Using the strsplit() function

This function in R is used to split the elements of a character vector into substrings based on specified delimiters or regular expressions. Here:

  • The lapply() function is used to apply a function to each element of the text_vector.
  • strsplit(x, ",") splits the string x into substrings wherever a comma (,) is encountered, returning a list of character vectors.
  • unlist() converts this list into a single character vector, flattening the list and removing the list structure.
R
text_vector <- c("apple,banana,cherry", "dog,cat,elephant", "red,green,blue")

split_text <- lapply(text_vector, function(x) unlist(strsplit(x, ",")))
print(split_text)

Output:

[[1]]
[1] "apple" "banana" "cherry"

[[2]]
[1] "dog" "cat" "elephant"

[[3]]
[1] "red" "green" "blue"

Use case of Word Tokenization

  • Sentiment Analysis: Tokenization breaks down text into words to analyze sentiment. Words like 'happy' or 'sad' are identified for sentiment scoring.
  • Machine Translation: Tokenization splits text into words making it easier to translate each word into another language.
  • Text Classification: Tokenized words serve as features for classifying text into categories like spam detection or topic categorization.
  • Chatbots: Tokenization helps chatbots understand user input and generate appropriate responses.
  • Text Summarization: Words are tokenized to identify key elements and create concise summaries.
  • Named Entity Recognition (NER): Tokenization helps in identifying entities like names, dates and locations in a text.
  • Speech Recognition: Breaks down spoken language into words for further analysis.
  • Topic Modeling: Tokenized words help in identifying and grouping related topics within large sets of text.
  • Information Retrieval: Helps search engines and recommendation systems understand individual words to retrieve relevant documents.

Next Article

Similar Reads