Open In App

Natural Language Processing with R

Last Updated : 06 May, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Natural Language Processing (NLP) is a field of artificial intelligence (AI) that enables machines to understand and process human language. R, known for its statistical capabilities, provides a wide range of libraries to perform various NLP tasks.

Understanding Natural Language Processing

NLP involves developing algorithms to help machines interpret, generate and respond to human language. Some core NLP tasks include:

  • Text Tokenization: Breaking down text into smaller units such as words or phrases.
  • Part-of-Speech (POS) Tagging: Assigning grammatical labels (noun, verb, etc.) to each word in a sentence.
  • Named Entity Recognition (NER): Identifying and classifying entities like names, locations and organizations.
  • Sentiment Analysis: Determining the sentiment (positive, negative, neutral) in a text.
  • Text Classification: Categorizing text into predefined labels or topics.

NLP Libraries in R

R has many libraries for NLP tasks such as tm for text mining and NLP for basic NLP functions.

Installing the libraries

We can install and load the necessary libraries using install.packages() function and library() function.

R
install.packages(c("NLP", "tm"))
library(NLP)
library(tm)

1. Text Tokenization and Cleaning

Tokenization breaks down text into smaller units, while cleaning involves removing unwanted elements like punctuation or stop words. The text is preprocessed by converting it to lowercase, removing punctuation and numbers and eliminating stopwords. The tokenize_words() function is then used to break the text into individual words (tokens).

R
library(NLP)
library(tm)
library(tokenizers)

text <- "Natural Language Processing in R is exciting!!"
text_corpus <- Corpus(VectorSource(text))
text_corpus <- tm_map(text_corpus, content_transformer(tolower))
text_corpus <- tm_map(text_corpus, removePunctuation)
text_corpus <- tm_map(text_corpus, removeNumbers)
text_corpus <- tm_map(text_corpus, removeWords, stopwords("english"))
text_corpus <- tm_map(text_corpus, stripWhitespace)

tokenize_words(text)

Output:

["natural", "language", "processing", "in", "r", "is", "exciting"]

2. Part-of-Speech Tagging

For advanced NLP tasks like POS tagging we can use the udpipe library. The udpipe package is used to download and load an English model. It then tokenizes the text and assigns POS tags (e.g., noun, verb) to each word in the sentence.

R
install.packages("udpipe")
library(udpipe)

ud_model <- udpipe_download_model(language = "english", model_dir = getwd())
ud_model <- udpipe_load_model(ud_model$file_model)

sentence <- "The quick brown fox jumps over the lazy dog."
udpipe_annotations <- udpipe_annotate(ud_model, x = sentence)
udpipe_pos <- as.data.frame(udpipe_annotations)

return(udpipe_pos[c("token_id","token","upos")])

Output:

pos
POS

3. Named Enity Recognition (NER)

We can perform Named Entity Recognition (NER) in R using the text package. The pre-trained BERT model (fine-tuned for NER) identifies various types of entities, such as persons, locations and organizations, in the provided text. The textEmbedder is used to load the model and the predict() function extracts the named entities from the input sentence.

R
install.packages("text")
install.packages("textdata")
library(text)
library(textdata)

model <- textEmbedder$new(model = 'dbmdz/bert-large-cased-finetuned-conll03-english')

sentence <- "Barack Obama was born in Hawaii and later became the President of the United States."

entities <- model$predict(texts = sentence, layers = 11, aggregation_from_layers = "concatenate")

print(entities)

Output:

text entity_type

1 Barack PER

2 Obama PER

3 Hawaii GPE

4 United GPE

5 States GPE

4. Sentiment Analysis

Sentiment analysis helps understand opinions expressed in text. The sentimentr package is used to perform sentiment analysis on text, providing a simple way to categorize the sentiment as positive, negative or neutral.

R
install.packages("sentimentr")
library(sentimentr)

text <- c("I love R programming!", "I hate bugs in the code.")
sentiment_analysis <- sentiment(text)

print(sentiment_analysis)

Output:

senti
Sentimental analysis

5. Text Classification

Text classification categorizes text into predefined topics. We can use machine learning algorithms like Naive Bayes or Support Vector Machines (SVM) for this task. The e1071 package (which includes algorithms like SVM) is mentioned for text classification tasks, where text is categorized into predefined topics.

R
install.packages("e1071")
install.packages("tm")
library(e1071)
library(tm)

texts <- c("I love R programming", "R is great for data analysis", 
           "I hate bugs in code", "The weather is bad", 
           "R is fantastic", "This movie is awful")
labels <- c("positive", "positive", "negative", "negative", "positive", "negative")

corpus <- Corpus(VectorSource(texts))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))

dtm <- DocumentTermMatrix(corpus)
dtm_matrix <- as.matrix(dtm)

train_data <- dtm_matrix[1:4, ]
train_labels <- factor(labels[1:4])
test_data <- dtm_matrix[5:6, ]
test_labels <- factor(labels[5:6])

svm_model <- svm(train_data, y = train_labels, kernel = "linear")

predictions <- predict(svm_model, test_data)

accuracy <- mean(predictions == test_labels)
print(paste("Accuracy:", accuracy))

Output:

[1] "Accuracy: 0.5"

In this article, we will explore key NLP tasks and how we can implement them in R.


Next Article

Similar Reads