Classifying data using Support Vector Machines(SVMs) in R
Last Updated :
19 Jun, 2025
Support Vector Machines (SVM) are supervised learning models mainly used for classification and but can also be used for regression tasks. In this approach, each data point is represented as a point in an n-dimensional space where n is the number of features. The goal is to find a hyperplane that best separates the two classes.
Working of SVM Algorithm
A Support Vector Machine (SVM) is a classifier that finds a separating hyperplane to differentiate between classes in the data. A hyperplane is a flat subspace that divides the feature space into two parts for classification tasks. In a two-dimensional space this is simply a line while in higher dimensions it becomes a plane or a hyperplane that separates the data into different categories.
Mathematically, the hyperplane can be represented as :
w \cdot x + b = 0
Where:
- w is the weight vector (normal to the hyperplane).
- x is a point on the feature space.
- b is the bias term that shifts the hyperplane.
For classification SVM aims to maximize the margin between the classes. The margin is the distance between the hyperplane and the closest data points from each class known as support vectors. SVM chooses the hyperplane that maximizes this margin which is given by:
\text{Margin} = \frac{2}{\|w\|}
This ensures the largest possible separation between the classes while minimizing classification errors.
Selecting the Best Hyperplane
To determine the optimal hyperplane, algorithm analyzes labeled training data and evaluates different hyperplanes based on how well they separate the classes. Consider the following scenarios for selecting the best hyperplane:
Scenario 1:
In this case, we have three hyperplanes: A, B and C. The goal is to find the hyperplane that best separates the two classes i.e stars and circles. The rule here is to choose the hyperplane that best divides the classes. In this scenario hyperplane B does the best job at separating the two classes making it the optimal choice.

Scenario 2:
In this situation all three hyperplanes A, B and C do a good job at separating the classes. To identify the best hyperplane we calculate the margin which is the distance between the nearest data points and the hyperplane. The hyperplane with the largest margin is considered the best as it provides better separation. Here hyperplane C has the largest margin making it the optimal choice.

Implementation of SVM in R
We are going to implement the SVM algorithm in R using following steps:
1. Installing and Loading the Required Packages
We need to install and load the e1071 package which contains the svm() function for training the model.
R
install.packages('e1071')
install.packages('caTools')
install.packages('ggplot2')
install.packages('caret')
library(caret)
library(e1071)
library(caTools)
library(ggplot2)
2. Loading the dataset
We will use this dataset of Social network ads from file Social.csv. We will read the dataset using read.csv() function and display the first 6 rows using the head() function.
R
data = read.csv('/content/social.csv')
head(data)
Output:
sample data3. Exploring the Dataset
We will explore our dataset by using the summary() function which provides a statistical summary of the dataset including measures like minimum, maximum, mean and quartiles.
R
Output:
summary4. Performing Data Preprocessing
We need to prepare the data by encoding the categorical variable Gender and scaling the continuous features Age and EstimatedSalary.
R
set.seed(123)
data$Gender <- as.numeric(factor(data$Gender, levels = c("Male", "Female"), labels = c(0, 1)))
data[, c("Age", "EstimatedSalary")] <- scale(data[, c("Age", "EstimatedSalary")])
split <- sample.split(data$Purchased, SplitRatio = 0.75)
training_set <- subset(data, split == TRUE)
test_set <- subset(data, split == FALSE)
5. Training the SVM Model
Now, we will train the SVM model using the svm() function. The model will predict whether a user purchased the product (Purchased) based on the features Age, EstimatedSalary and Gender.
R
classifier <- svm(Purchased ~ Age + EstimatedSalary + Gender,
data = training_set,
type = 'C-classification',
kernel = 'radial',
gamma = 0.1)
6. Making Predictions
Once the model is trained we can use it to predict on the test set.
R
y_pred <- predict(classifier, newdata = test_set)
table(test_set$Purchased, y_pred)
Output:
Confusion matrix7. Evaluating the Model
We evaluate the model’s performance using a confusion matrix, accuracy and other metrics like precision, recall, F1-score.
R
accuracy <- sum(diag(table(test_set$Purchased, y_pred))) / sum(table(test_set$Purchased, y_pred))
cat("Accuracy: ", accuracy)
confusionMatrix(table(test_set$Purchased, y_pred))
Output:
Evaluation 8. Visualizing the Decision Boundary
We can also visualize the decision boundary using ggplot2. Here ,
- X1: Creates a sequence for Age (with small steps).
- grid_set: Generates a grid of Age and EstimatedSalary combinations.
- grid_set$Gender: Sets default Gender using the median value.
- y_grid: Predicts the class for each grid point using the classifier.
- geom_tile: Fills grid cells with predicted class colors.
- geom_point: Plots training points with actual class colors.
- scale_fill_manual: Sets colors for predicted classes.
- scale_color_manual: Sets colors for actual training points.
R
X1 = seq(min(training_set$Age) - 1, max(training_set$Age) + 1, by = 0.01)
X2 = seq(min(training_set$EstimatedSalary) - 1, max(training_set$EstimatedSalary) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
grid_set$Gender = median(training_set$Gender) # Default Gender value for grid
y_grid = predict(classifier, newdata = grid_set)
ggplot() +
geom_tile(data = grid_set, aes(x = Age, y = EstimatedSalary, fill = as.factor(y_grid)), alpha = 0.3) +
geom_point(data = training_set, aes(x = Age, y = EstimatedSalary, color = as.factor(Purchased)), size = 3, shape = 21) +
scale_fill_manual(values = c('coral1', 'aquamarine')) +
scale_color_manual(values = c('green4', 'red3')) +
labs(title = 'SVM Decision Boundary (Training set)', x = 'Age', y = 'Estimated Salary') +
theme_minimal() +
theme(legend.position = "none")
Output:
Decision BoundaryIn this article we implemented SVM algorithm in R from data preparation and training the model to evaluating its performance using accuracy, precision, recall and F1-score metrics.
Similar Reads
Machine Learning with R Machine Learning is a growing field that enables computers to learn from data and make decisions without being explicitly programmed. It mimics the way humans learn from experiences, allowing systems to improve performance over time through data-driven insights. This Machine Learning with R Programm
3 min read
Getting Started With Machine Learning In R
Data Processing
Introduction to Data in Machine LearningData refers to the set of observations or measurements to train a machine learning models. The performance of such models is heavily influenced by both the quality and quantity of data available for training and testing. Machine learning algorithms cannot be trained without data. Cutting-edge develo
4 min read
ML | Understanding Data ProcessingIn machine learning, data is the most important aspect, but the raw data is messy, incomplete, or unstructured. So, we process the raw data to transform it into a clean, structured format for analysis, and this step in the data science pipeline is known as data processing. Without data processing, e
5 min read
ML | Overview of Data CleaningData cleaning is a important step in the machine learning (ML) pipeline as it involves identifying and removing any missing duplicate or irrelevant data. The goal of data cleaning is to ensure that the data is accurate, consistent and free of errors as raw data is often noisy, incomplete and inconsi
13 min read
ML | Feature Scaling - Part 1Feature Scaling is a technique to standardize the independent features present in the data in a fixed range. It is performed during the data pre-processing. Working: Given a data-set with features- Age, Salary, BHK Apartment with the data size of 5000 people, each having these independent data featu
3 min read
Supervised Learning
Evaluation Metrics
Unsupervised Learning
Model Selection and Evaluation
Reinforcement Learning
Dimensionality Reduction
Advanced Topics