BIG DATA
ANALYTICS
UNIVERSITY OF MUMBAI
K.M.S.P. Mandal’s
Sant Rawool Maharaj Mahavidyalaya
Kudal , Dist-Sindhudurg
DEPARTMEMT OF INFORMATION TECHNOLOGY
CERTIFICATE
This to certify that ,
Mr/Miss. __________________________________________________
Exam Seat No. _________________ student of the Part -I [Link].(Information
technology) . He/ She has been successfully completed practical as prescribed by
University of Mumbai in the Sant Rawool Maharaj Mahavidyalaya’s during the
semester ___ Of the academic year 2024-25
In the following topics
______________________
________________________
Teacher In -Charge : External Examiner:
Date :
INDEX
Sr no. Title Sign
1. Clustering model
2. Classification model
3. Regression model
4. Implement Decision tree classification techniques
5. Multiple regression model
6. Implement an application that stores big data in Hbase /
MongoDB and manipulate it using R / Python
[Link]’s Date:
Sant Rawool Maharaj Mahavidyalaya, Kudal Roll no:
Department of Masters of Science Expt No: 01 Signature
Information Technology
Title: a. Clustering algorithms for unsupervised classification.
b. Plot the cluster data using R visualizations.
Clustering is an unsupervised learning technique used to group similar data points based
on features or patterns without labeled output. It helps in identifying natural groupings in
data, making it ideal for tasks like customer segmentation, image compression, or
document categorization.
In R, clustering is typically performed using algorithms like K-means, Hierarchical
clustering, and DBSCAN.
1. K-means Clustering
K-means partitions the dataset into k distinct clusters by minimizing the variance within
each cluster.
2. Hierarchical Clustering
This method builds a tree-like structure (dendrogram) by merging or splitting clusters
step-by-step.
3. DBSCAN (Density-Based Clustering)
DBSCAN groups points that are closely packed and marks outliers as noise. It doesn't
require the number of clusters to be specified.
Clustering in R provides a powerful way to explore hidden structures in data. Choosing
the right algorithm depends on:
• The shape and size of the data
• Whether you know the number of clusters
• How sensitive your application is to outliers
Each technique has its strengths and can be easily implemented and visualized using R’s
rich ecosystem of packages.
Code:
data(iris)
features <- iris[, 1:4]
[Link](123)
k <- 3
kmeans_model <- kmeans(features, centers = k)
clustered_data <- cbind(iris, Cluster = [Link](kmeans_model$cluster))
pairs(features, col = clustered_data$Cluster, pch = 20)
if (!require(rgl)) {[Link]("rgl")}
library(rgl)
plot3d(features[,1], features[,2], features[,3], col = clustered_data$Cluster)
Output:
[Link]’s Date:
Sant Rawool Maharaj Mahavidyalaya, Kudal Roll no:
Department of Masters of Science Expt No: 02 Signature
Information Technology
Title: CLASSIFICATION MODEL a. Install relevant package for
classification. b. Choose classifier for classification problem. c. Evaluate the
performance of classifier.
a. Install Relevant Package for Classification
To perform classification in R, we use the caret package, which offers a unified interface
for a wide range of machine learning algorithms
b. Choose Classifier for Classification Problem
For this classification task, we choose the Random Forest algorithm, a powerful ensemble
learning method based on decision trees. It is suitable for multi-class classification
problems and handles noisy or missing data well.
We use the built-in iris dataset, which includes 150 samples of iris flowers labeled by
species (Setosa, Versicolor, Virginica). The model is trained using the features (sepal and
petal dimensions) to predict the species.
c. Evaluate the Performance of the Classifier
After training the model, we evaluate its performance using a confusion matrix, which
shows the number of correct and incorrect predictions. We also calculate the overall
accuracy of the model. The confusion matrix helps assess the model's ability to distinguish
between different classes effectively.
Typical performance metrics include:
• Accuracy: The proportion of correct predictions.
• Sensitivity/Recall: Ability to correctly identify positive cases.
• Specificity: Ability to correctly identify negative cases.
Code:
if (!require(caret)) {[Link]("caret")}
library(caret)
data(iris)
[Link](123)
trainIndex <- createDataPartition(iris$Species, p = 0.7, list = FALSE)
train <- iris[trainIndex, ]
test <- iris[-trainIndex, ]
model <- train(Species ~ ., data = train, method = "rf")
print(model)
predictions <- predict(model, newdata = test)
confusionMatrix(predictions, test$Species)
Output:
[Link]’s Date:
Sant Rawool Maharaj Mahavidyalaya, Kudal Roll no:
Department of Masters of Science Expt No: 03 Signature
Information Technology
Title: REGRESSION MODEL Import a data from web storage. Name the
dataset and now do Logistic Regression to find out relation between variables
that are affecting the admission of a student in an institute based on his or her
GRE score, GPA obtained and rank of the student. Also check the model is fit or
not. require (foreign), require(MASS).
The goal of this project is to determine whether a student will be admitted to a graduate
program based on their GRE score, GPA, and rank of the undergraduate institution using
logistic regression.
Steps Performed:
1. Loading Required Libraries:
o We used packages like foreign, MASS, and pscl for regression and
evaluation.
2. Importing Data:
o The data was loaded from a local CSV file ([Link]).
3. Data Preprocessing:
o The Rank column was converted to a factor since it is categorical.
o Columns were renamed for better readability if needed.
4. Model Building:
o A logistic regression model was built using glm() function:
5. Model Evaluation:
o Summary statistics provided coefficient estimates and p-values to assess
the significance of predictors.
o The null deviance and residual deviance indicated overall model fit.
o McFadden's pseudo R-squared was calculated using pscl::pR2() to
evaluate model performance.
6. Predictions:
o The model was used to compute predicted probabilities of admission for
each student.
Code:
# Step 1: Install and load required packages
[Link]("foreign")
[Link]("MASS")
[Link]("pscl")
library(foreign)
library(MASS)
library(pscl)
admission_data <- [Link](""D:\MSc-IT\Sem 2\Bigdata\[Link]")
colnames(admission_data) <- c("admit", "GRE", "GPA", "Rank")
admission_data$Rank <- factor(admission_data$Rank)
logit_model <- glm(admit ~ GRE + GPA + Rank, data = admission_data, family =
binomial)
summary(logit_model)
cat("Null Deviance:", logit_model$[Link], "\n")
cat("Residual Deviance:", logit_model$deviance, "\n")
pR2(logit_model)
admission_data$predicted_prob <- predict(logit_model, type = "response")
head(admission_data)
Output:
[Link]’s Date:
Sant Rawool Maharaj Mahavidyalaya, Kudal Roll no:
Department of Masters of Science Expt No: 04 Signature
Information Technology
Title: Implement Decision tree classification techniques
To build a classification model that predicts the species of an iris flower using the Decision
Tree algorithm based on the features: [Link], [Link], [Link], and
[Link].
Steps Performed:
1. Library Installation and Loading:
The party package, which provides tools for building conditional inference trees,
2. Dataset Used:
The built-in iris dataset was used. It contains 150 observations with 5 variables:
o [Link]
o [Link]
o [Link]
o [Link]
o Species (target variable with 3 classes: Setosa, Versicolor, Virginica)
3. Building the Model:
A Decision Tree model was built using the ctree() function:
4. Model Interpretation:
The tree was printed in text format using:
CopyEdit
print(iris_ctree)
This displayed the rules and split conditions used for classification.
5. Visualization:
plot(iris_ctree)
plot(iris_ctree, type = "simple")
Code:
# Step 1: Install and load required package
[Link]("party") # Only run once if not already installed
library(party)
data(iris)
str(iris) # View the structure of the dataset
# Step 3: Build the Decision Tree model
iris_ctree <- ctree(Species ~ [Link] + [Link] + [Link] + [Link],
data = iris)
print(iris_ctree)
plot(iris_ctree)
plot(iris_ctree, type = "simple")
Output:
[Link]’s Date:
Sant Rawool Maharaj Mahavidyalaya, Kudal Roll no:
Department of Masters of Science Expt No: 05 Signature
Information Technology
Title: MULTIPLE REGRESSION MODEL Apply multiple regressions, if data
have a continuous independent variable. Apply on above dataset.
Steps Involved:
1. Library Installation and Loading:
o The foreign and MASS packages are used to ensure compatibility and
statistical functions.
o rgl is used for interactive 3D plotting.
2. Data Generation:
o A synthetic dataset of 100 observations is generated.
o Variables include:
▪ gre: Graduate Record Examination score (normally distributed
around 600).
▪ gpa: Grade Point Average (normally distributed around 3.5).
▪ rank: Rank of the undergraduate institution (randomly sampled from
1 to 4).
▪ admitted: Binary target variable (0 or 1), randomly assigned.
3. Model Building:
o A multiple linear regression model is trained using lm() with predictors:
gre, gpa, and rank.
o rank is treated as a categorical variable using [Link](rank).
4. Model Summary:
o The regression summary provides coefficients, R-squared, and p-values to
interpret the model’s performance and predictor significance.
5. Visualization:
o A 3D plot is generated using rgl::plot3d() to visualize the relationship
between GRE, GPA, and admission likelihood.
o A regression plane is estimated and overlaid to indicate the decision
boundary.
Code:
if (!require(foreign)) {[Link]("foreign")}
if (!require(MASS)) {[Link]("MASS")}
library(foreign)
library(MASS)
library(rgl)
[Link](123)
n <- 100
gre <- round(rnorm(n, mean = 600, sd = 100))
gpa <- round(rnorm(n, mean = 3.5, sd = 0.5), 2)
rank <- sample(1:4, n, replace = TRUE)
admission_data <- [Link](admitted = sample(0:1, n, replace = TRUE), gre = gre, gpa =
gpa, rank = rank)
multi_reg_model <- lm(admitted ~ gre + gpa + [Link](rank), data = admission_data)
summary(multi_reg_model)
a <- coef(multi_reg_model)[2]
b <- coef(multi_reg_model)[3]
c <- -1
d <- coef(multi_reg_model)[1]
x_grid <- seq(min(admission_data$gre), max(admission_data$gre), length = 30)
y_grid <- seq(min(admission_data$gpa), max(admission_data$gpa), length = 30)
xy_grid <- [Link](gre = x_grid, gpa = y_grid)
z_grid <- with(xy_grid, (-a*gre - b*gpa - d)/c)
plot3d(admission_data$gre, admission_data$gpa, admission_data$admitted, type = "s", col
= "blue", xlab = "GRE Score", ylab = "GPA", zlab = "Admitted")
Output:
[Link]’s Date:
Sant Rawool Maharaj Mahavidyalaya, Kudal Roll no:
Department of Masters of Science Expt No: 06 Signature
Information Technology
Title: Implement an application that stores big data in Hbase / MongoDB and
manipulate it using R / Python
The goal of this project is to implement an application that:
• Stores big data in a NoSQL database (either MongoDB or HBase)
• Processes and manipulates this data using R or Python
• Demonstrates how scalable storage systems and high-level programming languages
can be integrated to perform data analytics
System Architecture:
1. Data Source: Big data from a structured or semi-structured file (CSV, JSON).
2. Database Layer: Stores data in either:
o MongoDB: Document-based NoSQL DB, ideal for semi-structured data
o HBase: Column-oriented distributed DB built on top of HDFS (Hadoop)
3. Application Layer:
o Uses Python or R for:
▪ Data cleaning and transformation
▪ Data insertion into the DB
▪ Querying and filtering
▪ Simple analytics or reporting
Code:
import pandas as pd
from pymongo import MongoClient
df = pd.read_csv('[Link]') # Load CSV into a DataFrame
client = MongoClient('mongodb://localhost:27017/') # Connect to MongoDB
db = client['mydb']
collection = db['people']
records = df.to_dict(orient='records') # Convert DataFrame to dictionary records and insert
collection.insert_many(records)
# Verify insertion
for person in [Link]():
print(person)
Output: