0% found this document useful (0 votes)

26 views17 pages

Big Data

mumbai university practicals

Uploaded by

vaishnavisalvi25

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views17 pages

Big Data

mumbai university practicals

Uploaded by

vaishnavisalvi25

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

BIG DATA

ANALYTICS
UNIVERSITY OF MUMBAI

K.M.S.P. Mandal’s
Sant Rawool Maharaj Mahavidyalaya
Kudal , Dist-Sindhudurg
DEPARTMEMT OF INFORMATION TECHNOLOGY

CERTIFICATE

This to certify that ,

Mr/Miss. __________________________________________________
Exam Seat No. _________________ student of the Part -I [Link].(Information
technology) . He/ She has been successfully completed practical as prescribed by
University of Mumbai in the Sant Rawool Maharaj Mahavidyalaya’s during the
semester ___ Of the academic year 2024-25

In the following topics

______________________
________________________

Teacher In -Charge : External Examiner:

Date :
INDEX
Sr no. Title Sign

1. Clustering model

2. Classification model

3. Regression model

4. Implement Decision tree classification techniques

5. Multiple regression model

6. Implement an application that stores big data in Hbase /

MongoDB and manipulate it using R / Python
[Link]’s Date:
Sant Rawool Maharaj Mahavidyalaya, Kudal Roll no:

Department of Masters of Science Expt No: 01 Signature

Information Technology

Title: a. Clustering algorithms for unsupervised classification.

b. Plot the cluster data using R visualizations.

Clustering is an unsupervised learning technique used to group similar data points based
on features or patterns without labeled output. It helps in identifying natural groupings in
data, making it ideal for tasks like customer segmentation, image compression, or
document categorization.
In R, clustering is typically performed using algorithms like K-means, Hierarchical
clustering, and DBSCAN.
1. K-means Clustering
K-means partitions the dataset into k distinct clusters by minimizing the variance within
each cluster.
2. Hierarchical Clustering
This method builds a tree-like structure (dendrogram) by merging or splitting clusters
step-by-step.
3. DBSCAN (Density-Based Clustering)
DBSCAN groups points that are closely packed and marks outliers as noise. It doesn't
require the number of clusters to be specified.

Clustering in R provides a powerful way to explore hidden structures in data. Choosing

the right algorithm depends on:
• The shape and size of the data
• Whether you know the number of clusters
• How sensitive your application is to outliers
Each technique has its strengths and can be easily implemented and visualized using R’s
rich ecosystem of packages.
Code:
data(iris)
features <- iris[, 1:4]
[Link](123)
k <- 3
kmeans_model <- kmeans(features, centers = k)
clustered_data <- cbind(iris, Cluster = [Link](kmeans_model$cluster))
pairs(features, col = clustered_data$Cluster, pch = 20)
if (!require(rgl)) {[Link]("rgl")}
library(rgl)
plot3d(features[,1], features[,2], features[,3], col = clustered_data$Cluster)

Output:
[Link]’s Date:
Sant Rawool Maharaj Mahavidyalaya, Kudal Roll no:

Department of Masters of Science Expt No: 02 Signature

Information Technology

Title: CLASSIFICATION MODEL a. Install relevant package for

classification. b. Choose classifier for classification problem. c. Evaluate the
performance of classifier.

a. Install Relevant Package for Classification

To perform classification in R, we use the caret package, which offers a unified interface
for a wide range of machine learning algorithms
b. Choose Classifier for Classification Problem
For this classification task, we choose the Random Forest algorithm, a powerful ensemble
learning method based on decision trees. It is suitable for multi-class classification
problems and handles noisy or missing data well.
We use the built-in iris dataset, which includes 150 samples of iris flowers labeled by
species (Setosa, Versicolor, Virginica). The model is trained using the features (sepal and
petal dimensions) to predict the species.
c. Evaluate the Performance of the Classifier
After training the model, we evaluate its performance using a confusion matrix, which
shows the number of correct and incorrect predictions. We also calculate the overall
accuracy of the model. The confusion matrix helps assess the model's ability to distinguish
between different classes effectively.
Typical performance metrics include:
• Accuracy: The proportion of correct predictions.
• Sensitivity/Recall: Ability to correctly identify positive cases.
• Specificity: Ability to correctly identify negative cases.
Code:
if (!require(caret)) {[Link]("caret")}
library(caret)
data(iris)
[Link](123)
trainIndex <- createDataPartition(iris$Species, p = 0.7, list = FALSE)
train <- iris[trainIndex, ]
test <- iris[-trainIndex, ]
model <- train(Species ~ ., data = train, method = "rf")
print(model)
predictions <- predict(model, newdata = test)
confusionMatrix(predictions, test$Species)

Output:
[Link]’s Date:
Sant Rawool Maharaj Mahavidyalaya, Kudal Roll no:

Department of Masters of Science Expt No: 03 Signature

Information Technology

Title: REGRESSION MODEL Import a data from web storage. Name the
dataset and now do Logistic Regression to find out relation between variables
that are affecting the admission of a student in an institute based on his or her
GRE score, GPA obtained and rank of the student. Also check the model is fit or
not. require (foreign), require(MASS).

The goal of this project is to determine whether a student will be admitted to a graduate
program based on their GRE score, GPA, and rank of the undergraduate institution using
logistic regression.
Steps Performed:
1. Loading Required Libraries:
o We used packages like foreign, MASS, and pscl for regression and
evaluation.
2. Importing Data:
o The data was loaded from a local CSV file ([Link]).
3. Data Preprocessing:
o The Rank column was converted to a factor since it is categorical.
o Columns were renamed for better readability if needed.
4. Model Building:
o A logistic regression model was built using glm() function:
5. Model Evaluation:
o Summary statistics provided coefficient estimates and p-values to assess
the significance of predictors.
o The null deviance and residual deviance indicated overall model fit.
o McFadden's pseudo R-squared was calculated using pscl::pR2() to
evaluate model performance.
6. Predictions:
o The model was used to compute predicted probabilities of admission for
each student.
Code:

# Step 1: Install and load required packages

[Link]("foreign")
[Link]("MASS")
[Link]("pscl")

library(foreign)
library(MASS)
library(pscl)
admission_data <- [Link](""D:\MSc-IT\Sem 2\Bigdata\[Link]")
colnames(admission_data) <- c("admit", "GRE", "GPA", "Rank")
admission_data$Rank <- factor(admission_data$Rank)
logit_model <- glm(admit ~ GRE + GPA + Rank, data = admission_data, family =
binomial)
summary(logit_model)
cat("Null Deviance:", logit_model$[Link], "\n")
cat("Residual Deviance:", logit_model$deviance, "\n")
pR2(logit_model)
admission_data$predicted_prob <- predict(logit_model, type = "response")
head(admission_data)
Output:
[Link]’s Date:
Sant Rawool Maharaj Mahavidyalaya, Kudal Roll no:

Department of Masters of Science Expt No: 04 Signature

Information Technology

Title: Implement Decision tree classification techniques

To build a classification model that predicts the species of an iris flower using the Decision
Tree algorithm based on the features: [Link], [Link], [Link], and
[Link].
Steps Performed:
1. Library Installation and Loading:
The party package, which provides tools for building conditional inference trees,
2. Dataset Used:
The built-in iris dataset was used. It contains 150 observations with 5 variables:
o [Link]
o [Link]
o [Link]
o [Link]
o Species (target variable with 3 classes: Setosa, Versicolor, Virginica)
3. Building the Model:
A Decision Tree model was built using the ctree() function:
4. Model Interpretation:
The tree was printed in text format using:
CopyEdit
print(iris_ctree)
This displayed the rules and split conditions used for classification.
5. Visualization:
plot(iris_ctree)
plot(iris_ctree, type = "simple")
Code:
# Step 1: Install and load required package
[Link]("party") # Only run once if not already installed
library(party)
data(iris)
str(iris) # View the structure of the dataset

# Step 3: Build the Decision Tree model

iris_ctree <- ctree(Species ~ [Link] + [Link] + [Link] + [Link],
data = iris)
print(iris_ctree)
plot(iris_ctree)
plot(iris_ctree, type = "simple")

Output:
[Link]’s Date:
Sant Rawool Maharaj Mahavidyalaya, Kudal Roll no:

Department of Masters of Science Expt No: 05 Signature

Information Technology

Title: MULTIPLE REGRESSION MODEL Apply multiple regressions, if data

have a continuous independent variable. Apply on above dataset.

Steps Involved:
1. Library Installation and Loading:
o The foreign and MASS packages are used to ensure compatibility and
statistical functions.
o rgl is used for interactive 3D plotting.
2. Data Generation:
o A synthetic dataset of 100 observations is generated.
o Variables include:
▪ gre: Graduate Record Examination score (normally distributed
around 600).
▪ gpa: Grade Point Average (normally distributed around 3.5).
▪ rank: Rank of the undergraduate institution (randomly sampled from
1 to 4).
▪ admitted: Binary target variable (0 or 1), randomly assigned.
3. Model Building:
o A multiple linear regression model is trained using lm() with predictors:
gre, gpa, and rank.
o rank is treated as a categorical variable using [Link](rank).
4. Model Summary:
o The regression summary provides coefficients, R-squared, and p-values to
interpret the model’s performance and predictor significance.
5. Visualization:
o A 3D plot is generated using rgl::plot3d() to visualize the relationship
between GRE, GPA, and admission likelihood.
o A regression plane is estimated and overlaid to indicate the decision
boundary.
Code:
if (!require(foreign)) {[Link]("foreign")}
if (!require(MASS)) {[Link]("MASS")}
library(foreign)
library(MASS)
library(rgl)
[Link](123)
n <- 100
gre <- round(rnorm(n, mean = 600, sd = 100))
gpa <- round(rnorm(n, mean = 3.5, sd = 0.5), 2)
rank <- sample(1:4, n, replace = TRUE)
admission_data <- [Link](admitted = sample(0:1, n, replace = TRUE), gre = gre, gpa =
gpa, rank = rank)
multi_reg_model <- lm(admitted ~ gre + gpa + [Link](rank), data = admission_data)
summary(multi_reg_model)
a <- coef(multi_reg_model)[2]
b <- coef(multi_reg_model)[3]
c <- -1
d <- coef(multi_reg_model)[1]
x_grid <- seq(min(admission_data$gre), max(admission_data$gre), length = 30)
y_grid <- seq(min(admission_data$gpa), max(admission_data$gpa), length = 30)
xy_grid <- [Link](gre = x_grid, gpa = y_grid)
z_grid <- with(xy_grid, (-a*gre - b*gpa - d)/c)
plot3d(admission_data$gre, admission_data$gpa, admission_data$admitted, type = "s", col
= "blue", xlab = "GRE Score", ylab = "GPA", zlab = "Admitted")
Output:
[Link]’s Date:
Sant Rawool Maharaj Mahavidyalaya, Kudal Roll no:

Department of Masters of Science Expt No: 06 Signature

Information Technology

Title: Implement an application that stores big data in Hbase / MongoDB and
manipulate it using R / Python

The goal of this project is to implement an application that:

• Stores big data in a NoSQL database (either MongoDB or HBase)
• Processes and manipulates this data using R or Python
• Demonstrates how scalable storage systems and high-level programming languages
can be integrated to perform data analytics
System Architecture:
1. Data Source: Big data from a structured or semi-structured file (CSV, JSON).
2. Database Layer: Stores data in either:
o MongoDB: Document-based NoSQL DB, ideal for semi-structured data
o HBase: Column-oriented distributed DB built on top of HDFS (Hadoop)
3. Application Layer:
o Uses Python or R for:
▪ Data cleaning and transformation
▪ Data insertion into the DB
▪ Querying and filtering
▪ Simple analytics or reporting
Code:
import pandas as pd
from pymongo import MongoClient
df = pd.read_csv('[Link]') # Load CSV into a DataFrame
client = MongoClient('mongodb://localhost:27017/') # Connect to MongoDB
db = client['mydb']
collection = db['people']

records = df.to_dict(orient='records') # Convert DataFrame to dictionary records and insert

collection.insert_many(records)

# Verify insertion
for person in [Link]():
print(person)

Output:

Assignment 4 R Program1
No ratings yet
Assignment 4 R Program1
11 pages
Data Analysis of Dirty Iris Dataset
No ratings yet
Data Analysis of Dirty Iris Dataset
19 pages
Data Science Project
No ratings yet
Data Science Project
31 pages
Data Mining Using Python Lab
100% (1)
Data Mining Using Python Lab
63 pages
Wine Quality Prediction Using ML
100% (1)
Wine Quality Prediction Using ML
3 pages
BDA Lab Manual (12 Weeks)
No ratings yet
BDA Lab Manual (12 Weeks)
22 pages
Data Mining Practical File Overview
No ratings yet
Data Mining Practical File Overview
54 pages
R Companion Data Mining
No ratings yet
R Companion Data Mining
370 pages
Artificial Intellegence Lab Practical
No ratings yet
Artificial Intellegence Lab Practical
48 pages
ISYE6501 Homework 2
No ratings yet
ISYE6501 Homework 2
11 pages
DM Lab Practical Examination Report
No ratings yet
DM Lab Practical Examination Report
18 pages
Admission Prediction Using Machine Learning
No ratings yet
Admission Prediction Using Machine Learning
4 pages
R Course - Part7 ML - Exercise Sheet 2024
No ratings yet
R Course - Part7 ML - Exercise Sheet 2024
8 pages
Data Mining with R: Practical Guide
No ratings yet
Data Mining with R: Practical Guide
15 pages
Case Study Classification
No ratings yet
Case Study Classification
3 pages
Datamining Lab Record
No ratings yet
Datamining Lab Record
36 pages
Machine Learning Theory and Practice
No ratings yet
Machine Learning Theory and Practice
299 pages
R and Python Machine Learning Models
No ratings yet
R and Python Machine Learning Models
23 pages
Machine Learning Models for Default Prediction
No ratings yet
Machine Learning Models for Default Prediction
8 pages
Big Data Analytics Prac
No ratings yet
Big Data Analytics Prac
37 pages
Data Mining Course Outline and Practical Guide
No ratings yet
Data Mining Course Outline and Practical Guide
5 pages
Notes - With R Code
No ratings yet
Notes - With R Code
7 pages
GRE Admission Prediction with Decision Trees
No ratings yet
GRE Admission Prediction with Decision Trees
13 pages
ML - Collection.2019 04 15
No ratings yet
ML - Collection.2019 04 15
30 pages
MBA Exam: Predictive Analytics Insights
No ratings yet
MBA Exam: Predictive Analytics Insights
11 pages
Data Validation and Analysis in R
No ratings yet
Data Validation and Analysis in R
53 pages
Machine Learning Algorithms Lab Manual
No ratings yet
Machine Learning Algorithms Lab Manual
25 pages
BDA MSC It
No ratings yet
BDA MSC It
35 pages
R Programming for Data Analysis
No ratings yet
R Programming for Data Analysis
6 pages
Tree Models in Insurance Pricing
No ratings yet
Tree Models in Insurance Pricing
142 pages
R Lab Program
No ratings yet
R Lab Program
20 pages
Rlab
No ratings yet
Rlab
7 pages
Overview of Classification Methods
No ratings yet
Overview of Classification Methods
16 pages
R Programming for Data Science Lab
No ratings yet
R Programming for Data Science Lab
21 pages
Ai Record Programs
No ratings yet
Ai Record Programs
34 pages
Data Science Lab Expt - No - 10
No ratings yet
Data Science Lab Expt - No - 10
4 pages
Practical Machine Learning R
90% (10)
Practical Machine Learning R
149 pages
ML Lab Record2
No ratings yet
ML Lab Record2
42 pages
Statistics and Machine Learning Toolbox™ Release Notes
No ratings yet
Statistics and Machine Learning Toolbox™ Release Notes
150 pages
SVM and KNN Classification Assignment
No ratings yet
SVM and KNN Classification Assignment
18 pages
Up M PHD Seminar Cart RF May 2023
No ratings yet
Up M PHD Seminar Cart RF May 2023
101 pages
Graduate Admission Prediction Models
No ratings yet
Graduate Admission Prediction Models
18 pages
Iris Flower Classification with Neural Networks
No ratings yet
Iris Flower Classification with Neural Networks
38 pages
Iris Flower Classification Project Report
No ratings yet
Iris Flower Classification Project Report
42 pages
DSR LAB MANUAL - 10 Programs
No ratings yet
DSR LAB MANUAL - 10 Programs
34 pages
Student Grade Prediction with ML
No ratings yet
Student Grade Prediction with ML
29 pages
Big Data Mid Term
No ratings yet
Big Data Mid Term
14 pages
Machine Learning Laboratory Report
No ratings yet
Machine Learning Laboratory Report
23 pages
Machine Learning Models in Business Analytics
No ratings yet
Machine Learning Models in Business Analytics
10 pages
Unit 4 DSRP
No ratings yet
Unit 4 DSRP
119 pages
Data Mining Practical File by Kashish Madan
No ratings yet
Data Mining Practical File by Kashish Madan
40 pages
0975 Data Science and Machine Learning
No ratings yet
0975 Data Science and Machine Learning
6 pages
Machine Learning Foundations
100% (1)
Machine Learning Foundations
119 pages
Data Mining Practical File Overview
No ratings yet
Data Mining Practical File Overview
48 pages
Probability Estimation in Random Forests
No ratings yet
Probability Estimation in Random Forests
35 pages
Logistic Regression on Iris Dataset
No ratings yet
Logistic Regression on Iris Dataset
39 pages
Automated BI Insights Platform
No ratings yet
Automated BI Insights Platform
5 pages
INF106 Fall 2022 Syllabus
No ratings yet
INF106 Fall 2022 Syllabus
6 pages
2024 Reviewer Acknowledgment in ICM
No ratings yet
2024 Reviewer Acknowledgment in ICM
3 pages
Object Oriented Programming
No ratings yet
Object Oriented Programming
95 pages
Data Scientist Resume: Skills & Experience
No ratings yet
Data Scientist Resume: Skills & Experience
1 page
Data-Centric AI - Perspectives and Challenges
No ratings yet
Data-Centric AI - Perspectives and Challenges
5 pages
Trends in Computer Aided Process Planning
No ratings yet
Trends in Computer Aided Process Planning
3 pages
Presented By:: SANTHOSH.K-927622BIT087 SIVA BHARAT.B-927622BIT099 NITHISH KUMAR.S-927622BIT066 SUNTHAR SHREE-927622BIT110
No ratings yet
Presented By:: SANTHOSH.K-927622BIT087 SIVA BHARAT.B-927622BIT099 NITHISH KUMAR.S-927622BIT066 SUNTHAR SHREE-927622BIT110
16 pages
Bidgoli MIS 11e PPT Mod03 Final
No ratings yet
Bidgoli MIS 11e PPT Mod03 Final
50 pages
Registration Open International Conference On Digital Library Management (ICDLM)
No ratings yet
Registration Open International Conference On Digital Library Management (ICDLM)
4 pages
20a02702c Intelligent Control Techniques
No ratings yet
20a02702c Intelligent Control Techniques
1 page
Oracle Database Restoration Guide
No ratings yet
Oracle Database Restoration Guide
3 pages
DB 1
No ratings yet
DB 1
81 pages
Resume June
No ratings yet
Resume June
1 page
Contribution of Ph.D. Theses by Departments of Hindi To Shodhganga: A Study in The Context of Indian Central Universities
No ratings yet
Contribution of Ph.D. Theses by Departments of Hindi To Shodhganga: A Study in The Context of Indian Central Universities
8 pages
Marking Scheme of Artificial Intelligence Class X
No ratings yet
Marking Scheme of Artificial Intelligence Class X
9 pages
Term Paper On Software - Reuse - Research - and - Practice
No ratings yet
Term Paper On Software - Reuse - Research - and - Practice
8 pages
Data Engineering
No ratings yet
Data Engineering
22 pages
ALBIN JOSEPH - Crux
No ratings yet
ALBIN JOSEPH - Crux
1 page
Kumar Anup BI Feb2025
No ratings yet
Kumar Anup BI Feb2025
5 pages
Aniket Gupta CV - Aniket Gupta
No ratings yet
Aniket Gupta CV - Aniket Gupta
1 page
Deep Learning Essentials for Tech Enthusiasts
No ratings yet
Deep Learning Essentials for Tech Enthusiasts
3 pages
Defensive Architecture of The Mediterranean - VI - 48
No ratings yet
Defensive Architecture of The Mediterranean - VI - 48
12 pages
Files and File System Control in VB
No ratings yet
Files and File System Control in VB
4 pages
Online Banking System Project Overview
100% (1)
Online Banking System Project Overview
13 pages
Aiml Mid1 Question Bank
No ratings yet
Aiml Mid1 Question Bank
2 pages
Lecture1 MCQ Guide
No ratings yet
Lecture1 MCQ Guide
4 pages
Elective Courses in Computer Science
No ratings yet
Elective Courses in Computer Science
4 pages
Diseño Lógico
No ratings yet
Diseño Lógico
8 pages
Data Structures & Algorithms Guide
No ratings yet
Data Structures & Algorithms Guide
3 pages

Big Data

Uploaded by

Big Data

Uploaded by

BIG DATA

This to certify that ,

In the following topics

Teacher In -Charge : External Examiner:

4. Implement Decision tree classification techniques

5. Multiple regression model

6. Implement an application that stores big data in Hbase /

Department of Masters of Science Expt No: 01 Signature

Title: a. Clustering algorithms for unsupervised classification.

Clustering in R provides a powerful way to explore hidden structures in data. Choosing

Department of Masters of Science Expt No: 02 Signature

Title: CLASSIFICATION MODEL a. Install relevant package for

a. Install Relevant Package for Classification

Department of Masters of Science Expt No: 03 Signature

# Step 1: Install and load required packages

Department of Masters of Science Expt No: 04 Signature

Title: Implement Decision tree classification techniques

# Step 3: Build the Decision Tree model

Department of Masters of Science Expt No: 05 Signature

Title: MULTIPLE REGRESSION MODEL Apply multiple regressions, if data

Department of Masters of Science Expt No: 06 Signature

The goal of this project is to implement an application that:

records = df.to_dict(orient='records') # Convert DataFrame to dictionary records and insert

You might also like