0% found this document useful (0 votes)

20 views6 pages

DAV practical 2

MU DAV

Uploaded by

mgade3012

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views6 pages

DAV practical 2

MU DAV

Uploaded by

mgade3012

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Subject: Data Analytics & Visualization Course Code: CSL-601

Semester: 6 Course: AI & DS

Laboratory No: 315 Name of Subject Teacher: Rajesh Morey
Name of Student: Meghana Gade Roll Id: VU2S2223002

Practical Number 2
Title of Practical Data Exploration: Knowing the data, Data preparation and Cleaning.

Prior Concepts:
Exploring data is a crucial step in understanding its characteristics, trends, and
underlying patterns. Conducting experiments in data exploration involves various
techniques and tools to gain insights into the dataset. Following are the different
approaches to conduct a data exploration.

1. Data Collection and Understanding:

- Collect the dataset you want to explore. This could be a CSV file, database,
or any structured/unstructured data source.
- Understand the data sources, variables, and their meanings. Look into data
dictionaries or metadata to comprehend what each column represents.

2. Data Cleaning and Preprocessing:

- Check for missing values, duplicates, and outliers in the dataset.
- Handle missing values by imputation or deletion, depending on the nature of the
data.
- Normalize or standardize data if necessary for certain algorithms or analyses.

3. Statistical Summaries and Visualizations:

- Calculate descriptive statistics (mean, median, mode, standard deviation,
etc.) for numerical variables.
- Generate frequency counts for categorical variables.
- Create visualizations like histograms, box plots, scatter plots, and
heatmaps to understand the distribution and relationships between variables.

4. Exploratory Data Analysis (EDA):

- Conduct correlation analysis to identify relationships between variables.
- Perform dimensionality reduction techniques like PCA (Principal
Component Analysis) or t-SNE (tDistributed Stochastic Neighbor Embedding) for

Page | 1
visualization in lower dimensions. - Cluster analysis to identify natural groupings
within the data.

5. Hypothesis Testing and Feature Engineering:

- Formulate hypotheses about relationships or patterns within the data.
- Perform hypothesis tests to validate or reject these hypotheses.
- Create new features by transforming existing ones to improve model
performance if working on a predictive modeling task.

6. Interactive Exploration and Tools:

- Utilize tools like Jupyter Notebooks, Pandas, Matplotlib, Seaborn, Plotly,
or Tableau for interactive exploration.

- Utilize widgets or interactive visualization libraries for dynamic

exploration of data subsets or trends.

7. Documentation and Communication:

- Document all findings, insights, and assumptions made during exploration.
- Create visualizations, summaries, and reports to communicate the insights gained.

8. Iterative Process:
- Data exploration is often iterative. Revisit steps, try different techniques,
and compare results to gain a comprehensive understanding of the dataset.

9. Ethical Considerations:
- Ensure ethical use of data, especially regarding privacy, biases, and the
implications of insights drawn from the data.

New Concept:
Data loading in R
CSV File
# Load a CSV file

data <- read.csv("your_file.csv")

# View the first few rows of the dataset head(data)

# View summary statistics of the dataset summary(data)

# Check the structure of the dataset str(data)

# Check the dimensions of the dataset dim(data)

2. EXCEL File
# Load an Excel file (assuming 'readxl' package is
installed) library(readxl) data <-
read_excel("your_file.xlsx")

Page | 2
# View the first few rows of the dataset head(data)

# View summary statistics of the dataset summary(data)

# Check the structure of the dataset str(data)

# Check the dimensions of the dataset dim(data)

3. Identifying Missing values

# Assuming 'data' is your dataframe# Check for missing values in the entire dataset
colSums(is.na(data))

# Check for missing values in a specific column sum(is.na(data$column_name))

4. Removing missing values

# Remove rows with any missing values
data_clean <- na.omit(data) # Remove
rows with missing values in specific
columns data_clean <-
data[complete.cases(data$column_name),
]

Data preparation and cleaning involve various steps to ensure that the dataset is in a
suitable format for analysis or modeling. Data Cleaning Steps:

1. Handling Missing Values:

- Identify missing values using functions like `is.na()` or `complete.cases()`.
- Decide whether to remove missing values using `na.omit()` or impute
them with methods like mean, median, mode, or using advanced imputation
techniques from packages like `mice` or `missForest`.
2. Handling Outliers:
- Detect outliers using statistical methods like z-scores, IQR (Interquartile
Range), or visualization techniques such as boxplots.
- Decide whether to remove outliers or transform them based on the
context of your analysis.
3. Dealing with Duplicates:
- Check for and remove duplicate rows using `duplicated()` and

`unique()` functions. Data Preparation Steps:

1. Feature Engineering:
- Create new features from existing ones based on domain knowledge or insights.
- Transform variables (e.g., log transformation for skewed data) to improve the
distribution of data.
2. Standardization and Normalization:
- Scale numerical variables to a standard scale using methods like `scale()` for
standardization or min- max scaling for normalization.

Page | 3
- Normalize features to bring them on a similar scale, especially when using distance-
based algorithms.
3. Handling Categorical Variables:
- Convert categorical variables to factors using às.factor()` or one-hot
encode them using techniques from packages like `dummies`.
- Handle ordinal variables appropriately by assigning levels.
4. Data Splitting:
- Split the dataset into training and testing subsets using functions like
`sample()` or from packages like `caret` or `tidymodels`.
5. Handling Date and Time Variables:
- Convert date/time variables to appropriate formats using functions like às.Date()` or
às.POSIXct()`.

Example of R Code for Data Cleaning and Preparation:

# Handling missing values

data_clean <- na.omit(data) # Removing rows with missing values # OR

data$column_name[is.na(data$column_name)] <- mean(data$column_name, na.rm = TRUE) #

Imputing missing values with mean

# Handling outliers (example: using z-

score) threshold <- 3 data_clean <-
data[abs(scale(data$numeric_column)) <
threshold, ]

# Dealing with duplicates data_unique <-

unique(data)

# Feature engineering (example: creating a new feature)

data$feature_sum <- rowSums(data[, c("feature1", "feature2", "feature3")])

# Handling categorical variables (example: one-hot

encoding) library(dummies) data <-
dummy.data.frame(data, names =
"categorical_column")

# Data splitting (example: 70-30 split for training and testing)

library(caret) set.seed(123)

train_index <- createDataPartition(data$target_variable, p = 0.7, list = FALSE)

train_data <- data[train_index, ] test_data <- data[-train_index, ]

These steps ensure that your data is clean, formatted correctly, and ready for
analysis or modeling tasks in R.

Adjust these methods based on your specific dataset and analysis requirements.

Page | 4
Learning Objectives:
To understand the different techniques of Data exploration, Data preparation and
Cleaning.

Conclusion/Learning outcome:
The use of different tools and commands for understanding the data are studied and
implemented. The concept of Data preparation and Cleaning is understood and
implemented in R languge.

R1 R2 R3
DOP DOS Conduction File Record Viva Voice Total Signature
5 Marks 5 Marks 5 Marks 15 Marks

Page | 5
Page | 6

PHP Runner Manual 9.6
75% (4)
PHP Runner Manual 9.6
795 pages
SSC CGL 12 Week Study Plan
No ratings yet
SSC CGL 12 Week Study Plan
3 pages
LWMC1V2 001
100% (2)
LWMC1V2 001
296 pages
Exploratory Data Analysis EDA Part of Data PreProcessing
No ratings yet
Exploratory Data Analysis EDA Part of Data PreProcessing
11 pages
Digital Envelopes and Digital Signatures
0% (1)
Digital Envelopes and Digital Signatures
5 pages
Data Analysis CheatSheet
No ratings yet
Data Analysis CheatSheet
34 pages
Huawei 4
No ratings yet
Huawei 4
71 pages
Class Activity-2
No ratings yet
Class Activity-2
3 pages
Rlab Exp 8
No ratings yet
Rlab Exp 8
3 pages
Ssc Cgl Study Plan
No ratings yet
Ssc Cgl Study Plan
1 page
Data Cleaning R
No ratings yet
Data Cleaning R
16 pages
9.3 技术服务合同
No ratings yet
9.3 技术服务合同
9 pages
Basic Data Analysis
No ratings yet
Basic Data Analysis
16 pages
Unit 3 OLAP and OLTP
No ratings yet
Unit 3 OLAP and OLTP
64 pages
EDA New
No ratings yet
EDA New
15 pages
Shadowing Using Shadows in Visual Basic
No ratings yet
Shadowing Using Shadows in Visual Basic
9 pages
Statistics with R week 5
No ratings yet
Statistics with R week 5
3 pages
IBM(CSRBOX)_Sales_Insights (1)
No ratings yet
IBM(CSRBOX)_Sales_Insights (1)
12 pages
SMA4 (1)
No ratings yet
SMA4 (1)
5 pages
SMA exp 2
No ratings yet
SMA exp 2
3 pages
Practical - 1 - Data Exploration and Data Preparation - DAL - Lab
100% (1)
Practical - 1 - Data Exploration and Data Preparation - DAL - Lab
8 pages
FDS UNIT 1 Part2
No ratings yet
FDS UNIT 1 Part2
47 pages
ADS IA 1 syllabus prep (1)
No ratings yet
ADS IA 1 syllabus prep (1)
5 pages
Mensuration 02 _ Class Notes
No ratings yet
Mensuration 02 _ Class Notes
25 pages
Metodología para el análisis de datos
No ratings yet
Metodología para el análisis de datos
10 pages
AIFBA_P6-M
No ratings yet
AIFBA_P6-M
7 pages
Signetweb Access On A Windows Computer
No ratings yet
Signetweb Access On A Windows Computer
7 pages
Data Exploration and Visualization unit 3
No ratings yet
Data Exploration and Visualization unit 3
13 pages
DEV_CORE
No ratings yet
DEV_CORE
7 pages
Module 3 Notes
No ratings yet
Module 3 Notes
5 pages
Ani Udoh Informatica Developer
No ratings yet
Ani Udoh Informatica Developer
2 pages
QA Testing Interview Questions
100% (2)
QA Testing Interview Questions
84 pages
Rakshana Sn - LAQ Week 2 DA
No ratings yet
Rakshana Sn - LAQ Week 2 DA
3 pages
kjwdh
No ratings yet
kjwdh
4 pages
EDA - Zep
No ratings yet
EDA - Zep
33 pages
IOT-Domain Analyst
No ratings yet
IOT-Domain Analyst
11 pages
sakshi ppt
No ratings yet
sakshi ppt
17 pages
EDA and Cleaning
No ratings yet
EDA and Cleaning
24 pages
Data Science Tools Final
No ratings yet
Data Science Tools Final
11 pages
U1_DA_Data Preprocessing
No ratings yet
U1_DA_Data Preprocessing
6 pages
Chapter 2. Pre-Processing Data
No ratings yet
Chapter 2. Pre-Processing Data
37 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
Da Laqs Saqs
No ratings yet
Da Laqs Saqs
23 pages
Major Signed
No ratings yet
Major Signed
89 pages
Servicenow Course Content
No ratings yet
Servicenow Course Content
2 pages
R Programming
No ratings yet
R Programming
11 pages
Steps in the Implementation of Data Analysis
No ratings yet
Steps in the Implementation of Data Analysis
2 pages
Day 1 Article For Discussion
No ratings yet
Day 1 Article For Discussion
5 pages
SML Updated UNIT-2
No ratings yet
SML Updated UNIT-2
43 pages
R-Programming Lab Mannual
No ratings yet
R-Programming Lab Mannual
33 pages
step by step data wrangling
No ratings yet
step by step data wrangling
4 pages
Data cleaning Using R
No ratings yet
Data cleaning Using R
5 pages
L3
No ratings yet
L3
34 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
SQR
100% (1)
SQR
81 pages
Forms Authentication and Authorization
No ratings yet
Forms Authentication and Authorization
9 pages
Presentation On Computer Virus
100% (1)
Presentation On Computer Virus
15 pages
EDA_INDEPTH
No ratings yet
EDA_INDEPTH
19 pages
DataCleaning
No ratings yet
DataCleaning
28 pages
Experiment 01: AIM: To Perform Data Preparation Using Numpy and Panda. Theory
No ratings yet
Experiment 01: AIM: To Perform Data Preparation Using Numpy and Panda. Theory
5 pages
Chapter 7 Inventory Management
No ratings yet
Chapter 7 Inventory Management
24 pages
Dav Exps - Merged - Merged
No ratings yet
Dav Exps - Merged - Merged
99 pages
REVIEWER
No ratings yet
REVIEWER
9 pages
1data Cleansing Cheklist
No ratings yet
1data Cleansing Cheklist
2 pages
Week 6 - Data Cleaning
No ratings yet
Week 6 - Data Cleaning
8 pages
Data Preparation
No ratings yet
Data Preparation
17 pages
22UCS303 DS-Unit II-N
No ratings yet
22UCS303 DS-Unit II-N
71 pages
CS202 Assignment - 4- GIKI
No ratings yet
CS202 Assignment - 4- GIKI
3 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Document (2)
No ratings yet
Document (2)
29 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Healthcare Analytics on Patient Data Using Big Data Technologies for Disease Prediction and Readmission Analysis
No ratings yet
Healthcare Analytics on Patient Data Using Big Data Technologies for Disease Prediction and Readmission Analysis
6 pages
ML_EXP_NO_1
No ratings yet
ML_EXP_NO_1
8 pages
Ora-13013 1
No ratings yet
Ora-13013 1
6 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
Guidebook on Exploratory Data Analysis
No ratings yet
Guidebook on Exploratory Data Analysis
27 pages
Machine Learning Project Roadmap
No ratings yet
Machine Learning Project Roadmap
4 pages
AEM Formss
0% (1)
AEM Formss
28 pages
Exploratory Data Analysis Using Python
No ratings yet
Exploratory Data Analysis Using Python
7 pages
Hospital Business Plan
No ratings yet
Hospital Business Plan
41 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Systems Modelling (Comp1429) : Prepared by Dr. Avgousta Kyriakidou-Zacharoudiou Delivered by Dr. Joseph Osunde
No ratings yet
Systems Modelling (Comp1429) : Prepared by Dr. Avgousta Kyriakidou-Zacharoudiou Delivered by Dr. Joseph Osunde
43 pages
Unit 4
No ratings yet
Unit 4
33 pages
Business Data Mining Week 2
No ratings yet
Business Data Mining Week 2
6 pages
Data Exploration
No ratings yet
Data Exploration
5 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
Midterm Cis
No ratings yet
Midterm Cis
8 pages
Electronic Customer Relationship Management E-Crm
No ratings yet
Electronic Customer Relationship Management E-Crm
39 pages
Data Exploration Preparation
No ratings yet
Data Exploration Preparation
12 pages
09 WebDynpro Configuration
No ratings yet
09 WebDynpro Configuration
35 pages
Unit 1 Introduction To Database: Characteristics
No ratings yet
Unit 1 Introduction To Database: Characteristics
79 pages
Ankit D. Core Java Developer: Phone: (832) 464 8724
No ratings yet
Ankit D. Core Java Developer: Phone: (832) 464 8724
3 pages
CRM Project Report
No ratings yet
CRM Project Report
125 pages
Exploratory Data Analysis - Satyajit
No ratings yet
Exploratory Data Analysis - Satyajit
35 pages
Cisco ACI For Enterprise: Phil Casini
No ratings yet
Cisco ACI For Enterprise: Phil Casini
51 pages
Informatica B2B Data Transformation 9.1 DS
No ratings yet
Informatica B2B Data Transformation 9.1 DS
3 pages
PDF Synopsis of Computer Shop Management System DL
No ratings yet
PDF Synopsis of Computer Shop Management System DL
29 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet

DAV practical 2

Uploaded by

DAV practical 2

Uploaded by

Subject: Data Analytics & Visualization Course Code: CSL-601

Semester: 6 Course: AI & DS

1. Data Collection and Understanding:

2. Data Cleaning and Preprocessing:

3. Statistical Summaries and Visualizations:

4. Exploratory Data Analysis (EDA):

5. Hypothesis Testing and Feature Engineering:

6. Interactive Exploration and Tools:

- Utilize widgets or interactive visualization libraries for dynamic

7. Documentation and Communication:

data <- read.csv("your_file.csv")

# View the first few rows of the dataset head(data)

# View summary statistics of the dataset summary(data)

# Check the structure of the dataset str(data)

# Check the dimensions of the dataset dim(data)

# View summary statistics of the dataset summary(data)

# Check the structure of the dataset str(data)

# Check the dimensions of the dataset dim(data)

3. Identifying Missing values

# Check for missing values in a specific column sum(is.na(data$column_name))

4. Removing missing values

1. Handling Missing Values:

`unique()` functions. Data Preparation Steps:

Example of R Code for Data Cleaning and Preparation:

# Handling missing values

data_clean <- na.omit(data) # Removing rows with missing values # OR

data$column_name[is.na(data$column_name)] <- mean(data$column_name, na.rm = TRUE) #

# Handling outliers (example: using z-

# Dealing with duplicates data_unique <-

# Feature engineering (example: creating a new feature)

data$feature_sum <- rowSums(data[, c("feature1", "feature2", "feature3")])

# Handling categorical variables (example: one-hot

# Data splitting (example: 70-30 split for training and testing)

train_index <- createDataPartition(data$target_variable, p = 0.7, list = FALSE)

You might also like