0% found this document useful (0 votes)
20 views6 pages

DAV practical 2

MU DAV

Uploaded by

mgade3012
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views6 pages

DAV practical 2

MU DAV

Uploaded by

mgade3012
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Subject: Data Analytics & Visualization Course Code: CSL-601

Semester: 6 Course: AI & DS


Laboratory No: 315 Name of Subject Teacher: Rajesh Morey
Name of Student: Meghana Gade Roll Id: VU2S2223002

Practical Number 2
Title of Practical Data Exploration: Knowing the data, Data preparation and Cleaning.

Prior Concepts:
Exploring data is a crucial step in understanding its characteristics, trends, and
underlying patterns. Conducting experiments in data exploration involves various
techniques and tools to gain insights into the dataset. Following are the different
approaches to conduct a data exploration.

1. Data Collection and Understanding:


- Collect the dataset you want to explore. This could be a CSV file, database,
or any structured/unstructured data source.
- Understand the data sources, variables, and their meanings. Look into data
dictionaries or metadata to comprehend what each column represents.

2. Data Cleaning and Preprocessing:


- Check for missing values, duplicates, and outliers in the dataset.
- Handle missing values by imputation or deletion, depending on the nature of the
data.
- Normalize or standardize data if necessary for certain algorithms or analyses.

3. Statistical Summaries and Visualizations:


- Calculate descriptive statistics (mean, median, mode, standard deviation,
etc.) for numerical variables.
- Generate frequency counts for categorical variables.
- Create visualizations like histograms, box plots, scatter plots, and
heatmaps to understand the distribution and relationships between variables.

4. Exploratory Data Analysis (EDA):


- Conduct correlation analysis to identify relationships between variables.
- Perform dimensionality reduction techniques like PCA (Principal
Component Analysis) or t-SNE (tDistributed Stochastic Neighbor Embedding) for

Page | 1
visualization in lower dimensions. - Cluster analysis to identify natural groupings
within the data.

5. Hypothesis Testing and Feature Engineering:


- Formulate hypotheses about relationships or patterns within the data.
- Perform hypothesis tests to validate or reject these hypotheses.
- Create new features by transforming existing ones to improve model
performance if working on a predictive modeling task.

6. Interactive Exploration and Tools:


- Utilize tools like Jupyter Notebooks, Pandas, Matplotlib, Seaborn, Plotly,
or Tableau for interactive exploration.

- Utilize widgets or interactive visualization libraries for dynamic


exploration of data subsets or trends.

7. Documentation and Communication:


- Document all findings, insights, and assumptions made during exploration.
- Create visualizations, summaries, and reports to communicate the insights gained.

8. Iterative Process:
- Data exploration is often iterative. Revisit steps, try different techniques,
and compare results to gain a comprehensive understanding of the dataset.

9. Ethical Considerations:
- Ensure ethical use of data, especially regarding privacy, biases, and the
implications of insights drawn from the data.

New Concept:
Data loading in R
CSV File
# Load a CSV file

data <- read.csv("your_file.csv")

# View the first few rows of the dataset head(data)

# View summary statistics of the dataset summary(data)

# Check the structure of the dataset str(data)

# Check the dimensions of the dataset dim(data)

2. EXCEL File
# Load an Excel file (assuming 'readxl' package is
installed) library(readxl) data <-
read_excel("your_file.xlsx")

Page | 2
# View the first few rows of the dataset head(data)

# View summary statistics of the dataset summary(data)

# Check the structure of the dataset str(data)

# Check the dimensions of the dataset dim(data)

3. Identifying Missing values


# Assuming 'data' is your dataframe# Check for missing values in the entire dataset
colSums(is.na(data))

# Check for missing values in a specific column sum(is.na(data$column_name))

4. Removing missing values


# Remove rows with any missing values
data_clean <- na.omit(data) # Remove
rows with missing values in specific
columns data_clean <-
data[complete.cases(data$column_name),
]

Data preparation and cleaning involve various steps to ensure that the dataset is in a
suitable format for analysis or modeling. Data Cleaning Steps:

1. Handling Missing Values:


- Identify missing values using functions like `is.na()` or `complete.cases()`.
- Decide whether to remove missing values using `na.omit()` or impute
them with methods like mean, median, mode, or using advanced imputation
techniques from packages like `mice` or `missForest`.
2. Handling Outliers:
- Detect outliers using statistical methods like z-scores, IQR (Interquartile
Range), or visualization techniques such as boxplots.
- Decide whether to remove outliers or transform them based on the
context of your analysis.
3. Dealing with Duplicates:
- Check for and remove duplicate rows using `duplicated()` and

`unique()` functions. Data Preparation Steps:

1. Feature Engineering:
- Create new features from existing ones based on domain knowledge or insights.
- Transform variables (e.g., log transformation for skewed data) to improve the
distribution of data.
2. Standardization and Normalization:
- Scale numerical variables to a standard scale using methods like `scale()` for
standardization or min- max scaling for normalization.

Page | 3
- Normalize features to bring them on a similar scale, especially when using distance-
based algorithms.
3. Handling Categorical Variables:
- Convert categorical variables to factors using `as.factor()` or one-hot
encode them using techniques from packages like `dummies`.
- Handle ordinal variables appropriately by assigning levels.
4. Data Splitting:
- Split the dataset into training and testing subsets using functions like
`sample()` or from packages like `caret` or `tidymodels`.
5. Handling Date and Time Variables:
- Convert date/time variables to appropriate formats using functions like `as.Date()` or
`as.POSIXct()`.

Example of R Code for Data Cleaning and Preparation:

# Handling missing values

data_clean <- na.omit(data) # Removing rows with missing values # OR

data$column_name[is.na(data$column_name)] <- mean(data$column_name, na.rm = TRUE) #


Imputing missing values with mean

# Handling outliers (example: using z-


score) threshold <- 3 data_clean <-
data[abs(scale(data$numeric_column)) <
threshold, ]

# Dealing with duplicates data_unique <-


unique(data)

# Feature engineering (example: creating a new feature)

data$feature_sum <- rowSums(data[, c("feature1", "feature2", "feature3")])

# Handling categorical variables (example: one-hot


encoding) library(dummies) data <-
dummy.data.frame(data, names =
"categorical_column")

# Data splitting (example: 70-30 split for training and testing)


library(caret) set.seed(123)

train_index <- createDataPartition(data$target_variable, p = 0.7, list = FALSE)


train_data <- data[train_index, ] test_data <- data[-train_index, ]

These steps ensure that your data is clean, formatted correctly, and ready for
analysis or modeling tasks in R.

Adjust these methods based on your specific dataset and analysis requirements.

Page | 4
Learning Objectives:
To understand the different techniques of Data exploration, Data preparation and
Cleaning.

Conclusion/Learning outcome:
The use of different tools and commands for understanding the data are studied and
implemented. The concept of Data preparation and Cleaning is understood and
implemented in R languge.

R1 R2 R3
DOP DOS Conduction File Record Viva Voice Total Signature
5 Marks 5 Marks 5 Marks 15 Marks

Page | 5
Page | 6

You might also like