0% found this document useful (0 votes)

31 views11 pages

DS Day 5

Uploaded by

ishuj759

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views11 pages

DS Day 5

Uploaded by

ishuj759

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

DATA SCIENCE

Topic 7: Data Transformation

Normalization and Standardization

Normalization and standardization are techniques used to adjust the

values of numeric columns in a dataset to a common scale, without
distorting differences in the ranges of values.

- Normalization (Min-Max Scaling): Transforms the data to a fixed range,

typically [0, 1]. This method is useful when you want your data to have a
specific range.

Formula:

x' = \frac{x - x_{\text{min}}}{x_{\text{max}} - x_{\text{min}}}

where \( x \) is the original value, \( x' \) is the normalized value, \(

x_{\text{min}} \) is the minimum value in the dataset, and \( x_{\text{max}}
\) is the maximum value in the dataset.

Example:

python

from sklearn.preprocessing import MinMaxScaler

data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]

scaler = MinMaxScaler()

print(scaler.fit_transform(data))

- Standardization (Z-score Scaling): Transforms the data to have a mean of

0 and a standard deviation of 1. This method is useful when the data
follows a normal distribution.

Formula:

z = \frac{x - \mu}{\sigma}

where \( x \) is the original value, \( z \) is the standardized value, \( \mu \)

is the mean of the dataset, and \( \sigma \) is the standard deviation of
the dataset.

Example:

python
from sklearn.preprocessing import StandardScaler

data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]

scaler = StandardScaler()

print(scaler.fit_transform(data))

Encoding Categorical Variables

Categorical variables are often encoded to be used in machine learning

algorithms that require numerical input.

- Label Encoding: Converts categorical values into numerical values. Each

unique value is assigned a unique integer. However, this method may
imply an ordinal relationship between categories that may not exist.

Example:

python

from sklearn.preprocessing import LabelEncoder

data = ['cat', 'dog', 'mouse']

encoder = LabelEncoder()

print(encoder.fit_transform(data))
- One-Hot Encoding: Converts categorical values into a series of binary
columns. Each unique value is represented as a binary column with a 1 or
0 indicating the presence or absence of the category.

Example:

python

from sklearn.preprocessing import OneHotEncoder

import numpy as np

data = np.array(['cat', 'dog', 'mouse']).reshape(-1, 1)

encoder = OneHotEncoder(sparse=False)

print(encoder.fit_transform(data))

Feature Engineering

Feature engineering is the process of using domain knowledge to create

new features that make machine learning algorithms work better.

- Creating Features: New features can be created by combining existing

features. For example, multiplying or adding features together, or creating
interaction terms.

Example:
python

import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

df['C'] = df['A'] df['B']

print(df)

- Feature Selection: Selecting the most relevant features for a model. This
can be done using techniques such as Recursive Feature Elimination
(RFE), L1 regularization (Lasso), and tree-based feature importance.

Example (using RFE):

python

from sklearn.datasets import make_classification

from sklearn.feature_selection import RFE

from sklearn.linear_model import LogisticRegression

X, y = make_classification(n_samples=100, n_features=10,
random_state=42)

model = LogisticRegression()

rfe = RFE(model, n_features_to_select=5)

fit = rfe.fit(X, y)

print(fit.support_)
print(fit.ranking_)

Topic 8: Exploratory Data Analysis (EDA)

Descriptive Statistics

Descriptive statistics summarize and describe the main features of a

dataset. They provide simple summaries about the sample and the
measures.

- Measures of Central Tendency: Mean, median, and mode.

- Mean: The average of the data points.

- Median: The middle value when the data points are sorted.

- Mode: The most frequent value in the dataset.

Example:

python

import numpy as np

data = [1, 2, 2, 3, 4]

print(np.mean(data)) Mean

print(np.median(data)) Median

print(np.mode(data)) Mode (Note: mode function is in scipy.stats)

- Measures of Dispersion: Range, variance, and standard deviation.

- Range: The difference between the maximum and minimum values.

- Variance: The average of the squared differences from the mean.

- Standard Deviation: The square root of the variance.

Example:

python

print(np.var(data)) Variance

print(np.std(data)) Standard Deviation

Data Visualization

Data visualization involves the graphical representation of data to

understand patterns, trends, and insights.

- Types of Visualizations:

- Bar Charts: Used for categorical data to show the frequency of different
categories.

- Histograms: Used for numerical data to show the distribution of the

data.

- Box Plots: Used to show the distribution of data and identify outliers.
- Scatter Plots: Used to show the relationship between two numerical
variables.

Example (using Matplotlib and Seaborn):

python

import matplotlib.pyplot as plt

import seaborn as sns

import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [5, 4, 3, 2, 1]})

sns.boxplot(x='A', y='B', data=df)

plt.show()

Data Summarization

Data summarization involves techniques to summarize and group data to

understand its structure and distribution.

- Correlation and Covariance:

- Correlation: Measures the relationship between two variables. Values

range from -1 to 1.

- Covariance: Measures the joint variability of two variables.

Example:

python

df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [5, 4, 3, 2, 1]})

print(df.corr()) Correlation

print(df.cov()) Covariance

- Grouping and Aggregation:

- Grouping: Splitting data into groups based on some criteria.

- Aggregation: Applying a function to each group independently.

Example:

python

df = pd.DataFrame({

'Category': ['A', 'A', 'B', 'B'],

'Values': [1, 2, 3, 4]

})

grouped = df.groupby('Category')

print(grouped.sum()) Sum of values for each category

Task 5

Questions on Data Transformation

1. What is normalization in data transformation?

2. What is the formula for Min-Max Scaling?

3. Provide an example of normalization using Min-Max Scaling.

4. What is standardization in data transformation?

5. What is the formula for Z-score Scaling?

6. Provide an example of standardization using Z-score Scaling.

7. What is label encoding and when is it used?

8. Provide an example of label encoding in Python.

9. What is one-hot encoding and when is it used?

10. Provide an example of one-hot encoding in Python.

11. What is feature engineering in data science?

12. How can new features be created from existing features? Provide an example.

13. What is feature selection and why is it important?

14. Describe the technique of Recursive Feature Elimination (RFE).

15. Provide an example of feature selection using RFE in Python.

Questions on Exploratory Data Analysis (EDA)

1. What are descriptive statistics and why are they important?

2. What are the measures of central tendency in descriptive statistics?

3. Define mean, median, and mode.

4. Provide a Python example to calculate mean, median, and mode.

5. What are the measures of dispersion in descriptive statistics?

6. Define range, variance, and standard deviation.

7. Provide a Python example to calculate variance and standard deviation.

8. What is data visualization and why is it used in EDA?

9. Name and describe different types of visualizations used in data analysis.

10. Provide a Python example of creating a box plot using Seaborn.

11. What is data summarization and what techniques are used?

12. Define correlation and covariance.

13. Provide a Python example to calculate correlation and covariance.

14. What is grouping in data summarization?

15. What is aggregation in data summarization?

16. Provide a Python example of grouping and aggregation in a DataFrame.

_OceanofPDF.com_Introduction_Data_Visualization_-_Jose_Berengueres
No ratings yet
_OceanofPDF.com_Introduction_Data_Visualization_-_Jose_Berengueres
206 pages
UNIT 1
No ratings yet
UNIT 1
54 pages
Ai&Ml Bail606 Ml Lab Manual
No ratings yet
Ai&Ml Bail606 Ml Lab Manual
50 pages
BTECH_(L&SCM)_Detailed_Syllabus
No ratings yet
BTECH_(L&SCM)_Detailed_Syllabus
43 pages
3-Data Description
No ratings yet
3-Data Description
91 pages
ML_Unit_2
No ratings yet
ML_Unit_2
52 pages
MCS102_Module1_Detailed (1)
No ratings yet
MCS102_Module1_Detailed (1)
5 pages
4 Data Pre Processing II
No ratings yet
4 Data Pre Processing II
26 pages
Method Matters in Psychology Essays in Applied Philosophy of Science Brian D. Haig instant download
100% (1)
Method Matters in Psychology Essays in Applied Philosophy of Science Brian D. Haig instant download
42 pages
Data Transformation (1)
No ratings yet
Data Transformation (1)
16 pages
12. B Lab Manual Machine Learning SEM-7 CSE 2024
No ratings yet
12. B Lab Manual Machine Learning SEM-7 CSE 2024
49 pages
1.3.2. Feature Engineering and Variable - Transformation
No ratings yet
1.3.2. Feature Engineering and Variable - Transformation
29 pages
3_AML _Lecture 3_Feature Engg
No ratings yet
3_AML _Lecture 3_Feature Engg
39 pages
Lecture # 13 Data_Transformation_Techniques
No ratings yet
Lecture # 13 Data_Transformation_Techniques
36 pages
Data+Science+in+Python+ +Data+Prep+&+EDA
No ratings yet
Data+Science+in+Python+ +Data+Prep+&+EDA
196 pages
d950dff6-fa1c-4553-b486-6e3656de899a
No ratings yet
d950dff6-fa1c-4553-b486-6e3656de899a
6 pages
AI and DS Final Autonomy Syllabus
No ratings yet
AI and DS Final Autonomy Syllabus
202 pages
DAI101 4 Data Preparation (1)
No ratings yet
DAI101 4 Data Preparation (1)
45 pages
Final Capstone Project - Group 4 - TPS
No ratings yet
Final Capstone Project - Group 4 - TPS
27 pages
CH1
No ratings yet
CH1
64 pages
UNIT-2
No ratings yet
UNIT-2
36 pages
Exploratory Data Analysis (EDA) in Python
No ratings yet
Exploratory Data Analysis (EDA) in Python
6 pages
Unit 2 1
No ratings yet
Unit 2 1
54 pages
Unit-2Exploratory-Analysis
No ratings yet
Unit-2Exploratory-Analysis
37 pages
3 Data Description
No ratings yet
3 Data Description
87 pages
Data Analysis and Visualisation With Python
No ratings yet
Data Analysis and Visualisation With Python
42 pages
ModuleAr Merged
No ratings yet
ModuleAr Merged
42 pages
UNIT 1
No ratings yet
UNIT 1
15 pages
1737527078055
No ratings yet
1737527078055
111 pages
Eda
No ratings yet
Eda
48 pages
Data Transformation
No ratings yet
Data Transformation
5 pages
Anket Analiz Ve Sunum
No ratings yet
Anket Analiz Ve Sunum
26 pages
data analysis
No ratings yet
data analysis
42 pages
FeatureEngineering (1)
No ratings yet
FeatureEngineering (1)
50 pages
R Module 6 - Data Summarization
No ratings yet
R Module 6 - Data Summarization
25 pages
pandas_cheat_sheet_2
No ratings yet
pandas_cheat_sheet_2
12 pages
Foundation of Data Science previous year question paper
No ratings yet
Foundation of Data Science previous year question paper
40 pages
M4 DAR Part1
No ratings yet
M4 DAR Part1
16 pages
JD - Data Science Analyst 2025
No ratings yet
JD - Data Science Analyst 2025
2 pages
MULTIVARIATE ANALYSIS Part 1
No ratings yet
MULTIVARIATE ANALYSIS Part 1
30 pages
Data Science Cat - 1
No ratings yet
Data Science Cat - 1
14 pages
Teks DATA SCIENCE Syllabus - QR
No ratings yet
Teks DATA SCIENCE Syllabus - QR
26 pages
Presentation Slide
100% (1)
Presentation Slide
8 pages
DS Chapter - 2
No ratings yet
DS Chapter - 2
73 pages
Data Science For Economics and Finanical Management
No ratings yet
Data Science For Economics and Finanical Management
10 pages
Data Cleaning and Preparation
No ratings yet
Data Cleaning and Preparation
4 pages
Violin Plots
No ratings yet
Violin Plots
5 pages
100 Days of Machine Learning
No ratings yet
100 Days of Machine Learning
14 pages
4 - Basics in Statistics and Linear Algebra
No ratings yet
4 - Basics in Statistics and Linear Algebra
7 pages
Grace Python Numpy MB Final
No ratings yet
Grace Python Numpy MB Final
55 pages
Hemant Kumar Jha: Education
No ratings yet
Hemant Kumar Jha: Education
1 page
Data Exploration and Visualization Unit 1
No ratings yet
Data Exploration and Visualization Unit 1
4 pages
Week 3
No ratings yet
Week 3
2 pages
Eidd S8 TD1
No ratings yet
Eidd S8 TD1
3 pages
DSBDL Write Ups 8 To 10
No ratings yet
DSBDL Write Ups 8 To 10
7 pages
Week 10
No ratings yet
Week 10
50 pages
Data Preparation.2
No ratings yet
Data Preparation.2
18 pages
Phan Project2 Report
No ratings yet
Phan Project2 Report
10 pages
FDS Pyq2
No ratings yet
FDS Pyq2
10 pages
Week - 6-7
No ratings yet
Week - 6-7
9 pages
Lecture 2 - Statistical Inference - EDA and DS Process - 02032023 111156am 1 - 1 27022024 012412pm
No ratings yet
Lecture 2 - Statistical Inference - EDA and DS Process - 02032023 111156am 1 - 1 27022024 012412pm
44 pages
Unit 2 - DS - 1st year
No ratings yet
Unit 2 - DS - 1st year
7 pages
Lecture 7 Data Transformation and Dimensionality Reduction
No ratings yet
Lecture 7 Data Transformation and Dimensionality Reduction
22 pages
XXX398 Project Template
No ratings yet
XXX398 Project Template
4 pages
Internal QP Format Ad3301
No ratings yet
Internal QP Format Ad3301
1 page
Brochure 10 Month Program On Applied DS and ML Analyttica LEAPS
No ratings yet
Brochure 10 Month Program On Applied DS and ML Analyttica LEAPS
53 pages
Manoj Intern Data Science
No ratings yet
Manoj Intern Data Science
37 pages
Week - 1 Day - 1 Descriptive Statistics
No ratings yet
Week - 1 Day - 1 Descriptive Statistics
40 pages
Internship Report AIML
No ratings yet
Internship Report AIML
40 pages
Ds 5
No ratings yet
Ds 5
9 pages
Unit2 Modified
No ratings yet
Unit2 Modified
42 pages
Data Mining Lab Maual Through Python 031023
No ratings yet
Data Mining Lab Maual Through Python 031023
22 pages
VIP Question Bank for DPV for Theory Exam
No ratings yet
VIP Question Bank for DPV for Theory Exam
6 pages
Unit 1,2,3, And4
100% (1)
Unit 1,2,3, And4
159 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
Lecture Material 3
No ratings yet
Lecture Material 3
7 pages
Data Treatment
No ratings yet
Data Treatment
6 pages
2) Theoretical Background: 2.1 EDA (Exploratory Data Analysis)
No ratings yet
2) Theoretical Background: 2.1 EDA (Exploratory Data Analysis)
7 pages
Data Exploration and Analysis With Python
No ratings yet
Data Exploration and Analysis With Python
9 pages
2 Mark Key DS
No ratings yet
2 Mark Key DS
3 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
ITS62604 Tutorial 6 (Answer)
No ratings yet
ITS62604 Tutorial 6 (Answer)
2 pages
Dev Answer Key
100% (1)
Dev Answer Key
17 pages
Assignment 1 - Introduction To Data Analysis
No ratings yet
Assignment 1 - Introduction To Data Analysis
3 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
22 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
2 Pengenalan Geostatistik
No ratings yet
2 Pengenalan Geostatistik
59 pages
Ad3301-Data-Exploration-And-Visualization Lab Manual
No ratings yet
Ad3301-Data-Exploration-And-Visualization Lab Manual
24 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet