0% found this document useful (0 votes)
31 views11 pages

DS Day 5

Uploaded by

ishuj759
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views11 pages

DS Day 5

Uploaded by

ishuj759
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

DATA SCIENCE

Topic 7: Data Transformation

Normalization and Standardization

Normalization and standardization are techniques used to adjust the


values of numeric columns in a dataset to a common scale, without
distorting differences in the ranges of values.

- Normalization (Min-Max Scaling): Transforms the data to a fixed range,


typically [0, 1]. This method is useful when you want your data to have a
specific range.

Formula:

\[

x' = \frac{x - x_{\text{min}}}{x_{\text{max}} - x_{\text{min}}}

\]

where \( x \) is the original value, \( x' \) is the normalized value, \(


x_{\text{min}} \) is the minimum value in the dataset, and \( x_{\text{max}}
\) is the maximum value in the dataset.

Example:

python

from sklearn.preprocessing import MinMaxScaler

data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]

scaler = MinMaxScaler()

print(scaler.fit_transform(data))

- Standardization (Z-score Scaling): Transforms the data to have a mean of


0 and a standard deviation of 1. This method is useful when the data
follows a normal distribution.

Formula:

\[

z = \frac{x - \mu}{\sigma}

\]

where \( x \) is the original value, \( z \) is the standardized value, \( \mu \)


is the mean of the dataset, and \( \sigma \) is the standard deviation of
the dataset.

Example:

python
from sklearn.preprocessing import StandardScaler

data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]

scaler = StandardScaler()

print(scaler.fit_transform(data))

Encoding Categorical Variables

Categorical variables are often encoded to be used in machine learning


algorithms that require numerical input.

- Label Encoding: Converts categorical values into numerical values. Each


unique value is assigned a unique integer. However, this method may
imply an ordinal relationship between categories that may not exist.

Example:

python

from sklearn.preprocessing import LabelEncoder

data = ['cat', 'dog', 'mouse']

encoder = LabelEncoder()

print(encoder.fit_transform(data))
- One-Hot Encoding: Converts categorical values into a series of binary
columns. Each unique value is represented as a binary column with a 1 or
0 indicating the presence or absence of the category.

Example:

python

from sklearn.preprocessing import OneHotEncoder

import numpy as np

data = np.array(['cat', 'dog', 'mouse']).reshape(-1, 1)

encoder = OneHotEncoder(sparse=False)

print(encoder.fit_transform(data))

Feature Engineering

Feature engineering is the process of using domain knowledge to create


new features that make machine learning algorithms work better.

- Creating Features: New features can be created by combining existing


features. For example, multiplying or adding features together, or creating
interaction terms.

Example:
python

import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

df['C'] = df['A'] df['B']

print(df)

- Feature Selection: Selecting the most relevant features for a model. This
can be done using techniques such as Recursive Feature Elimination
(RFE), L1 regularization (Lasso), and tree-based feature importance.

Example (using RFE):

python

from sklearn.datasets import make_classification

from sklearn.feature_selection import RFE

from sklearn.linear_model import LogisticRegression

X, y = make_classification(n_samples=100, n_features=10,
random_state=42)

model = LogisticRegression()

rfe = RFE(model, n_features_to_select=5)

fit = rfe.fit(X, y)

print(fit.support_)
print(fit.ranking_)

Topic 8: Exploratory Data Analysis (EDA)

Descriptive Statistics

Descriptive statistics summarize and describe the main features of a


dataset. They provide simple summaries about the sample and the
measures.

- Measures of Central Tendency: Mean, median, and mode.

- Mean: The average of the data points.

- Median: The middle value when the data points are sorted.

- Mode: The most frequent value in the dataset.

Example:

python

import numpy as np

data = [1, 2, 2, 3, 4]

print(np.mean(data)) Mean

print(np.median(data)) Median

print(np.mode(data)) Mode (Note: mode function is in scipy.stats)


- Measures of Dispersion: Range, variance, and standard deviation.

- Range: The difference between the maximum and minimum values.

- Variance: The average of the squared differences from the mean.

- Standard Deviation: The square root of the variance.

Example:

python

print(np.var(data)) Variance

print(np.std(data)) Standard Deviation

Data Visualization

Data visualization involves the graphical representation of data to


understand patterns, trends, and insights.

- Types of Visualizations:

- Bar Charts: Used for categorical data to show the frequency of different
categories.

- Histograms: Used for numerical data to show the distribution of the


data.

- Box Plots: Used to show the distribution of data and identify outliers.
- Scatter Plots: Used to show the relationship between two numerical
variables.

Example (using Matplotlib and Seaborn):

python

import matplotlib.pyplot as plt

import seaborn as sns

import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [5, 4, 3, 2, 1]})

sns.boxplot(x='A', y='B', data=df)

plt.show()

Data Summarization

Data summarization involves techniques to summarize and group data to


understand its structure and distribution.

- Correlation and Covariance:

- Correlation: Measures the relationship between two variables. Values


range from -1 to 1.

- Covariance: Measures the joint variability of two variables.


Example:

python

df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [5, 4, 3, 2, 1]})

print(df.corr()) Correlation

print(df.cov()) Covariance

- Grouping and Aggregation:

- Grouping: Splitting data into groups based on some criteria.

- Aggregation: Applying a function to each group independently.

Example:

python

df = pd.DataFrame({

'Category': ['A', 'A', 'B', 'B'],

'Values': [1, 2, 3, 4]

})

grouped = df.groupby('Category')

print(grouped.sum()) Sum of values for each category

Task 5

Questions on Data Transformation


1. What is normalization in data transformation?

2. What is the formula for Min-Max Scaling?

3. Provide an example of normalization using Min-Max Scaling.

4. What is standardization in data transformation?

5. What is the formula for Z-score Scaling?

6. Provide an example of standardization using Z-score Scaling.

7. What is label encoding and when is it used?

8. Provide an example of label encoding in Python.

9. What is one-hot encoding and when is it used?

10. Provide an example of one-hot encoding in Python.

11. What is feature engineering in data science?

12. How can new features be created from existing features? Provide an example.

13. What is feature selection and why is it important?

14. Describe the technique of Recursive Feature Elimination (RFE).

15. Provide an example of feature selection using RFE in Python.

Questions on Exploratory Data Analysis (EDA)

1. What are descriptive statistics and why are they important?

2. What are the measures of central tendency in descriptive statistics?

3. Define mean, median, and mode.

4. Provide a Python example to calculate mean, median, and mode.

5. What are the measures of dispersion in descriptive statistics?

6. Define range, variance, and standard deviation.

7. Provide a Python example to calculate variance and standard deviation.

8. What is data visualization and why is it used in EDA?


9. Name and describe different types of visualizations used in data analysis.

10. Provide a Python example of creating a box plot using Seaborn.

11. What is data summarization and what techniques are used?

12. Define correlation and covariance.

13. Provide a Python example to calculate correlation and covariance.

14. What is grouping in data summarization?

15. What is aggregation in data summarization?

16. Provide a Python example of grouping and aggregation in a DataFrame.

You might also like