0% found this document useful (0 votes)
23 views

Some Exercises

The document describes several exercises for practicing different data analysis techniques. The exercises cover topics like data cleaning, exploratory data analysis, regression, clustering, classification, and visualization. Example solutions are provided for some basic exercises involving data cleaning, EDA, and visualization.

Uploaded by

Eralda FRROKU
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Some Exercises

The document describes several exercises for practicing different data analysis techniques. The exercises cover topics like data cleaning, exploratory data analysis, regression, clustering, classification, and visualization. Example solutions are provided for some basic exercises involving data cleaning, EDA, and visualization.

Uploaded by

Eralda FRROKU
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Some exercises :

Exercise 1: Data Cleaning


Objective: Cleanse the data to prepare it for analysis.
Dataset: You have a dataset with information about customers, but it contains missing values
and outliers.
Tasks:
1. Identify and handle missing values (you can use techniques like filling with
mean/median or removing rows/columns).
2. Detect and deal with outliers in numerical variables.
Exercise 2: Exploratory Data Analysis (EDA)
Objective: Understand the characteristics of the dataset through exploratory data analysis.
Dataset: Use a dataset containing information about sales transactions.
Tasks:
1. Create summary statistics (mean, median, standard deviation, etc.) for numerical
variables.
2. Generate visualizations (histograms, box plots, scatter plots) to explore the
distribution of key variables.
3. Identify correlations between variables.
Exercise 3: Regression Analysis
Objective: Explore relationships between variables and make predictions.
Dataset: Use a dataset with information about housing prices, including features like square
footage, number of bedrooms, etc.
Tasks:
1. Perform a simple linear regression to predict housing prices based on a single variable
(e.g., square footage).
2. Evaluate the model's performance using metrics like Mean Squared Error or R-
squared.
Exercise 4: Clustering
Objective: Group similar data points together based on certain criteria.
Dataset: Use a dataset containing customer behavior data.
Tasks:
1. Apply a clustering algorithm (e.g., k-means) to group customers based on their
behavior.
2. Visualize the clusters and interpret the results.
Exercise 5: Classification
Objective: Assign observations to predefined categories or classes.
Dataset: Use a dataset with information about customer purchases.
Tasks:
1. Define a target variable (e.g., whether a customer will make a repeat purchase).
2. Split the data into training and testing sets.
3. Train a classification model (e.g., logistic regression or decision tree) to predict the
target variable.
4. Evaluate the model's performance on the testing set.
Exercise 6: Data Visualization
Objective: Create visual representations of data to aid in understanding.
Dataset: Choose a dataset that interests you.
Tasks:
1. Select two or more variables and create appropriate visualizations (e.g., bar chart, line
plot, pie chart).
2. Use color and labeling to enhance the interpretability of the visualizations.

Exercises with solutions


Exercise 1: Data Cleaning
Dataset: Download the dataset
Objective: Cleanse the COVID-19 dataset to prepare it for analysis.
Tasks:
1. Load the dataset into your preferred data analysis tool.
2. Identify and handle missing values.
3. Remove unnecessary columns.
4. Ensure consistency in date formats.
import pandas as pd

# Load the dataset


url = "https://raw.githubusercontent.com/datasets/covid-19/master/data/countries-
aggregated.csv"
covid_data = pd.read_csv(url)

# Identify and handle missing values


covid_data.dropna(inplace=True)

# Remove unnecessary columns


covid_data = covid_data[['Date', 'Country', 'Confirmed', 'Recovered', 'Deaths']]

# Ensure consistency in date formats


covid_data['Date'] = pd.to_datetime(covid_data['Date'])

# Display the cleaned dataset


print(covid_data.head())

Exercise 2: Exploratory Data Analysis (EDA)


Dataset: Download the dataset
Objective: Perform exploratory data analysis on the Auto MPG dataset.
Tasks:
1. Load the dataset into your preferred data analysis tool.
2. Generate summary statistics for numerical variables.
3. Create visualizations to explore the distribution of key variables.
4. Identify correlations between variables.
Solution:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset


url = "https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data"
column_names = ["MPG", "Cylinders", "Displacement", "Horsepower", "Weight",
"Acceleration", "Model Year", "Origin", "Car Name"]
auto_data = pd.read_csv(url, delim_whitespace=True, names=column_names)

# Summary statistics
print(auto_data.describe())

# Visualizations
sns.pairplot(auto_data[['MPG', 'Cylinders', 'Displacement', 'Weight']])
plt.show()

# Correlation matrix
correlation_matrix = auto_data.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.show()

Exercise 3: Regression Analysis


Dataset: Download the dataset
Objective: Perform a simple linear regression to predict the age of abalones.
Tasks:
1. Load the dataset into your preferred data analysis tool.
2. Choose a variable to predict (e.g., Rings).
3. Split the data into training and testing sets.
4. Train a linear regression model.
5. Evaluate the model's performance using metrics like Mean Squared Error or R-
squared
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load the dataset


url = "https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data"
column_names = ["Sex", "Length", "Diameter", "Height", "WholeWeight",
"ShuckedWeight", "VisceraWeight", "ShellWeight", "Rings"]
abalone_data = pd.read_csv(url, names=column_names)

# Choose a variable to predict


X = abalone_data[['Length', 'Diameter', 'Height', 'WholeWeight', 'ShuckedWeight',
'VisceraWeight', 'ShellWeight']]
y = abalone_data['Rings']

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a linear regression model


model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the test set


y_pred = model.predict(X_test)

# Evaluate the model's performance


mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')


print(f'R-squared: {r2}')

BASIC EXERCISES WITH SOLUTIONS


Exercise 1: Data Cleaning
Objective: Cleanse a simple dataset.
Dataset:
import pandas as pd

data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma', 'Frank', 'Grace', 'Harry', 'Ivy', 'Jack'],
'Age': [25, 28, None, 22, 30, 35, 28, None, 24, 29],
'Salary': [50000, 60000, 75000, 48000, None, 90000, 80000, 75000, 52000, 60000]
}

df = pd.DataFrame(data)

Tasks:
1. Handle missing values in the 'Age' and 'Salary' columns.
2. Drop any rows with missing values.
3. Display the cleaned dataset.
# Handle missing values
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Salary'].fillna(df['Salary'].median(), inplace=True)

# Drop rows with missing values


df.dropna(inplace=True)

# Display the cleaned dataset


print(df)

Exercise 2: Exploratory Data Analysis (EDA)


Objective: Explore a dataset and generate basic insights.
Dataset:
import pandas as pd

data = {
'Student_ID': [1, 2, 3, 4, 5],
'Math_Score': [85, 90, 78, 92, 88],
'English_Score': [75, 80, 85, 88, 92],
'Science_Score': [90, 85, 88, 80, 95]
}

df = pd.DataFrame(data)

Tasks:
1. Calculate the mean, median, and standard deviation for each subject.
2. Plot a bar chart to visualize the average scores for each subject.
# Calculate mean, median, and standard deviation
subject_stats = df.describe().loc[['mean', '50%', 'std']].transpose()
print(subject_stats)

# Plot a bar chart


import matplotlib.pyplot as plt

subject_stats.plot(kind='bar', y='mean', yerr='std', legend=False)


plt.title('Average Scores for Each Subject')
plt.ylabel('Score')
plt.xlabel('Subject')
plt.show()

Exercise 3: Data Visualization


Objective: Visualize a dataset using scatter plots.
Dataset:
import pandas as pd

data = {
'Hours_Studied': [2, 3, 5, 1, 4, 6, 7, 3, 2, 5],
'Exam_Score': [50, 65, 80, 40, 75, 90, 95, 60, 55, 85]
}

df = pd.DataFrame(data)

Tasks:
1. Create a scatter plot to visualize the relationship between hours studied and exam
scores.
2. Add labels and a title to the plot.
import matplotlib.pyplot as plt

# Scatter plot
plt.scatter(df['Hours_Studied'], df['Exam_Score'])
plt.title('Relationship Between Hours Studied and Exam Score')
plt.xlabel('Hours Studied')
plt.ylabel('Exam Score')
plt.show()

You might also like