0% found this document useful (0 votes)
1 views

Case Study Guidelines

The case study focuses on data wrangling and analysis to extract insights from a selected dataset using Python. It outlines the steps for dataset selection, cleaning, wrangling, analysis, and visualization, along with documentation requirements. The project aims to simulate real-world scenarios of handling messy datasets and deriving actionable insights.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Case Study Guidelines

The case study focuses on data wrangling and analysis to extract insights from a selected dataset using Python. It outlines the steps for dataset selection, cleaning, wrangling, analysis, and visualization, along with documentation requirements. The project aims to simulate real-world scenarios of handling messy datasets and deriving actionable insights.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Case Study Title: Data Wrangling and Analysis for Insights

Objective:
Identify a dataset, clean and preprocess it using Python, perform data wrangling and
analysis, and present insights through visualizations. This project will simulate real-
world scenarios of working with messy datasets and deriving actionable insights.

Guidelines
1. Dataset Selection
You can:
 Search for datasets online from repositories like Kaggle, data.gov.ph, or UCI
Machine Learning Repository.
 Collect data from:
o Department of Health (DOH) datasets.
o PhilHealth or other government agencies.
o Local Government Units (LGUs).
o Their school, e.g., student performance, attendance, or survey data.
 Ensure the dataset has at least 150 rows and 5 columns.
2. Tasks
 Data Acquisition: Download or collect the dataset.
 Data Cleaning: Handle missing values, remove duplicates, format columns, and
fix inconsistencies. (Note: If the data set is already clean, you can skip this task)
 Data Wrangling:
o Perform transformations (e.g., pivot, split-apply-combine).
o Merge or join datasets if using multiple sources.
 Data Analysis: Use Python libraries like Pandas, NumPy, and
Matplotlib/Seaborn to:
o Generate descriptive statistics.
o Identify trends, patterns, or anomalies.
 Visualization: Produce at least three visualizations (e.g., bar chart, scatter plot,
line graph) that communicate the findings clearly.
 Insights: Interpret the findings and provide actionable insights or
recommendations.
3. Tools and Libraries
 Python IDEs: You can use Jupyter Notebook, Google Colab, PyCharm or VS
Code.
 Required Libraries: Pandas, NumPy, Matplotlib, Seaborn (others optional:
SciPy, Plotly).

Required Documentation
1. Cover Page
 Title of the project.
 Student names and section.
 Submission date.
2. Executive Summary (1 page)
 A brief overview of the dataset and main findings.
3. Introduction
 Objective of the case study.
 Description of the data source.
 Relevance of the dataset to real-world applications.
4. Data Overview
 Dataset description:
o Number of rows and columns.
o Description of each column (type, units, significance).
 Source information:
o Dataset URL or collection method.
 Initial observations:
o Missing values, duplicates, and potential issues.
5. Methodology
 Step-by-step explanation of:
o Data cleaning process.
o Data wrangling techniques.
o Analysis methods.
6. Results and Analysis
 Include:
o Descriptive statistics (mean, median, standard deviation, etc.).
o At least three visualizations with detailed captions and explanations.
o Significant trends, correlations, or insights discovered.
7. Insights and Recommendations
 Explain findings in non-technical terms.
 Suggest practical applications or implications of the insights.
8. Conclusion
 Recap the main steps, findings, and their importance.
 Reflection on challenges faced and lessons learned.
9. References
 Cite datasets and any other sources used.
10. Appendix (if needed)
 Include Python code snippets (organized by task) and outputs if not already
detailed in the main document.

Formatting Guidelines
 Font: Arial or Times New Roman, 12 pt.
 Spacing: 1.5 lines.
 Margins: 1 inch on all sides.
 Page Numbers: Bottom center.
 Figures and Tables:
o Label and number each figure/table (e.g., Figure 1: Correlation Heatmap).
o Include captions below figures and above tables.
 Length: 10-15 pages, excluding code appendix.
Grading Rubric
Criteria Weight (%)

Dataset Selection 10%

Data Cleaning & Wrangling 25%

Data Analysis 20%

Visualizations 20%

Insights & Recommendations 15%

Documentation & Formatting 10%


Sample Case Study: Analyzing Health Trends in the Philippines
Title:
Exploring Health Indicators in the Philippines (2015–2020)
Objective:
To analyze trends in healthcare access and common illnesses reported in the
Philippines using publicly available data from the Department of Health (DOH).

1. Dataset Overview
 Dataset Name: Philippines Health Statistics 2015–2020
 Source: DOH Open Data Portal
 Description: Contains records on common illnesses, health facility statistics, and
population health metrics for six years.
 Rows and Columns:
o Rows: 5,000 (each row represents a unique observation for a city or
province over a year).
o Columns: 7 (e.g., Region, Year, Population, Health Facility Count,
Common Illnesses, Mortality Rate).
Columns and Description:
COLUMN NAME DESCRIPTION EXAMPLE
REGION Name of the region Western
Visayas
YEAR Year of data observation 2019
POPULATION Total population in the region for the 5,100,000
year
HEALTH FACILITY Number of health facilities available 500
COUNT
COMMON ILLNESSES Most reported illness in the region Dengue
MORTALITY RATE (%) Deaths per 1,000 individuals 4.5

2. Methodology
1. Data Cleaning:
o Handle Missing Values: Replace missing population data with mean
population of the region.
o Remove Duplicates: Identified and removed duplicate entries based on
region and year.
o Standardize Column Names: Changed "Health Facility Count" to
"Health_Facility_Count" for consistency.
2. Data Wrangling:
o Filtered data for the years 2015–2020.
o Created a new column: Facilities per 10,000 population =
(Health_Facility_Count / Population) * 10,000.
o Aggregated data to find the average mortality rate for each region.
3. Data Analysis:
o Used groupby() to analyze trends by region and year.
o Identified regions with the highest and lowest health facility counts.
4. Visualizations:
o Bar chart of common illnesses per region.
o Line plot for mortality rates (2015–2020).
o Scatter plot showing the correlation between facility count and mortality
rate.

3. Results and Visualizations


Key Findings:
 Health Facility Accessibility:
Regions with fewer than 5 facilities per 10,000 population had mortality rates
20% higher than the national average.
 Common Illness Trends:
Dengue and respiratory illnesses consistently ranked as the most reported
illnesses in high-density regions.

Visualizations:
1. Bar Chart - Top 5 Common Illnesses by Region (2019):
(Dengue leads in Central Luzon with over 20,000 cases reported.)
Insert Chart Here
2. Line Chart - Mortality Rate Trends (2015–2020):
(Sharp decline in mortality rates in regions with increased facility investments.)
Insert Chart Here
3. Scatter Plot - Correlation between Facilities and Mortality Rate:
(Negative correlation: Higher facilities per 10,000 = Lower mortality rates.)
Insert Chart Here

4. Insights and Recommendations


1. Insights:
o Investments in health facilities are linked to improved health outcomes.
o Preventative measures for dengue should be prioritized in Central Luzon.
2. Recommendations:
o Government should allocate more resources to regions with fewer health
facilities.
o Conduct awareness campaigns on dengue prevention in high-risk regions.

5. Python Code

You might also like