Case Study Guidelines
Case Study Guidelines
Objective:
Identify a dataset, clean and preprocess it using Python, perform data wrangling and
analysis, and present insights through visualizations. This project will simulate real-
world scenarios of working with messy datasets and deriving actionable insights.
Guidelines
1. Dataset Selection
You can:
Search for datasets online from repositories like Kaggle, data.gov.ph, or UCI
Machine Learning Repository.
Collect data from:
o Department of Health (DOH) datasets.
o PhilHealth or other government agencies.
o Local Government Units (LGUs).
o Their school, e.g., student performance, attendance, or survey data.
Ensure the dataset has at least 150 rows and 5 columns.
2. Tasks
Data Acquisition: Download or collect the dataset.
Data Cleaning: Handle missing values, remove duplicates, format columns, and
fix inconsistencies. (Note: If the data set is already clean, you can skip this task)
Data Wrangling:
o Perform transformations (e.g., pivot, split-apply-combine).
o Merge or join datasets if using multiple sources.
Data Analysis: Use Python libraries like Pandas, NumPy, and
Matplotlib/Seaborn to:
o Generate descriptive statistics.
o Identify trends, patterns, or anomalies.
Visualization: Produce at least three visualizations (e.g., bar chart, scatter plot,
line graph) that communicate the findings clearly.
Insights: Interpret the findings and provide actionable insights or
recommendations.
3. Tools and Libraries
Python IDEs: You can use Jupyter Notebook, Google Colab, PyCharm or VS
Code.
Required Libraries: Pandas, NumPy, Matplotlib, Seaborn (others optional:
SciPy, Plotly).
Required Documentation
1. Cover Page
Title of the project.
Student names and section.
Submission date.
2. Executive Summary (1 page)
A brief overview of the dataset and main findings.
3. Introduction
Objective of the case study.
Description of the data source.
Relevance of the dataset to real-world applications.
4. Data Overview
Dataset description:
o Number of rows and columns.
o Description of each column (type, units, significance).
Source information:
o Dataset URL or collection method.
Initial observations:
o Missing values, duplicates, and potential issues.
5. Methodology
Step-by-step explanation of:
o Data cleaning process.
o Data wrangling techniques.
o Analysis methods.
6. Results and Analysis
Include:
o Descriptive statistics (mean, median, standard deviation, etc.).
o At least three visualizations with detailed captions and explanations.
o Significant trends, correlations, or insights discovered.
7. Insights and Recommendations
Explain findings in non-technical terms.
Suggest practical applications or implications of the insights.
8. Conclusion
Recap the main steps, findings, and their importance.
Reflection on challenges faced and lessons learned.
9. References
Cite datasets and any other sources used.
10. Appendix (if needed)
Include Python code snippets (organized by task) and outputs if not already
detailed in the main document.
Formatting Guidelines
Font: Arial or Times New Roman, 12 pt.
Spacing: 1.5 lines.
Margins: 1 inch on all sides.
Page Numbers: Bottom center.
Figures and Tables:
o Label and number each figure/table (e.g., Figure 1: Correlation Heatmap).
o Include captions below figures and above tables.
Length: 10-15 pages, excluding code appendix.
Grading Rubric
Criteria Weight (%)
Visualizations 20%
1. Dataset Overview
Dataset Name: Philippines Health Statistics 2015–2020
Source: DOH Open Data Portal
Description: Contains records on common illnesses, health facility statistics, and
population health metrics for six years.
Rows and Columns:
o Rows: 5,000 (each row represents a unique observation for a city or
province over a year).
o Columns: 7 (e.g., Region, Year, Population, Health Facility Count,
Common Illnesses, Mortality Rate).
Columns and Description:
COLUMN NAME DESCRIPTION EXAMPLE
REGION Name of the region Western
Visayas
YEAR Year of data observation 2019
POPULATION Total population in the region for the 5,100,000
year
HEALTH FACILITY Number of health facilities available 500
COUNT
COMMON ILLNESSES Most reported illness in the region Dengue
MORTALITY RATE (%) Deaths per 1,000 individuals 4.5
2. Methodology
1. Data Cleaning:
o Handle Missing Values: Replace missing population data with mean
population of the region.
o Remove Duplicates: Identified and removed duplicate entries based on
region and year.
o Standardize Column Names: Changed "Health Facility Count" to
"Health_Facility_Count" for consistency.
2. Data Wrangling:
o Filtered data for the years 2015–2020.
o Created a new column: Facilities per 10,000 population =
(Health_Facility_Count / Population) * 10,000.
o Aggregated data to find the average mortality rate for each region.
3. Data Analysis:
o Used groupby() to analyze trends by region and year.
o Identified regions with the highest and lowest health facility counts.
4. Visualizations:
o Bar chart of common illnesses per region.
o Line plot for mortality rates (2015–2020).
o Scatter plot showing the correlation between facility count and mortality
rate.
Visualizations:
1. Bar Chart - Top 5 Common Illnesses by Region (2019):
(Dengue leads in Central Luzon with over 20,000 cases reported.)
Insert Chart Here
2. Line Chart - Mortality Rate Trends (2015–2020):
(Sharp decline in mortality rates in regions with increased facility investments.)
Insert Chart Here
3. Scatter Plot - Correlation between Facilities and Mortality Rate:
(Negative correlation: Higher facilities per 10,000 = Lower mortality rates.)
Insert Chart Here
5. Python Code