Data Programming
Data Programming
Objectives:
● Understand dataset with data scientist mindset.
● Understand and design computation logic and routines in Python.
● Assess use of Python only and Python data structures to perform extract, load, and
transformation operations.
● Assess use of Pandas dataframe to perform extract, load, transformation and calculation
operations.
● Structure code in appropriate methods (functions), looping and conditions.
● Conduct visualization in an appropriate way.
The dataset in question provides a rich overview of Housing and Development Board (HDB) flat
transactions in Singapore. Derived from the national database managed by Singapore's open data
initiative.
The data captured includes vital information such as the resale price, flat type, address, lease
commencement date, and floor area, among other details. These elements allow for robust analysis
on a multitude of aspects such as price trends and geographical price disparities. You may refer to
more information at `https://data.gov.sg/dataset/resale-flat-prices`.
Additionally, this dataset provides an invaluable resource for understanding the evolution of
Singapore's public housing landscape, the preferences of the populace, and market dynamics over
time. As such, it is an essential tool for policy makers, real estate professionals, urban planners,
and researchers studying Singapore's unique public housing model.
By addressing the given tasks, you will gain data analysis competencies, including data
reprocessing and manipulation, fundamental for preparing and managing datasets. Additionally,
you'll enhance your ability to comprehend data relationships through the practice of creating data
visualizations and executing correlation analysis.
(a) Load all CSV files containing transacted flats in a given `data` directory and merge all
them into a single Pandas DataFrame. Drop the `remaining_lease` column from the merged
DataFrame. Are there any columns that contain null values or empty strings?
(6 marks)
(b) Convert the `month` column to date-time format. Design a visualization to analyse the
`month` column by considering it as a numeric date-time and share insights.
(4 marks)
(c) The column `storey_range` is in the format "lower TO upper" (e.g. 1 TO 3). Compute a
new column called `storey_level` by calculating the average of the lower and upper storey
values. Drop the `storey_range` column from the DataFrame.
(5 marks)
(d) Identify inconsistent `flat_model` and `flat_type` values and perform the standardization
of the values.
(4 marks)
(f) Design and identify FIVE (5) factors that influence the resale price and offer a rationale
for each of these correlations.
(15 marks)
Question 2 (60 marks)
Objectives:
● Understand dataset with data scientist mindset
● Design computation logic and routines in Python
● Conduct visualization in an appropriate way
● Assess the design and use of database ORM / SQLite methods to perform extract, load,
transformation and calculation operations
The Mass Rapid Transit (MRT) exits dataset, obtained via Singapore's open data portal
(https://beta.data.gov.sg/datasets/367/view), offers MRT exit locations within the country. This
spatial dataset, providing data on exit coordinates and associated metadata, is instrumental in
geographic-based analysis such as the calculation of distance metrics. Harnessing this data source
facilitates a deeper understanding of the impact of public transportation infrastructure on various
urban phenomena, such as residential property resale prices.
(a) Use the `geopandas` and `contextily` libraries to visualize MRT exits based on the contents
of the GeoJSON file named `mrt-exits.geojson`.
(5 marks)
(b) Perform the following tasks:
Extract the longitude and latitude values from the `geometry` field and create two
new columns in the GeoPandas DataFrame.
Use `KMeans` (https://scikit-
learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) clustering from
the `sklearn` library to identify `5` clusters of these MRT exits based on their
geographical coordinates.
Create a plot visualizing these clusters with different colors and add the map of
Singapore as the background using `geopandas` and `contextily`.
(8 marks)
(c) Perform the following tasks:
Map each cluster of MRT exits to one of the five main regions of Singapore: Central
Region, East Region, North Region, North-East Region, and West Region.
Update the GeoPandas DataFrame by adding a new column `region` representing
the region to which each MRT exit belongs.
(5 marks)
(d) Calculate the number of MRT exits for each region using three different methods:
1) Utilize the pandas DataFrame.
2) Leverage the sqlite3 library.
3) Employ SQLAlchemy and ORM approach: Here, we first define a Python class
representing the MRT exits (`longitude`, `latitude`, `region`). We then use this
class to insert our data into a SQLite database and execute a query to get the
number of exits for each region.
(9 marks)
(e) Perform the following tasks:
Draw a random sample of 100 transacted flats from Question 1 with the random
seed set to 0.
Utilize the `geopy` library's `Nominatim` or `GoogleV3` geocoder to obtain the
longitude and latitude data for the 1000 transacted flats.
Nominatim (https://geopy.readthedocs.io/en/stable/#geopy.geocoders.Nominatim.geocode)
GoogleV3 (https://geopy.readthedocs.io/en/stable/#geopy.geocoders.GoogleV3.geocode)
(5 marks)
(h) Use SQLite query to determine the three towns in each region that have the highest average
resale price for '5 ROOM' flats transacted within the first half of 2023.
(5 marks)