0% found this document useful (0 votes)
7 views

Data Programming

The document provides details on two data programming questions. Question 1 involves loading, cleaning, manipulating, and analyzing a dataset on HDB flat transactions to identify factors influencing resale prices. Question 2 focuses on analyzing an MRT exit dataset, including visualizing exit locations and clusters, calculating exit counts by region, geocoding flat locations, and correlating flat prices with distance to MRT exits and the CBD over time.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Data Programming

The document provides details on two data programming questions. Question 1 involves loading, cleaning, manipulating, and analyzing a dataset on HDB flat transactions to identify factors influencing resale prices. Question 2 focuses on analyzing an MRT exit dataset, including visualizing exit locations and clusters, calculating exit counts by region, geocoding flat locations, and correlating flat prices with distance to MRT exits and the CBD over time.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Data Programming

Answer all questions. (Total 100 marks)

Question 1 (46 marks)

Objectives:
● Understand dataset with data scientist mindset.
● Understand and design computation logic and routines in Python.
● Assess use of Python only and Python data structures to perform extract, load, and
transformation operations.
● Assess use of Pandas dataframe to perform extract, load, transformation and calculation
operations.
● Structure code in appropriate methods (functions), looping and conditions.
● Conduct visualization in an appropriate way.

The dataset in question provides a rich overview of Housing and Development Board (HDB) flat
transactions in Singapore. Derived from the national database managed by Singapore's open data
initiative.

The data captured includes vital information such as the resale price, flat type, address, lease
commencement date, and floor area, among other details. These elements allow for robust analysis
on a multitude of aspects such as price trends and geographical price disparities. You may refer to
more information at `https://data.gov.sg/dataset/resale-flat-prices`.

Additionally, this dataset provides an invaluable resource for understanding the evolution of
Singapore's public housing landscape, the preferences of the populace, and market dynamics over
time. As such, it is an essential tool for policy makers, real estate professionals, urban planners,
and researchers studying Singapore's unique public housing model.

By addressing the given tasks, you will gain data analysis competencies, including data
reprocessing and manipulation, fundamental for preparing and managing datasets. Additionally,
you'll enhance your ability to comprehend data relationships through the practice of creating data
visualizations and executing correlation analysis.

(a) Load all CSV files containing transacted flats in a given `data` directory and merge all
them into a single Pandas DataFrame. Drop the `remaining_lease` column from the merged
DataFrame. Are there any columns that contain null values or empty strings?
(6 marks)

(b) Convert the `month` column to date-time format. Design a visualization to analyse the
`month` column by considering it as a numeric date-time and share insights.
(4 marks)

(c) The column `storey_range` is in the format "lower TO upper" (e.g. 1 TO 3). Compute a
new column called `storey_level` by calculating the average of the lower and upper storey
values. Drop the `storey_range` column from the DataFrame.
(5 marks)

(d) Identify inconsistent `flat_model` and `flat_type` values and perform the standardization
of the values.
(4 marks)

(e) To perform the following visualizations:


(i). Plot a histogram of the `resale_price` to understand its distribution. Is it normally
distributed or skewed?
(ii). Generate a boxplot for the `floor_area_sqm` column. Are there any values that lie
outside the expected range? If outliers are present, please provide an explanation
for their occurrence.
(6 marks)

(f) Design and identify FIVE (5) factors that influence the resale price and offer a rationale
for each of these correlations.
(15 marks)
Question 2 (60 marks)

Objectives:
● Understand dataset with data scientist mindset
● Design computation logic and routines in Python
● Conduct visualization in an appropriate way
● Assess the design and use of database ORM / SQLite methods to perform extract, load,
transformation and calculation operations

The Mass Rapid Transit (MRT) exits dataset, obtained via Singapore's open data portal
(https://beta.data.gov.sg/datasets/367/view), offers MRT exit locations within the country. This
spatial dataset, providing data on exit coordinates and associated metadata, is instrumental in
geographic-based analysis such as the calculation of distance metrics. Harnessing this data source
facilitates a deeper understanding of the impact of public transportation infrastructure on various
urban phenomena, such as residential property resale prices.

(a) Use the `geopandas` and `contextily` libraries to visualize MRT exits based on the contents
of the GeoJSON file named `mrt-exits.geojson`.
(5 marks)
(b) Perform the following tasks:
 Extract the longitude and latitude values from the `geometry` field and create two
new columns in the GeoPandas DataFrame.
 Use `KMeans` (https://scikit-
learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) clustering from
the `sklearn` library to identify `5` clusters of these MRT exits based on their
geographical coordinates.
 Create a plot visualizing these clusters with different colors and add the map of
Singapore as the background using `geopandas` and `contextily`.
(8 marks)
(c) Perform the following tasks:
 Map each cluster of MRT exits to one of the five main regions of Singapore: Central
Region, East Region, North Region, North-East Region, and West Region.
 Update the GeoPandas DataFrame by adding a new column `region` representing
the region to which each MRT exit belongs.
(5 marks)
(d) Calculate the number of MRT exits for each region using three different methods:
1) Utilize the pandas DataFrame.
2) Leverage the sqlite3 library.
3) Employ SQLAlchemy and ORM approach: Here, we first define a Python class
representing the MRT exits (`longitude`, `latitude`, `region`). We then use this
class to insert our data into a SQLite database and execute a query to get the
number of exits for each region.
(9 marks)
(e) Perform the following tasks:
 Draw a random sample of 100 transacted flats from Question 1 with the random
seed set to 0.
 Utilize the `geopy` library's `Nominatim` or `GoogleV3` geocoder to obtain the
longitude and latitude data for the 1000 transacted flats.
 Nominatim (https://geopy.readthedocs.io/en/stable/#geopy.geocoders.Nominatim.geocode)
 GoogleV3 (https://geopy.readthedocs.io/en/stable/#geopy.geocoders.GoogleV3.geocode)
(5 marks)

(f) Perform the following tasks:


 Incorporate the `data/addresses.csv` data, which contains address information, into
the dataset of transacted flats from Question 1, excluding any flats that don't have
their addresses in `addresses.csv`.
 Utilize the haversine(https://scikit-
learn.org/stable/modules/generated/sklearn.metrics.pairwise.haversine_distances.
html) formula, compute the distance kilometers to the closest MRT exit for each
flat and add this data under a new column, `nearest_mrt_distance`.
 Incorporate the data from the `data/town_to_region_mapping.json` file to introduce
a new column named `region` into the DataFrame. (Note: Disregard the `region`
column present in the `addresses.csv` file during this process.)
 Based on your visualizations and data analyses, articulate two key conclusions.
(13 marks)

(g) Perform the following tasks:


 Formulate a scatter plot to depict the correlation between the resale prices of flats
and their haversine (https://scikit-
learn.org/stable/modules/generated/sklearn.metrics.pairwise.haversine_distances.
html) distances to the Central Business District.
 Incorporate additional dimensions into your plot: the year of the transaction
(specifically 2015, 2020, and 2023) and the region of the flat's location.
 Use distinct color codes to denote different regions.
 Also, display the town of each transaction as individual data points on the plot.
 Interpret the plot and articulate any insights or patterns you notice, explaining their
significance or implications.
(10 marks)

(h) Use SQLite query to determine the three towns in each region that have the highest average
resale price for '5 ROOM' flats transacted within the first half of 2023.
(5 marks)

You might also like