Data Programming

The document provides details on two data programming questions. Question 1 involves loading, cleaning, manipulating, and analyzing a dataset on HDB flat transactions to identify factors influencing resale prices. Question 2 focuses on analyzing an MRT exit dataset, including visualizing exit locations and clusters, calculating exit counts by region, geocoding flat locations, and correlating flat prices with distance to MRT exits and the CBD over time.

Uploaded by

I211381 Eeman Ijaz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views4 pages

Data Programming

Uploaded by

I211381 Eeman Ijaz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Data Programming

Answer all questions. (Total 100 marks)

Question 1 (46 marks)

Objectives:
● Understand dataset with data scientist mindset.
● Understand and design computation logic and routines in Python.
● Assess use of Python only and Python data structures to perform extract, load, and
transformation operations.
● Assess use of Pandas dataframe to perform extract, load, transformation and calculation
operations.
● Structure code in appropriate methods (functions), looping and conditions.
● Conduct visualization in an appropriate way.

The dataset in question provides a rich overview of Housing and Development Board (HDB) flat
transactions in Singapore. Derived from the national database managed by Singapore's open data
initiative.

The data captured includes vital information such as the resale price, flat type, address, lease
commencement date, and floor area, among other details. These elements allow for robust analysis
on a multitude of aspects such as price trends and geographical price disparities. You may refer to
more information at `[Link]

Additionally, this dataset provides an invaluable resource for understanding the evolution of
Singapore's public housing landscape, the preferences of the populace, and market dynamics over
time. As such, it is an essential tool for policy makers, real estate professionals, urban planners,
and researchers studying Singapore's unique public housing model.

By addressing the given tasks, you will gain data analysis competencies, including data
reprocessing and manipulation, fundamental for preparing and managing datasets. Additionally,
you'll enhance your ability to comprehend data relationships through the practice of creating data
visualizations and executing correlation analysis.

(a) Load all CSV files containing transacted flats in a given `data` directory and merge all
them into a single Pandas DataFrame. Drop the `remaining_lease` column from the merged
DataFrame. Are there any columns that contain null values or empty strings?
(6 marks)

(b) Convert the `month` column to date-time format. Design a visualization to analyse the
`month` column by considering it as a numeric date-time and share insights.
(4 marks)

(c) The column `storey_range` is in the format "lower TO upper" (e.g. 1 TO 3). Compute a
new column called `storey_level` by calculating the average of the lower and upper storey
values. Drop the `storey_range` column from the DataFrame.
(5 marks)

(d) Identify inconsistent `flat_model` and `flat_type` values and perform the standardization
of the values.
(4 marks)

(e) To perform the following visualizations:

(i). Plot a histogram of the `resale_price` to understand its distribution. Is it normally
distributed or skewed?
(ii). Generate a boxplot for the `floor_area_sqm` column. Are there any values that lie
outside the expected range? If outliers are present, please provide an explanation
for their occurrence.
(6 marks)

(f) Design and identify FIVE (5) factors that influence the resale price and offer a rationale
for each of these correlations.
(15 marks)
Question 2 (60 marks)

Objectives:
● Understand dataset with data scientist mindset
● Design computation logic and routines in Python
● Conduct visualization in an appropriate way
● Assess the design and use of database ORM / SQLite methods to perform extract, load,
transformation and calculation operations

The Mass Rapid Transit (MRT) exits dataset, obtained via Singapore's open data portal
([Link] offers MRT exit locations within the country. This
spatial dataset, providing data on exit coordinates and associated metadata, is instrumental in
geographic-based analysis such as the calculation of distance metrics. Harnessing this data source
facilitates a deeper understanding of the impact of public transportation infrastructure on various
urban phenomena, such as residential property resale prices.

(a) Use the `geopandas` and `contextily` libraries to visualize MRT exits based on the contents
of the GeoJSON file named `[Link]`.
(5 marks)
(b) Perform the following tasks:
 Extract the longitude and latitude values from the `geometry` field and create two
new columns in the GeoPandas DataFrame.
 Use `KMeans` ([Link]
[Link]/stable/modules/generated/[Link]) clustering from
the `sklearn` library to identify `5` clusters of these MRT exits based on their
geographical coordinates.
 Create a plot visualizing these clusters with different colors and add the map of
Singapore as the background using `geopandas` and `contextily`.
(8 marks)
(c) Perform the following tasks:
 Map each cluster of MRT exits to one of the five main regions of Singapore: Central
Region, East Region, North Region, North-East Region, and West Region.
 Update the GeoPandas DataFrame by adding a new column `region` representing
the region to which each MRT exit belongs.
(5 marks)
(d) Calculate the number of MRT exits for each region using three different methods:
1) Utilize the pandas DataFrame.
2) Leverage the sqlite3 library.
3) Employ SQLAlchemy and ORM approach: Here, we first define a Python class
representing the MRT exits (`longitude`, `latitude`, `region`). We then use this
class to insert our data into a SQLite database and execute a query to get the
number of exits for each region.
(9 marks)
(e) Perform the following tasks:
 Draw a random sample of 100 transacted flats from Question 1 with the random
seed set to 0.
 Utilize the `geopy` library's `Nominatim` or `GoogleV3` geocoder to obtain the
longitude and latitude data for the 1000 transacted flats.
 Nominatim ([Link]
 GoogleV3 ([Link]
(5 marks)

(f) Perform the following tasks:

 Incorporate the `data/[Link]` data, which contains address information, into
the dataset of transacted flats from Question 1, excluding any flats that don't have
their addresses in `[Link]`.
 Utilize the haversine([Link]
[Link]/stable/modules/generated/[Link].haversine_distances.
html) formula, compute the distance kilometers to the closest MRT exit for each
flat and add this data under a new column, `nearest_mrt_distance`.
 Incorporate the data from the `data/town_to_region_mapping.json` file to introduce
a new column named `region` into the DataFrame. (Note: Disregard the `region`
column present in the `[Link]` file during this process.)
 Based on your visualizations and data analyses, articulate two key conclusions.
(13 marks)

(g) Perform the following tasks:

 Formulate a scatter plot to depict the correlation between the resale prices of flats
and their haversine ([Link]
[Link]/stable/modules/generated/[Link].haversine_distances.
html) distances to the Central Business District.
 Incorporate additional dimensions into your plot: the year of the transaction
(specifically 2015, 2020, and 2023) and the region of the flat's location.
 Use distinct color codes to denote different regions.
 Also, display the town of each transaction as individual data points on the plot.
 Interpret the plot and articulate any insights or patterns you notice, explaining their
significance or implications.
(10 marks)

(h) Use SQLite query to determine the three towns in each region that have the highest average
resale price for '5 ROOM' flats transacted within the first half of 2023.
(5 marks)

Class XI Informatics Practices Exam
No ratings yet
Class XI Informatics Practices Exam
4 pages
Elementary Data Structures Overview
No ratings yet
Elementary Data Structures Overview
25 pages
Solution Manual For Concepts of Database Management 9th Edition by Philip J Pratt
No ratings yet
Solution Manual For Concepts of Database Management 9th Edition by Philip J Pratt
61 pages
SSI Webshell Interface
No ratings yet
SSI Webshell Interface
32 pages
Introduction to Business Analytics
100% (1)
Introduction to Business Analytics
138 pages
Assessing Pupils' Progress in Mathematics at Key Stage 3: Assessment Guidelines
No ratings yet
Assessing Pupils' Progress in Mathematics at Key Stage 3: Assessment Guidelines
12 pages
Data Mining Techniques and Applications
No ratings yet
Data Mining Techniques and Applications
41 pages
Introduction to Research Methods
No ratings yet
Introduction to Research Methods
65 pages
MongoDB CRUD Operations Guide
No ratings yet
MongoDB CRUD Operations Guide
19 pages
MongoDB Lab Manual for B.Tech Students
No ratings yet
MongoDB Lab Manual for B.Tech Students
20 pages
Strengths and Weaknesses of Quantitative Research
82% (50)
Strengths and Weaknesses of Quantitative Research
3 pages
Class IX AI Practice Paper 5
No ratings yet
Class IX AI Practice Paper 5
6 pages
Introduction to Transaction Processing
No ratings yet
Introduction to Transaction Processing
12 pages
Literature Review
No ratings yet
Literature Review
15 pages
Unreadable Document Content
No ratings yet
Unreadable Document Content
78 pages
Managing ZFS ARC Memory Usage
No ratings yet
Managing ZFS ARC Memory Usage
3 pages
High-Speed Memory in Processors
No ratings yet
High-Speed Memory in Processors
50 pages
SAP S/4HANA Migration Cockpit Tutorial
100% (3)
SAP S/4HANA Migration Cockpit Tutorial
36 pages
Spring Framework Data Access Overview
No ratings yet
Spring Framework Data Access Overview
53 pages
2024-2025 Pre-Board Report for Faria Jesmy
No ratings yet
2024-2025 Pre-Board Report for Faria Jesmy
6 pages
Tr-4211 NetApp Storage Performance Primer
No ratings yet
Tr-4211 NetApp Storage Performance Primer
42 pages
Database Recovery Techniques Explained
No ratings yet
Database Recovery Techniques Explained
31 pages
FIR Format Template Guide
No ratings yet
FIR Format Template Guide
7 pages
Survey on TVL Strand Juice Quality
No ratings yet
Survey on TVL Strand Juice Quality
11 pages
Understanding Neo4j Graph Database
No ratings yet
Understanding Neo4j Graph Database
21 pages
Importing Excel Data into Access
No ratings yet
Importing Excel Data into Access
6 pages
Lokesh Reddy's Brand Management Profile
No ratings yet
Lokesh Reddy's Brand Management Profile
2 pages
Ensuring GDPR Compliance in AI Use
No ratings yet
Ensuring GDPR Compliance in AI Use
250 pages
E-Commerce Website Data Analysis Report
No ratings yet
E-Commerce Website Data Analysis Report
47 pages
Local Wisdom in Multicultural Education
No ratings yet
Local Wisdom in Multicultural Education
13 pages

Data Programming

Uploaded by

Data Programming

Uploaded by

Data Programming

Answer all questions. (Total 100 marks)

Question 1 (46 marks)

(e) To perform the following visualizations:

(f) Perform the following tasks:

(g) Perform the following tasks:

You might also like