0% found this document useful (0 votes)

28 views

School of Computing and Creative Media XBIS 2023 Data Science Assignment Report

The document discusses using data science techniques to predict trends in epidemic diseases like COVID-19. It introduces Python as the programming language and libraries like Pandas, NumPy, Seaborn, Matplotlib, and Statsmodels for data analysis. Linear regression and the R2-score metric are used to build a predictive model on COVID-19 training and test data to forecast confirmed cases and total deaths.

Uploaded by

Nicholas Tan

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views

School of Computing and Creative Media XBIS 2023 Data Science Assignment Report

Uploaded by

Nicholas Tan

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 21

School of Computing and Creative Media

XBIS 2023 Data Science

Assignment Report

Name: 1) Esmond Liaw Chu Siang

2) Nicholas Tan Hau Shen

Student ID: 1) 0117040

2) 0125360

Course: Bachelor of Information System (EIS)

1.0 Introduction
An epidemic disease is an infectious disease rapidly spreading to a large number of people in
a given population within a short period of time. Epidemic diseases have always represented
challenging problems to address, and cannot be ignored inhuman history. The worst epidemic
in modern history was the Spanish flu of 1918, which killed more than fifteen million people.
Nowadays, as the world becomes more interconnected, epidemics have the potential to
spread faster. On February 11, 2020 the World Health Organization announced the official
name for the disease that is causing the 2019 novel coronavirus outbreak, first identified in
Wuhan China. The infection caused by the novel coronavirus detected is now affecting about
118countries, raising concerns of widespread fear and increasing anxiety in individuals
subjected to the threat of the virus. With the development of science and technology and the
continuous attempts of scientific research institutions, relevant data one epidemics and
viruses have accumulated at an increasing rate. However, much of the data has not been
analyzed for extracting new knowledge and value. There are three main challenges
associated with these enormous amounts of data: gaps exist between researchers in
different fields, and different approaches and methods make it difficult to understand the
problem in depth; epidemiological data is vast so that we need a method to extract valuable
data, remove irrelevant data, and guide potential applications in a targeted manner; there is
still a lack of efficient methods and models in the field for data utilization and
application in practice, as well as corresponding tools. The exploding amount of data
in many applications, makes it crucial that the advancements in Big Data research, Deep
Learning, Data Analytics, and Data Science find their way from the research labs to practical
applications, and that these research results can be successfully integrated into drug
screening, crowd disease prevention and control, trend prediction, epidemic surveillance and
other fields.

2.0 Hypothesis
Prediction of COVID-19 data that are affected from Wuhan, China which being analyzed for
the data and predict the data to the most accurately one.

3.0 Research Question

Big Data in trend prediction of epidemic diseases. Our research question is to predict the
epidemic crisis such as COVID-19.

4.0 Methodology
4.1 Programming Language
4.1.1 Python
Python is a widely used general-purpose, high level programming language. It
was initially designed by Guido van Rossum in 1991 and developed by Python
Software Foundation. It was mainly developed for emphasis on code readability,
and its syntax allows programmers to express concepts in fewer lines of code.
4.1.2 Why using Python Language?
Easy to use and consistent
Python is a high-level, interpreted and general-purpose
dynamic programming language that focuses on code readability. The
syntax in Python helps the programmers to do coding in fewer steps as
compared to Java or C++. The Python is widely used in bigger
organizations because of its multiple programming paradigms.

Extensive selection of libraries and frameworks

Implementing AI and ML algorithms can be tricky and requires a lot of
time. It’s vital to have a well-structured and well-tested environment to
enable developers to come up with the best coding solutions.
To reduce development time, programmers turn to a number of Python
frameworks and libraries. A software library is pre-written code that
developers use to solve common programming tasks. Python, with its rich
technology stack, has an extensive set of libraries for artificial intelligence
and machine learning. For example, Keras, TensorFlow, and Scikit-learn
for machine learning. Scikit-learn features various classification,
regression, and clustering algorithms, including support vector machines,
random forests, gradient boosting, k-means, and DBSCAN, and is
designed to work with the Python numerical and scientific libraries
NumPy and SciPy.

Great community and popularity

In the Developer Survey 2018 by Stack Overflow, Python was among the
top 10 most popular programming languages, which ultimately means that
you can find and hire a development company with the necessary skill set
to build your AI-based project. If you look closely at figure 1 below,
you’ll see that Python is the language that people Google more than any
other.
Figure 1

5.0 Libraries
5.1 Pandas
Pandas is an open-source Python Library providing high-performance data manipulation
and analysis tool using its powerful data structures. The name Pandas is derived from the
word Panel Data an Econometrics from Multidimensional data. In 2008, developer Wes
McKinney started developing pandas when in need of high performance, flexible tool for
analysis of data. Prior to Pandas, Python was majorly used for data munging and
preparation. It had very little contribution towards data analysis. Pandas solved this
problem. Using Pandas, we can accomplish five typical steps in the processing and
analysis of data, regardless of the origin of data to load, prepare, manipulate, model, and
analyze. Python with Pandas is used in a wide range of fields including academic and
commercial domains including finance, economics, Statistics, and analytics.
5.2 Numpy
NumPy is a Python package. It stands for 'Numerical Python'. It is a library consisting of
multidimensional array objects and a collection of routines for processing of array.
Numeric, the ancestor of NumPy, was developed by Jim Hugunin. Another package
Numarray was also developed, having some additional functionalities. In 2005, Travis
Oliphant created NumPy package by incorporating the features of Numarray into
Numeric package. There are many contributors to this open source project. Using
NumPy, a developer can perform mathematical and logical operations on arrays, Fourier
transforms and routines for shape manipulation, operations related to linear algebra.
5.3 Seaborn
Seaborn is a library for making statistical graphics in Python. It is built on top
of matplotlib and closely integrated with Pandas data structures. A dataset-oriented API
for examining relationships between multiple variables. Specialized support for using
categorical variables to show observations or aggregate statistics. Options for
visualizing univariate or bivariate distributions and for comparing them between subsets
of data. Automatic estimation and plotting of linear regression models for different
kinds dependent variables. Convenient views onto the overall structure of complex
datasets.
5.4 Matplotlib
Matplotlib is an amazing visualization library in Python for 2D plots of arrays. Matplotlib
is a multi-platform data visualization library built on NumPy arrays and designed to work
with the broader SciPy stack. It was introduced by John Hunter in the year 2002. One of
the greatest benefits of visualization is that it allows us visual access to huge amounts of
data in easily digestible visuals. Matplotlib consists of several plots like line, bar, scatter,
and histogram.
5.5 Statsmodels
Statsmodels is a Python module that provides classes and functions for the estimation of
many different statistical models, as well as for conducting statistical tests, and statistical
data exploration. An extensive list of result statistics are available for each estimator. The
results are tested against existing statistical packages to ensure that they are correct.
5.6 Linear Regression
Linear regression is a basic predictive analytics technique that uses historical data to
predict an output variable. It is popular for predictive modelling because it is easily
understood and can be explained using plain English.
5.7 R2-Score
The coefficient of determination is the proportion of the variance in the dependent variable
that is predictable from the independent variables. It is used in the context of statistical
models whose main purpose is either the prediction of future outcomes or the testing
of hypotheses, on the basis of other related information. It provides a measure of how well
observed outcomes are replicated by the model, based on the proportion of total variation of
outcomes explained by the model.
6.0 Train and Test Data for COVID-19
As first model we are using Train and test data to predict the outcome of COVID-19.
Train/Test is a method to measure the accuracy of our model. We split the data into two sets
which is a training set and a test set. We are using 80% for training and 20% for testing.

Figure 2: we start off with importing all the libraries we need.

Figure 3: Showing the null values of our data.

Figure 4: Plotting the graph X and Y from our dataset.

Figure 5: Showing the method of predicting train/test data. The last line of code showing
0.17695, a smaller MSE score is better since it implies agreement between the prediction and
the reality. A smaller value of MSE generally indicates a better estimate.

Figure 2
Figure 3

Figure 4
Figure 5

7.0 Confirmed Case and Total Death prediction

We are using the confirmed case and total death case from our dataset to predict the outcome
of COVID-19. The reason we used confirmed case and total death is because we wanted to
predict whether confirmed case will effect on how many total death from the data. For
example, if there is a strong relationship between this 2 cases, our prediction will be correct.

Figure 6: The libraries we included for our prediction.

Figure 8: Showing the X-axis (Confirmed Case) and Y-axis (Total Deaths).

Figure 10: Showing the Linear Regression line of X and Y. The linear model is Y =
109.007399 + 0.046859X

Figure 11: We have to import statsmodel.api as sm in order to get the results. There are 86%
confirmed case is influenced by the Total Death by looking at the R-squared result. If we add
in more factors like comparing confirmed case with recovered case, the R-squared result will
be different. When p>|t| value is close to 0, this mean that the correlation between confirmed
case and total death is very strong. As the result is 0.733, it is not significant because the
value is more than 0.5. We need to accept for the null hypothesis as the value of p>|t| is more
than 0.5. We have to reject the alternative hypothesis as the value is 0.733 is came from null
hypothesis. R-squared have 86% that fit the regression model. This can consider that X and
Y have strong relationship affected by the COVID-19.

Figure 6

Figure 7
Figure 8

Figure 9
Figure 10

Figure 11
8.0 Confirmed Case with Recovered Case
We used confirmed case and recovered case to predict for the COVID-19. As a result, this can
compare to confirmed case and total death case. We can differentiated whether which of these 2
cases have biggest influenced by the covid-19.
Figure 12: Libraries that we include for the prediction.
Figure 14: The prediction graph of Confirmed case and Recovered case.
Figure 16: We add in the linear regression into the model which generates the straight blue line
in the graph. The linear model is Y = 109.0073993 + 0.046859X. As a result, we can compare
figure 16 and figure 10 graph which shown above. The linear regression line is different.
Figure 17: The confirmed case is 82% influenced by the recovered case by looking at the R-
squared result. By looking at the P>|t| result, the value is closer to 0 and less than 0.5. This can
consider as alternative hypothesis. There is a strong relationship between X (Confirmed case)
and Y (Recovered Case). R-squared which have the value of 82% fit to the regression model.
This can considered that X and Y have strong relationship. R-squared reflects the fit of the
model. The values range from 0 to 1, where a higher value generally indicates a better fit. When
a p-value is less than 0.05, it is considered to be statistically significant as the predicted value
shown in figure 17 is 0.011.

Figure 12
Figure 13

Figure 14
Figure 15

Figure 16
Figure 17
9.0 Different method to get Confirmed Case and Total Death prediction
We are using the same confirmed case and total death case from our dataset to predict the
outcome of COVID-19 through a different method to get the linear regression line and

Figure 18: The libraries we included for our prediction and to show the value of the dataset.

Figure 19: It shows both the prediction graph with X (Confirmed Case) axis and Y (Total
Deaths) axis together with the linear regression line

Figure 20: Then we import the necessary libraries to get the value of interception of linear
regression line which is 60.69577885 and the value of our regression coefficient output
which is 0.04628304.

The output generated with this method is the same as the previous method, hence to compare
is better to use the previous method as it generate a OLS Regression Results table which
provide more detailed information for the prediction.

Figure 18
Figure 19

Figure 20

10.0 New Cases with New Deaths & New Deaths with New Recovered
We used both new cases with new deaths and new deaths with new recovered to predict for the
COVID-19. As a result, we can differentiated whether which of these 2 related cases have
biggest influenced by the covid-19.
Figure 21: The libraries we included for our prediction and to show the value of the dataset.
Figure 22: It shows the prediction graph with X (New Cases) axis and Y (New Deaths) axis
together with the linear regression line.
Figure 23: It shows the prediction graph with X (New Deaths) axis and Y (New Recovered) axis
together with the linear regression line.
Figure 24: Then we import the necessary libraries to get the value of interception of linear
regression line which is 1.14704734 and the value of our regression coefficient output which is
0.03171409 for the first graph prediction (new cases with new deaths). The regression coefficient
is statistically significant because its value is lesser than the usual significance level. While for
new deaths with new recovered the value of interception of linear regression line which is
95.21543898 and the value of our regression coefficient output which is 0.57106593.The
regression coefficient is not statistically significant because its value is greater than the usual
significance level.

Figure 21
Figure 22
Figure 23
Figure 24

Class X AI Practical File 2024-25
100% (1)
Class X AI Practical File 2024-25
13 pages
Python Crash Course: The Complete Step-By-Step Guide On How to Come Up Easily With Your First Data Science Project From Scratch In Less Than 7 Days
From Everand
Python Crash Course: The Complete Step-By-Step Guide On How to Come Up Easily With Your First Data Science Project From Scratch In Less Than 7 Days
Simon Tallman
No ratings yet
DSOST1
No ratings yet
DSOST1
91 pages
Python Programming & Applications in Healthcare
No ratings yet
Python Programming & Applications in Healthcare
12 pages
Sean Jordan Synthesis Paper
No ratings yet
Sean Jordan Synthesis Paper
18 pages
Digital Data Part 4
No ratings yet
Digital Data Part 4
3 pages
Data Science With Python_ From
No ratings yet
Data Science With Python_ From
554 pages
Paper 7
No ratings yet
Paper 7
3 pages
Data Analysis For Health Data Science B0D6X9NL4S
No ratings yet
Data Analysis For Health Data Science B0D6X9NL4S
397 pages
Python and the AI Revolution A Comprehensive Deep Dive into Innovations, Challenges, and the Road Ahead
No ratings yet
Python and the AI Revolution A Comprehensive Deep Dive into Innovations, Challenges, and the Road Ahead
6 pages
Anurag008python
No ratings yet
Anurag008python
39 pages
The Rise of Python
No ratings yet
The Rise of Python
19 pages
Artificial Intelligence Project Report
No ratings yet
Artificial Intelligence Project Report
15 pages
Ganesh
No ratings yet
Ganesh
28 pages
Bhabesh - Chapter 5
No ratings yet
Bhabesh - Chapter 5
19 pages
Employees Career Survey Analysis
No ratings yet
Employees Career Survey Analysis
13 pages
sample project
No ratings yet
sample project
12 pages
Unit2 PDS
No ratings yet
Unit2 PDS
17 pages
Computational and Inferential Thinking
No ratings yet
Computational and Inferential Thinking
52 pages
PYTHON DATA ANALYTICS: Mastering Python for Effective Data Analysis and Visualization (2024 Beginner Guide)
From Everand
PYTHON DATA ANALYTICS: Mastering Python for Effective Data Analysis and Visualization (2024 Beginner Guide)
FLOYD BAX
No ratings yet
Machine Learning in Python Main Developments and T
100% (1)
Machine Learning in Python Main Developments and T
44 pages
Covid-2019 Symptoms and Its Prevention MINI PROJECT
No ratings yet
Covid-2019 Symptoms and Its Prevention MINI PROJECT
23 pages
Paper 5184
No ratings yet
Paper 5184
7 pages
12 Cool Data Science Projects Ideas For Beginners and Experts
No ratings yet
12 Cool Data Science Projects Ideas For Beginners and Experts
25 pages
40 Most Popular Python Scientific Libraries
No ratings yet
40 Most Popular Python Scientific Libraries
9 pages
Data 8 Textbook
No ratings yet
Data 8 Textbook
326 pages
Presentation 2
No ratings yet
Presentation 2
9 pages
Python
No ratings yet
Python
3 pages
DAP Mini Report1
No ratings yet
DAP Mini Report1
32 pages
int 5
No ratings yet
int 5
12 pages
Lesson1 Introduction To The Data Science Process and The Value of Learning Data Science
No ratings yet
Lesson1 Introduction To The Data Science Process and The Value of Learning Data Science
6 pages
6th Sem Data Science (DSE) Answer
No ratings yet
6th Sem Data Science (DSE) Answer
17 pages
Unit-1
No ratings yet
Unit-1
84 pages
Datasist: A Python-Based Library For Easy Data Analysis, Visualization and Modeling
No ratings yet
Datasist: A Python-Based Library For Easy Data Analysis, Visualization and Modeling
17 pages
Content
No ratings yet
Content
36 pages
Python for Data Science: A Comprehensive Guide to Programming and Learning How to Harness the Power of Python for Data Manipulation, Visualization, and Machine Learning
From Everand
Python for Data Science: A Comprehensive Guide to Programming and Learning How to Harness the Power of Python for Data Manipulation, Visualization, and Machine Learning
Harry Hankins
No ratings yet
The Rise of Python
No ratings yet
The Rise of Python
20 pages
Auditing The Data Using Python
No ratings yet
Auditing The Data Using Python
4 pages
Geetha Internship
No ratings yet
Geetha Internship
17 pages
Industrial Training Report
No ratings yet
Industrial Training Report
24 pages
Python Data Science Handbook Essential Tools for Working with Data 1st Edition Jake Vanderplas download pdf
100% (3)
Python Data Science Handbook Essential Tools for Working with Data 1st Edition Jake Vanderplas download pdf
55 pages
Machine Learning Internship Report
No ratings yet
Machine Learning Internship Report
19 pages
The Riseof Python
No ratings yet
The Riseof Python
20 pages
813-Article Text-2738-1-10-20220328
No ratings yet
813-Article Text-2738-1-10-20220328
13 pages
PPT-moocs-jayashRA2111003011636
No ratings yet
PPT-moocs-jayashRA2111003011636
14 pages
Mastering Data Science with Python: The Ultimate Guide: Unlock the Power of Data Analysis and Visualization with Python's Cutting-Edge Tools and Techniques
From Everand
Mastering Data Science with Python: The Ultimate Guide: Unlock the Power of Data Analysis and Visualization with Python's Cutting-Edge Tools and Techniques
daniel Huston
No ratings yet
Big Data and Genomics
No ratings yet
Big Data and Genomics
17 pages
Python For Data Science .
100% (1)
Python For Data Science .
112 pages
Ds Python Unit-I
No ratings yet
Ds Python Unit-I
30 pages
IPython Interactive Computing and Visualization Cookbook Sample Chapter
No ratings yet
IPython Interactive Computing and Visualization Cookbook Sample Chapter
43 pages
DAP Mini Report1
No ratings yet
DAP Mini Report1
28 pages
Slidesgo Unlocking Insights A Professional Introduction To Data Science With Python 20241125160150D6YR
No ratings yet
Slidesgo Unlocking Insights A Professional Introduction To Data Science With Python 20241125160150D6YR
14 pages
Data Analytics With Python
No ratings yet
Data Analytics With Python
18 pages
Chapter 2: Technologies: What Is Python?
No ratings yet
Chapter 2: Technologies: What Is Python?
4 pages
Ip Project Matplot (4) Con
No ratings yet
Ip Project Matplot (4) Con
18 pages
Introduction to Scientific Programming with Python
From Everand
Introduction to Scientific Programming with Python
Pankaj Jayaraman
No ratings yet
Fake News Detection Using Machine Learning Approaches
No ratings yet
Fake News Detection Using Machine Learning Approaches
15 pages
computer science project
No ratings yet
computer science project
21 pages
Updated Major One Report 2
No ratings yet
Updated Major One Report 2
48 pages
Fluiddyn: A Python Open-Source Framework For Research and Teaching in Fluid Dynamics by Simulations, Experiments and Data Processing
No ratings yet
Fluiddyn: A Python Open-Source Framework For Research and Teaching in Fluid Dynamics by Simulations, Experiments and Data Processing
6 pages
Pre ML Practise
No ratings yet
Pre ML Practise
14 pages
DEV Lab Manual
No ratings yet
DEV Lab Manual
55 pages
A Practical Guide To AI and Data Analytics 1641384046
No ratings yet
A Practical Guide To AI and Data Analytics 1641384046
86 pages
Netflix Movie Recommendation System
No ratings yet
Netflix Movie Recommendation System
45 pages
Project Walkthrough - Bike Share-2020
No ratings yet
Project Walkthrough - Bike Share-2020
58 pages
QuantEconlectures Python3
No ratings yet
QuantEconlectures Python3
1,123 pages
360 - BSC (CS) - Semester V KU
No ratings yet
360 - BSC (CS) - Semester V KU
13 pages
Introduction to Python for Science and Engineering Second Edition
No ratings yet
Introduction to Python for Science and Engineering Second Edition
462 pages
Food Delivery Time Prediction With LSTM Neural Network
No ratings yet
Food Delivery Time Prediction With LSTM Neural Network
7 pages
Data Science Course in Hyderabad
100% (1)
Data Science Course in Hyderabad
29 pages
Unit 4 1
No ratings yet
Unit 4 1
16 pages
Pandas Series Practice Questions
0% (1)
Pandas Series Practice Questions
42 pages
VR20 Python Lab Manual For Print
No ratings yet
VR20 Python Lab Manual For Print
46 pages
A Tutorial On Differential Evolution With Python - Pablo R. Mier
No ratings yet
A Tutorial On Differential Evolution With Python - Pablo R. Mier
21 pages
ML_LAB_Mannual-1
No ratings yet
ML_LAB_Mannual-1
79 pages
K - Means Clustering and Related Algorithms: Ryan P. Adams COS 324 - Elements of Machine Learning Princeton University
No ratings yet
K - Means Clustering and Related Algorithms: Ryan P. Adams COS 324 - Elements of Machine Learning Princeton University
18 pages
FDS Unit 4
No ratings yet
FDS Unit 4
66 pages
Stochastic Gradient Descent Algorithm With Python and NumPy - Real
No ratings yet
Stochastic Gradient Descent Algorithm With Python and NumPy - Real
21 pages
Anaconda Training PDF
100% (1)
Anaconda Training PDF
2 pages
NumPy Ufuncs - Simple Arithmetic
No ratings yet
NumPy Ufuncs - Simple Arithmetic
1 page
CS 5720 Neural Network & Deep Learning_Fall24_Syllabus
No ratings yet
CS 5720 Neural Network & Deep Learning_Fall24_Syllabus
10 pages
EnSPy: Python Library For Computations of Ensembles of Particles On GPU
No ratings yet
EnSPy: Python Library For Computations of Ensembles of Particles On GPU
41 pages
Numpy
No ratings yet
Numpy
9 pages
DM Slip Solutions
100% (1)
DM Slip Solutions
24 pages
Python Codes
No ratings yet
Python Codes
17 pages
DSBDA Lab Manual
No ratings yet
DSBDA Lab Manual
155 pages
Data Exploration and Visualization Laboratory - AD3301 - Lab Manual
No ratings yet
Data Exploration and Visualization Laboratory - AD3301 - Lab Manual
55 pages
DSF LAB EXP FULL (1) (1)
No ratings yet
DSF LAB EXP FULL (1) (1)
88 pages
Subject: Informatics Practices (Code-065) Class - XII
No ratings yet
Subject: Informatics Practices (Code-065) Class - XII
11 pages
Python For Reservoir Simulation
No ratings yet
Python For Reservoir Simulation
29 pages

School of Computing and Creative Media XBIS 2023 Data Science Assignment Report

Uploaded by

School of Computing and Creative Media XBIS 2023 Data Science Assignment Report

Uploaded by

School of Computing and Creative Media

XBIS 2023 Data Science

Name: 1) Esmond Liaw Chu Siang

Student ID: 1) 0117040

Course: Bachelor of Information System (EIS)

3.0 Research Question

Extensive selection of libraries and frameworks

Great community and popularity

Figure 2: we start off with importing all the libraries we need.

Figure 3: Showing the null values of our data.

Figure 4: Plotting the graph X and Y from our dataset.

7.0 Confirmed Case and Total Death prediction

Figure 6: The libraries we included for our prediction.

You might also like