School of Computing and Creative Media XBIS 2023 Data Science Assignment Report
School of Computing and Creative Media XBIS 2023 Data Science Assignment Report
Assignment Report
2.0 Hypothesis
Prediction of COVID-19 data that are affected from Wuhan, China which being analyzed for
the data and predict the data to the most accurately one.
4.0 Methodology
4.1 Programming Language
4.1.1 Python
Python is a widely used general-purpose, high level programming language. It
was initially designed by Guido van Rossum in 1991 and developed by Python
Software Foundation. It was mainly developed for emphasis on code readability,
and its syntax allows programmers to express concepts in fewer lines of code.
4.1.2 Why using Python Language?
Easy to use and consistent
Python is a high-level, interpreted and general-purpose
dynamic programming language that focuses on code readability. The
syntax in Python helps the programmers to do coding in fewer steps as
compared to Java or C++. The Python is widely used in bigger
organizations because of its multiple programming paradigms.
5.0 Libraries
5.1 Pandas
Pandas is an open-source Python Library providing high-performance data manipulation
and analysis tool using its powerful data structures. The name Pandas is derived from the
word Panel Data an Econometrics from Multidimensional data. In 2008, developer Wes
McKinney started developing pandas when in need of high performance, flexible tool for
analysis of data. Prior to Pandas, Python was majorly used for data munging and
preparation. It had very little contribution towards data analysis. Pandas solved this
problem. Using Pandas, we can accomplish five typical steps in the processing and
analysis of data, regardless of the origin of data to load, prepare, manipulate, model, and
analyze. Python with Pandas is used in a wide range of fields including academic and
commercial domains including finance, economics, Statistics, and analytics.
5.2 Numpy
NumPy is a Python package. It stands for 'Numerical Python'. It is a library consisting of
multidimensional array objects and a collection of routines for processing of array.
Numeric, the ancestor of NumPy, was developed by Jim Hugunin. Another package
Numarray was also developed, having some additional functionalities. In 2005, Travis
Oliphant created NumPy package by incorporating the features of Numarray into
Numeric package. There are many contributors to this open source project. Using
NumPy, a developer can perform mathematical and logical operations on arrays, Fourier
transforms and routines for shape manipulation, operations related to linear algebra.
5.3 Seaborn
Seaborn is a library for making statistical graphics in Python. It is built on top
of matplotlib and closely integrated with Pandas data structures. A dataset-oriented API
for examining relationships between multiple variables. Specialized support for using
categorical variables to show observations or aggregate statistics. Options for
visualizing univariate or bivariate distributions and for comparing them between subsets
of data. Automatic estimation and plotting of linear regression models for different
kinds dependent variables. Convenient views onto the overall structure of complex
datasets.
5.4 Matplotlib
Matplotlib is an amazing visualization library in Python for 2D plots of arrays. Matplotlib
is a multi-platform data visualization library built on NumPy arrays and designed to work
with the broader SciPy stack. It was introduced by John Hunter in the year 2002. One of
the greatest benefits of visualization is that it allows us visual access to huge amounts of
data in easily digestible visuals. Matplotlib consists of several plots like line, bar, scatter,
and histogram.
5.5 Statsmodels
Statsmodels is a Python module that provides classes and functions for the estimation of
many different statistical models, as well as for conducting statistical tests, and statistical
data exploration. An extensive list of result statistics are available for each estimator. The
results are tested against existing statistical packages to ensure that they are correct.
5.6 Linear Regression
Linear regression is a basic predictive analytics technique that uses historical data to
predict an output variable. It is popular for predictive modelling because it is easily
understood and can be explained using plain English.
5.7 R2-Score
The coefficient of determination is the proportion of the variance in the dependent variable
that is predictable from the independent variables. It is used in the context of statistical
models whose main purpose is either the prediction of future outcomes or the testing
of hypotheses, on the basis of other related information. It provides a measure of how well
observed outcomes are replicated by the model, based on the proportion of total variation of
outcomes explained by the model.
6.0 Train and Test Data for COVID-19
As first model we are using Train and test data to predict the outcome of COVID-19.
Train/Test is a method to measure the accuracy of our model. We split the data into two sets
which is a training set and a test set. We are using 80% for training and 20% for testing.
Figure 5: Showing the method of predicting train/test data. The last line of code showing
0.17695, a smaller MSE score is better since it implies agreement between the prediction and
the reality. A smaller value of MSE generally indicates a better estimate.
Figure 2
Figure 3
Figure 4
Figure 5
Figure 8: Showing the X-axis (Confirmed Case) and Y-axis (Total Deaths).
Figure 10: Showing the Linear Regression line of X and Y. The linear model is Y =
109.007399 + 0.046859X
Figure 11: We have to import statsmodel.api as sm in order to get the results. There are 86%
confirmed case is influenced by the Total Death by looking at the R-squared result. If we add
in more factors like comparing confirmed case with recovered case, the R-squared result will
be different. When p>|t| value is close to 0, this mean that the correlation between confirmed
case and total death is very strong. As the result is 0.733, it is not significant because the
value is more than 0.5. We need to accept for the null hypothesis as the value of p>|t| is more
than 0.5. We have to reject the alternative hypothesis as the value is 0.733 is came from null
hypothesis. R-squared have 86% that fit the regression model. This can consider that X and
Y have strong relationship affected by the COVID-19.
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
8.0 Confirmed Case with Recovered Case
We used confirmed case and recovered case to predict for the COVID-19. As a result, this can
compare to confirmed case and total death case. We can differentiated whether which of these 2
cases have biggest influenced by the covid-19.
Figure 12: Libraries that we include for the prediction.
Figure 14: The prediction graph of Confirmed case and Recovered case.
Figure 16: We add in the linear regression into the model which generates the straight blue line
in the graph. The linear model is Y = 109.0073993 + 0.046859X. As a result, we can compare
figure 16 and figure 10 graph which shown above. The linear regression line is different.
Figure 17: The confirmed case is 82% influenced by the recovered case by looking at the R-
squared result. By looking at the P>|t| result, the value is closer to 0 and less than 0.5. This can
consider as alternative hypothesis. There is a strong relationship between X (Confirmed case)
and Y (Recovered Case). R-squared which have the value of 82% fit to the regression model.
This can considered that X and Y have strong relationship. R-squared reflects the fit of the
model. The values range from 0 to 1, where a higher value generally indicates a better fit. When
a p-value is less than 0.05, it is considered to be statistically significant as the predicted value
shown in figure 17 is 0.011.
Figure 12
Figure 13
Figure 14
Figure 15
Figure 16
Figure 17
9.0 Different method to get Confirmed Case and Total Death prediction
We are using the same confirmed case and total death case from our dataset to predict the
outcome of COVID-19 through a different method to get the linear regression line and
Figure 18: The libraries we included for our prediction and to show the value of the dataset.
Figure 19: It shows both the prediction graph with X (Confirmed Case) axis and Y (Total
Deaths) axis together with the linear regression line
Figure 20: Then we import the necessary libraries to get the value of interception of linear
regression line which is 60.69577885 and the value of our regression coefficient output
which is 0.04628304.
The output generated with this method is the same as the previous method, hence to compare
is better to use the previous method as it generate a OLS Regression Results table which
provide more detailed information for the prediction.
Figure 18
Figure 19
Figure 20
10.0 New Cases with New Deaths & New Deaths with New Recovered
We used both new cases with new deaths and new deaths with new recovered to predict for the
COVID-19. As a result, we can differentiated whether which of these 2 related cases have
biggest influenced by the covid-19.
Figure 21: The libraries we included for our prediction and to show the value of the dataset.
Figure 22: It shows the prediction graph with X (New Cases) axis and Y (New Deaths) axis
together with the linear regression line.
Figure 23: It shows the prediction graph with X (New Deaths) axis and Y (New Recovered) axis
together with the linear regression line.
Figure 24: Then we import the necessary libraries to get the value of interception of linear
regression line which is 1.14704734 and the value of our regression coefficient output which is
0.03171409 for the first graph prediction (new cases with new deaths). The regression coefficient
is statistically significant because its value is lesser than the usual significance level. While for
new deaths with new recovered the value of interception of linear regression line which is
95.21543898 and the value of our regression coefficient output which is 0.57106593.The
regression coefficient is not statistically significant because its value is greater than the usual
significance level.
Figure 21
Figure 22
Figure 23
Figure 24