XSTK 1
XSTK 1
UNIVERSITY OF TECHNOLOGY
FACULTY OF COMPUTER SCIENCE AND ENGINEERING
Project Report
Contents
1 List of member & workload 3
2 List of Figures 4
4 Introduction 6
4.1 Topic introduction and requirements . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.2 Statistical methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5 Theoretical basis 7
5.1 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5.1.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5.1.2 Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5.1.2.a Simple linear regression . . . . . . . . . . . . . . . . . . . 7
5.1.2.b Multiple linear regression . . . . . . . . . . . . . . . . . . 7
5.1.3 Least square method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
5.1.4 Coefficient of determination . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.1.5 Assessing model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
6 Data handling 11
6.1 Data importing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
6.2 Data properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
6.3 Data cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
7 Data visualization 15
7.1 Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
7.2 Process Size (nm) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
7.3 TDP (W) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
7.4 Die Size (mm2 ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7.5 Transistors (million) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7.6 Freq (MHz) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
7.7 Foundry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
7.8 CPU and GPU Frequency vs Process Size through Release year . . . . . . . . . . 21
7.8.1 CPU Frequency vs Process Size through Release year . . . . . . . . . . . 21
7.8.2 GPU Frequency vs Process Size through Release year . . . . . . . . . . . 22
7.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 1/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
8.4.2.b Independence . . . . . . . . . . . . . . . . . . . . . . . . . 27
8.4.2.c Homoscedasticity . . . . . . . . . . . . . . . . . . . . . . . 28
8.4.2.d Normality of the residual . . . . . . . . . . . . . . . . . . 28
8.4.2.e Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 29
8.4.3 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
9 Full R code 31
10 References 36
Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 2/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
Introduction
1 2152143 Đỗ Duy Khương Data handling 20% Leader
Project overview
Data handling
2 2152591 Nguyễn Đình Thiên Huy 20%
Model building
Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 3/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
2 List of Figures
List of Figures
1 Import data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Get attributes data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Data overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4 Drop ID, FP16.GFLOPS, FP32.GFLOPS, FP64.GFLOPS columns . . . . . . . . 13
5 Count N/A entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
6 Add ReleaseDate.num attribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
7 Type Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
8 Histogram of Process size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
9 Box plot of Process size by Foundry . . . . . . . . . . . . . . . . . . . . . . . . . 16
10 Facetted Box Plot of Process Size by Type and Vendor . . . . . . . . . . . . . . . 17
11 Scatter Plot of TDP vs Release Date . . . . . . . . . . . . . . . . . . . . . . . . . 17
12 Scatter Plot of Die Size vs Release Date . . . . . . . . . . . . . . . . . . . . . . . 18
13 Scatter Plot of Transistors vs Release Date . . . . . . . . . . . . . . . . . . . . . 19
14 Scatter plot of Frequency vs Release Date and Box plot of Frequency by Vendor 19
15 Scatter plot of Frequency vs Release Date by Foundry . . . . . . . . . . . . . . . 20
16 Foundry Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
17 Scatter plot of CPU Frequency vs Process Size over the Release year . . . . . . . 22
18 Scatter plot of GPU Frequency vs Process Size over the Release year . . . . . . . 22
19 A new column of log-transformed transistor count is added . . . . . . . . . . . . 25
20 Transistor count before (left) and after (right) applied Log Transformation . . . . 25
21 Scatter Plot of log-transformed transistors count . . . . . . . . . . . . . . . . . . 26
22 df_linear - a copy of original df . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
23 Summary of model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
24 Residual plot against release date . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
25 Q-Q Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
26 Output of Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 4/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
3.2 Goal
Firstly, we want our data to be completely comprehensible by plotting graphs and making
comparisons between many attributes of CPUs and GPUs in the data. Moreover, by visualizing
data using histograms, pie charts, scatter plots, ..., we want to have a clearer view about the
tendency of development of each attribute lying inside CPUs and GPUs.
Secondly, the main objective of our project is to apply test techniques and methods deprived
from probabilities and statistics to predict, analyze, and show connection between the number
of transistors in CPUs, GPUs and the Moore’s law. To be more precise, we want to determine
whether Moore’s law is still suitable as time goes by according to the number of transistors packed
onto microchips throughout surveyed period, with the help of functions and methods included in
R Studio software. By analyzing the historical data on transistor counts, it may be possible to
make predictions about future trends in CPU and GPU development, which could be useful for
planning and investment purposes.
Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 5/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
4 Introduction
4.1 Topic introduction and requirements
CPU, or Central Processing Unit, is one of the primary component in any computer systems.
It is considered as the brain of the whole computer, with its capabilities to execute instructions
that are stored in the memory, perform basic arithmetic calculations, and control the flow of data
between different part of computers.
In support of CPU, a graphic processing unit (GPU), is a specialized type of processor
that can handle complex computations that is required for rendering graphics and images on
the computer. While a CPU can be good for general-purpose processing, GPU is specialized in
parallel processing, which allows it to perform simultaneously many calculations. Therefore, it is
suited for tasks that requires in large amount of data to be handled in parallel, such as rendering
3D graphics, running scientific simulations or training machine learning models.
The fundamentals of CPUs and GPUs are made up of small electronic components, called
transistors. Transistors are used as the basic building blocks for creating logic gates, which are
combined to form complex circuits that can perform arithmetic, logic, and control operations.
The smallest transistor size that has been used in commercial CPUs or GPUs is currently 5
nanometers. As of 2022, IBM has made a 2-nanometer chip techonology, the size of roughly five
atoms, which will be manufacture by 2025.
The data we are using provides information about CPUs and GPUs in the span of 21 years,
from 2000 to 2021. We will be using statistical methods to evaluate the growth of these processors
over the years, and then predict future statistics for them. Here is the link to our data source.
Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 6/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
5 Theoretical basis
5.1 Linear regression
5.1.1 Definition
In statistics, linear regression is a linear approach for modelling the relationship between a scalar
response and one or more explanatory variables . In other words, linear regression analysis is
used to predict the value of a variable based on one or more variable’s value. The variable that
we desire to predict is called dependent variable (response variable) and the variable we used to
predict the value of other variable is called independent variable (explanatory variable).
In case of only one explanatory variable is called simple linear regression, and the other case
where there are more than one is called multiple linear regression.
5.1.2 Formula
5.1.2.a Simple linear regression
* General form:
y = β0 + β1 x + ϵ
Where:
• y: the predicted value of the dependent variable (y) for given value of the independent
variable x
• x: the independent variable
• β0 : the intercept, the predicted value of y when the x is 0
• β1 : the regression coefficient (how much we expect y to change as x increases)
• ϵ: the error of the estimate, or how much variation there is in our estimate of the regression
coefficient
The method for determining regression coefficients will be discussed later.
* Matrix form:
We can think about the regression model as a whole for by writing out the regression equation
for every single observation i = 1, 2, ..., n:
Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 7/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
.. .. .. .. .. .. ..
. . . . . . .
y1 x1,0 x1,1 ... x1,n β̂0 û1
y2 x2,0 x2,1 ... x2,n β̂1 û2
.. = .. . + ..
.. .. ..
. . . . . .. .
yn xn,0 xn,1 ... xn,n β̂n ûn
Pn Pn
SSE = 2
i=1 ei = i=1 [yi − (βˆ0 + βˆ1 xi )]2
The least-square method is to minimize SSE to find the estimates βˆ0 , βˆ1
By derivative we have:
( n x )( n
P P
Pn yi )
xi yi − i=1 i n i=1
βˆ1 = i=1 Pn 2
2 − ( i=1 xi )
Pn
i=1 ix n
or: Pn
(x − x̄)(yi − ȳ)
βˆ1 = Pn i
i=1
2
i=1 (xi − x̄)
and:
βˆ0 = ȳ − βˆ1 x̄
Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 8/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
where: Pn
i=1 xi
x̄ =
n
Pn
i=1 yi
ȳ =
n
Definition: To evaluate how accuracy our linear regression function/model can predict, we
need to consider the coefficient of determination. It is simply just a value which acts as a way
to determine whether our model can be reliable or not (we can understand it as how much
percentages our model closes to reality).
The most general definition of the coefficient of determination is:
SSres
R2 = 1 − SStot
Where: P
n
SSres = P i=1 (yi − ŷi )2 : The sum of squares of residuals
n
SStot = i=1 (yi − ȳ)2 : The total sum of squares (ȳ is the mean value)
To visualise the formula for coefficient of determination we consider two graphs below (simple
linear regression):
The better the linear regression (on the right) fits the data in comparison to the simple average
(on the left graph), the closer the value of R2 is to 1. The areas of the blue squares represent the
squared residuals with respect to the linear regression. The areas of the red squares represent the
squared residuals with respect to the average value. (R2 = 1 − SS SStot )
res
In the best case, the modeled values exactly match the observed values, which results in
SSres = 0 and R2 = 1. A baseline model, which always predicts ȳ, will have R2 = 0. Models that
have worse predictions than the baseline will cause a negative R2 .
Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 9/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
• There is a linear relationship between the dependent variable and independent variables.
• The independent variables are not too highly correlated with each other.
• yi observations are selected independently and randomly from population.
Hypothesis testing
Back to the formula about simple linear regression model: y = β0 + β1 x + ϵ. This formula will
only be useful whenever β1 ̸= 0, which means that the change of independent variable x will help
to predict the value of dependent variable y. Therefore we need to perform a hypothesis testing
to assure that β1 ̸= 0.
In general let’s assume:
Null hypothesis: β1 = β10
β̂1 −β10
Our test statistic value will be: t = sβ̂
1
Pn
Where: sβ̂1 = √s : the standard deviation of β1 (Sxx = − x̄)2 )
Sxx i=1 (xi
SSres
s: the unbiased estimator of σ (s2 = n−2 )
In the case it is the model ulitily test: H0 : β1 = 0 and Ha : β1 ̸= 0, in which case the test
statistic value is t = sβ̂1 .
β̂1
Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 10/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
6 Data handling
6.1 Data importing
We first need to import the data into our application in order to run it. To test the data, we print
the first 5 rows and check the result.
1 raw _ df <- read . csv ( " chip _ dataset . csv " )
2 head ( raw _ df , 5)
After that, we get the data types for each variables in the table.
1 str ( raw _ df )
As illustrates in the figure above, this data contains 14 variables with 4854 data entries.
A quick look into out data to get a general overview of it.
1 print ( sprintf ( " Size of dataset : % d rows and % d columns " , dim ( raw _ df ) [1] , dim ( raw _
df ) [2]) )
2 summary ( raw _ df )
Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 11/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 12/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
There are a few methods for cleaning data, but we decide on dropping all data entries with
missing data (N/A data, not available). First, we count the number of N/A to identify missing
data.
1 colSums ( is . na ( df ) )
As it is clearly seen, the number of N/A entries vary for each data columns.
According to Figure 2, Release Date attribute is currently of "chr" type, which is a character
type representing text. We need to change this into its correct "date" type.
1 df $ Release . Date <- as . Date ( df $ Release . Date )
Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 13/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
Using the newly updated Release Date column, we will make a new column in our data, which
is called "ReleaseDate.num". This column objective is to turn the date into number of years by
converting date into decimal values (numeric type).
1 df $ ReleaseDate . num <- as . numeric ( df $ Release . Date , na . rm = TRUE ) / 365.25 + 1970
2 head ( df , 5)
This will be use later on for the linear regression model and data forecasting.
Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 14/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
7 Data visualization
As the data set are already dropped all the N/A value, we will present each of the data in a
way that is clear, concise, and visually appealing.
7.1 Type
1 df _ pie _ type <- data . frame ( count ( df , Type ) )
2 ggplot ( data = df _ pie _ type , aes ( x = " " , y = n , fill = Type ) ) +
3 geom _ bar ( stat = " identity " , width = 1) +
4 coord _ polar ( theta = " y " , start = 0) + ggtitle ( " Type Distribution " ) +
5 theme _ void () + scale _ fill _ manual ( values = c ( " # F8766D " , " #7 CAE00 " ) ) +
6 guides ( fill = guide _ legend ( title = " Type " ) )
54.8% of the total values are GPUs, and the rest are CPUs.
Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 15/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
It is noticeable that Histogram shows a global pattern for most of the process size is
lower than roughly 60nm.
Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 16/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
Figure 10: Facetted Box Plot of Process Size by Type and Vendor
Insights:
• It can be genuinely recognise that the Process Size for Intel, AMD and Nvidia lies in
comparatively lower range than for ATI and other vendors.
• We can see that Intel and AMD are the only vendors that have both GPU and CPU.
• Intel have a lower band than AMD for CPU.
• Even in the GPU section, Intel and AMD are the only vendors whose band lies in
the lower range of process size.
Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 17/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
The graph above shows the highest TDP product released each year, it seems to indicate
that GPUs have a tendency to get a higher TDP over time.
Insight:
• A closer look at the plot reveals that the Die Size of GPUs increased over the observation
period, in comparison with the figure for CPUs fluctuated between about 430 mm2 and
almost 0 mm2.
• It can be seen that when the size of the chip increases, more transistors can be packed
onto the chip, allowing for greater processing power and functionality. However, its
drawbacks still exists which can lead to higher costs and greater power consumption.
As a result, this can become a limiting factor in the design of new microprocessors and
integrated circuits, especially the GPUs.
Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 18/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
It is clear that the Transistors rose dramatically throughout the 2000-2020 period. Notwith-
standing, was it still followed the Moore’s Law? We will find out in the Linear Regression
section.
Figure 14: Scatter plot of Frequency vs Release Date and Box plot of Frequency by Vendor
Insights:
Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 19/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
• While overall Intel and AMD have almost similar frequency values. In 2005, the
frequency values of Intel was much higher than the frequency values of AMD.
Insights:
• GF have very recently emerged in the market and in the past few years, it has
scaled up its Frequency Values close to the level of Intel and TSMC.
• Although TSMC was presented for many years, its frequency values have recently
increased.
• Intel seems to be very consistent with its frequency values, and it produce CPUs and
GPUs with all range of frequency values.
7.7 Foundry
1 df _ pie _ foundry <- data . frame ( count ( df , Foundry ) )
2 ggplot ( data = df _ pie _ foundry , aes ( x = " " , y = n , fill = Foundry ) ) +
3 geom _ bar ( stat = " identity " , width = 1) +
4 coord _ polar ( theta = " y " , start = 0) +
5 ggtitle ( " Foundry Distribution " ) +
6 theme _ void () +
7 scale _ fill _ manual ( values = c ( " # F8766D " , " #7 CAE00 " ," # F9D923 " ," #187498 " ," #573391 " ,
" # FF78F0 " ," # F7B449 " ," #7 A9EAF " ," # A9907E " ," # FFEAEA " ) ) +
8 guides ( fill = guide _ legend ( title = " Foundry " ) )
Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 20/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
• If you know a little bit about companies, you must be aware that TSMC makes the highest
number of chips in the world. And it is a Tiawan based company.
• Additionally, TSMC is so big that the initial 7 companies does not even lie close to
the TSMC production. TSMC is followed by Intel.
7.8 CPU and GPU Frequency vs Process Size through Release year
7.8.1 CPU Frequency vs Process Size through Release year
Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 21/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
Figure 17: Scatter plot of CPU Frequency vs Process Size over the Release year
As the scatter plot above shows and it was discussed previously CPUs are released each
year with wide number of frequencies, but with a smaller Process Size.
Figure 18: Scatter plot of GPU Frequency vs Process Size over the Release year
As can be seen above, GPUs tend to have higher frequency and smaller Process Size
as time goes by.
Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 22/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
7.9 Conclusion
There are a lot of interesting aspects to this dataset that could be observed in the Visualization
steps, such as:
• CPUs Frequency are not evolving as fast as GPUs.
• CPUs and GPUs are tending to get a smaller Process Size over time.
• Vendors release a large amount of products with a wide range of Frequencies each
year.
• The CPUs with higher TDP seems to be increasing over time.
Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 23/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
log(2)
log(T ransistor_count) ≈ year − 293.3129 (4)
2
Or generally:
log(T ransistor_count) ≈ β0 + β1 year + ε (5)
With ε being the random error.
Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 24/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
Since this is a linear model, we can apply linear regression to assess the accuracy of Moore’s Law
on this dataset. Specifically, we will employ linear regression on the transistor count column (after
applying logarithm) and compare the outcome with our expectation, which is log(2) 2 or 0.1505, to
determine if Moore’s Law is still true.
The transformation was successful, resulting in a more normalized distribution of the transistor
count data:
Figure 20: Transistor count before (left) and after (right) applied Log Transformation
To better visualize the transistor count data after the logarithmic transformation, we can create
a scatter plot:
1 ggplot ( df _ linear , aes ( x = ReleaseDate . num , y = log _ transistor , color = Type ) ) +
2 geom _ point ()
Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 25/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 26/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
2 summary ( model )
3 coef ( model )
At first sight, the estimated coefficient for "ReleaseDate.num" was only 0.1203, which is smaller
than the expected value of 0.1505. However, we cannot make any conclusions yet. We need to use
hypothesis testing to determine whether the coefficient estimate is significantly different from
our expected value. This will allow us to draw more informed conclusions about the relationship
between the release date and the transistor count, and whether it supports Moore’s Law or not.
8.4.2.b Independence
The assumption of independence in linear regression refers to the idea that the observations in
the data set are independent of each other. That is, the value of one observation should not be
influenced by the value of any other observation.
In this particular dataset, each observation is either a CPU or a GPU and each of them has
different parameters; therefore, it is unlikely to be the case that any observation is
influenced by the others.
Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 27/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
8.4.2.c Homoscedasticity
Homoscedasticity is an assumption in linear regression that the variance of the residuals, or
differences between the predicted and actual values of the response variable, is constant across all
values of the predictor variable. This means that the distribution of errors has the same "spread"
or variability across the entire range of values of the predictor variable, which is release date in
this case.
In order to check for homoscedasticity, we will plot the residuals against the release date and
check for the presence of any pattern in the plot:
1 # Create a residual plot
2 ggplot ( df _ linear , aes ( x = ReleaseDate . num , y = model $ residual ) ) +
3 geom _ point ( alpha = 0.6) +
4 geom _ hline ( yintercept = 0 , linetype = " dashed " ) +
5 labs ( title = " Residual Plot " , x = " Release Date " , y = " Residuals " )
Since the plot shows no clear pattern (such as cone-shaped or funnel-shaped), we can conclude
that the assumption of homoscedasticity is met.
Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 28/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
Since the majority of the residuals approximately align with the straight line, which represents
the expected values under the assumption that the residuals conform to a normal distribution, it
is reasonable to conclude that the residuals are normally distributed.
8.4.2.e Conclusion
In this section, we have examined four common assumptions of linear regression, namely linearity,
independence, homoscedasticity, and normality of the residuals, and we found that all of these
assumptions satisfied in our analysis. Based on this, we can reasonably conclude that the linear
regression model is appropriate for the given dataset.
Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 29/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
As the output suggests, the null hypothesis of β1 ≥ 0.1505 is rejected, which means that
β1 , or the increase rate of transistor count, is unlikely to greater than or equal to Moore’s Law
expectation. Therefore, we can come to the conclusion that Moore’s Law is no longer valid.
It is important to note that there are several objective factors that can affect the accuracy of this
conclusion, such as:
• The amount of data collected is not enough to reflect the trend of Moore’s Law
Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 30/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
9 Full R code
1 library ( tidyverse )
2 library ( ggplot2 )
3 library ( tidyr )
4 library ( gridExtra )
5 library ( dplyr )
6 library ( lubridate )
7
8 # read csv file and print the first 5 rows
9 raw_df <- read . csv (" chip_dataset . csv ")
10 head ( raw_df , 5) # only print 5 first rows
11
12 # An overview of dataset ’ s attributes
13 print ( sprintf (" Size of dataset : % d rows and % d columns " , dim ( raw_df ) [1] , dim (
raw_df ) [2]) )
14 summary ( raw_df )
15
16 # get datatype of each attribute
17 str ( raw_df )
18
19 # count the number of NA ( not available )
20 colSums ( is . na ( raw_df ) )
21
22 # Drop the id column ( first column ) and the last three columns ( FP16 . GFLOPS , FP32 .
GFLOPS , FP64 . GFLOPS ) since they have two many NAs
23 df <- raw_df [ , -c (1 , 12:14) ]
24 head ( df , 5)
25
26 # convert Release . Date column to date
27 df$Release . Date <- as . Date ( df$Release . Date )
28 head ( df , 5)
29 colSums ( is . na ( df ) )
30
31 # drop NA values in the remaining columns
32 df <- na . omit ( df )
33 head ( df , 5)
34 dim ( df )
35
36 # with the release date column , we will create a new one - a numeric release date ,
starting from 2000 ( since the first day is 1 -1 -2000)
37 df$ Releas eDate . num <- as . numeric ( df$Release . Date , na . rm = TRUE ) / 365.25 + 1970
38 head ( df , 5)
39
40 ### Visualization ###
41 ## Type distribution ##
42 # Pie chart type distribution
43 df_pie_type < - data . frame ( count ( df , Type ) )
44 ggplot ( data = df_pie_type , aes ( x = "" , y = n , fill = Type ) ) +
45 geom_bar ( stat = " identity " , width = 1) +
46 coord_polar ( theta = " y " , start = 0) +
47 ggtitle (" Type Distribution ") +
48 theme_void () +
49 s c a l e _ f i l l _ m a n u a l ( values = c ("# F8766D " , "#7 CAE00 ") ) +
50 guides ( fill = guide_legend ( title = " Type ") )
51
52 # Bar chart distribution
53 ggplot ( data = df_pie_type , aes ( x = n , y = Type , fill = Type ) ) +
54 geom_bar ( stat = " identity " , width = 1) +
55 ggtitle (" Type Distribution ") +
56 theme_void () +
Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 31/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 32/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
108
109 # Box Plot of Frequency by Vendor
110 ggplot ( df , aes ( x = df$Freq .. MHz . , y = df$Vendor , fill = Vendor ) ) +
111 geom_boxplot () +
112 labs ( title = " Box Plot of Frequency by Vendor " , x = " Vendor " , y = " Freq ( MHz ) ")
+
113 s c a l e _ f i l l _ m a n u a l ( values = c ("# F8766D " , "#7 CAE00 " ,"# F9D923 " ,"#187498" ,"#573391")
) +
114 theme_classic ()
115
116 # Scatter Plot of Frequency vs Release Date by Foundry
117 ggplot ( df , aes ( x = df$Release . Date , y = df$Freq .. MHz . , color = Foundry ) ) +
118 geom_point () +
119 labs ( title = " Scatter Plot of Frequency vs Release Date " , x = " Release Date " , y
= " Freq ( MHz ) ") +
120 s c a l e _ c o l o r _ m a n u a l ( values = c ("# F8766D " , "#7 CAE00 " ,"# F9D923
" ,"#187498" ,"#573391" ,"# FF78F0 " ,"# F7B449 " ,"#7 A9EAF " ,"# A9907E " ,"# FFEAEA ") ) +
121 theme_classic ()
122
123 # Foundry Distribution
124 df_pie_foundry < - data . frame ( count ( df , Foundry ) )
125 ggplot ( data = df_pie_foundry , aes ( x = "" , y = n , fill = Foundry ) ) +
126 geom_bar ( stat = " identity " , width = 1) +
127 coord_polar ( theta = " y " , start = 0) +
128 ggtitle (" Foundry Distribution ") +
129 theme_void () +
130 s c a l e _ f i l l _ m a n u a l ( values = c ("# F8766D " , "#7 CAE00 " ,"# F9D923
" ,"#187498" ,"#573391" ,"# FF78F0 " ,"# F7B449 " ,"#7 A9EAF " ,"# A9907E " ,"# FFEAEA ") ) +
131 guides ( fill = guide_legend ( title = " Foundry ") )
132
133 ggplot ( data = df_pie_foundry , aes ( x = n , y = Foundry , fill = Foundry ) ) +
134 geom_bar ( stat = " identity " , width = 1) +
135 labs ( title =" Foundry Distribution " , x =" Count " , y =" Foundry ") +
136 theme_classic () +
137 s c a l e _ f i l l _ m a n u a l ( values = c ("# F8766D " , "#7 CAE00 " ,"# F9D923
" ,"#187498" ,"#573391" ,"# FF78F0 " ,"# F7B449 " ,"#7 A9EAF " ,"# A9907E " ,"# FFEAEA ") ) +
138 guides ( fill = guide_legend ( title = " Foundry ") )
139
140 # CPU Frequency ( MHz ) vs Processor Size ( nm ) vs Release year
141 df_cpu <- df [ df$Type == ’CPU ’ ,]
142 ggplot ( df_cpu , aes ( x = df_cpu$Freq .. MHz . , y = df_ cpu$Pr ocess . Size .. nm . , color =
factor ( year ( df_ cpu$Re lease . Date ) ) ) ) +
143 geom_point () +
144 labs ( title = " CPU Frequency ( MHz ) vs Processor Size ( nm ) vs Release year " ,
145 x = " Frequency ( MHz ) " ,
146 y = " Process Size ( nm ) " ,
147 color = " Year ") +
148 s c a l e _ c o l o r _ d i s c r e t e ( name = " Year ") +
149 theme_minimal ()
150
151 # GPU Frequency ( MHz ) vs Processor Size ( nm ) vs Release year
152 df_gpu <- df [ df$Type == ’GPU ’ ,]
153 ggplot ( df_gpu , aes ( x = df_gpu$Freq .. MHz . , y = df_ gpu$Pr ocess . Size .. nm . , color =
factor ( year ( df_ gpu$Re lease . Date ) ) ) ) +
154 geom_point () +
155 labs ( title = " GPU Frequency ( MHz ) vs Processor Size ( nm ) vs Release year " ,
156 x = " Frequency ( MHz ) " ,
157 y = " Process Size ( nm ) " ,
158 color = " Years ") +
159 s c a l e _ c o l o r _ d i s c r e t e ( name = " Year ") +
160 theme_minimal ()
161
Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 33/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 34/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
219 message (" Since t_value falls in the rejection region , which is t_value < -
t_alpha , we reject the null hypothesis that B1 >= 0.1505")
220 } else {
221 message (" Since t_value does not fall in the rejection region , which is t_value <
- t_alpha , we fail to reject the null hypothesis that B1 >= 0.1505")
222 }
Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 35/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering
10 References
1. Introduction to Multiple Linear Regression, 27/10/2022, Zach, https://www.statology.org/
multiple-linear-regression/?fbclid=IwAR0aJ0c4LTt7zN2-JoeITOw2Nv4Dkk88h2bt-yJAiaoVdWG_
51R6xV3uKyw
2. Coefficient of multiple Determination, Valerie Watts, https://ecampusontario.pressbooks.
pub/introstats/chapter/13-4-coefficient-of-multiple-determination/?fbclid=IwAR39G3xX\
protect\discretionary{\char\hyphenchar\font}{}{}zBcuWcza423MAIDq1DAXehjbJGa8tiYSVGUQWn\
protect\discretionary{\char\hyphenchar\font}{}{}SanaSRfjDkS28
3. Hypothesis Test for Simple Linear Regression, 27/01/2022, Maurice A.Geraghty, https://
stats.libretexts.org/Courses/American_River_College/STAT_300%3A_My_Introductory_
Statistics_Textbook_(Mirzaagha)/03%3A_Regression_Analysis/3.03%3A_Correlation_and_
Linear_Regression/3.3.04%3A_Hypothesis_Test_for_Simple_Linear_Regression?fbclid=
IwAR0kklGIH5MbkJsFGy-QqlT66nAygMPUFY6aQxHvmd23hDzih3PdatWO0Iw
4. Linear Regression in R, Dec 2022, https://www.datacamp.com/tutorial/linear-regression-R
Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 36/36