0% found this document useful (0 votes)
31 views

XSTK 1

This document is a project report from a group of students at Vietnam National University, Ho Chi Minh City. It contains an introduction, theoretical basis, data handling methods, data visualizations, building a linear regression model to examine Moore's Law, and R code for the project.

Uploaded by

Minh Tuấn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

XSTK 1

This document is a project report from a group of students at Vietnam National University, Ho Chi Minh City. It contains an introduction, theoretical basis, data handling methods, data visualizations, building a linear regression model to examine Moore's Law, and R code for the project.

Uploaded by

Minh Tuấn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

VIETNAM NATIONAL UNIVERSITY, HO CHI MINH CITY

UNIVERSITY OF TECHNOLOGY
FACULTY OF COMPUTER SCIENCE AND ENGINEERING

PROBABILITY AND STATISTICS (MT2013)

Class: CC06 | Group: 7

Project Report

Lecturer: Professor Nguyễn Tiến Dũng

Students: Đỗ Duy Khương - 2152143


Nguyễn Đình Thiên Huy - 2152591
Nguyễn Thịnh Đạt - 2152507
Trần Minh Trung - 2153073
Nguyễn Hoàng Quốc Tuấn - 2153078

HO CHI MINH CITY, APRIL 2023


University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

Contents
1 List of member & workload 3

2 List of Figures 4

3 Acknowledgement & Goal 5


3.1 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

4 Introduction 6
4.1 Topic introduction and requirements . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.2 Statistical methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

5 Theoretical basis 7
5.1 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5.1.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5.1.2 Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5.1.2.a Simple linear regression . . . . . . . . . . . . . . . . . . . 7
5.1.2.b Multiple linear regression . . . . . . . . . . . . . . . . . . 7
5.1.3 Least square method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
5.1.4 Coefficient of determination . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.1.5 Assessing model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

6 Data handling 11
6.1 Data importing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
6.2 Data properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
6.3 Data cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

7 Data visualization 15
7.1 Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
7.2 Process Size (nm) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
7.3 TDP (W) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
7.4 Die Size (mm2 ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7.5 Transistors (million) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7.6 Freq (MHz) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
7.7 Foundry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
7.8 CPU and GPU Frequency vs Process Size through Release year . . . . . . . . . . 21
7.8.1 CPU Frequency vs Process Size through Release year . . . . . . . . . . . 21
7.8.2 GPU Frequency vs Process Size through Release year . . . . . . . . . . . 22
7.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

8 Building a linear regression model for assessing Moore’s Law 24


8.1 What is Moore’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
8.2 Method for examining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
8.3 Applying Log Transformation to Transistor count . . . . . . . . . . . . . . . . . . 25
8.4 Assessing Moore’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
8.4.1 Building the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
8.4.2 Checking for model’s validity . . . . . . . . . . . . . . . . . . . . . . . . . 27
8.4.2.a Linearity . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 1/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

8.4.2.b Independence . . . . . . . . . . . . . . . . . . . . . . . . . 27
8.4.2.c Homoscedasticity . . . . . . . . . . . . . . . . . . . . . . . 28
8.4.2.d Normality of the residual . . . . . . . . . . . . . . . . . . 28
8.4.2.e Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 29
8.4.3 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

9 Full R code 31

10 References 36

Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 2/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

1 List of member & workload

ID Student ID Full Name Workload Evaluation Note

Introduction
1 2152143 Đỗ Duy Khương Data handling 20% Leader
Project overview

Data handling
2 2152591 Nguyễn Đình Thiên Huy 20%
Model building

3 2152507 Nguyễn Thịnh Đạt Data visualization 20%

4 2153073 Trần Minh Trung Theoretical basis 20%

5 2153078 Nguyễn Hoàng Quốc Tuấn Theoretical basis 20%

Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 3/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

2 List of Figures
List of Figures
1 Import data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Get attributes data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Data overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4 Drop ID, FP16.GFLOPS, FP32.GFLOPS, FP64.GFLOPS columns . . . . . . . . 13
5 Count N/A entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
6 Add ReleaseDate.num attribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
7 Type Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
8 Histogram of Process size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
9 Box plot of Process size by Foundry . . . . . . . . . . . . . . . . . . . . . . . . . 16
10 Facetted Box Plot of Process Size by Type and Vendor . . . . . . . . . . . . . . . 17
11 Scatter Plot of TDP vs Release Date . . . . . . . . . . . . . . . . . . . . . . . . . 17
12 Scatter Plot of Die Size vs Release Date . . . . . . . . . . . . . . . . . . . . . . . 18
13 Scatter Plot of Transistors vs Release Date . . . . . . . . . . . . . . . . . . . . . 19
14 Scatter plot of Frequency vs Release Date and Box plot of Frequency by Vendor 19
15 Scatter plot of Frequency vs Release Date by Foundry . . . . . . . . . . . . . . . 20
16 Foundry Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
17 Scatter plot of CPU Frequency vs Process Size over the Release year . . . . . . . 22
18 Scatter plot of GPU Frequency vs Process Size over the Release year . . . . . . . 22
19 A new column of log-transformed transistor count is added . . . . . . . . . . . . 25
20 Transistor count before (left) and after (right) applied Log Transformation . . . . 25
21 Scatter Plot of log-transformed transistors count . . . . . . . . . . . . . . . . . . 26
22 df_linear - a copy of original df . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
23 Summary of model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
24 Residual plot against release date . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
25 Q-Q Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
26 Output of Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 4/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

3 Acknowledgement & Goal


3.1 Acknowledgement
First to be mentioned, we would like to give our greatest gratitude toward Professor Nguyen
Tien Dung for giving our team an opportunity to cooperate with every team’s member. By
providing and introducing an application called R Studio software, after days by days working
on it to complete our project, we realize that this software is a very powerful tools including
many libraries and methods varying not only in probability and statistics but also in many other
mathematical fields. Thanks to R Studio software, it is not exaggerated that our knowledge, ideas,
and imagination in mathematical aspects has been considerably expanded and improved.
The two most important concepts in mathematics are probability and statistics. Probability
is entirely about luck. Statistics, on the other hand, is concerned with how we manage diverse
data sets using various methodologies. It aids in the representation of complex facts in a simple
and clear manner. Nowadays, statistics are widely used in data science professions. Professionals
employ statistics to forecast many various characteristics and departures. It is a crucial topic,
particularly for Computer Science and Engineering students. As a result, participating in this
topic has developed and refined our abilities, not only in Data Science, but also in team-work
and problem-solving.

3.2 Goal
Firstly, we want our data to be completely comprehensible by plotting graphs and making
comparisons between many attributes of CPUs and GPUs in the data. Moreover, by visualizing
data using histograms, pie charts, scatter plots, ..., we want to have a clearer view about the
tendency of development of each attribute lying inside CPUs and GPUs.
Secondly, the main objective of our project is to apply test techniques and methods deprived
from probabilities and statistics to predict, analyze, and show connection between the number
of transistors in CPUs, GPUs and the Moore’s law. To be more precise, we want to determine
whether Moore’s law is still suitable as time goes by according to the number of transistors packed
onto microchips throughout surveyed period, with the help of functions and methods included in
R Studio software. By analyzing the historical data on transistor counts, it may be possible to
make predictions about future trends in CPU and GPU development, which could be useful for
planning and investment purposes.

Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 5/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

4 Introduction
4.1 Topic introduction and requirements
CPU, or Central Processing Unit, is one of the primary component in any computer systems.
It is considered as the brain of the whole computer, with its capabilities to execute instructions
that are stored in the memory, perform basic arithmetic calculations, and control the flow of data
between different part of computers.
In support of CPU, a graphic processing unit (GPU), is a specialized type of processor
that can handle complex computations that is required for rendering graphics and images on
the computer. While a CPU can be good for general-purpose processing, GPU is specialized in
parallel processing, which allows it to perform simultaneously many calculations. Therefore, it is
suited for tasks that requires in large amount of data to be handled in parallel, such as rendering
3D graphics, running scientific simulations or training machine learning models.
The fundamentals of CPUs and GPUs are made up of small electronic components, called
transistors. Transistors are used as the basic building blocks for creating logic gates, which are
combined to form complex circuits that can perform arithmetic, logic, and control operations.
The smallest transistor size that has been used in commercial CPUs or GPUs is currently 5
nanometers. As of 2022, IBM has made a 2-nanometer chip techonology, the size of roughly five
atoms, which will be manufacture by 2025.
The data we are using provides information about CPUs and GPUs in the span of 21 years,
from 2000 to 2021. We will be using statistical methods to evaluate the growth of these processors
over the years, and then predict future statistics for them. Here is the link to our data source.

4.2 Statistical methods


Regarding to the statistical methods for the given data, we will be utilizing a linear regression
model to analyze and predict processors’ future growth. Mainly, we will be analyzing the number
of transistors according to Moore’s Law, and see if it still holds true to this date. We will be
going into detail later on in the report.
The main reason we will be using linear regression model rather than its counterpart, logistic
regression model, is because the number of transistors is a quantitative variable, which is a kind
of variable that measured the amount of something, in this case is the number of transistors.
Furthermore, we want to accurately determine the increase rate of transistors, which is not ideally
acquired by using a logistic regression model, since the result will be based on 0 or 1 (true or
false).

Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 6/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

5 Theoretical basis
5.1 Linear regression
5.1.1 Definition
In statistics, linear regression is a linear approach for modelling the relationship between a scalar
response and one or more explanatory variables . In other words, linear regression analysis is
used to predict the value of a variable based on one or more variable’s value. The variable that
we desire to predict is called dependent variable (response variable) and the variable we used to
predict the value of other variable is called independent variable (explanatory variable).
In case of only one explanatory variable is called simple linear regression, and the other case
where there are more than one is called multiple linear regression.

5.1.2 Formula
5.1.2.a Simple linear regression
* General form:
y = β0 + β1 x + ϵ
Where:

• y: the predicted value of the dependent variable (y) for given value of the independent
variable x
• x: the independent variable
• β0 : the intercept, the predicted value of y when the x is 0
• β1 : the regression coefficient (how much we expect y to change as x increases)
• ϵ: the error of the estimate, or how much variation there is in our estimate of the regression
coefficient
The method for determining regression coefficients will be discussed later.

5.1.2.b Multiple linear regression


* General form:
yi = β̂0 + β̂1 xi,1 + β̂2 xi,2 + ... + β̂n xi,n + ûi = ŷi + ûi
Where:
yi : the observed value with observation i
ŷi : the predicted value with data observation i
β̂0 : the y-intercept (value of yi when all other parameters are set to 0) (the value xi,0 is set to 1)
β̂1 , β̂2 , ..., β̂n : the regression coefficients calculated using the least square method
xi,1 , xi,2 , ..., xi,n : the independent variables
ûi : the residual between the predicted value and the observed value

* Matrix form:
We can think about the regression model as a whole for by writing out the regression equation
for every single observation i = 1, 2, ..., n:

Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 7/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

y1 = β̂0 + β̂1 x1,1 + β̂2 x1,2 + ... + β̂n x1,n + û1


y2 = β̂0 + β̂1 x2,1 + β̂2 x2,2 + ... + β̂n x2,n + û2

.. .. .. .. .. .. ..
. . . . . . .

yi = β̂0 + β̂1 xi,1 + β̂2 xi,2 + ... + β̂n xi,n + ûi


.. .. .. .. .. .. ..
. . . . . . .

yn = β̂0 + β̂1 xn,1 + β̂2 xn,2 + ... + β̂n xn,n + ûn

We can transform these equations to matrix form as ⃗y = X β⃗ + ⃗u by applying some linear


algebra theories, or to be more precise:

       
y1 x1,0 x1,1 ... x1,n β̂0 û1
 y2   x2,0 x2,1 ... x2,n   β̂1   û2 
 ..  =  ..  .  +  .. 
       
.. .. .. 
.  . . . .   ..   . 
yn xn,0 xn,1 ... xn,n β̂n ûn

5.1.3 Least square method


*Estimating the regression parameters:
In general as well as in R Studio, the least square method is used to find the appropriate formula
for linear regression model.
Let βˆ0 , βˆ1 are respectively estimates of β0 , β1
The fitted regression line is given by
ŷ = βˆ0 + βˆ1 x
The residual ei = yi − (βˆ0 + βˆ1 xi ) = yi − yˆi describes the error between ith predicted value and
ith observation.
βˆ0 , βˆ1 are found by least-square method (least-square estimate)
We define sum of square for errors:

Pn Pn
SSE = 2
i=1 ei = i=1 [yi − (βˆ0 + βˆ1 xi )]2

The least-square method is to minimize SSE to find the estimates βˆ0 , βˆ1
By derivative we have:
( n x )( n
P P
Pn yi )
xi yi − i=1 i n i=1
βˆ1 = i=1 Pn 2
2 − ( i=1 xi )
Pn
i=1 ix n

or: Pn
(x − x̄)(yi − ȳ)
βˆ1 = Pn i
i=1
2
i=1 (xi − x̄)

and:
βˆ0 = ȳ − βˆ1 x̄

Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 8/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

where: Pn
i=1 xi
x̄ =
n

Pn
i=1 yi
ȳ =
n

5.1.4 Coefficient of determination


* Coefficient of determination:

Definition: To evaluate how accuracy our linear regression function/model can predict, we
need to consider the coefficient of determination. It is simply just a value which acts as a way
to determine whether our model can be reliable or not (we can understand it as how much
percentages our model closes to reality).
The most general definition of the coefficient of determination is:
SSres
R2 = 1 − SStot

Where: P
n
SSres = P i=1 (yi − ŷi )2 : The sum of squares of residuals
n
SStot = i=1 (yi − ȳ)2 : The total sum of squares (ȳ is the mean value)

To visualise the formula for coefficient of determination we consider two graphs below (simple
linear regression):

The better the linear regression (on the right) fits the data in comparison to the simple average
(on the left graph), the closer the value of R2 is to 1. The areas of the blue squares represent the
squared residuals with respect to the linear regression. The areas of the red squares represent the
squared residuals with respect to the average value. (R2 = 1 − SS SStot )
res

In the best case, the modeled values exactly match the observed values, which results in
SSres = 0 and R2 = 1. A baseline model, which always predicts ȳ, will have R2 = 0. Models that
have worse predictions than the baseline will cause a negative R2 .

Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 9/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

5.1.5 Assessing model


Assumptions of regression:

• There is a linear relationship between the dependent variable and independent variables.
• The independent variables are not too highly correlated with each other.
• yi observations are selected independently and randomly from population.

• Residuals should be normally distributed with a mean of 0 and variance σ 2

Hypothesis testing
Back to the formula about simple linear regression model: y = β0 + β1 x + ϵ. This formula will
only be useful whenever β1 ̸= 0, which means that the change of independent variable x will help
to predict the value of dependent variable y. Therefore we need to perform a hypothesis testing
to assure that β1 ̸= 0.
In general let’s assume:
Null hypothesis: β1 = β10
β̂1 −β10
Our test statistic value will be: t = sβ̂
1
Pn
Where: sβ̂1 = √s : the standard deviation of β1 (Sxx = − x̄)2 )
Sxx i=1 (xi
SSres
s: the unbiased estimator of σ (s2 = n−2 )

Alternate hypothesis Rejection region for level α test


Ha : β1 > β10 t ≥ tα,n−2
Ha : β1 < β10 t ≤ −tα,n−2
̸ β10
Ha : β 1 = either t ≥ tα/2,n−2 or t ≤ −tα/2,n−2

In the case it is the model ulitily test: H0 : β1 = 0 and Ha : β1 ̸= 0, in which case the test
statistic value is t = sβ̂1 .
β̂1

Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 10/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

6 Data handling
6.1 Data importing
We first need to import the data into our application in order to run it. To test the data, we print
the first 5 rows and check the result.
1 raw _ df <- read . csv ( " chip _ dataset . csv " )
2 head ( raw _ df , 5)

Figure 1: Import data

After that, we get the data types for each variables in the table.
1 str ( raw _ df )

Figure 2: Get attributes data types

As illustrates in the figure above, this data contains 14 variables with 4854 data entries.
A quick look into out data to get a general overview of it.
1 print ( sprintf ( " Size of dataset : % d rows and % d columns " , dim ( raw _ df ) [1] , dim ( raw _
df ) [2]) )
2 summary ( raw _ df )

Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 11/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

Figure 3: Data overview

6.2 Data properties


Although this data contains 14 attributes, we only specify those needed since some data columns
are missing ample amount of data and are not related to our topic of discussion.

1. Product: The name of product.


2. Type: CPU or GPU.
3. Release date (YY/MM/DD): The release date of processor.

4. Process size (nm): The size of one transistor.


5. TDP (W): Stands for Thermal Design Power. It is use to measure the maximum amount
of heat generated under normal operating conditions.
6. Die size (mm2 ): The physical size of semiconductor chip that contains the processor and
other integrated circuits.

7. Transistors (million): The number of transistors in a processor.


8. Freq (MHz): The frequency, or clock speed of processor.
9. Foundry: The processor’s manufacturer.

10. Vendor: The processor’s distributor.

Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 12/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

6.3 Data cleaning


As we have mentioned earlier, there were some columns of data to be omitted, which are the first
column (id column) and the three last ones (FP16.GFLOPS, FP32.GFLOPS, FP64.GFLOPS).
We then print the first 5 rows of data to check the validity of the code.
1 df <- raw _ df [ , -c (1 , 12:14) ]
2 head ( df , 5)

Figure 4: Drop ID, FP16.GFLOPS, FP32.GFLOPS, FP64.GFLOPS columns

There are a few methods for cleaning data, but we decide on dropping all data entries with
missing data (N/A data, not available). First, we count the number of N/A to identify missing
data.
1 colSums ( is . na ( df ) )

Figure 5: Count N/A entries

As it is clearly seen, the number of N/A entries vary for each data columns.

• 75 N/A Release Date


• 9 N/A Process Size
• 626 N/A TDP
• 715 N/A Die Size

• 711 N/A Transistors


After that, we drop all N/A data entries.
1 df <- na . omit ( df )

According to Figure 2, Release Date attribute is currently of "chr" type, which is a character
type representing text. We need to change this into its correct "date" type.
1 df $ Release . Date <- as . Date ( df $ Release . Date )

Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 13/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

Using the newly updated Release Date column, we will make a new column in our data, which
is called "ReleaseDate.num". This column objective is to turn the date into number of years by
converting date into decimal values (numeric type).
1 df $ ReleaseDate . num <- as . numeric ( df $ Release . Date , na . rm = TRUE ) / 365.25 + 1970
2 head ( df , 5)

Figure 6: Add ReleaseDate.num attribute

This will be use later on for the linear regression model and data forecasting.

Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 14/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

7 Data visualization
As the data set are already dropped all the N/A value, we will present each of the data in a
way that is clear, concise, and visually appealing.

7.1 Type
1 df _ pie _ type <- data . frame ( count ( df , Type ) )
2 ggplot ( data = df _ pie _ type , aes ( x = " " , y = n , fill = Type ) ) +
3 geom _ bar ( stat = " identity " , width = 1) +
4 coord _ polar ( theta = " y " , start = 0) + ggtitle ( " Type Distribution " ) +
5 theme _ void () + scale _ fill _ manual ( values = c ( " # F8766D " , " #7 CAE00 " ) ) +
6 guides ( fill = guide _ legend ( title = " Type " ) )

1 ggplot ( data = df _ pie _ type , aes ( x = n , y = Type , fill = Type ) ) +


2 geom _ bar ( stat = " identity " , width = 1) + ggtitle ( " Type Distribution " ) +
3 theme _ void () + scale _ fill _ manual ( values = c ( " # F8766D " , " #7 CAE00 " ) ) +
4 guides ( fill = guide _ legend ( title = " Type " ) )

Figure 7: Type Distribution

54.8% of the total values are GPUs, and the rest are CPUs.

7.2 Process Size (nm)


Let’s have a look at the value distribution using the histogram plot
1 ggplot ( df , aes ( x = df $ Process . Size .. nm .) ) +
2 geom _ histogram ( fill = " steelblue " , color = " white " , bins = 15 , binwidth = 10 ,
alpha = 0.7) +
3 labs ( title = " Histogram of Process Size " , x = " Process Size ( nm ) " , y = " Count
" ) + theme _ classic ()

Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 15/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

Figure 8: Histogram of Process size

It is noticeable that Histogram shows a global pattern for most of the process size is
lower than roughly 60nm.

Let’s check its relations with other features.


1 ggplot ( df , aes ( x = df $ Process . Size .. nm . , y = df $ Foundry , fill = Foundry ) ) +
2 geom _ boxplot () +
3 labs ( title = " Box Plot of Process Size by Foundry " , x = " Process Size ( nm ) " , y =
" Foundry " ) +
4 scale _ fill _ manual ( values = c ( " # F8766D " , " #7 CAE00 " ," # F9D923 " ," #187498 " ," #573391 " ,
" # FF78F0 " ," # F7B449 " ," #7 A9EAF " ," # A9907E " ," # FFEAEA " ) ) +
5 theme _ classic ()

Figure 9: Box plot of Process size by Foundry

1 ggplot ( df , aes ( x = df $ Process . Size .. nm . , y = Type , fill = Vendor ) ) +


2 geom _ boxplot () +
3 facet _ wrap ( ~ Type , ncol = 2) +
4 labs ( title = " Facetted Box Plot of Process Size by Type and Vendor " , x = "
Process Size ( nm ) " , y = " Type " ) +
5 scale _ fill _ manual ( values = c ( " # F8766D " , " #7 CAE00 " ," # F9D923 " ," #187498 " ," #573391 " )
) + theme _ classic ()

Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 16/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

Figure 10: Facetted Box Plot of Process Size by Type and Vendor

Insights:

• It can be genuinely recognise that the Process Size for Intel, AMD and Nvidia lies in
comparatively lower range than for ATI and other vendors.
• We can see that Intel and AMD are the only vendors that have both GPU and CPU.
• Intel have a lower band than AMD for CPU.
• Even in the GPU section, Intel and AMD are the only vendors whose band lies in
the lower range of process size.

7.3 TDP (W)


1 ggplot ( df , aes ( x = df $ Release . Date , y = df $ TDP .. W . , color = Type ) ) +
2 geom _ point () +
3 labs ( title = " Scatter Plot of TDP vs Release Date " , x = " Release Date " , y = " TDP
(W)") +
4 scale _ color _ manual ( values = c ( " # F8766D " , " #7 CAE00 " ) ) + theme _ classic ()

Figure 11: Scatter Plot of TDP vs Release Date

Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 17/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

The graph above shows the highest TDP product released each year, it seems to indicate
that GPUs have a tendency to get a higher TDP over time.

7.4 Die Size (mm2 )


1 ggplot ( df , aes ( x = df $ Release . Date , y = df $ Die . Size .. mm .2. , color = Type ) ) +
2 geom _ point () +
3 labs ( title = " Scatter Plot of Die Size vs Release Date " , x = " Release Date " , y =
" Die Size ( mm2 ) " ) +
4 scale _ color _ manual ( values = c ( " # F8766D " , " #7 CAE00 " ) ) + theme _ classic ()

Figure 12: Scatter Plot of Die Size vs Release Date

Insight:
• A closer look at the plot reveals that the Die Size of GPUs increased over the observation
period, in comparison with the figure for CPUs fluctuated between about 430 mm2 and
almost 0 mm2.
• It can be seen that when the size of the chip increases, more transistors can be packed
onto the chip, allowing for greater processing power and functionality. However, its
drawbacks still exists which can lead to higher costs and greater power consumption.
As a result, this can become a limiting factor in the design of new microprocessors and
integrated circuits, especially the GPUs.

7.5 Transistors (million)


1 ggplot ( df , aes ( x = df $ Release . Date , y = df $ Transistors .. million . , color = Type ) ) +
2 geom _ point () +
3 labs ( title = " Scatter Plot of Transistors Count vs Release Date " , x = " Release
Date " , y = " Transistors Count ( million ) " ) +
4 scale _ color _ manual ( values = c ( " # F8766D " , " #7 CAE00 " ) ) + theme _ classic ()

Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 18/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

Figure 13: Scatter Plot of Transistors vs Release Date

It is clear that the Transistors rose dramatically throughout the 2000-2020 period. Notwith-
standing, was it still followed the Moore’s Law? We will find out in the Linear Regression
section.

7.6 Freq (MHz)


1 ggplot ( df , aes ( x = df $ Release . Date , y = df $ Freq .. MHz . , color = Vendor ) ) +
2 geom _ point () +
3 labs ( title = " Scatter Plot of Frequency vs Release Date by Vendor " , x = " Release
Date " , y = " Freq ( MHz ) " ) +
4 scale _ color _ manual ( values = c ( " # F8766D " , " #7 CAE00 " ," # F9D923 " ," #187498 " ," #573391 "
) ) + theme _ classic ()

1 ggplot ( df , aes ( x = df $ Freq .. MHz . , y = df $ Vendor , fill = Vendor ) ) +


2 geom _ boxplot () +
3 labs ( title = " Box Plot of Frequency by Vendor " , x = " Vendor " , y = " Freq ( MHz ) " )
+
4 scale _ fill _ manual ( values = c ( " # F8766D " , " #7 CAE00 " ," # F9D923 " ," #187498 " ," #573391 " )
) + theme _ classic ()

Figure 14: Scatter plot of Frequency vs Release Date and Box plot of Frequency by Vendor

Insights:

Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 19/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

• This confirms our previous interpretation of frequency with respect to vendors.


• AMD and Intel have higher frequency than the rest.

• While overall Intel and AMD have almost similar frequency values. In 2005, the
frequency values of Intel was much higher than the frequency values of AMD.

1 ggplot ( df , aes ( x = df $ Release . Date , y = df $ Freq .. MHz . , color = Foundry ) ) +


2 geom _ point () +
3 labs ( title = " Scatter Plot of Frequency vs Release Date " , x = " Release Date " , y
= " Freq ( MHz ) " ) +
4 scale _ color _ manual ( values = c ( " # F8766D " , " #7 CAE00 " ," # F9D923 " ," #187498 " ," #573391 "
," # FF78F0 " ," # F7B449 " ," #7 A9EAF " ," # A9907E " ," # FFEAEA " ) ) +
5 theme _ classic ()

Figure 15: Scatter plot of Frequency vs Release Date by Foundry

Insights:
• GF have very recently emerged in the market and in the past few years, it has
scaled up its Frequency Values close to the level of Intel and TSMC.

• Although TSMC was presented for many years, its frequency values have recently
increased.
• Intel seems to be very consistent with its frequency values, and it produce CPUs and
GPUs with all range of frequency values.

7.7 Foundry
1 df _ pie _ foundry <- data . frame ( count ( df , Foundry ) )
2 ggplot ( data = df _ pie _ foundry , aes ( x = " " , y = n , fill = Foundry ) ) +
3 geom _ bar ( stat = " identity " , width = 1) +
4 coord _ polar ( theta = " y " , start = 0) +
5 ggtitle ( " Foundry Distribution " ) +
6 theme _ void () +
7 scale _ fill _ manual ( values = c ( " # F8766D " , " #7 CAE00 " ," # F9D923 " ," #187498 " ," #573391 " ,
" # FF78F0 " ," # F7B449 " ," #7 A9EAF " ," # A9907E " ," # FFEAEA " ) ) +
8 guides ( fill = guide _ legend ( title = " Foundry " ) )

Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 20/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

1 ggplot ( data = df _ pie _ foundry , aes ( x = n , y = Foundry , fill = Foundry ) ) +


2 geom _ bar ( stat = " identity " , width = 1) +
3 labs ( title = " Foundry Distribution " , x = " Count " , y = " Foundry " ) +
4 theme _ classic () +
5 scale _ fill _ manual ( values = c ( " # F8766D " , " #7 CAE00 " ," # F9D923 " ," #187498 " ," #573391 " ,
" # FF78F0 " ," # F7B449 " ," #7 A9EAF " ," # A9907E " ," # FFEAEA " ) ) +
6 guides ( fill = guide _ legend ( title = " Foundry " ) )

Figure 16: Foundry Distribution

• If you know a little bit about companies, you must be aware that TSMC makes the highest
number of chips in the world. And it is a Tiawan based company.
• Additionally, TSMC is so big that the initial 7 companies does not even lie close to
the TSMC production. TSMC is followed by Intel.

7.8 CPU and GPU Frequency vs Process Size through Release year
7.8.1 CPU Frequency vs Process Size through Release year

1 df _ cpu <- df [ df $ Type == ’ CPU ’ ,]


2 ggplot ( df _ cpu , aes ( x = df _ cpu $ Freq .. MHz . , y = df _ cpu $ Process . Size .. nm . , color =
factor ( year ( df _ cpu $ Release . Date ) ) ) ) +
3 geom _ point () +
4 labs ( title = " CPU Frequency ( MHz ) vs Processor Size ( nm ) vs Release year " ,
5 x = " Frequency ( MHz ) " ,
6 y = " Process Size ( nm ) " ,
7 color = " Years " ) +
8 scale _ color _ discrete ( name = " Year " ) + theme _ minimal ()

Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 21/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

Figure 17: Scatter plot of CPU Frequency vs Process Size over the Release year

As the scatter plot above shows and it was discussed previously CPUs are released each
year with wide number of frequencies, but with a smaller Process Size.

7.8.2 GPU Frequency vs Process Size through Release year

1 df _ gpu <- df [ df $ Type == ’ GPU ’ ,]


2 ggplot ( df _ gpu , aes ( x = df _ gpu $ Freq .. MHz . , y = df _ gpu $ Process . Size .. nm . , color =
factor ( year ( df _ gpu $ Release . Date ) ) ) ) +
3 geom _ point () +
4 labs ( title = " GPU Frequency ( MHz ) vs Processor Size ( nm ) vs Release year " ,
5 x = " Frequency ( MHz ) " ,
6 y = " Process Size ( nm ) " ,
7 color = " Years " ) +
8 scale _ color _ discrete ( name = " Year " ) + theme _ minimal ()

Figure 18: Scatter plot of GPU Frequency vs Process Size over the Release year

As can be seen above, GPUs tend to have higher frequency and smaller Process Size
as time goes by.

Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 22/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

7.9 Conclusion
There are a lot of interesting aspects to this dataset that could be observed in the Visualization
steps, such as:
• CPUs Frequency are not evolving as fast as GPUs.

• CPUs and GPUs are tending to get a smaller Process Size over time.
• Vendors release a large amount of products with a wide range of Frequencies each
year.
• The CPUs with higher TDP seems to be increasing over time.

• On average the number of transistors tend to be higher on GPUs.

Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 23/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

8 Building a linear regression model for assessing Moore’s


Law
8.1 What is Moore’s Law
Gordon Moore, the co-founder of Intel, made the observation that the number of transistors
that could be packed onto a microchip would roughly double every two years while the cost of
manufacturing these chips would fall, and this observation became known as Moore’s Law. Moore
didn’t utilize any actual data to predict that the historical tendency would continue, but his
assertion has now come to be known as a "law" because it has held true since 1975.
The precision of the law had a significant impact on the development of the computer industry.
It has stimulated innovation, with scientists and computer engineers seeking to create more potent
yet smaller microchips. Technology has advanced significantly as a result of this tendency, leading
to the creation of mobile gadgets like smartphones and tablets.
However, there are concerns that the trend may eventually reach its limit due to physical
constraints, such as the size of atoms and the amount of heat generated by the transistors. In
this section, we will utilize linear regression for determining if Moore’s Law is coming to its end.

8.2 Method for examining


In short, the Moore’s Law can be stated as "The number of transistors in an integrated circuit
(IC) doubles about every two years". Therefore, we can write it as an equation:
year−base_year
T ransistor_count ≈ 2 2 ∗ base_year_transistors_count (1)
• T ransistor_count: the number of transistors of the CPUs/GPUs we want to calculate
• year: the release year of the chosen model

• base_year_transistors_count: the transistor count of the CPU/GPU chosed as base


• base_year: the release year of the base CPU/GPU
According to Wikipedia page of Transistor count, the oldest record was Intel 4004, produced by
Intel in 1971 with 2250 transistors. We will take this CPU as the base for our equation, which
becomes: year−1971
T ransistor_count ≈ 2 2 ∗ 2250 (2)
Since we applied logarithm to the transistor count column of the dataset (which will be discussed
later in the next part), we will also apply the logarithm to the formula:
year−1971
log(T ransistor_count) ≈ log(2 2 ∗ 2250) (3)

After a few calucation steps, we can write the equation as:

log(2)
log(T ransistor_count) ≈ year − 293.3129 (4)
2
Or generally:
log(T ransistor_count) ≈ β0 + β1 year + ε (5)
With ε being the random error.

Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 24/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

Since this is a linear model, we can apply linear regression to assess the accuracy of Moore’s Law
on this dataset. Specifically, we will employ linear regression on the transistor count column (after
applying logarithm) and compare the outcome with our expectation, which is log(2) 2 or 0.1505, to
determine if Moore’s Law is still true.

8.3 Applying Log Transformation to Transistor count


The figure below shows that the initial values in the transistor count column are highly skewed
to the right. To reduce this skewness, we applied a logarithmic transformation.
1 df $ log _ transistor <- log10 ( df $ Transistors .. million . * 1000000)
2 head ( df , 5)

Figure 19: A new column of log-transformed transistor count is added

The transformation was successful, resulting in a more normalized distribution of the transistor
count data:

Figure 20: Transistor count before (left) and after (right) applied Log Transformation

To better visualize the transistor count data after the logarithmic transformation, we can create
a scatter plot:
1 ggplot ( df _ linear , aes ( x = ReleaseDate . num , y = log _ transistor , color = Type ) ) +
2 geom _ point ()

Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 25/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

Figure 21: Scatter Plot of log-transformed transistors count

8.4 Assessing Moore’s Law


To assess whether Moore’s Law still holds true, we will employ linear regression with the release
year as the input to estimate β1 in Equation (5) and compare it to our expected value of log(2) 2
(0.1505). For this task, we will make use of the built-in function in R called lm(), which stands
for linear model.
In order to verify that our linear regression model is trustworthy, we will also check for the model’s
validity and use hypothesis testing for β1 to ensure that our conclusion is reasonable.
First of all, we begin by creating new dataframes. This way, any alterations we make to the data
won’t impact the original dataframe.
1 df _ linear <- df
2 head ( df _ linear ,5)

Figure 22: df_linear - a copy of original df

8.4.1 Building the model


In order to check for the increase rate of transistor count in the dataset, we first need to build
the linear model:
1 model <- lm ( df _ linear $ log _ transistor ~ df _ linear $ ReleaseDate . num , data = df _ linear )

Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 26/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

2 summary ( model )
3 coef ( model )

Figure 23: Summary of model

At first sight, the estimated coefficient for "ReleaseDate.num" was only 0.1203, which is smaller
than the expected value of 0.1505. However, we cannot make any conclusions yet. We need to use
hypothesis testing to determine whether the coefficient estimate is significantly different from
our expected value. This will allow us to draw more informed conclusions about the relationship
between the release date and the transistor count, and whether it supports Moore’s Law or not.

8.4.2 Checking for model’s validity


8.4.2.a Linearity
The primary assumption for a linear regression model is that the relationship between the predictor
variable and the response variable is linear. In Figure 25, which is a scatter plot of the log of
transistor count and the release date, we can observe that the data points roughly follow a line
from the bottom left to the upper right. This indicates that the number of transistors tends
to increase over time, suggesting a linear relationship between the release date (the predictor
variable) and the log of the transistor count (the response variable).
Therefore, we can conclude that the linearity assumption of the linear regression model is
valid for this particular dataset.

8.4.2.b Independence
The assumption of independence in linear regression refers to the idea that the observations in
the data set are independent of each other. That is, the value of one observation should not be
influenced by the value of any other observation.
In this particular dataset, each observation is either a CPU or a GPU and each of them has
different parameters; therefore, it is unlikely to be the case that any observation is
influenced by the others.

Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 27/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

8.4.2.c Homoscedasticity
Homoscedasticity is an assumption in linear regression that the variance of the residuals, or
differences between the predicted and actual values of the response variable, is constant across all
values of the predictor variable. This means that the distribution of errors has the same "spread"
or variability across the entire range of values of the predictor variable, which is release date in
this case.
In order to check for homoscedasticity, we will plot the residuals against the release date and
check for the presence of any pattern in the plot:
1 # Create a residual plot
2 ggplot ( df _ linear , aes ( x = ReleaseDate . num , y = model $ residual ) ) +
3 geom _ point ( alpha = 0.6) +
4 geom _ hline ( yintercept = 0 , linetype = " dashed " ) +
5 labs ( title = " Residual Plot " , x = " Release Date " , y = " Residuals " )

Figure 24: Residual plot against release date

Since the plot shows no clear pattern (such as cone-shaped or funnel-shaped), we can conclude
that the assumption of homoscedasticity is met.

8.4.2.d Normality of the residual


Normality of residuals is an important assumption in linear regression, as it indicates that the
residuals follow a normal (or Gaussian) distribution. It is essential because it guarantees that the
statistical tests used to evaluate the model are valid. To test for this assumption, a Q-Q plot is
commonly used:

Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 28/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

1 qqnorm ( model $ residuals ) # creates a QQ plot of the residuals against a theoretical


normal distribution
2 abline ( qqline ( model $ residuals ) ) # abline create a line according to the passed
values , while qqline returns coordinates of a line that passes through the
first and third quartiles

Figure 25: Q-Q Plot

Since the majority of the residuals approximately align with the straight line, which represents
the expected values under the assumption that the residuals conform to a normal distribution, it
is reasonable to conclude that the residuals are normally distributed.

8.4.2.e Conclusion
In this section, we have examined four common assumptions of linear regression, namely linearity,
independence, homoscedasticity, and normality of the residuals, and we found that all of these
assumptions satisfied in our analysis. Based on this, we can reasonably conclude that the linear
regression model is appropriate for the given dataset.

8.4.3 Hypothesis Testing


In the previous section, we determined that the linear regression is reliable, enabling us to utilize
the model coefficients for additional analysis. In the present section, to evaluate Moore’s Law,
we will compare the output coefficient of the ReleaseDate.num column to our expected value of
0.1505. This comparison will enable us to determine if Moore’s Law remains valid or not.
At first sight, the output coefficient of the model is 0.1203, which is significantly lower than our
expected value. However, to ensure the accuracy of our observation, we will utilize statistical
methods. Specifically, we will implement hypothesis testing to provide additional clarity:
Ha : β1 < 0.1505
(1)
H0 : β ≥ 0.1505

Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 29/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

By applying calculation steps mentioned in theory section at a significance level α = 0.05, we


obtain the result as below:
1 T _ alpha = qt ( p = 0.05 , df = length ( df _ linear $ ReleaseDate . num ) -2 , lower . tail = FALSE )
# T alpha , n -2
2 B1 = ( coef ( model ) ) [2] # beta _ 1
3 B10 = 0.1505 # beta _ 1 ,0
4 Sb1 = summary ( model ) $ coefficients [2 ,2] # standard error of beta _ 1
5 t _ value = ( B1 - B10 ) / Sb1
6 p _ value = pt ( t _ value , df = length ( df _ linear $ ReleaseDate . num ) -2 , lower . tail = TRUE )
7
8 # Print t _ value , T _ alpha and p - value
9 message ( " t _ value : " , t _ value )
10 message ( " T _ alpha : " , T _ alpha )
11 message ( "p - value : " , p _ value )
12
13 # Check if t _ value is in the rejection region
14 if ( t _ value < -T _ alpha ) {
15 message ( " Since t _ value falls in the rejection region , which is t _ value < -t _
alpha , we reject the null hypothesis that B1 >= 0.1505 " )
16 } else {
17 message ( " Since t _ value does not fall in the rejection region , which is t _ value <
-t _ alpha , we fail to reject the null hypothesis that B1 >= 0.1505 " )
18 }

Figure 26: Output of Hypothesis Testing

As the output suggests, the null hypothesis of β1 ≥ 0.1505 is rejected, which means that
β1 , or the increase rate of transistor count, is unlikely to greater than or equal to Moore’s Law
expectation. Therefore, we can come to the conclusion that Moore’s Law is no longer valid.
It is important to note that there are several objective factors that can affect the accuracy of this
conclusion, such as:
• The amount of data collected is not enough to reflect the trend of Moore’s Law

• Errors during the data-collecting phase


• Overtime, the increase in transistor count does not solely depend on the Release Date

Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 30/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

9 Full R code
1 library ( tidyverse )
2 library ( ggplot2 )
3 library ( tidyr )
4 library ( gridExtra )
5 library ( dplyr )
6 library ( lubridate )
7
8 # read csv file and print the first 5 rows
9 raw_df <- read . csv (" chip_dataset . csv ")
10 head ( raw_df , 5) # only print 5 first rows
11
12 # An overview of dataset ’ s attributes
13 print ( sprintf (" Size of dataset : % d rows and % d columns " , dim ( raw_df ) [1] , dim (
raw_df ) [2]) )
14 summary ( raw_df )
15
16 # get datatype of each attribute
17 str ( raw_df )
18
19 # count the number of NA ( not available )
20 colSums ( is . na ( raw_df ) )
21
22 # Drop the id column ( first column ) and the last three columns ( FP16 . GFLOPS , FP32 .
GFLOPS , FP64 . GFLOPS ) since they have two many NAs
23 df <- raw_df [ , -c (1 , 12:14) ]
24 head ( df , 5)
25
26 # convert Release . Date column to date
27 df$Release . Date <- as . Date ( df$Release . Date )
28 head ( df , 5)
29 colSums ( is . na ( df ) )
30
31 # drop NA values in the remaining columns
32 df <- na . omit ( df )
33 head ( df , 5)
34 dim ( df )
35
36 # with the release date column , we will create a new one - a numeric release date ,
starting from 2000 ( since the first day is 1 -1 -2000)
37 df$ Releas eDate . num <- as . numeric ( df$Release . Date , na . rm = TRUE ) / 365.25 + 1970
38 head ( df , 5)
39
40 ### Visualization ###
41 ## Type distribution ##
42 # Pie chart type distribution
43 df_pie_type < - data . frame ( count ( df , Type ) )
44 ggplot ( data = df_pie_type , aes ( x = "" , y = n , fill = Type ) ) +
45 geom_bar ( stat = " identity " , width = 1) +
46 coord_polar ( theta = " y " , start = 0) +
47 ggtitle (" Type Distribution ") +
48 theme_void () +
49 s c a l e _ f i l l _ m a n u a l ( values = c ("# F8766D " , "#7 CAE00 ") ) +
50 guides ( fill = guide_legend ( title = " Type ") )
51
52 # Bar chart distribution
53 ggplot ( data = df_pie_type , aes ( x = n , y = Type , fill = Type ) ) +
54 geom_bar ( stat = " identity " , width = 1) +
55 ggtitle (" Type Distribution ") +
56 theme_void () +

Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 31/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

57 s c a l e _ f i l l _ m a n u a l ( values = c ("# F8766D " , "#7 CAE00 ") ) +


58 guides ( fill = guide_legend ( title = " Type ") )
59
60 # Histogram of process size
61 ggplot ( df , aes ( x = df$Process . Size .. nm .) ) +
62 geom _histo gram ( fill = " steelblue " , color = " white " , bins = 15 , binwidth = 10 ,
alpha = 0.7) +
63 labs ( title = " Histogram of Process Size " , x = " Process Size ( nm ) " , y = " Count ")
+
64 theme_classic ()
65
66 # Box Plot of Process Size by Foundry
67 ggplot ( df , aes ( x = df$Process . Size .. nm . , y = df$Foundry , fill = Foundry ) ) +
68 geom_boxplot () +
69 labs ( title = " Box Plot of Process Size by Foundry " , x = " Process Size ( nm ) " , y =
" Foundry ") +
70 s c a l e _ f i l l _ m a n u a l ( values = c ("# F8766D " , "#7 CAE00 " ,"# F9D923
" ,"#187498" ,"#573391" ,"# FF78F0 " ,"# F7B449 " ,"#7 A9EAF " ,"# A9907E " ,"# FFEAEA ") ) +
71 theme_classic ()
72
73 # Facetted Box Plot of Process Size by Type and Vendor
74 ggplot ( df , aes ( x = df$Process . Size .. nm . , y = Type , fill = Vendor ) ) +
75 geom_boxplot () +
76 facet_wrap (~ Type , ncol = 2) +
77 labs ( title = " Facetted Box Plot of Process Size by Type and Vendor " , x = "
Process Size ( nm ) " , y = " Type " ) +
78 s c a l e _ f i l l _ m a n u a l ( values = c ("# F8766D " , "#7 CAE00 " ,"# F9D923 " ,"#187498" ,"#573391")
) +
79 theme_classic ()
80
81 # Scatter Plot of TDP vs Release Date
82 ggplot ( df , aes ( x = df$Release . Date , y = df$TDP .. W . , color = Type ) ) +
83 geom_point () +
84 labs ( title = " Scatter Plot of TDP vs Release Date " , x = " Release Date " , y = " TDP
( W ) ") +
85 s c a l e _ c o l o r _ m a n u a l ( values = c ("# F8766D " , "#7 CAE00 ") ) +
86 theme_classic ()
87
88 # Scatter Plot of Die Size vs Release Date
89 ggplot ( df , aes ( x = df$Release . Date , y = df$Die . Size .. mm .2. , color = Type ) ) +
90 geom_point () +
91 labs ( title = " Scatter Plot of Die Size vs Release Date " , x = " Release Date " , y =
" Die Size ( mm2 ) ") +
92 s c a l e _ c o l o r _ m a n u a l ( values = c ("# F8766D " , "#7 CAE00 ") ) +
93 theme_classic ()
94
95 # Scatter Plot of Transistors Count vs Release Date
96 ggplot ( df , aes ( x = df$Release . Date , y = df$Tra nsisto rs .. million . , color = Type ) ) +
97 geom_point () +
98 labs ( title = " Scatter Plot of Transistors Count vs Release Date " , x = " Release
Date " , y = " Transistors Count ( million ) ") +
99 s c a l e _ c o l o r _ m a n u a l ( values = c ("# F8766D " , "#7 CAE00 ") ) +
100 theme_classic ()
101
102 # Scatter Plot of Frequency vs Release Date by Vendor
103 ggplot ( df , aes ( x = df$Release . Date , y = df$Freq .. MHz . , color = Vendor ) ) +
104 geom_point () +
105 labs ( title = " Scatter Plot of Frequency vs Release Date " , x = " Release Date " , y
= " Freq ( MHz ) ") +
106 s c a l e _ c o l o r _ m a n u a l ( values = c ("# F8766D " , "#7 CAE00 " ,"# F9D923
" ,"#187498" ,"#573391") ) +
107 theme_classic ()

Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 32/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

108
109 # Box Plot of Frequency by Vendor
110 ggplot ( df , aes ( x = df$Freq .. MHz . , y = df$Vendor , fill = Vendor ) ) +
111 geom_boxplot () +
112 labs ( title = " Box Plot of Frequency by Vendor " , x = " Vendor " , y = " Freq ( MHz ) ")
+
113 s c a l e _ f i l l _ m a n u a l ( values = c ("# F8766D " , "#7 CAE00 " ,"# F9D923 " ,"#187498" ,"#573391")
) +
114 theme_classic ()
115
116 # Scatter Plot of Frequency vs Release Date by Foundry
117 ggplot ( df , aes ( x = df$Release . Date , y = df$Freq .. MHz . , color = Foundry ) ) +
118 geom_point () +
119 labs ( title = " Scatter Plot of Frequency vs Release Date " , x = " Release Date " , y
= " Freq ( MHz ) ") +
120 s c a l e _ c o l o r _ m a n u a l ( values = c ("# F8766D " , "#7 CAE00 " ,"# F9D923
" ,"#187498" ,"#573391" ,"# FF78F0 " ,"# F7B449 " ,"#7 A9EAF " ,"# A9907E " ,"# FFEAEA ") ) +
121 theme_classic ()
122
123 # Foundry Distribution
124 df_pie_foundry < - data . frame ( count ( df , Foundry ) )
125 ggplot ( data = df_pie_foundry , aes ( x = "" , y = n , fill = Foundry ) ) +
126 geom_bar ( stat = " identity " , width = 1) +
127 coord_polar ( theta = " y " , start = 0) +
128 ggtitle (" Foundry Distribution ") +
129 theme_void () +
130 s c a l e _ f i l l _ m a n u a l ( values = c ("# F8766D " , "#7 CAE00 " ,"# F9D923
" ,"#187498" ,"#573391" ,"# FF78F0 " ,"# F7B449 " ,"#7 A9EAF " ,"# A9907E " ,"# FFEAEA ") ) +
131 guides ( fill = guide_legend ( title = " Foundry ") )
132
133 ggplot ( data = df_pie_foundry , aes ( x = n , y = Foundry , fill = Foundry ) ) +
134 geom_bar ( stat = " identity " , width = 1) +
135 labs ( title =" Foundry Distribution " , x =" Count " , y =" Foundry ") +
136 theme_classic () +
137 s c a l e _ f i l l _ m a n u a l ( values = c ("# F8766D " , "#7 CAE00 " ,"# F9D923
" ,"#187498" ,"#573391" ,"# FF78F0 " ,"# F7B449 " ,"#7 A9EAF " ,"# A9907E " ,"# FFEAEA ") ) +
138 guides ( fill = guide_legend ( title = " Foundry ") )
139
140 # CPU Frequency ( MHz ) vs Processor Size ( nm ) vs Release year
141 df_cpu <- df [ df$Type == ’CPU ’ ,]
142 ggplot ( df_cpu , aes ( x = df_cpu$Freq .. MHz . , y = df_ cpu$Pr ocess . Size .. nm . , color =
factor ( year ( df_ cpu$Re lease . Date ) ) ) ) +
143 geom_point () +
144 labs ( title = " CPU Frequency ( MHz ) vs Processor Size ( nm ) vs Release year " ,
145 x = " Frequency ( MHz ) " ,
146 y = " Process Size ( nm ) " ,
147 color = " Year ") +
148 s c a l e _ c o l o r _ d i s c r e t e ( name = " Year ") +
149 theme_minimal ()
150
151 # GPU Frequency ( MHz ) vs Processor Size ( nm ) vs Release year
152 df_gpu <- df [ df$Type == ’GPU ’ ,]
153 ggplot ( df_gpu , aes ( x = df_gpu$Freq .. MHz . , y = df_ gpu$Pr ocess . Size .. nm . , color =
factor ( year ( df_ gpu$Re lease . Date ) ) ) ) +
154 geom_point () +
155 labs ( title = " GPU Frequency ( MHz ) vs Processor Size ( nm ) vs Release year " ,
156 x = " Frequency ( MHz ) " ,
157 y = " Process Size ( nm ) " ,
158 color = " Years ") +
159 s c a l e _ c o l o r _ d i s c r e t e ( name = " Year ") +
160 theme_minimal ()
161

Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 33/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

162 ### Linear Regression ###


163 d f $ l o g _ t r a n s i s t o r <- log10 ( df$Tr ansist ors .. million . * 1000000) # add a log -
transformed column of transistors count to the original df
164 head ( df , 5)
165
166 # Create new dataframe
167 df_linear <- df
168 head ( df_linear ,5)
169
170 # Scatter plot of log transistors vs release date
171 ggplot ( df_linear , aes ( x = ReleaseDate . num , y = log_transistor , color = Type ) ) +
172 geom_point ()
173
174 # Histogram
175 ggplot ( df_linear , aes ( x = log_transistor , y = after_stat ( density ) ) ) +
176 geom _histo gram ( fill = " steelblue " , color = " white " , bins = 15) +
177 geom_density ( lwd = 1.2 , linetype = 2 , colour = 2 , adjust = 2) +
178 labs ( x = " Transistor count " , y = " Frequency ") +
179 ggtitle (" Transistor count ( for both CPU and GPU ) ") +
180 theme ( plot . title = element_text ( hjust = 0.5 , size =22) )
181
182 # Building model
183 model <- lm ( d f _ l i n e a r $ l o g _ t r a n s i s t o r ~ d f _ l i n e a r $ R e l e a s e D a t e . num , data = df_linear )
184 summary ( model )
185 coef ( model )
186
187 # Plot the regression line
188 ggplot ( df_linear , aes ( x = ReleaseDate . num , y = l og_tra nsisto r ) ) +
189 geom_point () +
190 stat_smooth ( method = " lm " , size = 2)
191
192 # Create a residual plot
193 ggplot ( df_linear , aes ( x = ReleaseDate . num , y = m odel$r esidua l ) ) +
194 geom_point ( alpha = 0.6) +
195 geom_hline ( yintercept = 0 , linetype = " dashed ") +
196 labs ( title = " Residual Plot " , x = " Release Date " , y = " Residuals ")
197
198 qqnorm ( m od e l$ r es id u al s ) # creates a QQ plot of the residuals against a theoretical
normal distribution
199 abline ( qqline ( m od el $ re si d ua l s ) ) # abline create a line according to the passed
values , while qqline returns coordinates of a line that passes through the
first and third quartiles
200
201 # Hypothesis testing :
202 # + Ha : B1 < 0.1505
203 # + H0 : B1 >= 0.1505
204 # Confidence interval : 0.99
205 T_alpha = qt ( p = 0.01 , df = length ( d f _ l i n e a r $ R e l e a s e D a t e . num ) -2 , lower . tail = FALSE )
# T alpha , n -2
206 B1 = ( coef ( model ) ) [2] # beta_1
207 B10 = 0.1505 # beta_1 ,0
208 Sb1 = summary ( model ) $coefficients [2 ,2] # standard error of beta_1
209 t_value = ( B1 - B10 ) / Sb1
210 p_value = pt ( t_value , df = length ( d f _ l i n e a r $ R e l e a s e D a t e . num ) -2 , lower . tail = TRUE )
211
212 # Print t_value , T_alpha and p - value
213 message (" t_value :" , t_value )
214 message (" T_alpha :" , T_alpha )
215 message (" p - value :" , p_value )
216
217 # Check if t_value is in the rejection region
218 if ( t_value < - T_alpha ) {

Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 34/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

219 message (" Since t_value falls in the rejection region , which is t_value < -
t_alpha , we reject the null hypothesis that B1 >= 0.1505")
220 } else {
221 message (" Since t_value does not fall in the rejection region , which is t_value <
- t_alpha , we fail to reject the null hypothesis that B1 >= 0.1505")
222 }

Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 35/36
University of Technology, Ho Chi Minh City
Faculty of Computer Science and Engineering

10 References
1. Introduction to Multiple Linear Regression, 27/10/2022, Zach, https://www.statology.org/
multiple-linear-regression/?fbclid=IwAR0aJ0c4LTt7zN2-JoeITOw2Nv4Dkk88h2bt-yJAiaoVdWG_
51R6xV3uKyw
2. Coefficient of multiple Determination, Valerie Watts, https://ecampusontario.pressbooks.
pub/introstats/chapter/13-4-coefficient-of-multiple-determination/?fbclid=IwAR39G3xX\
protect\discretionary{\char\hyphenchar\font}{}{}zBcuWcza423MAIDq1DAXehjbJGa8tiYSVGUQWn\
protect\discretionary{\char\hyphenchar\font}{}{}SanaSRfjDkS28
3. Hypothesis Test for Simple Linear Regression, 27/01/2022, Maurice A.Geraghty, https://
stats.libretexts.org/Courses/American_River_College/STAT_300%3A_My_Introductory_
Statistics_Textbook_(Mirzaagha)/03%3A_Regression_Analysis/3.03%3A_Correlation_and_
Linear_Regression/3.3.04%3A_Hypothesis_Test_for_Simple_Linear_Regression?fbclid=
IwAR0kklGIH5MbkJsFGy-QqlT66nAygMPUFY6aQxHvmd23hDzih3PdatWO0Iw
4. Linear Regression in R, Dec 2022, https://www.datacamp.com/tutorial/linear-regression-R

Assignment for Probability and Statistics | Academic year 2022 - 2023 Page 36/36

You might also like