0% found this document useful (0 votes)
34 views

MLPC Group Assignment

Uploaded by

S U P R E M
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

MLPC Group Assignment

Uploaded by

S U P R E M
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

TAYLOR’S UWE DUAL AWARDS PROGRAMMES

JANUARY 2024 SEMESTER

MACHINE LEARNING AND PARALLEL COMPUTING


(ITS66604)

Assignment 2 – Group (30%)


DUE DATE: 16th March 2024 via myTIMeS (8pm)

STUDENT DECLARATION
1. I confirm that I am aware of the University’s Regulation Governing Cheating in a University Test
and Assignment and of the guidance issued by the School of Computing and IT concerning
plagiarism and proper academic practice, and that the assessed work now submitted is in
accordance with this regulation and guidance.
2. I understand that, unless already agreed with the School of Computing and IT, assessed work may
not be submitted that has previously been submitted, either in whole or in part, at this or any other
institution.
3. I recognise that should evidence emerge that my work fails to comply with either of the above
declarations, then I may be liable to proceedings under Regulation.
No Student Name Student ID Date Signature Score

1 Monish Shrestha 0362091

2 Praphul Shrestha 0362191

3 Khushi Thami 0362676

5
Part A: Machine Learning – A Case Study...........................................................................3
1. Describe your observation and understating on the whole dataset by answering the
following questions............................................................................................................. 3
a. What are the available data types in this data set?..................................................3
b. What is the statistical summary of all the attributes?............................................... 3
c. How to handle the missing values which are those presented as ‘?’ ?.................... 4
d. What are the independent and dependent variables?..............................................4
i. Independent Variables:........................................................................................ 4
ii. Dependent Variables:..........................................................................................4
2. Find the correlation coefficient between independent and dependent variables?..........6
3. By using any algorithm of your choice, create a model which could predict the price of
the laptop............................................................................................................................7
4. Is this a Supervised model or an unsupervised model? Why so? Explain in................. 8
detail...................................................................................................................................8
5. Continue with the same built model in No.3, but choose different independent
variables and compare the results..................................................................................... 9
Part B: Parallel Computing.................................................................................................. 11
Article1: Parallel computing method of deep belief networks and its application to traffic
flow prediction...................................................................................................................11
Study Background:..................................................................................................... 11
Research and Development Methodologies:..............................................................11
1. Pre-training Phase Methodology:..................................................................... 11
2. Fine-tuning Phase Methodology:...................................................................... 11
3. Parallel Architecture:........................................................................................ 12
4. Evaluation Indices:........................................................................................... 12
Performance Analysis:............................................................................................... 12
Results Analysis:........................................................................................................ 13
Conclusion:.................................................................................................................13
Part A: Machine Learning – A Case Study

1. Describe your observation and understating on the whole dataset by answering


the following questions.
a. What are the available data types in this data set?
The types of data that this dataset contains include the data type such as integer, float,
and object (string).

b. What is the statistical summary of all the attributes?


The statistical summary of all attributes includes Count, Mean, Standard Deviation,
Minimum, 25th Percentile, 50th Percentile (Median), 75th Percentile, and Maximum
for numerical attributes.
c. How to handle the missing values which are those presented as ‘?’ ?
To handle missing values represented as '?' in the dataset, we replaced them with NaN
(Not a Number) using the replace() function.

Additionally, we identified numerical and categorical columns in the dataset and


imputed missing values accordingly. Numerical columns were imputed with the mean
value, while categorical columns were imputed with the mode value.

d. What are the independent and dependent variables?


i. Independent Variables:
This is the researcher's choices of variables that might be manipulated or
controlled by them. In the context of these features, the attributes of laptops
include company, product, type name, screen size (in inches), CPU, RAM,
memory, GPU, operating system, and weight too.

ii. Dependent Variables:


This is the variable being studied and measured. In this case, it's the price of
the laptops in euros. The price depends on the values of the independent
variables.
2. Find the correlation coefficient between independent and dependent variables?
The aim of this section is to investigate the relationship between the independent variables
(features) and the dependent variable (price of the laptop). This goes through the formula of
the correlation coefficient calculation. Correlation implies a linear relationship between
variables represented by this coefficient which shows its strength and direction.

Firstly, we processed the dataset by changing the type of 'Weight' to string, removing
non-numeric characters such as "kg" and then finally changing it to a float type. Similarly, we
went on to remove numeric digits coming from the 'Ram' column and make them into a float.

Later on, we picked the proper numerical categories, eg: 'Inches', 'RAM', 'Weight', and
'Price_euros' to find out the correlation. We performed the panda corr() function to obtain the
correlation matrix and the correlation coefficient between the Price_euros column and the
numeric columns.

Next, we analyzed the correlation coefficients obtained:

i. Inches: The value of the correlation coefficient for the laptop screen size (Inches)
versus its price (Price_euros) is around 0.068, showing a very weak positive
correlation.
ii. Ram: The correlation coefficient of RAM capacity with price has been estimated to
be 0.743, thereby indicating a moderately strong positive correlation. This means that
as the memory capacity hikes, the cost of the laptop rises also.
iii. Weight: Among the whole-relationships-between-price-and-weight of the laptop, the
correlation-coefficient equals to roughly 0.210, which suggests that there is weak
positive-correlation. The fact that heavy laptops slightly tend to have higher prices on
the market gives one an idea that heavy laptops are more expensive.

These correlation coefficients give us useful information on the patterns existing


between the independent variables (Inches, Ram and Weight) and the dependent
variable (Price_euros), helping to understand which factors most affect the price of
laptops in consumer markets.
3. By using any algorithm of your choice, create a model which could predict the
price of the laptop.

To develop a predictive model for estimating laptop prices based on certain features, we
employed the Linear Regression algorithm. Here's a step-by-step explanation of the process:

i. Data Preparation:
We had input variables (features), and a dependent variable (price), because of the
dataset. For the straight-forward modeling, a multilinear regression equation was used
and the independent variables were selected as 'Inches' 'Ram' and 'Weight' and
'Price_Euros' as the dependent variable.

ii. Dataset Splitting:


The dataset was divided into a training set and the testing set using the train_test split
function of the sklearn library. 80% of the data set were reserved for training, with the
remaining 20% for testing.

iii. Model Training:


We used a Linear Regression model and trained it using the training data. The model
learned the relationships between the independent variables and the target variable
(price) during this phase.

iv. Prediction:
We worked in a course where we trained a model and used it to make predictions on
testing data in order to estimate the prices of laptops regarding their features.

v. Model Evaluation:
We estimated the performance of the model by calculating the Mean Squared Error
(MSE) between the actual prices and the retained prices on the moderately paced
dataset. The MSE, in turn, presents the structure of how a model's predictions match
the real values, with the smaller value revealing a better model's performance.

vi. Results:
Compared to our model, after evaluation, the output exhibited a Mean Squared Error
(MSE) of around 2.25444.48 We measure this indicator by simply taking an average
squared gap between predicted prices and actual. Though the MSE value is capable of
pointing out the structural accuracy of the model, a detailed analysis apart from the
comparison with other models is equally important to determine the overall
effectiveness of the chosen model.

vii. Conclusion:
The developed Linear Regression model showed a promising capability to predict
laptop prices based on chosen factors. Further results of the model, incorporation of
new algorithms and optimization could boost the precision, predictive performance
and robustness of the model. This foreseen capacity brings great benefit to the
consumers, manufacturers and the retailers among others who are able to drive the
right decisions related to laptop buying, pricing and market competitiveness,
respectively.

4. Is this a Supervised model or an unsupervised model? Why so? Explain in detail.

Yes, the given model is a supervised learning model that was designed to predict
laptop prices based on their specifications or features. A main feature of supervised
learning is the fact that it is the most common and representative paradigm in machine
learning, where in the data datasets, each example is assigned a single label
corresponding to the established output labels. This facilitates the model's learning
process by providing clear guidance on the expected outputs for given inputs.

In the context of the given data model, the dataset will be made of labeled examples
where every laptop will be represented by a tuple of size attributes e.g. 'Inches', 'Ram',
'Weight' as well as the price labeled as 'Price_euros'. This feature alignment with the
target variable would enable the model to understand the relationship between the
input features and the target variable, thereby making possible the prediction of
unseen data.

During training sessions, the supervised learning model becomes capable of mapping
input features with their corresponding output labels while minimizing a pre-defined
loss function. In this situation, the model wants to make the instability as low as
possible in between predictions and actual prices of laptops which are in the training
data. The process of iterative optimization via gradient descent is utilized which leads
the model to fine tune its parameters so that it is capable of predicting price with the
desired accuracy.
Supervised learning does render well if the aim is to learn mapping from input
features to the output labels, and with machine learning, it is possible for a laptop to
be priced based on its specifications. By leveraging the labeled data provided in the
dataset, the model can effectively learn the patterns and relationships within the data,
thereby enabling it to generalize well to unseen instances and make accurate price
predictions.

In summary, the provided model exemplifies supervised learning by utilizing labeled


data to train a model that learns to predict laptop prices based on their specifications.
Through the iterative optimization of a predefined loss function, the model hones its
ability to accurately map input features to output labels, showcasing the effectiveness
of supervised learning in addressing regression tasks such as price prediction.

5. Continue with the same built model in No.3, but choose different independent
variables and compare the results.

To compare the results of the model with different independent variables, we followed a
similar approach as in the previous steps. Here's how we proceeded:

i. Data preprocessing:
We extracted numeric values from the 'ScreenResolution' column to create two new
columns: 'ScreenResolution_Width' and 'ScreenResolution_Height'. These columns
represent the width and height of the screen resolution, respectively. We converted
these new columns to numeric data type and dropped rows where width or height
couldn't be extracted.

ii. Model Training with New Variables:


We defined a new set of independent variables, including 'Ram', 'Weight',
'ScreenResolution_Width', and 'ScreenResolution_Height'. Then, we split the data
into training and testing sets using these new independent variables. We created and
trained a Linear Regression model using the new variables.

iii. Model Evaluation:


We made predictions using the new variables and evaluated the model's performance
by calculating the Mean Squared Error (MSE) between the predicted and actual
values of the dependent variable.

iv. Results:
The Mean Squared Error with the new set of independent variables was
approximately 187697.57.

v. Conclusion:
This comparison helps us assess the impact of including different independent
variables on the model's predictive performance, allowing us to identify which set of
variables yields better results for predicting the price of laptops.
Part B: Parallel Computing

Article 1: Parallel computing method of deep belief networks and its application to
traffic flow prediction

Study Background:
The provided research involves the field of parallel computing applications which of course, focuses
on improving Deep Belief Networks (DBNs) using the methods in parallel computing. The goal of the
study in this article is to develop competent machine learning techniques, especially when huge
datasets are processed in real-time situations. The research centers on the use of parallel computing in
order to improve DBN timelines, namely, pre-training and fine-tuning; this is designed to reduce the
computational time necessary for model training.

Research and Development Methodologies:


The research articles bring into the spotlight the specific techniques for research and development that
would help DBNs receive the full capability of parallel computing which is seen as a good thing in the
long run. The techniques which are used in the studies guide the area of the DBNs improvement while
training within the machine learning algorithms framework, especially using the acceleration of the
pre-training and fine-tuning stage to reduce the time needed for the model development.

1. Pre-training Phase Methodology:


Data Partitioning: The subsequent datasets, Xq (q=1, 2, …, Q) are divided into smaller
subsets, distributed across multiple computing nodes.

Master-Slave Structure: The masterpiece of a computing node commands phasing in weight


and bias updates taken from slaves. A master node is the middleman who aggregates, sums,
updates gradients, and disseminates them to the worker (slaves) units that compute the local
changes and then submit them back to the master node.

Algorithm 4: The pre-training operation with parallel computing methodology requires


partitioning the dataset, initializing the DBN structure, data transmission of parameters,
computing variations, weight and bias update, and then cycle through for each epoch of
training.

2. Fine-tuning Phase Methodology:


Dataset Division: Like the pre-training, the original dataset X is partitioned into Q parts for
fine-tuning as well.

Error Function Revision: Adding an factor m comprises an average error functions ulitizing
sub datasets Xq.

Algorithm 5: Rules fine-tuning is accomplished via several steps: dataset distribution, DBN
model initialization, model parameter broadcasting, variation computation, weights and biases
updating, and epoch periodicity.
3. Parallel Architecture:
Master-Slave Computing Structure: Master-Slave Computing Structure: In both the
pre-training and fine-tuning phases the master-slave computational structure is used.

Comparison with Serial Computing: The multi-nodes and a varying sets of data technique
is known as the combination method, while the single node and a single dataset is called the
serial method. The parallel approach is made in such a way to result both theoretically and in
practice the same as the one achieved by the serial mode of evolution.

4. Evaluation Indices:
Acceleration Ratio and Efficiency: The analysis of the impacts of parallel computing is
done utilizing acceleration ratio and efficiency metrics, computing runtimes and showing the
performance between serial and parallel learning models using examples in pre-training and
fine-tuning phases.
These strategies are designed to reduce the training periods for DBNs and thus to make them
more resource-friendly through combining two computing paradigms - parallel computing
and model reduction approach. The study puts forward the cruciality of optimizing
preprocessing and fine-tuning, utilizing parallel computing in order to speed up the training of
DBNs. However, this is very important in real-time applications where quick data processing
is necessary because it is a crucial element.

Performance Analysis:
The research conducted here concentrates on shortening Deep Belief Networks' (DBNs) training time
via including parallel computing strategies into training stages. The main objective of the study is to
enhance the training process of DBNs through sharing the workload across multiple nodes. The target
of the research is to optimize model training parameters by running the training phases in parallel.
This method should result in immense reduction of the total time for model training. This approach is
very relevant in scenarios where dealing with large datasets and doing it securely and in a time
efficient manner is important for performance of machine learning algorithms.
The applied parallel computing method is modified for training of learning algorithms to increase
their efficiency and speed up the DBN training procedure. The study, therefore, will split the
computational workload up to several nodes with a goal of accelerating the training process, allowing
it to run smoothly and be precise. This approach of parallel computing tactic, which is specified to
solve the complexities of processing big databases, is going to train DBNs which present the process
of obtaining real time predictions and predictions which are fast and accurate.
Essentially, the research is in the domain of parallel computing applications to speed up the training
speed, and with the aim to be more productive than deep belief networks (DBN). The study endeavors
to overcome the problems that would arise from working with a bulky data set by contributions of
nodes to processing power, at the same time. The effort of parallelization takes place in all the stages
of training and leads to the workload distribution across the network. This approach accomplishes two
goals; it improves the learning algorithms performance hence it gives an idea on the suitable parallel
computing techniques which can also help optimize machine learning models for real-time
applications that require high speed and accuracy.
Results Analysis:
The research evaluates the performance of the parallel computing approach by comparing the results
of applied DBNs trained with the parallel methods with those trained on by traditional serial methods.
This leads us to the conclusion that the parallel training algorithms can achieve comparable
predictions results than the serial training algorithms. This is a proof of the fact that our parallel DBN
training techniques achieve a desired alignment and improve the efficiency of prediction. Moreover,
the evaluation indices of acceleration ratio and efficiency are applied to express the improvements in
performance of pre-training and fine-tuning using parallel computing as well.(Zhao et al., 2019)

Conclusion:
In conclusion, the research suggests that the parallel computation makes it possible both to speed up
and repeat the process of training a Deep Belief Network. The study demonstrates how parallel
pre-training and fine-tuning phases can lead to a substantial enhancement in the computation
efficiency and training speed, thus underscoring the significance of using parallel computing
methodologies in improving the performance of machine learning algorithms when dealing with large
scale datasets.
Article2: QuantCloud: A Software with Automated Parallel Python for Quantitative
Finance Applications

Study Background
The study is all about the development of QuantCloud, which is a software designed for the use of
quantitative finance applications in the integration of a parallel system of Python with a Big Data
system constructed in C++ coding. The main task is to accelerate the pace of execution and Software
life cycle, which is of high importance for the trading companies operating in Quantitative Finance.
The parallel execution of Python codes is demonstrated when the software is tested on Intel Xeon E5
processors and Intel Xeon Phi processors, based on moving-window and autoregressive
moving-average (ARMA) algorithms. This is a nearly linear speed up of the processing, which is ideal
for modern multicore processors. The incorporation of C++ for big data structure and Python for the
user method is the solution to fast developing and testing strategy in a finance quantitative investment
industry, which does not implicitly ask for the speed, but rather, speed is critical when competing with
others.

Research and Development Methodologies:

Specific Implementation Techniques:

Coprocess-Based Approach: The study utilizes a coprocess-based strategy for parallel execution of
Python codes, allowing for concurrent processing and efficient utilization of system resources.
Shared Memory System: Data communication between the main C++ program and embedded
Python scripts occurs through a shared memory system, facilitating seamless interaction and data
exchange.
Embedded Python Interface: An embedded Python interface is designed to enable effortless
integration with the big data infrastructure, ensuring smooth execution of Python scripts within the
system.

Optimization Strategies:
Intra-Node Parallelism: The system leverages multithreaded programming for intra-node parallelism
on shared memory, optimizing resource utilization within a single computing node.
Thread Pool Management: A thread pool is employed to manage threads efficiently, enhancing the
scalability and performance of parallel Python execution.
Asynchronous Execution Mechanism: The system implements an asynchronous execution
mechanism to overlap data serialization and analytics operations, reducing latency and improving
overall system efficiency.

Testing Procedures:
Performance Benchmarking: The study conducts performance benchmarking to evaluate the
speedup, parallel efficiency, and wallclock time for executing Python codes in parallel.
Real-World Market Data Testing: Testing procedures involve the application of the system to
real-world market data, assessing the system's performance in handling complex financial datasets.
Comparative Analysis: Extensive comparative studies are conducted between different processors,
such as Intel Xeon E5 and Xeon Phi, to analyze the system's performance under varying hardware
configurations.
By incorporating these specific implementation techniques, optimization strategies, and testing
procedures, the methodology section provides a comprehensive overview of how the QuantCloud
software suite effectively integrates Python and C++ for parallel execution in quantitative finance
applications, catering to the evolving demands of big data analysis in the financial sector.

Performance Analysis:
The performance analysis in the study delves into an in-depth comparative examination to evaluate
the effectiveness of the algorithms implemented in QuantCloud. By assessing various performance
metrics on real-world market data, the study provides valuable insights into the system's efficiency
and speed when executing time series analysis models coded in Python, especially on advanced
multicore processors like Intel Xeon Phi.

Performance Metrics Interpretation:

Wall Clock Time:

Overall Wallclock Time: This metric represents the total elapsed time from the initiation to the
completion of a process pipeline, encompassing data queries, preparation, Python script execution,
and result output. A decrease in overall wallclock time signifies improved efficiency in processing
financial data and executing analysis models.
Embedded-Python Wall Clock Time: This metric focuses on the time spent specifically in executing
Python codes within the coprocess-based parallel strategy. A reduction in embedded-Python wallclock
time indicates enhanced speed and efficiency in Python script execution.

Latency:
Microseconds per Tick: Latency, reported in microseconds per tick, reflects the average time taken
to process a single tick message. Lower latency values indicate quicker processing and improved
responsiveness of the system to market data, enhancing real-time decision-making capabilities.

Speedup and Parallel Efficiency:


Speedup Ratio: The speedup for Python codes is defined as the ratio of elapsed times with varying
numbers of coprocesses. A higher speedup ratio signifies a more efficient parallel execution of Python
scripts, leading to faster processing and analysis of financial data.
Parallel Efficiency: Parallel efficiency provides an estimate of how well the system utilizes
parallelism to enhance Python script execution. Higher parallel efficiency values indicate optimal
resource utilization and improved performance on multicore processor architectures.

Results Interpretation:

Scalability and Performance Improvement:


Consistent Performance Improvement: The study demonstrates consistent performance
improvement with an increase in coprocesses, reaffirming the scalability and efficiency of the system
in executing Python codes in parallel.
Linear Speedup: The nearly linear speedup observed in the performance tests indicates the system's
ability to scale effectively with additional coprocesses, resulting in improved processing speed and
performance.
Comparative Analysis:
Intel Xeon E5 vs. Xeon Phi: The comparison between Intel Xeon E5 and Xeon Phi processors
reveals significant performance differences, with the Xeon Phi processor consistently outperforming
the Xeon E5 in terms of overall wallclock time, latency, and speedup for Python codes.
Optimal Processor Selection: The study highlights the superiority of the Xeon Phi processor in
handling complex financial analysis models, showcasing its ability to reduce overall wallclock time
and improve parallel efficiency compared to the Xeon E5 processor.

Results Analysis:
The outcomes of the study show that QuantCloud, a suite of software, can obviously give big
speedups and default efficiency for executing python codes in parallel. The evaluation showcases that
the supply of running the production time series with quantitative analysis models on a hybrid system
of a Python parallel part and a C++-based data high-speed infrastructure streamlines the
computational implementation to conduct the chosen analysis. The research demonstrates the striking
effect of Intel Xeon Phi processors against Xeon E5 processors in per-stock latency and workload
throughput indicating the importance of interpreting the right processor for the quantitative finance
applications.(Zhang et al., 2018)

Conclusion:
In conclusion, this investigation demonstrates how powerful tools including QuantCloud are in
quantitative finance, bearing in mind that speed remains the prerequisite component for companies to
enjoy a competitive advantage. Through its combined Python and C++ based big data backend, the
software collection offers a fundamentally powerful environment tailor-made for strategy
development and testing in quantitative finance. The findings demonstrate the efficacy of the system
in conducting significant performance improvements and speedups, with a special focus on modern
multicore processors, thus indicating the system’s potential use in enhancing performance of
quantitative finance applications.
References

Zhao, L., Zhou, Y., Lu, H., & Fujita, H. (2019). Parallel computing method of deep belief
networks and its application to traffic flow prediction. Knowledge-Based Systems, 163,
972–987. https://doi.org/10.1016/j.knosys.2018.10.025

Zhang, P., Gao, Y., & Shi, X. (2018). QuantCloud: a software with automated parallel python
for quantitative finance applications. Ieeexplore.ieee.org; IEEE.
https://ieeexplore.ieee.org/abstract/document/8424990

You might also like