MLPC Group Assignment
MLPC Group Assignment
STUDENT DECLARATION
1. I confirm that I am aware of the University’s Regulation Governing Cheating in a University Test
and Assignment and of the guidance issued by the School of Computing and IT concerning
plagiarism and proper academic practice, and that the assessed work now submitted is in
accordance with this regulation and guidance.
2. I understand that, unless already agreed with the School of Computing and IT, assessed work may
not be submitted that has previously been submitted, either in whole or in part, at this or any other
institution.
3. I recognise that should evidence emerge that my work fails to comply with either of the above
declarations, then I may be liable to proceedings under Regulation.
No Student Name Student ID Date Signature Score
5
Part A: Machine Learning – A Case Study...........................................................................3
1. Describe your observation and understating on the whole dataset by answering the
following questions............................................................................................................. 3
a. What are the available data types in this data set?..................................................3
b. What is the statistical summary of all the attributes?............................................... 3
c. How to handle the missing values which are those presented as ‘?’ ?.................... 4
d. What are the independent and dependent variables?..............................................4
i. Independent Variables:........................................................................................ 4
ii. Dependent Variables:..........................................................................................4
2. Find the correlation coefficient between independent and dependent variables?..........6
3. By using any algorithm of your choice, create a model which could predict the price of
the laptop............................................................................................................................7
4. Is this a Supervised model or an unsupervised model? Why so? Explain in................. 8
detail...................................................................................................................................8
5. Continue with the same built model in No.3, but choose different independent
variables and compare the results..................................................................................... 9
Part B: Parallel Computing.................................................................................................. 11
Article1: Parallel computing method of deep belief networks and its application to traffic
flow prediction...................................................................................................................11
Study Background:..................................................................................................... 11
Research and Development Methodologies:..............................................................11
1. Pre-training Phase Methodology:..................................................................... 11
2. Fine-tuning Phase Methodology:...................................................................... 11
3. Parallel Architecture:........................................................................................ 12
4. Evaluation Indices:........................................................................................... 12
Performance Analysis:............................................................................................... 12
Results Analysis:........................................................................................................ 13
Conclusion:.................................................................................................................13
Part A: Machine Learning – A Case Study
Firstly, we processed the dataset by changing the type of 'Weight' to string, removing
non-numeric characters such as "kg" and then finally changing it to a float type. Similarly, we
went on to remove numeric digits coming from the 'Ram' column and make them into a float.
Later on, we picked the proper numerical categories, eg: 'Inches', 'RAM', 'Weight', and
'Price_euros' to find out the correlation. We performed the panda corr() function to obtain the
correlation matrix and the correlation coefficient between the Price_euros column and the
numeric columns.
i. Inches: The value of the correlation coefficient for the laptop screen size (Inches)
versus its price (Price_euros) is around 0.068, showing a very weak positive
correlation.
ii. Ram: The correlation coefficient of RAM capacity with price has been estimated to
be 0.743, thereby indicating a moderately strong positive correlation. This means that
as the memory capacity hikes, the cost of the laptop rises also.
iii. Weight: Among the whole-relationships-between-price-and-weight of the laptop, the
correlation-coefficient equals to roughly 0.210, which suggests that there is weak
positive-correlation. The fact that heavy laptops slightly tend to have higher prices on
the market gives one an idea that heavy laptops are more expensive.
To develop a predictive model for estimating laptop prices based on certain features, we
employed the Linear Regression algorithm. Here's a step-by-step explanation of the process:
i. Data Preparation:
We had input variables (features), and a dependent variable (price), because of the
dataset. For the straight-forward modeling, a multilinear regression equation was used
and the independent variables were selected as 'Inches' 'Ram' and 'Weight' and
'Price_Euros' as the dependent variable.
iv. Prediction:
We worked in a course where we trained a model and used it to make predictions on
testing data in order to estimate the prices of laptops regarding their features.
v. Model Evaluation:
We estimated the performance of the model by calculating the Mean Squared Error
(MSE) between the actual prices and the retained prices on the moderately paced
dataset. The MSE, in turn, presents the structure of how a model's predictions match
the real values, with the smaller value revealing a better model's performance.
vi. Results:
Compared to our model, after evaluation, the output exhibited a Mean Squared Error
(MSE) of around 2.25444.48 We measure this indicator by simply taking an average
squared gap between predicted prices and actual. Though the MSE value is capable of
pointing out the structural accuracy of the model, a detailed analysis apart from the
comparison with other models is equally important to determine the overall
effectiveness of the chosen model.
vii. Conclusion:
The developed Linear Regression model showed a promising capability to predict
laptop prices based on chosen factors. Further results of the model, incorporation of
new algorithms and optimization could boost the precision, predictive performance
and robustness of the model. This foreseen capacity brings great benefit to the
consumers, manufacturers and the retailers among others who are able to drive the
right decisions related to laptop buying, pricing and market competitiveness,
respectively.
Yes, the given model is a supervised learning model that was designed to predict
laptop prices based on their specifications or features. A main feature of supervised
learning is the fact that it is the most common and representative paradigm in machine
learning, where in the data datasets, each example is assigned a single label
corresponding to the established output labels. This facilitates the model's learning
process by providing clear guidance on the expected outputs for given inputs.
In the context of the given data model, the dataset will be made of labeled examples
where every laptop will be represented by a tuple of size attributes e.g. 'Inches', 'Ram',
'Weight' as well as the price labeled as 'Price_euros'. This feature alignment with the
target variable would enable the model to understand the relationship between the
input features and the target variable, thereby making possible the prediction of
unseen data.
During training sessions, the supervised learning model becomes capable of mapping
input features with their corresponding output labels while minimizing a pre-defined
loss function. In this situation, the model wants to make the instability as low as
possible in between predictions and actual prices of laptops which are in the training
data. The process of iterative optimization via gradient descent is utilized which leads
the model to fine tune its parameters so that it is capable of predicting price with the
desired accuracy.
Supervised learning does render well if the aim is to learn mapping from input
features to the output labels, and with machine learning, it is possible for a laptop to
be priced based on its specifications. By leveraging the labeled data provided in the
dataset, the model can effectively learn the patterns and relationships within the data,
thereby enabling it to generalize well to unseen instances and make accurate price
predictions.
5. Continue with the same built model in No.3, but choose different independent
variables and compare the results.
To compare the results of the model with different independent variables, we followed a
similar approach as in the previous steps. Here's how we proceeded:
i. Data preprocessing:
We extracted numeric values from the 'ScreenResolution' column to create two new
columns: 'ScreenResolution_Width' and 'ScreenResolution_Height'. These columns
represent the width and height of the screen resolution, respectively. We converted
these new columns to numeric data type and dropped rows where width or height
couldn't be extracted.
iv. Results:
The Mean Squared Error with the new set of independent variables was
approximately 187697.57.
v. Conclusion:
This comparison helps us assess the impact of including different independent
variables on the model's predictive performance, allowing us to identify which set of
variables yields better results for predicting the price of laptops.
Part B: Parallel Computing
Article 1: Parallel computing method of deep belief networks and its application to
traffic flow prediction
Study Background:
The provided research involves the field of parallel computing applications which of course, focuses
on improving Deep Belief Networks (DBNs) using the methods in parallel computing. The goal of the
study in this article is to develop competent machine learning techniques, especially when huge
datasets are processed in real-time situations. The research centers on the use of parallel computing in
order to improve DBN timelines, namely, pre-training and fine-tuning; this is designed to reduce the
computational time necessary for model training.
Error Function Revision: Adding an factor m comprises an average error functions ulitizing
sub datasets Xq.
Algorithm 5: Rules fine-tuning is accomplished via several steps: dataset distribution, DBN
model initialization, model parameter broadcasting, variation computation, weights and biases
updating, and epoch periodicity.
3. Parallel Architecture:
Master-Slave Computing Structure: Master-Slave Computing Structure: In both the
pre-training and fine-tuning phases the master-slave computational structure is used.
Comparison with Serial Computing: The multi-nodes and a varying sets of data technique
is known as the combination method, while the single node and a single dataset is called the
serial method. The parallel approach is made in such a way to result both theoretically and in
practice the same as the one achieved by the serial mode of evolution.
4. Evaluation Indices:
Acceleration Ratio and Efficiency: The analysis of the impacts of parallel computing is
done utilizing acceleration ratio and efficiency metrics, computing runtimes and showing the
performance between serial and parallel learning models using examples in pre-training and
fine-tuning phases.
These strategies are designed to reduce the training periods for DBNs and thus to make them
more resource-friendly through combining two computing paradigms - parallel computing
and model reduction approach. The study puts forward the cruciality of optimizing
preprocessing and fine-tuning, utilizing parallel computing in order to speed up the training of
DBNs. However, this is very important in real-time applications where quick data processing
is necessary because it is a crucial element.
Performance Analysis:
The research conducted here concentrates on shortening Deep Belief Networks' (DBNs) training time
via including parallel computing strategies into training stages. The main objective of the study is to
enhance the training process of DBNs through sharing the workload across multiple nodes. The target
of the research is to optimize model training parameters by running the training phases in parallel.
This method should result in immense reduction of the total time for model training. This approach is
very relevant in scenarios where dealing with large datasets and doing it securely and in a time
efficient manner is important for performance of machine learning algorithms.
The applied parallel computing method is modified for training of learning algorithms to increase
their efficiency and speed up the DBN training procedure. The study, therefore, will split the
computational workload up to several nodes with a goal of accelerating the training process, allowing
it to run smoothly and be precise. This approach of parallel computing tactic, which is specified to
solve the complexities of processing big databases, is going to train DBNs which present the process
of obtaining real time predictions and predictions which are fast and accurate.
Essentially, the research is in the domain of parallel computing applications to speed up the training
speed, and with the aim to be more productive than deep belief networks (DBN). The study endeavors
to overcome the problems that would arise from working with a bulky data set by contributions of
nodes to processing power, at the same time. The effort of parallelization takes place in all the stages
of training and leads to the workload distribution across the network. This approach accomplishes two
goals; it improves the learning algorithms performance hence it gives an idea on the suitable parallel
computing techniques which can also help optimize machine learning models for real-time
applications that require high speed and accuracy.
Results Analysis:
The research evaluates the performance of the parallel computing approach by comparing the results
of applied DBNs trained with the parallel methods with those trained on by traditional serial methods.
This leads us to the conclusion that the parallel training algorithms can achieve comparable
predictions results than the serial training algorithms. This is a proof of the fact that our parallel DBN
training techniques achieve a desired alignment and improve the efficiency of prediction. Moreover,
the evaluation indices of acceleration ratio and efficiency are applied to express the improvements in
performance of pre-training and fine-tuning using parallel computing as well.(Zhao et al., 2019)
Conclusion:
In conclusion, the research suggests that the parallel computation makes it possible both to speed up
and repeat the process of training a Deep Belief Network. The study demonstrates how parallel
pre-training and fine-tuning phases can lead to a substantial enhancement in the computation
efficiency and training speed, thus underscoring the significance of using parallel computing
methodologies in improving the performance of machine learning algorithms when dealing with large
scale datasets.
Article2: QuantCloud: A Software with Automated Parallel Python for Quantitative
Finance Applications
Study Background
The study is all about the development of QuantCloud, which is a software designed for the use of
quantitative finance applications in the integration of a parallel system of Python with a Big Data
system constructed in C++ coding. The main task is to accelerate the pace of execution and Software
life cycle, which is of high importance for the trading companies operating in Quantitative Finance.
The parallel execution of Python codes is demonstrated when the software is tested on Intel Xeon E5
processors and Intel Xeon Phi processors, based on moving-window and autoregressive
moving-average (ARMA) algorithms. This is a nearly linear speed up of the processing, which is ideal
for modern multicore processors. The incorporation of C++ for big data structure and Python for the
user method is the solution to fast developing and testing strategy in a finance quantitative investment
industry, which does not implicitly ask for the speed, but rather, speed is critical when competing with
others.
Coprocess-Based Approach: The study utilizes a coprocess-based strategy for parallel execution of
Python codes, allowing for concurrent processing and efficient utilization of system resources.
Shared Memory System: Data communication between the main C++ program and embedded
Python scripts occurs through a shared memory system, facilitating seamless interaction and data
exchange.
Embedded Python Interface: An embedded Python interface is designed to enable effortless
integration with the big data infrastructure, ensuring smooth execution of Python scripts within the
system.
Optimization Strategies:
Intra-Node Parallelism: The system leverages multithreaded programming for intra-node parallelism
on shared memory, optimizing resource utilization within a single computing node.
Thread Pool Management: A thread pool is employed to manage threads efficiently, enhancing the
scalability and performance of parallel Python execution.
Asynchronous Execution Mechanism: The system implements an asynchronous execution
mechanism to overlap data serialization and analytics operations, reducing latency and improving
overall system efficiency.
Testing Procedures:
Performance Benchmarking: The study conducts performance benchmarking to evaluate the
speedup, parallel efficiency, and wallclock time for executing Python codes in parallel.
Real-World Market Data Testing: Testing procedures involve the application of the system to
real-world market data, assessing the system's performance in handling complex financial datasets.
Comparative Analysis: Extensive comparative studies are conducted between different processors,
such as Intel Xeon E5 and Xeon Phi, to analyze the system's performance under varying hardware
configurations.
By incorporating these specific implementation techniques, optimization strategies, and testing
procedures, the methodology section provides a comprehensive overview of how the QuantCloud
software suite effectively integrates Python and C++ for parallel execution in quantitative finance
applications, catering to the evolving demands of big data analysis in the financial sector.
Performance Analysis:
The performance analysis in the study delves into an in-depth comparative examination to evaluate
the effectiveness of the algorithms implemented in QuantCloud. By assessing various performance
metrics on real-world market data, the study provides valuable insights into the system's efficiency
and speed when executing time series analysis models coded in Python, especially on advanced
multicore processors like Intel Xeon Phi.
Overall Wallclock Time: This metric represents the total elapsed time from the initiation to the
completion of a process pipeline, encompassing data queries, preparation, Python script execution,
and result output. A decrease in overall wallclock time signifies improved efficiency in processing
financial data and executing analysis models.
Embedded-Python Wall Clock Time: This metric focuses on the time spent specifically in executing
Python codes within the coprocess-based parallel strategy. A reduction in embedded-Python wallclock
time indicates enhanced speed and efficiency in Python script execution.
Latency:
Microseconds per Tick: Latency, reported in microseconds per tick, reflects the average time taken
to process a single tick message. Lower latency values indicate quicker processing and improved
responsiveness of the system to market data, enhancing real-time decision-making capabilities.
Results Interpretation:
Results Analysis:
The outcomes of the study show that QuantCloud, a suite of software, can obviously give big
speedups and default efficiency for executing python codes in parallel. The evaluation showcases that
the supply of running the production time series with quantitative analysis models on a hybrid system
of a Python parallel part and a C++-based data high-speed infrastructure streamlines the
computational implementation to conduct the chosen analysis. The research demonstrates the striking
effect of Intel Xeon Phi processors against Xeon E5 processors in per-stock latency and workload
throughput indicating the importance of interpreting the right processor for the quantitative finance
applications.(Zhang et al., 2018)
Conclusion:
In conclusion, this investigation demonstrates how powerful tools including QuantCloud are in
quantitative finance, bearing in mind that speed remains the prerequisite component for companies to
enjoy a competitive advantage. Through its combined Python and C++ based big data backend, the
software collection offers a fundamentally powerful environment tailor-made for strategy
development and testing in quantitative finance. The findings demonstrate the efficacy of the system
in conducting significant performance improvements and speedups, with a special focus on modern
multicore processors, thus indicating the system’s potential use in enhancing performance of
quantitative finance applications.
References
Zhao, L., Zhou, Y., Lu, H., & Fujita, H. (2019). Parallel computing method of deep belief
networks and its application to traffic flow prediction. Knowledge-Based Systems, 163,
972–987. https://doi.org/10.1016/j.knosys.2018.10.025
Zhang, P., Gao, Y., & Shi, X. (2018). QuantCloud: a software with automated parallel python
for quantitative finance applications. Ieeexplore.ieee.org; IEEE.
https://ieeexplore.ieee.org/abstract/document/8424990