RAST Evaluating Performance of A Legacy System Using Regression Analysis and Simulation
RAST Evaluating Performance of A Legacy System Using Regression Analysis and Simulation
Abstract—A challenging aspect in developing and deploying Another challenge is that the performance of the SUE and,
distributed systems with strict real-time constraints is how to thus, the evaluation results are adversely affected by other
evaluate the performance of the system running in a produc- programs that run simultaneously with the SUE and use the
tion environment without disrupting its regular operation. The
challenge is even greater when the System Under Evaluation same hardware resources and database [3].
(SUE) is a poorly documented legacy system with database- There are two traditional kinds of approaches to performance
centric architecture that works within a resource-sharing environ- evaluation of software systems in general: performance model-
ment. Current performance evaluation methods dealing with this ing and performance measurement – the latter is often called
challenge require live monitoring software or distributed tracing
performance or load testing; we will use load testing [4], [5].
software tools that are typically unavailable in legacy systems
and hard to establish. In this paper, we propose an alternative In our latest application case study on the legacy alarm system
approach RAST (Regression Analysis, Simulation, and load of the GS company group [6], we demonstrated that traditional
Testing); it evaluates the response time as the major performance performance modeling, i.e., creating analytical models based on
characteristic of a distributed real-time legacy system using the abstract characteristics, is hardly applicable because it requires
available system’s log files. Our use case is a commercial alarm
an in-depth analysis and support from application experts [3].
system in productive use that is provided and further developed
by the GS company group in Germany. We show in extensive Also, when the legacy system is under development, it is
experiments that our approach allows to adequately estimate to probable that the performance model will not represent the
what degree the workload of a legacy production system can rise system well enough, especially as time passes. Therefore, we
in the future while complying with the strict requirements on used load testing, which means generating artificial workloads
the response time. We provide a GitHub repository with non-
and measuring the system’s performance characteristics, like
proprietary parts of our predictive model generation, simulation,
and load testing software to reproduce our experiments. the response time1 . However, it disrupts the regular operation of
Index Terms—performance evaluation, real-time requirements, the production environment by consuming computing resources
regression analysis, simulation, alarm systems, legacy system and changing the contents of the database. While load testing
can be done in an isolated test environment, such environments
I. M OTIVATION AND R ELATED W ORK often have less computing power than production environments,
making it difficult to produce results that precisely represent
We address the problem of evaluating the performance of the performance of the production environment [3], [6].
a complex distributed system in productive use to find out if Because both load testing and performance modeling, ac-
the System Under Evaluation (SUE) complies with real-time cording to our previous work, do not produce plausible and
requirements. Our use case is a real-world industrial alarm reproducible performance prediction results, we propose in
system designed and provided by the GS company group in this paper an alternative approach that combines predictive
Germany [1], with strict requirements regarding the system’s modeling, simulation, and load testing.
real-time behavior. The performance evaluation problem is There are several approaches similar to ours. Okanović and
complicated by the fact that we deal with a large, legacy Vidaković [7] use linear regression to predict the performance of
system [2] that is in continuous productive use and at the software. They collect the data about the system behavior using
same time, its technical documentation is incomplete, there are a live monitoring software. A recent approach by Grohmann
remaining bugs, and the system remains under development, et al. [8] uses six different machine learning algorithms for
often not by the original designers of the system. Performance performance prediction. Like Okanović and Vidaković, they
evaluation is even more challenging when the SUE is a part use a live monitoring software for data collection. Courageux-
of a larger system; the SUE executes in a resource-sharing Sudan et al. [9] introduce a model for distributed software to
environment and has a database-centric architecture. simulate the processing time of a request. They determine the
When dealing with legacy systems, a complex and challeng- model’s parameters, like queuing time and CPU usage, using
ing aspect is evaluating their performance without disrupting the a distributed tracing software, simulate the model using the
regular operation of the production environment. In particular,
such disruptions could corrupt databases or cause too much load 1 Response time is the time it takes between sending a request and receiving
on the system, leading to the violation of real-time requirements. its response. It is the sum of the server’s processing time and network latency.
Authorized licensed use limited to: Monash University. Downloaded on March 28,2023 at 12:00:51 UTC from IEEE Xplore. Restrictions apply.
SimGrid simulation framework and evaluate its performance II. T HE RAST A PPROACH : OVERVIEW
with the LIMBO load testing tool. Aichernig et al. [10] simulate Fig. 1 shows a high-level overview of RAST – our approach
users to generate a synthetic workload for their SUE. To to measuring the response time of a production system by
preserve the regular operation of the production system, they combining Regression Analysis, Simulation, and load Testing.
use a test environment in which the SUE is being executed.
While these and other previous approaches require live
monitoring software or distributed tracing software, our target
i System
Production S
class of systems does not provide such tools, making it
Log Log Requirements
difficult to gather low-level metrics like CPU usage, network Files Files Checker
latency, request sizes, etc. Also, it is a common challenge
that test environments, even when available, do not represent
Pipeline Pipeline
the production environment well enough regarding computing A B
capacity, which makes running the SUE in a test environment
ineffective to determine its performance.
Therefore, we propose RAST – our approach in which we Predictive Request Type Request Types Request Types Workload Pattern
Model Mapping List
combine Regression Analysis, Simulation, and load Testing. List
50
Authorized licensed use limited to: Monash University. Downloaded on March 28,2023 at 12:00:51 UTC from IEEE Xplore. Restrictions apply.
If they do, the load test is repeated with a greater workload. then remove all times that are more than three times the standard
Otherwise, the load test is finished, which means that RAST deviation away from the mean [13].
has found the saturation point of the production system. After cleaning the data of outliers, our PMC component
evaluates common regression algorithms, in the following
A. RAST: Pipeline A called estimators. An estimator takes the training data as input
Fig. 2 shows the Log-Transformer and Predictive-Model- and learns the optimal parameters for a particular regression
Creator components inside of Pipeline A of RAST and the algorithm. The PMC component evaluates different estimators
data flow between the components. using the popular approach of cross-validation [14] that returns
a score for each estimator. The best estimator is the one with
the best score. Exporting the best estimator as a file makes the
Log
Training Data Files
predictive model usable for the Simulator.
Many regression algorithms require numerical values as
Timestamp Request Type ... Processing Time
Log-Transformer input. Therefore, when it loads the data, the PMC component
28.05.2021 (LT)
18:46:31.010
Alarm ... 16 ms transforms every textual request type in the training data, as
28.05.2021
Alarm ... 15 ms Training Data
shown in Fig. 2. At the end of Pipeline A, the Simulator obtains
18:46:31.100
Database the predictive model and the request type mapping.
... ... ... ...
Predictive-Model-Creator
28.05.2021
Hearbeat 18 ms (PMC) B. RAST: Pipeline B
18:46:32.100
Log
Files
Figure 2. RAST: Components of Pipeline A
Log-Transformer
(LT)
Simulator
passed to the Predictive-Model-Creator (PMC) component that Response
... ... ...
uses regression analysis to create a predictive model (Fig. 2). Load Tester
51
Authorized licensed use limited to: Monash University. Downloaded on March 28,2023 at 12:00:51 UTC from IEEE Xplore. Restrictions apply.
Alarm Receiving Center (ARC) – a computer center which 2) Database-centric architecture: one database instance is
acts as the endpoint for incoming alarm messages. used by all server programs of the system, thus making
it a common bottleneck;
Customer's home 3) Software bugs, continuous development, and legacy IT
infrastructure of the test environment.
Around 3% of requests the system processes during a
day are received and processed by the ARS. The other 97%
Alarm Device (AD) are requests destined for other server programs. This high
amount of background workload causes unreliable performance
measurements when load testing the ARS due to the database-
centric architecture and the resource-sharing environment. For
Alarm Receiving Centre Emergency and Service control Centre
(ARC) (ESC) these reasons, we propose an alternative approach in this paper.
An alarm system’s timing behavior is regulated by the EN
50136 standard [15]. The most relevant timing metric is the
Alarm Receiving Software Risk Management Software ESC-Employee response time between the AD and the ARS; it indicates how
(ARS) (RMS)
quickly the system responds to a request. Commonly, response
time is understood as the time between a request and any
response, according to the ISO/IEC 2382 standard [16]. The
EN 50136 standard, however, defines the response time as the
Customer Police time between sending an alarm message (request) and receiving
a positive acknowledgment (response). Timeouts or negative
Figure 4. Use Case: A typical alarm system
responses are not treated as acknowledgments. Therefore, the
ARS has to comply with the specified real-time requirements
The ARS is responsible for receiving and processing alarm
even under circumstances like high workloads and failures.
messages and forwarding them to the Risk Management
The currently valid EN 50136 standard which we rely upon
Software (RMS) running within the Emergency and Service
specifies the real-time requirements for the response time as
control Center (ESC). Once the ARS has processed the alarm
follows: the arithmetic mean of all response times measured
message, an acknowledgment is sent to the AD, confirming
in any time interval must not exceed 10 seconds, and the
that no retransmission of the alarm message is required. The
maximum response time must not exceed 30 seconds.
ESC-Employees then coordinate appropriate actions according
to an intervention plan agreed in advance with the customer. IV. C ASE S TUDY: A L EGACY A LARM S YSTEM
Our particular use case is a production-quality software for In this section, we present our proof-of-concept implemen-
alarm system designed and implemented by the GS company tation of the RAST approach for the case study of the legacy
group [1] which is a leading alarm system provider in Germany. alarm system described in Section III. We implement all
The company owns three ARCs and three ESCs located in components of RAST in the Python [17] language because of
the north-western part of Germany, that handle about 13.000 the availability of widely-used libraries for regression analysis
alarms per day from about 82.000 alarm devices located all (scikit-learn [18]), data management (pandas [19]), and load
over Germany around the clock. testing (Locust [20]). We implement both LT components of
The current software system provided by the company is Pipelines A and B as Extract-Transform-Load (ETL) pipelines
a large, complex, and poorly documented distributed legacy to facilitate reuse between Pipeline A and Pipeline B. We
system consisting of around 80 executable software components use an ETL pipeline [21] which, in our case, performs the
(programs) and libraries written in BASIC, C++, C#, and Java, following steps: Extracts data related to the workload of the
summing up to over 1.5 million lines of code. The majority of SUE from different log files, Transforms it to a common
the original developers are no longer in the company; moreover, format, and Loads the data into a database that serves as the
the software system lacks automated testing mechanisms like single-source-of-truth for future data processing.
unit tests. Therefore, we deal with a typical legacy system, i.e., The Predictive-Model-Creator (PMC) component described
“a large software system that we don’t know how to cope with in Section II-A uses the training data to create a predictive
but that is vital to our organization” [2]. model. The component uses the popular Python-libraries
In summary, performing load tests to find the saturation pandas and scikit-learn. The PMC component extracts the
point of the ARS is challenging for the following reasons: data from the database and stores it in a pandas dataframe.
1) Resource-sharing environment: the ARS is one of twelve Having the data in a pandas dataframe is a requirement of scikit-
server programs that form our target system. In addition, learn. In the process of extraction, the component transforms
programs like backup and anti-virus software are also the request types stored as strings to integers because the
running in the system. All programs executed within estimators expect numerical inputs. This transformation is done
virtual machines (VM) share computing resources like by assigning a unique number to each distinct request type.
network bandwidth, CPU, storage, etc.; Each assignment is stored in a hash map, thus creating the
52
Authorized licensed use limited to: Monash University. Downloaded on March 28,2023 at 12:00:51 UTC from IEEE Xplore. Restrictions apply.
request type mapping. After that, the PMC component applies Locust. Our load test aims to evaluate how many alarm
outlier detection and removal. Then, it splits the training data devices the system can serve while complying with the real-
into random train and test subsets – terms used by the scikit- time requirements when it is under production workload. The
learn developers – using the train_test_split function workload the Load Tester component generates consists of
of scikit-learn so that 75% of the data is in the train subset two parts: the alarm device workload caused by alarm devices
and the remainder is in the test subset. This distribution is the sending alarm messages and the background workload caused
default in scikit-learn. by the other programs that run simultaneously to the ARS.
The PMC component takes the train subset and uses cross- The component uses the extracted workload pattern from the
validation to compare the performance of the estimators. All production system to produce the background workload. The
estimators use default settings. After cross-validation, the PMC alarm device workload is steadily increased by simulating
component fits2 the estimator with the best score using the alarm devices and increasing their number over time until the
train subset. Then, it performs a prediction using the test subset workload is reached where the requirements checker component
as input and evaluates the predictions using the mean absolute of RAST reports that the performance requirements have been
error, mean squared error, and R² score [14]. Finally, the PMC exceeded. This way, we find the saturation point of the system.
component exports the estimator to create the predictive model
and the request type mapping as files. V. E XPERIMENTS : F INDING THE SATURATION POINT OF THE
The Simulator component of RAST provides a Web API L EGACY A LARM S YSTEM
capable of processing all types of requests of the GS legacy
system. For each incoming request, a new thread is created Here, we describe our experimental approach to find the
to serve the request. For this, we extend Python’s built-in saturation point of the system.
ThreadingHTTPServer and increase the size of the TCP To prepare the experiment, we collect log files from the
listen backlog from the default value of 5 to 20,000. In this production system and feed them into the implemented RAST
regard, the Simulator reflects the behavior of the real GS legacy components to create the predictive model, the request type
system. Serving a request means transforming the request type, mapping, and the workload pattern.
represented as a text, to the respective numerical value by using
Estimator Mean score and 95% confidence
the request type mapping, then simulating the processing time interval
using the predictive model, and then sending a response. Linear Regression 0.54 (+/- 0.00)
Fig. 5 illustrates how our Simulator simulates the processing Ridge Regression 0.54 (+/- 0.00)
Lasso Regression -0.00 (+/- 0.00)
time: Every time a request is received, the Simulator performs ElasticNet Regression -0.00 (+/- 0.00)
an initial prediction 1 and waits for the predicted amount of DecisionTree Regression 0.63 (+/- 0.00)
Table I
time 2 . Then, the Simulator performs another prediction 3 M EAN SCORE AND THE 95% CONFIDENCE INTERVAL OF THE SCORE
because other requests might have arrived in parallel, changing ESTIMATE FOR THE ESTIMATORS WE USED .
the input for the model. If the next predicted amount of time
is greater than the previous prediction 4 , the Simulator takes
the difference in time 5 and waits for it 2 . Otherwise, it We include five out of twelve server programs in our case
finishes the execution of this request and sends a response 6 . study, as only these programs produce log files with information
about the received and processed requests. Every request
processed by these programs produces a pair of two lines of
Receive Send
text in the log file, one for the beginning of request processing
Request Response
and one for the end. We refer to one line of text in the log file
as a log entry. From these selected server programs, we use
Predict New Processing Time - Last Processing Time
180 log files from 30 distinct days collected over 13 months,
Processing
Time YES NO
from December 2020 till January 2022. Altogether, our log
files contain around 39.99 million log entries. The number of
resulting rows in the training database is around half of the
Predict Processing Time
Wait Processing greater than number of log entries as one pair of log entries is required
Time last time?
to calculate the processing time of a request. This number is
further reduced after outlier detection and removal. Feeding
these log files into Pipeline A produces a training database
with around 19.9 million rows and five columns of data.
Figure 5. Flowchart: How the request processing is simulated.
Table I shows the mean score and the 95% confidence
The Load Tester component of RAST is based on our interval of the score estimate [14] of the estimators we used.
previous load testing approach [6] with the load testing tool We observe that the DecisionTree Regression has the best
score, so the PMC component fits the estimator using the train
2 Fitting is the process of learning the best parameters for a particular subset, performs a prediction using the test subset as input and
estimator based on the given train subset. evaluates the predictions. The results are: the mean absolute
53
Authorized licensed use limited to: Monash University. Downloaded on March 28,2023 at 12:00:51 UTC from IEEE Xplore. Restrictions apply.
error of 0.034, mean squared error of 0.01, and the R² score 100k
5
Number of different request types: 350
of 0.63. 2
10k
5
Response time ms
Giving the log files to Pipeline B produces the workload 2
1000
5
pattern shown in Fig. 6 and Fig. 7. In Fig. 6, we observe that the 100
2
10
Request type
45k
45635.5
45603
45478.5
Mon
45346
45133
Tue
42603.5
40275.5
Wed
40073.5
39915
39845.5
39509.5
39426.5
39241
39218
38652.5
38540
38222.5
37893
37807.5
37657
Thu
37037.5
36568
35k
35090.5
Fri
35048.5
35040.5
34473
33822.5
33205.5
Requests per hour
32825.5
Sat
30k
30878.5
29963
Sun
29396.5
28159.5
locust.log
27905.5
26827
26700.5
26553.5
26527.5
26427.5
26268.5
26136
25k
25992.5
25876.5
25855.5
25858
25600.5
25339
25240.5
25204
25147
25051
24518.5
24508.5
24488.5
24478.5
24437.5
24337
24244
24182
24168.5
24155.5
24139
24076.5
24049.5
24027
24019.5
23916
23899
23892.5
23834.5
23810.5
23769.5
23763
23755.5
23734
23672
23634
23615.5
23594.5
23576
23549.5
23539.5
23520.5
23501.5
23464
23425.5
23411.5
23323.5
23318.5
23213.5
23162
23086.5
23072.5
22991.5
22917
22897.5
22869.5
22849.5
22836.5
22830
22770.5
22764.5
22752
22742
22678
22671
22672.5
22638
22594.5
22581
22489.5
22416.5
22414.5
22387
22384
22301
22246
22138.5
22069.5
22055.5
22039.5
21993.5
21974.5
21959.5
21937.5
21914
21884.5
21887.5
21803
21771.5
21740.5
21741.5
21704.5
21637.5
21612
21604.5
21584
21573
21429.5
21414.5
21364
21358
21317.5
21287
21232
21204.5
21207.5
21197.5
21175
21133
21084.5
20k
20966
20942.5
20890.5
20895.5
20846.5
20481.5
20377.5
20227.5
20195.5
20178.5
20138.5
20086
15k
10k
Thu 645915
workload by simulating alarm devices. It also logs each
Fri 573559.5 measured response time to a log file. The real-time
requirements defined in Section III refer to alarm devices;
Sat 548150.5
thus, in our experiment, each simulated AD sends an alarm
Sun 549223.5
message with a random interval of 20 to 90 sec between
0 100k 200k 300k 400k 500k 600k 700k
messages, which is a realistic workload that happens in
Count practice. We initialize the random number generator with
a fixed number, so that the send times of the ADs remain
Figure 7. Daily workload pattern: number of total requests per day. the same between repetitions of our experiment.
Our Python script locust-parameter-variation.py automates
For the experiments in this paper, our previous work [6]
the process of finding the saturation point of the software. It
provides us with a complete testing infrastructure based on the
executes Locust with our gen gs alarm device workload.py
load testing tool Locust including ancillary shell and Python
Locust script, waits until Locust completes execution, then reads
scripts for automation and visualization. We build upon this
the measurements from the locust.log file, and compares the
work and develop three additional Python scripts shown in
measurements against the real-time requirements thus serving
Fig. 9. We make the source code of all scripts including the
as the Requirements Checker component in the RAST approach.
ancillary scripts available on GitHub [11].
54
Authorized licensed use limited to: Monash University. Downloaded on March 28,2023 at 12:00:51 UTC from IEEE Xplore. Restrictions apply.
Whenever the real-time requirements have been met, i.e., that
both the average and the maximum response times are below
the requirements stated in Section III, the script increases the
number of ADs by 5000 (the initial value is also 5000) and
executes Locust again. The script executes repeatedly, until
the average or the maximum response time required by the
standard is exceeded. At this point, the script has determined the
saturation point. Once the locust-parameter-variation.py script
finds the saturation point, it returns to the previous number,
and from there, it increases the number of ADs by 100 to
obtain a finer-grained result.
When simulating tens of thousands of alarm devices, it is
likely to receive a warning from Locust that the load test
requires too much CPU load, which completely invalidates
the measurements. The reason is that the Global Interpreter
Lock (GIL) of the CPython runtime, the most widely-used
Python runtime, prevents a single Python process from utilizing
multiple CPU cores in parallel, even if using multiple threads
and running on a multi-core CPU [22]. To solve this, our locust-
parameter-variation.py script uses Locust’s built-in distributed
load generation mode to distribute the load across multiple
workers, each one being a distinct Python process. Also, in
the field, ADs are installed over time. Therefore, the script
uses Locust’s spawn_rate parameter to start 100 ADs per
second. Using a higher value results in a warning from Locust.
All our experiments are conducted within a virtual machine
(VM) on an HP Proliant dl380 G7 server with two Intel Xeon Figure 10. Avg and Max Response Times dependent on the number of ADs
X5690 processors with 3.46 GHz. This VM has eight virtual
CPU (vCPU) cores with 16 GB of RAM. We launch the
Simulator and the Load Tester on the same machine. and 30 requests throughout the experiment. Running the test for
To obtain practically relevant results, we must simulate tens a longer period of time has no significant impact on the results.
of thousands of ADs, thus potentially creating thousands of The experiment ends once the locust-parameter-variation.py
concurrently established network connections. For large-scale reports the saturation point.
load testing to work, we tune the server’s parameters, like Fig. 10 shows the experimental results: for each executed
raising the limit of open file descriptors to 50,000 and the load test, the average and maximum response times are shown
maximum number of connections in the TCP listen backlog as blue and orange curves, correspondingly. At 42,900 ADs the
(SOMAXCONN) to 20,000. Additionally, we use the mininet maximum response time exceeds the threshold. However, the
tool [23] to simulate network latency and bandwidth. Our reason is that our simulated ARS cannot handle such a high
mininet topology (the number of simulated hosts, switches amount of concurrent connections, and thus, our simulated
and links) and the link parameters (delay, jitter, packet loss, ADs run into connection timeouts. The simulated ARS cannot
bandwidth) are modeled to reflect the production system. We handle that many connections because Python’s GIL prevents
provide the source code, details of our server’s parameters, and the process from utilizing multiple CPU cores even though we
our mininet configuration on GitHub [11]. are using one thread per connection. We expect that repeating
The process of running an experiment on our testing our experiment on a more powerful machine with a CPU with
infrastructure is as follows: we begin the experiment with greater single-thread performance than our machine would
the launch of the Simulator. After that, we start the yield higher results. That also means that the GS system can
gen gs prod workload.py Locust script so that the worst- potentially process even more than 42,800 ADs because the
case production workload is generated in the background. response times returned by our model were not the limiting
Then, we start the alarm device workload by executing the factor in the experiment.
locust-parameter-variation.py script. The script uses five Locust In summary, our results confirm that under normal operating
workers to distribute the load of simulating the ADs. Five is conditions, e.g., when there are no faults in the system, the GS
the maximum number of workers we can run because with the system can handle the alarm device workload of 42,800 ADs
already running background workload and the Simulator, seven while processing the most demanding background workload
vCPU cores are utilized by our experiment, leaving one core for that happens in production and still complies with the real-
the mininet and operating system processes. locust-parameter- time requirements of the current EN 50136 standard for alarm
variation.py runs for 10 min so that each AD sends between 6 systems.
55
Authorized licensed use limited to: Monash University. Downloaded on March 28,2023 at 12:00:51 UTC from IEEE Xplore. Restrictions apply.
VI. C ONCLUSION AND F UTURE W ORK [6] J. Tomak and S. Gorlatch, “Measuring performance of
fault management in a legacy system: An alarm system
In this paper, we propose RAST – an approach to measur-
study,” in MASCOTS ’20. Springer, 2021, pp. 129–146.
ing the response time of production systems by combining
[7] D. Okanović and M. Vidaković, “Software performance
Regression Analysis, Simulation, and load Testing. RAST
prediction using linear regression,” in Proc. of the 2nd
consists of components that are adaptable to different kinds
Int. Conf. on Information Society Technology and Man-
of systems. We present a proof-of-concept implementation
agement. Citeseer, 2012, pp. 60–64.
of RAST for an industrial, legacy alarm system that is in
[8] J. Grohmann et al., “Monitorless: Predicting Performance
productive use and has strict real-time requirements.
Degradation in Cloud Applications with Machine Learn-
The experimental results of our case study confirm that under
ing,” in Middleware ’19. ACM, 2019, pp. 149–162.
normal operating conditions, the System Under Evaluation can
[9] C. Courageux-Sudan, A.-C. Orgerie, and M. Quinson,
handle the workload of 42,800 modern Alarm Devices (ADs)
“Automated performance prediction of microservice ap-
while complying with the real-time requirements of the EN
plications using simulation,” in MASCOTS ’21, 2021, pp.
50136 standard. Therefore, our short-term goal formulated in
1–8.
Section I is achieved.
[10] B. K. Aichernig et al., “Learning and statistical model
A current limitation of our implementation is that our
checking of system response times,” Software Quality
predictive model considers a subset of the programs that
Journal, vol. 27, no. 2, pp. 757–795, 2019.
consume hardware resources in the legacy system’s resource-
[11] J. Tomak, “Performance Testing Infrastructure,” 2022.
sharing environment. We take five of twelve server programs
[Online]. Available: https://github.com/jtpgames/Locust
into account and leave out the rest because they do not
Scripts
produce log files that contain information about the requests the
[12] “Regression analysis essentials for machine learning.”
programs received and processed. That is why our predictive
[Online]. Available: http://www.sthda.com/english/
model has an R² score of 0.63 (1.00 being a perfect score)
articles/40-regression-analysis/
which means that 63% of the variability in the processing
[13] J. Brownlee, “How to remove out-
times is explained by the predictor variables of our predictive
liers for machine learning,” 2020. [On-
model. We assume that the load of other programs causes the
line]. Available: https://machinelearningmastery.com/
remaining 37%. Therefore, we expect the system’s performance
how-to-use-statistics-to-identify-outliers-in-data/
to drop in these situations below our measured saturation point
[14] Scikit-learn Development Team, “Model selection and
of 42.800 ADs. Thus, our measurements are only valid when
evaluation.” [Online]. Available: https://scikit-learn.org/
the system operates under normal conditions, i.e., no faults
stable/model selection.html
in the system are present, and no backups, anti-virus scans or
[15] DIN EN 50136-1:2012-08, “Alarm systems - alarm
other periodic tasks are running.
transmission systems and equipment - part 1: General
In the future, we plan to improve the predictive model so requirements for alarm transmission systems,” Aug. 2012.
that our simulation is closer to the real-world system. Also, [16] ISO/IEC, “ISO/IEC 2382:2015: Information technology -
we will implement RAST for different (open-source) software Vocabulary,” ISO, Tech. Rep., 2015.
systems to verify its validity for other systems and allow for [17] G. Van Rossum and F. L. Drake, Python 3 Reference
better reproduction of our research. Furthermore, we plan to Manual. CreateSpace, 2009.
export the predictive model into a more common format to [18] F. Pedregosa et al., “Scikit-learn: Machine Learning in
enable sharing our predictive model with other researchers. Python,” Journal of Machine Learning Research, vol. 12,
pp. 2825–2830, 2011.
R EFERENCES
[19] The Pandas development team, “pandas-dev/pandas:
[1] “GS,” GS electronic GmbH. [Online]. Available: Pandas 1.0.5,” Jun. 2020. [Online]. Available: https:
https://www.gselectronic.com/ //doi.org/10.5281/zenodo.3898987
[2] S. Matthiesen and P. Bjørn, “Why replacing legacy [20] C. Byström, J. Heyman, J. Hamrén, and H. Heyman,
systems is so hard in global software development: An “Locust.” [Online]. Available: https://docs.locust.io/en/
information infrastructure perspective,” in CSCW ’15. stable/what-is-locust.html
ACM, 2015, p. 876–890. [21] Databricks, “Extract-Transform-Load.” [On-
[3] Y. Jin, A. Tang, J. Han, and Y. Liu, “Performance line]. Available: https://databricks.com/glossary/
evaluation and prediction for legacy information systems,” extract-transform-load
in ICSE ’07, May 2007, pp. 540–549. [22] D. Beazley, “Understanding the Python GIL,” in PyCON
[4] A. Avritzer, J. Kondek, D. Liu, and E. J. Weyuker, Python Conference., 2010.
“Software performance testing based on workload charac- [23] F. Keti and S. Askar, “Emulation of software defined net-
terization,” in WOSP ’02, 2002, pp. 17–24. works using mininet in different simulation environments,”
[5] L. Zhang, “Performance models for legacy system migra- in ISMS ’15, 2015, pp. 205–210.
tion and multi-core computers-an mva approach,” Ph.D.
dissertation, MCMaster University, 2015.
56
Authorized licensed use limited to: Monash University. Downloaded on March 28,2023 at 12:00:51 UTC from IEEE Xplore. Restrictions apply.