0% found this document useful (0 votes)
6 views8 pages

RAST Evaluating Performance of A Legacy System Using Regression Analysis and Simulation

Uploaded by

lilfoxx
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views8 pages

RAST Evaluating Performance of A Legacy System Using Regression Analysis and Simulation

Uploaded by

lilfoxx
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

2022 30th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS) | 978-1-6654-5580-0/22/$31.

00 ©2022 IEEE | DOI: 10.1109/MASCOTS56607.2022.00015


2022 30th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems
(MASCOTS)

RAST: Evaluating Performance of a Legacy System


using Regression Analysis and Simulation
Juri Tomak Sergei Gorlatch
University of Muenster, Germany University of Muenster, Germany
[email protected] [email protected]

Abstract—A challenging aspect in developing and deploying Another challenge is that the performance of the SUE and,
distributed systems with strict real-time constraints is how to thus, the evaluation results are adversely affected by other
evaluate the performance of the system running in a produc- programs that run simultaneously with the SUE and use the
tion environment without disrupting its regular operation. The
challenge is even greater when the System Under Evaluation same hardware resources and database [3].
(SUE) is a poorly documented legacy system with database- There are two traditional kinds of approaches to performance
centric architecture that works within a resource-sharing environ- evaluation of software systems in general: performance model-
ment. Current performance evaluation methods dealing with this ing and performance measurement – the latter is often called
challenge require live monitoring software or distributed tracing
performance or load testing; we will use load testing [4], [5].
software tools that are typically unavailable in legacy systems
and hard to establish. In this paper, we propose an alternative In our latest application case study on the legacy alarm system
approach RAST (Regression Analysis, Simulation, and load of the GS company group [6], we demonstrated that traditional
Testing); it evaluates the response time as the major performance performance modeling, i.e., creating analytical models based on
characteristic of a distributed real-time legacy system using the abstract characteristics, is hardly applicable because it requires
available system’s log files. Our use case is a commercial alarm
an in-depth analysis and support from application experts [3].
system in productive use that is provided and further developed
by the GS company group in Germany. We show in extensive Also, when the legacy system is under development, it is
experiments that our approach allows to adequately estimate to probable that the performance model will not represent the
what degree the workload of a legacy production system can rise system well enough, especially as time passes. Therefore, we
in the future while complying with the strict requirements on used load testing, which means generating artificial workloads
the response time. We provide a GitHub repository with non-
and measuring the system’s performance characteristics, like
proprietary parts of our predictive model generation, simulation,
and load testing software to reproduce our experiments. the response time1 . However, it disrupts the regular operation of
Index Terms—performance evaluation, real-time requirements, the production environment by consuming computing resources
regression analysis, simulation, alarm systems, legacy system and changing the contents of the database. While load testing
can be done in an isolated test environment, such environments
I. M OTIVATION AND R ELATED W ORK often have less computing power than production environments,
making it difficult to produce results that precisely represent
We address the problem of evaluating the performance of the performance of the production environment [3], [6].
a complex distributed system in productive use to find out if Because both load testing and performance modeling, ac-
the System Under Evaluation (SUE) complies with real-time cording to our previous work, do not produce plausible and
requirements. Our use case is a real-world industrial alarm reproducible performance prediction results, we propose in
system designed and provided by the GS company group in this paper an alternative approach that combines predictive
Germany [1], with strict requirements regarding the system’s modeling, simulation, and load testing.
real-time behavior. The performance evaluation problem is There are several approaches similar to ours. Okanović and
complicated by the fact that we deal with a large, legacy Vidaković [7] use linear regression to predict the performance of
system [2] that is in continuous productive use and at the software. They collect the data about the system behavior using
same time, its technical documentation is incomplete, there are a live monitoring software. A recent approach by Grohmann
remaining bugs, and the system remains under development, et al. [8] uses six different machine learning algorithms for
often not by the original designers of the system. Performance performance prediction. Like Okanović and Vidaković, they
evaluation is even more challenging when the SUE is a part use a live monitoring software for data collection. Courageux-
of a larger system; the SUE executes in a resource-sharing Sudan et al. [9] introduce a model for distributed software to
environment and has a database-centric architecture. simulate the processing time of a request. They determine the
When dealing with legacy systems, a complex and challeng- model’s parameters, like queuing time and CPU usage, using
ing aspect is evaluating their performance without disrupting the a distributed tracing software, simulate the model using the
regular operation of the production environment. In particular,
such disruptions could corrupt databases or cause too much load 1 Response time is the time it takes between sending a request and receiving
on the system, leading to the violation of real-time requirements. its response. It is the sum of the server’s processing time and network latency.

2375-0227/22/$31.00 ©2022 IEEE 49


DOI 10.1109/MASCOTS56607.2022.00015

Authorized licensed use limited to: Monash University. Downloaded on March 28,2023 at 12:00:51 UTC from IEEE Xplore. Restrictions apply.
SimGrid simulation framework and evaluate its performance II. T HE RAST A PPROACH : OVERVIEW
with the LIMBO load testing tool. Aichernig et al. [10] simulate Fig. 1 shows a high-level overview of RAST – our approach
users to generate a synthetic workload for their SUE. To to measuring the response time of a production system by
preserve the regular operation of the production system, they combining Regression Analysis, Simulation, and load Testing.
use a test environment in which the SUE is being executed.
While these and other previous approaches require live
monitoring software or distributed tracing software, our target
i System
Production S
class of systems does not provide such tools, making it
Log Log Requirements
difficult to gather low-level metrics like CPU usage, network Files Files Checker
latency, request sizes, etc. Also, it is a common challenge
that test environments, even when available, do not represent
Pipeline Pipeline
the production environment well enough regarding computing A B
capacity, which makes running the SUE in a test environment
ineffective to determine its performance.
Therefore, we propose RAST – our approach in which we Predictive Request Type Request Types Request Types Workload Pattern
Model Mapping List
combine Regression Analysis, Simulation, and load Testing. List

RAST has similarities with previous work in using regression Request


Measured
analysis to create a predictive model for processing time. Simulator
Response Times
Response
The novel features of RAST are as follows: Load Tester
1) It uses the available log files of the production system as
input to regression analysis.
2) It simulates the system as a server software that receives
the same requests as the real legacy software and sends Figure 1. RAST: Overview of the components and their data flows
valid responses.
3) It uses load testing to submit the same workload to the The production system 1 provides RAST with log files
simulation that the production system is processing. that contain information about the requests the system has
4) It automatically finds the optimal predictive model for received and processed, like the request type, timestamp, etc.
the target system based on the provided log files by The specific contents and the format depend on the particular
choosing the best-performing regression algorithm via system. Each log file is transferred to the Pipelines A and B.
cross-validation of common regression algorithms, such The Pipeline A 2 creates the predictive model and the
as: Linear, Ridge, Lasso, Elastic net, and Decision tree request type mapping and transfers them to the Simulator 4 .
regression. The Pipeline B 3 produces the list of all request types the
The short-term goal of our work is to evaluate the perfor- production system accepts and transmits it to the Simulator
mance of the current legacy software produced by the GS and the Load Tester 5 . Additionally, Pipeline B produces the
company group to identify the parts of the system that prevent production system’s workload pattern and sends it to the Load
compliance with real-time requirements. Our long-term goal Tester. The workload pattern describes the number and types
is, based on the performance results, to decide how these of requests the system processed in a given time interval, e.g.,
problematic parts should be redesigned, then redesign them the number of requests per hour. The Simulator 4 is a server
and verify the improvements afterward. In the process of software that processes all requests that the production system
pursuing these goals, we aim to find the system’s saturation accepts. The Load Tester 5 generates a synthetic workload
point, i.e., how many modern Alarm Devices (ADs) the system and measures the response times of the Simulator.
can simultaneously handle at maximum. We adopt a data-driven approach to make the Simulator and
In summary, our contributions in the paper are as follows: the Load Tester adaptable to different kinds of systems. After
• We propose RAST – an approach that combines receiving a request, the Simulator verifies if the request type is
Regression Analysis, Simulation, and load Testing to part of the request types list and uses the request type mapping
accurately measure the response time of distributed real- to transform the type of the incoming request to a numerical
time legacy production software, without disrupting its value as expected by the predictive model. The Simulator then
regular operation (Section II). uses the predictive model to delay the response as long as the
• We present our proof-of-concept implementation of RAST production system would need to process the respective request.
for the particular use case of an industrial alarm system The Load Tester steadily increases the number of requests
with hard real-time requirements (Section III and IV). per second, starting with the highest amount of requests per
• We experimentally evaluate our RAST implementation in second extracted from the workload pattern, i.e., the worst-case
this case study and present the results (Section V). workload the production system had to process in the past.
• We provide a GitHub repository [11] with non-proprietary After each load test, the Load Tester sends the measured
parts of our software for model generation, simulation, response times to the Requirements Checker 6 to verify if
and load testing, that allows to reproduce our experiments. the response times comply with the performance requirements.

50

Authorized licensed use limited to: Monash University. Downloaded on March 28,2023 at 12:00:51 UTC from IEEE Xplore. Restrictions apply.
If they do, the load test is repeated with a greater workload. then remove all times that are more than three times the standard
Otherwise, the load test is finished, which means that RAST deviation away from the mean [13].
has found the saturation point of the production system. After cleaning the data of outliers, our PMC component
evaluates common regression algorithms, in the following
A. RAST: Pipeline A called estimators. An estimator takes the training data as input
Fig. 2 shows the Log-Transformer and Predictive-Model- and learns the optimal parameters for a particular regression
Creator components inside of Pipeline A of RAST and the algorithm. The PMC component evaluates different estimators
data flow between the components. using the popular approach of cross-validation [14] that returns
a score for each estimator. The best estimator is the one with
the best score. Exporting the best estimator as a file makes the
Log
Training Data Files
predictive model usable for the Simulator.
Many regression algorithms require numerical values as
Timestamp Request Type ... Processing Time
Log-Transformer input. Therefore, when it loads the data, the PMC component
28.05.2021 (LT)
18:46:31.010
Alarm ... 16 ms transforms every textual request type in the training data, as
28.05.2021
Alarm ... 15 ms Training Data
shown in Fig. 2. At the end of Pipeline A, the Simulator obtains
18:46:31.100
Database the predictive model and the request type mapping.
... ... ... ...
Predictive-Model-Creator
28.05.2021
Hearbeat 18 ms (PMC) B. RAST: Pipeline B
18:46:32.100

Predictive Request Type


Fig. 3 shows the Log-Transformer (LT) and the Workload
... ... ... ...
Model Mapping Characterization components inside of Pipeline B of RAST
and the data flow between the components.
Simulator

Log
Files
Figure 2. RAST: Components of Pipeline A
Log-Transformer
(LT)

The Log-Transformer (LT) extracts the contents from the


Merged Log Files
given log files and creates a database with the training data. Workload Pattern
Having a dedicated LT component makes the Predictive-Model- Workload Characterization Timestamp
Requests per
...
second
Creator component more generic by offloading the processing 28.05.2021
42
Request Types Workload Pattern 18:46:31
of specific details of the SUE to the LT component. List
28.05.2021
21
After the LT component has produced training data, they are Request
18:46:32

Simulator
passed to the Predictive-Model-Creator (PMC) component that Response
... ... ...

uses regression analysis to create a predictive model (Fig. 2). Load Tester

Our predictive model predicts the processing time, i.e., the


Figure 3. RAST: Components of Pipeline B
time the system spends processing a request. In our model,
the processing time depends on the type and the number of
requests executed concurrently. Similar to Pipeline A, the log files are first given to the LT
A predictive model consists of a regression algorithm and component that works very similar to the one in Pipeline A.
its parameters that are determined in the process of regression However, instead of producing training data, it merges and
analysis, e.g., linear regression and its coefficients. Regression transforms the log files into a common format and passes
analysis applies a set of machine learning methods that allow these transformed log files to the Workload Characterization
us to predict a continuous outcome variable (y) based on the component to create the production system’s workload pattern,
value of one or multiple predictor variables (x) [12]. e.g., number of requests per second at a specific time and the
When choosing suitable predictor variables for RAST, it list of all request types the production system accepts. At the
is essential that the variables can be observed at simulation end of Pipeline B, the request types list is passed on to the
runtime to create an input vector for our predictive model when Simulator and the Load Tester. The Load Tester additionally
running the simulation. In our approach, the outcome is the obtains the workload pattern.
time the system needs to process a particular request, while
the predictor variables are as follows: the number of parallel III. U SE C ASE : A N A LARM S YSTEM
requests at begin of the request, the number of parallel requests Fig. 4 shows a high-level overview of our use case: a typical
finished while processing the request, and the request type. alarm system and its components. An Alarm Device (AD) is
To improve the quality of our predictive model, the PMC installed at the customer’s home. If the AD detects a breach
component performs outlier detection and removal on the of the safety criteria, like a fire or a burglary, or a technical
training data using the standard deviation method, i.e., calculate malfunction of the AD itself, it transmits an alarm message
the mean and standard deviation of the processing times and to the Alarm Receiving Software (ARS) running within the

51

Authorized licensed use limited to: Monash University. Downloaded on March 28,2023 at 12:00:51 UTC from IEEE Xplore. Restrictions apply.
Alarm Receiving Center (ARC) – a computer center which 2) Database-centric architecture: one database instance is
acts as the endpoint for incoming alarm messages. used by all server programs of the system, thus making
it a common bottleneck;
Customer's home 3) Software bugs, continuous development, and legacy IT
infrastructure of the test environment.
Around 3% of requests the system processes during a
day are received and processed by the ARS. The other 97%
Alarm Device (AD) are requests destined for other server programs. This high
amount of background workload causes unreliable performance
measurements when load testing the ARS due to the database-
centric architecture and the resource-sharing environment. For
Alarm Receiving Centre Emergency and Service control Centre
(ARC) (ESC) these reasons, we propose an alternative approach in this paper.
An alarm system’s timing behavior is regulated by the EN
50136 standard [15]. The most relevant timing metric is the
Alarm Receiving Software Risk Management Software ESC-Employee response time between the AD and the ARS; it indicates how
(ARS) (RMS)
quickly the system responds to a request. Commonly, response
time is understood as the time between a request and any
response, according to the ISO/IEC 2382 standard [16]. The
EN 50136 standard, however, defines the response time as the
Customer Police time between sending an alarm message (request) and receiving
a positive acknowledgment (response). Timeouts or negative
Figure 4. Use Case: A typical alarm system
responses are not treated as acknowledgments. Therefore, the
ARS has to comply with the specified real-time requirements
The ARS is responsible for receiving and processing alarm
even under circumstances like high workloads and failures.
messages and forwarding them to the Risk Management
The currently valid EN 50136 standard which we rely upon
Software (RMS) running within the Emergency and Service
specifies the real-time requirements for the response time as
control Center (ESC). Once the ARS has processed the alarm
follows: the arithmetic mean of all response times measured
message, an acknowledgment is sent to the AD, confirming
in any time interval must not exceed 10 seconds, and the
that no retransmission of the alarm message is required. The
maximum response time must not exceed 30 seconds.
ESC-Employees then coordinate appropriate actions according
to an intervention plan agreed in advance with the customer. IV. C ASE S TUDY: A L EGACY A LARM S YSTEM
Our particular use case is a production-quality software for In this section, we present our proof-of-concept implemen-
alarm system designed and implemented by the GS company tation of the RAST approach for the case study of the legacy
group [1] which is a leading alarm system provider in Germany. alarm system described in Section III. We implement all
The company owns three ARCs and three ESCs located in components of RAST in the Python [17] language because of
the north-western part of Germany, that handle about 13.000 the availability of widely-used libraries for regression analysis
alarms per day from about 82.000 alarm devices located all (scikit-learn [18]), data management (pandas [19]), and load
over Germany around the clock. testing (Locust [20]). We implement both LT components of
The current software system provided by the company is Pipelines A and B as Extract-Transform-Load (ETL) pipelines
a large, complex, and poorly documented distributed legacy to facilitate reuse between Pipeline A and Pipeline B. We
system consisting of around 80 executable software components use an ETL pipeline [21] which, in our case, performs the
(programs) and libraries written in BASIC, C++, C#, and Java, following steps: Extracts data related to the workload of the
summing up to over 1.5 million lines of code. The majority of SUE from different log files, Transforms it to a common
the original developers are no longer in the company; moreover, format, and Loads the data into a database that serves as the
the software system lacks automated testing mechanisms like single-source-of-truth for future data processing.
unit tests. Therefore, we deal with a typical legacy system, i.e., The Predictive-Model-Creator (PMC) component described
“a large software system that we don’t know how to cope with in Section II-A uses the training data to create a predictive
but that is vital to our organization” [2]. model. The component uses the popular Python-libraries
In summary, performing load tests to find the saturation pandas and scikit-learn. The PMC component extracts the
point of the ARS is challenging for the following reasons: data from the database and stores it in a pandas dataframe.
1) Resource-sharing environment: the ARS is one of twelve Having the data in a pandas dataframe is a requirement of scikit-
server programs that form our target system. In addition, learn. In the process of extraction, the component transforms
programs like backup and anti-virus software are also the request types stored as strings to integers because the
running in the system. All programs executed within estimators expect numerical inputs. This transformation is done
virtual machines (VM) share computing resources like by assigning a unique number to each distinct request type.
network bandwidth, CPU, storage, etc.; Each assignment is stored in a hash map, thus creating the

52

Authorized licensed use limited to: Monash University. Downloaded on March 28,2023 at 12:00:51 UTC from IEEE Xplore. Restrictions apply.
request type mapping. After that, the PMC component applies Locust. Our load test aims to evaluate how many alarm
outlier detection and removal. Then, it splits the training data devices the system can serve while complying with the real-
into random train and test subsets – terms used by the scikit- time requirements when it is under production workload. The
learn developers – using the train_test_split function workload the Load Tester component generates consists of
of scikit-learn so that 75% of the data is in the train subset two parts: the alarm device workload caused by alarm devices
and the remainder is in the test subset. This distribution is the sending alarm messages and the background workload caused
default in scikit-learn. by the other programs that run simultaneously to the ARS.
The PMC component takes the train subset and uses cross- The component uses the extracted workload pattern from the
validation to compare the performance of the estimators. All production system to produce the background workload. The
estimators use default settings. After cross-validation, the PMC alarm device workload is steadily increased by simulating
component fits2 the estimator with the best score using the alarm devices and increasing their number over time until the
train subset. Then, it performs a prediction using the test subset workload is reached where the requirements checker component
as input and evaluates the predictions using the mean absolute of RAST reports that the performance requirements have been
error, mean squared error, and R² score [14]. Finally, the PMC exceeded. This way, we find the saturation point of the system.
component exports the estimator to create the predictive model
and the request type mapping as files. V. E XPERIMENTS : F INDING THE SATURATION POINT OF THE
The Simulator component of RAST provides a Web API L EGACY A LARM S YSTEM
capable of processing all types of requests of the GS legacy
system. For each incoming request, a new thread is created Here, we describe our experimental approach to find the
to serve the request. For this, we extend Python’s built-in saturation point of the system.
ThreadingHTTPServer and increase the size of the TCP To prepare the experiment, we collect log files from the
listen backlog from the default value of 5 to 20,000. In this production system and feed them into the implemented RAST
regard, the Simulator reflects the behavior of the real GS legacy components to create the predictive model, the request type
system. Serving a request means transforming the request type, mapping, and the workload pattern.
represented as a text, to the respective numerical value by using
Estimator Mean score and 95% confidence
the request type mapping, then simulating the processing time interval
using the predictive model, and then sending a response. Linear Regression 0.54 (+/- 0.00)
Fig. 5 illustrates how our Simulator simulates the processing Ridge Regression 0.54 (+/- 0.00)
Lasso Regression -0.00 (+/- 0.00)
time: Every time a request is received, the Simulator performs ElasticNet Regression -0.00 (+/- 0.00)
an initial prediction 1 and waits for the predicted amount of DecisionTree Regression 0.63 (+/- 0.00)
Table I
time 2 . Then, the Simulator performs another prediction 3 M EAN SCORE AND THE 95% CONFIDENCE INTERVAL OF THE SCORE
because other requests might have arrived in parallel, changing ESTIMATE FOR THE ESTIMATORS WE USED .
the input for the model. If the next predicted amount of time
is greater than the previous prediction 4 , the Simulator takes
the difference in time 5 and waits for it 2 . Otherwise, it We include five out of twelve server programs in our case
finishes the execution of this request and sends a response 6 . study, as only these programs produce log files with information
about the received and processed requests. Every request
processed by these programs produces a pair of two lines of
Receive Send
text in the log file, one for the beginning of request processing
Request Response
and one for the end. We refer to one line of text in the log file
as a log entry. From these selected server programs, we use
Predict New Processing Time - Last Processing Time
180 log files from 30 distinct days collected over 13 months,
Processing
Time YES NO
from December 2020 till January 2022. Altogether, our log
files contain around 39.99 million log entries. The number of
resulting rows in the training database is around half of the
Predict Processing Time
Wait Processing greater than number of log entries as one pair of log entries is required
Time last time?
to calculate the processing time of a request. This number is
further reduced after outlier detection and removal. Feeding
these log files into Pipeline A produces a training database
with around 19.9 million rows and five columns of data.
Figure 5. Flowchart: How the request processing is simulated.
Table I shows the mean score and the 95% confidence
The Load Tester component of RAST is based on our interval of the score estimate [14] of the estimators we used.
previous load testing approach [6] with the load testing tool We observe that the DecisionTree Regression has the best
score, so the PMC component fits the estimator using the train
2 Fitting is the process of learning the best parameters for a particular subset, performs a prediction using the test subset as input and
estimator based on the given train subset. evaluates the predictions. The results are: the mean absolute

53

Authorized licensed use limited to: Monash University. Downloaded on March 28,2023 at 12:00:51 UTC from IEEE Xplore. Restrictions apply.
error of 0.034, mean squared error of 0.01, and the R² score 100k
5
Number of different request types: 350

of 0.63. 2

10k
5

Response time ms
Giving the log files to Pipeline B produces the workload 2

1000
5

pattern shown in Fig. 6 and Fig. 7. In Fig. 6, we observe that the 100
2

requests per hour are at their highest level between 8 and 16 5

10

o’clock. And in Fig. 7, we see that the overall daily workload 5

Request type

is significantly lower at the end of the week. The explanation


for both observations is that most employees work between
8 and 16 o’clock from Monday till Thursday and use the Figure 8. Average processing time for every distinct request type.
system at this time. Additionally, Pipeline B produces ancillary
Simulator
visualizations that provide further insight into the processing
generate
times of the production system. Fig. 8 shows that the average background workload

processing time of most of our 350 distinct request types is


under one second. This also includes the alarm message request.
We also observe that there are three requests that require more
simulate ADs and
than ten seconds and one request even 100 seconds. We provide gen_gs_prod_workload.py
measure response time

additional figures on GitHub [11].


executes
Weekday
verify real-time requirements
46349

45k
45635.5

45603

45478.5

Mon
45346
45133

and increase load parameters


44323.5
43671.5

Tue
42603.5

40k locust-parameter-variation.py gen_gs_alarm_device_workload.py


40895.5

40275.5

Wed
40073.5
39915

39845.5

39509.5
39426.5
39241

39218

38652.5

38540
38222.5

37893

37807.5
37657

Thu
37037.5
36568

35k
35090.5

Fri
35048.5
35040.5

34473
33822.5

33205.5
Requests per hour

32825.5

Sat
30k
30878.5
29963

Sun
29396.5
28159.5

locust.log
27905.5

26827

26700.5
26553.5

26527.5
26427.5

26268.5

26136

25k
25992.5
25876.5

25855.5

25858
25600.5

25339
25240.5

25204
25147
25051

24518.5
24508.5
24488.5

24478.5

24437.5
24337
24244

24182

24168.5
24155.5

24139
24076.5

24049.5
24027

24019.5

23916
23899

23892.5

23834.5
23810.5

23769.5
23763

23755.5
23734

23672
23634
23615.5
23594.5

23576
23549.5

23539.5

23520.5
23501.5
23464

23425.5

23411.5
23323.5

23318.5

23213.5
23162
23086.5
23072.5

22991.5
22917
22897.5

22869.5

22849.5
22836.5
22830

22770.5
22764.5
22752
22742

22678
22671

22672.5
22638

22594.5

22581
22489.5
22416.5

22414.5
22387

22384
22301

22246
22138.5

22069.5

22055.5
22039.5

21993.5
21974.5
21959.5

21937.5
21914
21884.5

21887.5

21803
21771.5
21740.5
21741.5

21704.5
21637.5

21612

21604.5

21584
21573

21429.5
21414.5
21364

21358
21317.5
21287
21232
21204.5

21207.5
21197.5

21175
21133

21084.5

20k
20966

20942.5
20890.5

20895.5
20846.5

20481.5
20377.5
20227.5

20195.5
20178.5
20138.5

20086

Figure 9. Our Testing Infrastructure


19584.5

15k

10k

5k The following two Locust scripts generate the background


0
0 5 10 15 20
and the alarm device workload as explained in Section IV:
Hour
• gen gs prod workload.py: generates the background
Figure 6. Daily workload pattern: number of requests per hour.
workload by a) processing the files created by our
Workload Characterization component (Section II-B), b)
randomly sending one of the 349 different types of
requests, and c) constantly sending the highest amount
Mon 726842 of requests per sec extracted from the daily workload
pattern, i.e., generating the worst-case workload. In our
Tue 645219
case study, we determine a value of 70 requests per sec
Wed 685465.5 from the workload pattern;
• gen gs alarm device workload.py: generates the AD
WeekDay

Thu 645915
workload by simulating alarm devices. It also logs each
Fri 573559.5 measured response time to a log file. The real-time
requirements defined in Section III refer to alarm devices;
Sat 548150.5
thus, in our experiment, each simulated AD sends an alarm
Sun 549223.5
message with a random interval of 20 to 90 sec between
0 100k 200k 300k 400k 500k 600k 700k
messages, which is a realistic workload that happens in
Count practice. We initialize the random number generator with
a fixed number, so that the send times of the ADs remain
Figure 7. Daily workload pattern: number of total requests per day. the same between repetitions of our experiment.
Our Python script locust-parameter-variation.py automates
For the experiments in this paper, our previous work [6]
the process of finding the saturation point of the software. It
provides us with a complete testing infrastructure based on the
executes Locust with our gen gs alarm device workload.py
load testing tool Locust including ancillary shell and Python
Locust script, waits until Locust completes execution, then reads
scripts for automation and visualization. We build upon this
the measurements from the locust.log file, and compares the
work and develop three additional Python scripts shown in
measurements against the real-time requirements thus serving
Fig. 9. We make the source code of all scripts including the
as the Requirements Checker component in the RAST approach.
ancillary scripts available on GitHub [11].

54

Authorized licensed use limited to: Monash University. Downloaded on March 28,2023 at 12:00:51 UTC from IEEE Xplore. Restrictions apply.
Whenever the real-time requirements have been met, i.e., that
both the average and the maximum response times are below
the requirements stated in Section III, the script increases the
number of ADs by 5000 (the initial value is also 5000) and
executes Locust again. The script executes repeatedly, until
the average or the maximum response time required by the
standard is exceeded. At this point, the script has determined the
saturation point. Once the locust-parameter-variation.py script
finds the saturation point, it returns to the previous number,
and from there, it increases the number of ADs by 100 to
obtain a finer-grained result.
When simulating tens of thousands of alarm devices, it is
likely to receive a warning from Locust that the load test
requires too much CPU load, which completely invalidates
the measurements. The reason is that the Global Interpreter
Lock (GIL) of the CPython runtime, the most widely-used
Python runtime, prevents a single Python process from utilizing
multiple CPU cores in parallel, even if using multiple threads
and running on a multi-core CPU [22]. To solve this, our locust-
parameter-variation.py script uses Locust’s built-in distributed
load generation mode to distribute the load across multiple
workers, each one being a distinct Python process. Also, in
the field, ADs are installed over time. Therefore, the script
uses Locust’s spawn_rate parameter to start 100 ADs per
second. Using a higher value results in a warning from Locust.
All our experiments are conducted within a virtual machine
(VM) on an HP Proliant dl380 G7 server with two Intel Xeon Figure 10. Avg and Max Response Times dependent on the number of ADs
X5690 processors with 3.46 GHz. This VM has eight virtual
CPU (vCPU) cores with 16 GB of RAM. We launch the
Simulator and the Load Tester on the same machine. and 30 requests throughout the experiment. Running the test for
To obtain practically relevant results, we must simulate tens a longer period of time has no significant impact on the results.
of thousands of ADs, thus potentially creating thousands of The experiment ends once the locust-parameter-variation.py
concurrently established network connections. For large-scale reports the saturation point.
load testing to work, we tune the server’s parameters, like Fig. 10 shows the experimental results: for each executed
raising the limit of open file descriptors to 50,000 and the load test, the average and maximum response times are shown
maximum number of connections in the TCP listen backlog as blue and orange curves, correspondingly. At 42,900 ADs the
(SOMAXCONN) to 20,000. Additionally, we use the mininet maximum response time exceeds the threshold. However, the
tool [23] to simulate network latency and bandwidth. Our reason is that our simulated ARS cannot handle such a high
mininet topology (the number of simulated hosts, switches amount of concurrent connections, and thus, our simulated
and links) and the link parameters (delay, jitter, packet loss, ADs run into connection timeouts. The simulated ARS cannot
bandwidth) are modeled to reflect the production system. We handle that many connections because Python’s GIL prevents
provide the source code, details of our server’s parameters, and the process from utilizing multiple CPU cores even though we
our mininet configuration on GitHub [11]. are using one thread per connection. We expect that repeating
The process of running an experiment on our testing our experiment on a more powerful machine with a CPU with
infrastructure is as follows: we begin the experiment with greater single-thread performance than our machine would
the launch of the Simulator. After that, we start the yield higher results. That also means that the GS system can
gen gs prod workload.py Locust script so that the worst- potentially process even more than 42,800 ADs because the
case production workload is generated in the background. response times returned by our model were not the limiting
Then, we start the alarm device workload by executing the factor in the experiment.
locust-parameter-variation.py script. The script uses five Locust In summary, our results confirm that under normal operating
workers to distribute the load of simulating the ADs. Five is conditions, e.g., when there are no faults in the system, the GS
the maximum number of workers we can run because with the system can handle the alarm device workload of 42,800 ADs
already running background workload and the Simulator, seven while processing the most demanding background workload
vCPU cores are utilized by our experiment, leaving one core for that happens in production and still complies with the real-
the mininet and operating system processes. locust-parameter- time requirements of the current EN 50136 standard for alarm
variation.py runs for 10 min so that each AD sends between 6 systems.

55

Authorized licensed use limited to: Monash University. Downloaded on March 28,2023 at 12:00:51 UTC from IEEE Xplore. Restrictions apply.
VI. C ONCLUSION AND F UTURE W ORK [6] J. Tomak and S. Gorlatch, “Measuring performance of
fault management in a legacy system: An alarm system
In this paper, we propose RAST – an approach to measur-
study,” in MASCOTS ’20. Springer, 2021, pp. 129–146.
ing the response time of production systems by combining
[7] D. Okanović and M. Vidaković, “Software performance
Regression Analysis, Simulation, and load Testing. RAST
prediction using linear regression,” in Proc. of the 2nd
consists of components that are adaptable to different kinds
Int. Conf. on Information Society Technology and Man-
of systems. We present a proof-of-concept implementation
agement. Citeseer, 2012, pp. 60–64.
of RAST for an industrial, legacy alarm system that is in
[8] J. Grohmann et al., “Monitorless: Predicting Performance
productive use and has strict real-time requirements.
Degradation in Cloud Applications with Machine Learn-
The experimental results of our case study confirm that under
ing,” in Middleware ’19. ACM, 2019, pp. 149–162.
normal operating conditions, the System Under Evaluation can
[9] C. Courageux-Sudan, A.-C. Orgerie, and M. Quinson,
handle the workload of 42,800 modern Alarm Devices (ADs)
“Automated performance prediction of microservice ap-
while complying with the real-time requirements of the EN
plications using simulation,” in MASCOTS ’21, 2021, pp.
50136 standard. Therefore, our short-term goal formulated in
1–8.
Section I is achieved.
[10] B. K. Aichernig et al., “Learning and statistical model
A current limitation of our implementation is that our
checking of system response times,” Software Quality
predictive model considers a subset of the programs that
Journal, vol. 27, no. 2, pp. 757–795, 2019.
consume hardware resources in the legacy system’s resource-
[11] J. Tomak, “Performance Testing Infrastructure,” 2022.
sharing environment. We take five of twelve server programs
[Online]. Available: https://github.com/jtpgames/Locust
into account and leave out the rest because they do not
Scripts
produce log files that contain information about the requests the
[12] “Regression analysis essentials for machine learning.”
programs received and processed. That is why our predictive
[Online]. Available: http://www.sthda.com/english/
model has an R² score of 0.63 (1.00 being a perfect score)
articles/40-regression-analysis/
which means that 63% of the variability in the processing
[13] J. Brownlee, “How to remove out-
times is explained by the predictor variables of our predictive
liers for machine learning,” 2020. [On-
model. We assume that the load of other programs causes the
line]. Available: https://machinelearningmastery.com/
remaining 37%. Therefore, we expect the system’s performance
how-to-use-statistics-to-identify-outliers-in-data/
to drop in these situations below our measured saturation point
[14] Scikit-learn Development Team, “Model selection and
of 42.800 ADs. Thus, our measurements are only valid when
evaluation.” [Online]. Available: https://scikit-learn.org/
the system operates under normal conditions, i.e., no faults
stable/model selection.html
in the system are present, and no backups, anti-virus scans or
[15] DIN EN 50136-1:2012-08, “Alarm systems - alarm
other periodic tasks are running.
transmission systems and equipment - part 1: General
In the future, we plan to improve the predictive model so requirements for alarm transmission systems,” Aug. 2012.
that our simulation is closer to the real-world system. Also, [16] ISO/IEC, “ISO/IEC 2382:2015: Information technology -
we will implement RAST for different (open-source) software Vocabulary,” ISO, Tech. Rep., 2015.
systems to verify its validity for other systems and allow for [17] G. Van Rossum and F. L. Drake, Python 3 Reference
better reproduction of our research. Furthermore, we plan to Manual. CreateSpace, 2009.
export the predictive model into a more common format to [18] F. Pedregosa et al., “Scikit-learn: Machine Learning in
enable sharing our predictive model with other researchers. Python,” Journal of Machine Learning Research, vol. 12,
pp. 2825–2830, 2011.
R EFERENCES
[19] The Pandas development team, “pandas-dev/pandas:
[1] “GS,” GS electronic GmbH. [Online]. Available: Pandas 1.0.5,” Jun. 2020. [Online]. Available: https:
https://www.gselectronic.com/ //doi.org/10.5281/zenodo.3898987
[2] S. Matthiesen and P. Bjørn, “Why replacing legacy [20] C. Byström, J. Heyman, J. Hamrén, and H. Heyman,
systems is so hard in global software development: An “Locust.” [Online]. Available: https://docs.locust.io/en/
information infrastructure perspective,” in CSCW ’15. stable/what-is-locust.html
ACM, 2015, p. 876–890. [21] Databricks, “Extract-Transform-Load.” [On-
[3] Y. Jin, A. Tang, J. Han, and Y. Liu, “Performance line]. Available: https://databricks.com/glossary/
evaluation and prediction for legacy information systems,” extract-transform-load
in ICSE ’07, May 2007, pp. 540–549. [22] D. Beazley, “Understanding the Python GIL,” in PyCON
[4] A. Avritzer, J. Kondek, D. Liu, and E. J. Weyuker, Python Conference., 2010.
“Software performance testing based on workload charac- [23] F. Keti and S. Askar, “Emulation of software defined net-
terization,” in WOSP ’02, 2002, pp. 17–24. works using mininet in different simulation environments,”
[5] L. Zhang, “Performance models for legacy system migra- in ISMS ’15, 2015, pp. 205–210.
tion and multi-core computers-an mva approach,” Ph.D.
dissertation, MCMaster University, 2015.

56

Authorized licensed use limited to: Monash University. Downloaded on March 28,2023 at 12:00:51 UTC from IEEE Xplore. Restrictions apply.

You might also like