(Auerbach) - Patterns For Performance and Operability - Building and Testing Enterprise Software PDF
(Auerbach) - Patterns For Performance and Operability - Building and Testing Enterprise Software PDF
AUERBACH PUBLICATIONS
www.auerbach-publications.com
To Order Call:1-800-272-7737 • Fax: 1-800-374-3401
E-mail: [email protected]
This book contains information obtained from authentic and highly regarded sources. Reprinted
material is quoted with permission, and sources are indicated. A wide variety of references are
listed. Reasonable efforts have been made to publish reliable data and information, but the author
and the publisher cannot assume responsibility for the validity of all materials or for the conse‑
quences of their use.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.
copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC)
222 Rosewood Drive, Danvers, MA 01923, 978‑750‑8400. CCC is a not‑for‑profit organization that
provides licenses and registration for a variety of users. For organizations that have been granted a
photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and
are used only for identification and explanation without intent to infringe.
QA76.76.D47P3768 2008
005.1‑‑dc22 2007030244
Dedications......................................................................................................v
The Purpose of This Book.............................................................................. xv
Acknowledgments........................................................................................xvii
About the Authors.........................................................................................xix
1 Introduction............................................................................................1
Production Systems in the Real World..........................................................1
Case 1—The Case of the Puzzlingly Poor Performance.......................2
Case 2—The Case of the Disappearing Database................................5
Why Should I Read This Book?....................................................................7
The Non-Functional Systems Challenge.......................................................8
What Is Covered by Non-Functional Testing...............................................9
Planning for the Unexpected......................................................................10
Patterns for Operability in Application Design...........................................11
Ensuring Data and Transaction Integrity..........................................11
Capturing and Reporting Exception Conditions in a Consistent
Fashion...................................................................................11
Automated Recovery from Exception Conditions..............................14
Application Availability and Health...................................................14
Summary....................................................................................................14
2 Planning and Project Initiation............................................................17
The Business Case for Non-Functional Testing...........................................17
What Should Be Tested.....................................................................17
How Far Should the System Be Tested?.............................................19
Justifying the Investment...................................................................20
Negative Reasoning...........................................................................21
Scoping and Estimating..............................................................................22
Determining the Scope of Non-Functional Testing...........................22
Estimating Effort and Resource.........................................................26
Estimating the Delivery Timeline......................................................29
vii
Capacity.......................................................................................... 153
Change Management......................................................................154
Historical Data.........................................................................................154
Summary.................................................................................................. 157
7 Test Preparation and Execution..........................................................159
Preparation Activities................................................................................ 159
Script Development..................................................................................160
Validating the Test Environment.....................................................164
Establishing Mixed Load.................................................................164
Seeding the Test Bed.......................................................................167
Tuning the Load..............................................................................167
Performance Testing.................................................................................171
Priming Effects................................................................................172
Performance Acceptance..................................................................173
Reporting Performance Results....................................................... 176
Performance Regression: Baselining................................................177
Stress Testing...................................................................................181
Operability Testing...................................................................................181
Boundary Condition Testing...........................................................182
Failover Testing...............................................................................183
Fault Tolerance Testing....................................................................186
Sustainability Testing...............................................................................188
Challenges................................................................................................192
Repeatable Results...........................................................................193
Limitations......................................................................................193
Summary..................................................................................................194
8 Deployment Strategies........................................................................195
Procedure Characteristics.........................................................................196
Packaging.................................................................................................197
Configuration..................................................................................197
Deployment Rehearsal..............................................................................198
Rollout Strategies......................................................................................198
The Pilot Strategy............................................................................198
The Phased Rollout Strategy............................................................199
The Big Bang Strategy.....................................................................199
The Leapfrog Strategy..................................................................... 200
Case Study: Online Banking................................................................... 200
Case Study: The Banking Front Office.....................................................202
Back-Out Strategies................................................................................. 204
Complete Back-Out........................................................................ 204
Partial Back-Out............................................................................. 204
Summary..................................................................................................256
11 Troubleshooting and Crisis Management...........................................257
Reproducing the Issue...............................................................................257
Determining Root Cause..........................................................................258
Troubleshooting Strategies........................................................................259
Understanding Changes in the Environment...................................259
Gathering All Possible Inputs..........................................................261
Approach Based on Type of Failure.................................................263
Predicting Related Failures..............................................................265
Discouraging Bias........................................................................... 268
Pursuing Parallel Paths................................................................... 268
Considering System Age..................................................................269
Working Around the Problem.........................................................269
Applying a Fix..........................................................................................270
Fix versus Mitigation versus Tolerance.............................................270
Assessing Level of Testing................................................................271
Post-Mortem Review................................................................................272
Reviewing the Root Cause...............................................................272
Reviewing Monitoring.....................................................................272
Summary..................................................................................................275
12 Common Impediments to Good Design.............................................277
Design Dependencies............................................................................... 277
What Is the Definition of Good Design?..................................................279
What Are the Objectives of Design Activities?.................................279
Rating a Design...............................................................................281
Testing a Design...................................................................................... 286
Contributors to Bad Design.............................................................287
Common Impediments to Good Design..................................................287
Confusing Architecture with Design.............................................. 288
Insufficient Time/Tight Timeframes.............................................. 288
Missing Design Skills on the Project Team..................................... 288
Lack of Design Standards............................................................... 288
Personal Design Preferences.............................................................289
Insufficient Information...................................................................289
Constantly Changing Technology...................................................289
Fad Designs.....................................................................................290
Trying to Do Too Much..................................................................290
The 80/20 Rule................................................................................290
Minimalistic Viewpoint...................................................................290
Lack of Consensus...........................................................................291
Constantly Changing Requirements................................................291
References.........................................................................................................295
Index.................................................................................................................297
xv
Intended Audience
This book is intended for anyone who has a lead architectural, design, or busi-
ness role on a systems project, including those working on the projects, the project
sponsors, or those directly benefiting from the results. It also encompasses a wide
range of roles and responsibilities, including project sponsors, executives, directors,
project managers, program managers, project leaders, architects, designers, lead
developers, business users, consultants, leads and testers, and other resources on a
project team.
This book can be read and applied by beginners or experts alike. However, we do
assume that the reader has some knowledge of the basic project development lifecycles
and concepts of both functional and non-functional requirements.
This book represents the authors’ accumulated experience and knowledge as tech-
nology professionals. Tackling difficult problems is thoroughly enjoyable when it is
in the company of passionate and talented colleagues. Accordingly, we acknowledge
the following individuals who have unknowingly contributed to this book: Andrew
Adams, Daniil Andryeyev, Poorna Bhimavarapu, Neil Bisset, Lucy Boetto, Debi
Brown, Dave Bruyea, Ryan Carlsen, Steve Carlson, Cono D’Elia, Olivie De Wolf,
David Dinsmore, Bruno DuBreiul, Marc Elbirt, James Fehrenbach, Peter Fer-
rante, Oleg Fonarev, Kenny Fung, Michael Gardhouse, Wayne Gramlich, Frank
Haels, Michael Han, Mark Harris, John Hetherington, Eric Hiernaux, Steve Hill,
David Howard, Steve Hu, Hiram Hsu, Anand Jaggi, Tommy Kan, John Krasnay,
Andre Lambart, Marcus Leef, Robert Lei, Clement Ma, Richard Manicom, Ron-
nie Mitra, Odette Moraru, Shankara Narayanan, Nader Nayfeh, David Nielsen,
Rich O’Hanley, Kevin Paget, Alex Papanastassiou, Ray Paty, Cris Perdue, Neil
Phasey, Betty Reid, Adam Scherer, Sean Shelby, Dan Sherwood, Manjit Sidhu,
Greg Smiarowski, David A. Smith, Gilbert Swinkes, Chris Tran, Brent Walker,
Mark Williamson, John Wyzalek, Bo Yang, Eric Yip, and George Zhou. We also
acknowledge the talents of Richard Lowenburg in Toronto, Ontario for his contri-
bution to the illustrations in this book.
xvii
Based in Toronto, Ontario, Chris Ford has extensive experience providing hands-
on technical and strategy consulting services to large organizations throughout the
United States and Canada. Chris is currently a managing principal with Capital
Markets Company (Capco) (http://www.capco.com/), specializing in highly avail-
able software systems for the financial services industry. He is a graduate of the
University of Waterloo’s Systems Design Engineering program.
xix
Introduction
n Based on performance test results, the application code itself was capable
of supporting the expected performance assuming the required systems
resources were available.
n Performance degradation was continuing over time. However, there was no
indication as to whether the degradation was correlated to time itself or to the
rollout to additional bank branches.
n Performance degradation was sporadic at first but had become more and
more consistent over time.
n At the start of the day, performance appeared to be generally within the
required range, but quickly worsened over the course of the day.
n Based on the infrastructure test results, the organization was correct in con-
cluding that the infrastructure itself did not appear to be at fault.
Based on the preliminary assessment, the team decided to profile the J2EE
application container for key application resources including memory, threads,
connection pools and other internal resources. In doing so, the following observa-
tions were quickly captured:
n The application container (in this case, a Java virtual machine) was executing
full garbage collections at a much higher than expected frequency. A garbage
collection is a memory management service that is performed by the applica-
tion container.
n Each time a full garbage collection was executed, all application threads were
put on hold. In other words, all business processing was paused until the
completion of the memory management task.
n Garbage collection operations were taking an average of 20 seconds to
complete.
Based on this new information, the team was confident that they had identified the
culprit. In collaboration with the development team, the expert team was able to do
some fine-tuning of memory parameters and peace was restored to the organization.
Everyone thanked the technology team for their heroic efforts, late nights, and
congratulated them for solving the problem. Only a few questions loomed; fore-
most amongst them, could this incident have been avoided? What was missing
in the test-case coverage that did not identify this issue in advance of production
deployment? How does the root cause explain the chronology of events observed
in production?
In answering these questions, we begin to see the need for a structured approach
to non-functional design and implementation.
1. How does one explain the events that were observed in production?
System resources were not strained. Application containers allocate a certain amount
More specifically, the problem was of system resources including memory, threads,
memory exhaustion yet there was and connection pools. Applications running
abundant memory on the server. within a container cannot use more resources
than have been allocated. Ample system
memory is of no use to an application running
inside an application container that imposes
limits.
Performance was initially good or at The memory usage profile changed dramatically
acceptable levels. with the introduction of more users to the
system. More specifically, as time elapsed,
objects were accumulating in memory with a life
span of many days. This problem intensified with
the introduction of additional users.
Performance testing with 5,000 This is due to the fact that while conducting
simulated users did not show any short term tests, even with a large number of
performance degradation. simulated users, the state of the memory is fairly
clean. When the system is used over a long
period of time without restart, the memory
saturates and there is a need for frequent full
garbage collection (assuming the memory heap
size was not set appropriately to begin with).
Performance was not consistent. In the production architecture, there were several
server load balancing user requests. If one of the
servers was restarted, response time would
return to normal for a short period until full
garbage collections would resume at a high
frequency.
2. What was missing in the test coverage that would have allowed us to see this
problem in advance of production deployment?
Despite good performance test coverage, the project team did not fully
simulate production conditions. In this case, the impact of long system
uptime under heavy user operation was not factored into the test plan. Later
in this book, we will formally introduce the concept of sustainability test-
ing, which is designed exactly to avoid this type of incident. Sustainability
tests simulate long periods of system operation under various loads to observe
record the state of its processes or retain its data. The infrastructure team worked
diligently to correct the issue and was able to restore service on the secondary data-
base server in less than 4 hours.
During the 4-hour outage, users were frantically trying to assess the state of
their transactions. In doing so, in some cases they had manually posted transac-
tions through to the book of records system outside of the normal application-
driven process.
Once the application had been restored, it became clear that many of the trans-
actions that were initiated prior to the outage were now in an indeterminate or
broken state. In other words, business users were unable to continue processing
in-flight transactions. The business operations team was forced into the unenviable
position of having to investigate each transaction individually. In many cases, a
labor-intensive manual process was required to complete the transaction outside of
the application. The full recovery became a very slow and painful exercise for both
business and technology operations as technology was required to produce a variety
of ad-hoc reports.
In the next four weeks, the database server infrastructure failed an additional
4–5 times due to a combination of human error and system configuration issues.
Not surprisingly, business operations demanded that the technology team develop
applications that were resilient. Resilient applications were defined as applications
that can recover from a major infrastructure failure in a consistent way, mostly
automated (a few manual exceptions were allowed).
The development team was faced with a challenging new requirement for which
they had no prior experience. They had always assumed that the infrastructure
would be there to support the application, based on the extensive investment that
had been made in the highly available, fault-tolerant infrastructure. Ultimately
the team met the challenge in two ways. First, they developed a set of recovery
procedures and tools that would improve the efficiency and accuracy of any busi-
ness recovery in the event of an application outage. Second, they enhanced the
application design such that transactions would reliably transition to a state from
which they could be recovered by the previously mentioned recovery procedure
and tools.
The combination of these two efforts resulted in a system that was resilient to most
types of failures, meaning that recovery would be automated and transparent to busi-
ness users in most cases. The development approach leveraged the following technolo-
gies and approaches:
This book is designed with the software development lifecycle in mind. Our
hope is that you will be able to use it throughout your implementation as a reference
and guide for achieving highly available systems.
In summary, while the UAT is well designed to verify and accept the func-
tional aspect of an application, it is not designed to test and verify any of the key
non-functional aspects of the application. This translates into the need for a set of
dedicated environments for the sole purpose of testing for non-functional require-
ments and certification. At a minimum the requirements would be to have a project
non-functional test environment and a non-functional certification environment.
We will discuss the characteristics of these environments in more details in the
chapters ahead.
The key challenges when designing and executing non-functional tests are:
n The tests are executed in an environment that is not the final target environ-
ment and as such needs to predict the behavior of the target environment.
n Some of the tests are designed to test for unforeseen conditions in produc-
tion. These need to be simulated to the best of our ability.
n The scope and characteristics of the tests are based on a predicted business
utilization model that may or may not accurately predict the real usage of
the application.
n Performance tests for online and offline key activities (such activities to be
defined).
n Capacity test, to allow for a reasonable capacity plan.
Examples of some more detailed design patterns that draw on the above prin-
ciples are listed below. This topic will be covered in further detail in Chapter 4,
“Designing for Operability.”
User Online Testing of online response time as For each test case we would measure Typically measured as the time it
Performance observed by a user of the the average, maximum, minimum, and takes the application to render the
application. 90th percentile response times. next page of the application.
System Online Testing of online system-to- For each test case we would measure The request can be synchronous or
Performance system response time (i.e., the the average, maximum, minimum, and asynchronous; in both cases the time
time it takes one system to 90th percentile response times. for the complete response to arrive
respond to a request by another will be measured.
system).
Offline Testing of an offline activity, which Average time to complete the full In most cases, the batch operation
Performance could be a bulk operation that operation (bulk or batch) and a profile of would be broken into sub
happens during the availability the performance of each component of components profiling each
window or a batch operation that the offline operation. component for potential
takes place outside of the improvements.
availability window.
12 n Patterns for Performance and Operability
Component Testing of the system For each component we expect to see All critical redundant components
Failover recoverability when critical the system recover with no data or should be tested. Some examples are
components are failed over to the transaction lost. We observe and message broker, database server,
redundant component. measure the time to recover, number of application server, and disk volume.
errors reported, and any loss of data or
transaction.
—continued
11/19/07 7:48:23 AM
Table 1.2 Non-Functional Test Inventory
AU5334.indb 13
Test Type Description Expected Outcome Comments
Capacity Testing of the system capacity While the application is running at peak The requirements should include a
requirements at peak volumes volumes for a period of at least an hour projection for a period of at least one
with transaction and user (for stabilization) we measure the year to allow for all volumes
volumes that are based on the system resource utilization such as anticipated a year in advance. This
business-utilization model as memory, CPU, disk, and network test is greatly facilitated by
stated in the non-functional bandwidth on all application tiers. monitoring tools such as HP
requirements. OpenView or Mercury to capture and
record resource usage during the
execution of the test.
Sustainability Testing of application resource Monitor trending of resource This test would allow us to observe
management behavior over time. availability and application server behavior that would occur in
behavior over time. For example, we production when the system is not
would monitor the following database recycled for a lengthy period of time.
connect pool/threads/MQ connection We can observe memory leaks,
factory; full and minor garbage connection leaks, memory tuning
collection frequency and duration; requirements, connection, and thread
memory recovery over time; and so on. pools configuration (high-water
marks).
Operability This is a broad category of testing For each defined test case we observe The challenge with this type of
that measures the system the application for errors being testing is to identify the critical
behavior under a variety of reported, potential data and elements that should be tested. The
miscellaneous conditions. A transaction loss, and recoverability number of permutation of test cases
typical example is boundary- once the component is available again. that can be created is typically very
condition testing in which the large, and careful scoping and
system is subjected to highly rationale must be applied.
unexpected inputs outside the
functional range to ensure that
Introduction n 13
11/19/07 7:48:24 AM
14 n Patterns for Performance and Operability
Summary
While the functional-requirements side of software engineering has evolved and
improved over the years, non-functional requirements are still very much an after-
thought on most IT projects. In many cases it would seem that the scope of non-
functional requirements and testing is limited to performance and load testing,
excluding critical elements of non-functional requirements such as capacity plan-
ning, operability, monitoring, system health checks, failover, memory manage-
ment, and sustainability, among others.
More than ever, technology executives, managers, and professionals are aware
of the gap in the definition, design, testing, and implementation of systems’ non-
functional requirements. Such a gap has consistently been the cause for significant
system outages and loss of credibility for IT organizations.
User and business communities have become better at defining their functional
requirements, but are clearly not able to articulate the non-functional characteris-
tics of the system in more than broad terms. It is up to the technology community
to build the tools, templates, and methodologies needed to extract the correct level
of detailed requirements, challenge the business as to their real requirements, design
applications with operability in mind, ensure sufficient non-functional test cover-
age, and implement ongoing monitoring tools that will guarantee high availability.
This book addresses the development of scope, requirements, design patterns,
test strategies, and coverage and deployment strategies for system non-functional
characteristics. It can also be used to provide a detailed implementation guide for
technologists at all levels. The book can also serve as a conceptual guide for the
business and user communities in order to better develop and educate those com-
munities on the importance of a system’s non-functional requirements and design
activities.
The detailed material provided in the chapters ahead is based on years of experi-
ence in designing, building, tuning, and operating large complex systems within
demanding mission-critical environments. This book is filled with practical exam-
ples and advice that can be leveraged immediately to assist your current projects.
Planning and
Project Initiation
17
task in detail. For the moment let us just say that a different approach, based on
observation and prediction of usage patterns as well as a decomposition of all sys-
tem components for availability analysis will be required to identify the scope of
testing.
At a minimum, projects teams should perform the following non-functional
tests:
Online Performance
n Test and report all critical functions identified in the requirement
documentation
n Test for online response anomalies (responses that are well above an accept-
able range)
n Test response time for the above tests under simulated load
Batch Performance
n Test and report the overall batch processing time as well as each individual
component of the batch process
n Test the recovery time for failure during the batch processing window
Capacity Test
n Run the application/systems for a minimum period of one hour with full
simulation of one- to two-year projected utilization at peak usage
n Observe system resource utilization during the test (including central pro-
cessing unit [CPU], memory, execution threads, database [DB] connections,
etc.)
n Overlay resource utilization results on top of a current production baseline
and report overall system resource capacity requirements
n Determine any additional resource requirements based on the results of this
test
Failover Test
n Analyze all key failure points in the system
n Initiate failover condition under load, based on a predicted utilization model
to generate representative range of in-flight transactions during the test
n Fail component by component, and observe failover functionality and num-
ber of transactions impacted
n Confirm that all in-flight transactions are recovered on the failover system
In addition to the minimum set of tests above, there are many additional tests
that are critical to understanding and verifying the non-functional behavior of your
application and system. These are discussed extensively throughout this book and
are highly recommended. However, it is expected that projects will have different
areas of focus and criticality of functions, which will dictate the necessity for cer-
tain additional tests.
It is clearly not feasible to test every condition and permutation. It is also not
feasible to test all possible data combinations. We therefore are presented with the
question, “How far should we test?”
The answer to this question is governed by the following parameters and the
amount of flexibility or degree of freedom you can exercise on each:
For instance, if there is a very unique type of data set that is very difficult to cre-
ate and has the risk of not performing as well as required, one may opt to mitigate
that risk by observing the very low occurrence of this data set in production and
by monitoring production performance to implement any additional fine-tuning
as needed. This may save considerable time and money for a large implementation
effort.
Another common question is, “How much load should be put on the system
during testing?” One can argue that adding increasing load to the point of system
failure may be interesting to graph, such that management is aware of the load
under which the system will break. While this could be a useful test, the value it
adds is marginal as it would typically be sufficient to test the system under peak
anticipated projected load and two times that load for the event of a failover condi-
tion. Again, testing the extreme condition—while satisfying some need to know
on the part of management—may be a costly exercise in getting information that
might never be proven useful.
In summary, the extent to which one should conduct non-functional testing
should be entirely driven by the level of risk assumed by not testing a given test
scenario and the cost to execute such a test. You may find that management, faced
with the cost of executing a certain test and the real risk of it actually happening in
production, may be less inclined to invest the funds.
n Accurate measurement
n Reduction in functional test interruption
n Production simulation (size, capacity, performance, operability)
n Certification with exact production-like configuration
n Flexible scheduling of tests in parallel with other testing activities
Negative Reasoning
In the event that all the reasoning and business casing for non-functional test
investment falls on deaf ears, it is useful to document all the risks to which the
business will be subjected due to lack of investment in non-functional testing. This
can be used to achieve the following goals:
n Clearly communicate the risk and potential issues that have a high probabil-
ity to occur in production
n Create a sense of accountability with management for any future potential
issues
n Potentially reverse the decision not to invest based on the two previous
points
n Ensure that the technology team had done all in its power to alert manage-
ment of the risks the project is about to undertake
Clearly communicating the risks and documenting these risks for management,
business, and sponsors has, more often than not, influenced the decision to invest
in non-functional testing.
n Response time for key functions and in some cases a broad statement regard-
ing online response time
n Expected delivery time for reports, files, or other periodic artifacts that are
routinely generated by the application
n Expected system availability time (i.e., system uptime)
n Acceptable maintenance windows
n DR and BCP (business continuity plan) requirements
n Key reporting metrics on which the service level is measured
n Penalties associated with not meeting certain service levels
Using the above documents, the technology team can provide a statement of
non-functional testing scope that will describe the following.
Performance Testing
n Key functions to be tested and reported for response time, including expected
response time ranges
n List of bulk and batch processes and the expected time to execute each
n Transaction volumes average and peak
n User volumes average and peak
n Transaction and user volumes at peak and peak × 2 under which the system
will be tested
Capacity Testing
n “Transactions-per-minute” requirement for each transaction type to be
executed
n Number of users logged on to the system
n Number of concurrent users executing a variety of key functions
n Duration of the test
n System resources to be monitored
n Approach to measurement of baseline
Failover Testing
n List of all components to be tested
n Expected failover results (automated versus manual, number of retried
transactions)
n Criticality of automated failover per component
Operability Testing
n List of critical system functions to be tested
n Operability conditions to be tested against each system function
n Expected results and system alerts, report and recovery
n Automated versus manual recovery
n Expected time for recovery
n Expected transactions state for each test
Sustainability Testing
n Duration of test
n Volume of users and transactions to be run during the test
n Data setup requirements
n Key metrics to track and report on (i.e., memory profile, threads,
connections)
Certification Testing
n Which tests are included in certification
n Scope of configuration management (i.e., application only or including sys-
tem configuration)
n Code drop to be used for final certification
n Certification criteria—gating criteria for production deployment
the system conduct initial tests and proof of concepts to verify that the non-func-
tional behavior around these “hotspots” is as expected.
n Memory management
n Transaction management and boundaries
n Code efficiency
n SQL code design for performance
n Data model simplification and design for performance (based on key high-
volume queries)
n Error and exception management
n Recovery code after failure events
All of the above early detection methods should be considered for inclusion in
the scope of the overall non-functional work to be conducted by the development
teams and the non-functional engineering team.
People Resources
This category will include the non-functional engineering test team as well as any
additional effort that is required by the development and project test teams. In
addition, it is good practice to include any additional professional services and
infrastructure implementation costs for net new infrastructure for new test envi-
ronments as well as professional services associated with installation and configura-
tion of test and monitoring tools.
Test Tools
This category includes software tools that are required to simulate load, create and
execute test scripts, and report on test results. It also includes any required system
resource monitoring tools for observing, measuring, and trending system resource
utilization during capacity, sustainability, and failover testing.
Infrastructure Cost
This category includes all the hardware (HW) and software (SW) that is required
to construct the required non-functional test environments.
The exact estimates will be driven by the actual scope of non-functional work
defined for the project; however, the following guidelines and advice can be used
for high-level estimates.
People Cost
Test Tools
Many organizations make central investments in enterprise-wide software tools. In
such cases, the project may benefit from existing licenses or may have to only be charged
brown dollars (via internal allocation) rather than spend money on new licenses. It is
important to review the set of tools existing within the organization and ensure that
these will satisfy the requirements for non-functional testing for the project.
In the event that test tools are not already available, the project may elect to
commit to a long-term investment in industrialized products that would benefit
the organization post-project or go with open-source tools to minimize direct cost
to the project.
In general, the number of licenses required should be calculated based on the
expected transation rate and the concurrent user load required to satisfy the appli-
cation’s business utilization models.
Infrastructure Cost
This is the cost associated with the construction of the necessary non-functional test
environments. Later in this chapter we will discuss the various non-functional test
environments, the characteristics of each environment, and the conditions under
which these environments are required. Table 2.2 is a list of environments for non-
functional testing, their intended use, and basic cost considerations.
A common oversight is to forget the cost associated with running, supporting,
and maintaining these environments. Make sure to include all resource costs for
support and batch operation for these environments as well as deployments, cur-
rency updates, and software licensing costs.
Table 2.2 (continued)
Failover Testing failover and This environment must be isolated
recovery of all identified with all its components to allow for
failover points and destructive and failure condition
components. testing without impacting any of the
other nonproduction environments.
It must contain all the components
participating in the failover testing in
a configuration that is similar to
production. The size of the
environment can be much reduced
because the test scope would not
include a large volume of
transactions.
Non-Functional Activities
NFT Scope and plan - strategy
NFT Requirements
NFT Script build
Performance unit test
Design/code reviews
Performance - initial test
Fail-over tests
Operability tests
30 n Patterns for Performance and Operability
Capacity tests
Sustainability test
Certification tests
Monitor performance & operability
11/19/07 7:48:34 AM
Planning and Project Initiation n 31
n Early code review and architectural reviews for non-functional aspects of the
application had been conducted successfully
n Early non-functional unit testing had been successfully executed and reported
by the development team
n The application code entering the integrated testing cycles is sufficiently sta-
ble to allow for non-functional testing
n No major redesign is required based on the results of the non-functional
tests
n Multiple environments are in existence to support both functional and non-
functional test activities
The key areas of parallelism that are accomplished by the above plan are as
follows:
n Failover and operability tests are conducted in parallel with integrated func-
tional testing; early code and architectural reviews will ensure that the devel-
opment team has invested the right level of focus and effort into making the
application operable and respondent to failover conditions.
n Final capacity testing is conducted in parallel with the first cycle of user-accep-
tance testing (UAT); this is supported by an early cycle of capacity testing that
will confirm that the required system resources are available.
n Sustainability test is conducted in parallel with UAT, leveraging the code that
had been frozen as entry criteria into UAT. A potential code refresh can be
considered upon start of the second cycle of UAT.
Non-Functional Activities
NFT Scope and plan - strategy
NFT Requirements
NFT Script build
Performance unit test
Design/code reviews
Performance - initial test
Fail-over tests
Operability tests
32 n Patterns for Performance and Operability
Capacity tests
Sustainability test
Certification tests
Monitor performance & operability
11/19/07 7:48:38 AM
Planning and Project Initiation n 33
n Addition of an extra week of integrated testing to allow for any work required
based on the findings from performance, operability, and failover testing.
n Addition of an extra week of UAT to allow for the certification tests’ first (and
most likely final) cycle to complete prior to the end of UAT.
n Extension of failover, operability, and capacity testing schedules as well as
allowing an extra week for certification.
If time permits you may want to consider moving the certification testing for
post-UAT activity, thereby ensuring that certification is the last testing activity and
is conducted on a fully frozen code and configuration.
Operability Testing
This is a test of likely production conditions that may affect the application/system
in an unpredictable way. The objective of the test is to identify the key operability
test cases with the highest potential impact to the system and the business using the
system, and test the application behavior under such conditions. Some examples of
operability tests include:
Failover Testing
This is a test designed to verify the failover design, configuration, and process by
simulating failover conditions. The objective of the test is to ensure that the systems
can failover as designed under load with full in-flight transactions recovery and to
ensure that the failover detection is triggered appropriately. Key considerations for
this test include:
Capacity Testing
This test is intended to confirm and validate the capacity model that is developed
by the non-functional engineering team based on business utilization information
provided by the business requirements teams. The objective of the test is to run a
simulated production load for a sustained duration and monitor the utilization of
all relevant system resources. This test will finalize the hardware sizing require-
ments (i.e., the CPU, memory, disk, etc.); it would also confirm any additional
configuration of resource allocations. Key considerations for this test include:
Performance Testing
This test is intended to test the performance of online response as well as bulk
and batch operations. The objective of the test is to measure response time under
load and verify that it meets user requirements. For batch and bulk operations, the
objective is to test full data-load execution within the processing window defined in
the non-functional requirement. Key considerations for this test include:
n Invest in scripting of the test cases for consistency in execution, data buildup,
and measurement
n Invest in tools that provide for scripting and execution/results capture
n Test performance under simulated peak load
n Test and monitor performance in a sustained environment (environment that
has been running for a sustained period)
n Invest in preparing a data bank that would simulate 6–12 months of produc-
tion buildup
Sustainability Testing
This test is intended to measure and observe resource utilization in the application
and the infrastructure environment over a sustained period of time during which
daily load is simulated. The objective of the test is to identify any anomalies in
resource utilization over time or any trending utilization information that may sug-
gest a potential issue (such as memory leak, thread, or connection leak, etc.). Key
considerations for this test include:
Certification Testing
This test is intended to verify all configuration settings in a production-like envi-
ronment. At a minimum this test would include a subset of the performance test
cases and a complete set of failover tests. In some cases the scope of a certification
test would also include a capacity test to ensure that capacity is calculated based on
production identical infrastructure. Key considerations for this test include:
Test Environments
Table 2.3 lists the test types, and the target environments in which these should
be executed. In addition, the table states the minimum requirements for such test
environments.
Performance Capacity to run all test scripts from Script verification testing
Development a CPU, data, and disk standpoint. Performance execution
Tools for load and automated test run
testing available.
Table 2.3 (continued)
Development Base requirement to allow for Development
Integration Test deployment of end-to-end integration test
functionality.
No need for capacity, load, failover,
or any other production-like
configuration.
Simulator Developers
This group is responsible for the development of all simulators and injectors that
are required to simulate external systems behavior and to seed data into the appli-
cations or generate load. The team would typically have a good code development
background, with specific knowledge of databases, messaging systems, Web ser-
vices, and other interface methods.
Troubleshooters
Though this group is sometimes forgotten, its function is key to ensuring the smooth
execution of the test pack. Typically the project teams are too busy to provide any
significant support for troubleshooting non-functional test environments. Having a
troubleshooter on the team that can dissect the problem and pinpoint the area where
the issue resides will allow the team to move forward, bypassing the issue or getting
more focused help from the project team.
Communication Planning
Setting Expectations
As with any delivery-based work, setting the expectations up front is a key to suc-
cess. The non-functional test team must invest in setting the expectation right from
the start with two distinct groups; the project teams and the management/steering
committee of the program.
n Defect may be identified late in the delivery cycle due to the nature of
testing
n The project team is expected to identify and test performance hotspots early
in the development test cycles
n The non-functional test team may conduct code review and make recom-
mendations for tuning and code performance improvements
n The non-functional test team will require its own environment that cannot be
shared with the functional testing activities
Summary
This chapter demonstrates the need to advance the scope and planning of the non-
functional activities that are often overlooked when a project is initially business
cased and budgeted. The planning required is extensive, and includes elements of
budgeting, business casing cost and risk, determining scope, sizing the team(s),
determining the required environments and schedules for testing.
Engaging the technology leads early in the planning process will ensure that
all the considerations mentioned in this chapter are addressed head-on and will
provide for accurate planning and budgeting for execution.
Non-Functional
Requirements
41
the support organization, and the development team. It can also introduce sloppy
errors and vulnerabilities into a system. Reacting to crisis after crisis in your produc-
tion environment is not an efficient way to build or maintain software, and will end
up costing your organization money, resources, and possibly its reputation.
Consider a scenario in which end users and the development team proceed
under the optimistic belief that the software will perform to an acceptable level.
But what is acceptable? What if the development team confidently releases software
that performs a given business operation in 2 seconds but the business is accus-
tomed to 0.5 seconds for the same operation? The day after you launch a new system
is not a good time to reconcile differing expectations.
From an end-user perspective, it is self-evident that the software must run fast,
that it must never crash, and that it must be free from any and all defects. In the
real world, we know that systems rarely meet these requirements with perfect effi-
cacy. Like any engineering activity, all system characteristics need to be specified
in writing to ensure that they are implemented and tested as part of the solution.
Documenting non-functional requirements serves the following critical benefits:
1. Serves as a basis for constructing a robust System Design: During design
and development, the implementation team knows exactly what behavior is
expected from the system.
2. Serves as a prerequisite for Non-Functional Testing: Non-functional
requirements give the QA (quality assurance) organization clear objectives
and the input it needs to generate representative test cases. Like any require-
ment, a non-functional requirement cannot be considered met until it has
been thoroughly tested.
3. Defines a Usage Contract with the End Users: The business users under-
stand that the system is tested and rated to meet requirements for a desig-
nated load. If the end users triple the number of people using the system, then
they can no longer expect the same level of service if that load is outside of the
documented usage parameters.
4. Provides a Basis for Capacity Planning: Depending on your application,
your system may or may not accommodate increasing volumes over time. For
many business applications, the level of usage is expected to increase as the
business itself expands. In these situations, capacity planning will need non-
functional requirements as input to infrastructure planning activities.
In this chapter we will define the types of requirements that are included in the
non-functional realm and we will look at how these requirements are derived from
business inputs. Important considerations will be illustrated using the example of
an online banking system. We will also visit the topic of roles and responsibilities,
where we see how an organization should approach the formulation of non-func-
tional requirements. At the end of this chapter, you will be familiar with the scope,
definition process, and terminology required to write meaningful non-functional
requirements.
able without this having a significant business impact, you may decide to be more
selective in documenting non-functional requirements.
In this chapter we make the assumption that your system meets most or all of
the following criteria in order to illustrate a formal, structured approach to defining
non-functional requirements:
analyst to vet the documentation to ensure it is complete. The technical test lead
has additional responsibilities as follows:
n Coaxing variations in the business usage out of the business analyst, i.e., ask-
ing leading questions to populate detail into the business model
n Challenging the defined requirements to ensure that realistic targets are being
proposed for the application
n Helping the business analyst to understand where details are important in
order to properly test the system
n Helping to provide detailed content for the business analyst to include in the
requirements documents
Challenging Requirements
Requirements are generated by analysts who do their best to document what users
want, but users may not always know what they want—or they may change their
mind after seeing a product. Furthermore, analysts do not always ask the right ques-
tions, nor do they always interpret the responses they receive accurately. Subject mat-
ter experts and consultants, in general, are notorious for imposing a view of the world
on users who may or may not fully agree with the picture that is being presented.
Requirement swapping is a term invented by one of the authors after many years
of trying to implement badly formulated requirements. There were many times
when a requirement clearly expressed intent in a way that was technically complex
to implement. An implementer could offer an alternative that would satisfy the
user’s intent but not necessarily meet their requirement verbatim. For cases like this,
a proposal to swap a complex requirement for a simple, more natural requirement
that in many cases the user likes even better can build consensus. In many circles
this activity is part of a broader activity referred to as requirements engineering.
If you take the time to review requirements with the technical team, you will
usually avoid future disconnects that result in wasted effort. In general, the amount
of time that the technical team spends implementing a requirement should be
proportional to the value of the requirement. Performance requirements can be
great illustrations of this concept. A business user may arbitrarily decide that log-
in should take no more than one second for any request. In the technical design,
there is a robust, reusable log-in service available from another system but it only
supports a two-second response time. In talking to the sponsor, most users will log
in once or twice a day at most. Users are internal, so we don’t have to worry about
a competitor offering a faster login. In this case it should be easy to convince the
sponsor to relax the one-second login requirement, especially if the technical team
is willing to commit to a faster response time for another part of the system that is
more frequently accessed.
In some instances you may have to explain to the users what it is exactly they
are asking for. We have encountered situations in which the users were expecting
a thousand people to use the system concurrently and therefore expected the sys-
tem to have the capacity to handle a thousand transactions per second. It is only
after explaining to the user community that each user had to fill in a lengthy form
for each transaction—which would take them at least a minute and therefore was
physically impossible for them to submit one transaction per second—that everyone
agreed to a more realistic requirement of 16 transactions per second (1,000/60).
Non-functional requirements can be expensive to test and accommodate in the
technical design. It is prudent to ensure that all parties with a stake in the system
understand the effort and expected benefit for each requirement.
If you are introducing a new software product or service, you may have no
empirical basis whatsoever for establishing a usage model. In this case, the usage
model is entirely theoretical and based on predicted adoption and usage.
Some aspects of human usage are very difficult to estimate without actual
observation of users on the new system. If you give users two buttons to press that
perform slight variations of the same function, which button will they press? If
you give users a suite of new functionality based on a consultative requirements
gathering process, which features will they actually use and in what proportions?
Users themselves can only tell you how they think they will use a software system.
To make matters worse, the users who participate in your requirements gathering
may or may not represent the perspective of the majority of end users. If consultants
or management have the majority influence in the functional specification of the
system, they may totally misrepresent the behavior of the user community when
the system is actually in production. Malcolm Gladwell, in his book Blink, makes
convincing arguments that in a very large number of situations, users make totally
inaccurate predictions of their own behavior.
The subtleties of human interaction with a user interface are difficult to antici-
pate; however, the number of business transactions that users will initiate is usually
measurable—or, at least, more readily predictable. We consider business transac-
tions as coarse inputs in the usage model. The number of times that users click a
button or the number of times that users encounter validation failures is incidental
to the number of coarse inputs. If a customer is using an online banking system to
pay a bill, each bill payment is a good example of a coarse input. A coarse input is
a high-level functional activity. It includes all the nuances of how each customer
pays a bill.
Software systems often have nonhuman inputs. Machine inputs to your system
are equally important aspects of your usage model. Many complex systems func-
tion based on a combination of human and machine inputs. Machine inputs can
be continuous feeds or can come in batches. Continuous feeds are requests or data
inputs that arrive on a continuous and unscheduled basis. Batch inputs are a bulk
series of requests or inputs that usually arrive and are processed as a single unit of
work. Batch inputs are often, but not necessarily, scheduled interactions with your
system. The characteristics of batch inputs may be predictable or unpredictable in
nature. Consider a scenario in which an insurance company must process car insur-
ance applications and make approval recommendations on a nightly basis. Such a
system may involve the collection of application requests from multiple channel
front-end systems. Will the number of applications be constant over time? Will the
number of applications be subject to seasonal or time-of-month variations? In the
real world it is likely that the size of the batch input to the system will vary with
time. Again, this variation needs to be accounted for in your model. A usage model
that fails to anticipate a surge in insurance applications at month-end is not a rep-
resentative usage model.
The first step in establishing your business usage involves quantifying it for both
human and machine inputs. This can be done by answering the following questions:
Human Inputs
n For human inputs, what is the operations window for the software system?
n For each class of user, how many users are in the user population now? Pro-
jected in one year? Projected in five years?
n For each class of user, how many coarse inputs do we expect on average and
as a maximum in the operations window?
n For each class of user, what is the distribution of coarse inputs in the opera-
tions window?
n In particular, what is the busiest interval for the system with respect to the
creation of coarse inputs?
Machine Inputs
n For machine inputs, what is the operations window for the software system?
n How many interfaces support the input of machine inputs?
n For each interface, what is the expected and maximum number of coarse
inputs now? Projected in one year? Projected in five years?
n For each interface, what is the distribution of coarse inputs in the operations
window?
n In particular, what is the busiest interval for the system with respect to the
creation of coarse inputs?
Based on historical business reporting for the legacy online system, coarse
inputs for each class of user are expected to be as shown in Table 3.2. The data is
shown by month.
In scrutinizing historical data, it is clear that there is seasonal variation for a
number of coarse inputs. Logins remain constant throughout the year, but bill
payments are highest in January and lowest in July and August. Not surprisingly,
it appears that people are on holiday during the summer and pay bills most actively
following the busy Christmas shopping season, in January.
In looking at the weekly volumes, it also appears that peak usage is from 12:00
pm (noon) to 1:00 pm during the day. During this interval, 30% of the daily volumes
are typically completed. The busiest day of the week is Friday, as this corresponds to
the day following Thursday, when many employees receive weekly paychecks. This
one-hour period qualifies as our busiest interval in the business usage for human
input. Non-functional design and testing is all about worst-case scenarios. If the
bank’s systems can accommodate the busiest hour of the busiest day of the year,
then we can be confident that it will handle all other intervals. We add this param-
eter, the busiest interval, to the business usage (as shown in Table 3.3).
To the best of our abilities we have adequately quantified the usage of the sys-
tem based on human interactions. However, we are not done yet. We must also
describe the system in terms of machine inputs. The new online banking platform
is expected to have at least three interfaces that will accommodate machine coarse
inputs. A nightly job is expected to extract a report that captures all online cus-
tomer actions for business analytics. Further, another job is expected to produce an
extract file that captures all bill payments made on the online banking platform for
Login 65,645 65,656 66,547 6,875 45,353 47,765 76,576 65,474 76,586 56,363 47,574 74,547
Account
4,535 45,435 52,451 52,534 75,676 76,575 76,576 75,766 76,757 65,656 65,463 45,654
Inquiry
Bill Payment 45,345 56,433 36,363 64,463 74,756 45,356 65,533 5,646 65,465 5,353 45,645 63,635
50 n Patterns for Performance and Operability
Funds Transfer 63,463 53,646 5,346 6,346 36,361 63,446 63,356 63,535 36,635 65,363 65,346 43,213
11/19/07 7:48:46 AM
Non-Functional Requirements n 51
Login 2,309,039
available continuously from the source system, we will designate the machine input
as continuous. The source system only operates from 7:00 am to 10:00 pm daily,
Monday to Friday. The source system enjoys a maintenance window nightly from
10:00 pm to 7:00 am and on weekends.
The business analyst has met with the customer marketing organization, and
they have provided the expected and maximum number of marketing messages as a
percentage of the total number of users. On a typical day, they will forward market-
ing messages to the system for 5% of the registered customer base. On a busy day,
they will send messages to the online banking platform for 20% of the registered
customer volume (as shown in Table 3.6).
An important aspect of the usage model is the number of human users who will
be active on the system. We have just finished a discussion in which we divided load
between human and machine inputs. Consider a login scenario in which 30% of
the peak daily login volume constitutes 692,712 login operations in a single hour.
What does this really mean? Does it mean that a single person is serially logging in
to the application 692,712 times? Or does it mean that 692,712 people are logging
in to the application once? Or is it somewhere in between? Or does it even matter?
The purpose of a usage model is to accurately reflect the expected production
usage of your system. The usage model will drive the load scenarios that you use for
testing most of the non-functional requirements for the system. In reality, for many
systems the number of users executing business operations is just as important as
the number of operations that are executed. For stateful systems, and any system
requiring authentication is stateful to some extent, the number of active concurrent
users is highly meaningful to the accuracy of the test. Many systems maintain state
information as part of a user session. Each concurrent user will have a correspond-
ing user session. The true performance characteristics of the system can only be
measured if we are executing business operations with a representative number of
concurrent users.
Let’s introduce the notion of user volumes as an additional attribute of our
usage model (as shown in Table 3.7). The business is asked to provide a statistical
view of this attribute for the busiest day of the year over a 24-hour period.
As we will see, user volumes are an important attribute when we go to apply
load to the system to achieve the target rate of business operations.
Human Inputs
n Which coarse input(s) are achieved in this load scenario?
n How many unique scenarios are required to achieve the total number of
coarse inputs?
n Are specific scenarios required to certify specific performance requirements?
n What are the cost and budget constraints that will impact the number of load
scenarios that are devised?
Albert Einstein is renowned for the statement that “everything should be made
as simple as possible, but no simpler.” This very much applies to the specification of
load senarios. The objective is to provide enough detail to accurately model the sys-
tem, but detail for the sake of detail offers diminishing returns. Too much detail will
be difficult to implement and maintain. At the same time, an oversimplified view of
your system will increase the likelihood of real problems going undetected.
In many situations, performance requirements may require the specification
of additional load scenarios. We will discuss this topic in more detail later in this
chapter.
As you increase the number and complexity of load scenarios in your usage
model, you will also increase your costs. When you are generating the load profile,
it is preferable to describe the business usage in as much detail as possible. When
it comes time to test, you will usually take a practical view of your load scenarios
and adjust them. We will discuss this activity further when we describe testing
approaches in Chapters 6 and 7.
Non-Functional Requirements
An Important Clarification
We have been using the term non-functional as an adjective since the first chap-
ter of this book. On the topic of requirements, there is an important clarification
that we must make. For many people, there is a perception that non-functional
requirements are technical requirements. However, this is a misleading and inac-
curate perspective.
Non-functional requirements are still business requirements. Like any other set
of requirements, the technology team will interpret and translate non-functional
requirements into a concrete implementation. Non-functional requirements need
to be defined by a business analyst as part of the same exercise as functional require-
ments. We will illustrate the distinction between good and bad non-functional
requirements with some examples. The following requirements may sound appeal-
ing, but are out of context in a non-functional requirements document.
1. The system must verify the integrity of all file outputs that are generated
for customers by inspecting the first and last record in the file.
2. The system must log the username and time for each user login to the
system to a file.
3. A ll application code must include in-line documentation for support
purposes.
4. Performance testing must be conducted for a sustained period of at least
eight hours at 200% peak load.
strategy and plan is the appropriate place for describing the detailed test case
composition. Adding this type of requirement to your scope will convolute the
intent of your requirements. Furthermore, stakeholders who sign off on non-
functional requirements are seldom in a position to evaluate your detailed test
strategy. For this part of the example, our recommendation is to omit the require-
ment completely.
As you can see, there is a temptation to make non-functional requirements
a broad, all-encompassing container for requirements that don’t seem to fit any-
where else. The scope of non-functional requirements should be limited to true
business requirements that reflect genuine performance, operability, and availabil-
ity concerns. It is not efficient to communicate other topics like testing, delivery
process, or technical design in what is supposed to be a business requirements docu-
ment. There are more appropriate vehicles for this content such as test strategies,
project charters, and technical design documents in which business participants
can be asked to provide sign-off if required.
Performance Requirements
Performance requirements are usually the most prominent type of non-functional
requirement in a software implementation. Users readily understand that systems
that perform slowly will keep them waiting. More than likely, users have firsthand
experience with systems that perform badly and are anxious to avoid similar experi-
ences with any new system.
Performance requirements specify what should happen and how long it should
take. Describing this in a meaningful way is usually more difficult than it sounds.
We usually refer to “how long it takes” as the response time and “what should hap-
pen” as a transaction. This type of requirement will vary greatly depending on the
type of application.
For an animation or graphics-intensive application, performance requirements will
usually be expressed in terms of the refresh rate of the screen. The human eye can dis-
tinguish 1,300 frames per second. In general, anything faster than this will be accept-
able to end users. In this case, each frame refresh is considered to be a transaction.
For Internet-based applications, the screen refresh rate is typically based on the
amount of time it takes for a server component to generate a new screen and send
it over the network to the end user. Users are accustomed to Internet applications
and a threshold of under two seconds is usually acceptable for screen refreshes for
applications of this type. For this example, the time it takes to request and then
fully render a Web page is considered to be a transaction.
If a system is responsible for generating a complex report, users may be comfort-
able waiting hours for the report to be available. The generation of the report in this
example can be referred to as a transaction.
Table 3.9 Transactions
Transaction Average 90th Percentile Maximum
Using this approach, all transactions are classified as either light, medium, or
heavy. This is easy to understand and avoids confusion during the testing and vali-
dation phase of the software development lifecycle.
In assigning transactions to categories, you must agree on the range of inputs
for which these performance requirements will be met. From a user’s perspective,
viewing account details is the “same” type of request for every user. From a technol-
ogy point of view, rendering account information may vary significantly depending
on any one of the following factors:
Before you decide that an account inquiry is of medium weight for all users, you
may consult the technical team and determine that the weights shown in Table 3.11
are more appropriate.
Log-in Light
Does this mean that we are done? No; we are missing a critical piece. Perfor-
mance requirements are only meaningful in the context of load. A response time
of one second may be met easily if only one person is using a system. Meeting
the one-second requirement becomes much more difficult if there are hundreds or
thousands of requestors accessing while attempting the same transaction simulta-
neously. Fortunately, we completed the business usage model for this application
earlier in this chapter. Users will expect performance requirements to be met under
all circumstances. Accordingly, we must select the most strenuous interval in the
business usage and use that as the basis for our performance acceptance.
In the previous section, we identified the interval in the business usage as
from 12:00 to 1:00 pm on the third Friday in January as the busiest interval. This
means that we will test for our performance requirements using this load profile.
In order to be more specific, we calculate a transaction rate for the load profile. The
transaction rate can be expressed as transactions per minute or transactions per
second depending on the volumes for your system.
The busiest interval for our banking application processes 3,400 bill payments in
one hour. We can then calculate the transaction rate for bill payments as follows:
Transactions 3, 400
Transaction Rate = = = 0.94TPS
Interval 3600s
We refer to the transaction rate for the busiest interval as the peak transaction
rate. Assuming we conduct a similar exercise for each of the other transactions for
which performance requirements are specified, our example requirements evolve as
shown in Table 3.12.
When combined with the load scenarios defined in the business usage, we are
well positioned to prepare test cases and conduct performance acceptance for this
application from a requirements perspective. We will see more of the testing chal-
lenge in Chapter 6 and 7.
Operability Requirements
Business users do not specify the majority of operability requirements. Not surpris-
ingly, the stakeholders for most operability requirements are the operators of the
system. These types of requirements take into consideration the ease, robustness,
and overall availability requirements of the software solution.
Component Autonomy
Complex systems are often implemented as a set of dependent components. Sys-
tems may also have dependencies on external systems. Robust, highly available
systems typically meet the following minimum requirements:
able, there should be no impact to the portal itself or a user’s ability to access
the other three subsystems.
n If a component needs to be restarted, re-deployed, or otherwise taken out of
service, it should be possible to reintroduce that component without having
to restart, re-deploy, or alter any other components in the system. Consider
the example of an enterprise service that provides securities pricing infor-
mation to a number of applications at an investment firm. If the enterprise
service is taken out of service and then reintroduced, there should be no need
to restart any of the dependent applications.
Trace Logging
Problems that arise in production environments are often difficult to troubleshoot
because processing can be distributed across many disparate systems. If different
systems are responsible for different components, it is difficult for any one support
organization to reproduce the problem. When an external system does not respond
in an expected way, it is critical to be able to provide a log of the request and response
from that system. The exchange of data between systems can be logged at the level
of the database and/or the file system. In most cases, the performance trade-off
of this logging is well worth the increased visibility that it provides. Systems that
include good trace capability are easier to test and support. If performance must
trump operability for your application, consider asking for a configurable switch
that enables trace logging selectively for specific components. In this way, logging
can be introduced when a problem is suspected or in (quality assurance) environ-
ments only.
when there is a risk that they will not be processed. The appropriate behavior can
only be determined in the context of the application.
A variation of this same capability will be required when the system is being
upgraded and/or maintained. In this case, the communication should tell the user
when the system will once again become available.
Exception Logging
From a requirements point of view, every system should log exceptions with enough
detail that the cause of the failure can be investigated and understood by technical
resources. Error logging can be a critical aspect of production monitoring for the
system. It is also desirable for the error to be presented to the user in a way that can
be tied to additional technical logging at the level of the file system. We will look at
exception handling and logging in more detail in Chapter 4.
Failover
Availability is achieved by increasing quality and redundancy of software and infra-
structure components in your system. In the real world, even the best quality hard-
ware will fail, and it is critical that you discuss the implications of such failure with
your users. In the event of a failover, is it sufficient that service is still available for
the initiation of new requests? Or is there a more stringent requirement for in-flight
requests to be processed successfully. Is it sufficient for the request to be processed
when the failed component is recovered, or does a redundant component need to
recognize the failure and stand in to continue processing?
The behavior during a failover will depend on the criticality of your system and
the sensitivity of the users that are using it. Consider the example of an end user
who must complete a multi-screen form process that required input of over 200
fields. If a system component fails when the user is inputting the 199th field, is it
acceptable for the user to have to start the process over? Depending on the system,
there will be a cost to implementing failover for this scenario and it may or may
not be warranted for your application. Before stipulating failover requirements, it
is recommended that you consult with the technologists who will be designing the
system. It is quite likely that these types of requirements were already factors in the
infrastructure and software decisions that were made in the planning phase of the
project. If the target platform for your system does not provide support for failover,
then it is unwise to allow your users to specify requirements for this feature.
Fault Tolerance
Fault Tolerance requirements describe what the system should do when it encoun-
ters a failure. In many cases, these requirements should be described as alternative
flows in your use case documentation.
Availability Requirements
Availability is typically expressed as a percentage of time that the system is expected
to be available during the operations window. It is usually documented as a critical
metric in the SLA with the user community. Availability is a function of applica-
tion inputs, application robustness, and infrastructure availability. If your magnifi-
cently designed application runs on servers that are only available 80% of the time,
then your application will be available, at best, 80% of the time. Conversely, you
can invest in the most redundant, fail-safe hardware the market has to offer, but if
your application is fragile you will not meet your availability targets.
Like so many things, quality comes at a price. As you invest in both infra-
structure and software quality, you will asymptotically approach 100% availability.
However, no seasoned engineer will ever expect or promise 100% availability. At
best, the “five nines” are touted as the highest possible availability: a system at this
level is available 99.999% of the time. For a 24-hour application that operates 365
days a year, this means that the application is meeting its SLA if it experiences less
than 5.256 seconds of unscheduled downtime in a year. There are very few applica-
tions that require this level of availability, and you should speak candidly with your
user community to discuss the cost/benefit trade-offs associated with availability
at this level. In later chapters, we will look at infrastructure, software, and test-case
design to support availability requirements. Table 3.13 illustrates typical availabil-
ity for common system profiles.
Archive Requirements
End users rely on business systems to access information, and depending on the
nature of the business, there will be a requirement for how long that data must
be accessible to them. Some data must be available for the life of the software
system. As an example, most businesses require customer profile information to
persist forever. Alternately, some business data has a more short-lived requirement.
Transactional data is data that accumulates steadily over the life of the system; it is
required over the course of the business transaction and may be required for report-
ing purposes in the future. In general, transactional data is transient in nature and
there is no requirement for business users to have access to it historically. As data
accumulates in the system, this introduces ongoing storage costs and can degrade
performance over time. As a result, it is important for non-functional requirements
99.999% Full hardware and software redundancy for Securities trading systems.
all system components. High availability customer
Full-time dedicated monitoring and self-service portal.
application support infrastructure.
Support response time is 15 minutes or less
for all incidents.
to specify the retention period for the different types of data in the system. An
example set of retention requirements is provided in Table 3.14.
In this example, these archive requirements are for a procurement system used
by a manufacturer. From a business perspective, there are different retention peri-
ods for different types of business data. The supplier database is a permanent record
of all organizations that supply the manufacturer with materials.
Summary
In this chapter we’ve seen that the definition of non-functional requirements
encompasses many different topics spanning performance, operability, availability,
and expected business usage. We’ve also seen that different projects have different
needs in terms of the scope and depth of non-functional requirements. Pairing an
experienced business analyst with a technical resource is the recommended staffing
approach during requirements formulation. As we move forward, we will next look
at how non-functional considerations influence software design.
In the previous chapters we examined the initial phases of the software develop-
ment lifecycle—namely (1) the planning phase; and (2) the requirements phase. In
this chapter we turn our attention to software design, which traditionally follows
the first two phases, and is often driven out subsequent to software architecture
within the same high-level phase.
Software design has been raised to the level of high art by many who prac-
tice it. Good software design accomplishes many things including quality, flex-
ibility, extensibility, and development efficiency, many of which are non-functional
requirements or characteristic of these.
Over the years, there have been major enhancements in the process and approach
to software design. Notable milestones include object-oriented and pattern-based
design. Pattern-based design was introduced to a wide audience by the famous
“gang of four”—authors Erich Gamma, Richard Helm, Ralph Johnson, and John
M. Vlissides—in their book Design Patterns: Elements of Reusable Object-Oriented
Software.
Design patterns are powerful because they are language-independent approaches
that apply to recurring scenarios in software development. Many design patterns are
so indoctrinated with developers they expect to see common patterns in each others’
code. Software is easier to understand when it has been designed using a mutually
understood set of concepts and terminology.
In this chapter we make our own contributions to the growing catalog of avail-
able design patterns. We will illustrate our patterns using current technologies and
demonstrate how they are useful in achieving non-functional objectives. If you are
a developer, you may find these techniques useful in writing quality software. If
you are a manager or architect, you may find that these examples help you to chal-
lenge your development team to write better and more defensive code.
69
Error Categorization
As part of any software design activity, you should agree on standard error severi-
ties. Too often this decision is left until late in the implementation, after individual
developers have agreed on an assortment of error severities, each having their own
unique understanding of what this means.
Every project has its own unique needs, but the authors of this book have found
the following categorizations (shown in Table 4.1) to be useful and widely adopted.
Info An event occurred in the system that Informational messages are not
although not critical to the system meant for operations but can be
warrants informational output.It may, used by log scrapers to detect
for example, be useful to log the unexpected usage patterns or by the
attempt of someone to transfer an support team to determine the
amount of money larger than what cause of a problem.
they are allowed to transfer. If this
pattern is found repetitively in the
logs this may warrant investigation.
Many people may recognize these severity levels as standard for many vendor
software products and source-code frameworks. What is not standard is the mean-
ing and implication of each of these error severities. As we will see in a later chapter
in this book (Chapter 9), standardization of error types is especially important
from a monitoring and operations point of view.
Design Patterns
One of the candidates for this book’s title was Designing Software that Breaks Prop-
erly. Despite our fondness for this title, we decided on a broader one that reflected
the full scope of the book. Nonetheless, this title is highly appropriate to the con-
tent in this chapter. The basis for operability design patterns is that they anticipate
and design for problems. When problems happen, these design patterns ensure that
the software breaks in a predictable and acceptable way.
DBA Verifies
Database
this case, there is no opportunity to fix a defect and avoid the problem, but there is
an opportunity to introduce a feature.
Retry capability is not a novel or new concept. Most network protocols use retry
extensively when sending data over a physical network. If an acknowledgment is
not received within a specified time threshold, the message is re-sent until a config-
ured maximum number of send attempts. A slow network is often a network that
is experiencing frequent packet loss, requiring multiple send attempts for a good
portion of the packets. Such a network is slow but working, which most users will
prefer to a network that is not working at all.
This concept is also appropriate for many software scenarios, also, but in some
cases will require more effort on the part of the software developer. Wherever your
application initiates complex, asynchronous processing, it is worthwhile to consider
a retry capability as part of the solution. The most obvious example is when your
system is completing work in tandem with a third-party system. In such a scenario,
you need to consider the following before you embark on such a scheme:
1. Is the third-party system capable of processing duplicate requests? If
you are attempting to send the same request multiple times, you are assum-
ing a risk that the destination system will receive the message more than
once. Depending on the system, this may entail duplicate processing, which
usually has adverse business consequences. For systems that cannot man-
age duplicate submissions, there may be an opportunity to selectively retry
processing depending on the error type that is detected. If the error is clearly
part of the communication to the external system (e.g., obtaining or testing a
connection) then you may want to permit retries for errors of this type only.
2. Is your operations window large enough to allow for multiple retry
attempts? What are the business requirements for processing? If the business
is expecting you to process the message within one minute or less, it may not
be helpful to retry delivery of the message. In fact, the business may expect
the message to be discarded as its contents will expire if it is not delivered
within this window. On the other hand, if the business is willing to wait 48
hours for processing to complete, then your scenario is a good candidate for
retry processing.
Assuming that your scenario is appropriate for retry processing, you will need
to answer the following questions.
1. W hat time interval makes sense between retry attempts? This is a deci-
sion with two opposing factors. The smaller the retry interval, the more likely
you are to process successfully with a minimum level of delay. However, if
your system is processing high volumes, you may flood the system to which
you are posting. If the system is not acknowledging replies or appears to be
unavailable, you may be compounding its difficulties by resending at a high
frequency.
2. For how long should you retry? Business requirements will factor heavily
in choosing this setting. The retry window should be as long as your users
can tolerate without experiencing business impact minus some contingency
during which you can manually process if the retry capability is not effective.
If a business user is expecting a transaction to be processed in no longer than
48 hours, and the message is still not processed after 24 hours, it is likely that
you need to escalate and manually intervene.
We illustrate this thinking with the following equation, which indicates that
the allowable retry period should be the sum of the maximum system recovery and
expected manual recovery.
Time Retry Period = Time Maximum System Recovery + Time Manual Recovery
It is also worth mentioning that these settings should be configurable and exter-
nalized from your application code. Once your system is in use, you may decide to
fine-tune these settings to provide a higher level of service. You may in fact need to turn
the retry capability off completely if you discover that a third-party destination system
cannot handle duplicate requests, as originally believed.
Once you have determined the conceptual retry characteristics for your system,
conceptually you will need to implement a mechanism (as shown in Figure 4.2).
Figure 4.2 shows that there is a clear separation between the source application, the
fulfillment service with retry, and the third-party service.
If the invoking application requires an immediate initial response, then the
queuing service can be implemented to invoke the fulfillment service directly. In
this case, the queuing service would log the request to the queue as completed. As
we will see later in this chapter, it is often important to have a trace of request/
response messages for troubleshooting and reporting purposes.
Request 1 Fulfillment
Request 2 Scheduling
Request 3
Request 4
Request 5
...
Fulfillment Service
If at any point the retry mechanism proves not to work as expected, it can be
disabled by configuring the retry attempts to zero.
An important characteristic of this solution is that the queue is transparent and
can be viewed by an application support resource. System transparency is a critical
support characteristic for any system that is maintainable. At any point, a technical
resource should be able to answer the following questions:
1. How many requests are pending?
2. What is the oldest pending request?
3. When was the last time a request was successfully fulfilled?
4. When was the last failed fulfillment request?
Each of these questions adds valuable insight to any troubleshooting effort. In
the solution we have presented, a database implementation of the request queue
would answer each of these questions.
Ensuring transparency of the request queue also creates opportunities for moni-
toring. You may decide that a properly functioning system should never have more
than 50 items in the queue. You could then choose to introduce a monitoring
mechanism that alerts operations whenever the pending items count exceeds 50
items.
Software Fuses
Most people are familiar with the concept of a fuse. When a threshold or error
condition is reached, the fuse blows and halts processing. This same concept is
applicable in a software environment. A familiar example in the software realm is
the user-account lockout. As a security feature, many software systems only allow
a finite number of authentication attempts before “locking” the account. No fur-
Since there is a real possibility that this job will halt processing, it is critical that
the software implementation conform as follows:
1. Return an error code and/or generate a fatal event for monitoring.
2. Ensure the system is in a state such that the job can be rerun without risk of
duplicated or partial processing. Barring the possibility to have the system in
a consistent state, you will need to provide a mechanism to compensate for
the system’s inconsistent state before or after resuming the job.
3. Ensure that the system has generated sufficient output that a technical sup-
port resource can reliably determine which records have been processed and
which records have not.
Software Valves
When a system is experiencing errors, a typical reaction is to stop all processing
until the problem is understood. Through communication and restriction of user
access, you may be able to prevent human users from creating inputs to your system.
Identify
Records
qualifying records is expected to take up to two hours for the largest forecasted
weekly volume. However, our operations window requires the system to be available
again the next morning at 7:00 am. All archival must complete in the nine-hour
window between 10:00 pm and 7:00 am. If the system performs to specification,
archiving should never last beyond 1:00 am, but if there is one thing that this book
has tried to impress upon you is that should is a word you need to remove from your
vocabulary.
What if the system goes down at 11:00 pm, unexpectedly? What if database
backups are scheduled during this window at some point in the future and the
archival solution runs eight times slower? What if the forecasted business volumes
are wrong, and the peak volumes are in fact much higher? What if the first part of
the solution doesn’t run successfully at 10:00 pm and a well-intentioned operator
runs it at 6:00 am? All of these hypothetical scenarios make this solution an ideal
candidate for the introduction of a valve. We do not want archiving to run beyond
the window allocated as it has an unknown impact on online usage of the system.
A software valve is introduced at the point of message consumption. A con-
figuration table is introduced or extended to indicate whether archiving is allowed
or disallowed for a given point in time. The first task in the archival process is
to enable archiving. The process listening for archive record requests checks the
archive-enabled flag prior to processing each record.
If the archive flag is enabled, it processes the record. If the archive flag is dis-
abled, the listener discards the message. It is acceptable to discard the message
because processing will be repeated the next time the selection job is run. It is not
business critical that records be archived immediately after they qualify. In this
way, we ensure that archive activities run only during the designated window (as
shown in Figure 4.4).
As another example, let’s revisit the retry pattern from the previous section. If
the third-party service is down, our system will quickly enter a state in which there
are many outstanding requests, each of which is generating retry attempts. Assum-
ing that the third-party is now aware that they have a problem, they may request
that we stop sending additional requests until they have resolved the issue. Unfor-
tunately, our fulfillment service schedules retry events on a per-record basis; we
have no way of shutting this off unless it is designed into the solution. Obviously, a
software valve is also appropriate for this scenario. Let’s look at the revised solution
with the addition of a valve (as shown in Figure 4.5).
If the valve is open, the request is queued again for another retry. In this case
the number of retries is not incremented. Again, the software valve is nothing more
than a configuration parameter that is dynamically checked by the fulfillment ser-
vice before making each request.
Request 1 Fulfillment
Request 2 Scheduling
Request 3 Service
Request 4
Request 5
...
Fulfillment Service
that, when invoked, reports a detailed status on health. Typical attributes that are
verified by a system health check include the following:
Simple Is Better
Intuitively, most people would agree that the simpler something is, the less likely it
is to break. Generally speaking, this is true; the probability of failure is the sum of
the component probabilities of failure in a system. Richard Manicom, the executive
responsible for the Canadian government’s federal tax processing systems through
most of the 1990’s, once articulated a valid point to the authors with this anecdote.
Consider the scenario of a twin-engine aircraft flying across the Atlantic Ocean in
which you are a passenger. Do you feel safer because there is a redundant engine in
the plane? What if you were told that the likelihood of an engine failure is twice as
likely for a single engine plane? In order for this to make you feel safer, you must
have confidence in the ability of the plane and its pilot to recover from an engine
failure. In other words, you are accepting additional complexity in the system and
trusting that it will improve the overall reliability of the aircraft. The relative safety
of single vs. multi-engine aircraft has been a topic of ongoing debate in the aviation
industry since the 1960’s. Dick Collins was the first to point out that statistically,
multi-engine aircraft are involved in more fatal plane crashes than single engine
planes in Flying Magazine. This statistic makes it tempting to conclude that single-
engine planes must be safer, but an equitable comparison requires consideration of
many more factors than we are able to discuss here.
Complexity is often a requirement in order to achieve the objectives of the sys-
tem you are building. In the real world, there are many factors that can cause com-
plexity to increase and as a systems designer, you must ensure that you are accepting
complexity in your design for the right reasons. You may make well-intentioned
choices in your design that are meant to improve operability or availability, but if
the complexity you introduce is not properly designed, tested and implemented, it
may have the opposite effect from what you intend. As a general rule, you should
strive for minimal, simple designs and accept complexity only when you have the
means and the commitment to implement it properly.
Isolation
Many large organizations support hundreds of different information systems. In
an effort to control costs, businesses are increasingly adopting strategies to oper-
ate multiple applications on shared hardware. This is often referred to as a shared
services model and it can be a cost-effective way to manage infrastructure costs. A
shared services model allows an enterprise to make large, bulk purchases in infra-
structure and then distribute this cost amongst different applications and lines of
business. Managing your infrastructure as a shared service also creates opportuni-
ties to simplify and streamline your support organization. However, these attrac-
tive cost-savings often come with a hidden cost. If you are implementing multiple
applications on a shared hardware platform, you are exposing yourself to the
potential for undesirable interactions between applications. For example, if your
production applications are deployed such that they all rely on a single network
path, you are accepting the risk that a single misbehaving application could impact
all of your production applications. As a general rule, you should strive for dedi-
cated infrastructure for applications that require high availability. Applications that
are isolated from interactions with other systems will be simpler to operate, more
straightforward to troubleshoot and will enjoy higher availability.
Application Logging
Historically, developers have had two means of understanding the runtime behav-
ior of their applications. They can look inside the application while it is running,
or they can rely on the outputs the application creates while it is running. The for-
mer is usually referred to as runtime debugging or application profiling. The latter is
referred to as application logging.
Runtime debuggers for many software platforms are sophisticated and incredibly
useful. Debuggers allow the developer to run the program line by line, inspecting
and changing variable values and influencing the runtime behavior to understand
the program. Debuggers tend to be intrusive in that the software must run in a spe-
cial container or allow the debugger to connect to the software itself on a specified
interface. In production environments, it is usually not feasible to run the applica-
tion in a mode that permits debugging. Debugging is usually a single-threaded
activity and may seriously impact the performance/availability of your system.
Application logging is non-intrusive; it is compiled into the code and is capable
of creating output during the normal execution of the system. Good application
logging is a critical element of any maintainable software solution. Time and time
again the authors of this book have seen good return on investment in develop-
ment, QA (quality assurance), and production for application logs. The following
guidelines have proven to be effective.
1. Ensure your log level is dynamically configurable: Many modern program-
ming platforms have logging frameworks available that allow you to dynami-
cally toggle logging on and off or change the log level. Log4j for the Java
platforms is perhaps the most pervasive example.
2. More is better: In general, the operations benefit of application logging far
outweighs any performance penalty. This is true assuming you avoid unnec-
3. Binary
<not-printable-characters>
database layout for the coded object. If an additional attribute is required, the new
object type is serialized back into the table with no database changes required.
Though tempting, there are two serious drawbacks to this approach that make
this type of design counterproductive.
1. Visibility: Once you have stored data in a binary format, you relinquish all
hope of querying/reporting on this data once it is in storage. The only means
to access it is through the code that serialized it into the database. When
an end user calls to report an issue with a specific database record, it will
be inconvenient to say the least to look at specific attributes of the serialized
object.
2. Compatibility: If you are relying on your platform’s native capabilities for
marshaling/unmarshaling serialized objects, you must ensure that changes to
the software object remain backwards compatible with data that was previ-
ously serialized with the earlier code. This is error prone, and requires that
you test with production data in order to be certain you are not introducing
a problem.
Except in very unusual circumstances, the authors of this book recommend
that you avoid binary transmission between systems and storage of data. In this
way, you achieve the architectural advantage of clear separation between your cho-
sen software platform and your data model. If you decide to rewrite your applica-
tion for a different platform, you are more likely to preserve the data model intact.
n This request identifier can be displayed to users who can use this identifier to
refer to their request in the event of problems.
n This request identifier can be propagated into all data structures that contain
data related to this request. It is convenient as a foreign key into all database
tables that house related transaction data.
n All error and application logging should reference your system request iden-
tifier. This makes it easy to scan logs for all messages related to a specific
request identifier.
n Where possible, your design should propagate this identifier to external sys-
tems in requests that your system makes. Again, when possible, you should
ask that the external system include your request identifier in response
messages.
n In fulfilling the business request, if your system must interface with systems
that are not capable of maintaining a reference to your unique identifier,
you will need to maintain a local mapping of your request to the transaction
identifier that is used by the uncooperative external system.
n You should maintain state for all business requests using appropriate data
structures. For example, if your processing requires you to make an asyn-
chronous request to an external system, the request status should reflect that a
request has been successfully made and that the system is awaiting a response.
Again, the global request identifier should be at least part of a composite key
to such a data structure.
If you adhere to transparent design for your data as discussed in the previous
section, consistent use of a global request identifier will allow you to determine the
state of a request and to extract any and all data related to that request. This is an
indispensable ability when you are troubleshooting an incident on any system. Even
if the information is not immediately useful to you in your investigation, it is criti-
cal that you be able to inform the business users of the exact status of their request.
If the business users have accurate information, they can take steps to mitigate the
impact of a lost request outside of your software system (although they will prob-
ably not be happy about it).
For highly traceable systems, designers will even go one notch further: not only
will each request be traceable, as mentioned above, but the data model will also
be structured so as to maintain a history of the changes made by various requests
over time. In these systems each version of a data element is maintained separately
or each change to a data element over time is maintained. The request identifier is
appended to each revision of the data element together with the timestamp for the
change.
Reconciliation is normally a batch activity with an objective to ensure that sys-
tem state is correct based on the inputs that have been received up until that point.
Consider the example of a client-server call center application in which customer
service representatives (CSRs) are taking orders for telephone customers. Each
order that is placed results in a database entry on the call center application server.
Whenever an order is received, a separate subsystem reads the order and initiates
a fulfillment process to the inventory and fulfillment system, which is hosted cen-
trally for the organization.
This same fulfillment system services a number of channels including Web mail,
regular mail, and a small number of brick-and-mortar offices. For this business,
customer service is based on the successful initiation of a fulfillment order for every
order that is taken at the call center. Both the fulfillment system and the call center
application have been implemented by a highly conscientious technical team, but
despite their best efforts, orders taken at the call center do not always translate into
fulfillments. In order to mitigate this risk, the organization has initiated a nightly
reconciliation process in which reports are generated from both the fulfillment and
call center applications. If the number of orders taken does not match the number
of fulfillment requests, the discrepancy is investigated. Since introducing the recon-
ciliation reporting, the technical team has seen two distinct types of failures:
The purpose of the reconciliation report was to monitor for the first type of
failure; however, the technical team quickly realized that they had two problems on
their hands. In some cases, the fulfillment system was generating duplicate orders.
Customers were being sent (and potentially billed for) the same order twice. The
reconciliation process not only informs the technical support team when orders do
not equal fulfillment but also where the discrepancy lies. This system adheres to our
advice on the topic of traceability. Not only is every order assigned a unique system-
generated identifier, but this identifier is propagated to the fulfillment system.
When the reconciliation report does not match, it shows exactly which orders
have been omitted or exactly which orders have been fulfilled twice. The support
team can use the problematic order identifiers to interrogate the system for order
status and correct the problem before the call center or the customer is even aware
that there was a problem. Of course, each time an issue is identified in the reconcili-
ation, the root cause for the discrepancy is investigated and a code fix is made to
eliminate this scenario from ever happening again. In this case, the reconciliation
is a part of the monitoring and continuous improvement strategy for the organiza-
tion. The important design observation is that reconciliation approaches are not
possible if the system is not designed in a transparent and traceable way.
Exception Handling
An entire book could be written on the topic of exception handling. A core fea-
ture of any programming language is its native exception-handling capabilities.
We would like to avoid a technology-specific discussion, so in this section we will
define some general guidelines and then move on.
In our experience, the three most problematic and recurring themes for soft-
ware systems are as follows.
1. Insufficient error checking in code.
2. Insufficient detail in error messages when they are logged.
3. No reliable way to correlate user events with logged exceptions.
We will now visit each of these topics in the context of another example. Our
concern is how errors are handled by application code so this example will reference
an example code fragment in the Java programming language. For we will consider
the implementation of a business operation that calculates an insurance quote. In
our example, the method calculateQuote takes an object of type QuoteRequest as its
argument and then performs the required business operation, ultimately return-
ing an object of type QuoteResult. The implementation of calculateQuote is shown
below.
One of the results of writing defensive code is that you get the appropriate error
messages. This is the first problem with the application code in our example. There
is no checking on the address attribute before dereferencing. As a result, a system
exception is thrown and the opportunity is missed to log a much more meaningful
exception. We could easily have avoided this by checking the address attribute as
follows:
By checking for this error condition and rethrowing a typed application excep-
tion, Java will force the calling object to handle the checked exception. This greatly
improves the chances that the end user will see a genuine error message and not an
incomprehensible system exception.
In addition to arguement checking, it is important to log as much information
as possible. Here is an example of a good logged exception:
Tue Jul 10, 2006 09:08:22 -- ERROR -- QuoteHelper.java:67 “Error calculating quote”
In the first example log message, we get two key pieces of information that are
absent in the second example. First, we know why the exception is being thrown:
the address attribute has been checked and found to be null. Second, we are able to
correlate this error to an actual user request. If three users report issues calculating
insurance quotes on a given day, we use the error logs to derive exactly which users
experienced this particular problem.
We can look at the same code fragment again with each of our recommenda-
tions implemented:
To many developers reading this book, what we suggest in this section is sound
patently obvious. We make these remarks because a vast number of software sys-
tems have been built (and continue to be built) that do not meet this standard. If
you are an architect, technical lead, or development manager, you need to insist
that this level of error handling be accounted for through the code review process
for your deliverables.
Fortunately, emerging technologies continue to make appropriate error logging
and handling increasingly easy to implement. For example, exception handling is
a major improvement over the developer obligation to properly implement return
codes. More recently, aspect-oriented programming (AOP) approaches make it
easier to crosscut broad swaths of your application with consistent behavior and
handling. Error handling is one of the most often referenced applications of AOP
constructs. If you are a Java technologist, you are encouraged to investigate the
Spring framework invented by Rod Johnson, at time of writing, the most popular
AOP framework for this platform.
Infrastructure Services
The short answer is that you shouldn’t expect anything from the infrastructure.
It does not matter what promises are made around the quality and availability of
the infrastructure; the application needs to be coded in a way that is resilient to
infrastructure failures.
That said, the technical team must be aware of the features in the software
platform that are expected to provide resiliency. For example, if the chosen software
platform provides redundancy between clustered servers, the development team
should review vendor documentation for this feature to ensure that the design of
the solution is compliant with vendor recommendations.
A good example of where this is applicable is in the BEA Weblogic Application
Server. This product supports failover in a clustered environment, but only if ses-
sion-based application data conforms to the java.io.Serializable interface. Without
this understanding, it would be easy for a development team to invalidate this
vendor feature.
Design Reviews
Design reviews are needed to confirm that standards and guidelines are followed
and are going to meet the requirements, both functional and non-functional.
can be written at the object level for your application as well as at the software
component level for the total software system. Let’s consider a view of an applica-
tion that allows users to upload multimedia content to a web site which stores this
content on their behalf in a repository. We will use this simple example to illustrate
the purpose of an operability review.
In the operability review, there are two principles that will guide us. They are
as follows:
1 File is written
to local file
2
Success Response system
Archive begins DB
Success Page transaction to
Displayed to User 3 commit binary
content
4 Web Server 5 6
asynchronously Content is
7
begins transaction committed to
to store to archive 8 Database
9 storage
10 Transaction
successful
11 Request
successful
Content is
12 marked for
clean up from
local file system
you don’t really know how the application will behave—as in, “The archive server
should roll back the transaction and return an error status.” Alternately, it might
mean that you are assuming that a failure scenario will never happen—as in, “The
Web server should never lose connectivity because it is directly attached to the same
switch as the archive server.” You need to avoid this type of thinking in the context
of an operability review. Remember, things will go wrong, no matter how unlikely
that may be, and you need to know with confidence how your system will react
when they do.
If you are uncertain of the application behavior, you are encouraged to devise
a test to find out. Through this process, you may find that you need to revise
your design or build additional robustness into the application. At the same time,
depending on the likelihood of the failure scenario, the criticality of the business
operation, and the process for detecting and correcting the incident, not all scenar-
ios may require design intervention. For these cases, it is important to identify them
as a team and make a collective and documented decision to address them or not.
Summary
Good software design can be applied to achieve a host of benefits: flexibility, exten-
sibility, readability, maintenance, and quality. Successful projects are often projects
with strong technical leadership that insists on a thorough design phase. In this
chapter, we’ve argued that some of the most tangible benefits of good software
design are in the area of operability. Extensibility and flexibility are important but
loftier benefits. You may need to extend or change direction in your software, but
in the real world, applications that recover gracefully from errors will earn acco-
lades sooner and on more occasions.
In the next chapter we will look at effective techniques and guidelines for build-
ing scalable, high-performing software systems.
The goal of this chapter is to help you to better architect, design, and develop
software that meets the performance requirements of your system. We will focus
on the different aspects of the solution design that will inevitably influence its per-
formance, or at the very least the perception that the end user will have of the
application’s responsiveness.
In our experience, performance considerations need to be part of every step of
the development process. Projects that delay performance considerations until late
in the software lifecycle are at significant risk when it comes to their non-functional
test results.
Requirements
The performance requirements of a system are gathered as part of the non-func-
tional requirements of a software solution, as discussed in Chapter 3. In what fol-
lows we will highlight how performance considerations should be looked at as an
influencing factor of the requirements gathering process.
The “Ilities”
Performance is intrinsic to a system, whereas some of the capabilities (further
referred to as “ilities”) of a solution—although not all of them—can be added as an
afterthought. Performance will also have a major impact on the “ilities” so much
so that for some systems, some of these “ilities” will have to be sacrificed in favor
of performance. Note that the reverse is also true, and that performance may have
95
to be sacrificed for one or more of the “ilities.” The important thing is to determine
where performance is critical and how thoroughly it is allowed to affect your overall
system because of its criticality.
Scalability
The first thing that will come to mind for many people when talking about per-
formance is system scalability. This property does not relate so much to the perfor-
mance of a system but rather to its capacity to uphold the same performance under
heavier volumes.
The requirement for scalability must be considered in relation to the need for
future growth of the business function that is supported by the software solution.
As a rule of thumb for a business with a moderate or slow growth rate, vertical
scalability of a system will be sufficient as long as Moore’s law holds true, which it
seems will be the case for still a number of years to come.
It is notable, however, that chip makers have started concentrating more efforts
on multicore central processing unit (CPU) solutions, and we would argue that
today software solutions should be built to scale horizontally in order to sustain
business demands at affordable costs in the future.
Scalability can influence performance in different ways. In distributed systems,
scalability will usually be the result of load-balancing requests coming into the sys-
tem, so that they can be processed by multiple nodes. The load balancing will carry
with it a small overhead that should be taken into account during specification,
especially if load balancing occurs for each tier in the distributed solution.
Distributed databases or application servers will, in some cases, provide caching
mechanisms to speed up data lookups. Although the cache will drastically acceler-
ate data access in some cases, it will also require synchronization of the data across
all nodes, which does not come for free. Whether the system is mostly to be used
for reading data or for writing data will need to be assessed in order to define the
appropriate caching strategy (more about caching follows on p. 109).
Grid-computing solutions have multiple computers act as one; however, this is
not fully transparent and will require data synchronization to occur during specific
points of the processing. This will also add overhead to the total performance and
must be factored in when defining the system.
Usability
Making a system that is both enticing and easy for people to use, is a complex task,
worthy of a library in itself. Consequently, we will limit our interest here to the
impact usability requirements can have on performance.
Everyone will agree that a poorly performing system is not usable; people will
get frustrated and very soon abandon the application as a whole, even if the perfor-
mance issues are only related to one functional area of the entire solution.
The solution architect and business or usability analysts must therefore col-
laborate in order to come up with usage patterns that are both efficient for the end
user and computationally viable for the system under construction. The architect’s
role will be to provide input regarding the technology options available to the team,
whereas the usability analyst will ensure that these technology options are used in
a context most suitable for the end user.
Both should ensure that the user is not subjected to long waiting periods.
Expensive computations should not be performed while the user is waiting; they
should be removed from the user interaction flow and handled separately so that
the user can go on with her work.
A contemporary example of this is the use of Web 2.0 technologies in order to
execute front-end validations that would previously have taxed the backend systems.
Another example is the use of asynchronous processing and exception handling
using workflow systems. Using this paradigm, processing errors are not reported
to the user immediately but through some form of a notification mechanism. This
type of processing is advantageous in environments where the user’s error rate is
very low and rapid response times of the essence.
Extensibility
The extensibility of a system can on occasion jeopardize its effectiveness.
In order to make a system extensible, designers and developers are often forced
to add additional controls and decision logic into the computational model, which
will quite often lead to performance degradation.
In all instances a design should be kept simple, except if the requirements
explicitly mandate the need for an extensible solution. In the latter case the require-
ments will also have to provide guidance on the specific conditions under which the
solution should be extensible, and not simply make a high-level statement about the
need for extensibility.
An interesting example of the impact of flexibility on performance is the
Enterprise Java Bean (EJB) framework. The EJB framework by its very nature had
to be designed with extensibility in mind. In its prior iterations (versions before
3.0), every component that was built for the system had to elicit characteristics
of security, transaction support, etc. As a result each EJB call had to go through
some wrapper code to handle these aspects of the component. This was because
the notion of extensibility had been baked into many aspects of the standard. As
of EJB 3.0, the standard changed rather fundamentally: instead of baking things
into the way components were defined it was decided that components should be
built with no or very little knowledge of the framework, and extensibility would be
handled by injecting the capabilities mentioned above using techniques of aspect-
based programming.
Securability
Networks, and the Internet in particular, have opened up the door for many threats to
the enterprise. As a result, many companies became conscious of the need for tighter
security and especially for building security into all aspects of their software solutions.
Tightening security rarely happens without an impact on system performance.
Therefore (and as with each theme in this section) securing your system needs to be
done in a way that neither degrades the performance of your solution nor disrupts
the operation of its functions.
In order to do so, the stakeholders, business analyst, security architect, and
solutions architect will need to work together to determine the scope of the security
measures that are required for the specific purposes of the system.
Here are some of the questions that will be important to determine which prin-
ciples will underpin the security architecture:
These questions are important, as they will allow the security architect to
determine:
n The need for encryption or not and if said encryption should be hardware
accelerated, which may be necessary if many concurrent users have to be sup-
ported or if elevated volumes of data require encryption.
n How to implement access controls. The demand for heavy access verifica-
tions will obviously impact overall system performance; hence this quality of
the system may require the implementation of cached access control lists, or
other optimizations around such checks.
n Non-repudiation is another security measure that will impact the responsive-
ness of the system, given that each transaction requiring non-repudiation will
need signing. It is advisable to limit the requirements for non-repudiation to
only those specific transactions that may be legally binding.
required for trending and long-term analysis may need to be captured on perma-
nent media, whereas for data that is needed only at system runtime it may be suf-
ficient to keep a transient in memory copy.
With today’s technologies it is, in certain cases, even possible to benefit from
runtime instrumentation, which allows adding or removing instrumentation to an
application while it is running.
Maintainability
The ability to maintain a system will not really have an impact on its performance.
Source code comments and design documentation do not impact system perfor-
mance, although we have sometimes wondered whether some development teams
believed so, given the scantiness of documentation and comments provided for
some of the solutions we had to review.
The reason we wanted to cover this topic is to stress the fact that design and code
documentation are important, and even more so when dealing with algorithms that
have to be heavily optimized. In many cases when optimization is required, algo-
rithms become either very complex or unreadable—or both. Therefore, particular
attention should be taken to the documentation that will surround such artifacts.
Recoverability
When a process will take a considerable amount of time to execute, you will usually
want the capability to recover from a failure in the middle of the process without
the need to rerun the whole process. The capacity to recover will depend on what
information is available to recover, and maintaining this information will by all
means impact the overall performance of the process.
The rule of thumb is clearly that recovery should not take longer than resum-
ing the process, if indeed the process can be rerun. If either the process cannot be
rerun, or the recovery is faster than resuming the process, the price of maintaining
recovery data is acceptable and it will have to be incorporated into the capacity
requirements for the system.
The system’s architect should ascertain with the business analyst that there is
indeed a requirement for recovery. In some cases data is perceived to be critical to
a system, when in fact the data is either transient or maintained as part of another
system. In those cases it is most likely that recovery of the data is not mandatory,
and therefore the performance penalty of recoverability should not be incurred.
For instances in those cases you could, for instance, think of disabling the database
features that will maintain recovery logs.
Architecture
When taking a critical eye to a system’s architecture in order to figure out how to
design for performance while at the same time keeping to the projects’ timelines,
there are two important parts to your approach.
First of all you will have to determine which parts of the solution will need a
more thorough look to determine whether special measures are needed to ensure
the required level of performance. We call this activity the “hotspot” analysis of the
architectural picture, which we’ll examine in more detail in the coming section.
Secondly you’ll gain time by applying standard architectural patterns to the
performance issues that are specific to your problem. We will try to help you in
this regard by introducing you to a set of common performance patterns, as well as
their antipatterns.
Finally we also encourage you, when defining the architecture of a system—
whether it has high performance requirements or not—to take a pragmatic approach
as outlined in our personal note. Whenever possible, use the K.I.S.S. approach:
Keep it Simple and Stupid. On many occasions we have seen development teams
come up with designs that were far too complicated and circumvoluted for the
problem at hand. This seems to stem from a perception of the designer that if the
solution is too simple then he hasn’t done his job right. In our experience it is better
to reward people for coming up with simple, elegant solutions rather than overly
complicated ones that address more than the requirement. It is up to the manager
to clearly communicate this to the team.
Hotspots
When defining the architecture of a system it is important to clearly identify those
parts of the system that are liable to cause performance bottlenecks. These areas of
the system are quite often referred to as hotspots.
The determination of hotspots within an architecture will be achieved by map-
ping the non-functional requirements for the system onto the logical architecture.
Doing this will provide the design and architecture team with a view of which parts
of the system will require particular attention when it comes to the technical design
and even the implementation of the solution.
We suggest that you approach hotspot mapping as follows:
n Make sure that the non-functional requirements of the system have been
accurately articulated with as much detail as possible around volumes and
response times.
n Map each input or output channel of the system to its associated non-func-
tional requirement(s) and determine whether a hotspot would result from the
volumes or response requirements expected from said channel.
n Make sure each component of your architecture has its input and output
flows defined. The throughput and response requirements of a component are
a combination of the requirements for all of its in- and outflows.
n Based on the throughput or response requirements for each component, iden-
tify those components that are potential bottlenecks. Start with the compo-
nents that only have in- or outflows from or to external entities, and then only
those components that receive their inputs and outputs from other compo-
nents of the system.
Patterns
Divide and Conquer
By “divide and conquer” we understand the need to split up the work in smaller
parts. It is our opinion that the divide and conquer pattern is one that engineers
cannot live without in light of the increasing complexity of the systems that they
are tasked to build in today’s world.
When designing for performance this pattern is specifically helpful for tackling
the following problems:
n The identification of hotspots is made easier, and the analysis of said hot-
spots is straightforward when one only has to concentrate on the inputs and
outputs of the problem area.
n In many instances, the computation of all parts combined will take the same
amount of time and resources as the computation of the total problem, but in
many cases splitting up the computation of certain parts will give the impres-
sion to the end user that the system performs more efficiently.
Load Balancing
Load balancing is a typical pattern used to achieve horizontal scalability in a sys-
tem. A typical load balancing setup is shown in Figure 5.1.
In order for such a setup to achieve optimal scalability on requests made to the
system, these requests should be independent, and short-lived. When every request
is independent, each one can be processed by any hardware node in a server cluster
or farm, as long as the same application is deployed on each one of these servers.
This is an ideal solution, as it allows the load balancing mechanism to choose the
least busy server to execute the request, thereby optimizing overall resource usage.
Moreover, if every request is short-lived and will consume approximately the
same system resources whatever the request, the mechanism to load balance does
User Requests
Load Balancer
not need to be complex, and a simple “round robin” approach will usually suffice.
Typically serving up static Web pages falls into this category.
Most applications are not that fortunate, and will have varying resource require-
ments for each request as well as requests that are not totally independent of each
other. A typical Web application will maintain a user’s session, and all requests
coming from that user’s session will need to share his session’s state. State is gener-
ally maintained on the server, thereby binding requests to the server(s) on which
said state is memorized.
When faced with long-running requests and/or dependent requests, load bal-
ancing can still be of use but will require more intelligent allocation of load. For
instance, you may decide to direct the load of resource intensive requests to specific
servers, which have been allocated and tuned specifically for the execution of such
requests. You may want to send the load balancer information about current server
utilization so as to route requests to those servers being less utilized.
Note that load balancing will quite often be used to support failover as well.
This will require additional mechanisms to be put in place in order for the load
balancer to be aware of dead nodes in a cluster, as well as which nodes have a replica
of any state information required to fulfill a request.
In the end, the more intricate the strategy to decide where to send the load of a
request, the more impact the load-balancing process will have on end-to-end per-
formance of your requests. Depending on your specific requirements, you will need
to find the right balance between the complexity of a good load-balancing mecha-
nism and the benefit it provides you in terms of overall scalability of your system.
Parallelism
Whereas load balancing deals with the execution of unrelated requests in parallel,
this topic will cover the parallel execution of related computations. We decided to
keep the two topics separated, though we could have handled load balancing as a
subtopic of this one.
Even with today’s fastest computers, some calculations may still take a long
time to complete. Hence, if either the hardware does not exist to speed up your
computations, or buying the hardware that would allow faster computation is eco-
nomically not viable, dividing the problem into parallel computations is your only
remaining option (short of dropping the problem altogether) in order to up the
performance of these computations. Note that if this is true for the CPU usage of
your solution it can also be true for its I/O usage. In one case the solution is said to
be “CPU bound” whereas in the other case it is “I/O bound,” in both cases making
things run in parallel will help.
Before thinking of parallelization you will need to identify which parts of your
application would benefit the most from it, and if these parts can be parallelized at
all, meaning whether the algorithms exist to handle the task in parallel.
Once you have defined which parts of your solution will be benefiting from par-
allelism you will have to evaluate the overhead cost of the parallelization algorithm
in order to determine the boundary conditions under which parallel processing will
be triggered or not.
In many instances parallelized processing is only interesting as of the moment
certain volumes of data need to be manipulated, but does not make any sense for
small data volumes. For instance, when performing a parallel sort, the time of each
parallel sort together with the time needed to aggregate the results of these parallel
sorts should not exceed the time it would have taken to sort everything without
parallelism (as shown in Figure 5.2).
Now that we have defined the criteria that should guide you in answering the
question—To parallelize or not to parallelize?—we will look at some examples of
where to use parallelism when a system is I/O bound.
Parallel Sort 1
Parallel Sort 3
Time t1 Time t2
Time t3 < t1 + t2
ing the parallelism strategy into the application could be a defeating strategy as
systems evolve. An appropriate response to this design challenge would be to assign
four worker threads from a pool to the completion of each consolidated request and
allow them to complete the work as quickly as possible. In this type of scheme, if a
thread takes longer to complete its work than expected, other threads can pitch in
and off-load effort as soon as they become available.
asynchronous processing. The design team found that most input errors were being
caught by the intraform validations and that the first submission rarely needed
to be repeated. As a result, an inbox was added to the user interface that would
populate with the request status once the request had been fulfilled. As a conve-
nience feature, the team added an email notification that would allow users to be
informed when their request had completed processing. An unexpected advantage
of the alternative implementation was that when the backend fulfillment system
was unavailable, business users could still submit requests asynchronously to the
front-end system. This operability advantage improved the perception of system
availability as well as performance.
Finally, whether to opt for synchronous or asynchronous execution will depend
on the type of system you are building, the skills you have available, and many
other parameters. In Table 5.1 we attempt to provide a series of guidelines that may
help you to decide which way you want to go.
Deferred Processing
A subcategory of the asynchronous processing pattern can be dubbed “deferred
processing.” For the lazier amongst us this means never do more work than you
have to—advice that is particularly relevant in today’s climate of object-oriented
implementation and component-based frameworks. Component development
offers irrefutable advantages from the standpoints of extensibility, reusability, and
maintainability. However, component usage can also lead to serious performance
issues when not used at the appropriate times. It is common for a component to be
designed for one purpose and then re-purposed for something else. The secondary
usage of the component may not require the full component implementation, but it
is more convenient to use the existing “as is” component than to design something
new or change what is already available.
A typical example of this type of danger is the “User” object itself, which is
a common object in modern software implementations. The “User” object is an
abstraction of all characteristics of the authenticated user who is interacting with
the system. The user object commonly includes attributes for username, full name,
address, date of birth, email address, payment information, etc. When a user
first initiates a session with the system, it is common to construct the user object
and populate all of its attributes from storage. In some circumstances this may
be entirely appropriate if the attributes for the user all reside in a single, local,
and efficient data store. However, if the user object is a composite of information
from different systems, the context in which the user object is being used may not
require the object to be fully constituted. In this case, it may be prudent to defer
the construction of the object until a request is actually made to the object for the
specific attribute. Consider the example in which payment information for the user
is stored in a different profile database than name and address information. For this
example, we might propose an object interface as follows:
A user object is constructed when the user authenticates to the system and pro-
vides a valid username and password. In a deferred processing scenario, we would
suggest that the object construction start by verifying the username and password
against the authentication store. The implementation may also load the user’s per-
sonal information, including name and address, but the object construction would
not necessarily load payment information for the user. This independent initializa-
tion is deferred until the application calls the getPaymentInfo() method at some
point in the future.
Caching
Another mechanism that is commonly used to improve performance, or at the very
least give the perception thereof, is caching. This is mainly used to improve the per-
formance in scenarios requiring slow or voluminous I/O interactions, but can also
be used in parallel computing to maintain local copies of shared data.
Caching is such a pervasive performance-improvement pattern that it is very
common to have many layers of cache between the end user and the physical data
storage. Let’s consider the conventional three-tier architecture for a Web-based
application and look at a subset of different caches that may be at play. A nonex-
haustive example of how caches can be distributed is shown in Figure 5.3.
The main challenges related to caching information are twofold: 1) keeping the
cached information in sync when the information is distributed or when the infor-
mation can be modified by mechanisms that do not involve the caching mecha-
nism; and 2) managing the memory used by the cache in a way that minimizes
memory use but maximizes the use of the cache (in other words, choose the caching
strategy that will maximize cache hits).
These, however, are technical challenges, and solutions exist. For instance, most
application servers will use one or more caching mechanisms, and some will even
allow you to provide your own caching strategy. There are also a number of free and
commercial solutions available, such as Open Symphony OSCache or Gigaspaces.
Cache synchronization will ensure that the cache reliably reflects the contents
of the primary data store. When cache contents become out of synch, the contents
are referred to as stale. Depending on the nature of the data that you wish to cache,
you will need to choose a suitable synchronization policy. There are a variety of
choices, which we describe in Table 5.2.
AU5334.indb 110
Information about Synchronous Execution Asynchronous Execution
Interaction with the
System
A request requiring some Ideal for this kind of interaction. Asynchronous treatment does not add a lot of merit, given that a reply is
form of validation feedback always due to the requestor.
is executed. Using asynchronous methods in this case will add complexity to the
implementation with no benefit likely.
One notable exception to this that has proven its benefits is the approach
taken by some AJAX-based implementations. In this case, validation
feedback is provided to the user in real time while she inputs the data. The
validation of the entered data is done asynchronously while the user
continues to type in more data.Although overall this uses up more
resources (due to the multiple asynchronous calls), it gives the system a
much better user experience.
A packet of information is If absolutely no treatment of the When some form of treatment of the information is required, it is best to
sent to the system. No information is required on the receiving put the information on a queue and perform the treatment when resources
results or feedback is end, then a synchronous interaction will become available.
required. be fine.
110 n Patterns for Performance and Operability
The interaction is very Although not much different from the If you have limited resources at your disposal, introducing some form of
dynamic. Continuous first scenario, this scenario will quickly put asynchronous behavior in this instance will allow the system to better
requests/response heavy constraints on the resources of manage resource consumption. Some middleware solutions will do this for
exchanges are required. your system. Synchronous interaction is you (see note below).
only recommended if you have a lot of
resources at your disposal.
11/19/07 7:49:21 AM
AU5334.indb 111
Table 5.1 Guidelines for Synchronous versus Asynchronous Execution
Note: Although we speak here of synchronous and asynchronous interactions, we are talking from the point of view of the system’s
designer and/or developer. At the level of the CPU, most executions will be asynchronous to some extent. This asynchronous
behavior comes from the fact that the operating system will manage the execution of multiple processes and therefore pre-empt
or queue the execution of some of these processes, thereby introducing a form of asynchronous behavior. Moreover, when run-
ning code on a transaction server or application server, the middleware will normally rely on one or more resource usage control
mechanisms that will also introduce a form of asynchronous execution. These considerations are important either when defining
the capacity requirements of a system that is running more than just one application or when troubleshooting performance issues
on a shared production system.
Designing for Performance n 111
11/19/07 7:49:21 AM
112 n Patterns for Performance and Operability
In order to ensure that the cache is just a cache and not a full replica of your
data store, the cache will have to be provided with a caching policy that will deter-
mine which elements to remove from the cache once the memory consumption of
the cache has reached a certain limit. A list of commonly used caching policies can
be found in Table 5.3.
Finally, one conundrum we have experienced with regards to caching on some
of our projects could be dubbed “too much of a good thing.” On many instances we
have found development teams replicating cached information in various pieces of
related code. Although we mentioned earlier that it was not uncommon to see caches
at different layers of an architecture, we have to caution you that this does increase
the chances of desynchronization between the various caches and the root entity
that is being cached. And it also consumes a lot of memory, which is still a valuable
commodity. The right places for caching must therefore be defined as part of your
global system architecture and not left to the whim of each and every developer.
Antipatterns
Whereas design patterns illustrate proven approaches to common problems, anti-
patterns exemplify design flaws that consistently cause applications to have prob-
lems. On the topic of performance, the authors have seen the following patterns
repeated over and over again without predictable results. As important as it is to
“do the right thing,” it is equally important to be able to recognize the wrong thing
and be equipped to avoid it.
Time Expiration The cache maintains a timestamp for each member and
ejects members once they have aged to the configured
timeout.
Overdesign
During the design phase of a project, it is easy to become enamored with perfor-
mance strategies and build them into your solution design. The introduction of
these strategies can quickly escalate the complexity of your application. The best
advice that we can give you is design to best practice and then performance tune to
your bottlenecks.
In other words, you may spend weeks perfecting your caching strategy only to
find that the native I/O for most of your data retrieval is perfectly acceptable without
a cache at all. To make matters worse, you may find that you have serious perfor-
mance issues, but none of them are in the focus areas you invested in during your
design phase. If you design your application flexibly and follow simple industry stan-
dards around performance, you are unlikely to have problems introducing tuning
and enhancements into your design once you have identified concrete problems.
Overserialization
Innovations in technology continue to make it increasingly convenient to build
systems based on distributed architectures. Support for distributed processing is
a core feature in the two most common development platforms in use today: the
J2EE specification and Microsoft’s .NET framework. These frameworks allow you
to develop objects, deploy them in a distributed way, and then access them trans-
parently from any of the components in the distributed architecture.
The platform manages all of the implementation details associated with remote
invocation, freeing the application developer to focus on the business-specific
aspects of the system. This is a powerful advantage for any developer working with
these platforms. However, this flexibility comes at a price. Anytime you exchange
data over a network, the request data must be serialized into a stream and then
transmitted over a wire. At the receiving end of the request, the remote implemen-
tation must unserialize the request data and reconstitute the request in object form.
This process is usually referred to as marshalling and unmarshalling the request.
The same process is required in reverse to transmit the response back to the caller.
There are two performance exposures in this scenario:
ORDER
ORDER_ID [PK]
DATE
STATUS
CONTACT_ID [FK]
DESCRIPTION
<···>
OrderEntityBean
orderId: Integer
date: Date
Status: String
contactId: Integer
description: String
+ create(): OrderEntityRemote
+ getOrderId(): Integer
+ setOrderId(): void
+ getDate(): Date
+ setDate(): void
+ getStatus(): String
+ setStatus(): void
+ getContactId(): Integer
+ setContactId(): void
+ getDescription(): String
OrderEntityBean OrderValue
orderId: Integer orderId: Integer
date: Date date: Date
Status: String Status: String
contactId: Integer contactId: Integer
description: String description: String
value object, a client is able to construct a single value object and set all of the attri-
butes on the order entity bean using a single-method call. Let’s look at the before
and after sequence of operations between the local and remote application tiers:
Remote Local Remote Local
create() create()
setDate() Construct
OrderValue
setStatus()
value object
setContactId() setOrderValue()
setDescription()
The introduction of a value object allows us to avoid three remote method invo-
cations that were required in the original implementation. This is a simple and
widely used design pattern. Method invocation for EJBs also includes layers for
security and transaction handling that introduce marginal overhead. In addition
to avoiding serialization on each call, the value object implementation avoids these
additional costs also.
Related to this topic, the EJB 2.0 specification introduced the notion of local
interfaces for EJBs. This feature allows EJBs to defined local and remote inter-
faces. Local interfaces allow a calling application to pass arguments by reference
instead of by value. Previous to EJB 2.0, all EJB method invocations had to be by
value, meaning that a copy of the parameter data had to be serialized to the remote
instance and unmarshaled. Using local interfaces, calling client code can now use
these interfaces if the developer knows that the calling application code will be
located in the same application monitor as the remote implementation.
Oversynchronization
Synchronization is an important implementation tactic for ensuring data integrity
in software systems. The term usually refers to the need to ensure that only a single
thread of execution is able to use a given resource at any one time. This is usually
achieved by introducing a lock or semaphore that can only ever be granted to a
single thread of execution at any one time. A good example of this is write opera-
tions for database records.
If a user is updating a record in the database, you do not want concurrent write
operations to proceed simultaneously. In a worst-case scenario you might end up
with a record that has been updated by a combination of two separate updates
Synchronization issues can be difficult to find once they have been introduced
to a system. It is important to review your application design carefully prior to
performance testing to try to avoid this type of bottleneck.
If your application has been built and you suspect synchronization may be caus-
ing performance degradation, this type of problem is often characterized by lower
than expected CPU usage under load, for obvious reasons. Custom instrumenta-
tion and profiling is often required to isolate this type of problem.
Algorithms
There is no denying that it is often more elegant to optimize a computer algorithm
so that it will yield the appropriate performance rather than buying a bigger com-
puter to handle a poorly conceived program.
It is not our intention to provide the reader with an exhaustive list of all of
the incredible algorithms that exist. Not only would the list need revisiting on an
hourly basis, but, when it comes to the basic algorithms that matter, Donald E.
Knuth did a much better job in his “The Art of Computer Programming series than
we could ever hope to achieve.” 5
Our aim is to make you aware that when it comes to performance program-
ming you will need to surround yourself with professionals that understand the
ins and outs of building efficient algorithms. These professionals will need at least
some basic notions of computational complexity theory, and will understand the
advantages, shortcomings, and pitfalls of the software libraries they will be using.
In other words, they will be able to tell you whether an algorithm will take expo-
nential time to compute or not, and which library function is best suited to support
the execution of your algorithms.
For instance, when it comes to sorting algorithms that person will be able to
tell you that a Quicksort algorithm has a complexity of Θ(n log n) and that there
are other algorithms such as Heapsort and Mergesort, which may be more adequate
depending on the problem you are tackling. Moreover, if he is a Java developer he
will also tell you the Arrays.sort method uses the Mergesort algorithm, which has
the advantage over Quicksort that it provides a stable sort (it maintains the relative
ordering of elements with the same comparable value). By relying on developers
who possess these skills, you will require less investment into hardware capacity
and what is certain to be a long and tedious non-functional test cycle.
Technology
Programming Languages
The chosen programming language will most certainly have an impact on the per-
formance of your application. The choice of language must be a careful balancing
act between the need for execution efficiency versus the need for programming
efficiency; and in most cases, the need for companies to standardize on a set of
standards will also be a factor.
From the standpoint of performance it is best to look at programming lan-
guages based on how the resulting program will be executed in the target environ-
ment rather than based on the language itself:
Compiled languages
Languages such as C++, Cobol, or Fortran, are compiled so as to execute using the
instruction set of the target platform/CPU. This will usually yield the better execu-
tion times, given that the compiler can fully optimize the execution code for the
target system. We will not futher elaborate on these languages, as it can be accepted
that these are probably the most efficient languages from a performance perspec-
tive. But in many cases these languages do not yield the same level of productivity
as more modern languages such as those we will discuss hereafter.
Virtual-machine-based languages
In this category Java and C# are probably the most prominent examples although
many other languages are available that either run on a Java or .NET runtime, or
have their own VM implementation.
Note that some parties may not agree with the statement that C# and the
other .NET languages are virtual-machine based, but by our reckoning there isn’t
much difference between Java’s bytecode and JVM approach and .NET’s common
language runtime approach except, perhaps, for the fact that the CLR is more
language-agnostic than the JVM. From a performance perspective it has been dem-
onstrated that there is little or no difference between both technologies.6
There is a price to pay for the use of a virtual machine. It will have a larger
memory footprint than that of an average compiled program, given the need for it
to house its own runtime environment as well as the extensions it uses to instru-
ment or optimize code execution, such as a just-in-time or Hotspot compiler, or
built-in monitoring capabilities.
The virtual machine will also have a slight performance cost. This performance
cost is linked to a number of factors. First and foremost, there is the startup cost
due to the need to convert the code targeted at the virtual machine (VM) to code
targeted at the underlying CPU. The way this impacts performance may differ
depending on whether a just-in-time compiler is used versus a Hotspot compiler
(more about this below).
Then there is the fact that the virtual machine also serves as a “sandbox” for the
code’s execution. In other words, it will attempt to contain any malicious activity
that may emanate from the code. This means that additional checks will be per-
formed during code execution, which will also slow down the functions impacted
by such checks. Note that if the code comes from a trusted source you can disable
most of these checks, here you must find the right balance between security and
performance (as discussed in the section on securability).
Finally, one of the main causes for performance degradation with VM-based
languages is not so much related to the VM but to the fact that these languages
make use of garbage collection for their memory management. Although memory
managed through garbage collection proves conducive to faster development (given
the fact that the developer “seemingly” doesn’t need to care about how his use of
memory gets managed), it is also the primary reason why some VM programs
perform very poorly.
Many programmers will not think about memory consumption anymore when
using a language that does all the memory management work for them. This, how-
ever, will result in the garbage collector having to do all the “thinking” for the pro-
grammer—at the cost of performance. Because of some of the constraints imposed
on a garbage collector, it will stop a program’s execution in order to collect the
memory that has become unused. As a result, the program’s overall performance
gets degraded and in many instances the user’s perception of this performance
degradation negatively impacts the acceptance of a system. It is important for a
development team to understand this issue and ensure that memory management
remains a concern when using these languages.
Some of the techniques to alleviate these problems, such as object pool-
ing, are well known and should be part of every programmer’s bag of tricks.
Although applying good programming practices will remedy the problem
it is also noteworthy that research in the area of garbage collection has not
stopped and that today new approaches to this complex issue solve some, if
not all, of the performance impacts brought about by this type of memory
management.7
Interpreted Languages
Although any language can be either compiled or interpreted, the languages that
were built with an interpreter as the underlying engine usually have two things in
common. They are either purpose-built to be efficient at one or more specific tasks
or they have been conceived to be very dynamic in nature, and quite often they
have both characteristics.
Most, if not all, so-called scripting languages are interpreted languages. The
vast majority of scripting languages are purpose-built; for instance, shell script
languages target the manipulation of operating system artifacts such as files and
Distributed Processing
Distributed processing can take many forms and has been around for quite some
time. Before Web Services ever saw the light of day there was RPC, Corba, DCOM,
RMI, and possibly other mechanisms to enable a software solution to execute func-
tion calls across a network.
These calls come at a great computing cost. Not only does the function call
need to be translated into a format that is platform independent (remember the
section on overserialization), but additional checks are required to verify connectiv-
ity, additional mechanisms are required to manage the lifecycle of remote objects
or processes, security has to be taken into consideration, and possibly distributed
transaction solutions might have to be involved. All of these will drain the capacity
of your system for the sole purpose of making a call over the network. You must
therefore make sure that this luxury is used sparingly and for the right purpose.
One of the greater benefits brought by Web Services is that this technology has
put an emphasis on the notion of providing services rather than functions across
the network. Services are of a higher order than functions and, when designed cor-
rectly, will illicit different usage patterns that aim at limiting the number of calls
over the network. A service-oriented approach is the right approach to designing
distributed solutions; it is a cause for thought as to why it has taken us so long to
figure this out.
Make sure to keep this in mind when designing your distributed solution even
if you do not use actual Web Services technology. Design with services in mind,
rather than functions. Create services that represent actual business functions, and
therefore have a real business value. Build the interfaces such as to limit the number
of calls required during any given interaction.
The additional bonus you get from using actual Web Services is that you can
rely on the actual infrastructure that was built for the Web. This gives you access to
a whole plethora of solutions for load balancing hypertext transfer protocol (HTTP)
requests, monitoring network traffic, and handling network failover.
Distributed Transactions
If you have decided to go distributed you may be faced with a dilemma regard-
ing the way to deal with transactions. Transactions are dealt with easily when a
single resource is involved (e.g., a database); however, when multiple resources are
involved, and these resources are moreover distributed across the network, the com-
plexity of transaction processing gets multiplied.
XML
In what follows we talk about all the declinations of the extensible markup language
(XML) and not about a specific standard or a particular industry. Although it can
be said that XML has done wonders to enable collaboration of widely disparate
systems over the Internet, it is probably one of the worst technological choices when
it comes to performance.
Given that the goal of XML (and its predecessor, standard generalized markup
language [SGML]) was to create a language that was both readable by humans and
by the computer at the same time, it is a structured language, but not one that is
the most efficient for a machine to read. Humans require a verbose identifier and
some formatting—such as spacing, line breaks, and tabulations—in order to be
able to read and understand XML, whereas the computer couldn’t care less and
would be more efficient if it didn’t have to read all the formatting characters and
was provided with numerical identifiers that take up less space and can be more
readily matched to records in a database or memory array.
By making the above statement we are not encouraging you to make XML
more machine readable and less human readable, as this would defeat one of the
main reasons for the use of XML. If you were inclined to do so, we would encour-
age you to look at other means for transporting data rather than using XML.
The message we want to pass on is that XML is a beneficial technology when
it comes to the definition of messaging contracts between heterogeneous systems,
but that it should not be used indiscriminately for any sort of communication,
especially when performance is critical.
When you do end up choosing XML as the mechanism for communication of
your application, the one thing to choose correctly is the parsing technology that
will read the XML. A number of parsing mechanisms exist that are either more or
less efficient. Choosing the one that is right for you will depend on what you need
to do with the XML data. At different ends of the complexity spectrum you have
mechanisms that are SAX (simple API for XML) based and those that are DOM
(document object model) based.
SAX-based solutions will handle the XML piecemeal, one element at a time.
The overhead of the parser is minimal but you have to do all the leg work yourself.
The advantage is that you have complete control of the parsing and can stop it at
any time if you do not require all of the information in the XML, or if the XML
is incorrect.
DOM-based techniques will parse the complete XML and provide the devel-
oper with a document object model, which can be used to programmatically tra-
verse the XML elements. The advantage here is that using the object model the
developer has complete flexibility in the manipulation of the XML structure. It is
possible to get a list of all elements with a certain name, to add or remove elements
to the structure, etc.
In both cases the parser will usually give you the luxury to validate the XML for
you against either an XML schema or an XML document type definition (DTD).
Without detailing these two mechanisms—which is not within the scope of this
book—we can, however, mention that a schema is more complex to validate than
a DTD.
The right parsing mechanism is the mechanism that will perform exactly the
amount of work that you require. In most cases SAX-based mechanisms will do the
trick when all you need to do is read the XML once to transform it into some other
format or object model, whereas DOM will be more useful if the XML structure
needs to be traversed a number of times and possibly modified.
Software
This section will look at performance from the perspective of different software
solutions found in the most common system architectures in the industry. Each one
of these common software infrastructure pieces will require special attention when
it comes to performance tuning. Our goal here is to provide some commonsense
guidelines regarding the attention points for each of these systems when scrutiniz-
ing performance.
These guidelines will obviously not replace the expertise of a person specialized
in the configuration and operation of these solutions, but should provide the reader
with enough insight to tackle some of the more common performance issues found
when dealing with these often used infrastructure components. However, when in
doubt, hire a professional!
When building a system that will require a large amount of tuning and has some
very stringent performance requirements, we are confident that you will require
support from the software vendor(s) you have selected to support your system.
Given that it is not unlikely you will need to ask your vendors for changes, fixes,
Databases
When it comes to databases, you can summarize the things to focus on when con-
sidering the performance of your database server in one word: structure. In what
follows we will discuss relational database systems, since these are the systems that
we the authors are most familiar with. We are confident that whatever the database
system, the means to tune it will always deal with structure. Other database engines
will likely use a different terminology to refer to their specific structures.
Storage Structures
The structure of your database will be important at different levels. Let us start at
the lowest level, the structure of the data files onto the physical storage system.
Four main data structures normally compose a database system:
1. The system tables that hold information about the database structure itself, or
what is usually referred to as metadata (data about the data).
2. The database tables and other objects such as stored procedures, views, and so
forth.
3. The database indexes, which, although they are another type of database
object, are considered separately given the essential role they play in making
a database efficient.
4. The transaction logs, and other log files used to handle various aspects of a
database’s operations.
Each of the above structures is stored by most database solutions in one or more
files. In order to optimize access to these files it is preferable to store them on dif-
ferent file systems, segregated across different disks. As a result, when these files are
accessed in parallel by the database engine, disk access will also occur in parallel.
Index Structures
Once files are correctly structured, the next area to look into are the indexes. Indexes
will allow the database engine to optimize query access to your data but they will
slow down creation, updating, and deletion operations. Define your indexes with
care and make sure to include the appropriate columns in each index. It is possible
to combine more than one column into one index, which enables the engine to use
this index for either column. You will have to give precedence to the most used
column, however. For instance, in the example below, both index creation stanzas
will allow the engine to optimize access based on the values of columns A and B.
However, the first stanza will be better when the sort order or selection criteria favor
first A and then B, whereas in the second stanza the opposite is true.
Partitions
If you still do not achieve the desired performance after tuning all of the above
structures, you may need to partition your data. Partitioning should only be con-
sidered when dealing with very large amounts of data, rule of thumb: more than a
million rows in one table. When dealing with these types of volumes partitioning
will allow you to apply a “divide and conquer” strategy. The data gets divided into
smaller volumes that can be managed more efficiently. When a query has to take
into account data across all of the partitions it is also possible for the engine to opti-
mize execution of the query by accessing the information in the different partitions
in parallel and then merge the results from all partitions together in the end. If the
data that needs to be retrieved is not large, then this will be a lot more efficient than
looking up the data in a linear fashion.
Application Servers
As this book aims to be generic we will not try to discuss performance tuning
of any specific application server on the market. There are definitely books better
suited than this one for divulging all the tips and tricks of a specific vendor when
it comes to their application server. Instead of giving you a grocery list of all the
different parameters that can be used to get the most out of this or that application
server, we will try and focus on those resources that will have to be tuned for any
middleware of this sort.
Tuning of an application server could be referred to as “the tuning of the pools”
given that the control of resources within such servers is usually managed by chang-
ing the size of a pool of resources. The pools that can normally be sized are listed
in Table 5.4.
Messaging Middleware
Messaging middleware, also known as queues, plays an important role when it
comes to asynchronous processing. Depending on the use you want to make of this
type of middleware, your concerns should focus on different characteristics of these
solutions.
n If you are looking for raw speed there are solutions (e.g., Tibco RendezVous)
that will be very efficient in fast message delivery. These solutions will draw
heavily on your network resources and will be highly dependent on your net-
work topology. The purpose of these solutions is to deliver messages fast, but
as a result they will not always guarantee actual delivery of the message, or
the uniqueness of said delivery (the message might get delivered two or more
Threads This is the central resource of your application server. It will determine how many requests can be processed
concurrently inside of the server. These requests will be either of the synchronous or asynchronous type.Note that
depending on your server it may be possible to define more than one set of threads (also referred to as thread pools).
Each set can then be associated to a specific request channel (e.g., one set for all HTTP online requests, and one set for
all message-based asynchronous requests).It is important to understand the relation between threads and other
resources in the system. In order for a thread to completely handle a request it will most likely need to access various
other resources in the application server. Hence, if these other resources are not sized in a way that will guarantee a
resource is always free when a thread needs it, the resource will be causing a bottleneck in the processing and introduce
performance issues.It is therefore a good rule of thumb to have more of these other resources than there are threads. If
many types of different requests are processed by the system it will be useful to divide the processing of requests
between different sets of threads and size the resources used by these requests based on the sizing of the given thread
set.
Connections There are many different connections that can be managed by the application server. Database connections are the most
common ones; however, there can also be connections to messaging middleware, connections to third-party
applications, network connections to handle all sorts of protocols, etc…. As mentioned above you will have to size these
130 n Patterns for Performance and Operability
connections based on the number of them that are required by a typical request multiplied by the number of parallel
requests that can be handled by the system at any given time, which is equivalent to the number of threads that can
process the request. Connections = ConnectionPerRequest * Threads + KK is a small constant that you’ll add to account
for errors in your knowledge of how many connections are required per request. Usually 5 is a good number. You will also
have to be certain that the target system for the connections (database, third-party app), has sufficient capacity to handle
the number of connections that you plan to set up to it. If this application is also application server–based, for instance,
you may need to ensure that its threads are equal to the number of connections you have foreseen.
– continued
11/19/07 7:49:33 AM
AU5334.indb 131
Table 5.4 Resource Pools
Objects Although not true for all application servers, most modern ones use an object or component paradigm.In order to
manage the memory usage of the server, it will not allow an unlimited creation of the base components into memory,
but will rather rely on pools of objects that can only grow to a certain size. These object pools are a resource like any
other in the system, and hence could be sized according to the same rules as the number of connections discussed
above. We discuss objects separately because some application servers use the object pool not only to recycle old
objects in order to create new ones, but also as a transactional cache. In this case the cache is used to maintain the state
of objects during as part of the lifecycle of a transaction. For some types of requests it is possible that a large number of
objects participate in a transaction and hence the size of the object pool should be based on the largest number of
objects that may participate in a transaction of your request. PoolSize = ObjectsInTransaction * Threads. In this case you
will have to verify that your system has enough memory to host all of the different object pools.
Designing for Performance n 131
11/19/07 7:49:33 AM
132 n Patterns for Performance and Operability
times). These are ideal when messages need to be broadcast very efficiently
but actual delivery is not mandatory, and when the receiving system tolerates
multiple deliveries of the same message.
n If guaranteed delivery is what you are looking for, the messaging solution you
will choose will have to include a mechanism to persist the data. This means
that performance will be impacted by the additional I/O cost that will be
incurred. Depending on the underlying persistence mechanism, the impact can
be non-negligible. Many messaging systems (e.g., IBM WebSphere MQ) will
use a database system as their persistence mechanism. This provides additional
flexibility for the management of the messages—the messages can be indexed
by topic or other criteria, or the transaction manager of the database engine
can be used to enroll the message persistence activity into a transaction—but
it does add additional overhead to the whole operation of sending a message. If
all you are interested in is that your message is guaranteed to get from point A
to point B, a simple file-based solution may be what you require.
n Your requirements may also involve complex routing, in which case the
throughput of your setup will be dependent on the routing rules you have
defined and the associated network infrastructure. In the case of complex
routing across multiple networks, the overall behavior of this type of middle-
ware will be more dependent on network latency, network traffic, etc. rather
than on the configuration of the middleware itself.
One of the nicer things about messaging middleware is that performance prob-
lems associated with these tools are fairly easy to identify: just find the location of
where messages are getting queued and you’ve found where the problem is. This
does not mean, however, that the problem will be easy to resolve.
ETLs
Now that we have discussed software that is used mainly in processing discrete
units of work, such as messages and online requests, let’s talk a bit about tools
that are geared toward the processing of high volume “batch” units of work. This
software family gets referred to as ETL, which stands for extract, transform, and
load—in other words, extract a lot of information from one or more places (data-
bases, files, or other storage media), transform it in some way, and load it back into
(usually) another place or set of places.
ETLs are by their very nature very resource intensive. They will try to squeeze
the most out of your system in order to extract the data as quickly as possible,
transform it at blazing speed, and load it back to its target. Extracting and loading
will put a heavy strain on your system’s I/O capacity, whereas the transformation
will drain memory and CPU. These tools usually offer an impressive number of
parameters that will help you tune them so that they will solely use those resources
that you want them to use.
The one thing to understand about this type of software is that it is mainly a way
to ease the implementation of processes that conform to the pipe-and-filter pattern.
The main characteristic of this pattern is that it is linear and does not automatically
lend itself to parallelization, which would allow cutting the time necessary to per-
form the required transformations. This means that it will often be the job of the
ETL engineer to determine how to parallelize the transformation process.
Parallelization of a pipe-and-filter process is straightforward in itself (as shown
in Figure 5.4). All you need to do is split the data up so that it can be processed
in parallel.
In practice, however, it is seldom as easy as we make it sound. It must be pos-
sible to split the data, which is dependent on a number of aspects of the data and
the transformation process:
n Splitting the data and processing it in parallel must be less costly than process-
ing everything linearly. In other words, the splitting process must be low-cost,
and you must have sufficient CPU power to process the data in parallel.
n The data entities being split up must not be interdependent from the perspec-
tive of the filtering process, otherwise that process will not provide the proper
function.
n When other data inputs are used within the same process, it must be possible
to split those inputs as well or to replicate them so that one copy is available
to each parallel process.
If you are unable to split up the processing into a number of parallel chunks,
you may be reduced to finding the most appropriate ETL solution for you. As
usual, the software landscape is rife with different kinds of solutions in this
problem space. Some solutions will be very generic, favoring all sorts of transfor-
mations and ease of use but providing results that are not always optimal. Other
solutions are targeted at very specific problems, e.g. sorting of data. Many database
vendors also offer solutions to handle data extracted from their database; these
solutions are usually not very user friendly or loaded with functionality, but they
are designed to optimize the extraction and load process from and to the database,
which is often costly due to its I/O nature.
Hardware Infrastructure
Phew! You’ve made it this far. You’ve made sure that your requirements were speci-
fied with performance in mind, you designed your application to use every bit of
CPU and memory available to you, you optimized your code, and you tuned all of
the software pieces you were reliant upon. And still you want more bang for your
buck. It is now time to look at your hardware infrastructure.
Resources
When it comes to hardware, the problem of performance becomes a problem of
managing the resources that you have available to you, knowing that most of these
resources do not come cheap. You’ll have to determine the configuration that will
optimize your usage of CPU, storage, network, and possibly other hardware devices.
Where to look first for optimization options will highly depend on the profile of
your application.
If you are dealing with a computing intensive application you’ll want to have
the fastest CPUs, and possibly a lot of them as long as the application scales hori-
zontally. For such applications, looking at network throughput and latency may
only be necessary if you want to distribute the processing across multiple computer
nodes, and communication between said nodes is intensive. Storage will likely be
of little concern.
If you are dealing with an online transaction system, storage and network capa-
bilities will likely be of the essence. You’ll want to look at network hardware to
distribute your load across multiple servers to make sure that your database and
storage systems are properly tuned to minimize I/O latency and maximize through-
put, and you’ll want to do the same for your network. The network topology will
have to be designed to minimize packet hops; preferably those computer nodes that
exchange a lot of data should be on the same subnet and use gigabit connectivity (or
better, if available). Storage will require direct channel attachments of the storage
devices probably using technologies such as dark fiber.
If you are dealing with heavy batch processing, you’ll have to be particularly
attentive to the I/O capabilities of your servers. You’ll want to ensure that I/O
bandwidth can be tuned and that sufficient I/O channels are available on the
machine to allow some level of scalability for I/O operations.
Whatever your challenge, you’ll want to make sure that a proper capacity pro-
jection was done as part of your project, and that it is later substantiated by taking
adequate measurements during your performance and sustainability tests.
Yet another approach that we have seen used is the federation of appli-
cations using application servers. In this case applications are deployed
as packages (e.g., EAR files), onto the target application server, and it
is the resources dedicated to the applications server (see page 125) that
will be shared by the different applications. This is probably the most
difficult approach to control from a capacity-management perspec-
tive. If applications deployed in this way are not well behaved and the
application server resources they will use are not segregated, situations
will arise in which one application ends up consuming all resources,
thereby leaving the other applications with no processing power. It is
therefore important, should you choose this approach, to impose strict
regulations upon the application developers regarding the way they use
application server resources, and how they configure their components.
You should try and favor independent resource pool usages (see Table
5.4) as much as possible; this way each application will impact only its
own resources.
Summary
Somebody recently told me, “presentation is 50% of success,” and although he was
talking about the clothes he was wearing I believe this is very true when it comes to
presenting information to a user.
For the end user, the perception of performance is what counts; it is quite pos-
sible that the underlying system is doing more processing than what would be abso-
lutely necessary, but if this gives the user the impression that everything is going
very fast, you have probably done something right.
Today’s AJAX solutions are a big help at making the user perceive things are
going faster for instance. While the user fills out a Web form, his inputs are being
checked by XML-based requests made in the background. Although the XMLs
being sent back and forth between the Web browser and the server require a lot
more capacity from the server systems than if one request was used for the whole
validation process, the overall perception to the user is that his work, and therefore
the system, is done faster.
Another one of today’s technologies that can be used to give the user a perception
of faster processing pertains to the use of work-flow technology. Using a work-flow
system (a.k.a. an exception management system) you can refrain from executing
tedious parts of the processing as part of the user’s transaction. If a complex valida-
tion can be split up into a simple validation capturing 80% of the problems and a
more involved validation that is required yet only triggers an error 20% of the time,
the secondary check can be left for a later time and executed as a separate part of the
work-flow process. When the error is triggered, either a compensating action can be
taken or the user can be notified at that time that something went wrong. By using
this strategy the user will not be bothered by the overhead of the difficult validation
and will only be bothered by it, after the fact, 20% of the time.
These technologies may not be the right ones to solve your performance prob-
lems, but decoupling parts of the processing from the user’s interaction process may
well be what you need for some of them. So don’t always think of performance;
think of appearance.
Test Planning
139
n Things go wrong
n A number of things happen all at once
n Extreme conditions are reached (e.g., extreme data volumes)
As we shall see, there are hundreds of ways that things can go wrong. For com-
plex systems, the permutations of human and machine inputs are virtually endless.
In this section, we look at ways of coping with the magnitude of possible test cases
that this reality presents for both operability and performance test types. But first we
will discuss how to assign components inside and outside of your system boundary.
System Boundaries
Your test strategy must include a statement defining system boundaries for your
intended scope. Components that fall within your system boundary are generally
components that are being developed as part of the program or project under which
you are working. For these components it is expected that you have access to a
software vendor or in-house development team. It is also expected that you have an
in-house testing environment along with the deployment capability to install and
configure the application.
Components that are outside the system boundary are external systems and
dependencies over which you have no control. In defining these components out-
side of the system boundary, you assume that they will meet an agreed upon service
level in the production environment and do not require any direct testing as part
of your efforts.
As we shall see in Chapter 7, for these components it is often necessary to simu-
late the external system with a homegrown component that stands in for the exter-
nal dependency in order to support your test scenarios. Let’s look at an example
architecture, as shown in Figure 6.1, and apply the previous definitions to deter-
mine a system boundary.
The example shown in Figure 6.1 describes a CRM (customer relationship man-
agement) solution in which 500+ customer service representatives (CSRs) respond
to customer telephone inquiries; these customer service agents are widely distrib-
uted geographically. Primarily, agents work from home on desktop computers that
are provided to them for this purpose. The client application communicates with
an application server using SOAP/HTTPS. A collection of services are exposed as
Web services to the application server for shared functionality like sending email
and faxes. Customer information is drawn from an enterprise customer database
that is accessed over a MQ (message queue) series messaging interface.
Enterprise Services
Gateway
System Boundary
SOAP/HTTP
Win 32
Thick Client SOAP/HTTPS SqlNet
Application
Agent Server
Desktop
Oracle Application
MQ Series Database
Enterprise
Customer
Database
In this example, all of the components that are shown within the system bound-
ary are being upgraded as part of a major system enhancement. The development
team is actively engaged and can assist with scoping, deployment, troubleshooting,
tuning. The enterprise customer database and the enterprise gateway tier, however,
are existing services that have been previously tested. These systems are actually
already in the production environment and support similar loads to that which will
be imposed by the new system.
The organization has tested these systems well beyond the business usage model
for the distributed call center. Both of these applications are in a support mode and
do not have development resources available to assist with testing and development.
Furthermore, these enterprise services belong to a different division in the organiza-
tion. The bureaucracy required to include them within the system boundary would
cause costs to multiply. In Chapter 7 we will see how our system boundary will
influence our test execution.
It should be obvious that the elements within our system boundary are high risk
and justify commanding the bulk of our testing efforts. The supporting legacy sys-
tems are outside our system boundary. You will need to make a similar judgment call
for your application and document it in your test strategy as part of your test plan-
ning. Our next area of focus is on how to scope the coverage for the test case itself.
Scope of Operability
For even moderately complex systems, there are a myriad of ways that things can
go wrong. Consider a simple example like database failure. Here is a list of different
scenarios in which your application can experience database failures. We use an
Oracle database as an example:
n The network fails and can no longer route responses back to the application
n The network fails and can no longer forward requests to the database
n The network cannot find the database server
n The network is functioning, but latency increases and causes 50% of requests
to timeout
n The database is listening to requests, but is returning an authentication error
n All Oracle processes on the database are down
n The listener on the database is down, but the database is otherwise healthy
n The listener is running and accepting requests, but the shared server processes
are down
n The database is experiencing performance issues and 50% of requests are
timing out
n The database is refusing new connections
n The database is refusing new connections after a delay of two minutes
n The database is returning transaction errors
n The database is configured with Real Application Clustering (RAC) and one of
the servers has stopped responding
n One or both of the servers in the RAC cluster experiences memory corruption
n One or both of the servers in the RAC cluster is unplugged
n One or both of the servers in the RAC cluster experiences kernel panic
n An Oracle shared server process crashes and creates a core dump
n Oracle server runs out of disk for table extents
In this list we have included only scenarios that are external to your software
system. The client connectivity can also be subject to a host of potentially fatal
types of errors. If you decide to conduct non-functional test scope for database
failure, which of the above test cases will you include? No one would dispute that
including all of these failure scenarios results in the highest-quality test coverage,
but the testing may take four weeks to execute. Is this practical?
To make matters worse, the load profile for the system is also variable. For
each of the failure scenarios above there are potentially hundreds of variations of
in-flight requests based on time of day and season, not to mention the randomness
associated with individual user behavior. Consider a scenario in which there are 12
distinct and unique scheduled jobs for your system. Are you going to test each and
every job with each of the 20 failure scenarios above? This would result in 120 test
cases just for the combination of job execution and database failure. We haven’t yet
considered human and machine online inputs in this planning.
As you can see, defining your test scope means making judicious decisions. A
prudent approach includes consideration of the following inputs:
This discussion can only be had by involving the business and technical par-
ticipants who helped to formulate your non-functional requirements. If we revisit
the list of Oracle database failure types from above, we can group them based on
similarity to one another as follows:
Basically, each type of failure belongs to one of three types: (1) the database is
not responding at all; (2) the database is responding immediately with an applica-
tion error; or (3) the database is only partially available.
If we agree that fundamentally the application should react in a similar way for
each of the failure modes in any one of these categories, we can take the first step
in defining our test scope.
In each of the three categories, we select the most representative failure mode for
that category. For example, we may decide that the network configuration is static
and reliable, so the most likely mode of failure would be a sudden performance/
capacity event on the database in which all or the majority of database requests
begin to timeout. We apply similar thinking to the remaining two categories and
agree on mode of failure for these categories also.
Next we examine associated business risk. In dialog with the business and tech-
nical participants, we learn that 10 of the 12 jobs scheduled for this application
are not critical. These 10 jobs perform housekeeping tasks that can be completed
anytime within a one week window. If a job fails or does not run, its processing will
be completed the following day. Further, these ten jobs run on a dedicated server
that is isolated from the more critical online application. The remaining two jobs,
however, are highly business critical. Architecturally, these jobs are constrained to
share the online application server and, to make matters worse, they run at the end
of the peak usage period for the online application. In this case, it is an easy deci-
sion to categorize these jobs as mandatory test cases for each of the three failure
modes that we have defined. As a result of our analysis, instead of 120 test cases, we
now have six test cases. Since the two jobs are independent of one another and run
on a similar schedule, we may further optimize our execution to schedule a total of
three tests in which each failure mode is tested for both of the jobs at the same time.
We will talk more about optimizing test execution later in this chapter.
Scope of Performance
The factor that most complicates performance testing efforts is variability in busi-
ness usage. In the same way that a business usage model approximates actual usage,
our test cases will approximate the business usage model. How heavily you are able
to invest in performance testing will be a function of perceived risk and expected
cost. If time and budget permits, you should derive your test cases based on 100%
of the load scenarios defined in the non-functional requirements. If this is not pos-
sible, you may need to sit down with business and technical participants to exclude
scenarios that are
n Low business risk (For example, a feature is not frequently accessed and/or
a business user can tolerate poor performance)
n Complex to implement because load scenario requires preconditions that are
hard to achieve or require interactions with systems outside of your system
boundary
n Technically equivalent to a scenario that has already been defined for your
scope
Product Features
A prerequisite for executing performance and operability tests is a solution for how
you will create load in your testing environment. For applications having a large
number of concurrent users, a software solution is required. It is impractical to cre-
ate load manually with human testers.
Choosing the right software solution for your testing is a decision involving
many factors, including the skill set of your testers, available budget, corporate
standards, software platform for your solution, and whether your application will
have an ongoing need for non-functional test capabilities. In this section we will
enumerate a number of product features that are important in any solution.
1. Randomness: Many loading tools can create randomness in the execution
of your test scenarios on your behalf. This is helpful because it allows you to
execute load in a way that is generally consistent with expected usage, but
that also allows for subtle variations expected in human usage. The most
common implementation of a randomness strategy is through the concept
of think time. Think time refers to the length of pauses between execution of
steps in the load scenario. Think time simulates the time that a user or exter-
nal system spends looking at a screen or otherwise contemplating/processing
outputs from your system.
2. Supported interfaces (web, wireless, client/server, etc.): Some load test-
ing solutions support exclusively web-based applications. If you represent the
needs of a large organization, you should inventory the full range of appli-
cations for which testing will be required and make sure that you choose a
solution that can support each of them. For example, only a subset of load
testing tools can support client/server protocols necessary to simulate thick
client interaction.
3. Scripting and programmable logic: Some solutions offer you a great deal of
flexibility in terms of building logic into your load scenarios. Other solutions
limit you to a record/playback capability in which there is limited means to
make scripts intelligent. A single “smart” script may support three differ-
ent scenarios using condition logic. A scripting language also enables you to
include non-standard pauses and validation conditions. If you suspect that
you will need to develop more complex load scenarios for your application,
make sure that you have the required development expertise available. Prod-
ucts may employ widely used scripting languages like Visual Basic or Jython.
Other products require use of their own proprietary language.
4. Cost: Predictably, feature-rich, industry-leading load solutions can be very
expensive. In contrast, there are some very functional open source solutions
that are free.
5. Externalizing data from scripting steps: A major part of the variation in
many load scenarios is the variation in data. In the online banking example
Vendor Products
Selecting the right load testing software for your system will depend on many fac-
tors, including how you prioritize features described in the previous section. In this
section, we include some objective opinion on some of the more popular load test-
ing tools available at time of writing.
Mercury LoadRunner, now a part of HP’s IT management product suite,
enjoys broad market penetration amongst performance testing software. Load-
Runner is a rich solution that provides all of the features listed in the previous
section. For Web-based applications, LoadRunner scripts can be developed using
a record/playback approach. The record/playback process generates a proprietary
script that can be inspected and/or altered. LoadRunner also supports load testing
for non-Web-based interfaces including client server, legacy, Citrix, Java, .NET and
all widely known (enterprise resource processing) / (customer relationship man-
agement) solutions like PeopleSoft, Oracle, SAP, and Siebel. From a monitoring
perspective, there is rich plug-in support for a number of platforms and vendors
including J2EE, .NET, Siebel, Oracle, and SAP. Features and flexibility come at a
cost. Mercury’s product is the most expensive to license in the industry. Further,
there is no synergy between the load testing solution and the automated functional
test suite (WinRunner and QTP) so resources cannot rely on a common skill set in
order to develop scripts for functional and non-functional scenarios.
E-Load, from Empirix, is a load testing tool, supporting each of the features
listed in the previous section. The Empirix product is focused on web-based and
call center applications, thus lacking support for the full range of systems that
are likely to exist in a large enterprise. E-Load can scale to simulate thousands of
concurrent users and uses a distributed architecture to do so. E-Load is based on
the Jboss open-source J2EE application server. The authors have found that some
tuning is required out of box to create significant loads. Also, the stability of the
loading engine seems to degrade as scenario complexity and length increases. For
tests that must run for longer than 2 hours, the authors have found that this tool
may require careful monitoring and occasionally must be restarted.
If budget concerns are overriding for your project, there are a number of open-
source contributions to the area of load testing software. Jmeter, from the Apache
Software Foundation, is a popular Jakarta-based Java solution that can be used
to create loads for different types of interfaces, among them: Web applications,
Perl scripts, Java objects, database queries, and FTP (file transfer protocol) servers.
Because Jmeter is open-source, developers on your project can extend/customize
Jmeter as needed. For example, custom plug-ins can be written to create load for
unsupported interfaces. During tests, statistics capture and graphing capabilities
are highly configurable but do not have the same ease of use as other vendor-sup-
ported solutions. Jmeter is written in pure Java, and thus can run on any platform
with support for a JRE (Java runtime environment).
Grinder, available from SourceForge.net, is another popular open-source load
testing tool. Like each of the preceding three products, Grinder supports a distrib-
uted architecture for creating large concurrent volumes of requests. As of version
1.3, Grinder scenario execution is driven by Jython scripting. Jython is a script-
ing language based on Python that adds support for Java. Jython allows for great
flexibility in scenario scripting but requires a developer’s programming skill set.
Grinder is a good choice for developers or technical resources in a QA (quality
assurance) role who need to do discrete testing of services over standard proto-
cols such as IIOP, RMI/IIOP, RMI/JRMP, JMS, POP3, SMTP, FTP, and LDAP,
SOAP, XML-RPC, HTTP, HTTPS, and JDBC.
PureLoad, from Minq, is another offering in the area of loading testing soft-
ware. Minq is the maker of the popular Java-based DBVisualizer database utility
familiar to many developers. PureLoad includes a record/playback feature for Web-
based applications and also supports authoring of scripts to test services exposed
over standard protocols such as NNTP, FTP, SMTP, IMAP, JDBC, LDAP, Telnet,
and DNS. Other standard features include creation and storage of test scenarios,
statistics capture, and graphical presentation of results.
a simulator or a reflector except it is part of the run-time code for the applica-
tion itself. Typically, stubs need to be configured to override the intended
production configuration for the application. Stubs can be a very convenient
means of substituting for an external system but because of their embedded
nature, they will compete with your application for resources. An externally
hosted simulator or reflector will yield more accurate performance and capac-
ity results but will be more cumbersome to implement and maintain.
If you decide that additional testing apparatus is required, you will need to staff
a custom development activity. In most cases, the development team will have the
required skills and will be the most familiar with the interfaces of the system. The
need for test apparatus is frequently an afterthought for projects and becomes an
unplanned burden for the development team. Hopefully, this need is identified
early enough in your project that it can be accommodated in a planned and coor-
dinated way. By documenting your apparatus requirements in your non-functional
test strategy, you are communicating your needs to the development team.
You will also need to decide where any additional test apparatus will be hosted.
Generally, you have two choices. You can host the software on the system infrastruc-
ture itself or on additional infrastructure designated for this purpose. Since your test
apparatus is standing in for components that are outside your system boundary, it is
preferable that they be hosted on dedicated infrastructure. Unfortunately, dedicated
infrastructure means additional cost. If required, this infrastructure should have
been planned for in your project initiation. In many cases, there will be an opportu-
nity to co-host loading software and test apparatus on the same hardware so long as
it has been sized with sufficient capacity during project initiation.
Test Beds
Your test bed is comprised of two elements: data required for load scenarios, and
data that is already seeded in the software system to represent data populated by
historical, previously-executed transactions.
Test-Case Data
Your load scenarios will determine what test-case data is required. An important
feature of the test data is the amount of variation required. Let’s consider a securities
trading solution that coordinates settlement instructions on behalf of financial insti-
tutions. For such a system, there might be seven distinct types of request message
types. Each request type may involve a security ID that identifies the unique security
that is involved in the request. If the security IDs are referenced against a database
and/or influence the complexity of processing, it will be important for the security
IDs used in the load scenario to be varied and representative of the real world.
In assembling your test bed you will need to work with business resources to
identify the range of security IDs that will be subjected to the system in order to
do accurate testing. On the other hand, other pieces of data in the request may not
be important. Price information may be effectively “pass through” in this system.
In other words, there is no special processing for price information. Price informa-
tion may be logged to a database table and the performance characteristics are not
impacted by variations in the data.
These types of decisions can only be made in the course of reviewing test cases
with technical project participants. Omitting natural, real-world variation in your
data may be a time-saving simplification, but it also introduces marginal risk. Tech-
nical resources may not correctly anticipate the effect that variation in your test bed
will have on the system. As a result, it is always preferable to make your test bed as
similar to the expected production inputs as possible.
Test Environments
Your test environment should have been defined during the initiation and plan-
ning phase for your project. Projects that fail to allocate infrastructure for hard-
ware during this phase can experience serious delays as time is wasted waiting
to procure additional environments. Your environment will need to support the
software system itself, load testing tools, and any additional test apparatus that
you have identified.
In planning your test environment you will need to define the level of iso-
lation you will achieve, the change management procedures you will follow,
and the scale of your test environment as a proportion of the target production
environment.
Isolation
At a minimum, you should strive for isolation of the software system from all other
software components, including the load testing solution and any supporting test
apparatus. The part of the system that you are isolating is, of course, all components
that you have identified that are within your system boundary. As we will see in
the next chapter, it can be difficult to achieve repeatable test results even in well
isolated environments.
Many organizations have reluctantly conceded that the only way to reliably test
mission-critical systems is to introduce a production-scale environment that is ded-
icated to non-functional testing. Such environments are commonly referred to as
staging, pre-production, performance, and certification environments. In addition to
supporting non-functional activities, these environments are useful for rehearsing
production deployments. Attractive as this option is, if you do not have budget or
time for a production-scale, dedicated testing environment then you may consider
one of the following alternatives:
1. Reduced-scale Dedicated Test Environment: If the capital cost of a pro-
duction-scale environment is prohibitive, you may consider a smaller-scale,
but dedicated, environment to support non-functional testing. This approach
requires you to make compromises and accept some additional risk as you
must extrapolate test results to the target production environment.
2. Re-purpose production hardware: If you are building a new system for
which there is currently no production environment, you may be able to
test using the target production infrastructure itself. Of course, this strat-
egy means that you will lose your test environment once the production sys-
tem is commissioned. More specifically, this means that you will not have
a non-production environment in which you can regression-test changes or
reproduce load-related issues. For mission-critical systems, neither of these
consequences are acceptable.
3. Time-shift existing hardware: With some coordination, you can dedicate
your existing functional testing hardware for performance testing during
specific intervals. For example, non-functional testing can be scheduled on
evenings and weekends. You should recognize that this type of arrangement
is usually not an efficient use of resources, nor does it provide for much con-
tingency if any of the activities on the shared infrastructure begin to track
behind schedule.
4. Create logical instances: If neither of the previous two options are realistic,
you should at least configure your system as its own logical instance. For
example, a single database server can often support many development and
QA instances of an application. For your non-functional testing you should
strive to isolate your test system on its own instance. This will mitigate outside
influences, and is more representative of the target production environment.
5. Cross-purpose the Disaster Recovery (DR) environment. Increasingly,
software systems are a core part of every large business operation. Conse-
quently, in the event of a large-scale disaster, it is unacceptable for the business
to be completely deprived of its software systems. As a result, large organiza-
tions commonly make an investment in a geographically separate computing
facility that can host critical software systems in the event of a disaster. Such
a facility is referred to as a DR site. A DR site must have equivalent hardware
capacity to the primary facility that it supports in order for it to be effective
in the event of a disaster. Since this hardware is not utilized unless a disaster is
declared, it is common and cost-effective to cross-purpose this infrastructure
as a non-functional testing environment. If your software solution is sup-
ported by a disaster recovery site, you are strongly encouraged to consider
leveraging this site for your non-functional testing.
In designating your test environment, you need to inventory all of the hard-
ware required by the system. This includes servers, network components, and
storage devices. Many organizations provide network and storage services from a
central pool of resources. Be sure that you understand what the SLA is for these
components and whether or not it is consistent with the target production envi-
ronment. You should also be aware of whether or not unrelated activities within
your organization can exert influence on your testing with respect to these shared
resources.
Capacity
Hardware costs for some systems can be exorbitant. The prospect of duplicating
this cost for the non-functional test environment(s) can be a menacing thought for
many executives. For mission-critical systems expected to be in operation for many
years, an investment in a proper non-functional test environment is a necessary cost
of doing business.
The cost of unscheduled downtime that could have been avoided with proper
testing usually makes the hardware cost of the test environment seem justifiable.
However, there will be situations in which a scaled-down version of the production
environment is suitable for most non-functional testing. A scaled-down version of
the production environment is designed with a reduction in some or all of the fol-
lowing resources: servers, memory, CPUs, storage and network devices. The cost of
a system one-half the size of the production system may actually be one-tenth the
expense. In such cases, the cost savings justify the risk. If you are going to proceed
with a reduced version of production for your non-functional testing, you should
review the following list of considerations:
1. Operability Testing: Operability testing is usually not impacted by a scaled-
down non-functional testing environment. Failover, fault-tolerance, and
boundary testing are typically unrelated to hardware capacity.
2. Performance Testing: Hardware resources like CPU and memory are read-
ily seen as commodities that can be scaled linearly. In other words, dou-
bling the load should correspond to twice the amount of CPU and memory
requirements. However, there is no guarantee of this relationship. Further,
resources specific to your application may not scale linearly. If you are unable
to test peak load for your system in your test environment because of hard-
ware limitations, you are taking a considerable risk.
3. Capacity Testing: If your system is not big enough to support peak load, you
are relying on extrapolation for capacity planning and measurement. For many
systems, this is an acceptable measure, but it is not entirely without risk.
Change Management
An important question to answer during your test planning is who will have access to
the non-functional testing environment. Specifically, which individuals can deploy
the system into the environment and/or make changes to the system during testing?
An ideal process is one in which the same resources and procedure that are used
for deploying the production system should be used for deploying into your non-
functional testing environment. This ideal assures us that the configuration in your
test environment will be identical to production. Once your system is deployed,
there will be an ongoing need to make subtle changes to the system. We have
already discussed the need to load transaction volume—perhaps artificially. Addi-
tionally, we know that we may be introducing test apparatus into the environment
that may require configuration changes to the system.
Non-functional testing, when executed by technical resources, can be intru-
sive. Technical resources executing non-functional tests need the flexibility to make
tuning and configuration changes in order for testing to succeed. For example,
operability tests often require testing resources to purposefully configure the sys-
tem “wrong” to observe the outcome. Requirements to change the system can be
met in two ways: the same resources responsible for the deployment can make all
changes to the system on a by-request basis, or the non-functional test team can
make these changes themselves.
In either case, it is imperative that a log be maintained for all changes that
are made to the environment. The non-functional test team must be confident
that when acceptance testing is executed, the configuration and state of the system
in the test environment is aligned with the intended production configuration.
Towards the end of your test cycles, changes that have been made for tuning/opti-
mization purposes should be communicated to the deployment team, who should
then re-deploy the system into your environment for final acceptance. Following
this approach ensures that only changes that are in the production package are
deployed to the test environment at the time that acceptance testing is executed.
The detailed specification for your non-functional testing environment should
be a documented part of your testing strategy. If you are making compromises
in the capacity, isolation, or change-management processes for your environment,
then these risks should be documented in your test strategy so that management is
aware of them.
Historical Data
Many systems accumulate data over time; such data is referred to as transactional data.
The accumulation of transactional data may influence the performance characteristics
of your system and, as a result, should be modeled and included in your testing.
Business volumes can be derived from the business usage model constructed
during the requirements phase of your project. Let’s revisit the example we used in
the requirements chapter. For each of the transactions in Table 6.1 we have included
the average daily volume for four key coarse inputs.
In consultation with the development team, we have learned that login and
account inquiries do not create transactional data on the system. In other words,
any number of logins and account inquiries will leave the system unchanged. As a
result, we can ignore these coarse inputs.
Bill payments and funds transfers, however, do create transactional data on the
system as they are completed. A technical resource has provided the information
shown in Table 6.2 with respect to the database.
Operationally, transactional data is preserved for up to one year in the produc-
tion system. When the system is running at steady-state, there will be one year’s
worth of business volumes in each transactional table. Combining the record
counts and business volumes, we can forecast transactional table volumes shown
in Table 6.3.
Before we begin testing, each of the tables must be loaded up to the correspond-
ing record counts. There are two ways that this can be achieved: volumes can be
generated by running load scenarios themselves or by authoring custom data-load-
ing scripts that populate data directly into the system. It can be time-consuming
to generate table volumes by running load scenarios. Also, if development for the
system is still underway, sometimes this approach isn’t even an option until much
later in the project lifecycle. Time spent authoring scripts for generating volumes
artificially is often a good investment. It gives you the flexibility to run the scripts
on demand against multiple systems without impacting your timelines. However,
you will need to weigh this against the risk of there being defects or omissions in
the script you are using to populate data.
Login 2,309,039
Payment information 1
Transaction fulfillment 2
Transfer information 1
Transaction fulfillment 2
Payment 1 529,143
information
Transaction 2 1,058,286
fulfillment
Transfer 1 210,985
information
Transaction 2 421,970
fulfillment
Summary
The choices you make during your test planning will determine the efficacy and
the ease with which your test execution is completed. A non-functional test strategy
is a critical planning deliverable that should be completed prior to any test execu-
tion. The non-functional test strategy enumerates the key factors and assumptions
in preparing your detailed test plan including the definition of system boundaries,
your performance and operability test scope, load testing software, additional test
apparatus, test environments, and test data.
It is usually impossible to test every mode of failure with every load scenario for
your system. This reality requires you to use informed judgment in the determina-
tion of your test scope. Systems that must interact with software systems outside
of your control may require the introduction of additional software apparatus that
mimics the interaction of external systems. The test environment in which you
execute your tests may be a full-scale replica of the target production environment
or a logically separate instance that is defined on the same infrastructure as your
functional test environment.
The choices you make for your test environment will reflect a cost-benefit analy-
sis based on your risk tolerance. Finally, the test data that is used for your test
execution requires careful forethought. Non-functional testing is only as good as
the likeness of historical and test case data used in the execution. In Chapter 7 we
will delve deeply into the next topic in the software lifecycle: test case preparation
and execution.
Test Preparation
and Execution
During the test planning phase of your project, you would have defined the high-
level scope of the test execution. You would also have identified specific data
requirements, characteristics of your test bed, and any additional test apparatus like
injectors, reflectors, or simulators that would be needed to efficiently execute tests.
Seeking out data, authoring data load scripts, and developing your test apparatus
are all activities that will be completed during the test preparation and execution
phase of the project. It is during this phase that you will bring each of these con-
cepts together and actually commence the testing initiative.
In this chapter, we will review test preparation and execution activities, includ-
ing common challenges that are faced and general considerations for reporting test
results to project stakeholders.
Preparation Activities
The test cases that are defined in your project scope during the planning phase
include most of the details that are needed for execution such as required data,
execution steps, success criteria. Unfortunately, there is usually considerable work
to be done before actual testing activities can begin once the development team has
declared that the application is available for testing. Prior to commencing testing
activities you will need to attend to the following details:
159
1. Script Development: If you are using load testing software, you will need
to develop scripts that implement the load scenarios contained in your test
plan.
2. Validating the Test Environment: When the system is deployed into the
test environment, you will need to validate that your performance scripts
execute as expected.
3. Seeding the Test Bed: If you are using custom data loading scripts, these will
need to be executed against the test environment.
4. Establishing Mixed Load: Mixed load is a combination of test cases that
best characterize application usage. Mixed load is the default load profile that
you should use whenever you execute operability, failover, sustainability, and
capacity testing. The mixed load should include test cases that generate load
in proportion to the actual business usage. Because the mixed load provides
broad functional coverage for the application, it is useful for verifying new
deployments in your testing environment.
5. Tuning the Load: When you begin to subject the application to load, you
will need to make adjustments to parameters in the load testing software.
This is achieved through trial and error.
Script Development
The load testing software that you plan to use must support the development of test
scripts that create the virtual load needed for testing. The amount of development
required depends on the software package and the number and complexity of your
test scenarios. You should try to retain the following characteristics for your test
scripts as much as possible:
1. Resilient to changes in the user interface: Software systems will change
over time, especially the user interface. Business users are likely to refine the
user experience as a system is integrated into business operations, and they
begin to see the impact of their original ideas. In order to reduce rework in
your load testing scripts, you should try to avoid validation and control logic
that is heavily dependent on details in the user interface.
2. Self-sustaining: A load testing scenario may incorporate business logic that
expects the system to be in a certain state or to have transactional data pre-
populated. Where possible, it is always preferable for scripts to be self-sustain-
ing. This means that the scripts themselves create all data and preconditions
that they require. Consider the example of a load testing script for a sales-
force automation application. A load scenario may require a user to login and
view prospect information created by a different user. One way to implement
this in a self-sustaining way is to build a single script login as one user cre-
ates the prospect information. This is followed by a second step in which a
second user logs in to view the prospect information. The drawback of pair-
ing activities like this is that they may make it more difficult to achieve your
target transaction rate for coarse inputs in the right proportions. Later in this
chapter we will look at sustainability testing where a load is applied for a long
period of time. This activity will be complex to execute if a test operator must
intervene and reset conditions following each test iteration.
3. Leaves system in a state where tests can be repeated: Wherever reasonably
possible, use your load scripts to leave the system in a state that does not inter-
fere with testing should you choose to resume or repeat testing at a future date.
This may seem obvious, but in some cases it can be difficult to achieve. Many
scripts that emulate human usage are required to login to the application as a
prerequisite step for all tests. However, many applications are designed such
that a user can only have one concurrent authenticated session. Trivially, this
means that each one of our scripts needs to logout at completion in order to
ensure that they can be executed again without error. However, we must also
consider the case in which our scripts stop executing abruptly before they
have the opportunity to logout. There are lots of failures that can bring this
scenario about, including the following: the load testing software could fail
midway through the test, a critical component in the application may start
failing, or a functional defect could impact the ability of a subset of scripts to
complete. In each of these cases the system will be left in a state that blocks us
from repeating our testing. In such situations we may have to rely on restart-
ing the system to reset the system state. If this is too time-consuming—or
worse, doesn’t work—we may need to build a custom solution in order to
intervene and artificially reset the system state.
4. Achieve target coarse inputs with an optimal number of scripts: The busi-
ness usage defined in your requirements describes the types and transaction
rates for coarse inputs. In implementing load testing scripts, you should try
to minimize the number and complexity of scripts while achieving the target
transaction rate for your coarse inputs. The remainder of this section elabo-
rates on this topic.
If we revisit the human inputs from our online banking example in Chapter
3, we see the following targets for coarse inputs. Our challenge now is to translate
these coarse inputs (as shown in Table 7.1) into detailed load testing scripts that
create the inputs in the right proportions.
Because our system requires authentication, each of our scripts require login as
the first execution step. If we take a simple approach and write a separate script for
each coarse input, we will run into a problem.
In order to achieve a target transaction rate of 2.17 TPS for the account inquiry
operations, we will indirectly achieve a transaction rate of 2.17 TPS for login also.
That is, we will overstate the transaction rate for login by a considerable margin.
To make matters worse, the bill payment and funds transfer operations will also
Table 7.1 Inputs
Input Input Type Schedule
introduce logins at their transaction rates. Our login transaction rate will end up
being the sum of all transaction rates (i.e., 3.82 TPS). This is nearly four times the
required transaction rate. If we were to follow this approach, and login performance
does not meet our requirements, how will we know if we would have the same prob-
lem if the transaction rate was 1.06 TPS? Or worse, perhaps the login operation
is so taxing that it is compromising performance of the other business operations
also? We are going to have to plan our script development more creatively.
We note that the ratio of account inquiries to logins is approximately 2-to-1. In
other words, for each login, there must be at least two account inquiries if we are
to achieve the target transaction rate for account inquiry. We can also see that for
each login there is a little less than one bill payment. For every four logins, there
appear to be about three transfers. We can accommodate these proportions if we
implement our test scripts as shown in Table 7.2.
This is a fairly simple example, and we were able to arrive at reasonable pro-
portions through a trial-and-error strategy. The approach we have taken in this
example involves creating two scripts that observe the 2-to-1 relationship between
account inquiries and logins.
We consider the login transaction rate to anchor the transaction rate for the
other operations in the script. For script 1, if our login TPS is 0.45 operations per
second, this means that, necessarily, each discrete step in the test script must also
be achieving this TPS. If we manually distribute the load between each of the two
scripts, we can make adjustments until we achieve something very close to the tar-
get transaction rates in the business usage. When we go to tune our load, we will
have the opportunity to make additional adjustments to our load scenario. For now
this seems like a reasonable approach for us to use in developing our test scripts for
our human coarse inputs.
Perform
account 0.45 0.45
inquiry
Perform bill
0.45 0.45
payment 2
Perform
account 0.68 0.68
inquiry
Funds
Transfer
Script Perform funds
0.68 0.68
transfer
Perform
account 0.68 0.68
inquiry
Standardized mixed load scenarios can be used for many purposes, as described
below:
1. Performance Certification: Running the mixed load at peak transaction rates is
usually the best condition under which to conduct performance certification.
2. Performance Regression: A mixed load is an efficient way to conduct per-
formance testing. It is a single test in which you run load and compare results
against the baseline from the most recent previous test.
3. Operability Testing: A mixed load is the most useful test scenario with
which to execute operability tests. The mixed load should include broad
coverage for different application functions. By using the same mixed load
for each of your operability tests, you standardize the load and simplify the
activities of the test operators.
4. Sustainability Testing: The sustainability test requires a sustained, representa-
tive load. When the mixed load is applied for a long interval, this requirement
is met.
5. Capacity Testing: When the mixed load is run at the peak expected volumes,
the environment becomes suitable for taking capacity measurements. Later in
this book, we will see how this translates to capacity planning activities.
Let’s revisit the online banking example from Chapter 3 again and look at
assembling a suitable mixed load. For human usage, the metrics shown in Table 7.3
were defined for coarse inputs.
Earlier in this chapter we revealed how we could achieve these coarse inputs
through the introduction of two load testing scripts executed at specific transaction
rates. In terms of machine inputs, there were three separate interfaces (as shown in
Table 7.4).
Table 7.3 Transactions
Transaction Classification Target Transaction Rate
Account Inquiry:
Light 1.87 TPS
Less than five accounts
Account Inquiry:
Medium 0.30 TPS
Five accounts or more
In this example, only one of the machine inputs runs concurrently with the
peak online usage. Human inputs dominate the load profile for this application, so
we are most interested in the behavior of the system during the business day—that
is, from 7:00 am to 10:00 pm.
Theoretically, busy evening online volumes could coincide with execution of
the bill payment fulfillment job. Since complexity is still manageable with the addi-
tion of the bill payment fulfillment job into the mixed load, we will include it. The
business reporting job, however, runs off-hours. We will indeed test it in our scope,
but we will not include it in the mixed load that we use as the basis for performance
regression, sustainability testing, and failover.
In light of the decisions we have made, our mixed load scenario looks like that
represented in Table 7.5.
Note that all of the human inputs are modeled using two load testing scripts
running continuously at the specified transaction rate. The customer marketing
messages are a continuous machine input. The transaction rate for marketing mes-
sages is tied to the login rate, as we saw in Chapter 3. The bill payment fulfillment
is scheduled to run every six hours in cognizance of the fact that this scenario is a
compressed business day, i.e. this scenario runs constantly at peak load.
In this example the mixed load scenario that we have defined is simpler than
you would expect for a critical, multifunction system like a national online bank-
ing system. For enterprise systems, the mixed load may include a few dozen up to
a few hundred scripts.
Customer Marketing
Machine Continuous
Messages
1. Number of virtual users: “Virtual users” is an industry term for the number
of concurrent threads executing load against your application. For load sce-
narios that emulate human usage, the term “users” is appropriate because the
load testing software is emulating actual human users.
2. Ramp-up: Ramp-up refers to the rate at which load is increased over the
duration of the test. You can begin the test with the maximum number of
users, or you can step up the number of virtual users over an interval at the
beginning of the test.
3. Think time: “Think time” is another industry term that refers to the interval
between execution steps for each script. Think time is a pause in the script
execution that emulates the time a human user spends processing the output
of the previous step.
4. Test duration: For your application, you will also need to designate the
length of the load test. For example, is it enough to demonstrate acceptable
performance for 30 minutes, 60 minutes, 120 minutes, or longer?
In Chapter 3 we saw that for human inputs, the number of concurrent users
is an important characteristic of the overall business usage. However, the fact that
you have 500 concurrent users logged on does not help you to assign virtual users
across load scenarios in your test suite. For example, how many virtual users should
be assigned to bill payment versus funds transfer? To get to this level of detail, you
will need to make an educated guess and then refine your parameters as you observe
the system.
Let’s continue to work with the online banking mixed load example from ear-
lier in this chapter. When we go to execute these scripts against the test environ-
ment, we may find that the duration for the bill payment script is 40 seconds while
the funds transfer script executes in only 15 seconds. This is shown in Table 7.6.
The difference in execution time is explainable, as the bill payment script includes
a number of steps that are not actually measured for performance, i.e. the script has
many more steps than the funds transfer script. We know from Chapter 3 that the
number of concurrent users expected to be on the system during the peak interval
is 5,100. If we assign 2,550 virtual users to each of our two load testing scripts, we
are unlikely to achieve our target transaction rate. In order to compensate for the
long execution time of the bill payment script, we will need to compensate with a
higher number of users. We make an educated estimate for virtual user distribution
to establish a starting point (as shown in Table 7.7).
When we allocate virtual users in the ratio above, we find that we are still not
meeting our target transaction rate for the bill payment script. We are also running
at significantly higher loads than required for the funds transfer script. We can try
to compensate again by shifting an additional 200 virtual users to the bill payment
script (as shown in Table 7.8).
By shifting an additional 200 users to the bill payment script, we have achieved
our target TPS for bill payments. Unfortunately, we are still executing far too many
funds transfers based on our requirements.
At this point we don’t want to subtract users from either test script because our
target for the total number of users is still 5,100. Ideally, we would like to slow
down the funds transfer script. Fortunately, there is a mechanism for us to do so in
our load testing software.
Think time is a common industry term for a pause between executions of steps
in a load testing script. Think emulates time that a user spends processing the
results of their previous action. The load testing software we are using injects a
Continuous at
Bill Payment Script 3,300 0.30 TPS
0.45 TPS
Continuous at
Bill Payment Script 3,500 0.44 TPS
0.45 TPS
default think time of 4 seconds between steps in the execution. We can slow down
the entire script execution by increasing the think time. If we adjust the think time
for the funds transfer script to 6 seconds, our results will show a decrease in the
transaction rate for this script (as shown in Table 7.9).
In this example, adding an additional 2 seconds between execution steps has
dramatically decreased the transaction rate for this script. As a secondary effect, the
transaction rate for the bill payment script has actually increased. By slowing down
funds transfer script, we have created some slack in the system that has allowed the
bill payment script to execute slightly faster.
Many load testing packages also support a parameter called delay between itera-
tions, which is an interval of time that the software package will wait before re-
executing the load script. This parameter can also be used to adjust the load of a
specific script. You should be careful in using the parameter, however. By introduc-
ing a delay between iterations, you will create periods of time in which there are less
than 5,100 concurrent users on the system. Alternately, if you increase the think
time, you can be assured that there are still 5,100 active users on the application at
any point in time.
Before we move on we also need to discuss the concept of ramp-up. In the
loading scenario we discussed a moment ago, if all 5,100 virtual users logged in
simultaneously and proceeded to execute, how do you suppose the system would
react? 200 instantaneous logins is an interesting operability test for any system, but
it is not representative of the actual production environment.
A more likely scenario is that user activities are randomized. At any point in
time, there are blocks of users executing different functions. Some users are finish-
ing a bill payment at the same instance that another user is just logging in. In order
to create this distribution, load testing software allows you to configure the ramp-
up parameters for your load. The most common ramp-up parameter is the number
and interval over which virtual users should be added to the load. If our target load
is 1,600 virtual users, one ramp-up scenario would be as shown in Table 7.10.
In this configuration, an additional ten users will be added to the load every 8
seconds. As a result, it will take (1,600/10) × 8 = 1,280 seconds = ~21 minutes to
reach the target number of virtual users.
Performance Testing
Business users tend to wait for performance test results with the most anticipa-
tion. Fortunately, your test execution strategy should prioritize performance testing
towards the front of your execution schedule. Many of the operability tests that
you will need to conduct can only be executed under load. If your application has
not been certified for performance, you are very likely to have difficulty achieving
the required load for your operability testing. In this section, we will look various
aspects of performance testing including priming effects, stress testing, regression
and reporting of results.
Load
(TPS)
Time
Ramp-Up Ramp-Down
Peak
Transaction
Load Rate
(TPS)
Time
Priming Effects
Depending on the nature of the system you are testing, priming effects can exert
considerable influence on your performance test results. Priming effects manifest
themselves through degraded performance when a system is first started. There are
a number of causes for priming effects, among them:
your system will be restarted frequently, your performance testing should be based
on a “warm” system. A warm system is one that has already been subjected to load
in order to eliminate priming effects.
Performance Acceptance
Performance acceptance is the process by which new applications go through test-
ing processes focused on validating performance and also getting certified against
non-functional requirements. Assuming that you have completed all of the neces-
sary planning and preparation activities, performance acceptance is a matter of
applying load and reporting response times.
A typical performance report from a load testing application for a 60-minute
performance test will look like the one shown in Table 7.11.
In assessing the value of the performance results, you will need to determine
if the transaction rate is accurate and also if the error rate is acceptable. The trans-
action rate is easy to calculate in this example for both the bill payment and the
fund transfer scripts. We use the number of successful iterations in each of our
calculations.
The transaction rate for the bill payment script is calculated as
Transactions 1, 728
Transaction Rate = = = 0.48TPS
Interval 3600s .
Transactions 2, 351
Transaction Rate = = = 0.65TPS
Interval 3600s .
Both of these transaction rates are very close to our targets of 0.48 TPS and
0.61 TPS, respectively.
Our next task is to evaluate the error rate for this test. If the error rate is too
high, it is unlikely that we will achieve the target transaction rate. If the applica-
tion is exhibiting a high error rate, this can also have an unknown effect on your
performance results. Even if you are hitting your target transaction rate, if the
error rate is higher than 5%, it is recommended that you resolve the errors before
reporting performance results. For some applications, you may decide on a more- or
less-forgiving error rate.
We calculate the error rate for this performance test as follows:
The error rate for the bill payment script is calculated as:
11/19/07 7:49:57 AM
Test Preparation and Execution n 175
Transactions Errors 10
Error Rate = = ×100 = 0.57%
TransactionsTotal 1, 728
.
The error rate for the funds transfer script is calculated as:
Transactions Errors 8
Error Rate = = ×100 = 0.34%
TransactionsTotal 2, 351
.
The error rate is very low for both of our scripts. Fortunately, this means that we
can report these test results against our non-functional requirements. If our error
rate had been much higher, further investigation would have been required. Errors
in your script execution can come from a variety of sources. The load testing soft-
ware itself may encounter an error for a specific thread of execution.
A subset of the data in your test bed may be bad, i.e. does not meet valida-
tion criteria for the application. It is also possible that the application begins to
encounter errors under load. The most common such error is a timeout, in which
no response is received within the configured timeout for the load testing software.
Timeouts are common as a result of priming effects, i.e. right after a system has
been started. If the error rate is consistently high for your application, some recom-
mended actions are:
1. Repeat the test at lower load: If the error rate goes away at a more moderate
load, then your problem is likely load related. If the error rate is being caused
by a high number of timeouts, you should see poor performance response
times in your results. If response time is good for passing iterations, then you
are probably not looking for timeout scenarios.
2. Run scripts individually: You may also find it useful to run each script
under load individually. This will rule out interplay between different busi-
ness operations as a source of problems.
3. Look for application errors in the log: If the software system has followed
the logging best practices described earlier in this book, there should be
descriptive information in the application logs.
4. Ensure you have sufficient capacity: You should look at the load profile on
the hardware. If you are maximizing resources like the CPU (central processing
unit) or memory, it should be no surprise that the error rate is high.
Before we move on, we will comment briefly on standard deviation. Many load
testing packages report standard deviations in your performance results. The stan-
dard deviation is a statistical description of how distributed your data is between
the minimum and maximum values. A very low standard distribution means that
your data is clustered around the arithmetic average for the data set.
With respect to performance results, the smaller the standard deviation, the
more reliable the average is as a projection of system response time. Assuming that
your data is normally distributed about the mean, another interpretation of the
standard deviation is that two-thirds of your data is within one standard deviation
of the mean.
Account Inquiry –
(Less than five 1.87 TPS (0.48 * 2) + (0.65 * 2) 2.26 TPS
accounts)
Account Inquiry –
(Five accounts or 0.30 TPS 2.26 TPS
more)
11/19/07 7:50:02 AM
Test Preparation and Execution n 179
integrity of your testing would show consistency between actual production per-
formance and the baseline that you recorded prior to introducing the system into
production.
A performance regression report as shown in Table 7.15–7.17 will compare per-
formance results against the original requirements and the most recent baseline for
the system. In a regression test, we assume the same load as was used in previous
tests.
By including the previous baseline, you get a view of how you are affecting per-
formance as a result of the change under consideration. The new results may still
meet the requirements, but users will not react favorably if performance degrades
by, say a full second, across all the business operations.
Account Inquiry –
(Less than five 1.87 TPS (0.48 * 2) + (0.65 * 2) 2.26 TPS
accounts)
Account Inquiry –
(Five accounts or 0.30 TPS 2.26 TPS
more)
AU5334.indb 180
90% Average 90% Percentile
Scenario Iterations Errors Average Percentile (baseline) (baseline) Requirement Test Result
Bill Payment Script 1728 10 40.007 47.965 40.007 43.965 n/a n/a
Funds Transfer Script 2351 8 22.492 24.273 18.492 19.273 n/a n/a
180 n Patterns for Performance and Operability
11/19/07 7:50:03 AM
Test Preparation and Execution n 181
Stress Testing
Formally, your project has met its commitments if performance testing shows that
response times are satisfied under peak transaction rates. However, it is informa-
tive to tell the business what transaction rate causes the system to miss its response
time targets. In other words, how much more load can the business apply and still
expect satisfactory performance? For some applications, this is useful information
for business planning purposes.
Executives with technology portfolios will sometimes ask for this test to improve
their sense of comfort with new applications. This test can also indicate the margin
of error you are working with in the production environment with respect to busi-
ness usage. If your business requirements for usage are way off the mark, the stress
test informs you how much contingency you have in the business usage.
The performance profile for most systems looks like that shown in Figure 7.3.
As load increases, response time gradually increases until you hit a knee in the
curve where response time increases dramatically. At this point it is futile to apply
additional load. The stress test establishes at what load system response time will no
longer meet performance requirements.
In Figure 7.4, L1 indicates the level of load at which performance is certified. For
L1, response time is well below the stated performance requirements of the applica-
tion. L2 indicates the level of load at which the system response time is equivalent
to the performance requirements. This is the breaking point for the system. In this
example, we would state that the system is certified for L1 but it is rated up to L2 .
There are no guarantees of system behavior beyond L2 .
Response
Time
Load
Operability Testing
Operability testing is a broad category of testing that encompasses everything in
the non-functional domain that is not performance testing. In this section, we will
decompose operability tests into more specific categories and discuss approaches for
testing each one of them.
Response
Time
R2
R1
Load
L1 L2
In accepting the application into the test environment, you should scrutinize
each of the system interfaces and devise additional test conditions as needed.
Boundary condition test cases are not difficult to document and generally do
not require a load testing solution. These tests can often be introduced into the
test schedule during downtime, i.e., time during which you are waiting for other
activities like deployments, script development, etc. Table 7.18 is an example of two
documented boundary condition test cases.
Failover Testing
High availability is achieved using high-quality infrastructure and software compo-
nents combined with redundancy. When there is redundancy in your environment,
failover testing confirms that your system will take advantage of the redundancy
to maintain availability. In the course of executing failover tests, you will need to
address the following topics:
1. What is the mode of failure for the failover test? There are many modes of
failure for most software components. A process can stop responding or the
network cable can be unplugged from the server itself. You will need to decide
on the mode of failover for your testing. In some cases, you may elect to test
multiple modes of failure.
2. Which software components are being tested for failover? This question
should be answered in the solution architecture, i.e., which components were
intended as redundant.
3. What load is suitable for failover testing? Ideally, you want to identify a
broad and representative mix of business functionality that can be executed
during a single test. The mixed load that we established during preparation
activities is usually appropriate (if not a good starting point).
4. What are the performance requirements during failover? Assuming the
failover is successful, is there sufficient capacity in the infrastructure to accom-
modate peak load on the surviving components? Is there a requirement to sup-
port the performance requirements in the event of a failover?
5. Does the system require fail-back capability? Will the failed component
be resumed automatically or manually? In either case, is the system required
to fail back under load? Some systems do not support fail-back, meaning that
the system must be brought offline in order to restore service the original
service level.
6. What functional expectations are there for in-flight processing? What
error rate is tolerable during failover (if any)? Most systems should expect
some degree of exceptions during a failover scenario.
Let’s look at an example test case definition that addresses each of these topics.
In this example, we will consider a failover scenario for a clustered Web Services
interface on the IBM WebSphere application server platform. The Web Service
supports address lookup and validation. This is an enterprise service that supports a
number of different mission critical applications. The service architecture is shown
in Figure 7.5.
Incoming hypertext transfer protocol traffic (HTTP) is addressed to a VIP (vir-
tual Internet provider) address on a Content Switching Service (CSS). The CSS
load balances requests across the WebSphere cluster as shown in the diagram. In
evaluating this architecture, we have two tiers of redundancy. We have redundant
AIX servers managed by Veritas.
At the level of the application server, we have four clustered WebSphere pro-
cesses that provide the enterprise Web Service. In this example, the address lookup
service is stateless. For purposes of this illustration, the Web Service tier supports
only a single request type. Let’s look at how we would define failover test cases for
each level of redundancy (as shown in Table 7.19 and 7.20).
VIP: 10.403.54.783
Content Switch
Veritas Cluster
WebSphere Cluster
WebSphere WebSphere
Server Process 2 Server Process 3
net protocol) address that is not on the network. This would subject all requests
to the TCP (transmission control protocol) timeout which can be as high as 15
minutes.
Sustainability Testing
The aim of sustainability testing is to prove that there is no performance degrada-
tion or risk to availability over long term usage of the system. It is not uncommon
for an application to slowly bleed resources until it abruptly fails. Sustainability
testing is also referred to as soak testing in some circles.
The most difficult challenge in executing a sustainability test is the sheer length
of time it can take to complete. As a result, it is important to select a reasonable
duration that maximizes the utility of the test while minimizing your efforts.
As a starting point you should ensure that your system is operable for at least as
long as your longest operations window. For example, if your application must pro-
vide service for an 18-hour business window 5 days a week, your sustainability test
should, at a minimum, run for 18 hours. Should your application fail in the 19th
hour, this scheme means that you would need to restart the application 5 times a
week. This is hardly a characteristic of a highly available, operable system.
A better suggestion would be to run your sustainability test for 5 consecutive
days, or 90 hours. If your system fails in the 91st hour, your operations team has a
much longer maintenance window in which to restart the application, not to men-
tion that they are doing so less frequently. Of course, if we could run the applica-
tion for much longer—say, 4 weeks—this would improve our confidence level in
the application even further.
Unfortunately, what we are considering is a time-consuming endeavor. As we
have discussed previously, well designed tests are tests that can be run repeatedly
and conveniently. A 90-hour test will take almost 4 days to run, assuming we have
the resources to operate the test on a 24-hour basis. If we are running the test on a
10-hour workday, the test will still take 9 working days to complete. To make mat-
ters worse, if the test fails on the eighth day, perhaps because of a software failure,
we must restart the test.
Few organizations have the luxury of weeks of resource and environment avail-
ability in which to complete this type of testing. We can approach this difficulty
by changing the criteria for the test. Instead of planning our test based on elapsed
time, we can plan our test based on elapsed business volumes. For most software
systems, an idle system is not a very interesting specimen. In the 90-hour test we
have been considering, the system is idle or at least not very busy a large fraction
of the time.
In fact, the system may do 25% of its processing in a 1-hour window. Let’s con-
sider a content management application for a pharmaceuticals company. Employees
of the company use the content management system to look up documentation on
drugs that the company manufactures. A large number of employees actually update
and create new documentation in the system also.
The business usage for this system is normally distributed around two peaks
at 11:30 am and 3:30 pm. It appears that employees strive to finish documentation
tasks prior to lunch and again at the end of the day. The business usage for the sys-
tem has been documented as shown in Table 7.22.
In looking at the usage, it is clear that 60% of the transaction volumes are
expected within 2 two-hour windows. When we go to calculate the transaction
rate for these intervals, these will be our peak periods of usage. Next we look at the
operations window for the system and see that it is fairly generous. The system is
available from 7:00 am until 6:00 pm, seven days a week (as shown in Table 23).
Our objective is to prove that the system is sustainable for a four-week period.
Instead of elapsed time, let’s calculate how long it would take to drive our weeks of
business volumes based on the peak transaction rate for the system. We do this in
the calculation shown in Table 7.24.
In the previous calculation we see that we can drive four weeks of business
volumes in a little less than four days. Based on our schedule, it will be difficult to
accommodate a full four days of testing of this type. We have one week in which to
complete sustainability testing. Four days is less than the one week we have allot-
ted, but there is little margin for error.
If we need to repeat or restart the test, we will immediately overrun our activ-
ity. As a result, we compromise on 14 days’ sustainability and will run load for two
days. We will compensate for the abbreviated test duration with close attention
to system metrics while the system is under load. Our next point of business is to
discuss system monitoring during sustainability testing.
Supervisors – 55
Documentation – 75
Number of Users By Class Engineering – 35
Legal – 10
Quality – 60
1. Resource Leak: You have a resource leak in your application. That is, the
system is creating or requesting resources and then losing track of them. The
system continues to create or request resources until the requests cannot be
fulfilled. For lower-level programming languages like C/C++, it is entirely
possible to allocate physical memory and then obliterate all references to
this memory. This is a true memory leak in which the memory can never
be recovered. If this pattern continues, your process will reach the maxi-
mum process size for the operating system and be terminated or exhaust the
physical memory available on the server. On other platforms, including Java,
memory is managed by the execution environment (JRE), so it is not possible
to truly leak memory. However, if your application allocates memory into an
unbounded collection and never destroys references to this data, the effect
is the same: an increasing memory footprint that will eventually exhaust all
memory.
2. Resource Sizing: You have sized a configurable resource in your system too
large or too small. For example, if your system has a hard limit of 250 MB of
available memory, and you configure an in-memory cache to contain 10,000
objects (each object being 25 kB or more), you will exhaust the memory you
have allocated. This problem is easily resolved by shrinking the number of
objects permitted in the in-memory cache. Of course, this may impact per-
formance, so configuration changes in this category will require performance
regression. A good illustration of where this type of sizing can be problematic
is in the area of sizing Enterprise Java Bean (EJB) cache sizes for the J2EE
platform. J2EE containers manage cached and pooled bean instances on your
behalf. The number of objects that can be pooled is set in XML deploy-
ment descriptors, which can be configured on a per-environment basis. If
these parameters are not sized properly, it is easy for the cache sizes to over-
run the physical limitations of the runtime environment and cause memory
exhaustion.
For you to be confident that your sustainability test is successful, it is not
enough to observe performance over the duration of the test and conclude that your
test has passed. It is equally valuable to monitor the system behavior throughout
the test and ensure that the system has reached a steady-state in which there is no
unbounded resource growth. The following metrics merit close attention during
sustainability testing:
1. Memory: As discussed earlier, unbounded memory growth surely spells the
demise of your application. Some platforms allocate physical memory to the
process and then manage this memory internally. The Java platform can be
configured to work in this way. The Java heap can be allocated once at system
start-up. In order to have a view of the Java heap internally, you can configure
verbose memory logging for the JRE. For cases like this, be certain that you
measure the memory footprint at the OS level and internal to the process.
2. CPU. You should monitor CPU during the course of your test. If the amount
of CPU that is required for the application is steadily increasing while load
remains constant, you may have a CPU leak. A CPU leak will eventually
exhaust the available processing power on your platform and cause your
application to fail.
3. File System Growth: You should monitor the file system to ensure that the
application is using disk at a sustainable rate. In the production environment,
you will need to allocate sufficient storage for log files and transactional files
that are generated within the required retention period for the application.
4. Internal Caches and Pools: Depending on your platform, you may be able
to monitor standard containers, caches and pools. For example, on J2EE-
based applications, most pooled resources for connections and EJB caches are
exposed through the Java Monitoring and Management API (JMX). There
is a growing population of monitoring tools that support this standard and
allow you to monitor your system over the duration of the test.
5. Performance: The easiest way to measure performance for degradation is to
run a performance regression on your system at the conclusion of your sus-
tainability test. If you certified the application using a one-hour peak load,
then run this same test and compare the performance results against the
original baseline.
Sustainability testing is among the most important tests in your repertoire as
a non-functional expert. Sustainability testing can expose subtle defects that are
hard to detect in a production environment where you may not have the flexibility
of your test environment and the benefit of intrusive monitoring capabilities. You
should be satisfied with your efforts once you have demonstrated consistent per-
formance for a period at least as long as your longest operations window. At the
same time, you can improve your confidence level by carefully monitoring system
metrics and showing stable and predictable resource usage for your application at
steady-date.
Challenges
Test execution can be a trying activity fraught with system restarts, database
imports, data loading, and script warm-up, amongst other time-consuming events.
Before we move on to the next chapter, we would like to share some wisdom on the
challenges you may face during your test execution.
Repeatable Results
A great deal of coordination and planning is sometimes necessary to execute non-
functional tests. It can be a major inconvenience to repeat all of this effort in your
quest to achieve consistent, repeatable results. By now, you should know that you
can’t trust your results until they are repeatable. If your results are not consistent,
you should look to the following possible explanations:
1. Isolation: If your test environment is not isolated from other activities, it is
possible that external influences are impacting your test results. The only way
to mitigate this is to discover who is impacting your environment and try to
schedule your tests during periods when they are inactive.
2. Variable Load: If your test results are not consistent, you should verify that
you are running the same test. It is not difficult for a test operator to config-
ure the load with different scripts, or different parameters such as ramp-up or
think time. If response time is radically degraded, make sure the number of
virtual users hasn’t been increased.
3. Example Size: If your results are not consistent, make sure your sample size
is large enough. You can also extend the duration of your test, if need be. A
statistical average will not be consistent if there aren’t enough data points in
the calculation.
4. Rollback the Application: If you are seeing dramatically different results
for a new version of the application, you should try to execute an equivalent
test on the previous version of the application. This may indicate whether
the problem is in the application or the environment. If you’ve followed the
advice in Chapter 4 on test planning, you should have a logical instance of
the previous version in your environment already.
Limitations
Werner Heisenberg, a founding scientist in the field of quantum mechanics, is best
known for the uncertainty principle, which states that it is impossible to measure
the precise position and momentum of a particle at the same time. The basis for the
theory is that your measurement of position compromises measurement of momen-
tum. To put it more simply, it is impossible to measure an attribute of the particle
without exerting an effect on the particle that invalidates the other measurement.
The complexity of non-functional testing is not exactly on par with quantum
theory, but, interestingly, we consistently face the same challenge. When a phenom-
enon of the system occurs under load, it is often very difficult to conduct analysis
of the behavior without changing the phenomenon itself.
For example, a performance problem may arise once load has crossed a certain
threshold and we need to determine what is causing the degradation. Two strategies
come to mind: we could introduce custom instrumentation code (performance log-
ging; see Chapter 3) or, if our platform supports it, we could run the load with pro-
filing software attached. Unfortunately, in both cases the additional load imposed
by these alternatives will certainly change the performance characteristics of the
system. In fact, the specific performance degradation may not arise at all when we
run in this configuration. Perhaps more likely, we may not be able to achieve the
production load because of the additional overhead of our measurement. If we are
lucky, we may see a similar performance degradation and if we are luckier still,
our instrumentation may point to the source of the problem. All things said, it is
important that you understand that intrusive efforts to measure and understand
your system have the capacity to also influence system behavior.
Summary
The mechanics of test case preparation and execution take time and experience to
master. This chapter has equipped you with the tools you need to approach each set
of activities with confidence. By now, you should be comfortable with each of the
tasks that are prerequisite for your test execution. You should be familiar with per-
formance testing itself, including stress and regression tests. In devising test scripts,
we explored strategies for combining execution steps in scripts to achieve target
transaction rates. This chapter introduced two important concepts: mixed load and
performance baselines. A mixed load is a representative combination of test scripts
that can be leveraged for a variety of operability tests; a performance baseline is the
most recent successful performance test result that is used to contrast with new
test results. In this chapter, we also reviewed a number of categories of operability
testing including boundary conditions, failover, fault tolerance, and sustainability.
Based on our experience, we also discussed common frustrations including the
difficulty of achieving repeatable test results and in intrusively measuring system
behavior. In the next chapter we assume that your test activities have executed suc-
cessfully and move on to a discussion of deployment strategies that mitigate risk
and improve your chances of delivering successful projects.
Deployment Strategies
We have spent most of this book describing how to build and test software systems.
In this chapter, we shift our focus and begin to look at considerations for deploying
critical software into a production infrastructure.
Failed deployments are a nightmare for everyone. Your project team has spent
months building and testing your application only for it to fail business verifica-
tion when it is deployed. The ensuing weeks will be a scrambled, unplanned effort
to correct the issue and prepare for another deployment. Generally speaking, there
are two varieties of failed deployments, and this chapter provides you with tools to
mitigate the likelihood of either of them.
A deployment can fail because the deployment procedure itself is bungled. This
can happen for many reasons, the simplest of which is that the procedure itself can
be wrong. Alternately, an operator executing the procedure can make a mistake.
Or, an important set of configuration parameters may not be correct for the pro-
duction environment. Basically, the software itself may be fine, but the procedure
to implement it is not. For software systems that are large, complex, or both, the
deployment procedure can be equally large and complex.
A deployment can also fail because the new software itself does not anticipate
the production environment correctly. Rigorous functional testing does not always
ensure compatibility with the production environment. This is common for sce-
narios where your system must interact with complex legacy systems that do not
have well-defined behavior. In these cases, your functional testing may have relied
on a test system that is woefully out of synch with the production environment.
For new systems, you may also face a situation in which projected business
usage falls well short of reality. The result being that your application must cope
with volumes that were not part of your non-functional test scope. In this chapter
we will look at deployment strategies that help mitigate your risk in these types of
circumstances.
195
Procedure Characteristics
Risk management is an important theme throughout this book. A good deploy-
ment process is focused heavily on minimizing risk. Deployment strategies that
manage risk effectively have the following characteristics:
1. Minimal: There is always the potential for error in a software deployment. By
minimizing the number of components that you are changing, you are likely
to shorten and simplify the deployment procedure. As we will see in this
chapter, structuring your applications in loosely coupled component architec-
tures helps to position you for future deployments with a minimal footprint.
2. Automated: There are two key reasons why you should strive for automated
deployment procedures. Firstly, manual processes are executed by human
operators who are prone to error. Secondly, automated procedures deploy the
application in the same way in every environment. A human operator may
follow a deployment procedure and introduce subtle differences across dif-
ferent environments. The objective of a deployment is to propagate the exact
same system that was tested and verified in non-production environments
into the production environment. Automated deployments also tend to com-
plete more quickly and efficiently than manual deployments. This efficiency
gives you more time to verify the deployment and reverse the deployment if
necessary.
3. Auditable: Each step in the deployment should be auditable. This means that
a person looking at the production environment should be able to reverse-engi-
neer the deployment procedure from the production environment and the out-
puts of the deployment. Many automated deployment procedures generate a
deployment log that can be used for this purpose.
4. Reversible: Given that there is always risk that a deployment will introduce
serious problems in a production environment, it is always recommended to
have a back-out, rollback, or contingency procedure that resurrects the state
of the software system prior to your deployment. Of course, each of the char-
acteristics mentioned in this list should also apply to your back-out procedure
that reverses the deployment.
5. Tested: If you have invested design and development effort in your deploy-
ment process, you need to ensure that it is fully tested and exercised. This
means that you should use it consistently to build all of your test environ-
ments. If possible, you should encourage the development team and indi-
vidual developers to use your process for building environments in their own
activities.
The approach you take to achieving an auditable, reversible, automated deploy-
ment will depend on your software platform. There are dozens of scripting lan-
guages and technologies that can be used to automate deployment procedures for
common enterprise platforms like UNIX and Windows.
Packaging
Packaging refers to the way in which your application code is bundled into deploy-
able units. Your packaging options will depend in large part on the software plat-
form with which you are developing. Many software platforms like J2EE and .NET
are designed to encourage and support component architectures.
Component architectures consist of a family of components that interoperate
in order to implement the overall software system. Individual components can be
upgraded (or downgraded) independently. In order for such a scheme to work, you
need to ensure that the combination of components you are deploying has been
certified to work together in your testing. Development activities also scale better
for component architectures; as developers can be aligned to work on different
components in parallel.
Component architectures are also efficient at supporting code reuse; individual
components can be shared among different applications.
In component architectures, you can group features and application code into
components that are likely to change together. For example, you may have a com-
plex set of application code that implements some industry-specific business logic.
Since business logic is more likely to change than structural and utility functions in
your application, you would be well advised to package all of the business logic into
a dedicated component. In minor releases that alter business logic, only the compo-
nent encapsulating business logic need be upgraded as part of the deployment.
The alternative to component architectures tends to be a monolithic application
that forces you to re-deploy the entire application every time you need to make a
single change. Re-deploying the entire application increases your testing obliga-
tions to confirm that nothing unexpected has been somehow introduced into the
deployment. The deployment procedure itself is likely to include many more steps,
increasing the number of opportunities for error.
Configuration
Technologies like XML and Spring for Java-based applications have made it
increasingly attractive for developers to make software systems highly configurable
through text files. Among other things, text file configuration is commonly used to
enable and disable business functions, size software resources (e.g., cache size), and
parameterize business logic.
The advantage of flat-file configurations is that they are highly transparent. It is
easy for a third-party to audit text file changes and be confident that a deployment
includes no more than the stated changes in configuration. Database-based con-
figurations offer a similar advantage as configurations are manipulated using SQL
(structured query language), which is also quite readable in plain text.
Deployment Rehearsal
In Chapter 6 we discussed the advantages associated with a production-scale
test environment used to support non-functional testing activities. A produc-
tion-scale environment is also advantageous for rehearsing your production
deployment.
The rehearsal process ensures that your deployment is compatible with all of
the infrastructure nuances of the production environment. It is a test where the
emphasis is not on the application itself, but rather on the mechanism by which it
is deployed.
Deployment rehearsals also create a familiarity with the deployment team so
that when the system is deployed to production there is reduced risk in human
execution. A deployment rehearsal is also often referred to as a dry run. Ideally, a
deployment rehearsal should also exercise the verification and back-out procedures
of the deployment.
Rollout Strategies
The deployment of your application is often only the first step in making new or
upgraded functionality available to end users. Once the software is deployed, you
can choose from different strategies for actually rolling out new features to users.
This section will look at common rollout strategies.
In many cases, for existing systems, pilot functionality is available at the same
time as the original system. If users have a negative experience with the pilot sys-
tem, they can always revert to the original system.
This is a good strategy for mitigating the reaction of sensitive end users to
major changes in an existing application. If the pilot system is hosted outside the
production system, this is also a good rehearsal for the deployment procedure
itself.
Internal Pilot users access the new online banking system directly using a dedicated pilot
1
URL. Access to the new system is un-published to external customers.
2 Customers access the Online Banking site through the original, legacy login form.
3 The new and legacy banking systems are both connected to the bank’s back-end systems
through a common services tier. Technically, customers can conduct banking from either
the new or the legacy site.
1 Customer Login
Login Module
2
3
Phased Roll-
out Lookup
New Online Legacy Online
Banking Banking
Customers access the Online Banking site through a single login form. The login
1
form is rendered by the new banking system login module as shown in the diagram.
2 Customers supply user information to the login module. The login module does
a determination based on user information whether the user should be redirected
to the new or legacy online banking system.
3 The login module redirects the customer to the new or legacy online banking
system as appropriate.
4 The new and legacy banking systems are both connected to the bank’s back-end
systems through a common services tier. Technically, customers can conduct
banking from either the new or the legacy site.
As new features were introduced, the institution needed a strategy for gradually
rolling out new business functionality to the front office in increments. Immedi-
ately following a release, the institution did not want all 4,000 front-office users
simultaneously attempting to access the same new feature.
Unfortunately, the front-office group could not be trusted to comply with a
phased rollout strategy, i.e., if you added additional capabilities to their current
interface, they would use those capabilities, irrespective of what had been com-
municated to them. Front-office users were located at retail branches across the
country. Each branch would have between 5 and 200 front-office users. A good
solution would have been to allow users to have access to new features based on
their associated branch. Branches could be added incrementally to the rollout.
The solution to this problem was solved programmatically in the application
itself by building an additional layer into the security model for the system.
The development team imposed a branch lookup prior to building the menu of
available options for front-office users. In this way, a list of branches could be con-
figured to have access to a given new feature, as shown in Figure 8.3. The branch
list was implemented in a database that could be altered via simple administrative
interface. This simple feature gave the transition team the control needed to imple-
ment a phased rollout strategy.
In the screen schematics above, you can see how the Transfer Securities menu
item is only available to users who are associated with branches that are included
in the rollout.
Front–Office User Screen Front–Office User Screen
Transfers Transfers
Transfers Cash Transfers Cash
Transfers Securities My Profile
My Profile
Transfer Securities link is only available
to users in the rollout branch list
Back-Out Strategies
A back-out is usually what follows a failed deployment; it is required when a deploy-
ment fails and puts an existing business critical system into a state that is unusable
to end users. Backing out an application is itself a risk, but when you decide to back
out you are already in a situation where the production system is broken.
A deployment procedure for an enterprise application should always include a
back-out procedure. The back-out procedure needs to be tested with the same rigor
as the deployment process itself. Backing out an application is an embarrassing and
undesirable scenario for any project.
Complete Back-Out
A back-out procedure can follow one of two strategies. The most conservative
approach to back-out removes all traces of the new deployment from the produc-
tion environment. This approach reverts the production system to a state identical
to the pre-deployment state of the infrastructure. Your deployment plan should
always include a complete back-out procedure.
Partial Back-Out
As you might expect, a partial back-out removes a subset of the deployed changes
from the production general. In general, partial back-outs are heavily discouraged
for enterprise systems because it is too difficult to test and anticipate all of the pos-
sible combinations of partial back-out procedures.
Partial back-outs also have difficulty meeting our requirement that deployments
be fully auditable as it isn’t clear in the documentation which back-out procedures
were actually followed and which procedures weren’t. However, in some situations,
partial back-outs are a manageable way to mitigate the impact of a failed deploy-
ment. Your business verification should outline which test cases are associated to
which back-out procedures. If the components that are being left in versus the
components that are being backed out are sufficiently remote from one another,
this can be a reasonable strategy.
Logical Back-Out
An alternative back-out procedure leaves the deployed application intact, but dis-
ables the new or changed functionality. This type of back-out is quicker to apply,
but usually requires built-in support from the application. For this case, toggles can
be built into the database or text file configuration to enable and disable specific
business functions.
This approach to back-out leaves the door open to re-enable the new business
function—if for example, a required external dependency is met post-deployment.
Your deployment procedure will need to include technical and business verification
steps that also indicate the conditions for when a partial or complete back-out is
acceptable.
Summary
Any seasoned technologist will tell you that risk accompanies any change you make
in a production environment. You mitigate this risk in your deployment proce-
dure by ensuring that your deployments are minimal, automated, auditable, and
reversible.
Once you have deployed software into your environment, you can choose from
one of several rollout strategies including piloting, multiple and short phases, and a
big-bang implementation, depending on your risk tolerance. A rollout strategy is a
good way to compensate for shortfalls in your non-functional test coverage. Finally,
in the event that your deployment is unsuccessful, you will need to follow a pre-
determined back-out strategy.
In this chapter we have reviewed complete, partial, and logical back-outs as
alternatives if you are in this undesirable situation. In the next chapter we will look
at operations considerations including important topics like monitoring, trending,
and reporting.
Resisting Pressure
from the Functional
Requirements Stream
207
attention of the business team. There are several reasons for this, including the
following:
1. Defining functionality is challenging and can become all-consuming in time
and effort. SMEs feel the need to focus all their efforts and energy to get the
functional requirements absolutely right.
2. Non-functional requirements are considered important, but until the func-
tional requirements are clearly known, there is believed to be little point in
spending time on them.
3. SMEs, due to their background, will focus on what they know best—driving
out the functional requirements, while viewing non-functional requirements
as being a technical issue only.
4. Functional requirements are generally on the critical path on most project plans.
5. Businesses do not pay for a system that is very fast, very secure, and highly
usable. They pay for functionality that is all of these things.
The importance and criticality of functional requirements cannot be reasonably
debated. This is what the business is buying; this is the reason the project exists in
the first place. Very few business resources would deny the importance of security,
availability, performance, and ease of use. This is where the insidious challenges
creep into the picture. Despite everyone’s best intentions and recognition of the
importance of non-functional requirements, this stream is still usually neglected,
incomplete, or incorrect. This is true even when the non-functional requirements
stream is launched at the same time as the functional stream.
On a continuum of focus, even when non-functional requirements are under-
way, when push comes to shove, the functional requirements stream will win the
competition for resources. The demands from the functional stream can insidiously
creep up and draw attention and resources away from the non-functional stream.
There clearly needs to be a balance in the pursuit of both functional and non-func-
tional requirements.
This chapter focuses on defining a framework that allows a project team to
resist continued and unrelenting pressure from the functional requirements stream
to draw resources and attention away from the non-functional stream despite
everyone’s best intentions to the contrary. Non-functional requirements require
attention, but the key is to maintain that attention for the duration of the project,
regardless of the pressure being felt to complete the functional requirements.
A Question of Degree
Functional requirements tend to be business-domain-specific, while non-functional
requirements tend to have several components. They are generic in the sense that
performance, throughput, and security requirements are universal. This implies
with the non-functional stream and drawing them away to meet their own schedule
at the expense of the project as a whole. The situation is insidious because it happens
slowly and innocently. The project team believes they are doing the right thing, not
realizing that they are postponing some of the most difficult requirements into the
time crunch that typically happens at the end of a project lifecycle.
The process starts simply enough. A project plan is constructed that shows,
among many other things, how functional and non-functional requirements are
going to be accommodated. More than likely, activities representing the start of
the functional requirements stream will be reflected in the project plan with ear-
lier start dates than their non-functional counterparts. The latter will also likely
reach completion sometime following the former’s end date. This is shown in Fig-
ure 9.1. Note that there is likely to be some degree of iteration and reworking in
the process.
Figure 9.1 also shows several points of dependency between the two streams.
The non-functional requirements stream can be initiated, but it requires input or
answers from the functional team at specific points. The figure shows a collection of
generic non-functional requirement categories that include the following:
Figure 9.1 also shows that the functional requirements stream can be depen-
dent on the non-functional requirements stream. This may sound like a novel idea
to some purists, while others might be confused at this statement, believing it to
be completely obvious. The principle behind this is to produce a stronger return
on investment by aligning technical/non-functional capabilities with the objec-
tives sought by the business. It is possible for the business to modify their require-
ments in response to technical feasibility rather than to spend the extra money
(e.g., on hardware) to completely satisfy their wish list. This type of alignment is
only possible if the two requirement streams are jointly conducted and are equally
respected.
The two-way dependency between the phases results in resource contention on
project resources. The two streams can begin as per plan. As time progresses, how-
ever, we tend to start seeing a magnetlike attraction for resources in various areas,
including the following:
1. Attention
2. Human Resources
3. Hardware Resources
4. Software Resources
5. Issue Resolution
These are described further in the following subsections.
Attention
This really refers to the areas that are getting the attention of the project team
and stakeholders. While everyone is interested in non-functional requirements,
the level of detail in the functional stream—business rules, input screen layouts,
reports, interface dialogue—are the immediate artifacts for user signoff. These then
get the attention until signoff is achieved. In most projects, the complexity of the
functional requirements stream needs continued SME and business attention to get
the level of detail and accuracy to warrant user signoff.
Coupled with the architecture, modeling, and design teams to ensure that the
functional requirements are supported, a good portion of the extended project team
is involved in the activity. If—and this is usually the case—there are challenges
in getting signoff due to missing, incomplete or incompatible functional require-
ments, overlapping team members will delay involvement in the non-functional
stream.
Human Resources
This is the greatest point of contention in completing both the functional and
non-functional requirement streams per a project plan. The resources that are
required in both streams can include: SMEs, business analysts, architects, design-
ers, users, project sponsors, and modelers. Increasing complexity on the functional
side requires more attention from these resources, with a corresponding decrease in
attention to the non-functional requirements. These are discussed in further detail
later in this chapter.
Hardware Resources
Hardware for non-functional requirements such as stress testing, throughput veri-
fication, and end-to-end security is often neglected due to the high cost of acqui-
sition, setup, and maintenance. This pushes these activities well into the project
lifecycle and may delay them to a point where they cannot be completed in time to
meet the project deadline. This leads to a choice of delaying the project or imple-
menting the application without a full understanding of how it will behave in a
production environment.
Software Resources
This deals with the type of software and the number of licenses required by the
application. The focus of most project teams tends to be on the tools required
for designing, modeling, and building the application. Tools for non-functional
requirements are left until later or entirely written out of the budget. This also tends
to be a first point of reduction when the budget needs to be cut back. It is difficult
to see the impact of a delay here during development, while a missing development
tool is visible immediately.
Issue Resolution
As development progresses, many issues are identified by different members of the
project team. These are typically categorized, prioritized, and logged. As the pres-
sure to sign off on the functional requirements increases, related issues tend to have
a higher prioritization and an earlier resolution date. Non-functional issues tend to
be given lower priorities or longer resolution dates, which again removes urgency
and attention away from them to the point that there again may not be enough
time for resolution.
Defining Success
The definition of project success in the industry is fairly standard, and based upon
whether the project is completed on time and on budget. This also assumes delivery
of a mandatory set of functions that are within the project scope. The functions
themselves are governed by a set of non-functional specifications that drive out how
well and completely the functions behave. Consider the following examples:
n A claim entry screen that requires more than one second to save or update a
claim is unusable
n A funds transfer in a banking application that adds funds to one account but
breaks before subtracting them from the other is a complete disaster
n A reporting application that needs information that is current to the hour
otherwise meaningful executive decisions cannot be made needs several non-
functional design solutions
Each of these examples begins with a business requirement that provides fur-
ther definition to the non-functional requirements stream. The project’s success
depends on both parts of the statement requirement being met. Without paying
full attention to non-functional business requirements, and maintaining that atten-
tion, the project cannot be successful. Completing either the functional stream or
the non-functional stream alone is not enough.
n Plan
− Non-functional resource estimates complete
− Budget secured for hardware/software needed
n Architecture and Design
− Non-functional test environment defined
− Software testing tools defined
n Develop
− Non-functional requirements completed
− Development completed with attention to performance/operability
n Test
− Deployment to non-functional test environment completed
− Development for automation and load testing completed
− Performance testing completed
− Failover and operability testing completed
− Sustainability testing completed
n Deploy
− Capacity model and plan completed
Framework
The best way to protect non-functional requirements against the pressures of the
functional requirements stream is to build the activities directly into the standard
project-development lifecycle and to align delivery to the performance metrics of the
resources on the project team. Figure 9.3 shows a number of non-functional threads
that should be incorporated as a set of activities across the project lifecycle.
The non-functional requirements thread stretches across the entire framework.
It is not an afterthought. It has as much importance as the project management
thread. This should be complemented by a set of milestone deliverables in the project
Non–Functional Requirements
Project Management
Change Management
Risk Management
plan (as shown in Table 9.2). The milestone deliverables can be further subdivided.
For example, Hardware requirements can be subdivided into development server,
stress testing server, testing server, development desktop, and external user accep-
tance area.
Data Architect Possibly the owner of the data and Works closely with the data
the database. Must lead efficiency modeler and the database
of the data architecture. administrator.
Table 9.3 (continued)
Developers Sounding the alarm when non-
functional requirements are not
being identified or addressed.
The other project development lifecycle activities apply around these specific
ones. In constructing a framework for dealing with non-functional requirements,
consider these objectives:
n Usability
n Documentation
n System Help
n Call Center
n Ease of Future Enhancements
n Audit Reports
n Ad-hoc Reports
Project Sponsorship
While the project manager has the ultimate responsibility to deliver the suc-
cessful application, the project sponsors play a key role. As the ultimate source
of problem resolution on the project, they must ensure that non-functional
requirements are not ignored if deadlines begin to slip or resources begin to
get drawn in different directions. They can also work with other stakeholders
to bring additional resources onto the project to ensure that the project plan
continues to be met.
Ideally, a business and a technical sponsor will jointly have access to all the
other resources in the organization required to complete the project. While
these resources may be involved in operational activities, the combined spon-
sor team can work with senior management to affect other priorities in the
organization.
Performance Metrics
We have discussed the importance of non-functional requirements and ensuring
that they are included in the project plan. We have also discussed the fact that the
project manager and the project sponsors have the ultimate responsibility for ensur-
ing that resources are in place to adequately address these. However, members of
the project team must have the incentive and initiative to also play a key role in the
fulfillment of these.
In an ideal stiuation, all the members of the project team would share in the
prioritization of non-functional requirements. However, many different priorities
emerge in the project trenches. The real-world situation is generally far from ideal,
and so we need a vehicle to share the responsibility. Neither the project manager
nor the stakeholders can be successful only by themselves. It is also difficult for
them to ascertain the truth or ambiguity of statements they will undoubtedly hear
from project team members that satisfying both requirement streams is impossible
due to time considerations or other reasons.
Performance metrics need to be extended to the key project team members
that are needed to adequately address the non-functional stream. Working on
the functional stream should not be enough to give them an excellent perfor-
mance review. Signoff of the non-functional requirements, per the key mile-
stone deliverable list, should be included in their success criteria. This should
be regularly revisited with the project team members until the dependencies
have all been met.
With the performance metrics in everyone’s mind, a regular (e.g., weekly) sta-
tus report should track each of the non-functional requirements so that progress is
clearly visible.
Escalation Procedures
With the other tools in place, the non-functional requirements stream should be
positioned to resist pressure from the functional side. However, feedback from dif-
ferent members of the project team may still identify risks, future or immediate,
that need to be processed. The feedback may also show that the non-functional
stream is starting to be neglected or is falling behind.
A published escalation procedure is needed from the start of the project to deal
with issues where this stream is still being neglected. This should include a process
for establishing project responsibilities above stream responsibilities. This could
mean that a specific function may actually not get the resources that are needed to
work on backup and recovery capabilities, for example.
The escalation procedures should go through the project manager into a regu-
larly occurring meeting with business users, stakeholders, and the project sponsors.
Lack of resolution may lead to an executive steering committee that has the author-
ity to provide any of the following:
n Additional funding for resources, tools, and technology to deal with compet-
ing requests
n Ability to divert knowledgeable resources from operational responsibilities
n Ability to divert knowledgeable resources from other projects to provide relief
n Ability to modify the scope or timeline of the project
Setting Expectations
Clarity at the outset of a project around risk management, problems that may
occur, and what is expected to resolve them from anyone involved in the project is
the only way to ensure that there is the will and ability to deliver both requirements
streams within the constraints of the project. Expectations need to be established
in the following areas:
These should be written and shared with the project sponsors at the start of a
project. With their agreement, the expectations should be shared with the team
leads and then the rest of the project team. They should also be communicated to
executive management.
Controls
Controls are needed to ensure that the functional stream does not encroach on the
non-functional stream. Indeed, they are also needed to ensure that the converse
does not occur, either. Projects that are driven by the business tend to be the former;
projects that are dominated by the IT (information technology) team can easily
slant toward the latter. The following controls offer balance to a project:
Summary
Businesses do not make money by constructing systems that are defined by typical
non-functional requirements such as fast response time, being highly secure, or
being immensely scalable. Businesses spend money and time to get systems that
satisfy specific functions. However, many of these are complex and time-consum-
ing to define, design, and build. Most times these functions are at odds with non-
functional requirements.
There are many factors that drive tight—but usually competitive—relationships
between meeting functional and non-functional requirement streams. There will
be pressure on dedicating resources to define, design, and build. But they generally
do not get the same “mind space” of a project team. Nobody is going to say that
response time or security is unimportant. But do project teams make the invest-
ments of time, money, and other resources commensurate with the importance of
response time? The answer is generally no. In the subset of cases where this level of
attention if made, does it remain? Again, the answer is generally no. The reasons for
this are complex, and not due to any planned negativity.
This chapter described situations where a functional requirements stream can
begin to draw resources from the non-functional requirements stream, thereby put-
ting that stream and the project as a whole at risk. This chapter also presented a
framework for ensuring that this does not occur, along with suggestions on how
to deal with the inevitable pressures to do so when the project timeline becomes
threatened.
Operations Trending
and Monitoring
No amount of careful design and testing can substitute for a thorough and effec-
tive operations strategy. From an availability standpoint, monitoring and trending
are critical. Eventually, things will go wrong at some point in your operations and
you will be measured by how quickly you detect and respond to the failure. The
organization’s profitability and perhaps even survival may depend on minimizing
or neutralizing the impact of any problems that occur. Setting up early warnings
may make all the difference between successfully dealing with a problem, or per-
haps avoiding it altogether.
Monitoring
Monitoring your system has two distinct objectives:
If you can detect problems early, you may be able to correct them before they have
any end-user impact. Your ability to resolve an issue quickly depends on the quality
of diagnostic information you have available at the time of the failure. Consider the
difference between these two failures for a production system (Table 10.1).
225
In the first scenario, users begin to report that they are no longer receiving e-
mail from the production system. At this point, there is already a business impact.
Users are likely consorting with one another, comparing experience, and complain-
ing about the impact. They may be questioning the validity of the work they are
doing on the system. This lost productivity is a financial impact that could have
been avoided.
In the second scenario, a series of errors are logged by the application when
individual e-mail requests are attempted and failed. Each attempt results in a gen-
erated alert. Because errors are at the level of an individual transaction, they are
logged and alerted with ERROR severity. This may attract the attention of the
operations team, depending on their training and the criticality of this application.
It is certainly preferable to the first scenario. A proactive operations team may be
able to do further investigation and discover the root cause of the failure. Unfortu-
nately, other than reporting that the error is associated to the e-mail capability for
the system, there is very little diagnostic information to help the operator.
In the third scenario, we see a much better level of service from our monitoring
infrastructure. Additional monitors in the system have detected that the simple
mail transfer protocol (SMTP) gateway process is actually down. Because this is
such a clear mode of failure, the alert level has been correctly raised as FATAL.
This is certain to command the full attention of the operations team. Furthermore,
a correlating alert has been raised from the application server indicating that the
SMTP gateway is not reachable on the network. This event is consistent with the
fact that the SMTP server is down. This alert shows the link between the applica-
tion errors and the root cause failure of the SMTP gateway. Again, this connectiv-
ity failure is logged as FATAL; if the SMTP gateway is not reachable, it is clear
that e-mail services are totally compromised. Because the diagnostics are so clear
and precise in this situation, the operations team is able to restore the SMTP server
quickly.
The operations team does not need to escalate involvement to the infrastructure
or development teams. Most importantly, the business user community may never
learn that there was a temporary failure for e-mail services on the application. Of
course, this would require a defensive design pattern that supports automated retry
to recover failed transactions. Readers of this book will have acquired this knowl-
edge in Chapter 4.
In the previous example, there can be no disputing how valuable a comprehen-
sive monitoring strategy can be for high-availability applications. In this section,
we will look at how to wdevelop such a strategy for your application.
Monitoring Scope
An approach that is useful in assessing your application for monitoring starts with
dividing the application into layers and evaluating each layer independently. In this
approach, we will also see how roles and responsibilities can be defined for each
layer.
This approach begins by considering your sysem as a composition of the follow-
ing elements:
n Infrastructure: This refers to the hardware and software components that are
part of the base platform for your application, and includes network connectiv-
ity between servers, server availability, storage devices, and network devices.
Infrastructure monitors also apply to commoditized resources like the central
processing unit (CPU), available disk, and memory. We will look at example
monitors in this category later in this section.
n Container: Container monitors are resources that are explicitly configured
and installed in the infrastructure to support your application. Container
resources are typically vendor software applications. Your application may
use services from these components, as in the case of the SMTP gateway
example from earlier in this chapter. You application may also be deployed
into a container provided by a software vendor. We will also see examples of
this type later in this chapter.
Before we discuss these types of monitors in more detail, we should first outline
roles and responsibilities. Defining good monitoring interfaces requires application
knowledge, so it is important to have the right people engaged. At a minimum,
your project should designate one person who is responsible for the monitoring
strategy as a whole. This person will need to be a liaison to infrastructure, devel-
opment, and business participants. Usually a technical person is best equipped to
function in this capacity. It may make sense for the monitoring task owner to also
be engaged with non-functional test activities. In this way, verifying monitors can
be interwoven with non-functional tests.
Your approach to building monitoring capabilities into your platform will
need to follow the same guidelines as the rest of the functionality in a project
lifecycle. You need to define requirements, design a strategy that meets these
requirements, and then test and implement your solution. When you are solu-
tioning a new system with high-availability requirements, it is key to designate
an individual with overall responsibility for executing a plan that delivers the
intended strategy.
On most projects, whether you are hosting your application or outsourcing, you
will have an infrastructure lead. This person is responsible for building and main-
taining the production infrastructure. This same person needs to be accountable
for the quality and depth of infrastructure monitoring. Fortunately, as we will see
shortly, infrastructure monitors are fairly commoditized and there are management
platforms that can be purchased for this purpose. The monitoring strategy lead will
challenge the infrastructure monitoring lead to ensure that the appropriate moni-
toring is in place.
Container monitors require a joint effort from the infrastructure and develop-
ment teams. The development team will need to designate a monitoring lead for
defining custom application monitors. It is recommended that this same person
work in tandem with the infrastructure lead to define appropriate container moni-
tors. Again, the monitoring strategy lead should challenge the other participants to
ensure maximum coverage from custom and configured monitors.
Finally, end-user monitors require cooperation between the development and
business teams. The development team is positioned to recommend a series of
monitors that exercise application functionality in a way that efficiently monitors
overall availability. A business participant is usually required to validate that these
monitors are permissible and do not have any negative business impact. Business
participants may also need to provide suitable data for end-user monitors. The busi-
ness user in this case is an advisor and participant only. It is recommended that the
development monitoring lead or the overall monitoring strategy lead own end-user
monitoring.
We summarize the roles and responsibilities we have defined in Table 10.2.
Please note that this is by no means the only way to structure your efforts, but it
is a proven configuration that will work for many types of projects. For small or
medium-sized projects, these roles are usually not full-time commitments.
Now that we have introduced high-level categories for monitoring, we will look
at each of these categories in detail in the following sections.
Infrastructure Monitoring
Infrastructure monitors are low-level monitors that verify hardware availability and
capacity. For highly available systems, Table 10.3 provides a reference for com-
mon, critical metrics for some example device types. You should consult vendor
documentation for specific devices to identify additional metrics. Many of these
metrics are threshold based—that is, you specify a threshold based on percentage
utilization.
In setting a threshold, you need to appreciate the granularity over which you
are taking measurements. For thresholds that are applied to rate, this is particu-
larly important. In CPU measurement, the instantaneous CPU utilization may be
100% for a small fraction of a second, but the steady-state average CPU utilization
based on measurement at 1s intervals may be only 30%. For CPU monitors, a suit-
able polling interval for CPU measurement is 1 minute.
Managed Managed
Device 1 Device 3 Agent A
Agent A
SNMP
Agent B Agent B
SNMP
Agent C
NMS
SNMP
Managed
Device 2 Agent A
n Read: The NMS can interrogate a managed device to collect data on device
health.
n Write: The NMS can also issue configuration commands to devices to con-
trol them. This is less interesting to us from a monitoring standpoint.
n Trap: Managed devices can also asynchronously send messages to the NMS.
These events are called traps.
n Traversal: Traversal operations are used by the NMS to determine supported
variables on the managed device.
Container Monitoring
Container monitoring refers to the installed vendor software in your infrastructure
and its attributes. This category of monitoring can be further broken down into the
following types:
n Availability Monitoring: This is the most basic level of monitoring for soft-
ware containers and services. Basically, this type of monitoring determines
if the software itself is available. In the SMTP gateway example from earlier
in this chapter, the monitor for SMTP availability is clearly an example in
this category. Some vendor software supports a “ping” capability that can be
used to ensure that it is responding normally. This type of capability may be
exposed via SNMP or through a custom plug-in for a well-penetrated vendor
solution like HP OpenView. In the absence of anything else, you can always
configure a vendor monitoring solution to detect processes at the operating
system level. If you are running an application server, you may expect four
processes with a specific footprint to be running at all times and this can eas-
ily be configured.
n Dependency Monitoring: If the vendor software you are running is depen-
dent on specific services in the infrastructure (e.g. an SMTP gateway), you
should consider introducing additional monitors that will create alerts when
this dependency is broken. For example, for Oracle to run efficiently, you may
decide that no tablespace configured on physical storage should ever exceed
60% capacity. This is a finer-grained monitor than the available disk storage
that would be part of infrastructure monitoring. This Oracle-specific monitor
is a type of container monitor.
n Vendor Messages: Most vendor software applications will report errors to an
exception log. As part of your container-monitoring strategy you should ensure
that you are monitoring and alerting on these errors. Vendor software may also
be capable of generating SNMP traps that can be caught and alerted for using
your management and monitoring solution.
n Container Resource Monitoring: Container resources are specific resources
configured in the container to support your application. These resources can
be heavily impacted by the runtime behavior of your application and require
the most sensitivity when applying monitors. Resource monitors are the most
interesting aspect of container monitoring, and we will spend the rest of this
section on them.
In this book we have intentionally tried to engineer examples that will appeal
to a broad audience and are not vendor- or technology-specific. However, container
monitors are by definition vendor-specific, so much of the discussion in the follow-
ing example will revolve around the BEA Application Server. Fortunately, BEA
exposes its monitoring interfaces using the Java management extensions (JMX)
standard. All of BEA’s competitors, including JBoss and WebSphere, have chosen
the same direction conceptually; thus, everything we discuss in this example will
apply to you if you are writing software applications that are deployed to J2EE-
based containers.
Resource monitors are really the meeting point between development and
infrastructure team responsibilities. Resource monitors need to be implemented in
a standardized way. This inclines us to involve the infrastructure team, which can
equip the infrastructure with generic monitoring capabilities for the containers in
the environment. On the other hand, the development team is specifying and siz-
ing the resources that are important to the application. Consequently, only a joint
effort between teams is effective in specifying and implementing a proper container
monitoring solution.
The BEA example we will look at next is ideal because it includes resources
that are widely used by applications. For our purposes, let’s revisit an example from
Chapter 4. The example shown in Figure 10.2 is a conceptual architecture for a
fulfillment service that includes a retry capability and a software valve. (You may
wish to revisit Chapter 4 to familiarize yourself with this example if you have not
done so already.)
As our first step, let’s recast this example using J2EE constructs (as shown in
Figure 10.3). We will then look at the example from the perspective of how we
would monitor container resources within the application server.
Request 1
Request 2 Fulfillment
Request 3 Scheduling
Request 4 Service
Request 5
···
Fulfillment Service
No Yes Third-Party
Fulfillment
Valve Service
Application Request Queuing: JDBC read via
Processing Message-Driven WLS* DataSource
Persistent Bean
JMS Queue Fulfillment
Service: POJO**
JDBC update
via WLS*
Request 1 Fulfillment
DataSource
Request 2 Scheduling:
Request 3 Stateless EJB Service
Request 4
Request 5
···
JMX Agent
MBean Server
MBean A
MBean B
figuration values for control purposes. For our immediate use, application server
vendors like BEA have implemented their own connector architecture that can be
used to expose JMX MBeans to monitoring software. This means we can use JMX
to interrogate and monitor J2EE components in the infrastructure. You will find
that many enterprise monitoring solutions have built-in capabilities for looking up
JMX-exposed objects and attributes on J2EE application servers.
Let’s now look at the JMX attributes that apply to our example. They are as
follows:
idle threads is a JMX-exposed attribute on the execute queue and can also be
easily monitored in the container.
n Message-Driven Beans (MDBs) and Stateless Enterprise Java Beans
(EJBs): The EJBs in our design are also exposed as JMX MBeans. We can
use APIs provided by the application server to ensure that both of the EJBs in
our fulfillment service are deployed at all times.
n Data Sources and Connection Pools: Both the connection pool and the
data source abstraction that gates application access to the connection pool
can also be monitored through JMX APIs. In our case, we should monitor
that both of these resources are deployed and that the number of available
connections is always one or greater. This ensures that there is always a data-
base connection available for a thread if it needs it.
n JMS Queue: Lastly, the JMS queue is the asynchronous messaging interface
between the rest of the application and the fulfillment service. We expect
messages to be consumed from this queue immediately after they are depos-
ited from the requesting application. Consequently, we can verify that the
queue is deployed and that the message depth in the queue never exceeds a
threshold value that is reasonable based on the expected business usage.
Application Monitoring
In our discussion thus far, application monitors can be described as monitors that
are custom built or rely on custom application outputs. Container and infrastruc-
ture monitors tend to be supported through configuration of vendor monitoring
solutions. Application monitors require a little more work, but we are rewarded for
our efforts with improved diagnostic information.
For most software systems, application alerts are derived from application
errors logs. Software packages like HP OpenView include capabilities for monitor-
ing application log files and generating alerts based on text-based pattern match-
ing. This is a powerful and flexible mechanism but it requires careful thought and
consideration.
If you have adopted a universal logging severity level in your software imple-
mentation, you will be well-positioned to take advantage of log-file monitoring.
This will enable you to establish policies like for every log message with a FATAL
severity, generate an alert with FATAL severity. Further, you can include the full text
of the error message and an excerpt of the surrounding log file with the alert that
is generated.
If you are monitoring a log file for messages that do not include severity, or—
worse—your log file includes severities that are not dependable, then you will need
to inventory error messages that can be generated from your application, assess
them for criticality, and then configure your log file monitoring solution to watch
for these messages. This is error-prone, unreliable, and time-consuming. You are
highly recommended to adopt and enforce a universal log severity strategy to avoid
this manual effort if at all possible.
If your development team has been strict in applying correct log levels to applica-
tion logs, then you are well positioned for FATAL events. You will also want to consider
ERROR events. A common pattern is to assign ERROR severity to errors that the
application believes are specific to a transaction or request. FATAL errors are reserved
for exceptions that indicate a common component or service is down or unavailable.
Failures in the FATAL category indicate that service is totally disrupted.
What if you application begins to generate hundreds or thousands of logged
events with ERROR severity? Does it make sense to ignore them if there is no
corresponding FATAL event? Probably not. To address this scenario, you are rec-
ommended to implement frequency-based monitoring for log events with ERROR-
based severity. You would implement such a rule with the logic, If there are more
than 100 errors in a sliding 10 minute window, raise a single FATAL alert. You may
wish to implement this additional logic for specific types of error messages or as a
general rule. This decision will depend on your application.
In addition to log-file-based application monitoring, there are many other types
of monitors that we encourage you to consider. Most of these monitors are easy to
script and/or are supported by vendor monitoring platforms.
End-User Monitoring
If you are resource-constrained and have very little capacity to implement a monitor-
ing strategy, you should implement end-user monitors. End-user monitors generally
do not require technical expertise, and they can usually be implemented without
Historical Reporting
As we will see in Chapter 11, when a problem arises it is extremely valuable to have a
historical view of your system. Information that has been recorded by your monitor-
ing solution will give technical resources information on system state in the period
leading up to the failure. If a resource is exhausted, data logged by your monitors
will show the depletion of the resource over time. Further, logged metrics can often
answer the important question, When did this problem first start to happen? Finally,
historical information for your system is also critical from a tuning, sizing, and
capacity-planning standpoint. We will come back to this topic later in this chapter.
Performance Trending
As systems mature, performance can gradually change over time. From an opera-
tions perspective, it is in your best interests to monitor your system and detect this
before it is reported to you by your users. Responding to an issue in its infancy will
give you time to plan a suitable resolution. If you wait until the problem is percep-
tible to your users, you will find yourself referencing the materials in Chapter 11
on crisis management.
Fortunately, performance trending is readily supported by a host of vendor
software solutions. Earlier in this chapter we discussed artificial transactions as a
technique for end-user monitoring. In order to perform trending against artificial
transactions, you need to ensure that the software package you are using is capable
of storing historical data over a sustained period (i.e., years). Trends are easily iden-
tified graphically. If your software package supports long-term reporting, it is also
likely to support graphing. If you graph response against time, your graph is likely
to resemble one of the examples shown in Figure 10.5.
In example A of Figure 10.5, we see steady-state response over time. Example
A is what you should expect from a properly designed and tested system, assuming
constant business usage. In example B we see a more worrisome scenario. Response
appears to be gradually increasing over time. From looking at example B, we have
no way of knowing if response time will ultimately plateau to a value within the
stated requirements for the system. Example C is also a problematic scenario; in it
we see a dramatic spike in response over time. The increase in response time seems
to have been a one-time event, but performance has not returned to its original
level. Let’s now look at a list of frequent explanations for the patterns we see in
these examples.
volume of data increases, the proportion of that data that can be accommo-
dated in memory will steadily decrease. Degraded performance is not only
associated with database behavior. In the shorter term, your application may
traverse data structures or caches that have steadily increased in size, causing
a gradual impact to system performance.
n System Changes. In example (c) of Figure 10.5, we see a sudden, drastic
impact to performance. Depending on the sensitivity of your end users, this
may or may not be reported. Following any system change, you should con-
trast the new system performance against response time prior to the change.
If you have worsened performance as a result of an application or infrastruc-
ture change, you will need to determine the explanation, evaluate the impact,
and then take action if necessary.
Error Reporting
Large, complex, and especially new enterprise software systems will create errors.
If your system processes thousands or millions of records every day, your applica-
tion may log hundreds or thousands of errors. It is non-tractable for an individual
or support team to manually assess application log files and evaluate which errors
merit investigation. As a further complication, it is not uncommon for systems
to have many known errors that can be safely ignored by the operations and sup-
port teams. The development team may have gone so far as to request that alerts
be suppressed for specific error events. There can be no disputing that fixing real
errors and erroneously logged error messages alike is an important priority for the
development team. However, cluttered log files are a reality for many systems and
it may take some time before a new system has settled and is no longer generating
large volumes of error-log output.
An effective strategy for managing error output is to process it into a summary
format as part of an automated daily routine. This summary report categorizes
error using text-match patterns for severity and uniqueness. This is easier to illus-
trate with an example than it is to describe; see Figure 10.6.
In this example, errors have been force-ranked by severity and frequency. There
are some significant advantages to having information in this format. First of all,
the support team can more readily identify new errors in the log output. If an exist-
ing error suddenly shows a higher incidence rate or loses its position to a new error,
the support organization can hone in on this particular error. In the next chapter,
application log output is an important input to your troubleshooting efforts.
When you are investigating a production issue, you will find it useful to consult
this report to look for changes over time. This report is also a useful mechanism
for setting priorities for the development team. Obviously FATAL events and high-
incident ERROR log events should be addressed by the development team. As your
application matures, your objective should be to drive your application logs to zero
FATAL
Count Description
------------------------------------------------------------------
2 Tue Jul 10, 2006 09:08:22 -- FATAL --
QuoteHelper.java:67 “InvalidQuoteRequest exception being thrown
for request:
1 Tue Jul 10, 2006 13:14:01 -- FATAL --
QuoteHelper.java:67 “InvalidQuoteRequest exception being thrown
for request:
ERROR
Count Description
------------------------------------------------------------------
97 Tue Jul 10, 2006 10:12:06 -- ERROR --
QuoteHelper.java:67 “InvalidQuoteRequest exception being thrown
length. This strategy can still be applied to systems for which there are no standard-
ized log severities. As you would imagine, all of the log events will be force-ranked
in a single category.
The script that generated this report used a raw application log file as input
and outputted this result. You may want to configure such a script to e-mail the
error report to the support team on a daily basis. As an alternative, you could
also have the script deposit the report to the document root of a Web server.
Reconciliation
Reconciliation is another category of reporting that is an important check for
many types of applications. Reconciliations balance business inputs with applica-
tion outputs to ensure that no records have been omitted from processing. Some
reconciliation reports are simple counts that match the number of inputs with the
number of outputs. You may also have reconciliation reports for your application
that applies business logic to calculate expected values for a given set of inputs.
More sophisticated reconciliation and balancing is common in the financial ser-
vices industry. In Chapter 4 we outlined the circumstances under which a recon-
ciliation process is recommended. We mention reconciliation again in this chapter
because it is an important safety net for detecting failed processing. You should
consider automating your reconciliation if possible and generating alerts when the
reconciliation fails.
Capacity Planning
There are entire books on the topic of capacity planning, and many of these books
are academic and theoretical in their approach. Capacity planning is an essential
part of operating an enterprise software system and we would be remiss if we did
not address it in this book. Capacity planning is the exercise of determining the
required hardware for a given software system. This includes CPU, memory, net-
work, and storage requirements. In this section we will outline some practical strat-
egies and advice on the topic of capacity planning. Our goal is to equip you with
the knowledge to develop an accurate capacity model based on the right inputs for
your application. First, we will present a set of best practices and then we will illus-
trate the use of these recommendations using a detailed example.
Planning Inputs
A capacity model states the infrastructure requirements for your system over time.
In the end, your capacity plan may simply draw the conclusion that you have ade-
quate capacity on the existing hardware for the foreseeable future. Alternately, your
capacity model may point to an impending problem unless additional hardware is
provisioned in the very near future. In either case, your capacity plan will be based
on an analysis performed using a capacity model. A capacity model is a theoretical
view of your system that considers load requirements over time mapped against
required physical resources in the infrastructure. Once you have established an accu-
rate capacity model, you will be equipped to define your capacity plan, which will
specify exactly what hardware purchases and upgrades are required to maintain your
system over time. As we will see, most of your efforts will be expended in building a
capacity model in which you are confident.
Capacity planning and modeling starts before a software system is commis-
sioned and is ongoing over the lifetime of the system. This activity is at its most
challenging before your system is in production. For many endeavors you are
required to order hardware before your application has even been built in order to
meet timelines. At this point, capacity planning is best described as a mixture of art
and science. Hardware sizing is based on a combination of inputs, including:
1. Vendor recommendations: If you are able to quantify your business usage,
many software vendors are willing to provide recommendations for your
hardware configuration. These recommendations can be a good source of
input, as the software vendor may have broad experience working with cli-
ents who have similar needs to your own. On the other hand, vendors often
license their product based on hardware configuration, so they may have a
stake in sizing your system towards the high end. And of course, your hard-
ware vendors may have input to your sizing efforts but their objectivity is even
more questionable for obvious reasons.
2. Equivalence estimates to existing, similar systems: If the system you are
building is similar in terms of technology and business usage, you may be
able to draw comparisons to existing systems. If you do not have existing sys-
tems suitable for basing estimates, you may want to involve consultants who
can make this expertise available to you.
3. Growth and business criticality of the system: You may also wish to apply
some subjective factors to your decision making. If your software system is
considered strategic and business-critical, then additional cost may be a good
trade-off to ensure a wider margin for error. Also, if your system is expected to
experience rapid growth that has not been well quantified by the business you
may want to provide more capacity rather than less.
4. Technology platform: The technology platform that has been selected for
your architecture will influence your hardware options. Some software is
not supported or runs better on specific hardware platforms. This could eas-
ily mean the difference between a large farm of Windows-based x86 servers
and a single multiprocessor reduced instruction set computer (RISC)-based
server.
In your sizing decisions, you will likely have the option to purchase hardware
that provides varying levels of flexibility for future expansion. It may be more cost-
effective to purchase a fully loaded entry-level server, but you will soon strand this
investment if you grow outside of its capacity. You may elect for a more expensive
midrange server that is not fully configured with CPUs and memory to provide for
future growth. Hardware costing is complex, and vendor strategies are designed
to try to maximize your expenditure. Beyond this brief introduction, there is little
advice we can offer you to make this decision any easier.
A more preferable circumstance is one in which you have time to test your
system under load before making procurement decisions for your infrastructure.
In your non-functional test cycles, you will have prepared load scenarios that are
representative of the forecasted business usage. In Chapter 7 we described strategies
for execution of sustainability testing. We proposed that you run your mixed load
of performance scenarios at peak transaction rate to perform the required number
of business operations in the shortest amount of time. If you record system metrics
during execution of a similar test in which you run your application mixed load at
peak, these metrics are good predictors of the required capacity you will need in the
production environment. It is important to note that these measurements are based
on a model that approximates the actual business usage. This is clearly superior to
no measurement at all, but it can only project the required capacity within a certain
degree of accuracy.
For most systems, you determine required capacity based on peak usage plus
a contingency factor based on risk tolerance. Software systems do not run well on
hardware that is almost fully utilized. You will usually see performance begin to
degrade significantly once CPU utilization passes the 75% point on most systems.
The contingency factor is there to ensure that your application runs below this
threshold. The contingency factor is also there to compensate for any inaccuracy in
your measurement or estimate of the system load.
A standard contingency for enterprise systems is 40% over and above the mea-
sured system load. In other words, you require 140% of the measured hardware
capacity required by your mixed load at peak. A contingency factor can also be
designed to account for increased load due to a failover scenario. If your application
is load balanced across two application servers, you need to consider the effect of
failover. If a single server fails, and the second application server is overwhelmed by
the additional load, then you may as well have not bothered to implement failover
for your application. For your application, you may decide on a higher or lower
contingency factor; this is not written in stone, but 40% is a good value for most
systems.
The accuracy of your capacity plan will depend on the quality of the capacity
model you are able to devise based on available inputs. You may have no choice but
to estimate your system load if your application is yet to be built. If your application
can be tested, you can measure system load based on the business usage model that
you constructed from the non-functional requirements. However, neither of these
inputs is superior to actually measuring system capacity for a production system.
As a result, once your application is rolled out to production users, you should
measure the actual capacity and feed this back into your capacity model to ensure
it is accurate.
Once you have real production measurements, there is no need to rely on hypo-
thetical or experimental values in your capacity model. If you are completing a
capacity plan to determine hardware requirements based on increased business
usage, then you may be able to use exclusively production measurements in your
model. Later in this chapter we will look at an example that employs this strategy
as part of building a complete model.
Best Practice
Let’s review the key points from the preceding discussion. The following list enu-
merates our view on best practice in the area of capacity planning.
1. The accuracy of your model will depend on the accuracy of your inputs. Use
production measurements, test measurements, and estimates based on equiv-
alent systems in that order. Some capacity models will require a combination
of these three inputs.
2. Your planned capacity should be based on a mixed, representative business
load running at peak volume.
3. You should decide on a contingency factor and apply this on top of the esti-
mated or measured needs of the application; 40% of the required system
load is a standard contingency. Do not add contingency to each input as you
build the model. You should use the most accurate, yet conservative, estimate
available and apply contingency once when the capacity model is nearing
completion.
4. When possible, use measurements based on the full application load. If logis-
tics do not permit for this, you should assume that hardware capacity is a lin-
ear commodity, i.e., you can stack application load and the required capacity
will sum together.
5. Express all present and future capacity as a percentage of your existing hard-
ware platform.
The remainder of this chapter will illustrate the application of these best prac-
tices in a detailed example.
and an external person. For introductions to nonusers, the introduction is made via
e-mail, with the goal being to draw additional users onto the network. Normally,
customers contact other customers directly, so the notion of third-party introduc-
tions is a significant enhancement. We will refer to this feature as introductions in
the rest of this example. The second new feature allows customers to post recom-
mendations for date venues to a bulletin board. The bulletin board is accessed by
online users who are looking for ideas on where to go for dates. In a future release,
the business would like people to actually initiate dates based on mutual interest in
different date venues. This feature is referred to as date venues.
As a simplification, we will focus our example on the application server tier for
the online dating service. For a software system like this we would normally need
to consider multiple software tiers, including Web, application, and database. The
logic we will follow is identical for each component in the infrastructure; we do not
want to clutter our example, so we will focus on a single tier.
We start by looking at the current and forecasted business usage for the online
dating service. The current service has over 250,000 customers in North America.
This number is expected to grow at the established rate of 15% per year for at least
the next five years. The total UK market is estimated at 150,000. In the first year,
marketing anticipates 10,000 users. The number of users is expected to double
in each of the first three years and then grow at a rate of 15%. These statistics are
shown in Table 10.5.
It is reasonable to assume that the number of business operations will vary
directly with the number of users on the system. The best quality information we
have for this system is the actual business usage and production utilization for
the current system. A graph for CPU utilization of the current system is shown in
Figure 10.7.
CPU
100% Utilization
52%
Time
0%
0:00 0:600 12:00 18:00 0:00 0:600 12:00 18:00 0:00 0:600 12:00
59.8%
0%
0:00 0:600 12:00 18:00 0:00 0:600 12:00 18:00 0:00 0:600 12:00
Figure 10.8 Online dating example: capacity mode with additional U.S. usage.
spans fewer time zones, so our spikes will probably be somewhat more compressed.
One year from now, the UK market as a percentage of the North American market
will be
25, 000
= 8.6%.
287 , 500
To apply the additional UK usage, we can inflate our model by 9%. We must
also time shift this usage to account for the time zone difference between North
America and the UK. Time shifting in this case works in our favor as it distributes
our peaks and smooths our CPU usage, as shown in Figure 10.9.
Our next step is to sum the time-shifted UK contribution to the required
CPU with the forecasted North American CPU requirements. When we do
this, our graph of the outcome is a smoother utilization as predicted (see
Figure 10.10).
The North American and UK evening peaks do not overlap. The UK non-peak
usage does contribute to the original North American peak, shifting it up slightly
to 61.6%. Management will be happy to learn that in the first year of operation, the
UK expansion makes more efficient use of the existing hardware capacity without
introducing the need for additional expenditure. Remember, we still haven’t looked
at the impact of the additional features that are being introduced, nor have we con-
sidered the five-year outlook for this system.
CPU
100% Utilization
0%
0:00 0:600 12:00 18:00 0:00 0:600 12:00 18:00 0:00 0:600 12:00
Figure 10.9 Online dating example: capacity model for additional UK usage
only.
CPU
100% Utilization
N.A. peak
61.6 % UK peak
0%
0:00 0:600 12:00 18:00 0:00 0:600 12:00 18:00 0:00 0:600 12:00
Figure 10.10 Online dating example: capacity model for combined UK and
U.S. usage.
Let’s complete our one-year outlook before we generate a five-year forecast. All
of the inputs to the model have been based on the one-year forecasted business vol-
umes, so all that is remaining is to add the business usage for the two new features
that are being bundled with the release. We start by looking at the business usage in
Table 10.6 as it is defined for the North American user community. (A similar table
would exist to describe the UK usage, but it is omitted for brevity.)
Table 10.6 is taken from the updated non-functional requirements for the
online dating service. We see that the busiest period for site activity is between 7:30
pm and 9:30 pm, which aligns with our production CPU measurements. The busi-
ness feels that for every two individual contacts that are made on the site, approxi-
mately one introduction is likely.
Posting date venues is forecasted to be far less common than browsing data
venues. In fact, the marketing team plans to supplement postings if take-up is not
high among users. Finally, for every ten personal profiles that are viewed marketing
feels that there will be at least one data venue that is also browsed. The business
has posted the expected one- and five-year business volumes based on the projected
increase in usage.
Business volumes for this online data service are not seasonal, nor is there much
weekly variation in online usage. (This may or may not be true. Feel free to write
to us should you have information that contradicts this assumption.) As a result,
we need to calculate what the incremental peak load is expected to be for the sys-
tem and then add it to our current model in order to determine the new hardware
capacity.
Busiest Interval 7:30 pm to 9:30 pm, weekday evenings (40% of average day’s
business volume)
Fortunately for us, the non-functional testing team had previously identi-
fied that the North American and UK peak loads do not intersect one another.
Accordingly, they have devised two mixed load scenarios, one for each of the North
American and UK peaks. As you may have guessed, they are actually the same
mixed-load scenarios run at different transaction rates. For the two new features,
the non-functional test team added additional load scenarios to the original mixed
load for the dating service.
The transaction rates for the three coarse inputs defined by the business were
calculated as follows. For these volumes, rates are calculated in transactions per
minute (TPM). We assume 40% of business operations will be completed in the
two-hour window indicated. Remember, the calculations below describe North
American usage only; an equivalent set of calculations would be required for the
UK usage to derive the UK transaction rates.
In order to complete our model, we ask the non-functional test team to exe-
cute two sets of performance trials in which they run load for the incremental
scenarios that exercise the new business features in which we are interested. The
performance team fulfills our request and provides two numbers that reflect peak
CPU usage under both the North American and UK transaction rates (as shown
in Table 10.7).
Before we can incorporate these inputs into our model, we need to convert them
into our standard units of measurement. Our capacity model is currently expressed
in terms of percentage of current capacity. The non-functional test environment is
actually half the size of the production infrastructure. The application server has
half the number of CPUs as its production counterpart. As a result, the numbers
we need to use in our analysis are actually 6.5% and 3.5%. If we add these numbers
to the two observed peaks in our capacity plan, we arrive at the conclusions shown
in Table 10.8 for our one-year capacity model.
Table 10.7 Peak CPU Usage in North America and the United Kingdom
Scenario CPU Measurement
At the end of the next year, the online dating service will be straining the limits
of its current infrastructure based on the North American peak business usage. The
advantage in building this model incrementally is that you can summarize results
for management so that they can see the impact of different factors on hardware
requirements. By creating an incremental view of your capacity model, you provide
the visibility management needs to make more efficient business decisions. Based
on the work we have done, it is clear that expansion to the United Kingdom does
not impose an urgent need for increased capacity.
You should also complete this same exercise based on the five-year business vol-
ume. The one-year view of the capacity plan indicates that the business will need to
make an additional investment in infrastructure very shortly. The five-year view of
the capacity plan will give them a perspective on how big an infrastructure invest-
ment is required in the long term. In this example, we have seen that the impact
of new features is the single largest contributor to the need for more capacity. In a
real situation, we would emphasize this in reporting results to management. The
five-year plan will not show the need for increased processing power based on fea-
tures that will no doubt be introduced as the platform matures. Conversely, as the
development team refines the application, you may see a significant drop in the
capacity requirements for the system. This can only be evaluated on an ongoing
basis using an approach that strongly positions current production measurements
in the revised capacity plan.
over time to meet the needs projected by your model. The capacity plan specifically
addresses which hardware expenditures will be needed and when, and will make
statements along the lines of the current Sun Enterprise 10K (40 x 400MHz Ultra-
SPARC II) servers will need to be upgraded to Sun Fire 6900 (16 x 1.8GHz 48GB
UltraSPARC IV+) servers no later than July of next year in order to sustain performance
for projected volumes.
The capacity plan typically makes considerations as to when the best time for
your organization to accommodate an upgrade would be. For example, the plan
may recommend that the total upgrade be accomplished in stages over the course
of 16 months. The capacity plan will also need to factor in the requirements of
non-production environments (e.g., if the production environment is expanding,
a cost-benefit analysis will be required for whether to expand the pre-production
non-functional testing environment as well).
Since cost is a factor for every organization, it is the responsibility of the capac-
ity plan owner to make judicious decisions about which hardware vendor and plat-
form are appropriate for the organization. In some cases, the capacity plan may
recommend retiring equipment from one production platform to be re-introduced
as part of the infrastructure for another smaller, existing production system.
Summary
This chapter has covered a broad range of topics that come into play for systems
that are already deployed to production environments. We have looked at appli-
cation monitoring from an operations perspective spanning application, infra-
structure, container, and end-user categories of monitoring. This chapter has also
emphasized the importance of trending and reporting as a means of understanding
system health and predicting future system failures. The second half of the chapter
explored the complex topic of capacity planning based on measurement of the pro-
duction system combined with test results from non-functional activities. We illus-
trated how to define a capacity model based on multiple inputs and how to draw
conclusions and make recommendations based on that model. In Chapter 11 our
focus shifts to the topic of troubleshooting and crisis management. Despite your
best efforts to design, test, and operate highly available systems, you must still be
prepared to respond quickly and effectively to production incidents if they occur.
Troubleshooting and
Crisis Management
It is with great sincerity that we hope you never have need for any of the material in
this chapter. Yet, despite your best efforts to design and test for robustness in your
applications, you may be required to manage and resolve unforeseen issues for your
production applications. You may also have the misfortune to inherit responsibility
for applications that have not been designed and tested using the expertise in this
book. In either case, this chapter enumerates a list of troubleshooting strategies and
outlines crisis management techniques developed by way of hard-earned experi-
ence. Much of what we describe in this chapter is common sense, but in a crisis,
discipline and calm are required to work through the situation in a structured
fashion. This chapter is a good reference for situations in which things are quickly
going from bad to worse.
257
mindset of the business may be severe, i.e. If these guys knew what they were doing in
the first place, we wouldn’t be in this mess. And now they want me to authorize more
tinkering in the environment?
In earlier chapters on project initiation and test planning, we recommended
that you plan for a logical environment in your non-functional test environment
that is at all times synchronized with the version of your application that is in pro-
duction. If you have followed this advice, you are well positioned for your efforts in
reproducing a production issue. If you do not have such an environment, you will
need to acquire or designate a suitable environment. This may mean repurposing
an existing environment and deploying the production version of your application
in there.
Reproducing the problem usually precedes your determination of root cause. In
fact, once you have reproduced a problem, it doesn’t tend to take long for capable
technical resources to hone in on the underlying issue. In the next section, we will
look at the difficult task of troubleshooting a problem, including scenarios that are
not readily reproduced.
Troubleshooting Strategies
For tough problems, it can be difficult to know where to start. The strategies we
discuss in this section are a reference for these types of situations. This material
will also be helpful to you if you find that you are completely blocked because you
believe you have exhausted all avenues of investigation.
begin to answer this question. In many cases, this exercise will yield nothing or a
list of changes that have no discernable linkage to the symptoms of your failure.
Consequently, you must do two things: look harder, and cast a wider net. You must
capture all changes in the system, including those that are not documented in your
change control process. The full list of changes that can be responsible for a sudden
system failure are:
1. Documented Changes: These include application upgrades, bug fixes, vendor
patches, scheduled maintenance tasks, and hardware migrations. If a problem
arises following a significant change in the environment, it is obvious that this
should be the focus of your initial investigation, as discussed above.
2. Undocumented Changes: In some organizations, changes are made in the
production environment without the benefit of an audit trail. A well-inten-
tioned technical resource may perform a seemingly harmless maintenance
activity unrelated to your application. You should challenge the operations
team to be forthcoming about any of these undocumented activities no mat-
ter how benign or irrelevant they may seem at the time. Remember that the
operations team may be reluctant to volunteer this information if it exposes
a flaw in their operations, so this may take some coaxing. In many environ-
ments, the application support team may have access to the production envi-
ronment, so ensure that you clearly understand the full population of users
with access to the production environment.
3. Scheduled Jobs: You should look at the execution schedule for batch jobs.
If jobs run on an infrequent basis, there may have been a change since the
last job was executed that triggered the failure. Many systems have multiple
subsystems or scheduling components that can launch jobs so make sure
you inventory everything including operating system (OS) schedulers (e.g.,
CRON, Windows Scheduling), custom application scheduling (e.g., in J2EE,
Java Message Service (JMS) supports deferred message delivery. Also, the
EJB 2.2 specification includes support for timed operations) and third-party
scheduling software.
4. Usage: You should question the user community to understand how the
application usage may have changed. About eight years ago, I was part of a
technical team supporting the rollout of a call center application. Without
any notice to the operations or support teams, the call center manager decided
to triple the usage of the online system when he moved training activities into
production one Monday morning, unexpectedly. This decision triggered a
serious degradation of service that required immediate intervention. Unfor-
tunately, the technical team wasn’t expecting a surge in business usage and
wasted considerable time looking at alternative explanations and scenarios.
5. External Systems: If your application interfaces with external systems, you
should verify that those systems are available and ask for an inventory of
recently applied changes.
n Verify the Application Configuration: You may find that a process dump
was not created because of an OS or application configuration. In one expe-
rience, we failed to capture a core dump from an Oracle database failure
because the system was not configured to deposit core files onto a file sys-
tem with sufficient capacity. On another occasion, a multiple virtual storage
(MVS) system we were working with was not configured with the correct job
control language (JCL) settings to create a detailed dump on system ABEND
(abnormal ending or termination) codes.
n Enable Additional Logging and Trace Information: If you were attentive
to the recommendations in Chapter 4, your application should include exten-
sive debug and informational logging. You can reproduce the issue with these
settings enabled to glean more insight into the problem.
n Run Incremental Load: Once the problem is reproduced, you should con-
tinue to try to reproduce the problem with an objective of using the mini-
mum possible business load. If you reproduce the problem with seven load
scenarios, you may find that it only takes two of the seven scenarios to actu-
ally cause the failure. This allows you to focus on what distinguishes those
two load scenarios from any others.
If you cannot reproduce the problem, then you should think about what addi-
tional monitoring may be helpful should the problem happen again. If you do not
understand the root cause of the problem, you should assume that the problem will
happen again. Sometimes you have no choice but to wait for another occurrence
of the problem, but there is no excuse for not capturing more information on sub-
sequent occurrences. It is tempting to conclude that the problem was a one-time
glitch, especially if the problem appears to go away once the system is restarted.
This is ill-advised; if you do not understand the root cause of a problem, you should
assume that it will happen again.
Another common failure scenario is one in which the application enters a state
from which it begins to experience a high number of application errors. In this situ-
ation, it is common to have to restart the application in order to restore service. For
these types of incidents, it may be obvious what is causing errors. What is not clear
is the event that forced the system into this state in the first place. For this type of
problem, a reliable strategy is as follows:
1. Determine the time of the last successful transaction: You need to estab-
lish a timeline for the events that led up to the failure. The last successful
transaction completed is usually a good starting point. For busy systems
under constant load, the timestamp of the last good transaction will be very
close to the commencement of the failure condition.
2. Inspect application logs and outputs at that time: From the point of the last
successful transaction, look closely at logs and application outputs. If you are
lucky, these inputs will provide an indication of what triggered the failure.
3. Inspect data and system state at that time: It is worthwhile to look at the
last valid data that was processed by the system and compare it to the data of
transactions that are now failing. Differences in what you see may lead you to
an explanation.
4. What was the business usage and application behavior at that time: A
sudden change in business usage or unusual inputs may have forced the sys-
tem into a state that compromised future processing. There may not have
been errors at the time, but the actual usage may be the cause of the errors
you are seeing now.
5. Reproduce the problem: Using the information gathered, try to reproduce the
problem by subjecting the system to the equivalent load and set of inputs.
We’ve looked at two template approaches for two different types of system failures.
Many of the problems you face will fall into one of these two categories. In the sections
following, we will look at more specific techniques and examples for troubleshooting.
Loan Request
Validation
Approve Loan?
Return Loan
Application Number
fulfillment service, we can’t tell if the transaction is attempted and rolled back or
never attempted at all. We need to get more information.
Our first approach will be to subject the application to negative inputs. This
should tell us whether the form validation itself is functioning correctly. We ask
the business user to submit a valid loan application, but omit a required field. The
business user does this and finds that the loan application is immediately rejected.
The business user attempts various combinations of illegal inputs and finds that the
loan application is rejecting them all, as expected. Based on this, we conclude that
it doesn’t look like there is a problem with the form validation processing itself.
Our next effort is to submit a well-formed loan application that purposefully
triggers a business rule that causes the loan application to be rejected. Loan appli-
cations that are rejected are not persisted to the SQL Server, nor are they posted
to the IBM MQ Series for fulfillment. Again, our business user attempts multiple
loan application inputs and, as expected, the bad loan applications are rejected with
the correct business error, and the eligible applications produce the same error as
was originally reported. Based on this evidence we conclude that the business rules
engine for loan processing is working correctly.
So far our efforts seem to be pointing to an issue in either persisting or fulfill-
ing valid loan application requests. We need to continue our efforts to isolate the
problem. We ask a database administrator (DBA) to temporarily remove the insert
privilege from the SQL Server table into which the record is being recorded. If we
see the same error, then this is evidence to support the theory that there is a prob-
lem with the database operation. If we see a different or more informative error,
then this means it is less likely that the database is implicated in our problem. The
DBA accommodates our request and we repeat our testing. To our surprise, we see
the same obtuse error message that we have been struggling with since the prob-
lem was first reported. It seems that the error handling for database operations
has been poorly implemented. At this point, it would be prudent to ask a support
resource to look at the application code and confirm that a SQL Server exception
would result in the observed error. At the same time we decide to look closely at
the insert operation and try to formulate theories that explain this behavior. Fol-
lowing the advice from earlier in this chapter, we ask the DBA to tell us when the
last successful insert was made on the database. The DBA reports that inserts were
made successfully today at 11:01 am. This closely corresponds with the first reports
from the field that the system was experiencing errors. We ask the DBA to forward
the last 6 successful entries in the loan application table and they appear as follows
(see Table 11.1).
In inspecting the data, the technical team very quickly identifies that the last
successful input is suspiciously 999,999. When the data type for the primary key
column is scrutinized, the team realizes that it was improperly designated as a
character field instead of a numeric field. The field width was suddenly exceeded
at 11:00 am when an attempt was made to process the millionth loan application.
Unfortunately, this field overflow coincided with deployment of a new version of
the application. This coincidence distracted the team, causing them to focus on
application changes that were introduced as part of the system upgrade.
Discouraging Bias
In a crisis situation, you cannot afford to let bias play a role in problem determina-
tion. The following list includes common sources of bias. You will need to work
actively to encourage open-mindedness. As a general rule (and unlike our criminal
justice system) everything is suspect until proven otherwise.
1. Politics: In many organizations, politics are an unfortunate and inescap-
able part of getting things done. In a crisis situation, politics can be very
counterproductive. Spinning a problem as an “application problem” is a con-
venient way for the infrastructure or operations team to shift emphasis and
responsibility in an investigation. This type of bias works in many different
directions. The application team will often cite “environmental issues” as the
most likely explanation even when there is little or no evidence to support
this claim.
2. Pride: No one likes to think that they are responsible for a problem. Technologists
can be fiercely proud and opinionated. However, good technologists also appre-
ciate that anyone can make a mistake, an oversight, or a flawed assumption.
3. Expertise: If you have a task force comprised mostly of DBAs that are looking
at a problem, then this team will do a great job of formulating theories based
on their own expertise. In other words, the database may quickly become the
focal point of your investigation. We often become beholden to the theories
we understand best, but these are not necessarily the correct theories. In a
crisis situation, you need to stretch yourself and your team to ensure that
theories and speculation are grounded and supported by evidence. If you
don’t have sufficient technical coverage for the application, be honest about it
and escalate the investigation to get the right people involved.
4. Communication: Don’t assume that everyone knows everything that you
do about the system. If you find yourself thinking, “that person should know
this already,” then take the time to confirm that they, in fact, do.
As painful as it may be, there will be situations where a subtle design change in your
application is the best way to mitigate risk and satisfy your end users.
Applying a Fix
Determining root cause for a problem is usually the hardest part of responding to
a failure in a production system. Developing, testing, and applying a fix, however,
may be less challenging work but just as much effort. In this section we will look at
processes and considerations for introducing a fix into a production environment.
5. Comfort Level Regarding Root Cause Analysis: In many cases you may
think you have found the root cause of your problem but you are not quite
certain. When you are acting on a hunch or don’t have undeniable proof that
root cause has been found then make sure that your uncertainty is compen-
sated by sufficient verification.
It is never a comfortable option to waive testing cycles when considering a pro-
duction implementation. Ultimately, the decision to put a change in production
without the benefit of testing needs to be an objective business decision that weighs
business benefit against technical risk.
Post-Mortem Review
Once you have implemented a successful fix for a production incident, your last
obligation in emerging from the crisis is to conduct a post-mortem review. The
goal of the review is to ensure that you never wind up in a similar predicament ever
again. In the review you need to look at the root cause of the problem, whether
it was preventable, and whether it is theoretically possible for anything similar to
happen again. You are also encouraged to review your monitoring and operations
procedures and score your organization on how effectively you reacted to the crisis.
We will discuss these topics next.
Reviewing Monitoring
In addition to preventing problems from happening in the first place, you also need
to ensure that you detect them as quickly as possible. As we discussed in Chapter
10, monitoring is about detecting problems and capturing the maximum amount
of supporting information. If the alert that is generated contains enough informa-
tion for the support team to immediately hone in on root cause, then it has done its
job admirably. In reviewing your problem-detection capability, your organization
should ask itself the following questions:
Vendor Defect Has the vendor fixed the problem, or did we work around it?
For a vendor fix, have all environments—including development environments—been upgraded to the new version?
Has the vendor stated that the fix will continue to be included in future versions of their product?
For a work-around, has it been communicated widely to the development team that this vendor feature is problematic?
Operator Error Has the operator been identified and informed of the error?
Has operations documentation been reviewed and updated if required?
Illegal or For illegal inputs, has a business analyst reviewed the unexpected input and updated requirements?
Unexpected Has the development team reviewed the application for all other processing that may have to process a similar input?
Usage For unexpected usage, has a business analyst reviewed the observed usage and updated the business usage model including
future forecasts?
For unexpected inputs from other systems, have external parties been contacted to confirm the possible range of inputs from
their systems?
Has the development team reviewed and strengthened (if necessary) the application handling for unexpected inputs?
– continued
Troubleshooting and Crisis Management n 273
11/19/07 7:51:05 AM
AU5334.indb 274
Table 11.2 Failure Type and Corresponding Post-Mortem Assessment
Type of Failure Assessment
Configuration Have all related configurations been reviewed by the development and deployment team for correctness?
Error Has the deployment team identified why this parameter was not configured or was misconfigured in the environment?
Is there a risk that this parameter is misconfigured in other nonproduction environments or other applications under your
management?
274 n Patterns for Performance and Operability
11/19/07 7:51:06 AM
Troubleshooting and Crisis Management n 275
1. Through what channel was the problem first reported? (If the answer is, “a
business user called the help desk,” then you have a problem.)
2. Did the monitoring infrastructure detect the problem through multiple inter-
faces? If not, what additional monitors should have recognized a problem?
3. Did each monitor include as much diagnostic information as possible to aid
in the technical investigation?
4. Did the monitors produce alerts with the correct severity for the urgency of
the problem?
5. Did the monitors produce alerts with an appropriate frequency for an ongo-
ing problem?
6. Did technical support resources have efficient access to data and alerts gener-
ated by the monitoring infrastructure?
Improving your monitoring and diagnostic capability is an ongoing responsi-
bility for owners of mission-critical systems. There is an applicable expression that
comes to mind: what doesn’t kill you makes you stronger. Every application failure is an
opportunity to improve your monitoring and response capability. Failures can also be
canaries in the coal mine: they can alert you to more serious, imminent, problems.
Summary
We hope that this is a chapter that you will not reference very often. This chapter
is an admission that we will occasionally face problems in our production envi-
ronments. We have provided you with practical advice on how to minimize the
impact of production incidents and optimize your efforts around troubleshooting
and problem resolution. This includes properly assessing the severity, managing
all stakeholders in the incident, and prioritizing mitigation activities against root-
cause analysis and resolution. We have emphasized the importance of being able to
reproduce issues in nonproduction environments. This approach insists that fixes
are fully verified and tested prior to implementation. In this way, you are assured
that you are solving the “right” problem and mitigating the risk associated with the
technical change under consideration.
This chapter also introduced a series of troubleshooting strategies that you may
find helpful. We discussed the need to evaluate all changes to the environment
irrespective of how unrelated they may seem. When faced with a difficult problem,
we recommend capturing all possible inputs and discouraging bias on the part of
the technical team that is tackling the problem. We have also looked at specific
examples of failures and proposed methods for poking and prodding the system
into revealing more information. This chapter also discussed the criteria by which
you should assess the required level of testing for a fix that is proposed for produc-
tion. Finally, we have explored the post-mortem process and challenged you to
review all aspects of the incident management with an emphasis on whether or not
the monitoring infrastructure performed to expectations.
Common Impediments
to Good Design
Design is the cornerstone for supporting performance, operability, and other non-
functional requirements in an application. It is the point of intersection where
many different project streams converge and is the place to confirm that all the
different pieces fit together. Design activities sit between the conceptual, logical
view of an application and the physical one that will ultimately be deployed into
the production environment.
In terms of the system lifecycle, the design phase precedes development but fol-
lows the requirement-gathering activities. Design usually follows architecture, but
is often described as being the “architecture/design phase,” which in itself can be a
problem that we will examine later in this chapter.
Performance and operability are dependent on design. An inefficient or inap-
propriate design will destroy the possibility of meeting both of these non-functional
requirements, as well as other functional ones. Improper design can also make it
impossible to scale the application without violating these requirements. There are
several common impediments to establishing good design on most nontrivial proj-
ects. This chapter looks at these and describes ways to mitigate their impact.
Design Dependencies
Design has a dependency on the requirements—functional and non-functional—
and the overall architecture. Architecture is constructed to satisfy both the defined
requirements and the anticipated requirements if there is enough information to
277
High–Level
Design
Specifications
Detailed
Design
Abstract Conceptual
Architecture Architecture
High–Level Detailed
Architecture Architecture
Integrated
Architecture
n Performance: This must be within the service level agreement, but have the
capability for improvement through some easy-to-access focal points.
n Completeness: The design should be an end-to-end solution in being able to drive
an architecture to a lower level of detail to satisfy all the defined requirements.
n Flexibility: The design must be flexible enough to accommodate future busi-
ness requirements without requiring extensive rework. More on the meaning
of this later in this chapter.
n Modern: The technology and techniques used in the design should be within
the industry mainstream. Design that requires techniques or technology that
is legacy or too bleeding edge results in very similar problems. This includes
the difficulty of finding appropriate resources to build out the design and
having to pay a premium for their services.
Rating a Design
Good design is not always possible because of real-life resource constraints that
restrict implementation of the solution that was defined. There are several evalua-
tion criteria elements that can be applied to a given design to determine an objective
rating for it. Figure 12.2 shows several other ratings that can be considered, with
the desirable ones shown near the top of the list.
The design ratings on the left side of the figure lead toward excellent design.
Increasing time and money can take an incomplete design and still move it forward
on an iterative and manageable basis. Excellent design is a desirable project goal,
but generally unattainable in the real world. Bad design cannot be improved by
spending more time on it, nor by increasing the financial investment. It has too
many inherent flaws to be iteratively improved. In fact, the only way to react to bad
design may be to throw it out and start over, making sure that the same thing does
not happen again.
Designs on the left side of Figure 12.2, below the good rating level, can be
augmented by a set of conditions and clarity that describe when and where further
investment should be made to adjust the design to the next level.
The ratings on the left side of Figure 12.2 are characterized by degree of com-
pleteness and the ability to anticipate future needs. The ratings at the bottom of the
list are focused on supporting immediate requirements. Moving up the list intro-
duces support for future anticipated needs. The following sections of this chapter
describe each of the evaluation categories in the context of the design components
and the criteria that should be applied to decide where a given design lies.
Excellent Design
Good Design
Time to Implement
Increasing Cost
Adequate Design
Incomplete Design
The ratings shown in Figure 12.2 are defined here. Subsequent sections expand
on the attributes and their values in each of the rankings below.
tion iteratively, it is also highly risky as future details might cause substantial
reworking. This can be a result of bad planning or because of trying to do too
much and then having to settle for a lot less in a worried hurry.
n Bad Design: A bad design means that one or more components do not
adequately address the business requirements. This could involve building a
batch solution for a real-time application or setting up a database to satisfy
update/insert requests with no consideration for massive searches in a call
center application.
The following subsections discuss what the design attributes and their values
would need to be to fall within each of the rankings shown in Figure 12.2.
Excellent Design
An excellent design requires a thorough understanding of current requirements,
but also requires an ability to forecast or anticipate the future—generally 2–5 years
out—to build a design that requires minimal redevelopment or new code while
being able to deal with the domain of known non-functional requirements and
business/functional requirements. The design also needs to consider a range of
potential technologies and techniques as well. A design that is ranked as excellent
would also need to be adaptable to a range of future outcomes. This level of design
is characterized by reaching an excellent rating in each of the following functional
areas:
The problem with excellent design is that it tries to do too much. After the
known business requirements, and some meaningful forecasting, the situation
begins to get hazy—after, say, 2 years. At this point, design considerations become
highly speculative and based on assumptions that may not come to pass.
Excellent design is expensive to define and build. It also requires a significantly
long timeline. The need to support multiple years, multiple technologies, and
multiple business scenarios requires a valiant amount of work and assumptions.
Many projects cannot afford these. Even worse, once the process starts, valuable
deliverables may not even be created, before management determines that there is
not enough time to finish the design and to mandate a switch in strategy.
Good Design
As stated previously, a good design is a reasonable and achievable objective for most
modern systems. It is built to support known functionality, but uses a forecasting
approach to anticipate future business needs, possibly based on business forecasts,
to build a set of likely scenarios. A good design also needs to consider reality in
terms of the resources (money and people) available and the timeline that exists
to be able to support the design. During aggressive timelines, a good design can
be formulated but implemented in stages based on the resources available and the
needs of the business. The following design considerations apply.
Sound Design
Sound design can be viewed as a scaled-down version of good design, one that
needs to be filled out over a period of time. The principle behind a sound design is
that it can support known requirements, but can be extended for future require-
ments as they emerge. Forecasting is kept to a minimum. A sound design should
be used as a stepping stone to good design if there are time or resource limitations
that keep a good design from being constructed. It is best to define a good design
and implement at a sound design than to operate only at the sound design level. The
following design considerations apply at the functional requirements level.
n Performance: Meet the known requirements but have a roadmap for satisfy-
ing future requirements.
n Completeness: Addresses all the functional needs of the application.
n Flexibility: Defines flexibility, but may have compromises to meet a timeline
within a set of resources.
n Modern: Same as good design.
n Extensible: Should have a roadmap for the future, but details may be missing.
n Scalable: Roadmap for scaling the application.
n Right-sized: The design is suited to the implementation schedule.
n Throughput: Designed for the known throughput, with a roadmap for
future likely scenarios.
n Operability: Can operate under the known business situations.
Adequate Design
Adequate design is a dramatic compromise to meet a current business need, but
with an understanding that reworking will be required to improve the design at
some point in the future. It is better to define at the good design level, but imple-
ment at the adequate design level, than to define at the adequate level and then try
to move up. However, the former approach may not be possible given resources or
budgets available to the initiative.
Incomplete Design
An incomplete design is work in progress; time for the design activities has run out.
This could be due to bad project management, or focusing on the wrong things,
or some change requirements, or some combination of all of these and more. An
incomplete design may show up during testing. It could be apparent in a couple of
places in the overall design. An incomplete design needs to be moved up before an
application can be deployed into a production environment.
Bad Design
The worst side of the design process is ending up with a bad design that does not
and cannot meet the functional and non-functional requirements of the applica-
tion. This can be the result of many factors, including the following functional
areas.
Testing a Design
Testing a design requires a combination of inspection techniques and tool sets.
Inspection techniques involve activities such as design reviews, expert reviews, expert
analysis, and evaluation templates. One such form these can take is in the way of
questions that can detect the soundness of a design through workshops involving
business users, designers, architects, and other stakeholders on the project team.
Design testing requires a focus on the known requirements, likely scenarios,
and anticipated requirements. Tools include simulation tools, measurement tools,
regression tools, and automated test scripts.
Insufficient Information
A proper set of design activities require several sources of input information (as
shown in Figure 12.3). The degree of completeness in these input deliverables will
drive the rating on the design being defined. Gaps or lack of details in these areas
lead to assumptions that result in ambiguity.
Design Functional
Standards Requirements
Architecture Non–Functional
(approved) Requirements
the rate of technological change is very rapid—within five years. This means that
design may only be valid for at most five years before new technologies bring new
solution options.
Technology is evolving so rapidly that a two-year timeline can see dramatic
changes in available tools. Keeping pace with this rapid change is difficult. Knowl-
edgeable resources are difficult to find. References that show how the solutions
should fit together, and their limitations, are not prevalent as an industrywide set of
best practices. Bottlenecks may not be known until the testing activities begin.
Fad Designs
While rapidly changing technology poses problems in building good design from
a time and skills perspective, fad designs are even worse. The IT industry has had
many cycles in which a new technology or technique led to some rapid design
decisions that did not pan out. Incorporating fad designs is a problem from sev-
eral perspectives. The design may look reasonable under certain situations, but its
implications may not be known until later in the lifecycle. When the fad passes, the
design may need to be revised, at great expense.
Minimalistic Viewpoint
Building a design for tactical reasons, which leads to an adequate or sound design,
could in fact be considered bad design when a wider perspective is considered.
Building only for the present opens up this risk.
Lack of Consensus
This can be caused by an inconsistent set of standards. Lack of consensus can exist
around requirements, needs, and potential solutions. Team members may have
individual ideas that are too strong to compromise. This makes design decisions
difficult, and infighting or divergent opinions may make it impossible to determine
and prove the correct designs.
Lack of Facts
Design is a combination of science and art. Decisions need to be based on facts.
However, when these are not known, guesses or near guesses may be used to drive
design decisions to meet a project schedule.
External Impacts
External impacts can either affect the design or the requirements. This can nega-
tively impact a good design at the last minute. These are difficult to ignore, espe-
cially when they are of the legislative variety.
Insufficient Testing
Design deliverables on paper may look good; however, they still need to be validated
as early in the project lifecycle as possible. Several types of testing can be considered,
mostly in the non-functional variety. This includes stress and regression testing.
Design Principles
Is there a bad design inside every good one, trying to get out? Not exactly, although
it might seem that way most of the time. There are certain things that a design team
and project management should consider to improve the odds of building a good
design, in the context of functional and non-functional requirements:
Summary
Design refers to the portion of the project lifecycle in which business or functional
requirements, non-functional requirements, and architecture are brought together
to map out how an application is going to be built and whether the end result is
going to be successful.
Each of the initial phases of the standard development lifecycle are crucial to
the outcome by definition. However, the design phase really acts as the glue that
binds the creative, often abstract nature of functional and non-functional require-
ments with the development phase.
The first requirement of a good design is that it meet the known requirements.
This includes satisfying the business or functional requirements of an application as
they are currently defined. This must also extend to the support of non-functional
requirements. The second requirement of a good design is to anticipate future needs
and provide a roadmap for supporting these within a timeframe and cost model
that is acceptable to the business.
Articles
Bacon, David F. 2007. Realtime Garbage Collection. Queue 5, (1) 40–49.
Heisenberg, Werner. 1927.Ueber die Grandprincipien der “Quantenmechanik”. Forschun-
gen und Fortschritte 3:83 1927.
Koenig, Andrew. “Patterns and Antipatterns.” Journal of Object-Oriented Programming, 8
(April): 46–48.
MacMcLellan, J. 2003. Wrong Worry in Twins Versus Singles. Flying Magazine.
(February)
Books
Gamma, Erich, Richard Helm, Ralph Johnson, and John Vlissides. Design Patterns: Elements
of Reusable Object-Oriented Software. Boston: Addison-Wesley Professional, 1994.
Gladwell, Malcolm. Blink: The Power of Thinking without Thinking, New York: Little,
Brown, 2005.
Knuth, Donald E. The Art of Computer Programming. 3 vols., 2d ed. Boston: Addison-Wes-
ley Professional, 1998.
Singer, Jeremy. JVM versus CLR: A Comparative Study. ACM International Conference
Proceeding Series 42. New York: Computer Science Press, 2003.
Web Sites
The Apache Logging Services Project log4j: http://logging.apache.org/log4j/docs
Ask Tom: http://asktom.oracle.com/
AspectJ: http://www.eclipse.org/aspectj/
Aspect-Oriented Programming at Wikipedia: http://en.wikipedia.org/wiki/Aspect_programming
Empirix e-Load: http://empirix.com/products/testing/e-Load.asp
295
297
K M
K.I.S.S. approach, 101 Machine inputs, quantifying, 49–53, 49–54
Knowledge gathering, 31 Maintainability, 100
Knuth, Donald E., 119 Maintenance of model, 255
Maintenance windows
operability requirements, 63
L service level agreement, 23
LDAP, 148, 149 Management information base (MIB), 232
Leaks, resources, 191 Manicom, Richard, 80
Leapfrog strategy, 200 Manual activities, 36
Licenses Manual recovery, 24
infrastructure costs, 28 MBeans, 235
test tools, 28 MDBs, see Message-driven beans
Lifecycle Measureability
critical non-functional issue identification, failover testing, 34
24 performance design requirements, 99–100
operability and measureability, 99–100 Measurements, accuracy, 21
software development, 8 Measurement tools, 34
Limitations, challenges, 193–194 Memory management
Load process, see ETL process code and data review, 25
LoadRunner, 147 ETL, 132
Loads sustainability testing, 192
balancing, 102–103, 103 Mercury, 147
extent of testing, 20 Mergesort, 120
failover testing, 34, 184 Message consumption, 77
failure approaches, 263–264 Message-driven beans (MDBs), 237
performance requirements, 60 Message queue (MQ), 141–142, 266
scenarios, 54–56, 55 Messaging middleware, 129, 132
Metadata, 127
tuning, 39–40, 167–171, 168–172
Methodology, design principles, 292
user acceptance tests, 9
Methods of access, 22
Load testing software
Metrics
fundamentals, 145
capacity testing, 34
product features, 146–147
design principles, 292
vendor products, 147–149
service level agreement, 23
Lockout, user-accounts, 74–75
sustainability testing, 24
Log file growth monitors, 239 MIB, see Management information base
Logging, see also Traceability Microsoft, 115, 176
application, 261 Milestones, 214
debug, 82 Minimal deployment, 196
exceptions, 64 Minimalistic viewpoint, 290
failure approaches, 263–264 Minq, 148
gratuitous, 82 Mitigation
insufficient, 82 application logging, 82
obtuse error, 82 extent of testing, 19
operability design, 80–83 tolerance comparison, 270
performance, 82 Mixed loads
trace logging, 63, 82 capacity testing, 34
Logical back-out, 204–205 establishment, 160, 164–166, 165–167
Log level configuration, 81 preparation activities, 160
certification testing, 24, 29, 35–36 Prediction, related failures, 265–267, 267
code deployment engineers, 38 Pre-production, 151
code review, 25–26 Prerequisites, 42
communication planning, 39–40 Pressure, functional requirements stream
data review, 25–26 attention, 212–213
delivery timeline, 29–33, 30, 32 controls, 222
design hotspots, 24 core project team, 216
effort and resources, 26–28, 27 defining, 213–214
environments, 36, 36–37 escalation procedures, 221–222
executive expectations, 40 expectations, 222
expectations, 39–40 extension of project framework, 211, 216,
extent of testing, 19–20 217, 219
failover testing, 18–19, 23, 29, 34 framework, 215–216, 216–217
functional requirements document, 22 fundamentals, 207–208, 212, 223
fundamentals, 40 hardware resources, 213
infrastructure costs, 26, 28, 28–29 human resources, 212–213
injectors investment, 25 impacts of not acting, 223
investment justification, 20–21 issue resolution, 213
negative reasoning, 21 non-functional requirements comparison,
non-functional requirements document, 22 208–209
non-functional testing, 17–21 performance metrics, 221–222
non-functional test lead, 37–38 pressures from, 209–212, 210–211
online performance test, 18 process, 214, 214–223
operability testing, 24, 33 raw resources requirements, 216–220
people resources, 26, 27 resources requirements, 216–220
performance testing, 23, 28, 35 roles and responsibilities, 216, 218–219
project team expectations, 39–40 software resources, 213
scope determination, 22–33 sponsorship of project, 220
service level agreement document, 22–23 Pride, 268
simulator developers, 38 Priming effects, 172–173
simulators investment, 25 Prioritization, 140
steering committee expectations, 40 Procedure characteristics, 196
sustainability testing, 24, 35 Process, project success, 214, 214–223
team, 37–39 Product features, 146–147
test automation execution engineers, 38 Production behavior patterns, 8
test case scripters, 38 Production build-up, 35
test data manager, 39 Production configuration
test tools, 26, 28 sustainability testing, 35
troubleshooters, 39 user acceptance tests, 9
unit testing, 25 Production systems
Politics, 268 applications, availability and health, 14
Pools automated recovery, 14
connection, 237 database disappearance, 5–7
sustainability testing, 192 data integrity, 11
Poor performance, case study, 2–5 exception conditions, 11, 14
Poor query performance scenario, 105 fundamentals, 14–15
POP3, 148 importance of, 7–8
Post-mortem review nonfunctional systems, 8–10, 12–13
fundamentals, 272 nonfunctional test inventory, 12–13
monitoring, 272, 275 operability patterns, 11–14
root causes, 272, 273–274 poor performance, 2–5