App Engine Web Architecture
App Engine Web Architecture
Developers leverage Google App Engine to simplify development and deployment of Web Applications. These applications use the autoscaling
compute power of App Engine as well as the integrated features like distributed in-memory cache, task queues and datastore, to create robust
applications quickly and easily.
App Engine is Googles PaaS platform, a robust development environment for applications written in Java, Python, PHP and Go. The SDK for App
Engine supports development and deployment of the application to the cloud. App Engine supports multiple application versions which enables easy
rollout of new application features as well as traffic splitting to support A/B testing.
Integrated within App Engine are the Memcache and Task Queue services. Memcache is an in-memory cache shared across the AppEngine
instances. This provides extremely high speed access to information cached by the web server (e.g. authentication or account information).
Task Queues provide a mechanism to offload longer running tasks to backend servers, freeing the front end servers to service new user requests.
Finally, App Engine features a built-in load balancer (provided by the Google Load Balancer) which provides transparent Layer 3 and Layer 7 load
balancing to applications.
Googles Cloud DNS service can be used to manage your DNS zones.
Google App Engine offers excellent solutions for these challenges. In this document, we will look at how App Engine handles incoming user requests
and how it scales your application as traffic increases and decreases. You will learn how to configure App Engine's scaling behavior to find the optimal
balance between performance and cost, so that you can successfully harness the power and flexibility of Google App Engine for your projects.
Figure 1:
How user requests are routed to application instances
A user request is routed to the geographically closest Google data center, which may be in the United States, the European Union, or the Asia-Pacific
region. In that data center, an HTTP server known as the Google front end receives the request and routes it through the Google-owned fiber
backbone to an App Engine data center that runs your application.
The App Engine architecture includes the following components:
App Engine front end servers are responsible for load balancing and failover of App Engine applications. As shown in Figure 1, the App
Engine front end receives user requests from the Google front end and dispatches each request either to an app server for dynamic content or
to a static server for static content.
App servers are containers for App Engine applications. An app server creates application instances and distributes requests based on
traffic load. The application instances contain your codethe request handlers that implement your application.
The app server runtime environment includes APIs to access the full suite of App Engine services, allowing you to easily build scalable and
highly available web applications.
Static servers are dedicated to serving static files for App Engine applications; they are optimized to provide the content rapidly with minimal
latency. (To further reduce latency, Google may move high-volume public content to an edge cache.)
The app master is the conductor of the whole App Engine orchestra. When you deploy a new version of an App Engine application, the app
master uploads your program code to app servers and your static content to static servers.
The App Engine front end and the app servers work together, tracking the available application instances as your application uses more or fewer
instances.
It can take tens of seconds to boot up an operating system each time a new VM is added to handle more requests. In contrast, an application instance
can be created in a few seconds.
Application instances operate against high-level APIs, eliminating the need to communicate through layers of code with virtual device drivers.
In summary, application instances are the computing units that App Engine uses to scale your application. Compared to VMs, application instances
are fast to initialize, lightweight, memory efficient, and cost effective.
Overall, App Engine scales easily because its lightweight-container architecture can spin up application instances far more quickly than VMs can be
started.
The next section describes how you can design your application to take advantage of that architecture.
The overall cost of application instances running inside the app servers
To reduce the response time, App Engine automatically increases the number of instances based on the current load and developer-specific
configuration parameters. However, additional instances cost more. To minimize additional cost, you need to understand how and when instances are
created or deleted by App Engine in response to changes in the traffic load.
The following sections explain what an instance is and how to configure its parameters to balance responsiveness and cost effectiveness for your
application.
Manual scalingApp Engine creates the number of instances that you specify in a configuration file. The number of instances you configure
depends on the desired throughput, the speed of the instances, and the size of your dataset balanced against cost considerations. Manually
scaled instances run continuously, so complex initializations and other in-memory data is preserved across requests.
Basic scalingApp Engine creates instances to handle requests and releases the instances when they becomes idle. Basic scaling is ideal
and cost effective if your workload is intermittent or driven by user activity.
For modules on frontend instance classes, you use automatic scaling: App Engine adjusts the number of instances based on the request rate,
response latencies, and other application metrics. You can control the scaling of your instances to meet your performance requirements.
Overall, the dynamic scalability and cost effectiveness of an App Engine application is primarily controlled by the design and configuration of the
frontend instances. Follow these important best practices to optimize their scalability:
1. Design for reduced latency and more queries per second (QPS).
2. Optimize idle instances and pending latency.
The rest of this paper will focus on applying these practices to automatic-scaling frontend instance classes. Refer to the App Engine Modules
documentation (Java, Python, Go, PHP) for information on managing backend instance classes.
Your instance is busy during that time, so App Engine may create another instance to handle additional requests. If you can optimize your code
to execute in 200 ms, the user experience is improved, and your QPS may be doubled without running extra instances.
Appstats is a powerful tool you can use to understand, optimize, and improve your applications QPS. As shown in Figure 5, it shows the number of
RPC calls that are invoked inside each request, the duration of each RPC call (such as Datastore or Memcache access), and the contribution of each
RPC call to the overall latency of the request. This information gives you hints for finding bottlenecks in your application.
If the Appstats graphs indicate that your applications bottleneck is in CPU-intensive tasks, rather than waiting for RPC calls to return, you could try
a higher CPU class for the frontend instances to reduce the latency. While this increases the CPU cost of each instance, the number of instances
required to support the load will decrease, and the user experience improves without a major shift in the total cost.
You can also increase the QPS by letting App Engine assign multiple requests to each instance simultaneously. By default, one instance can run only
one thread to prevent unexpected behavior or errors caused by concurrent processing. If your application code is thread-safe and implements proper
concurrency control, you can increase the QPS of each instance without additional cost by specifying the threadsafe element in the configuration
file.
Idle Instances
Idle instances help your site handle a sudden influx of requests. Usually, requests are handled by existing, active, available application instances. If a
request arrives and there are no available application instances, App Engine may need to activate an application instance to handle that request
(called a loading request). A loading request takes longer to respond, because it must wait while the new instance is initialized.
Idle instances (also called resident instances) represent the number of instances that App Engine keeps loaded and initialized, even when the
application is not receiving any requests. The default is to have zero idle instances, which means requests will be delayed every time your application
scales up to more instances.
You can adjust the minimum and maximum number of idle instances independently with sliders in the Admin Console.
We recommend that you maintain idle instances if you do not want requests to wait for instance creation and initialization. For example, if you specify
a minimum of ten idle instances, your application will be able to service a burst of requests immediately on those ten instances. We recommend that
you allocate idle instances carefully because they will always be resident and incur some cost.
You can also set an upper limit to the number of idle instances. This parameter is designed to control how gradually App Engine reduces the number
of idle instances as load levels return to normal after a spike. This helps your application maintain steady performance through fluctuations in request
load, but it also raises the number of idle instances (and consequently running cost) during periods of heavy load. Lowering the maximum number of
idle instances can reduce cost.
Pending Latency
Pending latency is the time that a request spends in a pending queue for an app server. You can set minimum and maximum values for this
parameter.
When an App Engine front end receives a request from a user and no instance is available to service that request, the request is added to a pending
queue until an instance becomes available. App Engine tracks how long requests are held in this queue. If requests are held for too long, App Engine
creates another instance to distribute the load. Figure 7 shows how instances are added or deleted based on traffic volume.
create a new instance if a request has been waiting in the pending queue for more than one second. Adding more instances results in increased
throughput and incurs more cost.
Note: If you have specified a minimum number of idle instances, the pending latency parameters will have little or no effect (unless there is a
sustained traffic spike that grows to exhaust the idle instances faster than they can be initialized).
Specifies
Pending Latency
Pending Latency
Minimum
Maximum6
Minimum number of
Maximum number of
resident instances
resident instances
Pending Queues
Low
settings
spike
spike
- Higher cost
- Higher cost
- Lower cost
- Lower cost
- Slower response
- Slower response
High
settings
spike
spike
- Higher cost
- Higher cost
- Lower cost
- Lower cost
For example, if you expect high traffic to your site because you have scheduled an event or expect major media coverage related to a product release,
you could increase the minimum number of idle instances and decrease the maximum pending latency shortly before and during the event to
smoothly handle traffic spikes.
Known anti-patterns are to set the minimum and maximum idle instances close to each other or specify a very small pending latency gap. Either of
these may cause unexpected scaling behavior in your application.
We recommend the following configurations:
Best performanceIncrease the value for the minimum number of idle instances and lower the maximum pending latency while leaving the
other settings on automatic.
Lowest costKeep the number of maximum idle instances low and increase the minimum pending latency while leaving the other settings on
automatic.
We also recommend that you conduct a load test of your application before trying out the recommended settings. This will help you choose the best
values for idle instances and pending latency.
Load code from a zip or jar file, which is faster than loading from many separate files.
If you cannot decrease the time required for a loading request to complete, you may need to have more idle instances to ensure responsiveness when
the load increases. Reducing the loading request time increases the elasticity of your application and lowers the cost.
Special Notes:One of the biggest advantages of Google App Engine is that lightweight application instances can be added within a few seconds. This enables highly
elastic scaling which adapts to sudden increases in traffic volume. To benefit from this power, you have to understand how requests are distributed to
application instances, how to maximize the QPS of your application by increasing the throughput per instance, and how to control elasticity. By
following best practices, you can build web applications that scale smoothly when traffic increases rapidly. In addition, following best practices helps
you tune your application for an optimal balance of cost and performance.
1. The term QPS is Googles terminology to express requests per second. It includes all HTTP requests to the servers and is not restricted to
search queries.
2. For the mathematically precise: QPS is computed over the past 60 seconds. The seven instances handled 4401 requests in 60 seconds for
4401 / 60 = 74.35 QPS, so the average is 74.35 / 7 = 10.479 QPS. For the first instance: 13.133 QPS implies that the instance processed
13.333 * 60 = 787 requests.
3. If you convert your application to use modules, this graphical interface is replaced by parameter settings in the per-module configuration
files.
4. App Engine knows what requests are outstanding, how long those requests are likely to take (from past statistics), and how loaded the various
app servers are. This means it can predict whether an instance will be available in time to service a request before the maximum pending
latency is reached.
5. The minimum idle instances is enabled using the Console for a paid app.
Application hierarchy
At the highest level, an App Engine application is made up of one or more modules. Each module consists of source code and configuration files. The
files used by a module represent a version of the module. When you deploy a module, you always deploy a specific version of the module. For this
reason, whenever we speak of a module, it usually means a version of a module.
You can deploy multiple versions of the same module, to account for alternative implementations or progressive upgrades as time goes on.
Every module and version must have a name. A name can contain numbers, letters, and hyphens. It cannot be longer than 63 characters and cannot
start or end with a hyphen.
While running, a particular module/version will have one or more instances. Each instance runs its own separate executable. The number of instances
running at any time depends on the module's scaling type and the amount of incoming requests:
Stateful services (such as Memcache, Datastore, and Task Queues) are shared by all the modules in an application. Every module, version, and
instance has its own unique URI (for example, v1.my-module.my-app.appspot.com). Incoming user requests are routed to an instance of a
particular module/version according to URL addressing conventions and an optional customized dispatch file.
Note: After April 2013 Google does not issue SSL certificates for double-wildcard domains hosted at appspot.com (i.e.*.*.appspot.com). If you
rely on such URLs for HTTPS access to your application, change any application logic to use "-dot-" instead of ".". For example, to access version
v1 of application myapp use https://v1-dot-myapp.appspot.com. The certificate will not match if you
use https://v1.myapp.appspot.com, and an error occurs for any User-Agent that expects the URL and certificate to match exactly.
Scaling Types
Feature
Automatic Scaling
Manual Scaling
Basic Scaling
Deadlines
Same as manual
scaling.
tasks
Not allowed
Allowed
Allowed
Configurable by
Configurable by
instance class
Background
Threads
CPU/Memory
B4_1G, or B8 instance
class
Residence
usage patterns.
timeout` parameter. If
shutdown occurs.
Startup and
Shutdown
on demand to handle
on demand to handle
requests and
requests and
automatically turned
automatically turned
Instance
Instances are
Same as manual
Addressability
anonymous.
scaling.
Scaling
number of instances
version is configured
automatically in
with a maximum
response to processing
number of instances
work.
using the
factors in the
`basic_scaling`
`automatic_scaling`
setting's
`max_instances`
provided on a per-
parameter. The
number of live
configuration file
module version.
Free Daily
28 instance-hours
8 instance-hours
8 instance-hours
Usage Quota
Instance classes
Instances are priced based on an hourly rate determined by the instance class.
Instance Class
Memory Limit
CPU Limit
B1
128 MB
600 Mhz
$0.05
B2
256 MB
1.2 Ghz
$0.10
B4
512 MB
2.4 Ghz
$0.20
B4_1G
1024 MB
2.4 Ghz
$0.30
B8
1024 MB
4.8 Ghz
$0.40
F1
128 MB
600 Mhz
$0.05
Instance Class
Memory Limit
CPU Limit
F2
256 MB
1.2 Ghz
$0.10
F4
512 MB
2.4 Ghz
$0.20
F4_1G
1024 MB
2.4 Ghz
$0.30
Manual and basic scaling instances are billed at hourly rates based on uptime. Billing begins when an instance starts and ends fifteen minutes after a
manual instance shuts down or fifteen minutes after a basic instance has finished processing its last request. Runtime overhead is counted against
the instance memory limit. This will be higher for Java than for other languages.
Important: When you are billed for instance hours, you will not see any instance classes in your billing line items. Instead, you will see the appropriate
multiple of instance hours. For example, if you use an F4 instance for one hour, you do not see "F4" listed, but you will see billing for four instance
hours at the F1 rate.