Architecture Design PDF
Architecture Design PDF
Business
Bus Process Technical
Requirements Technical Process
Requirements
Stakeholder
High Level Objectives SDLC
Change
Application Design considerations ITIL
Team Functional Non-functional
Cost considerations CICD
Network Scalability
Compliance & Regulations
Durability
Success Measures
Architecture: Business Requirements
Business
Requirements
• Integration checkpoints
System Integration & Data Management • Integration investments (RESTful APIs, point-to-point, Publisher/Subscriber)
• Data management (how much data?, how long to store?, what to process and who to use?)
Compliance & Regulations • HIPAA, GDPR, SOX, COPPA, PCI DSS, Open Data
• Data Privacy regulations – PII Data
Non-functional
Availability
Availability is a measure of the time that services are functioning correctly and accessible to users. It’s generally measured as a percentage of time that a system
is available and responding to requests with latency not exceeding some certain threshold SLA agreements are used a measuring yardstick for GCP products.
Ex – 99.9999 % accounts for downtime of 2.63 seconds a month. 99.999 – 26.3 seconds/month, 99.99 – 4.38 minutes/month, 99.00 – 7.31 hours/month
Following guidelines and architectural patterns can be used to ensure high availability for GCP services.
• Compute Engine – Live Migration, Managed Instance Groups, Multiple Regions and Global Load Balancing
• Kubernetes Engine – Managed Instance Groups, Replicated Master
• Storage
• Object Storage – Fully Managed
• File and block storage – Persistent disks support online resizing
• Database Services
• Self Managed DB – Shared Disk, Filesystem replication, Synchronous multi-master replication
• Managed Databases – Fully managed (Firetore, BigQuery), Regional replication can be enabled (Cloud SQL, BigTable)
• Network
• Redundant network connections – Dedicated or Partner interconnect
• Premium Network Tier – Data transmitted among regions using the Google’s internal network.
Reliability
Reliability is a measure of probability that a services will continue to function under specific load over a period. Reliability is highly dependent on availability of the
underlying systems and requires to consider the chances of system failures.
Following guidelines and architectural patterns can be used to ensure reliability for GCP services.
Non-functional
Scalability
Scalaibility is the ability of a service to adapt it’s infrastructure to the load on the system. It’s the process of adding and removing infrastructure resources to
meet workload demands efficiently. Scaling stateless applications horizontally is easy, Stateful applications are difficult to scale horizontally, and vertical
scaling is often the first choice for stateful applications.
Following guidelines and architectural patterns can be used to ensure scalability for GCP services.
• Compute
• Autoscaling can be configured based on Average CPU utilisation, Load Balancer capacity, Cloud Monitoring metrics.
• Kubernetes Engine autoscales the number of nodes and VMs in a cluster
• Storage
• Zonal and regional persistent disk and persistent SSDs
• Managed databases self scale based on the workload.
• Serverless and managed services scale automatically.
Durability
Durability is used to measure the likelihood that a stored object will be retrievable in the future.
Cloud storage has 11 9’s (99.999999999%) durability guarantees, which means it’s extremely unlikely that an object stored in Storage will be lost.
Failure domain:
• Machine-level – Hardware failure for individual machines.
• Zonal – Entire zone becomes unavailable because of building fire, power outage, fibre-optic cable loss and network isolations.
• Regional – All zones within a region become unavailable. Examples are hurricanes and large-scale earthquake.
For managed services like BigQuery, data is stored in a single region but backed up in a geographically-seperated region to provide resilience to regional
disaster. For soft failures, data is never lost, but for hard failures (flood, terrorist attack, earthquake, hurricanes), the recent data which is not backed up yet,
would be lost.
Architecture: Microservices
Microservices refers to an architectural style for developing applications. Microservices allow a large application to be decomposed into independent constituent parts, with each part having its
own realm of responsibility. To serve a single user or API request, a microservices-based application can call many internal microservices to compose its response.
Pricing Service
Image Service
Metadata Service
Recommendation
Service
v Well architected microservices design :- Service boundary around stateful and stateless services.
Architecture: 12 factors app
1. Codebase
One codebase tracked in revision control, many deploys. ex- Git Twelve Factor Cloud Pipeline
2. Dependencies
Explicitly declare and isolate dependencies. ex- npm, pip, maven etc
3. Config
Store config in the environment
4. Backing services
Treat backing services as attached resources. ex – url access for storage services.
6. Processes
Execute the app as one or more stateless processes
7. Port binding
Export services via port binding
8. Concurrency
Scale out via the process model
9. Disposability
Maximize robustness with fast startup and graceful shutdown
11. Logs
Treat logs as event streams
Quantitative requirements are the artefacts that can be measurable for service level outcomes.. The type of system being evaluated determines the
data that can be measured,
v Service level objectives (SLOs) specify a target level for the reliability of
your service. Because SLOs are key to making data-driven decisions
about reliability, they're at the core of SRE practices
DORA's State of DevOps research program represents an independent view into the practices and
DORA's research program: capabilities that drive high performance in technology delivery and ultimately organizational outcomes. The
research uses behavioral science to identify the most effective and efficient ways to develop and deliver
software.
Cloud Build is a service that executes builds on Google Cloud Platform infrastructure.
Cloud Build can import source code from Google Cloud Storage, Cloud Source
Cloud Build
Repositories, GitHub, or Bitbucket, execute a build to specifications, and produce artifacts
such as Docker containers or Java archives.
Artifact Registry provides a single location for managing packages and Docker
container images. It integrates with CI/CD tools and Google Cloud runtime
Artifact Registry environments to manage the full artifact lifecycle.
Google Cloud Google Cloud's Architecture Framework describes best practices, makes implementation recommendations and helps design the cloud deployments for business needs.
Operational excellence
This section explores how operational excellence results from efficiently running, managing, and monitoring systems that deliver business value
Use these strategies to achieve operational excellence:
ü Automate build, test, and deploy. Use continuous integration and continuous deployment (CI/CD) pipelines to build automated testing into your
releases. Perform automated integration testing and deployment.
ü Monitor business objectives metrics. Define, measure, and alert on relevant business metrics.
ü Conduct disaster recovery testing. Don't wait for a disaster to strike. Instead, periodically verify that your disaster recovery procedures work, and
test the processes regularly.
Google Cloud Google Cloud's Architecture Framework describes best practices, makes implementation recommendations and helps design the cloud deployments for business needs.
Reliability
This section describes how to apply technical and procedural requirements to architect and operate reliable services on Google Cloud.
Use these strategies to achieve reliability:
ü Reliability is defined by the user. For user-facing workloads, measure UX metrics. For batch and streaming workloads, measure job KPIs
ü Create redundancy, include horizontal scaling, ensure overload tolerance and prevent traffic spikes.
ü Test failure recovery and detect failure, Make incremental changes.
ü Create, document and automate emergency response
ü Reduce toil - Continually aim to reduce or eliminate toil. Otherwise, operational work will eventually overwhelm operators, leaving little room for growth.