0% found this document useful (0 votes)
2 views18 pages

TA050 System Availability Strategy

The document outlines the System Availability Strategy for <Company Short Name>, detailing approaches to ensure high system availability and minimize downtime during outages. It covers various types of failures, including physical hardware, network components, and software, along with strategies for recovery and maintenance. The document serves as a guide for managing system availability to support business operations effectively.

Uploaded by

tarek abib
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views18 pages

TA050 System Availability Strategy

The document outlines the System Availability Strategy for <Company Short Name>, detailing approaches to ensure high system availability and minimize downtime during outages. It covers various types of failures, including physical hardware, network components, and software, along with strategies for recovery and maintenance. The document serves as a guide for managing system availability to support business operations effectively.

Uploaded by

tarek abib
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 18

AIM

TA.050 SYSTEM AVAILABILITY


STRATEGY
<Company Long Name>
<Subject>

Author: <Author>
Creation Date: May 29, 1999
Last Updated: June 10, 1999
Document Ref: <Document Reference Number>
Version: DRAFT 1A

Approvals:

<Approver 1>

<Approver 2>
TA.050 System Availability Strategy Doc Ref: <Document Reference Number>
June 10, 1999

Document Control

Change Record
1

Date Author Version Change Reference

29-May- <Author> Draft 1a No Previous Document


99

Reviewers

Name Position

Distribution

Copy Name Location


No.
1
Library Master Project Library
2 Project Manager
3
4

Note To Holders:

If you receive an electronic copy of this document and print it out, please
write your name on the equivalent of the cover page, for document
control purposes.

If you receive a hard copy of this document, please write your name on
the front cover, for document control purposes.

<Subject> Physical (Hardware or Network) Component Failure


File Ref: 877741618.doc (v. DRAFT 1A )
Company Confidential - For internal use only
TA.050 System Availability Strategy Doc Ref: <Document Reference Number>
June 10, 1999

Contents

Document Control............................................................................................

Introduction......................................................................................................
Purpose......................................................................................................
Scope..........................................................................................................
Definitions..................................................................................................
Critical Systems Availability............................................................................

Unplanned System Outages............................................................................

Planned System Outages................................................................................

Physical (Hardware or Network) Component Failure......................................


Database Server Failure............................................................................
Application Server Failure.........................................................................
Network Failure..........................................................................................
Data Center Failure....................................................................................
Client Desktop Failure................................................................................
Software Component Failure...........................................................................
Database Software Failure........................................................................
Application Software Failure......................................................................
Century Date Failure........................................................................................
Century Date Compliance..........................................................................
Interface Failure.........................................................................................
Reporting Failure.......................................................................................
Maintenance Outages......................................................................................
Database Maintenance..............................................................................
Software Maintenance...............................................................................
Open and Closed Issues for this Deliverable..................................................
Open Issues................................................................................................
Closed Issues.............................................................................................

<Subject> Physical (Hardware or Network) Component Failure


File Ref: 877741618.doc (v. DRAFT 1A )
Company Confidential - For internal use only
TA.050 System Availability Strategy Doc Ref: <Document Reference Number>
June 10, 1999

Introduction

Purpose

The purpose of the System Availability Strategy document is to


describe the high-level system strategies and approaches that <Company
Short Name> will use to provide the level of system availability needed to
support the new business operations.

System availability is a key element of an architecture design. The


applications, databases, hardware, and networks must be selected or
designed carefully so that integrated system maintains minimum
continuous availability, and when outages do occur, they can be rectified
within the allowable downtime metrics.

There are many points of failure in a complex system of applications,


databases. Hardware, and networks. Only by properly considering all of
them can the business be confident that unplanned outages will not
create unacceptable operational downtime, especially during periods of
peak demand such as the end of fiscal periods.

Note: It is important to remember that systems outages can


often be avoided by a well-planned approach to systems
management using proactive monitoring and maintenance
wherever possible.

Scope

This document covers the following data centers:



The following architecture components (database servers, file servers,


and so on) are covered in this document:



Definitions

Mean Time To Failure (MTTF)

<Subject> Physical (Hardware or Network) Component Failure


File Ref: 877741618.doc (v. DRAFT 1A )
Company Confidential - For internal use only
TA.050 System Availability Strategy Doc Ref: <Document Reference Number>
June 10, 1999
Mean Time To Recover (MTTR)

<Subject> Physical (Hardware or Network) Component Failure


File Ref: 877741618.doc (v. DRAFT 1A )
Company Confidential - For internal use only
TA.050 System Availability Strategy Doc Ref: <Document Reference Number>
June 10, 1999

Critical Systems Availability


This component summarizes the major (and most critical) system
availability requirements.

Critical Component Type : Availability Acceptable Comments


System System Application Requirement Downtime
Component or
Application Function

<Subject> Physical (Hardware or Network) Component Failure


File Ref: 877741618.doc (v. DRAFT 1A )
Company Confidential - For internal use only
TA.050 System Availability Strategy Doc Ref: <Document Reference Number>
June 10, 1999

Unplanned System Outages


This section summarizes the main unplanned system outages due to
failures in different components of the overall system and the strategy
that will be used to handle to each. This table summarizes special
architecture components or procedures that will need to be built into the
architecture to handle system failure and provide the desired level of
system availability.

Code Unplanned Est. Outage Availability Est. Outage Predicted Comments


Outage Frequency /Recovery Frequency TTR
without Strategy with
Strategy Strategy
(1/MTTF) (1/MTTF)

<Enter the Single database 2-way disk Compute from


codes you server disk mirroring MTTF per disk as
use outage per
below> manufacturers
spec. MTTF takes
into account
entire disk array
Single database Compute from
I/O controller MTTF per disk as
outage per
manufacturers
spec.
Single database
server CPU
outage
Database
server outage
Single
Application
Client File
server disk
outage
Single
Application
Client File
server CPU
outage
Application
Client File
server outage
Local area
network outage
Wide area
network outage
within US
Wide area
network outage
outside US

<Subject> Physical (Hardware or Network) Component Failure


File Ref: 877741618.doc (v. DRAFT 1A )
Company Confidential - For internal use only
TA.050 System Availability Strategy Doc Ref: <Document Reference Number>
June 10, 1999

Planned System Outages


This section summarizes planned outages that are predicted to be
needed for regular or routine system maintenance.

Code Planned Outage Outage Predicted Predicted Details


Frequency Schedule Duration

Oracle Database
Software Upgrade
Oracle Database
Software Patch
Oracle Applications
Client Software
Upgrade
Oracle Applications
Server Software
Upgrade
Oracle Applications
Software Patch
Oracle Database
Cold Backup

<Subject> Physical (Hardware or Network) Component Failure


File Ref: 877741618.doc (v. DRAFT 1A )
Company Confidential - For internal use only
TA.050 System Availability Strategy Doc Ref: <Document Reference Number>
June 10, 1999

Physical (Hardware or Network) Component Failure


Each of the hardware or network components in the information systems
architecture can be subject to failure in different ways and it is important
to document how the different components will be protected against the
various failures (if at all). This section focuses on physical component
failures and the failure analysis is subdivided into failures for:

 database servers
 application file servers
 networks
 data centers
 client desktop machines

For each type of component, all possible physical failure events are
considered for every machine or component with the scope of this
analysis.

Database Server Failure

This section lists the causes, results and the strategy for providing
continuing availability through database server failure or for recovering
from the failure.

Failure Cause Failure Code Result Availability/ Recovery Comments


Strategy

Loss of power supply DS-1 Loss of all server resident Uninterrupted power
application and database supply
processing
Failed CPU DS-2
Failed System Bus DS-3
Failed Memory DS-4
Loss or corruption of a DS-5 Failure of database 2-way disk mirror
single disk instances that access data
or control structures on
the disk
DS-6 Failure of application
processing for processes
that need to read
application code from the
failed disk
Loss of a disk I/O DS-7 Failure of database 2-way disk mirror
Controller instances that access data
or control structures
through that controller
DS-8 Failure of application
processing for processes
that need to read
application code through
the controller

<Failure Cause>

<Subject> Physical (Hardware or Network) Component Failure


File Ref: 877741618.doc (v. DRAFT 1A )
Company Confidential - For internal use only
TA.050 System Availability Strategy Doc Ref: <Document Reference Number>
June 10, 1999

Application Server Failure

This section lists the causes, results, and the strategy for providing
continuing availability through application server failure, or for recovering
from the failure. This applies to the following application servers:

 <Application Server Name>

Failure Cause Failure Code Result Availability/ Recovery Comments


Strategy

Loss of power supply FS-1 Loss of all application Uninterrupted power


network client file services supply
to desktop PCs
Failed CPU FS-2 Loss of all application
network client file services
to desktop PCs
Failed System Bus FS-3 Loss of all application
network client file services
to desktop PCs
Failed Memory FS-4 Loss of all application
network client file services
to desktop PCs
Loss or corruption of a FS-5 Loss of all application
single disk network client file services
to desktop PCs
Loss of a disk I/O FS-6 Loss of all application
Controller network client file services
to desktop PCs

<Failure Cause>

Network Failure

Network failures can occur within both the local and the wide area
networks (LANs and WANs).

Network failures can affect business applications in two main areas:

 loss of communication between a client and a server machine,


preventing user access to system in the Oracle Applications client-
server architecture
 inability to transfer interface data between applications or
databases network linked through the failed network segment

The network segment failure may also affect systems management


capability by preventing:

 network transfer of files


 network monitoring of systems and applications

Failure Cause Failure Code Result Availability/ Recovery Comments


Strategy

Loss of a LAN segment Loss of communication


between desktop clients

<Subject> Physical (Hardware or Network) Component Failure


File Ref: 877741618.doc (v. DRAFT 1A )
Company Confidential - For internal use only
TA.050 System Availability Strategy Doc Ref: <Document Reference Number>
June 10, 1999

Failure Cause Failure Code Result Availability/ Recovery Comments


Strategy

and DB server. Application


processing stops.
Loss of Router
Loss of WAN connection Loss of application Redundant leased line
between US and interface data transfers. connection between US
AsiaPac and AsiaPac data
centers

<Failure Cause>

Data Center Failure

Data Center failures entail the loss of functioning of all physical


architecture components that are resident in the data center or the data
center service area. The loss of a data center is an extreme or
catastrophic event that is very rare but needs to be considered as a
remote possibility.

Failure Cause Failure Code Result Availability/ Recovery Comments


Strategy

Civil strife, earthquake, DC-1 Loss of all application and Geographically remote
flood database processing disaster recovery site
resident in the data center in...

<Failure Cause>

Client Desktop Failure

Client desktop PCs can fail because of:

 network or disk copy virus infection


 non-standard (non IS approved) software loaded by users causing
unpredictable PC resource usage
 disk corruption
 PC operating system bugs causing software failure.

Failure Cause Failure Code Result Availability/ Recovery Comments


Strategy

Virus infection PC-1 Partial or complete loss of Backup spare PCs Assume PC will

<Subject> Physical (Hardware or Network) Component Failure


File Ref: 877741618.doc (v. DRAFT 1A )
Company Confidential - For internal use only
TA.050 System Availability Strategy Doc Ref: <Document Reference Number>
June 10, 1999

Failure Cause Failure Code Result Availability/ Recovery Comments


Strategy

regular PC functions require disinfecting


and reload of
standard IS approved
software
Load of non-approved PC-2 Partial loss of regular PC IS standards imposed for Deinstall of offending
software function PC usage by users. software.
Backup spare PC can be Reconfiguration of PC
made available may be necessary.
Disk corruption PC-3 Partial or complete loss of PC disk backups. Backup
regular PC functions spare PC can be made
available

<Failure Cause>

<Subject> Physical (Hardware or Network) Component Failure


File Ref: 877741618.doc (v. DRAFT 1A )
Company Confidential - For internal use only
TA.050 System Availability Strategy Doc Ref: <Document Reference Number>
June 10, 1999

Software Component Failure


This section focuses on software component failures and the failure
analysis is subdivided into failures at the level of:

 database software
 application software

This section does not include failures that are caused by the failure of
physical system components. These failures are discussed in the Physical
(Hardware or Network) Component Failure section.

Software component failures include these sources:

 user errors
 operations (maintenance) staff errors
 software bugs
 lack of adequate database space management

Database Software Failure

Failure Cause Failure Code Result Availability/ Recovery Comments


Strategy

Database object DB-1 Users unable to access or Restore using database


dropped by IS process the dropped backups
Operations staff object
Inadequate database DB-2 Users unable to create Regular scheduled
space in structures new data in the affected database monitoring.
database structure Emergency shutdown
and reconfiguration of
database.

Application Software Failure

Failure Cause Failure Code Result Availability/ Recovery Comments


Strategy

Database object DB-1 Users unable to access or Restore using database


dropped by IS process the dropped backups
Operations staff object
Inadequate database DB-2 Users unable to create Regular scheduled
space in structures new data in the affected database monitoring.
database structure Emergency shutdown
and reconfiguration of

<Subject> Physical (Hardware or Network) Component Failure


File Ref: 877741618.doc (v. DRAFT 1A )
Company Confidential - For internal use only
TA.050 System Availability Strategy Doc Ref: <Document Reference Number>
June 10, 1999

Failure Cause Failure Code Result Availability/ Recovery Comments


Strategy

database.

<Subject> Physical (Hardware or Network) Component Failure


File Ref: 877741618.doc (v. DRAFT 1A )
Company Confidential - For internal use only
TA.050 System Availability Strategy Doc Ref: <Document Reference Number>
June 10, 1999

Century Date Failure


This section focuses on Century Date related failures. The failure analysis
is subdivided into failures at the level of:

 interfaces
 reporting and analysis

This section does not include failures that are caused by the failure of
physical system components. These failures are discussed in the Physical
(Hardware or Network) Component Failure section.

Century Date failures include these sources:

 user errors
 operations (maintenance) staff errors
 software bugs (interfaces)
 improper access to data for reporting and analysis

Century Date Compliance

In the past, two character date coding was an acceptable convention due
to perceived costs associated with the additional disk and memory
storage requirements of full four character date encoding. As the year
2000 approached, it became evident that a full four character coding
scheme was more appropriate.

In the context of the Application Implementation Method (AIM), the


convention Century Date or C/Date support rather than Year2000 or Y2K
support is used. Coding for any future Century Date is now the modern
business and technical convention.

Every applications implementation team needs to consider the impact of


the century date on their implementation project. As part of the
implementation effort, all customizations, legacy data conversions,
custom interfaces, data extraction mechanisms and architecture
components need to be reviewed for Century Date compliance.

Interface Failure
Failure Cause Failure Code Result Availability/ Recovery Comments
Strategy

Interface Truncates four CD-1 Possible generation of Repair interface, review


character date date with wrong century and repair any data that
was imported.

<Subject> Physical (Hardware or Network) Component Failure


File Ref: 877741618.doc (v. DRAFT 1A )
Company Confidential - For internal use only
TA.050 System Availability Strategy Doc Ref: <Document Reference Number>
June 10, 1999

Reporting Failure

Failure Cause Failure Code Result Availability/ Recovery Comments


Strategy

Invalid selection of data CD-10 Report/Query may yield Repair query, rerun
in query due to invalid incorrect results. report/query and
date logic in query. validate results.

<Subject> Physical (Hardware or Network) Component Failure


File Ref: 877741618.doc (v. DRAFT 1A )
Company Confidential - For internal use only
TA.050 System Availability Strategy Doc Ref: <Document Reference Number>
June 10, 1999

Maintenance Outages
Maintenance outages are planned outage events that are required to
perform some form of maintenance on the system. Examples of system
maintenance events include:

 Regular database maintenance


 database backups
 database tuning
 database space management
 Software maintenance
 bug patches
 software upgrade

Database Maintenance

Maintenance Event Code Availability During Maintenance Strategy Comments


Maintenance

Cold database backup DM-1 Database and applications Use hot backups as
unavailable much as possible. Cold
backups once per week.
Data archive
Data purge

Software Maintenance
Maintenance Event Code Availability During Maintenance Strategy Comments
Maintenance

Oracle Database SM-1 Database and applications


Software Upgrade unavailable
Oracle Database SM-2 Database and applications
Software Patch unavailable
Oracle Applications SM-3 Applications unavailable
Client Software
Upgrade
Oracle Applications File SM-4 Applications unavailable Minimize applications
Server Software downtime by using
Upgrade Oracle Client Software
Manager
Oracle Applications SM-5 Applications unavailable Minimize applications
Software Patch downtime by using
Oracle Client Software
Manager

<Subject> Physical (Hardware or Network) Component Failure


File Ref: 877741618.doc (v. DRAFT 1A )
Company Confidential - For internal use only
TA.050 System Availability Strategy Doc Ref: <Document Reference Number>
June 10, 1999

Open and Closed Issues for this Deliverable

Open Issues

ID Issue Resolution Responsibilit Target Impact Date


y Date

Closed Issues

ID Issue Resolution Responsibilit Target Impact Date


y Date

<Subject> Physical (Hardware or Network) Component Failure


File Ref: 877741618.doc (v. DRAFT 1A )
Company Confidential - For internal use only

You might also like