Welcome: Classroom Session
Welcome: Classroom Session
Classroom Session
Troubleshooting
Software / Hardware Requirements and Recommendations
VERITAS NetBackup
Veritas NetBackup offers a single console for all your backup and recovery operations.
What is Cluster ?
Cluster is a group of servers and other resources that act like a single system and enable high availability and, in some cases, load balancing and parallel processing avaliablity.Each system ,or node runs its own OS and cooperate at the software level to form a cluster. Cluster links commodity hardware with intelligent software to provide application failover and control. A cluster consist of the following
Cluster Terminology
Cluster Name of your HA environment Nodes Physical systems that make up the cluster
Service group A service group is a virtual container that contains all the
hardware and software resources that are required to run the managed application.
Attributes Parameter values that define the resources Agents are multi-threaded processes that provide the logic to manage resources. Dependencies Links between resources or service groups
Switchover: A switchover is an orderly shutdown of an application and its supporting resources on one server and a controlled startup on another server. Failover: A failover is similar to a switchover, except the ordered shutdown of applications on the original node may not be possible, so the services are started on another node.
The following diagram illustrates the usage of the regular and virtual IP addresses used in cluster configuration: Administration of the nodes uses IP1/IP2 and IP3/IP4 respectively. The cluster IP address for external clients over the public network uses VIP1
Cluster Communication
Cluster membership is defined in the primary cluster configuration file - simply an ascii file that the administrator edits Cluster communications is any network path between the cluster nodes.
Cluster Communication
VCS agents track the state of all resources and service groups in the cluster.
Cont
HAD (High Availability Demon) polls the various agents on the node and, if there's a change, reports that to GAB.
GAB (Group Membership Services/Atomic Broadcast) has two jobs.
First, it tracks which systems are part of the cluster. Cluster membership is defined by systems sharing the same cluster ID and a pair of redundant ethernet LLT cables. GAB's second job is to transmit resource status changes to all nodes in the cluster. The atomic broadcast portion of the name implies (correctly, as it turns out) that all systems in the cluster are notified of any changes. If a failure occurs during the update, the "status change" is rolled back ensuring that, upon recovery, all nodes have the same status information. It's the same paradigm as a database commit, if that's familiar to you
Cluster Communication
Cont
LLT is responsible for transmitting the heartbeat signals which GAB uses to maintain cluster membership. A cluster can have between 2 and 8 LLT cables. LLT links can be identified as low or high priority.
Workstation
VCS Architecture
VCS Architecture
Agents monitor resources on each system and provide status to HAD on the local system
HAD on each node takes corrective action, such as failover, when necessary
Cluster Startup
Here is what the cluster does at startup: Node checks if other node is already started, if so -- stays OFFLINE If no other machine is running, checks communication (gabconfig). May need system admin intervention if cluster requires both nodes to be available. (/sbin/gabconfig -c -x) Once communication between machines is open -- or gabconfig has been started, it sets up network (nic & ip adddress) (starts cluster server) If also brings up volume manager, file system, and then (Application) oracle. If any of the critical processes fail, the whole system is faulted. The most common reason for failing is expired licenses, so check licenses before doing work with vxlicense -p.
Cont
hashadow-log_A: hashadow checks to see if the ha cluster daemon (had) is up and restarts it if needed. This is the log of that process. engine.log_A: primary log, usually what you will be reading for debugging Oracle_A: oracle process log (related to cluster only) Sqlnet_A: sqlnet process log (related to cluster only) IP_A: related to shared IP Volume_A: related to Volume manager Mount_A: related to mounting actual filesystes (filesystem) DiskGroup_A: related to Volume Manager/Cluster Server NIC_A: related to actual network device Look at the most recent ones for debugging purposes (ls -ltr).
Cont
Network conf: /etc/gabtab If has: /sbin/gabconfig -c -n2 , will need to run /sbin/gabconfig -c -x if only one system comes up and both systems were down. Cluster conf:: /etc/VRTSvcs/conf/config/main.cf Has exact details on what the cluster contains. Most executables are in: /opt/VRTSvcs/bin or /sbin
Cont
vxlicense -p If any licenses are not valid or expired -- get them FIXED before continuing! All licenses should say "No expiration". If ANY license has an actual expiration date, the test failed. Permenant licenses do NOT have an expiration date. Non-essential licenses may be moved -- however, a senior admin should do this.
Cont
If only one system is shown, start other system with hastart. Note: one system should ALWAYS be OFFLINE for the way we configure systems here. (If we ran oracle parallel server, this could change -- but currently we run standard oracle server) If both systems are up but are OFFLINE and hastart did NOT correct the problem and oracle filesystems are not running on either system, the cluster needs to be reset. (This happens under strange network situations with GE Access.) [You ran hastart and that wasn't enough to get full cluster to work.]
VCS Troubleshooting
Hope now we all are familiar with VCS concept and ready for the Lab- Session