Clusteraware PDF
Clusteraware PDF
Cluster Management
Cluster Management
Note Before using this information and the product it supports, read the information in Notices on page 27.
This edition applies to AIX Version 7.1 and to all subsequent releases and modifications until otherwise indicated in new editions. Copyright IBM Corporation 2010, 2012. US Government Users Restricted Rights Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
Contents
About this document . . . . . . . . . v
Highlighting . . . . Case-sensitivity in AIX . ISO 9000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v . v . v Programming cluster sockets . . . . . . . . Troubleshooting Cluster Aware . . . . . . . Troubleshooting with the snap command . . . Troubleshooting with node maintenance mode . Troubleshooting with component trace . . . Sample output for cluster commands . . . . . clcmd date command sample output . . . . lscluster -d command sample output . . . . lscluster -i command sample output . . . . lscluster -m command sample output. . . . lscluster -s command sample output . . . . nodeState cluster event sample output . . . Code samples for cluster events . . . . . . Cluster events using AHAFS sample code . . Cluster socket programming sample code . . . 8 . 9 . 9 . 9 . 10 . 10 . 10 . 11 . 11 . 12 . 13 . 13 . 14 . 14 . 17
Cluster management . . . . . . . . . 1
Whats new in Cluster management . . . Cluster Aware concepts . . . . . . . . Cluster repository . . . . . . . . Cluster system architecture flow . . . . Naming a cluster . . . . . . . . . Cluster communication . . . . . . . Deadman switch . . . . . . . . . Configuring Cluster Aware . . . . . . Setting up cluster storage communication. Configuring cluster security . . . . . Managing clusters with commands . . . . Managing cluster events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 2 3 4 4 5 5 5 6 6 7
| |
Notices . . . . . . . . . . . . . . 27
Trademarks . . . . . . . . . . . . . . 29
iii
iv
Highlighting
The following highlighting conventions are used in this document:
Bold Identifies commands, subroutines, keywords, files, structures, directories, and other items whose names are predefined by the system. Also identifies graphical objects such as buttons, labels, and icons that the user selects. Identifies parameters whose actual names or values are to be supplied by the user. Identifies examples of specific data values, examples of text similar to what you might see displayed, examples of portions of program code similar to what you might write as a programmer, messages from the system, or information you should actually type.
Italics Monospace
Case-sensitivity in AIX
Everything in the AIX operating system is case-sensitive, which means that it distinguishes between uppercase and lowercase letters. For example, you can use the ls command to list files. If you type LS, the system responds that the command is not found. Likewise, FILEA, FiLea, and filea are three distinct file names, even if they reside in the same directory. To avoid causing undesirable actions to be performed, always ensure that you use the correct case.
ISO 9000
ISO 9000 registered quality systems were used in the development and manufacturing of this product.
vi
Cluster management
The Cluster Aware function is part of the AIX operating system. Using Cluster Aware for AIX you can create a cluster of AIX nodes and build a highly available and an ideal architectural solution for a data center.
May 2012
The following information is a summary of the updates made to this topic collection: v The Defining a virtual Ethernet adapter topic was added. v The reservation policy in the Cluster repository section was updated. v The information about Setting up cluster storage communication on page 5 topic, that relates to the 4-port 8GB adapter support and settings was added.
December 2011
The following information is a summary of the updates made to this topic collection: v The function and actions of the deadman switch as used in CAA are described. v Troubleshooting CAA with component trace is defined. v Migration is not supported for AIX 6 with 6100-07 or for AIX 7 with 7100-01. To upgrade from AIX 6.1 with 6100-06 of Cluster Aware AIX (CAA) or from AIX 7 with 7100-00 of CAA to AIX 6 with 6100-07 or to AIX 7 with 7100-01, first remove the cluster, and then install AIX 6 with 6100-07 or install AIX 7 with 7100-01 on all nodes that will be included in the new cluster. v CAA no longer uses an embedded IBM solidDB database. The bos.cluster.solid fileset still exists, but it is now obsolete. The solid and solidhac daemons are no longer used by CAA. v The CAA infrastructure now provides limited support for some disks that are managed by vender disk drivers. No disk events are available for these disks, but they can be configured into a cluster as a repository or as shared disks. See the documentation for the clustering product that you are using, such as IBM PowerHA SystemMirror for AIX, for a complete list of vendor disk devices that are supported for your environment. v CAA commands no longer support force cleanup options. The following is a list of options, by command, that are not supported in the 2011 release. chcluster -f clusterconf -f, -s, -u rmcluster -f v The clctrl command can be used for tuning the cluster subsystem. Only tune the cluster subsystem at the direction of IBM customer support.
Cluster repository
The cluster repository disk is used as the central repository for the cluster configuration data.
The cluster repository disk must be accessible from all nodes in the cluster. The minimal size of the repository is largely dependent upon the cluster configuration. A minimal disk size of 10 GB is preferred. For VIOS, PowerHA pureScale cluster, see the respective release notes for the minimal size. The cluster repository disk is backed up by a redundant and highly available storage configuration. The cluster repository disk should be configured for RAID to accommodate the requirements of the data center. The cluster repository disk is a special device for the cluster. The use of LVM commands are not supported when used on the cluster repository disk. The AIX LVM commands are single node administrative commands, and are not applicable in a clustered configuration. Due to the special device characteristics required by the cluster repository disk, a raw section of the disk and a section of the disk that contains a special volume group and special logical volumes are used during cluster operations. When CAA is configured with repos_loss mode set to assert and CAA loses access to the repository disk, the system automatically shuts down. | Reservation policy for repository disk | The following is an explanation of the reservation policy used in Cluster Aware. | All storage area network (SAN) provisioned disks must be zoned to all Fibre Channel adapters on the | Virtual I/O Servers that will be members of the shared storage pool cluster. | The disks must have the reserve policy set to no_reserve. One disk with a minimum of 1 GB is used as | the repository disk for the cluster. | Notes: | v Cluster Aware AIX (CAA) opens the repository disk, and CAA sets the ODM reserve attribute to | no_reserve for all storage types. | v For nonrepository disks, use the chdev command to change the attribute to no_reserve. Related information: chdev Command
v The node discovers all of the available communication interfaces. v The cluster interface monitoring starts. v The cluster interacts with Autonomic Health Advisory File System (AHAFS) for clusterwide event distribution. v The cluster exports cluster messaging and cluster socket services to other functions in the operating system, such as Reliable Scalable Cluster Technology (RSCT) and PowerHA SystemMirror.
Naming a cluster
When you are naming a cluster you must follow specific guidelines. The only acceptable ASCII characters you can use when naming a cluster are A - Z, a - z, 0 - 9, (hyphen), . (period), and _ (underscore). The first character of the cluster name and domain name cannot be a hyphen. The maximum length of a cluster name is 63 characters.
Cluster communication
Cluster communication takes advantage of traditional networking interfaces, such as IP based network communications and storage interface communication through Fibre Channel and SAS adapters. When you use both the IP-based network communications and the storage interface communications, all nodes in the cluster can always communicate with any other nodes in the cluster configuration. Having clusters in this configuration eliminates "split brain" incidents. You must complete the Fibre Channel setup before the cluster can use the storage interfaces as an alternative communication path. The SAS adapter does not require special setup. During Storage Area Network port configuration you must verify that your server interfaces are connected to the SAN fabric ports in the same zone. Related concepts: Setting up cluster storage communication on page 5 You must complete the following setup before creating a cluster that uses storage communication interfaces. | Defining a virtual Ethernet adapter | Additional procedures for cluster communications. | During storage area network (SAN) port configuration you must verify that your server interfaces are | connected to the SAN fabric ports in the same zone. | | | | | | | | | | | | To configure the VLAN to establish SAN communication when the storage adapters are virtualized through VIOS, complete the following steps 1. Enable the target mode enabled (TME) attribute on VIOS Fibre Channel adapters as the padmin, by entering the following commands.
chdev -dev fcs0 -attr tme=yes -perm shutdown -restart
2. On the Hardware Management Console (HMC), add a virtual Ethernet adapter to the profile of each PowerHA virtual client node that has a VLAN ID of 3358. To create a virtual Ethernet adapter on the Virtual I/O Server using the HMC Version 7, or later, go to "Creating a virtual Ethernet adapter using HMC version 7". 3. Reactivate the partition by using the new profile. The new profile will boot, and then display a new entX. To display the interface status, enter the command lscluster -i
| Notes:
| 1. VLAN 3358 must be created on the virtual client LPARs and VIOS servers. | 2. VLAN 3358 is the only value that CAA uses. The VLAN tag of sfw0 must not be changed. | 3. The entX adapter that is associated with VLAN 3358 does not require an enX interface or an IP | address. | | | | | | | | |
Deadman switch
A deadman switch is an action that occurs when Cluster Aware AIX (CAA) detects that a node has become isolated in a multinode environment. This setting occurs when nodes are not communicating with each other via the network and the repository disk. The AIX operating system can react differently depending on the deadman switch setting or the deadman_mode which is tunable. The deadman switch mode can be set to either force a system shut down or generate an Autonomic Health Advisor File System (AHAFS) event. Related information: clctrl Command
v 3 Gb Dual-Port SAS Adapter PCI-X DDR External (FC 5900 and 5912; CCIN 572A) Note: For the most current list of supported Fibre Channel adapters, contact your IBM representative. | | | For the adapter to be supported, it must have target mode support. The target mode enabled (TME) attribute for a supported adapter is only present when the minimum AIX level for CAA is installed. To configure the Fibre Channel adapters that will be used for cluster storage communications, complete the following steps: Note: In the following steps the X in fcsX represents the number of your Fibre Channel adapters, for example, fcs1, fsc2, or fcs3.
Cluster management
Note: If you booted from the Fibre Channel adapter, you do not need to complete this step. 2. Run the following command:
chdev -l fcsX -a tme=yes
Note: If you booted from the Fibre Channel adapter, add the -P flag. 3. Run the following command:
chdev -l fscsiX -a dyntrk=yes -a fc_err_recov=fast_fail
4. Run the cfgmgr command. Note: If you booted from the Fibre Channel adapter and used the -P flag, you must reboot. 5. Verify the configuration changes by running the following command:
lsdev -C | grep sfwcom
The following is an example of the output displayed from the lsdev -C | grep sfwcom command:
lsdev -C | grep sfwcom sfwcomm0 Available 01-00-02-FF Fiber Channel Storage Framework Comm sfwcomm1 Available 01-01-02-FF Fiber Channel Storage Framework Comm
After you create the cluster, you can list the cluster interfaces and view the storage interfaces by running the following command:
lscluster -i
Related concepts: Cluster communication on page 4 Cluster communication takes advantage of traditional networking interfaces, such as IP based network communications and storage interface communication through Fibre Channel and SAS adapters.
mkcluster Use this command to create a cluster. The following example creates a multinode cluster:
mkcluster -n mycluster -m nodeA,nodeB,nodeC -r hdisk7 -d hdisk20,hdisk21,hdisk22
chcluster Use this command to change the cluster configuration. The following example adds a node to the cluster configuration:
chcluster -n mycluster -m +nodeD
rmcluster Use this command to remove the cluster configuration. The following example removes the cluster configuration:
rmcluster -n mycluster
lscluster Use this command to list cluster configuration information. The following example lists the cluster configuration for all nodes:
lscluster -m
clcmd Use this command to distribute a command to a set of nodes that are members of a cluster. The following example lists the date for all the nodes in the cluster:
clcmd date
Related concepts: Sample output for cluster commands on page 10 You can view sample output for the lscluster -d command, the lscluster -i command, the lscluster -m command, and the lscluster -s command. Related information: chcluster command clcmd command lscluster command mkcluster command rmcluster command
Cluster management
The following steps display the process for event handling: 1. Create a monitor file based on the /aha directory. 2. Write the required information to the monitor file to represent the wait type, either a select call or blocking read call, and when the event should be triggered. For example, a state change of node down. 3. Wait in a select ( ) call or a blocking read ( ) call. 4. Read from the monitor file to obtain the event data. Related concepts: nodeState cluster event sample output on page 13 Related information: AIX Event Infrastructure for AIX and AIX Clusters - AHAFS
Note: To find the node number, view the output from the lscluster m command. For the cluster shorthand ID, you can also use the get_clusterid function. To start the socksimple program as the sender on node 3 (nodeC), run the following command:
./socksimple -s -a 1
Note: The a (address) option sends the packets to node 1 in this local cluster. The following code is output from running the socksimple s a 1 command:
./socksimple -s -a 1 socksimple version 1.2 socksimple 1/12 with ttl=1: 1275 bytes from cluster host id = 1: seqno=1275 1276 bytes from cluster host id = 1: seqno=1276 1277 bytes from cluster host id = 1: seqno=1277 1278 bytes from cluster host id = 1: seqno=1278 --- socksimple statistics --4 packets transmitted, 4 packets received round-trip min/avg/max = 0.267/0.291/0.411 ms ttl=1 ttl=1 ttl=1 ttl=1 time=0.411 time=0.275 time=0.287 time=0.284 ms ms ms ms
The following structure is an example of the data files collected during the snap script execution for Cluster Aware for AIX:
/tmp/ibmsupt | -- caa | -- Data | |-- 20100817215934 (For example, a timestamp at which "snap caa" was run) | | | |-- nodeA.austin.ibm.com.tar.gz | |-- ... | |-- nodeB.austin.ibm.com.tar.gz | |-| |-- nodeC.austin.ibm.com.tar.gz | -- ... (For example, more timestamp directories to distinguish separate "snap caa" invocations)
Cluster management
The clctrl -stop command quiesces cluster services on one or more nodes. You may make cluster configuration changes as long as one node in the cluster is in normal operation. If all nodes in the cluster are stopped, you cannot make cluster configuration changes. Nodes that have been stopped do not participate in cluster configuration or communications and are seen by the other nodes as down. The stopped state is persistent. Nodes that have been stopped must be explicitly started via the clctrl -start command before they can resume cluster participation. To set a node in maintenance mode, run the following command:
clctrl -stop -n mycluster -m nodeA
| The cluster subsystem uses component trace, which is controlled by the ctctrl command. | The hierarchy is as follows: : Base parent component for CAA | cluster .config : Component for configuration | .lock : Component for locking | .ahafs : Component for AHAFS | .comm : Parent component for communication | .disk : Subcomponent for disk communication | .net : Subcomponent for network communication | .san : Subcomponent for SAN communication | | AHAFS Autonomic Health Advisor File System | Related information: | clctrl Command
10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
11
| Interface state UP RESTRICTED AIX_CONTROLLED Pseudo Interface | Interface State DOWN | | | Node nodeB.austin.ibm.com | Node uuid = 7382a214-0f7f-11e1-a8bc-00145e764238 | Number of interfaces discovered = 2 Interface number 1 en4 | ifnet type = 6 ndd type = 7 | Mac address length = 6 | Mac address = 00:14:5E:76:42:39 | Smoothed rrt across interface = 8 | Mean Deviation in network rrt across interface = 4 | Probe interval for interface = 120 ms | ifnet flags for interface = 0x1E080863 | ndd flags for interface = 0x0021081B | Interface state UP | Number of regular addresses configured on interface = 1 | IPv4 ADDRESS: 10.33.2.109 broadcast 10.33.255.255 netmask 255.255.0.0 | Number of cluster multicast addresses configured on interface = 1 | IPv4 MULTICAST ADDRESS: 228.33.2.109 broadcast 0.0.0.0 netmask 0.0.0.0 | Interface number 2 dpcom | ifnet type = 0 | ndd type = 305 | Mac address length = 0 | Mac address = 00:00:00:00:00:00 | Smoothed rrt across interface = 576 | Mean Deviation in network rrt across interface = 334 | Probe interval for interface = 9100 ms | ifnet flags for interface = 0x00000000 | ndd flags for interface = 0x00000009 | Interface state UP RESTRICTED AIX_CONTROLLED | Pseudo Interface | Interface State DOWN |
12
Number of clusters node is a member in: 1 CLUSTER NAME TYPE SHID UUID mycluster local ff48c404-a711-11df-9d99-0245c0002003 Number of points_of_contact for node: 1 Point-of-contact interface & contact state en0 UP -----------------------------Node name: nodeC.austin.ibm.com Cluster shorthand id for node: 3 uuid for node: ff57b98c-a711-11df-9d99-0245c0002003 State of node: UP Smoothed rtt to node: 7 Mean Deviation in network rtt to node: 3 Number of zones this node is a member in: 0 Number of clusters node is a member in: 1 CLUSTER NAME TYPE SHID UUID mycluster local ff48c404-a711-11df-9d99-0245c0002003 Number of points_of_contact for node: 1 Point-of-contact interface & contact state en0 UP
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
13
Related concepts: Managing cluster events on page 7 AIX event management is implemented using a pseudo-file system architecture. The use of the pseudo-file system allows you to use existing application programming interfaces (APIs) to program the monitoring of events, such as a select ( ) call or a blocking read ( ) call.
#define MAX_WRITE_STR_LEN
void syntax(char *prog); int ahaMonFile(char *str); static int mk_parent_dirs (char *path); void read_data (int fd,int outfd); char *monFile;
test_prog :: main
int main (int argc, char *argv[]) { int fd,outfd, rc,i=0,cnt=0; fd_set readfds; char *outputFile; char wrStr[MAX_WRITE_STR_LEN+1]; char waitInRead[] = "WAIT_TYPE=WAIT_IN_READ"; if (argc < 5) syntax( argv[0]); monFile = argv[1]; if ( ! ahaMonFile(monFile) ) /* Not a .mon file under /aha */ syntax( argv[0]); /* Create intermediate directories of the .mon file */ rc = mk_parent_dirs(monFile); if (rc) {
14
fprintf (stderr, "Could not create intermediate directories of the file %s !\n", monFile); return(-1); } printf("Monitor file name: %s\n", monFile); sprintf (wrStr, "%s", argv[2]); cnt = atoi(argv[3]); printf("Write String : %s\n", wrStr); outputFile = argv[4]; fd = open (monFile, O_CREAT|O_RDWR); if (fd < 0) { fprintf (stderr,"Could not open the file %s; errno = %d\n", monFile,errno); exit (1); } outfd = open (outputFile, O_CREAT|O_RDWR); if (outfd < 0) { fprintf (stderr, "Could not open the file %s; errno = %d !\n", monFile, errno); return(-1); } write(fd, wrStr, strlen(wrStr)); for(i = 0; i < cnt; i++) { if (strstr(wrStr, waitInRead) == NULL) { FD_ZERO(&readfds); FD_SET(fd, &readfds); printf( "Entering select() to wait till the event corresponding to the AHA node %s occurs.\n", monFile); printf("Please issue a command from another window to trigger this event.\n"); rc = select (fd+1, &readfds, NULL, NULL, NULL); printf("\nThe select() completed. \n"); if (rc <= 0) /* No event occurred or an error was found. */ { fprintf (stderr, "The select() returned %d.\n", rc); perror ("select: "); return (-1); } if(! FD_ISSET(fd, &readfds)) goto end; printf("The event corresponding to the AHA node %s has occurred.\n", monFile); } else { printf( "Entering read() to wait till the event corresponding to the AHA node %s occurs.\n", monFile); printf("Please issue a command from another window to trigger this event.\n"); } read_data(fd,outfd); } end: close(fd); close(outfd); }
Cluster management
15
test_prog :: syntax
/* -------------------------------------------------------------------------- */ void syntax(char *prog) { printf("\nSYNTAX: %s <aha-monitor-file> [<key1>=<value1>[;<key2>=<value2>;...]] <count> <outfile> \n",prog); exit (1); }
test_prog :: ahaMonFile
/* -------------------------------------------------------------------------* PURPOSE: To check whether the file provided is an AHA monitor file. */ int ahaMonFile(char *str) { char cwd[PATH_MAX]; int len1=strlen(str), len2=strlen(".mon"); int rc = 0; struct stat sbuf; /* Make sure /aha is mounted. */ if ((stat("/aha", &sbuf) < 0) || (sbuf.st_flag != FS_MOUNT)) { printf("ERROR: The filesystem /aha is not mounted!\n"); return (rc); } /* Make sure the path has .mon as a suffix. */ if ((len1 <= len2) || (strcmp ( (str + len1 - len2), ".mon")) ) goto end; if (! strncmp (str, "/aha",4)) /* The given path starts with /aha */ rc = 1; else /* It could be a relative path */ { getcwd (cwd, PATH_MAX); if ((str[0] != / ) && /* Relative path and */ (! strncmp (cwd, "/aha",4)) /* cwd starts with /aha . */ ) rc = 1; } end: if (!rc) printf("ERROR: %s is not an AHA monitor file !\n", str); return (rc); }
test_prog :: mk_parent_dirs
/*----------------------------------------------------------------* NAME: mk_parent_dirs() * PURPOSE: To create intermediate directories of a .mon file if * they are not created. */ static int mk_parent_dirs (char *path) { char s[PATH_MAX]; char *dirp; struct stat buf; int rc=0; dirp = dirname(path); if (stat(dirp, &buf) != 0)
16
test_prog :: read_data
/*----------------------------------------------------------------* PURPOSE: To parse and print the data received at the occurrence * of the event. */ void read_data (int fd,int outfd) { #define READ_BUF_SIZE 3072 char data[READ_BUF_SIZE]; char *p, *line; char cmd[64]; time_t sec, nsec; pid_t pid; uid_t uid, luid; gid_t gid; char curTm[64]; int n; int stackInfo = 0; char uname[64], lname[64], gname[64]; bzero((char *)data, READ_BUF_SIZE); /* Read the info from the beginning of the file. */ n=pread(fd, data,READ_BUF_SIZE, 0); p = data; printf("%s\n",p); write(outfd, data, n); }
Function :: main
#include <socksimple.h> /* TEST Program Only */ int int int int int int int sndflag=0; /* sender flag */ rcvflag=0; /* receiver flag */ iend=DEFAULT_END; istart=DEFAULT_START; errcount=DEFAULT_ERRCOUNT; actual_err=0; current_ping;
int main(int argc, char **argv) { int c; /* hold command-line args */ extern int getopt(); /* for getopt */ extern char *optarg; /* for getopt */ /* parse command-line arguments */ while ((c = getopt(argc, argv, "vrsa:p:t:b:e:c:")) != -1) { switch (c) {
Cluster management
17
case r: /* socksimple receiver */ rcvflag=1; break; case s: /* socksimple sender */ sndflag=1; break; case v: verbose=1; break; case a: /* socksimple address override */ strcpy(arg_addr_str, optarg); break; case p: /* socksimple port override */ arg_port = atoi(optarg); break; case b: istart = atoi(optarg); if ( istart <= 0 ) istart = 1; break; case c: errcount = atoi(optarg); break; case e: if ( iend > MAX_BUF_LEN ) iend = MAX_BUF_LEN; iend = atoi(optarg); break; case t: /* socksimple ttl override */ arg_ttl = atoi(optarg); break; case ?: usage(); break; } } /* verify one and only one send or receive flag */ if ( ((!rcvflag) && (!sndflag)) || ((rcvflag) && (sndflag)) ) { usage(); } current_ping=istart; printf("socksimple version %d.%d\n", VERSION_MAJOR, VERSION_MINOR); init_socket(); get_local_host_info(); if (sndflag) { printf("socksimpleing %s/%d with ttl=%d:\n\n", arg_addr_str, arg_port, arg_ttl); /* catch interrupts with clean_exit() */ signal(SIGINT, clean_exit); /* catch alarm signal with send_socksimple() */ signal(SIGALRM, send_socksimple); /* send an alarm signal now */ send_socksimple(SIGALRM);
18
Function :: init_socket
void init_socket() { int flag_on=1; /* create a UDP socket */ if ((sock = socket(AF_CLUST, SOCK_DGRAM, 0)) < 0) { perror("receive socket() failed"); exit(1); } /* construct a cluster address structure */ memset(&dst_addr, 0, sizeof(dst_addr)); dst_addr.sclust_family = AF_CLUST; dst_addr.sclust_len = sizeof(struct sockaddr_clust); if ( sndflag ) { dst_addr.sclust_addr = atoi(arg_addr_str); dst_addr.sclust_port = arg_port; dst_addr.sclust_cluster_id = WWID_LOCAL_CLUSTER; } memset(&src_addr, 0, sizeof(src_addr)); src_addr.sclust_family = AF_CLUST; src_addr.sclust_len = sizeof(struct sockaddr_clust); src_addr.sclust_addr = get_clusterid(); src_addr.sclust_port = arg_port; src_addr.sclust_cluster_id = WWID_LOCAL_CLUSTER; /* bind to address to socket */ if ((bind(sock, (struct sockaddr *) &src_addr, sizeof(src_addr))) < 0) { perror("bind() failed"); exit(1); } }
Function :: get_local_host_info
void get_local_host_info() { char hostname[MAX_HOSTNAME_LEN]; struct hostent* hostinfo; /* lookup local hostname */ gethostname(hostname, MAX_HOSTNAME_LEN); if (verbose) printf("Localhost is %s, ", hostname); /* use gethostbyname to get hosts IP address */ if ((hostinfo = gethostbyname(hostname)) == NULL) { perror("gethostbyname() failed"); } localIP.s_addr = *((unsigned long *) hostinfo->h_addr_list[0]); if (verbose) printf("%s\n", inet_ntoa(localIP)); pid = getpid(); }
Cluster management
19
Function :: send_socksimple
void send_socksimple(int sig) { struct timeval now; int ioffset; /* increment count, check if done */ if (current_ping >= iend) { clean_exit(); } /* clear send buffer */ memset(&socksimple_payload, 4, sizeof(socksimple_payload)); /* populate the socksimple packet */ socksimple_payload.socksimple_packet.type = SENDER; socksimple_payload.socksimple_packet.version_major = htons(VERSION_MAJOR); socksimple_payload.socksimple_packet.version_minor = htons(VERSION_MINOR); socksimple_payload.socksimple_packet.seq_no = htonl(current_ping); socksimple_payload.socksimple_packet.src_host = get_clusterid(); socksimple_payload.socksimple_packet.dest_host = atoi(arg_addr_str); socksimple_payload.socksimple_packet.ttl = arg_ttl; socksimple_payload.socksimple_packet.pid = pid; ioffset = current_ping - strlen(PKT_END)- sizeof(struct socksimple_struct) - 2; strcpy((char *) &socksimple_payload.payload[ioffset],PKT_END); gettimeofday(&now, NULL); socksimple_payload.socksimple_packet.tv.tv_sec = htonl(now.tv_sec); socksimple_payload.socksimple_packet.tv.tv_usec = htonl(now.tv_usec); /* send the outgoing packet */ send_packet(&socksimple_payload, &dst_addr, current_ping); current_ping++; /* set another alarm call to send in 1 second */ (void) signal(SIGALRM, send_socksimple); alarm(1); }
Function :: send_packet
void send_packet(struct socksimple_payload *packet, struct sockaddr_clust *target, int ilen) { int pkt_len; pkt_len = ilen; /* send string to cluster socket address */ if ((sendto(sock, packet, pkt_len, 0, (struct sockaddr *) target, sizeof(struct sockaddr_clust))) != pkt_len) { perror("sendto() sent incorrect number of bytes"); exit(1); } packets_sent++; }
Function :: sender_listen_loop
oid sender_listen_loop() { char *recv_packet; /* buffer to receive packet */ int recv_len; /* len of packet received */ struct timeval current_time; /* time value structure */ double rtt; /* round trip time */ socklen_t from_len; struct sockaddr_clust send_host; int ilen;
20
ilen = sizeof(struct socksimple_payload); if (!(recv_packet = (char *)malloc(ilen))) { fprintf(stderr,"malloc_failed\n"); exit(-1); } from_len = sizeof(struct sockaddr_clust); while (1) { /* clear the receive buffer */ memset(recv_packet, 0, ilen); /* block waiting to receive a packet */ if ((recv_len = recvfrom(sock, recv_packet, ilen, 0, (struct sockaddr *) &send_host, &from_len)) < 0) { if (errno == EINTR) { /* interrupt is ok */ continue; } else { perror("recvfrom() failed"); exit(1); } } /* get current time */ gettimeofday(¤t_time, NULL); /* process the received packet */ if (process_socksimple_packet(recv_packet, recv_len, RECEIVER) == 0) { /* packet processed successfully */ /* calculate round trip time in milliseconds */ subtract_timeval(¤t_time, &rcvd_pkt->socksimple_packet.tv); rtt = timeval_to_ms(¤t_time); /* keep rtt total, min and max */ rtt_total += rtt; if (rtt > rtt_max) rtt_max = rtt; if (rtt < rtt_min) rtt_min = rtt; /* output received packet information */ printf("%d bytes from cluster host id = %d: seqno=%d ttl=%d time=%.3f ms\n", recv_len, send_host.sclust_addr, rcvd_pkt->socksimple_packet.seq_no, rcvd_pkt->socksimple_packet.ttl, rtt); } } }
Function :: receiver_listen_loop
void receiver_listen_loop() { char *recv_packet; /* buffer to receive packet */ int recv_len; /* len of string received */ socklen_t from_len; struct sockaddr_clust send_host; int ilen,ioffset; ilen = sizeof(struct socksimple_payload); if (!(recv_packet = (char *)malloc(ilen))) { fprintf(stderr,"malloc_failed\n"); exit(-1);
Cluster management
21
} printf("Listening on %s/%d:\n\n", arg_addr_str, arg_port); from_len = sizeof(struct sockaddr_clust); while (1) { /* clear the receive buffer */ memset(recv_packet, 0, ilen); /* block waiting to receive a packet */ if ((recv_len = recvfrom(sock, recv_packet, ilen, 0, (struct sockaddr *) &send_host, &from_len)) < 0) { perror("recvfrom() failed"); exit(1); } /* printf("recvfrom cluster node id = %d port = %d \n",send_host.sclust_addr, send_host.sclust_port); */ /* process the received packet */ if (process_socksimple_packet(recv_packet, recv_len, SENDER) == 0) { /* packet processed successfully */ /* printf("Replying to socksimple from cluster node id = %d bytes=%d seqno=%d ttl=%d\n", rcvd_pkt->src_host, recv_len, rcvd_pkt->seq_no, rcvd_pkt->ttl); */ printf("Replying to socksimple from cluster node id = %d bytes=%d seqno=%d ttl=%d\n", send_host.sclust_addr, recv_len, rcvd_pkt->socksimple_packet.seq_no, rcvd_pkt->socksimple_packet.ttl); /* populate socksimple response packet */ memset(&socksimple_payload, 6, sizeof(socksimple_payload)); socksimple_payload.socksimple_packet.type = RECEIVER; socksimple_payload.socksimple_packet.version_major = htons(VERSION_MAJOR); socksimple_payload.socksimple_packet.version_minor = htons(VERSION_MINOR); socksimple_payload.socksimple_packet.seq_no = htonl(rcvd_pkt->socksimple_packet.seq_no); socksimple_payload.socksimple_packet.dest_host = rcvd_pkt->socksimple_packet.src_host; socksimple_payload.socksimple_packet.src_host = get_clusterid(); socksimple_payload.socksimple_packet.ttl = rcvd_pkt->socksimple_packet.ttl; socksimple_payload.socksimple_packet.pid = rcvd_pkt->socksimple_packet.pid; socksimple_payload.socksimple_packet.tv.tv_sec = htonl(rcvd_pkt->socksimple_packet.tv.tv_sec); socksimple_payload.socksimple_packet.tv.tv_usec = htonl(rcvd_pkt->socksimple_packet.tv.tv_usec); ioffset = recv_len - sizeof(struct socksimple_struct) - strlen(PKT_END) - 2; strcpy((char *) &socksimple_payload.payload[ioffset],PKT_END); /* send response packet */ send_packet(&socksimple_payload, &send_host, recv_len); } } }
Function :: subtract_timeval
void subtract_timeval(struct timeval *val, const struct timeval *sub) { /* subtract sub from val and leave result in val */ if ((val->tv_usec -= sub->tv_usec) < 0) { val->tv_sec--; val->tv_usec += 1000000; } val->tv_sec -= sub->tv_sec; }
Function :: timeval_to_ms
double timeval_to_ms(const struct timeval *val) { /* return the timeval converted to a number of milliseconds */ return (val->tv_sec * 1000.0 + val->tv_usec / 1000.0); }
22
Function :: process_socksimple_packet
int process_socksimple_packet(char *packet, int recv_len, unsigned char type) { int ioffset, icheck; /* validate packet size */ ioffset = recv_len - strlen(PKT_END) - 2 - sizeof(struct socksimple_struct); /* cast data to socksimple_struct */ rcvd_pkt = (struct socksimple_payload *) packet; /* convert required fields to host byte order */ rcvd_pkt->socksimple_packet.version_major = ntohs(rcvd_pkt->socksimple_packet.version_major); rcvd_pkt->socksimple_packet.version_minor = ntohs(rcvd_pkt>socksimple_packet.version_minor); rcvd_pkt->socksimple_packet.seq_no = ntohl(rcvd_pkt->socksimple_packet.seq_no); rcvd_pkt->socksimple_packet.tv.tv_sec = ntohl(rcvd_pkt->socksimple_packet.tv.tv_sec); rcvd_pkt->socksimple_packet.tv.tv_usec = ntohl(rcvd_pkt->socksimple_packet.tv.tv_usec); /* validate socksimple version matches */ if ((rcvd_pkt->socksimple_packet.version_major != VERSION_MAJOR) || (rcvd_pkt->socksimple_packet.version_minor != VERSION_MINOR)) { if (verbose) printf("Discarding packet: version mismatch (%d.%d)\n", rcvd_pkt->socksimple_packet.version_major, rcvd_pkt->socksimple_packet.version_minor); return(-1); } /* validate socksimple packet type (sender or receiver) */ if (rcvd_pkt->socksimple_packet.type != type) { if (verbose) { switch (rcvd_pkt->socksimple_packet.type) { case SENDER: printf("Discarding sender packet\n"); break; case RECEIVER: printf("Discarding receiver packet\n"); break; case ?: printf("Discarding packet: unknown type(%c)\n", rcvd_pkt->socksimple_packet.type); break; } } return(-1); } /* if response packet, validate pid */ if (rcvd_pkt->socksimple_packet.type == RECEIVER) { if (rcvd_pkt->socksimple_packet.pid != pid) { if (verbose) printf("Discarding packet: pid mismatch (%d/%d)\n", (int)pid, (int)rcvd_pkt->socksimple_packet.pid); return(-1); } } if (strcmp((char *) &rcvd_pkt->payload[ioffset],PKT_END)) { printf("Payload mismatch: = %s\n", &rcvd_pkt->payload[ioffset]); printf(" payload mismatch: = %x:%x:%x:%x:%x:%x\n", rcvd_pkt->payload[ioffset], rcvd_pkt->payload[ioffset+1], rcvd_pkt->payload[ioffset+2], rcvd_pkt->payload[ioffset+3], rcvd_pkt->payload[ioffset+4],
Cluster management
23
rcvd_pkt->payload[ioffset+5]); actual_err++; } for (icheck = 0; icheck < ioffset; icheck++) { if (rcvd_pkt->socksimple_packet.type == RECEIVER) { if ( (int) rcvd_pkt->payload[icheck] != 6 ) { printf("Junk at offset %d 0x%x\n", icheck, rcvd_pkt->payload[icheck]); actual_err++; } } else { if ( (int) rcvd_pkt->payload[icheck] != 4 ) { printf("Junk at offset %d 0x%x\n", icheck, rcvd_pkt->payload[icheck]); actual_err++; } } if ( actual_err > errcount ) exit(-1); } /* packet validated, increment counter */ packets_rcvd++; return(0); }
Function :: clean_exit
void clean_exit() { /* close the socket */ close(sock); /* output statistics and exit program */ printf("\n--- socksimple statistics ---\n"); printf("%d packets transmitted, %d packets received\n", packets_sent, packets_rcvd); if (packets_rcvd == 0) printf("round-trip min/avg/max = NA/NA/NA ms\n"); else printf("round-trip min/avg/max = %.3f/%.3f/%.3f ms\n", rtt_min, (rtt_total/packets_rcvd), rtt_max); exit(0); }
Function :: usage
void usage() { printf("Usage: socksimple -r|-s [-v] [-a address]"); printf(" [-p port] [-t ttl]\n\n"); printf("-r|-s Receiver or sender. Required argument,\n"); printf(" mutually exclusive\n"); printf("-a address Cluster address to listen/send on,\n"); printf(" overrides the default.\n"); printf("-p port port to listen/send on,\n"); printf(" overrides the default of 12.\n"); printf("-p ttl Time-To-Live to send,\n"); printf(" overrides the default of 1.\n"); printf("-v Verbose mode\n"); exit(1); }
24
#define SENDER s /* socksimple sender identifier */ #define RECEIVER r /* socksimple receiver identifier */ #define PKT_END "lwrwashere" /* socksimple receiver identifier */ /* socksimple packet structure */ struct socksimple_struct { unsigned short version_major; unsigned short version_minor; unsigned char type; unsigned char ttl; clustid_t src_host; clustid_t dest_host; unsigned int seq_no; pid_t pid; struct timeval tv; }; struct socksimple_payload { struct socksimple_struct socksimple_packet; char payload[MAX_BUF_LEN]; } socksimple_payload; /* pointer to socksimple packet buffer */ struct socksimple_payload *rcvd_pkt; int sock; /* socket descriptor */ pid_t pid; /* pid of socksimple program */ struct sockaddr_clust dst_addr; struct sockaddr_clust src_addr; struct in_addr localIP; /* socket address structure */ /* socket address structure */
/* counters and statistics variables */ int packets_sent = 0; int packets_rcvd = 0; double rtt_total = 0; double rtt_max = 0;
Cluster management
25
double rtt_min
= 999999999.0;
/* default command-line arguments */ char arg_addr_str[16] = "1"; int arg_port = 12; unsigned char arg_ttl = 1; int verbose=0; /* function prototypes */ void init_socket(); void get_local_host_info(); void send_socksimple(int); void send_packet(struct socksimple_payload *payload, struct sockaddr_clust *target, int len); void sender_listen_loop(); void receiver_listen_loop(); void subtract_timeval(struct timeval *val, const struct timeval *sub); double timeval_to_ms(const struct timeval *val); int process_socksimple_packet(char *packet, int recv_len, unsigned char type); void clean_exit(); void usage();
#define CLUSTPCB_REF(rp) { \ fetch_and_add(&((rp)->rclust_refcnt), 1); \ } #define CLUSTPCB_UNREF(rp) { \ fetch_and_add(&((rp)->rclust_refcnt), -1); { \ } #endif /* _H_CLUST_VAR */
26
Notices
This information was developed for products and services offered in the U.S.A. IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing IBM Corporation North Castle Drive Armonk, NY 10504-1785 U.S.A. For license inquiries regarding double-byte character set (DBCS) information, contact the IBM Intellectual Property Department in your country or send inquiries, in writing, to: Intellectual Property Licensing Legal and Intellectual Property Law IBM Japan, Ltd. 1623-14, Shimotsuruma, Yamato-shi Kanagawa 242-8502 Japan The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-IBM websites are provided for convenience only and do not in any manner serve as an endorsement of those websites. The materials at those websites are not part of the materials for this IBM product and use of those websites is at your own risk. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Licensees of this program who wish to have information about it for the purpose of enabling: (i) the exchange of information between independently created programs and other programs (including this
Copyright IBM Corp. 2010, 2012
27
one) and (ii) the mutual use of the information which has been exchanged, should contact: IBM Corporation Dept. LRAS/Bldg. 903 11501 Burnet Road Austin, TX 78758-3400 U.S.A. Such information may be available, subject to appropriate terms and conditions, including in some cases, payment of a fee. The licensed program described in this document and all licensed material available for it are provided by IBM under terms of the IBM Customer Agreement, IBM International Program License Agreement or any equivalent agreement between us. Any performance data contained herein was determined in a controlled environment. Therefore, the results obtained in other operating environments may vary significantly. Some measurements may have been made on development-level systems and there is no guarantee that these measurements will be the same on generally available systems. Furthermore, some measurements may have been estimated through extrapolation. Actual results may vary. Users of this document should verify the applicable data for their specific environment. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. All statements regarding IBM's future direction or intent are subject to change or withdrawal without notice, and represent goals and objectives only. All IBM prices shown are IBM's suggested retail prices, are current and are subject to change without notice. Dealer prices may vary. This information is for planning purposes only. The information herein is subject to change before the products described become available. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. COPYRIGHT LICENSE: This information contains sample application programs in source language, which illustrate programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs. The sample programs are provided "AS IS", without warranty of any kind. IBM shall not be liable for any damages arising out of your use of the sample programs. Each copy or any portion of these sample programs or any derivative work, must include a copyright notice as follows:
28
(your company name) (year). Portions of this code are derived from IBM Corp. Sample Programs. Copyright IBM Corp. _enter the year or years_. If you are viewing this information softcopy, the photographs and color illustrations may not appear.
Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the web at Copyright and trademark information at www.ibm.com/legal/copytrade.shtml. Other product and service names might be trademarks of IBM or other companies.
Notices
29
30
Printed in USA