Cmode Explained Latest
Cmode Explained Latest
In 1992, NetApp introduced Data ONTAP and ushered in the network-attached storage industry. Since
then, NetApp has continued to add features and solutions to its product portfolio to meet the needs of its
customers. In 2004, NetApp acquired Spinnaker Networks in order to fold its scalable Clustered file
system technology into Data ONTAP. That plan came to fruition in 2006 as NetApp released Data ONTAP
GX, the first Clustered product from NetApp. NetApp also continued to enhance and sell Data ONTAP 7G.
Having two products provided a way to meet the needs of the NetApp customers who were happy with the
classic Data ONTAP, while allowing customers with certain application requirements to use Data ONTAP
GX to achieve even higher levels of performance, and with the flexibility and transparency afforded by its
scale-out architecture.
Although the goal was always to merge the two products into one, the migration path for Data ONTAP 7G
customers to get to Clustered storage would eventually require a big leap
leap. Enter Data ONTAP 8
8.0.
0 The
goal for Data ONTAP 8.0 was to create one code line that allows Data ONTAP 7G customers to operate a
Data ONTAP 8.0 7-Mode system in the manner in which theyre accustomed, while also providing a first
step in the eventual move to a Clustered environment. Data ONTAP 8.0 Cluster-Mode allows Data ONTAP
GX customers to upgrade and continue to operate their Clusters as theyre accustomed.
13
16
Vserver - A vserver is an object that provides network access through unique network addresses, that may
serve data out of a distinct namespace, and that is separately administerable from the rest of the cluster.
There are three types of vservers: cluster, admin, node.
Cluster Vserver - A cluster vserver is the standard data serving vserver in cluster-mode. It is the
successor to the vserver of GX. It has both data and (optional) admin LIFs, and also owns a namespace
with a single root.
domains, such as kerberos realms,
domains, etc
etc. and
root It has separate administrative domains
realms NIS domains
can live on separate virtual networks from other vservers.
Admin Vserver - Previously called the "C-server", the admin vserver is a special vserver that does not
provide data access to clients or hosts. However, it has overall administrative access to all objects in the
cluster, including all objects owned by other vservers.
Node Vserver - A node vserver is restricted to operation in a single node of the cluster at any one time,
and provides administrative and data access to 7-mode objects owned by that node. The objects owned by
partner node when takeover occurs. The node vserver is equivalent
q
to the
a node vserver will failover to a p
pfiler, also known as vfiler0 on a particular node. In 7G systems, it is commonly called the "filer".
17
This example shows many of the key resources in a cluster. There are three types of virtual servers,
plus nodes, aggregates, volumes, and namespaces.
18
Notice the types of vservers. Each node in the Cluster automatically has a node vserver created to
represent it. The administration vserver is automatically created when the Cluster is created. The Cluster
vservers are created by the administrator to build global namespaces.
19
20
:
2011 NetApp. All rights reserved.
22
23
25
26
27
Physical things can be touched and seen, like nodes, disks, and ports on those nodes.
Logical things cannot be touched, but they do exist and take up space. Aggregates are logical groupings of
disks. Volumes, Snapshot copies, and mirrors are areas of storage carved out of aggregates. Clusters are
groupings of physical nodes. A virtual server is a virtual representation of a resource or group of resources.
A logical interface is an IP address that is associated with a single network port
port.
A cluster, which is a physical entity, is made up of other physical and logical pieces. For example, a cluster
is made up of nodes, and each node is made up of a controller, disks, disk shelves, NVRAM, etc. On the
disks are RAID groups and aggregates. Also, each node has a certain number of physical network ports,
each with its own MAC address.
28
29
30
Cluster-Mode supports V-Series systems. As such, the setup will be a little different when using V-Series.
Each controller should have a console connection, which is needed to get to the firmware and to get to the boot menu
(for the setup, install, and init options, for example). A Remote LAN Module (RLM) connection, although not required,
is very helpful in the event that you cannot get to the UI or console. It allows for remote rebooting and forcing core
dumps, among other things.
Each node must have at least one connection (ideally, two connections) to the dedicated cluster network. Each node
should have at least one data connection, although these data connections are only necessary for client access.
Because the nodes will be clustered together, its possible to have a node that participates in the cluster with its
storage and other resources, but doesnt actually field client requests. Typically, however, each node will have data
connections.
The cluster connections must be on a network dedicated to cluster traffic
traffic. The data and management connections
must be on a network that is distinct from the cluster network.
There is a large amount of cabling to be done with a Data ONTAP 8.0 cluster. Each node has NVRAM
interconnections to its HA partner, and each node has Fibre Channel connections to its disk shelves and to those of its
HA partner.
This is standard cabling, and is the same as Data ONTAP GX and 7-Mode.
For cabling the network connections, the follow things must be taken into account:
Each node is connected to at least two distinct networks; one for management (UI) and data access (clients), and one
for intra-cluster communication. Ideally, there would be at least two cluster connections to each node in order to create
redundancy and improve cluster traffic flow.
The cluster can be created without data network connections but not without a cluster network connection.
Having more than one data network connection to each node creates redundancy and improves client traffic flow.
To copy flash0a to flash0b, run flash flash0a flash0b. To flash (put) a new image onto the primary flash, you
must first configure the management interface. The -auto option of ifconfig can be used if the management
network has a DHCP/BOOTP server. If it doesnt, youll need to run ifconfig <interface> -addr=<ip>
-mask=<netmask> -gw=<gateway>. After the network is configured, make sure you can ping the IP address of the
TFTP server that contains the new flash image. To then flash the new image, run flash
tftp://<tftp_server>/<path_to_image> flash0a.
The environment variables for Cluster-Mode can be set as follows:
set-defaults
setenv ONTAP_NG true
setenv bootarg.init.usebootp false
setenv bootarg.init.boot_clustered true
10
ONTAP 8.0 uses an environment variable to determine which mode of operation to boot with. For Cluster-Mode the
correct setting is:
LOADER> setenv bootarg.init.boot_clustered true
If the environment variable is unset, the controller will boot up in 7-Mode
11
12
13
The time it takes to initialize the disks is based on the size of one of the disks, not on the sum capacity of the disks,
because all disks are initialized in parallel with each other. Once the disks are initialized, the nodes first aggregate and
its vol0 volume will be automatically created.
15
After the reboot, if the node stops at the firmware prompt by itself (which will happen if the firmware environment
variable AUTOBOOT is set to false), type boot_primary to allow it to continue to the boot menu. If AUTOBOOT is set
to true, the node will go straight to the boot menu.
When using TFTP, beware of older TFTP servers that have limited capabilities and may cause installation failures.
16
The setup option on the boot menu configures the local information about this node, such as the host name,
management IP address, netmask, default gateway, DNS domain and servers, and so on.
17
18
19
20
Autoconfig is still somewhat inflexible because: it doesnt allow you to choose the host names of the nodes, only two
cluster ports can be configured, the cluster ports are fixed (always the same), and the cluster IPs are out of sequence.
As such, NetApp recommends that cluster joins be done manually.
The first node in the cluster will perform the "cluster create" operation. All other nodes will perform a "cluster join"
operation. Creating the cluster also defines the cluster-management LIF. The cluster-management LIF is an
administrative interface used for UI access and general administration of the cluster. This interface can failover to datarole ports across all the nodes in the cluster, using pre-defined failover rules (clusterwide).
The cluster network is an isolated, non-routed subnet or VLAN, separate from the data or management networks, so
using non-routable IP address ranges is common and recommended.
Using 9000 MTU on the cluster network is highly recommended, for performance and reliability reasons. The cluster
switch or VLAN should be modified to accept 9000 byte payload frames prior to attempting the cluster join/create.
22
After a cluster has been created with one node, the administrator must invoke the cluster join command on each
node that is going to join the cluster. To join a cluster, you need to know a cluster IP address of one of the nodes in the
cluster, and you need some information that is specific to this joining node.
The cluster join operation ensures that the root aggregates are uniquely named. During this process, the first root
aggregate will remain named "aggr0" and subsequent node root aggregates will have the hostname appended to the
aggregate name as "aggr0_node01". For consistency, the original aggregate should also be renamed to match the
naming convention, or renamed as per customer requirements.
23
When the storage controller(s) that were unjoined from the cluster are powered back, they will display information
about the cluster that it previously belonged to.
24
25
The base Cluster-Mode license is fixed and cannot be installed as a temporary/expiring license. The base license
determines the cluster serial number and is generated for a specific node count (as are the protocol licenses). The
base license can also be installed on top of an existing base license as additional node count are purchased. If a
customer purchases a 2-node upgrade for their current 2 node cluster, they will need a 4-node base licenses for the
given cluster serial number. The licenses are indexed on the NOW by the *cluster* serial number, not the node serial
number.
By default, there are no feature licenses installed on an 8.0 Cluster-Mode system as shipped from the factory. The
cluster create process installs the base license, and all additional purchased licenses can be found on the NOW site
26
27
The controllers will default to GMT timezone. Modify the date, time and timezone using the system date command.
While configuring the NTP is not a hard requirement for NFS only environments, it is for a cluster with the CIFS
protocol enabled and a good idea in most environments. If there are time servers available in the customer
environment, the cluster should be configured to sync up to them.
Time synchronization can take some time, depending on the skew between the node time and the reference clock
time.
28
29
30
31
Although the CLI and GUI interfaces are different, they both provide access to the same information, and both have the
ability to manage the same resources within the cluster. All commands are available in both interfaces. This will always
be the case because both interfaces are generated from the same source code that defines the command hierarchy.
The hierarchical command structure is made up of command directories and commands. A command directory may
contain commands and/or more command directories. Similar to a typical file system directory and file structure, the
command directories provide the groupings of similar commands
commands. For example
example, all commands for storage-related
things fall somewhere within the storage command directory. Within that directory, there are directories for disk
commands and aggregate commands. The command directories provide the context that allows similar commands to
be used for different objects. For example, all objects/resources are created using a create command, and removed
using a delete command, but the commands are unique because of the context (command directory) in which theyre
used. So, storage aggregate create is different from network interface create.
There is a cluster login,
g , byy wayy of the cluster management
g
LIF. There is also a login
g capability
p
y for each node,, by
y way
y
of the node management LIF for each node.
The preferred way to manage the cluster is to log in to the clustershell by way of the cluster management LIF IP
address, using ssh. If a node is experiencing difficulties and cannot communicate with the rest of the cluster, the node
management LIF of a node can be used. And if the node management LIF cannot be used, then the Remote LAN
Module (RLM) interface can be used.
This diagram shows the software stack making up Data ONTAP 8.0 Cluster-Mode. The most obvious difference
between this stack and the 7-Mode stack is the addition of a networking component called the N-blade, and more
logical interfaces (LIFs). Also, notice that Cluster-Mode does not yet support the SAN protocols (FC and iSCSI).
Th N
The
N-blade
bl d iis th
the network
t
k blade.
bl d It ttranslates
l t b
between
t
th
the NAS protocols
t
l (NFS and
d CIFS) and
d th
the SpinNP
S i NP protocol
t
l
that the D-blade uses. SpinNP is the protocol used within a cluster to communicate between N-blades and D-blades.
In Cluster-Mode, the D-blade does not service NAS or SAN protocol requests.
Data ONTAP GX had one management virtual interface on each node. Cluster-Mode still has that concept, but its
called a node management LIF. Like the management interfaces of Data ONTAP GX, the node management LIFs do
not fail over to other nodes.
Cluster-Mode introduces a new management LIF, called the cluster management LIF, that has failover and migration
capabilities. The reason for this is so that regardless of the state of each individual node (rebooting after an upgrade,
halted for hardware maintenance)
maintenance), there is a LIF address that can always be used to manage the cluster
cluster, and the
current node location of that LIF is transparent.
The two mgmt1 LIFs that are shown here are the node management LIFs, and are each associated with their
respective node virtual servers (vservers).
The one cluster management LIF, named clusmgmt in this example, is not associated with any one node vserver, but
rather is associated with the admin vserver, called hydra, which represents the entire physical cluster.
10
Nodeshell is accessible only via run -node from within the clustershell
Has visibility to only those objects that are attached to the given controller
Like Hardware, disks, aggregates, volumes, and things inside volumes like snapshots and qtrees.Both 7-Mode and
Cluster-Mode volumes on that controller are visible
11
12
In these examples, the hostname command was invoked from the UI of one node, but actually executed on the other
node. In the first example, the command was invoked from the clustershell. In the second example, the administrator
entered the nodeshell of the other node, and then ran the command interactively.
13
The FreeBSD shell is only to be used internally for ONTAP development, and in the field for emergency purposes
(e.g., system diagnostics by trained NetApp personnel). All system administration and maintenance commands must
be made available to customers via the cluster shell.
14
Access to the systemshell is not needed as much as it was in Data ONTAP GX because many of the utilities that only
ran in the BSD shell have now been incorporated into the clustershell.
But there are still some reasons why the systemshell may need to be accessed. No longer can you log in to a node or
the cluster as root and be placed directly to the systemshell. Access to the systemshell is limited to a user named
diag, and the systemshell can only be entered from within the clustershell.
15
16
Element Manager is the web based user interface for administration of the cluster. All the operations which can be
done using the CLI, ZAPI etc can be done using this interface.
To use the Element Manager point to a web browser to the URL- http://<cluster_management_ip >/
17
SMF and RDB provide the basis for single system image administration of a cluster in the
M-host. SMF provides the basic command framework and the ability to route commands to
different nodes within the cluster. RDB provides the mechanism for maintaining cluster-wide
data.
19
20
The clustershell has features similar to the tcsh shell that is popular on UNIX machines, such as the ability to pull
previous commands out of a command history buffer, then optionally edit those commands and reissue them. The
command editing is very similar to tcsh and Emacs editing, with key combinations like Ctrl-a and Ctrl-e to move the
cursor to the beginning and end of a command, respectively. The up and down arrows allow for cycling through the
command history.
Simple online help also is available. The question mark (?) can be used almost anywhere to get help within whatever
context you may find yourself. Also, the Tab key can be used in many of the same contexts to complete a command or
parameter in order to reduce the amount of typing you have to do.
21
22
These are the command directories and commands available at the top level of the command hierarchy.
23
This demonstrates how the question mark is used to show the available commands and command directories at any
level.
24
This demonstrates how the question mark is used to show the required and optional parameters. It can also be used to
show the valid keyword values that are allowed for parameters that accept keywords.
The Tab key can be used to show other directories, commands, and parameters that are available, and can complete
a command (or a portion of a command) for you.
25
This is the initial page that comes up when logging into the Element Manager. Its a dashboard view of the
performance statistics of the entire cluster. The left pane of the page contains the command directories and
commands. When there is a + beside a word, it can be expanded to show more choices. Not until you click an object
at the lowest level will the main pane switch to show the desired details.
26
27
This shows the further expansion of the aggregate directory within the STORAGE directory. The main pane continues
to show the Performance Dashboard.
28
After selecting manage on the left pane, all the aggregates are listed. Notice the double arrow to the left of each
aggregate. Clicking that will reveal a list of actions (commands) that can be performed on that aggregate.
29
This shows what you see when you click the arrow for an aggregate to reveal the storage aggregate commands. The
modify command for this particular aggregate is being selected.
30
The modify action for an aggregate brings up this page. You can change the state, the RAID type, the maximum
RAID size, or the high-availability policy. Also, from the Aggregate drop-down menu, you can select a different
aggregate to work on without going back to the previous list of all the aggregates.
31
This shows the set adv command (short for set -privilege advanced) in the clustershell. Notice the options
available for the storage directory before (using the admin privilege) and after (using the advanced privilege), where
firmware is available.
Note that the presence of an asterisk in the command prompt indicates that you are not currently using the admin
privilege.
32
This page, selected by clicking PREFERENCES on the left pane, is how you would change the privilege level from
within the GUI.
The privilege level is changed only for the user and interface in which this change is made, that is, if another admin
user is using the clustershell, that admin users privilege level is independent of the level in use here, even if both
interfaces are accessing the same node.
33
34
35
Here is an example of a FAS3040 or FAS3070 controller. Use this as a reference, but keep in mind that as new cards
are supported, some of this could change.
Here is an example of a FAS6070 or FAS6080 controller. Use this as a reference, but keep in mind that as new cards
are supported, some of this could change.
This is the back of a typical disk shelf. Here, were highlighting the in and out ports of loop A (top) and loop B (bottom).
The following example shows what the storage show disk -port command output looks like for an SFO
configuration that does not use redundant paths:
node::> storage disk show -port
Primary
Port Secondary
Port Type Shelf Bay
--------------- ---- --------------- ---- ------ ----- --node2a:0a.16
A
FCAL
1
0
node2a:0a.17
A
FCAL
1
1
node2a:0a.18
A
FCAL
node2a:0a.19
A
FCAL
1
3
node2a:0a.20
A
FCAL
1
4
node2a:0a.21
A
FCAL
1
5
.
.
.
node2a:0b.21
B
FCAL
1
5
node2a:0b.22
B
FCAL
1
6
node2a:0b.23
B
FCAL
1
7
Multipath HA Storage enhances data availability and performance for active/active system configurations. It is highly
recommended for customers who want to avoid unnecessary failovers resulting from storage-related faults. By
providing redundant paths, Multipath HA Storage avoids controller failover due to storage faults from shelf I/O
modules, cables, and disk HBA failures.
MultiPathing is supported on ESH2, ESH4 and AT-FCX disk shelves. If the shelf modules are not of these types
then upgrade them before proceeding. If there are no free HBAs on the node, then add additional HBAs.
Use following procedure to dual-path each loop. This can be done while the node is online.
Insert optical connectors into the out connection on both the A and B modules on the last shelf in the loop.
Determine if the node head is plugged into the A or B module of the first shelf.
Connect
C
t a cable
bl ffrom a diff
differentt h
hostt adapter
d t on th
the node
d tto th
the opposite
it module
d l on th
the last
l t shelf.
h lf F
For
example, if the node is attached, via adapter 1, to the in port of module A on the first shelf, then it should be
attached, via adapter 2, to the out port of the module B on the last shelf and vice versa.
Repeat step 2 for all the loops on the node.
Repeat steps 2 & 3 for the other node in the SFO pair.
Use the storage disk show port command to verify that all disks have two paths.
As a best practice cable shelf loops symmetrically for ease of administration - Use the same node FC port for owner
and partner to ease administration.
Consult the appropriate ISI (Installation and Setup Instructions) for graphical cabling instructions.
10
The types of traffic that flow over the InfiniBand links are:
Failover: The directives related to performing storage failover (SFO) between the two nodes, regardless of whether
the failover is:
negotiated (planned and as a response to administrator request)
non-negotiated (unplanned in response to a dirty system shutdown or reboot)
Disk firmware: Nodes in an HA pair coordinate the update of disk firmware. While one node is updating the firmware,
the other node must not do any I/O to that disk
Heartbeats: Regular messages to demonstrate availability
Version information: The two nodes in an HA pair must be kept at the same major/minor revision levels for all
software components
11
Each node of an HA pair designates two disks in the first RAID group in the root aggregate as the mailbox disks. The
first mailbox disk is always the first data disk in RAID group RG0. The second mailbox disk is always the first parity
disk in RG0. The mroot disks are generally the mailbox disks.
Each disk, and hence each aggregate and volume built upon them, can be owned by exactly one of the two nodes in
the HA pair at any given time.
time This form of software ownership is made persistent by writing the information onto the
disk itself. The ability to write disk ownership information is protected by the use of persistent reservations. Persistent
reservations can be removed from disks by power-cycling the shelves, or by selecting Maintenance Mode while in
Boot Mode and issuing manual commands there. If the node that owns the disks is running in normal mode, it
reasserts its persistent reservations every 30 seconds. Changes in disk ownership are handled automatically by
normal SFO operations, although there are commands to manipulate them manually if necessary.
Both nodes in an HA pair can perform reads from any disk to which it is connected
connected, even if it isn't
isn t that disk
disk'ss owner.
owner
However, only the node marked as that disk's current owner is allowed to write to it.
12
Persistent reservations can be removed from disks by power-cycling the shelves, or by selecting Maintenance Mode
while in Boot Mode and issuing manual commands there. If the node that owns the disks is running in normal mode, it
reasserts its persistent reservations every 30 seconds.
A disk's data contents are not destroyed when it is marked as unowned, only its ownership information is erased.
Unowned disks residing on an FC
FC-AL
AL loop,
loop where owned disks exist
exist, will have ownership information automatically
applied to guarantee all disks on the same loop have the same owner.
13
To enable SFO within an HA pair, the nodes must have the Data ONTAP 7G cf license installed on them, and they
must both be rebooted after the license is installed. Only then can SFO be enabled on them.
Enabling SFO is done within pairs regardless of how many nodes are in the cluster. For SFO, the HA pairs must be of
p , two FAS3050s,, two FAS6070s,, and so on. The cluster itself can contain a mixture of
the same model,, for example,
models but each HA pair must be homogenous. The version of Data ONTAP must be the same on both nodes of the
HA pair, except for the short period of time during which the pair is being upgraded. During that time, one of the nodes
will be rebooted with a newer release than its partner, with the partner to follow shortly thereafter. The NVRAM cards
must be installed in the nodes, and two interconnect cables are needed to connect the NVRAM cards to each other.
Remember, this cluster is not simply the pairing of machines for failover; its the Data ONTAP cluster.
14
In SFO, interface failover is separate out from storage failover. So give back returns first aggregate which
has mroot volume of partner node and then rest of the aggregates one-by-one
15
Multiple controllers are connected together to provide a high-level of hardware redundancy and resilience against
single points of failure.
All controllers in an HA array can access the same shared storage backend.
All controllers in an HA array can distribute their NVRAM contents
contents, including the NVLog to facilitate takeover without
data loss in the event of failure.
In the future, HA array will likely expand to include more than two controllers.
16
17
If the node-local licenses are not installed on each node, enabling storage failover will result in an error. Verify and/or
install the appropriate node licenses, reboot each node.
Enable SFO on one node per HA-pair (reboot required later) for cluster > 2 nodes.
18
CFO used to stand for cluster failover, but the term cluster is no longer being used in relation to Data ONTAP 7G or
Data ONTAP 8.0 7-Mode.
19
20
This example shows a 2-node cluster, which is also an HA pair. Notice that SFO is enabled on both nodes.
21
23
When the aggregates of one node failover to the SFO partner node, the aggregate that contains the mroot of that node
goes too. Each node needs its mroot to boot, so when the rebooted node begins to boot, the first thing that happens is
that it signals the partner to do a sendhome of that one aggregate and then it waits for that to happen. If SFO is
working properly, sendhome will happen quickly, the node will have its mroot and be able to boot, and then when it
gets far enough in its boot process, the rest of the aggregates will be sent home (serially). If there are problems, youll
probably see the rebooted node go into a waiting
waiting for sendhome
sendhome state.
state If this happens,
happens its
it s possible that its aggregates
are stuck in a transition state between the two nodes and may not be owned by either node. If this happens, contact
NetApp Technical Support.
The EMS log will show why the sendhome was vetoed.
25
26
Note: Changing epsilon can be run from any node in the cluster.
The steps to move epsilon are as follows:
1. Mark all nodes in the cluster as epsilon false
2. Mark one node epsilon true
28
For clusters of only two nodes, the replicated database (RDB) units rely on the disks to help maintain quorum within
the cluster in the case of a node being rebooted or going down. This is enabled by configuring this 2-node HA
mechanism. Because of this reliance on the disks, SFO enablement and auto-giveback is also required by 2-node HA
and will be configured automatically when 2-node HA is enabled. For clusters larger than two nodes, quorum can be
maintained without using the disks. Do not enable 2-node HA for clusters that are larger than two nodes.
29
30
31
Note : 2-node HA mode should be disabled on an existing 2-node cluster prior to joining the third and subsequent
nodes
32
33
34
35
The HA policy determines the takeover and giveback behavior and is set to either CFO or SFO.
CFO HA Policy: CFO policy aggregates (or CFO aggregates for short) can contain 7-mode volumes. When these
aggregates are taken over they are available in partner mode. During giveback, all CFO aggregates are given back in
one step. This is same as what happens during takeover and giveback on 7g. CFO aggregates can also contain
cluster mode volumes but this is not recommended because such cluster mode volumes could experience longer
outages during giveback while waiting for the applications like VLDB to stabilize and restore the access to these
volumes. Cluster-mode volumes are supported in CFO aggregates because Tricky allowed data volumes in a root
aggregate.
SFO HA Policy: SFO policy aggregates (or SFO aggregates for short) can contain only cluster mode volumes. They
cannot contain 7-mode volumes. When these aggregates are taken over they are available in local mode. This is same
as what happens during takeover on GX. During giveback, the CFO aggregates are given back first, the partner boots
and then the SFO aggregates are given back one aggregate at a time. This SFO aggregate giveback behavior is same
as the non-root aggregate giveback behavior on GX.
The root aggregate has policy of CFO in cluster mode. In BR.0 cluster-mode, only the root can have CFO policy. All
other aggregates will have SFO policy.
36
37
38
39
40
41
Cluster-Mode volumes can be flexible volumes. The flexible volumes are functionally equivalent to flexible volumes in
7-Mode and Data ONTAP 7G. The difference is in how theyre used. Because of the flexibility inherent in Data ONTAP
clusters (specifically, the volume move capability), volumes are deployed as freely as UNIX directories and
Windows folders to separate logical groups of data. Volumes are created and deleted, mounted and unmounted, and
moved around as needed. To take advantage of this flexibility, cluster deployments typically use many more volumes
than traditional 7G deployments.
y
Volumes can be moved around, copied, mirrored, and backed up.
42
This example shows some volumes. The name for the vserver root volume was chosen by the administrator to indicate
clearly that the volume is a root volume.
You can see that the Type values are all RW, which shows that these are read/write volumes, as opposed to loadsharing (LS) mirrors or data protection (DP) mirrors. Well learn more about mirrors later.
Also, the difference between the Size and Available values is the amount of the volume that is used, but also reflects
some administrative space used by the WAFL (Write Anywhere File Layout) file system, as well as Snapshot reserve
space.
43
For example, an explicit NFS licence is required (was not previously with GX). Mirroring requires a
new license.
44
ONTAP 8.0 Cluster-Mode supports a limited subset of the 7-Mode qtree functionality. In cluster-mode, they are
basically quota containers, not as a storage unit of management.
Qtrees can be created within flexvols and can be configured with a security style and default or specific tree quotas.
U
User
quotas
t are nott supported
t d in
i the
th 8
8.0.0
0 0 release,
l
and
db
backup
k ffunctionality
ti
lit remains
i ttargeted
t d att th
the volume
l
llevel.
l
Cluster virtual servers are integral part of the cluster architecture and the means for achieving secure multi-tenancy
and delegated administration. They serve data out of its namespace, have its own network identities and
administrative domains.
A cluster virtual server (vserver) ties together volumes, logical interfaces, and other things for a namespace. No
volumes can be created until there is a cluster vserver with which to associate them.
46
Think of the cluster as a bunch of hardware (nodes, disk shelves, and so on). A vserver is a logical piece of that
cluster, but it is not a subset or partitioning of the nodes. Its more flexible and dynamic than that. Every vserver can
use all the hardware in the cluster, and all at the same time.
Here is a simple example: A storage provider has one cluster, and two customers, ABC Company and XYZ Company.
A vserver can be created for each company
company. The attributes that are related to specific vservers (volumes,
(volumes LIFs
LIFs,
mirrors, and so on) can be managed separately, while the same hardware resources can be used for both. One
company can have its own NFS server, while the other can have its own NFS and CIFS servers, for example.
47
There is a one-to-many relationship between a vserver and its volumes. The same is true for a vserver and its data
LIFs. Cluster vservers can have many volumes and many data LIFs, but those volumes and LIFs are associated only
with this one cluster vserver.
48
49
Please note that this slide is a representation of logical concepts and is not meant to show any physical relationships.
For example, all of the objects shown as part of a vserver are not necessarily on the same physical node of the cluster.
In fact, that would be very unlikely.
p
) Although
g the hardware is not shown,, these four vservers
This slide shows four distinct vservers ((and namespaces).
could be living within a single cluster. These are not actually separate entities of the vservers, but are shown merely to
indicate that each vserver has a namespace. The volumes, however, are separate entities. Each volume is associated
with exactly one vserver. Each vserver has one root volume, and some have additional volumes. Although a vserver
may only have one volume (its root volume), in real life it is more likely that a vserver would be made up of a number
of volumes, possibly thousands. Typically, a new volume is created for every distinct area of storage. For example,
every department and/or employee may have its own volume in a vserver.
50
A namespace is simply a file system. It is the external (client-facing) representation of a vserver. It is made up of
volumes that are joined together through junctions. Each vserver has exactly one namespace, and the volumes in one
vserver cannot be seen by clients that are accessing the namespace of another vserver. Namespace provides the
logical arrangement of the NAS data available in the Vserver.
51
These nine volumes are mounted together via junctions. All volumes must have a junction path (mount point) to be
accessible within the vservers namespace.
Volume R is the root volume of a vserver. Volumes A, B, C, and F are mounted to R through junctions. Volumes D and
g jjunctions. Likewise,, volumes G and H are mounted to F.
E are mounted to C through
Every vserver has its own root volume, and all non-root volumes are created within a vserver. All non-root volumes are
mounted into the namespace, relative to the vserver root.
52
53
This is a detailed volume show command. Typing this will show a summary view of all volumes. If you do a show of
a specific virtual server and volume, youll see the instance (detailed) view of the volume rather than the summary list
of volumes.
54
Junctions are conceptually similar to UNIX mountpoints. In UNIX, a hard disk can be carved up into partitions and then
those partitions can be mounted at various places relative to the root of the local file system, including in a hierarchical
manner. Likewise, the flexible volumes in a Data ONTAP cluster can be mounted at junction points within other
volumes, forming a single namespace that is actually distributed throughout the cluster. Although junctions appear as
directories, they have the basic functionality of symbolic links.
A volume is not visible in its vservers namespace until it is mounted within the namespace.
55
Typically, when volumes are created by way of the volume create command, a junction path is specified at that
time. That is optional; a volume can be created and not mounted into the namespace. When its time to put that volume
into use, the volume mount command is the way to assign the junction path to the volume. The volume also can be
unmounted, which takes it out of the namespace. As such, it is not accessible by NFS or CIFS clients, but it is still
online, and can be mirrored, backed up, moved and so on. It then can be mounted again to the same or different place
in the namespace and in relation to other volumes ((for example, it can be unmounted from one parent volume and
mounted to another parent volume).
56
This is a representation of the volume hierarchy of a namespace. These five volumes are connected by way of
junctions, with the root volume of the namespace at the top of the hierarchy. From an NFS or CIFS client, this
namespace will look like a single file system.
57
Its very important to know the differences between what the volume hierarchy looks like to the administrator
(internally) as compared to what the namespace looks like from an NFS or CIFS client (externally).
The name of the root volume of a vserver (and hence, the root of this namespace) can be chosen by the administrator,
path of the root volume is always
y /. Notice that the jjunction p
path for ((the mountpoint
p
of)) a volume is not
but the jjunction p
tied to the name of the volume. In this example, weve prefixed the name of the volume smith_mp3 to associate it with
volume smith, but thats just a convention to make the relationship between the smith volume and its mp3 volume
more obvious to the cluster administrator.
58
59
Here again is the representation of the volumes of this namespace. The volume names are shown inside the circles
and the junction paths are listed outside of them. Notice that there is no volume called user. The user entity is
simply a directory within the root volume, and the junction for the smith volume is located in that directory. The acct
volume is mounted directly at the /acct junction path in the root volume.
60
62
63
64
Kernel modules are loaded into the FreeBSD kernel. This gives them special privileges that are not available to user
space processes. There are great advantages to being in the kernel; there are downsides too. For one, its more
difficult to write kernel code, and the penalty for a coding error is great. User space processes can be swapped out by
the operating system, but on the plus side, user space processes can fail without taking the whole system down, and
can be easily restarted on the fly.
This diagram shows the software stack making up Data ONTAP 8.0 Cluster-Mode. The most obvious difference
between this stack and the 7-Mode stack is the addition of a networking component called the N-blade, and more
logical interfaces (LIFs). Also, notice that Cluster-Mode does not yet support the SAN protocols (FC and iSCSI).
Th N
The
N-blade
bl d iis th
the network
t
k blade.
bl d It ttranslates
l t b
between
t
th
the NAS protocols
t
l (NFS and
d CIFS) and
d th
the SpinNP
S i NP protocol
t
l
that the D-blade uses. SpinNP is the protocol used within a cluster to communicate between N-blades and D-blades.
In Cluster-Mode, the D-blade does not service NAS or SAN protocol requests.
10
The term blade refers to separate software state machines, accessed only by well-defined application program
interfaces, or APIs. Every node contains an N-blade, a D-blade, and Management. Any N-blade in the cluster can talk
to any D-blade in the cluster. Each node has an N-blade, a D-blade, and Management.
The N-blade translates client requests into Spin Network Protocol (SpinNP) requests (and vice versa). The D-blade,
which contains the WAFL (Write Anywhere File Layout) file system
system, handles SpinNP requests
requests. CSM is the SpinNP
layer between the N-blade and D-blade.
The members of each RDB unit, on every node in the cluster, are in constant communication with each other to remain
in sync. The RDB communication is like the heartbeat of each node. If the heartbeat cannot be detected by the other
members of the unit, the unit will correct itself in a manner to be discussed later. The three RDB units on each node
are: VLDB, VifMgr, and Management. There will be more information about these RDB units later.
11
This graphic is very simplistic, but each node contains the following: N-blade, CSM, D-blade, M-host, RDB units (3),
and the nodes vol0 volume.
12
An NFS or CIFS client sends a write request to a data logical interface, or LIF. The N-blade that is currently associated
with that LIF translates the NFS/CIFS request to a SpinNP request. The SpinNP request goes through CSM to the
local D-blade. The D-blade sends the data to nonvolatile RAM (NVRAM) and to the disks. The response works its way
back to the client.
13
This path is mostly the same as the local write request, except that when the SpinNP request goes through CSM, it
goes to a remote D-blade elsewhere in the cluster, and vice versa.
14
The N-blade architecture comprises a variety of functional areas, interfaces and components. The N - blade itself
resides as a loadable module within the FreeBSD kernel. It relies heavily on services provided by SK (within the Dblade).
The N-blade supports a variety of Protocols. Interaction with these protocols is mediated by the PCP (protocol and
connection processing) module.
module It handles all connection and packet management between the stream protocols and
the network protocol stack/device drivers.
15
16
Transports requests from any N-blade to any D-blade and vice versa (even on the same node)
The protocol is called SpinNP (Spinnaker network protocol) and is the language that the N-blade speaks to the Dblade
Uses UDP/IP
17
SpinNP is the protocol family used within a cluster or between clusters to carry high frequency/high bandwidth
messages between blades or between an m-host and a blade.
18
Cluster Session Manager (CSM) is the communication layer that manages connections using the SpinNP protocol
between two blades. The blades can be either both local or one local and one remote. Clients of CSM use it because it
provides for blade to blade communication without the client's knowledge of where the remote blade is located.
19
20
21
Basically a wrapper around Data ONTAP 7G that translates SpinNP for WAFL. The Spinnaker D-blade (SpinFS file
system, storage pools, VFS, Fibre Channel Driver, N+1 storage failover) was replaced by Data ONTAP (encapsulated
into a FreeBSD kernel module)
Certain parts of the old Data ONTAP arent used (UI, network, protocols)
It
It speaks
k SpinNP
S i NP on the
th front
f t end
d
The current D-blade is mostly made up of WAFL
22
D-blade is the disk facing software kernel module and is derived from ONTAP. It contains WAFL, RAID, Storage.
SpinHI is part of the D-blade, and sits directly above WAFL. It processes all incoming SpinNP fileop messages. Most
of these are translated into WAFL messages
23
24
The M-Host is a User space environment on a node along with the entire collection of software services :
Command shells and API servers.
Service processes for upcalls from the kernel.
User space implementation of network services, such as DNS, and file access services such as HTTP and FTP.
U d l i cluster
Underlying
l t services
i
such
h as RDB
RDB, cluster
l t membership
b hi services,
i
quorum.
Logging services such as EMS.
Environmental monitors.
Higher level cluster services, such as VLDB, job manager, and LIF manager.
Processes that interact with external servers, such as Kerberos, LDAP.
Processes that perform operational functions such as NDMP control, and auditing.
S i
Services
th
thatt operate
t on data,
d t such
h as Anti-virus
A ti i
and
d iindexing.
d i
25
SMF currently supports two types of persistent data storage via table level attributes: persistent and replicated. The
replicated tables are identical copies of the same set of tables stored on every node in the cluster. Persistent tables are
node specific and stored locally on each node in the cluster.
q
y these table attributes are referred to as RDB ((replicated)
p
) and CDB (p
(persistent).
)
Colloquially
26
27
28
29
30
32
Manages all cluster-mode network connections, data, cluster, and mgmt networks.
Uses RDB to store network configuration information
User RDB to know when to migrate a LIF to another node.
33
34
35
36
The vol0 volume of a node is analogous to the root volume of a Data ONTAP 7G system. It contains the data needed
for the node to function.
The vol0 volume does not contain any user data, nor is it part of the namespace of a vserver. It lives (permanently) on
the initial aggregate that is created when each node is initialized.
The vol0 volume is not protected by mirrors or tape backups, but thats OK. Although it is a very important volume (a
node cannot boot without its vol0 volume), the data contained on vol0 is (largely) re-creatable. If it were lost, the log
files would indeed be gone. But because the RDB data is replicated on every node in the cluster, that data can be
automatically re-created onto this node.
37
Each vserver has one namespace and, therefore, one root volume. This is separate from the vol0 volume of each
node.
38
The RDB units do not contain user data, but rather they contain data that helps manage the cluster. These databases
are replicated, that is, each node has its own copy of the database, and that database is always in sync with the
databases on the other nodes in the cluster. RDB database reads are performed locally on each node, but an RDB
write is performed to one master RDB database, and then those changes are replicated to the other databases
throughout the cluster. When reads are done of an RDB database, they can be fulfilled locally, without the need to
send any
y requests over the cluster networks.
The RDB is transactional in that it guarantees that when something is being written to a database, either it all gets
written successfully or it all gets rolled back. No partial/inconsistent database writes are committed.
There are three RDB units (VLDB, Management, VifMgr) in every cluster, which means that there are three RDB unit
databases on every node in the cluster.
39
Replicated Database
Currently three RDB units: VLDB, VifMgr, Management
Maintains the data that manages the cluster
Each unit has its own replication unit
Unit is made up of one master (read/write) and other secondaries (read-only)
One node contains the master of an RDB app, others contain the secondaries
Writes go to the master, then get propagated to others in the unit (via the cluster network)
Enables the consistency of the units through voting and quorum
The user space processes for each RDB unit vote to determine which node (process) will be the master
Each unit has a master, which could be a different node for each unit
The master can change as quorum is lost and regained
An RDB unit is considered to be healthy only when it is in quorum (i.e., a master is able to be elected)
A simple majority of online nodes are required to have a quorum
One node is designated as epsilon (can break a tie) for all RDB units
A RDB replication ring stays online as long as a bare majority of the application instances are healthy and in
communication (a quorum). When an instance is online (part of the quorum), it enjoys full read/write capability on upto date replicated data
to-date
data. When offline
offline, it is limited to read
read-only
only access to the potentially out
out-of-date
of date data offered by the
local replica. The individual applications all require online RDB state to provide full service.
40
Each RDB unit has it own ring. If n is the number of nodes in the cluster, then each unit/ring is made up of n databases
and n processes. At any given time, one of those databases is designated as the master and the others are designated
as secondary databases. Each RDB units ring is independent of the other RDB units. If nodeX has the master
database for the VLDB unit, nodeY may have the master for the VifMgr unit and nodeZ may have the master for the
Management
g
unit.
The master of a given unit can change. For example, when the node that is the master for the Management unit gets
rebooted, a new Management master needs to be elected by the remaining members of the Management unit. Its
important to note that a secondary can become a master and vice versa. There isnt anything special about the
database itself, but rather the role of the process that manages it (master versus secondary).
When data has to be written to a unit
unit, the data is written to the database on the master and then the master takes care
of immediately replicating the changes to the secondary databases on the other nodes. If a change cannot be
replicated to a certain secondary, then the entire change is rolled back everywhere. This is what we mean by no partial
writes. Either all databases of an RDB unit get the change, or none get the change.
41
42
Quorum requirements are based on a straight majority calculation. To promote easier quorum formation given an
even number of replication sites, one of the sites is assigned an extra partial weight (epsilon). So, for a cluster of 2n
sites, quorum can be formed by the n-site partition that includes the epsilon site.
43
Lets define some RDB terminology. A master can be elected only when there is a quorum of members available (and
healthy) for a particular RDB unit. Each member votes for the node that it thinks should be the master for this RDB
unit. One node in the cluster has a special tie-breaking ability called epsilon. Unlike the master, which may be
different for each RDB unit, epsilon is a single node that applies to all RDB units.
Quorum means that a simple majority of nodes are healthy enough to elect a master for the unit
unit. The epsilon power is
only used in the case of a voting tie. If a simple majority does not exist, the epsilon node (process) chooses the master
for a given RDB unit.
A unit goes out of quorum when cluster communication is interrupted, for example, due to a reboot, or perhaps a
cluster network hiccup that lasts for a few seconds. It comes back into quorum automatically when the cluster
communication is restored.
44
45
A master can be elected only when there is a majority of local RDB units connected (and healthy) for a particular RDB
unit. A master is elected when each local unit agrees on the first reachable healthy node in the RDB site list. A
healthy node would be one that is connected, able to communicate with the other nodes, has CPU cycles, and has
reasonable I/O.
The master of a given unit can change
change. For example
example, when the node that is the master for the Management unit gets
rebooted, a new Management master needs to be elected by the remaining members of the Management unit.
A local unit goes out of quorum when cluster communication is interrupted for a few seconds, for example, due to a
reboot, or perhaps a cluster network hiccup that lasts for a few seconds. It comes back in quorum automatically as the
RDB units are always working to monitor and maintain a good state. When a local unit goes out of quorum and then
comes back into quorum, the RDB unit is re-synchronized. Its important to note that the VLDB process on a node
could go out of quorum for some reason
reason, while the VifMgr process on that same node has no problem at all
all.
When a unit goes out of quorum, reads from that unit can be done, but writes to that unit cannot. That restriction is
enforced so that no changes to that unit happen during the time that a master is not agreed upon. Besides the VLDB
example above, if the VifMgr goes out of quorum, access to LIFs is not affected, but no LIF failover can occur.
46
Marking a node as ineligible (by way of the cluster modify command) means that it no longer affects RDB quorum or
voting. If the epsilon node is marked as ineligible, epsilon will be automatically given to another node.
47
48
49
The cluster ring show command is available only at the advanced privilege level or higher.
The DB
DB Epoch
Epoch values of the members of a given RDB unit should be the same. For example, as shown, the DB
epoch for the mgmt unit is 8, and its 8 on both node5 and node8. But that is different than the DB epoch for the vldb
unit, which is 6. This is fine. The DB epoch needs to be consistent across nodes for an individual unit. Not all units
have to have the same DB epoch.
Whenever RDB ring forms a new quorum and elects the RDB master, the master starts a new epoch.
Combination of epoch number and transaction number <epoch,tnum> is used to construct RDB versioning.
The transaction number is incremented with each RW transaction.
All RDB copies that have the same <epoch,tnum> combination contain exactly the same information.
51
52
When a majority of the instances in the RDB ring are available, they elect one of these instances the master, with the
others becoming secondary's. The RDB master is responsible for controlling updates to the data within the replication
ring
When one of the nodes wishes to make an update, it must first obtain a write transaction from the master. Under this
transaction, the node is free to make whatever changes it wants; however, none of these changes are seen externally
until the node commits the transaction
transaction. On commit
commit, the master attempts to propagate the new data to the other nodes
in the ring.
53
If a quorums worth of nodes is updated, the changes are made permanent; if not, the changes are rolled back.
54
55
One node in the cluster has a special voting weight called epsilon. Unlike the masters of each RDB unit, which may be
different for each unit, the epsilon node is the same for all RDB units. This epsilon vote is only used in the case of an
even partitioning of a cluster, where, for example, four nodes of an eight-node cluster cannot talk to the other four
nodes. This is very rare, but should it happen, a simple majority would not exist and the epsilon node would sway the
vote for the masters of the RDB units.
56
57
This diagram shows that each node contains the following: N-blade, CSM, D-blade, M-host, RDB units (3), and vol0.
58
59
60
61
The key change in Boilermaker from 7G is that we now have a dual-stack architecture. However, when we say "dualstack" it tends to imply that both stacks have equal prominence. But in our case, the stack inherited from 7G, referred
to as the SK stack, owns the network interfaces in normal operation and runs the show for 7-mode and C-mode apps
for the most part. The FreeBSD stack inherited from GX runs as a surrogate to the SK stack and provides the
programmatic interface(BSD sockets) to the mhost apps to communicate to the network. The FreeBSD stack itself
does not directly
y talk to the network in normal operational mode. This is because it does not own any
y of the physical
y
network interfaces. FreeBSD stack maintains the protocol(TCP+UDP) state for all mhost connections and sets up the
TCP/IP frames over mhost data. It sends the created TCP/IP frames to the SK stack for delivery to the network. On the
ingress side, SK stack delivers all packets destined to the mhost to the FreeBSD stack.
Data ONTAP 8.0 makes a distinction between physical network ports and logical interfaces, or LIFs. Each port has a
role associated with it by default, although that can be changed through the UI. The role of each network port should
line up with the network to which it is connected.
Management ports are for administrators to connect to the node/cluster, for example, through SSH or a Web browser.
Cluster ports are strictly for intra-cluster traffic.
Data ports are for NFS and CIFS client access, as well as the cluster management LIF.
10
11
Using a FAS30x0 as an example, the e0a and e0b ports are defined as having a role of cluster, while the e0c and e0d
ports are defined for data. The e1a port would be on a network interface card in one of the four horizontal slots at the
top of the controller. The e1a port is, by default, defined with a role of mgmt.
12
The network port show command shows the summary view of the ports of this 4-node cluster. All the ports are
grouped by node, and you can see the roles assigned to them, as well as their status and Maximum Transmission Unit
(MTU) size. Notice the e1b data ports that are on the nodes, but not connected to anything.
13
A LIF in Cluster-Mode terminology refers to an IP and netmask associated with a data port.
Each node can have multiple data LIFs, and multiple data LIFs can reside on a single data port, or optional interface
group.
The default LIF creation command will also create default failover rules. If manual/custom failover rule creation is
desired, or if multiple data subnets will be used, add the "use-failover-groups disabled" or specific "-failover-group"
options to the "network interface create" command.
14
Data ONTAP connects with networks through physical interfaces (or links). The most common interface is an Ethernet
port, such as e0a, e0b, e0c, and e0d.
Data ONTAP has supported IEEE 802.3ad link aggregation for some time now. This standard allows multiple network
interfaces to be combined into one interface group. After being created, this group is indistinguishable from a physical
network interface.
Multiple ports in a single controller can be combined into a trunked port via the interface group port feature. An
interface group supports 3 distinct modes: multimode, multimode-lacp and singlemode, and the load distribution
selectable between mac, ip and sequential. Using interface groups will require matching configuration on the
connected client ethernet switch, depending on the configuration selected.
15
Ports are either physical ports (NICs), or virtualized ports such as ifgrps or vlans. Ifgrps treat several physical ports as
a single port, while vlans subdivide a physical port into multiple separate ports. A LIF communicates over the network
through the port it is currently bound to.
16
Using 9000 MTU on the cluster network is highly recommended, for performance and reliability reasons. The cluster
switch or VLAN should be modified to accept 9000 byte payload frames prior to attempting the cluster join/create.
Standard 1500 MTU cluster ports should only be used in non-production lab or evaluation situations, where
performance is not a consideration
The LIF names need to be unique within their scope. For data LIFs, the scope is a cluster virtual server, or vserver. For
the cluster and management LIFs the scopes are limited to their nodes. Thus, the same name, like mgmt1, can be
used for all the nodes, if desired.
18
19
A routing group is automatically created when the first interface on a unique subnet is created. The routing group is
role-specific, and allows the use of the same set of static and default routes across many logical interfaces. The default
naming convention for a routing group is representative of the interface-role and the subnet they are created for
20
The first interface created on a subnet will trigger the automatic creation of the appropriate routing-group. Subsequent
LIFs created on the same subnet will inherit the existing routing group.
Routing groups cannot be renamed. If a naming convention other than the default is required, the routing group can be
pre-created with the desired name, then applied to an interface during LIF creation or as a modify operation to the LIF
21
22
23
Routing groups are created automatically as new LIFs are created, unless an existing routing group already covers
that port role/network combination. Besides the node management LIF routing groups, other routing groups have no
routes defined by default.
The node management LIFs on each node have static routes automatically set up for them, using the same default
gateway.
gateway
There is a metric value for each static route, which is how the administrator can configure which route would be
preferred over another (the lower the metric, the more preferred the route) in the case where there is more than one
static route defined for a particular LIF. The metric values for the node management LIFs are 10. When routes are
created for data LIFs, if no metric is defined, the default will be 20.
24
25
As with the network interface show output, the node management LIFs have a Server that is the node itself. The data
LIFs are associated with a cluster vserver, so theyre grouped under that.
26
Why migrate a LIF? It may be needed for troubleshooting a faulty port, or perhaps to offload a node whose data
network ports are being saturated with other traffic. It will failover if its current node is rebooted.
Unlike storage failover (SFO), LIF failover or migration does not cause a reboot of the node from which the LIF is
migrating. Also unlike SFO, LIFs can migrate to any node in the cluster, not just within the high-availability pair. Once a
LIF is migrated,
migrated it can remain on the new node for as long as the administrator wants it to
to.
Well cover failover policies and rules in more detail later.
27
Data LIFs can migrate or failover from one node and/or port to any other node and/or port within the cluster
LIF migration is generally for load balancing; LIF failover is for node failure
Data LIF migration/failover is NOT limited to an HA pair
Nodes in a cluster are paired as high-availability (HA) pairs (these are called pairs, not clusters)
Each member of an HA pair is responsible for the storage failover (SFO) of its partner
Each node of the pair is a fully functioning node in the greater cluster
Clusters can be heterogeneous (in terms of hardware and Cluster-Mode versions), but an HA pair must be the same
controller model
First, we show a simple LIF migration
Next, we show what happens when a node goes down:
Both data LIFs that reside on that node fail over to other ports in the cluster
The storage owned by that node fails over to its HA partner
The failed node is gone (i.e., its partner does not assume its identity like in 7G and 7-Mode)
The data LIF IP addresses remain the same, but are associated with different NICs
28
Remember that data LIFs arent permanently tied to their nodes. However, the port to which a LIF is migrating is tied to
a node. This is another example of the line between physical and logical. Also, ports have a node vserver scope,
whereas data LIFs have a cluster vserver scope.
All data and cluster-mgmt LIFs can be configured to automatically fail over to other ports/nodes in the event of failure.
Can also be used for load-balancing if an N-Blade is overloaded .The TCP state is not carried over during failover to
another node.
node
Best practices is to fail LIFs from even nodes over to other even nodes and LIFs from odd nodes to other odd
nodes
29
The default policy that gets set when a LIF is created is nextavail, but priority can be chosen if desired.
In a 2 node cluster, the nextavail failover-group policy creates rules to fail over between interfaces on the 2 nodes. In
clusters with 4 or more nodes, the system-defined group will create rules between alternating nodes, to prevent the
storage
t
failover
f il
partner
t
from
f
receiving
i i th
the d
data
t LIFs
LIF as wellll iin th
the eventt off a node
d ffailure.
il
F
For example,
l iin a 4 node
d
cluster, the default failover rules are created so that node1 -> node3, node2 -> node4, node3-> node1 and node4->
node2
Priority rules can be set by the administrator. The default rule (priority 0, which is the highest priority) for each LIF is its
home port and node. Additional rules that are added will further control the failover, but only if the failover policy for that
LIF is set to priority. Otherwise, rules can be created but wont be used if the failover policy is nextavail. Rules are
attempted in priority order (lowest to highest) until the port/node combination for a rule is able to be used for the LIF
LIF.
Once a rule is applied, the failover is complete.
Manual failover rules can also be created, in instances where explicit control is desired by using disabled option.
.
30
31
32
33
As the cluster receives different amounts of traffic, the traffic on all of the LIFs of a virtual server can become
unbalanced. DNS load balancing aims at dynamically choose a LIF based on load instead of using the round robin way
of providing IP addresses.
With DNS load balancing enabled, a storage administrator can choose to allow the new built-in load balancer to
balance client logical interface (LIF) network access based on the load of the cluster. This DNS server resolves names
to LIFs based on the weight of a LIF. A vserver can be associated with a DNS load-balancing zone and LIFs can be
either created or modified in order to be associated with a particular DNS zone. A fully-qualified domain name can be
added to a LIF in order to create a DNS load-balancing zone by specifying a dns-zone parameter on the network
interface create command.
There are two methods that can be used to specify the weight of a LIF: the storage administrator can specify a LIF
weight, or the LIF weight can be generated based on the load of the cluster. Ultimately, this feature helps to balance
the overall utilization of the cluster. It does not increase the performance of any one individual node, rather it makes
sure that each node is more evenly used. The result is better performance utilization from the entire cluster.
DNS load balancing also improves the simplicity of maintaining the cluster.
cluster Instead of manually deciding which LIFs
are used when mounting a particular global namespace, the administrator can let the system dynamically decide which
LIF is the most appropriate. And once a LIF is chosen, that LIF may be automatically migrated to a different node to
ensure that the network load is remains balanced throughout the cluster.
35
The -allow-lb-migrate true option will allow the LIF to be migrated based on failover rules to an underutilized port on
another head. Pay close attention to the failover rules because an incorrect port may cause a problem. A good practice
would be to leave the value false unless you're very certain about your load distribution.
The -lb-weight
g load option
p
takes the system
y
load into account. CPU, throughput
g p and number of open
p connections are
measured when determining load. These currently cannot be changed.
The -lb-weight 1..100 value for the LIF is like a priority. If you assign a value of 1 to LIF1, and a value of 10 to LIF2,
LIF1 will be returned 10 times more often than LIF2. An equal numeric value will round robin each LIF to the client.
This would be equivalent to DNS Load Balancing on a traditional DNS Server.
36
The weights of the LIFs are calculated on the basis of CPU utilization and throughput (Average of both is taken)
1. LIF_weight_CPU = ((Max CPU on node - used CPU on node)/(number of LIFs on node) * 100
2. LIF_weight_throughput = (Max throughput on port - used throughput on port)/(number of LIFs on port) * 100
The more the weight , lesser is the probability of returning a LIF associated.
37
38
39
40
41
NFS is the standard network file system protocol for UNIX clients, while CIFS is the standard network file system for
Windows clients. Macintosh clients can use either NFS or CIFS.
The terminology is slightly different between the two protocols. NFS servers are said to export their data, and the
NFS clients mount the exports. CIFS servers are said to share their data, and the CIFS clients are said to use or
map the shares.
shares
NFS is the de facto standard for UNIX and Linux, CIFS is the standard for Windows
N-blade does the protocol translation between {NFS and CIFS} and SpinNP
NFS and CIFS have a virtual server scope (so, there can be multiples of each running in a cluster)
NFS is a licensed protocol, and is enabled per vserver by creating an NFS server associated with the vserver.
Similarly CIFS is a licensed protocol, and is enabled per vserver by creating a CIFS server associated with the vserver
The name-service switch is assigned at a virtual server level and, thus, Network Information Service (NIS) and
Lightweight Directory Access Protocol (LDAP) domain configurations are likewise associated at the virtual server level.
A note about virtual servers--although a number of virtual servers can be created within a cluster, with each one
containing its own set of volumes, vifs, NFS, and CIFS configurations (among other things), most customers only use
one virtual server.
server This provides for the most flexibility
flexibility, as virtual servers cannot
cannot, for example,
example share volumes
volumes.
10
The Kerberos realm is not created within a Data ONTAP cluster. It must already exist, and then configurations can be
created to associate the realm for use within the cluster.
Multiple configurations can be created. Each of those configurations must use a unique Kerberos realm.
11
The NIS domain is not created within a Data ONTAP cluster. It must already exist, and then configurations can be
created to associate the domain with cluster vservers within Data ONTAP 8.0.
Multiple configurations can be created within a vserver and for multiple vservers. Any or all of those configurations can
use the same NIS domain or different ones. Only one NIS domain configuration can be active for a vserver at one
time.
time
Multiple NIS servers can be specified for an NIS domain configuration when it is created, or additional servers can be
added to it later.
12
The LDAP domain is not created within a Data ONTAP cluster. It must already exist, and then configurations can be
created to associate the domain with cluster vservers within Data ONTAP 8.0.
LDAP can be used for netgroup and UID/GID lookups in environments where it is implemented .
Multiple configurations can be created within a vserver and for multiple vservers. Any or all of those configurations can
use the same LDAP domain or different ones. Only one LDAP domain configuration can be active for a vserver at one
time.
13
14
15
16
Each volume will have an export policy associated with it. Each policy can have rules that govern the access to the
volume based on criteria such as a clients IP address or network, the protocol used (NFS, NFSv2, NFSv3, CIFS, any),
and many other things. By default, there is an export policy called default that contains no rules.
Each export policy is associated with one cluster vserver. An export policy name need only be unique within a vserver.
When a vserver is created
created, the default export policy is created for itit.
Changing the export rules within an export policy changes the access for every volume using that export policy. Be
careful.
17
18
Export Policies control the clients that can access the NAS data in a Vserver. It is applicable to both CIFS and NFS
access. Each export policy consists of a set of export rules that define mapping of client, its permission and the access
protocol [CIFS, NFS]. Export policies are associated with volumes which by virtue of being associated to the
namespace by a junction controls the access to the data in the volume.
19
20
Export policies serve as access controls for the volumes. During configuration and testing, a permissive export policy
should be implemented, and tightened up prior to production by adding additional export policies and rules to limit
access as desired.
21
If youre familiar with NFS in Data ONTAP 7G (or on UNIX NFS servers), then youll wonder about how things are
tagged to be exported. In Data ONTAP clusters, all volumes are exported as long as theyre mounted (through
junctions) into the namespace of their cluster vservers. The volume and export information is kept in the Management
RDB unit so there is no /etc/exports file. This data in the RDB is persistent across reboots and, as such, there are no
temporary exports.
The vserver root volume is exported and, because all the other volumes for that vserver are mounted within the
namespace of the vserver, there is no need to export anything else. After the NFS client does a mount of the
namespace, the client has NFS access to every volume in this namespace. NFS mounts can also be done for specific
volumes other than the root volume, but then the client is limited to only being able to see this volume and its
descendant volumes in the namespace hierarchy.
Exporting a non
non-volume
volume directory within a volume is permitted but not recommended
recommended. NetApp recommends that a
separate volume be set up for that directory, followed by an NFS mount of that volume.
If a volume is created without being mounted into the namespace, or if it gets unmounted, it is not visible within the
namespace.
22
23
24
25
26
27
To prevent clock skew errors, ensure that the NTP configuration is working properly prior to the cifs server create
operation
28
The machine account does not need to be pre-created on the domain for a domain >= Windows 2000 (it will be
created during the vserver cifs create), but the userid/passwd requested does need to have domain join
permissions/credentials to the specified OU container.
29
30
In this slide, we see the user interface of the Windows Active Directory or domain controller where the machine
account has been created for the CIFS configuration of a vserver.
31
Active Directory uses Kerberos authentication, while NT LAN Manager (NTLM) is provided for backward compatibility
with Windows clients prior to Windows 2000 Server. Prior to Active Directory, Windows domains had primary and
secondary domain controllers (DCs). With Active Directory, there may be one or multiple Windows servers that work in
cooperation with each other for the Windows domain. A domain controller is now a role that is played by an Active
Directory machine.
When configuring CIFS, the domain controller information will be automatically discovered and the account on the
domain controller will be created for you. If the virtual server requires preferred DC ordering, this can be set via the
"vserver cifs domain preferred-dc add" command.
32
Some typical steps needed to configure CIFS for a cluster vserver are shown here. The first step is creating a CIFS
configuration. This is the CIFS server itself for the vserver vs1.
Three CIFS shares are created. The first one, root, represents the normal path to the root of the namespace. Keep in
g ((LS)) mirrors,, this normal path
p
will use the readmind that if the root volume or anyy other volumes have load-sharing
only volumes. Therefore, a read/write share called root_rw needs to be created. If a client maps to the read/write
share, it will always use the read/write volumes throughout the namespace of that vserver (no LS mirrors will be used).
More details about mirrors, read/write, and read-only paths will be provided later.
The third share uses dynamic shares, based on the user name. For example, if user bill in the nau01 domain connects
to the CIFS server, and there is a path in the namespace of /user/bill, then the %u will be translated dynamically into
bill such that there is a share called bill that maps to the junction path /user/bill. While bill is on his PC in the nau01
domain, he can go to \\mycifs\bill and be put into whatever volume has a junction path of /user/bill.
33
34
Creating a CIFS configuration is the process of enabling a CIFS server for a given cluster vserver. It is, in effect,
creating a CIFS server. But remember that a CIFS server is specific to a vserver. To enable CIFS for another vserver,
you would need to create a vserver-specific CIFS configuration.
35
A CIFS configuration is limited in scope to a cluster vserver, and a vserver does not have to have a CIFS configuration.
As such, a vserver must exist before a CIFS configuration can be created.
36
Kerberos is sensitive to time skew among nodes and between the cluster and a Kerberos server. When multiple
machines are working together, as is the case with a Data ONTAP cluster and a Kerberos server, the times on those
machines should be within a few minutes of each other. By default, a five-minute time skew is allowed. A time skew
greater than that will cause problems. Time zone settings take care of machines being in different time zones, so thats
not a problem. NTP is a good way to keep multiple machines in time sync with each other. You also can widen the
allowable time skew, but its best to keep the machine times in sync
y anyway.
y y
37
Cluster-Mode allows concurrent access to files by way of NFS and CIFS and with the use of Kerberos. All of these
protocols have the concept of a principal (user or group), but theyre incompatible with each other. So, name mappings
provide a level of compatibility.
CIFS principals
i i l explicitly
li itl contain
t i th
the CIFS d
domain
i as partt off th
the principle.
i i l Lik
Likewise,
i
K
Kerberos
b
principals
i i l contain
t i an
instance and a realm. NFS uses NIS to store its principle information and is simply a name (the NIS domain is implied
and so is not needed in the principal). Because of these differences, the administrator needs to set up rules (specific or
regular expression) to enable these protocols to resolve the differences and correlate these principals with each other.
38
There are no default mappings configured by default on a 8.0 cluster, so for multi-protocol access and unix <-->
windows username matching, generic name-mappings will need to be created using reguslar expressions.
39
Note the two backslashes between the CIFS domains and the CIFS user. Because these parameters take regular
expressions, and the backslash is a special character in regular expressions, it must be escaped with another
backslash. Thus, in a regular expression, two backslashes are needed to represent one backslash.
40
In the second example, a more dynamic mapping is set up. In this example, any user in the MYCIFS domain would be
mapped with a like-named UNIX user. So, user yoda in the MYCIFS domain a.k.a. MYCIFS\yoda would be mapped
with the UNIX user yoda.
41
42
43
44
There are a number of ways to protect your data, and a customers data protection plan will likely use all of these
methods.
Snapshot functionality is controlled by Management, which provides the UI for manual Snapshot copies and the Job
Manager policies and schedules for automated Snapshot operations. Each volume can have a Snapshot policy
associated with it. This policy can have multiple schedules in it, so that Snapshot copies can be created using any
combinations of hourly, daily, weekly, and so on. The policy also says how many of each of those to retain before
deleting an old one. For example, you can keep four hourly Snapshot copies, and when the fifth one is taken, the
oldest one is removed, such that a rolling
g window of the previous four hours of Snapshot copies is retained.
The .snapshot directories are visible and usable by clients, allowing users to restore their own data without the need
for administrator intervention. When the entire volume needs to be restored from a Snapshot copy, the administrator
uses the volume snapshot promote command, which is basically the same thing as doing a restore using SnapRestore
technology. The entire Snapshot copy is promoted, replacing the entire volume. Individual files can be restored only if
done through a client.
The Snapshot copies shown here are scheduled Snapshot copies. We have three Snapshot copies that were taken
five minutes apart for the past 15 minutes, two daily Snapshot copies, six hourly Snapshot copies, and two weekly
Snapshot copies.
Note:
We recommend that you manually replicate all mirrors of a volume immediately after you promote its Snapshot copy. Not doing
so can result in unusable mirrors that must be deleted and recreated.
There are two Snapshot policies that are automatically created: default and none. New volumes are associated with
a default snapshot policy and schedule. The defaults provide 6 hourly, 2 daily and 2 weekly snapshots. A pre-defined
snapshot policy named "none" is also available for volumes that do not require snapshots.
A volume that has none as its Snapshot policy will have no Snapshot copies taken. A volume that uses the default
policy will
will, after two weeks
weeks, have a total of ten Snapshot copies retained (six hourly copies
copies, two daily copies
copies, and two
weekly copies).
Volumes are created by default with a 20% snapshot reserve.
New schedules for use with a snapshot policy can be defined via the "job schedule cron create" command.
10
11
Mirrors are read-only volumes. Each mirror is created with an association with a read/write (R/W) volume, and labeled
as either an LS or DP mirror. LS and DP mirrors are the same in substance, but the type dictates how the mirror is
used and maintained.
Mi
Mirrors
are copies
i off th
their
i R/W volumes,
l
and
d are only
l as synchronized
h i d with
ith th
the R/W as th
the administrator
d i i t t kkeeps th
them,
through manual replication or scheduled (automated) replication. Generally, DP mirrors do not need to be as up-todate as LS mirrors, due to their different purposes.
Each mirror that is created can have a replication schedule associated with it, which determines when (cron) or how
often (interval) the replications are performed on this mirror. All LS mirrors of a volume are treated as a unified group;
they use the same schedule (which is enforced by the UI, that is, if you choose a different schedule for one LS mirror,
the other LS mirrors of that volume will be changed for you)
you). All DP mirrors are independent of each other,
other and are not
forced to use the same schedule as any other mirror.
12
All replication is done directly from the R/W volume to the appropriate mirrors. This is different from the cascading that
occurs within Data ONTAP 7G.
Creating a mirror, associating it with a source volume, and replicating to it are separate steps.
An LS or DP mirror can be promoted (like a restore using SnapRestore technology) to take the place of its R/W
volume.
13
14
15
The purpose of LS mirrors is to offload volumes (and a single D-blade) of read activity. As such, it is very important
that all mirrors are in sync with each other (at the same data version level). When a replication is performed of a
volume to its LS mirrors, all LS mirrors of a volume are synced together and directly from the volume (no cascading).
The way that an NFS mount is performed on a client, or which CIFS share is mapped, makes a difference in what data
is accessed (read/write volume or one of its LS mirrors).
mirrors) The normal method of NFS mounting the root of a virtual
server (vserver), for example, is mount <ip address>:/ /myvserver. This will cause the LS selection algorithm
to be invoked. If, however, the NFS mount is executed using the .admin path, as in mount <ip
address>:/.admin /myvserver, this mount from the client will always access the R/W volumes when traversing
the namespace, even if there are LS mirrors for volumes. For CIFS, the difference is not in how a share is accessed,
but in what share is accessed. If a share is created for the .admin path, then use of that share will cause the client to
always have R/W access. If a share is created without using .admin, then the LS selection algorithm will be used.
Clients are transparently directed to an LS mirror for read operations, rather than to the read/write volume, unless the
special .admin path is being used.
16
17
18
When the / path is used (that is, the /.admin path is not used) and a read or write request comes through that path
into the N-blade of a node, the N-blade first determines if there are any LS mirrors of the volume that it needs to
access. If there arent any LS mirrors of that volume, the read request will be routed to the R/W volume. If there are LS
mirrors of it, preference is given to an LS mirror on the same node as the N-blade that fielded the request. If there isnt
an LS mirror on that node, then an up-to-date
p
LS mirror from another node is chosen.
If a write request goes to an LS mirror, it will return an error to the client, indicating that this is a read-only file system.
To write to a volume that has LS mirrors, the .admin path must be used.
For NFS clients, an LS mirror is used for a set period of time (minutes), after which a new LS mirror is chosen. Once a
file is opened, different LS mirrors may be used across different NFS operations. The NFS protocol can handle the
switch from one LS mirror to another
another.
For CIFS clients, the same LS mirror will continue to be used for as long as a file is open. One the file is closed, and
the period of time expires, then a new LS will be selected prior to the next file open operation. This is done because
the CIFS protocol cannot handle the switch from one LS mirror to another.
19
If a load-sharing mirror is lagging behind the most up-to-date load-sharing mirror in the set, the exportedsnapshot field will show a dash (-)
20
21
When a client accesses a junction, the N-blade detects that there are multiple MSIDs and will direct the
packet to one of the LS mirrors. It will prefer a LS mirror on the same node.
p
path
p
to access the RW volume: /.admin
Since the LS mirror is the default volume,, there is a separate
Each vserver has an entry in the root called .admin. It can only be accessed from the root. When passing
through this path, all packets will be directed to the RW volumes
23
24
25
Replication in BR is through Paloma (Logical replication engine). This will change again to block replication engine in
Rolling Rock.
Volume mirror (SnapMirror) relationships need to be removed and recreated during the upgrade from ONTAP GX
10.0.4 to 8.0.
(Th original
(The
i i l goall was tto convertt existing
i ti volume
l
mirrors
i
tto Paloma,
P l
b t it was nott iimplemented).
but
l
t d) The
Th same applies
li tto
upgrades from 8.0 to 8.1.
26
27
28
29
The Administrative function in the M-Host is responsible for maintaining the peer-to-peer relationships and scheduling
transfers in the context of a relationship. It also ensures that no more than the maximum permissible transfers can be
initiated at any point in time.
Th D
The
Data
t M
Mover ffunction
ti in
i the
th D
D-blade
bl d iis responsible
ibl for
f differencing
diff
i th
the id
identified
tifi d snapshots
h t ((att th
the S
Source)) and
d
transferring the data over the wire in conformance with the relevant protocol. The Data Mover function at the
destination then lays out the data at the destination Data Container object, appropriately. The limit on the maximum
amount of data that can be transferred in the context of a transfer session is also ensured by the Data Mover engine
30
31
32
33
34
35
36
37
38
39
Mirrors use schedules directly, whereas Snapshot copies are controlled by Snapshot policies, which in turn contain
schedule(s). For mirrors, the schedule is defined as part of the mirror definition. For Snapshot copies, the Snapshot
policy
p
y is defined as p
part of the R/W volume definition.
The schedules are maintained under the job command directory. There are some schedules defined by default, as
shown in this example. If a mirror is assigned the 5min schedule, for example, the mirror will be replicated every five
minutes, based on the system clock. If a Snapshot policy uses the hourly schedule, a Snapshot copy will be created
at five minutes after every hour.
40
41
42
43
Here we see that the volume called root has three mirrorstwo DP mirrors and one LS mirror.
The instance view of the root_ls2 mirror shows the aggregate on which the mirror lives, when it was last replicated, as
well as other information.
44
45
46
47
There is no native backup or restore commands. All tape backups and restores are done through third-party NDMP
applications.
Consider the use of a stretched cluster to have a cluster that is geographically distributed, in case a disaster hits one
site.
it The
Th d
data
t can b
be mirrored
i
d tto and
db
backed
k d up att a secondary
d
site.
it
48
50
51
52
53
54
55
56
57
The data copy to the new volume is achieved by a series of copies of the
snapshots each time copying a diminishing delta from the previous snapshot
copy.
Only in the final copy is the volume locked for I/O while the final changed blocks
are copied, and the file handles are updated to point to the new volume. This
should easily complete within the default NFS timeout (600 seconds) and almost
always within the CIFS timeout period of 45 seconds. In some very active
environments, sufficient data will have changed that it will take a longer period of
time to copy than the timeout period. In this case, the end-user sees the same
effect as if the drive had become disconnected they will simply need to reconnect to the share and re-try the operation.
58
59
The same capability can be used to optimize performance for critical projects. In
many types of work there are important crunch times where the project
absolutely must complete by deadline. One option is to buy such a very large
system
t
that
th t guarantees
t
th
thatt any critical
iti l projects
j t complete
l t on time.
ti
That
Th t might
i ht be
b
very expensive. An alternative possible with Cluster-Mode is to reallocate a
smaller pool of available resources to the prefer the critical project.
Lets assume that Project A is critical and will be the top priority starting next
week The system administrator can transparently move all the other volumes to
week.
free up resources for the critical project. The other projects may now get less
performance, but thats a trade-off you can control. The system is capable of
fluidly adjusting with business cycles and critical needs.
Another Cluster-Mode capability is to transparently grow the storage system. The storage administrator
can add new nodes to the system at any time, and transparently rebalance existing volumes to take
advantage of the new resources.
Backups are the one thing in Data ONTAP clusters that are not cluster-aware. As such, the backup administrator
needs to be aware of what volumes are on what nodes (determined by volume show queries by node), and the
backups of the volumes on each node need to be performed through their respective nodes.
B k
Backups
can be
b done
d
across the
th cluster
l t using
i 3
3-way NDMP,
NDMP provided
id d th
thatt th
the thi
third-party
d
t backup
b k application
li ti iis given
i
access to the cluster network.
It may be tempting to assume that a backup of a volume includes all the volumes that are mounted under it, but thats
not the case. NDMP backups do not traverse junctions. Therefore, every volume that is to be backed up needs to be
listed explicitly. The exception to that is if the backup vendor software supports an auto-discovery of file systems, or
supports some sort of wildcarding.
Although backing up through an NFS or CIFS client is possible, doing so would utilize all the cluster resources that are
meant to serve data, as well as filling the N-blade caches with data that most clients arent actually using. The best
practice is to send the data through a dedicated Fibre Channel connection to the tape device(s) using NDMP, as this
doesnt tax the N-blade, data network, or cluster network. But using NFS or CIFS is the only way (at this time) to back
up full-striped volumes. The legacy GX data-striped volumes can be backed up through NDMP.
64
65
66
67
68
User space core dumps are named according to the process name (for example, mgwd) and also use the process ID
(pid) of that instance of the process that generated the core file.
Kernel core dumps include the sysid, which is not the node name, but a numerical representation of this node. The
p name indicate when the p
panic occurred.
date and time in the core dump
When a node panics, a kernel core dump will be generated. There are times, however, when a node is up and running,
but having issues that cannot be debugged live. NetApp Global Support may request that a system core dump be
generated for one or multiple nodes to capture the complete picture of what is happening at that time. If a node is
healthy enough to issue UI commands, then a system reboot command can be entered with the -dump true
parameter. If a node is not healthy enough for that, then from the RLM session to that node, the system core RLM
command can be used to generate a core dump.
RLM is an out-of-band connection to a node that allows for some management of a node even when it is inaccessible
from the console and UI. The RLM connection has a separate IP address and has its own shell. Some sample RLM
commands are system power off, system power on, system reset, and system console.
Core files are meant to be examined by NetApp Global Support and should be reported and uploaded to NetApp
Global Support. The default location to which core dumps should be uploaded (as shown through system coredump
config show) is ftp://ftp.netapp.com/to-ntap/.
The cluster is maintained by constant communication over the cluster network. As such, the cluster network must be
reliable. One of the first things to check when there are problems is the health of the cluster.
Each node writes to log files locally on that node. Those log files are only local and do not contain log messages from
g messages
g also are written to the Event Management
g
System
y
((EMS)) and that enables an
the other nodes. Log
administrator on one node (using the UI) to see the event messages from all nodes in the cluster.
ASUPs are also a great way to get the log files.
While a node is booting, and until the vol0 volume is available, all logging goes to /var/log/. After vol0 is available, the
logging goes to /mroot/etc/log/.
Each process has its own log file, for example, mgwd.log, vldb.log, and vifmgr.log.
Log files are rotated every time the particular process starts,
starts and several previous log files are kept for each process
process,
for example, vldb.log.1, vldb.log.2, and so on.
EMS messages are available to be viewed through the UI. The D-blade, N-blade, and Management event log
messages go to the EMS log. The EMS log is rotated once a week, at the same time that the AutoSupport messages
are sent out.
The tail
Th
il UNIX command
d will
ill print
i t th
the llastt ffew lilines off a fil
file tto th
the screen. Th
The -f
f flag
fl causes it to
t continuously
ti
l
refresh that output as new data is written to that file. Using tail -f for a log file is a great way to watch the logging
as it happens. For example, if you run a command in the UI and get an error, you could open up another window to
that node, run the tail -f command on the log file that you think may provide for information for this error, and then
go back to the other window/browser and run the UI command again. This helps to establish the cause-and-effect
relationship between a UI command and a log message.
10
Logs live on the mroot. You may access logs by logging into the FreeBSD shell on the system which is running
Clustered ONTAP. Logs are located in /mroot/etc/log. You may copy individual logs to another system using the
secure copy command 'scp' from the FreeBSD shell
11
12
Beware that ps and top can be a bit confusing, due to the way the schedulers operate. From FreeBSD, the CPUs will
look 100% busy, but thats because theyre actually be managed by a scheduler other than the normal FreeBSD
scheduler.
13
If you run the ps command and dont see processes like vldb, mgwd, or vifmgr, then something is wrong. For
example, if the vldb process is not running, youll want to look at the vldb.log* files to see if there is an indication
of what happened.
14
A process has to register with the Service Process Manager (SPM) to be managed.
If the process is not running, SPM will restart the process. It will generate an EMS message when a process dies.
If a process has reached its threshold number of restarts, SPM will shift to an interval-based restart and generate an
ASUP. Currently the threshold is 10 restarts in an hour.
Interval-based restarts ranges from 5 minutes to a maximum of 60 minutes per process. The interval between restarts
is twice that of the last value, for example, 5 minutes, then 10 minutes, then 20 minutes, up to 60 minutes. Any further
retries will be happening once every hour after the first 15 (10+5) retries.
Process manager never gives up managing a process.
15
16
17
18
19
vreport is a tool that scans the vldb and dblade for differences and reports it.
The output is generated for differences in aggregates, volumes, junctions and snapmirrors.
Once the differences are generated, vreport provides the option to fix any of the differences. This tool does not have
the authority to change values in the d-blade. Using the fix option on any difference would modify ONLY the vldb to be
consistent with the d-blade.
20
The cluster commands are a quick check as to the health of the cluster. Remember that a 2-node cluster needs to
have 2-node HA enabled. If this step is forgotten, problems will arise, especially during storage failover (SFO).
The cluster ping-cluster command is a great way to make sure that all the cluster ports and cluster logical
interfaces (LIFs) are working properly.
21
For the most part, these commands are self-explanatory. Most show commands give you a good picture of whats
happening in a particular area of the cluster. Also, most show commands have some powerful query capabilities that, if
you take the time to learn, can greatly pinpoint potential problems.
In the volume show -state !online command, the exclamation point means not (negation). So, this command
shows all volumes that do not have a state of online.
online It
Itss important to use !online rather than offline,
offline because
there are other states that youll want to know about.
22
When the aggregates of one node fail over to the high-availability (HA) partner node, the aggregate that contains the
vol0 volume of that node goes too. Each node needs its vol0 to boot, so when the rebooted node begins to boot, the
first thing that happens is that it signals the partner to do a giveback of that one aggregate and then it waits for that to
happen. If SFO is working properly, giveback will happen quickly, the node will have its vol0 and be able to boot, and
then when it gets far enough in its boot process, the rest of the aggregates will be given back. If there are problems,
youll probablyy see the rebooted node go
y
g into a waiting
g for g
giveback state. If this happens, its possible that its
aggregates are stuck in a transition state between the two nodes and may not be owned by either node. If this
happens, contact NetApp Global Support.
23
24
It is a best practice for the vol0 aggregate to contain only the vol0 volume. The reason for this is because during
giveback the vol0 aggregate must be given back to the home node, and must complete that giveback, before any other
aggregates can be given back. The more volumes that are on that aggregate, the longer the vol0 giveback will take,
and thus the longer the delay before all the other aggregates can be given back. The exception to this best practice
would be during an evaluation or proof of concept, where a configuration may only contain one or two disk shelves per
node.
25
26
We want to make sure that all the network ports are OK, including the cluster ports. If those are fine, then take a look
at the LIFs. Make sure theyre working properly, and make note of which ones are home and which ones arent. Just
because theyre not home doesnt mean that there is a problem, but it might give you a sense of whats been
happening.
To get help for the pcpconfig command,
command simply type pcpconfig with no parameters (this is a systemshell utility,
utility
not a clustershell command).
27
28
WAFLtop allows customers to map the utilization at a higher level to their applications based on the volume or the type
of client/internal protocol, and possibly use this information to identify the source of bottlenecks within their systems.
The command is also useful internally to monitor performance and identify bottlenecks.
Th mostt common use case for
The
f WAFLtop
WAFLt can be
b d
described
ib d as ffollows:
ll
1.
Customer sees some degradation in performance, in terms of throughput or response time, on their system.
2.
Customer wishes to determine if there is a volume or a particular application or client protocol which is consuming
resources in a way that leads to the degradation.
3.
Customer can look at sysstat and other utilities to determine overall system usage of resources.
4.
Customer can additionally look at the output of WAFLtop to determine the topmost consumers of various system
resources. Based on this information, the customer may be able to determine the cause of the degradation.
29
vmstat -w 5
will print what the system is doing every five seconds; this is a good choice of printing interval since this is how
often some of the statistics are sampled in the system. Others vary every second and running the output for a
while will make it apparent which are recomputed every second.
node4# vmstat -w 5
procs memory page disk faults cpu
r b w avm fre flt re pi po fr sr ad0 in sy cs us sy id
1 1 0 145720 254560 20 0 0 0 18 0 0 463 433 531 0 100 0
0 1 0 145720 254580 46 0 0 0 33 0 0 192 540 522 0 100 0
6 1 0 145720 254560 38 0 0 0 32 0 0 179 552 515 0 100 0
0 0 0 145164 254804 7 0 0 0 13 0 0 174 297 468 0 100 0
0 0 0 145164 254804 0 0 0 0 0 0 0 182 269 512 0 100 0
30
31
32
The rdb_dump utility is a tool that is run from the systemshell. It gives us a cluster-wide view of which RDB units are
healthy and which arent. If any are not healthy, rdb_dump might give a decent picture of which ring member (node) is
not healthy. But if the node on which this command is being invoked is the unhealthy one, then itll just look like
everything is bad, which is misleading.
rdb d mp can be run with the c (continuous) option,
rdb_dump
option which will iteratively run the rdb_dump
rdb d mp command for you
you,
allowing you to see if something is going in and out of quorum.
33
This section of output from rdb_dump shows three RDB units (Management, VifMgr, and VLDB) of a 2-node cluster.
From this partial output, we can see that the first two units are healthy. If one or more of the nodes were in an offline
state, it would indicate some issues that are affecting the RDB units, most likely cluster networking issues.
Notice that this concise view does not show you the names of the nodes, although you can tell that the ID 1001 is this
local node
node.
34
This rdb_dump -f output is not quite as concise as rdb_dump, but it shows useful information, including the
correlations between the IDs and the host names.
To see more rdb_dump options, run rdb_dump -help.
35
36
Given the correct configuration, the health information summarizes the status of the replication group
Health information obtained from the master is always the most accurate. There is a slight delay in the propagation of
secondary information to other secondaries, but they will come into agreement
37
38
39
40
41
42
There are specific category values that can be used. The object parameter can be used to specify a particular
instance, for example, a volume name or aggregate name. There are a number of counter values that can be used.
Notice that some use a hyphen within the string while others use an underscore. Running stat show with no other
parameters will show all possible categories and counters.
You can narrow down your output by being more specific with your query. You cannot narrow things down by virtual
server, as the statistics command is either cluster-wide or node-specific. Because volume names only have to be
unique within a virtual server, if you query on a volume name (using the object parameter), you may see multiple
volumes with the same name, perhaps even on the same node.
43
44
This output shows statistics specifically for all NFS categories and only for this node.
45
A typical CIFS log message shows the error code (316), as well as the file and line number in the Data ONTAP 8.0
source code that issued the error message. From the systemshell, there is a tool called printsteamerr that will
translate the error code into a more useful error string. In this example, code 316 gets translated into cifs: access
denied.
46
47
48
Althought tcpdump is no longer used to capture traces, it is still used to look at the pktt traces.
49
50
51
52
53
54
55
56
57
58
We generally want to see the same network bandwidth coming in and going out. A good rule to follow is to match the
cluster and data network bandwidth, that is, if using four cluster ports, then use four data ports. We have some
guidelines on the number of ports to use for maximum performance. This takes into account cached workloads as well.
Noncached workloads can probably decrease the port count by one data port per node (as compared to the number of
cluster ports).
p
)
10
11
Using LS mirrors of vserver root volumes is very important for high-availability access to the other volumes in the
namespace. As such, LS mirroring does not require a separate license, but rather is included in the Base license. A
best practice is to create LS mirrors of the vserver root volume and to situate one on each node, including the node
that contains the vserver root volume itself.
Splitting a volume is a manual process. For example, if a volume has two directories to which many writes are being
sent, and such that this volume has become a hot spot, then that volume can be divided into two volumes. With a new
volume on another node, the contents of one of those directories can be moved (by way of NFS or CIFS commands)
into the new one, and then that new volume can be mounted into the namespace at the same point as the original
directory. The clients would use the same path to write the data, but the writes would go to two separate volumes
rather than one.
12
13
14
As part of benchmarking, its important to understand the capabilities and limitations of a single client within the context
of the benchmark. This data will allow for a better understanding of the results when a group of clients are running the
benchmark.
15
The number of nodes in a Data ONTAP cluster has everything to do with scaling performance higher, while having
very little effect in terms of overhead. NetApp refers to this as near-linear performance scalability. This means that a
2-node cluster has about twice the performance of a single node, and a 4-node cluster has about twice the
performance of a 2-node cluster, and so on.
How the data is distributed in the cluster is a big deal. Variables like how the namespace is distributed across nodes
and how the striped member volumes are distributed have a major impact on performance. Some customers do not
take advantage of spreading out volumes (and work) across nodes, and instead configure one large volume per node
(7-mode style). Doing something like this will negate many of the performance benefits of the cluster-mode
architecture.
16
SIO and IOzone are multithreaded benchmarking tools, while dd and mkfile are not. The tools that are not
multithreaded may not be able to accurately simulate a needed environment.
17
The dashboard commands provide quick views of the nodes and the cluster.
18
The following example shows detailed performance-dashboard information for a node named node13:
node::> dashboard performance show -node node13
Node: node13
Average Latency (usec): 624us
y 84%
CPU Busy:
Total Ops/s: 27275
Displaying the Performance Dashboard 108
NFS Ops/s: 27275
CIFS Ops/s: 0
Data Network Utilization: 0%
Data Network Received (MB/s): 0
Data Network Sent (MB/s): 0
Cluster Network Utilization: 0%
Cluster Network Received (MB/s): 0
Cluster Network Sent (MB/s): 0
Storage Read (MB/s): 0
Storage Write (MB/s): 0
CIFS Average Latency: 0us
NFS Average Latency: 624us
By default, the performance dashboard displays the following information about system and cluster performance:
Node name or cluster summary
Average operation latency, in microseconds
Total number of operations
Percentage of data network utilization
Data received on the data network, in MB per second
Data sent on the data network, in MB per second
Percentage of cluster network utilization
Data received on the cluster network, in MB per second
Data sent on the cluster network, in MB per second
Data read from storage, in MB per second
Data written to storage, in MB per second
The command can display a wide range of performance information; see the reference page for the command for
further details.
Thi performance
This
f
view
i
can b
be used
d iin conjunction
j
ti with
ith statistics show node <node> category
<category> to get more detailed statistics.
20
The command can display a wide range of information about storage utilization and trend; see the reference page for
the command for further details.
The following example shows storage utilization trend information for all aggregates
during the past seven days:
node::> dashboard storage show -week
~1 day
~2 days
~3 days
~7 days
Aggregate
Size
Used Vols
Used Vols
Used Vols
Used Vols
Used Vols
--------- -------- ------- ---- ----------- ------- --- ------- --- ------- --node1_aggr0
113.5GB 99.91GB
1
620KB
0 1.18MB
0 1.77MB
0 4.36MB
0
node1_aggr2
908.3GB 50.00GB
1
4KB
0
12KB
0
16KB
0
40KB
0
node2_aggr0
113.5GB 99.91GB
1
612KB
0 1.13MB
0 1.68MB
0 4.02MB
0
node3_aggr0
229.1GB 109.9GB
2
648KB
0 1.23MB
0 1.84MB
0 4.34MB
0
node3_aggr1
687 3GB 110
687.3GB
110.1GB
1GB
7
48KB
0
80KB
0
128KB
0
344KB
0
node4_aggr0
229.1GB 99.92GB
1
624KB
0 1.18MB
0 1.74MB
0 4.06MB
0
node4_aggr1
687.3GB 90.08GB
8
56KB
0
108KB
0
164KB
0
436KB
0
7 entries were displayed.
22
Perfstat executes a number of different commands and collects all the data.
There are a number of commands that are familiar to those who are accustomed to Data ONTAP 7G, available at the
nodeshell.
23
24
Under the admin privilege, the statistics show category processor command shows a basic view of the
utilization of each processor of a node.
25
Under the diag privilege, the statistics show node mowho-05 category processor object
processor1 command shows a detailed view of the utilization of processor1 of node mowho-05.
26
The statistics periodic command runs until Ctrl-C is pressed, with each line of output reporting the stats since
the previous line of output (interval). The default interval is one second. When Ctrl-C is pressed, some summary data
is presented.
This output
Thi
t t can ttellll you a llot.
t If th
the cluster
l t b
busy values
l
are nonzero, it
its a good
d iindication
di ti th
thatt th
the user d
data
t iisnt
t
being sent over the cluster links. The same is true if cluster recv and cluster sent values are in the KB range. So, if
there are ops going on with no data being sent over the cluster network, it shows that data is being served locally, like
when a lot of reads are being done to LS mirrors that are on the same nodes as the data LIFs being accessed by the
clients. When cluster traffic is happening, the cluster recv and cluster sent values will be in the MB range.
Some other good options to use with this same command are:
statistics
t ti ti periodic
i di category
t
latency
l t
node
d node
d
statistics periodic category volume node node interval 1
statistics show node node category volume object sio counter *latency (This wildcard shows all the different
latency counters.)
27
28
The following example will print what the system is doing every five seconds. Five seconds is a good interval since this
is how often some of the statistics are sampled in the system. Others vary every second and running the output for a
while will make it apparent which are recomputed every second. Check the Web for vmstat usage information.
node4#
d 4# vmstat
t t -w 5
procs
r b w
memory
avm
page
disk
in
sy
cpu
fre
flt
re
pi
po
fr
1 1 0
145720 254560
20
18
463
433 531
0 100
0 1 0
145720 254580
46
33
192
540 522
0 100
sr ad0
faults
cs us sy id
29
To analyze the statit results, refer to the 7G man pages. The output is the same in 8.0 as it is in 7G.
30
The CIFS server in is essentially divided into two major pieces, the CIFS N-blade protocol stack and the Security
Services (secd) module.
31
32
33
34
For example, an explicit NFS licence is required (was not previously with GX). Mirroring requires a
new license.
For example, an explicit NFS licence is required (was not previously with GX). Mirroring requires a
new license.
For example, an explicit NFS licence is required (was not previously with GX). Mirroring requires a
new license.
From any node in the cluster, you can see the images from all other nodes. Of the two images on each node, the one that has an
Is Current value of true is the one that is currently booted. The other image can be booted at any time, provided that the
release of Data ONTAP 8.0 on that image is compatible with that of its high-availability (HA) partner and the rest of the cluster.
10
In the process of managing a cluster, it will probably be necessary to scale out at some point. Clusters provide a
number of ways to do this, many of which youve seen and performed already. This is a recap of some of the ways that
a cluster can scale.
11
12
If its determined that an aggregate needs to be re-created, or if the disks are needed for another purpose (for example, to grow a
different aggregate), you may need to delete an aggregate. The volumes need to be removed from the aggregate first, which can
be accomplished by volume move if you dont want to delete them.
13
Events messages beginning with callhome.* are a good collection to configure for initial monitoring. As the customer
becomes more familiar with the system, individual messages can be added or removed as required. The callhome.*
event names are the same events that trigger an autosupport.
Events that begin with callhome.* are configured with the internal "asup" destination that cannot be removed. Use the
advanced "-add-destinations"
add destinations flag to add an additional destination to these events
14
15
16
There is only one event configuration. The named event destination(s) need to be created or modified appropriately,
for example, to indicate to which e-mail address certain event notifications should be sent. Event routes are
associations between predefined event messages and event destinations. You can enable the notification to a
destination of a message by modifying its destination value. This can also be done all together by using a regular
expression when specifying the event name in the event route modify command.
17
Event destinations can also be created for SNMP and syslog hosts. The SNMP capable events can be obtained via the
"event route show -snmp true" command.
18
Event routes have nothing to do with network routes but are merely associations between event messages and
destinations.
19
AutoSupport is NetApp's 'phone home' mechanism that allows our products to do automated configuration, status
and error reporting. This data is then used in a variety of critical ways:
Provides a wealth of data that can be mined for real-world issues and usage. This is especially valuable for
product management and engineering for product planning and for resolving case escalations. We also use this
as a source for product quality and usage metrics
metrics.
Provides current status and configuration information for NGS (and customers) who use this information for case
resolution, system healthcheck and audit reporting, system upgrade planning, customer system inventory
reporting, and many other creative uses.
AutoSupport is now a M-Host (user space) process called notifyd
It collects information from the D-Blade, from Management Gateway(mgwd), from BSD commands, and from files
20
21
22
users can be created to provide different access methods (ssh, http, console), authentication mechanisms (password,
publickey) and capabilities via profiles (admin, readonly, none or user-defined).
23
24
25
The statistics periodic command gives a good summary of operations within the cluster. It prints a line for
every given interval of time (the default is once a second) so that you can see real-time statistics.
26
The dashboard commands are meant to give summaries of whats going on in the cluster. In particular, the
dashboard performance show command gives a quick view of the nodes in this cluster.
27
28
ClusterView is a Web-based tool that graphically displays performance, usage, and health information from a Data
ONTAP cluster. ClusterView is implemented as a set of Adobe Flash Web pages that are served up from any node
in the cluster. The user points a Web browser to one particular node, which is referred to as the "serving node.
Dynamic content is constructed using performance, health, and resource utilization data that ClusterView periodically
fetches from the serving
g node. The serving
g node constructs this data by
yq
querying
y g other nodes in the cluster as
appropriate.
29
30
This is a graphical representation of the space utilization: aggregates on the left and volumes on the right.
31
32
33
34
Complete manageability of Data ONTAP cluster systems will be provided by a combination of products. The scope of
each product is as follows:
Operations Manager: discovery, monitoring, reporting, alerting, File SRM, Quota management
Provisioning Manager: policy based provisioning of storage on cluster systems
g policy
y based data protection and disaster recovery
y of cluster systems
y
Protection Manager:
Performance Advisor: performance monitoring and alerting for cluster systems
35
36
37
38
39
40
41
Upgrades can be staged by leaving the old image as the default, so that a reboot will not bring up the upgraded image.
Rolling upgrades of an HA pair are faster than parallel reboots.
M lti l nodes
Multiple
d (only
( l one per HA pair)
i ) can b
be rebooted
b t d iin parallel,
ll l b
butt b
be aware off quorum rules
l which
hi h d
demand
d th
thatt
fewer than half of the nodes in a cluster be down/rebooting at any given time. Also, be aware of the LIF failover rules to
ensure that the data LIFs are not all failing over to nodes that are also being rebooted.
Certain management operations (for example, NIS lookup) happen over the management network port, which can be a
single point of failure. The cluster management LIF can take advantage of LIF failover functionality.
10
11
12
13
14
15
16
Snapshot copies should be turned on for striped volumes, even if its to keep only one Snapshot per volume. Snapshot
copies are used as part of the overall consistency story for striped volumes.
17
18
19
20
21
22