Storage
Storage
Types of Storage
There are many types of storage media:
• Flash, this has become a cheap form of fast storage, especially in consumer products.
• Optical storage, which comes in the form of CDs, DVDs and BlueRay etc. These are
slow for data access, but still very useful for archives and movies.
• Magnetic Tape backup systems, which are still in use in corporate IT centres, but are
slow, and aren’t good for random access. Random access refers to the ability to
(effectively) access any piece of data by its address (e.g. block number on a hard disk –
see below) instantly.
• Magnetic or hard disks, which are discussed at length below, are a ubiquitous form of
high-volume, high-speed, random-access storage.
• A solid-state drive (SSD) is a data storage device that uses solid-state memory to
store persistent data. Unlike flash-based memory cards, an SSD emulates a hard disk
drive, thus easily replacing it in most applications. An SSD using SRAM or DRAM
(instead of flash memory) is often called a RAM drive. The advantage over (magnetic)
disk drives is speed, but the cost per gigabyte is 4 to 5 times that of disk drives, and at
the moment the amount of storage is much less per unit.
These types of storage can either be static or removable thanks to the ubiquity of USB
and firewire. In this paper we’ll be exclusively talking about magnetic disk (as in hard-
drive) storage.
Hard Disk
The disk drive is covered and hermetically sealed, since heads are designed to float
above the disk platter with less than a micron of space.
Each surface of the disk platter is divided into areas as shown. In the context of data,
the word „block‟ can have many different meanings, but in this context a block is the
smallest unit of data that is read and written. A block is also called (more formally) a
„track sector‟ or simply „sector‟ as shown by (C). A series of sectors makes up a track
(A), and there are several tracks on a surface. At the lowest level when a block (or
segment) is being read or written, there are several things that identify where that block
goes: The surface (identified by the head number), the track that the head should move
to, and the sector that should be read or written. This lowest level of data storage is
called „block-level‟ storage and implies that the data is composed of a series of bits,
with the drive having no notion of format, or what the data belongs to. In the operating
system (OS), there will be device drivers, file systems and applications that impose the
meaning on, and keep tabs on the individual blocks of data. This high level of data
storage is called „file-level‟ storage. Seek time is one of the three delays associated
with reading or writing data on a disk drive. The others are the rotational delay of the
disk, and transfer time. Their sum is the access time. In order to read or write data in a
particular sector, the head of the disk needs to be physically moved to the correct place.
This process is known as seeking, and the time it takes for the head to move to the right
place is the seek time. Seek time for a given disk varies depending on how far the
head's destination is from its origin at the time of each read or write; usually one
discusses a disk's average seek time.
For disk drives that are to be installed internally to a computer (such as a server), the
interface from the disk will be cabled to a Host Controller Card (or simply controller).
Depending on the type of controller, this can usually accommodate multiple disk drives.
You may also be able to plug in external devices to an external-facing interface. The
controller will usually plug straight into a slot on the computer motherboard and draw its
power from there. This will also let the CPU talk to the host adapter and disks through
the system “bus”. Sometimes you may find the controller actually integrated into the
computer’s motherboard.
SCSI disks usually spin at 10,000 or 15,000 RPM. Because of this, and the more
complicated electronics, SCSI components are much more expensive than S/ATA.
However, SCSI disks are renowned for their speed of access, and data transfer.
Because disk drives are sophisticated mechanical devices, when they fail they tend to
take all the data with them. RAID defines several types of redundancy and efficiency
enhancements by clustering commonly available disks. For example:
• RAID 0: Striped set no parity. Striping is where each successive block of information is
written to alternate disks in the array. RAID 0 still suffers from a single disk failure in the
array, but is often used to get the increased read-speed. The increase in read-speed
comes from being able to simultaneously move the disk read/write heads for the
different drives containing the sequential block to be read. Write speeds may also
improve, since each sequential blocks can be written at the same time to the different
disks in the array.
• RAID 1: Mirroring, no parity. Mirroring is where each block is duplicated across all
disks in the array. Here, any one disk failure will not impact data integrity. Better read
speeds are achieved by using the drive whose read/write head is closest to the track
containing the block to be read. There is generally no improvement in write speeds.
• RAID 5: Striped set with distributed parity. The advantage here is that the data from
one drive can be rebuilt with the parity information contained on the other drives. RAID
5 can only afford 1 drive to fail.
These are the most common RAID levels, but there are other RAID levels, and indeed
combinations of levels that can be configured. http://en.wikipedia.org/wiki/RAID RAID
can be implemented in the host controller, or built into the operating system. Either way,
with RAID we are beginning to see an abstraction of a physical disk into a logical one.
For example, with RAID 1, if we decided to use 2 identical 100GBdisks for mirroring,
this would ultimately end up as a 100GB (not 200GB) logical disk to the OS. So:
• Traditionally disk drives are either called Physical Volumes (PV) or Logical Volumes
(LV), depending on where, in the infrastructure, you’re talking about.
• A PV can be split up into partitions, where each partition can also look, to the
operating system, like an individual PV.
• A LUN (logical unit number) comes from the SCSI protocol, but more generally refers
to an LV in storage terminology.
• On some systems, Physical Volumes can be pooled into Volume Groups (VG), from
which Logical Volumes can be created. In this case a Logical Volume may stretch
across many different sizes and types of Physical Disks, and take advantage of RAID.
In a Linux system, this software management of disk storage is called the Logical
Volume Manager (LVM).
It didn’t take long to see the appearance of disk cabinets – or Disk Arrays, connected to
servers via an external SCSI cable, that were separately managed. In some cases
these storage cabinets could be connected to multiple servers, so they could share the
storage (perhaps for fault tolerance). Also, being able to „hot swap‟ failed disks and
have the unit rebuild that disk from parity on the other disks was an expected feature.
This lead to the acronym DAS, or Directly Attached Storage (actually the acronym was
coined in more recent times to distinguish it from other technologies). The main
technologies used with DAS are SCSI with a specialized Host Bus Adapters (HBA)
installed in the servers. (More on HBAs later). A DAS afforded multiple server-access
(up to 4), for clustering but the main disadvantage was that DAS ended up yielding an
island of information.
Storage Networking Protocols
Since the length of a SCSI cable is very limited, there came a need for low-level access
to storage over networks. In effect, the equivalent of stretching the permissible distance
of the SCSI cable to much larger distances. This led to the advancements in Storage
Networking Protocols. These protocols are the same block-level SCSI commands that
go over the interface cables of a disk, and have no knowledge of how clusters of blocks
are aggregated (or used) by the OS to give us a file system. This gives us a network of
disk appliances, where each appliance is a fault-tolerant disk array with its own
management interface. The two predominant networking protocols used for Storage
Networks are the Fibre Channel Protocol (FCP) and iSCSI (over Gigabit Ethernet). In
these cases, both the Fibre Channel and the Gigabit Ethernet infrastructures are used
to carry SCSI commands over the network. iSCSI uses TCP/IP, whereas FCP has its
own 5-layer stack definition.
Often, in the physical implementation, port connections are made through a Gigabit
Interface Converter (GBIC). A GBIC is a standard for transceivers, commonly used with
Gigabit Ethernet and Fibre Channel (explained below). By offering a standard, hot
swappable electrical interface, a one gigabit Ethernet port, for example, can support a
wide range of physical media, from copper to long-wave single-mode optical fibre, at
lengths of hundreds of kilometres.
Fibre Channel
Fibre Channel was originally designed to support fiber optic cabling only. When copper
support was added, the Fibre Channel committee decided to keep the name in principle,
but to use the UK English spelling (Fibre) when referring to the standard. Fibre Channel
can use either optical fiber (for distance) or copper cable links (for short distance at low
cost). However, fiber-optic cables enjoys a major advantage in noise immunity
Fibre Channel, or FC, is a gigabit-speed network technology primarily used for storage
networking, using Fibre Optics. There are 3 topologies that can be used:
• Point-to-Point (FC-P2P). Two devices are connected back to back. This is the
simplest topology, with limited connectivity.
• Arbitrated loop (FC-AL). In this design, all devices are in a loop or ring, similar to
token ring networking. Adding or removing a device from the loop causes all activity on
the loop to be interrupted. The failure of one device causes a break in the ring. Fibre
Channel hubs exist to connect multiple devices together and may bypass failed ports. A
loop may also be made by cabling each port to the next in a ring. A minimal loop
containing only two ports, while appearing to be similar to FC-P2P, differs considerably
in terms of the protocol.
• Switched fabric (FC-SW). All devices or loops of devices are connected to Fibre
Channel switches, similar conceptually to modern Ethernet implementations. The
switches manage the state of the fabric, providing optimized interconnections.
FC-SW is the most flexible topology, enabling all servers and storage devices to
communicate with each other. It also provides for failover architecture if a server or disk
array fails. FC-SW involves one or more intelligent switches, each providing multiple
ports for nodes. Unlike FC-AL, FC-SW bandwidth is fully scalable, i.e. there can be any
number of 8Gbps (Gigabits per second) transfers operating simultaneously through the
switch. In fact, if using full-duplex, each connection between a node and a switch port
can use 16Gbps bandwidth. Because switches can be cascaded and interwoven, the
resultant connection cloud has been called the fabric.
Fibre Channel Host Bus Adapters
Fibre Channel HBAs are available for all major open systems, computer architectures,
and buses. Some are OS dependent. Each HBA has a unique Worldwide Name (WWN,
or WWID for Worldwide Identifier), which is similar to an Ethernet MAC address in that it
uses an Organizationally Unique Identifier (OUI) assigned by the IEEE. However,
WWNs are longer (8 bytes). There are two types of WWNs on an HBA: a node WWN
(WWNN), which is shared by all ports on a host bus adapter, and a port WWN (WWPN),
which is unique to each port. Some Fibre Channel HBA manufacturers are Emulex, LSI,
QLogic and ATTO Technology.
Fibre Ports
The basic building block of the Fibre Channel is the port:
N_Port: This is a node port that is not loop capable. It is used to connect an equipment
port to the fabric.
NL_Port: This is a node port that is loop capable. It is used to connect an equipment
port to the fabric in a loop configuration through an L_Port or FL_Port.
FL_Port: This is a fabric port that is loop capable. It is used to connect an NL_Port to
the switch in a public loop configuration.
L_Port: This is a loop-capable node or switch port.
E_Port: This is an expansion port. A port is designated an E_Port when it is used as an
inter-switch expansion port (ISL) to connect to the E_Port of another switch, to enlarge
the switch fabric.
F_Port: This is a fabric port that is not loop capable. It is used to connect an N_Port
point-point to a switch.
G_Port: This is a generic port that can operate as either an E_Port or an F_Port. A port
is defined as a G_Port after it is connected but has not received a response to loop
initialization or has not yet completed the link initialization procedure with the adjacent
Fibre Channel device.
U_Port: This is a universal port—a more generic switch port than a G_Port. It can
operate as either an E_Port, F_Port, or FL_Port. A port is defined as a U_Port when it is
not connected or has not yet assumed a specific function in the fabric.
MTx_Port: CNT port used as a mirror for viewing the transmit stream of the port to be
diagnosed.
MRx_Port: CNT port used as a mirror for viewing the receive stream of the port to be
diagnosed.
SD_Port: Cisco SPAN port used for mirroring another port for diagnostic purposes.
iSCSI
iSCSI over Gigabit Ethernet
Ethernet has evolved into the most widely implemented physical and link layer protocol
today. Fast Ethernet increased speed from 10 to 100 megabits per second (Mbit/s).
Gigabit Ethernet was the next iteration, increasing the speed to 1000 Mbit/s. In the
marketplace full-duplex with switches is the norm. There are four different physical layer
standards for gigabit Ethernet:
Optical fiber (1000BASE-X)
Twisted pair cable (1000BASE-T)
Balanced copper cable (1000BASE-CX).
iSCSI (RFC3720) is a mapping of the regular SCSI protocol over TCP/IP, more
commonly over Gigabit Ethernet. Unlike Fibre Channel, which requires special-purpose
cabling, iSCSI can be run over long distances using an existing network infrastructure.
TCP/IP uses a client/server model, but iSCSI uses the terms initiator (for the data
consumer) and target (for the LUN).
• A Software initiator: Uses code to implement iSCSI, typically as a device driver.
• A hardware initiator mitigates the overhead of iSCSI, TCP processing and Ethernet
interrupts, and therefore may improve the performance of servers that use iSCSI. An
iSCSI host bus adapter (HBA) implements a hardware initiator and is typically packaged
as a combination of a Gigabit Ethernet NIC, some kind of TCP/IP offload technology
(TOE) and a SCSI bus adapter (controller), which is how it appears to the operating
system.
The initiator presents both its iSCSI Initiator Name and the iSCSI Target Name to which
it wishes to connect in the first login request of a new session. The only exception is if a
discovery session is to be established; the iSCSI Initiator Name is still required, but the
SCSI Target Name may be ignored. The default name "iSCSI" is reserved and is not
used as an individual initiator or target name. iSCSI Names do not require special
handling within the iSCSI layer; they are opaque and case-sensitive for purposes of
comparison. iSCSI provides three name-formats:
iSCSI Qualified Name (IQN), format: iqn.yyyy-mm.{reversed domain name}
o iqn.2001-04.com.acme:storage.tape.sys1.xyz
o iqn.1998-03.com.disk-vendor.diskarrays.sn.45678
o iqn.2000-01.com.gateways.yourtargets.24
o iqn.1987-06.com.os-vendor.plan9.cdrom.12345
o iqn.2001-03.com.service-provider.users.customer235.host90
Extended Unique Identifier (EUI), format: eui.{EUI-64 bit address}
o eui.02004567A425678D
T11 Network Address Authority (NAA), format: naa.{NAA 64 or 128 bit identifier}
o naa.52004567BA64678D
IQN format addresses occur most commonly, and are qualified by a date (yyyy-mm)
because domain names can expire or be acquired by another entity. iSCSI nodes (i.e.
the machine that contains the LUN targets) also have addresses. An iSCSI address
specifies a single path to an iSCSI node and has the following format: <domain-
name>[:<port>] Where <domain-name> can be either an IP address, in dotted decimal
notation or a Fully Qualified Domain Name (FQDN or host name). If the <port> is not
specified, the default port 3260 will be assumed.
iSCSI Security
To ensure that only valid initiators connect to storage arrays, administrators most
commonly run iSCSI only over logically-isolated backchannel networks.
For authentication, iSCSI initiators and targets prove their identity to each other using
the CHAP protocol, which includes a mechanism to prevent cleartext passwords from
appearing on the wire. Additionally, as with all IP-based protocols, IPsec can operate at
the network layer. Though the iSCSI negotiation protocol is designed to accommodate
other authentication schemes, interoperability issues limit their deployment. An initiator
authenticates not to the storage array, but to the specific storage asset (target) it intends
to use. For authorization, iSCSI deployments require strategies to prevent unrelated
initiators from accessing storage resources. Typically, iSCSI storage arrays explicitly
map initiators to specific target LUNs.
iSCSI Zoning
Though there really isn‟t a zoning protocol associated with iSCSI, VLANs can be
leveraged to accomplish the segregation needed. http://en.wikipedia.org/wiki/Vlan
Storage Architectures
DAS, NAS & SAN
The emergence of these Storage Networking Protocols has led to the development of
different types of storage architectures, depending on the needs:
• We talked earlier about the Directly Attached Storage (DAS).
• Network Attached Storage (NAS): First conceived by Novell, but more commonly used
in MS LAN Manager (CIFS), and NFS (predominant in the UNIX/Linux worlds), all serve
up file shares. These days it‟s more common to see a NAS appliance, which is
essentially a self-contained computer connected to a network, with the sole purpose of
supplying file-based data storage services to other devices on the network. Due to its
multiprotocol nature, and the reduced CPU and OS layer, a NAS appliance – as such –
has its limitations compared to the FC/GbE systems. This is known as file-level storage.
• Storage Area Network (SAN) is an architecture to attach remote storage devices (such
as disk arrays, tape libraries and optical jukeboxes) to servers in such a way that, to the
OS, the devices appear as locally attached. That is, the storage acts to the OS like it
was attached with an interface cable to a locally installed host adapter. This is known as
block-level storage.
Interestingly, Auspex Systems was one of the first to develop a dedicated NFS
appliance for use in the UNIX market. A group of Auspex engineers split away in the
early 1990s to create the integrated NetApp filer, which supported both CIFS for
Windows and NFS for UNIX, and had superior scalability and ease of deployment. This
started the market for proprietary NAS devices.
Hybrid
What if the NAS uses the SAN for storage? A NAS head refers to a NAS which does not
have any on-board storage, but instead connects to a SAN. In effect, it acts as a
translator between the file-level NAS protocols (NFS, CIFS, etc.) and the block-level
SAN protocols (Fibre Channel Protocol, iSCSI). Thus it can combine the advantages of
both technologies.
Tiered storage
Tiered storage is a data storage environment consisting of two or more kinds of storage
delineated by differences in at least one of these four attributes: Price, performance,
capacity and function. In mature implementations, the storage architecture is split into
different tiers. Each tier differs in the:
Type of hardware used
Performance of the hardware
Scale factor of that tier (amount of storage available)
Availability of the tier and policies at that tier
A very common model is to have a primary tier with expensive, high performance and
limited storage. Secondary tiers typically comprise of less expensive storage media and
disks and can either host data migrated (or staged) by Lifecycle Management software
from the primary tier or can host data directly saved on the secondary tier by the
application servers and workstations if those storage clients did not warrant primary tier
access. Both tiers are typically serviced by a backup tier where data is copied into long-
term and offsite storage. In this context, you may hear two terms:
• ILM – Information Lifecycle Management refers to a wide-ranging set of strategies for
administering storage systems on computing devices.
• HSM – Hierarchical Storage Management is a data storage technique which
automatically moves data between high-cost and low-cost storage media. HSM systems
exist because high-speed storage devices, such as hard disk drive arrays, are more
expensive (per byte stored) than slower devices, such as optical discs and magnetic
tape drives. http://en.wikipedia.org/wiki/Hierarchical_storage_management
There are different kinds of multipathing software available from different vendors.
Storage Replication
Depending on the details behind how the particular replication works, the application
layer may or may not be involved. If blocks are replicated without the knowledge of file
systems or applications built on top of the blocks being replicated, when recovering
using these blocks, the file system may be in an inconsistent state.
• A “Restartable” recovery implies that the application layer has full knowledge of the
replication, and so the replicated blocks that represent the applications are in a
consistent state. This means that the application layer (and possibly OS) had a chance
to „quiesce‟ before the replication cycle.
• A “Recoverable” recovery implies that some extra work needs to be done to the
replicated data before it can be useful in a recovery situation.
Snapshots
Even though snapshots where talked about in the context of replication, snapshots have
their uses on the local systems as well. Typically a snapshot is not a copy, since that
would take too long, but it’s a, freezing of all the blocks in a LUN making them read-only
at that point in time. Any logical block that needs to be updated, is allocated a new
physical block, thus preserving the original snapshot blocks as a backup. Any new
blocks are what take up new space, and are allocated for the writes after the snapshot
took place. Allocating space in this manner can take substantially less space than taking
a whole copy. Deleting of a snapshot can be done in the background, essentially freeing
any blocks that have been updated since the snapshot.
Snapshotting can be implemented in the management tools of the storage array, or built
into the OS. As with RAID, the advantage of building this functionality at the block-level
is that it can be abstracted from the file systems that are built on top of the blocks. Being
at this low level also has a drawback, in that when the snapshot is taken, the file
systems (and hence applications) may not be in a consistent state. There is usually a
need to, quiesce the running machine (virtual or otherwise) before a snapshot is made.
This implies that all levels (up to the application) should be aware that they reside on a
snapshot-capable system.
Terminology
Thin Provisioning & Over-Allocation
[Thin provisioning is called sparse volumes in some contexts] In a storage consolidation
environment, where many applications are sharing access to the same storage array,
thin provisioning allows administrators to maintain a single free space buffer pool to
service the data growth requirements of all applications. This avoids the poor utilization
rates, often as low as 10%, that occur on traditional storage arrays where large pools of
storage capacity are allocated to individual applications, but remain unused (i.e. not
written to). This traditional model is often called fat provisioning. On the other hand,
over-allocation or over-subscription is a mechanism that allows server applications to be
allocated more storage capacity than has been physically reserved on the storage array
itself. This allows flexibility in growth and shrinkage of application storage volumes,
without having to predict accurately how much a volume will grow or contract. Physical
storage capacity on the array is only dedicated when data is actually written by the
application, not when the storage volume is initially allocated.
LUN Masking
Logical Unit Number Masking or LUN masking is an authorization process that makes a
Logical Unit Number available to some hosts and unavailable to other hosts. The
security benefits are limited in that with many HBAs it is possible to forge source
addresses (WWNs/MACs/IPs). However, it is mainly implemented not as a security
measure per se, but rather as protection against misbehaving servers from corrupting
disks belonging to other servers. For example, Windows servers attached to a SAN will
under some conditions corrupt non-Windows (Unix, Linux, NetWare) volumes on the
SAN by attempting to write Windows volume labels to them. By hiding the other LUNs
from the Windows server, this can be prevented, since the Windows server does not
even realize the other LUNs exist. (http://en.wikipedia.org/wiki/LUN_masking)
Data De-duplication
This is an advanced form of data compression. Data de-duplication software as an
appliance, offered separately or as a feature in another storage product, provides file,
block, or sub-block-level elimination of duplicate data by storing pointers to a single
copy of the data item. This concept is sometimes referred to as data redundancy
elimination or single instance store. The effects of de-duplication primarily involve the
improved cost structure of disk-based solutions. As a result, businesses may be able to
use disks for more of their backup operations and be able to retain data on disks for
longer periods of times, enabling restoration from disks.