Massively Scalable Data Center (MSDC) Design and Implementation Guide
Massively Scalable Data Center (MSDC) Design and Implementation Guide
All other trademarks mentioned in this document or website are the property of their respective owners. The use of the word partner does not imply a partnership relationship
between Cisco and any other company. (1002R)
THE SOFTWARE LICENSE AND LIMITED WARRANTY FOR THE ACCOMPANYING PRODUCT ARE SET FORTH IN THE INFORMATION PACKET THAT
SHIPPED WITH THE PRODUCT AND ARE INCORPORATED HEREIN BY THIS REFERENCE. IF YOU ARE UNABLE TO LOCATE THE SOFTWARE LICENSE
OR LIMITED WARRANTY, CONTACT YOUR CISCO REPRESENTATIVE FOR A COPY.
The Cisco implementation of TCP header compression is an adaptation of a program developed by the University of California, Berkeley (UCB) as part of UCB’s public
domain version of the UNIX operating system. All rights reserved. Copyright © 1981, Regents of the University of California.
NOTWITHSTANDING ANY OTHER WARRANTY HEREIN, ALL DOCUMENT FILES AND SOFTWARE OF THESE SUPPLIERS ARE PROVIDED “AS IS” WITH
ALL FAULTS. CISCO AND THE ABOVE-NAMED SUPPLIERS DISCLAIM ALL WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING, WITHOUT
LIMITATION, THOSE OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OR ARISING FROM A COURSE OF
DEALING, USAGE, OR TRADE PRACTICE.
IN NO EVENT SHALL CISCO OR ITS SUPPLIERS BE LIABLE FOR ANY INDIRECT, SPECIAL, CONSEQUENTIAL, OR INCIDENTAL DAMAGES, INCLUDING,
WITHOUT LIMITATION, LOST PROFITS OR LOSS OR DAMAGE TO DATA ARISING OUT OF THE USE OR INABILITY TO USE THIS MANUAL, EVEN IF CISCO
OR ITS SUPPLIERS HAVE BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
Preface i
Provisioning 1-22
Power On Auto Provisioning (PoAP) 1-22
Monitoring 1-26
Buffers 1-26
Why Congestion Matters 1-26
Buffer Monitoring Challenges 1-27
When Watching Buffer Utilization Isn’t Enough 1-27
Pull vs Push Models 1-28
On-switch Buffer Monitoring for Fine-grained Stats 1-28
Deployment in Testing 1-29
Issues and Notes 1-30
Recommendations 1-30
Caveats 1-31
Role of Virtualization 1-31
Scale 1-31
Fast Failure Detection (FFD) 1-31
Quick BFD Overview 1-32
Graceful Restart 1-34
Hiding Fabric Routes 1-34
TCP Incast 1-35
Why A Concern in MSDC 1-35
Current Work at Cisco 1-35
Industry Research Gaps This Testing Addresses 1-36
PoAP 2-1
PoAP Benefits 2-2
Topology Setup 2-2
MGMT0 2-3
Inband 2-3
Infrastructure 2-4
DHCP Server 2-4
isc-dhcpd Configuration 2-5
TFTP/FTP/SFTP/HTTP Server 2-5
Demo 2-6
PoAP Considerations 2-8
Servers A-1
Server Specs A-1
Operating System A-2
Network A-6
F2/Clipper References A-6
F2/Clipper VOQs and HOLB A-6
Python Code, Paramiko A-7
Spine Configuration A-8
Leaf Configuration A-8
buffer_check.py B-1
check_process.py B-11
Dell/Force10 D-1
Arista D-1
Juniper D-1
Brocade D-2
HP D-2
fail-mapper.sh E-1
find-reducer.sh E-3
tcp-tune.sh E-4
irqassign.pl E-4
VM configuration E-5
Cisco’s customers that exist within the MSDC realm are expanding their East-West networks at
ever-increasing rates to keep up with their own demand. Because networks at MSDC scale are large
cost-centers, designers and operators of these networks are faced with the task of getting the most out of
their capital, power and cooling, and data center investments. Commodity pricing for networking gear,
previously only seen in the server space, is pushing vendors to re-think how customers architect and
operate their network environments as a whole: to do more (faster), safely (resilient), with lower costs
(smaller buffers, fewer features, power efficiency).
This document intends to guide the reader in the concepts and considerations impacting MSDC
customers today. We:
1. Examine characteristics of traditional data centers and MSDCs and highlight differences in design
philosophy and characteristics.
2. Discuss scalability challenges unique to MSDCs and provide examples showing when a MSDC is
approaching upper limits. Design considerations that improve scalability are also reviewed.
3. Present summaries and conclusions to SDU’s routing protocol, provisioning and monitoring, and
TCP performance testing.
4. Provide tools for network engineers to understand scaling considerations in MSDCs.
While any modern network can benefit from topics covered in this document, it is intended for customers
who build very large data centers with significantly larger East-West than North-South traffic. Cisco
calls this space Massively Scalable Data Center (MSDC).
The following scaling characteristics are defined as a prelude to the Massively Scalable Data Center
design that drives the MSDC technology and differentiates it from VMDC/Enterprise.
Drivers
• Commoditization!
• Cloud Networking. Classical reasons to adopt cloud networking include:
– Improving compute, memory, and storage utilization across large server fleets (Figure 1-1).
Efficiencies improve when troughs of the utilization cycle are filled in with useful work.
CPU/Mem/Storage CPU/Mem/Storage
Utilization Utilization
Cloud
&
Virtualization
Time Time
– Increased efficiencies enable customers to innovate by freeing up compute cycles for other
work as well as providing a more flexible substrate to build upon.
• Operations and Management (OaM).
• Scalability. Application demands are growing within MSDCs. This acceleration requires
infrastructure to keep pace.
• Predictability. Latency variation needs to be kept within reasonable bounds across the entire MSDC
fabric. If every element is nearly the same, growth is easier to conceptualize and the impact scaling
has on the overall system is relatively easy to predict—homogeneity, discussed later in Design
Tenets, page 1-5, is a natural outgrowth of predictability.
These networks are characterized by a set of aggregation pairs (AGGs) which aggregate many access
(aka Top of Rack, or ToR) switches. AGGs then connect to an upstream distribution (DIS) layer, which
is followed by a core (COR) layer which aggregates the DIS layer and connects to other networks as
needed. Another noticeable characteristic in these networks which differ from that of MSDCs is
inter-AGG, inter-DIS, and inter-COR links between pairs; in MSDCs the amount of bandwidth needed,
and the fact that today’s platforms do not provide the necessary port density, make it unnecessary and
even cost-prohibitive to provide inter-device links which meet requirements. In MSDCs, the routing
decision to take a particular path from ToR to the rest of the network is made early on at the ToR layer.
Traditional data center networks are designed on principles of fault avoidance. The strategy for
implementing this principle is to take each switch1 (and links) and build redundancy into it. For example,
two or more links are connected between devices to provide redundancy in case of fiber or transceiver
failures. These redundant links are bundled into port-channels that require additional configuration or
protocols. Devices are typically deployed in pairs requiring additional configuration and protocols like
VRRP and spanning-tree to facilitate inter-device redundancy. Devices also have intra-device
redundancy such as redundant power supplies, fabric modules, clock modules, supervisors, and line
cards. Additional features (SSO) and protocol extensions (graceful-restart) are required to facilitate
supervisor redundancy. The steady state of a network designed with this principle is characterized by a
stable routing protocol. But it comes at the expense of:
• Operational complexity.
• Configuration complexity.
• Cost of Redundant Hardware—this in turn increases capital costs per node and increases the risk of
things to fail, longer development time, longer test plans.
• Inefficient use of bandwidth (single rooted).
• Not being optimized for small flows (required by MSDCs).
Whereas MSDCs are more interested in being able to fail a device and the overall system doesn’t
care—thus reducing liability each network element can introduce into the system upon failure—again,
this is fault tolerance.
1. For the purposes of this document, the term “switch[es]” refers to basic networking elements of an MSDC
network. These basic elements can be routers and/or switches, in the strict sense. However unless otherwise
noted, a “switch” is a L3 device that can perform both traditional L2 switching and L3 routing functions,
including speaking routing protocols such as OSPF and BGP.
Evidence of this break from traditional data center design is already observed in the field, as seen in this
sanitized version of a particular customer’s network above. Here a higher degree of ECMP is seen than
is present in earlier network architectures, however there are still weaknesses in the above design – most
notably the sizeable reduction in bandwidth capacity if one AGG device fails. ECMP is allowing for
higher cross-sectional bandwidth between layers, thus greater east-west bandwidth for applications, and
reduces the fault domain as compared to traditional designs – that is failure of a single device only
reduces available bandwidth by a fraction.
Finally, the logical conclusion to the trend towards more ECMP is a Clos2 design with a “Spine” and a
“Leaf” as shown in Figure 1-4. The Spine is responsible for interconnecting all Leafs, and provides a
way for servers in one rack to talk to servers in another in a consistent way. Leafs are responsible for
equally distributing server traffic across all Spine nodes.
2. Refer to Interconnecting Building Blocks, page 1-9 for details on how MSDC’s use Clos topology.
Design Goals
Data in this guide is based on thorough research into customer motivations, design tenets, and
top-of-mind concerns, all coupled by the drivers discussed.
Design Tenets
All engineering requirements for the design of a MSDC are mapped, at varying degrees, to these
fundamental concerns and governing tenets:
• Cost—Customers want to spend less on their network, or turn their network into a profit-center. For
example, a 1W savings in power on a server NIC can translate to $1M saved overall.
• Power—Reducing power footprint, as well as improving PDU efficiencies, are major concerns to
customers.
• East-West BW—AKA “crosstalk”. Applications are demanding more bandwidth due to
multi-layers and large fanout. In a MSDC context, applications typically generate huge fanout
ratios, for example 1:100. For every byte inbound to the data center, this can translate to 100bytes
inside the MSDC because a typical social website 2.0 takes well over 100 backend (east-west)
transactions per single north-south transaction. Oversubscription is less tolerated in MSDC
environments.
• Transparency—Customers use this term to help communicate the idea of building an intelligent
network which fosters easier, predictable communication between East-West components.
• Homogeneity—Eliminating one-offs makes operating MSDC networks easier at scale.
• Multipathing—ECMP brings fault domain optimization. ECMP reduces liability of a single fault,
or perhaps a small number of faults, to the overall system.
• Control—Programmability, automation, monitoring and bug/defect management, and innovation
velocity. The more customers control (code), vendor’s adoption of relevant technologies, the more
they can integrate into their own software infrastructure which gives them a competitive advantage.
Being able to influence quality assurance with Vendors are traits that give customers control they
need to operate successful environments.
reduce available bandwidth by one-eighth. Leaf devices could have two independent uplink failures and
still operate at 75% capacity. From these two examples it is apparent that fault tolerant network design
moves redundancy from individual network elements to the network as a system. Instead of each
network element having a unique mechanism to handle its own failures, the routing protocol is
responsible for handling failures at all levels. This drastically reduces configuration and operational
complexity of each device. But simplification, as always, comes at a cost (flexibility) which must be
balanced against the benefits of simplification.
As mentioned earlier, “M” in MSDC means “massive”. MSDC networks are massive, and require
astounding amounts of fiber (24,576 links), transceivers (49,152 xfp/sfp+), power supplies (over 800
devices), line cards, supervisors, chassis, etc. Such data centers are home to tens or hundreds of
thousands of physical servers, and the burden to interconnect those in intelligent ways is non-trivial.
These network elements are put into the domain of a single routing protocol. Due to the sheer number
of network elements in a single domain, failures are routine. Failures are the norm! Also, in MSDC
networks, the “application” is tightly integrated with the network and often participates with routing
protocols. For example, Virtual IPs (VIPs, these are IP addresses from services which load balancers are
advertising) can be injected or withdrawn into the network at the discretion of the application. Routing
protocols must keep the network stable despite near constant changes coming from both application
updates and network element failures. Dealing with churn is a primary motivation for moving all
redundancy and resiliency into the network.
Scalability
The scalability limits of individual devices that make up MSDCs are well known. Each platform has
route scale limits defined by TCAM partitioning and size. Base measurements like these can be used to
quantify a network with a stable steady state. However, these limits do not properly define scalability of
MSDC networks. Routing protocols are invoked in nearly every fault scenario, and as discussed in a
previous section titled “Scale, Differences between VMDC/Enterprise and MSDC”, MSDCs are so large
that faults are routine. Therefore true scalability limits of MSDC networks are in part defined by the
capacity of its routing protocol to handle network churn.
Deriving a measurement to quantify network churn can be difficult. Frequency and amplitude of routing
protocol updates depends on several factors; explicit network design, application integration, protocols
used, failure rates, fault locations, etc. Measurements derived would be specific to the particular
network, and network variations would bring statistical ambiguity. A more useful question is;
“Depending on a particular network architecture, how does one know when churn limits have been
reached?” MSDC customers are asking such a question today. For details, refer to Scalability, page 1-6.
Predictibility
Predictible latencies across the MSDC fabric are critical for effective application workload placement.
A feature of the Clos topology is all endpoints are equidistant from one another, thus it doesn’t matter
where workloads are placed, at least in terms of topological placement.
Building Blocks
The guide uses specific hardware and software building blocks, all of which fulfill MSDC design tenets.
Building blocks must be cost-sensitive, consume lower power, simpler, programmable, and fascilitate
sufficient multipathing width (both hardware and software are required for this).
Building blocks are broken down into three areas:
• Leaf Layer, page 1-7
• Spine Layer, page 1-8
• Fib, page 1-8
Leaf Layer
The Leaf Layer is responsible for advertising server subnets into the network fabric. In MSDCs this
usually means Leaf devices sit in the Top-of-Rack (ToR), if the network is configured in a standard
3-stage folded Clos design5.
Figure 1-5 shows the Nexus 3064, the foundation of the Leaf layer.
The Leaf layer is what determines oversubscription ratios, and thus size of the Spine. As such, this layer
is of top priority to get right. The N3064 provides 64x 10G linerate ports, utilizes a shared memory
buffer, is capable of 64-way ECMP, and features a solid enhanced-manageability roadmap.
In exchange for Cisco’s devices which employ more feature-rich ASICs (M-series linecards, 5500
switches, ISSU, triple redundancy) , this layer employs simpler designs that have fewer “moving parts”
to effectively forward packets while learning the network graph accurately.
Spine Layer
The Spine layer is responsible for interconnecting all Leafs. Individual nodes within a Spine are not
connected to one another nor form any routing protocol adjacencies among themselves. Rather, Spine
devices are responsible for learning “infrastructure” routes, that is routes of point-to-point links and
loopbacks, to be able to correctly forward from one Leaf to another. In most cases, the Spine is not used
to directly connect to the outside world, or other MSDC networks, but will forward such traffic to
specialized Leafs acting as a Border Leaf. Border Leafs may inject default routes to attract traffic
intended for external destinations.
Figure 1-6 shows the F2 linecard providing 48x 10G linerate ports (with the appropriate Fabric Cards).
The Nexus 7K is the platform of choice which provides high-density needed for large bandwidth
networks, has a modular operating system which allows for programmability. The N7004 consumes 7RU
of space but only provides 2 I/O slots, and is side-to-side airflow (although not a first-order concern,
MSDCs prefer front-to-back, hot-isle, cold-isle cooling when they can get it). The N{7009|7010|7018}
are preferable since their port-to-RU ratio is much higher (real-estate is a concern in MSDCs). If
front-to-back airflow is required, the N7010 provides this function. N7009 and N7018 utilize
side-to-side airflow. The building blocks in SDU testing employs all three N{7009,7010,7018}
platforms.
Customers have voiced concern about complexities and costs of the M-series linecards, and thus
requested simpler linecards that do less, but do those fewer tasks very fast and highly reliable. F2 fits
very well with those requirements. It provides low-power per 10G port, low-latency, and utilizes ingress
buffering to support large fanout topologies.
Note The F2 linecard is based on Cisco’s Clipper ASIC detailed in Appendix C, “F2/Clipper Linecard
Architecture”.
Fib
New FIB management schemes are needed to meet the demands of larger networks. The number of
loopbacks, point-to-point interfaces, and edge subnets are significantly higher than in traditional
networks. And as MSDCs are becoming more cloud-aware, more virtualization-aware, the burden on a
FIB can skyrocket.
Obviously, dedicating hardware such as TCAM to ever-growing FIBs is not scalable; the number of
entries can grow to hundreds of millions, as seen in some MSDC customer’s analysis. This is cost and
power prohibitive.
Regardless of size, managing FIB churn is a conern.
Strategies to address this concern:
1. One strategy to manage FIB is merely to reduce the size of FIB by separating infrastructure6 routes
from customer7 routes. If the network system could relegate hardware to simply managing
infrastructure ONLY, this could take the FIB from hundreds of thousands, even millions, down to
24,000 or less. Customer routes could be managed by a system that is orthogonal to the
infrastructure itself, this could be the network, or it could be off-box route controller cluster(s).
2. The strategy used in Phase 1 was to manage the FIB by learning routes over stable links – links that
are directly connected routes. In such situations, churn is only introduced as physical links go down
and is less fragile than a topology which completely relies on dynamic insertion of prefixes. For
example, MSDC networks based on a 3-stage Clos architecture may have 32 Spine devices
(N7K+F2) and 768 Leaf devices (N3064). The FIB will be comprised of a stable set of 24,576
point-to-points, 800 loopbacks, and then server subnets being advertised by the Leafs.
These networks are characterized by a set of aggregation pairs (AGGs) which aggregate many access
(aka Top of Rack) switches.
The bandwidth increases significantly near the root of the tree, but non-blocking functionality is not
supported, therefore introducing significant oversubscription. Examples of oversubscription and
blocking, in traditional architectures, are displayed in Figure 1-8.
Oversubscription—means ingress capacity exceeds egress capacity. In Figure 8, if you have a rack of
24x 10G attached servers, the ACC device needs at least 240G of port capacity facing the upstream layer
to be 1:1 oversubscribed (1:1 would actually mean there is NO oversubscription). If the ACC device has
24x 10G server ports and 2x 10G uplinks, you have 12:1 oversubscription. To allow the entire network
to operate at linerate, 1:1 oversubscription is required. However, not all networks need to provide 1:1
performance; some applications will operate fine when oversubscription occurs. Therefore in some
scenarios non-blocking designs aren’t necessary. The architect should have a thorough understanding
of application traffic patterns, bursting needs, and baseline states in order to accurately define the
oversubscription limits a system can tolerate.
Blocking—Oversubscription situations at device level, and even at logical layers, are causes of
applications getting blocked which results in network queueing. For example in Figure 1-9 server A
wants to talk to server Z, but the upstream DIS layer is busy handling other inter-rack traffic. Since the
DIS layer is overwhelmed the network causes server A to "block". Depending on the queueing
mechanisms and disciplines of the hardware, queueing may occur at ingress to the DIS layer.
In Figure 1-8 and Figure 1-9 10G interfaces are being considered as the foundation. If the system wants
to deliver packets at line-rate these characteristics should be considered:
• Each ACC/ToR device could only deliver 20G worth of server traffic to the AGG layer, if we assume
there are only 2x 10G uplinks per ACC device. That represents only 8% of total possible server
traffic capability! This scenario provides a 1:1 oversubscription.
• Each AGG device needs to deliver ten times the number of racks-worth of ACC traffic to the DIS
layer.
• Each DIS device needs to deliver multiple-terrabytes of traffic to the COR layer.
Scaling such a network becomes cost-prohibitive, and growth becomes more complex because
additional branches of the tree need to be built to accomodate new pods . In addition to bandwidth
constraints, there are also queueing concerns based on the amount of buffer available to each port within
the fabric.
The problem with these traditional topologies, in the MSDC space, is they can't sustain the burst of
east-west traffic patters and bandwidth needs common in MSDC environments.
Clos
Figure 1-10 shows an example of a Clos topology composed of a hypothetical 6-port building block.
In 1953 Charles Clos created the mathematical theory of the topology which bears his name; a
non-blocking, multi-stage topology which provides greater bandwidth than what a single node8 is
capable of supplying. The initial purpose of the Clos topology was to solve the n2 interconnect problem
in telephone switching systems: it interconnects n inputs to n outputs with ? n2 nodes. The labeling of
both inputs and outputs with the same variable, n, is by design – that is, we marry each output with an
input; the number of outputs equals the number of inputs, or there is precisely one connection between
nodes of one stage to those of another stage. Said another way, a Clos network connects a large number
of inputs and outputs with “smaller-sized” nodes.9
In this example, using 6-port switches, we connect 18 endpoints (or “edge ports”) in non-blocking
fashion 10 using a 3-stage Clos topology. We use the phrase “folded Clos” to mean the same thing as a
3-stage Clos, but is more convenient for network engineers to visualize ports, servers, and topology in
a folded Clos manner. For terminology, in a 3-stage Clos we have an ingress Leaf layer, a Spine center
layer, and an egress Leaf layer. If we fold it, we simply have a Leaf layer and a Spine layer.
If we create a Clos network using building blocks of uniform size, we calculate the number of edge ports
using a relationship derived from Charles Clos’ original work:
where k is the radix of each node (its total number of edges), and h is the number of stages (the “height”
of the Clos). Some examples:
• k=6, h=3
• k=64, h=3
• k=64, h=5
Intuition, however, shows a 5-stage Clos built using, say, the Nexus 3064, doesn’t actually give you more
than 4 million edge ports, but rather 65,536 edge ports (2048 Leafs multiplied by 32 edge-facing ports).
Figure 1-11 shows an example of a 5-stage folded Clos, using 6-port building blocks.
Here we have 54 edge ports (not 214 ports as the formula predicts), up from 18 when using a 3-stage
Clos. The primary reason to increase the number of stages is to increase the overall cross-sectional
bandwidth between Leaf and Spine layers, thus allowing for an increased number of edge ports.
Note The discrepancy between the above formula and intuition can be explained by a “trunking” factor in
Clos’ derivation due to the middle stages – since the N3064 isn’t a perfect single crossbar, but rather a
multi-stage crossbar itself, the above formula does not work where h ? 5. And it should be noted that in
the strict sense a Clos network is one in which each building-block is of uniform size and is a perfect,
single crossbar.
As such, because the nodes of today (Nexus 3064) are multi-stage Clos’es themselves, a more
appropriate formula for MSDC purposes is one in which h is always 3 (in part because cabling of a strict
Clos network where h ? 5 is presently cost-prohibitive, and the discussion of more stages is beyond the
scope of this document), and the formula is simplified to:
Where N is the radix of Spine nodes and k is the radix of each Leaf; we divide by two because only half
the ports on the Leaf (k) are available for edge ports at 1:1 oversubscription. Therefore a 3-stage Clos
using only N3064s would provide 2048 edge ports:
Or, a 3-stage Clos using fully-loaded N7018s+F2 linecards as Spine nodes and N3064s as Leafs, you get
24,576 edge ports:
Fat Tree
A Fat Tree is a tree-like arrangement of a Clos network (Figure 1-12), where the bandwidth between
each layer increases by x2, hence the tree gets thicker (fatter) the closer to the trunk you get.
Note the boxes outlined with dotted-blue; these are the “Leafs” of the Clos; the topmost grouping of
nodes, 3x3, are each the “Spines”. In other words, you essentially have a 3-stage folded Clos of 6
“nodes”, comprised of 3x 6-port Spines nodes and 3x 6-port Leafs. This creates 27 edge-ports.
Compared to a standard Clos, while it’s true you get more edge ports with a Fat Tree arrangement, you
also potentially have more devices and more links. The cost of doing such must considered when
deciding on a particular topology.
Table 1-1 compares relative costs of Clos and Fat-trees using hypothetical 6-port building blocks.
Table 1-1 Clos and Fat-Trees Relative Cost Comp Using Hypothetical 6-Port Building Blocks
Table 1-2 shows an N3K as the building block (user to calculate x, y, and z for an N3K-based Fat Tree).
11. Costs, in this case, refers to the number of Fabric Boxes, Links, and Optics to achieve a particular end-host
capacity.
Table 1-2
Building Block,
Topology Ports/Box Fabric Boxes Fabric Links Total End-hosts
Fat Tree 64 x y z
Clos-3 64 96 (32spines+64leafs) 2048 2048
Clos-5 64 5120 ((96*32)spines+(32*64)leafs) 131072 65536
(2048*32+2048*32)
Table 1-3 shows a modification for the CLOS-5 case that might employ a 16-wide “Spine” rather than a
32-wide “Spine” (Spine, in the CLOS-5 sense, means that each Spine “node” is comprised of a 3-stage
Clos), thus each Leaf has 2 connections/Spine. In other words, you cut the number of devices and
end-hosts in half.
Table 1-3
Building Block,
Topology Ports/Box Fabric Boxes Fabric Links Total End-hosts
Clos-5’ 64 2560 ((96*16)spines+(32*32)leafs) 131072 32768
(2048*32+2048*32)
It cannot be overstated the importance of also considering the amount of cabling, the cost of cabling, and
the quantity and cost of optics in large Clos topologies!
The width of a Spine is determined by uplink capacity of the platform chosen for the Leaf layer. In the
Reference Archicture, the N3064 is used to build the Leaf layer. To be 1:1 oversubscribed the N3K needs
to have 32x 10G upstream and 32x 10G edge-facing. In other words, each Leaf layer device can support
racks of 32x 10G attached servers or less. This also means that the Spine needs to be 32 nodes wide.
Note In Figure 1-14, there is latitude in how one defines a Spine “node”. For example, there might be 16
nodes, but each Leaf uses 2x 10G ports to connect to each Spine node. For simplicity, a Spine in the strict
sense, meaning that for each Leaf uplink there must be a discrete node, is what is used. The size of the
Spine node will determine the number of Leafs the Clos network can support.
With real-world gear we construct the Clos with N3Ks (32 ports for servers each) and N7Ks+F2 (768
ports for Leafs, which means there are a total of 768 Leafs) as Leafs and Spines respectively. This means
a total of 24,456 10G ports are available to interconnect servers at the cost of 800 devices, 24,456 cables,
and 48912 10G optical transceivers.
Because of limited rack real-estate, power, and hardware availablility, the test topology employs a
16-wide Spine with 20 Leafs, each Leaf supporting at most 16 servers/rack (this leaves a total of 32 ports
on the N3Ks unused/non-existent for the purposes of our testing).
The lab topology had 16 N7Ks as a spine, so it is a 16-wide Spine Clos architecture.
Other Topologies
Clos’es and Fat Trees are not the only topologies being researched by customers. While Clos’es are the
most popular, other topologies under consideration include the Benes12 network (Figure 1-15).
Or 1-dimensional (ring) (Figure 1-16), 2 and 3-dimensional Toroid 13, (Figure 1-17 & Figure 1-18)and
hypercubes (Figure 1-19)
12. http://en.wikipedia.org/wiki/Benes_network#Clos_networks_with_more_than_three_stages
13. http://en.wikipedia.org/wiki/Grid_network
Each topology has pros and cons, and may be better suited to a particular application. For example,
toroid and hypercube topologies favor intra-node, east-west traffic, but are terrible for north-south
traffic. Toroid networks are in fact being utilized by some vendors, especially in the High Performance
Computing (HPC) space. While a fuller discussion of alternative topologies is no doubt interesting to
network engineers and architects trying to optimize the network to their applications, it is beyond the
scope of the present document.
As of this writing, customers by and large gravitate to 3-stage Clos’es since they represent an acceptable
balance of east-west vs north-south capability, implementation cost and complexity, number of cables
and optics needed, and ease of operating. It goes without saying that Clos topologies are very well
understood since they’ve been in use, such as in ASICs, for around 60 years and the theory is well
developed as compared to more exotic topologies.
Refer to the bullet points in the “Design Tenets” section on page 1-5, to see how this type of topology
can be use to meet those needs.
COST
Low-cost platforms for the Leaf, such as N3064, are based on commodity switching chipsets. For the
Spine, F2 linecards on N7K are used. F2 is a higher density, high performance linecard with a smaller
feature set than that of the M-series.
POWER
The testing did not focus as much on power as in the other areas of concern.
EAST-WEST BW
This area was of greatest concern when considering a MSDC topology for the lab. Utilizing the 3-stage
folded Clos design afforded 2.5Tbps of east-west bandwidth, with each Leaf equally getting 160Gbps of
the total.
TRANSPARENCY
An area of concern that is not discussed in this guide. It is expected that overlays may play an important
part in achieving sufficient transparency between logical and physical networks, and between customer
applications and the network.
HOMOGENEITY
There are only 2 platforms used in our MSDC topology, N3K and N7K. With only a small number of
platform types it is expected that software provisioning, operations, and performance predictability will
be achievable with present-day tools.
MULTIPATHING
The use of 16-way ECMP between the Leaf and Spine layers is key.14 For a long time, IOS, as well as
other network operating systems throughout the industry, were limited to 8 path descriptors for each
prefix in the FIB. Modern platforms such as those based on NX-OS double the historical number to 16
as well as provide a roadmap to significantly greater ECMP parallelization. 64-way is currently
available.15 128-way is not far off.16
CONTROL
This aspect of MSDC design tenets is met by programmability, both in initial provisioning (PoAP) and
monitoring. This guide addresses both of these areas later in the document. However, it is acknowledged
that “control” isn’t just about programmability and monitoring, but also may include the customer’s
ability to influence a Vendor’s design, or even for large portions of the network operating system, for
example, to be open to customer modifications and innovation. These last 2 aspects of control are not
addressed in this guide.
Applications
When discussing applications in MSDC environments, it is important to recognize not all MSDC
operators actually control the applications running on their infrastructure. In the case of MSDC-scale
public cloud service providers, for example, MSDC operators have little control over what applications
tenants place into the cloud, when they run workloads, or how they tune their software.
Conversely in situations where public tenancy is not a constraint, operators tend to have very
fine-grained control over the applications workloads. In many cases, the applications are written
in-house rather than being purchased “off the shelf”. Many use open source components such as
databases, frameworks, message queues, or programming libraries. In such scenarios, applications vary
widely between MSDC customers, but many share common threads:17
• Workloads are distributed across many nodes.
• Because of their distributed nature, many-to-one conversations among participating nodes are
common.
• Applications that the data center owner controls are generally designed to tolerate (rather than
avoid) failures in the infrastructure and middleware layers.
• Workloads can be sensitive to race conditions; but customers have made great efforts to minimize
this with increased intelligence in the application space (independent of the network).
Exceptions certainly exist to the above application characteristics, but by-and-large represent the trends
seen in present-day MSDCs.
Distribution
Distribution in MSDC environments may vary. Common distribution schemes include:
• In most cases, workloads are distributed among multiple racks (for scale and resiliency).
• In many cases, workloads are distributed among multiple independent clusters or pods (for
manageability, resiliency, or availability reasons).
• In some cases, workloads are distributed among multiple data centers (for resiliency or proximity).
While the exact schemas for distribution may vary, some common rationales drive the design. A few
common key characteristics which determine how workloads are distributed include:
• Performance
• Manageability
• Resiliency to failures (redundancy, fault isolation zones, etc)
• Proximity (to audience or other interacting components/data stores/applications)
• Scalability and elasticity
17. This information is based on extensive work with Account Teams as well as customer surveys.
• Cost
Workload Characterizations
Workloads vary between MSDC's based on applications being supported and how much control the
customer has over the workload. For example, large cloud service providers hosting thousands of tenants
have little control over workloads tenants deploy on top of the provider's IaaS or PaaS. Traffic in the
network, disk I/O on end hosts, and other resource usage may be very inconsistent and hard to predict.
Even in these cases, however, MSDC customers may have some distributed applications running atop
the hardware that it has direct control over, such as orchestration systems, cloud operating systems,
monitoring agents, and log analysis tools. Most such applications are designed to have as light a
footprint as possible in order to preserve the maximum resources possible for sale to tenants.
By contrast, web portal or e-commerce providers may run applications designed in-house and therefore
have flexibility in how to tune workloads that best suit underlying infrastructure. In such networks,
tenants tend to be entities within the same corporation which actively collaborate on how best to use
available resources. Workloads can be tuned for maximum efficiency, and elasticity may follow
predictable trends (e-commerce sites might expect more load during holiday shopping season).
Workloads in such customer environments can be loosely characterized as a series of interacting
applications that together create a singular end-user SaaS experience. Characteristics of these systems
reflect the purpose of the application. For example, distributed applications participating in the
presentation of a website generate small packets (128-512 bytes) and short-lived conversations. Big data
analysis workloads by contrast may have longer sustained flows as chunks of data are passed around and
results of analysis returned.
Because of workload variability found in MSDC environments, it is strongly recommended that
architects make careful study of the applications to be deployed before making infrastructure design
decisions.
Provisioning
It doesn’t matter if a network is the highest performing network engineers may build for their
applications if the network cannot get provisioned quickly and accurately. Timing is essential because
MSDC network change often. Popular reasons for frequent network changes include Change
Management (CM) proceedures or rapid scale growth to meet seasonal traffic bursts. It is not uncommon
for customers to require entire datacenters be built within weeks of an initial request. Also, provisioning
systems that easily integrate into customer’s own software mechanisms are the ones that get deployed.
PoAP in a MSDC
PoAP in MSDCs is important for the following reasons:
• MSDC’s have a lot of devices to provision, especially Leafs.
• Configuring devices manually doesn’t scale.
• MSDCs already use the network to bootstrap servers and would like to be able to treat network
infrastructure in a similar manner.
• Speed of deployment.
PoAP Step-by-Step
PoAP Scripts
• Can be written in Python or TCL. Python is considered more modern.
• First line of script is md5sum over rest of script text.
– #md5sum="0b96a4f2b9f876b4af97d4e1b212fabf”
– Update with every script change!
18. http://www.cisco.com/en/US/docs/switches/datacenter/nexus3000/sw/fundamentals/503_U3_1/b_Nexus_30
00_Fundamentals_Guide_Release_503_U3_1_chapter_0111.html
19. If the Python script does not require image or configuration download, then FTP/HTTP servers aren’t
required.
20. Configurations applied after the first reboot may be things like hardware profile
portmode, hardware profile unicast, and system urpf.
• Sample scripts available on CCO download page (it’s with kickstart images)
– Upcoming scripting “community” for code sharing with/among customers to be available.21
• Full system initialization and libraries available
– Script can be customized to do almost anything!22
• Script troubleshooting is time consuming, therefore keep the script simple!
• PoAP process on switch is very basic, script does all the magic.
Figure 1-21 Two Parallel and Independent Topologies, one testing OSPF, the other BGP
21. https://github.com/datacenter
22. http://www.cisco.com/en/US/docs/switches/datacenter/nexus3000/sw/python/api/python_api.html
Figure 1-24 One Full Topology, Running Hadoop for Incast Testing
Monitoring
As networks grow, extracting useful telemetry in a timely manner is critical. Without relevant
monitoring data it is impossible to manage MSDC-sized networks and stay profitable. MSDC customers
want to do more with less, thus monitoring (with the requisite automation) is the glue which holds the
infrastructure together. In addition to daily operations, monitoring provides essential information that
allows for effective forward-planning and scaling, with a minimum number of network engineers.
Buffers
Statistics, gleaned from probes which monitor buffers, reveal important real-time MSDC characteristics.
These characteristics show how traffic is distributed, how the infrastructure is performing, and are key
indicators of where applications may suffer from blocking and queueing.
Traditionally, monitoring systems poll each node for data periodically. A classical Free Open Source
Software (FOSS) example is Nagios. Usually the monitoring polling is done in serial, but can be
parallelized to some degree (mod_gearman). Polling systems can only interact with switches via the
mechanisms such as SNMP, SSH to CLI, Netconf, etc. In cases where CLI or Netconf are used, all
command output must be sent to the monitoring system to be parsed and analyzed. Generally, these
polling nodes don’t tax CPU and Memory on the switch much (same as a user shell).
Deployment in Testing
Refer to Figure 1-20 on page 1-19 for details.
The daemon was written in Python with approximately 600 lines of code, and it used only modules
provided by NX-OS – it wasn’t necessary to load 3rd party libraries from bootflash, for example. The
program sets up a TCP socket to a Graphite23 receiver once then sends data via the Pickle24 protocol at
configurable intervals. Several CLI options are available to alter the frequency of stats collection, which
stats are collected, where data is sent, and so forth.
SDU was able to demonstrate these capabilities with the on-switch system:
• Gathers data from both XML (when available) and raw CLI commands (when XML output not
supported).
• Uses fast, built-in, modules like cPickly and Expat, to gather some stats, such as buffer cell
utilization, and calculates other info not provided by NX-OS, such as % of buffer threshold used per
port. As expected, there is a tradeoff between CPU impact and stat collection frequency moved to
runtime via CLI arguments.
Graphite Setup
• Single server (8 cores/16 threads, 12GB RAM, 4-disk SATA3 RAID5 array).
• 8 carbon-cache instances fed by 1 carbon-relay daemon.
• Server receives stats from collectd 25on each of 40 physical servers as well as on-switch monitoring
daemons on each Leaf.
• Each collectd instance also provides stats for 14 VMs/server acting as Hadoop nodes.
• Incoming rate of over 36,000 metrics/sec possible, with 17,000-21,000 metrics/sec more the
average.
Performance
• Buffer utilization stats every 0.18-0.20s possible, but uses 40-50% CPU!
• Buffer stats at approximately 1s intervals used negligible CPU, ranging from 2-5%.
• About 10.5MB footprint in memory.
Why Graphite and collectd? Both are high performance, open source components popular in cloud
environments and elsewhere.
Graphite
• Apache2 license, originally created by Orbitz.
• Scales horizontally, graphs in near realtime even under load.
• Written in Python.
Accepts data in multiple easy-to-create formats.
23. http://graphite.wikidot.com/
24. https://graphite.readthedocs.org/en/latest/feeding-carbon.html#the-pickle-protocol
25. http://collectd.org/
Collectd
• GPLv2 license.
• Written in C.
• Low overhead on server nodes.
• Extensible via plugins.26
• Can send stats to Graphite via the Write Graphite plugin.27
Recommendations
The following recommendations are provided.
• When Possible, stick with Python modules already included on the N3K. Loading 3 rd party
pure-python modules from bootflash is possible, but provisioning and maintenance becomes more
painful. This could be mitigated, however, by config management tools like Puppet, if it has support
for Cisco devices.
• Balance granularity and CPU/memory footprint to specific needs. Adding more commands and stats
to the daemon quickly lengthens the amount of time required to collect data and therefore the
interval at which metrics can be published. The bulk of the overhead is in issuing commands and
receiving output (not usually parsing or calculating). Parallelization can help by running multiple
daemons or multiple instances of a daemon, each configured to gather only certain stats. This will
certainly increase memory footprint, and may even increase CPU burden. But, doing parallelization
make collecting different stats at different intervals easier.
• Use XML output from commands when possible. There is more reliable parsing available, as well
as fast, especially with C-based Expat parser)
• Carefully select data sink, as it can become a choke point. SDU used Graphite, which scales
relatively well, horizontally on multiple hosts and/or behind loadbalancers. Many MSDC customers
have the resources and experience to design their own data sink systems.
• Avoid using per-interface command when possible, especially if you have a lot of interfaces to
check. Parsing ‘show queuing interface’, once, is faster than issuing and parsing 64 individual ‘show
queuing interface x/y’ commands.
26. https://collectd.org/wiki/index.php/Table_of_Plugins
27. https://collectd.org/wiki/index.php/Plugin:Write_Graphite
Caveats
The following caveats are provided.
• The on-switch approach still has some of the same pain points as other approaches. It still has to
deal with issuing commands and parsing output. Full support for getting data via any one method
other than CLI is lacking. Some commands have XML output, some don’t. Some command have a
Python API, most don’t.
• The bottleneck for metric frequency is usually the CLI. Most of the bottlenecks SDU found were in
how long it took to issue commands and get back output. For example:
– show queuing interface has no XML output, takes ~1.1s
– show interface x/y | xml on 29 interfaces took ~1.9s
Role of Virtualization
In this guide, virtualization plays only a supporting role and is not the focus of Phase 1. Virtualization
was configured on the servers to provide infrastructure services, such as DHCP, TFTP, FTP/SFTP, and
HTTP daemons. Virtualization was also used to create additional “nodes” in the Hadoop framework for
the purpose of having finer-grained control over where workloads were placed.
Scale
Here we discuss best practice designs to raise the limits of the MSDC network. Refer to Fabric Protocol
Scaling, page 2-8 for details on what the top churn elements are. BFD and routing protocols are
discussed. Also, TCP Incast is introduced.
• Indirect validation of forwarding plane failure. Not helpful in link down scenarios between p2p
links.28
• May impact SSO and ISSU in redundant systems.
• High CPU overhead caused by the additional information carried in Routing protocol messages not
needed for failure detection. Link utilization also increases as a result of frequent updates.
• Aggressive Routing protocol timers can lead to false positives under load/stress.
Table 1-4 lists differences between BFD and protocol timers in MSDC.
BFD Timers
Single set of timer for all protocols. Hello/dead timers different for each protocol and load.
Lightweight and can work with large number of peers without Routing protocol messages carry superfluous information not
introducing instability (scalable). needed for failure detection. Higher CPU load, link utilization
and false positives can occur.
Distributed implementation (hellos sent from I/O module1). Centralized implementation (hellos sent from SUP).
Failure notification to other protocols. No failure notification to other protocols.
Interacts well under system HA events. May impact SSO and ISSU in redundant systems (not as
relevant in MSDCs).
Single L3 hop only2. Capable of single and multi L3 hop.
Sub-second failure detection. Failure detection not sub-second.
1. For N7K implantation; not true for N3K.
2. The standard includes multi-hop, but Cisco implementation is only single-hop. Multi on roadmap.
28. On NXOS and many other products, link-down notification to protocols will always be faster than default
dead-timer expiration.
BFD was jointly developed by Cisco and Juniper. Many major vendors now support BFD, such as TLAB,
Huawei, ALU. BFD at Cisco is implemented in IOS, IOS XR, and NX-OS with support for both BFD
v0 and v1 packet formats. NX-OS implementation has been tested to interoperate with Cat6k, CRS, and
various JUNOS platforms. Table 1-5 comparing the difference implementations across network Cisco’s
network OSes.
Step 1 Session request is received from the application (example OSPF, BGP).
Step 2 SUP-BFD process on the SUP determines type of port and ports operational parameters and IP address.
Step 3 A session discriminator is assigned and session context is created. A response is sent to the application.
Step 4 Finite State Machine (FSM) selects linecard where the session will be installed. ACLMGR programs
required ACL (ACL’s are required to redirect incoming BFD packets to appropriate line card CPU).
Step 5 Session control is passed to the linecard from the SUP.
Step 6 The LC-BFD process on LC sends notification to registered applications indicating session UP or
DOWN status.
Step 7 If session state changes during session, BFD process on the LC will notify all registered applications.
BFD Recommendation
Graceful Restart
It is recommended to turn this feature off in an MSDC network. Graceful Restart allows the data-plane
to continue forwarding packets should a control-plane-only failure occur (the routing protocol needs to
restart but no links have changed). In a network with a stable control plane during steady-state, this is a
very useful as it allows for hitless control-plane recovery. However, in a network with unstable
control-plane during steady state, this feature can cause additional packet loss because the data-plane
cannot handle addition updates during the restart interval.
TCP Incast
TCP Incast, also known as “TCP Throughput Collapse”, a form of congestive collapse, is an extreme
response in TCP implementations that results in gross under-utilization of link capacity in certain N:1
communication configurations.32
Packet loss, usually occurring at the last-hop network device, is a result of the N senders exceeding the
capacity of the switch’s internal buffering. Such packet-loss across a large fleet of senders may lead to
TCP Global Synchronization (TGS), an undesirable condition where senders respond to packet losses
by taking TCP timeouts in “lock-step”. In stable networks, buffer queues are either usually empty or full;
in bursty environments these limited queues are quickly overrun. A popular method for dealing with
overrun queues is to enforce “tail drop”. However when there are large numbers of [near] simultaneous
senders, N, and the senders are sending to a single requestor, the resultant tail-drop packet-losses occur
at roughly the same time. This in turn causes the sender’s TCP automatic recovery mechanisms of
congestion avoidance to kick in (“slow-start” and its variant and augmentations) at the same time. The
net effect wasted bandwidth consumed that isn’t doing much real work.33
The team showed that with 10G attached servers there are fewer burdens on network buffering because
servers will consume network data faster. Thus CPU, memory, and storage I/O becomes the bottleneck
as opposed to network buffers (Figure 1-27). Work done in support of this guide differs from what has
previously been done in two ways:
1. Testing used Hadoop as a way to generalize Incast conditions rather than analyzing Hadoop itself.
2. Testing builds upon work that has already been done by introducing a broader class of failure and
churn scenarios and observe how the network behaves, for example, what happens when you fail
larger groups of servers, or have gross-level rack failures?
This chapter discusses Power on Auto Provisioning (PoAP) and fabric protocol scaling.
PoAP
As was discussed earlier, PoAP was used to configure the various logical topologies—one major change
for each of 4 cycles (a, b, c, and d) for this phase of testing1. Setup and testing is documented below.
The Goals of the PoAP testing can be summarized in 4 bullet points, along with a summary of results:
1. It should be demonstrated that automation of simultaneous initial provisioning of all Leafs, without
human intervention, is possible.
• SUCCESS. After issuing write erase;reload, no human intervention was needed in order for the
switches to load new images/configuration and for the network to reconverge.
2. If failures occur during the PoAP process, there should be troubleshooting steps engineers can take
to determine root cause using logs.
• CONDITIONAL SUCCESS. Log messages left on bootflash by the PoAP script helped determine
root cause of failures in most cases. However some corner cases (bootflash full) prevented logs from
being written, and log verbosity is partly dependent on the PoAP script code (which is up to the
customer/script author).
a. Upon failure, PoAP will restart continuously.
b. On console, abort PoAP process when prompted.
c. Go through user/pass setup to get to bootflash to read logs.
d. Problems with PoAP process:
– PoAP never gets to script execution step
– bootflash:<ccyymmdd>_<HHMMss>_PoAP_<PID>_init.log files contain log of PoAP
process:
DHCP related problems (DHCP Offer not received, incorrect options in OFFER, etc)
HTTP/TFTP related problems (couldn’t reach server, file not found, etc)
Check DHCP/TFTP/HTTP/FTP/SFTP server logs for additional information
e. Errors in script execution:
– NO STDOUT or STDERR – only what script writes to logfile.
PoAP Benefits
Here are a few benefits provided by PoAP:
• Pipelining device configuration
– Pre-build configurations for Phase N+1 during Phase N.
• Fast reconfiguration of entire topology
– Phase N complete and configs saved offline.
– ‘write erase’ and ‘reload’ devices and recable testbed.
– After POAP completes, the new topology fully operational.
• Ensuring consistent code version across testbed/platforms.
• Scripting allows for customization.
• Revision control: config files can be stored in SVN/Git/etc, off-box in a centralized repository, for
easy versioning and backup.
Topology Setup
Each method of enabling PoAP, below, has its pros and cons. One of the most important decisions is how
any method scales. MGMT0, page 2-3 and Inband, page 2-3 are two possible ways to enable PoAP in
the topology.
MGMT0
Here is a detailed depiction of how PoAP can be used with the mgmt0 interface of each Spine and Leaf
node (Figure 2-1).
Pros
• Simple setup (no relay).
• DHCP server can be single homed.
• Single subnet in DHCP config.
Cons
• This is not how most MSDC would deploy. Cost of separate mgmt network at MSDC scales are
prohibitive.
• DHCP server could potentially respond to DISCOVERIES from outside the primary network,
depending on cabling and configuration.
If using this setup, the PoAP script uses the management VRF.
Inband
In this setup, no mgmt network is used, but rather the normal network (Figure 2-2).
Pros
• Customers prefer this method; L3-only, no separate network needed.
• DHCP scope limited to just the main network.
Cons
• Requires DHCP relay on devices.
• When testing, this setup requires extra non-test gear within the topology (dedicated servers).
• DHCP is multi-homed.
• More complex DHCP server configuration.
The test topology used this arrangement for PoAP. The Pros for inband are much higher weighted than
all the other cons, and it scales much better than a dedicated L2 network. And with software automation
the complexity of DHCP server configuration is easily managed.
Infrastructure
PoAP requires supporting services, such as DHCP, TFTP, FTP/SFTP, and HTTP to properly function.
These are discussed below.
DHCP Server
PoAP requires DHCP Offer to contain:
1. IP
2. Subnet
3. routers option
4. domain-name-server option
5. next-server
6. tftp-server-name option
7. bootfile-name option
8. lease time of 1 hour or greater
If PoAP does not get offer with adequate information, init.log will show:
poap_dhcp_select_interface_config: No interface with required config
poap_dhcp_intf_ac_action_config_interface_select: Failed in the interface selection to
send DHCPREQUEST for interface 1a013000
isc-dhcpd Configuration
Split config into Subnet and Host portions.
• Subnets
– Switch could DHCP from any interface. Need a subnet entry for every network where DHCP
Discover could originate. For inband, that is every point-to-point link where dhcp-relay is
configured.
– IP/Subnet/Router unique for each subnet.
– Use ‘group’ to specify same next-server, tftp-server, domain-name-server for all subnets.
• Hosts
– Host entries need to map Serial Number (prepended with \0) to device hostname.
host msdc-leaf-r4 {
option dhcp-client-identifier "\000FOC1546R0SL";
option host-name "msdc-leaf-r4";
}
– Use ‘group’ to specify same filename, bootfile-name for hosts that will use the same PoAP
script.
– Grouping based on platform, network role, testbed, etc.
TFTP/FTP/SFTP/HTTP Server
• PoAP process on switch downloads PoAP script via TFTP/HTTP. Most tftp servers chroot, so
filename but not path is required. For http, configure dhcp option tftp-server-name to be
“http://servername.domain.com”.
• PoAP script then downloads image and config via TFTP, FTP, SFTP, or SCP.
– Script will need credentials for login and full path to files
• Host specific config files named directly or indirectly2 .
– Identified directly by hostname when using os.environ['POAP_HOST_NAME']
– Best Practice: MAC or S/N mapped to hostname in DHCP config
– Identified indirectly by serial number, mac address, CDP neighbor.
2. As of this writing hostname is only available in Caymen+ (U4.1) and GoldCoast Maintenance.
Demo
The following collection of logfiles demonstrates a successful PoAP event.
• leaf-r13
2012 Jun 4 19:53:22 %$ VDC-1 %$ %NOHMS-2-NOHMS_DIAG_ERR_PS_FAIL: System minor alarm
on power supply 1: failed
Starting Power On Auto Provisioning...
2012 Jun 4 19:54:17 %$ VDC-1 %$ %VDC_MGR-2-VDC_ONLINE: vdc 1 has come online
2012 Jun 4 19:54:17 switch %$ VDC-1 %$ %POAP-2-POAP_INITED: POAP process initialized
Done
Abort Power On Auto Provisioning and continue with normal setup ?(yes/no)[n]:
2012 Jun 4 19:54:37 switch %$ VDC-1 %$ %POAP-2-POAP_DHCP_DISCOVER_START: POAP DHCP
Discover phase started
2012 Jun 4 19:54:37 switch %$ VDC-1 %$ %POAP-2-POAP_INFO: Abort Power On Auto
Provisioning and continue with normal setup ?(yes/no)[n]:
• DHCP Server and Script Output. The first reboot happens at 19:55. Then config requiring reboot is
applied (system URPF, hardware profile, etc). The first second reboot at 19:58:
Jun 4 10:54:19 milliways-cobbler dhcpd: DHCPDISCOVER from 54:7f:ee:34:10:c1 via
10.3.1.32
Jun 4 10:54:19 milliways-cobbler dhcpd: DHCPDISCOVER from 54:7f:ee:34:10:c1 via
10.2.1.32
Jun 4 10:54:19 milliways-cobbler dhcpd: DHCPDISCOVER from 54:7f:ee:34:10:c1 via
10.4.1.32
Jun 4 10:54:19 milliways-cobbler dhcpd: DHCPDISCOVER from 54:7f:ee:34:10:c1 via
10.1.1.32
Jun 4 10:54:20 milliways-cobbler dhcpd: DHCPOFFER on 10.3.1.33 to 54:7f:ee:34:10:c1
via 10.3.1.32
Jun 4 10:54:20 milliways-cobbler dhcpd: DHCPOFFER on 10.2.1.33 to 54:7f:ee:34:10:c1
via 10.2.1.32
Jun 4 10:54:20 milliways-cobbler dhcpd: DHCPOFFER on 10.4.1.33 to 54:7f:ee:34:10:c1
via 10.4.1.32
Jun 4 10:54:20 milliways-cobbler dhcpd: DHCPOFFER on 10.1.1.33 to 54:7f:ee:34:10:c1
via 10.1.1.32
Jun 4 10:54:34 milliways-cobbler dhcpd: DHCPREQUEST for 10.3.1.33 (10.128.3.132) from
54:7f:ee:34:10:c1 via 10.3.1.32
Jun 4 10:54:34 milliways-cobbler dhcpd: DHCPACK on 10.3.1.33 to 54:7f:ee:34:10:c1 via
10.3.1.32
2012 Jun 4 19:54:53 switch %$ VDC-1 %$ %POAP-2-POAP_INFO: Using DHCP, information
received over Eth1/19 from 10.128.3.132
2012 Jun 4 19:54:53 switch %$ VDC-1 %$ %POAP-2-POAP_INFO: Assigned IP address:
10.3.1.33
2012 Jun 4 19:54:53 switch %$ VDC-1 %$ %POAP-2-POAP_INFO: Netmask: 255.255.255.254
2012 Jun 4 19:54:53 switch %$ VDC-1 %$ %POAP-2-POAP_INFO: DNS Server: 10.128.3.136
2012 Jun 4 19:54:53 switch %$ VDC-1 %$ %POAP-2-POAP_INFO: Default Gateway: 10.3.1.32
2012 Jun 4 19:54:53 switch %$ VDC-1 %$ %POAP-2-POAP_INFO: Script Server: 10.128.3.132
$ head -n 1 poap_script.py
#md5sum="b9b180bd70baee9fabb7a253d59e909a"
Mon Jun 4 10:54:50 2012 1 10.3.1.33 886 /var/lib/tftpboot/conf_FOC1539R06D.cfg b _ o
r administrator ftp 0 * c
Mon Jun 4 10:54:51 2012 1 10.3.1.33 0 /var/lib/tftpboot/conf_FOC1539R06D.cfg.md5 b _
o r administrator ftp 0 * i
Mon Jun 4 10:54:53 2012 1 10.3.1.33 3060 /var/lib/tftpboot/conf_mgmt_milliways.cfg b
_ o r administrator ftp 0 * c
Mon Jun 4 10:54:55 2012 1 10.3.1.33 0 /var/lib/tftpboot/conf_mgmt_milliways.cfg.md5 b
_ o r administrator ftp 0 * i
Mon Jun 4 10:54:56 2012 1 10.3.1.33 632 /var/lib/tftpboot/conf_proto_ospf.cfg b _ o r
administrator ftp 0 * c
Mon Jun 4 10:54:58 2012 1 10.3.1.33 0 /var/lib/tftpboot/conf_proto_ospf.cfg.md5 b _ o
r administrator ftp 0 * i
PoAP Considerations
The following PoAP considerations are recommended.
• No “default” config using PoAP
– If no admin user is configured during PoAP - you’ll lock yourself out of the box.
– No CoPP policy applied to box by default – you must have it in your config.
– Any IP address received via DHCP during PoAP is discarded when PoAP is complete.
• DHCP Relay issues on N7k
– CSCtx88353 – DHCP Relay; Boot Reply packet not forwarded over L3 interface
– CSCtw55298 – With broadcast flag set, dhcp floods resp pkt with dmac=ch_addr
• System configuration after aborted PoAP
– If PoAP initiated because ‘write erase’, config will be blank
– If PoAP initiated by ‘boot poap enable’, config will be in unknown state. Cannot fall-back to
previous config.
• Ensure you have enough free space on bootflash for script logs, downloaded images, and
downloaded configs.
Churn
Figure 2-3 is used to describe the day in the life of a packet and how it relates to various routing events
and actions.
Figure 2-3 Day in the Life of a Packet Through Routing and Processing Subsystems
Several terms are used to describe a routing protocol failure; meltdown, cascading failures, etc. The
underlying problem in each of these is the network reaches the point where the protocol can no longer
keep up. It is so far backed up and sending updates that it becomes the cause of problems instead of
routing packets around problems. From an application point of view, this manifests as communication
failures between endpoints. But how can one tell from the router point of view that this is occurring?
Every routing protocol does three basic things; receive updates, compute new route tables based on these
updates, and send out new updates. The most obvious item to check is CPU utilization. If CPU is pegged
at 100% computing new route tables, then the limit has obviously been reached. There are, however,
other potential breakpoints from when new updates are taken off the wire, to when those updates are
processed by the routing protocol, to when new RIB and FIB and generated and pushed to hardware, to
when new updates are sent out.
CoPP
Control Plane Policing (CoPP) protects the supervisor from becoming overwhelmed by DDOS type
attacks using hardware rate-limiters. The CoPP configuration is user customizable. The default N7k
CoPP policy puts all routing protocol packets into the copp-system-p-class-critical class. By default this
class is given the strict policy of 1 rate and 2 color and has a BC value of 250ms. The default N3k CoPP
policy divides the routing protocol packets into several classes based on each protocol. Should the
routing protocol exceed configured rates, packets will be dropped. Dropped Hello's can lead to entire
neighbor session being dropped. Dropped updates/LSAs can lead to increased load due to
retransmissions or inconsistent routing state.
CoPP Commands
On the N7k the show policy-map interface control-plane class copp-system-p-class-critical command
displays counters for default CoPP class regulating routing protocol traffic. A violated counter that is
continuously incrementing indicates network churn rate is approaching meltdown.
msdc-spine-r9# show pol int cont class copp-system-p-class-critical | begin mod
module 3 :
conformed 14022805664 bytes; action: transmit
violated 0 bytes; action: drop
module 4 :
conformed 8705316310 bytes; action: transmit
violated 0 bytes; action: drop
On the N3k, the show policy-map interface control-plane command displays counters for all CoPP
classes. A routing protocol class DropPackets counter that is continuously incrementing indicates the
network churn rate is approaching meltdown.
msdc-leaf-r21# show policy-map interface control-plane | begin copp-s-igmp
class-map copp-s-igmp (match-any)
match access-grp name copp-system-acl-igmp
police pps 400
OutPackets 0
DropPackets 0
class-map copp-s-eigrp (match-any)
match access-grp name copp-system-acl-eigrp
match access-grp name copp-system-acl-eigrp6
police pps 200
OutPackets 0
DropPackets 0
class-map copp-s-pimreg (match-any)
match access-grp name copp-system-acl-pimreg
police pps 200
OutPackets 0
DropPackets 0
class-map copp-s-pimautorp (match-any)
police pps 200
OutPackets 0
DropPackets 0
class-map copp-s-routingProto2 (match-any)
match access-grp name copp-system-acl-routingproto2
police pps 1300
OutPackets 0
DropPackets 0
class-map copp-s-v6routingProto2 (match-any)
match access-grp name copp-system-acl-v6routingProto2
police pps 1300
OutPackets 0
DropPackets 0
class-map copp-s-routingProto1 (match-any)
match access-grp name copp-system-acl-routingproto1
match access-grp name copp-system-acl-v6routingproto1
police pps 1000
OutPackets 1208350
DropPackets 0
class-map copp-s-arp (match-any)
police pps 200
OutPackets 9619
DropPackets 0
class-map copp-s-ptp (match-any)
police pps 1000
OutPackets 0
DropPackets 0
class-map copp-s-bfd (match-any)
police pps 350
OutPackets 24226457
DropPackets 0
<snip>
On the N7k, the inband rate limit for Sup1 is 32kpps, while the limit for Sup2 is 64kpps. The show
hardware internal cpu-mac inband stats command gives a vast array of statics regarding the inband
interface, specifically statistics about throttling. Seeing the rate limit reached counter incrementing
indicates the network churn rate is approaching meltdown.
msdc-spine-r1# show hard int cpu-mac inband stats | be Throttle | head
Throttle statistics
-----------------------------+---------
Throttle interval ........... 2 * 100ms
Packet rate limit ........... 32000 pps
Rate limit reached counter .. 0
Tick counter ................ 2217856
Active ...................... 0
Rx packet rate (current/max) 261 / 3920 pps
Tx packet rate (current/max) 618 / 4253 pps
Netstack
Netstack is the set of NX-OS processes that implement all protocol stacks required to send and receive
control plane packets. Routing protocols register with the IP Process to receive their Hello and Update
packets. MTS is used to pass these updates between IP Process and routing protocols. When routing
protocols are too busy processing previous messages or doing route recalculations to receive these
messages, they can be dropped. Dropped Hello's can lead to entire neighbor session being dropped.
Dropped updates/LSAs can lead to increased load due retransmissions or inconsistent routing state. Each
routing protocol registers as a client of IP process to receive these messages. Statistics are available on
a per-client basis.
The show ip client command lists all the processes that have registered to receive IP packets. Seeing the
failed data messages counter incrementing is an indication that the network churn rate is approaching
meltdown.
msdc-spine-r9# show ip client ospf
CPU Utilization
Once the update has reached its final destination, the routing protocol requires compute time on the
supervisor to run its SPF or best-path algorithms. As the network converges more frequently, the more
load will be put on CPU. However, each platform has a different type of CPU so load will be different
on each platform. Also, the location of the device in the network has an impact (routers in an OSPF
totally stubby area are insulated from churn in other areas). Thus CPU utilization is one metric to
carefully examine, but monitoring all devices is required until it is determined which platform+roles will
be high water marks. If the network melts before any devices have pegged the CPU, then one of the other
breakpoints are being reached first.
# = average CPU%
1 1 11
777877697797678967989767785988798980787586978798098788009679
166077546715148676827549868699342800060935474641066850000773
100 * * * * * * ** **
90 * * ** * * * *** * * * * * * *** * *** *
80 ***** ***** ** ***** *** *** ***** * * *************** **
70 ************ ********* *** ************ ********************
60 ************************************************************
50 ************************************************************
40 **#****#**********#******#*#*****#******#*#*****#*****##***#
30 **##*#*##*#***#***#*##***#*###***#*###**#*###***#*#***##**##
20 ###############*############################################
10 ############################################################
0....5....1....1....2....2....3....3....4....4....5....5....
0 5 0 5 0 5 0 5 0 5
-----------------------------------------------------------
Processor memory: Module Total(KB) Free(KB) % Used
-----------------------------------------------------------
1 2075900 1339944 35
2 2075900 1340236 35
3 2075900 1333976 35
4 2075900 1339780 35
5 2075900 1341112 35
6 2075900 1344648 35
7 2075900 1344492 35
8 2075900 1344312 35
10 8251592 6133856 25
11 2075900 1344604 35
12 2075900 1344904 35
13 2075900 1344496 35
14 2075900 1344496 35
15 2075900 1344808 35
16 2075900 •show process cpu sort
• show process cpu hist
• show system resources module all
1 1 11
777877697797678967989767785988798980787586978798098788009679
166077546715148676827549868699342800060935474641066850000773
100 * * * * * * ** **
-----------------------------------------------------------
Processor memory: Module Total(KB) Free(KB) % Used
-----------------------------------------------------------
1 2075900 1339944 35
2 2075900 1340236 35
3 2075900 1333976 35
4 2075900 1339780 35
5 2075900 1341112 35
6 2075900 1344648 35
7 2075900 1344492 35
8 2075900 1344312 35
10 8251592 6133856 25
11 2075900 1344604 35
12 2075900 1344904 35
13 2075900 1344496 35
14 2075900 1344496 35
15 2075900 1344808 35
16 2075900 1344416 35
17 2075900 1344536 35
msdc-spine-r1# 1344416 35
17 2075900 1344536 35
msdc-spine-r1#
URIB
When there is a lot of network instability urib-redist can run out of shared memory waiting for acks
caused by routing changes. urib-redist uses 1/8 of the memory allocated to urib, which can be increased
by modifying the limit for 'limit-resource u4route-mem' (urib).
This data shows urib-redist with 12292 allocated, which is 1/8 of urib (98308)
n7k# show processes memory shared
Component Shared Memory Size Used Available Ref
Address (kbytes) (kbytes) (kbytes) Count
smm 0X50000000 1028 4 1024 41
cli 0X50101000 40964* 25151 15813 12
npacl 0X52902000 68 2 66 2
u6rib-ufdm 0X52913000 324* 188 136 2
u6rib 0X52964000 2048+ (24580) 551 1497 11
urib 0X54165000 7168+ (98308) 5161 2007 22
u6rib-notify 0X5A166000 3076* 795 2281 11
urib-redist 0X5A467000 12292* 11754 538 22
urib-ufdm 0X5B068000 2052* 0 2052 2
Protocols often express interest in notifications whenenever there is a change in the status of their own
routes or routes of others (redistribution). Previously, no flow control in this notification mechanism
existed, that is, urib kept sending notifications to protocols without checking whether the protocol was
able to process the notifications or not. These notifications use shared memory buffers which may
encounter situations where shared memory was exhausted. Part of this feature, urib will now allow only
for a fixed number of unacknowledged buffers. Until these buffers are acknowledged additional
notifications will not be sent.
EOBC
Once a new FIB has been generated from the RIB, updates are sent to the forwarding engine on each
linecard via the Ethernet Out of Band Channel (EOBC) interface on the supervisor. Many other internal
system processes utilize the EOBC as well. As the level of network churn increases, it is expected the
number of FIB updates increase. Thus it is expected an increase in RX and TX utilization on the EOBC
interface to happen. Should this interface become overwhelmed, throttling will occur and packets will
be dropped. This delays programming new entries into the forwarding engine, causing packet misrouting
and increased convergence times.
EOBC Commands
On the N7k, the EOBC rate limit for SUP1 is 16kpps, while the limit for SUP2 is significantly higher.
The show hardware internal cpu-mac eobc stats command gives a vast array of statics regarding the
EOBC interface. Statistics about throttling are specifically sought after. Seeing the Rate limit reached
counter incrementing indicates the network churn rate is approaching meltdown.
msdc-spine-r8# show hard int cpu-mac eobc stats | be Throttle | head
Throttle statistics
-----------------------------+---------
Throttle interval ........... 3 * 100ms
Packet rate limit ........... 16000 pps
Rate limit reached counter .. 0
Tick counter ................ 6661123
Active ...................... 0
Rx packet rate (current/max) 30 / 6691 pps
Tx packet rate (current/max) 28 / 7581 pps
OSPF
Open Shortest Path First (OSPF) testing focused around control plane scale at a real MSDC customer
network, herein to be referred as ACME_16. ACME_1 has an OSPF network that runs at a higher scale
than Cisco originally published for the N7K platform as supported, and is growing at a rapid pace.
This testing verification ensures Nexus 7000 capabilities of handling ACME_1s specific scenario.
This version of ACME_1 testing includes the following primary technology areas:
• OSPF Scale
• Unicast Traffic
• ECMP
DDTS caveats discovered and/or encountered in this initial testing effort are identified in the “Defects
Enountered” section of the external test results document.7
All routing protocols are susceptible to scale limitation in the number of routes in the table and the
number of peers to which they are connected. Link state protocols like OSPF are also susceptible to
limitations in the number of routers and links within each area. The ACME_1 topology pushes all these
limits, as is typical of most MSDC customers.
Summary of Results
OSPF testing results demonstrated that the network remains stable up to 30k LSAs, and can scale to 60k
LSAs if BFD is enabled. OSPF and OSPF with BFD enabled showed some instability in a few instances
with steady-state flaps and LSA propagation delays; however, both those issues are addressed in
NX-OS 6.2.
BGP
Another MSDC customer, ACME_2, was selected to examine alternative BGP arrangements for
increasing scale of an MSDC without compromising convergence. Both resiliency and reliability were
also top concerns needing attention, and are discussed below. The test topology was not a
straightforward three-stage Clos, but rather closer to a “reduced” five-stage Clos with multiple Spine
“networks”, never the less, the same high-level topological principles apply (Figure 2-12). It was run
within the test topology.
The system was composed of 3 physical Podsets8, Podsets 1, 2 and 3. Each Podset consisted of 4 Nexus
3064 Leaf nodes and a mixture of Nexus 3064/3048 ToRs. Podset 1 had over a dozen TORs while Podset
2 and 3 had 3 ToRs. IXIA IXNetwork was used to bring the total number of real and simulatied ToRs to
17 for each Podset. Route-maps were configured on each ToR to advertise four /24 directly connected
prefixes. A 300x VM Hadoop cluster was also connected to Podset 1 (also used for TCP incast and buffer
utilization testing). Each VM connected to the ToR via a /30 connected subnet, configured through
DCHP.
Note /30 masks were used to provide location awareness for Hadoop nodes.
Based on the DHCP forwarding address, backend servers map requests to specific racks, and position in
the rack. Inband management was used for the Hadoop cluster, out of band was utilized for network
devices. Each Leaf node connected to a single Spine. Depending on the Leaf node there were either two
or three parallel connections to the Spine layer (ACME_2 requirement). IXNetwork was used to simulate
up to 32 BGP spine sessions for each Leaf node.
Scaling was done to 140 POD sets at the Spine layer using combinations of real and simulated
equipment. Each Spine node connected three non-simulated Leaf nodes, and the remaining nodes, 137
of them, were simulated using IXIA. All Leafs advertised 68 /24 ipv4 prefixes to each Spine node, and
each Spine node received over 9000 BGP prefixes, in total, from the Leaf layer.
8. A Podset would be comprised of hundreds of servers. ToRs for each rack were N3064s. Pod sets connect to
an infrastructure based on the three-stage Clos topology. For the purposes of testing, a smaller-scale version
of the customer has in production was used.
With the exception of the programmable BGP Speakers (pBS), BFD was enabled across the topology for
each BGP session. BFD is enabled for all ToR <-> Leaf, Leaf <-> Spine, and Spine <-> Border
connections.
pBSes were simulated using IXIA. Each Spine and Leaf node peered with a pBS. There were 32 BGP
sessions with the pBS, per device, broken down into two groups, with each group consisting of sixteen
BGP sessions. All 32 BGP sessions advertised hundreds of /32 VIPs used for service loadbalancing to
the server. For all VIPS advertised, Group1 advertises prefix with MED 100 while Group 2 advertised
MED 200. Each VIP had 16 equal cost paths in the route table; NH reachability for all VIPs point to the
physical IP address of the load balancer(s).
To reach the final goal of 16,000 IPV4 prefixes, IXIA injected 4700 prefixes at the Border Leaf layer.
Nexus 3000 limits the route size to 8K in hardware if uRPF is enabled (default). To get to the target of
16K routes, urpf had to be disabled on Leaf and ToR nodes.
Two types of traffic were used in testing:
1. Background server-to-server traffic
a. Podset 2 <-> Podset 1
b. Podset 3 <-> Podset 1
c. Podset 3 <-> Podset 2
2. VIP traffic from servers to loadbalancers
a. Podset 2 -> VIP
b. Podset 1 -> VIP
c. Podset 3 -> VIP
With the entire system configured as outlined above, these were the 3 major test sets executed:
1. Baseline tests
2. Route Convergence
3. Multi-Factor Reliability
Note Test sets are defined as a broad characterization of individual tests; in other words, Test set 1 had 17
individual tests (BGP steady state with and without churn, BGP soft clearing, Link Flapping, ECMP path
addition and reduction, etc), Test set 2 had 7, Test set 3 had 6.
Summary of Results
All platforms must be considered when examining routing scale limits. For the N7K 9; 2 session limits
exist when running BGP with and without BFD. BFD is limited to 200 sessions per module, and 1000
sessions were supported per system. For BGP, 1000 neighbors per system were supported. Limits for
N3K were less than N7K.
Observations
• Peering at both Spine and Leaf provides greater granularity of available hardware loadbalancing.
However, peering at the Spine, requires customizing route-maps to change next-hop which is less
scalable.
4. SDU validated these numbers in testing:
9. http://www.cisco.com/en/US/docs/switches/datacenter/sw/verified_scalability/b_Cisco_Nexus_7000_Series_
NX-OS_Verified_Scalability_Guide.html#concept_2CDBB777A06146FA934560D7CDA37525
Features are available in IOS-XR which would benefit NX-OS development, which address FIB
issues encountered above.
8. FIB and MAC tables are not coupled. Recommendation is to configure identical aging timer to
maintain synchronization. Options are; either increase MAC aging or decrease ARP aging.
Primarily applies to unidirectional flow.
9. If BFD is implemented in the network, BFD echo packets needs to be assigned to priority queue to
ensure network stability under load.
10. URPF must be disabled to support 16K routes in hardware on the N3K.
11. To work around an ECMP polarization issue, hashing algorithms must be different between ToR and
Leaf layers. A new CLI command was created to configure different hash offsets to avoid the ECMP
polarization.
Refer to subsequent testing documentation for complete details about ACME_2 testing.
BFD
Bidirectional Forwarding Detection (BFD), a fast failure detection technology, was found to allow for
relaxed routing protocol timers. This in turn creates room for scaling routing protocols.
Summary of Results
BFD testing occurred between test instrumentation hardware and the Spine. 384 sessions were validated
at the spine with both BGP and OSPF. A 500ms interval was configured based on overall system
considerations for other LC specific processes.
Servers
Servers are distributed throughout the fabric with 10G connectivity. Refer to Server and Network
Specifications, page A-1 for server specifications, configurations, and Hadoop applications details.
Intel recommends the following based on real world applications:
http://www.intel.com/content/dam/doc/application-note/82575-82576-82598-82599-ethernet-controller
s-interrupts-appl-note.pdf
Note File transfer buffering behaviors were observed – kernel controls how frequently data is dumped from
cache; with default kernel settings, the kernel wasn’t committing all memory available, thus there was a
difference between committed memory vs. what it’s able to burst up to. As a result, VMs that hadn’t
committed everything behaved worse than those that did. To keep all experiments consistent, all VMs
were configured to have all memory 100% “committed”.
TCP receive buffers were configured at 32MB. It was set higher because the goal was to remove receive
window size as a potential limitation on throughput and to completely rely on CWND. This is not
realistic for a production deployment, but it made tracking key dependencies easier. Refer to Incast
Utility Scripts, IXIA Config, page E-1 for relevant sysctl.conf items.
The formula for TCP receive window is:
Based on theeformula, 75% of buffer size is used for TCP receive window (25MB window scale
factor 10). This value is never reached as CWND is always the limiting factor.
Note Regarding window size, as of linux kernel 2.6.19 and above, CUBIC is the standard implementation for
congestion control.
• IP forward disabled:
• Misc settings:
[root@r09-p02-vm01 ipv4]# more tcp_congestion_control
Cubic
[root@r09-p02-vm01 ipv4]# more tcp_reordering
3
Note This link outlines additional issues to be aware of when hot plugging vcpu:
https://bugzilla.redhat.com/show_bug.cgi?id=788562
To manage failures and their impact to Incast events, two scripts were written to track the status of a job:
“fail-mapper.sh” and “find-reducer.sh”. fail-mapper.sh reloads 15% of the VMs immediately before the
reduce phase, and find-reducer.sh launches tcpdump on the reducer. Tcpdump output was used to analyze
TCP windowing behavior during Incast events.
Following logic was implemented in fail-mapper.sh:
1. User inputs two job ids (example 0051, 0052)
2. Query each map task and generate a unique list of VMs responsible for each job. There will be two
lists generated, one per job.
3. Compare the two lists, generate a third list by suppress common VMs.
4. Query the job status, once map tasks reaches 100% completion (96% for cascading failure), reload
15% of the VMs based on #3.
Find-reducer.sh determines the location of the reducer and launches tcpdump.
Topology
Figure 2-13 shows a standard 3-stage folded Clos topology, with 8 Spines and 16 Leafs.
Note Physical servers are arranged in logical racks, numbered “r01-r16”. Even though a physical server spans
two logical racks, it is the physical NICs (and the VMs mapped to them) that are actually assigned to a
logical rack. For example, the first server shown in the top-leftmost position has NIC_1 which is “in”
rack r01 and NIC_2 in r02.
Initially, there was noise traffic sent to exhaust both “bandwidth” and “buffer utilization”, but it was
determined exercising buffers was sufficient, along with Hadoop traffic, to create Incast events. For
completeness, the “bandwidth utilization” noise floor traffic method is described in Bandwidth
Utilization Noise Floor Traffic Generation, page F-1.
The border devices represent “external” networks and are injecting a default route, effectively acting as
a sensor for spurious traffic.
Buffer Utilization
Figure 2-14 shows an IXIA shared buffer setup.
The IXIA is connected to each Leaf indirectly, and using a series of oscillating traffic bursts, in
conjunction with the bandwidth “noise” traffic above, both dedicated and shared buffers on the Leafs are
consumed at will (oscillating traffic is needed because the IXIA wasn’t able to consistently consume
N3K buffers with steady-stream traffic). The purposes of sending traffic through the border leaf and to
the Spines are two-fold:
1. IXIA didn’t have enough 10G ports to connect to every Leaf.
2. Sending traffic via ECMP towards the Spine, and then the Spine downto the Leafs, simulates real
traffic flow, albeit uni-directional (IXIA is both the source and sink).
In detail, this is how the IXIA is configured for shared buffer impairment traffic:
2x 10G interfaces, in total, are used to Send (Ix3/7) and Recv (Ix3/8) uni-directional UDP traffic. The
source traffic comes into an N5K fanout switch (this switch held other experiments to the border, so it
was left intact – technically, the IXIA could be connected directly to the border leaf, achieving the same
result) to Border leaf-r1 (msdc-leaf-r17), which connects to Spines r1 – r8.
• Refer to the following example for Leaf dest IP 10.128.4.131:
msdc-leaf-r17# show ip route 10.128.4.131
IP Route Table for VRF "default"
'*' denotes best ucast next-hop
• Traffic is sourced from the same IP (10.128.128.151), but there are 3 unique dest IP’s for each leaf
(msdc-leaf-r1-16), Vlans 11-13:
msdc-leaf-r1# show ip int brief
IP Interface Status for VRF "default"(1)
Interface IP Address Interface Status
Vlan11 10.128.4.129 protocol-up/link-up/admin-up
Vlan12 10.128.5.1 protocol-up/link-up/admin-up
Vlan13 10.128.6.1 protocol-up/link-up/admin-up
• All Leaf switches have 3x 100Mb links connected to an N3K fan-in switch, which connects to IXIA
(Ix3/8):
msdc-leaf-r1# show cdp neighbors
Capability Codes: R - Router, T - Trans-Bridge, B - Source-Route-Bridge
S - Switch, H - Host, I - IGMP, r - Repeater,
V - VoIP-Phone, D - Remotely-Managed-Device,
s - Supports-STP-Dispute
Shared_Buffer_Xtra (Figure 2-16) has the same 48 endpoints and traffic profile except that it sends
traffic at ~ 800Kb.
This exceeds the interface throughput when combined with the first profile and starts to consume shared
buffers. To achieve a shared buffer impairment without running out of buffers an IXIA script is used to
stop and start the Xtra traffic stream, while the Shared_Buffer stream runs continuously (Figure 2-17).
The timing of the script first loads up the shared buffers to ~8.5k for each of the 3 interfaces and then
switches to a pattern where it alternates between bleeding off and increasing the buffer usage. This
allows for a majority of the shared buffers to be used without exceeding the limit and dropping packets.
The process forms a saw tooth pattern of usage shown in Figure 2-18.
Buffer Allocation
Because the primary objective in these tests is to observe buffer behavior on the N3K Leaf layer, it must
be ensured that dedicated buffers are consumed and shared buffer space is being exercised.
Figure 2-19 shows the overall schema of shared vs dedicated buffers on the N3K
This means the noise floor will consume all 128 dedicated buffers per port and has the capability of
leeching into shared space, at will. With this control, Incast traffic can be pushed over the tipping point
of consuming the remainder of available buffer space, i.e. – shared buffers, thus causing an Incast event.
Table 2-2 shows how buffers are allocated system-wide.
Note There is a defined admission control related to when shared buffer space is consumed by each port.
Admission control criteria are:
1. Queue Reserved space available
2. Queue dynamic limit not exceeded
3. Shared Buffer Space available
N3064-E imposes dynamic limits on a per queue basis for each port. The dynamic limit is controlled by
the alpha parameter, which is set to 2. In dynamic mode, buffers allocated per interface cannot exceed
the value based on this formula:
See N3K datasheets for a more detailed treatment of buffer admission control.
Monitoring
Standard Hadoop, Nagios, Graphite and Ganglia tools were used to monitor all VMs involved. Custom
Python scripts, running on the native N3K Python interpreter, were created to monitor shared buffer
usage.
Incast Event
Figure 2-20 shows a logical representation of the Incast event created.
Note Actual locations of M or R VMs is determined by the Hadoop system when a job is created, thus
monitoring scripts must first query for the locations before executing their code.
For the first example (Figure 2-21, Figure 2-22), two Hadoop jobs were executed: _0026 and _0027. Job
26 was tracked, and when the Map phase reached 96% of completion a script would kill 15% of the Map
nodes only used in job 27. This would force failures on that particular job and cause block replication
(data xfer) throughout the network. This was an attempt to introduce a cascading failure. However, it did
not occur – Job 26 experienced the expected incast event, but no additional failure events were seen.
Though numerous errors due to force-failed datanodes were observed in Job27, it too completed once it
was able to recover after the Incast event.
The Reduce Copy phase is when the reducer requests all Map data in order to sort and merge the resulting
data to be written to the output directory. The Incast burst occurs during this ‘Copy’ phase, which occurs
between the Start time and Shuffle Finished time (Figure 2-23). Due to tuning parameters used to
maximize network throughput bursting, the 1GB data transfer completed within a few seconds during
the time window of 11s.
Interfaces on the Leaf switch which connects to servers are 1-33 – 37, map to r02-p0(1-5)_vm01,
respectively, thus Leaf interfaces which connect to the Reducer is 1-35. Figure 2-24 shows packet loss
seen by the switch interface during event. Because data points for packets dropped are plotted every 10s
by Graphite, and reported every 1s by the switch, the time period is slightly skewed.
Figure 2-25 shows global instant cell usage and max cell usage, observed as the sharp burst in traffic,
for the Reducer (Leaf-R2). The instant cell data point doesn’t show up for this interface because the
event occurs quickly then clears before the data point can be captured. However, max cell usage is
persistent and reflects the traffic event.
Figure 2-25 Instant and Max Cell (Buffer) Usage, as Seen on the N3K
Figure 2-26 is a zoomed-in view of the spike. The additional spiking after the event is due to block
replication that occurs from the force-failed VMs.
The reason why the spike didn't use all 37976 shared buffers available on the N3K system is because of
buffer admission control – cannot exceed 2x available buffer per interface.
Lastly, for Job26, Figure 2-27 shows a Wireshark Expert Analysis of this job from a trace taken on the
Reducer. Throughput collapse is evidenced by “Zero window” parameter (this means the TCP
connection has a window-size of 0 and no payload can be transmitted/acknowledged); after which TCP
slow-start mechanism kicks in.
The second example is Job47 (Figure 2-28, Figure 2-29), which looks similar to Job26, but there is an
additional comparison to the Control at the end. As before, there are 33 Mappers and 1 Reducer. One
Hadoop job was launched with the IXIA shared buffer impairment running without any force failures.
The Reduce copy phase produced a spike causing drops and degradation.
Due to the tuning parameters used to maximize network throughput bursting the 1GB data transfer was
complete within a few seconds during the time window of 12s.
Figure 2-29 Completed Successfully After it Recovered From the Incast Event
As with Job26, the burst received by Reducer (r16-p02_vm01) is seen in Figure 2-30:
Note Detailed analysis that follows is based on TCP sessions which contribute to the overall whole of the
Hadoop job.
The following configuration is a parsed tcptrace CLI output on VMs, with important metrics highlighted:
TCP connection 6:
host k: r16-p02-vm01.dn.voyager.cisco.com:43809
host l: r10-p01-vm01.dn.voyager.cisco.com:50060
complete conn: yes
first packet: Fri Nov 9 14:44:48.479320 2012
last packet: Fri Nov 9 14:45:02.922288 2012
elapsed time: 0:00:14.442968
total packets: 3107
filename: job_0047.pcap
k->l: l->k:
total packets: 1476 total packets: 1631
ack pkts sent: 1475 ack pkts sent: 1631
pure acks sent: 1473 pure acks sent: 1
sack pkts sent: 40 sack pkts sent: 0
dsack pkts sent: 0 dsack pkts sent: 0
max sack blks/ack: 1 max sack blks/ack: 0
unique bytes sent: 302 unique bytes sent: 33119860
actual data pkts: 1 actual data pkts: 1628
actual data bytes: 302 actual data bytes: 33158956
rexmt data pkts: 0 rexmt data pkts: 5
rexmt data bytes: 0 rexmt data bytes: 39096
zwnd probe pkts: 0 zwnd probe pkts: 0
zwnd probe bytes: 0 zwnd probe bytes: 0
outoforder pkts: 0 outoforder pkts: 0
pushed data pkts: 1 pushed data pkts: 60
SYN/FIN pkts sent: 1/1 SYN/FIN pkts sent: 1/1
req 1323 ws/ts: Y/Y req 1323 ws/ts: Y/Y
adv wind scale: 10 adv wind scale: 10
req sack: Y req sack: Y
sacks sent: 40 sacks sent: 0
urgent data pkts: 0 pkts urgent data pkts: 0 pkts
urgent data bytes: 0 bytes urgent data bytes: 0 bytes
mss requested: 1460 bytes mss requested: 1460 bytes
max segm size: 302 bytes max segm size: 26064 bytes
min segm size: 302 bytes min segm size: 1448 bytes
avg segm size: 301 bytes avg segm size: 20367 bytes
max win adv: 3950592 bytes max win adv: 16384 bytes
min win adv: 1024 bytes min win adv: 16384 bytes
Note the RTT was quite large, especially considering all VMs for these tests are in the same datacenter.
Figure 2-34 shows a scatterplot taken from raw tcptrace data as sampled on the Reducer – thoughput
collapse and ensuring TCP slow-start are easily visible. Yellow dots are raw, instantaneous, throughput
samples. Red line is the average throughput based on the past 10 samples. Blue line (difficult to see) is
the average throughput up to that point in the lifetime of the TCP connection.
By way of comparison, here is the Control for the test: a copy of the same 1GB job between the Reducer
to the output directory, as assigned by HDFS, and no Incast event was present (it’s a one to many, not
many to one, communication).
TCP connection 46:
host cm: r16-p02-vm01.dn.voyager.cisco.com:44839
host cn: r10-p05-vm01.dn.voyager.cisco.com:50010
complete conn: yes
first packet: Fri Nov 9 14:45:13.413420 2012
last packet: Fri Nov 9 14:45:15.188133 2012
elapsed time: 0:00:01.774713
total packets: 4542
filename: job_0047.pcap
cm->cn: cn->cm:
total packets: 2146 total packets: 2396
ack pkts sent: 2145 ack pkts sent: 2396
pure acks sent: 100 pure acks sent: 1360
sack pkts sent: 0 sack pkts sent: 0
dsack pkts sent: 0 dsack pkts sent: 0
max sack blks/ack: 0 max sack blks/ack: 0
unique bytes sent: 67659222 unique bytes sent: 12399
actual data pkts: 2044 actual data pkts: 1034
actual data bytes: 67659222 actual data bytes: 12399
rexmt data pkts: 0 rexmt data pkts: 0
rexmt data bytes: 0 rexmt data bytes: 0
zwnd probe pkts: 0 zwnd probe pkts: 0
zwnd probe bytes: 0 zwnd probe bytes: 0
outoforder pkts: 0 outoforder pkts: 0
pushed data pkts: 928 pushed data pkts: 1034
SYN/FIN pkts sent: 1/1 SYN/FIN pkts sent: 1/1
req 1323 ws/ts: Y/Y req 1323 ws/ts: Y/Y
adv wind scale: 10 adv wind scale: 10
req sack: Y req sack: Y
sacks sent: 0 sacks sent: 0
urgent data pkts: 0 pkts urgent data pkts: 0 pkts
It comes as no surprise that RTT is significantly less than when there was Incast: 3ms down from ~60ms,
what one would expect for a 1:1 interaction.
Finally, Figure 2-35 shows the scatterplot of the TCP connection while the file was being copied.
Figure 2-35 Example of Good TCP Throughput for 1:1 Control Test
The reason for the dip ¼ the way through is inconclusive, but the important point is that it doesn’t go to
zero, nor is slow-start seen after the dip (as one would expect if collapse had occurred), and the file copy
for the Control test completed in 1.7 seconds (with reasonable RTT), as opposed to 14 seconds for Job47.
MSDC Conclusion
The purpose of this document was to:
1. Examine the characteristics of a traditional data center and a MSDC and highlight differences in
design philosophy and characteristics.
2. Discuss scalability challenges unique to a MSDC and provide examples showing when an MSDC is
approaching upper limits. Design considerations which improve scalability are also reviewed.
3. Present summaries and conclusions to SDU’s routing protocol, provisioning and monitoring, and
TCP Incast testing.
4. Provide tools for a network engineer to understand scaling considerations in MSDCs.
It achieved that purpose.
• Customers’ top-of-mind concerns were brought into consideration and effective use of Clos
topologies, particularly the 3-stage folded Clos, we examined and demonstrated how they enable
designers to meet east-west bandwidth needs and predictable traffic variations.
• The Fabric Protocol Scaling section outlined considerations with Churn, OSPF, BGP, and BFD with
regard to scaling.
• OSPF was tested and shown were current system-wide limits contrast with BGP today. For BGP, it
was demonstrated how the customer’s peering, reliability, and resiliency requirements could be met
with BGP + BFD.
• Along with (3), the N3K was shown to have effective tools for buffer monitoring and signaling when
and where thresholds are crossed.
Using underlying theory, coupled with hands-on examples and use-cases, knowledge and tools are given
to help network architects be prepared to build and operate MSDC networks.
This appendix provides MSDC phase 1 server testing requirements and specifications, network
configurations, and buffer monitoring with configurations.
Servers
The lab testbed has forty (40) Cisco M2 Servers. Each server, 48GB RAM and 2.4Ghz, runs CentOS 6.2
64-bit OS and KVM hypervisor. There are 14 VMs configured per server and each VM is assigned 3GB
RAM and allocated to one HyperThread each. The servers connect to the network via two (2) 10G NICs,
capable of TSO/USO and multiple receive and transmit queues.
Server Specs
2x Xeon E5620, X58 Chipset
• Per CPU
– 4 Cores, 2 HyperThreads/Core
– 2.4Ghz
– 12M L2 cache
– 25.6GB/s memory bandwidth
– 64-bit instructions
– 40-bit addressing
48GB RAM
• 6x 8GB DDR3-1333-MHz RDIMM/PC3-10600/2R/1.35v
3.5TB (LVM)
• 1x 500GB Seagate Constellation ST3500514NS HD
– 7200RPM
– SATA 3.0Gbps
• 3x 1TB Seagate Barracuda ST31000524AS HDs
– 7200RPM
– SATA 6.0Gbps
• Partitions
Operating System
CentOS 6.2, 64-bit
• 2.6.32-220.2.1.el6.x86_64
• eth1
echo 100 > /proc/irq/79/smp_affinity
• /etc/sysctl.conf
fs.file-max = 65535
net.ipv4.ip_local_port_range = 1024 65000
net.ipv4.tcp_sack = 0
net.ipv4.tcp_timestamps = 0
net.ipv4.tcp_rmem = 10000000 10000000 10000000
net.ipv4.tcp_wmem = 10000000 10000000 10000000
net.ipv4.tcp_mem = 10000000 10000000 10000000
net.core.rmem_max = 524287
net.core.wmem_max = 524287
net.core.rmem_default = 524287
net.core.wmem_default = 524287
net.core.optmem_max = 524287
net.core.netdev_max_backlog = 300000
Virtual Machines
KVM
• libvirt-0.9.4-23.el6_2.1.x86_64
• qemu-kvm-0.12.1.2-2.209.el6_2.1.x86_64
14 VMs/server
Per VM:
• 3GB RAM
• 230.47GB 213GB HD
• Single {Hyper}Thread
Iptables Configurations
N/A.
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://namenode.nn.voyager.cisco.com:8020/</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/data/tmp</value>
</property>
<property>
<name>topology.script.file.name</name>
<value>/etc/hadoop-0.20/conf/rackaware.pl</value>
</property>
<property>
<name>topology.script.number.args</name>
<value>1</value>
</property>
</configuration>
• /etc/hadoop-0.20/conf/rackaware.pl
#!/usr/bin/perl
use strict;
use Socket;
my @addrs = @ARGV;
foreach my $addr (@addrs){
my $hostname = $addr;
if ($addr =~ /^(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})/){
# We have an IP.
$hostname = gethostbyaddr(inet_aton($1), AF_INET);
}
get_rack_from_hostname($hostname);
}
sub get_rack_from_hostname () {
my $hostname = shift;
if ($hostname =~ /^(r\d+)/){
print "/msdc/$1\n";
} else {
print "/msdc/default\n";
}
}
• /etc/hadoop-0.20/conf.rtp_cluster1/hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.name.dir</name>
<value>/data/namespace</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/data/data</value>
</property>
<property>
<name>dfs.heartbeat.interval</name>
<value>3</value>
<description> DN heartbeat interval in seconds default 3 second </description>
</property>
<property>
<name>heartbeat.recheck.interval</name>
<value>80</value>
<description> DN heartbeat interval in seconds default 5 minutes </description>
</property>
<property>
<name>dfs.namenode.decommission.interval</name>
<value>10</value>
<description> DN heartbeat interval in seconds </description>
</property>
</configuration>
• /etc/hadoop-0.20/conf.rtp_cluster1/mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx5120m</value>
</property>
<property>
<name>io.sort.mb</name>
<value>2047</value>
</property>
<property>
<name>io.sort.spill.percent</name>
<value>1</value>
</property>
<property>
<name>io.sort.factor</name>
<value>900</value>
</property>
<property>
<name>mapred.job.shuffle.input.buffer.percent</name>
<value>1</value>
</property>
<property>
<name>mapred.map.tasks.speculative.execution</name>
<value>false</value>
</property>
<property>
<name>mapred.job.reduce.input.buffer.percent</name>
<value>1</value>
</property>
<property>
<name>mapred.reduce.parallel.copies</name>
<value>200</value>
</property>
<property>
<name>mapred.reduce.slowstart.completed.maps</name>
<value>1</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>/data/mapred</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>jobtracker.jt.voyager.cisco.com:54311</value>
</property>
<property>
<name>mapred.system.dir</name>
<value>/data/system</value>
</property>
<property>
<name>mapred.task.timeout</name>
<value>1800000</value>
</property>
</configuration>
Network
The following network configuration were used.
F2/Clipper References
The following F2/Clipper references are available.
• Clipper ASIC Functional Specification—EDCS: 588596
• Clipper Device Driver Software Design Specification—EDCS-960101
• Packet Arbitration in Data Center Switch. Kevin Yuan
username=nexus_user, password=nexus_password,
allow_agent=False, look_for_keys=False)
else:
man.connect(nexus_host, port=nexus_ssh_port,
username=nexus_user, password=nexus_password)
except paramiko.SSHException:
return 4, man
except paramiko.BadHostKeyException:
return 4, man
except paramiko.AuthenticationException:
return 4, man
except socket.error:
return 4, man
return 1, man
chan.send(attach_cmd)
if args.verbosity > 0:
logger.debug("attaching to module %s\n" % mod)
prompt = "module-" + mod + "# "
buff = ''
while not buff.endswith(prompt):
resp = chan.recv(9999)
buff += resp
if args.verbosity > 0:
logger.debug("buffer output is %s" % (buff))
logger.debug("chan is %s status is %d" % (chan, chan.recv_ready()))
return chan
if args.verbosity > 0:
logger.debug("processed command is %s" % (processed_cmd))
buff = ''
resp = ''
Spine Configuration
Forthcoming in supplemental documentation.
Leaf Configuration
Forthcoming in supplemental documentation.
The following buffer monitoring code and configuration files are available for consideration:
• buffer_check.py, page B-1
• check_process.py, page B-11
• NX-OS Scheduler Example, page B-14
• Collectd Configuration, page B-15
– collectd.conf, page B-15
– Puppet Manifest, page B-16
• Graphite Configuration, page B-17
– carbon.conf, page B-18
– graphite.wsgi, page B-19
– graphite-vhost.conf, page B-19
– local_settings.py, page B-20
– relay-rules.conf, page B-20
– storage-schemas.conf, page B-21
– Puppet Manifest (init.pp), page B-22
buffer_check.py
#!/usr/bin/python
#
# A script for monitoring buffer utilization on the Cisco Nexus 3000
# platform. Tested with Nexus 3064 and Nexus 3048 switches. Intended
# to be run on the switch. Reports data to Graphite via pickled data
# over TCP (or any other data sink that can read pickle data).
#
# Written by Mark T. Voelker
# Copyright 2012 Cisco Systems, Inc.
#
import os
import sys
import re
import logging
import argparse
import time
import cPickle
import socket
import struct
import copy
import xml.parsers.expat
from cisco import CLI
def daemonize():
"""
Daemonizes the process by forking the main execution off
into the background.
"""
try:
pid = os.fork()
except OSError, e:
raise OSError("Can't fork(%d): %s" % (e.errno, e.strerror))
if (pid == 0):
# This is the child process.
# Become the session leder/process group leader and ensure
# that we don't have a controlling terminal.
os.setsid()
def write_pidfile(pid=os.getpid()):
"""
Writes a pid file to /bootflash/buffer_check.py.pid.
The file contains one line with the PID.
"""
global args
f = open(args.pidfile, 'w')
f.write(str(pid))
f.close()
"""
global current_tag
current_tag = copy.copy(name)
#logger.debug("Current tag: '%s'" % (current_tag))
def end_element(name):
"""
Callback routine for handling the end of a tagged element.
"""
global current_tag
current_tag = ''
def char_data(data):
"""
Callback routine to handle data within a tag.
"""
global current_tag
global current_int
global parsed_data
#logger.debug("char_data handler called [current_tag = %s] on '%s'" % (
# current_tag, data)
# )
if current_tag == 'total_instant_usage':
parsed_data['instant_cell_usage'] = int(copy.copy(data))
logger.debug("FOUND TOTAL INSTANT CELL USAGE: %s" % (data))
elif current_tag == 'max_cell_usage':
parsed_data['max_cell_usage'] = int(copy.copy(data))
logger.debug("FOUND TOTAL MAX CELL USAGE: %s" % (data))
elif current_tag == 'rem_instant_usage':
parsed_data['rem_instant_usage'] = int(copy.copy(data))
logger.debug("FOUND REMAINING INSTANT USAGE: %s" % (data))
elif current_tag == 'front_port':
current_int = int(copy.copy(data))
parsed_data[current_int] = 0
logger.debug("Started a new front port: %s" % (data))
elif re.search('^[m|u]cast_count_\d$', current_tag):
logger.debug("Found queue counter (port %s): %s" % (current_int, data))
if current_int in parsed_data:
parsed_data[current_int] += int(copy.copy(data))
else:
parsed_data[current_int] = int(copy.copy(data))
logger.debug("Added %s to counter for port %s (total: %s)" % (
data, current_int, parsed_data[current_int])
)
def int_char_data(data):
"""
Callback routine to handle data within a tag.
"""
global interface_rates
global current_tag
global current_int
global get_cmd_timestamp
global pickle_data
global logger
if current_tag in keepers:
# Set up some data storage.
))
def get_show_queuing_int():
"""
Parses output from 'show queuing interface' and reports stats.
Unicast drop stats are reported for each interface given in the
list of interfaces on the command line. Drop stats for multicast,
unicast, xon, and xoff are added up for all interfaces (including
those not specified on the command line) to provide switch-level
totals for each.
)
switch_stat_name = "%s_%s_dropped" % (match.group(1).lower(),
match.group(2).lower())
def get_int_counters():
"""
Parses stats from the output of 'show interface x/y | xml'.
"""
global args
global pickle_data
global logger
global interface_rates
global current_int
global current_tag
global get_cmd_timestamp
int_xml_parser.Parse(get_cmd_reply, 1)
def get_buffer_stats():
"""
Parses stats from the output of 'show hardware internal buffer pkt-stats detail |
xml'.
"""
global args
global pickle_data
global logger
global interface_rates
global exit_code
# Before we process the reply, send another message to clear the counters
# unless we've been told not to do so.
if args.clear_counters:
clear_obj = CLI(clear_message)
clear_cmd_reply = clear_obj.get_raw_output()
logger.debug("Result of clear command:\n%s" % (clear_cmd_reply))
""" % (port_num)
def do_switch_commands():
"""
A hook function for executing any switch-level command necessary.
Commands for individual interfaces are handled elsewhere.
"""
global args
# TODO (mvoelker): add CLI options here to determine which
# commands get run.
if args.get_queuing_stats:
get_show_queuing_int()
if args.get_buffer_stats:
get_buffer_stats()
def do_interface_commands():
"""
A hook function for executing any per-interface command necessary.
Commands for handling switch-level stats and commands which
provide data for multiple interfaces are generally handled in
do_switch_commands().
"""
global args
if args.get_int_counters:
get_int_counters()
Example:
%prog -H myN3K.mydomain.com -l admin -p password \\
-m -i 46 47 48
"""
parser = argparse.ArgumentParser(description=usage)
parser.add_argument("-H", "--hostname", dest="hostname",
help="Hostname or IP address", required=True)
parser.add_argument("-p", "--pidfile", dest="pidfile",
help="File in which to write our PID", default="/bootflash/buffer_check.py.pid")
parser.add_argument("-v", "--verbose", dest="verbosity", action="count",
help="Enable verbose output.", default=0)
parser.add_argument("-b", "--clear_buffer_counters", dest="clear_counters",
help="Clear buffer counters after checking", default=False,
action="store_true")
parser.add_argument("-m", "--max_buffer", dest="get_max_buf",
help="Show max buffer utilization", default=False,
action="store_true")
parser.add_argument("-i", "--instant_buffer", dest="get_instant_buf",
help="Show instant buffer utilization", default=False,
action="store_true")
parser.add_argument("interfaces", metavar="N", type=int, nargs='*',
# Set up a logger.
logger = logging.getLogger('n3k_buffer_check')
logging.basicConfig()
# Since this started out purely as a script for buffer monitoring commands,
# certain command options imply others. Fix things up here.
if args.get_instant_buf:
args.get_buffer_stats = True
logger.debug("CLI: assuming -f because I received -i.")
if args.get_max_buf:
args.get_buffer_stats = True
logger.debug("CLI: assuming -f because I received -m.")
if args.clear_counters:
args.get_buffer_stats = True
logger.debug("CLI: assuming -f because I received -b.")
port_num = 0
get_cmd_timestamp = 0.0
while True:
# Clear out old pickled data.
pickle_data = list()
if args.verbosity > 0:
logger.debug(pickle_data)
check_process.py
#!/usr/bin/python
import os
import sys
import re
import argparse
from cisco import CLI
Example:
%prog -k
"""
parser = argparse.ArgumentParser(description=usage)
parser.add_argument("-k", "--kill", dest="kill",
help="Kill buffer_check.py if running", default=False,
action="store_true")
args = parser.parse_args()
def start_process():
"""
Starts the buffer_check.py script. For our implementation, we
start three instances: one to check buffer stats, one to check
interface stats on server-facing ports, and one to check interface
stats on other ports and queuing stats (at lower granularity).
"""
def check_pid():
"""
Checks to see if the buffer_check.py script is running.
"""
retval = True
for pf in pidfiles:
# Try to open our pidfile.
try:
f = open(pf, 'r')
except IOError:
print "No pidfile %s found!" % (pf)
retval = False
# Read the pid from the file and grock it down to an int.
pid = f.readline()
pidmatch = re.search('^(\d+)\s*$', pid)
if pidmatch:
pid = pidmatch.group(1)
Example:
%prog -k
"""
parser = argparse.ArgumentParser(description=usage)
parser.add_argument("-k", "--kill", dest="kill",
help="Kill buffer_check.py if running", default=False,
action="store_true")
args = parser.parse_args()
def start_process():
"""
Starts the buffer_check.py script. For our implementation, we
start three instances: one to check buffer stats, one to check
interface stats on server-facing ports, and one to check interface
stats on other ports and queuing stats (at lower granularity).
"""
def check_pid():
"""
Checks to see if the buffer_check.py script is running.
"""
retval = True
for pf in pidfiles:
# Try to open our pidfile.
try:
f = open(pf, 'r')
except IOError:
print "No pidfile %s found!" % (pf)
retval = False
# Read the pid from the file and grock it down to an int.
pid = f.readline()
pidmatch = re.search('^(\d+)\s*$', pid)
if pidmatch:
pid = pidmatch.group(1)
print "Pid from pidfile is %s" % (pid)
global options
try:
if args.kill:
os.kill(int(pid), 9)
print "Killed %s" % (pid)
else:
os.kill(int(pid), 0)
except OSError:
print "%s is dead." % (pid)
retval = False
else:
if not args.kill:
print "%s is alive." % (pid)
else:
print "No pid found!"
retval = False
return retval
if check_pid():
# We can exit, the scripts are running.
exit(0)
else:
# We need to start the scripts.
if not args.kill:
start_process()
exit(1)
config terminal
scheduler job name buffer_check
python bootflash:/check_process.py
end
config terminal
scheduler schedule name every_minute
time start 2012:09:10:09:58 repeat 1
Collectd Configuration
The following collectd configurations are available for consideration:
• collectd.conf, page B-15
• Puppet Manifest, page B-16
collectd.conf
BaseDir "/var/lib/collectd"
PIDFile "/var/run/collectd.pid"
PluginDir "/usr/lib64/collectd"
TypesDB "/usr/share/collectd/types.db"
Interval 1
ReadThreads 5
LoadPlugin syslog
LoadPlugin cpu
LoadPlugin disk
LoadPlugin ethstat
LoadPlugin libvirt
LoadPlugin load
LoadPlugin memory
LoadPlugin write_graphite
<Plugin disk>
Disk "/^[hs]d[a-f][0-9]?$/"
IgnoreSelected false
</Plugin>
<Plugin ethstat>
Interface "eth0"
Interface "eth1"
Map "rx_packets" "pkt_counters" "rx_packets"
Map "tx_packets" "pkt_counters" "tx_packets"
Map "rx_bytes" "byte_counters" "rx_bytes"
Map "tx_bytes" "byte_counters" "tx_bytes"
Map "rx_errors" "error_counters" "rx_errors"
Map "tx_errors" "error_counters" "tx_errors"
Map "rx_dropped" "drop_counters" "rx_dropped"
Map "tx_dropped" "drop_counters" "tx_dropped"
Map "collisions" "error_counters" "collisions"
Map "rx_over_errors" "error_counters" "rx_over_errors"
Map "rx_crc_errors" "error_counters" "rx_crc_errors"
Map "rx_frame_errors" "error_counters" "rx_frame_errors"
Map "rx_fifo_errors" "error_counters" "rx_fifo_errors"
Map "rx_missed_errors" "error_counters" "rx_missed_errors"
Map "tx_aborted_errors" "error_counters" "tx_aborted_errors"
Map "tx_carrier_errors" "error_counters" "tx_carrier_errors"
Map "tx_fifo_errors" "error_counters" "tx_fifo_errors"
Map "tx_heartbeat_errors" "error_counters" "tx_heartbeat_errors"
Map "rx_pkts_nic" "pkt_counters" "rx_pkts_nic"
<Plugin libvirt>
Connection "qemu:///system"
RefreshInterval 5
IgnoreSelected false
HostnameFormat hostname name
</Plugin>
<Plugin write_graphite>
<Carbon>
Host "voyager-graphite.hosts.voyager.cisco.com"
Port "2003"
Prefix "collectd"
Postfix "collectd"
StoreRates false
AlwaysAppendDS false
EscapeCharacter "_"
</Carbon>
</Plugin>
Include "/etc/collectd.d"
Puppet Manifest
class collectd {
package { "collectd":
name => "collectd",
ensure => 'latest',
require => [File['/etc/yum.conf']]
}
package { "collectd-graphite":
name => "collectd-graphite",
ensure => 'latest',
require => [File['/etc/yum.conf']]
}
package { "collectd-ethstat":
name => "collectd-ethstat",
ensure => 'latest',
require => [File['/etc/yum.conf']]
}
package { "collectd-libvirt":
name => "collectd-libvirt",
ensure => 'latest',
require => [File['/etc/yum.conf']]
}
service { "collectd":
enable => 'true',
ensure => 'running',
start => '/etc/init.d/collectd start',
stop => '/etc/init.d/collectd stop',
require => [Package['collectd'], Package['collectd-graphite'],
Package['collectd-ethstat'], File['/etc/collectd.conf']]
}
if $fqdn =~ /^r05+-p0[1-5]\.hosts\.voyager\.cisco\.com$/ {
file { '/etc/collectd.conf':
#source => 'puppet:///modules/collectd/collectd.conf.enabled',
source => 'puppet:///modules/collectd/collectd.conf',
owner => 'root',
group => 'root',
mode => '644',
notify => Service['collectd'],
require => Package['collectd']
}
} else {
file { '/etc/collectd.conf':
source => 'puppet:///modules/collectd/collectd.conf',
owner => 'root',
group => 'root',
mode => '644',
notify => Service['collectd'],
require => Package['collectd']
}
}
}
Graphite Configuration
The following graphite configurations are available for consideration:
• carbon.conf, page B-18
• graphite.wsgi, page B-19
• graphite-vhost.conf, page B-19
• local_settings.py, page B-20
• relay-rules.conf, page B-20
• storage-schemas.conf, page B-21
• Puppet Manifest (init.pp), page B-22
carbon.conf
[cache]
USER =
MAX_CACHE_SIZE = inf
MAX_UPDATES_PER_SECOND = 50000
MAX_CREATES_PER_MINUTE = 500
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2103
ENABLE_UDP_LISTENER = True
UDP_RECEIVER_INTERFACE = 0.0.0.0
UDP_RECEIVER_PORT = 2103
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2104
USE_INSECURE_UNPICKLER = False
CACHE_QUERY_INTERFACE = 0.0.0.0
CACHE_QUERY_PORT = 7102
USE_FLOW_CONTROL = True
LOG_UPDATES = False
WHISPER_AUTOFLUSH = False
[cache:b]
LINE_RECEIVER_PORT = 2203
PICKLE_RECEIVER_PORT = 2204
CACHE_QUERY_PORT = 7202
UDP_RECEIVER_PORT = 2203
[cache:c]
LINE_RECEIVER_PORT = 2303
PICKLE_RECEIVER_PORT = 2304
CACHE_QUERY_PORT = 7302
UDP_RECEIVER_PORT = 2303
[cache:d]
LINE_RECEIVER_PORT = 2403
PICKLE_RECEIVER_PORT = 2404
CACHE_QUERY_PORT = 7402
UDP_RECEIVER_PORT = 2403
[cache:e]
LINE_RECEIVER_PORT = 2503
PICKLE_RECEIVER_PORT = 2504
CACHE_QUERY_PORT = 7502
UDP_RECEIVER_PORT = 2503
[cache:f]
LINE_RECEIVER_PORT = 2603
PICKLE_RECEIVER_PORT = 2604
CACHE_QUERY_PORT = 7602
UDP_RECEIVER_PORT = 2603
[cache:g]
LINE_RECEIVER_PORT = 2703
PICKLE_RECEIVER_PORT = 2704
CACHE_QUERY_PORT = 7702
UDP_RECEIVER_PORT = 2703
[cache:h]
LINE_RECEIVER_PORT = 2803
PICKLE_RECEIVER_PORT = 2804
CACHE_QUERY_PORT = 7802
UDP_RECEIVER_PORT = 2803
[cache:i]
LINE_RECEIVER_PORT = 2903
PICKLE_RECEIVER_PORT = 2904
CACHE_QUERY_PORT = 7902
UDP_RECEIVER_PORT = 2903
[cache:j]
LINE_RECEIVER_PORT = 3003
PICKLE_RECEIVER_PORT = 3004
CACHE_QUERY_PORT = 8002
UDP_RECEIVER_PORT = 3003
[relay]
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2003
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2004
RELAY_METHOD = rules
REPLICATION_FACTOR = 1
DESTINATIONS = 127.0.0.1:2104:a, 127.0.0.1:2204:b, 127.0.0.1:2304:c, 127.0.0.1:2404:d,
127.0.0.1:2504:e, 127.0.0.1:2604:f, 127.0.0.1:2704:g, 127.0.0.1:2804:h,
127.0.0.1:2904:i, 127.0.0.1:3004:j
MAX_DATAPOINTS_PER_MESSAGE = 500
MAX_QUEUE_SIZE = 10000
USE_FLOW_CONTROL = True
[aggregator]
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2023
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2024
DESTINATIONS = 127.0.0.1:2004
REPLICATION_FACTOR = 1
MAX_QUEUE_SIZE = 10000
USE_FLOW_CONTROL = True
MAX_DATAPOINTS_PER_MESSAGE = 500
MAX_AGGREGATION_INTERVALS = 5
graphite.wsgi
import os, sys
sys.path.append('/opt/graphite/webapp')
os.environ['DJANGO_SETTINGS_MODULE'] = 'graphite.settings'
import django.core.handlers.wsgi
application = django.core.handlers.wsgi.WSGIHandler()
from graphite.logger import log
log.info("graphite.wsgi - pid %d - reloading search index" % os.getpid())
import graphite.metrics.search
graphite-vhost.conf
<IfModule !wsgi_module.c>
LoadModule wsgi_module modules/mod_wsgi.so
</IfModule>
WSGISocketPrefix run/wsgi
<VirtualHost *:80>
ServerName voyager-graphite
ServerAlias voyager-graphite.cisco.com
DocumentRoot "/opt/graphite/webapp"
ErrorLog /opt/graphite/storage/log/webapp/error.log
CustomLog /opt/graphite/storage/log/webapp/access.log common
<Directory /opt/graphite/conf/>
Order deny,allow
Allow from all
</Directory>
</VirtualHost>
local_settings.py
TIME_ZONE = 'America/New_York'
DEBUG = True
USE_LDAP_AUTH = True
LDAP_SERVER = "ldap.cisco.com"
LDAP_PORT = 389
LDAP_SEARCH_BASE = "OU=active,OU=employees,ou=people,o=cisco.com"
LDAP_USER_QUERY = "(uid=%s)" #For Active Directory use "(sAMAccountName=%s)"
CARBONLINK_HOSTS = ["127.0.0.1:7102:a", "127.0.0.1:7202:b", "127.0.0.1:7302:c",
"127.0.0.1:7402:d", "127.0.0.1:7502:e", "127.0.0.1:7602:f", "127.0.0.1:7702:g",
"127.0.0.1:7802:h", "127.0.0.1:7902:i", "127.0.0.1:8002:j"]
relay-rules.conf
[collectd01-02]
pattern = collectdr01.*
destinations = 127.0.0.1:2104:a
[collectd03-04]
pattern = collectdr03.*
destinations = 127.0.0.1:2204:b
[collectd05-06]
pattern = collectdr05.*
destinations = 127.0.0.1:2304:c
[collectd07-08]
pattern = collectdr07.*
destinations = 127.0.0.1:2404:d
[collectd09-10]
pattern = collectdr09.*
destinations = 127.0.0.1:2504:e
[collectd11-12]
pattern = collectdr11.*
destinations = 127.0.0.1:2604:f
[collectd13-14]
pattern = collectdr13.*
destinations = 127.0.0.1:2704:g
[collectd15-16]
pattern = collectdr15.*
destinations = 127.0.0.1:2804:h
[iface_eth_inb]
pattern = iface_eth_inb.*
destinations = 127.0.0.1:2904:i
[iface_eth_inp]
pattern = iface_eth_inp.*
destinations = 127.0.0.1:2904:i
[iface_eth_outb]
pattern = iface_eth_outb.*
destinations = 127.0.0.1:3004:j
[iface_eth_outp]
pattern = iface_eth_outp.*
destinations = 127.0.0.1:3004:j
[max_cell]
pattern = max_cell.*
destinations = 127.0.0.1:2504:e
[instant_cell]
pattern = instant_cell.*
destinations = 127.0.0.1:2604:f
[percent_buf]
pattern = percent_buf.*
destinations = 127.0.0.1:2704:g
[carbon]
pattern = carbon.*
destinations = 127.0.0.1:3004:j
[default]
default = true
destinations = 127.0.0.1:2904:i
storage-schemas.conf
[carbon]
pattern = ^carbon\.
retentions = 60:90d
[interface_max_buffer]
pattern = ^max_cell_usage*
retentions = 1s:10d,1m:30d
[interface_instant_buffer]
pattern = ^instant_cell_usage*
retentions = 1s:10d,1m:30d
[interface_percent_threshhold]
pattern = ^iface_instant_cell_usage*
retentions = 1s:10d,1m:30d
[collectd]
pattern = ^collectd*
retentions = 1s:10d,1m:30d
[selective_in_byte_count]
pattern = ^iface_eth_inbytes+?\.1-3\d*
retentions = 1s:10d,1m:30d
[selective_out_byte_count]
pattern = ^iface_eth_outbytes+?\.1-3\d*
retentions = 1s:10d,1m:30d
[selective_in_bit_count]
pattern = ^iface_eth_inbits_rate\.1-3\d*
retentions = 1s:10d,1m:30d
[selective_out_bit_count]
pattern = ^iface_eth_outbits_rate\.1-3\d*
retentions = 1s:10d,1m:30d
[default_1min_for_1day]
pattern = .*
retentions = 10s:10d,1m:30d
package { "gcc":
name => "gcc",
ensure => "installed",
}
package { "pycairo":
name => "pycairo",
ensure => 'installed',
}
package { "mod_python":
name => "mod_python",
ensure => 'installed',
}
package { "Django":
name => "Django",
ensure => 'installed',
}
package { "django-tagging":
name => "django-tagging",
ensure => 'installed',
}
package { "python-ldap":
name => "python-ldap",
ensure => 'installed',
}
package { "python-memcached":
name => "python-memcached",
ensure => 'installed',
}
package { "python-sqlite2":
name => "python-sqlite2",
ensure => 'installed',
}
package { "bitmap":
name => "bitmap",
ensure => 'installed',
}
package { "bitmap-fixed-fonts":
name => "bitmap-fixed-fonts",
ensure => 'installed',
}
package { "bitmap-fonts-compat":
name => "bitmap-fonts-compat",
ensure => 'installed',
}
package { "python-devel":
name => "python-devel",
ensure => 'installed',
}
package { "python-crypto":
name => "python-crypto",
ensure => 'installed',
}
package { "pyOpenSSL":
name => "pyOpenSSL",
ensure => 'installed',
}
package { "graphite-web":
name => "graphite-web",
ensure => 'installed',
provider => 'pip',
require => [Package['pycairo'], Package['mod_python'], Package['Django'],
Package['python-ldap'], Package['python-memcached'], Package['python-sqlite2'],
Package['bitmap'], Package['bitmap-fonts-compat'], Package['bitmap-fixed-fonts']]
}
package { "carbon":
name => "carbon",
ensure => 'installed',
provider => 'pip',
package { "whisper":
name => "whisper",
ensure => 'installed',
provider => 'pip',
require => [Package['pycairo'], Package['mod_python'], Package['Django'],
Package['python-ldap'], Package['python-memcached'], Package['python-sqlite2'],
Package['bitmap'], Package['bitmap-fonts-compat'], Package['bitmap-fixed-fonts']]
}
file { '/opt/graphite/conf/carbon.conf':
source => 'puppet:///modules/graphite/carbon.conf',
owner => 'apache',
group => 'root',
mode => '644',
require => Package['carbon']
}
file { '/opt/graphite/conf/storage-schemas.conf':
source => 'puppet:///modules/graphite/storage-schemas.conf',
owner => 'apache',
group => 'root',
mode => '644',
require => Package['whisper']
}
file { '/opt/graphite/conf/graphite.wsgi':
source => 'puppet:///modules/graphite/graphite.wsgi',
owner => 'apache',
group => 'root',
mode => '655',
require => Package['graphite-web']
}
file { '/opt/graphite/webapp/local_settings.py':
source => 'puppet:///modules/graphite/local_settings.py',
owner => 'apache',
group => 'root',
mode => '655',
require => Package['graphite-web']
}
file { '/etc/httpd/conf.d/graphite-vhost.conf':
source => 'puppet:///modules/graphite/graphite-vhost.conf',
owner => 'root',
group => 'root',
mode => '655',
require => Package['graphite-web'],
notify => Service['httpd']
}
service { "httpd":
enable => 'true',
ensure => 'running',
start => '/etc/init.d/httpd start',
stop => '/etc/init.d/httpd stop',
require => [Package['graphite-web'],
File['/etc/httpd/conf.d/graphite-vhost.conf']]
}
}
Since the F2 linecard is an important aspect of building high-density, line-rate, Spines in the MSDC
space, further discussion is warranted. This section explores unicast forwarding [only] in greater detail
within the F2 linecard.
F2 topics which are beyond the scope of this document include:
1. FIB lookup success/failure
2. EOBC/inband
3. TCAM programming
Traffic sources and bursts are extremely random and non-deterministic for a typical distributed
application in MSDCs. As an example, consider a Hadoop workload. Map and reduce nodes are
determined during runtime by the job tracker based on data block location and server memory/cpu
utilization. As a workload enters the shuffle phase, network "hotspots" can occur on any physical link(s)
across the L3 fabric. Since congestion control is closely aligned with egress interface buffers, deep
buffers are required on all physical interfaces. Insufficient buffer size can cause application
degradation1. But on the other hand, buffers that are too large can hinder predictability due to increased
latencies, and latency variations. As such, careful examination of how buffering works on key MSDC
building blocks is essential.
F2 Architecture Overview
The F2 line module consists of 12 dedicated System on Chip (SoC) ASICs, each ASIC supports 4 line
rate 10GE interfaces and has the following characteristics (Figure C-1):
• Embedded SRAM, DRAM and TCAM
• Ingress / Egress Buffering and congestion management
• L2 key features2: 4 VDC profiles, 4K VLAN, 16K MAC Table, 16 SPAN sessions
• L3 key features: 4K LIF, 32K IP Prefix, 16K ACL, 1K Policers
Other features include3:
• FCoE
• TRILL
• VN-Tag/FEX
• SFLOW
LC Arbitration
CPU Aggregator …
Fabric 2 ASIC
4 X 10G 4 X 10G 4 X 10G 4 X 10G 4 X 10G 4 X 10G 4 X 10G 4 X 10G 4 X 10G 4 X 10G 4 X 10G 4 X 10G
SoC SoC SoC SoC SoC SoC SoC SoC SoC SoC SoC SoC
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48
With regard to data forwarding, each SoC has two dedicated connections:
1. To the arbitration aggregator
2. To a (8x 6.5Gbps) bundled connection to the 1st stage fabric, local to the module.
Similar to previous generation M or F series modules, traffic destined between F2 line cards are sent
across the crossbar (xbar)-fabric module (2nd stage fabric). Each F2 I/O module has up to 10x 55G
connections towards the 2nd stage Fabric (Figure C-2). Five fabric connections are referred to as channel
0 connections, the remaining five are classified as channel 1 connections. Second stage fabric is either
a 1st gen FAB1 or 2nd gen FAB2 module. While F2 is compatible with both, FAB2 cards are required
for line-rate deployment scenarios; FAB2's are used in the SDU MSDC test topology. Except for
migration purposes Cisco does not recommend deployments with a mixture of FAB1 and FAB2 modules.
F2 requires deploying an F2-only VDC, thus interoperability of F2 with M or F1 cards, in the same VDC,
is not supported. If an F2 module is inserted in a chassis with M modules, all interfaces will be in
unallocated state until placed in a dedicated F2-only VDC.
X X X X X
Channel 0 – ports 1 – 24
Channel 1 – Ports 25 - 48 X X
Stage 1 Fabric
Ingress Egress Stage 3 Fabric
Linecard Linecard
Arbitration is required for all unicast flows. The purpose of arbitration is to avoid Head-of-line Blocking
(HOLB)4 within the switch fabric and to provide Quality of Service (QoS) differentiation based on
traffic class. Arbitration ASICs on linecards perform arbitration tasks among local SoC requesters and
act as a proxy to request credit on behalf of all SoCs on the same linecard, with the central arbiter on the
Supervisor (SUP). Unicast packets are only transmitted to fabric when credits are available. Broadcast,
unicast flood, and multicast traffic do not require arbitration.
Introduction to Queueing
Figure C-3 shows there are three widely supported buffer/queue models in datacenter switches: shared,
ingress and egress queuing.
Since this section is examing F2 (and F2 does not employ a shared buffer model), the remaining
discussion for this sub-section will discuss Ingress and Egress buffering. See TCP Incast discussion
further down in this document for an examination of how N3064s utilize shared buffering.
In egress buffering methods, traffic is pushed through the switch fabric; output scheduling and queuing
occurs on the egress interface. Most egress based buffer implementations allocate fixed size buffers to
each interface, and consequently the amount of burst interfaces can absorb is limited by the size of the
egress buffer.
Unlike egress buffering, ingress buffering architectures absorb congestion on ingress via distributed
buffer pools. The amount of buffer available is a function of traffic flow. For example, in a 2:1 Incast
scenario, there are 2x input buffers to absorb the burst. As the number of source interfaces increase, the
amount of ingress buffers increases accordingly. If we add one additional sender to create a 3:1 Incast
scenario, we have 3x input buffers. The simplest way to characterize ingress buffering is: the amount of
available buffers equal to number of interfaces sending traffic to a single destination. Traffic bursts
which exceed the egress capacity will be dropped on the ingress interface. Ingress queuing in general
scales well in an environment with large fan-outs. F2 implementation is based on ingress buffering. Each
port is assigned ~1.5MB of ingress buffer.
Queueing in F2 Hardware
Packet queuing occurs at multiple stages when transmitting across a F2 based system. Figure C-4 shows
key queuing points with F2 10G I/O modules.
Port Trust
State
Class-map type queuing
Priority
DWRR Weight
Virtual Lane—separated into ingress (iVL) /egress (oVL). VL enables the ability to support traffic
differentiation based CoS or DSCP on a physical link. There are three mechanisms to classify a packet
into iVL:
1. On a trusted interface, if a packet is 1Q tagged, Ethernet COS concept is extended to support VL,
three bits in the .1p header identifies VL for a frame.
2. If a port is not trusted or untagged, traffic is assigned to VL0 based on FIFO.
3. Starting from NX-OS 6.1.1, DSCP based classification for ipv4 is supported.
Packet classification for oVL depends on the type of traffic. Default classification is based on received
CoS for bridged traffic. For routed traffic, received CoS is rewritten based on DSCP, derived CoS is used
for egress queuing.
Flow Control—Flow control is a congestion avoidance mechanism to signal to the remote station to stop
sending traffic due to high buffer usage. There are two types of flow control, Priority Flow Control
(PFC) / Link Flow Control (LFC). LFC is at the link level and independent of the COS. PFC is based on
VL, typically implemented to provide lossless service such as FCOE. PFC / LFC are mutually exclusive.
F2 does not support flow control per COS value.
ACoS (Advanced Class of Service)—This is the internal classification and treatment of a packet within
the data path of the switch, it is carried end to end as part of the DC3 internal header across the data path.
ACOS values are often derived from configured inbound policies during the forwarding lookup process.
CCoS—CCoS is derived from the ACoS. Based on CL (fabric qos level) Switch maintains a static
mapping table between the ACoS and CCoS. Combination of VQI:CCoS makes up the VoQ.
Credited / Un-credited Traffic—Credited traffic are unicast traffic that has gone through full
arbitration process. Un-credited traffics are those that do not require arbitration. Typically multicast,
broadcast and unknown unicast are transmitted as un-credited traffic flows. If an interface has both
credited and un-credited traffic, configured DWRR weight determines amount of traffic to send for each
type.
VQI and VoQ—A VQI (virtual queue index) is the index or destination port group id over the fabric.
A VQI always maps to a port group, representing 10G worth of bandwidth. Each port group consists of
multiple interfaces (M1 series 10G shared mode or 12X 1GE) or a single dedicated interface (M1 series
running in dedicate mode or F series line cards). Number of QoS levels per VQI varies between
hardware: M series supports 4 COS / VQI, while F2 supports 8. The mapping of VQI to CCoS is often
referred to as VoQ. A line card can have up to QoS Level * VQI number of VoQs. Specific to the F2 Line
card:
1. Each port group maps to a single 10GE interface, a single VQI per interface.
2. Up to 1024 VQI/destinations and up to 8 qos level per VQI. This translates to 8000 VoQs.
3. Packets are queued into one of the VoQs based on destination (VQI) and CCoS.
In practice, the number of usable VoQ is based on fabric QoS levels. Current implementation in the
Nexus 7000 family supports 4 Credit Loop (QoS level), this implies the number of usable VOQ is 4 *
1024 or 4096 VOQs.
LDI—LDI is an index local to the Linecard. An interface is defined by a 6 bit LDI in SUP-1, 7 bit LDI
in SUP-2. When communicating with the central arbiter, linecards use LDIs to indicate interface id. On
the central arbiter, every LDI maps to a unique VQI. This mapping is based on the received interface of
the arbitration message and LDI.
Ingress Logic
Nexus 7000 F2 implementation is based on ingress buffering. In the case of F2 I/O module, the VoQ
buffer is the input port buffer. Each SoC has 6MB of buffer which is shared by the 4 front-panel port
results in ~1.5MB of ingress buffer per port, represented as 3584 pages of input buffer at 384 bytes per
page. 6 MB is equivalent to 1.25 ms buffering. There is also a 1MB skid buffer per SoC, 250KB per
interface; this is only used with PFC. Buffer pages are assigned to various input queues depending on
the system / port queue configuration. Using the default 2q4t (two input queue and 4 threshold per
queue) configuration with 90/10 queue limit, 3195 pages are assigned to queue 0 (90%) and 338 pages
are assigned to queue 5 (10%).
Ingress Queuing for Ethernet1/1 [System]
-------------------------------------------
Trust: Trusted
DSCP to Ingress Queue: Disabled
-----------------------------------
Que# Group Qlimit% IVL CoSMap
-----------------------------------
0 1 90 0 0-4
1 0 10 5 5-7
Traffic to ingress queue mapping can be based on CoS (default) or DSCP (starting from nxos 6.1.1).
Packets are queued to corresponding iVL awaiting lookup results based on the following mapping:
UP 0 - 4 -> IVL 0
UP 5 - 7 -> IVL 5
iVL buffer utilization can be monitored by attaching to the linecard. Refer to Monitor F2 drop section
for details.
Once lookup results are received from the decision engine, packet headers are rewritten and forwarded
to corresponding VoQs for arbitration. ACoS values are used to determine QoS levels across the fabric.
There are three lookup tables, 8cl, 4cl, and 4clp - the CL mode determines which table to query. 8cl is
used for 8 queue (CL) mode while 4cl and 4clp are for 4 queue mode (p indicates if priority bit set or not).
module-1# show hardware internal qengine inst 0 vq acos_ccos_4cl
ACOS CCOS
---- ----
0 3
1 3
2 2
3 1
4 1
5 0
6 0
7 0
8 - 15 3
16 - 23 2
24 - 39 1
40 - 63 0
This mapping has to be the same on all I/O modules and SUP across the entire switch. CSCuc07329
details some of the issues that can occur when mismatch occurs.
Once CCoS is determined, packet is queued into the corresponding VoQ (VQI: CCoS) for central
arbitration. Current generation SoCs do not allow drops on the VoQ, all congestion related drops can
only occur at iVL for unicast traffic. iVL drops show up as indiscards on the physical interface and queue
drops under qos queuing policy. Future F2 series I/O modules will support VoQ drops. Before accepting
a packet into the VoQ, WRED and tail checks are performed based on instant buffer usage. If the instant
buffer usage is greater than the buffer threshold or number of packets in the output queue is higher than
packet count threshold, incoming packet are dropped. It is important to highlight that WRED and tail
drop applies to droppable VLs only if PFC is enabled. Central arbitration occurs once packets are
accepted into the VoQ. Refer to Introduction to Arbitration Process for arbitration details.
Egress Logic
On the egress side, there are additional 755 pages of FIFO buffer for credited and 512 pages for
broadcast / multicast traffic. Credited traffic consists of Super frames (packets or Jumbo frames with
segments), and buffer space is reserved and managed by the Arbiter. Credited traffic is sent to an egress
linecard only when it has been granted buffer space by the central arbiter and can only be destined to
one VQI. The egress buffer must be returned to the arbiter once egress interface completes its transfer.
If traffic arrives out of order on the egress line card, it is the responsibility of the egress logic to re-order
packets before being transmitted out of the output interface.
If an interface has both credited and un-credited traffic, configured DWRR weights determine the
amount of traffic to send for each type.
DWRR weight for credited and uncredited traffic:
DWRR weights:
Q# Credited Uncredited
0 8190 5460
1 8190 5460
2 8190 5460
3 8190 5460
Egress QoS policy controls how various classes of traffic are prioritized when being transmitted out of
an interface. Default egress queue structure is based on 1p3q4t (one priority queue and 3 normal queues
- each queue has 4 drop threshold), CoS 5, 6, 7 are mapped to the priority queue, DWRR is implemented
between queue Q1-3. For bridged traffic, received CoS is used for both ingress and egress classification
by default. If ingress classification is changed to DSCP, by default egress CoS value for bridged traffic
remains unchanged. On the egress side, received CoS will remain for egress queue selection, and DSCP
is ignored. An example of this would be a bridged packet marked with CoS 0 / 46 DSCP as it enters the
switch, then it will be treated as premium data on the ingress based on DSCP classification. On the egress
side, it will continue to be mapped to the default queue due to COS 0. A policy map can be applied at
the egress interface if DSCP based queuing on the egress is required. For routed traffic, either CoS or
DSCP can be used for ingress queue selection. DSCP is used to rewrite the CoS on the egress interface.
Derived CoS will be used for egress queue selection.
WRED is not supported on F2 modules. The following output highlights F2 output queue information.
Flexible Scheduler config:
System Queuing mode: 4Q
Q 0: VLs (5,6,7) priority Q HI,
Q 1: VLs (3,4) DWRR weight 33,
Q 2: VLs (2) DWRR weight 33,
Q 3: VLs (0,1) DWRR weight 33
In a unicast-only environment, no drops occur on the egress for credited traffic. Arbitration processes
ensure traffic is only sent over the fabric if egress buffers exist. This is a by-product of ingress based
queuing; traffic exceeding egress bandwidth of the egress port will only consume necessary fabric
bandwidth, and will be dropped at the ingress. Egress queuing policy on F2 controls how much an egress
port receives from ingress ports. If a mixture of priority and best-effort traffic exists, egress policy
assigns higher precedence to priority traffic.
X X X X X
Supervisor
Arbitration
Asic 550 Gbps
X X
Ingress Egress
Linecard Linecard
For unicast packet forwarding, once a lookup decision is made on ingress, packets containing output
VQIs and Credit Loops (QoS levels) are sent to the central arbiter seeking permission for transmission
across the fabric. If egress buffers are available the central arbiter grants (GNT) the permission to
transmit (a GNT message) to the arbitration aggregator on the linecard. The ingress linecard starts
transmission across the fabric upon receiving the GNT message. Super frames are used if multiple small
packets are destined for the same egress VOQ, performed in the same arbitration cycle. Packets are
stored in egress buffers once it reaches the egress linecard. Once it's processed, packets are then sent via
egress port logic and a token (GID) is returned to the central arbiter via buffer available (CRD)
messages.
As the industry increases demand for 100G, requirements for high density 100G lineorate interfaces will
correspondingly increase as well. High density 100G interfaces require substantial increases in slot and
system throughput. Increases in capacity requirements are accomplished with distributed flow
arbitration. Distributed flow arbitration removes the central arbiter and integrates the buffer/token
allocation with the flow status on the ingress / egress linecard. Buffer management is based on flow
status - sequence number, and/or TCP window sizes, are used to control the rate of data transmission.
When implemented properly flow status alleviates egress congestion and enables dead-lock avoidance
at multi-Tbps throughput. Distributed arbitration is a roadmap item and is not required for current
throughput demands.
IVL/Pause Frames
Port QoS configuration indicates flow control status:
module-1# show hardware internal mac port 1 qos configuration
QOS State for port 1 (Asic 0 Internal port 1)
GD
TX PAUSE:
VL# ENABLE RESUME REFRESH REF_PERIOD QUANTA
0 OFF OFF OFF 0x0 0x0
1 OFF OFF OFF 0x0 0x0
2 OFF OFF OFF 0x0 0x0
3 OFF OFF OFF 0x0 0x0
4 OFF OFF OFF 0x0 0x0
5 OFF OFF OFF 0x0 0x0
6 OFF OFF OFF 0x0 0x0
7 OFF OFF OFF 0x0 0x0
LFC ON ON ON 0x1000 0xffff
RX PAUSE:
VL 0-7 ENABLE: OFF OFF OFF OFF OFF OFF OFF OFF LFC: ON
As discussed previously, the system sends pause frames when buffers run low. The number of pause
states entered is viewed from the show hardware command. ID 2125-2130 indicates which UP are in
internal pause state due to high buffer usage. UP0 - UP4 maps to iVL0:
Ingress Buffer
Since F2 is based on the ingress buffering model, visibility into input buffer usage is critical to determine
overall performance of a distributed application. When input buffers are consumed (due to the egress
endhost is receiving more traffic than it can handle) it is important to proactively identify a set of
interfaces which contribute to congestion and re-route workloads to another system that has excess
capacity. F2 input buffer usage is viewed by issuing a show hardware internal mac command. This
command reports the number of input buffers allocated and used based on iVL:
module-1# show hardware internal mac port 1 qos configuration
IB
Port page limit : 3584 (1376256 Bytes)
VL# HWM pages(bytes) LWM pages(bytes) Used PL_STOP(HWM & LWM)
Pages THR
0 3195 ( 1226880) 3075 ( 1180800) 21 3195 3075
1 2 ( 768) 1 ( 384) 0 2 1
2 2 ( 768) 1 ( 384) 0 2 1
3 2 ( 768) 1 ( 384) 0 2 1
4 2 ( 768) 1 ( 384) 0 2 1
5 338 ( 129792) 266 ( 102144) 0 338 266
6 2 ( 768) 1 ( 384) 0 2 1
The number of drops per iVL or interface is tracked via input QoS policy; use the show policy-map
command to get the total number of ingress drops at ingress queue:
msdc-spine-r1# show policy-map interface ethernet 1/1
Global statistics status : enabled
Ethernet1/1
Service-policy (queuing) input: default-4q-8e-in-policy
The show hardware internal statistics device fabric errors command is used with FAB2 to display
the number of times a frame with a bad CRC enters the fabric ASIC.
To display the number of bad frames received by egress engines, use the show hardware internal
statistics module-all device qengine errors command.
VOQ Status
In F2's case, VQI index tracks interface LTL index. Both indexes are assigned by the Port manager:
msdc-spine-r1# show system internal ethpm info interface ethernet 1/1
Information from GLDB Query:
Platform Information:
Slot(0), Port(0), Phy(0x2)
LTL(0x77), VQI(0x77), LDI(0x1), IOD(0x358)
Backplane MAC address in GLDB: 6c:9c:ed:48:c9:28
Router MAC address in GLDB: 00:24:98:6c:72:c1
Packets requiring central arbitration are queued in respective VoQs awaiting tokens. The number of
outstanding frames per VOQ are monitored based on VQI:CCoS:
module-3# show hardware internal qengine voq-status
VQI:CCOS CLP0 CLP1 CLP2 CLP3 CLP4 CLP5 CLP6 CLP7 CLP8 CLP9 CLPA CLPB
-------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
0033:3 0 2 0 0 0 0 0 0 0 0 0 0
0033:4 0 0 0 0 0 0 0 0 0 0 0 0
Note Due to a known hardware limitation, the VoQ counters above are not implemented correctly in the F2.
This limitation is addressed in F2E and beyond.
A summary view of VQI-to-module and LDI mappings are obtained by querying the VQI map table:
module-3# show hardware internal qengine vqi-map | i 33
VQI SUP SLOT LDI EQI FPOE NUM XBAR IN ASIC ASIC SV FEA_
NUM VQI NUM NUM NUM BASE DLS MASK ORD TYPE IDX ID TURE
---- --- ---- --- --- ---- --- ----- --- ---- ---- -- ----
33 no 2 33 2 162 1 0x155 0 CLP 8 0 0x80
Pktflow output provides a breakdown of ingress / egress traffic based on credited / uncredited, packet
drops per VL:
module-1# show hardware internal statistics device mac pktflow port 1
|------------------------------------------------------------------------|
| Device:Clipper MAC Role:MAC Mod: 1 |
| Last cleared @ Wed Sep 12 10:04:03 2012
The show hardware queueing drops ingress | egress command reports the total number of drops per
VoQ. Since existing F2 modules do not drop packets at the VoQ, this command does not apply. This
command is used directly from the SUP when monitoring future F2 I/O modules.
Central Arbitration
Each linecard has up to 2 SERDES links to send and receive arbitration messages from the central
arbiter. One link goes to the primary SUP and the other goes to the secondary SUP (if present). The
technical term for those links is called a "group". In an 18 slot system (N7018), there are up to 16 port
groups for linecards and 2 ports groups for SUPs on the central arbiter. The mapping of group to linecard
connections is determined during boot time:
msdc-spine-r1# test hardware arbiter print-map-enabled
be2_ch_type:10 Sup slot:9
Slots with groups Enabled
-------------------------
Slot 10 GROUP: 0 gp: 9
Slot 1 GROUP: 1 gp: 0
Slot 3 GROUP: 2 gp: 2
Slot 4 GROUP: 3 gp: 3
Slot 2 GROUP:15 gp: 1
-------------------------
Bucket Count (BKT) is used to count the number of received request, grant, and credit messages. The
request and grant message bucket lookup uses a 10 bit LDi from the arbitration message, concatenated
with the fabric CoS to form a 12 bit bucket table index. A dedicated BKT table exists per group
(linecard).
REQ messages contain a 2 bits CoS field which maps to three levels of priority. Mapped output "CoS 0"
has absolute priority. CoS1, CoS2 and CoS3 have the same priority level during arbitration:
msdc-spine-r2# show hardware internal arbiter counters 2
GROUP:2
LDI COS OUT_REQ CREDIT CREDITNA
1 3 1 122087645 63
3 3 1 120508256 63
Bkt Cos Gresend Grant Request Rresend
0 0 0 39459 39459 0
0 1 0 1 1 0
0 2 0 1 1 0
0 3 0 686452080 686452776 0
64 0 0 23740 23740 0
64 1 0 1 1 0
64 2 0 1 1 0
64 3 0 203618 203618 0
For credit id based arbitration, CRD messages carry unique tags per buffer id, plus the LDI and CoS for
the credit. The central arbiter maintains all received GIDs per {LDI, CoS}. When a GNT is issued the
arbiter removes a GID from the table. It is possible to query credits available in the arbiter by looking
at GID usage:
msdc-spine-r2# show hardware internal arbiter gid 2 1
Gid Group:2 carb:2 cgp:0
LDI COS LGID UGID B2 PTR B1 PTR CNAGID
----------------------------------------------------
1 0 f f 7 0 0
1 1 f e 2 0 0
1 2 f e 2 0 0
1 3 0 0 1 0 0 <<<<<< Bit map of available GID
- In this case, all token has been assigned.
If DSCP based classification is enabled on ingress, a known limitation exists such that DSCP 5, 6, and
7 are treated as priority data on egress.
Use show hardware internal statistics command to see which egress queue is processing the majority
of traffic. In the following example, a majority of traffic belongs to queue #3. Unfortunately, as of this
writing, per CoS statistics are not available by default:
module-1# show hardware internal statistics device mac all port 1 | i EB
20480 EB egress_credited_fr_pages_ucast 0000035840088269 1-4 -
20482 EB egress_uncredited_fr_pages_ucast 0000000000002004 1-4 -
20484 EB egress_rw_cred_fr0_pages_ucast (small cnt) 0000035840088278 1-4 -
20488 EB num credited page returned by RO 0000035840088287 1-4 -
20515 EB egress_credited_tx_q_#0 0000000000529420 1 -
20516 EB egress_credited_tx_q_#1 0000000000000209 1 -
20517 EB egress_credited_tx_q_#2 0000000000000159 1 -
20518 EB egress_credited_tx_q_#3 0000181154592683 1 -
module-1#
The show policy map interface command is used to view output drops; however in unicast
environments drops should not occur in the egress path. For completeness fabric utilization can also be
monitored:
msdc-spine-r1# show hardware fabric-utilization
------------------------------------------------
Slot Total Fabric Utilization
Bandwidth Ingress % Egress %
------------------------------------------------
1 550 Gbps 1.50 1.50
2 550 Gbps 1.50 1.50
3 550 Gbps 1.50 1.50
4 550 Gbps 1.50 1.50
10 115 Gbps 0.00 0.00
In unicast only environments the N7k fabric should never be over-subscribed due to arbitration and
traffic load balancing over available xbar modules.
Nagios Plugin
For monitoring dropped packet counts across F2, several methods exist to retrieve stats. A majority of
tats are available via netconf, some are available via snmp and netconf, while a small percentage are
available via CLI only. Nagios is an excellent, highly configurable, open source tool and is easily
customized to collect statistics via any access mechanism. Nagios stores device performance and status
information in a central database. Real time and historical device information is retrieved directly from
a web interface. It is also possible to configure email or SMS notifications if an aberrant event occurs.
Nagios' strengths include:
• Open Source
• Robust and Reliable
• Highly Configurable
• Easily Extensible
• Active Development
• Active Community
SNMP monitoring is enabled via straightforward config files.5
For input drops on F2, IF-MIB tracks critical ifstats (inDiscards, OutDiscards, inOctets, etc) numbers.
Developing Nagios plug-ins for CLI based statistics is slightly more involved. 3 general steps include:
5. Detailed examples on how to enable custom plug-in via SNMP can be found here:
http://conshell.net/wiki/index.php/Using_Nagios_with_SNMP
While comprehensive analysis of Cisco’s competition is beyond the scope of this document, it is worth
quickly mentioning what the competitive landscape looks like. It is also worth noting that MSDC
customers are themselves looking into building their own devices based on merchant silicon.
Dell/Force10
Dell/Force10 (formerly just “Force10”) has made its name with high-density gigE and 10G switches.
Primary MSDC focal points from within their portfolio are:
• Z-Series Core Switches, such as the Z9000. Cheap 128x non-blocking 10G ports. Leaf and Spine.
• E-Series Virtualized Core Switching, such as the E600i. 224x non-blocking 10G ports. Spine.
• The C300 Chassis-based Switch. Glorified Leaf, such as an “end of row” Leaf.
Arista
Arista has traditionally been very focused on 3 things: low-footprint/high-density 10G chassis, ultra
low-latency, and modular software. Strengths they bring to the table are:
• 7500. 192x linerate 10G.
• EOS Network Operating System. Complete separation of networking state and route & packet
processing. Extensible and customizable.
• 7150S. 64x 10G linerate ports. SDN-aware.
Juniper
Juniper made its splash into the industry with their M-series, pure routers, and their unified Network
Operating System, JUNOS. They have traditionally held a large portion of the Service Provider segment,
but have since branched out into MSDCs, namely with their proprietary Q-Fabric. The primary
competitive concerns they bring are:
• Q-Fabric. 6000x 10G ports:
– QFabric Scenario 1 - QFX3500 standalone mode as an ethernet switch
– QFabric Scenario 2 - QFX3600 standalone mode as an ethernet switch
Brocade
Brocade, formerly Foundry Networks, has long been among market leaders in the density battle. Their
VCS/VDX family of switches are the foundation of their datacenter switching fabric portfolio. For
example:
• VDX 8770-8, 384x 10G ports.
• 15RU.
• Focusing on flatter “Ethernet fabrics”.
• Virtual Cluster Switching (VCS), similar to Juniper’s Q-Fabric. “Self-healing” and resilient fabrics.
HP
Not to be left out of the large datacenter fabric market, HP has rolled out their 5900 series switch which
provides a low-cost, 64x 10G low-latency ToR platform that competes directly with Cisco’s Nexus 3064.
This appendix includes scripts used for setting up and creating Incast events in the lab. “fail-mapper.sh”
fails relevant Hadoop mappers (VMs), thus causing a shift in traffic. “find-reducer.sh” determines the
location of relevant reducers. “tcp-tune.sh” and “irqassign.pl” help prepare the servers for the lab
environment.
fail-mapper.sh
#!/bin/bash
URL=http://jobtracker.jt.voyager.cisco.com:50030/jobdetails.jsp?jobid=job_201211051628
_
RURL=http://jobtracker.jt.voyager.cisco.com:50030/taskdetails.jsp?tipid=task_201211051
628_
JOB_ID=$1
JOB_ID2=$2
RLOG=_r_000000
MLOG=_m_0000
W=0
counter=0
for i in {0..77};
do
printf -v MID "%03d" $i
wget $RURL$JOB_ID$MLOG$MID
wget $RURL$JOB_ID2$MLOG$MID
done
'rm' task*
diff --suppress-common-lines vmnames1 vmnames2 | grep ">" | sed 's/> //' > vmnames
while [ $W -lt 96 ]
do
wget $URL$JOB_ID
W1=`grep "jobtasks.jsp?jobid=job_201211051628_$JOB_ID&type=map&pagenum=1"
jobdetails.jsp?jobid=job_201
211051628_$JOB_ID | awk -F 'align="right">' '{print $2}' | cut -c 1-5`
rm jobdetails.jsp?jobid=job_201211051628_$JOB_ID
if [ -z $W1 ]; then
W=0
else
W=${W1/\.*}
fi
echo -e "\n currently at ******** $W \n"
done
fi
fi
if (( $i > 10 ))
then
HOSTS=($(cat vmnames | grep r$i-p0$z | cut -c 9-13))
HOSTS2=($(cat vmnames | grep r$j-p0$z | cut -c 12-13 | awk '{printf
"vm-%02d\n", $1+7}'))
fi
for h2 in ${HOSTS2[@]}
do
if [ -z "$WW" ]; then
printf -v WW "virsh destroy $h2"
printf -v WW1 "virsh start $h2"
else
printf -v WW "virsh destroy $h2 ; $WW"
printf -v WW1 "virsh start $h2 ; $WW1"
fi
done
unset WW
unset WW2
done
done
'rm' task*
find-reducer.sh
#!/bin/bash
URL=http://jobtracker.jt.voyager.cisco.com:50030/jobdetails.jsp?jobid=job_201211051628
_
RURL=http://jobtracker.jt.voyager.cisco.com:50030/taskdetails.jsp?tipid=task_201211051
628_
JOB_ID=$1
RLOG=_r_000000
MLOG=_m_0000
W=0
counter=0
while [ $W -lt 100 ]
do
wget $URL$JOB_ID
W1=`grep "jobtasks.jsp?jobid=job_201211051628_$JOB_ID&type=map&pagenum=1"
jobdetails.jsp?jobid=job_201211051628_$JOB_ID | awk -F 'align="right">' '{print $2}' |
cut -c 1-5`
rm jobdetails.jsp?jobid=job_201211051628_$JOB_ID
if [ -z $W1 ]; then
W=0
else
W=${W1/\.*}
fi
date;
echo -e "\n currently at ******** $W \n"
done
EVENODD=`expr $RACK % 2`
echo -e "\n reducer is at Physical host ***** $Reducer $Reducer1 \n"
if [ $RACK -eq 13 -o $RACK -eq 14 ]; then
if [ $EVENODD -eq 1 ]; then
ssh -o StrictHostkeyChecking=no $Reducer.hosts.voyager.cisco.com "date ; tcpdump
-s128 -i eth2 -n -w bla3"
else
RACK=`expr $RACK - 1`
printf -v RACKID "%02d" $RACK
ssh -o StrictHostkeyChecking=no r$RACKID-p$POD.hosts.voyager.cisco.com "date ;
tcpdump -s128 -i eth3 -n -w bla3-eth1"
fi
else
if [ $EVENODD -eq 1 ]; then
ssh -o StrictHostkeyChecking=no $Reducer.hosts.voyager.cisco.com "date ; tcpdump
-s128 -i eth0 -n -w bla3"
else
RACK=`expr $RACK - 1`
printf -v RACKID "%02d" $RACK
ssh -o StrictHostkeyChecking=no r$RACKID-p$POD.hosts.voyager.cisco.com "date ;
tcpdump -s128 -i eth1 -n -w bla3-eth1"
fi
fi
tcp-tune.sh
#!/bin/bash
irqassign.pl
#!/usr/bin/perl
use strict;
use POSIX;
# Open a logfile.
my $log;
open($log, '>>/tmp/irq_assign.log') or die "Can't open logfile: $!";
print $log strftime('%m/%d/%Y %H:%M:%S', localtime), ": Starting run.\n";
my %irqmap = (
79 => 2, # Start of eth1
80 => 200,
81 => 8,
82 => 800,
83 => 20,
84 => 2000,
85 => 80,
86 => 8000,
87 => 2,
88 => 200,
89 => 8,
90 => 800,
91 => 20,
92 => 2000,
93 => 80,
94 => 8000,
95 => 2, # End of eth1
62 => 1, # Start of eth0
63 => 100,
64 => 4,
65 => 400,
66 => 10,
67 => 1000,
68 => 40,
69 => 4000,
70 => 1,
71 => 100,
72 => 4,
73 => 400,
74 => 10,
75 => 1000,
76 => 40,
77 => 4000,
78 => 40, # End of eth0
);
VM configuration
for z in {1..16..2};
do
for i in {1..5};
do
printf -v RACK "%02d" $z;
ssh r$RACK-p0$i.hosts.voyager.cisco.com " virsh setvcpus vm-01 4 --maximum
--config;
virsh setvcpus vm-08 4 --maximum --config";
done;
done
for z in {1..16..2};
do
for i in {1..5};
do
printf -v RACK "%02d" $z;
ssh r$RACK-p0$i.hosts.voyager.cisco.com " virsh setmaxmem vm-01 24576000;
virsh setmaxmem vm-08 24576000";
done;
done
for z in {1..16..2};
do
for i in {1..5};
do
printf -v RACK "%02d" $z;
ssh r$RACK-p0$i.hosts.voyager.cisco.com " virsh setvcpus vm-01 4 --config;
virshsetvcpus vm-08 4 --config";
done;
done
for z in {1..16..2};
do
for i in {1..5};
do
printf -v RACK "%02d" $z;
ssh r$RACK-p0$i.hosts.voyager.cisco.com " virsh setmem vm-01 20480000
--config;
virsh setmem vm-08 20480000 --config";
done;
done
Figure F-1 shows a walk-through of how noise traffic is generated by utilizing both IXIA and iptables
on the servers:
1. IXIA sends 6-8Gbps traffic down each of the 5 links connected to leaf-r2, with ip_dst set to servers
hanging off leaf-r3. ip_src is set to a range owned by the IXIA ports.
2. Since ip_dsts don’t live off leaf-r2, traffic is attracted to Spine layer in ECMP fashion.