DC Fundamentals
DC Fundamentals
By Colin Wrightson
Day One: Data Center Fundamentals provides a thorough understanding of all the components that make up a data center solution using Juniper Networks products and
networking technologies. Youll learn how key data center principles fit together by
using an example architecture that is common throughout the book, providing an easy
reference point to gauge different data center solutions.
By books end, youll be able to design your own data center network and in the process come to understand why you would favor one technology or design principle over
another. The author points out subtle differences along the way, and provides links to
authoritative content that will help you with the technical specifications of the components, protocols, controllers, and configurations.
Data center architectures are evolving at an exponential rate. With the cloud reaching
maturity and SDNs rise from the trough of disillusionment towards the plateau of
profitability an update on the new fundamentals of data center architecture is long overdue. Colin Wrightson tackles the topic head-on in this excellent new addition to the Day
One library.
Perry Young, SVP, Tier-1 US Bank, Juniper Ambassador, JNCIP-SEC/SP/ENT, JNCDS-DC
This very timely book is essential reading if you want to keep up with the rapidly changing world of data centers. It takes you all the way from cabling to SDN technology, explaining all the fundamental principles in a well-written, easily digestible format.
Julian Lucek, Author and Distinguished Engineer, Juniper Networks
After years of remaining relatively static, the designs and technologies behind a data
center network are now evolving rapidly. The speed of change makes it difficult to keep
up with the latest architectures and technologies. Colin Wrightson, one of the data center wizards at Juniper Networks, has done of an amazing job of explaining these design
choices and technologies in a simple, easy-to-read book. Highly recommended for anyone who is considering implementing a new data center network.
Andy Ingram, VP Data Centers, Center of Excellence, Juniper Networks.
Juniper Networks Books are singularly focused on network productivity and efficiency.
Peruse the complete library at www.juniper.net/books.
ISBN 978-1-941441-39-8
51800
9 781941 441398
By Colin Wrightson
iv
PDF files.
If your device or ebook app uses .epub files, but isnt an Apple
product, open iTunes and download the .epub file from the iTunes
Store. You can now drag and drop the file out of iTunes onto your
desktop and sync with your .epub device.
vi
4 and 6
You have a basic understanding of network design at a campus
level
You work with servers and want to understand the network side
of data centers
vii
Preface
This Day One book provides you with a thorough understanding of all
the components that make up a Juniper Networks data center solution;
in essence, it offers a 10,000 foot view of how everything regarding
Junipers data center solution fits together. Such a view enables you to
see the value Juniper provides over other vendors in the same space by
glimpsing the big picture.
This books starts by covering the basics and slowly builds upon core
ideas in order to cover more complex elements. Design examples relate
back to an example architecture that is common throughout the book,
thus providing an easy reference point to gauge guideline solutions.
The idea is to allow you to design your own network and, in the
process, come to understand why you would favor one technology or
design principle over another. In order to do that, this book points out
subtle differences along the way:
Chapter One: Common Components starts with the various
common components (products) you can use and where they sit in
the design topology. This is important because of the differences
between merchant silicon and custom silicon, and because
different vendors have different approaches and those approaches
can affect the network and services you may want to implement.
Chapter Two: The Top-of-Rack or End/Middle-of-Row chapter
viii
Chapter 1
Common Components
The first question to pose in any data center design is What switch
should I use and where? Juniper Networks, like other vendors,
produces data center switches that meet stringent specifications in
order to fit within particular segments of a network. This chapter
provides an overview of the Juniper Networks switching solution,
allowing you to understand and compare the different port densities,
form factors, and port capabilities of the devices. The book then
moves on explain the placement of devices in the network and the
different architectures they support.
NOTE
The architecture and different layers within a data center are described in more detail in subsequent chapters.
MORE?
10
The QFX5100-48S is a 10-Gigabit Ethernet Enhanced Small FormFactor Pluggable (SFP+) top-of-rack switch with 48SFP+ ports and
sixQuad SFP+ (QSFP+) 40GbE ports. Each SFP+ port can operate as a
native 10Gigabit port or a 100MB/1Gigabit port when 1_Gigabit
optics are inserted. Each QSFP+ port can operate as either 40GbE
uplink ports or access ports. Each QSFP port can also operate as 4x
10GbE ports using a 4x 10 breakout cable. The QFX5100-48S provides
full duplex throughput of 1.44Tbps, has a 1U form factor, and comes
standard with redundant fans and redundant power supplies. The
switch is availiable with either back-to-front or front-to-back airflow
and with AC or DC power supplies. The QFX5100-48S can be used in
multiple architectures, such as:
A standalone switch
A spine or leaf in an IP Fabric (covered in later chapters)
A master, backup, or line card in a QFX Virtual Chassis (covered
later)
A spine or leaf device in a Virtual Chassis Fabric (VCF) (covered
later)
A satellite device in a Junos Fusion fabric (covered later)
QFX5100-48T
QFX5100-24Q
QFX5100-96S
The QFX5100-96S is a 10-Gigabit Ethernet Enhanced Small FormFactor Pluggable (SFP+) top-of-rack switch with 96 SFP+ ports and
eight 40GbE Quad SFP+ (QSFP+) ports. Each SFP+ port can operate as
a native 10 Gbps port or as a 100MB/1 Gbps port. The eight QSFP+
ports can operate at native 40 Gbps speed or can be channelized into
four independent 10 Gbps port speeds taking the total number of
10GbE ports on the switch to 128. The QFX5100-96S switch has a 2U
form factor and comes as standard with redundant fans, redundant
power supplies, both AC or DC power support, and the option of either
front-to-back or back-to-front airflow. The QFX5100-96S can be used
in multiple architectures, such as:
A standalone switch
Spine or leaf in a IP Fabric
A member of a QFX Virtual Chassis
A spine or leaf device in a VCF
A satellite device in a Junos Fusion fabric
11
12
QFX5100-24Q-AA
The data sheets and other information for all of the QFX5100 Series
outlined above can be found at http://www.juniper.net/assets/us/en/
local/pdf/datasheets/1000480-en.pdf.
The QFX5200-32C comes standard with redundant fans and redundant power supplies supporting either AC or DC, and is available with
either frontto-back or back-to-front airflow. The QFX5200 can be
used in multiple architectures, such as:
A standalone switch
Spine or leaf in a IP Fabric
A satellite device in a Junos Fusion fabric
MORE?
The datasheets and other information for all of the QFX5200 Series
outlined above can be found at http://www.juniper.net/us/en/productsservices/switching/qfx-series/
13
14
MORE?
30 ports of 100GbE
The QFX10000-60S-6Q line card provides 60 ports of 1/10GbE
later sections)
A aggregation device in a Junos Fusion fabric (covered in later
sections)
QFX10016
15
16
30 ports of 100GbE
The QFX10000-60S-6Q line card provides 84 ports of 10GbE,
later sections)
A aggregation device in a Junos Fusion fabric (covered in later
sections)
MORE?
MORE?
For more detailed information about the QFX Series and data centers,
see the Juniper/OReilly book, The QFX10000 Series, at http://www.
juniper.net/books.
With so many iterations of the QFX Series, its worth discussing the
different silicon types used in these products as it can have a bearing on
their placement in the data center and their capabilities once installed.
17
18
the market longer than a merchant silicon version (to recoup the initial
production costs) and that you need to consider future technology
shifts that may happen and their effects on both types of products.
There are pros and cons to both approaches, so I suggest you consider
using both merchant and custom silicon, but in different positions
within the network to get the best results.
Network designers tend to use the following rule of thumb: use
merchant silicon at the leaf/server layer where the Layer 2 and Layer 3
throughput and latency is the main requirement, with minimal buffers,
higher port densities, support for open standards, and innovation in
the switches OS software. Then, at the spine or core, where all the
traffic is aggregated, custom silicon should be used, as the benefits are
greater bandwidth, port density, and larger buffers. You can also
implement more intelligence at the spine or core to allow for other
protocols such as EVPN for Data Center Interconnect (DCI), analytics
engines, and other NFV-based products that may need more resources
than are provided on a merchant silicon-based switch.
IMPORTANT
Chapter 2
Architectures
Top-of-Rack
In top-of-rack (ToR) designs one or two Ethernet switches are
installed inside each rack to provide local server connectivity.
While the name top-of-rack would imply placement of the switch
at the top of the physical rack, in reality the switch placement can
be at the bottom or middle of the rack (top-of-rack typically
provides an easier point of access and cable management to the
switch).
20
Figure 2.1
Chapter 2: Architectures
Figure 2.2
21
22
Figure 2.3
Chapter 2: Architectures
In the other rack the same is done again and 50% of the server connections cross into the other rack and vise versa. As mentioned earlier, this
is not the most elegant of designs, and you would need to factor in
extra cable length to bridge the gap between racks, but it is more
practical and less expensive than a two switch per rack design.
And this situation brings up a good point: not every data center
installation or design is perfect, so if the last example works for you
and you are aware of its limitations and potential issues, then implement accordingly.
Table 2.1
Advantages
Limitations
End-of-Row
The end-of-row design (EoR) was devised to provide two central
points of aggregation for server connectivity in a row of cabinets as
opposed to aggregation within each rack as shown in the top-of-rack
design. Each server within each cabinet would be connected to each
end-of-row switch cabinet either directly via RJ45, via fiber, or if the
length is not too great, with DAC or via a patch panel present in each
rack.
23
24
Figure 2.4
EoR Design
Chapter 2: Architectures
25
26
Table 2.2
Advantages
Limitations
Chapter 3
Cabling
28
Copper
Copper cabling is often referred to as Cat5e and Cat6, which refers to
RJ45 terminated cabling over a twisted pair that provides support for
10/100/1000Mb and 10GbE connection speeds over a certain frequency up to 100m. See Table 3.1 below for detailed information.
In the context of a top-of-rack design solution, one would expect
copper to be implemented within each rack. In an end-of-row design
solution the copper cabling could be implemented between the servers
and switches, but the distance would need to be considered as well as
the size of the cabling plant. Also, any future upgrades might mean
removing this cabling to support higher speeds, which is always a
consideration.
But if this is whats required, then the QFX5100 series supports it in
the form of the QFX5100-48T switch. You also have the option of
using the EX4300 and EX3400 series.
Table 3.1 provides an overview of the different connector types, cable
standards, cabling, distances each cable can cover, the frequency
typically used, and where in the design the cable is used, as well as the
Juniper Networks supported switch series, the EX or QFX Series.
Table 3.1
Connector/
Media
Max Distance
Frequency
(MHz)
Recommended
Placement and
Supported Switch
1000BASE-T
Copper
CAT5e
CAT6
328ft (100m)
1-100
1-250
1-500
ToR in rack
1-100
1-250
1-500
ToR in rack
ToR in rack
CAT6A (STP/UTP)
100BASE-TX
CAT5e
CAT6
328ft (100m)
CAT6A (STP/UTP)
10GBASE-T
CAT7
328ft (100m)
1600
CAT6A
328ft (100m)
1500
CAT6 (UTP)
98ft (30m)
1250
CAT 6 (STP)
98ft (30m)
1250
QFX5100-48T
EX4300 and EX3400
QFX5100-48T
EX4300 and EX3400
QFX5100-48T
Chapter 3: Cabling
Fiber
There are two classifications for optical fiber: single-mode fiber-optic
(SMF) and multimode fiber-optic (MMF).
In SMF, light follows a single path through the fiber while in MMF it
takes multiple paths resulting in a differential mode delay or (DMD).
In the context of a top-of-rack design solution one would expect MMF
to be used within each rack and between connecting racks up to the
aggregation layer, if the distance is relatively short (up to 400m). In an
end-of-row design solution it could be implemented between the
servers and switches. If distance permits, then MMF could also be
implemented up to a core or aggregation layer, and if not, then then
you could implement SMF.
All of the switches in the QFX Series support fiber interfaces. To make
things a little easier Table 3.2 lists an overview of the different connector types, cable standards, cabling types, the distances each can cover,
the wavelengths used, where in the design the cable is used, and the
Juniper Networks supported switch series.
Table 3.2
Connector/
Media
IEEE Cable
Standard
1000BASE-T
RJ45 Copper SFP
RJ-45
328ft (100m)
1000BASE-SX
LC-MMF
1000BASE-LX
10G-USR
Wavelength
Recommended
Placement and
Supported Switch
ToR in rack
QFX5100-48S & 96T
LC SMF
LC- MMF
1-160
1-200
1-400
1-500
ToR & EoR within rack
1-500
1-400
1-500
840nm to
860nm
QFX5100 Series
29
30
10GBASE-SR
LC-MMF
840nm to
860nm
LC SMF
1530nm to
1565nm
QFX5100 &
QFX10000 Series
10GBASE-ER
LC-SMF
40GBASE-SR4
840nm to
860nm
840nm to
860nm
LC SMF
See note *
40GX10G-ESR4
40GBASE-LR4
1260nm to
1355nm
10 Km (6.2 miles)
QFX5100 &
QFX10000 Series
QFX5100 &
QFX10000 Series
QFX5100 &
QFX10000 Series
40G-LX4
Dual LC SMF
& MMF
OS1 2 KM
LC SMF
2 Km (1.24 miles)
See note *
QFX5100 &
QFX10000 Series
See note *
100GBASE-SR4
100GBASE-LR4
840nm to
860nm
LC or Dual LC
SMF
See note *
10 Km (6.2 miles)
QFX10000 Series
QFX10000 Series
Chapter 3: Cabling
DAC or Twinax
DAC or Twinax cabling is a copper cable that comes in either active or
passive assembly and connects directly in to a SFP+ (small form-factor
pluggable plus) or QSFP+ (quad small form-factor plus) covering. An
active DAC cable has amplification and equalization built into the
cable assembly to improve the signal quality, while a passive DAC
cable has a straight wire with no signal amplification built into the
cable assembly. In most cases a rule of thumb is that for distances
shorter than 5 meters you go with a passive DAC, and greater than 5
meters with an active DAC.
Due to its low cost, in comparison to fiber, DAC makes perfect sense
for short runs inside the rack in a top-of-rack solution, and if the
distance between racks is less than 10 meters, between racks in a
end-of-row solution as well.
The SFP+ DAC cable allows for a serial data transmission up to
10.3Gb/s, which is a low cost choice for very short reach applications
of 10GbE or 1-8G Fiber Channel.
QSFP+ 40GbE DAC allows for bidirectional data transmission up to
40GbE over four lanes of twin-axial cable, delivering serialized data at
a rate of 10.3125Gbit/s per lane.
QSFP+ 100GbE DAC cable allows for bidirectional data transmission
up to 100GbE over four lanes of twin-axial cable, delivering serialized
data at a rate of 28Gbit/s per lane.
Table 3.3 outlines the different types of DAC cables you can use in a
data center. This table differs from the preceding tables on copper and
fiber cabling because it includes the bend radius. The bend radius is an
important point to keep in mind because like many cables, these cables
are sensitive to adverse bending, which can effect data rates.
Table 3.3
Connector/Media
IEEE Cable
Standard
Max Distance
Minimum Cable
Bend Radius
Recommended
Placement and
Supported Switch
QFX-SFP-DAC-1M
10GbE
SFP permanently
attached.
1 m (3.3 ft)
ToR in rack
QFX5100 &
QFX10000 Series
QFX-SFP-DAC-2M
10GbE
SFP permanently
attached.
2 m (6.6 ft)
ToR in rack
QFX5100 &
QFX10000 Series
QFX-SFP-DAC-3M
10GbE
SFP permanently
attached.
3 m (9.9 ft)
ToR in rack
QFX5100 &
QFX10000 Series
31
32
QFX-SFP-DAC-5M
10GbE
SFP permanently
attached.
5 m (16.4 ft)
ToR in rack
QFX-SFP-DAC-7MA
10GbE
SFP permanently
attached.
7 m (23 ft)
ToR in rack
QFX5100 &
QFX10000 Series
QFX-SFP-DAC-10MA
10GbE
SFP permanently
attached.
10 m (32.8 ft)
ToR in rack
QFX5100 &
QFX10000 Series
QFX-QSFP-DAC-1M
40GbE
SFP permanently
attached.
1 m (3.3 ft)
ToR in rack
QFX5100 &
QFX10000 Series
QFX-QSFP-DAC-3M
40GbE
SFP permanently
attached.
3 m (9.9 ft)
ToR in rack
QFX5100 &
QFX10000 Series
QFX-QSFP-DAC-5M
40GbE
SFP permanently
attached.
5 m (16.4 ft)
ToR in rack
QFX5100 &
QFX10000 Series
QFX-QSFP-DAC7MA 40GbE
SFP permanently
attached.
7 m (22.9 ft)
ToR in rack
QFX5100 &
QFX10000 Series
QFX-QSFP-DAC10MA 40GbE
SFP permanently
attached.
10 m (32.8 ft)
ToR in rack
QFX5100 &
QFX10000 Series
QFX-QSFP28-DAC1M 100GbE
SFP permanently
attached.
1 m (3.3 ft)
ToR in rack
QFX5200 &
QFX10000 Series
QFX-QSFP28-DAC3M 100GbE
SFP permanently
attached.
3 m (9.9 ft)
ToR in rack
QFX5200 &
QFX10000 Series
QFX5100 &
QFX10000 Series
There are also a few other choices of cable types at 40GbE, including
Active Optical Cables, which can go up to 30 meters and can provide
for another option in both top-of-rack and end-of-row solutions.
Another solution, if you dont require 40GbE on day one, is to use
40GbE to 4x 10GbE breakout cables. These allow you to break a
native 40GbE interface into 4x 10GbE SFP or DAC interfaces.
If there were a best practice design, then it would be as simple as DAC/
Active Optical Cables and RJ45 within racks, and fiber between racks.
MORE? Details on the cables supported for all of the QFX products mentioned
Chapter 4
Oversubscription
34
Oversubscription Design
The starting point in designing any network is to understand the
requirements that have been set and design from that point forward.
To make this hypothetical as real as possible and provide a foundation
for the rest of the book, lets consider the following client requirements:
The client has two greenfield data centers (DC1 & DC2) that are 40Km
apart and will be connected to each other via the clients MPLS WAN
currently running on Juniper MX Series.
DC1:
chassis
Client would like an end-of-row/middle-of-row solution
Oversubscription ratio of 1:4 or lower (lower is preferred)
Chapter 4: Oversubscription
DC1 Design
Starting with DC1, each rack houses 14 servers. Each server has 3x
10GbE fiber connections plus 1x 1GbE RJ45 connection for out of
band (OoB).
Total per rack = 3 x 14 (number of 10GbE times the number of
row) = 140
NOTE
The ratio of 10GbE and 1GbE per a row was used just in case end-ofrow is offered instead of top-of-rack at DC1.
From a subscription point of view, a rack has a total bandwidth of 420
GbE. With a 4:1 subscription ratio that would be 420 / 4 (total bandwidth divided by the subscription ratio) = 105 GbE as the uplink
capacity.
To hit the 105GbE mark or better you need to use 40GbE uplinks. You
could go with 3 x 40GbE, which would equal 120GbE, which would
lower the subscription ratio to 5:3, but this would mean three spine or
aggregation layer switches to terminate the uplinks, which is never
ideal. If you propose 4 x 40GbE per a top-of-rack switch, then you
have 160GbE of uplink capacity, which would give us a subscription
ratio of 2.65:1. This would also mean you could either have two, or
four, spine-layer switches per row and either 2 x 40GbE per spine layer,
or if we go with four spine-layer switches, then a single 40GbE connection to each.
Once you know what your interface specifications are you can match
them against the available products. The QFX5100-48S provides 48
ports of 10GbE and 6 x 40GbE uplinks, so its perfect.
NOTE
At this point a good designer should ask the client about resiliency.
Would a single switch per rack be okay, or would the client rather have
two switches per rack, which would increase cost and waste ports?
This is where an end-of-row solution could be a more cost-efficient
answer, as you get both the resiliency and better port utilization. For
the sake of brevity, lets assume that the client is happy with the single
top-of-rack per rack and a top-of-rack-based architecture.
35
36
Figure 4.1
Figure 4.1 includes a switch at the bottom of the rack to terminate the
OoB 1GbE connections from the servers. Best practice would suggest
that the OoB connection on the back of the top-of-rack switch is also
connected to this bottom switch, thus providing access to the switch
outside of the normal data path while allowing network management
traffic and analytics to flow without interruption.
Lets now move on to the spine layer, or aggregation points, per row.
The number of 40GbE connections per a top-of-rack switch is four, and
there are ten racks, so our total number of 40GbE connections per a
row is 40 x 40GbE. You have the option at this point to either position
two spine layer switches, so 20 x 40GbE per spine, or four, which would
take that down to 10 x 40GbE per spine. But in order to make this
decision you have to work out the subscription ratio of each row.
There are two ways of approaching doing this, either oversubscribe by a
factor or two again, or, try and keep the same subscription ratio as
prescribed for the leaf to spine.
If you reduce the 400GbE by a factor of two you would have to provision 200GbE of uplink capacity from the spine to the core. If you
propose two spines, that would be 100GbE per a spine switch, and if
you propose four spines, that would be 50GbE per spine.
Chapter 4: Oversubscription
With two spines you could propose 3 x 40GbE per spine or with four
spines you could propose 2 x 40GbE. In each of these cases you would
still be below the initial subscription ratio.
The other option is to propose 100GbE uplinks to the core. With a
two-spine solution you could propose 100GbE per spine and with a
four-spine design you could propose a 1 x 100GbE per spine, and, in
doing so, you would you could keep the same subscription ratio as
defined between the leaf and the spine. So the client would see no drop
in bandwidth northbound of the top-of-rack switches within each rack.
From a QFX Series point of view, if you go down the 40GbE route then
todays choice is the QFX5100-24Q, with its 24 ports of 40GbE plus
additional expansion. But if you want the option to do both 40GbE and
100GbE, then the QFX10002, with its flexibility to support both
solutions, and its built-in upgrade path from 40GbE to 100GbE, would
be the better option. The QFX10002 would also provide additional
benefits to the client in regard to its buffer capacity of 300Mb per a port
and analytics feature set.
The other option would be the QFX5200 model that also has 40 and
100GbE connectivity, but has a much smaller buffer capacity than the
10K. In each case it provides you with options to present back to the
client.
The choice to present either 200GbE or 400GbE uplink capacity from
each row to the client, based on their specifications, which would mean
a four-rack design per row, would look like Figure 4.2.
Figure 4.2
37
38
And the logical subscription design of Figure 4.2 would look like the
one in Figure 4.3, below.
Figure 4.3
Now that the number of uplinks from each row is known, you can
work out what your core layer product should be. The core layer acts
as the point of aggregation between the rows of spine and leaves while
providing onward connectivity to the WAN.
As outlined previously, there are five rows of ten racks. Each row of
racks will have either 200GbE or 400GbE of uplink connectivity. So, at
a minimum, you would have to connect 10 x 100GbE links per core.
But, it would also be prudent to make sure you can support the higher
capacity, if at some stage the client prefers or wants the higher capacity
in the near future.
From the Juniper playbook, the QFX10000 Series precisely fits the bill
as it supports 100GbE. The choice is between an eight-slot chassis and
a 2RU fixed-format switch. The 2RU platform provides either 12 or 24
ports of 100GbE, while the 13RU chassis can provide up to 240 x
100GbE via a series of line cards. In that case you could have either the
30C line card supporting 30 x 100GbE, or the 36Q, which supports 12
x 100GbE.
The two options are similar in price, but here the 2RU unit is more
than sufficient for the proposed client solution. It provides 100% scale
even if the client wants a higher capacity bandwidth solution, and it
takes up a smaller amount of space, uses less power, and will require
less cooling. On that basis it would seem strange to pick a larger
chassis when 70% of it would be unused, even if the higher bandwidth
solution is chosen.
Utilizing the QFX10002-72Q would mean our rack design would look
similar to Figure 4.4.
Figure 4.4
Chapter 4: Oversubscription
DC1 Proposed Data Center Rack Design Core, Spine, and Leaf
Figure 4.5
The subscription ratio and onward connectivity from the core to the
WAN and its MX Series layer, can be treated slightly differently since
its generally known that 75% of your traffic stays local to the data
center. It seems to be a common traffic profile since servers have
moved to a virtual basis and as the applications are more distributed in
nature, as are their dependencies. As such, you end up with a 75/25%
split in how your traffic traversing your DC, with the 75% local to the
Leaf/Spine/Core, and 25% traversing the WAN.
39
40
This means you can provision a higher subscription ratio out to the
WAN router, which in turn means smaller physical links. Again the
choice comes down to 40GbE or 100GbE. While the MX Series
supports both, the cost of those interfaces differs a lot as WAN router
real estate is quite expensive compared to switching products.
In either case you have the option of 40GbE or 100GbE connectivity
from each core switch to the WAN layer. Going 100GbE would be the
better option but it may mean additional costs that the client wasnt
expecting. If thats the case then its easy to provision multi-40GbE
connections instead. In either case, the logical subscription solution will
be similar to that shown in Figure 4.6.
Figure 4.6
DC2 Design
The second data center presents a different challenge because the client
has stated they would prefer an end-of-row/middle-of-row architecture.
In review of the requirements, DC2 will have five rows of ten racks, the
same as DC1, but with each rack housing five blade chassis per rack
with each chassis connectivity at four 10GbE. The client would also like
a 4:1 or lower subscription ratio, similar to DC1.
So the calculations would be:
Total per a rack = 4 x 5 (number of 10GbE times number of
servers) = 20
Total per a rack is 20 x 10GbE = 200GbE
Per a row - 20 x 10 (number of 10GbE per rack times number of
Chapter 4: Oversubscription
As opposed to DC1, all of the connections in DC2 have to be connected to the two end-of-row switches. As stated in an earlier chapter, the
most efficient way would be to provision two chassis, with one at
either end of the rows. Split the connections from each rack by two, so
100 x 10GbE to one end of row and 100 x 10GbE to the other end of
row as shown in Figure 4.7, where each cable from each rack is 10 x 10
GbE.
Figure 4.7
And from an oversubscription point of view, our first hop from the
servers to the end-of-row switches will be at line rate as shown in
Figure 4.8 below.
Figure 4.8
With this in mind the first point of concern for oversubscription is from
the leaf layer to the spine layer. With 100 x 10GbE per leaf, or 1TB,
and an oversubscription ratio of at least 4:1 would mean 1TB divided
by 4 equals 250GbE of uplink connectivity to the spine layer.
41
42
The ideal product for the leaf layer would be the QFX10008 chassis.
It has eight slots, so plenty of room for expansion if additional blade
servers are installed into the racks, and it also has a range of line cards
to suit any combination of connectivity. The main line card of interest
is the 60S-6Q, which has 60 ports of 10GbE plus six extra uplink ports
supporting either 6 x 40GbE, or 2 x 100GbE. You could take the 100 x
10GbE rack connections, divide these over two 60S line cards, and
then use the two 100GbE uplink ports from each line card to connect
to the spine layer. This would drop the oversubscription ratio by half,
as the ratio would be based on 50 x 10GbE ports per line card and 2 x
100GbE uplinks. So, 500 divided by 200 (total bandwidth of incoming
connections / total bandwidth of uplinks) = 2.5 to 1, well within the
clients oversubscription ratio.
So our logical oversubscription design would look like the one illustrated in Figure 4.9.
Figure 4.9
The effect of provisioning 4 x 100GbE per leaf is that the spine layer
has to accommodate 8 x 100GbE per row and with ten rows thats 80
x 100GbE across two spine switches. Given the quantity of 100GbE
thats required, the choice of spine layer device is made for us in the
guise of the QFX10008 chassis.
The most suitable line cards are either the 36Q, which supports 12
ports of 100GbE, or the 30C, which 30 ports of 100GbE Both line
cards also provide breakout options for 40GbE and 10GbE, so you
dont have to provision additional line cards for those connectivity
options. If you remove cost concerns from this design, then the 30C
line card is the obvious choice because you can have two line cards per
spine, providing 60 x 100GbE ports. But when you factor cost concerns into this design, then it might be slightly more economical to
place four 36Q line cards per spine than two 30Cs.
So the physical diagram for DC2 would look like Figure 4.10.
Figure 4.10
Chapter 4: Oversubscription
Figure 4.11
Notice that both Figure 4.9 and Figure 4.11 consolidate the number of
layers by using an end-of-row design. The leaf nodes are now in the
end-of-row position and the spine layer now moves up to become what
was the core layer. So the principle of spine and leaf stays the same
regardless of whether you use top-of-rack or end-of-row design.
DC2 The Difference Between DC1 and DC2 Spine and Leaf
43
44
If you follow the same principle for connectivity to the WAN layer as
designed for DC1, whereby 75% of the traffic would be local to the
spine and leaf and 25% over the WAN, then the connectivity would
consist of either several 40GbE interfaces from both spines to the
WAN layer, or you could provision 100GbE, but again, thats an
option to offer to the client.
In each case, the logical oversubscription solution will look like the one
shown in Figure 4.12.
Figure 4.12
So now that the designs for DC1 and DC2 are established, lets move
on to logical design and consider the different options available using
the same client requirements as stated at the beginning of this chapter.
Chapter 5
Fabric Architecture
46
Figure 5.1
Traditional MC-LAG
Our starting point is MC-LAG, which enable a client device to form a
logical link aggregation group (LAG) interface between two MC-LAG
peers. An MC-LAG provides redundancy and load balancing between
the two MC-LAG peers, multi-homing support, and a loop-free Layer
2 network without running STP.
In Figure 5.2, on one end of an MC-LAG is an MC-LAG client device,
such as a server or switch, which has one or more physical links in a
LAG. This client device uses the link as a LAG and traffic is distributed
across all links in the LAG. On the other side of the MC-LAG, there
are two MC-LAG peers. Each of the MC-LAG peers has one or more
physical links connected to a single client device.
Figure 5.2
Basic MC-LAG
47
48
Figure 5.3
Figure 5.4
49
50
But how does this relate to the DC1 and DC2 scenarios? The quick
answer is: it depends. Recall that in earlier chapters, our two options
were spine and leaf in the form of a top-of-rack and a spine and leaf in
an end-of-row topology.
MC-LAG relates to Layer 2-based solutions and as such if your proposed network is going to have Layer 2 VLANs present at both the leaf
layer and the spine layer, in both solutions, then you are going to need
MC-LAG at either the spine layer for DC1 and either the spine or leaf
layer for DC2. As Figure 5.5 shows, the spine layer would have to be
interconnected to each other for a MC-LAG to work. It also means your
MC-LAG setup for every Layer 2 domain is present at a central point in
the network, as opposed to distributed between multiple leaf devices,
which is possible but can be messy.
Figure 5.5
With the DC2 logical architecture you have the option to implement one
step closer to the server layer at the leaf layer as shown in Figure 5.6.
Figure 5.6
Now the question is, should you implement MC-LAG in either of these
scenarios? I would suggest not. Solutions like VCF, Junos Fusion, and a
Layer 3 IP Clos with a Layer 2 overlay would be far better solutions,
and they scale considerably further and with less management overhead.
Where MC-LAG setups do become relevant is when you dont want a
proprietary fabric or an IP fabric but need a failover architecture for a
small deployment of QFX Series in a data center. In this instance,
MC-LAG would be appropriate and is fully supported across all QFX
Series platforms.
MORE? For more detailed information on MC-LAG and the QFX Series refer to
Virtual Chassis
Juniper released Virtual Chassis on the EX4200 Series Ethernet Switch
in 2009 as an alternative to the stacking technologies that were, and are
still, present. The basic principle in Virtual Chassis is to take the benefits
of a single control plane in a chassis and virtualize it over a series of
distributed switches interconnected via a cable backplane. It allows the
management of up to ten switches to be simplified via a single management interface, so you have a common Junos OS version across all
switches, a single configuration file, and an intuitive chassis-like slot and
module/port interface numbering scheme.
The design is further simplified through a single control plane and the
ability to aggregate interfaces across Virtual Chassis switch members as
you would with a normal chassis.
Switches within a Virtual Chassis are assigned one of three roles; master
RE (Router Engine), backup RE, and line card. As the name implies, the
master RE is the main routing engine for the Virtual Chassis. It acts as
the main point of configuration and holds the main routing table, which
is copied to each Virtual Chassis member to provide local, or distributed, switching and routing.
The backup provides a backup to the master RE and it is ready to take
over mastership of the Virtual Chassis in the event of a failure of the
master. The backup maintains a copy of the switching and routing tables
and when an update to the master is enacted, then the backup RE
automatically updates its tables. Think of it as an active/active sync.
51
52
Figure 5.7
All other switches are classed as line cards and in the event a master RE
goes down and the backup RE takes over mastership, then a switch
member classed as a line card automatically becomes the backup RE. In
reality, when the old master RE has resolved its issue, it should come
back as either the master or new backup. You have the flexibility to
select the best option depending on your needs.
The special sauce that makes all this possible is Junipers proprietary
protocol called, unsurprisingly, Virtual Chassis Control Protocol
(VCCP). Its based on Intermediate System-to-Intermediate System
(IS-IS), which is a link state protocol. Its based on the principle of
discovering, maintaining, and then flooding topology information on
the shortest paths across the Virtual Chassis to all members with the
most efficient routes on a flow-by-flow basis. VCCP is not user-configurable, and operates automatically, on both the rear-panel and frontfacing Virtual Chassis ports.
So how does Virtual Chassis apply to this books data center solution?
Besides the EX Series lineage and tradition, Virtual Chassis is also
supported on the QFX5100 Series and that gives you the option to
contrast a Virtual Chassis for both the top-of-rack and end-of-row
data centers.
In fact, the ability to mix and match different switch types in a single
Virtual Chassis is a fundamental feature and Virtual Chassis supports
these different Juniper Networks EX Series switching platforms:
EX4200
EX4550
EX4300
EX4600
QFX5100
The EX4200 and EX4550 can be part of the same Virtual Chassis, and
the EX4300 and EX4600 can be part of the same Virtual Chassis, but
you can also mix the EX4300 and the QFX5100 in the same Virtual
Chassis, which could be an option in our DC1 and DC2 to support
100/1000Mbps interfaces as well as 10GbE in the same virtual switch.
Lets investigate the design option as shown in Figure 5.8.
Figure 5.8
For DC1, Virtual Chasis could run the entire top-of-rack switches on a
rack-by-rack basis. This would give a common fabric at the rack layer,
while allowing traffic to switch locally without having to traverse the
spine layer, as all members of the Virtual Chasis to be interconnected
(hence the Virtual Chasis flow in Figure 5.8). The oversubscription
ratio would stay the same and even be reduced slightly due to the local
switching at the rack layer.
But the cabling costs go up and you have to provision additional cables
between each of the switches. Also, traffic isnt using the most efficient
route to get to its destination, as traffic from top-of-rack 1 would need
to pass through all of the other switches to get to top-of-rack10. This is
due to Virtual Chassis assigning a higher weighting in its calculation
than sending the traffic externally up to the spine and down to top-ofrack 10, even though thats the more efficient route. The other negative
factor is scale. While the requirements are for ten switches over ten
racks, what if you need to introduce more switches, or create a secondary Virtual Chassis in the same racks? Either situation would just add
to the complexity of the solution.
This is why Virtual Chassis Fabric was introduced, to allow you to
take the best elements of Virtual Chassis and institute better scale and
traffic distribution. Its discussed in the next section.
53
54
For DC2, Virtual Chassis only works well when you introduce a
stack-based solution into the two end-of-row cabinets as opposed to a
chassis, as shown in Figure 5.9. The limitations still exist, however,
which means that while a small implementation of Virtual Chassis
could work, you are still going to be limited by the backplane capacity.
In a Virtual Chassis this is comprised of a number of Virtual Chassis
port (VCP) links and could lead to a higher internal oversubscription
because your traffic distribution wont take the most efficient path,
because traffic potentially still needs to pass through multiple switches
to reach a destination.
Figure 5.9
Figure 5.10
55
56
single virtual MAC and IP gateway. All other switches in the Virtual
Chassis Fabric become line cards of the master. Theres a single point
of CLI through the master, and theres a distributed forwarding plane
so every switch has a copy of the both the Layer 2 and 3 tables.
Figure 5.11
With Virtual Chassis Fabric you also gain the option of plug and play
implementation. In the existing Virtual Chassis, you had to designate
the master and backup, and then you could pre-prevision the other
members of the Virtual Chassis though configuring the serial numbers
into the master so the master knew what line cards would be members.
That option still exists, but you also have the option to pre-prevision
just the master and backup at the spine layer and any other nodes that
are directly connected to any of the spine nodes are allowed to join the
Virtual Chassis automatically without pre-provisioning their serial
numbers. In this mode, leaf nodes can be brought into the Virtual
Chassis Fabric without any user intervention their configuration,
VCP port conversions and dynamic software update will be handled
automatically through the use of the open standard LLDP (Link Layer
Discovery Protocol).
57
58
Figure 5.12
As you can see in Figure 5.12, the links between leaf and spine now
become Virtual Chassis Ports and allow the Virtual Chassis Fabric to
become a single fabric over which traffic will be distributed. Northbound connectivity from the Virtual Chassis Fabric to the core is via
the spine layer advertised with the single IP gateway for the Virtual
Chassis Fabric. The purpose of the core layer in this design is to
provide a point of aggregation for the Virtual Chassis Fabric to
connect to the other Virtual Chassis Fabrics present in the other rows.
You can also LAG the links up to the core to allow traffic to be distributed to both spines as shown in Figure 5.13.
Figure 5.13
This design can then be replicated over the other rows of racks allowing what were 50 racks of switches to be virtualized into five IP
addresses and five points of management for the whole of DC1.
Virtual Chassis Fabric also provides full support for SDN-controller
based solutions integrated with virtual servers which is covered in
Chapter 9.
For DC2, which uses an end-of-row based solution, Virtual Chassis
Fabric becomes a little more challenging. With all of the switches
located in single racks at each end of the row you would need to
implement a spine in each row and then interconnect each leaf across
the rows to reach each spine (as shown in Figure 5.14) to allow the
Virtual Chassis Fabric to form. As noted earlier, for a Virtual Chassis
Fabric to form every leaf needs to be directly connected to each spine
layer. This would increase cabling issues and costs. The other option
would be to implement a Virtual Chassis Fabric-per-end-of-row, with
two spines and rest of the switches as leafs, but stacked. That would
provide a single point of management the same as a chassis would
provide, but would require a major increase in cabling as you would
have to follow the same principles of leaf and spine connectivity but
within a single rack. In addition, the oversubscription ratio is different
for DC2 and as such would need to increase the cabling at 40GbE to
make a Virtual Chassis Fabric work.
Figure 5.14
59
60
This is why you need to work through the pros and cons for each
scenario. Virtual Chassis Fabric works really well for a top-of-rack
based solution, but for end-of-row it becomes a little more problematic
and I suspect it would still be better to position a chassis for the end-ofrow solution as opposed to individual switches.
Our next architecture takes many of the principles of Virtual Chassis
and Virtual Chassis Fabric and enlarges the scale of the number of
switches you can virtualize.
Junos Fusion
Junos Fusion is a new technology from Juniper, which, at its most basic,
takes the elements of Virtual Chassis Fabric, such as the single management domain, the removal of spanning tree, and the full bandwidth use,
and expands the support up to 128 switches per virtualized domain
while simplifying the management of all those switches through the use
of a new software implementation.
The Junos Fusion architecture is comprised of Aggregation Devices
(QFX10000 switches) and Satellite Devices (EX43000 and/or
QFX5000 switches). Satellite devices directly connect to the aggregation device in a hub and spoke topology as shown in Figure 5.15.
Figure 5.15
switches) remotely
Configuration, software image management, statistics polling
devices. Junos Fusion has been designed in such a way that all
components are loosely coupled. This allows us to introduce the
notion of software upgrade groups. Devices within the same
SUG, or software upgrade group, need to run the same OS. A
given Junos Fusion system can have as many SUGs as the opera-
61
62
tor prefers, and the OS across different SUGs does not have to be
the same. This enables operational simplicity and ease of maintenance.
Supported Aggregation Device: QFX10002-72Q, 36Q,
aggregation devices
No local management required
Can be single or dual-homed to aggregation devices
Satellite devices run Windriver Yocto Linux or Linux Forwarding
Junos Fusion comes with a lot of failure protections built-in, and, for
the sake of brevity, they are not covered here, but Junos Fusion is a
active/active control plane across the data center, and as such the
potential for a control plane failure affecting the whole data center
should be a concern you need to understand so you can incorporate its
management into your best practices.
Is Junos Fusion an option for DC1? Yes, if you want to simplify the
management domain to cover a large number of switches while
providing an easy point of configuration and management.
Figure 5.16
As for DC2, the potential issue with Junos Fusion is the same as
implementing Virtual Chassis Fabric. Youre dependent on 1RU QFX
devices to do the 10GbE termination in both solutions. This would
mean implementing stacks of 1RU switches and connecting those
satellite devices back to the core/aggregation layer. The benefit of the
chassis solution is that local switching is done via the backplane on the
chassis, as opposed to pushing the traffic up to the core and potentially
back down to the leaf layer.
63
64
Again, this is why Juniper provides you with options, because one
solution does not always fit all scenarios, and while Junos Fusion
works well for DC1 more than DC2, maybe an IP fabric as outlined in
Chapter 6 could provide a solution for both.
MORE? For more detailed information about Junos Fusion and data centers,
Chapter 6
IP Fabrics and BGP
IP Fabrics
IP Fabrics have been popular of late because they provide the best
support for virtual servers and their applications, and they offer
the ability for those applications to talk to each other at scale via
Layer 2. While the virtual server world has been moving at breakneck speed to provide speed and agility, the networking world has
been a little slow to respond.
Up until a few years ago, most data center networks were based on
the traditional three-tiered approach that had been copied from
campus network design. While its fine when most of the traffic has a
north/south profile like a campus network, its not really suitable
when the majority of virtual applications and associated workloads
66
Figure 6.1
As you can see in Figure 6.1, this design introduces a fair amount of
latency, as the packets need to traverse several hops in order to reach
their destination. The more hops required between the different
dependent applications, the more those applications are subjected to
additional latency, contributing to unpredictable performance and user
access to those applications.
This is why fabrics have become so popular; they reduce latency by
flattening the architecture so any server node can connect to any other
server node with a maximum of two hops, as shown in Figure 6.2.
Data center fabric architectures typically use only two tiers of switches
as opposed to older data centers that implemented multiple tiered
network architectures.
Figure 6.2
67
68
Figure 6.3
IP Fabric Architecture
You can see in Figure 6.3 that the spine layer is made up of four
switches. Each leaf has four uplinks, one to each spine. The maximum
number of leaves supported in this topology is dictated by the maximum number of ports per a spine. So if our spine switch supports 40 x
40GbE connections, the maximum number of leaf devices would be 40
(though we should allow for onward connectivity at a similar rate to
the uplink connectivity, so 36 would be about right).
Clos
Where did the spine and leaf architecture come from? Its based on the
principles of a Clos network. Clos networks are named after Bell Labs
researcher Charles Clos, who proposed the model in 1952 as a way to
overcome the performance and cost-related challenges of electro-mechanical switches then used in telephone networks. Clos used mathematical theory to prove that achieving non-blocking performance in a
switching array (now known as a fabric) was possible if the switches
were organized in a hierarchy.
The design Charles Clos came up with was what he classed as a three
stage Clos. A three stage clos is comprised of an ingress, a middle, and
an egress, as shown in Figure 6.4.
Figure 6.4
As Figure 6.4 shows, the number of sources which feed into each of the
ingress switches is equal to the number of connections connecting to
the middle switches and is equal to the number of connections connecting to the egress switches. This provides a non-blocking architecture
with no oversubscription. If you now re-name the ingress, middle, and
egress, to leaf and spine, you get a Three Stage Clos as shown in Figure
6.5:
Figure 6.5
Now if you fold this topology in half and turn it on its side, youll end
up with the same spine and leaf architecture as you have seen previously and as outlined in Figure 6.6.
69
70
Figure 6.6
BGP
Layer 3 acts as the control plane in allowing routing information to be
distributed to all switches in the fabric. But as you are no doubt aware,
there are several different Layer 3 protocols you can choose from. Best
practice would argue in favor of any one of the three main open standard protocols: OSFP, IS-IS, or BGP would do the job. Each protocol
can essentially advertise routing prefixes, but each protocol varies in
terms of scale and features.
Open Shortest Path First (OSPF) and IS-IS use a flooding technique to
send updates and other routing information. Creating areas can help
limit the amount of flooding, but then you start to lose the benefits of an
SPF routing protocol. On the other hand, Border Gateway Protocol
(BGP) was created from the ground up to support a large number of
prefixes and peering points. The Internet and most service providers are
the best examples of BGP as a control plane at scale.
Its hard to find a reason why you wouldnt use BGP, unless you have
had little cause up until now to use BGP in any part of your network.
From a campus point of view, many people would consider BGP as a
service provider-based protocol, but the fear of learning this complex
protocol and how to configure it is being surpassed by automation
tools such as OpenClos, which automates the whole process: not only
the BGP configuration for each switch, but also the push of the configuration to the switch and bringing the whole fabric up.
Using an automation tool such as OpenClos will still require you to
have a basic understanding of BGP. So lets conduct a basic review of
BGP basics and BGP design.
BGP Basics
BBGP is an exterior gateway protocol (EGP) used to exchange routing
information among routers in different autonomous systems (AS).
An AS is a group of routers that are under a single technical administration. BGP routing information includes the complete route to each
destination and it uses the routing information to maintain a database
of network reachability information, which it exchanges with other
BGP systems. BGP uses the network reachability information to
construct a graph of AS connectivity, which enables BGP to remove
routing loops and enforce policy decisions at the AS level.
BGP uses TCP as its transport protocol, using port179 for establishing
connections. The reliable transport protocol eliminates the need for
BGP to implement update fragmentation, retransmission, acknowledgment, and sequencing.
An AS is a set of routers or switches that are under a single technical
administration and normally use a single interior gateway protocol
and a common set of metrics to propagate routing information within
71
72
the set of routers. To other ASs, the AS appears to have a single routing
plan and presents a consistent picture of what destinations are reachable through it.
The route to each destination is called the AS path, and the additional
route information is included in path attributes. BGP uses the AS path
and the path attributes to completely determine the network topology.
Once BGP understands the topology, it can detect and eliminate
routing loops and select among groups of routes to enforce administrative preferences and routing policy decisions.
BGP supports two types of exchanges of routing information: exchanges between different ASs and exchanges within a single AS.
When used among ASs, BGP is called external BGP (EBGP). When
used within an AS, BGP is called internal BGP (IBGP), which leads us
quite nicely in to the next section.
BGP Design
The first question posed by using BGP is which variant to use: IBGP or
EBGP. The difference between the two may seem quite small, but those
minor differences can lead to major changes in your DC implementation. The first big difference between the two is the way they use
autonomous systems or ASs.
In IBGP, all switches in the spine and leaf sit under a single AS, as
shown here in Figure 6.7. And in EBGP, each switch within the spine
and leaf has its own AS, as shown in Figure 6.8.
Figure 6.7
IBGP Example
The next issue is how each switch in the fabric distributes prefixes or
routes.
IBGP
73
74
Figure 6.8
EBGP Example
The only issue is the number of ASs you might consume with the IP
fabric. Each switch has its own BGP AS number, and the BGP private
range is 64,512 to 65,535, which leaves you with 1023 BGP autonomous system numbers. If your IP fabric is larger than 1023 switches,
youll need to consider moving into the public BGP autonomous
system number range, which I wouldnt advise for an internal data
center, or move to 32-bit autonomous system numbers.
The 32 bit takes the number up to potentially four billion ASs, with the
4,200,000,000 to 4,294,967,294 of the 32-bit range reserved for
private addressing, giving you plenty to play with. Note that while
Juniper supports 32 bit ASs as standard on its equipment, other
vendors may or do not.
Table 6.1 outlines the main points between IBGP and EBGP.
Table 6.1
IBGP
EBGP
Simple Design
Simple design
ECMP as standard
Figure 6.9
As you can see, each switch at both the spine and leaf layer is given its
own AS number. This can be repeated for the other rows. So, you could
have 64,512 to 523 for row one, and then increase for each device.
With DC2, as we are using an end-of-row solution, the EBGP design
would look like Figure 6.10:
75
76
Figure 6.10
Again, as outlined for the DC1 solution, you can use AS64512 and
64513 for row one and then increment by one for each device. If you
choose the IBGP method of implementation the design becomes really
easy because all you do is assign all of the components to a single AS.
This single AS could cover the whole of DC1 and you could have the
same approach for DC2 as shown in Figures 6.11 and 6.12.
Figure 6.11
Figure 6.12
www.juniper.net/us/en/local/pdf/whitepapers/2000565-en.pdf.
Chapter 7
Overlay Networking
78
Tunneling Protocols
So why do we need overlay networking?
Where previously one application would be run on one physical server,
you can now carve up the processing and memory of a physical server
to be able to run multiple virtual servers, thus reducing the number of
physical servers you need while increasing the productivity of those
same servers. Your average physical server now consists of a hypervisor, or VM monitor, that can be a piece of software or firmware that
creates and runs multiple VMs.
A computer on which a hypervisor is running one or more VMs is
defined as a host machine, and each virtual machine is called a guest
machine. The hypervisor presents the guest operating systems with a
virtual operating platform and manages the execution of the guest
operating systems. Multiple instances of a variety of operating systems
can then share the virtualized hardware resources.
The applications that run on these VMs need to communicate with
each other and in general the communication is Layer 2. But why go to
the trouble of an overlay if all they need to do is talk to each other on
the same Layer 2 domain?
Overlay networking is a way to view the network as secondary to the
application. With overlay networks youre managing from the application down rather than from the network up because you want to
architect that the application gets the services and support from the
network to be able to deliver quickly and efficiently. VLANs are a
network-imposed solution with limitations, such as the number you
can have on a single switch or in a data center. A Layer 2 overlay
technology such as a Virtual Extensible LAN (VXLAN) removes those
constraints, allowing your server team to implement the network
element and bring the application to the user in a manner consistently
faster than before. From a network point of view, youre not redundant
in this process, but your concern is the Layer 3 element, which is
presented to you by the encapsulated protocol.
So, to transport traffic between the different VMs on different physical
hardware, you generally use virtual tunnels. These VTEPs typically
reside in hypervisor hosts, such as VMware Hypervisor hosts, or
kernel-based virtual machine (KVM) hosts, or they can even reside on
the switches themselves. Each VTEP has two interfaces: a switching
interface that faces the VMs in the host and provides communication
between VMs on the local LAN segment, and an IP interface that faces
the Layer 3 network. Each VTEP has a unique IP address that is used
for routing the UDP packets between VTEPs.
79
80
This limitation becomes apparent when you start to scale your data
center to support multiple customers; these customers want to have
multiple Layer 2 domains to host their applications, and they want
these applications to be active all the time, which means you may need
larger facilities to support these requirements. This is what is classed as
multitenancy.
To get around this issue, VXLAN VNI has a 24-bit length, which
means it can provide up to 16M unique L2 IDs thus supporting a
considerably larger number of L2 domain and subnets, and removing
any scaling and overlap concerns that would have been present using
VLANS.
The second element to understand is the VXLAN tunnel endpoint
(VTEP), that preforms the encapsulation and decapsulation of packets
to travel over the Layer 3 fabric.
VTEPs typically reside within the hypervisor hosts. Each VTEP has
two interfaces. One is a switching interface that faces the VMs in the
host and provides communication between VMs on the local LAN
segment. The other is an IP interface that faces the Layer 3 network.
Each VTEP has a unique IP address that is used for routing the UDP
packets between VTEPs.
A quick walkthrough might help clarify things, as shown in Figure 7.1.
Figure 7.1
A Walkthrough of VTEP
Figure 7.2
You can see in Figure 7.2 that the original Ethernet frame consists of
the source and destination MAC addresses, Ethernet type, and an
optional IEEE 802.1q header (VLAN ID). When this Ethernet frame is
encapsulated using VXLAN, VXLAN adds these additional elements:
A VXLAN header that comprises of an 8-byte (64-bit) field
dynamically assigned by the originating VTEP and the destination port is typically the well-known UDP port 4789.
81
82
the source VTEP associated with the inner frame source (that
could be the VM or a physical server). The outer destination IP
address is the IP address of the destination VTEP that corresponds to the inner frame destination.
Outer Ethernet Header: This has a source MAC address of the
frame.
In total, VXLAN encapsulation adds between 50 and 54 additional
bytes of header information to the original Ethernet frame. Because
this can result in Ethernet frames that exceed the default 1514 byte
MTU, best practice is to implement jumbo frames throughout the
network.
MORE? For a very detailed review of VXLAN, please refer to this IETF draft:
https://tools.ietf.org/html/draft-mahalingam-dutt-dcops-vxlan-00.
7. The VTEP-B (Host-B) receives the packet and removes the VXLAN
header and forwards the original Ethernet frame.
8. The packet hits the VTEP endpoint and it strips the VXLAN header
and switches the original frame towards VM2.
9. The original frame is now just a native Layer 2 frame with payload
as was transmitted by VM1.
10. VM2 receives the original frame as if it was on the same VLAN as
VM1.
Figure 7.3
Ideally, as a best practice, you start VNI numbering from 4K/5K+, for
example, VLAN 100 -> VNI 4000 or 5000. However, it doesnt
prevent you from configuring VLAN 100 -> VNI 1001, as shown in
Figure 7.3.
The steps in Figure 7.3 could easily apply to both a VTEP termination
within the hypervisor as well as a VTEP termination in a switch. Thats
because VTEP endpoints can be located within the hardware of the
83
84
switch as well as within a VM, due to the simple fact that data centers
run a lot of non-virtualized or bare metal servers (BMS), which dont
have native support for VXLAN.
By placing VTEP endpoints so that they reside on the switches that BMS
servers attach to, you can allow the switch to act as a gateway between
the virtualized and non-virtualized, thus allowing traffic to move
between the two different data planes as if they were connected to the
same VLAN, as shown in Figure 7.4.
Figure 7.4
Figure 7.4 shows the two VMs that reside in VLAN10 are encapsulated
to VXLAN 110, which is transported across the Layer 3 network, but
the VTEP is terminated in the switches hardware. So the VTEP is
associated with the interface that connects to the physical server, which
is configured for VLAN10, and from the point of view of both the
physical and virtual servers, they all reside on VLAN10.
Support for this feature is standard across both switches and routers
from Juniper Networks.
Layer 2 Learning
To allow VMs and BMS servers to communicate with each other there
has to be a learning mechanism in place (with associated tables) that
maps the MAC addresses of VMs and BMSs to specific VTEPs and
maintains that mapping.
Lets start with the data plane first. VXLAN as a standard doesnt
include a control plane mechanism for VTEPs to share the addresses
that they have discovered in the network, but it does include a
mechanism that is very similar to the way traditional Ethernet learns
MAC addresses. Whenever a VTEP receives a VXLAN packet, it
records the IP address of the source VTEP, the MAC address of the
VM, and the VNI to its forwarding table. So when a VTEP receives an
Ethernet frame for that destination VM server on its VNI segment, it
is ready to encapsulate that packet in a VXLAN header and push it
towards that VTEP.
If a VTEP receives a packet destined for a VM with an unknown
address, it will flood and learn like a traditional Ethernet switch to see
if someone else knows the destination MAC address. But, to stop
unnecessary flooding of traffic, each VNI is assigned to a multicast
group, so the flood and learn process is limited to all of the VTEPs in
that VNIs multicast group.
Figure 7.5
85
86
VXLAN Routing
At its most basic, VXLAN routing is the ability for one VLAN to talk to
another VLAN. To allow this to happen, you need a Layer 3 gateway. Its
the same in an overlay network, but when a VXLAN VNI needs to talk
to another VXLAN VNI, you need to route between them and you need
a Layer 3 gateway.
Juniper prefers to implement VXLAN routing natively in hardware as
opposed to either using a multi-stage solution or cabling, either internally
Figure 7.6
87
88
this fundamentals book, start at Junipers Tech Library where all the
products, protocols, and configurations can be found: http://www.juniper.net/documentation. The data sheets for all the Juniper devices
mentioned in this Day One book can be found on their specific
product pages at http://www.juniper.net/us/en/products-services/.
Chapter 8
Controllers
The overlay in data centers can either be constructed by a controller or independently by the network nodes. In a controller-based
solution, a central brain such as Juniper Contrail or VMwares
NSX solution holds the Layer 2 tables and knows how to reach all
of the elements in the virtual network, as shown in Figure 8.1. In a
controller-less based solution, you are reliant on a protocol to
distribute from one point to another.
Figure 8.1
90
Figure 8.2
As you can see in Figure 8.2, the controller-less network provisions the
VTEP network from the network switches. VM-based servers would
connect to these switches via standard Layer 2 trunks. While these
VTEPs can be controlled via a management platform such as Junos
Space, a controller-less solution is a little more static in its implementation and management.
EVPN is covered in detail in Chapter 9; its a complex subject and
requires a diverse protocol in its support for different architectures.
Now, lets get back to a controller-based solution. How does it work?
Lets look at Juniper Contrail as an example.
Chapter 8: Controllers
91
92
Figure 8.3
Chapter 8: Controllers
Contrail Controller
From the Controllers point of view, each of the three elements:
configuration, control, and analytics, becomes its own node. So theres
a control node, a configuration node, and an analytics node. Each of
these logical nodes runs on an X86 processor that may be on a separate
physical server or running as a VM, or indeed, all running on a single
server in your lab. As all nodes run in an active/active configuration,
you can run multiple instances of these nodes so no one node becomes
a bottleneck, and that allows you to scale out the solution and provide
redundancy. These nodes are interconnected to each other, with the
control nodes being the nodes that the vRouters interact with as shown
in Figure 8.4.
Figure 8.4
Contrail Controller
93
94
nodes. The vRouter then makes a local decision about which copy of
the control state to use its similar to how a BGP PE router receives
multiple copies of the same route (one from each BGP neighbor) and
makes a local best route selection. The information is then added to the
local table until the controller advertises a better route.
If a control node fails, the vRouter agent will notice that the connection to that control node is lost. The vRouter agent will flush all routes
that it received from the failed control node. It already has a redundant
copy of all the states from the other control node. The vRouter can
locally and immediately switch over without any need for resynchronization. The vRouter agent will do a discovery for the remaining
control nodes to see if a new control node exists to replace the old one.
If not it will continue to use the single control node that it knows about
until another one recovers.
Contrail vRouter
The vRouter sits within the compute servers hosting your VMs. The
standard configuration assumes Linux is the host OS, and KVM is the
hypervisor (but other hypervisors are supported). The vRouter is then
divided into two functions, the vRouter forwarding plane and the
vRouter agent. The vRouter forwarding plane for Virtio sits in the
Linux kernel, but for those running the Data Plane Development Kit
(DPDK) support it sits in user space and the vRouter agent is in the
local user space as shown in Figure 8.5.
Figure 8.5
Chapter 8: Controllers
analytics nodes
Installing routing tables into the forwarding plane
Discovery of existing VMs and their associated attributes
Applying policy for the first packet of each new flow and install-
95
96
Figure 8.6
Where Figure 8.6s detail becomes naturally advantageous is in a multitenant environment, where you need to offer multiple VMs and
domains but with complete separation and simple centralized control
and configuration.
As mentioned previously, Contrail vRouters, and Contrail as a whole,
support MPLS over GRE or UDP and VXLAN, but it is worth noting
that the selection of which data plane protocol to use is based on a
preference order that can be defined during setup, and also takes into
account the capabilities of the two endpoints of the tunnel. Either you
define the tunnel type or the vRouters can do that based on what the
tunnel endpoints support.
So elements covered in Chapter 7 on VXLAN stay the same and you
implement Contrail to manage that overlay element. If, on the other
hand, you come from an MPLS background, then the support for
MPLS is the same as it would be for a MPLS WAN network, but
implemented in the data center where the encapsulation method of the
packets is MPLS with a MPLS label to act as its ID.
Lets join this all together with a packet walkthrough on how VM1A
could talk to VM1B across our fabric.
Chapter 8: Controllers
Figure 8.7
97
98
able from that routing instance, then a route will be applied in that
instance providing the next-hop and label required to reach the
destination and normal encap as outlined;
If theres a default route, the packet will follow that path;
If nothing is defined, then the packet will be dropped.
its specifications, please refer to its product page on the Juniper web site,
with links to papers, documentation, and solutions: http://www.juniper.
net/us/en/products-services/sdn/contrail/.
MORE? If you require more detail on how VMWare NSX and Juniper work
Chapter 8: Controllers
99
100
In the data plane you can use VXLAN encapsulation (as discussed in
pervious chapters). The VTEP for the BMS is on the top-of-rack
switch. Unicast traffic from BMS is VXLAN-encapsulated by the
top-of-rack switch and forwarded, if the destination MAC address is
known within the virtual switch. Unicast traffic from the virtual
instances in the Contrail cluster is forwarded to the top-of-rack switch,
where VXLAN is terminated and the packet is forwarded to the BMS.
Broadcast traffic from BMS is received by the TSN node which uses the
replication tree to flood the broadcast packets across that specific
virtual network. Broadcast traffic from the virtual instances in the
Contrail cluster is sent to the TSN node, which then replicates the
packets to the top-of-rack switches.
Chapter 8: Controllers
Figure 8.8
101
102
Figure 8.9
You can see in DC2 that traffic is more centralized due to the end-ofrow design. IIt also means that the gateway between the wider network
and the overlay sits at the spine/core acting as the main aggregation
point between all of the rows, and then onward connectivity to the
WAN or outside of the data center.
Okay, remember all this because next up is EVPN as a controller-less
based solution for supporting overlay networking.
Chapter 9
EVPN Protocol
104
For the past few years you could use VPLS (virtual private LAN
service) to stretch that Layer 2 domain between sites. But while VPLS
did a good job, like any protocol it also came with limitations regarding: MAC address scaling, support for multicast in a sensible way,
multi-homing active/active, transparent customer MAC address
transport, faster convergence, and no doubt the largest pain, ease of
management.
EVPN attempts to address these issues, but remember that its still a
new protocol and in some cases the standards are still being worked
out, which is why youll see slightly different implementations by
different vendors.
EVPN is in the BGP family. It uses multi-protocol BGP for the learning
of MAC addresses between switches and routers and allows those
MAC addresses to be treated as routes in the BGP table. This means
you can use multiple active paths both inside and between data centers
without blocking links. But youre not just limited to MAC addresses.
You can you use IP addresses plus the MAC address (this forms a ARP
entry) to be routed and you can combine them further with a VLAN
tag as well.
Given this flexibility for both Layer 2 and Layer 3 addressing, and the
fact that you can use a single control plane such as BGP for both the
internal network and the external WAN, the benefits of EVPN quickly
become apparent.
Before delving into the innards of EVPN, lets sync up and run through
some of the terminology and how that terminology applies to a fabric.
Lets start with Figure 9.1, which shows two servers with two VMs per
server attached to leaves in a standard spine and leaf topology. Compared to previous diagrams, you should note that that each server is
attached to two leafs for resiliency, just as it should be in the real
world. The diagram used in Figure 9.1 is used throughout this chapter
to build upon EVPN concepts and where they apply.
Starting with the server connection, as shown in Figure 9.2, the server
connects to the switch and has a trunk with two VLANs present
(VLAN 1 and VLAN 2). In EVPN-speak these two VLANs are classed
as Ethernet Tags, which is simply the identity of the VLAN, so Ethernet Tag 1 corresponds with VLAN Tag 1 and in EVPN the Ethernet tag
maps to the VLAN tag.
Figure 9.1
Figure 9.2
105
106
Inside each EVI is MAC-VRF, which is a virtual MAC table for the
forwarding and receiving of MAC addresses within the EVI. You also
have an import policy and an export policy.
The VRF export policy for EVPN is a statement configured in the VRF.
This statement causes all locally learned MACs to be copied into the
VRF table as EVPN Type 2 routes. Each of the Type 2 routes associated
with locally learned MACs will be tagged with the community target of
say 1:1, and these tagged routes are then advertised to all switches in the
fabric.
The VRF import policy statement does the reverse of the export statement to accept routes that are tagged with that target community.
You also have a route distinguisher or RD that is assigned to the MACVRF, again this is unique, and its ID is advertised into the BGP control
plane that runs across the whole of our fabric.
There is a route target (RT) community. Each EVPN route advertised by
a switch in the fabric contains one or more route target communities.
These communities are added using VRF export policy or by a configuration, as mentioned earlier. When another switch in the fabric receives
a route advertisement from other switches, it determines whether the
route target matches one of its local VRF tables. Matching route targets
cause the switch to install the route into the VRF table whose configuration matches the route target.
Finally, EVPN gives you the flexibility to support different VLAN
mapping options per an EVP instance. These different supported
options are classed as EVPN services.
VLAN Services
There are three VLAN services with the first one being VLAN-based
service as shown in Figure 9.3. With this service a VLAN is mapped to a
single EVI and it becomes the EVI for that VLAN across the fabric.
Figure 9.3
VLAN-Based Service
Figure 9.4
The benefits here are the efficient way in which you can bundle like
VLANs together and operationally make the configuration a lot easier.
But traffic flooding will affect every VLAN in the EVI, and you have no
support for the VLAN mapping, so you are sharing the EVI with lots
of other VLANs.
The last service is called VLAN aware service and it allows for multiple
VLANs and bridge domains to be mapped to a single EVI, as shown in
Figure 9.5. It allows all of the VLANs to share the single EVI but
because you have the one-to-one mapping of VLAN to bridge domain,
you can assign an ID or label to that bridge domain to provide separation between other VLANs. It also means that flooding only affects the
VLAN in which it occurs, as opposed to the bundle service where it
affects every VLAN in the EVI.
107
108
Figure 9.5
At the time of this books writing, the Junipers QFX Series currently
only supports VLAN Aware and VLAN Bundle. The MX Series and
the EX9000 Series support all three.
While you have already read about data plane encapsulation like
VXLAN and the different protocols that you can use for the overlay,
its still worth touching upon the ones supported by EVPN.
Data Plane
As shown in Figure 9.6, there are currently five supported data plane
technologies for EVPN. They are MPLS, PBB, SST, NVGRE, and
VXLAN. All five are encapsulation methods using EVPN and BGP as
the common control plane.
Figure 9.6
You should now have two VLANs that have assigned two VXLAN
VNIs with their MAC addresses learned and advertised in EVPN EVI
1, and assigned a RD of 1, that then provides its unique ID across the
fabric.
Okay, but one of the elements that you have to consider is EVPN
routing in BGP. It relates to the various types of reachability information that EVPN will advertise into BGP. In the wonderfully named
IETF document: BGP MPLS Based Ethernet VPN (https://tools.ietf.
org/html/draft-ietf-l2vpn-evpn-11), this information is classed as
Network Layer Reachability Information (NLRI).
109
110
be used for forwarding. If the flag is set to 0, that means that all links
associated with the Ethernet segment can be used for forwarding data.
Basically its the difference between active/active links or single active.
So if a remote leaf, (say Leaf 4, in Figure 9.7) receives an Ethernet
auto-discovery route from Leaf 1 and Leaf 2 (as the server is dualattached), it will look at the advertisement, see that its for the RED
EVPN segment (which it is a part of) and install that route into its
table. It also knows it will have two VXLAN tunnels to forward traffic
back over to Server 1, as both leaves and the VXLAN tunnels are in the
same EVI (RED).
Figure 9.7
Figure 9.8
Using Figure 9.8, lets say that Leaf 4 learns the MAC addresses in the
data plane from Ethernet frames received from Server 2. Once Leaf 4
learns Server 2s MAC address, it automatically advertises the address
to other leaves in the fabric and attaches a route target community,
which is the solid red circle in Figure 9.8.
Upon receiving the route, Leaf 1 has to make a decision as to whether
it should keep the route. It makes its decision based on whether an
import policy has been configured to accept red route targets. No
policy, then the advertisement would be discarded.
So, at a minimum, each EVI on any participating switches for a given
EVPN must be configured with an export policy that attaches a unique
target community to the MAC advertisements, and also, it must be
configured with an import policy that matches and accepts advertisements based on that unique target community.
111
112
packet
113
114
router or gateway from this data center to the WAN. The edge gateway
(lets say DC1 MX) will have an IRB interface that will act as the
gateway address for Leaf 1.
The MX will strip off the VXLAN header and do a lookup on the
remaining Ethernet packet, which will have a destination for Server 2
in DC2. The edge gateway strips the Ethernet header and routes the
remaining IP packet based on the IP route table related to its IRB
interface. The MX then uses a Type 5 route that it receives from its
opposite number in DC2, and then forwards it over the VXLAN
tunnel between DC1 MX and DC2 MX.
DC2 MX receives the VXLAN encapsulated packet, strips the VXLAN
encapsulation off, routes the IP packet to its IRB interface and re-encapsulates the packet with a Ethernet header with a destination MAC
address of Server 2. It does the MAC address lookup and forwards the
packet over the VXLAN tunnel to the corresponding leaf and then to
Server 2.
So in this scenario, Type 5 allows for inter-data center traffic to be
forwarded over VXLAN tunnels, but allows the Ethernet MAC to be
translated into IP.
Hopefully, these five types of EVPN routing explain how the EVPN
signals in to BGP. The next element to understand is how the gateway
is distributed across several switches.
MORE? Check out the Juniper TechLibrary for more about EVPN and Juniper
Figure 9.9
115
116
This provides you with the flexibility to map related applications and
their associated VLANs together under a single EVPN instance to
provide them with isolated Layer 2 connectivity. You can then replicate
it for other application groups and in time allow for the introduction of
tenanted-based solutions.
One aspect not touched upon is the management of this EVPN-VXLAN
solution. Because its protocol-based the configuration of these elements
need to be managed. This can be done very easily through the use of
Junipers Junos Space Network Director platform, which provides IP
fabric deployment and EVPN-VXLAN configuration and management.
NOTE
Junos Space Network Director is beyond the scope of this book, but
look for its inclusion in a future Day One: Data Center Management
Fundamentals.
The other benefit of EVPN not covered is the extension of Layer 2
between data centers, referred to as data center interconnect (DCI).
While not covered in this book, it has been extensively covered in Day
One: Using Ethernet VPNs for Data Center Interconnect, so please
refer to this excellent Day One book at http://www.juniper.net/us/en/
training/jnbooks/day-one/proof-concept-labs/using-ethernet-vpns/.
MORE?
NEXT
Summary
This has been a book about data center fundamentals and how Juniper
Networks technology can build data centers. It glosses over very
complicated topics and issues to give you a fundamental understanding
so you can get started on day one.
Everything in this book is subject to change. Please make every effort
to visit the Juniper links provided throughout these pages for the most
up-to-date data sheets and specifications, as those details will change
much faster than this books ability to track them.
Juniper Networks is the home of many data center and data center
interconnect solutions. This book attempts to favor none while trying
to explain them all. Keep in mind that data center networking is still a
complicated engineering science and no one quite agrees on the perfect
data center even the technical reviewers who helped proof this book
had conflicting advice.
While the book has shown you the basics of how to build a data center,
it stops very abruptly at an architected example, and it says nothing
about data center administration, where CoS, analytics, automation,
and orchestration rule. The author hopes to write that next book for
this Day One series as soon as possible. Until that time, please use the
Juniper website to track the many data center administration tools,
from Juniper Contrail, to the NorthStar Controller, to Juniper cloudbased security products and analytics.