Project Report Flow Based Load Balancer On NETFPGA
This project proposes a FPGA-based load balancer that routes packets based on the least recently used node to combine the strengths of layer 4 and layer 7 load balancers. It aims to provide low latency load balancing without the large overhead of layer 7 balancers. The document describes the NetFPGA platform that will be used, including its FPGA, memory, Ethernet ports, and typical usage models. It then provides details on the pipeline design for the proposed load balancer, covering the five stages of instruction fetch, decode, execute, memory access, and writeback.
Download as DOCX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
97 views
Project Report Flow Based Load Balancer On NETFPGA
This project proposes a FPGA-based load balancer that routes packets based on the least recently used node to combine the strengths of layer 4 and layer 7 load balancers. It aims to provide low latency load balancing without the large overhead of layer 7 balancers. The document describes the NetFPGA platform that will be used, including its FPGA, memory, Ethernet ports, and typical usage models. It then provides details on the pipeline design for the proposed load balancer, covering the five stages of instruction fetch, decode, execute, memory access, and writeback.
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 30
Project Report
Flow Based Load Balancer on NETFPGA
CHAPTER 1 1. Abstract: Currently there are two main load balancers available in the market i.e. Layer 7 Load Balancer and Layer 4 Load Balancer. Layer 7 runs on the application layer. Because of it, there is a huge overhead which adds to the large latency. However it provides additional features such as security against Dos Attacks and content caching. Layer 4 load balancer runs on the network layer. It has a less latency. But it does not provide additional features as provided by the layer 7 load balancer. This gives us the motivation of combining the strengths of both the load balancers into one. In this project, we proposed a FPGA based solution that will route the packets based on the least recently used node. Being on the physical layer, it does not suffer from large overhead. Comparison will be done with socket program run on reference router in Deterlab. The main benefits of this project is low latency and ability to reroute packets on-the-fly. 2. Introduction The NetFPGA is a low-cost platform, primarily designed as a tool for teaching networking hardware and router design. It has also proved to be a useful tool for networking researchers. Through partnerships and donations from sponsor of the project, the NetFPGA is widely available to students, teachers, researchers, and NF anyone else interested in experimenting with new ideas in high-speed networking hardware. 2.1 Usage Models At a high level, the board contains four 1 Gigabit/second Ethernet (GigE) interfaces, a user programmable Field Programmable Gate Array (FPGA), and four banks of locally-attached Static and Dynamic Random Access Memory (SRAM and DRAM). It has a standard PCI interface allowing it to be connected to a desktop PC or server. A reference design can be downloaded from the http://NetFPGA.org website that contains a hardware-accelerated Network Interface Card (NIC) or an Internet Protocol Version 4 (IPv4) router that can be readily configured into the NetFPGA hardware. The router kit allows the NetFPGA to interoperate with other IPv4 routers.
The NetFPGA offloads processing from a host processor. The host's CPU has access to main memory and can DMA to read and write registers and memories on the NetFPGA. Unlike other open-source projects, the NetFPGA provides a hardware-accelerated hardware datapath. The NetFPGA provides a direct hardware interface connected to four GigE ports and multiple banks of local memory installed on the card. NetFPGA packages (NFPs) are available that contain source code (both for hardware and software) that implement networking functions. Using the reference router as an example, there are three main ways that a developer can use the NFP. In the first usage model, the default router hardware can be configured into the FPGA and the software can be modified to implement a custom protocol.
Another way to modify the NetFPGA is to start with the reference router and extend the design with a custom user module. Finally, it is also possible to implement a completely new design where the user can place their own logic and data processing functions directly in the FPGA. 1. Use the hardware as is as an accelerator and modify the software to implement new protocols. In this scenario, the NetFPGA board is programmed with IPv4 hardware and the Linux host uses the Router Kit Software distributed in the NFP. The Router Kit daemon mirrors the routing table and ARP cache from software to the tables in the hardware allowing for IPv4 routing at line rate. The user can modify Linux to implement new protocols and test them using the full system. 2. Start with the provided hardware from the official NFP (or from a third-party NFP), modify it by using modules from the NFP's library or by writing your own Verilog code, then compile the source code using industry standard design tools. The implemented bitfile can then be downloaded to the FPGA. The new functionality can be complemented by additional software or modifications to the existing software. For the IPv4 router, an example of this would be implementing a Trie longest prefix match (LPM) lookup instead of the currently implemented CAM LPM lookup for the hardware routing table. Another example would be to modify the router to implement NAT or a firewall. 3. Implement a new design from scratch: The design can use modules from the official NFP's library or third party modules to implement the needed functionality or can use completely new source code.
2.2 Major Components The NetFPGA platform contains one large Xilinx Virtex2-Pro 50 FPGA which is programmed with user-defined logic and has a core clock that runs at 125MHz. The NetFPGA platform also contains one small Xilinx Spartan II FPGA holding the logic that implements the control logic for the PCI interface to the host processor. Two 18 MBit external Cypress SRAMs are arranged in a configuration of 512k words by 36 bits (4.5 Mbytes total) and operate synchronously with the FPGA logic at 125 MHz. One bank of external Micron DDR2 SDRAM is arranged in a configuration of 16M words by 32 bits (64 MBytes total). Using both edges of a separate 200 MHz clock, the memory has a bandwidth of 400 MWords/second (1,600 MBytes/s = 12,800 Mbits/s). The Broadcom BCM5464SR Gigabit/second external physical-layer transceiver (PHY) sends packets over standard category 5, 5e, or 6 twisted-pair cables. The quad PHY interfaces with four Gigabit Ethernet Media Access Controllers (MACs) instantiated as a soft core on the FPGA. The NetFPGA also includes two interfaces with Serial ATA (SATA) connectors that enable multiple NetFPGA boards in a system to exchange traffic directly without use of the PCI bus.
CHAPTER 2
Pipeline:
The underlying processor is a general purpose processor with five pipeline stages: IF, ID, EX, MEM, WB. The processor employs pipelining so that all parts of the processing and memory systems can operate continuously. Typically, while the output of the one instruction is written back in the register file, its successor is performing a memory operation a third instruction is being executed, a fourth instruction is being decoded and fifth instruction is being fetched from instruction memory.
2.1 IF Stage:
This is the first stage of the pipeline where the processor fetches the instruction from instruction memory. The instruction memory is logically divided into two memories to implement multi threading. It is ensured that each thread accesses different instruction memories. As shown in the figure two program counters are required to access the instruction memory. The instruction width is 32 bits. The thread scheduler determines which program counter becomes active for a particular thread.
Instruction decoding is carried out in this stage that is, control signals required for execution of instructions is obtained in this stage. The decoding logic determines which type of instruction is brought in, for example it determines whether the incoming instruction is a R-Type or LW or SW and correspondingly generates the control signals required for executing the instructions.
Instruction(31:0) Control Unit Register Bank Opcode(3:0) Func(3:0) rs_in(4:0) rt_in(4:0) regwrite wdata(63:0) waddr(4:0) rs_data(63:0) rt_data(63:0) rd_add(4:0) Offset(8:0) Control Signals(5) rt_addr(4:0) ALU sel(3:0) IF/ID ID/EX
Fig. ID stage
2.3 EX Stage:
The following operations are done in EX stage:
1. The ALU performs the arithmetic or logical operation for register-to-register instruction 2. The ALU calculates the data address for load and store instructions. 3. The jump instruction is executed taking the thread id in consideration. 4. The branch condition is evaluated and if found true then the branch instruction is executed.
The ALU block is capable of performing operations like addition, subtraction, SET LESS THAN ZERO (SLT), logical AND,OR etc. The operation that the ALU performs is based on the four ALU control signals generated by the instruction from ID stage.
1. It performs memory fetch for load and store instructions from the FIFO. 2. It stores the packet information in the FIFO which can later be used by processor to perform operations on them.
Each processor is designed to carry two threads, so that it can implement thread level parallelism on the incoming packets. Fine grain multithreading is employed executing the threads alternately. The thread id is obtained by dividing the input clock frequency by 2. The 2 program counters are enabled by two different clocks generated by CE signal in thread scheduler.
There are two such dual threaded cores that process the packets simultaneously that increase the throughput of the router.
Dual Threaded Processor Dual Threaded Processor in_fifo_data_p Header Parser Re routing and Checksum B/W Calculator Header Parser Re routing and Checksum B/W Calculator Output Port lookup Output port lookup FIFO FIFO
Fig Multithreaded Multicore Architecture
3.2 Thread Scheduler: The thread scheduler utilizes a T flip flop and a demux for generating thread ids. The two clocks CE_PC0 and CE_PC1 for program counters are generated using T flip flop as shown in figure below.
T FF Demux CE clk clr Q sel Q1 Q2 Thread1_ID Thread2_ID T A vdd vdd
T FF T CE clk INV CE_PC0 CE_PC1
3.3 Working of FIFO and Processor:
There are three states namely WriteFIFO, ReadFIFO and CPU. In the WriteFIFO state the packet comes into the pipeline and is stored into FIFO memory. The CPU processes on the packet in the CPU state. After CPU operates on the packets they are readout from the FIFO in ReadFIFO state.
4.1 The router consists of three Hardware Accelerators: 4.1.1 Header Parser: This hardware accelerator extracts the source ip address , destination port number and protocol which facilitates the implementation of 3 tuple method of {IP,Port,Proto}.As a packet enters the stage, the Header Parser pulls the relevant fields from the packet and concatenates them. This forms the flow header.
4.1.2 Re-Routing and Checksum Calculation: Hash based algorithm is used for classifying the packets. In hash based algorithm first a lookup table is created for flows so that each of its entry is related to a flow. Then hash of 3 tuple field of packets is calculated and stored in CAM lookup table. When a packet comes in hash calculated from 3 tuple field is used to access the lookup table and index obtained points to another Block Memory which stores the destination ip address and destination port number to which the packet needs to be routed. If the 3 tuple field after hashing does not match the hash table entries then the packet is routed to output link that has the minimum bandwidth. The lookup table needs to be updated with this entry. Once the packet gets its destination ip address and port number the checksum of packet needs to be re-calculated and modified. The new checksum is calculated using the formula: H = ~(~H ~SUM old SUM new )
where H is the old checksum, H is the new checksum, SUM old and SUM new are the ones complement sum computed over source IP address, destination IP address, source port, destination port, window size of the old and new values respectively (the only fields that change their value). indicates a bit-wise complement operation, and indicates a ones complement sum operation.
4.1.3 Minimum Bandwidth Calculation: This accelerator finds the output link with the minimum bandwidth. The bandwidth is calculated for a time interval of T secs. The calculation method involves finding the total number of bits sent out on a link over a time T to determine bandwidth.
4.2 NETFPGA BASED ROUTER:
Dual Core Dual Threaded Processor Header parser MIN. B/W calculator Rerouting and Checksum Ethernet CPU RXQ
MAC RXQ
CPU RXQ
MAC RXQ
CPU RXQ
MAC RXQ
CPU RXQ
MAC RXQ
CPU TXQ
MAC TXQ
CPU TXQ
CPU TXQ
CPU TXQ
MAC TXQ
MAC TXQ
MAC TXQ
nf2c0 nf2c1 nf2c2 nf2c3 FIFO
4.3 Fields Description:
Fig. FIELDS DESCRIPTION
src mac(16) dst mac (48) V,L,TOS(16) ethertype(16) src mac lo(32) Total length(16) Id(16) Flags+frag off(16) TTL(8) prot(8) dst ip hi (16) src ip (32) Checksum(16) dst ip lo(16) UDP src(16) UDP dest(16) UDP len(1 6) UDP chksum(16) Pkt sequence number(32) OUTPUT QUEUES short events 0 E V num port ev INPUT FIFO
4.4 Hardware Accelerators:
Least B/W link flow header
fig. Hardware Accelerators
Header Parser Extract src_ip CAM Look up Table Hashing Hash Function Output port lookup Extract src_ip Extract src_ip
4.5 Minimum Bandwidth Calculation Accelerator
in_fifo_data
Link 1 Link 2 Link 3 Processor FIFO Minimum Bandwidth Calculation Timer
4.6 Header parser pin diagram:
The header parser block is used to fetch the corresponding Source ip, destination port number and protocol Id. These values are stored in the Hash table to maintain a lookup table. These values are further used to calculate the corresponding destination mac address which will help to determine the target IP addresses and do the routing successfully.
Rerouting is done by replacing the mac address, destination port number. The checksum is calculated by taking the difference in the current IP address and the new IP addresses. The resultant field is updated in the checksum field and routed ahead to the desired nodes based on the least recently used bandwidth. The least bandwidth is calculated after a constant time interval. This shows the least recently used bandwidth among the various nodes.
4.8 Minimum Bandwidth Calculation: A timer is run which will be updated after T=250000000 clock cycles. The corresponding node packets are updated. In this count of the number of packets are calculated by fetching the corresponding packet length.
Instruction Set Architecture: 5.1 Instruction Field Format: Each instruction has a fixed width of 32 bits. The processor consists of a 32 bit register file. Thus, the widths of source register (rs), transfer register (rt) and destination register (rd) is 5 bits. The immediate addressing mode has an offset field of 9 bits. Thus, the maximum offset it can support is 512. The instruction is divided into following fields:
The Processor supports different types of instructions such as register, register logic instructions, arithmetic instructions, immediate arithmetic instructions, conditional instruction, memory-based instructions and shift instructions. General purpose registers like rs,rt and rd are indicated with a $ sign. The register file consists of 32 registers, hence any register is represented by 5 bits. The register $0 is always grounded.
5.3 Instructions: Register Arithmetic ADD Syntax add $rd,$rs,$rt Description Adds the contents of rs and rt and stores into rd
SUB Syntax sub $rd,$rs,$rt Description Subtracts rt from rs and stores the result in rd.
SUB Syntax sub $rd,$rs,$rt Description Subtracts rt from rs and stores the result in rd.
Immediate Instruction:
ADDI Syntax addi $rt,$rs,immediate Description Adds rs and a sign-extended immediate value and stores the result in rd
SUBI Syntax subi $rt,$rs,immediate Description Subtracts the immediate value from rs and stores in rd.
SLTI Syntax slti $rt,$rs,immediate Description Sets rt if the content in rs is less than the value mentioned in the offset.
Register Logic Instruction
AND Syntax and $rd,$rs,$rt Description Bitwise ANDs contents of rs and rt and stores the result in rd.
OR Syntax or $rd,$rs,$rt Description Bitwise ORs contents of rs and rt and stores the result in rd.
NOR Syntax nor $rd,$rs,$rt Description Bitwise NORs contents of rs and rt and stores the result in rd.
XNOR Syntax xnor $rd,$rs,$rt Description Bitwise XNORs contents of rs and rt and stores the result in rd.
NOT Syntax not $rd,$rs Description Bitwise negates the contents of rs and stores the result in rd.
Shift Instructions
SLL Syntax sll $rd,$rs Description Shifts the value left, contained in rs by one bit and stores in rd. Zeros are shifted in.
SRL Syntax slr $rd,$rs Description Shifts the value right, contained in rs by one bit and stores in rd. Zeros are shifted in.
Conditional Instructions:
SLT Syntax slt $rd,$rs,$rt Description Sets rt if the content in rs is less than the contents of rt, else resets.
BEQ Syntax beq $rs, $rt, offset Description Branches to the location pointed by the offset if (rs) and (rt) are equal.
Memory Based Instructions:
LW lw $rt, offset ($rs) Syntax slr $rd,$rs Description A word (data) is loaded from the memory location calculated by adding the contents of rs with the offset value into rt.
SW Syntax sw $rt, offset ($rs) Description A word (data) is stored from rt in the memory location calculated by adding the contents of rs with the offset value.
Control Unit: The ALU is in ex stage of the pipeline.
Table: Control Unit Description
Chapter 6
Purpose of the Compiler: A compiler is basically a computer program that converts the source code written in any of the programming language into another programming language, usually into the binary equivalent of the code. Then binaries are then stored in the instruction memory of the processor which are used to execute a program.
Compiling Process : The gcc complier runs on the x86 architecture Linux Machines and converts the source code into the MIPS assembly. This MIPS assembly code is translated into custom ISA of the processor. Finally this ISA is converted into corresponding binaries. These binaries are loaded into instruction memory using a Perl script. There are instances where we have a complex code which are not supported by the processor. These instructions are broken down into instructions supported by the processor. This translator is written in C. The operations generally performed by the processor are sorting of the contents of the payload, swapping the contents etc.
MIPS instructions supported by the processor along with along with the translation is as follows: MIPS ISA CUSTOM ISA add add Addi addi addu add addiu addi sub sub subi subi mov $x,$y add $x,$y,$0 li addi slt slt slti slti sll sll slr slr lw lw sw sw beq beq
We have put 2 NOOPS between two instructions put onto the processor because of the dependency problems in the 5 stage pipeline. There is no internal forwarding implemented in the pipeline but because of the fine grain scheduling only 2 no-ops are sufficient. Also register renaming is employed to encounter the problem of registers being updated by both the threads. This issue is tackled using register renaming. The register file is divided for threads at the compiler level. The register file of depth 32 is divided into two register files allocating 16 registers to one thread.