ExpressFabric: Thinking Outside the Box
Avago PEX9700 Series of PCIe Next Generation Switches
Transform the Rack with ExpressFabric® Technology
The Avago PEX9700 series switch chips offer an industry-first suite of new features that dramatically improve performance while reducing power consumption and cost by 50% for the most demanding hyper-converged, NVMe, and rack scale systems.
The PEX9700 series with ExpressFabric technology enables high performance, low latency, scalable, cost-effective fabric capabilities. The technology, based on Gen3 PCI Express®, now provides the ability to share I/O with standard SR-IOV or multifunction devices, and to enable multiple hosts to reside on a single PCIe®- based network using standard PCIe enumeration. The hosts communicate through Ethernet-like DMA (NIC DMA) using standard hosts, end- points, and application software. A special very low latency host-to-host communication capability for short packets, called Tunneled Windows Connection (TWC) is also available.
Reduced Cost, Power, and Latency
ExpressFabric technology is designed to replace the “bridging” and switching devices that operate within a cloud/data center rack. This situation is possible because virtually all of the components that form the foundation of the data center—CPUs, storage devices, and communication devices—have PCIe as at least one of their connections. By using PCIe as the main fabric, all of the components can interoperate directly. By removing the need to translate from PCIe (on the component) to Ethernet or InfiniBand (as two common alternatives), the cost and power of the rack can be substantially reduced. In addition, communicating directly between components also reduces latency.
It is common within data centers to have multiple fabrics within the rack. Ethernet is typically used for communications, Fibre Channel is popular for storage, and InfiniBand is common for HPC traffic. ExpressFabric technology has the capability to handle all of the different data types at line speed with a single fabric based on PCIe. This capability eliminates the need to partition different types of data using different protocols, which allows for a truly converged fabric where processors and endpoints can be allocated across the rack as required. In addition, they will all communicate across the low latency, high bandwidth PCIe path efficiently.
Direct Connection to SSDs
Enterprise -level SSDs are rapidly standardizing on PCIe as the primary interconnect to high performance flash memory. Storage subsystems based on this approach can attach directly to ExpressFabric technology, which allows for high performance, low-latency flash elements to be integrated into the fabric in a scalable fashion.
Shared I/O using Standards
An ExpressFabric-based system allows multiple hosts to share data with end-points using standard SR-IOV capable devices. Usually, an SR-IOV device allows multiple virtual machines (VMs) within a single host to share an end-point. ExpressFabric technology extends that, allowing the VMs within multiple hosts to have that same capability. In addiiton, this feature operates with standard, vendor-supplied SR-IOV drivers, which maintains the existing hardware and software installed.
General-Purpose Host-to-Host DMA
The majority of applications that run within a data center use Ethernet as the fabric, and a vast library of applications exist that have been deployed for this purpose. ExpressFabric technology enables that application software to run unchanged through the use of a virtual Ethernet NIC on each host port.
Low-Latency Host-to-Host NIC DMA
When performance is critical in clustering applications, NIC DMA is used to eliminate most of the software overhead of copying the data repeatedly. ExpressFabric has dedicated NIC DMA hardware to handle this function, offering high performance without specialized hardware.
The ExpressFabric-based solution is built on a hybrid hardware/software platform. The critical pathways have direct hardware support, which enables the fabric to offer non-blocking, line speed performance with features, such as sharing or DMA.
The solution offers an innovative approach to setup and control, making use of an off-chip management CPU (mCPU) to initialize the fabric, configure the routing tables, handle errors and hot-plug events, and enable the solution to extend the capabilities without modifying the hardware.
One key feature that the mCPU enables is the ability to allow multiple hosts to reside on the PCIe network, but to do so using standard host enumeration methods. This has been a capability that, until now, was not possible with a PCIe-based system. The mCPU performs this task by synthesizing a hierarchy for each host. Because of this synthesis, the hosts “see” a normal PCIe hierarchy, but, in fact, they only see what the mCPU allows. The hosts have no direct connection within the fabric, and are thus able to run standard enumeration and software.
Tunneled Window Connection (TWC)
As part of the overall solution, the hosts can communicate in two different ways. DMA is typically used in data centers, and the ExpressFabric solution supports them seamlessly for larger message sizes.
When a need exists for a small message to be passed between hosts, an approach called TWC is available. TWC allows messages to be sent from one host to another in a very low-latency manner, and without the overhead associated with DMA.
Downstream Port Containment (DPC/eDPC)
Most servers have difficulty handling serious errors, especially when an end-point disappears from the system due to, for example, a cabled being pulled. The problem tends to proliferate through the system until recovery becomes impossible. DPC/eDPC allows a downstream link to be disabled after an uncorrectable error. This capability makes error recovery feasible with the appropriate software, and it is especially critical in storage systems because the removal of a drive must be handled in a controlled and robust manner.
In addition to offering this PCI-SIG ECN, ExpressFabric devices track outstanding reads to downstream ports, and they synthesize a completion so that the host does not receive a completion timeout if the end-point is removed.
Flexible Fabric Topologies
ExpressFabric technology eliminates the topology restrictions of PCIe. Usually, PCIe networks must be arranged in a hierarchical topology, with a single path to get from one point to another. ExpressFabric technology allows other topologies, such as mesh, fat tree, and many others, and it does this while allowing the components to remain architecturally and software-compatible with standard PCIe.
Improved SSC Isolation
ExpressFabric devices offer several mechanisms for supporting multi-clock domains that include spread spectrum clocking, which eliminates the need to pass a common clock across a backplane. In addition to the standard Avago approach to the problem—a mechanism that the company has included in its products for several generations—Avago also has added the new PCI-SIG approach, called SRIS (Separate Refclk Independent SSC Architecture).
With this standard approach to SSC isolation, devices from different vendors can offer this benefit, which provides more flexibility to the system designer.
Built on a Solid Foundation
Avago ExpressFabric devices are built on the same basic switching element foundation as the current family of high lane count devices. As such, they support the same powerful set of features that are offered in the standard devices from Avago, including:
- Low latency of ~150 ns (x16 to x16) per hop
- Highly flexible port configuration
- Flexible register configuration
- SerDes power and signal management
- Flexible internal buffer allocation and packet flow control
- Direct hot-plug capability for up to six ports on the largest device, and serial hot-plug for all of the ports
- Avago performancePAK and visionPAK suites
Products based on ExpressFabric technology can deliver outstanding solutions for designing a heterogeneous system where a requirement exists for a flexible mix of processors, storage elements, and communication devices.
An appliance is a dedicated function box that offers a specific capability and is connected to the rest of the system through a standard interface—usually Ethernet. This approach is commonly used in storage, because it is relatively easy to add more storage by just including it as part of the network.
Most modern high speed storage subsystems have a mix of rotating media and SSDs to balance performance and cost, and include some processing as well to manage the system. These systems can be deployed efficiently with ExpressFabric, because the storage subsystems all hook up to PCIe either directly (SSDs) or indirectly (SAS or SATA controllers), and can communicate directly with the processors and communication chips.
High-Performance Computing (HPC) Clusters
HPC clusters are made up of high-performance processing elements that communicate through high bandwidth, low latency pathways to execute applications, such as medical imaging, financial trading, data warehousing, and so on.
An ExpressFabric-based solution can offer the same capabilities—high bandwidth, low latency, and switch fabric. The processing subsystems can be hooked up directly to the PCIe fabric and run the same application software, which benefits from lower cost and power due to the elimination of the bridging devices.In addition, clustering systems can be built with I/O sharing as an additional native capability when required, which is not usually provided with traditional clustering systems built on InfiniBand.
Rack- and Blade-based Servers
Typical server boxes that create modern cloud and enterprise data centers consist of racks that include modular subsystems that communicate with each other over a backplane or through cables. The connections within the racks benefit from using ExpressFabric technology. Instead of treating each subsystem as a separate server node (with some predetermined or limited quanta of processing, storage, and communication), the blades on an ExpressFabric-based solution can be put together with dedicated blades that perform a specific function.
This disaggregated approach allows the right mix of each function, depending upon the specific needs of the application. Because they are all hooked up directly to each other through a PCIe connection, the latency between subsystems is very low, and, architecturally, the entire system looks like one large system from a software perspective.
A MicroServer is a system designed with a large number of lower power and lower cost processing engines rather than larger (and thus, much higher power and cost) high-end server processors. They offer substantial benefits when the applications require a large amount of aggregate processing, but where the application can be spread among a lot of smaller engines. Some typical applications are Web servers and Hadoop data analysis.
Most MicroServer elements today are made up of Systems-on-a Chip (SoCs) that have processing, storage, and communication, and these elements are hooked together with either proprietary or low-speed Ethernet connections. Because similar processing elements have PCIe on them, in general, ExpressFabric is an ideal interconnect for a MicroServer system.
The existing SoCs can be hooked together for a standard, low latency, high performance solution, or the different elements can be disaggregated as with standard servers.
Avago offers an ExpressFabric-based development platform that includes both hardware and software tools. A hardware reference platform implements a full rack-level top-of-rack switch box, a fully functional firmware package that enables the fabric switch to operate, and the host drivers that complete the package.
ExpressFabric Reference Platform
To allow system development and demonstration, Avago offers the PXF 55033: a 32-port, 1U orm factor rack-mountable top-of-rack fabric switch box. This system attaches to the rack servers through an optional redriver-based PCIe plug-in card: PXF 51003. The connection between the adapter card and the ToR switch is through industry standard QSFP+ connectors and either copper or optical cables.