Skip to main content

International Field Programmable Logic & Applications

Placement Strategies for 2.5D FPGA Fabric Architectures

FPGAs take advantage of 2.5D stacking technology to manufacture large capacity and high performance heterogeneous devices at reasonable costs. EDA tools need to be aware of and exploit physical characteristics of such devices, for example the reduced connection count between SLRs, the infrequency of SLL channel occurrence in the fabric, and the aspect ratios of individual SLRs. We implement a partition driven places to explore various EDA options to take advantage of architectural features in 2.5D FPGAs. We improve the routability of designs by optimizing the place for discrete SLL channels and reduced connection counts. We propose a cut schedule for the partitioner to orient the placement with awareness of the aspect ratio of SLRs to improve track demands within each SLR.

I. Introduction

2.5D stacking  enables FPGAs to meet the twin demands of higher logic capacity and heterogeneity. 2.5D stacking also permits lower latency communication between dies than competing technologies . Devices with logic capacities that are impossible to build on a single die are made feasible by assembling multiple, better yielding, smaller dice on a passive interposer . Market demands for heterogeneity and specialized functionality can be met by integrating application specific dies with FPGAs on a single package . In this paper, we address the EDA challenges specific to implementing multi-die FPGA systems. Betz et al.  investigate the placement and routing challenges in multidie FPGAs by enhancing the open-source academic VPR CAD tool to model and optimize for 2.5D FPGAs. In this paper, we consider current customer designs and synthetic designs to outline key EDA challenges beyond those studied in and propose techniques to address them. We limit our study to manufacturable 2.5D FPGAs as constrained by current technology and economic factors.
Reseller Web Hosting
All-Battery.com

A. Terminology

We refer to a single monolithic FPGA die as a Super Logic Region or SLR. A ”2.5D” or multi-SLR device is assembled on a passive silicon interposer and connections are made through micro-bumps, or uBumps. The interSLR connections, called Super Long Lines (SLLs), are made on the silicon interposer. In this paper, we refer to SLL capacity as a percentage of the capacity of tracks that exist within the SLR. For example, 25% SLL capacity means that the number of SLLs that cross the SLR boundary is 25% of the wires that exist if the cut was observed in an arbitrary region within the SLR.

B. 2.5D Stacking in FPGAs 

Stacking technology is especially interesting for FPGAs due to the regularity in logic cells and interconnect, allowing identical arrays to be connected on an interposer with fine pitch wiring. The interposer consists of metal layers that enable wire traces that connect the individual FPGA SLRs. We illustrate the physical limitations on 2.5D FPGAs by working through an example based on the 4-SLR 28nm test vehicle described in [6]. The number of uBumps available on of each SLR limits SLL counts. Assuming a uBump pitch of 45um and an FPGA die size of 7mm ×12mm, we compute the maximum number of uBumps to be about 155 × 267 ≈ 41K uBumps per SLR. Assuming 30% of uBumps are unuseable due to power and global signal considerations, and using half the uBump rows to communicate with adjacent SLRs, we can use (155×0.7) ÷2 ≈ 54 uBump rows. Assuming we meet the latency requirements and provision for sufficient number of metal layers on the interposer, we can achieve 54 × 267 ≈ 14.4K inter-SLR connections. Compared to a virtual monolithic device with identical logic capacity, this is about 25% of the vertical wires that would exist in the same region. Interfaces to the interposer (refered to as ”SLL Channels”) on the FPGA fabric need to appear at discrete intervals since the uBump pitch is coarser than fabric routing channels. The primary challenges in supporting placement and routing on multiSLR devices arise from these two characteristics of SLLs - their relative infrequency and reduced count compared to traditional interconnect resources


II. Evaluation Platform

 A. Device Model and Designs

We generate device models for 2.5D FPGAs with similar feature mixes and logic counts as commercial FPGAs and implement them in the Vivado R Design suite  modified to handle experimental architectures. We develop a global partition driven placer, described in section IV, combined with a packer and simple move based optimizer. We use the Vivado R router to implement designs. Implementation tools trade-off the number of wires cut between each SLR and how balanced the utilization of each SLR is. If the utilization of each SLR is balanced, the probability of routing failure within each SLR is reduced while the number of inter-SLR cuts is increased. We use synthetic designs to understand this tradeoff because they allow us to incrementally control the design size, topology and complexity as described in . Synthetic designs offer several benefits. They allow us to: 1) Analyze the incremental impact of utilization and design complexity. 2) Create designs with expected logic capacity and complexities that may not exist in current customer designs. 3) Identify the ”breaking point” (i.e. point where designs become impossible to implement) of architectural decisions. B. Estimated Channel Demand To analyze routing demand, we compute the estimated channel demand (ECD) based on placement of designs. For each net, we look at the connectivity between various blocks and implement a stochastic model  to compute the probability of using horizontal or vertical tracks on a two dimensional grid. This metric lets us understand the placement quality of a design in terms of routing congestion independent of routing architecture. The ECD computation is enhanced to be multi-SLR aware by identifying SLL channel locations in each SLR and splitting multi-SLR nets into subnets. We recursively partition the net for each SLR crossing and compute separate ECDs for each subnet.

III. EDA Issues for 2.5D FPGAs

 A. Feasibility of Multi-SLR Devices

In this section we attempt to understand the feasibility and effects of implementing current customer designs on multi-SLR devices. These designs utilize 20% to 95% of all the various tile types available on modern FPGAs, including CLBs, RAM Blocks, and DSPs. The number of nets in these designs range from 500K to 3.6 million. We recursively bisect each design into 4 partitions, where each partition must fit in one SLR of a 4-SLR target device and compute the number of SLLs demanded across the SLR boundaries. The cut demand ranges from 0.5% to 10%, while physical limitations allow SLL capacity of upto 25% (refer section I-B). This is the primary result which motivates the possibility of multi-SLR devices. In our experience, current customer designs can always be partitioned such that the SLL demand is less than the SLL tracks that we can physically supply. While the SLL demand is well below the supply, we still observe degradation in various design implementation metrics as the SLL supply is reduced. To illustrate this, we create 6 variants of a 4 SLR device, each having different SLL counts, and implement the partitioned customer designs on each variant. In fig 1, 100% SLL Capacity refers to a device where there is no reduction of tracks between the SLR boundaries and hence, the 4-SLR device can be treated as a large monolithic device. We arbitrarily trim the interconnect resources that cross SLR boundaries and create devices with SLL capacities ranging from 75% to 5%. In fig 1, we show that all device variants with 12.5% SLL capacity or more can successfully route the designs. As expected, with 5% supply, majority of designs fail due to oversubscription of SLLs. There is a 3-4% increase of routed wirelength, and a 2% impact to critical path delay as the SLL supply is reduced to 12.5% compared to the monolithic variant.
upto 25% (refer section I-B). This is the primary result which motivates the possibility of multi-SLR devices. In our experience, current customer designs can always be partitioned such that the SLL demand is less than the SLL tracks that we can physically supply. While the SLL demand is well below the supply, we still observe degradation in various design implementation metrics as the SLL supply is reduced. To illustrate this, we create 6 variants of a 4 SLR device, each having different SLL counts, and implement the partitioned customer designs on each variant. In fig 1, 100% SLL Capacity refers to a device where there is no reduction of tracks between the SLR boundaries and hence, the 4-SLR device can be treated as a large monolithic device. We arbitrarily trim the interconnect resources that cross SLR boundaries and create devices with SLL capacities ranging from 75% to 5%. In fig 1, we show that all device variants with 12.5% SLL capacity or more can successfully route the designs. As expected, with 5% supply, majority of designs fail due to oversubscription of SLLs. There is a 3-4% increase of routed wirelength, and a 2% impact to critical path delay as the SLL supply is reduced to 12.5% compared to the monolithic variant. B. Inter-SLR Connections and SLL Channels We now illustrate the impact of inter-SLR cuts on placement quality and routability by experimenting with synthetic designs with controlled SLL demand. We implement the benchmarks on a 2-SLR FPGA with 25% SLL capacity. We create a device model with realistic uBump pitches, resulting in relatively infrequent SLL channels on the FPGA fabric. We generate the synthetic designs in the following manner:
1) Create 180 designs of varying utilization and routing complexity that are placeable in one SLR. These designs are neither too easy nor completely impossible to implement.
2) Create a duplicate instance of each design.
3) Connect the two design instances at the top level
4) Constrain each design instance to a single SLR
We control the SLL demand between SLRs by varying the top level IO ports on each design instance in step (3). Since we constrain the instances to SLRs in step (4), we are guaranteed to have SLL demand that is equal to the number of connections between the two design instances. In fig 2a, we show the average ECD increase across the benchmark suite at various SLL counts. The plot shows that as the SLL demand increases, both horizontal and vertical ECD grows indicating an increase in routing congestion, resource usage and routability degradation within each SLR. This is because more tracks are consumed for routing to and from the SLL channels. In fig 2b, we illustrate the ECD heat map of a 80% resource utilized design with 80% SLL demand where we see that most of the nets demanding SLLs are concentrated in the middle. The bar chart shows that there are several SLL channels that are oversubscribed by more than 3x of the available SLLs in the channel. This is because the placer is unaware of the capacity and location of the SLL channels. In section IV-A, we explore strategies to make the placer SLL channel aware to improve SLL access and routability.



C. SLR Aspect Ratios

 To minimize development costs, commercial FPGA vendors normally design a single routing architecture for an entire family of devices. Modern FPGA families offer both monolithic and 2.5D FPGAs on the same package technology . To maintain reliability of packages, they have to be of reasonable size and aspect ratios . Hence, the aspect ratios of monolithic dies are relatively square as shown in fig 3a. In 2.5D stacking, as we add more SLRs to a device to increase logic capacity or heterogeneity, we increase size of the package in a single dimension (eg. height), while the other dimension (eg. width) remains constant. To enable flexibility in SLR integration, we naturally migrate towards SLR aspect ratios that are biased in one dimension. This can result in SLRs with noticeably different aspect ratios than monolithic dies. In fig 3b, we illustrate the variance in aspect ratios between SLRs in a 4-SLR device compared to a monolithic die. As the ratio of wSLR : hSLR increases, the design placed in each SLR is forced in the horizontal orientation resulting in an increased demand for horizontal tracks. In section IV-B, we discuss strategies to optimize for different SLR aspect ratios by orienting the placement with awareness of total tracks available in each dimension.

UK Web Hosting Services

eUkhost Halloween offers


Read more : http://kalman.mee.tcd.ie/fpl2018/content/pdfs/FPL2018-43iDzVTplcpussvbfIaaHz/zqvWXlbX8qRsjJmla3maX/2zU5fqaUOGluQQDLrJ3rMP.pdf

Comments

Popular posts from this blog

AXI

When part of a team, your group can become more capable than a single individual, but only if your team can work together and communicate effectively. Having members of a group talk over each other leads to nothing but a cacophony, and nothing gets done. For this reason protocols need to be established, such as letting others speak without interruption, or facing those you are addressing. The same is necessary with electronics, especially with system on chip (SoC) designs.

Introducing the AXI ProtocolThe protocol used by many SoC designers today is AXI, or Advanced eXtensible Interface, and is part of the Arm Advanced Microcontroller Bus Architecture (AMBA) specification. It is especially prevalent in Xilinx’s Zynq devices, providing the interface between the processing system and programmable logic sections of the chip.My first introduction with the interface was in a tutorial I was following that was to be implemented on Aldec’s own development board based off the Zynq XC7Z030, theT…

VECTOR (The good Robot)

vector is not a toy but rather a joyful,smart home robot..
A helpful character. Vector is happiest when he’s helping. He’s eager to accommodate your requests and answer your questions. He isn’t a fully grown robot butler capable of doing your taxes, buttering your bread, or writing a position paper on the future of robot/human relationships, but he’s a helpful little guy who puts his whole self into helping you out. That’s what we call Characterful Utility. TIMER Vector will set a timer and share in your joy when your muffins come out perfect, or when your laundry is finally done.

BLACKJACK He’s a robot. He’s a friend. He’s a blackjack dealer.


WEATHER Ask him any city’s weather and he’ll show you



TAKE A PHOTO Ask him to take a photo, say cheese, and wait while he snaps it. Can we call this a robot selfie?

XILINX ALVEO

Overview:

Acceleration Applications Alveo Data Center accelerator cards can deliver dramatic acceleration across a broad set of applications and are reconfigurable to provide an ideal fit for the changing workloads of the modern data center. Compare how Alveo Data Center accelerator cards perform compared to traditional CPU architectures.






Accelerator Cards That Fit Your Performance Needs