SATISH KASHYAP: Architectures and Compilers to Support Reconfigurable Computing

by João M. P. Cardoso and Mário P. Véstias

Introduction

The main characteristic of Reconfigurable Computing (RC) is the presence of hardware that can be reconfigured (reconfigware - RW) to implement specific functionality more suitable for specially tailored hardware than on a simple uniprocessor. RC systems join microprocessors and programmable hardware in order to take advantage of the combined strengths of hardware and software [20, 5] and have been used in applications ranging from embedded systems to high performance computing. Many of the fundamental theories have been identified and used by the Hardware/Software Co-Design research field [16]. Although the same background ideas are shared in both areas, they have different goals and use different approaches. Although the basic concept was proposed in the 1960s, RC has only recently been feasible; this is the result of the availability of high-density reconfigurable devices. These devices make the hardware characteristics of Application Specific Integrated Circuits (ASICs) much more flexible.
During the past 5 years a large number of RC systems developed by the research community have demonstrated the potential for achieving high performance for a range of applications. However, performance improvements possible with these systems typically are dependent on the skill and experience of hardware designers. This method of RW programming cannot fully exploit the increasing density of reconfigurable devices. Hence, a current challenge in this area is the establishment of an efficient RW compiler that would help the designer accomplish adequate performance improvements without the need to be involved in complex low level manipulations. Although mature design tools exist for logic and layout synthesis for programmable devices, High-Level Synthesis (HLS) and multi-unit partitioning (both spatial and temporal) need to be further developed.

Reconfigurable computing: Why and How?

Usually the target RC system is based on multiple SRAM-based Field Programmable Gate-Arrays (SRAM-FPGAs), which act as Reconfigurable Processing Units (RPUs), coupled to a personal computer. The coupling of dedicated hardware to a host computer (as shown in Figure 1) in order to accelerate some computationally intensive tasks, is not a new concept. A common example in the PC industry is the use of graphic boards to accelerate graphical transformations. However, such hardwired application-specific accelerator circuits require great design effort and typically use more silicon area than the microprocessor itself.

Figure 1. A hardware device added to an existing architecture for acceleration purposes. The use of reconfigurable hardware makes it possible to couple the host computer with more flexible hardware resources, better adapted to the application under current execution, and with the possibility to adapt to new versions of an application. Moreover, the dynamic reconfiguration capability allows for time-sharing of different tasks, which may significantly reduce the required silicon area. These devices have a faster growth of transistor density than that of general processors, as can be seen in Figure 2. The newest FPGAs can support circuits with about 500,000 equivalent gates in the same device and improved FPGAs (such as the Xilinx Virtex(tm) series [29]) with one-million gates have been announced.
With the introduction of FPGAs with faster reconfiguration times and partial reconfiguration support, it is possible to use FPGAs in a dynamically reconfigurable environment. This technology makes possible the concept of unlimited hardware or "virtual hardware".

Figure 2. The growing of transistor density in processors [24] and FPGAs devices.

Virtual Hardware

One of the major disadvantages of RC is the impossibility of migrating only the portion of the application which does not exceed the total size of the used RPUs to hardware. Recently, a concept similar to virtual memory was adapted to allow multiple configuration RAM sets which are stored in an RPU to be used. The mapping of hardware units in RPUs without minimal hardware resources can be done with a temporal partition. This new concept is addressed in [21, 9]. However, since temporal partitioning techniques are not yet mature, the practical implications for hardware acceleration have not been found, and the algorithms proposed so far are not suitable for the acceleration of programs in an RC environment. There is more research effort on the concept of virtual hardware in the field of rapid prototyping of large hardware circuits. Researchers in this field do not consider memory accesses, because the input is a pure hardware description. This means that special effort must be made in order to partition and schedule large graphs obtained from high-level programming languages.

The virtual hardware concept is implemented by time-sharing a given RPU. It needs a scheduler that is responsible for the configurations, the execution, and the communications between temporal partitions.

RC systems and parallel computing systems have the same objective: to speed-up the execution of a given application. For parallel computations, acceleration is achieved through the exploitation of the parallelism in a program and its mapping onto an architecture of various processors, but for RC it is achieved through the migration of the most computationally-intensive parts of the program to RPUs. With RCs the exploitation of the parallelism of a given application is through the concurrent model of execution (hardware), instead of the sequential Von Neumann model of computation used by traditional parallel computation.
Scientific journals devoted to general research have already highlighted this field [25] and during the last few years, numerous approaches to RC have appeared in special topical conferences [1, 2, 3]. However, aside from new architectures that have been proposed and new concepts that have been presented, the utilization of these systems has been reported only for academic research. As discussed in [22], a methodology to automatically exploit the RC target architecture must be supported by compilers for RC systems to become widespread. Until effective support for these systems is available, software programmers will not be attracted to RC because the implementation of the hardware parts of the application is time consuming and needs hardware specialists.
The development of applications in an RC environment requires the use of one of the two following methodologies:

the use of hardware objects (cores) bought from specific vendors or developed by hardware designers, with a specific interface to the software language used. This can be suitable for developing applications that need standard functionality such as FFT, JPEG/MPEG encoders/decoders, etc.
the use of a tool to compile from a high-level software programming language to the object code to run on a processor and the configuration files to program the RPUs. State-of-the-art techniques from the compiler and design automation worlds must be combined to achieve performance improvements. In particular, special care must be devoted to loop transformations, such as the exploitation of partial and total loop unrolling, to automatically achieve better acceleration factors. A hardware library of operators commonly used by high-level software programming languages must allow the specialized synthesis techniques to take advantage of the partial and on-the-fly reconfiguration characteristics of the target hardware.

The integration of HLS [10] techniques in the methodology to target the RC systems is the best known technique. This will permit the rapid generation of the hardware with the use of a library of basic hardware units. However, traditional HLS techniques, which target ASIC implementations, must be redefined because the target technology does not have the layout freedom of ASICs and the objective of accelerating the given application in the RC environment must prevail. The time spent in HLS, logic synthesis, placement, and routing, is unacceptable in an RC environment. To be adopted as a new computing paradigm, it is necessary to decrease the overall compilation time, to provide efficient support for virtual hardware, and to provide the use of RPUs tightly coupled to the main processor.

RC Architectures Design Space

The type of interconnection between the RPU and the host system and the level of granularity of the RPU constitute a wide design space to be explored. Many points of this design space have already been explored and many more are left for further research. However, it is still not possible to identify a dominant solution for all kinds of applications. The next two sections will review some of the architectures used in RC systems.

Coupling RPUs To The Host

The type of coupling of the RPUs to the existing computing system has a big impact on the communication cost. The coupling can be classified into one of the four groups listed below, which are presented in order of decreasing communication costs:

RPUs coupled to the I/O bus of the host (Figure 3.a). This group includes many commercial circuit boards. Some of them are connected to the PCI (Peripheral Controller Interface) bus of a PC or workstation.
RPUs coupled to the local bus of the host (Figure 3.b)
RPUs coupled like co-processors (such as the REMARC - Reconfigurable Multimedia Array Coprocessor - [18]) (Figure 3.c)
RPUs acting like an extended data-path of the processor (such as the OneChip [27], the PRISC - Programmable Reduced Instruction Set Computer - [23], and the Chimaera [13]) (Figure 3.d);


a) RPU coupled to the I/O system bus	b) RPU coupled to the local bus

c) RPU coupled to the CPU	d) RPU integrated in the process chip

Figure 3. Organization of RC systems with respect to the coupling of the RPU to the host computer

RPUs Granularity

Most current research efforts use FPGAs and directly manipulate their low-level processing elements. An FPGA cell typically consists of a flip-flop and a function generator that implements a boolean function of up to 4 variables. These elements can be used to implement nearly any digital circuit. This fine grain parallelism has been shown to be inefficient in time and area for certain classes of problems. Some ongoing research uses Arithmetic-Logic Units (ALUs) as the basic hardware primitives of the RC system. In some cases this approach can provide more efficient solutions, but somewhat limits the flexibility of the system. The approaches using an abstraction layer of coarse-grain granularity (operand level) over the physical fine-grain granularity of an FPGA have more flexibility but the compilation time is larger. This abstraction level is provided with the use of a library of uni-functional Relatively Placed Macros (RPMs) with direct correspondence to the operators of the high-level software programming languages (arithmetic, logical, and memory access). This library can have more than one implementation of the same operator (different relations of area/latency/configuration time). Furthermore, the use of a library will decrease the overall RW compilation time, will make possible more accurate estimations, and are already a proven concept in HLS systems.
Various architecture concepts have been considered for use in RPUs, with differences in the flexibility and level of granularity used:

The use of dynamic processor cores in the FPGA, such as the DISC (Dynamic Instruction Set Computer). This approach permits the dynamic reconfiguration of dedicated hardware units when a new instruction appears whose corresponding unit has not yet been configured [26]. This processor reconfigures the new instructions at runtime. A more flexible solution is the use of a processor with a single MOVE instruction, such as the URISC (Ultimate RISC), as proposed in [7].
The utilization of RPUs with medium granularity like the Reconfigurable Data Path Array (RDPA) [11]: These RPUs are designated as Field-Programmable ALU Arrays (FPAAs) and have ALUs as reconfigurable blocks (see Figure 4). These 32 bit ALUs can be reconfigured to execute some of the operators of the high-level language C. [6] presents a framework to deal with this type of RPUs, the CoDe-X. Its objective is to automatically map the more suitable portions of a C program to the RPDAs. Results achieved with this approach are shown in [11], and speed-ups (compared to the execution of the programs without CW support) of 13 for a JPEG compressor algorithm, and 70 for a two-dimension FIR (Finite Impulse Response Filter) are reported.

Figure 4. The matrix of reconfigurable datapaths of the Xputer system.

A coarse-grain approach such as the MATRIX project [17], which is based on an array of identical 8-bit functional units and a reconfigurable network: Each functional unit contains memory, an ALU, a multiplier unit, and control logic.

With the existence of RPUs that permit partial reconfiguration, it is possible to reconfigure regions of the RPU while others are executing. These types of RPUs have many advantages over traditional FPGAs. One of the FPGAs especially suitable to RC is the XC6200 series of Xilinx(tm) [28]. It has very good properties such as a feature that allows the host processor to access the internal registers of the FPGA by address (like a typical memory access) without the need for any special routing or wasting of cells. It has a symmetrical internal structure, which allows units to be mapped independently of the position, and which configures 1,000 times faster than the traditional internal structures. A proposed reconfiguration based on layers of reconfiguration selected by a context switch has been addressed in [12].

Compilation Techniques Suitable for RC

As detailed in [15], to develop an effective RW> compiler it is necessary to understand the state of the art in two research fields: compilers and ECAD (Electronic Computer-Aided Design). A number of data-flow techniques used by software compilers [4, 19] to generate object code can be used to generate hardware images with faster execution: operator strength reduction, dead code elimination, common sub-expression elimination, constant folding (constant propagation), variable propagation (copy propagation), tree-height reduction, memory accesses reduction (e.g., the elimination of redundant memory accesses over loop iterations), function inlining, etc.

Example 1:

Figure 5

Figure 5. Operation strength reduction of the example: A = 3 x B (where A and B are integers).

These techniques produce better results when they are used with some type of loop transformation technique because of the repetitive nature of these structures. At the loop level, many transformations are used by software compiler designers [19]; they include strip mining or loop tiling, loop fusion, loop splitting or loop distribution, loop interchanging, loop permutation, loop reversal, loop skewing, loop peeling, and loop unrolling or loop expansion.
The loop unrolling of deterministic (statically bound) cyclic structures can achieve a large gain in performance, because of the elimination of the loop control and by making possible the use of the above data-flow techniques. However, loop unrolling, when mapped to hardware, can produce an unacceptable hardware area. Thus, it is important to exploit the loop unrolling factor together with the temporal partitioning of the large unrolled loops to make mapping to a single or multiple RPU(s) possible.
The exploitation of potential parallelism and scheduling algorithms in loop unrolling with different unrolling factors is not analyzed by any author. Furthermore, loop regions may contain many memory accesses which constrain the scheduling of operations because the system architecture usually has only one-port memories. Thus, the optimization of memory accesses can have great impact on the achieved performance.

Example 2:

Figure 6.a)

Figure 6.b)

Figure 7

Data Flow Graph

int Sum = 0; for(int i=0;i<4;i++) {
Sum += a[i];
}
a) int Sum; Sum = a[0] + a[1] + a[2] + a[3];

b)

Figure 6. A loop unrolling example which sums four array elements:

a) Initial code b) Final code

Figure 7. Scheduling operations from the unrolled loop under memory access constraints.

Reconfigware/Software Compilation

The compiler for RC systems has as input the software program of a given application and produces two images: the executable model for the CPU and the configurations for the FPUs, as shown in the compilation flow of Figure 8.

Figure 8. Compilation flow of a software program to run on a RC system. The intermediate format can be obtained by front-end compilers of high-level software programming languages. These graph representations can also be obtained, for example, from the intermediate model used by the GNU C/C++ compiler. The compiler must be an integrated tool suitable to the programming of RC systems.
The compiler must generate a structural hardware representation (such as VHDL-RTL) that represents the connections between units contained in a library of RPMs, with direct correspondence to the operators of high-level programming languages (such as Java(tm)). Unlike generic HLS techniques, allocation for partial hardware reconfiguration requires that different digital blocks have identical ``shapes" and similar interfaces in order to enable temporal sharing. Therefore, the representation of the hardware operators will include specific attributes to specify the relative locations and shapes of the corresponding hardware which will be taken into account by the placement and routing tool. Moreover, the system will generate a description of the control unit of the runtime scheduler of reconfigurations to the same RPU. By using temporal partitioning, the host computer will view the set of RPUs as an unlimited hardware resource.

The GALADRIEL Compiler

ALADRIEL

VHDL,

VHDL

FPGA

Generic HLS systems [10] that target ASIC implementations have been shown inadequate for RC systems. Technology driven architectural synthesis algorithms must be developed with the following specific objectives in mind:

The generation of hardware images with the lowest execution time possible by exploiting the existence of the potential parallelism of the given application.
Temporal partitioning of the graph segment that corresponds to hardware when there is not sufficient available area. This case implies the introduction of communication between partitions and the control of the scheduling of different configurations, as well as partial executions.
The control of Direct Memory Accesses (DMA) on each partition. Since the memory accesses are quantitatively of great importance in the computer systems [14], alternative solutions for mapping arrays on the host computer or on local memory (board of RPUs) must be evaluated.
Exploitation of loop unrolling techniques targeting theoretical unlimited hardware.

Conclusion

One of the difficulties of research in this field is that each author uses his own examples, often publicly unavailable, so results from different implementations are impossible to compare. The majority of work starts from programming languages with little or no utilization in the real-world and frequently uses hardware description languages which prevent the use of the compiler tools by software programmers. Since RC systems are partially handcrafted, it is very important to develop a Reconfigware/Software compiler tool that receives as input a software program described with a high-level of abstraction. To make this possible it is necessary to close the gap between the software and the hardware model with advanced compiler transformations.
A significant speed-up can be achieved only by using a compiler capable of extracting and exploiting the massive amounts of parallelism existent in the input application.
We have described briefly some types of architectures to support the RC concept. However, adopting a particular RPU granularity or abstraction level is not trivial because of the lack of comparable implementation results for a set of applications.

Acknowledgements

We would like to acknowledge our research advisor Prof. Horácio Neto for his guidance and assistance, and Virginia Chu, who helped us to improve the English in this paper. We would also like to acknowledge the support given by: the Portuguese Ph.D. program of the Prodep 5.2 action, and the Portuguese Ph.D. program of the Praxis XXI.

Glossary

Reconfigurable Computing: A kind of computing based on custom computing machines, which combines high-density FPGAs along with processors to achieve the best of both worlds.Definition by Danish Bathia: ``An ability for software to reach through the hardware layers and change the data-path for optimising the performance''.
Configuration: Set-up the logic in the cells and the routing between them.
Reconfigurable Devices: Logic devices that can be customised once or as much as needed. In contrast to configurable devices which can be customised only once.
Dynamic Reconfiguration: The concept of customisation on-the-fly.
High-Level Synthesis (HLS): High-Level Synthesis: also known as behavioural synthesis or architectural synthesis. Definition in [10]: ``A transformation of a behavioural description into a set of connected storage and functional units''.
Temporal Partition: The ability to map a function which exceeds the available space of the reconfigurable device using time sharing.
Spatial Partition: The ability to partition a function onto a set a reconfigurable devices.
Equivalent Gates: It is the number of transistors of the considered circuit divided by the number of transistors of a 2-input NAND gate (usually four transistors in a CMOS technology).
Hardware/Software Co-Design: Concurrent design of hardware and software.
VHDL: VHSIC (Very High Speed Integrated Circuit) Hardware Description Language.
SRAM FPGA: A kind of FPGA based on static memory technology that is re-programmable and in-system programmable and requires external boot devices.

References

1: ___, Proceedings of the International Workshop on Field-Programmable Logic and Applications. 1991-1998. Printed by Springer-Verlag. See http://xputers.informatik.uni-kl.de/FPL/index_fpl.html
2: ___, Proceedings of the Reconfigurable Architectures Workshop. 1994-1998. Printed by Springer-Verlag. See http://xputers.informatik.uni-kl.de/RAW/index_raw.html
3: ___, Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines. 1992-1998. Printed by IEEE Computer Society Press, Los Alamitos, Calif. See http://www.fccm.org/.
4: Aho, A. V., Sethi, R., Ullman, J. D. Compilers: Principles, Techniques and Tools. Addison Wesley, 1986.
5: Athanas, P., and Silverman, H. Processor Reconfiguration through Instruction-Set Metamorphosis: Architecture and Compiler. IEEE Computer, vol. 26, n. 3, March 1993, pp. 11-18.
6: Becker, J., Hartenstein, R., Herz, M., Nageldinger, U. Parallelization in Co-Compilation for Configurable Accelerators. In Proceedings of the Asia South Pacific Design Automation Conference (ASP-DAC'98), Yokohama, Japan, February 10-13, 1998.
7: Brebner, G., Donlin, A. Runtime Reconfigurable Routing. In Proceedings of the Reconfigurable Architectures Workshop (RAW'98), Orlando, Florida, USA, March 30, 1998.
8: Cardoso, J. M. P., and Neto, H. C. Towards an Automatic Path from Java(tm) Bytecodes to Hardware Through High-Level Synthesis. In Proceedings of the 5th IEEE International Conference on Electronics, Circuits and Systems (ICECS-98), Lisbon, Portugal, September 7-10, 1998, pp. 85-88.
9: GajjalaPurna, K. M., and Bhatia, D. Temporal Partitioning and Scheduling for reconfigurable Computing. In Proceedings of the 6th IEEE Symposium on Field Programmable Custom Computing Machines (FCCM'98), Napa Valley, California, USA, April 15-17, 1998.
10: Gajski, D. D., Dutt, N. D., Wu, A. C.-H., Lin, S. Y.-L. High-Level Synthesis, Introduction to Chip and System Design. Kluwer Academic Publishers, Boston, Dordrecht, London, 1992.
11: Hartenstein, R. W., Becker, J., et al. High-Performance Computing Using a Reconfigurable Accelerator. In CPE Journal, Special Issue of Concurrency: Practice and Experience, John Wiley & Sons Ltd., 1996.
12: Hartenstein, R., Herz, M., Hoffmann, T., Nageldinger, U. On Reconfigurable Co-Processing Units. In Proceedings of the Reconfigurable Architectures Workshop (RAW'98), Orlando, Florida, USA, March 30, 1998.
13: Hauck, S., Fry, T. W., Hosler, M. M., Kao, J. P. The Chimaera Reconfigurable Functional Unit. In Proceedings of the 5th IEEE Symposium on Field Programmable Custom Computing Machines (FCCM'97), Napa Valley, California, USA, April, 1997, pp. 87-96.
14: Hennessy, J. L., Patterson, D. A. Computer Architecture: A Quantative Approach. Morgan Kaufman Publ., 1990.
15: Mangione-Smith, W. H., et al. Seeking Solutions in Configurable Computing. IEEE Computer 30,12 December 1997, pp. 38-43.
16: Micheli, G. De, Gupta, R. Hardware/Software Co-Design. In Procedings of the IEEE, vol. 85, no.3, March 1997, pp.349-365.
17: Mirsky, E., and DeHon, A. MATRIX: A Reconfigurable Computing Device with Configurable Instruction Deployable Resources. In Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines, Napa Valley, California, USA, April 17-19, 1996, pp. 157-166.
18: Miyamori, T., and Olukotun, K. A Quantative Analysis of Reconfigurable Coprocessors for Multimedia Applications. In Proceedings of the 6th IEEE Symposium on Field Programmable Custom Computing Machines (FCCM'98), Napa, California, USA, April 15-17, 1998.
19: Muchnick, S. Advanced Compiler Design and Implementation. Morgan Kaufmann Publishers, Inc., San Francisco, California, USA, 1997, ISBN 1-55860-320-4.
20: Olukotun, K. A., Helaihel, R., Levitt, J., and Ramirez, R. A Software-Hardware Cosynthesis Approach to Digital System Simulation. IEEE Micro, vol. 14, Nov. 1994, pp. 48-58.
21: Ouaiss, I., Govindarajan, S., Srinivasan, V., Kaul, M., and Vemuri, R. An Integrated Partioning and Synthesis System for Dynamically Reconfigurable Multi-FPGA Architectures. In Proceedings of the Reconfigurable Architectures Workshop (RAW'98), Orlando, Florida, USA, March 30,1998.
22: Radunovic, B., Milutinovic, V. A Survey of Reconfigurable Computing Architectures. To appear as a tutorial in the 8th International Workshop on Field Programmable Logic and Applications (FPL'98), Talin, Estonia, 30 August - 2 September, 1998.
23: Razdan, R., Brace, K., and Smith, M. D. PRISC Software Acceleration Techniques. In Proceedings of the IEEE International Conference on Computer Design, Oct. 1994, pp. 145-149.
24: See General Processor Information, http://infopad.eecs.berkeley.edu/CIC/summary/local.
25: Villasenor, J., Mangione-Smith, W. H. Configurable Computing. Scientific American, June 1997, pp. 66-71. See http://www.sciam.com/0697issue/0697villasenor.html
26: Wirthlin, M. J., and Hutchings, B. L. A Dynamic Instruction Set Computer. In Proceedings of the 4th IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'95), Napa Valley, California, USA, April 19-21, 1995, pp. 99-107.
27: Witting, R. D., and Chow, P. OneChip: An FPGA Processor with Reconfigurable Logic. In Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines, Napa Valley, California, USA, April 17-19, 1996, pp. 126-135.
28: Xilinx Inc. XC6000 Field Programmable Gate Arrays. Version 1.10, April 24, 1997. http://www.xilinx.com.
29: Xilinx, Inc., "The Virtex Series of FPGAs". See .

João M. P. Cardoso (Joao.Cardoso@inesc.pt) is currently a Ph.D. student at the Instituto Superior Técnico in Lisbon, Portugal, working on the compilation of Java programs to custom computing machines. His research interests include system-level synthesis, high-level synthesis, hardware/software co-design, and reconfigurable computing.

Mário P. Véstias (mpv@hobbes.inesc.pt) is currently a Ph.D. student at the Instituto Superior Técnico in Lisbon, Portugal, working on co-synthesis of embedded real-time systems. His research interests include hardware/software co-design, real-time systems, and reconfigurable computing.

Pages

11.3.10

Architectures and Compilers to Support Reconfigurable Computing