Showing posts with label RECONFIGURABLE COMPUTING. Show all posts
Showing posts with label RECONFIGURABLE COMPUTING. Show all posts

11.3.10

Reconfigurable Computing Primer

Designers of embedded systems face three significant challenges in today's ultra-competitive marketplace. Products must always: do more, cost less, and arrive to market faster. Fortunately, new flexible hardware design techniques are emerging from the study of reconfigurable computing.
Although originally proposed in the late 1960s by a researcher at UCLA, reconfigurable computing is a relatively new field of study. The decades-long delay had mostly to do with a lack of acceptable reconfigurable hardware. Reprogrammable logic chips like field programmable gate arrays (FPGAs) have been around for many years, but these chips have only recently reached gate densities making them suitable for high-end applications. (The densest of the current FPGAs have approximately 100,000 reprogrammable logic gates.) With an anticipated doubling of gate densities every 18 months, the situation will only become more favorable from this point forward.
One of our clients has been developing and using reconfigurable computing technologies for almost three years. Their primary product is groundstation equipment for satellite communications. This application involves high-rate communications, signal processing, and a variety of network protocols and data formats. What follows is an introduction to the terminology and techniques we have developed as our experience with reconfigurable computing has grown. I hope that this explanation will help other system designers benefit from the work that's already been done.

Reconfigurable computing

When we talk about reconfigurable computing we’re usually talking about FPGA-based system designs. Unfortunately, that doesn’t qualify the term precisely enough. System designers use FPGAs in many different ways. The most common use of an FPGA is for prototyping the design of an ASIC. In this scenario, the FPGA is present only on the prototype hardware and is replaced by the corresponding ASIC in the final production system. This use of FPGAs has nothing to do with reconfigurable computing.
However, many system designers are choosing to leave the FPGAs as part of the production hardware. Lower FPGA prices and higher gate counts have helped drive this change. Such systems retain the execution speed of dedicated hardware but also have a great deal of functional flexibility. The logic within the FPGA can be changed if or when it is necessary, which has many advantages. For example, hardware bug fixes and upgrades can be administered as easily as their software counterparts. In order to support a new version of a network protocol, you can redesign the internal logic of the FPGA and send the enhancement to the affected customers by email. Once they’ve downloaded the new logic design to the system and restarted it, they’ll be able to use the new version of the protocol. This is configurable computing; reconfigurable computing goes one step further.
Reconfigurable computing involves manipulation of the logic within the FPGA at run-time. In other words, the design of the hardware may change in response to the demands placed upon the system while it is running. Here, the FPGA acts as an execution engine for a variety of different hardware functions — some executing in parallel, others in serial — much as a CPU acts as an execution engine for a variety of software threads. We might even go so far as to call the FPGA a reconfigurable processing unit (RPU).
Reconfigurable computing allows system designers to execute more hardware than they have gates to fit, which works especially well when there are parts of the hardware that are occasionally idle. One theoretical application is a smart cellular phone that supports multiple communication and data protocols, though just one a time. When the phone passes from a geographic region that is served by one protocol into a region that is served by another, the hardware is automatically reconfigured. This is reconfigurable computing at its best, and using this approach it is possible to design systems that do more, cost less, and have shorter design and implementation cycles.

Technical advantages

Reconfigurable computing has several advantages. First, it is possible to achieve greater functionality with a simpler hardware design. Because not all of the logic must be present in the FPGA at all times, the cost of supporting additional features is reduced to the cost of the memory required to store the logic design. Consider again the multiprotocol cellular phone. It would be possible to support as many protocols as could be fit into the available on-board ROM. It is even conceivable that new protocols could be uploaded from a base station to the handheld phone on an as-needed basis, thus requiring no additional memory.
 The second advantage is lower system cost, which does not manifest itself exactly as you might expect. On a low-volume product, there will be some production cost savings, which result from the elimination of the expense of ASIC design and fabrication. However, for higher-volume products, the production cost of fixed hardware may actually be lower. We have to think in terms of lifetime system costs to see the savings. Here, technical obsolescence drives up the cost of systems based on fixed-hardware designs. Systems based on reconfigurable computing are upgradable in the field. Such changes extend the useful life of the system, thus reducing lifetime costs.
The final advantage of reconfigurable computing is reduced time-to-market. The fact that you’re no longer using an ASIC is a big help in this respect. There are no chip design and prototyping cycles, which eliminates a large amount of development effort. In addition, the logic design remains flexible right up until (and even after) the product ships. This allows an incremental design flow, a luxury not typically available to hardware designers. You can even ship a product that meets the minimum requirements and add features after deployment. In the case of a networked product like a set-top box or cellular telephone, it may even be possible to   make such enhancements without customer involvement!

Reconfigurable hardware

Traditional FPGAs are configurable, but not run-time reconfigurable. Many of the older FPGAs expect to read their configuration out of a serial EEPROM, one bit at a time. And they can only be made to do so by asserting a chip reset signal. This means that the FPGA must be reprogrammed in its entirety and that its previous internal state cannot be captured beforehand. Though these features are compatible with configurable computing applications, they are not sufficient for reconfigurable computing.
In order to benefit from run-time reconfiguration, it is necessary that the FPGAs involved have some or all of the following features. The more of these features they have, the more flexible can be the system design.

On-the-fly reprogrammability

Whenever possible, we’d like to avoid resetting the FPGA, mostly because it takes a lot of time. Ideally, we could just stop the clock going to some or all of the chip, change the logic within that region, and restart the clock. That way, there isn’t as much wasted time, or configuration overhead. The more configuration overhead there is the more likely that the system performance will be unacceptably below that of a fixed-hardware version. Of course, a small performance hit (like stopping the clock) is itself a reasonable trade-off for the added benefits of hardware flexibility.

Partial reprogrammability

Even better would be the ability to leave most of the internal logic in place and change just one part. The Atmel 40K and Xilinx 62xx series FPGAs have such a feature. Any gate or set of gates may be changed without affecting the state of the others. Figure 1 shows how this might be used in practice. It will always be much faster to change a small piece of the logic than the entire FPGA contents.

Externally-visible internal state

If you can see the internal state of the FPGA at any time, then it is also possible to capture that state and save it for later use. For example, the Xilinx 62xx series FPGAs feature a 32-bit data bus called the FastMAP processor interface. This allows the internal state of the FPGA to be read and written just like memory and makes it possible to “swap” hardware designs in much the same way that pages of virtual memory are swapped into and out of physical memory.
partial reprogrammability
Figure 1. Partial reprogrammability allows partial changes

Hardware objects

Before going on, we need to define a new term. A hardware object is a functional or logical hardware component that contains its own configuration and state information. In other words, it is a piece of logic that can be executed in an RPU. Hardware objects are position-independent, or relocatable, to allow us to execute the hardware object from any convenient and available position within the chip. To actually take that leap requires a few assumptions.
relocatable hardware objects
Figure 2. Relocatable hardware objects are position-independent
First, if we’re going to be working with relocatable logic blocks, it is desirable to add constraints on their size and shape. These constraints limit the number of possible positions within the FPGA and make run-time decision-making more efficient and effective. The actual constraints should be based on the features of a particular FPGA or FPGA family. However, the best constraints require that all hardware objects be rectangular in shape and have edge lengths that are multiples of some unit length (called the hardware page size), which may be any convenient number of gates. For example, page sizes of 4 and 16 gates work very well for the Xilinx 62xx series FPGAs because these parts have additional routing resources at each of those intersections, which makes routing between hardware objects or a hardware object and its I/O pins much easier.
Second, it is desirable to define a standard look and feel for hardware object interfaces. The idea here is to make interobject routing easier by defining standard interfaces between them. This is especially important if routing between objects will be performed on-the-fly, and it also paves the way for greater hardware object re-use. By standardizing the interfaces of all hardware objects, it is possible to maintain libraries of frequently used objects and to quickly build larger designs from these smaller components. In some cases, it may even be possible to purchase third-party hardware objects rather than designing your own.
You may be wondering how you can build a “generic” hardware object that will work in any system. To do that, we need to make one final assumption. Assume that any hardware objects that expect to interface to the world outside the RPU (to a block of memory, the processor, a peripheral, or even another RPU) must do so through an abstraction. This abstraction is called the "hardware object framework", which is a ring of logic that is always present within the RPU and physically located along the outer edges. The framework provides a set of standard interfaces to memory and peripheral devices located outside of the RPU. This ring of logic shrinks the available space for executing hardware objects (see Figure 3), but that is a small price to pay for greater hardware object re-usability and, hence, faster design cycles.
hardware object framework
Figure 3. A hardware object framework surrounds the hardware objects

Run-time environments

Due to the dynamic nature of reconfigurable computing, it is sometimes helpful to have software manage the processes of:
  • Deciding which hardware objects to execute and when
  • Swapping hardware objects into and out of the reconfigurable logic
  • Performing routing between hardware objects or between hardware objects and the hardware object framework.
Of course, having software manage the reconfigurable hardware usually means having an embedded processor or microcontroller on-board. (We expect several vendors to introduce single-chip solutions that combine a CPU core and a block of reconfigurable logic by year’s end.) The embedded software that runs there is called the run-time environment and is analogous to the operating system that manages the execution of multiple software threads. Like threads, hardware objects may have priorities, deadlines, and contexts, etc. It is the job of the run-time environment to organize this information and make decisions based upon it.
The reason we need a run-time environment at all is that there are decisions to be made while the system is running. And as human designers, we are not available to make these decisions. So we impart these responsibilities to a piece of software. This allows us to write our application software at a very high level of abstraction. For example, if the application involves manipulation of images in the JPEG format, it would be ideal to have only two blocks of logic: one for JPEG compression, and the other for decompression. Then we could simply hand our input data and the appropriate logic block to the run-time environment and wait for the results. This is equivalent to saying: “Please execute the attached hardware object and let me know when it is done. If there are any results, please let me know as soon as they become available.”
To do this, the run-time environment must first locate space within the RPU that is large enough to execute the given hardware object. It must then perform the necessary routing between the hardware object’s inputs and outputs and the blocks of memory reserved for each data stream. Next, it must stop the appropriate clock, reprogram the internal logic, and restart the RPU. Once the object starts to execute, the run-time environment must continuously monitor the hardware object’s status flags to determine when it is done executing. Once it is done, the caller can be notified and given the results. The run-time environment is then free to reclaim the reconfigurable logic gates that were taken up by that hardware object and to wait for additional requests to arrive from the application software.
By assigning all of these tasks to a special piece of software, we hope to make it possible to develop generic run-time environments. Much as there is a market for commercial operating systems for CPUs, we expect a market for commercial run-time environments for RPUs will emerge if reconfigurable computing becomes popular. That would save system designers even more time by allowing them to purchase a run-time environment rather than design their own. At that point, system design becomes a matter of purchasing or developing the required hardware object libraries, configuring a third-party run-time environment, and writing the application software--a truly efficient system development paradigm.
Internally, our run-time environment can be thought of as a series of three layers (see Figure 4). The device abstraction layer is the lowest level and is responsible for hiding the details of a particular FPGA or FPGA family. This is analogous to the parts of an operating system that must be written in assembly language because they are processor-specific. The device abstraction layer can answer the following questions about the hardware: How many FPGAs are present? What types are they? What is the hardware page size? What are their dimensions (height and width as multiples of the hardware page size)? What routing resources are available at the edge of each hardware page? The device abstraction layer also provides a simple read/write interface for the layer above.
hardware object scheduler
Figure 4. A run-time environment contains three layers
The middle layer is responsible for placement and routing of hardware objects. It maintains a logical representation of the free space within the RPU and decides where each object will be physically located within the device. It is also responsible for adding routing between hardware objects or between one hardware object and the hardware object framework. This is the most complicated layer of the three.
The uppermost layer is called the object scheduler. It provides an application programming interface (API) that makes using the RPUs easy for the application programmer and is responsible for deciding which hardware objects are currently running. This decision may be based on any convenient scheduling algorithm. For example, first-come first-serve, round-robin, and priority-based schemes are reasonable choices. But in order to implement the latter pair, it would be necessary to first implement hardware object swapping. Hardware object swapping involves saving the current state of a running piece of logic and later restoring it, and is only possible in systems that employ FPGAs with externally visible internal states.

Looking forward

The principal benefits of reconfigurable computing are the ability to execute larger hardware designs with fewer gates and to realize the flexibility of a software-based solution while retaining the execution speed of a more traditional, hardware-based approach. This makes doing more with less a reality.
In our own business we have seen tremendous cost savings, simply because our systems do not become obsolete as quickly as our competitors’. This has even led us to use the marketing slogan “Obsolescence is Obsolete” because reconfigurable computing enables the addition of new features in the field, allows rapid implementation of new standards and protocols on an as-needed basis, and protects their investment in computing hardware.
Whether you do it for your customers or for yourselves, you should at least consider using reconfigurable computing in your next design. You may find, as we have, that the benefits far exceed the initial learning curve. And as reconfigurable computing becomes more popular, these benefits will only increase. The idea of buying third-party hardware objects and run-time environments and simply combining them in new and interesting ways to create a product is certainly forward-looking, but it may not be that far over the horizon.

C-to-Verilog.com provides C to HDL as a service

If you follow this blog, then you've read my ramblings on EDA SaaS. An interesting new advance in this area is http://www.c-to-verilog.com. This website let's you compile C code into synthesizable Verilog modules.

Here is a screencast of the service:


I've also discovered a few other websites related to EDA, SaaS and "Cloud Computing." Harry the asic guy's blog covers the burgeoning EDA SaaS market and Xuropa is creating an online community for EDA users and developers. Here's a two part EETimes piece which talks about EDA SaaS.

I already use a remote desktop connection to run most of my EDA jobs remotely. I've argued that there is a "generation gap" between SaaS and traditional software license markets. The people who used to code in university basements when they were 13 in the 60's and 70's invented the software licensing industry. Now, the people who used to herd botnets when they were 13 are graduating with computer science degrees. The nu-hacker distributes code updates to all his users immediately without forcing anyone to wait through an install.

HINDAWI - International Journal of Reconfigurable Computing

Table of Contents [1–10 of 55 articles]

  1. Flexible Interconnection Network for Dynamically and Partially Reconfigurable Architectures, Ludovic Devaux, Sana Ben Sassi, Sebastien Pillement, Daniel Chillet, and Didier Demigny
    Volume 2010 (2010), Article ID 390545, 15 pages
  2. Concurrent Calculations on Reconfigurable Logic Devices Applied to the Analysis of Video Images, Sergio R. Geninatti, José Ignacio Benavides Benítez, Manuel Hernández Calviño, and Nicolás Guil Mata
    Volume 2010 (2010), Article ID 962057, 8 pages
  3. Timing-Driven Nonuniform Depopulation-Based Clustering, Hanyu Liu and Ali Akoglu
    Volume 2010 (2010), Article ID 158602, 11 pages
  4. High-Speed FPGA 10's Complement Adders-Subtractors, G. Bioul, M. Vazquez, J. P. Deschamps, and G. Sutter
    Volume 2010 (2010), Article ID 219764, 14 pages
  5. Power Characterisation for Fine-Grain Reconfigurable Fabrics, Tobias Becker, Peter Jamieson, Wayne Luk, Peter Y. K. Cheung, and Tero Rissa
    Volume 2010 (2010), Article ID 787405, 9 pages
  6. Multiloop Parallelisation Using Unrolling and Fission, Yuet Ming Lam, José Gabriel F. Coutinho, Chun Hok Ho, Philip Heng Wai Leong, and Wayne Luk
    Volume 2010 (2010), Article ID 475620, 10 pages
  7. Reaction Diffusion and Chemotaxis for Decentralized Gathering on FPGAs, Bernard Girau, César Torres-Huitzil, Nikolaos Vlassopoulos, and José Hugo Barrón-Zambrano
    Volume 2009 (2009), Article ID 639249, 15 pages
  8. Speeding Up FPGA Placement via Partitioning and Multithreading, Cristinel Ababei
    Volume 2009 (2009), Article ID 514754, 9 pages
  9. Hardware Accelerated Sequence Alignment with Traceback, Scott Lloyd and Quinn O. Snell
    Volume 2009 (2009), Article ID 762362, 10 pages
  10. Selected Papers from ReCoSoC 2008, Michael Hübner, J. Manuel Moreno, Gilles Sassatelli, and Peter Zipf           Volume 2009 (2009), Article ID 894059, 2 pages
  11. Parallel Processor for 3D Recovery from Optical Flow, Jose Hugo Barron-Zambrano, Fernando Martin del Campo-Ramirez, and Miguel Arias-Estrada
    Volume 2009 (2009), Article ID 973475, 11 pages
  12. Experiencing a Problem-Based Learning Approach for Teaching Reconfigurable Architecture Design, Erwan Fabiani
    Volume 2009 (2009), Article ID 923415, 11 pages
  13. An ILP Formulation for the Task Graph Scheduling Problem Tailored to Bi-Dimensional Reconfigurable Architectures, F. Redaelli, M. D. Santambrogio, and S. Ogrenci Memik
    Volume 2009 (2009), Article ID 541067, 12 pages
  14. Architectural Synthesis of Fixed-Point DSP Datapaths Using FPGAs, Gabriel Caffarena, Juan A. López, Gerardo Leyva, Carlos Carreras, and Octavio Nieto-Taladriz
    Volume 2009 (2009), Article ID 703267, 14 pages
  15. An Automatic Design Flow for Data Parallel and Pipelined Signal Processing Applications on Embedded Multiprocessor with NoC: Application to Cryptography, Xinyu Li and Omar Hammami
    Volume 2009 (2009), Article ID 631490, 14 pages
  16. Reducing Reconfiguration Overheads in Heterogeneous Multicore RSoCs with Predictive Configuration Management, Stéphane Chevobbe and Stéphane Guyetant
    Volume 2009 (2009), Article ID 390167, 7 pages
  17. Answer Set versus Integer Linear Programming for Automatic Synthesis of Multiprocessor Systems from Real-Time Parallel Programs, Harold Ishebabi, Philipp Mahr, Christophe Bobda, Martin Gebser, and Torsten Schaub
    Volume 2009 (2009), Article ID 863630, 11 pages
  18. An Interface for a Decentralized 2D Reconfiguration on Xilinx Virtex-FPGAs for Organic Computing, Christian Schuck, Bastian Haetzer, and Jürgen Becker
    Volume 2009 (2009), Article ID 273791, 11 pages
  19. A Hardware Filesystem Implementation with Multidisk Support, Ashwin A. Mendon, Andrew G. Schmidt, and Ron Sass
    Volume 2009 (2009), Article ID 572860, 13 pages
  20. Enabling Self-Organization in Embedded Systems with Reconfigurable Hardware, Christophe Bobda, Kevin Cheng, Felix Mühlbauer, Klaus Drechsler, Jan Schulte, Dominik Murr, and Camel Tanougast
    Volume 2009 (2009), Article ID 161458, 9 pages

  21. FPGA Interconnect Topologies Exploration, Zied Marrakchi, Hayder Mrabet, Umer Farooq, and Habib Mehrez
    Volume 2009 (2009), Article ID 259837, 13 pages
  22. Non-Power-of-Two FFTs: Exploring the Flexibility of the Montium TP, Marcel D. van de Burgwal, Pascal T. Wolkotte, and Gerard J. M. Smit
    Volume 2009 (2009), Article ID 678045, 12 pages
  23. Pipeline FFT Architectures Optimized for FPGAs, Bin Zhou, Yingning Peng, and David Hwang
    Volume 2009 (2009), Article ID 219140, 9 pages
  24. A Reconfigurable Systolic Array Architecture for Multicarrier Wireless and Multirate Applications, H. Ho, V. Szwarc, and T. Kwasniewski
    Volume 2009 (2009), Article ID 529512, 14 pages
  25. Analysis and Enhancement of Random Number Generator in FPGA Based on Oscillator Rings, Knut Wold and Chik How Tan
    Volume 2009 (2009), Article ID 501672, 8 pages
  26. Software Toolchain for Large-Scale RE-NFA Construction on FPGA, Yi-Hua E. Yang and Viktor K. Prasanna
    Volume 2009 (2009), Article ID 301512, 10 pages
  27. A Message-Passing Hardware/Software Cosimulation Environment for Reconfigurable Computing Systems, Manuel Saldaña, Emanuel Ramalho, and Paul Chow
    Volume 2009 (2009), Article ID 376232, 9 pages
  28. Providing Memory Management Abstraction for Self-Reconfigurable Video Processing Platforms, Kurt Franz Ackermann, Burghard Hoffmann, Leandro Soares Indrusiak, and Manfred Glesner
    Volume 2009 (2009), Article ID 851613, 15 pages
  29. Analysis and Design of a Context Adaptable SAD/MSE Architecture, Arvind Sudarsanam, Aravind Dasu, and Karthik Vaithianathan
    Volume 2009 (2009), Article ID 789592, 21 pages
  30. A Decentralised Task Mapping Approach for Homogeneous Multiprocessor Network-On-Chips, Peter Zipf, Gilles Sassatelli, Nurten Utlu, Nicolas Saint-Jean, Pascal Benoit, and Manfred Glesner
    Volume 2009 (2009), Article ID 453970, 14 pages

  31. A System on a Programmable Chip Architecture for Data-Dependent Superimposed Training Channel Estimation, Fernando Martín del Campo, René Cumplido, Roberto Perez-Andrade, and A. G. Orozco-Lugo
    Volume 2009 (2009), Article ID 912301, 10 pages
  32. A Taxonomy of Reconfigurable Single-/Multiprocessor Systems-on-Chip, Diana Göhringer, Thomas Perschke, Michael Hübner, and Jürgen Becker
    Volume 2009 (2009), Article ID 395018, 11 pages
  33. An Adaptive Message Passing MPSoC Framework, Gabriel Marchesan Almeida, Gilles Sassatelli, Pascal Benoit, Nicolas Saint-Jean, Sameer Varyani, Lionel Torres, and Michel Robert
    Volume 2009 (2009), Article ID 242981, 20 pages
  34. A Design Technique for Adapting Number and Boundaries of Reconfigurable Modules at Runtime, Thilo Pionteck, Roman Koch, Carsten Albrecht, and Erik Maehle
    Volume 2009 (2009), Article ID 942930, 10 pages
  35. Multilevel Simulation of Heterogeneous Reconfigurable Platforms, Damien Picard and Loic Lagadec
    Volume 2009 (2009), Article ID 162416, 12 pages
  36. vMAGIC—Automatic Code Generation for VHDL, Christopher Pohl, Carlos Paiz, and Mario Porrmann
    Volume 2009 (2009), Article ID 205149, 9 pages
  37. A Reconfigurable and Biologically Inspired Paradigm for Computation Using Network-On-Chip and Spiking Neural Networks, Jim Harkin, Fearghal Morgan, Liam McDaid, Steve Hall, Brian McGinley, and Seamus Cawley
    Volume 2009 (2009), Article ID 908740, 13 pages
  38. High level modeling of Dynamic Reconfigurable FPGAs, Imran Rafiq Quadri, Samy Meftali, and Jean-Luc Dekeyser
    Volume 2009 (2009), Article ID 408605, 15 pages
  39. Efficient Scheme for Implementing Large Size Signed Multipliers Using Multigranular Embedded DSP Blocks in FPGAs, Shuli Gao, Dhamin Al-Khalili, and Noureddine Chabini
    Volume 2009 (2009), Article ID 145130, 11 pages
  40. Current Trends on Reconfigurable Computing, Jürgen Becker, Michael Hübner, Roger Woods, Philip Leong, Robert Esser, and Lionel Torres
    Volume 2008 (2008), Article ID 918525, 1 page

  41. Selected Papers from SPL 2008: Programmable Logic and Applications, Gustavo Sutter and Richard Katz
    Volume 2008 (2008), Article ID 921921, 2 pages
  42. Neuromorphic Configurable Architecture for Robust Motion Estimation, Guillermo Botella, Manuel Rodríguez, Antonio García, and Eduardo Ros
    Volume 2008 (2008), Article ID 428265, 9 pages
  43. The Coarse-Grained/Fine-Grained Logic Interface in FPGAs with Embedded Floating-Point Arithmetic Units, Chi Wai Yu, Julien Lamoureux, Steven J. E. Wilton, Philip H. W. Leong, and Wayne Luk
    Volume 2008 (2008), Article ID 736203, 10 pages
  44. Multiobjective Optimization for Reconfigurable Implementation of Medical Image Registration, Omkar Dandekar, William Plishker, Shuvra S. Bhattacharyya, and Raj Shekhar
    Volume 2008 (2008), Article ID 738174, 17 pages
  45. Burst-Mode Asynchronous Controllers on FPGA, Duarte L. Oliveira, Marius Strum, and Sandro S. Sato
    Volume 2008 (2008), Article ID 926851, 10 pages
  46. Architecture-Level Exploration of Alternative Interconnection Schemes Targeting 3D FPGAs: A Software-Supported Methodology, Kostas Siozios, Alexandros Bartzas, and Dimitrios Soudris
    Volume 2008 (2008), Article ID 764942, 18 pages
  47. Design of a Mathematical Unit in FPGA for the Implementation of the Control of a Magnetic Levitation System, Juan José Raygoza-Panduro, Susana Ortega-Cisneros, Jorge Rivera, and Alberto de la Mora
    Volume 2008 (2008), Article ID 634306, 9 pages
  48. On the Use of Magnetic RAMs in Field-Programmable Gate Arrays, Y. Guillemenet, L. Torres, G. Sassatelli, and N. Bruchon
    Volume 2008 (2008), Article ID 723950, 9 pages
  49. Area Optimisation for Field-Programmable Gate Arrays in SystemC Hardware Compilation, Johan Ditmar, Steve McKeever, and Alex Wilson
    Volume 2008 (2008), Article ID 674340, 14 pages
  50. An Embedded Reconfigurable IP Core with Variable Grain Logic Cell Architecture, Motoki Amagasaki, Ryoichi Yamaguchi, Masahiro Koga, Masahiro Iida, and Toshinori Sueyoshi
    Volume 2008 (2008), Article ID 180216, 14 pages

  51. Dynamic Hardware Development, Stephen Craven and Peter Athanas
    Volume 2008 (2008), Article ID 901328, 10 pages
  52. SystemC Transaction-Level Modeling of an MPSoC Platform Based on an Open Source ISS by Using Interprocess Communication, Sami Boukhechem and El-Bay Bourennane
    Volume 2008 (2008), Article ID 902653, 10 pages
  53. A Game-Theoretic Approach for Run-Time Distributed Optimization on MP-SoC, Diego Puschini, Fabien Clermidy, Pascal Benoit, Gilles Sassatelli, and Lionel Torres
    Volume 2008 (2008), Article ID 403086, 11 pages
  54. FPGA-Based Embedded Motion Estimation Sensor, Zhaoyi Wei, Dah-Jye Lee, Brent E. Nelson, James K. Archibald, and Barrett B. Edwards
    Volume 2008 (2008), Article ID 636145, 8 pages
  55. On the Power Dissipation of Embedded Memory Blocks Used to Implement Logic in Field-Programmable Gate Arrays, Scott Y. L. Chin, Clarence S. P. Lee, and Steven J. E. Wilton
    Volume 2008 (2008), Article ID 751863, 13 pages
  56.  
  57.  
  58.  
  59.  

Architectures and Compilers to Support Reconfigurable Computing

by João M. P. Cardoso and Mário P. Véstias

Introduction

The main characteristic of Reconfigurable Computing (RC) is the presence of hardware that can be reconfigured (reconfigware - RW) to implement specific functionality more suitable for specially tailored hardware than on a simple uniprocessor. RC systems join microprocessors and programmable hardware in order to take advantage of the combined strengths of hardware and software [20, 5] and have been used in applications ranging from embedded systems to high performance computing. Many of the fundamental theories have been identified and used by the Hardware/Software Co-Design research field [16]. Although the same background ideas are shared in both areas, they have different goals and use different approaches. Although the basic concept was proposed in the 1960s, RC has only recently been feasible; this is the result of the availability of high-density reconfigurable devices. These devices make the hardware characteristics of Application Specific Integrated Circuits (ASICs) much more flexible.
During the past 5 years a large number of RC systems developed by the research community have demonstrated the potential for achieving high performance for a range of applications. However, performance improvements possible with these systems typically are dependent on the skill and experience of hardware designers. This method of RW programming cannot fully exploit the increasing density of reconfigurable devices. Hence, a current challenge in this area is the establishment of an efficient RW compiler that would help the designer accomplish adequate performance improvements without the need to be involved in complex low level manipulations. Although mature design tools exist for logic and layout synthesis for programmable devices, High-Level Synthesis (HLS) and multi-unit partitioning (both spatial and temporal) need to be further developed.

Reconfigurable computing: Why and How?

Usually the target RC system is based on multiple SRAM-based Field Programmable Gate-Arrays (SRAM-FPGAs), which act as Reconfigurable Processing Units (RPUs), coupled to a personal computer. The coupling of dedicated hardware to a host computer (as shown in Figure 1) in order to accelerate some computationally intensive tasks, is not a new concept. A common example in the PC industry is the use of graphic boards to accelerate graphical transformations. However, such hardwired application-specific accelerator circuits require great design effort and typically use more silicon area than the microprocessor itself.
Figure 1. A hardware device added to an existing architecture for acceleration purposes.
The use of reconfigurable hardware makes it possible to couple the host computer with more flexible hardware resources, better adapted to the application under current execution, and with the possibility to adapt to new versions of an application. Moreover, the dynamic reconfiguration capability allows for time-sharing of different tasks, which may significantly reduce the required silicon area. These devices have a faster growth of transistor density than that of general processors, as can be seen in Figure 2. The newest FPGAs can support circuits with about 500,000 equivalent gates in the same device and improved FPGAs (such as the Xilinx Virtex(tm) series [29]) with one-million gates have been announced.
With the introduction of FPGAs with faster reconfiguration times and partial reconfiguration support, it is possible to use FPGAs in a dynamically reconfigurable environment. This technology makes possible the concept of unlimited hardware or "virtual hardware".
Figure 2. The growing of transistor density in processors [24] and FPGAs devices.
 
 
 
 
Virtual Hardware  
    One of the major disadvantages of RC is the impossibility of migrating only the portion of the application which does not exceed the total size of the used RPUs to hardware. Recently, a concept similar to virtual memory was adapted to allow multiple configuration RAM sets which are stored in an RPU to be used. The mapping of hardware units in RPUs without minimal hardware resources can be done with a temporal partition. This new concept is addressed in [21, 9]. However, since temporal partitioning techniques are not yet mature, the practical implications for hardware acceleration have not been found, and the algorithms proposed so far are not suitable for the acceleration of programs in an RC environment. There is more research effort on the concept of virtual hardware in the field of rapid prototyping of large hardware circuits. Researchers in this field do not consider memory accesses, because the input is a pure hardware description. This means that special effort must be made in order to partition and schedule large graphs obtained from high-level programming languages. The virtual hardware concept is implemented by time-sharing a given RPU. It needs a scheduler that is responsible for the configurations, the execution, and the communications between temporal partitions.
RC systems and parallel computing systems have the same objective: to speed-up the execution of a given application. For parallel computations, acceleration is achieved through the exploitation of the parallelism in a program and its mapping onto an architecture of various processors, but for RC it is achieved through the migration of the most computationally-intensive parts of the program to RPUs. With RCs the exploitation of the parallelism of a given application is through the concurrent model of execution (hardware), instead of the sequential Von Neumann model of computation used by traditional parallel computation.
Scientific journals devoted to general research have already highlighted this field [25] and during the last few years, numerous approaches to RC have appeared in special topical conferences [1, 2, 3]. However, aside from new architectures that have been proposed and new concepts that have been presented, the utilization of these systems has been reported only for academic research. As discussed in [22], a methodology to automatically exploit the RC target architecture must be supported by compilers for RC systems to become widespread. Until effective support for these systems is available, software programmers will not be attracted to RC because the implementation of the hardware parts of the application is time consuming and needs hardware specialists.
The development of applications in an RC environment requires the use of one of the two following methodologies:
  • the use of hardware objects (cores) bought from specific vendors or developed by hardware designers, with a specific interface to the software language used. This can be suitable for developing applications that need standard functionality such as FFT, JPEG/MPEG encoders/decoders, etc.
  • the use of a tool to compile from a high-level software programming language to the object code to run on a processor and the configuration files to program the RPUs. State-of-the-art techniques from the compiler and design automation worlds must be combined to achieve performance improvements. In particular, special care must be devoted to loop transformations, such as the exploitation of partial and total loop unrolling, to automatically achieve better acceleration factors. A hardware library of operators commonly used by high-level software programming languages must allow the specialized synthesis techniques to take advantage of the partial and on-the-fly reconfiguration characteristics of the target hardware.
The integration of HLS [10] techniques in the methodology to target the RC systems is the best known technique. This will permit the rapid generation of the hardware with the use of a library of basic hardware units. However, traditional HLS techniques, which target ASIC implementations, must be redefined because the target technology does not have the layout freedom of ASICs and the objective of accelerating the given application in the RC environment must prevail. The time spent in HLS, logic synthesis, placement, and routing, is unacceptable in an RC environment. To be adopted as a new computing paradigm, it is necessary to decrease the overall compilation time, to provide efficient support for virtual hardware, and to provide the use of RPUs tightly coupled to the main processor.

RC Architectures Design Space

The type of interconnection between the RPU and the host system and the level of granularity of the RPU constitute a wide design space to be explored. Many points of this design space have already been explored and many more are left for further research. However, it is still not possible to identify a dominant solution for all kinds of applications. The next two sections will review some of the architectures used in RC systems.

Coupling RPUs To The Host

The type of coupling of the RPUs to the existing computing system has a big impact on the communication cost. The coupling can be classified into one of the four groups listed below, which are presented in order of decreasing communication costs:
  • RPUs coupled to the I/O bus of the host (Figure 3.a). This group includes many commercial circuit boards. Some of them are connected to the PCI (Peripheral Controller Interface) bus of a PC or workstation.
  • RPUs coupled to the local bus of the host (Figure 3.b)
  • RPUs coupled like co-processors (such as the REMARC - Reconfigurable Multimedia Array Coprocessor - [18]) (Figure 3.c)
  • RPUs acting like an extended data-path of the processor (such as the OneChip [27], the PRISC - Programmable Reduced Instruction Set Computer - [23], and the Chimaera [13]) (Figure 3.d);
 
a) RPU coupled to the I/O system bus
b) RPU coupled to the local bus
c) RPU coupled to the CPU
d) RPU integrated in the process chip
Figure 3. Organization of RC systems with respect to the coupling of the RPU to the host computer

RPUs Granularity

Most current research efforts use FPGAs and directly manipulate their low-level processing elements. An FPGA cell typically consists of a flip-flop and a function generator that implements a boolean function of up to 4 variables. These elements can be used to implement nearly any digital circuit. This fine grain parallelism has been shown to be inefficient in time and area for certain classes of problems. Some ongoing research uses Arithmetic-Logic Units (ALUs) as the basic hardware primitives of the RC system. In some cases this approach can provide more efficient solutions, but somewhat limits the flexibility of the system. The approaches using an abstraction layer of coarse-grain granularity (operand level) over the physical fine-grain granularity of an FPGA have more flexibility but the compilation time is larger. This abstraction level is provided with the use of a library of uni-functional Relatively Placed Macros (RPMs) with direct correspondence to the operators of the high-level software programming languages (arithmetic, logical, and memory access). This library can have more than one implementation of the same operator (different relations of area/latency/configuration time). Furthermore, the use of a library will decrease the overall RW compilation time, will make possible more accurate estimations, and are already a proven concept in HLS systems.
Various architecture concepts have been considered for use in RPUs, with differences in the flexibility and level of granularity used:
  • The use of dynamic processor cores in the FPGA, such as the DISC (Dynamic Instruction Set Computer). This approach permits the dynamic reconfiguration of dedicated hardware units when a new instruction appears whose corresponding unit has not yet been configured [26]. This processor reconfigures the new instructions at runtime. A more flexible solution is the use of a processor with a single MOVE instruction, such as the URISC (Ultimate RISC), as proposed in [7].
  • The utilization of RPUs with medium granularity like the Reconfigurable Data Path Array (RDPA) [11]: These RPUs are designated as Field-Programmable ALU Arrays (FPAAs) and have ALUs as reconfigurable blocks (see Figure 4). These 32 bit ALUs can be reconfigured to execute some of the operators of the high-level language C. [6] presents a framework to deal with this type of RPUs, the CoDe-X. Its objective is to automatically map the more suitable portions of a C program to the RPDAs. Results achieved with this approach are shown in [11], and speed-ups (compared to the execution of the programs without CW support) of 13 for a JPEG compressor algorithm, and 70 for a two-dimension FIR (Finite Impulse Response Filter) are reported.
Figure 4. The matrix of reconfigurable datapaths of the Xputer system.
  • A coarse-grain approach such as the MATRIX project [17], which is based on an array of identical 8-bit functional units and a reconfigurable network: Each functional unit contains memory, an ALU, a multiplier unit, and control logic.
With the existence of RPUs that permit partial reconfiguration, it is possible to reconfigure regions of the RPU while others are executing. These types of RPUs have many advantages over traditional FPGAs. One of the FPGAs especially suitable to RC is the XC6200 series of Xilinx(tm) [28]. It has very good properties such as a feature that allows the host processor to access the internal registers of the FPGA by address (like a typical memory access) without the need for any special routing or wasting of cells. It has a symmetrical internal structure, which allows units to be mapped independently of the position, and which configures 1,000 times faster than the traditional internal structures. A proposed reconfiguration based on layers of reconfiguration selected by a context switch has been addressed in [12].

Compilation Techniques Suitable for RC

As detailed in [15], to develop an effective RW> compiler it is necessary to understand the state of the art in two research fields: compilers and ECAD (Electronic Computer-Aided Design). A number of data-flow techniques used by software compilers [4, 19] to generate object code can be used to generate hardware images with faster execution: operator strength reduction, dead code elimination, common sub-expression elimination, constant folding (constant propagation), variable propagation (copy propagation), tree-height reduction, memory accesses reduction (e.g., the elimination of redundant memory accesses over loop iterations), function inlining, etc.


Example 1:
    The operator strength reduction can be implemented on division and multiplication by constants. The simple cases occur when there is multiplication or division of an operand by a constant which is a power of two (accomplished by a simple shifting of the non-constant operand). In the general case, the multiplication by a constant can be transformed into a series of shifts and additions/subtractions (Figure 5). This produces very good results because of the smaller and faster characteristics of shifters and adders/subtracters than multipliers when implemented in the RPU.
Figure 5. Operation strength reduction of the example: A = 3 x B (where A and B are integers).


These techniques produce better results when they are used with some type of loop transformation technique because of the repetitive nature of these structures. At the loop level, many transformations are used by software compiler designers [19]; they include strip mining or loop tiling, loop fusion, loop splitting or loop distribution, loop interchanging, loop permutation, loop reversal, loop skewing, loop peeling, and loop unrolling or loop expansion.
The loop unrolling of deterministic (statically bound) cyclic structures can achieve a large gain in performance, because of the elimination of the loop control and by making possible the use of the above data-flow techniques. However, loop unrolling, when mapped to hardware, can produce an unacceptable hardware area. Thus, it is important to exploit the loop unrolling factor together with the temporal partitioning of the large unrolled loops to make mapping to a single or multiple RPU(s) possible.
The exploitation of potential parallelism and scheduling algorithms in loop unrolling with different unrolling factors is not analyzed by any author. Furthermore, loop regions may contain many memory accesses which constrain the scheduling of operations because the system architecture usually has only one-port memories. Thus, the optimization of memory accesses can have great impact on the achieved performance.



Example 2:
    Figure 6.a) shows a simple example of a loop that sums 4 elements of an array. Illustrated in Figure 6.b) is the code after the unrolling of the loop. Shown in Figure 7 is the scheduling of the Data Flow Graph (DFG) related to the unrolled loop. In this case, the tree-height transformation of the given DFG before scheduling produces worse results than the scheduling directly over the DFG. This shows that the compiler transformations must be performed with care to improve the obtained results and the majority of timings must be technology-driven.
    int Sum = 0;   for(int i=0;i<4;i++) {  
       Sum += a[i]; 
    }  
    a)
    int Sum;   Sum = a[0] + a[1] + a[2] + a[3];  
     
    b)
    Figure 6. A loop unrolling example which sums four array elements:
    a) Initial code b) Final code
 
Figure 7. Scheduling operations from the unrolled loop under memory access constraints.
 


Reconfigware/Software Compilation

The compiler for RC systems has as input the software program of a given application and produces two images: the executable model for the CPU and the configurations for the FPUs, as shown in the compilation flow of Figure 8.
Figure 8. Compilation flow of a software program to run on a RC system.
The intermediate format can be obtained by front-end compilers of high-level software programming languages. These graph representations can also be obtained, for example, from the intermediate model used by the GNU C/C++ compiler. The compiler must be an integrated tool suitable to the programming of RC systems.
The compiler must generate a structural hardware representation (such as VHDL-RTL) that represents the connections between units contained in a library of RPMs, with direct correspondence to the operators of high-level programming languages (such as Java(tm)). Unlike generic HLS techniques, allocation for partial hardware reconfiguration requires that different digital blocks have identical ``shapes" and similar interfaces in order to enable temporal sharing. Therefore, the representation of the hardware operators will include specific attributes to specify the relative locations and shapes of the corresponding hardware which will be taken into account by the placement and routing tool. Moreover, the system will generate a description of the control unit of the runtime scheduler of reconfigurations to the same RPU. By using temporal partitioning, the host computer will view the set of RPUs as an unlimited hardware resource.
 
The GALADRIEL Compiler 
 
    Members of our research group are working on a front-end compiler called GALADRIEL [8] which analyzes the Java(tm) class files produced from the Java compiler and processes the information in order to exploit the implicit parallelism in each method so that it can be efficiently implemented by multiple hardware and/or software components.  After the programmer has identified the more critical functions in the program (with the help of profiling, for example) the compiler will find the code segments of the functions selected to migrate to the RPUs. The compiler will be able to evaluate different levels of granularity of the code specification, from basic block (definition used in [4]) to the overall function, so that the programmer does not need to reprogram the application to make it suitable for the compiler.  Bytecodes are produced for the software image. These bytecodes include calls to the communication library implemented in C/C++. Each hardware image related to a configuration is described in VHDL, which defines the units used from the library set, the interconnections between them, and when the scheduling of operations is necessary, the control part (sharing of resources such as operators or memory access units). Then these descriptions are passed to a VHDL-based logic synthesis tool (such as the Synopsys Design Compiler(tm)), which has access to the units library based on the FPGA used. The synthesis tool translates each hardware description into a list of primitive cells supported by the FPGA technology library in a format accepted by the placement and routing tools of the given FPGA family. 
  Generic HLS systems [10] that target ASIC implementations have been shown inadequate for RC systems. Technology driven architectural synthesis algorithms must be developed with the following specific objectives in mind:
  • The generation of hardware images with the lowest execution time possible by exploiting the existence of the potential parallelism of the given application.
  • Temporal partitioning of the graph segment that corresponds to hardware when there is not sufficient available area. This case implies the introduction of communication between partitions and the control of the scheduling of different configurations, as well as partial executions.
  • The control of Direct Memory Accesses (DMA) on each partition. Since the memory accesses are quantitatively of great importance in the computer systems [14], alternative solutions for mapping arrays on the host computer or on local memory (board of RPUs) must be evaluated.
  • Exploitation of loop unrolling techniques targeting theoretical unlimited hardware.

Conclusion

One of the difficulties of research in this field is that each author uses his own examples, often publicly unavailable, so results from different implementations are impossible to compare. The majority of work starts from programming languages with little or no utilization in the real-world and frequently uses hardware description languages which prevent the use of the compiler tools by software programmers. Since RC systems are partially handcrafted, it is very important to develop a Reconfigware/Software compiler tool that receives as input a software program described with a high-level of abstraction. To make this possible it is necessary to close the gap between the software and the hardware model with advanced compiler transformations.
A significant speed-up can be achieved only by using a compiler capable of extracting and exploiting the massive amounts of parallelism existent in the input application.
We have described briefly some types of architectures to support the RC concept. However, adopting a particular RPU granularity or abstraction level is not trivial because of the lack of comparable implementation results for a set of applications.

Acknowledgements

We would like to acknowledge our research advisor Prof. Horácio Neto for his guidance and assistance, and Virginia Chu, who helped us to improve the English in this paper. We would also like to acknowledge the support given by: the Portuguese Ph.D. program of the Prodep 5.2 action, and the Portuguese Ph.D. program of the Praxis XXI.

Glossary

Reconfigurable Computing
A kind of computing based on custom computing machines, which combines high-density FPGAs along with processors to achieve the best of both worlds.Definition by Danish Bathia: ``An ability for software to reach through the hardware layers and change the data-path for optimising the performance''.
Configuration
Set-up the logic in the cells and the routing between them.
Reconfigurable Devices
Logic devices that can be customised once or as much as needed. In contrast to configurable devices which can be customised only once.
Dynamic Reconfiguration
The concept of customisation on-the-fly.
High-Level Synthesis (HLS)
High-Level Synthesis: also known as behavioural synthesis or architectural synthesis. Definition in [10]: ``A transformation of a behavioural description into a set of connected storage and functional units''.
Temporal Partition
The ability to map a function which exceeds the available space of the reconfigurable device using time sharing.
Spatial Partition
The ability to partition a function onto a set a reconfigurable devices.
Equivalent Gates
It is the number of transistors of the considered circuit divided by the number of transistors of a 2-input NAND gate (usually four transistors in a CMOS technology).
Hardware/Software Co-Design
Concurrent design of hardware and software.
VHDL
VHSIC (Very High Speed Integrated Circuit) Hardware Description Language.
SRAM FPGA
A kind of FPGA based on static memory technology that is re-programmable and in-system programmable and requires external boot devices.

References

1
___, Proceedings of the International Workshop on Field-Programmable Logic and Applications. 1991-1998. Printed by Springer-Verlag. See http://xputers.informatik.uni-kl.de/FPL/index_fpl.html
2
___, Proceedings of the Reconfigurable Architectures Workshop. 1994-1998. Printed by Springer-Verlag. See http://xputers.informatik.uni-kl.de/RAW/index_raw.html
3
___, Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines. 1992-1998. Printed by IEEE Computer Society Press, Los Alamitos, Calif. See http://www.fccm.org/.
4
Aho, A. V., Sethi, R., Ullman, J. D. Compilers: Principles, Techniques and Tools. Addison Wesley, 1986.
 
5
Athanas, P., and Silverman, H. Processor Reconfiguration through Instruction-Set Metamorphosis: Architecture and Compiler. IEEE Computer, vol. 26, n. 3, March 1993, pp. 11-18.
6
Becker, J., Hartenstein, R., Herz, M., Nageldinger, U. Parallelization in Co-Compilation for Configurable Accelerators. In Proceedings of the Asia South Pacific Design Automation Conference (ASP-DAC'98), Yokohama, Japan, February 10-13, 1998.
7
Brebner, G., Donlin, A. Runtime Reconfigurable Routing. In Proceedings of the Reconfigurable Architectures Workshop (RAW'98), Orlando, Florida, USA, March 30, 1998.
8
Cardoso, J. M. P., and Neto, H. C. Towards an Automatic Path from Java(tm) Bytecodes to Hardware Through High-Level Synthesis. In Proceedings of the 5th IEEE International Conference on Electronics, Circuits and Systems (ICECS-98), Lisbon, Portugal, September 7-10, 1998, pp. 85-88.
9
GajjalaPurna, K. M., and Bhatia, D. Temporal Partitioning and Scheduling for reconfigurable Computing. In Proceedings of the 6th IEEE Symposium on Field Programmable Custom Computing Machines (FCCM'98), Napa Valley, California, USA, April 15-17, 1998.
10
Gajski, D. D., Dutt, N. D., Wu, A. C.-H., Lin, S. Y.-L. High-Level Synthesis, Introduction to Chip and System Design. Kluwer Academic Publishers, Boston, Dordrecht, London, 1992.
11
Hartenstein, R. W., Becker, J., et al. High-Performance Computing Using a Reconfigurable Accelerator. In CPE Journal, Special Issue of Concurrency: Practice and Experience, John Wiley & Sons Ltd., 1996.
12
Hartenstein, R., Herz, M., Hoffmann, T., Nageldinger, U. On Reconfigurable Co-Processing Units. In Proceedings of the Reconfigurable Architectures Workshop (RAW'98), Orlando, Florida, USA, March 30, 1998.
13
Hauck, S., Fry, T. W., Hosler, M. M., Kao, J. P. The Chimaera Reconfigurable Functional Unit. In Proceedings of the 5th IEEE Symposium on Field Programmable Custom Computing Machines (FCCM'97), Napa Valley, California, USA, April, 1997, pp. 87-96.
14
Hennessy, J. L., Patterson, D. A. Computer Architecture: A Quantative Approach. Morgan Kaufman Publ., 1990.
15
Mangione-Smith, W. H., et al. Seeking Solutions in Configurable Computing. IEEE Computer 30,12 December 1997, pp. 38-43.
16
Micheli, G. De, Gupta, R. Hardware/Software Co-Design. In Procedings of the IEEE, vol. 85, no.3, March 1997, pp.349-365.
17
Mirsky, E., and DeHon, A. MATRIX: A Reconfigurable Computing Device with Configurable Instruction Deployable Resources. In Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines, Napa Valley, California, USA, April 17-19, 1996, pp. 157-166.
18
Miyamori, T., and Olukotun, K. A Quantative Analysis of Reconfigurable Coprocessors for Multimedia Applications. In Proceedings of the 6th IEEE Symposium on Field Programmable Custom Computing Machines (FCCM'98), Napa, California, USA, April 15-17, 1998.
19
Muchnick, S. Advanced Compiler Design and Implementation. Morgan Kaufmann Publishers, Inc., San Francisco, California, USA, 1997, ISBN 1-55860-320-4.
20
Olukotun, K. A., Helaihel, R., Levitt, J., and Ramirez, R. A Software-Hardware Cosynthesis Approach to Digital System Simulation. IEEE Micro, vol. 14, Nov. 1994, pp. 48-58.
21
Ouaiss, I., Govindarajan, S., Srinivasan, V., Kaul, M., and Vemuri, R. An Integrated Partioning and Synthesis System for Dynamically Reconfigurable Multi-FPGA Architectures. In Proceedings of the Reconfigurable Architectures Workshop (RAW'98), Orlando, Florida, USA, March 30,1998.
22
Radunovic, B., Milutinovic, V. A Survey of Reconfigurable Computing Architectures. To appear as a tutorial in the 8th International Workshop on Field Programmable Logic and Applications (FPL'98), Talin, Estonia, 30 August - 2 September, 1998.
23
Razdan, R., Brace, K., and Smith, M. D. PRISC Software Acceleration Techniques. In Proceedings of the IEEE International Conference on Computer Design, Oct. 1994, pp. 145-149.
24
See General Processor Information,  http://infopad.eecs.berkeley.edu/CIC/summary/local.
25
Villasenor, J., Mangione-Smith, W. H. Configurable Computing. Scientific American, June 1997, pp. 66-71. See http://www.sciam.com/0697issue/0697villasenor.html
26
Wirthlin, M. J., and Hutchings, B. L. A Dynamic Instruction Set Computer. In Proceedings of the 4th IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'95), Napa Valley, California, USA, April 19-21, 1995, pp. 99-107.
27
Witting, R. D., and Chow, P. OneChip: An FPGA Processor with Reconfigurable Logic. In Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines, Napa Valley, California, USA, April 17-19, 1996, pp. 126-135.
28
Xilinx Inc. XC6000 Field Programmable Gate Arrays. Version 1.10, April 24, 1997. http://www.xilinx.com.
29
Xilinx, Inc., "The Virtex Series of FPGAs". See
João M. P. Cardoso (Joao.Cardoso@inesc.pt) is currently a Ph.D. student at the Instituto Superior Técnico in Lisbon, Portugal, working on the compilation of Java programs to custom computing machines. His research interests include system-level synthesis, high-level synthesis, hardware/software co-design, and reconfigurable computing. 

Mário P. Véstias (mpv@hobbes.inesc.pt) is currently a Ph.D. student at the Instituto Superior Técnico in Lisbon, Portugal, working on co-synthesis of embedded real-time systems. His research interests include hardware/software co-design, real-time systems, and reconfigurable computing.