Reducing the overheads of hardware acceleration through datapath integration
Research output: Chapter in Book/Report/Conference proceeding › Conference contribution
|Title of host publication||Multimedia on Mobile Devices 2008, 28-29 January, 2008, San Jose, California, USA. Proceedings of SPIE-IS&T Electronic Imaging|
|Editors||R. Greutzbur, J. Takala|
|Number of pages||10|
|State||Published - 2008|
|Publication type||A4 Article in a conference publication|
|Event||SPIE CONFERENCE PROCEEDINGS - |
Duration: 1 Jan 1900 → …
|Conference||SPIE CONFERENCE PROCEEDINGS|
|Period||1/01/00 → …|
Hardware accelerators are used to speed up execution of specific tasks such as video coding. Often the purpose of hardware acceleration is to be able to use a cheaper or, for example, more energy economical processor for executing the majority of the application in software. However, when using hardware acceleration, new overheads are produced mainly due to the need to transfer data to and from the accelerator and signaling the readiness of the accelerator computation to the processor. We find the traditional mechanisms suboptimal for fine-grain hardware acceleration, especially when energy efficiency is important.
This paper explores a technique unique to Transport Triggered Architectures to interface with hardware accelerators. The proposed technique places hardware accelerators to the processor data path, making them visible as regular function units to the programmer. This way communication costs are reduced as data can be transferred directly to the accelerator from other processor data path components and synchronization can be done by polling a simple ready flag in the accelerator function unit. Additionally, this setup enables the instruction scheduler of the compiler to schedule the hardware accelerator like any other operation, thus partially hide its latency with other program operations.
The paper presents a case study with an audio decoder application in which fine-grain and coarse-grain hardware accelerators are integrated to the processor data path as function units. The case is used to study several different synchronization, communication, and latency-hiding techniques enabled by this kind of setup.