Tampere University of Technology

TUTCRIS Research Portal

Customized Low Power Processor for Object Recognition

Research output: Book/ReportMaster's ThesisScientific


Original languageEnglish
PublisherEindhoven University of Technology
Number of pages93
Publication statusPublished - 2016
Externally publishedYes
Publication typeG2 Master's thesis, polytechnic Master's thesis


Convolutional Neural Networks (CNNs) are machine learning algorithms that allow computers to perform object recognition on images with great accuracy. With cameras becoming cheaper and increasing availability of computing power, computer vision applications are on the rise. This has created a demand for fast, low power embedded processors that can execute CNNs with high performance in small form factor devices and mobile applications.
In this work we present a design for a programmable hardware accelerator that can perform object recognition on HD images. The accelerator has a transport triggered architecture, which benefits from the instruction level parallelism of a VLIW, without the increased register file complexity. We use vector processing for high throughput and by analyzing the data access patterns of the application we exploit data locality inherent to CNNs and we have reduced the number of memory accesses to 4% of a naive implementation. Using memory tiling, our accelerator requires only small SRAMs with low power consumption.
The accelerator has three on-chip SRAMs, one of neurons, one for weights and an instruction memory. The total chip area including these three memories is 0,36 mm2 on 28 nm technology. The entire chip consumes 74,3 mW at 1 GHz. Hardware simulations of a compiled network have shown that the accelerator can perform detection at 11,5 HD frames per second at a cost of 6,6 mJ per frame. However, the current version of the compiler was not able to fully utilize the accelerator and our vector unit utilization is around 42%. We demonstrate that higher utilizations are realistic and that an 80% utilization is achievable, allowing for 21 frames per second for 5,2 mJ per frame.
The accelerator is designed with scalability in mind. We demonstrate that by adding more parallel vector function units the performance scales almost linearly while the energy cost per frame drops. Using our efficient memory access scheme, we can contain the memory bandwidth bottleneck that is a big problem for hardware accelerated CNNs and we achieve excellent scalability. This allows us to increase the computational capacities of the accelerator and lower the clock frequency, resulting in a reduced power consumption and even lower energy cost per frame. Low power consumption in combination with programmability makes our accelerator an excellent solution for a broad range of low power, embedded visual recognition applications.