Evaluation of a Heterogeneous Multicore Architecture by Design and Test of an OFDM Receiver

This paper presents an evaluation of a Heterogeneous Multicore Architecture (HMA) by implementing Orthogonal Frequency-Division Multiplexing (OFDM) receiver blocks as designs for the test of functionality. OFDM receiver consists of computationally intensive and general-purpose processing tasks that can provide maximum coverage to test and evaluate a massively-parallel as well as a general-purpose platform like the HMA. The blocks of the receiver are primarily designed by crafting template-based Coarse-Grained Reconfigurable Array (CGRA) devices and then arranging them in a sequence over a Network-on-Chip (NoC) structure along with a few RISC cores for complete OFDM processing. The OFDM blocks such as Fast Fourier Transform (FFT) and Time Synchronization are computationally intensive and require parallel processing. The OFDM receiver also contains tasks such as frequency offset estimation which require the processing of Taylor series and CORDIC algorithms that are serial in nature. Such a combination of serial and parallel algorithms can perform a thorough exploration and evaluation of almost all the design features of an HMA. The OFDM implementation has led to scale CGRAs to different dimensions, instantiate Processing Elements (PEs) as multiple arithmetic resources and to establish almost all possible ways of PE interconnections. It further explores time-multiplexed patterns for data placement in the CGRA memories. Nevertheless, the data can also be exchanged among different nodes over NoC structure simultaneously and independently by using direct memory access devices. In this experimental work, the performance of each CGRA, the collective performance of the whole platform and the NoC traffic are recorded in terms of the number of clock cycles and several high-level performance metrics. Today’s HMAs are generally over or under resourced for the applications that they are designed for and thus not an optimal choice for the end user. Apart from the interesting comparisons to the other state-of-the-art, our experimental setup has provided important insight and guidelines that the designers can use to implement near-optimal solutions for their target applications.

General information
State: Published
Ministry of Education publication type: A1 Journal article-refereed
Authors: Nouri, S., Hussain, W., Nurmi, J.
Number of pages: 3,187
Pages: 3171
Publication date: 1 Nov 2017
Peer-reviewed: Yes
Early online date: 22 May 2017

Publication information
Journal: IEEE Transactions on Parallel and Distributed Systems
ISSN (Print): 1045-9219
Ratings:
Scopus rating (2016): CiteScore 5.45 SJR 1.129 SNIP 3.261
Scopus rating (2015): SJR 1.177 SNIP 3.564 CiteScore 4.66
Scopus rating (2014): SJR 1.048 SNIP 3.886 CiteScore 4.29
Scopus rating (2013): SJR 0.975 SNIP 3.4 CiteScore 3.78
Scopus rating (2012): SJR 0.831 SNIP 3.144 CiteScore 3.49
Scopus rating (2011): SJR 0.785 SNIP 3.008 CiteScore 3.31
Scopus rating (2010): SJR 0.938 SNIP 2.826
Scopus rating (2009): SJR 0.939 SNIP 2.538
Scopus rating (2008): SJR 1.009 SNIP 2.606
Scopus rating (2007): SJR 1.059 SNIP 2.872
Scopus rating (2006): SJR 1.145 SNIP 2.736
Scopus rating (2005): SJR 1.394 SNIP 3.033
Scopus rating (2004): SJR 1.141 SNIP 2.748
Scopus rating (2003): SJR 0.952 SNIP 2.183
Scopus rating (2002): SJR 1.016 SNIP 2.108
Scopus rating (2001): SJR 0.991 SNIP 1.943
Scopus rating (2000): SJR 0.759 SNIP 2.16
Scopus rating (1999): SJR 0.507 SNIP 1.82
Original language: English
ASJC Scopus subject areas: Hardware and Architecture, Electrical and Electronic Engineering
DOIs:
10.1109/TPDS.2017.2706691
Research output: Scientific - peer-review › Article
MergeTree: A Fast Hardware HLBVH Constructor for Animated Ray Tracing

Ray tracing is a computationally intensive rendering technique traditionally used in offline high-quality rendering. Powerful hardware accelerators have been recently developed that put real-time ray tracing even in the reach of mobile devices. However, rendering animated scenes remains difficult, as updating the acceleration trees for each frame is a memory-intensive process. This article proposes MergeTree, the first hardware architecture for Hierarchical Linear Bounding Volume Hierarchy (HLBVH) construction, designed to minimize memory traffic. For evaluation, the hardware constructor is synthesized on a 28nm process technology. Compared to a state-of-the-art binned surface area heuristic sweep (SAH) builder, the present work speeds up construction by a factor of 5, reduces build energy by a factor of 3.2, and memory traffic by a factor of 3. A software HLBVH builder on a graphics processing unit (GPU) requires 3.3 times more memory traffic. To take tree quality into account, a rendering accelerator is modeled alongside the builder. Given the use of a toplevel build to improve tree quality, the proposed builder reduces system energy per frame by an average 41% with primary rays and 13% with diffuse rays. In large (> 500K triangles) scenes, the difference is more pronounced, 62% and 35%, respectively.
Fast Hardware Construction and Refitting of Quantized Bounding Volume Hierarchies

There is recent interest in GPU architectures designed to accelerate ray tracing, especially on mobile systems with limited memory bandwidth. A promising recent approach is to store and traverse Bounding Volume Hierarchies (BVHs), used to accelerate ray tracing, in low arithmetic precision. However, so far there is no research on refitting or construction of such compressed BVHs, which is necessary for any scenes with dynamic content. We find that in a hardware-accelerated tree update, significant memory traffic and runtime savings are available from streaming, bottom-up compression. Novel algorithmic techniques of modulo encoding and treelet-based compression are proposed to reduce backtracking inherent in bottom-up compression. Together, these techniques reduce backtracking to a small fraction. Compared to a separate top-down compression pass, streaming bottom-up compression with the proposed optimizations saves on average 42% of memory accesses for LBVH construction and 56% for refitting of compressed BVHs, over 16 test scenes. In architectural simulation, the proposed streaming compression reduces LBVH runtime by 20% compared to a single-precision build, and 41% compared to a single-precision build followed by top-down compression. Since memory traffic dominates the energy cost of refitting and LBVH construction, energy consumption is expected to fall by a similar fraction.

General information
State: Published
Ministry of Education publication type: A1 Journal article-refereed
Organisations: Pervasive Computing, Research area: Computer engineering
Authors: Viitanen, T., Koskela, M., Jääskeläinen, P., Immonen, K., Takala, J.
Number of pages: 12
Pages: 167-178
Publication date: 5 Jul 2017
Peer-reviewed: Yes

Publication information
Journal: Computer Graphics Forum
Volume: 36
Issue number: 4
ISSN (Print): 0167-7055
Ratings:
Scopus rating (2016): SJR 0.89 SNIP 1.239 CiteScore 2.33
Scopus rating (2015): SJR 0.785 SNIP 1.549 CiteScore 2.34
Scopus rating (2014): SJR 0.991 SNIP 1.721 CiteScore 2.35
Scopus rating (2013): SJR 1.074 SNIP 1.869 CiteScore 2.68
Scopus rating (2012): SJR 0.771 SNIP 2.043 CiteScore 2.28
Scopus rating (2011): SJR 0.879 SNIP 1.691 CiteScore 2.2
Scopus rating (2010): SJR 0.814 SNIP 1.792
Scopus rating (2009): SJR 0.685 SNIP 1.81
Scopus rating (2008): SJR 0.688 SNIP 2.348
Scopus rating (2007): SJR 0.768 SNIP 1.753
Scopus rating (2006): SJR 0.664 SNIP 2.331
Scopus rating (2005): SJR 0.522 SNIP 2.256
Integration Issues of a run-Time Configurable Memory Management Unit to a RISC Processor on FPGA

This paper presents the integration issues of a proposed run-time configurable Memory Management Unit (MMU) to the COFFEE processor developed by our group at Tampere University of Technology. The MMU consists of three Translation Lookaside Buffers (TLBs) in two levels of hierarchy. The MMU and its respective integration to the processor is prototyped on a Field Programmable Gate Array (FPGA) device. Furthermore, analytical results of scaling the second-level Unified TLB (UTLB) to three configurations (with 16, 32, and 64 entries) with respect to the effect on overall hit rate as well as the energy consumption are shown. The critical path analysis of the logical design running on the target FPGA is presented together with a description of optimization techniques to improve static timing performance which leads to gain 22.75% speed-up. We could reach to our target operating frequency of 200 MHz for the 64-entry UTLB and, thus, it is our preferred option. The 32-entry UTLB configuration provides a decent trade-off for resource-constrained or speed-critical hardware designs while the 16-entry configuration poses unsatisfactory performance. Next, integration challenges and how to resolve each of them (such as employing a wrapper around the MMU, modifying the hardware description of the COFFEE core, etc.) are investigated in detail. This paper not only provides invaluable information with regard to the implementation and integration phases of the MMU to a RISC processor, it opens a new horizon to our processor to provide virtual memory for its running operating system without degrading the operating frequency. This work also tends toward being a general reference for future integration to the COFFEE core as well as other similar processor architectures.

General information
State: Published
Ministry of Education publication type: A1 Journal article-refereed
Organisations: Department of Electronics and Communications Engineering, Research group: System-on-Chip for GNSS, Wireless Communications and Cyber-Physical Embedded Computing
Authors: Shamani, F., Fakour Sevom, V., Ahonen, T., Nurmi, J.
Pages: 179-191
Publication date: Mar 2017
Peer-reviewed: Yes
Early online date: 5 Dec 2016

Publication information
Journal: Microprocessors and Microsystems
Volume: 49
ISSN (Print): 0141-9331
Ratings:
Scopus rating (2016): CiteScore 1.11 SJR 0.238 SNIP 0.864
Scopus rating (2015): SJR 0.26 SNIP 0.859 CiteScore 0.89
Scopus rating (2014): SJR 0.241 SNIP 1.1 CiteScore 0.97
Scopus rating (2013): SJR 0.231 SNIP 1.234 CiteScore 1.02
Scopus rating (2012): SJR 0.218 SNIP 0.818 CiteScore 0.94
Scopus rating (2011): SJR 0.22 SNIP 0.847 CiteScore 0.99
Scopus rating (2010): SJR 0.276 SNIP 0.937
Scopus rating (2009): SJR 0.241 SNIP 0.843
Scopus rating (2008): SJR 0.262 SNIP 0.915
Scopus rating (2007): SJR 0.35 SNIP 0.848
Scopus rating (2006): SJR 0.332 SNIP 0.57
Scopus rating (2005): SJR 0.18 SNIP 0.751
Scopus rating (2004): SJR 0.229 SNIP 0.774
Scopus rating (2003): SJR 0.274 SNIP 0.997
Scopus rating (2002): SJR 0.253 SNIP 0.62
Scopus rating (2001): SJR 0.241 SNIP 0.572
Scopus rating (2000): SJR 0.166 SNIP 0.371
Scopus rating (1999): SJR 0.171 SNIP 0.248

Original language: English
Keywords: FPGA Implementation, Memory Management Unit, Virtual-to-Physical Address Translation, Run-Time Configurable MMU, COFFEE RISC Processor

Computing Platforms for Software-Defined Radio

General information
State: Published
Ministry of Education publication type: C2 Edited books
Organisations: Department of Electronics and Communications Engineering, Research group: System-on-Chip for GNSS, Wireless Communications and Cyber-Physical Embedded Computing, University of Turku, Fraunhofer Institute
Authors: Hussain, W., Nurmi, J., Isoaho, J., Garzia, F.
Publication date: 2017

Publication information
Publisher: Springer
ISBN (Print): 978-3-319-49678-8
ISBN (Electronic): 978-3-319-49679-5
Original language: English
ASJC Scopus subject areas: Electrical and Electronic Engineering
DOIs:
10.1016/j.micpro.2016.12.001
Source: RIS
Source-ID: urn:232F4BBA78E77F148E205D553683F92D
Research output: Scientific - peer-review › Article

Data Flow Algorithms for Processors with Vector Extensions: Handling Actors With Internal State

Data Flow Algorithms for Processors with Vector Extensions: Handling Actors With Internal State

Full use of the parallel computation capabilities of present and expected CPUs and GPUs requires use of vector extensions. Yet many actors in data flow systems for digital signal processing have internal state (or, equivalently, an edge that loops from the actor back to itself) that impose serial dependencies between actor invocations that make vectorizing across actor invocations impossible. Ideally, issues of inter-thread coordination required by serial data dependencies should be handled by code written by parallel programming experts that is separate from code specifying signal processing operations. The purpose of this paper is to present one approach for so doing in the case of actors that maintain state. We propose a methodology for using the parallel scan (also known as prefix sum) pattern to create algorithms for multiple simultaneous invocations of such an actor that results in vectorizable code. Two examples of applying this methodology are given: (1) infinite impulse response filters and (2) finite state machines. The correctness and performance of the resulting IIR filters and one class of FSMs are studied.

General information
State: Published
Ministry of Education publication type: A1 Journal article-refereed
Organisations: Department of Pervasive Computing, Research area: Computer engineering, Signal Processing Research Community (SPRC), Keysight Technologies, University of Maryland
Authors: Barford, L., Bhattacharyya, S. S., Liu, Y.
Pages: 21-31
Publication date: 2017
Peer-reviewed: Yes

Research output: Scientific - peer-review › Anthology
Design and Implementation of IEEE 802.11a/g Receiver Blocks on a Coarse-Grained Reconfigurable Array

This chapter presents the design and evaluation of template-based Coarse-Grained Reconfigurable Array (CGRA) generated accelerators that process Orthogonal Frequency-Division Multiplexing receiver blocks. The CGRA operates as a coprocessor with a Reduced Instruction-Set Computing (RISC) processor so that the overall system yields the benefits of general- and special-purpose processing. The accelerators are designed by crafting the CGRA template to the computational and communication requirements of the algorithms in an effort to minimize the resource utilization and power dissipation on the target Field Programmable Gate Array (FPGA) device. The performance of each CGRA is recorded in terms of the number of clock cycles and several multiple performance metrics. The power consumption is also estimated by simulating the postfit gate-level FPGA netlist of the accelerators.

General information
State: Published
Ministry of Education publication type: A3 Part of a book or another research book
Organisations: Department of Electronics and Communications Engineering, Research group: System-on-Chip for GNSS, Wireless Communications and Cyber-Physical Embedded Computing, Ruhr-Universität Bochum
Authors: Nouri, S., Hussain, W., Göhringer, D., Nurmi, J.
Pages: 61-89
Publication date: 2017

Host publication information
Title of host publication: Computing Platforms for Software-Defined Radio
Publisher: Springer
Editors: Hussain, W., Nurmi, J., Isoaho, J., Garzia, F.
ISBN (Print): 978-3-319-49678-8
ISBN (Electronic): 978-3-319-49679-5
ASJC Scopus subject areas: Electrical and Electronic Engineering
Design Transformation from a Single-Core to a Multi-Core Architecture Targeting Massively Parallel Signal Processing Algorithms

This chapter describes single-core and multi-core platforms that are reconfigurable and heterogeneous by design and are specifically targeted to accelerate computationally intensive signal processing algorithms mostly used in software-designed radio applications. The signal-core accelerator architectures are tightly integrated with a C programmable processor core while the backbone of communications and control in multi-core architecture is a network-on-chip. The platforms were instantiated multiple times for different proof-of-concept application scenarios. The single- and multi-core platforms were subjected to self-aware dynamic frequency scaling while being prototyped for a field programmable gate array device. The performance of the platforms was measured and estimated in terms of many basic and high-level metrics and comparisons with other state-of-the-art platform are established for design evaluation.

Efficient fast-convolution based implementation of 5G waveform processing using circular convolution decomposition

Multirate fast convolution (FC) has recently been introduced as an effective tool for communication waveform processing, especially for advanced multicarrier systems targeting at well-contained spectrum. These include filter bank based multicarrier waveforms and filtered OFDM schemes which are receiving increasing attention in the 5G radio access development. Recalling that the key idea of FC is effective implementation of high-order linear filtering through frequency-domain processing, this paper investigates possibilities to reduce the complexity of FC based waveforms. Special focus is on scenarios where a relatively small part of the bandwidth is in active use, which could be the case, e.g., in low-rate machine-type communication devices. A new variant of fast-convolution filter bank (FC-FB) is developed which uses circular convolution decomposition. The narrowband variant of decomposed structure, called D-FC-FB, achieves significantly reduced complexity, which is proportional to the active bandwidth, while maintaining filtering performance equivalent to FC-FB. Therefore, this variant is considered as a low-complexity solution for low-rate devices. D-FC-FB can be used in any multicarrier scheme that utilizes filtering at subcarrier or resource block level. This paper develops closed-form complexity expressions for the case of filter bank multicarrier with offset-QAM subcarrier modulation (FBMC/OQAM) demonstrating significant complexity reduction in a case study.
Filtered multitone multicarrier modulation with partially overlapping sub-channels

Future wireless networks demand multicarrier modulation schemes with improved spectrum efficiency and superior spectrum containment. Orthogonal frequency division multiplexing (OFDM) has been the favorite technique in recent developments, but due to its limited spectrum containment, various alternative schemes are under consideration for future systems. Theoretically, it is not possible to reach maximum spectrum efficiency, high spectral containment, and orthogonality of subcarriers simultaneously, when using quadrature amplitude modulation (QAM) for subcarriers. This has motivated the study of non-orthogonal multicarrier modulation schemes. This paper focuses on the filtered multitone (FMT) scheme, one of the classical configurations of filter bank multicarrier modulation (FBMC) utilizing QAM subcarrier symbols. Our main aim is to improve the spectral efficiency of FMT by introducing controlled overlap of adjacent subbands. An analytical model is developed for evaluating the tradeoffs between spectrum efficiency and intercarrier interference introduced by the overlap. An efficient fast convolution waveform processing scheme is adopted for the generation of the proposed waveform. It allows effective adjustment of the roll-off and subcarrier spacing to facilitate waveform adaptation in real time. Analytical studies, confirmed by simulation results, indicate that the proposed FMT system can obtain significant spectral density improvement without requiring additional ICI cancellation techniques.

Introduction and Book Structure

Wireless positioning and navigation area is a prevalent area embroidered in the majority of wireless communication devices and applications. The number of technologies supporting wireless navigation has been continuously increasing in the last decade, from those based on classical satellite navigation systems to technologies employing inertial sensors, Wireless Local Area Networks (WLAN) and cellular systems, and even ultra-sound and visible light systems.
Multi-Technology Positioning

General information
State: Published
Ministry of Education publication type: C2 Edited books
Organisations: Department of Electronics and Communications Engineering, Research group: System-on-Chip for GNSS, Wireless Communications and Cyber-Physical Embedded Computing, Research group: Wireless Communications and Positioning, Department of Mathematics, Research group: MAT Intelligent Information Systems Laboratory, Chalmers University of Technology
Publication date: 2017

Publication information
Publisher: Springer
ISBN (Print): 978-3-319-50426-1
ISBN (Electronic): 978-3-319-50427-8
Original language: English
ASJC Scopus subject areas: Electrical and Electronic Engineering, Aerospace Engineering
DOIs: 10.1007/978-3-319-50427-8
Research output: Scientific - peer-review » Anthology

Ninesilica: A Homogeneous MPSoC Approach for SDR Platforms
This chapter presents the study of Software Defined Radio applications on homogeneous multi-core architectures based on the Silicon Café template. Two instances of the template have been realized and implemented on an Altera Stratix IV FPGA device. Ninesilica, the first instance of the template, is a homogeneous 3 × 3 mesh of processing elements realizing a standalone cluster. The second instance of the template is a clustered architecture composed of four Ninesilica clusters. Significant kernels of WCDMA and OFDM kernels were ported on the architectures analyzing the platform performance in terms of computational power, algorithm scalability, energy consumption and efficiency, portability of the mapping and hardware scalability. The achieved results showed that the proposed approach offers a high flexibility and parallelization efficiency, making homogeneous solutions a good candidate for the implementation of SDR systems.

General information
State: Published
Ministry of Education publication type: A3 Part of a book or another research book
Organisations: Department of Electronics and Communications Engineering, Research group: System-on-Chip for GNSS, Wireless Communications and Cyber-Physical Embedded Computing, Fraunhofer Institute
Authors: Airoldi, R., Garzia, F., Ahonen, T., Nurmi, J.
Pages: 107-119
Publication date: 2017

Host publication information
Title of host publication: Computing Platforms for Software-Defined Radio
Publisher: Springer
Editors: Hussein, W., Nurmi, J., Isoaho, J., Garzia, F.
ISBN (Print): 978-3-319-49678-8
ISBN (Electronic): 978-3-319-49679-5
ASJC Scopus subject areas: Electrical and Electronic Engineering
DOIs: 10.1007/978-3-319-49679-5_6
Research output: Scientific - peer-review » Chapter

Proceedings of 2017 27th International Conference on Field Programmable Logic and Applications (FPL)

General information
State: Published
Ministry of Education publication type: C2 Edited books
Organisations: Electronics and Communications Engineering, Research group: Wireless Communications and Positioning, Politecnico di Milano, Dipartamento di Elettronica e Informazione, Technical University of Dresden, University of Gent, University of Leuven
Synchronization In NC-OFDM-Based Cognitive Radio Platforms

This chapter provides essential information with regard to the synchronization issues in Non-Contiguous Orthogonal Frequency Division Multiplexing (NC-OFDM)-based systems. It also provides a flexible timing synchronization scheme implemented on an Altera Stratix-V Field Programmable Gate Array (FPGA) device. The main component of the synchronizer is a reconfigurable module which calculates the Sum-of-Products (SoP) of the incoming signal with predefined coefficients. The SoP module performs as a multicorrelator on demand. Furthermore, different architectures of the SoP block and their respective performance evaluations are discussed in detail. Eventually, all developed architectures are compared to each other in terms of power consumption, silicon area, maximum frequency, etc.

The Evolution of Software-Defined Radio: An Introduction

The Software-Defined Radio (SDR) concept was originally developed by the combined efforts of various research groups in the private and government organizations of the United States (US) in 1970s–1980s. The important ones to mention are the US Department of Defense Laboratory and a team at the Garland, Texas Division of E-Systems Inc. In 1991, Joe Mitola independently reinvented the term ‘Software Radio’ (SR) in cooperation with E-Systems as a plan to build a true software-based GSM transceiver (Mitola, Telesystems Conference, 1992). The SR platform essentially processes almost all the transceiver algorithms as software for a processor. This includes nearly all layers of transmission. However, an optimal implementation of physical layer is always challenging due to an enormous amount of mathematical computation. Over the period of time, many developmental changes occurred and an interesting feature of cognition was added to existing SDR platforms, thereby inventing the term ‘Cognitive Radio’. The main idea was to reduce over-sampling by the analog to digital converter, reduce on-chip processing and to target only the spectrum of interest. This book also touches the CR feature in the large SDR field in some of its selected chapters. Since, the very first few articles of J. Mitola, there has been a tremendous amount of research work conducted in industry and academia. The evolution in SDRs is continuous with time and provides a number of excellent opportunities to researcher for exploration and to come up with their findings. The present day SDR implementations are such that the designers are focused mostly on the design of hardware and software, their interfacing and optimizations for varying architectural choices. It includes multiple cases of application-specific general-purpose acceleration platforms that are scalable, homogeneous and heterogeneous in nature while providing multiple programmable cores on a single chip computing system.
### The Future of Software-Defined Radio: Recommendations

An efficient Software-Defined Radio solution comes when all the aspects of system design are collectively addressed under application specifications and constraints. It includes all—the efforts to design wideband antennas, powerful software to process huge bandwidth of information, optimizations at hardware to maximize performance and nevertheless to mention compilers and operating systems. It is important that every engineer or a scientist working on a particular block of SDR should have a bare-minimum understanding of the entire design stack. There is a need to have clear vision about the targets to be achieved, trade-offs to be made, and a unified approach so that all the objectives are measurable to enable a qualitative and quantitative analysis.

### Xor-Masking: A Novel Statistical Method for Instruction Read Energy Reduction in Contemporary SRAM Technologies

Pervasive computing calls for ultra-low-power devices to extend the battery life enough to enable usability in everyday life. Especially in devices involving programmable processors, the energy consumption of integrated memories often plays a critical role. Consequently, contemporary memory technologies focus more on the energy-efficiency aspects with new custom CMOS SRAM cells with tailored energy consumption profiles constantly being proposed.

This paper proposes a method that exploits such contemporary low power SRAM memories that are energy optimized for storing a certain logic value to improve the energy-efficiency of instruction fetching, a major energy overhead in programmable designs. The method utilizes a low overhead xor-masking approach combined with statistical program analysis to produce optimal masks to reduce the occurrence of the more energy consuming bit values in the fetched instructions.

In comparison to the "bus invert" technique typically used with similar SRAMs, the proposed method incurs minimal area overhead while still reducing the total energy consumption of an example LatticeMico32 core up to 5%. The improvement to instruction memory energy consumption alone is up to 13% with a set of benchmarks.
FPGA Implementation Issues of a Flexible Synchronizer Suitable for NC-OFDM-Based Cognitive Radios

This paper presents a flexible timing synchronization scheme alongside the hardware implementation issues on an Altera Stratix-V Field Programmable Gate Array (FPGA) device. The core content of the synchronizer is based on Finite Impulse Response (FIR) filter which operates as a multicorrelator on demand. The term “flexibility” refers to a specific part of the synchronizer where the multicorrelator reconfigures its FIR filter block on-the-fly by employing Partial Reconfiguration (PR) feature. Moreover, different implementations have been evaluated for the multicorrelator, including MultiplierLess (ML) approach (an approximate computing technique only for autocorrelation purpose) along with the Direct From as well as the Transposed, Parallel, and Pipelined-Parallel Direct Form FIR filters. All the developed architectures are compared to each other in terms of power consumption, silicon area, and maximum frequency. Preliminary synthesis results show that the ML approach achieves better performance (including 94% less power dissipation, 75% less logic utilization as well as 67% fewer registers) than other architectures when performing autocorrelation function. Furthermore, the critical path is analyzed and appropriate optimization techniques (such as DSP register packing and intermediate register insertion) are applied to the best candidates of the architectures mentioned. As the best results, 2.83× speed-up, 56.57% less logic utilization along with 38.86% fewer registers are achieved for different architectures. Accordingly, we discover that the parallel form, as well as the pipelined-parallel one, achieve more interesting results than the transposed version in most of the cases.
Using OpenCL to Rapidly Prototype FPGA Designs

Field Programmable Gate Arrays (FPGAs) have gained popularity because their reconfigurability can speed up development and verification with relatively low cost. However, the deep level of understanding required on hardware logic programming has discouraged many software engineers. An interface between host devices and FPGAs to enable designing and programming FPGAs using a software programming standard and encapsulating hardware details is much desired. In this paper we evaluate leveraging Open Computing Language (OpenCL) to rapidly design FPGAs, considering both hardware logic utilization efficiency and computing performance. On a heterogeneous computer system consisting of ARM processors and Altera FPGA, we execute an OpenCL host program on the ARM processors and an OpenCL kernel on the FPGA, to compute a parametrizable two-dimensional Mandelbrot fractal. We explore three design aspects of adjusting OpenCL work-group size, coalescing memory access, and replicating compute units to improve the FPGA computation performance. After optimizing the core algorithm, we efficiently reduced the logic utilization and Digital Signal Processing (DSP) blocks required for a single compute unit, and successfully increased the number of replicated compute units from four to six, thus delivering a 1.5X increase of parallel computation capacity of the FPGA, and improving the computing speed by 1.5X and memory bandwidth by 1.7X.
Accelerating Computation on an Android Phone with OpenCL Parallelism and Optimizing Workload Distribution between a Phone and a Cloud Service

We evaluate workload distribution optimization between an Android phone, a cloud service by considering the overall impact of both computation, data transfer. We use OpenCL parallelism on Android to obtain high computation performance. We implement an escape time algorithm to compute the Mandelbrot set with OpenCL, with Java as a reference for comparison. In an experiment of setting the escape boundary at 256, OpenCL offers about 5.0X to 7.5X faster computation compared to Java. With a cloud service, data transfer becomes a dominant factor when the amount of computation is low. In a set of four experiments of sharing workload of computing the Mandelbrot set between a cloud service, a phone, data transfer consumes on average over 80% of processing time. In those experiments computing locally with OpenCL on an Android phone yields faster processing time. On the other hand, local computation capacity becomes a bottleneck when the amount of computation is high. With the escape boundary at 65536, requesting computation from a cloud service yields up to 7.55X speedup.
IEEE 802.11ac MIMO Transceiver Baseband Processing on a VLIW Processor

Wireless standards are evolving rapidly due to the exponential growth in the number of portable devices along with the applications with high data rate requirements. Adaptable software based signal processing implementations for these devices can make the deployment of the constantly evolving standards faster and less expensive. The flagship technology from the IEEE WLAN family, the IEEE 802.11ac, aims at achieving very high throughputs in local area connectivity scenarios. This article presents a software based implementation for the Multiple Input and Multiple Output (MIMO) transmitter and receiver baseband processing conforming to the IEEE 802.11ac standard which can achieve transmission bit rates beyond 1Gbps. This work focuses on the Physical layer frequency domain processing. Various configurations, including 2×2 and 4×4 MIMO are considered for the implementation. To utilize the available data and instruction level parallelism, a DSP core with vector extensions is selected as the implementation platform. Then, the feasibility of the presented software-based solution is assessed by studying the number of clock cycles and power consumption of the different scenarios implemented on this core. Such Software Defined Radio based approaches can potentially offer more flexibility, high energy efficiency, reduced design efforts and thus shorter time-to-market cycles in comparison with the conventional fixed-function hardware methods.

General information
State: Published
Ministry of Education publication type: A1 Journal article-refereed
Organisations: Department of Electronics and Communications Engineering, Research group: Wireless Communications and Positioning, Department of Pervasive Computing, Research area: Computer engineering
Authors: Aghababaeetafreshi, M., Lehtonen, L. K., Levanen, T., Valkama, M., Takala, J.
Publication date: 2016
Peer-reviewed: Yes

Publication information
Journal: Journal of Signal Processing Systems
ISSN (Print): 1939-8018
Ratings:
Scopus rating (2016): CiteScore 0.78 SJR 0.226 SNIP 0.625
Scopus rating (2015): SJR 0.228 SNIP 0.639 CiteScore 0.7
Scopus rating (2014): SJR 0.292 SNIP 1 CiteScore 0.99
Scopus rating (2013): SJR 0.27 SNIP 0.858 CiteScore 0.97
Impact of Operand sharing to the Processor Energy Efficiency

Transport triggered architecture processors may have function unit input registers, which allow operands to be written to function units earlier than the clock cycle where the operation begins execution. An operand used in consecutive operations may be written only once, saving register file accesses and internal bus traffic. This optimization is called operand sharing. In this paper, the effectiveness of operand sharing is analyzed from the perspective of improving the energy-efficiency of the processor. On average of 12.0 % and at the best case of 32.4 % of register file reads could be eliminated, resulting in the best case power savings of 5.3% and energy savings of 8.8 %. In one of the 14 measured cases, operand sharing allowed a register file read port to be removed without performance penalty.

OpenCL Programmable Exposed Datapath High Performance Low-Power Image Signal Processor

Sophisticated computational imaging algorithms require both high performance and good energy-efficiency when executed on mobile devices. Recent trend has been to exploit the abundant data-level parallelism found in general purpose programmable GPUs. However, for low-power mobile use cases, generic GPUs consume excessive amounts of power. This paper proposes a programmable computational imaging processor with 16-bit half-precision SIMD floating point vector processing capabilities combined with power efficiency of an exposed datapath. In comparison to traditional VLIW architectures with similar computational resources, the exposed datapath reduces the register file traffic and complexity. These and the specific optimizations enabled by the explicit programming model enable extremely good...
power-performance. When synthesized on a 28nm ASIC technology, the accelerator consumes 71mW of power while running a state-of-the-art denoising algorithm, and occupies only 0.2mm² of chip area. For the algorithm, energy usage per frame is 7mJ, which is 10x less than the best found GPU-based implementation.

**Optimization of parallel processing intensive digital front-end for IEEE 802.11ac receiver**

Modern computing platforms offer increasing levels of parallelism for the fast execution of different signal processing tasks. In this paper a digital front-end concept is developed, where the parallel processing is utilized for dividing the inherent structure of IEEE 802.11ac waveform to two or more parallel signals and by processing the resulting signals further e.g. using legacy IEEE 802.11n digital receiver chains. Two multirate channelization architectures are developed with the corresponding filter coefficient optimization. The full radio link performance simulations with commonly adopted indoor WiFi channel profiles are provided, verifying the overall link performance with the proposed channelization architectures.

**Remotely Powered Piezoresistive Pressure Sensor: Toward Wireless Monitoring of Intracranial Pressure**

This paper presents the results of pressure measurements taken after the successful activation of an implantable piezoresistive pressure sensor. The sensor was activated using inductive power transmission for an Intracranial Pressure (ICP) monitoring application. This generated sufficient power (4.47 mW) and voltage (1.894 V) at the sensor input to monitor the pressure changes. Although the changes in voltage were monitored through wires, the required electronics for wireless voltage transfer and measurement in a biological environment are planned in the future. The simulated and measured results of the wireless link, along with the measured changes in pressure are presented. The results are the first step towards a wirelessly powered implant for ICP monitoring.
In this paper, we describe the Heterogeneous System Architecture Foundation's application to digital signal processors (DSP) and hardware accelerators. We provide an overview of the HSA runtime, system architecture and programmer’s model, identify characteristics of DSPs and compare differences in algorithms to GPUs. We show an example mapping of HSA agents to a modern DSP using the HSA intermediate language.
Model-based design and implementation of an adaptive digital predistortion filter

Dataflow models of computation are widely used for modeling signal processing systems. These models have inherent concurrency and the task (actor) execution depends only on the availability of the input data (tokens). This property of dataflow models can be exploited for dynamic power management by automatically switching off the actors with no available input tokens. This idea is applied in this paper for efficient modeling and implementation of an adaptive Digital Predistortion (DPD) filter. The DPD filter is required to operate with different profiles under varying operation scenarios, hence requiring a methodology to manage power dynamically. The paper presents a dataflow model for Adaptive Digital Predistortion based on the Core Functional Dataflow (CFDF) model of computation using the Light Weight Dataflow (LWDF) programming methodology. The paper also provides a methodology for dynamic power management under the dataflow paradigm. To the authors’ best knowledge, this work is the first to integrate dataflow-based power management systematically in the context of adaptive DPD implementation.

General information
State: Published
Ministry of Education publication type: A4 Article in a conference publication
Organisations: Department of Pervasive Computing, Research area: Computer engineering, Department of Electronics and Communications Engineering, Research group: Wireless Communications and Positioning, Signal Processing Research Community (SPRC), Wireless Communications and Positioning (WICO), Oulun Yliopisto/CWC, University of Maryland, USA
Authors: Ghazi, A., Boutellier, J., Silven, O., Shahabuddin, S., Juntti, M., Bhattacharyya, S., Anttila, L.
Number of pages: 6
Publication date: Oct 2015

Host publication information
Title of host publication: IEEE Workshop on Signal Processing Systems (SiPS)
Publisher: Institute of Electrical and Electronics Engineers, IEEE
ISBN (Print): 978-1-4673-9604-2
DOI:
10.1109/SiPS.2015.7345010

Bibliographical note
ORG=tie,0.5
ORG=elt,0.5
Research output: Scientific - peer-review › Conference contribution

E-learning of ethics, awareness, hacking and research by information security majors

Some earlier courses were reorganized in 2013 to construct a syllabus for the information security major at Tampere University of Technology, a 30 ECTS credit unit package in the 300-cu master's degree. As their other subjects the students may have for instance communications or software engineering, or information management. This paper describes how the compulsory courses introduce four important but not very technical engineering skills using mainly an e-learning approach. The reasons for such an approach is to save resources in the very beginning – because of the large number of students heading for other majors – and after that to offer flexibility in scheduling to serve the elective courses, as well as the studies of other disciplines – those that provide a need for security. The four topic areas are ethics of individuals and organizations, personal awareness of security issues, hacking, i.e. offensive way of thinking, and The described introductory stage of exposing the students' minds to these matters does not forget innovativeness, but that remains more in the background before the students start working with cases and hands-on experiments later. The description covers four separate courses, forming a prerequisite chain. The first and last one are lecture-based and it takes at least two years to pass them; 3–4 years is more normal. The academic units are not essential here. Instead, one of the main points is the repeated exposure to the various ways of thinking. In the following summary of the succession the numbers 1–4 refer to the courses, but they can be just thought of as time-separated occasions: Ethics: 1. Laws 2. Laws 3. Ethical questions in one's own environment – technology-related ethical questions for individuals – ethical questions for organizations. 4. Interview a security professional, ethical point of view included. Awareness: 1 & 2. Policies, guidelines and web-sites of security information. 3. Daily observations (own or from news) and actions regarding information security, Campaigns etc. Hacking: 1. By-pass authentication by changing the source code of a web page. 2. -- 3. Carry out and report an exercise found at one of listed sites, 4. Laboratory exercises in hacking. Research: 1. Fill in a questionnaire resembling the one from 3rd stage. 2. -- 3. A questionnaire to five acquaintances, completed by interviewing them; deal with the results. 4. Read research papers, interview a security professional trying to generalize together with peers. The paper explains the rationale of these exposures and how they are delivered. It must be noted that not everything is compulsory for passing the courses. The paper reports observations concerning the student choices and feedback. The course #3 appears in its earlier form in [1]. The current version was updated to be two times larger and more professionally oriented. Reference: [1] Jukka A. Koskinen, Tomi O. Kelo: Pure e-learning course in information security. Proc. 2nd Int. Conf. on Security of Information and Networks, 2009. 8–13.

General information
MEMS IMU Carouseling for Ground Vehicles

Microelectromechanical system (MEMS) gyroscopes have advantageous properties for orientation sensing and navigation as they are small, low cost, and consume little power. However, the significant noise at low frequencies results in large orientation errors as a function of time. Controlled physical rotation of the gyroscope can be used to remove the constant part of the gyro errors and reduce low-frequency noise. As adding motors for this would increase the system cost, it would be advantageous to attach gyroes to a rotating platform that is already built in the vehicle. In this paper, we present theory and results for novel navigation systems where an inertial measurement unit (IMU) is attached to the wheel of a ground vehicle. The results show that a low-cost MEMS IMU can provide a very accurate navigation solution using this placement option.

General information

State: Published
Ministry of Education publication type: A1 Journal article-refereed
Organisations: Department of Pervasive Computing, Research area: Computer engineering, Signal Processing Research Community (SPRC)
Authors: Collin, J.
Pages: 2242-2251
Publication date: 16 Jun 2015
Peer-reviewed: Yes

Publication information
Journal: IEEE Transactions on Vehicular Technology
Volume: 64
Issue number: 6
ISSN (Print): 0018-9545
Ratings:
Scopus rating (2016): CiteScore 4.33 SJR 0.855 SNIP 1.91
Scopus rating (2015): SJR 0.97 SNIP 2.325 CiteScore 4
Scopus rating (2014): SJR 1.063 SNIP 2.567 CiteScore 4.02
Scopus rating (2013): SJR 1.599 SNIP 3.031 CiteScore 4.23
Scopus rating (2012): SJR 1.542 SNIP 2.852 CiteScore 3.83
Scopus rating (2011): SJR 1.168 SNIP 2.393 CiteScore 3.16
Scopus rating (2010): SJR 0.995 SNIP 1.927
Scopus rating (2009): SJR 1.074 SNIP 2.009
Scopus rating (2008): SJR 1.193 SNIP 2.294
Scopus rating (2007): SJR 1.463 SNIP 2.8
Scopus rating (2006): SJR 1.055 SNIP 2.332
Scopus rating (2005): SJR 1.004 SNIP 1.885
Scopus rating (2004): SJR 1.238 SNIP 2.185
Scopus rating (2003): SJR 1.283 SNIP 2.267
Scopus rating (2002): SJR 2.363 SNIP 2.337
Scopus rating (2001): SJR 1.677 SNIP 2.108
Scopus rating (2000): SJR 1.739 SNIP 1.631
Scopus rating (1999): SJR 1.059 SNIP 2.012
Original language: English
Electronic versions:
Collin14_VT_post_print
Constant-rate clock recovery and jitter measurement on deep memory waveforms using dataflow

General information
State: Published
Ministry of Education publication type: A4 Article in a conference publication
Organisations: Department of Pervasive Computing, Research area: Computer engineering, Signal Processing Research Community (SPRC), University of Maryland
Authors: Liu, Y., Barford, L., Bhattacharyya, S.
Number of pages: 6
Pages: 1590-1595
Publication date: 1 May 2015

Host publication information
Title of host publication: 2015 IEEE International Instrumentation and Measurement Technology Conference (I2MTC)
Publisher: IEEE
ISBN (Print): 978-1-4799-6113-9
Keywords: data flow analysis, error statistics, measurement errors, measurement standards, synchronisation, time measurement, timing jitter, BER, bit error rate, constant-rate clock recovery measurement, dataflow method, deep memory waveform, digital communication circuitry, field programmable gate array, jitter standard deviation, measurement error, multicore platform, timing jitter measurement, Approximation algorithms, Approximation methods, Clocks, Jitter, Schedules, Signal processing algorithms, Threshold voltage

Towards automation security research and training environment
An automation system is a networked software product in hardware intensive environment and requires more than normal IT security skills. Building an automation security research and training environment for automation requires knowledge on the internal workings of an automation system as well as creative approach on how to keep the system secure where needed, and broken when required for development and teaching purposes. The main challenges are to combine the amount of automation specific hardware and to create good practices which keep the need for maintenance, versatility and pedagogical aspects in balance. This paper presents a project called TUTCyberLabs, the learned lessons and the design decisions. The main focus is on Department of Automation Science and Engineering environment ASECyberLab.

General information
State: Published
Ministry of Education publication type: A4 Article in a conference publication
Authors: Seppälä, J., Salmenperä, M., Koivisto, H., Harju, J., Repo, S., Holmström, J., Ahonen, P.
Publication date: 18 Mar 2015

Host publication information
Title of host publication: Proceedings of Automaatio XXI, The Industrial Revolution of Internet – From Intelligent Devices to Networked Intelligence
Place of publication: Helsinki, Finland
Publisher: Suomen Automaatioseura ry

Publication series
Name: SAS Julkaisusarja
Publisher: Finnish Society of Automation
Volume: 44
Midair User Interfaces Employing Particle Screens
Recent developments with low-cost optical sensors and ultralight, mobile fogscreens enable new applications such as midair volumetric screens and augmented reality. The major advantages of the fogscreens are their walk-through and immaterial nature, translucency, good image quality, possibility for direct interaction, and the intriguing appearance of the screen. Fogscreens can create a collaborative visualization space or even personal views for each viewer. Dual-sided face-to-face interaction mimics most real-world competitive games and sports and enables body language and free switching of sides. This article reports the authors’ experiments with these emerging technologies and their suitability for novel midair user interfaces.
keeps multiple GPUs occupied, and uses all available compute resources. We present an implementation of hybrid microscopy image stitching using HTGS that reduces code size by ≈ 25% and shows favorable performance compared to a similar hybrid workflow implementation without HTGS. The HTGS-based implementation reuses the computational functions of the hybrid workflow implementation.

**General information**
State: Published
Ministry of Education publication type: A4 Article in a conference publication
Organisations: Department of Pervasive Computing, Research area: Computer engineering
Authors: Blattner, T., Keyrouz, W., Halem, M., Brady, M., Bhattacharyya, S.
Number of pages: 4
Pages: 634-637
Publication date: 2015

**Host publication information**
Title of host publication: IEEE Global Conference on Signal and Information Processing (IEEE GlobalSIP 2015)
Publisher: Institute of Electrical and Electronics Engineers, IEEE
ISBN (Electronic): 978-1-4799-7590-7
DOIs: 10.1109/GlobalSIP.2015.7418273
Research output: Scientific - peer-review › Conference contribution

**An efficient GPU implementation of a multirate resampler for multi-carrier systems**
In modern communication systems, a sample rate conversion is necessary since often the system clock is fixed at some specific rate. Such resampling is critical because there exists a tight coupling between the data rates and sampling rates. It is desirable to have a flexible, high performance, and resource efficient resampler that can accommodate various required data rates. To achieve these objectives, we present a novel multirate resampling method based on graphics processing units (GPUs). We also extend this method to multiple dimensions, which enables efficient processing of multiple channels across different frequency bands. The overall result is a flexible, high-throughput, and low-latency implementation that is capable of processing many channels simultaneously.

**General information**
State: Published
Ministry of Education publication type: A4 Article in a conference publication
Organisations: Department of Pervasive Computing, Research area: Computer engineering
Authors: Kim, S. C., Bhattacharyya, S. S.
Number of pages: 5
Pages: 751-755
Publication date: 2015

**Host publication information**
Title of host publication: IEEE Global Conference on Signal and Information Processing (IEEE GlobalSIP 2015)
Publisher: Institute of Electrical and Electronics Engineers, IEEE
ISBN (Electronic): 978-1-4799-7590-7
DOIs: 10.1109/GlobalSIP.2015.7418297
Research output: Scientific - peer-review › Conference contribution

**Design and Implementation of a Power-aware FFT Core for OFDM-based DSA-enabled Cognitive Radios**
This research work presents the design and the physical implementation of a power aware FFT core for OFDM-based, dynamic spectrum access (DSA) enabled cognitive radios. The FFT core is equipped with a pruning engine that allows the run-time removal of dummy operations (e.g. multiplications by a zero term) related to the pruning of sub-carriers of the communication systems. The pruning algorithm introduced by this research work utilizes a reduced size configuration matrix, which limits the memory requirements' overhead. Finally, the physical implementation of the FFT on a 45 nm technology node showed that, for a 8 % area overhead, the total power saving settles around 10 % when in the presence of a medium to high pruning level, justifying the silicon area overhead introduced by the pruning unit.

**General information**
State: Published
Ministry of Education publication type: A1 Journal article-refereed
Organisations: Department of Electronics and Communications Engineering, Research group: System-on-Chip for GNSS, Wireless Communications and Cyber-Physical Embedded Computing, Wireless Communications and Positioning (WICO), Simon Fraser Univ, Simon Fraser University, Sch Engn Sci
Authors: Airoldi, R., Campi, F., Cucchi, M., Revanna, D., Anjum, O., Nurmi, J.
Number of pages: 9
Parallelization of Kvazaar HEVC Intra Encoder for Multi-core Processors

This paper introduces key parallelization strategies of our Kvazaar HEVC intra encoder for multicore processors. The schemes implemented in Kvazaar are 1) tiles; 2) Wavefront Parallel Processing (WPP); and 3) picture-level parallel processing. Kvazaar is the only practical open-source HEVC encoder that supports all these schemes. In addition, its rate-distortion-complexity characteristics are superior to other public implementations in all-intra (AI) coding. Our experiments with high-quality encoder presets show that a C implementation of Kvazaar is 19% faster than the corresponding implementation of x265 for the same coding efficiency with 8 threads and 38% faster with 16 threads. With the high-speed presets, Kvazaar improves coding efficiency by 4.5% while being twice as fast as x265. The high-speed preset of Kvazaar obtains almost the same coding efficiency as the high-quality preset of f265 while being 24 times faster when 16 threads are used.
Pedestrian Localization in Moving Platforms Using Dead Reckoning, Particle Filtering and Map Matching

Localization in global navigation satellite system denied environments using inertial sensors alone, or radio sensors alone or a combination of both are the currently active research topics. The current research works are primarily focused on static environments with earth fixed coordinate frames, having nonmoving maps. In this research work, we use micro electromechanical sensors based inertial sensors, band pass filtering, particle filtering, maps and map matching techniques for pedestrian localization with respect to on ground moving platforms such as train or bus. Since these platforms are moving, the maps of such platforms are moving maps with respect to earth centered, earth fixed coordinate frames. The techniques of this research work could further be extended and adapted to other moving platforms such as airplanes, boats and submarines.

Performance Evaluation of Kvazaar HEVC Intra Encoder on Xeon Phi Many-core Processor

This paper analyzes parallel scalability and coding speed of our open-source Kvazaar HEVC intra encoder on Intel Xeon Phi 61-core coprocessor that supports up to four hardware threads per core. The evaluated parallelization schemes of Kvazaar are 1) Wavefront Parallel Processing (WPP); and 2) tiles, both accelerated with picture-level parallel processing. With WPP, the C implementation of Kvazaar high-quality preset achieves an average speedup of 1.3 and a bit rate gain of 0.7% over the respective implementation of x265. Using tiles makes Kvazaar 1.4 times faster than x265 but at a cost of 0.3% bit rate loss. When high-speed presets are used, the speedup of Kvazaar increases to 1.4 with WPP and to 1.9 with tiles. Moreover, the respective coding efficiency of Kvazaar rises to 11.2% and 10.3%. Kvazaar also scales almost linearly to the number of cores in the processor. Even if the peak coding speed of Kvazaar on Xeon Phi is lower than that on the Intel 8-core i7 processor, our parallel scalability results promise excellent speed for Kvazaar on massively parallel processors equipped with more powerful cores.
Power Optimizations for Transport Triggered SIMD Processors
Power consumption in modern processor design is a key aspect. Optimizing the processor for power leads to direct savings in battery energy consumption in case of mobile devices. At the same time, many mobile applications demand high computational performance. In case of large scale computing, low power compute devices help in thermal design and in reducing the electricity bill. This paper presents a case study of a customized low power vector processor design that was synthesized on a 28 nm process technology. The processor has a programmer exposed datapath based on the transport triggered architecture programming model. The paper’s focus is on the RTL and microarchitecture level power optimizations applied to the design. Using register file datapath gating, register file banking and enabling clock gating of individual pipeline stages in pipelined function units, up to one fourth of power and energy savings could be achieved with only a small area overhead. On top of this, for the measured radio applications, the exposed datapath architecture helped to achieve major power improvements in comparison to the traditional VLIW programming model by utilizing optimizations unique to transport triggered architectures.

Programmable Data Parallel Accelerator for Mobile Computer Vision
The demand for high performance yet extremely low-power multimedia accelerators for mobile communication is ever growing. To meet this challenge a novel approach with a very low-power programmable TTA processor is proposed in this paper. The processor is benchmarked with two OpenCL computer vision applications; depth estimation and face detection. The former is an excellent example of a highly parallel algorithm that suits our TTA processor extremely well whereas the latter is an example of a more serial algorithm that poses a challenge for GPU-style parallel platforms. Both algorithms are also implemented and optimized for a high throughput AMD Radeon HD 7750 GPU, Qualcomm Adreno 330 mobile GPU and Intel Core i5-480M for a fair comparison of performance and energy efficiency. These platforms are chosen because they all can be programmed with OpenCL with equivalent programming efforts. In this paper we show that our novel approach can achieve real-time requirements and easily outperform both GPUs as well as the CPU in terms of throughput per watt criterion, making it an excellent candidate for power-constrained mobile platforms.
Resolving parameter reference management in IP-XACT using Kactus2

Modern VLSI and FPGA chip designs utilize automated generation of the structure and component configuration for different product variations. This is based on re-usable, parametrized library components, and tools for definition, assembly, configuration and generation of final HW and SW code. A product version includes several structural hierarchies, in which each component is independently reusable and must be configured for the specific product context. IEEE 1685 "IP-XACT" standardizes the component and design descriptions and the overall process. Still the challenges are very large parameter space, name-based referencing and propagation of parameter values. Practical user problems are careless parameter renaming, duplicate names, and removing a parameter definition without first removing all references to it. In this paper, we present solutions to these problems implemented in Kactus2 v2.7 that is an open-source IP-XACT tool. Our basis is automatic identifier generation and referencing. This required major changes for Kactus2 import wizards, generators as well as expression editors and evaluators. The implementation was carried out in C++/Qt5 and we modified and added 5k LOC compared to Kactus2 v2.6. According to several use cases analysis the new solution practically eliminates the user errors in the parameter referencing, which significantly improves productivity.

General information
State: Published
Ministry of Education publication type: A4 Article in a conference publication
Organisations: Department of Pervasive Computing, Research area: Computer engineering
Authors: Pekkarinen, E., Teuho, M., Salminen, E., Hämäläinen, T. D.
Pages: 2765-2770
Publication date: 2015

Host publication information
Title of host publication: 41st Annual Conference of the IEEE Industrial Electronics Society IECON 2015
Publisher: IEEE
ISBN (Print): 978-1-4799-1762-4
Keywords: reuse, generation, parameter, configuration, IP-XACT, SystemVerilog, Kactus2
DOIs:
10.1109/GlobalSIP.2015.7418271
Research output: Scientific - peer-review › Conference contribution

WarmPie: A bare-bones implementation of message passing interface for embedded many-cores

In this paper we present a message-passing based interface, WarmPie, to simplify data communication and management on a Multi-Processor System-on-Chip (MPSoC). WarmPie defines a subset of Message Passing Interface (MPI) library routines. We provide C language implementation of those routines on a 9-core MPSoC. WarmPie offers an abstract view of the MPSoC to facilitate effortless integration of software to hardware. In one use case study of developing a ring communication program on the MPSoC, software development effort is reduced by a factor of 3.75 due to using WarmPie. The application using WarmPie is fully compatible with a reference MPI environment on Linux. WarmPie has a small memory footprint of 7.3KB per core. Although data transmission latency has increased due to using the interface, the overhead is amortized when transferring a bigger payload in one message.

General information
State: Published
Ministry of Education publication type: A4 Article in a conference publication
Organisations: Department of Electronics and Communications Engineering, Research group: System-on-Chip for GNSS, Wireless Communications and Cyber-Physical Embedded Computing, Department of Pervasive Computing, Research area: Computer engineering, Wireless Communications and Positioning (WICO)
Authors: Wang, K., Salminen, E., Nurmi, J., Ahonen, T.
Number of pages: 4
Pages: 33-36
Publication date: 2015

Host publication information
Title of host publication: 2015 11th Conference on Ph.D. Research in Microelectronics and Electronics (PRIME)
Publisher: IEEE
DOIs:
10.1109/PRIME.2015.7251087
Research output: Scientific - peer-review › Conference contribution
2013 International Symposium on System-on-Chip Proceedings

General information
State: Published
Ministry of Education publication type: C2 Edited books
Organisations: Department of Electronics and Communications Engineering
Publication date: 2014

Publication information
Publisher: Institute of Electrical and Electronics Engineers IEEE
ISBN (Print): 978-1-4799-1189-9
Original language: English

Bibliographical note
Contribution: organisation=elt,FACT1=1<br/>Portfolio EDEND: 2014-09-16<br/>Publisher name: Institute of Electrical and Electronics Engineers IEEE
Source: researchoutputwizard
Source-ID: 1161
Research output: Scientific - peer-review › Anthology

Accomodating the fast-paced evolution of VLSI in engineering curricula

General information
State: Published
Ministry of Education publication type: A4 Article in a conference publication
Organisations: Department of Electronics and Communications Engineering, Wireless Communications and Positioning (WICO)
Authors: Campi, F., Airoldi, R., Nurmi, J.
Number of pages: 5
Pages: 208-212
Publication date: 2014

Host publication information
Title of host publication: 10th European Workshop on Microelectronics Education (EWME), 14-16 May 2014, Tallinn
Publisher: Institute of Electrical and Electronics Engineers IEEE
ISBN (Print): 978-1-4799-4016-5
DOIs:
10.1109/EWME.2014.6877427

Bibliographical note
Contribution: organisation=elt,FACT1=1<br/>Portfolio EDEND: 2014-09-15<br/>Publisher name: Institute of Electrical and Electronics Engineers IEEE
Source: researchoutputwizard
Source-ID: 202
Research output: Scientific - peer-review › Conference contribution

Advanced Acquisition and Tracking Algorithms

General information
State: Published
Ministry of Education publication type: A3 Part of a book or another research book
Organisations: Department of Electronics and Communications Engineering
Authors: Lohan, E. S.
Number of pages: 36
Pages: 85-120
Publication date: 2014

Host publication information
Approximate computing for complexity reduction in timing synchronization

General information
State: Published
Ministry of Education publication type: A1 Journal article-refereed
Organisations: Department of Electronics and Communications Engineering, Wireless Communications and Positioning (WICO)
Authors: Airoldi, R., Campi, F., Nurmi, J.
Number of pages: 7
Pages: 1-7
Publication date: 2014
Peer-reviewed: Yes

Publication information
Journal: Eurasip Journal on Advances in Signal Processing
Volume: 2014
Issue number: 1
Article number: 155
ISSN (Print): 1687-6172
Ratings:
Scopus rating (2016): SJR 0.313 SNIP 0.78 CiteScore 1.21
Scopus rating (2015): SJR 0.279 SNIP 0.592 CiteScore 0.83
Scopus rating (2014): SJR 0.229 SNIP 0.54 CiteScore 0.7
Scopus rating (2013): SJR 0.267 SNIP 0.506 CiteScore 0.63
Scopus rating (2012): SJR 0.278 SNIP 0.582 CiteScore 0.72
Scopus rating (2011): SJR 0.371 SNIP 0.724 CiteScore 0.91
Scopus rating (2010): SJR 0.403 SNIP 0.982
Scopus rating (2009): SJR 0.474 SNIP 0.823
Scopus rating (2008): SJR 0.468 SNIP 0.897
Scopus rating (2007): SJR 0.386 SNIP 0.913
Scopus rating (2006): SJR 0.362 SNIP 0.92
Scopus rating (2005): SJR 0.519 SNIP 0.968
Scopus rating (2004): SJR 0.603 SNIP 1.155
Scopus rating (2003): SJR 0.63 SNIP 1.023
Scopus rating (2002): SJR 0.14 SNIP 0.329
Scopus rating (2001): SJR 0.118 SNIP 0.372
Scopus rating (2000): SJR 0.115 SNIP 0.236
Scopus rating (1999): SJR 0.194 SNIP 0.381
Original language: English
DOIs:
Area estimation of time-domain GNSS receiver architectures

General information
State: Published
Ministry of Education publication type: A4 Article in a conference publication
Organisations: Department of Electronics and Communications Engineering, Wireless Communications and Positioning (WICO)
Authors: Eerola, V., Nurmi, J.
Number of pages: 6
Pages: 1-6
Publication date: 2014

Host publication information
Title of host publication: 2014 International Conference on Localization and GNSS (ICL-GNSS), 24-26 June 2014, Helsinki
Publisher: Institute of Electrical and Electronics Engineers IEEE
ISBN (Print): 978-1-4799-5123-9
DOIs: 10.1109/ICL-GNSS.2014.6934163

Bibliographical note
Contribution: organisation=elt,FACT1=1<br/>Portfolio EDEND: 2014-11-21<br/>Publisher name: Institute of Electrical and Electronics Engineers IEEE
Source: researchoutputwizard
Source-ID: 276
Research output: Scientific - peer-review › Conference contribution

Baseband Hardware Implementations for Galileo Receiver

General information
State: Published
Ministry of Education publication type: A3 Part of a book or another research book
Organisations: Department of Electronics and Communications Engineering
Authors: Hurskainen, H., Nurmi, J.
Number of pages: 17
Pages: 121-137
Publication date: 2014

Host publication information
Title of host publication: GALILEO Positioning Technology
Publisher: Springer Science+Business Media
Editors: Nurmi, J., Lohan, E., Sand, S., Hurskainen, H.
ISBN (Print): 978-94-007-1829-6

Publication series
Name: Signals and Communication Technology
Volume: 182
ISSN (Print): 1860-4862
DOIs: 10.1007/978-94-007-1830-2_6

Bibliographical note
Contribution: organisation=elt,FACT1=1<br/>Portfolio EDEND: 2014-11-21<br/>Publisher name: Springer Science+Business Media
Source: researchoutputwizard
Source-ID: 509
Research output: Scientific - peer-review › Chapter
Enabling Real-Time Resource Oriented Architectures with REST Observers

General information
State: Published
Ministry of Education publication type: A3 Part of a book or another research book
Organisations: Department of Pervasive Computing, Research area: Software engineering, Managing digital industrial transformation (mDIT)
Authors: Stirbu, V., Aaltonen, T.
Number of pages: 17
Pages: 51-67
Publication date: 2014

Host publication information
Title of host publication: REST: Advanced Research Topics and Practical Applications
Place of publication: New York, NY
Publisher: Springer Science+Business Media
Editors: Pautasso, C., Wilde, E., Alarcon, R.
ISBN (Print): 978-1-4614-9298-6
DOIs:
10.1007/978-1-4614-9299-3_4
Source: Bibtex
Source-ID: urn:80a2646f491c33d9d980260a4da6a9c4
Research output: Scientific - peer-review › Chapter

Faster than real-time GNSS receiver testing

General information
State: Published
Ministry of Education publication type: A4 Article in a conference publication
Organisations: Department of Electronics and Communications Engineering, Wireless Communications and Positioning (WICO)
Authors: Paakki, T., Nurmi, J.
Number of pages: 4
Pages: 1-4
Publication date: 2014

Host publication information
Title of host publication: 2014 International Conference on Localization and GNSS (ICL-GNSS), 24-26 June 2014, Helsinki
Publisher: Institute of Electrical and Electronics Engineers IEEE
ISBN (Print): 978-1-4799-5123-9
DOIs:
10.1109/ICL-GNSS.2014.6934172

Bibliographical note
Contribution: organisation=elt,FACT1=1<br/>Portfolio EDEND: 2014-08-30<br/>Publisher name: Institute of Electrical and Electronics Engineers IEEE
Source: researchoutputwizard
Source-ID: 511
Research output: Scientific - peer-review › Conference contribution

Bibliographical note
Contribution: organisation=elt,FACT1=1<br/>Portfolio EDEND: 2014-11-21<br/>Publisher name: Institute of Electrical and Electronics Engineers IEEE
Source: researchoutputwizard
Source-ID: 1206
Research output: Scientific - peer-review › Conference contribution
Implementation of Multicore Communications API

General information
State: Published
Ministry of Education publication type: A4 Article in a conference publication
Organisations: Department of Pervasive Computing, Signal Processing Research Community (SPRC)
Authors: Virtanen, J., Matilainen, L., Salminen, E., Hämäläinen, T. D.
Number of pages: 6
Pages: 1-6
Publication date: 2014

Host publication information
Title of host publication: International Symposium on System-on-Chip, SoC 2014, October 28-29, 2014, Tampere, Finland
Place of publication: Piscataway, NJ
Publisher: IEEE
Editors: Nurmi, J., Ellervee, P., Milojevic, D., Daniel, O., Paakki, T.
ISBN (Print): 978-1-4799-6889-3

Publication series
Name: International Symposium on System-on-Chip
Links:

Bibliographical note
Contribution: organisation=elt,FACT1=1<br/>Portfolio EDEND: 2014-08-28<br/>Publisher name: Elsevier
Source: researchoutputwizard
Source-ID: 277
Research output: Scientific - peer-review » Article
Scopus rating (2008): SJR 0.302 SNIP 0.813
Scopus rating (2007): SJR 0.351 SNIP 0.92
Scopus rating (2006): SJR 0.291 SNIP 0.824
Scopus rating (2005): SJR 0.258 SNIP 0.707
Scopus rating (2004): SJR 0.263 SNIP 0.709
Scopus rating (2003): SJR 0.229 SNIP 0.676
Scopus rating (2002): SJR 0.239 SNIP 0.378
Scopus rating (2001): SJR 0.194 SNIP 0.35
Scopus rating (2000): SJR 0.176 SNIP 0.431
Scopus rating (1999): SJR 0.159 SNIP 0.363
Original language: English
DOIs:
10.1016/j.sysarc.2013.10.005

Bibliographical note
Contribution: organisation=elt,FACT1=1<br/>Portfolio EDEND: 2014-03-15<br/>Publisher name: Elsevier BV * North-Holland; European Association for Micro-Processing and Micro-Programming
Source: researchoutputwizard
Source-ID: 105
Research output: Scientific - peer-review › Article

MULTI-POS - Multi-technology Positioning Professionals Training Network

General information
State: Published
Ministry of Education publication type: A4 Article in a conference publication
Organisations: Department of Electronics and Communications Engineering, Wireless Communications and Positioning (WICO)
Authors: Nurmi, J., Della Rosa, F., Lohan, E.
Number of pages: 4
Pages: 1-4
Publication date: 2014

Host publication information
Title of host publication: 2014 International Conference on Localization and GNSS (ICL-GNSS), 24-26 June 2014, Helsinki, Finland
Publisher: Institute of Electrical and Electronics Engineers IEEE
ISBN (Print): 978-1-4799-5123-9
DOIs:
10.1109/ICL-GNSS.2014.6934174

Bibliographical note
Contribution: organisation=elt,FACT1=1<br/>Portfolio EDEND: 2014-08-20<br/>Publisher name: Institute of Electrical and Electronics Engineers IEEE
Source: researchoutputwizard
Source-ID: 1160
Research output: Scientific - peer-review › Conference contribution

PVT Computation Issues in Mixed Galileo/GPS Reception

General information
State: Published
Ministry of Education publication type: A3 Part of a book or another research book
Organisations: Department of Electronics and Communications Engineering
Authors: Paakki, T., Della Rosa, F., Nurmi, J.
Number of pages: 29
Pages: 139-167
Publication date: 2014

Host publication information
Title of host publication: GALILEO Positioning Technology
Publisher: Springer Science+Business Media
Software Simulators and Multi-Frequency Test Scenarios for GALILEO

General information
State: Published
Ministry of Education publication type: A3 Part of a book or another research book
Organisations: Department of Electronics and Communications Engineering
Authors: Thombre, S., Nurmi, J.
Number of pages: 33
Pages: 289-321
Publication date: 2014

Host publication information
Title of host publication: GALILEO Positioning Technology
Publisher: Springer Science+Business Media
Editors: Nurmi, J., Lohan, E., Sand, S., Hurskainen, H.
ISBN (Print): 978-94-007-1829-6

Publication series
Name: Signals and Communication Technology
Volume: 182
ISSN (Print): 1860-4862
DOIs:
10.1007/978-94-007-1830-2_7

Bibliographical note
Contribution: organisation=elt,FACT1=1<br/>Portfolio EDEND: 2014-11-21<br/>Publisher name: Springer Science+Business Media
Source: researchoutputwizard
Source-ID: 1205
Research output: Scientific - peer-review › Chapter
Tactical Applications of Heterogeneous Ad Hoc Networks - Cognitive Radios, Wireless Sensor Networks and COTS in Networked Mobile Operations

General information
State: Published
Ministry of Education publication type: A4 Article in a conference publication
Organisations: Department of Electronics and Communications Engineering
Authors: Suojanen, M., Nurmi, J.
Number of pages: 5
Publication date: 2014

Host publication information
Title of host publication: COCORA 2014, The Fourth International Conference on Advances in Cognitive Radio, 23-27 February 2014, Nice, France
Publisher: International Academy, Research, and Industry Association IARIA
Editor: Lorenz, P.
ISBN (Print): 978-1-61208-323-0
Links:
http://www.thinkmind.org/download.php?articleid=cocora_2014_1_10_80004

Bibliographical note
Contribution: organisation=elt,FACT1=1<br/>Portfolio EDEND: 2014-04-29<br/>Publisher name: International Academy, Research, and Industry Association IARIA
Source: researchoutputwizard
Source-ID: 1566
Research output: Scientific - peer-review › Conference contribution

Training communication skills in project-oriented microelectronics courses

General information
State: Published
Transport triggered architecture to perform carrier synchronization for LTE

General information
State: Published
Ministry of Education publication type: A1 Journal article-refereed
Organisations: Department of Electronics and Communications Engineering, Wireless Communications and Positioning (WICO)
Authors: Anjum, O., Ali, M., Pitkänen, T., Nurmi, J.
Number of pages: 15
Publication date: 2014
Peer-reviewed: Yes

Publication information
Journal: ACM Transactions on Embedded Computing Systems
Volume: 13
Issue number: 4
Article number: 89
ISSN (Print): 1539-9087
Ratings:
Scopus rating (2016): SJR 0.377 SNIP 1.145 CiteScore 1.69
Scopus rating (2015): SJR 0.322 SNIP 0.895 CiteScore 1.28
Scopus rating (2014): SJR 0.233 SNIP 0.708 CiteScore 0.8
Scopus rating (2013): SJR 0.31 SNIP 1.095 CiteScore 1.05
Scopus rating (2012): SJR 0.329 SNIP 1.257 CiteScore 1.48
Scopus rating (2011): SJR 0.445 SNIP 1.087 CiteScore 2.08
Scopus rating (2010): SJR 0.557 SNIP 1.073
Scopus rating (2009): SJR 0.43 SNIP 1.25
Scopus rating (2008): SJR 0.166 SNIP 0.307
Original language: English
DOIs:
10.1145/2560036

Bibliographical note
Contribution: organisation=elt,FACT1=1
Portfolio EDEND: 2014-08-29
Publisher name: Association for Computing Machinery
Source: researchoutputwizard
Source-ID: 106
Research output: Scientific - peer-review » Article