- Name: Dr.ir. Bart (A.C.J.) Kienhuis
- Position: Assistant Professor at LIACS, University Leiden, The
Netherlands
- Group: Computer
Systems Group
- Section: Embedded
Design
- Area of expertise:
- Embedded System Design, in particular compiler support for
Embedded Multi Core Systems
- High Level System Synthesis
- Software Engineering
- Web based Software Technologies
- Main Research Field:
- Software Engineering
- Tools for Embedded Multi Core System Design
- Design Space Exploration Support
- Office: Room 126
- Phone: +31-(0)71 527 5776
- Fax: +31-(0)71 527 6985
- Email: a.c.j.kienhuis@liacs.leidenuniv.nl
- Work address:
Leiden University
LIACS (Leiden Institute of Advanced
Computer Science)
Niels Bohrweg 1
2333 CA Leiden
The Netherlands
For more details see also: LIACS Organization
Bio
Bart Kienhuis received a MSEE from Delft University of Technology in
1994 and he received his Ph.D. from Delft University of Technology in
1999. During his Ph.D., he has worked at Philips Research in Eindhoven
on a design methodology (the Y-chart approach) for high performance
video architectures for consumer products. His primary interest is in
the area of embedded system design with an emphasis on design space
exploration, performance modeling, architectural analysis, and
hardware/software codesign. From 1999 until 2000, Bart Kienhuis was a
Post Doc
in the group of Prof. Edward A. Lee at the University of California at
Berkeley. He is an assistant professor in the
Computer Systems
group.
Bart is a Senior Member of the IEEE
Master and PhD projects
Finished Master Project Publications.
Former PhD students
Involved Research Projects (Dutch/European)
- 2008 - 2013 (FINISHED)SoftSoC,
Project 2A714, Hardware Dependent Software for Systems on Chip
(European Funded Project and PointOne),
2 PhDs.
Hardware dependant Software (HdS) solutions to improve IP integration
in the SoC design process (quality / productivity): To separate the
Operating System (OS) and the application software from the underlying
hardware and HdS for efficiency, dependability, flexibility and
manageability. Systematic, highly automated HW/SW-integration of IP. See also the official SoftSoc website.
- 2008 - 2013 (FINISHED)TSAR,
Project 2A718, Tera-Scale Multi-core Processor ARchitecture
(European Funded Project and PointOne),
2 PhDs.
Design and application of multi-core processor architectures targeting
tera-flops performance. Topics are In-Network Cache Coherence Protocol
on NoC-based architecture, Scalable Tera Bit/s IO design, Application
Specific GALS NoC design, optimal Compilation for
Multi-Core/Multi-Thread Processor.
There is a special workshop organized around TSAR on the Design and Reuse workshop on December 3/4 2008 in Grenoble. See program for more information.
- 2005 - 2009 (FINISHED) NEVA,
Project 2A703, Networks on Chips Design Driven by Video and
Distribution Applications
(European Funded Project), 2 PhDs.
Circuits for electronic devices are becoming so complex that they are
expected to reach a billion transistors by the end of 2008.
Traditional silicon chip architectures are nearing the limit of their
performance in such applications so the NEVA project was set up to
introduce innovative network-on-chip designs, based on multiple
processors and asynchronous circuitry. The goal is to allow designers
and application engineers to cope with emerging applications resulting
from multimedia/communication convergence. In the first instance,
datastream applications, mainly from the video environment, will be
used as drivers, with a targeted computing power around one-giga
operations per second per chip.
Neva is a MEDEA+ collaboration of industrial and academic
partners: ACE, Bull Certess, LETI/CEA, NXP Semiconductors, Silicomp,
STMicroelectronics, TIMA, and VERIMAG.
- 2005 - 2009 (FINISHED) TRADER, Embedded Systems Institute,
(Dutch Funded Project, BSIK), 1 PhD,
Modern technical systems such as household goods, DVD players, PCs,
medical X-ray imaging, printers, advanced car vehicles, and airplanes
rely increasingly on software. Complex systems cannot be built without
software accomplishing their integration. Embedded computer programs
monitor the whole system and take care that the system accomplishes
more than its parts would. In such software-intensive systems
reliability is of prime importance.
Presentations:
- Keynote presenation at Memocode 2008, DAC Anaheim, June 5- 7 2008.
Title ``Programming multicores with Kahn Process Networks; a smart choice?''.
- Presenation at the workshop on Software & Compilers
for Embedded Systems (SCOPES'07), Nice, April 20 2007.
Title ``Mapping
Stream based Applications to an Intel IXP Network Processor using
Compaan''.
- invited speaker on ICT -
Kenniscongres
``Pushing
to the limit: how to teach a billion transistor chip a new trick'',
Amsterdam, The Netherlands, April 10, 2006.
- Invited speaker on PROGRESS Symposium``From My
Research to My Company?'', Utrecht, The Netherlands, 2004.
- Invited tutorial presenter at ASAP'03, Den Hague, The
Netherlands, June 25, 2003.``How do you run your next Software
Project; A Software Practice''. A tutorial on how to improve the
quality of software developed in the academic environment.
- Presentation in the joined
Electronic Systems Design / Chess Seminar
series at UC Berkeley, Berkeley May 13, 2003.``System
design using Kahn Process Networks - the Compaan/Laura Approach''
- Organized together with Kees Vissers one
of the tutorials at the
37th DAC at Los Angeles, on system level design using embedded
platforms. ``System
Level Design with Embedded Platforms". This link contains all
the presentation slides in PDF format.
- Paper presentation at the 8th International Workshop on
Hardware/Software Codesign (CODES'2000), May 3 -- 5 2000, San Diego,
CA, USA.``Compaan:
Deriving Process Networks from Matlab for Embedded Signal Processing
Architectures.''
- Poster presented at the Ptolemy Mini conference, at the
University of California at Berkeley 19 February 1999.``System
Level Design using Y-charts'' (See also this slide pack)
Publications:
Most of these articles may be copyright of
ACM
or
IEEE
. Please understand their copyright policy before reproducing these
articles.
Tsvetan Shoshkov, Todor Stefanov and Bart Kienhuis,"Parameterized System Level Design : Real-Time X-Ray Image Processing Case Study", The 27th Annual IEEE International Conference on
Application-specific Systems, Architectures and Processors (ASAP2016), Jul 6th-8th 2016, London, England. Abstract.
-
Complex embedded systems are used to facilitate real-time data processing applications. The embedded systems we are interested in typically have to process large amount of data that represents, for example, medical images, astronomy data, network traffic, or even stock exchange information. To cope with their surroundings, these systems need to provide some form of
dynamic parameterization. For example, to adjust parameters of filters, to switch in or out particular computational blocks or to select regions for further analysis; all at runtime. The development of such systems is challenging as a designer needs to develop a high-performing system that executes in a stable and consistent way with the parameterization. In this paper, we want to present how we can develop such a system using the Compaan technology and show a real-time medical X-Ray image processing case study. The Compaan technology realizes this design faster and more reliable than traditional design approaches that rely on more low level or adhoc development techniques to realize parameterization.
Wouter van Teijlingen, Carlo Galuzzi, Rene van Leuken, and Bart Kienhuis,"Determining Performance Boundaries on High-Level System Specifications", The 19th International Workshop on Software and Compilers for Embedded Systems (SCOPES2016), May 23 - 25, 2016, St. Goar, Germany. Abstract.
-
We can significantly reduce the time required to realize designs if it is possible to find limits to the performance of an embedded system, solely based on high-level system specifications. For that purpose, we present in this paper the cprof profiler, which determines the number of clock cycles needed to execute a C-program in hardware. The cprof tool is based on the Clang compiler front-end to parse C-programs and to produce instrumented source code for the profiling. Using cprof, we determine a lower and upper bound limit for all 29 cases of the PolyBench/C benchmark suite. The lower and upper bound are determined using the absolute performance estimations assuming all statement are mapped onto the same processing resource and unbounded performance estimations assuming unlimited resources. We also compared the clock cycles found by cprof with RTL implementations for all 29 Polybench/C cases and found that cprof determines with 1.2 procent accuracy the correct number of clock cycles. It does this in a fraction of the time compared to the time needed to do a full RTL simulation.
Ana Balevic and Bart Kienhuis,"Deriving a Multi-Level Program Model for Efficient Parallelization on Heterogeneous Platforms", The 11th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN 2013), February 11 – 13, 2013, Innsbruck, Austria. Abstract.
-
Sven van Haastreg and Bart Kienhuis,"Enabling Automatic Pipeline Utilization Improvement in Polyhedral Process Network Implementations". In Intl. Conference on Application-specific Systems, Architectures and Processors (ASAP'12), July 2012. Abstract.
-
Ana Balevic and Bart Kienhuis,"Employing Dataflow Principles for Acceleration of Streaming Applications on Heterogeneous Platform", Data-Flow Execution Models for Extreme Scale Computing (DFM 2011), Galveston Island, Texas, USA, October 10, 2011, in conjunction with PACT 2011. Abstract.
-
Razvan Nane, Sven van Haastregt, Todor Stefanov, Bart Kienhuis, Vlad Mihai Sima and Koen Bertels,"IP-XACT Extensions for Reconfigurable Computing", In Intl. Conference on Application-specific Systems, Architectures and Processors (ASAP'11), pp. 215-218, September 2011. Abstract.
-
Ana Balevic and Bart Kienhuis,"A Data Parallel View on Polyhedral Process Networks", In the proceedings of 14th International Workshop on Software and Compilers for Embedded Systems June 27-28, 2011 Schloss Rheinfels, St. Goar, Germany. Abstract.
-
Ana Balevic and Bart Kienhuis,"KPN2GPU: An Approach for Discovery and Exploitation of Fine-Grain Data Parallelism in Process Networks", In the proceedings of the International Workshop in Highly-Efficient Accelerators and Reconfigurable Technologies, June 2-3, 2011, Imperial College, London. Abstract.
-
Ana Balevic and Bart Kienhuis,"Exploiting Task Parallel Execution on CUDA: A Case Study", In the proceedings of the 2nd Workshop on Applications for Multi and Many Core Processors, June 4-8, 2011 in San Jose, CA, USA. Abstract.
-
Sven van Haastregt, Stephen Neuendorffer, Kees Vissers and Bart Kienhuis,"High Level Synthesis for FPGAs Applied to a Sphere Decoder Channel Preprocessor", In Intl. Symposium on Field Programmable Gate Arrays (FPGA'11), pp. 278, February-March 2011. Abstract.
-
Sven van Haastregt and Bart Kienhuis,"Cost Modeling and Cycle-Accurate Co-Simulation of Heterogeneous Multiprocessor Systems", In the proceedings of Design, Automation and Test in Europe (DATE'10), Dresden 2010. Abstract.
-
In this paper, we present a method to analyze different implementations of
stream-based applications on heterogeneous multiprocessor systems. We take both
resource usage and performance constraints into account.
For the first aspect we use an empirical cost model. For the second aspect we
build a network of cycle-accurate processor simulators.
The simulation and resource cost estimation have been integrated in an existing
framework, allowing one to generate fast exploration simulations,
cycle-accurate simulations and FPGA implementations from a single system level
specification. We show that with our methodology cycle-accurate performance numbers of candidate
systems can be obtained. In our experiments with the QR and MJPEG applications,
we found that the error of our resource cost model is below two percent.
Sven van Haastregt and Bart Kienhuis,
``Automated Synthesis of Streaming C Applications to Process Networks in Hardware'', In proceedings of Design, Automation and Test in Europe (DATE'09), April 2009. Abstract.
-
The demand for embedded computing power is continuously increasing and FPGAs are
becoming very interesting computing platforms, as they provide huge amounts of
customizable parallelism. However, programming them is challenging, let alone
from a high level language. The ESPAM methodology was
already presented to quickly obtain realizations on FPGAs from sequential C
code. The realization consists of a network of processors and IP cores. In this
approach, a problem was that the IP cores had to be provided manually. In this
paper, we present an extension on the ESPAM methodology by incorporating the
industrial high level synthesis tool PICO from Synfora Inc. In this way, we
realize the automated generation of efficient hardware implementations on FPGAs
from a single sequential C input specification of a streaming application.
We demonstrate our approach for the Sobel and QR applications.
Bin Jiang, Bart Kienhuis, and Ed Deprettere,
``Hierarchical Run Time Deadlock Detection in Process Networks'',
Published in the proceedings of 2008 IEEE Workshop on Signal Processing Systems, October 8-10, 2008
Washington, D.C. Metro Area, USA. Abstract.
-
Deadlock detection is a well-studied problem that may be
considered solved from a theoretical point of view. However,
specific cases may demand for specific solutions. One such specific
case is deadlock detection in Kahn Process Networks. The Kahn
process network (KPN) is an expressive model of computation that is
widely used to model and specify deterministic streaming
applications. The processes in the network communicate
point-to-point over FIFO channels whose sizes are undecidable in
general. As a consequence, deadlock may occur and, therefore, a
run-time deadlock detection mechanism is required. This can be
organized in a centralized way, a distributed way, and a
hierarchical way. Centralized and distributed procedures have been
reported in the literature. In this paper, we propose a novel
hierarchical approach for KPN deadlock detection at run time. We
also give results for the implementation on the IBM Cell processor.
Steven Derrien, Alexandru Turjan, Claudiu Zissulescu,
Bart Kienhuis and Ed Deprettere,
``Deriving Efficient Control in Process Networks with Compaan/laura'',
Published in the International
Journal of Embedded Systems (IJES), volume 3, Issue 3, p 170 - 180,
2008, Abstract.
-
At Leiden Embedded Research
Center (LERC), we are building a tool chain called Compaan/Laura
that allows us to map rapidly and efficiently signal processing
applications written in Matlab onto reconfigurable platforms. In this
chain, first the Matlab code is converted automatically to executable
Kahn Process Network (KPN) specification, then a tool called Laura
transforms the PN specification into design implementations described
as synthesizable VHDL.
The applications targeted by Compaan are usually data-flow intensive,
requiring large computational power. Therefore, an important issue in
Laura is the derivation of efficient and scalable hardware control
structures. This control is based on an abstract representation given
as parametrized polytopes. Although this representation can be
directly translated into nested guarded for-loops (very suitable for
software implementation) its mapping to hardware is much more
difficult. In this paper we investigate the opportunities of deriving
different hardware realizations for this control, and explore the
trade off between speed and resources usage.
Sjoerd Meijer, Sven van Haastregt, Dmitry Nadezhkin, and
Bart Kienhuis``Kahn
Process Network IR Modeling for Multicore Compilation'', Technical
Report, LIACS, December 2007. Abstract.
-
The complexity
of embedded applications has reached a point where the performance
requirements of these applications can no longer be supported by
embedded systems based on a single processor. Instead, emerging
embedded System-on-Chip platforms are increasingly becoming multicore
architectures. On these platforms two major problems emerge: how to
design and how to program such multicore platforms in a systematic and
automated way? We focus on the latter and are building compiler
support for programming multicore platforms. For uniprocessor systems,
auto parallelization techniques such as vectorization, instruction
level parallelism, and software pipelining mainly focused on the
back-end of the compiler. We believe that for multicore systems this
is not sufficient and a difference can be made in the middle-end.
Therefore, we present a compiler framework in which we use the Kahn
Process Network (KPN) model of computation in the Intermediate
Representation (IR) of the compiler. This IR model is crucial and
allows us to focus on: 1) code generation for multicore platforms, and
2) KPN profiling and network restructuring transformations, where the
former is a prerequisite for the latter. In this paper we focus on 1),
and manually apply the network restructuring transformations. We
demonstrate a working prototype compiler tool-chain that automatically
creates multithreaded code from a sequential program specification and
show results for the Cell processor.
Sjoerd Meijer, Johan Walters, David Snuijf, and Bart
Kienhuis``Automatic
partitioning and mapping of stream-based applications onto the Intel
IXP Network Processor'', Workshop on Software & Compilers
for Embedded Systems (SCOPES'07), Nice, April 20 2007. Abstract
-
When studying the IXP
Network processor architecture from Intel, we found quite some
interesting aspects that make the IXP attractive for stream-based
applications. The architecture is highly optimized for streaming data,
albeit in the form of Internet packets. Furthermore, the architecture
has Gigabit Ethernet connectors for handling incoming and outgoing
traffic and can process this data at real-time using dedicated
microengines. In this paper, we try to answer three questions; 1) Can
we use the IXP architecture for stream-based applications? 2) can we
map applications written as a KPN onto an IXP? 3) can we integrate the
generation of KPNs using the Compaan compiler with a tool flow that
maps the KPN to an IXP, thereby make the programming of IXP much
simpler? As will be shown, all three steps can be performed and we
show that we can map automatically two non-Internet stream-based
applications (QR and DWT) onto the IXP.
Sjoerd Meijer, Bart Kienhuis, Alex Turjan and Erwin de
Kock``A Generic Process Splitting Transformation For Kahn Process
Networks'', Proceedings of the DATE conference (DATE'07),
Nice, April 17 - 19 2007. Abstract
-
Kahn Process Networks (KPN)
are very suitable for systematic mapping onto multiprocessor
architectures. Running applications written in this parallel program
specification on a multiprocessor architecture, however, does not
guarantee that the runtime requirements are met. Therefore, it may be
necessary to further analyze and optimize Kahn process networks. In
this paper, we will present a process splitting technique that results
in a functionally equivalent process network, but with a changed and
optimized network structure. The class of networks that can be handled
is generic as it is not restricted to static networks. The presented
approach can also handle processes with dynamic program statements. We
prototyped the splitting transformation in the GCC compiler for a JPEG
decoder application. This resulted in a 21% performance improvement.
Ming-Yung Ko, Claudiu Zissulescu, Sebastian
Puthenpurayil, Shuvra Bhattacharyya, Bart Kienhuis, and Ed Deprettere,``Parameterized
Looped Schedules for Compact Representation of Execution Sequences in
DSP Hardware and Software Implementation'', in the IEEE Transactions on
Signal Processing. 2007, vol. 55 (2), no6, pp. 3126-3138. Describes
joined work between LIACS and Dept of Electrical Engineering, Univ. of
Maryland. Abstract
-
In this paper, we present a
technique for compact representation of execution sequences in terms
of efficient looping constructs. Here, by a looping construct, we mean
a compact way of specifying a finite repetition of a set of execution
primitives. Such compaction, which can be viewed as a form of
hierarchical runlength encoding (RLE), has application in many VLSI
signal processing contexts, including efficient control generation for
Kahn processes on FPGAs, and software synthesis for static dataflow
models of computation. In this paper, we significantly generalize
previous models for loop-based code compaction of DSP programs to
yield a configurable code compression methodology that exhibits a
broad range of achievable trade-offs. Specifically, we formally
develop and apply to DSP hardware and software synthesis a
parameterizable loop scheduling approach with compact format, dynamic
reconfigurability, and low-overhead decompression.
Alexandru Turjan, Bart Kienhuis and Ed Deprettere,``Classifying
Interprocess Communication in Process Network Representation of Nested
Loop Programs'', in the ACM Transactions on Embedded Computing Systems
Volume 6, Issue 2 (May 2007).
Claudiu Zissulescu, Bart Kienhuis and Ed Deprettere,``Communication
Synthesis in a multiprocessor environment'', Presented at
Field-Programmable Logic and Applications conference (FPL'05) Tampere, Finland August 24 - 26,
2005. Abstract
-
At Leiden University, we are
developing a design methodology that allows for fast mapping of
nested-loop applications (e.g. DSP, Imaging, or Multi-Media) written
in a subset of Matlab onto reconfigurable devices. This design
methodology is implemented into a tool chain that we call Compaan/Laura
. This methodology generates a process network in which the
inter-process communication takes place in a point-to-point fashion.
Four types of point-to-point inter-processor communication exist in
the PN. Two of them use a FIFO like communication and the other two
use a cache like memory to exchange data. In this paper, we
investigate the realizations for the four communication types and show
that point-to-point communication at the level of scalars can be
realized automatically and very efficiently in today's FPGAs.
Claudiu Zissulescu, Bart Kienhuis and Ed Deprettere,``Expression
Synthesis in Process Networks generated by Laura'', Presented at the
16th IEEE International Conference on Application-specific Systems,
Architectures and Processors ASAP2005,
July 23 -- 25, 2005, Samos, Greece. Abstract
-
The Compaan/Laura
tool chain maps nested loop applications written in Matlab onto
reconfigurable platforms, such as FPGAs. \Compaan rewrites the
original Matlab application as a Process Network in which the control
is parameterized and distributed. This control is given as
parameterized polytopes that are expressed in terms of pseudo-linear
expressions. These expressions cannot always be mapped efficiently
onto hardware as they contain multiplication and integer division
operations. This obstructs the data flow through the processes.
Therefore, we present in this paper the Expression Compiler that
efficiently maps pseudo-linear expressions onto reconfigurable
hardware in such a way that the distributed and parameterized control
never obstructs the data flow through processors. This compiler
employs techniques like number theory axioms, method of difference,
and predicated static single assignment code.
Alexandru Turjan, Bart Kienhuis and Ed Deprettere,
``Solving Out-of-Order Communication in Kahn Process Networks'', in the
Journal of VLSI Signal Processing, Issue: Volume 40, Number 1 Date: May
2005 Pages: 7 - 18. Abstract
-
The Compaan compiler
framework automates the transformation of DSP applications written in
Matlab into Kahn Process Networks (KPNs). These KPNs express the
signal processing applications in a parallel distributed way making
them more suitable for mapping onto parallel architectures. A simple
instance of a generated KPN by Compaan is a Producer process that
communicates with a Consumer process via a FIFO buffer, with the
Consumer reading data from the FIFO using a blocking read. When the
sequence of producing data is different from the sequence of consuming
data, a simple FIFO is not sufficient to implement the communication.
For such case, extra storage and control are needed at the consumer
side. This paper presents a compile approach that determines whether a
FIFO buffer is sufficient for every Producer/Consumer pair of a
Compaan-generated KPN. When additional memory is required, we provide
an address generation mechanism to perform the reordering and
furthermore give a lower bound on the amount of memory needed to
perform the reordering. The presented approach is based on the Ehrhart
theory.
Andre van der Plas, Bart Kienhuis, ``A Methodology for
Efficient Reuse of Open Source Code''
,presented the First Conference for the dutch Software Engineering
Community (JACQUARD2005)Abstract
-
Research has been
carried out to explore if, and how, it is possible to increase the
efficiency in which open source code can be used. The research has
shown that structuring such code can improve this efficiency. It has
lead to a methodology and a set of techniques, which can be used to
(re) structure the code.
Alexandru Turjan, Bart Kienhuis and Ed Deprettere,``A
Hierarchical Classification Scheme to Derive Interprocess Communication
in Process Networks'', in the proceedings of the 15th IEEE
International Conference on Application-specific Systems, Architectures
and Processors ASAP2004,
Sept 27 -- 29, 2004, Galveston, Texas. Abstract
-
The Compaan compiler
automatically derives a Process Network (PN) description from an
application written in Matlab. The basic element of a PN is a Producer/Consumer
(P/C) pair . Four different communication patterns for a P/C pair have
been identified and the complexity of communication structure differs
depending on the communication pattern involved. Therefore, in order
to obtain cost-efficient process networks our compiler automatically
identifies the communication pattern of each P/C pair. This problem is
equivalent to integer linear programming and thus in general can not
be solved efficiently. In this paper we present simpler techniques
that allow to classify the interprocess communication in a PN.
However, in some cases those techniques do not allow to find an answer
and therefore, an ILP test has still to be applied. Thus, we introduce
a hierarchical classification scheme that correctly classifies the
interprocess communication, but uses dramatically less integer linear
programming. In only 5% of the cases to classify, we still relay on
integer linear programming while in the remaining 95%, the techniques
presented in this paper are able to classify a case correctly.
Alexandru Turjan, Bart Kienhuis and Ed Deprettere,
``Translating affine nested-loop programs to Process Networks'', in
proceedings of at the international conference on compilers,
architecture, and synthesis for Embedded Systems (CASES'04), Sept
23 -- 25, 2004, Washington D.C Abstract
-
New heterogeneous
multiprocessor platforms are emerging that are typically composed of
loosely coupled components that exchange data using programmable
interconnections. The components can be CPUs or DSPs, specialized IP
cores, reconfigurable units, or memories. To program such platform, we
use the Process Network (PN) model of computation. The localized
control and distributed memory are the two key ingredients of a PN
allowing us to program the platforms. The localized control matches
the loosely coupled components and the distributed memory matches the
style of interaction between the components. To obtain applications in
a PN format, we have built the Compaan compiler that translates affine
nested-loop programs into functionally equivalent PNs. In this paper,
we describe a novel analytical translation procedure we use in our
compiler that is based on integer linear programming. The translation
procedure consists of four main steps and we will present each step by
describing the main idea involved, followed by a representative
example.
Claudiu Zissulescu, Bart Kienhuis and Ed Deprettere,
``Increasing pipelined IP core utilization in Process Networks using
Exploration'', In proceedings of Field-Programmable Logic and
Applications (FPL'04), Springer LNCS
3203, pages 690 -- 699. Abstract
-
At Leiden Embedded Research
Center, we are building a tool chain called Compaan/Laura that
allows us to do fast mapping of applications written in Matlab onto
reconfigurable platforms, such as FPGAs, using IP cores to implement
the data-path of the applications. A particular characteristic of the
derived networks is the existence of selfloops. These selfloops have a
large impact on the utilization of IP cores in the final hardware
implementation of a PN, especially if the IP cores are deeply
pipelined. In this paper, we present an exploration methodology that
uses feedback provided by the Laura tool to increase the utilization
of IP cores embedded in our PN network. Using this exploration, we go
from 60MFlops to 1,7GFlops for the QR algorithm using the same number
of resources except for memory.
Alexandru Turjan, Bart Kienhuis and Ed Deprettere,
``An Approach to Classify Inter-Process Communication in Process
Networks at Compile Time'', In proceedings of 8th International
Workshop on Software and Compilers for Embedded Systems (SCOPES2004), Springer LNCS 3199,
pages 62 -- 76, Sept 2 -- 3, 2004, Amsterdam. Abstract
-
New embedded signal
processing architectures are emerging that are composed of loosely
coupled heterogeneous components like CPUs or DSPs, specialized IP
cores, reconfigurable units, or memories. We believe that these
architectures should be programmed using the Process Network model of
computation. To ease the mapping of applications, we are developing
the Compaan compiler that automatically derives a Process
Network (PN) description from an application written in Matlab. In
this paper, we investigate a particular problem in nested loop
programs, which is about classifying the interprocess communication in
the PN representation of the nested loop program. The global memory
arrays present in the Matlab code have to be replaced by a distributed
communication structure used for sending data to the network
processes. We will show that four types of communication exists, each
exhibiting different requirements when realizing them in hardware of
software. We present two compile time tests that decide the type of a
particular static array. These tests have become an important part of
our Compaan compiler.
Ingrid Verbauwhede (UCLA/KU Leuven), Patrick Schaumont
(UCLA), Christian Piquet (CSEM), and Bart Kienhuis,
``Architectures and Design techniques for energy efficient embedded DSp
and multimedia processing'' presented at the embedded Low-power
Tutorial at the Design, Automation and Test in Europe conference DATE2004, Feb 16-20 2004,
Paris, France. Abstract
-
Energy efficient embedded
systems consist of a heterogeneous collection of very specific
building blocks, connected together by a complex network of many
dedicated busses and interconnect options. The trend to merge multiple
functions into one device makes the design and integration of these
"systems-on-chip" (SOC's) even more problematic. Yet, specifications
and applications are never fixed and require the embedded units to be
programmable. The topic of this paper is to give the designer
architectures and design techniques to find the right balance between
energy efficiency and flexibility. The key is to include
programmability (or reconfiguration) at the right level of abstraction
and tuned to the application domain. The challenge is to provide an
exploration and programming environment for this heterogeneous
architecture platform.
Todor Stefanov, Claudiu Zissulescu, Alexandru Turjan,
Bart Kienhuis, Ed Deprettere,
``System Design using Kahn Process Networks: The Compaan/Laura
Approach'', in proceedings of the Design, Automation and Test in Europe
conference DATE2004, Feb
16-20 2004, Paris, France. Abstract
-
New emerging embedded
system platforms in the realm of high-throughput multimedia, imaging,
and signal processing will consist of multiple microprocessors and
reconfigurable components. One of the major problems is how to program
these platforms in a systematic and automated way so as to satisfy the
performance need of applications executed on these platforms.In this
paper, we present our system design approach as an efficient solution
to this programming problem. We show how for an application written in
Matlab, a Kahn Process Network specification can automatically be
derived and systematically mapped onto a target platform composed of a
microprocessor and an FPGA. Furthermore, we illustrate how the mapping
approach is applied on a real-life example, namely an M-JPEG encoder.
Claudiu Zissulescu, Todor Stefanov, Bart Kienhuis, Ed
Deprettere,``Laura:
Leiden Architecture Research and Exploration Tool'', Presented at the
International Conference on Field Programmable Logic and Applications (FPL), Sept 1-3 2003, Lisbon, Portugal. Abstract
-
At Leiden Embedded Research
Center (LERC), we are building a tool chain called Compaan/Laura
that allows us to map fast and efficiently application written in
Matlab onto reconfigurable platforms. In this chain, first the Matlab
code is converted automatically to executable Kahn Process Network
(KPN) specification. Then a tool called Laura accepts this
specification and transforms the specification into design
implementations described as synthesizable VHDL. In this paper, we
present our methodology implemented in Laura tool to
automatically convert KPNs to synthesizable VHDL code targeted form
mapping onto FPGA-based platforms. With the help of Laura , a
designer is able to either fast prototype signal processing and
multimedia applications directly in hardware or to extract very fast
valuable low-level quantitative implementation data such as
performance in terms of clock cycles or silicon area.
Alexandru Turjan, Bart Kienhuis``Storage
Management in Process Networks using the Lexicographically Maximal
Preimage'', Presented at the IEEE 14th International Conference on
Application-specific Systems, Architectures and Processors (ASAP'03) June 24-26,
2003, Den Hague, The Netherlands. Abstract
-
At the Leiden Embedded
Research Center, we are developing a compiler called Compaan that
automatically translates signal processing applications written in
Matlab into Kahn Process Networks (KPNs). In general, these signal
processing applications are data-flow intensive, requiring large
storage capacities, usually represented by matrices. An important
issue in Compaan is the derivation of a memory management mechanism
that allows for efficient inter-process communication. This mechanism
has previously been published and is called the Extended Linearization
Model (ELM). The controller needed in the ELM is derived using the
Ehrhart theory, leading to a computational intensive procedure. In
this paper, we present a new approach to derive the ELM controller,
based on the notion of Lexicographically Maximal Preimage. Using
polytope manipulations and parametric integer linear programming
techniques, we get less computational intensive and easier to be
derived controller implementation for the ELM.
Alexandru Turjan, Bart Kienhuis, Ed Deprettere,
``Realizations of the Extended Linearization Model'', In chapter 9 of
book "Domain-Specific Embedded Multiprocessors", page 171 --
191, 2003, by Marcel Dekker,
Inc. Abstract
-
At the Leiden Embedded
Research Center, we are working towards a framework called Compaan
that automates the transformation of digital signal processing (DSP)
applications to Kahn Process Networks (KPNs). These applications are
written in Matlab as parameterized nested loop programs. This
transformation is interesting as KPNs are well suited for mapping onto
parallel architectures. Although the KPN semantic always assumes that
FIFO buffers can be used between processes, we have found cases in
which the FIFO is not enough as data may arrive in the wrong order. To
solve this order problem, we previously presented the Extended
Linearization Model (ELM) that describes a mechanism to reorder
tokens. The introduction of the ELM does not violate the Kahn Process
Network semantics; we still use a FIFO between a Producer and
Consumer. The ELM relays on some additional memory and a controller to
perform the reordering. The ELM model can be implemented in different
ways. In this chapter, we investigate four different realizations of
the ELM. The realizations differ in the computational complexity of
performing the reordering, the kind of reordering memory used, and the
size of the reordering memory.
Bart Kienhuis and Ed F. Deprettere,``Modeling
Stream-Based Applications Using the SBF Model of Computation'', The
Journal of VLSI Signal Processing-Systems for Signal, Image, and Video
Technology (Kluwer),
pag 291 -- 300, Vol. 34, Issue 3, 2003.
Edwin Rijpkema (Promotor Ed Deprettere and Co-promotor
Bart Kienhuis), ``Modeling
Task Level Parallelism in Piece-wise Regular Programs'', PhD thesis,
Leiden University, Leiden Institute of Advanced Computer Science (LIACS), The Netherlands,
Sept 2002.
Tim Harriss, Richard Walke, Bart Kienhuis, and Ed
Deprettere``Compilation
from Matlab to Process Networks Realized in FPGA'', In
journal on Design Automation of Embedded Systems, Kluwer, Vol 7, Issue
4, 2002. Abstract
-
Compaan is a software tool
that is capable of automatically translating nested loop programs,
written in Matlab, into parallel process network descriptions suitable
for implementation in hardware. In this article, we show a methodology
and tool to convert these process networks into FPGA implementations.
We will show that we can in principle obtain high performing
realizations in a fraction of the design time currently employed to
realize a parameterized implementation. This allows us to rapidly
explore a range of transformations, such as loop unrolling and
skewing, to generate a circuit that meets the requirements of a
particular application. The QR decomposition algorithm is used to
demonstrate the capability of the tool. We present results showing how
the number of clock cycles and calculations-per-second vary with these
transformations using a simple implementation of the function units.
We also provide an indication of what we expect to achieve in the near
future once the tools are completed and applied the transformations to
parallel, highly pipelined implementations of the function units.
Alexandru Turjan, Bart Kienhuis, and Ed Deprettere ``A
compile time based approach for solving out-of-order communication in
Kahn Process Networks'', in proceeding of IEEE 13th International
Conference on Aplication-specific Systems, Architectures and Processors
(ASAP'2002), San Jose, CA, USA, July 17-19, 2002. Abstract
-
The Compaan compiler
framework automates the transformation of DSP applications written in
Matlab into Kahn Process Networks (KPNs). These KPNs express the
signal processing applications in a parallel distributed way making
them more suitable for mapping onto parallel architectures. A simple
instance of a generated KPN by Compaan is a Producer process that
communicates with a Consumer process via a FIFO buffer, with the
Consumer reading data from the FIFO using a blocking read. When the
sequence of producing data is different from the sequence of consuming
data, a simple FIFO is not sufficient to implement the communication.
For such case, extra storage and control are needed at the consumer
side. This paper presents a novel approach that determines at compile
time whether a FIFO buffer is sufficient for every Producer/Consumer
pair of a Compaan-generated KPN. For the case when the additional
memory is required, we also provide an address generation mechanism at
compile time. The presented approach is based on the Ehrhart theory.
Todor Stefanov, Bart Kienhuis, and Ed Deprettere``Algorithmic
Transformation Techniques for Efficient Exploration of Alternative
Application Instances'', in proceeding of Tenth International Symposium
on Hardware/Software Codesign CODES'2002, Stanley Hotel, Estes Park,
Colorado, USA, May 6 -- 8, 2002. Abstract
-
Following the Y-chart
paradigm for designing a system, an application and an architecture
are modeled separately and mapped onto each other in an explicit
design step. Next, a performance analysis for alternative application
instances, architecture instances and mappings has to be done, thereby
exploring the design space of the target system. Deriving alternative
application instances is not trivially done. Nevertheless, many
instances of a single application exist that are worth to be derived
for exploration. In this paper, we present algorithmic transformation
techniques for systematic and fast generation of alternative
application instances that express task-level concurrency hidden in an
application in some degree of explicitness. These techniques help a
system designer to speedup significantly the design space exploration
process.
Bart Kienhuis, Ed Deprettere, Pieter van der Wolf and
Kees Vissers ``A
Methodology to Design Programmable Embedded Systems'', in the LNCS
series of Springer-Verlag (c), Volume 2268, SAMOS: Systems,
Architectures, Modeling, and Simulation, editors Ed F. Deprettere,
Jurgen Teich, and Stamatis Vassiliadis, November 2001. Abstract
-
Embedded systems
architectures are increasingly becoming programmable , which
means that an architecture can execute a set of applications instead
of only one. This makes these systems cost-effective, as the same
resources can be reused for another application by reprogramming the
system. To design these programmable architectures, we present in this
article a number of concepts of which one is the Y-chart approach.
These concepts allow designers to perform a systematic exploration of
the design space of architectures. Since this design space may be
huge, it is narrowed down in a number of steps. The concepts presented
in this article provide a methodology in which architectures can be
obtained that satisfies a set of constraints while establishing enough
flexibility to support a given set of applications.
Tim Harriss, Richard Walke, Bart Kienhuis and Ed. F.
Deprettere,``Compilation
from Matlab to Process Networks Realized in FPGA'' , Presented at the
35th Asilomar conference on Signals, Systems, and Computers , November
4 -- 7, 2001, Pacific Grove, CA, USA, QinetiQ, Ltd (c). Abstract
-
Compaan is a software
tool capable of automatically translating nested loop programs,
written in Matlab, into parallel Kahn process network descriptions
suitable for implementation in hardware. In this paper we present a
tool for converting these process networks into FPGA implementations.
The QR decomposition algorithm is used to demonstrate the capability
of the tool to quickly generate high-performance parallel
implementations. This allows us to rapidly explore a range of
transformations, such as loop unrolling and skewing, to generate a
circuit that meets the requirements of a particular application. We
present results showing how the control logic complexity and number of
clock cycles vary with these transformations
Bart Kienhuis and Ed. F. Deprettere``Modeling
Stream-Based Applications using the SBF model of computation'' ,
Presented at: IEEE Workshop on Signal Processing Systems (SIPS 2001),
Antwerp, Belgium, September 26-28, 2001. Abstract
-
Modeling applications and
architectures at various levels of abstraction is becoming more and
more an accepted approach in embedded system design. When looking at
the modeling of applications in the domain of video, audio, and
graphics applications, we notice that they exhibit a high degree of
task parallelism and operate on streams of data. Models that we can
use to specify such stream-based applications on a high level of
abstraction are the dataflow models and process network models. Each
of these models has its own merits. Therefore, an alternative approach
is to introduce a model of computation that combines the semantics of
both models of computation. In this paper, we introduce such a model
of computation, which we call the Stream-Based Functions (SBF)
model of computation and show an example. Furthermore, we discuss the
composition and decomposition of SBF objects and put the SBF model of
computation in the context of relevant dataflow models and process
network models.
Edwin Rijpkema, Ed F. Deprettere and Bart Kienhuis``Deriving
Process Networks From Nested Loop Algorithms'' , Parallel Processing
Letters, Vol 10 Nos. 2 & 3 (2000), Pages 165 -- 176. Abstract
-
High level modeling and
(quantitative) performance analysis of signal processing systems
requires high level models for the applications (algorithms) and the
implementations (architectures), a mapping of the former into the
latter, and a simulator for fast execution of the whole. Signal
processing algorithms are very often nested-loop algorithms with a
high degree of inherent parallelism. This paper presents - for such
applications - a suitable application model and a method to convert a
given imperative executable specification to a specification in terms
of this application model. The methods and tools are illustrated by
means of an example.
Ed F. Deprettere, Edwin Rijpkema, Paul Lieverse and Bart
Kienhuis``High
Level Modeling for Parallel Executions of Nested Loop Algorithms'' ,
IEEE International Conference on Application-specific Systems,
Architectures and Processors (ASAP'2000), July 10 -- 12 2000, Boston
Massachusetts, USA. Abstract
-
High level modeling and
(quantitative) performance analysis of signal processing systems
requires high level models for the applications (algorithms) and the
implementations (architecture), a mapping of the former into the
latter, and a simulator for fast execution of the whole. Signal
processing algorithms are very often nested-loop algorithms with a
high degree of inherent parallelism. This paper presents - for such
applications - suitable application and implemetation models, a method
to convert a given imperative executable specification to a
specification in terms of the application model, a method to map this
specification into an architecture specification in terms of the
implementation model, and a method to analyze the performance through
simulation. The methods and tools are illustrated by means of an
example.
Bart Kienhuis, Edwin Rijpkema, and Ed F. Deprettere``Compaan:
Deriving Process Networks from Matlab for Embedded Signal Processing
Architectures.'', 8th International Workshop on Hardware/Software
Codesign (CODES'2000), May 3 -- 5 2000, San Diego, CA, USA. Abstract
-
This paper presents the
Compaan tool that automatically transforms a nested loop program
written in Matlab into a process network specification. The process
network model of computation fits better with the new emerging kind of
embedded architectures that use coprocessors. Process networks can
describe both fine-grained and coarse-grained parallelism, making the
mapping of the applications easier.
Bart Kienhuis``MatParser:
An array dataflow analysis compiler.'',Technical Report UCB/ERL M00/9.
Abstract
-
This paper
presents MatParser , which is an array analysis compiler that
automatically converts an affine nested loop program into a single
assignment program. The nested loop programs may contain non-linear
operators like div/mod/floor/ceil and stepsizes other than one. The
focus of this article is on the software architecture used in
MatParser to resolve the data dependencies. Finding that two variables
are dependent on each other and at which iteration, is a very
computational intensive procedure. MatParser employs a particular
linear programming technique to find the data-dependencies, based on
parametric integer programming (PIP) as proposed by Paul Feautrier. To
appreciate the implementation of MatParser, we will explain in this
paper in enough detail the basics of the linear programming technique
used.
Ed F. Deprettere, Edwin Rijpkema, and Bart Kienhuis``Application
and Architecture Modeling for Parallel Execution of Jacobi-type
Algorithms'',33rd Asilomar conference on signals, systems, and
computers, October 24 -- 27, Pacific Grove, CA. Abstract
-
In wireless
communications of tomorrow, high-resolution space-time filtering
techniques will play a crucial role. To make these techniques broadly
available, low-power, low-cost embedded processors for efficient
implementation must be designed. As many of these space-time filters
implement routines that are built on robust Jacobi-type algorithms, it
makes sense to design a processor with a good performance-cost ratio
over a subset of the Jacobi-type algorithms. This paper summarizes the
steps in a design procedure that takes as its input executable
specifications in Matlab of a subset of Jacobi-type algorithms and
outputs one or more candidate processors for efficient execution of
these algorithms. The procedure heavily relies on models of
computation and models of realization to specify algorithm and
architecture instances for evaluation of performance and cost metrics
through extensive simulation. Details of the various steps can be
found in the list of references.
Edwin Rijpkema, Bart Kienhuis and Ed F. Deprettere,``Compilation
from Matlab to Process Networks'', Presented at the Second
International Workshop on Compiler and Architecture Support for
Embedded Systems (CASES'99), October 1-3 1999, Washington. Abstract
-
Novel high-performance,
domain specific, embedded architectures are more and more composed of
a microprocessor, some memory, and a number of dedicated coprocessors.
Onto these embedded architectures, applications will be executed that
belong to the domain of multi-media processing, mobile communication,
and adaptive array processing. In general, these applications are
written using an imperative model of computation most commonly C or
Matlab. A better specification format would be to use an inherent
parallel model of computation like Process Networks. This paper
presents a three-step approach to automatically transform a class of
Matlab specification into a process network specification.
A.C.J. Kienhuis,``Design
Space Exploration of Stream-based Dataflow Architectures: Methods and
Tools'',PhD thesis, Delft University of Technology, The Netherlands,
January 1999. (Explains the Y-chart approach in great detail.)
B. Kienhuis, E. Deprettere, K. Vissers and P. van der
Wolf, ``The
Construction of a Retargetable Simulator for an Architecture
Template'', In Proc. 6-th Int. Workshop on Hardware/Software
Codesign (CODES'98), Seattle, Washington, March 15 - 18 1998. Abstract
-
Systems in the domain of
high-performance video signal processing are becoming more and more
programmable. We suggest an approach to design such systems that
involves measuring, via simulation, the performance of various
architectures on which a set of applications are mapped. This approach
requires a retargetable simulator for an architecture template. We
describe the retargetable simulator that we constructed for a
stream-oriented application-specific dataflow architecture. For each
architecture instance of the architecture template, a specific
simulator is derived in three steps: the architecture instance is
constructed, an execution model is added, and the executable
architecture is instrumented to obtain performance numbers. We used
object oriented principles together with a high-level simulation
mechanism to ensure retargetability and an efficient simulation speed.
Finally, we explain how a retargetable simulator can be encapsulated
within an environment for automated design space exploration.
P. Lieverse, E.F. Deprettere, A.C.J. Kienhuis and E.A. de Kock,
``A
Clustering Approach to Explore Grain-sizes in the Definition of Weakly
Programmable Processing Elements'', In 1997 IEEE Workshop on Signal
Processing Systems: Design and Implementation, pp. 107-120, De Montfort
University, Leicester, UK, November 3-5 1997. Rewarded with Best
Student Paper Award. Abstract
-
We explore the efficiency
of a class of stream-oriented dataflow architectures as a function of
the grain-size, for a given set of applications. We believe the
grain-size is a key parameter in balancing flexibility and efficiency
of this class of architectures. We apply a clustering approach on the
well-defined set of applications to determine a set of weakly
programmable processing elements of different grain-sizes. The
resulting architectures are compared with respect to their silicon
area. For a set of twenty-one industrially relevant video algorithms,
we determined architectures with various grain- sizes. The results of
this exercise indicate an improvement factor of two for the silicon
area, while changing the grain-size from fine-grain to coarser- grain.
B. Kienhuis, E. Deprettere, K. Vissers and P. van der
Wolf,``An
Approach for Quantitative Analysis of Application-Specific Dataflow
Architectures'', In Proc. 11-th Int. Conf. on
Application-specific Systems, Architectures and Processors , Zurich,
Switzerland, July 14-16 1997. Abstract
-
In this paper we present an
approach for quantitative analysis of application- specific dataflow
architectures. The approach allows the designer to rate design
alternatives in a quantitative way and therefore supports him in the
design process to find better performing architectures. The context of
our work is Video Signal Processing algorithms which are mapped onto
weakly- programmable, coarse-grain dataflow architectures. The
algorithms are represented as Kahn graphs with the functionality of
the nodes being coarse- grain functions. We have implemented an
architecture simulation environment that permits the definition of
dataflow architectures as a composition of architecture elements, such
as functional units, buffer elements and communication structures. The
abstract, clock-cycle accurate simulator has been built using a
multi-threading package and employs object oriented principles. This
results in a configurable and efficient simulator. Algorithms can
subsequently be executed on the architecture model producing
quantitative information for selected performance metrics. Results are
presented for the simulation of a realistic application on several
dataflow architecture alternatives, showing that many different
architectures can be simulated in modest time on a modern workstation.
P.C. Held and A.C.J. Kienhuis ``DIV,
FLOOR, CEIL, MOD and STEP Functions in Nested Loop Programs and
Linearly Bounded Lattices'' ,Algorithms and Parallel VLSI Architectures
III, M. Moonen and F. Catthoor (Editors), Page 271 -- 282, Elsevier
Science B.V. 1995. Abstract
-
This paper describes the
conversion of nested loop programs into single assignment forms. The
nested loop programs may contain the integer operators: integer
division , floor , ceil , and modulo , in
expressions and the stride, or step size, of for loops may be greater
than one. The programs may be parametrized but must have static
control. The conversion is done completely automatical by the tool HiPars
. The output of HiPars is a single assignment program (SAP) and
a dependence graph (DG). The description of the dependence graph is
based on linearly bounded lattices. We will show the relation between
the integer division operators in the SAP and linearly bounded
lattices in the corresponding DG.
Peter Held,``Functional
Design of Data-Flow Networks'', PhD thesis, Delft University of
Technology, The Netherlands, May 1996. Added reference. This thesis was
an important inspiration source for some steps as done in the Compaan
effort.
Bart Kienhuis,``Parallelizing
Nested Loop Programs containing DIV, FLOOR, CEIL, MOD and STEP
functions'',Master thesis, Delft University of Technology, The
Netherlands, Technical Report nr: 94-132, March 1994.
Previous Research Work:
- 1994 - 1998
Philips Research (Natlab) / TU
Delft
- High Level System Evaluation ( Y-chart approach )
- 1993 - 1994 TU Delft,
CAS group
- The HiPars parallel compiler for Nested Loop Programs