Bachelor project proposals

The overview presented on January 15, 2014, can be found here.

Algorithms

Visualization of structural data augmented with spatial constraints

Background & motivation

Treemapping is a method for displaying hierarchical data structures using nested rectangles. Each branch of the tree is given its rectangle, also called a tile, which then is tiled with smaller rectangles representing sub-branches. A node's shape has an area proportional to a specified dimension on the data.

The method became a popular tool in data visualization able to effectively display at least two data dimensions using the area allocated to the corresponding rectangles and their color or an associated image. As a simple example, think of a map that shows folders on your computer where a rectangle corresponding to each folder has an area proportional to its size.

To create a treemap, one must define a tiling algorithm, i.e., a way to divide a rectangle into sub-rectangles of specified areas. There are multiple ways to make such a division. The existing algorithms try to optimize some global parameter about such allocation, for example, try to allocate rectangles as close to the square shape as possible. However, in many applications it may be desirable to associate certain areas to certain data items based on their meaning, user expectations or relational constraints with respect to other data elements. For example, if you want to show the volume of human body organs on such a map, it is better to place rectangles in the way that their positions reflect the positions of organs in the human body.

Most of data sources that describe structural data such as e.g., human body anatomy, do not provide any indication how to distribute the corresponding elements on a treemap. We either manually create a suitable layout or, alternatively, associate spatial constraints on these data items, e.g., “the head is placed above all other parts” or “left hand should be placed on the left from the right hand”, and use a constraint-satisfaction algorithm to create a layout that satisfies all given constraints.

Research question, problem or aim of the project

The goal of this project is to implement an algorithm for automated generation of layouts satisfying spatial constraints. The basic task is to implement a simple brute-force algorithm for finding all layouts that adhere to a given set of orthogonal constraints in the form “tile X should be placed left/right of tile Y” or “tile X should be placed above/below of tile Y”. It is expected that a simple language or a handy intuitive format to describe positional constraints will be created, and given a data set (tree) and such constraints, the tool will create a set of appropriate layouts.

There are many opportunities to extend this work to (i) deal with more sophisticated constraints, (ii) apply the method to other types of visualization methods (circular treemaps, graphs, spatial trees, etc.), (iii) work on the optimization techniques or heuristics to find an appropriate layout faster or select a layout which is optimal according to some other metric (e.g., average aspect ratio, etc.), (iv) study the properties of the method (performance, scalability, usability) when applied to large datasets, (v) build impressive demos involving bio-medical ontologies (body anatomy, cells, proteins, genes, etc.).

Expected types of work

Programming in C++, Programming in other languages than C++, Developing algorithms, Running experiments, Developing a graphical user interface, Developing a command line interface

Context of the project

The treemapping prototype that allows users to control tile positions manually is implemented in JavaScript. However, the algorithm can be implemented in any language of your choice. In this case, it should export the result to a text file (e.g., JSON format would be the best for the integration with the existing visualization tool).

It is advisable, although not strictly necessary, to learn about the research on treemaps and existing tiling algorithms to better understand why the method in this project is needed.

The suggested method has a very promising application to the automated schematic visualization of complex data (e.g., such as the structure of human brain) that typically takes considerable effort to draw manually (e.g., for anatomy textbooks etc.).

The algorithm sketch you will need to implement is outlined in the following paper: N. Kokash, de Bono B., Kok J.: "Template-based treemaps to preserve spatial constraints", International Conference on Information Visualization Theory and Applications (IVAPP), to appear. Preliminary version available here http://nkokash.com/documents/IVAPP2014-treemaps.pdf

Required background and interests of the students

Algorithmic complexity, software development

Work plan

The approximate roadmap is as follows: - Learn about data visualization using treemaps - Understand the idea of template-based treemaps and the suggested algorithm for constraint satisfaction - Implement the basic brute-force algorithm for building templates that define layouts satisfying orthogonal constraints - Evaluate the algorithm's complexity and scalability - Write a report/paper describing the outcome - Consider any of further research directions mentioned above (see research questions)

Names, emails of the supervisors/advisors

Natallia Kokash, natallia.kokash@gmail.com Joost Kok, joost@liacs.nl

Generation of 3D schematic images for biomedical data

Background & motivation

In computer science, ontologies represent knowledge as collections of concepts within a domain, using a shared vocabulary to denote types, properties and interrelationships of those concepts. Often ontologies created by experts in some domain then become publicly available and widely used by various software tools for concept disambiguation, reasoning, information retrieval etc. Ontologies gained an immense importance in bio-informatics as frameworks to handle complex structural information about human body and other living organisms. Ontologies representing knowledge about human body anatomy, genes, proteins, cells, diseases etc. are available.

While ontologies are successfully used by many software tools, the human brain cannot efficiently perceive and reason about large amounts of data in the structured text form. Visual representation of these data may help bio-medical students and researchers to learn and analyze body anatomy and corresponding biological processes. Since manual images are hard to create, navigate, scale and maintain, it is useful to generate visually comprehensive schematics of body parts from the ontological knowledge. However, most of ontologies such as e.g., foundational models of anatomy, do not provide any indication how to distribute the corresponding elements on the surface or in space. Therefore, we are working on methods to quickly define the way to allocate images corresponding to single anatomical concept on the surface or in space and produce visual schematics that can be analyzed by bio-medical experts.

Research question, problem or aim of the project

The goal of this project is to investigate the way to extend ontologies with positional information and generate 3D images representing these data. As a starting point, a method for generating 2D images using treemaps should be studied. The extension of the template-based treemap generation to 3D case is rather straightforward and comprised the basic task in this project. Given an existing solution for the 2D case, your task will be to implement a tool for defining layout templates and a rendering algorithm to display corresponding 3D treemaps. Apart from that, there are many opportunities to realize the aforementioned idea using other visualization techniques. A curious and creative student can investigate other ways to define positional constraints on structured data in order to generate images that are better perceived by human brain.

Expected types of work

Programming in other languages than C++, Developing algorithms, Running experiments, Software engineering, Developing a graphical user interface, Reading scientific computer science literature

Context of the project

One of the methods we use for visualizing ontological data is treemapping. Treemapping is a method for displaying hierarchical data structures using nested rectangles. Each branch of the tree is given its rectangle, also called a tile, which then is tiled with smaller rectangles representing sub-branches. A node's shape has an area proportional to a specified dimension on the data. The method became a popular tool in data visualization able to effectively display at least two data dimensions using the area allocated to the corresponding rectangles and their color or an associated image. As a simple example, think of a map that shows folders on your computer where a rectangle corresponding to each folder has an area proportional to its size. To create a treemap, one must define a tiling algorithm, i.e., a way to divide a rectangle into sub-rectangles of specified areas. There are multiple ways to make such a division. We developed a method to control the desired positions of tiles by defining templates that prescribe relative positions of data in the treemap, e.g., “the head is placed above all other parts” or “left hand should be placed on the left from the right hand”, etc. We expect you to extend this method to define 3 dimensional constraints and produce 3D images. Although many languages can be used to program the aforementioned tools, we advise you to use JavaScript as in this case the code for the 2D template-based treemap generation can be exploited.

Required background and interests of the students
  • JavaScript
  • Some acquaintance with 3D graphics (no advance techniques will be needed, the final images are schematic and can be produced using basic geometric shapes available in any library).
Work plan
  • analyze related work on automatic visualization of structured data
  • learn about template-based treemaps
  • develop a tool to define layout templates for the 3D environment
  • modify the algorithm for the 2D treemap generation to generate 3D treemaps
  • prepare a set of sample images to illustrate the result
  • write the project report and, if interesting results are obtained, a scientific paper presenting the work
  • optionally, think of other methods to extend structural data with spatial information and visualize such data
Names, emails of the supervisors/advisors

Natallia Kokash, natallia.kokash@cwi.nl Joost Kok, joost@liacs.nl

Slides Siegfried Nijssen (PDF)

Solving jungle checkers (PDF)

Tree of life (PDF)

Mining graphs with labels

Background & motivation

Graph mining tools have been used in the past to analyze databases with molecular structures and identify promising molecular fragments for use in medicines. A weakness of these tools is however that they do not deal well with graph data that is annotated with additional information. Such additional information is commonly available in chemistry: for instance, atoms in molecules can be hydrogen donor, hydrogen acceptor, be part of an aromatic ring, and so on.

Research question, problem or aim of the project

The aim of this project is to extend an existing graph mining system such that it deals well with labels in graphs. The are two questions that need to be answered: 1) can labels be included without making the graph mining algorithm less efficient? 2) does the inclusion of labels improve the quality of the mining results?

Expected types of work

Programming in C++, Developing algorithms, Running experiments, Reading scientific computer science literature

Context of the project

The student will continue working on an implementation of the gSpan algorithm described here:

Xifeng Yan, Jiawei Han: gSpan: Graph-Based Substructure Pattern Mining. ICDM 2002: 721-724

Required background and interests of the students

The student should be interested in algorithm development and data mining. An interest in biology and/or chemistry would be helpful, but is not necessary.

Work plan

1) The student will study the existing graph mining system. 2) A proposal will be made for expanding this system. 3) This extension will be implemented. 4) Experiments will be performed. 5) Depending on the outcome of the experiments, further extensions can be considered.

Names, emails of the supervisors/advisors

Siegfried Nijssen, s.nijssen@liacs.leidenuniv.nl

Declarative data mining

Background & motivation

Most data mining systems are currently implemented in procedural programming languages such as C++ and Java. While these languages are powerful, this makes implementing new data mining tasks time consuming. The situation is very different in the database world, where languages such as SQL make it possible to implement databases quickly without implementing a single line of C++ or Java. A key feature of SQL is that it allows a programmer to specify which task (s)he is interested in, without requiring the programmer to specify how the computer has to find an answer. A language with this property is called declarative.

Research question, problem or aim of the project

This project will study the possibilities for developing an "SQL for data mining". Questions include: which systems can be used to implement such a language? What would such a language look like? Which tasks are easy to implement and which are hard?

Expected types of work

Programming in C++, Programming in other languages than C++, Making mathematical proofs, Developing algorithms, Running experiments, Reading scientific computer science literature

Context of the project

The student will build on earlier research, which showed that constraint programming systems can be used to implement a range of data mining tasks:

Tias Guns, Siegfried Nijssen, Luc De Raedt: Itemset mining: A constraint programming perspective. Artif. Intell. 175(12-13): 1951-1983 (2011).

There are several open questions that can be answered.

One is practical in nature: can the above approach also be made to work successfully in Python?

Python is currently the most commonly used language by data scientists. However, it is procedural and not declarative in nature. It will be explored whether a declarative extension is possible by building on the Numberjack library:

http://numberjack.ucc.ie/

Ideally, the approach would be linked to a popular Python-based machine learning toolkit:

http://scikit-learn.org/

Another is theoretical in nature. The above approach is simplified as it does not take into account probabilistic models or statistical tests. Clearly, statistics are very important in many data mining task. A theoretical question involves how to include probabilistic models in a constraint programming framework.

Required background and interests of the students

This project is not an applied project. The practical project involves programming in other languages than C++ (Python, CP languages). The theoretical project requires mathematical skills and is more free in the implementation language used.

Work plan

The student will first study the existing work; for the practical project, a first implementation will be made immediately afterwards; for the theoretical project first a conceptual proposal will be made.

Names, emails of the supervisors/advisors

Siegfried Nijssen, s.nijssen@liacs.leidenuniv.nl

Mining a conference

Background & motivation

Recently we organized a large conference in data mining and machine learning:

http://www.ecmlpkdd2013.org/

We collected a large amount of information at this conference: - 450 conference submissions and 150 journal submissions, with abstracts, author names, author affiliations - 1800 reviews, with answers to questions about the papers, and final decisions (accepted or rejected at the conference)

Note that we hence have different types of data: - Text: reviews, abstracts - Attribute-value data: topical categories, nationalities, accepted or not

Finally, there is the possibility to link this data to publicly available social networks. For instance, from http://dblp.uni-trier.de/ co-authorship graphs and citation graphs can be constructed for both authors and reviewers of the papers at the conference.

Many conferences are collecting similar data.

Research question, problem or aim of the project

We are interested in analysing this wealth of conference data.

Questions include: - Can we predict whether a paper is accepted? - Can we predict the length of a review? - Can we predict the verdict of a review based on its text? - Are there large differences between topics of machine learning and data mining? - Can we predict whether a paper should receive a summary reject? - Can we predict how long it will take to review a paper? - Can social network data be used to make the above tasks easier?

Expected types of work

Programming in C++, Programming in other languages than C++, Running experiments, Reading scientific computer science literature

Context of the project

This project will use existing data mining software as much as possible. This may be in Weka, KNIME, Orange, R or Python - depending on the preferences of the student.

We do not currently have social network data. A topic of interest is to write software to collect such data from DBLP for the authors in our database. This involves addressing problems such as named entity recognition.

Required background and interests of the students

The student should have an interest in data mining and data science. Programming in C++ is likely not necessary.

Work plan

The student will first need to understand the data and will run small experiments ignoring the textual data. Depending on the outcome of these results, either textual data or social network data will be included. The student will have to become familiar with these as well.

Names, emails of the supervisors/advisors

Siegfried Nijssen, s.nijssen@liacs.leidenuniv.nl Hendrik Blockee, hendrik.blockeel@cs.kuleuven.be

Patterns in data visualization

Background & motivation

Data mining aims at gaining insight in data. One of the most direct ways of getting insight in data is to show the spreadsheet containing it. Unfortunately, however, most datasets are too large to show in a spreadsheet on one screen. Other visualization techniques are necessary to provide a more insightful perspective on data.

One possibility is to observe a relationship between data matrices and images: if we see a data matrix as an image, we are essentiallty interested in zooming out. However, it can be shown that the zoomed out view of a dataset depends highly on the order of the columns and rows chosen in the visualization. An important question is hence to choose a good order.

Research question, problem or aim of the project

The aim of this project is to investigate how pattern mining systems can be used to find a good order for rows and columns of data, and hence, nice visualizations of datasets.

Expected types of work

Programming in C++, Developing algorithms, Running experiments, Developing a graphical user interface, Reading scientific computer science literature

Context of the project

This project will build on existing tools for unsupervised pattern mining, in most cases, itemset mining, which are mostly implemented in C++.

Required background and interests of the students

Students need to be interested in developing software in C++, drawing visualizations and running experiments. The project will also involve algorithm development, as the output of pattern mining systems needs to be transformed into an ordered set of rows and columns.

Work plan

The student will first study existing pattern mining tools. Subsequently, a visualization tool will be developed assuming a given order of rows and columns. Subsequently, it will be studied how the output of pattern mining systems can be used to draw images for data matrices and different possibilities will be compared experimentally.

Names, emails of the supervisors/advisors

Siegfried Nijssen, s.nijssen@liacs.leidenuniv.nl

Overview slides natural computing (PDF)

Overview slides algorithms (PDF)

Solving jungle checkers (slides) (PDF)

Tree of life (slides) (PDF)

Computer systems

Overview slides embedded research center (PDF)

Overview slides high performance computing (PDF)

Foundations of Software Technology

KAT partial derivatives

Background & motivation

Kleene Algebra of Test (KAT) is an extension of regular expressions with conditional symbolic tests. They are useful, for example, in expressing symbolic computations of programs.

Research question, problem or aim of the project

The aim of this project is to develop the theoretical work necessary to associate a non-dtereministic finite automaton to a KAT expressions using partial derivatives. The latter is a technique associating to each expression E and to each letter a of the alphabet a set of expressions denoting the remainder of the original E after reading a.

Expected types of work

Making mathematical proofs, Developing algorithms, Running experiments

Context of the project

The student is expected to study existing algorithms for calculating partial derivatives, and to apply them to KAT expressions.

Required background and interests of the students

Knowledge on formal language and automata.

Work plan

The expected end result is a thesis describing KAT and the novel algorithm for finding an automaton for it via partial derivatives

Names, emails of the supervisors/advisors

Marcello Bonsangue m.m.bonsangue@liacs.leidenuniv.nl Jurriaan Rot jurriaanrot@gmail.com

Parsing Boolean grammars

Background & motivation

Boolean grammars are an extension of context free grammars with intersection and negation operators.

Research question, problem or aim of the project

The aim of this project is to develop parsing technique for Boolean grammar, and syntactic formats so to characterize regular languages.

Expected types of work

Making mathematical proofs, Developing algorithms

Context of the project

The student is expected to study existing work on Boolean grammars and to develop new efficient parsing algorithms for them.

Required background and interests of the students

The project is rather open, and it involves knowledge on automata and formal languages.

Work plan

The stsudent is expected to deliver a thesis describing her/his own results.

Names, emails of the supervisors/advisors

Marcello Bonsangue m.m.bonsangue@liacs.leidenuniv.nl Jurriaan Rot jurriaanrot@gmail.com

Music from streams

Background & motivation

Streans are infinite sequences of numbers. In some case we can represent them finitely as solutions of systems of equations involving operations on the "head" of the stream and its "tail".

Number can be thought of as coding musical notes, in the simplest way by assigning each number to an octave.

Research question, problem or aim of the project

In this project we aim to build a system transforming stream definitions given in terms of head and tails into musical note. The idea is to generate interactively a midi file from the number of the stream defined by an user.

Expected types of work

Programming in other languages than C++

Context of the project

The student will need to study the theoretical background of streams and coinductive definition.

Required background and interests of the students

Basic knowledge of midi programming is required. Knowledge of Haskell is a pro.

Work plan

Student can start immediately. The expected result at th end is a piece of software and a thesis explaining the work done.

Names, emails of the supervisors/advisors

Marcello Bonsangue m.m.bonsangue@liacs.leidenuniv.nl Jurriaan Rot jurriaanrot@gmail.com

Bisimulation-up-to for formal languages

Background & motivation

Bisimulation is a very powerful technique to prove equality of two languages based on coinduction. Bisimulation-up-to is a recent improvement on the above technique, that has, for example, been let recently in one of the best algorithm for proving equivalence of non-determinstic automata.

Research question, problem or aim of the project

In this paper we aim to implement several bisimualtion-up-to techniques for language inclusion and equivalence.

Expected types of work

Programming in other languages than C++, Running experiments

Context of the project

The project will be carried out in Maude, a programming language suitable for handling algebraic structure and reasoning. However no prior knowledge of Maude is required.

Required background and interests of the students

The student should have knowledge on formal languages and automata theory.

Work plan

The expected result is a piece of software written in Maude and a thesis explaining the work done.

Names, emails of the supervisors/advisors

Marcello Bonsangue m.m.bonsangue@liacs.leidenuniv.nl Jurriaan Rot jurriaanrot@gmail.com

Overview slides (PDF)

Software engineering (PDF)

Petri Nets: Boundedness and coverability

Background & motivation

Finiteness of the state space and coverability of states are important notions in the analysis of the behaviour of distributed dynamic systems. For Place/Transition nets finiteness can be decided using a monotonicity property of the firing rule and also the classical coverability tree construction relies on this property. Recently, new results have been published on the construction of minimal coverability sets.

Research question, problem or aim of the project

Aim of the project is to provide an overview of results on the construction of coverability sets and coverability trees for different net classes. Special attention should be paid to complexity and the role of monotonicity.

Expected types of work

Developing algorithms, Reading scientific computer science literature

Context of the project

This project would be part of ongoing research into finiteness, coverability, and boundedness issues for extended (not necessarily monotonic) Petri net models.

Required background and interests of the students

Theory, algorithmic complexity

Work plan

Start from available literature, create an annotated bibliography, compiling the results in a comprehensive thesis. There would be room to deviate and explore a particularly interesting case (and provide/implement an algorithm of one's own).

Names, emails of the supervisors/advisors

Jetty Kleijn h.c.m.kleijn (at) liacs.leidenuniv.nl

Compatibility in teams

Background & motivation

Message loss and deadlocks are two main factors why a distributed environment with many interacting components may function in a suboptimal manner. Team automata can be used to model collaboration and communication between system components. A definition for the correct interaction between component automata (compatibility) in a team has been proposed recently.

Research question, problem or aim of the project

Compatibility checks are relevant when developing correct-by-construction multi-component systems. The aim of this project is to implement an algorithm to check whether component automata that are to form a team automaton are compatible.

Expected types of work

Programming in C++, Programming in other languages than C++, Developing algorithms, Reading scientific computer science literature

Context of the project

Team automata theory with a view to application to the design of (asynchronous) systems.

Required background and interests of the students

Theory of concurrency, automata theory, algorithmic interest

Work plan

Start from a few research papers to extract the necessary concepts, identify the problem and the proposed approach. Implement the algorithm supported by a thesis providing all necessary background and information regarding the design decisions.

Names, emails of the supervisors/advisors

Jetty Kleijn h.c.m.kleijn (at) liacs.leidenuniv.nl

Nets that do not count

Background & motivation

So-called 'set nets' are a novel Petri net model introduced as a semantical model for reaction systems, a framework for the investigation of processes carried out by biochemical reactions in living cells. In contrast to standard Petri net models like Place/Transition systems with their multiset-based token arithmetic, set nets support set-based operations on markings. Moreover, due to the lack of quantification, their execution semantics is also quite different from that of Elementary Net systems.

Research question, problem or aim of the project

Several initial results have been obtained already in the investigation of set nets and their behaviour. The aim of this project is to continue the research into (extended) set nets and to further investigate, e.g., their expressive power, their causality and concurrency semantics, and their relationship with other Petri net models.

Expected types of work

Developing algorithms, Reading scientific computer science literature

Context of the project

This project is part of ongoing research in the area of Petri nets and biologically motivated computing.

Required background and interests of the students

Theory; interest to implement algorithms (add to existing tools) is appreciated.

Work plan

After a brief period of getting acquainted with the model and gaining insight in existing results, concrete research questions are identified together with the supervisor and depending on the student's interests. The project should lead to a comprehensive thesis (and possibly some implementation of relevant algorithms).

Names, emails of the supervisors/advisors

Jetty Kleijn h.c.m.kleijn (at) iacs.leidenuniv.nl

Design and development of FpML

Background & motivation

The market for 'over-the-counter' traded (OTC) derivatives (i.e. traded outside of the regulated exchanges), practically non-existent in the late nineteen eighties, has grown in size from about USD 90 trillion in 2000 to about USD 650 trillion outstanding notional amount just before the start of the financial crisis mid-2008.

The Financial product Markup Language (FpML) is an XML-based business information exchange standard for Internet-based electronic dealing and processing of OTC Derivatives, developed in 1997 by JP Morgan (a leading OTC-derivatives dealer) together with Price Waterhouse Coopers as one of the core technologies to aid in managing this massive growth of very complex derivative transactions.

Following the financial meltdown in 2008, triggered by the default of Lehman Brothers in 2008, regulators around the world identified the expanded OTC-derivatives market as a major source of systemic risk and they decided in the G20 Pittsburgh summit in 2009 on a concerted action to get this market under their control. FpML was identified as one of the key technologies to be included in the infrastructure needed to make that work.

Research question, problem or aim of the project

FpML can be considered as a 'domain specific' computer language specifically designed to solve a number of problems that (part of) the financial industry and its regulators have with managing large volumes of very complex data.

The assignment is to investigate a number of research questions, including: 1. What exactly were the problems that triggered the development of FpML? 2. Which considerations led to the design choices made? 3. How well did FpML perform in solving the problems it was designed to solve (which by the way might have changed in the course of its history)? 4. What have turned out to be shortcomings of FpML and how could these be solved? Here you can give your own own ideas as well as relate to efforts, undertaken by the financial industry, to improve FpML.

Expected types of work

Reading outside the computer science literature, Reading scientific computer science literature

Context of the project

The size of Financial Institutions has grown enormously in the last 30 years or so, which to a great extent would be unthinkable without the massive application of modern information and communications technology. The IT-managers struggling to maintain and develop the ICT-infrastructure of these Institutions have difficulty keeping up with the growing complexity of this infrastructure using the methods currently at their disposal. As the problems encountered here are 'domain specific', we might expect that the solutions will also need to be domain specific. The FaST group has started a project to investigate the potential of domain specific languages for financial markets. FpML is an interesting case in point where the industry tried to walk this road.

Required background and interests of the students

This is a project specifically intended for Informatica-en-Economie students, preferably with an interest in financial markets.

Work plan

Literature study; possibly interviewing of experts or users. This should lead to a comprehensive thesis addressing the questions above with adequate referencing.

Names, emails of the supervisors/advisors

Pieter Kwantes kwant000 (at) planet.nl

Jetty Kleijn h.c.m.kleijn (at) liacs.leidenuniv.nl

Ontological adequacy of BPMN

Background & motivation

The Business Process Redesign (BPR) movement, gaining wide popularity in the nineteen nineties by the writings of people like Michael Hammer and Thomas Davenport, raised in its wake much interest in languages specifically suited for the modeling of business processes. BPMN (Business Process Model and Notation) is a graphical language of which version 2.0 was published by the Object Management Group (OMG) in 2011 as a standard for modeling business processes.

One of the main advantages of using graphical modeling languages like BPMN or UML is that it enables us to make a conceptual model of some part of reality (like a business process or an information system) in such a way that it allows communication and mutual understanding of that part of reality between stakeholders.

However, the extent to which this approach is effective will depend on the quality of the conceptual model which in turn will depend on the quality of the modeling language used.
One way to determine the quality of a modeling language is by establishing its 'ontological adequacy'. The ontological adequacy of a language expresses the extent to which the vocabulary of that language adequately represent the concepts that are modeled with the language. (Guizzardi, 2005, PhD TU Twente) applied this quality measure to the UML modeling language. The purpose of this bachelor project is to apply this quality measure to analyze the ontological adequacy of BPMN and, depending on the outcome of this analysis, to propose improvements.

Research question, problem or aim of the project

The aim of this research project is to determine the quality of BPMN as a modeling language for business processes using the concept of ontological adequacy and identify possible ways to improve the language. The assignment is to investigate the following questions: 1. What is the ontology that is implicitly present in the BPMN standard as published by the OMG? 2. To what extent does this ontology capture the essential concepts in the business process domain? 3. What are possible improvements to the ontology derived as an answer to question 2 above? 4. What are possible improvements to the BPMN standard based on the answer to question 3 using the concept of ontological adequacy?

Expected types of work

Developing algorithms, Reading outside the computer science literature, Reading scientific computer science literature

Context of the project

Increasing the abstraction level of programming languages has proved in the past to be a successful strategy for increasing the productivity of software development. Much of the research undertaken under the heading of 'Model Driven Development' and 'Domain Specific Languages' involves the design of programming languages motivated by the desire to increase this abstraction level even further in a direction closer to the language that is used by the human (domain) expert. What is the best way to design such a language is still largely an open question. One potential direction for research that looks promising is to derive the design of a language from the ontology of the domain that the language is supposed to describe. The purpose of this research project is to find supporting evidence for that hypothesis.

Required background and interests of the students

This is a project specifically intended for Informatica-en-Economie students, preferably with an interest in (business process) modeling and formal languages.

Work plan
  1. Literature study to get acquainted with the concepts of ontology, ontology engineering, and ontological adequacy.
  2. Analysis of the documents containing the BPMN 2.0. standard published by the OMG and identify the core business process concepts and their relationships as present in these documents.
  3. Design of an ontology implied by the concepts and relationships identified in step number 2.
  4. Establish any problems that might be present in the ontology resulting from the third step (possibly using secondary literature commenting on the BPMN 2.0. standard as published by the OMG).
  5. Propose improvements to the ontology to solve the problems identified in step 4.
  6. Derive proposals for improvement of the BPMN 2.0. standard based on the concept of ontological adequacy studied in step 1 and the results of step 5.
Names, emails of the supervisors/advisors

Pieter Kwantes kwant000 (at) planet.nl

Jetty Kleijn h.c.m.kleijn (at) liacs.leidenuniv.nl

Imagery & Media

Overview slides bioinformatics (PDF)

Overview slides multimedia (PDF)

Linking persons via spatio-termporal data (PDF)

Technology & Innovation Management

Overview slides (PDF)

For comments or questions, contact Siegfried Nijssen.