Data Mining is concerned with the discovery of patterns and relations
in large collections of data. Most current-day data mining algorithms can
only handle data with a fixed structure, where the data schema is defined in advance.
However, data on the web and bioinformatics databases often lack such a regular structure.
We call such data semi-structured.
The MISTA project aims at mining this type of data.
Since in normal databases all data items share the same structure, the structure
of the data items plays no role in standard data mining. For semi-structured
databases, however, the structure of an individual data item encodes an important
part of its semantics. Therefore, mining for patterns and models in the structure
of the data, e.g., for frequent substructures, is an important aspect of mining
semi-structured databases. In that respect, semi-structured data mining is close
to multi-relational data mining and Inductive Logic Programming: in
both cases patterns have structure.
The aim of the project is to develop algorithms
that discover rules or patterns from large collections of semi-structured data.
We will focus on the discovery of patterns that represent interactions between entities,
like frequent sub-trees and frequent sub-graphs. These can be seen as
generalizations of frequent item sets, which form the basis of the popular
association rule algorithms.
For example,
for data stored in tree structured XML documents (widely occurring in practice),
instead of discovering frequent item sets one would like to discover frequent (sub-)trees.
Frequent trees give a feeling for the general information content in the database: it is
a way of summarizing the data. This information can then be used to formulate
queries, as a guideline for building indexes, as a basis for
structure based clustering and for discovering access patterns.
Another example arises in the analysis of
accesslogs from webservers, where the behavior
of visitors produces lots of different trees (graphs) that can
be examined with respect to the occurrence of frequent patterns.
The project is divided into two parts: one which focuses on the algorithms for
finding frequent trees and graphs (main location: Leiden), and one which aims at the employment of
user-defined constraints and condensed representations (main location: Utrecht).
PhD Position 1: Algorithms for frequent trees and graphs (Leiden)
Finding frequent sub-trees is not straightforward.
First one has to decide on the type of trees and on a notion of inclusion on trees.
Here are several possibilities that will heavily influence the algorithm.
For example, are the elements in the tree ordered or not? Do we consider sets or
multi-sets? When do we consider a tree to
be a sub-tree of another tree?
In this context, it is also possible to consider graphs.
The occurrence of cycles may complicate the matter further.
(For example, in a movie database
a director may serve as an actor in his/her own movie.)
There is an interesting link with the theory of parallel processes
which can be further exploited.
Checking of the inclusion relation should be performed as efficiently as possible: speed is an important issue.
The actual mining algorithm should be a balance
between generation of candidate trees,
pruning of the search space
and checking of the inclusion of candidate trees on all
elements of the database.
Assumptions about the size of the database will also influence the algorithm.
For example, algorithms simplify if one can assume that one can store inclusion
information. Also, the fact that the whole database may or may not fit in main memory,
necessitates different approaches.
The actual performance of the different algorithms
needs to be understood by means of
a thorough theoretical analysis.
PhD Position 2: Constraints and condensed representations (Utrecht)
In many cases, a substantial part of the patterns generated
by frequent pattern mining algorithms is of little or no interest to the user.
Hiding or even eliminating such results is clearly beneficial.
We consider two lines of attack:
- Constraint-based mining: devise algorithms that employ
user-defined constraints in order to reduce the number of uninteresting results.
The use of constraints to prune the search space in
algorithms for finding frequent itemsets has been studied extensively
in the literature. An important question from the efficiency
viewpoint is how far the constraints can be pushed into the mining process.
To extend the ideas of constraint-based mining to cover frequent patterns in
semi-structured data, we consider the following issues:
- The definition of classes of useful constraints on the structures
(trees and graphs) to be mined from semi-structured data.
- The development of a categorization of those constraints
similar to what has been done for mining frequent itemsets (antimonotone, monotone, etc).
- The development of algorithms that can use such constraints to
increase the efficiency of the search process.
The class of pattern-growth algorithms
appears to be a promising candidate, but other possibilities will be considered as well.
- Condensed representations: develop more compact representations
of the frequent patterns instead of their simple enumeration, and develop algorithms
that mine directly for those condensed representations. Well-known examples of
condensed representation for frequent item sets are maximal and closed frequent itemsets.
The underlying idea of both is that they represents all frequent
itemsets but are far smaller. Our aim is to develop condensed representations for patterns,
i.e., trees and graphs, that are mined from semi-structured data, and to develop algorithms
for their discovery.
Applications and Cooperation
Candidates must have a Master's degree in computer science.
Candidates are offered a 4-year contract
as a PhD researcher ("assistent in opleiding" or "promovendus"); the gross monthly salary is
euro 1683 in the first year and increases progressively to euro 2394
in the forth year. Salary and benefits are conform the
Collective Employment Agreement for Dutch Universities.
Candidates are expected to produce a PhD thesis at the end of the fourth year.
The researchers design and develop prototypes,
that can be implemented and tested on synthetic
and real life databases, such as a bioinformatics database, a supermarket dataset and
a movie database:
the different approaches need to be verified for their practical implications.
To this end we have contacts with
companies such as insurance firms and large retailers, as well as a close cooperation
with the department of Biology of Utrecht University and the Netherlands Institute of
Developmental Biology (Hubrecht Laboratory).
In order to promote cooperation between the researchers we
expect regular meetings, also on the group level.
MISTA is a joint research project of the
Algorithms and Programmethodology group from the
Leiden Institute of Advanced Computer Science (LIACS) of
Leiden University, and the
Large Distributed Databases group
from the
Institute of Information and Computing Sciences (ICS)
of Utrecht University, both in the Netherlands.
The two universities offer a full Bachelor, Master and PhD program
in different branches of computer science.
The project is funded by the
Netherlands Organisation for Scientific Research (NWO),
with project number 612.066.304.
The full text of the MISTA proposal is available
here
(PDF, 112 kB).
For further information,
please contact the project leaders
dr. W.A.(Walter) Kosters
(Leiden)
and
dr. A.J.(Ad) Feelders
(Utrecht).
Applications should be sent to the project leaders.