MISTA: Mining in Semi-Structured Data

Data Mining is concerned with the discovery of patterns and relations in large collections of data. Most current-day data mining algorithms can only handle data with a fixed structure, where the data schema is defined in advance. However, data on the web and bioinformatics databases often lack such a regular structure. We call such data semi-structured. The MISTA project aims at mining this type of data.

Since in normal databases all data items share the same structure, the structure of the data items plays no role in standard data mining. For semi-structured databases, however, the structure of an individual data item encodes an important part of its semantics. Therefore, mining for patterns and models in the structure of the data, e.g., for frequent substructures, is an important aspect of mining semi-structured databases. In that respect, semi-structured data mining is close to multi-relational data mining and Inductive Logic Programming: in both cases patterns have structure.

The aim of the project is to develop algorithms that discover rules or patterns from large collections of semi-structured data. We will focus on the discovery of patterns that represent interactions between entities, like frequent sub-trees and frequent sub-graphs. These can be seen as generalizations of frequent item sets, which form the basis of the popular association rule algorithms. For example, for data stored in tree structured XML documents (widely occurring in practice), instead of discovering frequent item sets one would like to discover frequent (sub-)trees. Frequent trees give a feeling for the general information content in the database: it is a way of summarizing the data. This information can then be used to formulate queries, as a guideline for building indexes, as a basis for structure based clustering and for discovering access patterns. Another example arises in the analysis of accesslogs from webservers, where the behavior of visitors produces lots of different trees (graphs) that can be examined with respect to the occurrence of frequent patterns.

The project is divided into two parts: one which focuses on the algorithms for finding frequent trees and graphs (main location: Leiden), and one which aims at the employment of user-defined constraints and condensed representations (main location: Utrecht).

PhD Position 1: Algorithms for frequent trees and graphs (Leiden)

Finding frequent sub-trees is not straightforward. First one has to decide on the type of trees and on a notion of inclusion on trees. Here are several possibilities that will heavily influence the algorithm. For example, are the elements in the tree ordered or not? Do we consider sets or multi-sets? When do we consider a tree to be a sub-tree of another tree?

In this context, it is also possible to consider graphs. The occurrence of cycles may complicate the matter further. (For example, in a movie database a director may serve as an actor in his/her own movie.) There is an interesting link with the theory of parallel processes which can be further exploited.

Checking of the inclusion relation should be performed as efficiently as possible: speed is an important issue. The actual mining algorithm should be a balance between generation of candidate trees, pruning of the search space and checking of the inclusion of candidate trees on all elements of the database. Assumptions about the size of the database will also influence the algorithm. For example, algorithms simplify if one can assume that one can store inclusion information. Also, the fact that the whole database may or may not fit in main memory, necessitates different approaches. The actual performance of the different algorithms needs to be understood by means of a thorough theoretical analysis.

PhD Position 2: Constraints and condensed representations (Utrecht)

In many cases, a substantial part of the patterns generated by frequent pattern mining algorithms is of little or no interest to the user. Hiding or even eliminating such results is clearly beneficial. We consider two lines of attack:

Constraint-based mining: devise algorithms that employ user-defined constraints in order to reduce the number of uninteresting results. The use of constraints to prune the search space in algorithms for finding frequent itemsets has been studied extensively in the literature. An important question from the efficiency viewpoint is how far the constraints can be pushed into the mining process. To extend the ideas of constraint-based mining to cover frequent patterns in semi-structured data, we consider the following issues:
- The definition of classes of useful constraints on the structures (trees and graphs) to be mined from semi-structured data.
- The development of a categorization of those constraints similar to what has been done for mining frequent itemsets (antimonotone, monotone, etc).
- The development of algorithms that can use such constraints to increase the efficiency of the search process. The class of pattern-growth algorithms appears to be a promising candidate, but other possibilities will be considered as well.
Condensed representations: develop more compact representations of the frequent patterns instead of their simple enumeration, and develop algorithms that mine directly for those condensed representations. Well-known examples of condensed representation for frequent item sets are maximal and closed frequent itemsets. The underlying idea of both is that they represents all frequent itemsets but are far smaller. Our aim is to develop condensed representations for patterns, i.e., trees and graphs, that are mined from semi-structured data, and to develop algorithms for their discovery.

Applications and Cooperation

Candidates must have a Master's degree in computer science. Candidates are offered a 4-year contract as a PhD researcher ("assistent in opleiding" or "promovendus"); the gross monthly salary is euro 1683 in the first year and increases progressively to euro 2394 in the forth year. Salary and benefits are conform the Collective Employment Agreement for Dutch Universities. Candidates are expected to produce a PhD thesis at the end of the fourth year. The researchers design and develop prototypes, that can be implemented and tested on synthetic and real life databases, such as a bioinformatics database, a supermarket dataset and a movie database: the different approaches need to be verified for their practical implications. To this end we have contacts with companies such as insurance firms and large retailers, as well as a close cooperation with the department of Biology of Utrecht University and the Netherlands Institute of Developmental Biology (Hubrecht Laboratory). In order to promote cooperation between the researchers we expect regular meetings, also on the group level.

MISTA is a joint research project of the Algorithms and Programmethodology group from the Leiden Institute of Advanced Computer Science (LIACS) of Leiden University, and the Large Distributed Databases group from the Institute of Information and Computing Sciences (ICS) of Utrecht University, both in the Netherlands. The two universities offer a full Bachelor, Master and PhD program in different branches of computer science.

The project is funded by the Netherlands Organisation for Scientific Research (NWO), with project number 612.066.304.

The full text of the MISTA proposal is available here (PDF, 112 kB).

For further information, please contact the project leaders dr. W.A.(Walter) Kosters (Leiden) and dr. A.J.(Ad) Feelders (Utrecht). Applications should be sent to the project leaders.

September 29, 2003 - http://www.liacs.nl/home/kosters/mista/advertentie.html