Seminar DNA String Algorithms (Fall 2008)

Schedule | Participants | Contents | Deliverables | References


The Seminar DNA String Algorithms is intended for Master students in Computer Science. Lectures are in English. The seminar is organised by dr. H.J. (Hendrik Jan) Hoogeboom and dr. W.A. (Walter) Kosters. The seminar examines algorithms that deal with DNA, viewed as a string. Students should perhaps write programs and certainly study recent literature, and present their results. Meetings are on a weekly basis. Students can also suggest subjects. For a "full" list, see below.
Dates: Monday September 8, 2008 — Monday November 24, 2008 (12 sessions, see below); time: 11.15-13.00; room: 407, Snellius. Credit points: 7 ECTS (at some other locations "6 ECTS" are mentioned, but it is 7 (seven)). An extra session has been added in December. There is no written exam.

Schedule

Participants

   teams
Giovanni Bernardigbtito X gmail A1C2
Bertram Bourdrezbertram X bourdrez.org C1F2
Ramon van Damramonvandam X gmail G1D2
Bart van der Driftbartvanderdrift X hotmail G1E2
Sjoerd van Egmondsjoerdvanegmond X gmail E1E2
Johan Groenenjohan.groenen X gmail E1G2
-Sjoerd Henstrashenstra X liacs.nl .
Erik Jongsmaejongsma X liacs.nl F1G2
Susan Laraghysusan X running-gag.nl F1D2
Jori van Lierjlier X liacs.nl B1F2
Edgar Reehuisedgarreehuis X hotmail D1C2
Jan van Rijnjanvanrijn X gmail A1B2
 Johan de Ruiterjohan.de.ruiter X gmail C1A2
Marijn Swennemarijn.swenne X gmail D1B2
-Frank Takesftakes X liacs.nl B1A2

Contents

In this seminar we will discuss several issues dealing with DNA and strings. A typical question looks like this:
Given a large piece of DNA, for example the (or a) human genome. How many unique substrings of length 25 does this string possess?
To get some ideas, see The solutions may use sophisticated techniques like suffix arrays, but can also be brute-force, or hybrid. In scientific papers many algorithms are described for these problems. Students are supposed to discuss these papers during class. Results are presented by the students in one hour lectures (i.e., 45 minutes), where they could also contrast their own findings with book chapters and research papers: see the list of references below. Both members of the team seriously take part in the presentation. Next to the oral presentations (in PowerPoint or PDF) 10 page self-contained team essays on the subject must be written (in LaTeX); it is also possible to give two 6 page individual essays. Deadline: one month after the presentation. (Each person writes on his/her own two presentations.) Essays should be improved until students and lecturers agree on the contents. Students must be present during all sessions, and are supposed to actively take part in the presentations, e.g., ask questions. Immediately after each lecture there is a short discussion between students and organizers, where the presentation is evaluated. Some of the subjects can perhaps be split into two, together making a two hour lecture. The first meeting is used to make a proper schedule. Every student is supposed to choose and present at least two subjects, one in the first half and one in the second half of the seminar. The two-person teams split up after their first session, to make new teams.
You can use your own laptop for the presentations. If you don't have one available, you can use ours — let us know in advance! In that case, please email the .ppt- or .pdf-files a day before the presentation — at the latest. It is wise to also put them on a memory-stick. We take care of a beamer.

Prerequisites: Algorithms, Artificial intelligence, Data mining, Complexity, Datastructures (all at bachelor's level, see for instance the Dutch course pages Artificial intelligence and Complexity). And perhaps some Biology.

Students should register in advance, in August 2008, by personal visit to the organizers. The number of participants is at most 10 (approximately :).

Deliverables

team   date     first     second   done   topic
D1 (Edgar & Marijn)29.9WAKHJHFast Algorithms for Selecting ...
WABI'07, Davila doi
A1 (Giovanni & Jan)6.10HJHWAKFixed Parameter Tractable ...
CPM'08, Möhl doi
C1 (Bertram & JohandR)6.10HJHWAKIndexing a Dictionary ...
SPIRE'07, Landau doi
E1 (SjoerdvE & JohanG)13.10WAKHJHJump-Matching with Errors
SPIRE'07, Butman doi
F1 (Erik & Susan)13.10WAKHJHMore Efficient Algorithms ...
RECOMB'08, Ma doi
G1 (Ramon & Bart)20.10WAKHJHOptimal Self-adjusting Trees ...
SPIRE'07, Ko doi
B1 (Jori (& Frank))10.11WAKHJHFaster Algorithm for the Set ...
CPM'08, Gąsieniec doi
B2 (Jan & Marijn)10.11HJHWAKEfficient Text Proximity Search
SPIRE'07, Schenkel doi
C2 (Giovanni & Edgar)17.11WAKHJHA Faster Algorithm for RNA ...
WABI'08, Ziv-Ukelson doi
G2 (Erik & JohanG)17.11WAKHJHSearching for Gapped Palindromes
CPM 2008, Kolpakov doi
D2 (Susan & Ramon)24.11WAKHJHTuning Approximate Boyer-Moore ...
SPIRE'07, Kalsi doi
E2 (Bart & SjoerdvE)24.11WAKHJHMinimum Common String Partition
WABI'08, Damaschke doi
F2 (Jori & Bertram)1.12HJHWAKCompostBin: A DNA Composition-Based ...
RECOMB 2008, Chatterji doi
A2 ((Frank &) JohandR)1.12HJHWAKComputation of Median Gene Clusters
RECOMB 2008, Böcker doi

The columns "first" and "second" denote the person handling your paper. A period "." means that that person is still busy reading ;-).
And what's "done"✓ is done.

References

References are either books or papers, see the examples below. Books are usually available in rooms 159 and 162. More papers can be found in various conference proceedings, we have provided some candidate conferences. Remember that most doi's can only be used from computers within the university domain. Many other books and the internet may be helpful.

Conferences

  1. CPM (Combinatorial Pattern Matching) 2008.
  2. RECOMB (Research in Computational Molecular Biology) 2008.
  3. SPIRE (String Processing and Information Retrieval) 2007, 2008
  4. WABI (Workshop Algorithms in Bioinformatics) 2007, 2008.
Books
  1. M. Crochemore, C, Hancart and T. Lecroq, Algorithms on Strings, Cambridge University Press, 2007.
  2. D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University Press, 1997.
  3. M. Kasahara and S. Morishita, Large-scale Genome Sequence Processing, Imperial College Press, 2006; Google books.
  4. W.F. Smyth, Computing Patterns in Strings, ACM Press Books, 2003.
Papers
  1. Jie Lin, Yue Jiang and Don Adjeroh, The Virtual Suffix Tree: An Efficient Data Structure for Suffix Trees, The Prague Stringology Conference, 2008
  2. B. Phoophakdee and M.J. Zaki, Genome-scale disk-based suffix tree indexing, ACM SIGMOD International Conference on Management of Data, 2007; Proceedings pp. 833-844; doi.
  3. S.J. Puglisi, W.F. Smyth and A.H. Turpin, A taxonomy of suffix array construction algorithms, ACM Computing Surveys 39 (2007) 2-4; doi.
  4. K.-B. Schürmann and J. Stoye, An incomplex algorithm for fast suffix array construction, Software: Practice and Experience 37 (2006) 309-329; doi.


Questions/remarks: kosters@liacs.nl.

21 June 2010 — http://www.liacs.nl/home/kosters/semdna/index.html