Seminar DNA String Algorithms (Fall 2008)

Schedule | Participants | Contents | Deliverables | References

The Seminar DNA String Algorithms is intended for Master students in Computer Science. Lectures are in English. The seminar is organised by dr. H.J. (Hendrik Jan) Hoogeboom and dr. W.A. (Walter) Kosters. The seminar examines algorithms that deal with DNA, viewed as a string. Students should perhaps write programs and certainly study recent literature, and present their results. Meetings are on a weekly basis. Students can also suggest subjects. For a "full" list, see below.
Dates: Monday September 8, 2008 — Monday November 24, 2008 (12 sessions, see below); time: 11.15-13.00; room: 407, Snellius. Credit points: 7 ECTS (at some other locations "6 ECTS" are mentioned, but it is 7 (seven)). An extra session has been added in December. There is no written exam.

Schedule

========== PAST: ================

list of presentations moved to "Deliverables"
Monday September 8, 2008: First meeting; teams are formed (note: for second choice teams are reshuffled; this choice can be made later).
Introductory lectures, for slides see below.
Monday September 15, 2008: Distribution of papers; Discussion of subjects
Monday September 22, 2008: Problem solving session
Monday October 13, 2008: Topics and teams for second presentation.
Monday October 27, 2008: Problem solving session
Monday November 3, 2008: Problem solving session
note: not in room 407

Participants

			teams
✓	Giovanni Bernardi	`gbtito X gmail`	A1	C2
✓	Bertram Bourdrez	`bertram X bourdrez.org`	C1	F2
✓	Ramon van Dam	`ramonvandam X gmail`	G1	D2
✓	Bart van der Drift	`bartvanderdrift X hotmail`	G1	E2
✓	Sjoerd van Egmond	`sjoerdvanegmond X gmail`	E1	E2
✓	Johan Groenen	`johan.groenen X gmail`	E1	G2
-	Sjoerd Henstra	`shenstra X liacs.nl`	.
✓	Erik Jongsma	`ejongsma X liacs.nl`	F1	G2
✓	Susan Laraghy	`susan X running-gag.nl`	F1	D2
✓	Jori van Lier	`jlier X liacs.nl`	B1	F2
✓	Edgar Reehuis	`edgarreehuis X hotmail`	D1	C2
✓	Jan van Rijn	`janvanrijn X gmail`	A1	B2
	Johan de Ruiter	`johan.de.ruiter X gmail`	C1	A2
✓	Marijn Swenne	`marijn.swenne X gmail`	D1	B2
-	Frank Takes	`ftakes X liacs.nl`	B1	A2

In this seminar we will discuss several issues dealing with DNA and strings. A typical question looks like this:

Given a large piece of DNA, for example the (or a) human genome. How many unique substrings of length 25 does this string possess?

To get some ideas, see

Exact pattern matching: Aho-Corasick, Suffix trees (from Molecular Computational Biology 2008; HJH)
Bioinformatics: Fragment assembly (IPA lecture 29 June 2007; WAK)

The solutions may use sophisticated techniques like suffix arrays, but can also be brute-force, or hybrid. In scientific papers many algorithms are described for these problems. Students are supposed to discuss these papers during class. Results are presented by the students in one hour lectures (i.e., 45 minutes), where they could also contrast their own findings with book chapters and research papers: see the list of references below. Both members of the team seriously take part in the presentation. Next to the oral presentations (in PowerPoint or PDF) 10 page self-contained team essays on the subject must be written (in LaTeX); it is also possible to give two 6 page individual essays. Deadline: one month after the presentation. (Each person writes on his/her own two presentations.) Essays should be improved until students and lecturers agree on the contents. Students must be present during all sessions, and are supposed to actively take part in the presentations, e.g., ask questions. Immediately after each lecture there is a short discussion between students and organizers, where the presentation is evaluated. Some of the subjects can perhaps be split into two, together making a two hour lecture. The first meeting is used to make a proper schedule. Every student is supposed to choose and present at least two subjects, one in the first half and one in the second half of the seminar. The two-person teams split up after their first session, to make new teams.
You can use your own laptop for the presentations. If you don't have one available, you can use ours — let us know in advance! In that case, please email the .ppt- or .pdf-files a day before the presentation — at the latest. It is wise to also put them on a memory-stick. We take care of a beamer.

Prerequisites: Algorithms, Artificial intelligence, Data mining, Complexity, Datastructures (all at bachelor's level, see for instance the Dutch course pages Artificial intelligence and Complexity). And perhaps some Biology.

Students should register in advance, in August 2008, by personal visit to the organizers. The number of participants is at most 10 (approximately :).

Deliverables

team	date	first	second	done	topic
D1 (Edgar & Marijn)	29.9	WAK	HJH	✓	Fast Algorithms for Selecting ... WABI'07, Davila doi
A1 (Giovanni & Jan)	6.10	HJH	WAK	✓	Fixed Parameter Tractable ... CPM'08, Möhl doi
C1 (Bertram & JohandR)	6.10	HJH	WAK	✓	Indexing a Dictionary ... SPIRE'07, Landau doi
E1 (SjoerdvE & JohanG)	13.10	WAK	HJH	✓	Jump-Matching with Errors SPIRE'07, Butman doi
F1 (Erik & Susan)	13.10	WAK	HJH	✓	More Efficient Algorithms ... RECOMB'08, Ma doi
G1 (Ramon & Bart)	20.10	WAK	HJH	✓	Optimal Self-adjusting Trees ... SPIRE'07, Ko doi
B1 (Jori (& Frank))	10.11	WAK	HJH	✓	Faster Algorithm for the Set ... CPM'08, Gąsieniec doi
B2 (Jan & Marijn)	10.11	HJH	WAK	✓	Efficient Text Proximity Search SPIRE'07, Schenkel doi
C2 (Giovanni & Edgar)	17.11	WAK	HJH	✓	A Faster Algorithm for RNA ... WABI'08, Ziv-Ukelson doi
G2 (Erik & JohanG)	17.11	WAK	HJH	✓	Searching for Gapped Palindromes CPM 2008, Kolpakov doi
D2 (Susan & Ramon)	24.11	WAK	HJH	✓	Tuning Approximate Boyer-Moore ... SPIRE'07, Kalsi doi
E2 (Bart & SjoerdvE)	24.11	WAK	HJH	✓	Minimum Common String Partition WABI'08, Damaschke doi
F2 (Jori & Bertram)	1.12	HJH	WAK	✓	CompostBin: A DNA Composition-Based ... RECOMB 2008, Chatterji doi
A2 ((Frank &) JohandR)	1.12	HJH	WAK	✓	Computation of Median Gene Clusters RECOMB 2008, Böcker doi

The columns "first" and "second" denote the person handling your paper. A period "." means that that person is still busy reading ;-).
And what's "done"✓ is done.

References

References are either books or papers, see the examples below. Books are usually available in rooms 159 and 162. More papers can be found in various conference proceedings, we have provided some candidate conferences. Remember that most doi's can only be used from computers within the university domain. Many other books and the internet may be helpful.

Conferences

CPM (Combinatorial Pattern Matching) 2008.
RECOMB (Research in Computational Molecular Biology) 2008.
SPIRE (String Processing and Information Retrieval) 2007, 2008
WABI (Workshop Algorithms in Bioinformatics) 2007, 2008.

Books

M. Crochemore, C, Hancart and T. Lecroq, Algorithms on Strings, Cambridge University Press, 2007.
D. Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University Press, 1997.
M. Kasahara and S. Morishita, Large-scale Genome Sequence Processing, Imperial College Press, 2006; Google books.
W.F. Smyth, Computing Patterns in Strings, ACM Press Books, 2003.

Papers

Jie Lin, Yue Jiang and Don Adjeroh, The Virtual Suffix Tree: An Efficient Data Structure for Suffix Trees, The Prague Stringology Conference, 2008
B. Phoophakdee and M.J. Zaki, Genome-scale disk-based suffix tree indexing, ACM SIGMOD International Conference on Management of Data, 2007; Proceedings pp. 833-844; doi.
S.J. Puglisi, W.F. Smyth and A.H. Turpin, A taxonomy of suffix array construction algorithms, ACM Computing Surveys 39 (2007) 2-4; doi.
K.-B. Schürmann and J. Stoye, An incomplex algorithm for fast suffix array construction, Software: Practice and Experience 37 (2006) 309-329; doi.

Questions/remarks: kosters@liacs.nl.

21 June 2010 — http://www.liacs.nl/home/kosters/semdna/index.html