I am a researcher on the edge of Natural Language Processing (NLP) and Information Retrieval (IR). My  research focus is on text mining and retrieval in specific domains. I have a particular interest in user-centric methods (interactive information access) and I am involved in a number of projects on social media mining, collaborating with social scientists.

I am an Associate Professor at the Leiden Institute of Advanced Computer Science (LIACS). I am affiliated with the Data Science Research Programme of Leiden University. I am group leader of Text Mining and Retrieval Leiden.

I currently supervise projects that develop and evaluate text mining and retrieval methods in a diversity of domains. My group works with a large diversity of textual data: archaeological reports, patents, scientific and legal publications, EU meeting reports, health records, newspaper texts, user-generated content in online patient communities (discussion forums), and posts on social media.

I collaborate with many different public and private partners such as Wolters Kluwer, Zylab, Nedap Healthcare, Medstone, ZetaAlpha, and The Dutch National Institute for Public Health and the Environment (RIVM).

Go to the TMR homepage

News and updates

December 2020:

  • We concluded the year in my research group on December 18 with a roundtable in which everyone shared a photo, something positive and something negative about 2020. Tweets.
  • On December 15, I was interviewed about false news, social media analytics, and the role of AI during the SAILS event “The Future of AI is Human”. Tweet.
  • I attended COLING 2020 (online) from 8 to 11 December. Together with Yuting Hu, I presented our paper on Named Entity Recognition for Chinese biomedical patents.
  • On December 7, I gave an invited lecture in the Glasgow Information Retrieval seminar series on Explainable IR. Video. Tweet.
  • The last lecture of the Text Mining course was on December 2nd The online course was a much better experience than I had expected, thanks to the active participation of the students and my great TAs. We had ~90 participants of whom ~60 came to the online lectures.

November 2020:

  • Reviewing season! I am reviewer for CHIIR, senior programma committe member for TheWebConf and ECIR, and Area chair for EACL.
  • Only a few lectures left in the Text Mining course. In lecture 11 and 13, we have presentations of scientific papers. In groups of 4, the students choose a paper from the list, read and discuss it, make sure they understand it, and present it in 15 minutes in break-out sessions.

October 2020:

  • Myrthe Reuver, PhD student at the VU Amsterdam on the project “Rethinking news algorithms: nudging users towards diverse news exposure” has started her project and we welcomed her as guest member of TMR.
  • The MISDOOM symposium on Misinformation in Open Online Media, held online on October 26-27, was a great success! (tweets about the event)
  • I participated in the discussion panel on Misinformation and Covid-19 during the MAISoN workshop on October 19.
  • Master student Mohamed Barbouch presented his paper “Combining Language Models and Network Features for Relevance-based Tweet Classification” at #SocInfo20
  • We have a PhD position open at the UvA on “Automated Social network and news media analysis to promote cancer screening” (a ZonMW project coordinated by Gert-Jan de Bruijn)
  • Only a few weeks left until #MISDOOM2020, the 2nd Multidisciplinary International Symposium on Disinformation in Open Online Media. We have compiled a very interesting online programme with 70 regular presentations and 3 invited talks 2020.misdoom.org/program/
  • My PhD student Anne Dirkson explained her work on text mining from patient discussion forums in the ‘3 October University‘, and she did really great!
  • We have been awarded a grant in the Archeologie Telt programme of NWO-SSH on on multi-lingual text mining and retrieval for archaeology! This means that I will keep working with Alex Brandsen for the next years.

September 2020:

  • The paper “Named Entity Recognition for Chinese biomedical patents”, work by my master student Yuting Hu”, was accepted for COLING 2020!
  • My former master students Ioannis Chios and Anneloes Louwe received their diploma during an online graduation ceremony.
  • The research meetings of my research group TMR are still held online. We welcome some new group members!
  • My PhD student Alex Brandsen was interviewed about his work on Natural Language Processing and Information Retrieval methods for Archaeological texts
  • After six months of working from home, I was able to go to the Snellius building for one day per week in September. It was great to see colleagues again. Unfortunately, by the end of September, the Covid restrictions became more strict again and I went back to working from home for the whole semester.
  • We kicked off the online master course Text Mining with around 100 students and – luckily – 4 excellent teaching assistants!

August 2020:

  • After 3.5 years on the tenure track, I have been awarded tenure and promotion to associate professor (‘universitair hoofddocent’)!
  • I took some very essential vacation 🙂

July 2020:

June 2020:

  • End of semester! I taught a bachelor course (elective) on Data Science with about 30 students. Half-way we had to switch to online teaching and we survived.
  • The paper “Helping results assessment by adding explainable elements to the deep relevance matching model” with master student Ioannis Chios was accepted for the workshop on ExplainAble Recommendation and Search (EARS2020) at SIGIR
  • Historical linguist Lauren Fonteyn has been awarded a Digital Infrastructure grant for the project MacBERTh, in which she will develop a BERT model for historical English.
  • Collega Antal van den Bosch vertelde bij De Taalstaat op Radio 1 over ons project waarin we welbevinden tijdens de coronacrisis meten via sociale-media-analyse.

May 2020:

  • Our paper (Alex Brandsen et al.) on creating a NER training dataset in the archaeology domain has been published in the LREC2020 proceeding.
  • Our paper (Hugo de Vos and Suzan Verberne) on Challenges of Applying Automatic Speech Recognition for Transcribing EU Parliament Committee Meetings has been published in the Proceedings of the Second ParlaCLARIN Workshop.
  • We have published a first draft of a Diversity, Equity and Inclusion (DEI) checklist for ACM-SIGIR events, as a reference for the community
  • We have a PhD position open for the project ‘Rethinking news algorithms: nudging users towards diverse news exposure’. The position is at the VU Amsterdam and the first supervisor is Antske Fokkens. I am the second supervisor.

April 2020:

  • I am involved in a project on Social Media Monitoring of the RIVM (National Institute for Public Health and the Environment), focussing on the extraction of behavioral aspects related to COVID-19 from open social media. I supervise one of my PhD students (Anne Dirkson) and two master students (Mohamed Barbouch and Rayan Suryadikara) in this project.
  • On April 22, we had the annual meeting of the H2020 RISE project ‘Social Media Analytics’, as a virtual online event. I chaired the session about prototypes, addressing the 2020 deliverables for the analysis of social media content.
  • All group meetings and thesis presentations have moved online as well.
  • Two of my PhD students, Gineke Wiggers and Juan Pablo Bascur, presented their work in the 10th International Workshop on Bibliometric-enhanced Information Retrieval (BIR 2020) on April 14. I recorded a video on invitation by the organizers to congratulate them with the 10th edition of the workshop.
  • On April 14, I chaired the doctoral consortium at ECIR together with Stefan Rueger. It was a successful online event with two inspiring keynotes.

March 2020:

  • Mare published an interview with me about the H2020 RISE project ‘Social Media Analytics’, in which COVID-19 is a relevant current case study.
  • An interview about my experience with working and teaching from home at the university’s home page.
  • All education has been moved online. My bachelor course has been adapted accordingly with web lectures and interactive sessions in Kaltura: http://tmr.liacs.nl/DS.html
  • We had to postpone the MISDOOM 2020 conference because of the COVID-19 situation.


  • LIACS organizes BNAIC/BeneLearn 2020 in Leiden. I am one of the local organizers and in charge of the website.

January 2020:

  • My group has 7 presentations at CLIN30, 2 talks and 5 posters.
  • I am happy to be part of a new interdisciplinary NWO project with Ioulia Ossokina (TU/e), Theo Arentze (TU/e), and Vladimir Karamychev (EUR): BEL (Behaviour, Energy transition, Low income): “Tenants’ behavioural responses to residential energy transition: are intended energy savings feasible?”
  • On March 6, 2020 we are organizing the symposium ‘360 Degrees of Data Science‘, giving the floor to Data Science in new and exciting domains.

December 2019:

  • I am proud that I am one of the nominees for the faculty of science’s Teacher of the Year award!
  • I am currently recruiting two international PhD students in the H2020 project DoSSIER, on Domain-specific Information Extraction and Retrieval. Application web page.
  • I am a member of the newly established Ethics Review Committee of the Science Faculty
  • I am general co-chair of MISDOOM 2020: the 2nd Multidisciplinary International Symposium on Disinformation in Open Online Media, to be held on April 20-22, 2020 in Leiden

November 2019:

  • I presented two posters at the Dutch-Belgium Information Retrieval workshop: one together with Benjamin van der Burgh about experiments on Dutch data proving the benefits of ULMFiT for classification with small datasets, and one with Alex Brandsen on the release of the Dutch BERT model, BERT-NL.
  • My group (in particular Alex Brandsen and Benjamin van der Burgh) has published a number of Dutch-language data sets, together with pre-trained BERT and ULMFiT language models on textdata.nl.
  • NWO published the First national research agenda for Artificial Intelligence (AIREA-NL), for which I was a member of the expert committee
  • On November 5, I gave an invited lecture in the Lorentz workshop “the future on academic lexicography” on Text Mining for Lexicography. There was a large, interested and engaging crowd. I uploaded the slides here.

October 2019:

  • Pre-print just out: “The merits of Universal Language Model Fine-tuning for Small Datasets – a case with Dutch book reviews” by Benjamin van der Burgh and Suzan Verberne on evaluating the effectiveness of ULMFiT for small training sets. Paper on arXiv. Data on GitHub.
  • Vacancies! 15 fully-funded PhD Positions on Domain-Specific Search (EU Marie Curie Action project). See the webpage of the DoSSIER project. I am hiring for project 6 and 7, addressing transparency and explainability in legal search. Note that if you have been living in the Netherlands for the last 2 years, you cannot apply for my projects, but you can for those in the other countries.

September 2019:

  • On September 10, I was one of the speakers of the event ‘DIT WORDT HET NIEUWS‘, on the future of journalism. My message to the chief editors of three national news papers: Good data journalism is the key
  • First Text Mining lecture of the semester with a packed room!
  • It is the start of the academic year! We welcome new master students in our group for research projects and thesis projects!
  • The course schedule for the Text Mining master course has been published here

August 2019:

July 2019:

  • I presented the paper “Extracting and Matching Patent In-text References to Scientific Publications” (with Ioannis Chios and Jian Wang) in the BIRNDL workshop at SIGIR 2019 in Paris.
  • I chaired the Women in IR session during SIGIR 2019 in Paris.
  • I was interviewed for the Volkskrant, on the topic of Natural Language Processing for Scientific Discovery. “Hoe kunstmatige intelligentie nieuwe kennis opduikt uit miljoenen wetenschappelijke artikelen” (July 19, 2019)

June 2019:

May 2019:

  • Our proposal “Constructing a Unified Knowledge Base by joint Deep Learning from images and text” in the Innovation talent programme Leiden-XJTU joint PhD on AI/Bioscience was accepted! This means that Xue Wang will continue her PhD project in LIACS under supervision of Fons Verbeek and me.
  • Our paper “Lexical Normalization of User-Generated Medical Forum Data” (Anne Dirkson, Suzan Verberne, and Wessel Kraaij) was accepted for the workshop on Social Media Mining for Health Applications at ACL 2019. In addition, Anne successfully participated in all shared tasks at the SMM4H workshop, resulting in the paper “Transfer learning for health-related Twitter data”.
  • Together with a number of LIACS colleagues, I am part of the project Curriculum Development in Data Science and Artificial Intelligence / DS&AI, funded by Erasmus+
  • The spring semester is almost finished! I taught my last lecture in the Data Science bachelor course. The final course schedule can be found here.
  • I organize an open meeting in Utrecht on clinical NLP on June 12, with 8 presentations (academic and non-academic) and discussion on the challenges of using text data in health records for knowledge extraction and predictive models. Program and registration.
  • I am one of the course coordinators of the SIKS course ‘Advances in Information Retrieval’, together with Arjen de Vries and Djoerd Hiemstra. The course takes place on October 8th and 9th in Utrecht.

News/updates archive