Data download

The data collections that we created and published about are available for researchers in the field.

The Nijmegen 2011 query intent data set

(added March 2013)

The dataset consists of:

  • A txt-file containing 605 queries with annotations according to our multi-dimensional intent classification scheme;
  • A txt-file (README.txt) containing documentation.

You can download the data as a zipped archive here. If you use the data, please refer to this paper:

Why-questions with snippets from Bing and relevance assessments for each snippet

(added March 2011)

This download set (as described in Verberne et al., 2011) consists of:

  • A txt-file containing 238 questions with 10 snippets per question + relevance assessments on a 3-point scale;
  • A txt-file (00about.txt) containing documentation.

You can download the data as gzipped tar archive here

Why-questions and answers with relevance labels for machine learning (learning-to-rank) purposes

(added 2010)

This download set (as described in Verberne et al., 2010) consists of:

  • A txt-file containing 186 questions with 150 candidate answers per question + labels for your own feature extraction;
  • A txt-file containing 37 feature values for 150*186 answers + labels in SVMlight format for machine learning purposes.
  • A txt-file containing documentation.

You can download the data as gzipped tar archive here

Why-questions from the Webclopedia collection with Wikipedia answer fragments

(added March 2007)

This download set (as described in Verberne et al., 2007b) consists of:

  • An Excel sheet with 400 randomly selected why-questions from the Webclopedia set (questions asked to the online QA system answers.com, gathered by Hovy et al.) and for each question a Wikipedia text fragment giving the answer and a pointer to the complete Wikipedia document;
  • A zip-file containing all complete Wikipedia documents that is referred to
    in the Excel sheet;
  • A zip-file containing all answer fragments in context (complete paragraph and sometimes also the previous paragraph or heading), manually annotated with RST structures (Carlson et al. 2003);
  • A readme file.

You can download the data here

Why-questions and answers formulated to RST-annotated WSJ-texts

(added January 2007)

This download set (as described in Verberne et al., 2007) consists of:

  • Seven documents selected from the RST Treebank (Carlson et al., 2003), both the annotated and the unannotated versions, used for elicitation;
  • All 372 why-questions and the corresponding answers, formulated by native speakers;
  • A readme file.

All files are plain text files.

You can download the data here

Why-questions and answers formulated to newspaper texts

(added March 2006)

This download set (as described in Verberne et al., 2006) consists of:

  • The source documents from Reuters and Guardian news archives, used for elicitation;
  • All 395 why-questions and the corresponding answers, formulated by native speakers;
  • 211 user-formulated paraphrases and the 166 corresponding questions;
  • A readme file.

All files are plain text files.

You can download the data here