TREC CAR

Old Data Releases (v1.4)

We provide the following data sets under a Creative Commons Attribution-ShareAlike 3.0 Unported License. It is based on content extracted from Wikipedia that is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License.

Please cite this data set as

Laura Dietz, Ben Gamari. 
"TREC CAR: A Data Set for Complex Answer Retrieval". 
Version 1.4, 2017. 
http://trec-car.cs.unh.edu

Data sets

  • spritzer-v1.4.zip: Six example articles, provided as articles, outlines, paragraphs, and qrels.

  • corpus-v1.4.zip paragraph collection covering both train and test articles.

  • release-v1.4.zip: Articles from 50% of Wikipedia that match the selection criterion. This half is dedicated for training. Provided as articles, outlines, and qrels, already provided in five folds for cross validation.

  • halfwiki-v1.4.tar.xz Minimally processed version of the 50% of Wikipedia for training. Feel free to use this to derive knowledge bases or map to a knowledge base of your choice. (Uses XZ compression, unpacked: 11GB)

  • test200-v1.4.zip A manual selection of 200 pages from fold0, provided for training.

  • benchmarkY1.zip Train/Test set for Y1 benchmark. Includes outlines, articles, qrels for train set, however, for test set you receive outlines only. test.benchmarkY1.omit.cbor.outlines will be the official evaluation topics for TREC CAR 2017. The qrels for the test set will be released after the submission deadline of TREC CAR.

Support Tools

Support tools for reading this data (v1.4 release) can be found here: https://github.com/TREMA-UNH/trec-car-tools.

Support tools for this release:

  • v1.4 in v1.4 branch.

Other versions are not supported.

Contents

The following kinds of data derivatives are provided

Data:

  • *.cbor: full articles as derived from Wikipedia (after cleaning)

  • *.cbor.outlines: outlines of articles (=queries), preserving the hierarchy with sectionIDs.

  • *.cbor.paragraphs: paragraphs from these articles (=documents), with paragraph ids. These are intended for only training. (Evaluation will run based on the collection of paragraphs from the whole Wikipedia (see release-v1.1.cbor.paragraphs)

Qrels (trec_eval-compatible qrels files) which are automatically derived from Articles (to be complemented by human judgments).

  • *.cbor.hierarchical.qrels: every paragraph is relevant only for its leaf most specific section (example: PageName/Heading1/Heading1.1 - but not PageName/Heading1!)

  • *.cbor.toplevel.qrels: every paragraph is relevant only for its top-level section (the above example is relevant for PageName/Heading1, any second-level heading is ignored.)

  • *.cbor.articles.qrels: every paragraph is relevant for its article (the above example is relevant for PageName) - any section info is ignored.

  • *.cbor.hierarchical.entity.qrels: every entity is relevant only for its leaf most specific section (example: PageName/Heading1/Heading1.1 - but not PageName/Heading1!)

  • *.cbor.toplevel.entity.qrels: every entity is relevant only for its top-level section (the above example is relevant for PageName/Heading1, any second-level heading is ignored.)

  • *.cbor.articles.entity.qrels: every entity is relevant for its article (the above example is relevant for PageName) - any section info is ignored.

Test topics

For evaluation, only a *cbor.outlines fill will be distributed and a decision will have been made which of the three qrels files will be used.

Issues

Please submit any issues with trec-car-tools as well as the provided data sets using the issue tracker on github.

Creative Commons License
TREC-CAR Dataset by Laura Dietz, Ben Gamari is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Based on a work at www.wikipedia.org.