TREC CAR

Old Data Releases (v1.2)

We provide the following data sets under a Creative Commons Attribution-ShareAlike 3.0 Unported License. It is based on content extracted from Wikipedia that is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License.

  • corpus-v1.2.zip paragraph collection covering both train and test articles.

  • release-v1.2.zip: Articles from 50% of Wikipedia that match the selection criterion. This half is dedicated for training. Provided as articles, outlines, and qrels, already provided in five folds for cross validation.

  • spritzer-v1.2.zip: Six example articles, provided as articles, outlines, paragraphs, and qrels.

  • Coming soon: training-release-v1.2.zip: Hand-selected articles from the subset of “halfwiki” that characterize the nature of the test queries.

Support Tools

Support tools for reading this data can be found here: https://github.com/TREMA-UNH/trec-car-tools.

Support tools for previous releases:

  • v1.2 in v1.2 branch.
  • v1.1 in v1.1 branch.

The v1.0 format is incompatible and no longer supported.

Contents

The following kinds of data derivatives are provided

Data:

  • *.cbor: full articles as derived from Wikipedia (after cleaning)
  • *.cbor.outlines: outlines of articles (=queries), preserving the hierarchy with sectionIDs.
  • *.cbor.paragraphs: paragraphs from these articles (=documents), with paragraph ids. These are intended for only training. (Evaluation will run based on the collection of paragraphs from the whole Wikipedia (see release-v1.1.cbor.paragraphs)

Qrels (trec_eval-compatible qrels files) which are automatically derived from Articles (to be complemented by human judgments).

  • *.cbor.hierarchical.qrels: every paragraph is relevant only for its leaf most specific section (example: PageName/Heading1/Heading1.1 - but not PageName/Heading1!)
  • *.cbor.toplevel.qrels: every paragraph is relevant only for its top-level section (the above example is relevant for PageName/Heading1, any second-level heading is ignored.)
  • *.cbor.articles.qrels: every paragraph is relevant for its article (the above example is relevant for PageName) - any section info is ignored.

  • *.cbor.hierarchical.entity.qrels: every entity is relevant only for its leaf most specific section (example: PageName/Heading1/Heading1.1 - but not PageName/Heading1!)
  • *.cbor.toplevel.entity.qrels: every entity is relevant only for its top-level section (the above example is relevant for PageName/Heading1, any second-level heading is ignored.)
  • *.cbor.articles.entity.qrels: every entity is relevant for its article (the above example is relevant for PageName) - any section info is ignored.

Test topics

For evaluation, only a *cbor.outlines fill will be distributed and a decision will have been made which of the three qrels files will be used.

Creative Commons License
TREC-CAR Dataset by Laura Dietz, Ben Gamari is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Based on a work at www.wikipedia.org.