We provide the following data sets under a Creative Commons Attribution-ShareAlike 3.0 Unported License. It is based on content extracted from Wikipedia that is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License.
corpus-v1.2.zip paragraph collection covering both train and test articles.
release-v1.2.zip: Articles from 50% of Wikipedia that match the selection criterion. This half is dedicated for training. Provided as articles, outlines, and qrels, already provided in five folds for cross validation.
spritzer-v1.2.zip: Six example articles, provided as articles, outlines, paragraphs, and qrels.
Coming soon: training-release-v1.2.zip: Hand-selected articles from the subset of “halfwiki” that characterize the nature of the test queries.
Support tools for reading this data can be found here: https://github.com/TREMA-UNH/trec-car-tools.
Support tools for previous releases:
The v1.0 format is incompatible and no longer supported.
The following kinds of data derivatives are provided
Data:
*.cbor
: full articles as derived from Wikipedia (after cleaning)*.cbor.outlines
: outlines of articles (=queries), preserving the hierarchy with sectionIDs.*.cbor.paragraphs
: paragraphs from these articles (=documents), with paragraph ids. These are intended for only training. (Evaluation will run based on the collection of paragraphs from the whole Wikipedia (see release-v1.1.cbor.paragraphs)Qrels (trec_eval-compatible qrels files) which are automatically derived from Articles (to be complemented by human judgments).
*.cbor.hierarchical.qrels
: every paragraph is relevant only for its leaf most specific section (example: PageName/Heading1/Heading1.1 - but not PageName/Heading1!)*.cbor.toplevel.qrels
: every paragraph is relevant only for its top-level section (the above example is relevant for PageName/Heading1, any second-level heading is ignored.)*.cbor.articles.qrels
: every paragraph is relevant for its article (the above example is relevant for PageName) - any section info is ignored.
*.cbor.hierarchical.entity.qrels
: every entity is relevant only for its leaf most specific section (example: PageName/Heading1/Heading1.1 - but not PageName/Heading1!)*.cbor.toplevel.entity.qrels
: every entity is relevant only for its top-level section (the above example is relevant for PageName/Heading1, any second-level heading is ignored.)*.cbor.articles.entity.qrels
: every entity is relevant for its article (the above example is relevant for PageName) - any section info is ignored.
For evaluation, only a *cbor.outlines fill will be distributed and a decision will have been made which of the three qrels files will be used.
TREC-CAR Dataset by Laura Dietz, Ben Gamari is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Based on a work at www.wikipedia.org.