We provide the following data sets under a Creative Commons Attribution-ShareAlike 3.0 Unported License. It is based on content extracted from Wikipedia that is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License.
Please cite this data set as
Laura Dietz, Ben Gamari. "TREC CAR: A Data Set for Complex Answer Retrieval". Version 1.5, 2017. http://trec-car.cs.unh.edu
All archives use XZ compression, datasize refers to uncompressed data.
paragraphcorpus-v1.5.tar.xz (16GB) paragraph collection covering both train and test articles, entities of all types, and paragraphs from all content sections and lead paragraphs.
benchmarkY1test.public-v1.5.tar.xz (123KB) Official evaluation topics for TREC CAR 2017. Test set for Y1 benchmark, includes outlines only. The qrels for the test set will be released after the submission deadline of TREC CAR.
benchmarkY1train-v1.5.tar.xz (19MB) Train set for Y1 benchmark. Includes outlines, articles, qrels. (Selected in an with the same process as the test set.)
train-v1.5.tar.xz (9.4GB): Articles from 50% of Wikipedia that match the selection criterion. This half is dedicated for training any data-hungry machine learning methods. Data provided as articles, outlines, and qrels. Additionally, segmentation of the data into five folds for cross validation is provided in this archive.
test200-v1.5.tar.xz (19MB) A manual selection of 200 pages from fold0, provided for training.
unprocessedtrain-v1.5.tar.xz (37GB) Minimally processed version of the 50% of Wikipedia for training. Includes all page types, all sections, etc. Also includes images embedded on the page. Feel free to use this to derive knowledge bases or map to a knowledge base of your choice. This archive includes a list of “legal” wikipedia page titles for filtering of external resources.
Official submission of runs through TREC CAR Submission Site – FYI: We will only consider the top 100 paragraphs of your rankings.
check_car.py Validation script for run submissions. To avoid delays when submitting, we recommend to validate your runs before uploading to the submission system.
all_titles_passages_and_entities-v1.5.txt List of all query ids needed by the validation script.
paragraphcorpus.cbor.docid.list.gz List of legal paragraph ids. All other paragraph ids will be removed from your rankings.
legal-entities-v1.5.1.tar.xz For entity task submissions, list of legal entity ids. All other entity ids will be removed from your rankings. The file has two tab-separated columns: the first is the official trec-car ID for this entity, the second the page name on Wikipedia (with spaces instead of underscores). We kindly ask you to submit entity rankings with trec-car ids.
entity-redirects-v1.5.1.tar.xz List of page/entity redirects in the form
toPage, with intermediate redirects already resolved. We recommend to use this table to resolve link targets (which can be redirects) to the actual page that is being referred to.
For discontinued data releases see:
Support tools for reading this data can be found here: https://github.com/TREMA-UNH/trec-car-tools.
Support tools to work with the v1.4 release will be maintained in the v1.5 branch
Earlier versions are not supported.
The following kinds of data derivatives are provided
*.cbor: full articles as derived from Wikipedia (after cleaning)
*.cbor.outlines: outlines of articles (=queries), preserving the hierarchy with sectionIDs.
*.cbor.paragraphs: paragraphs from these articles (=documents), with paragraph ids. These are intended for only training. (Evaluation will run based on the collection of paragraphs from the whole Wikipedia (see release-v1.1.cbor.paragraphs)
Qrels (trec_eval-compatible qrels files) which are automatically derived from Articles (to be complemented by human judgments).
*.cbor.hierarchical.qrels: every paragraph is relevant only for its leaf most specific section (example: PageName/Heading1/Heading1.1 - but not PageName/Heading1!)
*.cbor.toplevel.qrels: every paragraph is relevant only for its top-level section (the above example is relevant for PageName/Heading1, any second-level heading is ignored.)
*.cbor.articles.qrels: every paragraph is relevant for its article (the above example is relevant for PageName) - any section info is ignored.
*.cbor.hierarchical.entity.qrels: every entity is relevant only for its leaf most specific section (example: PageName/Heading1/Heading1.1 - but not PageName/Heading1!)
*.cbor.toplevel.entity.qrels: every entity is relevant only for its top-level section (the above example is relevant for PageName/Heading1, any second-level heading is ignored.)
*.cbor.articles.entity.qrels: every entity is relevant for its article (the above example is relevant for PageName) - any section info is ignored.
For evaluation, only a *cbor.outlines fill will be distributed and a decision will have been made which of the three qrels files will be used.
Please submit any issues with trec-car-tools as well as the provided data sets using the issue tracker on github.
TREC-CAR Dataset by Laura Dietz, Ben Gamari is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Based on a work at www.wikipedia.org.