We provide the following data sets under a Creative Commons Attribution-ShareAlike 3.0 Unported License. It is based on content extracted from Wikipedia that is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License and the Textbook Question Answering corpus from AI2.
Please cite this data set as
Laura Dietz, Ben Gamari.
"TREC CAR 2.3: A Data Set for Complex Answer Retrieval".
Version 2.3, 2019.
http://trec-car.cs.unh.edu
This is the official dataset for the Y3 evaluation.
It contains new test queries (benchmarkY3test), as well as training data from automatic benchmarks and manual assessments from Y1 and Y2.
The paragraph IDs and entity IDs have not changed between this version and v2.0. The paragraphCorpus and allButBenchmark are the same as from v2.1.
For follow-up datasets and Wikipedia dumps see v2.4
Submission concluded.
validation script - and the required car.topics file: To validate that your run files are in the correct format.
All archives use XZ or GZ compression, datasize refers to uncompressed data.
Please use these files for your TREC CAR 2019 submissions!
Code and Data for Validation and Population
Software (python3): https://github.com/TREMA-UNH/car-convert-ranking-to-ordering contains two standalone scripts:
paragraph_ids.txt.xz List of valid paragraph ids required for validating TREC CAR Y3 submission files.
Y3 Results
trec-car-y3-runs-and-eval.tar.gz Evaluation results from Y3. Includes run files derived from participant submissions, a trec_eval
compatible qrel file from the manual assessment, and evaluation results produced with trec_eval -c
.
trec-car-benchmarkY3test-section.qrels Same as above, but just the qrel file.
Y3 Releases
benchmarkY3test.public.tar.gz Official evaluation topics for TREC CAR 2019 (Year 3) based on the TQA corpus. The test set for Y3 includes outlines only. Qrels will be released later in the year. (Released June 18, 2019)
benchmarkY3train.tar.gz Small set of training topics for TREC CAR Year 3, derived from the annotated subset of annotated TQA topics in benchmarkY2test topics. Provided are outlines, gold articles, a table for matching Y3 ids to Y2 ids, and manually annotated gold keywords for each section. Note: the paragraph ids given in the *pages file are not contained in the paragraph corpus!
benchmarkY2test-auto-manual-qrels.tar.gz Automatic assessments, manual NIST assessments from evaluation in 2018, and gold passages — used in official TREC CAR Y2 evaluation.
The dataset for Y3 is based on the TQA dataset of AI2, further described here:
Kembhavi, Aniruddha, et al. “Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension.” CVPR. Vol. 2. 2017.
The following previous releases of v2.0 and v2.1 are still valid for Y3:
Corpus of paragraphs
paragraphCorpus.v2.0.tar.xz paragraph collection covering both train and test articles, entities of all types, and paragraphs from all content sections and lead paragraphs.
Alternatively paragraphCorpus-splits.tar.xz the same paragraph corpus, but split into 30 chunks of 10000 paragraphs each.
Knowledge base
extra-large training set
benchmarkY2test
benchmarkY1test
benchmarkY1-test-public.v2.0.tar.xz Official evaluation topics for TREC CAR 2017. Test set for Y1 benchmark, includes outlines only. For qrels for the test set see datasets below.
benchmarkY1-test.tar.xz Automatic ground truth and full articles for the official evaluation topics for TREC CAR 2017.
benchmarkY1Test-manual-qrels-v2.0.tar.xz Manual NIST assessments from 2017, translated to paragraph ids and entity ids of the V2.0 release.
Tree-qrels Hierarchical ground truth also for internal sections (to distinguish from previous qrels with hierarchical information only for the leaf-sections, I refer to these as tree.qrels). Contains tree-qrels for test200, benchmarkY1train benchmarkY1test, and train. Remarks: Previous benchmarks did not contain automatic qrels for internal sections. Hierarchical qrels only contained relevance assessments for paragraphs/entities directly included in that section. In contrast, tree-qrels marke an paragraph/entity as relevant if it is contained in the section or any child section. We ask participants in the 2018 evaluation to include ranking also for internal sections. (Released July 26, 2018).
benchmarkY1train
obsolete datasets (which also work, but are not recommended)
test200.v2.0.tar.xz A manual selection of 200 pages from fold0, provided for training.
unprocessedTrain.v2.0.tar.xz Minimally processed version of the 50% of Wikipedia for training. Includes all page types, all sections, etc. Also includes images embedded on the page. Feel free to use this to derive knowledge bases or map to a knowledge base of your choice. This archive includes a list of “legal” wikipedia page titles for filtering of external resources.
Unless denoted otherwise files are based on a Wikipedia dump from December 20, 2016. BenchmarkY2test is based on files from the AI2’s TQA corpus and a Wikipedia dump from June 2018.
The following kinds of data derivatives are provided
Data:
*.cbor
: full articles as derived from Wikipedia (after cleaning)
*.cbor-outlines.cbor
: outlines of articles (=queries), preserving the hierarchy with sectionIDs.
*.cbor-paragraphs.cbor
: paragraphs from these articles (=documents), with paragraph ids. These are intended for only training. (Evaluation will run based on the collection of paragraphs from the whole Wikipedia (see release-v1.1.cbor.paragraphs)
Qrels (trec_eval-compatible qrels files) which are automatically derived from Articles (to be complemented by human judgments).
*.cbor.hierarchical.qrels
: every paragraph is relevant only for its leaf most specific section (example: PageName/Heading1/Heading1.1 - but not PageName/Heading1!)
*.cbor.toplevel.qrels
: every paragraph is relevant only for its top-level section (the above example is relevant for PageName/Heading1, any second-level heading is ignored.)
*.cbor.articles.qrels
: every paragraph is relevant for its article (the above example is relevant for PageName) - any section info is ignored.
*.cbor.tree.qrels
: every paragraph is relevant for any parent heading (this covers hierarchical, toplevel, and article)
*.cbor.hierarchical.entity.qrels
: every entity is relevant only for its leaf most specific section (example: PageName/Heading1/Heading1.1 - but not PageName/Heading1!)
*.cbor.toplevel.entity.qrels
: every entity is relevant only for its top-level section (the above example is relevant for PageName/Heading1, any second-level heading is ignored.)
*.cbor.articles.entity.qrels
: every entity is relevant for its article (the above example is relevant for PageName) - any section info is ignored.
*.cbor.tree.entity.qrels
: every entity is relevant for any parent heading (this covers hierarchical, toplevel, and article)
For evaluation, only a *cbor.outlines fill will be distributed and a decision will have been made which of the three qrels files will be used.
Please submit any issues with trec-car-tools as well as the provided data sets using the issue tracker on github.
Please use release version 17 from trec-car-tools-java
Alternative 1: Use Maven (highly recommended!)
<repositories>
<repository>
<id>jitpack.io</id>
<url>https://jitpack.io</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>com.github.TREMA-UNH</groupId>
<artifactId>trec-car-tools-java</artifactId>
<version>17</version>
</dependency>
<dependency>
<groupId>co.nstant.in</groupId>
<artifactId>cbor</artifactId>
<version>RELEASE</version>
</dependency>
[...]
Alternative 2: checkout source code
Git repository [github.com/TREMA-UNH/trec-car-tools-java][{https://github.com/TREMA-UNH/trec-car-tools-java/tree/33de7dde9e61510fe71a41c2bd2bf2f3bc9a8fad}]
Please check out tag “17”!
*Example**
An example on how to lead TREC CAR data from java can be found on github trec-car-tools/trec-car-tools-example.
Alternative 1: Install via pypi and pip
pip install trec-car-tools
More info on https://pypi.org/project/trec-car-tools/
version 2.3 and 2.1 should both work for benchmarkY3
Alternative 2: Checkout from source code
Python support tools for reading this data can be found here: https://github.com/TREMA-UNH/trec-car-tools branch:v2.1 directory: python3 It is available on PyPi and anaconda cloud as trec-car-tools v2.1
.
python setup.py install
Look at test.py for an example on how to access the data.
For discontinued data releases see:
TREC-CAR Dataset by Laura Dietz, Ben Gamari, Jeff Dalton is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Based on a work at www.wikipedia.org.