Data Releases (latest, v2.1 – official Y2 benchmark)

Please cite this data set as

Laura Dietz, Ben Gamari, Jeff Dalton.
"TREC CAR 2.1: A Data Set for Complex Answer Retrieval".
Version 2.1, 2018.
http://trec-car.cs.unh.edu

This is the official dataset for the Y2 evaluation.

It contains the test queries (benchmarkY2test) as well as a new AllButBenchmark file for knowledge graph construction. Notice that submissions with the v2.0 AllButBenchmark file are not allowed.

The paragraph IDs and entity IDs have not changed between this version and v2.0.

Submission URL

Submit your runs here: https://ir.nist.gov/trecsubmit/car.html requires pre-registration as a TREC team.

Data sets

All archives use XZ compression, datasize refers to uncompressed data.

• benchmarkY2test-public.v2.1.1.tar.xz Official evaluation topics for TREC CAR 2018 (Year 2). Test set for Y2 benchmark, includes outlines only. Qrels will be released later in the year. (fixed July 25, 2018) **

• unprocessedAllButBenchmark.v2.1.tar.xz Like unprocessedTrain, but contains nearly everything of Wikipedia, only omitting pages in the benchmarkY2, benchmarkY1, and test200 benchmarks. The splits offered are consistent with the “train” file below.

** Part of the benchmarkY2test benchmark are based on the TQA dataset of AI2, further described here: Kembhavi, Aniruddha, et al. “Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension.” CVPR. Vol. 2. 2017.

• validation script - and the required car.topics file: To validate that your run files are in the correct format before uploading it.

• Tree-qrels Hierarchical ground truth also for internal sections (to distinguish from previous qrels with hierarchical information only for the leaf-sections, I refer to these as tree.qrels). Contains tree-qrels for test200, benchmarkY1train benchmarkY1test, and train. Remarks: Previous benchmarks did not contain automatic qrels for internal sections. Hierarchical qrels only contained relevance assessments for paragraphs/entities directly included in that section. In contrast, tree-qrels marke an paragraph/entity as relevant if it is contained in the section or any child section. We ask participants in the 2018 evaluation to include ranking also for internal sections. (Released July 26, 2018).

The following releases of v2.0 are still valid.

• paragraphCorpus.v2.0.tar.xz paragraph collection covering both train and test articles, entities of all types, and paragraphs from all content sections and lead paragraphs.

• benchmarkY1-test-public.v2.0.tar.xz Official evaluation topics for TREC CAR 2017. Test set for Y1 benchmark, includes outlines only. For qrels for the test set see datasets below.

• benchmarkY1-train.v2.0.tar.xz Train set for Y1 benchmark. Includes outlines, articles, qrels. (Selected in an with the same process as the test set.)

• train.v2.0.tar.xz: Articles from 50% of Wikipedia that match the selection criterion. This half is dedicated for training any data-hungry machine learning methods. Data provided as articles, outlines, and qrels. Additionally, segmentation of the data into five folds for cross validation is provided in this archive.

• test200.v2.0.tar.xz A manual selection of 200 pages from fold0, provided for training.

• unprocessedTrain.v2.0.tar.xz Minimally processed version of the 50% of Wikipedia for training. Includes all page types, all sections, etc. Also includes images embedded on the page. Feel free to use this to derive knowledge bases or map to a knowledge base of your choice. This archive includes a list of “legal” wikipedia page titles for filtering of external resources.

• benchmarkY1Test-manual-qrels-v2.0.tar.xz Manual NIST assessments from 2017, translated to paragraph ids and entity ids of the V2.0 release.

• benchmarkY1-test.tar.xz Automatic ground truth and full articles for the official evaluation topics for TREC CAR 2017.

Unless denoted otherwise files are based on a Wikipedia dump from December 20, 2016. BenchmarkY2test is based on files from the AI2’s TQA corpus and a Wikipedia dump from June 2018.

Old Releases

For discontinued data releases see:

Support Tools

Python support tools for reading this data can be found here: https://github.com/TREMA-UNH/trec-car-tools branch:v2.1 directory: python3 It is available on PyPi and anaconda cloud as trec-car-tools v2.1.

Java support tools for reading this data can be found here: https://github.com/TREMA-UNH/trec-car-tools-java tag:11.

Don’t copy the code, use it as maven dependency:

    <repositories>
<repository>
<id>jitpack.io</id>
<url>https://jitpack.io</url>
</repository>
</repositories>

<dependencies>
...

<dependency>
<groupId>com.github.TREMA-UNH</groupId>
<artifactId>trec-car-tools-java</artifactId>d60
<version>12</version>
</dependency>
...


With v2.1 and maven version 11, the behavior for iterating flat headings has changed. These are now including section path for internal sections.

See examples on how to use the tools here: https://github.com/TREMA-UNH/trec-car-tools/.

You may be able to read v1.5 data files with this code, but we can’t guarantee correctness.

Contents

The following kinds of data derivatives are provided

Data:

• *.cbor: full articles as derived from Wikipedia (after cleaning)

• *.cbor-outlines.cbor: outlines of articles (=queries), preserving the hierarchy with sectionIDs.

• *.cbor-paragraphs.cbor: paragraphs from these articles (=documents), with paragraph ids. These are intended for only training. (Evaluation will run based on the collection of paragraphs from the whole Wikipedia (see release-v1.1.cbor.paragraphs)

Qrels (trec_eval-compatible qrels files) which are automatically derived from Articles (to be complemented by human judgments).

• *.cbor.hierarchical.qrels: every paragraph is relevant only for its leaf most specific section (example: PageName/Heading1/Heading1.1 - but not PageName/Heading1!)

• *.cbor.toplevel.qrels: every paragraph is relevant only for its top-level section (the above example is relevant for PageName/Heading1, any second-level heading is ignored.)

• *.cbor.articles.qrels: every paragraph is relevant for its article (the above example is relevant for PageName) - any section info is ignored.

• *.cbor.hierarchical.entity.qrels: every entity is relevant only for its leaf most specific section (example: PageName/Heading1/Heading1.1 - but not PageName/Heading1!)

• *.cbor.toplevel.entity.qrels: every entity is relevant only for its top-level section (the above example is relevant for PageName/Heading1, any second-level heading is ignored.)

• *.cbor.articles.entity.qrels: every entity is relevant for its article (the above example is relevant for PageName) - any section info is ignored.

Test topics

For evaluation, only a *cbor.outlines fill will be distributed and a decision will have been made which of the three qrels files will be used.

Issues

Please submit any issues with trec-car-tools as well as the provided data sets using the issue tracker on github.