TREC CAR

Data Releases (latest, v2.1 – official Y2 benchmark)

We provide the following data sets under a Creative Commons Attribution-ShareAlike 3.0 Unported License. It is based on content extracted from Wikipedia that is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License.

Please cite this data set as

Laura Dietz, Ben Gamari, Jeff Dalton. 
"TREC CAR 2.1: A Data Set for Complex Answer Retrieval". 
Version 2.1, 2018. 
http://trec-car.cs.unh.edu

This is the official dataset for the Y2 evaluation.

It contains the test queries (benchmarkY2test) as well as a new AllButBenchmark file for knowledge graph construction. Notice that submissions with the v2.0 AllButBenchmark file are not allowed.

The paragraph IDs and entity IDs have not changed between this version and v2.0.

Submission URL

Submit your runs here: https://ir.nist.gov/trecsubmit/car.html requires pre-registration as a TREC team.

Data sets

All archives use XZ compression, datasize refers to uncompressed data.

Please use these files for your TREC CAR 2018 submissions!

** Part of the benchmarkY2test benchmark are based on the TQA dataset of AI2, further described here: Kembhavi, Aniruddha, et al. “Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension.” CVPR. Vol. 2. 2017.

The following releases of v2.0 are still valid.

Unless denoted otherwise files are based on a Wikipedia dump from December 20, 2016. BenchmarkY2test is based on files from the AI2’s TQA corpus and a Wikipedia dump from June 2018.

Old Releases

For discontinued data releases see:

Support Tools

Python support tools for reading this data can be found here: https://github.com/TREMA-UNH/trec-car-tools branch:v2.1 directory: python3 It is available on PyPi and anaconda cloud as trec-car-tools v2.1.

Java support tools for reading this data can be found here: https://github.com/TREMA-UNH/trec-car-tools-java tag:11.

Don’t copy the code, use it as maven dependency:

    <repositories>
        <repository>
            <id>jitpack.io</id>
            <url>https://jitpack.io</url>
        </repository>
    </repositories>

    <dependencies>
...

  	<dependency>
            <groupId>com.github.TREMA-UNH</groupId>
            <artifactId>trec-car-tools-java</artifactId>d60
            <version>12</version>
        </dependency>
...

With v2.1 and maven version 11, the behavior for iterating flat headings has changed. These are now including section path for internal sections.

See examples on how to use the tools here: https://github.com/TREMA-UNH/trec-car-tools/.

You may be able to read v1.5 data files with this code, but we can’t guarantee correctness.

Contents

The following kinds of data derivatives are provided

Data:

Qrels (trec_eval-compatible qrels files) which are automatically derived from Articles (to be complemented by human judgments).

Test topics

For evaluation, only a *cbor.outlines fill will be distributed and a decision will have been made which of the three qrels files will be used.

Issues

Please submit any issues with trec-car-tools as well as the provided data sets using the issue tracker on github.

Creative Commons License
TREC-CAR Dataset by Laura Dietz, Ben Gamari, Jeff Dalton is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Based on a work at www.wikipedia.org.