TREC CAR

V2.0 Data Releases (Jan 2018, v2.0)

We provide the following data sets under a Creative Commons Attribution-ShareAlike 3.0 Unported License. It is based on content extracted from Wikipedia that is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License.

Please cite this data set as

Laura Dietz, Ben Gamari. 
"TREC CAR 2.0: A Data Set for Complex Answer Retrieval". 
Version 2.0, 2018. 
http://trec-car.cs.unh.edu

Note that there are significant changes between this dataset and v1.5 regarding paragraph and entity ids.

Data sets

All archives use XZ compression, datasize refers to uncompressed data.

  • paragraphCorpus.v2.0.tar.xz paragraph collection covering both train and test articles, entities of all types, and paragraphs from all content sections and lead paragraphs.

  • benchmarkY1-test-public.v2.0.tar.xz Official evaluation topics for TREC CAR 2017. Test set for Y1 benchmark, includes outlines only. For qrels for the test set see datasets below.

  • benchmarkY1-train.v2.0.tar.xz Train set for Y1 benchmark. Includes outlines, articles, qrels. (Selected in an with the same process as the test set.)

  • train.v2.0.tar.xz: Articles from 50% of Wikipedia that match the selection criterion. This half is dedicated for training any data-hungry machine learning methods. Data provided as articles, outlines, and qrels. Additionally, segmentation of the data into five folds for cross validation is provided in this archive.

  • test200.v2.0.tar.xz A manual selection of 200 pages from fold0, provided for training.

  • unprocessedTrain.v2.0.tar.xz Minimally processed version of the 50% of Wikipedia for training. Includes all page types, all sections, etc. Also includes images embedded on the page. Feel free to use this to derive knowledge bases or map to a knowledge base of your choice. This archive includes a list of “legal” wikipedia page titles for filtering of external resources.

  • unprocessedAllButBenchmark.v2.0.tar Like unprocessedTrain, but contains nearly everything of Wikipedia, only omitting pages in the test200 and benchmarkY1 benchmarks.

  • benchmarkY1Test-manual-qrels-v2.0.tar.xz (NEW!) Manual NIST assessments from 2017, translated to paragraph ids and entity ids of the V2.0 release.

  • benchmarkY1-test.tar.xz (NEW!) Automatic ground truth and full articles for the official evaluation topics for TREC CAR 2017.

Old Releases

For discontinued data releases see:

Support Tools

Support tools for reading this data can be found here: https://github.com/TREMA-UNH/trec-car-tools-java.

Don’t copy the code, use it as maven dependency:

    <repositories>
        <repository>
            <id>jitpack.io</id>
            <url>https://jitpack.io</url>
        </repository>
    </repositories>

    <dependencies>
...
  	<dependency>
            <groupId>com.github.TREMA-UNH</groupId>
            <artifactId>trec-car-tools-java</artifactId>
            <version>9</version>
        </dependency>
...

See examples on how to use the tools here: https://github.com/TREMA-UNH/trec-car-tools/.

You may be abel to lead v1.5 data files with this code, but we can’t guarantee correctness.

Contents

The following kinds of data derivatives are provided

Data:

  • *.cbor: full articles as derived from Wikipedia (after cleaning)

  • *.cbor-outlines.cbor: outlines of articles (=queries), preserving the hierarchy with sectionIDs.

  • *.cbor-paragraphs.cbor: paragraphs from these articles (=documents), with paragraph ids. These are intended for only training. (Evaluation will run based on the collection of paragraphs from the whole Wikipedia (see release-v1.1.cbor.paragraphs)

Qrels (trec_eval-compatible qrels files) which are automatically derived from Articles (to be complemented by human judgments).

  • *.cbor.hierarchical.qrels: every paragraph is relevant only for its leaf most specific section (example: PageName/Heading1/Heading1.1 - but not PageName/Heading1!)

  • *.cbor.toplevel.qrels: every paragraph is relevant only for its top-level section (the above example is relevant for PageName/Heading1, any second-level heading is ignored.)

  • *.cbor.articles.qrels: every paragraph is relevant for its article (the above example is relevant for PageName) - any section info is ignored.

  • *.cbor.hierarchical.entity.qrels: every entity is relevant only for its leaf most specific section (example: PageName/Heading1/Heading1.1 - but not PageName/Heading1!)

  • *.cbor.toplevel.entity.qrels: every entity is relevant only for its top-level section (the above example is relevant for PageName/Heading1, any second-level heading is ignored.)

  • *.cbor.articles.entity.qrels: every entity is relevant for its article (the above example is relevant for PageName) - any section info is ignored.

Test topics

For evaluation, only a *cbor.outlines fill will be distributed and a decision will have been made which of the three qrels files will be used.

Issues

Please submit any issues with trec-car-tools as well as the provided data sets using the issue tracker on github.

Creative Commons License
TREC-CAR Dataset by Laura Dietz, Ben Gamari is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Based on a work at www.wikipedia.org.