TREC CAR

Data Releases (latest, v2.4 – Wiki dumps)

The data set is released under Creative Commons Attribution-ShareAlike 3.0 Unported License (CC BY-SA 3.0).

Please cite this data set as

Laura Dietz, Ben Gamari. 
"TREC CAR 2.4: A Machine-Readable Wikipedia Dump". 
Version 2.4, 2020. 
http://trec-car.cs.unh.edu

Some of these datasets will be used in the TREC News Track 2020.

Data sets

All archives use XZ or GZ compression, datasize refers to uncompressed data.

In May 2020 we updated our Wikimedia parser to better handle fragmented templates and illegal XML tags. New! the data now includes Infobox and image information.

Data Format Description

We provide passage corpus, outlines, and example articles in CBOR-encoded archives to preserve hierarchical structure and entity links. CBOR is a binary-JSON format and avoids encoding issues. Support tools for accessing the CBOR are provided for Python, Java, and Haskell

Articles we provide are encoded with the following grammar. Terminal nodes are indicated by $.

      Page         -> $pageName $pageId [PageSkeleton] PageType PageMetadata
      PageType     -> ArticlePage | CategoryPage | RedirectPage ParaLink | DisambiguationPage
      PageMetadata -> RedirectNames DisambiguationNames DisambiguationIds CategoryNames CategoryIds InlinkIds InlinkAnchors
      RedirectNames       -> [$pageName] 
      DisambiguationNames -> [$pageName] 
      DisambiguationIds   -> [$pageId] 
      CategoryNames       -> [$pageName] 
      CategoryIds         -> [$pageId] 
      InlinkIds           -> [$pageId] 
      InlinkAnchors       -> [$anchorText] 
      
      PageSkeleton -> Section | Para | Image | ListItem
      Section      -> $sectionHeading [PageSkeleton]
      Para         -> Paragraph
      Paragraph    -> $paragraphId, [ParaBody]
      ListItem     -> $nestingLevel, Paragraph
      Image        -> $imageURL [PageSkeleton]
      ParaBody     -> ParaText | ParaLink
      ParaText     -> $text
      ParaLink     -> $targetPage $targetPageId $linkSection $anchorText
      Infobox      -> $template, [($key, [PageSkeleton])]

Support Tools

Note: You have to upgrade to the lastest version of the trec-car-tools bindings. Older versions will not work.

Python support tools for reading this data can be found here: https://github.com/TREMA-UNH/trec-car-tools branch:v2.5.2 directory: python3 It is available on PyPi and anaconda cloud as trec-car-tools v2.5.2.

Java support tools for reading this data can be found here: https://github.com/TREMA-UNH/trec-car-tools-java tag:18.

Don’t copy the code, use it as maven dependency:

    <repositories>
        <repository>
            <id>jitpack.io</id>
            <url>https://jitpack.io</url>
        </repository>
    </repositories>

    <dependencies>
...

  	<dependency>
            <groupId>com.github.TREMA-UNH</groupId>
            <artifactId>trec-car-tools-java</artifactId>d60
            <version>18</version>
        </dependency>
...

See examples on how to use the tools here: https://github.com/TREMA-UNH/trec-car-tools/.

Support tools for reading this format are provided in the trec-car-tools repository (the java version is available through maven).