The data set is released under Creative Commons Attribution-ShareAlike 3.0 Unported License (CC BY-SA 3.0).
Please cite this data set as
Laura Dietz, Ben Gamari.
"TREC CAR 2.4: A Machine-Readable Wikipedia Dump".
Version 2.4, 2020.
http://trec-car.cs.unh.edu
Some of these datasets will be used in the TREC News Track 2020.
All archives use XZ or GZ compression, datasize refers to uncompressed data.
In May 2020 we updated our Wikimedia parser to better handle fragmented templates and illegal XML tags. New! the data now includes Infobox and image information.
We provide passage corpus, outlines, and example articles in CBOR-encoded archives to preserve hierarchical structure and entity links. CBOR is a binary-JSON format and avoids encoding issues. Support tools for accessing the CBOR are provided for Python, Java, and Haskell
Articles we provide are encoded with the following grammar. Terminal nodes are indicated by $
.
Page -> $pageName $pageId [PageSkeleton] PageType PageMetadata
PageType -> ArticlePage | CategoryPage | RedirectPage ParaLink | DisambiguationPage
PageMetadata -> RedirectNames DisambiguationNames DisambiguationIds CategoryNames CategoryIds InlinkIds InlinkAnchors
RedirectNames -> [$pageName]
DisambiguationNames -> [$pageName]
DisambiguationIds -> [$pageId]
CategoryNames -> [$pageName]
CategoryIds -> [$pageId]
InlinkIds -> [$pageId]
InlinkAnchors -> [$anchorText]
PageSkeleton -> Section | Para | Image | ListItem
Section -> $sectionHeading [PageSkeleton]
Para -> Paragraph
Paragraph -> $paragraphId, [ParaBody]
ListItem -> $nestingLevel, Paragraph
Image -> $imageURL [PageSkeleton]
ParaBody -> ParaText | ParaLink
ParaText -> $text
ParaLink -> $targetPage $targetPageId $linkSection $anchorText
Infobox -> $template, [($key, [PageSkeleton])]
Note: You have to upgrade to the lastest version of the trec-car-tools bindings. Older versions will not work.
Python support tools for reading this data can be found here: https://github.com/TREMA-UNH/trec-car-tools branch:v2.5.2 directory: python3 It is available on PyPi and anaconda cloud as trec-car-tools v2.5.2
.
Java support tools for reading this data can be found here: https://github.com/TREMA-UNH/trec-car-tools-java tag:18.
Don’t copy the code, use it as maven dependency:
<repositories>
<repository>
<id>jitpack.io</id>
<url>https://jitpack.io</url>
</repository>
</repositories>
<dependencies>
...
<dependency>
<groupId>com.github.TREMA-UNH</groupId>
<artifactId>trec-car-tools-java</artifactId>d60
<version>18</version>
</dependency>
...
See examples on how to use the tools here: https://github.com/TREMA-UNH/trec-car-tools/.
Support tools for reading this format are provided in the trec-car-tools repository (the java version is available through maven).