The English Wikipedia dump from Dec 20th, 2016 is used as a basis for the current data release. We obtain the data set using a pipeline of the following phases.
We parse the raw mediawiki format (see caveats) to identify page title, section headings and level, paragraph content and link information.
The content of the page is heavily cleaned of images, refs, tables, galleries, math, infoboxes, etc.
Furthermore pages in the following namespaces will be dropped.
We split all pages into 50% training and test. This split is decided a modulo 2 operation on the SipHash of the page name (e.g. “Green Sea Turtle”). The test topics in the TREC CAR evaluation will be taken from the test set.
The raw parsed data of the training half is made available as “halfwiki” on the data releases page.
Any knowledge graph derived from this set of pages is safe for use in the competition.
To build a training collection that mimics the test cases as close as possible, we discard any page tagged with a category that contains the following substrings (lower cased):
Further, any page in the namespace “Category:” or “Portal:” will be discarded as well as pages with titles starting with “List of” or “Lists of”.
Additionally redirect pages (indicated by #REDIRECT in their lead paragraph) and Disambiguation pages (i.e. pages where the name contains “(disambiguation)” are removed.
Of the remaining pages, we will remove lead paragraphs (these are paragraphs before the first heading) as well as links to categories.
We remove content and heading for sections with the following list of titles (lower cases)
We also discard any heading with less than three or more than 100 characters (which are often a sign of invalid markup)
Any page that, after this filtering, has less than three top-level headings left, will be discarded.
All remaining pages from the 50% training data (halfwiki) are divided into five folds (based on the SipHash on the page name). We also provide a small dataset “spritzer” containing six hand-selected articles.
Each of these subsets (spritzer, and fold0..fold4) is provided as - articles (nested headings and paragraphs in order): cbor - just the outline (nested headings with content removed) : cbor.outlines - just the paragraphs (just the paragraphs in a random order): *cbor.paragraphs
We also derive qrels based on the automatic evaluation for 1) entities and paragraphs, as well as 2) hierarchical, toplevel, and article.
The paragraph collection used for training and testing is obtained by omitting Phase 2, but applying all remaining filtering stages on the full wikipedia parse. Only the paragraphs (in random order) will be made available.
We base the extract on the raw mediawiki format because it contains rich information not available in other formats. We use a Parsing Expression Grammar (PEG) parsing framework frisby with infinite lookahead and cheap backtracking. However, mediawiki does not follow any well-defined grammar, and as editors manually write markup, it often contains invalid statements, missing closing parens, swapped parenthetical expressions, such as ([..)]. We try our best to fix many of these cases, but you will encounter blurbs of wiki-markup in the collection.
If you encounter parse issues, please submit an issue in our issue tracker with a link to the page, the true content, and wrong content.
As italics (‘’) and bold (’’’) are very difficult to get right for the editors, we decided to preserve their markup in the content and leave it to your tokenizer to deal with it.