TREC CAR

Introduction

Current retrieval systems provide good solutions towards phrase-level retrieval for simple fact and entity-centric needs. This track encourages research for answering more complex information needs with longer answers. Much like Wikipedia pages synthesize knowledge that is globally distributed, we envision systems that collect relevant information from an entire corpus, creating synthetically structured documents by collating retrieved results.

Organization

Join our Mailing list: https://groups.google.com/d/forum/trec-car

Example

To motivate a brief example, consider a user wondering about the latest advances in mobile technology. She heard that there is a new iPhone on the marked and is looking for a summary on its features or issues. With this intention in mind, she enters the query . A possible answer she would be very happy to receive could look like this:

Despite the inclusion of an adapter, the removal of the headphone jack was met with criticism. Criticism was based primarily on the following arguments

  • Digital output (as opposed to analog output from the 3.5 mm headphone jack) does not show any notable improvement in sound quality;

  • the inability to charge the iPhone 7 and listen to music simultaneously without Bluetooth;

  • and the inconvenience of having to carry around an adapter for what is purely a mobile device, diminishing its utility.

  • In particular, Apple’s vice president Phillip Schiller, who announced the change, was mocked extensively online for stating that removing the headphone jack took ‘courage’.

  • An online petition created by the consumer group SumOfUs that accuses Apple of planned obsolescence and causing substantial electronic waste by removing the headphone jack reached over 300,000 signatures.

While this example was taken from Wikipedia, IPhone_7 it should be possible to identify such a list of issues from a Web collection with passage retrieval, consolidation, and organization. Of course, one might envision other responses that would satisfy the information need equally well.

Task

Given an article stub Q, retrieve for each of its sections Hi, a ranking of relevant entity-passage tuples (E,S). The passage S is taken from a provided passage corpus. The entity E refers to an entry in the provided knowledge base. We define a passage or entity as relevant if the passage content or entity should be mentioned in the knowledge article. Different degrees of “could be mentioned”, “should be mentioned”, or “must be mentioned” are represented through a PEGFB annotation scale during manual assessment.

Providing E or S is optional, allowing participants to participate in either the entity ranking task (entry-level), the passage ranking task (not utilizing entities), or a combined task. The idea is that the passage provides provenance for the entity’s relevance and clarifies which aspect of the entity is relevant in the context of this section.

Passage Corpus and Training Data

Supporting the neural IR community, we provide a large (yet approximate) training set based on a large collection of knowledge articles from Wikipedia that are divided into content passages and outlines. Across all articles, all extracted paragraphs are provided as one large passage corpus (see paragraphs release). We ensure that entity links remain intact. These are gathered from links to Wikipedia (and in future might be augmented to the FACC1 collection, and the TagMe toolkit).

From all stubs obtained in this process, we will select 1000 stubs for which the original stub association is revealed as a training set. Additionally we open up 50% of a recent Wikipedia dump for training (see halfwiki release)

For each section Hi in training stubs Q we indicate which passages originated from the section and which knowledge base entities are mentioned therein. We provide three variations of trec_eval compatible qrels files, using the section hierarchy, only top-level sections, and only the article.

Passage retrieval tasks are often criticized for leading to test collections that are not reusable. We are aware of this problem and intend on solving it through legal span sets and optional candidate sets.

Test Data and Evaluation

We enact both an automatic binary and a fine-grained manual evaluation procedure on 40 test stubs.

Analogously to the training signals, an automatic (yet approximative) test signal is derived from the passages of the original article as well as entities mentioned therein. However, this test signal is not necessarily reliable, as even the best article has room for improvement. Especially for complex answers, there are often different ways to provide an equally good answer.

Consequently, for a subset of test stubs, contributed entity-passage rankings will additionally be manually assessed for relevance, reflecting different levels of relevance. This manual assessment also provides a reference for the quality of the automatic evaluation procedure.

Data

We provide passage corpus, stubs, and example articles in CBOR-encoded archives to preserve hierarchical structure and entity links. Support tools are provided for python 3.5 and Java 1.7.

Articles we provide are encoded with the following grammar. (Terminal nodes are indicated by $.)

     Page         -> $pageName $pageId [PageSkeleton]
     PageSkeleton -> Section | Para
     Section      -> $sectionHeading $sectionId [PageSkeleton]
     Para         -> Paragraph
     Paragraph    -> $paragraphId, [ParaBody]
     ParaBody     -> ParaText | ParaLink
     ParaText     -> $text
     ParaLink     -> $targetPage $targetPageId $linkSection $anchorText

Support tools for reading this format are provided in the trec-car-tools repository.

Passage Format

All passages that are cut out of Wikipedia articles are assigned a unique (SHA256 hashed) paragraphId that will not reveal the originating article. Pairs of paragraphId,paragraphContent are provided as passage corpus.

Outline Format

Each extracted page outline, provides the pageName (e.g., Wikipedia title) and pageId, and a nested list of children. Each section is represented by a heading, a headingId, and a list of its children in order. Children can be sections or paragraphs.

Each section in the corpus is uniquely identified by its sectionpath:

    pageName/section1/section1.1/section1.1.1

To avoid UTF8-encoding and whitespace issues with existing evaluation tools like trec_eval, the pageId and headingId are projected to ascii characters using Percent encoded.

Qrels

We provide training/evaluation annotations in the trec_eval compatible Qrels format.

We provide Qrels for pa and entity rankings as well as differnt variations on the section path:

Format for passages:

    $sectionId 0 $psgId $isRelevant

Format for entities (where the entity is represented through the Wikipedia article it represents - using percent encoding):

    $sectionId 0 $pageId $isRelevant

For automatic training/test data $isRelevant is either 1 or 0 (only 1 are reported).

For manual test data $isRelevant is a grading on the PEGFB-scale (4-0) according to

Training Articles

We also provide training data not just as stubs + Qrels but also (alternatively) as annotated articles.

These follow a similar format as stubs, but also include the passages with their IDs inside their sections.

Details on the data selection process are available here.

Releases

v1.4-release (Jan 24, 2017): release of the paragraph collection, spritzer, and halfwiki

As formats may change slightly between releases, make sure you check out support tools from the corresponding branch in trec-car-tools.

Previous releases - do not use!

v1.2-release (Jan 17, 2017): release of the paragraph collection, spritzer, and halfwiki

v1.1-release (Jan 11, 2017): release of the paragraph collection, spritzer, and halfwiki

2016-11-v1.0-release: Pre-release to straighten out issues in the data format – Format change, not supported any longer.

Spritzer

Training articles

Paragraph corpus

Collection of paragraphs collected from Wikipedia articles that meet the selection criterion. The links between articles are preserved (see format definition above).

Paragraph Wikipedia Corpus

Training corpora

Half of the Wikipedia articles are dedicated for training. A minimally processed version of this data is available as halfwiki.

Excluding articles that were already study in literature, we derive a subset of pages matching the list of selection criteria as an initial training corpus. The data is provided as five folds which are indended for cross-validation as release.

Additionally, a manunal selection of 200 pages will be provided for training as test200.

See data releases for more information.

Creative Commons License
TREC-CAR Dataset by Laura Dietz, Ben Gamari is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Based on a work at www.wikipedia.org.