# Introduction

Current retrieval systems provide good solutions towards phrase-level retrieval for simple fact and entity-centric needs. This track encourages research for answering more complex information needs with longer answers. Much like Wikipedia pages synthesize knowledge that is globally distributed, we envision systems that collect relevant information from an entire corpus, creating synthetically structured documents by collating retrieved results.

Overview Report from 2018 (CAR-Y2) will be published in April 2019. TREC Attendees receive it on their thumbdrive.

More details in TREC CAR Overview report from 2017 (TREC CAR Y1). Please cite as

Dietz, Laura and Verma, Manisha and Radlinski, Filip and Craswell, Nick
In: Proceedings of {Text REtrieval Conference} (TREC), 2017. 

## Organization

July 25: New dataset for Y2 evaluation (v2.1)!

## Example

To motivate a brief example, consider a user wondering about the latest advances in mobile technology. She heard that there is a new iPhone on the marked and is looking for a summary on its features or issues. With this intention in mind, she enters the query iPhone 7 new features and issues. A possible answer she would be very happy to receive could look like this:

Despite the inclusion of an adapter, the removal of the headphone jack was met with criticism. Criticism was based primarily on the following arguments

• Digital output (as opposed to analog output from the 3.5 mm headphone jack) does not show any notable improvement in sound quality;

• the inability to charge the iPhone 7 and listen to music simultaneously without Bluetooth;

• and the inconvenience of having to carry around an adapter for what is purely a mobile device, diminishing its utility.

• In particular, Apple’s vice president Phillip Schiller, who announced the change, was mocked extensively online for stating that removing the headphone jack took ‘courage’.

• An online petition created by the consumer group SumOfUs that accuses Apple of planned obsolescence and causing substantial electronic waste by removing the headphone jack reached over 300,000 signatures.

While this example was taken from Wikipedia, IPhone_7 it should be possible to identify such a list of issues from a Web collection with passage retrieval, consolidation, and organization. Of course, one might envision other responses that would satisfy the information need equally well.

Given an article stub $$Q$$, retrieve for each of its sections $$H_i$$, a ranking of relevant entity-passage tuples $$(E,P)$$. The passage $$P$$ is taken from a provided passage corpus. The entity $$E$$ refers to an entry in the provided knowledge base. We define a passage or entity as relevant if the passage content or entity should be mentioned in the knowledge article. Different degrees of “could be mentioned”, “should be mentioned”, or “must be mentioned” are represented through a PEGFB annotation scale during manual assessment.

Two tasks are offered: Passage ranking and Entity ranking.

In the passage ranking task, entities are omitted.

In the entity ranking task, the entity must be given, and optionally, complemented with a passage that serves as provenance for why the entity is relevant. If the passage is omitted, it will be replaced with the first paragraph from the entity’s article.

## Passage Corpus and Training Data

Supporting the neural IR community, we provide a large (yet approximate) training set based on a large collection of knowledge articles from Wikipedia that are divided into content passages and outlines. Across all articles, all extracted paragraphs are provided as one large passage corpus (see paragraphs release). We ensure that entity links remain intact. These are gathered from links to Wikipedia (and in future might be augmented to the FACC1 collection, and the TagMe toolkit).

From all stubs obtained in this process, we will select 1000 stubs for which the original stub association is revealed as a training set. Additionally we open up 50% of a recent Wikipedia dump for training (see halfwiki release)

For each section $$H_i$$ in training stubs $$Q$$ we indicate which passages originated from the section and which knowledge base entities are mentioned therein. We provide three variations of trec_eval compatible qrels files, using the section hierarchy, only top-level sections, and only the article.

Passage retrieval tasks are often criticized for leading to test collections that are not reusable. We are aware of this problem and intend on solving it through legal span sets and optional candidate sets.

## Test Data and Evaluation

We enact both an automatic binary and a fine-grained manual evaluation procedure on 40 test stubs.

Analogously to the training signals, an automatic (yet approximative) test signal is derived from the passages of the original article as well as entities mentioned therein. However, this test signal is not necessarily reliable, as even the best article has room for improvement. Especially for complex answers, there are often different ways to provide an equally good answer.

Consequently, for a subset of test stubs, contributed entity-passage rankings will additionally be manually assessed for relevance, reflecting different levels of relevance. This manual assessment also provides a reference for the quality of the automatic evaluation procedure.

# Data Set Description

We provide passage corpus, stubs, and example articles in CBOR-encoded archives to preserve hierarchical structure and entity links. Support tools are provided for Python 3.5 and Java 1.7.

Articles we provide are encoded with the following grammar. Terminal nodes are indicated by $.  Page ->$pageName $pageId [PageSkeleton] PageSkeleton -> Section | Para | Image Section ->$sectionHeading $sectionHeadingId [PageSkeleton] Para -> Paragraph Paragraph ->$paragraphId, [ParaBody]
Image        -> $imageURL [PageSkeleton] ParaBody -> ParaText | ParaLink ParaText ->$text
ParaLink     -> $targetPage$targetPageId $linkSection$anchorText

Support tools for reading this format are provided in the trec-car-tools repository.

## Passage Format

All passages that are cut out of Wikipedia articles are assigned a unique (SHA256 hashed) paragraphId that will not reveal the originating article. Pairs of paragraphId,paragraphContent are provided as passage corpus.

## Outline Format

Each extracted page outline, provides the pageName (e.g., Wikipedia title) and pageId, and a nested list of children. Each section is represented by a heading, a headingId, and a list of its children in order. Children can be sections or paragraphs.

Each section in the corpus is uniquely identified by a section path of the form

    pageName/section1/section1.1/section1.1.1

To avoid UTF8-encoding and whitespace issues with existing evaluation tools like trec_eval, the pageId and headingId are projected to ASCII characters using a variant of URL percent encoding.

## Qrels

We provide training/evaluation annotations in the trec_eval compatible Qrels format.

We provide qrels for passage and entity rankings as well as differnt variations on the section path:

• full section paths (hierarchical)
• only top-level sections (toplevel)
• only the article (article)

Format for passages:

    $sectionId 0$paragraphId $isRelevant Format for entities (where the entity is represented through the Wikipedia article it represents - using percent encoding): $sectionId 0 $pageId$isRelevant

For automatic training/test data $isRelevant is either 1 or 0 (only 1 are reported). For manual test data $isRelevant is a grading on the PEGFB-scale (4-0) according to

• 4: perfect, sums up everything you need to know.
• 3: excellent, must be mentioned
• 2: good, should be mentioned
• 1: fair, could be mentioned

## Training Articles

We also provide training data not just as stubs and Qrels but also (alternatively) as annotated articles.

These follow a similar format as stubs, but also include the passages with their IDs inside their sections.

Details on the data selection process are available here.

# Submission Format

Retrieval results are uploaded through the official NIST submission page, using a format compatible with trec_eval .run files. Each ranking entry is represented by a single line in the format below. Please submit a separate run file for every approach, but include rankings for all queries.

Notice that the format differs slightly between tasks. A validation script for your runs will be provided soon.

Per task and per team, up to three submissions will be accepted.

For every section, a ranking of paragraphs is to be retrieved (as identified by the $paragraphId). $sectionId Q0 $paragraphId$rank $score$teamname-$methodname ## Entity Retrieval For every section, a ranking of entities is to be retrieved (as identified by the $entityId which coincides with $pageIds – the percent-encoded Wikipedia title) Optionally, a paragraph from the corpus can be provided as provenance for why the entity is relevant for the section. If the paragraph is omitted, the first paragraph from the entity’s page will be used as default provenance. The format is as follows (first: with paragraph, second: default) $sectionId Q0 $paragraphId/$entityId $rank$score $teamname-$methodname
$sectionId Q0$entityId $rank$score $teamname-$methodname

# Releases

• v2.1-release (July 25, 2018): Official Y2 evaluation dataset! Released test queries based on selected outlines from Wikipedia dump June 1, 2018 and outlines from the AI2’s Textbook Question Answering training dataset. Corpus and train data based on Wikipedia dump from December 22, 2016. All data based on Wikipedia dump from December 22, 2016.

Previous releases - do not use!

• v2.0-release (January 1, 2018): Fixed Wikipedia parsing issues, leading to new paragraph ids. New release of all datasets including mapping of v1.x paragraph ids to v2.x paragraph ids. Release of allButBenchmark. All data based on Wikipedia dump from December 22, 2016.

• v1.5-release (June 22, 2017): release of the paragraph collection, test topics, train topics, test200, large-scale training data and unprocessed training data. All based on the Wikipedia dump from December 22, 2016.

As formats may change slightly between releases, make sure you check out support tools from the corresponding branch in trec-car-tools.

• v1.4-release (Jan 24, 2017): release of the paragraph collection, spritzer, and halfwiki

• v1.2-release (Jan 17, 2017): release of the paragraph collection, spritzer, and halfwiki

• v1.1-release (Jan 11, 2017): release of the paragraph collection, spritzer, and halfwiki – no longer supported.

• 2016-11-v1.0-release: Pre-release to straighten out issues in the data format – Format change, no longer supported.

## Paragraph corpus

Collection of paragraphs collected from Wikipedia articles that meet the selection criterion. The links between articles are preserved (see format definition above).

## Test topics Y1

The official test topics for the benchmarkY1 are released benchmarkY1test.

The selection process is as follows:

• data selection criteria apply
• 50/50 training/test split on the raw data was decided in advance
• Inspired through topics on the news (e.g., sustainability and fair labor) and things of popular interest (e.g., coffee) a set of 250 pages were selected by one person.
• The set of pages that fall into the training set are released as training topics (see below). The remaining pages constitute the test set.

## Test topics Y2

The official test topics for the benchmarkY2 are released benchmarkY2test.

The selection process is as follows:

• data selection criteria apply
• 30ish outlines from a 2018 Wikipedia dump were selected (non-overlapping with earlier releases of training data)
• 30ish outlines from AI2’s Text book question answering train dataset (TQA)
• Inspired through topics on things of popular interest (e.g., coffee) the were selected by two people.
• The outlines are to be filled with the v2.x paragraphCorpus (based on a Wiki dump from 2016). Models are to be trained on all previously released traing data. .

## Training corpora

We derive a subset of pages matching the list of selection criteria as an initial training corpus. The data is provided as five folds which are indended for cross-validation as release.

Additionally, a manual selection pages with similar characteristics, types, structure as the test topics are provided for training:

• 199 pages from fold0 test200
• 130 pages across all folds benchmarkY1train – automatic assessments only.
• 100+ pages from the official Y1 evaluation benchmarkY1test – automatic (for psg and entity) and manual (only passage) assessments.

## Knowledge Graph Data

To enable knowledge graph-aware methods, we offer a large dump of unprocessed and unfiltered Wikipedia pages called unprocessedAllButBenchmark. Please do not use article information from DBpedia as you are likely leaking test data into your model.

The dump contains all 2016 Wikipedia pages in a (nearly) unprocessed format, only omitting outlines contained in the benchmarks: test200, benchmarkY1train, benchmarkY1test, benchmarkY2test. Only redirects, disambiguation pages, and category links were resolved and stored in the pages’s meta data.

To avoid train/test signal leakage, the dump is also offered as five folds that are consistent with the larger training corpuss.