# Introduction

Current retrieval systems provide good solutions towards phrase-level retrieval for simple fact and entity-centric needs. This track encourages research for answering more complex information needs with longer answers. Much like Wikipedia pages synthesize knowledge that is globally distributed, we envision systems that collect relevant information from an entire corpus, creating synthetically structured documents by collating retrieved results.

More details in

• TREC CAR Overview report from 2017 (TREC CAR Y1). Please cite as

Dietz, Laura and Verma, Manisha and Radlinski, Filip and Craswell, Nick “TREC Complex Answer Retrieval Overview”. In: Proceedings of {Text REtrieval Conference} (TREC), 2017.

• TREC CAR Overview report from 2018 (TREC CAR Y2). Please cite as

Dietz, Laura and Gamari, Ben and Dalton, Jeff and Craswell, Nick “TREC Complex Answer Retrieval Overview”. In: Proceedings of {Text REtrieval Conference} (TREC), 2018.

## Organization

June 20: New v2.3 data release
May 8: Task and guidelines for Y3 evaluation (v2.3)!

## Example

To motivate a brief example, consider a user interested in learning about water pollution through fertilizers, ocean acidification, and aquatic debris and the effects it has. There is no short and simple answer to this information need. Instead we need to retrieve a complex answer that covers the topic with its different facets and elaborates pertinent connections between entities/concepts. A suitable answer would cover the following:

Through photosynthesis algae provide food and nutrients for the marine ecosystem. However, through rain storms, fertilizers used in agriculture and lawn care are swept into the rivers and coastal sea. Fertilizers contain nitrogen and phosphorys, these stimulate algae growth so that the alae population will grow large very quickly, called algal blooms. The problem is that these algae do not live long, when they die and decompose oxygen is removed from the water. As a result fish and shell fish die.

Furthermore, some algal blooms release toxins into the water, which are consumed by shell fish. Humans that consume toxins though shell fish can suffer neurological damage.

A different source of water pollution is through high levels of carbon dioxide in the athmosphere. Oceans absorb carbon dioxide, but it will lower the PH level of the water, meaning that oceans to become acidic. As a result, corals and shell fish are killed and other marine organisms reproduce less. This leads to issues in the food chain, and thereby less fish and shell fish for humans to consume.

Finally, trash and other debris that gets in the waterways through shipping accidents, landfill erosion, or by directly dumping trash in the ocean. This debris is dangerous for aquatic wildlife in two ways. Animals may mistake debris for food swallow plastic bags which kills them. Other aquatic animals are tangled in nets and strangled by trash like plastic six-pack rings.

This example was taken from the TQA collection, Effects of Water Pollution, and many similar examples can be found on Wikipedia. Nevertheless, such articles are not available for any imaginable kind of of information needs, which is why we aim to generate such comprehensive summaries automatically from Web sources through passage retrieval, consolidation, and organization. Of course, one might envision other responses that would satisfy the information need equally well.

We are now entering TREC CAR Y3. While Y1 and Y2 were dedicated to produce passage and entity rankings for the query and facets given in the outline; with Y3 we are turning to arrange paragraphs into a topically coherent article.

## Y1 + Y2: Passage and Entity Ranking Task

(This task offered in only offered in Y1 and Y2)

Given an outline $$Q$$, retrieve for each of its sections $$H_i$$, a ranking of relevant entity-passage tuples $$(E,P)$$. The passage $$P$$ is taken from a provided passage corpus. The entity $$E$$ refers to an entry in the provided knowledge base. We define a passage or entity as relevant if the passage content or entity should be mentioned in the knowledge article. Different degrees of “could be mentioned”, “should be mentioned”, or “must be mentioned” are represented through a graded annotation scale during manual assessment.

Two tasks are offered: Passage ranking and Entity ranking.

In the passage ranking task, entities are omitted.

In the entity ranking task, the entity must be given, and optionally, complemented with a passage that serves as provenance for why the entity is relevant. If the passage is omitted, it will be replaced with the first paragraph from the entity’s article.

Given an outline $$Q$$, retrieve, select, and arrange a sequence of $$k$$ passages $$P$$ from the provided passage corpus, with ideally:

1. Highest relevance of all passages.
2. Balanced coverage of all query facets as defined through headings $$H_i$$ in the outline.
3. Maximizing topical coherence, minimizing topic switches, i.e., first all passages about one topic, then all passages of the next topic while avoiding to interleave multiple topics.

The number of passages $$k$$ is given with the topic.

However, in order to do well on this task, you also need some component that retrieves relevant passages (as in Y1). We will make some baseline code available to convert Y1/Y2 passage ranking output into a Y3 passage order. This code implements a baseline that selects the top k passages from each heading’s ranking.

## Passage Corpus, Knowledge Base, and Training Data

Paragraph Corpus: A corpus of 20 million paragraphs is provided. These are harvested from paragraphs on Wikipedia pages from a snapshot of 2016 (with hyperlinks preserved). Given the high amount of duplication on Wikipedia pages, the collection was de-duplicated before the data release. These paragraphs are to be used for the passages ranking task.

AllButBenchmark/knowledge base: We provide (nearly all) Wikipedia pages (2016) as standardized information on entities – only Wikipedia pages that are used as test outlines in the benchmarkY… subsets are omitted. We provide full meta information for each page including disambiguation pages and redirects, as well as the full page contents (parsed as outline with paragraphs). To avoid train/test signal leakage, the dump is also offered as five folds that are consistent with the large train corpus.

Queries/Outlines: We provide queries are outlines that are structured into title, and a hierarchy of headings. Please obtain the clean text from these headings from the outlines file – do not attempt to parse the text from the query id!

# Test Data and Evaluation

We enact both an automatic (binary) and a manual (graded) evaluation procedure.

automatic benchmark: We generate a large-scale automatic benchmark for training and automatic evaluation by splitting Wikipedia pages into paragraphs and outlines and ask the system to put paragraphs back to their place of origin.

For each section $$H_i$$ in training stubs $$Q$$ we indicate which passages originated from the section. To derive a ground truth for entity ranking, we indicate which entities are linked from the section.

Passage retrieval tasks are often criticized for leading to test collections that are not reusable. We address this problem through unique paragraph ids, in the past years this approach has worked well and yielded reusable benchmarks.

train: For a large-scale automatic benchmark for training data-hungry methods like neural networks, we selected half of all Wikipedia articles that are elibile for TREC CAR and provide them as five folds. As TREC CAR will not choose test outlines from pages that describe people, organizations, or events, these are also removed from the training data. (See full list of details on the data selection process are available here)

benchmarkY…: To create high-quality topics, we manually selected outlines from pages about popular science and the environment, which were released across different years, namely benchmarkY1train (from Wiki 16), benchmarkY1test (from Wiki 16), benchmarkY2test (from Wiki 18 and TQA). For these subsets, we provide automatic ground truth (similar to “train”) as well as manual ground truth if it was collected.

## Benchmark Train / Test topics Y1

The official test topics for the benchmarkY1 are released as benchmarkY1test.

Simultaneously a subset of train topics were released as benchmarkY1train.

The selection process is as follows:

• data selection criteria apply
• 50/50 training/test split on the raw data was decided randomly
• Inspired through topics on the news (e.g., sustainability and fair labor) and things of popular interest (e.g., coffee) a set of 250 pages were selected by one person.
• The set of pages that fall into the train set are released as training topics. The remaining pages constitute the test set.

## Benchmark Test topics Y2

The official test topics for the benchmarkY2 are released benchmarkY2test.

The selection process is as follows:

• data selection criteria apply
• 30ish outlines from a 2018 Wikipedia dump were selected (non-overlapping with earlier releases of training data)
• 30ish outlines from AI2’s Text book question answering train dataset (TQA)
• Inspired through topics on things of popular interest (e.g., coffee) the were selected by two people.
• The outlines are to be filled with the v2.x paragraphCorpus (based on a Wiki dump from 2016). Models are to be trained on all previously released traing data. .

## Qrels

We provide training/evaluation annotations in the trec_eval compatible Qrels format.

We provide qrels for passage and entity rankings as well as differnt variations on the section path:

• full section paths (hierarchical)
• only top-level sections (toplevel)
• only the article (article)
• the full hierarchy containing all of the above (tree)

Format for passages:

    $sectionId 0$paragraphId $isRelevant Format for entities (where the entity is represented through the Wikipedia article it represents - using percent encoding): $sectionId 0 $pageId$isRelevant

For automatic training/test data $isRelevant is either 1 or 0 (only 1 are reported). For manual test data $isRelevant is a graded relevance-scale according to

• 3: must be mentioned (excellent)
• 2: should be mentioned (good)
• 1: could be mentioned (fair)
• 0: roughly on topic but not relevant (a near miss)
• -1: not relevant (not even close)
• -2: trash, this passage is not useful for any information need.

Manual assessments are created by six NIST assessors on assessment pools built from participant submissions.

# Data Format Description

We provide passage corpus, outlines, and example articles in CBOR-encoded archives to preserve hierarchical structure and entity links. Support tools are provided for Python 3.5 and Java 1.7.

Articles we provide are encoded with the following grammar. Terminal nodes are indicated by $.  Page ->$pageName $pageId [PageSkeleton] PageType PageMetadata PageType -> ArticlePage | CategoryPage | RedirectPage ParaLink | DisambiguationPage PageMetadata -> RedirectNames DisambiguationNames DisambiguationIds CategoryNames CategoryIds InlinkIds InlinkAnchors RedirectNames -> [$pageName]
DisambiguationNames -> [$pageName] DisambiguationIds -> [$pageId]
CategoryNames       -> [$pageName] CategoryIds -> [$pageId]
InlinkIds           -> [$pageId] InlinkAnchors -> [$anchorText]

PageSkeleton -> Section | Para | Image | ListItem
Section      -> $sectionHeading [PageSkeleton] Para -> Paragraph Paragraph ->$paragraphId, [ParaBody]
ListItem     -> $nestingLevel, Paragraph Image ->$imageURL [PageSkeleton]
ParaText     -> $text ParaLink ->$targetPage $targetPageId$linkSection $anchorText Support tools for reading this format are provided in the trec-car-tools repository (the java version is available through maven). ## Passage Format All passages that are cut out of Wikipedia articles are assigned a unique (SHA256 hashed) paragraphId that will not reveal the originating article. Pairs of paragraphId,paragraphContent are provided as passage corpus. Hyperlinks to other Wikipedia pages (aka Entity links) are preserved in the paragraph. ## Outline Format Each extracted page outline, provides the pageName (e.g., Wikipedia title) and pageId, and a nested list of children. Each section is represented by a heading, a headingId, and a list of its children in order. WHile for pages, children can be sections or paragraphs, outlines do not contain paragraphs Each section in the corpus is uniquely identified by a section path of the form  pageName/section1/section1.1/section1.1.1 To avoid UTF8-encoding and whitespace issues with existing evaluation tools like trec_eval, the pageId and headingId are projected to ASCII characters using a variant of URL percent encoding. Unfortunately there have been many problems whenever participants have tried to “parse” the text of the heading or query out of the heading id (or query id). Please do not do this, as the ID is a hashkey and does not perfectly preserve the text. Please obtain the text from the outlines file. # Y1 + Y2 Submission Format Retrieval results are uploaded through the official NIST submission page, using a format compatible with trec_eval .run files. Each ranking entry is represented by a single line in the format below. Please submit a separate run file for every approach, but include rankings for all queries. Notice that the format differs slightly between tasks. A validation script for your runs will be provided soon. Per task and per team, up to three submissions will be accepted. ## Passage Retrieval For every section, a ranking of paragraphs is to be retrieved (as identified by the $paragraphId).

    $sectionId Q0$paragraphId $rank$score $teamname-$methodname

## Entity Retrieval

For every section, a ranking of entities is to be retrieved (as identified by the $entityId which coincides with $pageIds – the percent-encoded Wikipedia title)

Optionally, a paragraph from the corpus can be provided as provenance for why the entity is relevant for the section. If the paragraph is omitted, the first paragraph from the entity’s page will be used as default provenance.

The format is as follows (first: with paragraph, second: default)

# Releases

Previous releases - do not use!

• v2.1-release (July 25, 2018): Official Y2 evaluation dataset! Released test queries based on selected outlines from Wikipedia dump June 1, 2018 and outlines from the AI2’s Textbook Question Answering training dataset. Corpus and train data based on Wikipedia dump from December 22, 2016. All data based on Wikipedia dump from December 22, 2016.

• v2.0-release (January 1, 2018): Fixed Wikipedia parsing issues, leading to new paragraph ids. New release of all datasets including mapping of v1.x paragraph ids to v2.x paragraph ids. Release of allButBenchmark. All data based on Wikipedia dump from December 22, 2016.

• v1.5-release (June 22, 2017): release of the paragraph collection, test topics, train topics, test200, large-scale training data and unprocessed training data. All based on the Wikipedia dump from December 22, 2016.

As formats may change slightly between releases, make sure you check out support tools from the corresponding branch in trec-car-tools.

• v1.4-release (Jan 24, 2017): release of the paragraph collection, spritzer, and halfwiki

• v1.2-release (Jan 17, 2017): release of the paragraph collection, spritzer, and halfwiki

• v1.1-release (Jan 11, 2017): release of the paragraph collection, spritzer, and halfwiki – no longer supported.

• 2016-11-v1.0-release: Pre-release to straighten out issues in the data format – Format change, no longer supported.