Current retrieval systems provide good solutions towards phrase-level retrieval for simple fact and entity-centric needs. This track encourages research for answering more complex information needs with longer answers. Much like Wikipedia pages synthesize knowledge that is globally distributed, we envision systems that collect relevant information from an entire corpus, creating synthetically structured documents by collating retrieved results.

Overview Report from 2018 (CAR-Y2) will be published in April 2019. TREC Attendees receive it on their thumbdrive.

More details in TREC CAR Overview report from 2017 (TREC CAR Y1). Please cite as

Dietz, Laura and Verma, Manisha and Radlinski, Filip and Craswell, Nick
"TREC Complex Answer Retrieval Overview". 
In: Proceedings of {Text REtrieval Conference} (TREC), 2017. 


Join our mailing list:

July 25: New dataset for Y2 evaluation (v2.1)!


To motivate a brief example, consider a user wondering about the latest advances in mobile technology. She heard that there is a new iPhone on the marked and is looking for a summary on its features or issues. With this intention in mind, she enters the query iPhone 7 new features and issues. A possible answer she would be very happy to receive could look like this:

Despite the inclusion of an adapter, the removal of the headphone jack was met with criticism. Criticism was based primarily on the following arguments

  • Digital output (as opposed to analog output from the 3.5 mm headphone jack) does not show any notable improvement in sound quality;

  • the inability to charge the iPhone 7 and listen to music simultaneously without Bluetooth;

  • and the inconvenience of having to carry around an adapter for what is purely a mobile device, diminishing its utility.

  • In particular, Apple’s vice president Phillip Schiller, who announced the change, was mocked extensively online for stating that removing the headphone jack took ‘courage’.

  • An online petition created by the consumer group SumOfUs that accuses Apple of planned obsolescence and causing substantial electronic waste by removing the headphone jack reached over 300,000 signatures.

While this example was taken from Wikipedia, IPhone_7 it should be possible to identify such a list of issues from a Web collection with passage retrieval, consolidation, and organization. Of course, one might envision other responses that would satisfy the information need equally well.


Given an article stub \(Q\), retrieve for each of its sections \(H_i\), a ranking of relevant entity-passage tuples \((E,P)\). The passage \(P\) is taken from a provided passage corpus. The entity \(E\) refers to an entry in the provided knowledge base. We define a passage or entity as relevant if the passage content or entity should be mentioned in the knowledge article. Different degrees of “could be mentioned”, “should be mentioned”, or “must be mentioned” are represented through a PEGFB annotation scale during manual assessment.

Two tasks are offered: Passage ranking and Entity ranking.

In the passage ranking task, entities are omitted.

In the entity ranking task, the entity must be given, and optionally, complemented with a passage that serves as provenance for why the entity is relevant. If the passage is omitted, it will be replaced with the first paragraph from the entity’s article.

Details on the submission format at the end of this page.

Passage Corpus and Training Data

Supporting the neural IR community, we provide a large (yet approximate) training set based on a large collection of knowledge articles from Wikipedia that are divided into content passages and outlines. Across all articles, all extracted paragraphs are provided as one large passage corpus (see paragraphs release). We ensure that entity links remain intact. These are gathered from links to Wikipedia (and in future might be augmented to the FACC1 collection, and the TagMe toolkit).

From all stubs obtained in this process, we will select 1000 stubs for which the original stub association is revealed as a training set. Additionally we open up 50% of a recent Wikipedia dump for training (see halfwiki release)

For each section \(H_i\) in training stubs \(Q\) we indicate which passages originated from the section and which knowledge base entities are mentioned therein. We provide three variations of trec_eval compatible qrels files, using the section hierarchy, only top-level sections, and only the article.

Passage retrieval tasks are often criticized for leading to test collections that are not reusable. We are aware of this problem and intend on solving it through legal span sets and optional candidate sets.

Test Data and Evaluation

We enact both an automatic binary and a fine-grained manual evaluation procedure on 40 test stubs.

Analogously to the training signals, an automatic (yet approximative) test signal is derived from the passages of the original article as well as entities mentioned therein. However, this test signal is not necessarily reliable, as even the best article has room for improvement. Especially for complex answers, there are often different ways to provide an equally good answer.

Consequently, for a subset of test stubs, contributed entity-passage rankings will additionally be manually assessed for relevance, reflecting different levels of relevance. This manual assessment also provides a reference for the quality of the automatic evaluation procedure.

Data Set Description

See the data releases page for data set download.

We provide passage corpus, stubs, and example articles in CBOR-encoded archives to preserve hierarchical structure and entity links. Support tools are provided for Python 3.5 and Java 1.7.

Articles we provide are encoded with the following grammar. Terminal nodes are indicated by $.

     Page         -> $pageName $pageId [PageSkeleton]
     PageSkeleton -> Section | Para | Image
     Section      -> $sectionHeading $sectionHeadingId [PageSkeleton]
     Para         -> Paragraph
     Paragraph    -> $paragraphId, [ParaBody]
     Image        -> $imageURL [PageSkeleton]
     ParaBody     -> ParaText | ParaLink
     ParaText     -> $text
     ParaLink     -> $targetPage $targetPageId $linkSection $anchorText

Support tools for reading this format are provided in the trec-car-tools repository.

Passage Format

All passages that are cut out of Wikipedia articles are assigned a unique (SHA256 hashed) paragraphId that will not reveal the originating article. Pairs of paragraphId,paragraphContent are provided as passage corpus.

Outline Format

Each extracted page outline, provides the pageName (e.g., Wikipedia title) and pageId, and a nested list of children. Each section is represented by a heading, a headingId, and a list of its children in order. Children can be sections or paragraphs.

Each section in the corpus is uniquely identified by a section path of the form


To avoid UTF8-encoding and whitespace issues with existing evaluation tools like trec_eval, the pageId and headingId are projected to ASCII characters using a variant of URL percent encoding.


We provide training/evaluation annotations in the trec_eval compatible Qrels format.

We provide qrels for passage and entity rankings as well as differnt variations on the section path:

Format for passages:

    $sectionId 0 $paragraphId $isRelevant

Format for entities (where the entity is represented through the Wikipedia article it represents - using percent encoding):

    $sectionId 0 $pageId $isRelevant

For automatic training/test data $isRelevant is either 1 or 0 (only 1 are reported).

For manual test data $isRelevant is a grading on the PEGFB-scale (4-0) according to

Training Articles

We also provide training data not just as stubs and Qrels but also (alternatively) as annotated articles.

These follow a similar format as stubs, but also include the passages with their IDs inside their sections.

Details on the data selection process are available here.

Submission Format

Retrieval results are uploaded through the official NIST submission page, using a format compatible with trec_eval .run files. Each ranking entry is represented by a single line in the format below. Please submit a separate run file for every approach, but include rankings for all queries.

Notice that the format differs slightly between tasks. A validation script for your runs will be provided soon.

Per task and per team, up to three submissions will be accepted.

Passage Retrieval

For every section, a ranking of paragraphs is to be retrieved (as identified by the $paragraphId).

    $sectionId Q0 $paragraphId $rank $score $teamname-$methodname

Entity Retrieval

For every section, a ranking of entities is to be retrieved (as identified by the $entityId which coincides with $pageIds – the percent-encoded Wikipedia title)

Optionally, a paragraph from the corpus can be provided as provenance for why the entity is relevant for the section. If the paragraph is omitted, the first paragraph from the entity’s page will be used as default provenance.

The format is as follows (first: with paragraph, second: default)

    $sectionId Q0 $paragraphId/$entityId $rank $score $teamname-$methodname
    $sectionId Q0 $entityId $rank $score $teamname-$methodname


Previous releases - do not use!

As formats may change slightly between releases, make sure you check out support tools from the corresponding branch in trec-car-tools.

Paragraph corpus

Collection of paragraphs collected from Wikipedia articles that meet the selection criterion. The links between articles are preserved (see format definition above).

Test topics Y1

The official test topics for the benchmarkY1 are released benchmarkY1test.

The selection process is as follows:

Test topics Y2

The official test topics for the benchmarkY2 are released benchmarkY2test.

The selection process is as follows:

Training corpora

We derive a subset of pages matching the list of selection criteria as an initial training corpus. The data is provided as five folds which are indended for cross-validation as release.

Additionally, a manual selection pages with similar characteristics, types, structure as the test topics are provided for training:

Knowledge Graph Data

To enable knowledge graph-aware methods, we offer a large dump of unprocessed and unfiltered Wikipedia pages called unprocessedAllButBenchmark. Please do not use article information from DBpedia as you are likely leaking test data into your model.

The dump contains all 2016 Wikipedia pages in a (nearly) unprocessed format, only omitting outlines contained in the benchmarks: test200, benchmarkY1train, benchmarkY1test, benchmarkY2test. Only redirects, disambiguation pages, and category links were resolved and stored in the pages’s meta data.

To avoid train/test signal leakage, the dump is also offered as five folds that are consistent with the larger training corpuss.

See data releases for more information.