Introduction

Current retrieval systems provide good solutions towards phrase-level retrieval for simple fact and entity-centric needs. This track encourages research for answering more complex information needs with longer answers. Much like Wikipedia pages synthesize knowledge that is globally distributed, we envision systems that collect relevant information from an entire corpus, creating synthetically structured documents by collating retrieved results.

TREC CAR has concluded. We thank all the participants and TREC organizers and NIST assessors.

PLEASE NOTE: CAR offers multiple tasks (each with their own ground truth): passage retrieval, entity retrieval, article construction. Most published results are on the passage retrieval task. Results between passage and entity retrieval tasks are not comparable!

More details about TREC CAR are available in

TREC CAR Overview report from 2017 (TREC CAR Y1). Please cite as

Dietz, Laura and Verma, Manisha and Radlinski, Filip and Craswell, Nick “TREC Complex Answer Retrieval Overview”. In: Proceedings of {Text REtrieval Conference} (TREC), 2017.
TREC CAR Overview report from 2018 (TREC CAR Y2). Please cite as

Dietz, Laura and Gamari, Ben and Dalton, Jeff and Craswell, Nick “TREC Complex Answer Retrieval Overview”. In: Proceedings of {Text REtrieval Conference} (TREC), 2018.
TREC CAR Overview report from 2019 (TREC CAR Y3). Please cite as

Dietz, Laura and Foley, John “TREC CAR Y3: Complex Answer Retrieval Overview”. In: Proceedings of {Text REtrieval Conference} (TREC), 2019.

Please see the TREC proceedings for participants papers: Y3, Y2, Y1

To reproduce results please see the datasets for

Y1 datasets: Evaluation and Training data based on English Wikipedia 2016-12-22
Y2/Y3 datasets:
Evaluation data from English Wikipedia 2017-07-01 and TQA corpus (in Y3 with reformulated headings)
Training data derived from English Wikipedia 2016-12-22, but with bugfixes in the Wiki parser, which lead to different content and different paragraph ids.

We also offer additional Wikipedia dumps in the same format:

Converted Wiki Dumps: All pages, produced with updated Wiki parser that includes infoboxes and images for dumps
2020-01-01

Organization

Join our mailing list: https://groups.google.com/d/forum/trec-car

July 3, 2020: Released Tools to convert page titles to proper TREC CAR entity ids.

June 1, 2020: Released Dump of Wikipedia 01/01/2020.

Example

To motivate a brief example, consider a user interested in learning about water pollution through fertilizers, ocean acidification, and aquatic debris and the effects it has. There is no short and simple answer to this information need. Instead we need to retrieve a complex answer that covers the topic with its different facets and elaborates pertinent connections between entities/concepts. A suitable answer would cover the following:

Through photosynthesis algae provide food and nutrients for the marine ecosystem. However, through rain storms, fertilizers used in agriculture and lawn care are swept into the rivers and coastal sea. Fertilizers contain nitrogen and phosphorys, these stimulate algae growth so that the alae population will grow large very quickly, called algal blooms. The problem is that these algae do not live long, when they die and decompose oxygen is removed from the water. As a result fish and shell fish die.

Furthermore, some algal blooms release toxins into the water, which are consumed by shell fish. Humans that consume toxins though shell fish can suffer neurological damage.

A different source of water pollution is through high levels of carbon dioxide in the athmosphere. Oceans absorb carbon dioxide, but it will lower the PH level of the water, meaning that oceans to become acidic. As a result, corals and shell fish are killed and other marine organisms reproduce less. This leads to issues in the food chain, and thereby less fish and shell fish for humans to consume.

Finally, trash and other debris that gets in the waterways through shipping accidents, landfill erosion, or by directly dumping trash in the ocean. This debris is dangerous for aquatic wildlife in two ways. Animals may mistake debris for food swallow plastic bags which kills them. Other aquatic animals are tangled in nets and strangled by trash like plastic six-pack rings.

This example was taken from the TQA collection, Effects of Water Pollution, and many similar examples can be found on Wikipedia. Nevertheless, such articles are not available for any imaginable kind of of information needs, which is why we aim to generate such comprehensive summaries automatically from Web sources through passage retrieval, consolidation, and organization. Of course, one might envision other responses that would satisfy the information need equally well.

Evaluation Tasks and Data

We are now entering TREC CAR Y3. While Y1 and Y2 were dedicated to produce passage and entity rankings for the query and facets given in the outline; with Y3 we are turning to arrange paragraphs into a topically coherent article.

Y1 + Y2: Passage and Entity Ranking Task

(This task offered in only offered in Y1 and Y2)

Given an outline $Q$, retrieve for each of its sections $H_i$, a ranking of relevant entity-passage tuples $(E,P)$. The passage $P$ is taken from a provided passage corpus. The entity $E$ refers to an entry in the provided knowledge base. We define a passage or entity as relevant if the passage content or entity should be mentioned in the knowledge article. Different degrees of “could be mentioned”, “should be mentioned”, or “must be mentioned” are represented through a graded annotation scale during manual assessment.

Two tasks are offered: Passage ranking and Entity ranking.

In the passage ranking task, entities are omitted.

In the entity ranking task, the entity must be given, and optionally, complemented with a passage that serves as provenance for why the entity is relevant. If the passage is omitted, it will be replaced with the first paragraph from the entity’s article.

Details on the submission format at the end of this page.

Y3: Passage Ordering Task

(New task for Y3)

Given an outline $Q$, retrieve, select, and arrange a sequence of $k$ passages $P$ from the provided passage corpus, with ideally:

Highest relevance of all passages.
Balanced coverage of all query facets as defined through headings $H_i$ in the outline.
Maximizing topical coherence, minimizing topic switches, i.e., first all passages about one topic, then all passages of the next topic while avoiding to interleave multiple topics.

The number of passages $k$ is given with the topic.

However, in order to do well on this task, you also need some component that retrieves relevant passages (as in Y1). We will make some baseline code available to convert Y1/Y2 passage ranking output into a Y3 passage order. This code implements a baseline that selects the top k passages from each heading’s ranking.

Passage Corpus, Knowledge Base, and Training Data

Paragraph Corpus: A corpus of 20 million paragraphs is provided. These are harvested from paragraphs on Wikipedia pages from a snapshot of 2016 (with hyperlinks preserved). Given the high amount of duplication on Wikipedia pages, the collection was de-duplicated before the data release. These paragraphs are to be used for the passages ranking task.

AllButBenchmark/knowledge base: We provide (nearly all) Wikipedia pages (2016) as standardized information on entities – only Wikipedia pages that are used as test outlines in the benchmarkY… subsets are omitted. We provide full meta information for each page including disambiguation pages and redirects, as well as the full page contents (parsed as outline with paragraphs). To avoid train/test signal leakage, the dump is also offered as five folds that are consistent with the large train corpus.

Queries/Outlines: We provide queries are outlines that are structured into title, and a hierarchy of headings. Please obtain the clean text from these headings from the outlines file – do not attempt to parse the text from the query id!

Test Data and Evaluation

We enact both an automatic (binary) and a manual (graded) evaluation procedure.

automatic benchmark: We generate a large-scale automatic benchmark for training and automatic evaluation by splitting Wikipedia pages into paragraphs and outlines and ask the system to put paragraphs back to their place of origin.

For each section $H_i$ in training stubs $Q$ we indicate which passages originated from the section. To derive a ground truth for entity ranking, we indicate which entities are linked from the section.

Passage retrieval tasks are often criticized for leading to test collections that are not reusable. We address this problem through unique paragraph ids, in the past years this approach has worked well and yielded reusable benchmarks.

train: For a large-scale automatic benchmark for training data-hungry methods like neural networks, we selected half of all Wikipedia articles that are elibile for TREC CAR and provide them as five folds. As TREC CAR will not choose test outlines from pages that describe people, organizations, or events, these are also removed from the training data. (See full list of details on the data selection process are available here)

benchmarkY…: To create high-quality topics, we manually selected outlines from pages about popular science and the environment, which were released across different years, namely benchmarkY1train (from Wiki 16), benchmarkY1test (from Wiki 16), benchmarkY2test (from Wiki 18 and TQA). For these subsets, we provide automatic ground truth (similar to “train”) as well as manual ground truth if it was collected.

Benchmark Train / Test topics Y1

The official test topics for the benchmarkY1 are released as benchmarkY1test.

Simultaneously a subset of train topics were released as benchmarkY1train.

The selection process is as follows:

data selection criteria apply
50/50 training/test split on the raw data was decided randomly
Inspired through topics on the news (e.g., sustainability and fair labor) and things of popular interest (e.g., coffee) a set of 250 pages were selected by one person.
The set of pages that fall into the train set are released as training topics. The remaining pages constitute the test set.

Benchmark Test topics Y2

The official test topics for the benchmarkY2 are released benchmarkY2test.

The selection process is as follows:

data selection criteria apply
30ish outlines from a 2018 Wikipedia dump were selected (non-overlapping with earlier releases of training data)
30ish outlines from AI2’s Text book question answering train dataset (TQA)
Inspired through topics on things of popular interest (e.g., coffee) the were selected by two people.
The outlines are to be filled with the v2.x paragraphCorpus (based on a Wiki dump from 2016). Models are to be trained on all previously released traing data. .

See data releases for more information.

Qrels

We provide training/evaluation annotations in the trec_eval compatible Qrels format.

We provide qrels for passage and entity rankings as well as differnt variations on the section path:

full section paths (hierarchical)
only top-level sections (toplevel)
only the article (article)
the full hierarchy containing all of the above (tree)

Format for passages:

    $sectionId 0 $paragraphId $isRelevant

Format for entities (where the entity is represented through the Wikipedia article it represents - using percent encoding):

    $sectionId 0 $pageId $isRelevant

For automatic training/test data $isRelevant is either 1 or 0 (only 1 are reported).

For manual test data $isRelevant is a graded relevance-scale according to

3: must be mentioned (excellent)
2: should be mentioned (good)
1: could be mentioned (fair)
0: roughly on topic but not relevant (a near miss)
-1: not relevant (not even close)
-2: trash, this passage is not useful for any information need.

Manual assessments are created by six NIST assessors on assessment pools built from participant submissions.

Data Format Description

See the data releases page for data set download.

We provide passage corpus, outlines, and example articles in CBOR-encoded archives to preserve hierarchical structure and entity links. Support tools are provided for Python 3.5 and Java 1.7.

Articles we provide are encoded with the following grammar. Terminal nodes are indicated by $.

      Page         -> $pageName $pageId [PageSkeleton] PageType PageMetadata
      PageType     -> ArticlePage | CategoryPage | RedirectPage ParaLink | DisambiguationPage
      PageMetadata -> RedirectNames DisambiguationNames DisambiguationIds CategoryNames CategoryIds InlinkIds InlinkAnchors
      RedirectNames       -> [$pageName] 
      DisambiguationNames -> [$pageName] 
      DisambiguationIds   -> [$pageId] 
      CategoryNames       -> [$pageName] 
      CategoryIds         -> [$pageId] 
      InlinkIds           -> [$pageId] 
      InlinkAnchors       -> [$anchorText] 
      
      PageSkeleton -> Section | Para | Image | ListItem
      Section      -> $sectionHeading [PageSkeleton]
      Para         -> Paragraph
      Paragraph    -> $paragraphId, [ParaBody]
      ListItem     -> $nestingLevel, Paragraph
      Image        -> $imageURL [PageSkeleton]
      ParaBody     -> ParaText | ParaLink
      ParaText     -> $text
      ParaLink     -> $targetPage $targetPageId $linkSection $anchorText

Support tools for reading this format are provided in the trec-car-tools repository (the java version is available through maven).

Passage Format

All passages that are cut out of Wikipedia articles are assigned a unique (SHA256 hashed) paragraphId that will not reveal the originating article. Pairs of paragraphId,paragraphContent are provided as passage corpus. Hyperlinks to other Wikipedia pages (aka Entity links) are preserved in the paragraph.

Outline Format

Each extracted page outline, provides the pageName (e.g., Wikipedia title) and pageId, and a nested list of children. Each section is represented by a heading, a headingId, and a list of its children in order. WHile for pages, children can be sections or paragraphs, outlines do not contain paragraphs

Each section in the corpus is uniquely identified by a section path of the form

    pageName/section1/section1.1/section1.1.1

To avoid UTF8-encoding and whitespace issues with existing evaluation tools like trec_eval, the pageId and headingId are projected to ASCII characters using a variant of URL percent encoding.

Unfortunately there have been many problems whenever participants have tried to “parse” the text of the heading or query out of the heading id (or query id). Please do not do this, as the ID is a hashkey and does not perfectly preserve the text. Please obtain the text from the outlines file.

Y1 + Y2 Submission Format

Retrieval results are uploaded through the official NIST submission page, using a format compatible with trec_eval .run files. Each ranking entry is represented by a single line in the format below. Please submit a separate run file for every approach, but include rankings for all queries.

Notice that the format differs slightly between tasks. A validation script for your runs will be provided soon.

Per task and per team, up to three submissions will be accepted.

Passage Retrieval

For every section, a ranking of paragraphs is to be retrieved (as identified by the $paragraphId).

    $sectionId Q0 $paragraphId $rank $score $teamname-$methodname

Entity Retrieval

For every section, a ranking of entities is to be retrieved (as identified by the $entityId which coincides with $pageIds – the percent-encoded Wikipedia title)

Optionally, a paragraph from the corpus can be provided as provenance for why the entity is relevant for the section. If the paragraph is omitted, the first paragraph from the entity’s page will be used as default provenance.

The format is as follows (first: with paragraph, second: default)

    $sectionId Q0 $paragraphId/$entityId $rank $score $teamname-$methodname
    $sectionId Q0 $entityId $rank $score $teamname-$methodname

Y3 Submission Format

We decided to depart from the TREC RUN format towards a JSON-based dataformat. Each submitted run constitutes one JSON-lines file in the format as described below. We opted to include reduncancies to help validate for errors (see population and validation script on data release page). A JSON-lines format contains each populated page (i.e., passage ordering per page) as one JSON object as a separate single line, delimited by ‘’. It goes without saying, that you cannot use a JSON pretty-printer that inserts newlines in the middle of a JSON object. Your submission file must have exactly 131 lines, because the benchmarkY3test benchmark asks for 131 pages.

Only the following fields are mandatory to participate in Y3:

run_id
squid
paragraphs
- para_id

While we offer only one task, we also evaluate the ranking quality as well, if the paragraph_origins field is populated. We expect that the ranking was used as input to predict the passage ordering and therefore, all entries in paragraphs have an entry in paragraph_origins. The following fields are mandatory for the ranking evaluation:

paragraph_origins
- para_id
- rank_score
- section_path

The format of the JSON representation of one page, explained in detail. Note that this example is based on an outline from benchmarkY2test; in contrast, benchmarkY3 squids start with ‘tqa2:’ and do not contain any ‘%20’.

{
    "run_id": "UNH-p-l2r",                            # system that produced the ranking
    "squid": "tqa:effects%20of%20water%20pollution",  # the topic's stable query unique ID
    "title": "effects of water pollution"             # Plain text title of the article
    "query_facets": [                                 # query facets (aka outline headings) to be covered
	{
	    "heading": "Ocean Acidification",         # plain text facet description
	    "heading_id": "tqa:effects%20of%20water%20pollution/Ocean%20Acidification"  # facet ID
	},
	{
	    "heading": "Eutrophication",              # plain text facet description
	    "heading_id": "tqa:effects%20of%20water%20pollution/Eutrophication"  # facet ID
	},
        ...
    ],
    "paragraphs": [                                  # candidate paragraphs to consolidate
        {
            "para_id": "ece8bed05f22e7c84e63c40759289dd0fd09dae9",   # unique paragraph id, maps to paragraphCorpus
            "para_body": [                           # content (for your convenience) as chunks of plain text and entity links.
               {                                     # first plain text chunk
                    "text": "Nutrients are important to the growth and survival of living organisms, and hence, are essential for development and maintenance of healthy ecosystems. Humans have greatly influenced the phosphorus cycle by mining phosphorus, converting it to fertilizer, and by shipping fertilizer and products around the globe. Transporting phosphorus in food from farms to cities has made a major change in the global Phosphorus cycle. However, excessive amounts of nutrients, particularly phosphorus and nitrogen, are detrimental to aquatic ecosystems. Waters are enriched in phosphorus from farms' run-off, and from effluent that is inadequately treated before it is discharged to waters. Natural "
                },
                {                                    # Entity link / Wikipedia hyperlink
                    "entity": "enwiki:Eutrophication",   # TREC CAR entity id (please don't fudge the string)
                    "entity_name": "Eutrophication",     # Plain text of entity name (aka Wiki title)
                    "link_section": null,                # If the link points to a section on the entity's article
                    "text": "eutrophication"             # surfaceform / Anchor text of link
                },
                {                                    # next plain text chunk...
                    "text": " is a process by which lakes gradually age and become more productive and may take thousands of years to progress. Cultural or anthropogenic eutrophication, however, is water pollution caused by excessive plant nutrients; this results in excessive growth in the algal population; when this algae dies its putrefaction depletes the water of oxygen.Such eutrophication may also give rise to toxic algal bloom.Both these effects cause animal and plant death rates to increase as the plants take in poisonous water while the animals drink the poisoned water. Surface and subsurface runoff and erosion from high-phosphorus soils may be major contributing factors to this fresh water eutrophication. The processes controlling soil Phosphorus release to surface runoff and to subsurface flow are a complex interaction between the type of phosphorus input, soil type and management, and transport processes depending on hydrological conditions."                }
            ],
        },
        ...
   ].
    "paragraph_origins": [                           # As generated from a ranking, info on rank, score and query where each paragraph is found
        {
            "para_id": "5516be8a3b7a395bbbd7e532931fbfd8c69c2394",  # only paragraph ids from the paragraphCorpus are valid
            "rank": 1,                               # rank information is optional, but if given it must be consistent with the rank_score. The highest rank is 1!
            "rank_score": 0.9038483096258834,        # mandatory rank_score information. Must be a float. Cannot have rank ties.
            "section_path": "tqa:effects%20of%20water%20pollution/Eutrophication"    # section path (squid/headingid).
        },
        {
            "para_id": "ece8bed05f22e7c84e63c40759289dd0fd09dae9",      # This is the one from example above
            "rank": 2,
            "rank_score": 0.8846124107754259,
            "section_path": "tqa:effects%20of%20water%20pollution/Eutrophication"
        },
        ...
   ]
  }

Also see Y3 format validation rules.

To support comparison to ranking systems (akin to previous years), we offer software to convert rankings in TREC RUN file format (used in Y1 and Y2) into the Y3 JSON-based submission format. This software will take the top (k/facets) paragraphs from every section-ranking to be concatenated across all query facets.

Releases

v2.3-release (June 18, 2019): Official Y3 evaluation dataset! Released test queries based on selected outlines from the AI2’s Textbook Question Answering training dataset. Corpus and train data based on Wikipedia dump from December 22, 2016.
Bindings to load the data: trec-car-tools (the java version is available through maven).

Previous releases - do not use!

v2.1-release (July 25, 2018): Official Y2 evaluation dataset! Released test queries based on selected outlines from Wikipedia dump June 1, 2018 and outlines from the AI2’s Textbook Question Answering training dataset. Corpus and train data based on Wikipedia dump from December 22, 2016. All data based on Wikipedia dump from December 22, 2016.
v2.0-release (January 1, 2018): Fixed Wikipedia parsing issues, leading to new paragraph ids. New release of all datasets including mapping of v1.x paragraph ids to v2.x paragraph ids. Release of allButBenchmark. All data based on Wikipedia dump from December 22, 2016.
v1.5-release (June 22, 2017): release of the paragraph collection, test topics, train topics, test200, large-scale training data and unprocessed training data. All based on the Wikipedia dump from December 22, 2016.

As formats may change slightly between releases, make sure you check out support tools from the corresponding branch in trec-car-tools.

v1.4-release (Jan 24, 2017): release of the paragraph collection, spritzer, and halfwiki
v1.2-release (Jan 17, 2017): release of the paragraph collection, spritzer, and halfwiki
v1.1-release (Jan 11, 2017): release of the paragraph collection, spritzer, and halfwiki – no longer supported.
2016-11-v1.0-release: Pre-release to straighten out issues in the data format – Format change, no longer supported.