Current retrieval systems provide good solutions towards phrase-level retrieval for simple fact and entity-centric needs. This track encourages research for answering more complex information needs with longer answers. Much like Wikipedia pages synthesize knowledge that is globally distributed, we envision systems that collect relevant information from an entire corpus, creating synthetically structured documents by collating retrieved results.
TREC CAR has concluded. We thank all the participants and TREC organizers and NIST assessors.
PLEASE NOTE: CAR offers multiple tasks (each with their own ground truth): passage retrieval, entity retrieval, article construction. Most published results are on the passage retrieval task. Results between passage and entity retrieval tasks are not comparable!
More details about TREC CAR are available in
TREC CAR Overview report from 2017 (TREC CAR Y1). Please cite as
Dietz, Laura and Verma, Manisha and Radlinski, Filip and Craswell, Nick “TREC Complex Answer Retrieval Overview”. In: Proceedings of {Text REtrieval Conference} (TREC), 2017.
TREC CAR Overview report from 2018 (TREC CAR Y2). Please cite as
Dietz, Laura and Gamari, Ben and Dalton, Jeff and Craswell, Nick “TREC Complex Answer Retrieval Overview”. In: Proceedings of {Text REtrieval Conference} (TREC), 2018.
TREC CAR Overview report from 2019 (TREC CAR Y3). Please cite as
Dietz, Laura and Foley, John “TREC CAR Y3: Complex Answer Retrieval Overview”. In: Proceedings of {Text REtrieval Conference} (TREC), 2019.
Please see the TREC proceedings for participants papers: Y3, Y2, Y1
To reproduce results please see the datasets for
We also offer additional Wikipedia dumps in the same format:
Join our mailing list: https://groups.google.com/d/forum/trec-car
July 3, 2020: Released Tools to convert page titles to proper TREC CAR entity ids.
June 1, 2020: Released Dump of Wikipedia 01/01/2020.
To motivate a brief example, consider a user interested in learning about water pollution through fertilizers, ocean acidification, and aquatic debris and the effects it has. There is no short and simple answer to this information need. Instead we need to retrieve a complex answer that covers the topic with its different facets and elaborates pertinent connections between entities/concepts. A suitable answer would cover the following:
Through photosynthesis algae provide food and nutrients for the marine ecosystem. However, through rain storms, fertilizers used in agriculture and lawn care are swept into the rivers and coastal sea. Fertilizers contain nitrogen and phosphorys, these stimulate algae growth so that the alae population will grow large very quickly, called algal blooms. The problem is that these algae do not live long, when they die and decompose oxygen is removed from the water. As a result fish and shell fish die.
Furthermore, some algal blooms release toxins into the water, which are consumed by shell fish. Humans that consume toxins though shell fish can suffer neurological damage.
A different source of water pollution is through high levels of carbon dioxide in the athmosphere. Oceans absorb carbon dioxide, but it will lower the PH level of the water, meaning that oceans to become acidic. As a result, corals and shell fish are killed and other marine organisms reproduce less. This leads to issues in the food chain, and thereby less fish and shell fish for humans to consume.
Finally, trash and other debris that gets in the waterways through shipping accidents, landfill erosion, or by directly dumping trash in the ocean. This debris is dangerous for aquatic wildlife in two ways. Animals may mistake debris for food swallow plastic bags which kills them. Other aquatic animals are tangled in nets and strangled by trash like plastic six-pack rings.
This example was taken from the TQA collection, Effects of Water Pollution, and many similar examples can be found on Wikipedia. Nevertheless, such articles are not available for any imaginable kind of of information needs, which is why we aim to generate such comprehensive summaries automatically from Web sources through passage retrieval, consolidation, and organization. Of course, one might envision other responses that would satisfy the information need equally well.
We are now entering TREC CAR Y3. While Y1 and Y2 were dedicated to produce passage and entity rankings for the query and facets given in the outline; with Y3 we are turning to arrange paragraphs into a topically coherent article.
(This task offered in only offered in Y1 and Y2)
Given an outline \(Q\), retrieve for each of its sections \(H_i\), a ranking of relevant entity-passage tuples \((E,P)\). The passage \(P\) is taken from a provided passage corpus. The entity \(E\) refers to an entry in the provided knowledge base. We define a passage or entity as relevant if the passage content or entity should be mentioned in the knowledge article. Different degrees of “could be mentioned”, “should be mentioned”, or “must be mentioned” are represented through a graded annotation scale during manual assessment.
Two tasks are offered: Passage ranking and Entity ranking.
In the passage ranking task, entities are omitted.
In the entity ranking task, the entity must be given, and optionally, complemented with a passage that serves as provenance for why the entity is relevant. If the passage is omitted, it will be replaced with the first paragraph from the entity’s article.
Details on the submission format at the end of this page.
(New task for Y3)
Given an outline \(Q\), retrieve, select, and arrange a sequence of \(k\) passages \(P\) from the provided passage corpus, with ideally:
The number of passages \(k\) is given with the topic.
However, in order to do well on this task, you also need some component that retrieves relevant passages (as in Y1). We will make some baseline code available to convert Y1/Y2 passage ranking output into a Y3 passage order. This code implements a baseline that selects the top k passages from each heading’s ranking.
Paragraph Corpus: A corpus of 20 million paragraphs is provided. These are harvested from paragraphs on Wikipedia pages from a snapshot of 2016 (with hyperlinks preserved). Given the high amount of duplication on Wikipedia pages, the collection was de-duplicated before the data release. These paragraphs are to be used for the passages ranking task.
AllButBenchmark/knowledge base: We provide (nearly all) Wikipedia pages (2016) as standardized information on entities – only Wikipedia pages that are used as test outlines in the benchmarkY… subsets are omitted. We provide full meta information for each page including disambiguation pages and redirects, as well as the full page contents (parsed as outline with paragraphs). To avoid train/test signal leakage, the dump is also offered as five folds that are consistent with the large train corpus.
Queries/Outlines: We provide queries are outlines that are structured into title, and a hierarchy of headings. Please obtain the clean text from these headings from the outlines file – do not attempt to parse the text from the query id!
We enact both an automatic (binary) and a manual (graded) evaluation procedure.
automatic benchmark: We generate a large-scale automatic benchmark for training and automatic evaluation by splitting Wikipedia pages into paragraphs and outlines and ask the system to put paragraphs back to their place of origin.
For each section \(H_i\) in training stubs \(Q\) we indicate which passages originated from the section. To derive a ground truth for entity ranking, we indicate which entities are linked from the section.
Passage retrieval tasks are often criticized for leading to test collections that are not reusable. We address this problem through unique paragraph ids, in the past years this approach has worked well and yielded reusable benchmarks.
train: For a large-scale automatic benchmark for training data-hungry methods like neural networks, we selected half of all Wikipedia articles that are elibile for TREC CAR and provide them as five folds. As TREC CAR will not choose test outlines from pages that describe people, organizations, or events, these are also removed from the training data. (See full list of details on the data selection process are available here)
benchmarkY…: To create high-quality topics, we manually selected outlines from pages about popular science and the environment, which were released across different years, namely benchmarkY1train (from Wiki 16), benchmarkY1test (from Wiki 16), benchmarkY2test (from Wiki 18 and TQA). For these subsets, we provide automatic ground truth (similar to “train”) as well as manual ground truth if it was collected.
The official test topics for the benchmarkY1 are released as benchmarkY1test.
Simultaneously a subset of train topics were released as benchmarkY1train.
The selection process is as follows:
The official test topics for the benchmarkY2 are released benchmarkY2test.
The selection process is as follows:
See data releases for more information.
We provide training/evaluation annotations in the trec_eval
compatible Qrels format.
We provide qrels for passage and entity rankings as well as differnt variations on the section path:
Format for passages:
$sectionId 0 $paragraphId $isRelevant
Format for entities (where the entity is represented through the Wikipedia article it represents - using percent encoding):
$sectionId 0 $pageId $isRelevant
For automatic training/test data $isRelevant
is either 1 or 0 (only 1 are reported).
For manual test data $isRelevant
is a graded relevance-scale according to
Manual assessments are created by six NIST assessors on assessment pools built from participant submissions.
See the data releases page for data set download.
We provide passage corpus, outlines, and example articles in CBOR-encoded archives to preserve hierarchical structure and entity links. Support tools are provided for Python 3.5 and Java 1.7.
Articles we provide are encoded with the following grammar. Terminal nodes are indicated by $
.
Page -> $pageName $pageId [PageSkeleton] PageType PageMetadata
PageType -> ArticlePage | CategoryPage | RedirectPage ParaLink | DisambiguationPage
PageMetadata -> RedirectNames DisambiguationNames DisambiguationIds CategoryNames CategoryIds InlinkIds InlinkAnchors
RedirectNames -> [$pageName]
DisambiguationNames -> [$pageName]
DisambiguationIds -> [$pageId]
CategoryNames -> [$pageName]
CategoryIds -> [$pageId]
InlinkIds -> [$pageId]
InlinkAnchors -> [$anchorText]
PageSkeleton -> Section | Para | Image | ListItem
Section -> $sectionHeading [PageSkeleton]
Para -> Paragraph
Paragraph -> $paragraphId, [ParaBody]
ListItem -> $nestingLevel, Paragraph
Image -> $imageURL [PageSkeleton]
ParaBody -> ParaText | ParaLink
ParaText -> $text
ParaLink -> $targetPage $targetPageId $linkSection $anchorText
Support tools for reading this format are provided in the trec-car-tools repository (the java version is available through maven).
All passages that are cut out of Wikipedia articles are assigned a unique (SHA256 hashed) paragraphId
that will not reveal the originating article. Pairs of paragraphId
,paragraphContent
are provided as passage corpus. Hyperlinks to other Wikipedia pages (aka Entity links) are preserved in the paragraph.
Each extracted page outline, provides the pageName
(e.g., Wikipedia title) and pageId
, and a nested list of children. Each section is represented by a heading
, a headingId
, and a list of its children in order. WHile for pages, children can be sections or paragraphs, outlines do not contain paragraphs
Each section in the corpus is uniquely identified by a section path of the form
pageName/section1/section1.1/section1.1.1
To avoid UTF8-encoding and whitespace issues with existing evaluation tools like trec_eval
, the pageId
and headingId
are projected to ASCII characters using a variant of URL percent encoding.
Unfortunately there have been many problems whenever participants have tried to “parse” the text of the heading or query out of the heading id (or query id). Please do not do this, as the ID is a hashkey and does not perfectly preserve the text. Please obtain the text from the outlines file.
Retrieval results are uploaded through the official NIST submission page, using a format compatible with trec_eval
.run
files. Each ranking entry is represented by a single line in the format below. Please submit a separate run file for every approach, but include rankings for all queries.
Notice that the format differs slightly between tasks. A validation script for your runs will be provided soon.
Per task and per team, up to three submissions will be accepted.
For every section, a ranking of paragraphs is to be retrieved (as identified by the $paragraphId
).
$sectionId Q0 $paragraphId $rank $score $teamname-$methodname
For every section, a ranking of entities is to be retrieved (as identified by the $entityId
which coincides with $pageId
s – the percent-encoded Wikipedia title)
Optionally, a paragraph from the corpus can be provided as provenance for why the entity is relevant for the section. If the paragraph is omitted, the first paragraph from the entity’s page will be used as default provenance.
The format is as follows (first: with paragraph, second: default)
$sectionId Q0 $paragraphId/$entityId $rank $score $teamname-$methodname
$sectionId Q0 $entityId $rank $score $teamname-$methodname
We decided to depart from the TREC RUN format towards a JSON-based dataformat. Each submitted run constitutes one JSON-lines file in the format as described below. We opted to include reduncancies to help validate for errors (see population and validation script on data release page). A JSON-lines format contains each populated page (i.e., passage ordering per page) as one JSON object as a separate single line, delimited by ‘’. It goes without saying, that you cannot use a JSON pretty-printer that inserts newlines in the middle of a JSON object. Your submission file must have exactly 131 lines, because the benchmarkY3test benchmark asks for 131 pages.
Only the following fields are mandatory to participate in Y3:
While we offer only one task, we also evaluate the ranking quality as well, if the paragraph_origins
field is populated. We expect that the ranking was used as input to predict the passage ordering and therefore, all entries in paragraphs
have an entry in paragraph_origins
. The following fields are mandatory for the ranking evaluation:
The format of the JSON representation of one page, explained in detail. Note that this example is based on an outline from benchmarkY2test; in contrast, benchmarkY3 squids start with ‘tqa2:’ and do not contain any ‘%20’.
{
"run_id": "UNH-p-l2r", # system that produced the ranking
"squid": "tqa:effects%20of%20water%20pollution", # the topic's stable query unique ID
"title": "effects of water pollution" # Plain text title of the article
"query_facets": [ # query facets (aka outline headings) to be covered
{
"heading": "Ocean Acidification", # plain text facet description
"heading_id": "tqa:effects%20of%20water%20pollution/Ocean%20Acidification" # facet ID
},
{
"heading": "Eutrophication", # plain text facet description
"heading_id": "tqa:effects%20of%20water%20pollution/Eutrophication" # facet ID
},
...
],
"paragraphs": [ # candidate paragraphs to consolidate
{
"para_id": "ece8bed05f22e7c84e63c40759289dd0fd09dae9", # unique paragraph id, maps to paragraphCorpus
"para_body": [ # content (for your convenience) as chunks of plain text and entity links.
{ # first plain text chunk
"text": "Nutrients are important to the growth and survival of living organisms, and hence, are essential for development and maintenance of healthy ecosystems. Humans have greatly influenced the phosphorus cycle by mining phosphorus, converting it to fertilizer, and by shipping fertilizer and products around the globe. Transporting phosphorus in food from farms to cities has made a major change in the global Phosphorus cycle. However, excessive amounts of nutrients, particularly phosphorus and nitrogen, are detrimental to aquatic ecosystems. Waters are enriched in phosphorus from farms' run-off, and from effluent that is inadequately treated before it is discharged to waters. Natural "
},
{ # Entity link / Wikipedia hyperlink
"entity": "enwiki:Eutrophication", # TREC CAR entity id (please don't fudge the string)
"entity_name": "Eutrophication", # Plain text of entity name (aka Wiki title)
"link_section": null, # If the link points to a section on the entity's article
"text": "eutrophication" # surfaceform / Anchor text of link
},
{ # next plain text chunk...
"text": " is a process by which lakes gradually age and become more productive and may take thousands of years to progress. Cultural or anthropogenic eutrophication, however, is water pollution caused by excessive plant nutrients; this results in excessive growth in the algal population; when this algae dies its putrefaction depletes the water of oxygen.Such eutrophication may also give rise to toxic algal bloom.Both these effects cause animal and plant death rates to increase as the plants take in poisonous water while the animals drink the poisoned water. Surface and subsurface runoff and erosion from high-phosphorus soils may be major contributing factors to this fresh water eutrophication. The processes controlling soil Phosphorus release to surface runoff and to subsurface flow are a complex interaction between the type of phosphorus input, soil type and management, and transport processes depending on hydrological conditions." }
],
},
...
].
"paragraph_origins": [ # As generated from a ranking, info on rank, score and query where each paragraph is found
{
"para_id": "5516be8a3b7a395bbbd7e532931fbfd8c69c2394", # only paragraph ids from the paragraphCorpus are valid
"rank": 1, # rank information is optional, but if given it must be consistent with the rank_score. The highest rank is 1!
"rank_score": 0.9038483096258834, # mandatory rank_score information. Must be a float. Cannot have rank ties.
"section_path": "tqa:effects%20of%20water%20pollution/Eutrophication" # section path (squid/headingid).
},
{
"para_id": "ece8bed05f22e7c84e63c40759289dd0fd09dae9", # This is the one from example above
"rank": 2,
"rank_score": 0.8846124107754259,
"section_path": "tqa:effects%20of%20water%20pollution/Eutrophication"
},
...
]
}
Also see Y3 format validation rules.
To support comparison to ranking systems (akin to previous years), we offer software to convert rankings in TREC RUN file format (used in Y1 and Y2) into the Y3 JSON-based submission format. This software will take the top (k/facets) paragraphs from every section-ranking to be concatenated across all query facets.
v2.3-release (June 18, 2019): Official Y3 evaluation dataset! Released test queries based on selected outlines from the AI2’s Textbook Question Answering training dataset. Corpus and train data based on Wikipedia dump from December 22, 2016.
Bindings to load the data: trec-car-tools (the java version is available through maven).
Previous releases - do not use!
v2.1-release (July 25, 2018): Official Y2 evaluation dataset! Released test queries based on selected outlines from Wikipedia dump June 1, 2018 and outlines from the AI2’s Textbook Question Answering training dataset. Corpus and train data based on Wikipedia dump from December 22, 2016. All data based on Wikipedia dump from December 22, 2016.
v2.0-release (January 1, 2018): Fixed Wikipedia parsing issues, leading to new paragraph ids. New release of all datasets including mapping of v1.x paragraph ids to v2.x paragraph ids. Release of allButBenchmark. All data based on Wikipedia dump from December 22, 2016.
v1.5-release (June 22, 2017): release of the paragraph collection, test topics, train topics, test200, large-scale training data and unprocessed training data. All based on the Wikipedia dump from December 22, 2016.
As formats may change slightly between releases, make sure you check out support tools from the corresponding branch in trec-car-tools.
v1.4-release (Jan 24, 2017): release of the paragraph collection, spritzer, and halfwiki
v1.2-release (Jan 17, 2017): release of the paragraph collection, spritzer, and halfwiki
v1.1-release (Jan 11, 2017): release of the paragraph collection, spritzer, and halfwiki – no longer supported.
2016-11-v1.0-release: Pre-release to straighten out issues in the data format – Format change, no longer supported.