TREC CAR

Introduction

Current retrieval systems provide good solutions towards phrase-level retrieval for simple fact and entity-centric needs. This track encourages research for answering more complex information needs with longer answers. Much like Wikipedia pages synthesize knowledge that is globally distributed, we envision systems that collect relevant information from an entire corpus, creating synthetically structured documents by collating retrieved results.

More details in

Organization

Join our mailing list: https://groups.google.com/d/forum/trec-car

August 1, 2019: TREC CAR Y3 Submission deadline (link coming soon!)
July 8: Release of Y3 validation and population code; also see v2.3 data release For the Y3 Submission:
  1. Each outline in benchmarkY3test should be populated with exactly 20 passages from the paragraphCorpus.
  2. Each team can submit up to 10 runs; but for assessment purposes only one can be marked as highest priority and only two marked with medium priority.
  3. See more: “Y3 Submission Format” below and Y3 validation rules
June 20: New v2.3 data release
May 8: Task and guidelines for Y3 evaluation (v2.3)!

Example

To motivate a brief example, consider a user interested in learning about water pollution through fertilizers, ocean acidification, and aquatic debris and the effects it has. There is no short and simple answer to this information need. Instead we need to retrieve a complex answer that covers the topic with its different facets and elaborates pertinent connections between entities/concepts. A suitable answer would cover the following:

Through photosynthesis algae provide food and nutrients for the marine ecosystem. However, through rain storms, fertilizers used in agriculture and lawn care are swept into the rivers and coastal sea. Fertilizers contain nitrogen and phosphorys, these stimulate algae growth so that the alae population will grow large very quickly, called algal blooms. The problem is that these algae do not live long, when they die and decompose oxygen is removed from the water. As a result fish and shell fish die.

Furthermore, some algal blooms release toxins into the water, which are consumed by shell fish. Humans that consume toxins though shell fish can suffer neurological damage.

A different source of water pollution is through high levels of carbon dioxide in the athmosphere. Oceans absorb carbon dioxide, but it will lower the PH level of the water, meaning that oceans to become acidic. As a result, corals and shell fish are killed and other marine organisms reproduce less. This leads to issues in the food chain, and thereby less fish and shell fish for humans to consume.

Finally, trash and other debris that gets in the waterways through shipping accidents, landfill erosion, or by directly dumping trash in the ocean. This debris is dangerous for aquatic wildlife in two ways. Animals may mistake debris for food swallow plastic bags which kills them. Other aquatic animals are tangled in nets and strangled by trash like plastic six-pack rings.

This example was taken from the TQA collection, Effects of Water Pollution, and many similar examples can be found on Wikipedia. Nevertheless, such articles are not available for any imaginable kind of of information needs, which is why we aim to generate such comprehensive summaries automatically from Web sources through passage retrieval, consolidation, and organization. Of course, one might envision other responses that would satisfy the information need equally well.

Evaluation Tasks and Data

We are now entering TREC CAR Y3. While Y1 and Y2 were dedicated to produce passage and entity rankings for the query and facets given in the outline; with Y3 we are turning to arrange paragraphs into a topically coherent article.

Y1 + Y2: Passage and Entity Ranking Task

(This task offered in only offered in Y1 and Y2)

Given an outline \(Q\), retrieve for each of its sections \(H_i\), a ranking of relevant entity-passage tuples \((E,P)\). The passage \(P\) is taken from a provided passage corpus. The entity \(E\) refers to an entry in the provided knowledge base. We define a passage or entity as relevant if the passage content or entity should be mentioned in the knowledge article. Different degrees of “could be mentioned”, “should be mentioned”, or “must be mentioned” are represented through a graded annotation scale during manual assessment.

Two tasks are offered: Passage ranking and Entity ranking.

In the passage ranking task, entities are omitted.

In the entity ranking task, the entity must be given, and optionally, complemented with a passage that serves as provenance for why the entity is relevant. If the passage is omitted, it will be replaced with the first paragraph from the entity’s article.

Details on the submission format at the end of this page.

Y3: Passage Ordering Task

(New task for Y3)

Given an outline \(Q\), retrieve, select, and arrange a sequence of \(k\) passages \(P\) from the provided passage corpus, with ideally:

  1. Highest relevance of all passages.
  2. Balanced coverage of all query facets as defined through headings \(H_i\) in the outline.
  3. Maximizing topical coherence, minimizing topic switches, i.e., first all passages about one topic, then all passages of the next topic while avoiding to interleave multiple topics.

The number of passages \(k\) is given with the topic.

However, in order to do well on this task, you also need some component that retrieves relevant passages (as in Y1). We will make some baseline code available to convert Y1/Y2 passage ranking output into a Y3 passage order. This code implements a baseline that selects the top k passages from each heading’s ranking.

Passage Corpus, Knowledge Base, and Training Data

Paragraph Corpus: A corpus of 20 million paragraphs is provided. These are harvested from paragraphs on Wikipedia pages from a snapshot of 2016 (with hyperlinks preserved). Given the high amount of duplication on Wikipedia pages, the collection was de-duplicated before the data release. These paragraphs are to be used for the passages ranking task.

AllButBenchmark/knowledge base: We provide (nearly all) Wikipedia pages (2016) as standardized information on entities – only Wikipedia pages that are used as test outlines in the benchmarkY… subsets are omitted. We provide full meta information for each page including disambiguation pages and redirects, as well as the full page contents (parsed as outline with paragraphs). To avoid train/test signal leakage, the dump is also offered as five folds that are consistent with the large train corpus.

Queries/Outlines: We provide queries are outlines that are structured into title, and a hierarchy of headings. Please obtain the clean text from these headings from the outlines file – do not attempt to parse the text from the query id!

Test Data and Evaluation

We enact both an automatic (binary) and a manual (graded) evaluation procedure.

automatic benchmark: We generate a large-scale automatic benchmark for training and automatic evaluation by splitting Wikipedia pages into paragraphs and outlines and ask the system to put paragraphs back to their place of origin.

For each section \(H_i\) in training stubs \(Q\) we indicate which passages originated from the section. To derive a ground truth for entity ranking, we indicate which entities are linked from the section.

Passage retrieval tasks are often criticized for leading to test collections that are not reusable. We address this problem through unique paragraph ids, in the past years this approach has worked well and yielded reusable benchmarks.

train: For a large-scale automatic benchmark for training data-hungry methods like neural networks, we selected half of all Wikipedia articles that are elibile for TREC CAR and provide them as five folds. As TREC CAR will not choose test outlines from pages that describe people, organizations, or events, these are also removed from the training data. (See full list of details on the data selection process are available here)

benchmarkY…: To create high-quality topics, we manually selected outlines from pages about popular science and the environment, which were released across different years, namely benchmarkY1train (from Wiki 16), benchmarkY1test (from Wiki 16), benchmarkY2test (from Wiki 18 and TQA). For these subsets, we provide automatic ground truth (similar to “train”) as well as manual ground truth if it was collected.

Benchmark Train / Test topics Y1

The official test topics for the benchmarkY1 are released as benchmarkY1test.

Simultaneously a subset of train topics were released as benchmarkY1train.

The selection process is as follows:

Benchmark Test topics Y2

The official test topics for the benchmarkY2 are released benchmarkY2test.

The selection process is as follows:

See data releases for more information.

Qrels

We provide training/evaluation annotations in the trec_eval compatible Qrels format.

We provide qrels for passage and entity rankings as well as differnt variations on the section path:

Format for passages:

    $sectionId 0 $paragraphId $isRelevant

Format for entities (where the entity is represented through the Wikipedia article it represents - using percent encoding):

    $sectionId 0 $pageId $isRelevant

For automatic training/test data $isRelevant is either 1 or 0 (only 1 are reported).

For manual test data $isRelevant is a graded relevance-scale according to

Manual assessments are created by six NIST assessors on assessment pools built from participant submissions.

Data Format Description

See the data releases page for data set download.

We provide passage corpus, outlines, and example articles in CBOR-encoded archives to preserve hierarchical structure and entity links. Support tools are provided for Python 3.5 and Java 1.7.

Articles we provide are encoded with the following grammar. Terminal nodes are indicated by $.

      Page         -> $pageName $pageId [PageSkeleton] PageType PageMetadata
      PageType     -> ArticlePage | CategoryPage | RedirectPage ParaLink | DisambiguationPage
      PageMetadata -> RedirectNames DisambiguationNames DisambiguationIds CategoryNames CategoryIds InlinkIds InlinkAnchors
      RedirectNames       -> [$pageName] 
      DisambiguationNames -> [$pageName] 
      DisambiguationIds   -> [$pageId] 
      CategoryNames       -> [$pageName] 
      CategoryIds         -> [$pageId] 
      InlinkIds           -> [$pageId] 
      InlinkAnchors       -> [$anchorText] 
      
      PageSkeleton -> Section | Para | Image | ListItem
      Section      -> $sectionHeading [PageSkeleton]
      Para         -> Paragraph
      Paragraph    -> $paragraphId, [ParaBody]
      ListItem     -> $nestingLevel, Paragraph
      Image        -> $imageURL [PageSkeleton]
      ParaBody     -> ParaText | ParaLink
      ParaText     -> $text
      ParaLink     -> $targetPage $targetPageId $linkSection $anchorText

Support tools for reading this format are provided in the trec-car-tools repository (the java version is available through maven).

Passage Format

All passages that are cut out of Wikipedia articles are assigned a unique (SHA256 hashed) paragraphId that will not reveal the originating article. Pairs of paragraphId,paragraphContent are provided as passage corpus. Hyperlinks to other Wikipedia pages (aka Entity links) are preserved in the paragraph.

Outline Format

Each extracted page outline, provides the pageName (e.g., Wikipedia title) and pageId, and a nested list of children. Each section is represented by a heading, a headingId, and a list of its children in order. WHile for pages, children can be sections or paragraphs, outlines do not contain paragraphs

Each section in the corpus is uniquely identified by a section path of the form

    pageName/section1/section1.1/section1.1.1

To avoid UTF8-encoding and whitespace issues with existing evaluation tools like trec_eval, the pageId and headingId are projected to ASCII characters using a variant of URL percent encoding.

Unfortunately there have been many problems whenever participants have tried to “parse” the text of the heading or query out of the heading id (or query id). Please do not do this, as the ID is a hashkey and does not perfectly preserve the text. Please obtain the text from the outlines file.

Y1 + Y2 Submission Format

Retrieval results are uploaded through the official NIST submission page, using a format compatible with trec_eval .run files. Each ranking entry is represented by a single line in the format below. Please submit a separate run file for every approach, but include rankings for all queries.

Notice that the format differs slightly between tasks. A validation script for your runs will be provided soon.

Per task and per team, up to three submissions will be accepted.

Passage Retrieval

For every section, a ranking of paragraphs is to be retrieved (as identified by the $paragraphId).

    $sectionId Q0 $paragraphId $rank $score $teamname-$methodname

Entity Retrieval

For every section, a ranking of entities is to be retrieved (as identified by the $entityId which coincides with $pageIds – the percent-encoded Wikipedia title)

Optionally, a paragraph from the corpus can be provided as provenance for why the entity is relevant for the section. If the paragraph is omitted, the first paragraph from the entity’s page will be used as default provenance.

The format is as follows (first: with paragraph, second: default)

    $sectionId Q0 $paragraphId/$entityId $rank $score $teamname-$methodname
    $sectionId Q0 $entityId $rank $score $teamname-$methodname

Y3 Submission Format

We decided to depart from the TREC RUN format towards a JSON-based dataformat. Each submitted run constitutes one JSON-lines file in the format as described below. We opted to include reduncancies to help validate for errors (see population and validation script on data release page). A JSON-lines format contains each populated page (i.e., passage ordering per page) as one JSON object as a separate single line, delimited by ‘’. It goes without saying, that you cannot use a JSON pretty-printer that inserts newlines in the middle of a JSON object. Your submission file must have exactly 131 lines, because the benchmarkY3test benchmark asks for 131 pages.

Only the following fields are mandatory to participate in Y3:

While we offer only one task, we also evaluate the ranking quality as well, if the paragraph_origins field is populated. We expect that the ranking was used as input to predict the passage ordering and therefore, all entries in paragraphs have an entry in paragraph_origins. The following fields are mandatory for the ranking evaluation:

The format of the JSON representation of one page, explained in detail. Note that this example is based on an outline from benchmarkY2test; in contrast, benchmarkY3 squids start with ‘tqa2:’ and do not contain any ‘%20’.

{
    "run_id": "UNH-p-l2r",                            # system that produced the ranking
    "squid": "tqa:effects%20of%20water%20pollution",  # the topic's stable query unique ID
    "title": "effects of water pollution"             # Plain text title of the article
    "query_facets": [                                 # query facets (aka outline headings) to be covered
	{
	    "heading": "Ocean Acidification",         # plain text facet description
	    "heading_id": "tqa:effects%20of%20water%20pollution/Ocean%20Acidification"  # facet ID
	},
	{
	    "heading": "Eutrophication",              # plain text facet description
	    "heading_id": "tqa:effects%20of%20water%20pollution/Eutrophication"  # facet ID
	},
        ...
    ],
    "paragraphs": [                                  # candidate paragraphs to consolidate
        {
            "para_id": "ece8bed05f22e7c84e63c40759289dd0fd09dae9",   # unique paragraph id, maps to paragraphCorpus
            "para_body": [                           # content (for your convenience) as chunks of plain text and entity links.
               {                                     # first plain text chunk
                    "text": "Nutrients are important to the growth and survival of living organisms, and hence, are essential for development and maintenance of healthy ecosystems. Humans have greatly influenced the phosphorus cycle by mining phosphorus, converting it to fertilizer, and by shipping fertilizer and products around the globe. Transporting phosphorus in food from farms to cities has made a major change in the global Phosphorus cycle. However, excessive amounts of nutrients, particularly phosphorus and nitrogen, are detrimental to aquatic ecosystems. Waters are enriched in phosphorus from farms' run-off, and from effluent that is inadequately treated before it is discharged to waters. Natural "
                },
                {                                    # Entity link / Wikipedia hyperlink
                    "entity": "enwiki:Eutrophication",   # TREC CAR entity id (please don't fudge the string)
                    "entity_name": "Eutrophication",     # Plain text of entity name (aka Wiki title)
                    "link_section": null,                # If the link points to a section on the entity's article
                    "text": "eutrophication"             # surfaceform / Anchor text of link
                },
                {                                    # next plain text chunk...
                    "text": " is a process by which lakes gradually age and become more productive and may take thousands of years to progress. Cultural or anthropogenic eutrophication, however, is water pollution caused by excessive plant nutrients; this results in excessive growth in the algal population; when this algae dies its putrefaction depletes the water of oxygen.Such eutrophication may also give rise to toxic algal bloom.Both these effects cause animal and plant death rates to increase as the plants take in poisonous water while the animals drink the poisoned water. Surface and subsurface runoff and erosion from high-phosphorus soils may be major contributing factors to this fresh water eutrophication. The processes controlling soil Phosphorus release to surface runoff and to subsurface flow are a complex interaction between the type of phosphorus input, soil type and management, and transport processes depending on hydrological conditions."                }
            ],
        },
        ...
   ].
    "paragraph_origins": [                           # As generated from a ranking, info on rank, score and query where each paragraph is found
        {
            "para_id": "5516be8a3b7a395bbbd7e532931fbfd8c69c2394",  # only paragraph ids from the paragraphCorpus are valid
            "rank": 1,                               # rank information is optional, but if given it must be consistent with the rank_score. The highest rank is 1!
            "rank_score": 0.9038483096258834,        # mandatory rank_score information. Must be a float. Cannot have rank ties.
            "section_path": "tqa:effects%20of%20water%20pollution/Eutrophication"    # section path (squid/headingid).
        },
        {
            "para_id": "ece8bed05f22e7c84e63c40759289dd0fd09dae9",      # This is the one from example above
            "rank": 2,
            "rank_score": 0.8846124107754259,
            "section_path": "tqa:effects%20of%20water%20pollution/Eutrophication"
        },
        ...
   ]
  }
  

Also see Y3 format validation rules.

To support comparison to ranking systems (akin to previous years), we offer software to convert rankings in TREC RUN file format (used in Y1 and Y2) into the Y3 JSON-based submission format. This software will take the top ${} paragraphs from every section-ranking to be concatenated across all query facets.

Releases

Previous releases - do not use!

As formats may change slightly between releases, make sure you check out support tools from the corresponding branch in trec-car-tools.