# Old Data Releases (v1.2)

• corpus-v1.2.zip paragraph collection covering both train and test articles.

• release-v1.2.zip: Articles from 50% of Wikipedia that match the selection criterion. This half is dedicated for training. Provided as articles, outlines, and qrels, already provided in five folds for cross validation.

• spritzer-v1.2.zip: Six example articles, provided as articles, outlines, paragraphs, and qrels.

• Coming soon: training-release-v1.2.zip: Hand-selected articles from the subset of “halfwiki” that characterize the nature of the test queries.

## Support Tools

Support tools for reading this data can be found here: https://github.com/TREMA-UNH/trec-car-tools.

Support tools for previous releases:

• v1.2 in v1.2 branch.
• v1.1 in v1.1 branch.

The v1.0 format is incompatible and no longer supported.

## Contents

The following kinds of data derivatives are provided

Data:

• *.cbor: full articles as derived from Wikipedia (after cleaning)
• *.cbor.outlines: outlines of articles (=queries), preserving the hierarchy with sectionIDs.
• *.cbor.paragraphs: paragraphs from these articles (=documents), with paragraph ids. These are intended for only training. (Evaluation will run based on the collection of paragraphs from the whole Wikipedia (see release-v1.1.cbor.paragraphs)

Qrels (trec_eval-compatible qrels files) which are automatically derived from Articles (to be complemented by human judgments).

• *.cbor.hierarchical.qrels: every paragraph is relevant only for its leaf most specific section (example: PageName/Heading1/Heading1.1 - but not PageName/Heading1!)
• *.cbor.toplevel.qrels: every paragraph is relevant only for its top-level section (the above example is relevant for PageName/Heading1, any second-level heading is ignored.)
• *.cbor.articles.qrels: every paragraph is relevant for its article (the above example is relevant for PageName) - any section info is ignored.

• *.cbor.hierarchical.entity.qrels: every entity is relevant only for its leaf most specific section (example: PageName/Heading1/Heading1.1 - but not PageName/Heading1!)
• *.cbor.toplevel.entity.qrels: every entity is relevant only for its top-level section (the above example is relevant for PageName/Heading1, any second-level heading is ignored.)
• *.cbor.articles.entity.qrels: every entity is relevant for its article (the above example is relevant for PageName) - any section info is ignored.

## Test topics

For evaluation, only a *cbor.outlines fill will be distributed and a decision will have been made which of the three qrels files will be used.