RUBRIC@ TREC RAG 24

Rubric evaluation with FLAN-T5-large (we also experimented with Llama3, but found results too unreliable). Generated with
rubric-autograder-workbench https://github.com/laura-dietz/rubric-internal and rubric-trec-rag https://github.com/TREMA-UNH/rubric-trec-rag

Methods anonymized by Ian Soboroff. Rubric analysis conduced by Laura Dietz.

Data Archives

Each archive contains:
1. qrel file.
2. run files
3. exported run-measure-topic-value
4. leaderboard (tsv) with mean and stdev under different measures
5. plots

Measures provided are
cover-1: RUBRIC-Cover with minimum grade level 1 (on 0--5 scale)
cover-4: same but minimum grade level 4
cover-5: same but minimum grade level 5
qrels-ndcg: RUBRIC-qrels with NDCG@20 metric (ndcg uses the multi-relevance levels)
qrels-1: RUBRIC-qrels with P@20 with a minmum grade level 1
qrels-4: same but minimum grade level 4
qrels-5: same but minimum grade level 5

questions-rate--rubric-rag24-auggen.jsonl.results.tar.gz
questions-rate--rubric-rag24-concat-auggen.jsonl.results.tar.gz
questions-rate--rubric-rag24-gen.jsonl.results.tar.gz
questions-rate--rubric-rag24-concat-gen.jsonl.results.tar.gz
questions-rate--rubric-rag24-retrieval.jsonl.results.tar.gz

Rubric Grade Files:

questions-rate--rubric-rag24-auggen.jsonl.gz
questions-rate--rubric-rag24-concat-auggen.jsonl.gz
questions-rate--rubric-rag24-gen.jsonl.gz
questions-rate--rubric-rag24-concat-gen.jsonl.gz
questions-rate--rubric-rag24-retrieval.jsonl.gz

Generated question-style rubric elements:
rag24-questions.jsonl.gz

Jupyter Notebooks Result Analysis

(also available in the repository)
rubric-rag-results-auggen.ipynb
rubric-rag-results-concat-auggen.ipynb
rubric-rag-results-gen.ipynb
rubric-rag-results-concat-gen.ipynb
rubric-rag-results.ipynb (retrieval)

How run run and reproduce jupyter notebooks?

Using google colab or jupyter with pypi 1. Open notebook from github repository or this website 2. Uncomment and run the cell with ``` !pip install exam-pp !pip install rubric-trec-rag ``` 3. Run remaining cells in notebook Alternative: Using nix 1. Check out https://github.com/TREMA-UNH/rubric-trec-rag

2. Install the nix package manager from nixos.org (install nix package, not the OS!)
3. from the rubric-trec-rag directory call nix develop to create the python environment (more instructions in the repository's README file)
4. from that environment start the notebook server with jupyter notebook
5. open the rubric-rag-result notebooks in the browser
6. Copy Rubric Grade files into a subdirectory ./data/
Plots
X-axis are participating systems, sorted by qrels-ndcg performance. Any system that performed above the 85%-percentile (or 75%-percentile) is marked with a green dot, to see how many "very good systems" would be missed if a different measure would have been chosen. The analysis shows a general agreement of differnt measures on selecting the best systems. Of course, the "coverage"-based metrics that do not award redundancy in the generated passages, prefer differnt systems than "precision"-based metrics. Each of these groups strongly correlate with each other. We find that the grade threshold of 5 is only rarely assigned by the LLM, and hence not sensitive enough for comparing systems robustly. On the other hand, a cover-1 is often too lenient. Auggen Gen Retrieval RUBRIC@ TREC RAG 24 by Laura Dietz is licensed under Creative Commons Attribution-ShareAlike 4.0 International