5. open the rubric-rag-result notebooks in the browser
6. Copy Rubric Grade files into a subdirectory ./data/
Plots
X-axis are participating systems, sorted by qrels-ndcg performance. Any system that performed above the 85%-percentile (or 75%-percentile) is marked with a green dot, to see how many "very good systems" would be missed if a different measure would have been chosen.
The analysis shows a general agreement of differnt measures on selecting the best systems. Of course, the "coverage"-based metrics that do not award redundancy in the generated passages, prefer differnt systems than "precision"-based metrics. Each of these groups strongly correlate with each other. We find that the grade threshold of 5 is only rarely assigned by the LLM, and hence not sensitive enough for comparing systems robustly. On the other hand, a cover-1 is often too lenient.
RUBRIC@ TREC RAG 24 by Laura Dietz is licensed under Creative Commons Attribution-ShareAlike 4.0 International