Human Relevance Testing and the Dilemma of Rater Consistency

Posted: 16 Dec 2019

Date Written: December 16, 2019

Abstract

Human relevance testing uses subject matter experts as proxies for actual users to measure relevance quality. We ask our raters to exercise their judgement in assessing the relevance of a particular document to a query, but what happens when the experts don’t agree? Our philosophy is that some degree of disagreement is healthy and to be expected, but at what point does it compromise your results?

This talk examines the impact of rater inconsistency as well as tactics for managing it. We look at the use of rating guidelines as well as working directly with raters to resolve issues – and striking the balance between providing helpful guidance and being overly prescriptive. We will also share learnings from experimenting with different metrics to measure Inter-rater reliability and mitigate inconsistencies.

Keywords: search relevance measurement, human relevance testing, IRR, inter-rater reliability, rater bias, joint probability of agreement, Fleiss’ kappa, free marginal multi-rater kappa

Suggested Citation

Belwal, Deekshant and Diedrichsen, Tara, Human Relevance Testing and the Dilemma of Rater Consistency (December 16, 2019). Proceedings of the 3rd Annual RELX Search Summit, Available at SSRN: https://ssrn.com/abstract=3504567

Deekshant Belwal (Contact Author)

LexisNexis ( email )

P. O. Box 933
Dayton, OH 45401
United States

Tara Diedrichsen

LexisNexis ( email )

P. O. Box 933
Dayton, OH 45401
United States

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Abstract Views
158
PlumX Metrics