Human Relevance Testing and the Dilemma of Rater Consistency
Posted: 16 Dec 2019
Date Written: December 16, 2019
Human relevance testing uses subject matter experts as proxies for actual users to measure relevance quality. We ask our raters to exercise their judgement in assessing the relevance of a particular document to a query, but what happens when the experts don’t agree? Our philosophy is that some degree of disagreement is healthy and to be expected, but at what point does it compromise your results?
This talk examines the impact of rater inconsistency as well as tactics for managing it. We look at the use of rating guidelines as well as working directly with raters to resolve issues – and striking the balance between providing helpful guidance and being overly prescriptive. We will also share learnings from experimenting with different metrics to measure Inter-rater reliability and mitigate inconsistencies.
Keywords: search relevance measurement, human relevance testing, IRR, inter-rater reliability, rater bias, joint probability of agreement, Fleiss’ kappa, free marginal multi-rater kappa
Suggested Citation: Suggested Citation