Many IR collections contain forbidden documents (đč -docs), i.e. documents that should not be retrieved to the searcher. In an ideal scenario đč -docs are clearly flagged, hence the ranker can filter them out, guaranteeing that no đč -doc will be exposed. However, in real-world scenarios, filtering algorithms are prone to errors. Therefore, an IR evaluation system should also measure filtering quality in addition to ranking quality. Typically, filtering is considered as a classification task and is evaluated independently of the ranking quality. However, due to the mutual affinity between the two, it is desirable to evaluate ranking quality while filtering decisions are being made. In this work we propose nDCGf , a novel extension of the nDCGmin metric [14], which measures both ranking and filtering quality of the search results. We show both theoretically and empirically that while nDCGmin is not suitable for the simultaneous ranking and filtering task, nDCGf is a reliable metric in this case.
We experiment with three datasets for which ranking and filtering are both required. In the PR dataset our task is to rank product reviews while filtering those marked as spam. Similarly, in the CQA dataset our task is to rank a list of human answers per question while filtering bad answers. We also experiment with the TREC web-track datasets, where đč -docs are explicitly labeled, sorting participant runs according to their ranking and filtering quality, demonstrating the stability, sensitivity, and reliability of nDCGf for this task. We propose a learning to rank and filter (LTRF) framework that is specifically designed to optimize nDCGf , by learning a ranking model and optimizing a filtering threshold used for discarding documents with lower scores. We experiment with several loss functions demonstrating their success in learning an effective LTRF model for the simultaneous learning and filtering task.
IR evaluation and learning in the presence of forbidden documents
2022
Research areas