Paper
Workshop and Seminars
Heterogenous Benchmarking across Domains and Languages: The Key to Enable Meaningful Progress in IR Research.

Heterogenous Benchmarking across Domains and Languages: The Key to Enable Meaningful Progress in IR Research.

Title: Heterogenous Benchmarking across Domains and Languages: The Key to Enable Meaningful Progress in IR Research.
Guest Speaker: Nandan Thakur
Date: 24th January 2024 (Wednesday)
Time4.30-5.30pm (IST)
Venue: C102, LHC
Abstract
Benchmarks are ever so necessary to measure realistic progress within Information Retrieval. However, existing benchmarks quickly saturate as they are prone to overfitting affecting retrieval model generalization. To overcome these challenges, I would present two of my research efforts: BEIR, a heterogeneous benchmark for zero-shot evaluation across specialized domains and MIRACL, a monolingual benchmark covering a diverse range of languages. In BEIR, we show that neural retrievers surprisingly struggle to generalize zero-shot on specialized domains due to lack of training data. To overcome this, we develop GPL that distils cross-encoder knowledge using cross-domain BEIR synthetic data. On the language side, MIRACL is robust in annotations and includes a broader coverage of the languages. However, generating supervised training data is cumbersome in realistic settings. To supplement, we construct SWIM-IR, a synthetic training dataset with 28 million LLM-generated pairs across 37 languages to develop multilingual retrievers comparable to supervised models on three multilingual retrieval benchmarks and can be extended to several new languages.
Speaker Bio
Nandan Thakur is a third-year PhD student in the David R. Cheriton School of Computer Science at University of Waterloo under the supervision of Prof. Jimmy Lin. His research broadly investigates data efficiency and model generalization across specialized domains and languages in information retrieval. He was the co-organizer of the MIRACL competition in WSDM 2023 and will co-organize the upcoming RAG Track in TREC 2024. His work has been published in top conferences and journals, including ACL, NAACL, NeurIPS, SIGIR and TACL.