Hofstätter, S., Khattab, O., Althammer, S., Sertkan, M., & Hanbury, A. (2022). Introducing Neural Bag of Whole-Words with ColBERTer: Contextualized Late Interactions using Enhanced Reduction. In CIKM ’22: Proceedings of the 31st ACM International Conference on Information & Knowledge Management (pp. 737–747). Association for Computing Machinery (ACM). https://doi.org/10.1145/3511808.3557367
Recent progress in neural information retrieval has demonstrated large gains in quality, while often sacrificing efficiency and interpretability compared to classical approaches. We propose ColBERTer, a neural retrieval model using contextualized late interaction (ColBERT) with enhanced reduction. Along the effectiveness Pareto frontier, ColBERTer dramatically lowers ColBERT's storage requirements while simultaneously improving the interpretability of its token-matching scores. To this end, ColBERTer fuses single-vector retrieval, multi-vector refinement, and optional lexical matching components into one model. For its multi-vector component, ColBERTer reduces the number of stored vectors by learning unique whole-word representations and learning to identify and remove word representations that are not essential to effective scoring. We employ an explicit multi-task, multi-stage training to facilitate using very small vector dimensions. Results on the MS MARCO and TREC-DL collection show that ColBERTer reduces the storage footprint by up to 2.5x, while maintaining effectiveness. With just one dimension per token in its smallest setting, ColBERTer achieves index storage parity with the plaintext size, with very strong effectiveness results. Finally, we demonstrate ColBERTer's robustness on seven high-quality out-of-domain collections, yielding statistically significant gains over traditional retrieval baselines.
Domänen-spezifische Systeme für Informationsextraktion und -suche: 860721 (European Commission)