Using Graph and Transformer Embeddings for Vector Based Retrieval in Scopus

Posted: 28 Jan 2021

Date Written: November 4, 2020

Abstract

For the longest time, term-based vector representations based on whole-document statistics, such as TF-IDF, have been the staple of efficient and effective information retrieval. The popularity of Deep Learning over the past decade has resulted in the development of many interesting embedding schemes. Like term-based vector representations, these embeddings depend on structure implicit in language and user behavior. Unlike them, they leverage the distributional hypothesis, which states that the meaning of a word is determined by the context in which it appears. These embeddings have been found to better encode the semantics of the word, compared to term-based representations. Despite this, it has only recently become practical to use embeddings in Information Retrieval at scale.

In this presentation, we will describe how we applied two new embedding schemes to the subset of documents available in Scopus and the CORD-19 dataset. Both schemes are based on the distributional hypothesis but come from very different backgrounds. The first embedding is a graph embedding called node2vec, that encodes papers using citation relationships between them as specified by their authors. The second embedding leverages Transformers, a recent innovation in the area of Deep Learning, that are essentially language models trained on large bodies of text. These two embeddings exploit the signal implicit in these data sources and produce semantically rich user and content-based vector representations respectively. We evaluate search results from these embedding schemes against judgement lists from the TREC-COVID competition. Finally, we will describe how RELX staff can access these embeddings for their own data science needs, independent of the search application.

Keywords: graph embeddings, node2vec, transformers, BERT, vector search, Scopus

Suggested Citation

Pal, Sujit and Scerri, Antony, Using Graph and Transformer Embeddings for Vector Based Retrieval in Scopus (November 4, 2020). Proceedings of the 4th Annual RELX Search Summit, Available at SSRN: https://ssrn.com/abstract=3774457

Antony Scerri

Elsevier ( email )

Radarweg 29
Amsterdam, 1043 NX
Netherlands

Do you have negative results from your research you’d like to share?

Paper statistics

Abstract Views
382
PlumX Metrics