C-Graph Architectural Evolution
Posted: 16 Nov 2021 Last revised: 18 Nov 2021
Date Written: September 21, 2021
Abstract
The C-Graph project started as a mechanism to calculate in 'real-time' document and author metrics to be stored in SOLR for Scopus. It replaced a batch solution based on AWS RedShift which was expensive to run and difficult to extend.
Once the migration of the metric calculation to C-Graph was over, it was clear that C-Graph could be also be used to compute metrics for other Elsevier products like Science Direct and Engineering Village as well as power complex algorithms like query intent for Search.
This paper presents how the architecture of C-Graph which was initially developed to compute a finite known set of metrics on one dataset had to evolve to handle the following:
* other datasources, some as xocs feeds like Science Direct, other provided as Kafka topics like grant awards and also as rdf triples.
* not just other metrics but also different ways to compute the metrics, for example adding the option to exclude self citations.
* the set of documents on which the metrics are calculated: customers only want metrics computed using a subset of the documents.
The initial implementation was also strongly coupled with its first client SOLR for Scopus and we present how we incrementally decoupled the two.
The new architecture had to accommodate this new requirements while keeping good performance and low operating costs.
Keywords: Graph, architecture evolution, Kafka, decoupling, semantic technologies
Suggested Citation: Suggested Citation