Repeatable Process for Extracting Health Data from Hl7 Cda Documents

29 Pages Posted: 1 Apr 2024

See all articles by Harry-Anton Talvik

Harry-Anton Talvik

affiliation not provided to SSRN

Marek Oja

University of Tartu

Sirli Tamm

University of Tartu

Kerli Mooses

University of Tartu

Dage Särg

affiliation not provided to SSRN

Marcus Lõo

affiliation not provided to SSRN

Õie Renata Siimon

affiliation not provided to SSRN

Hendrik Šuvalov

affiliation not provided to SSRN

Raivo Kolde

University of Tartu - Institute of Computer Science

Jaak Vilo

affiliation not provided to SSRN

Sulev Reisberg

University of Tartu - Institute of Computer Science

Sven Laur

affiliation not provided to SSRN

Abstract

Objective: This study aims to address the gap in the literature on converting real-world Clinical Document Architecture (CDA) data into the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDA), focusing on the initial steps preceding the mapping phase. We highlight the importance of a repeatable Extract-Transform-Load (ETL) pipeline for health data extraction from HL7 CDA documents in Estonia for research purposes.Methods: We developed a repeatable ETL pipeline to facilitate the extraction, cleaning, and restructuring of health data from CDA documents to OMOP CDM, ensuring a high-quality and structured data format. This pipeline was designed to adapt to continuously updated data exchange format changes and handle various CDA document subsets for different scientific studies.Results: Our pipeline successfully transformed a significant portion of diagnosis codes and eGFR measurements from CDA documents into OMOP CDM, showing the ease of extracting structured data. However, challenges such as harmonising diverse coding systems and extracting lab results from free-text sections were encountered. The iterative development of the pipeline facilitated swift error detection and correction, enhancing the process’s efficiency.Conclusion: After a decade of focused work, our research has led to the development of an ETL pipeline that effectively transforms HL7 CDA documents into OMOP CDM in Estonia, addressing key data extraction and transformation challenges. The pipeline’s repeatability and adaptability to various data subsets make it a valuable resource for researchers dealing with health data. While tested on Estonian data, the principles outlined are broadly applicable, potentially aiding in handling health data standards that vary by country. Despite newer health data standards emerging, the relevance of CDA for retrospective health studies ensures the continuing importance of this work.

Keywords: HL7 Clinical Document Architecture, ETL, workflow, pipeline, OMOP CDM, nlp

Suggested Citation

Talvik, Harry-Anton and Oja, Marek and Tamm, Sirli and Mooses, Kerli and Särg, Dage and Lõo, Marcus and Siimon, Õie Renata and Šuvalov, Hendrik and Kolde, Raivo and Vilo, Jaak and Reisberg, Sulev and Laur, Sven, Repeatable Process for Extracting Health Data from Hl7 Cda Documents. Available at SSRN: https://ssrn.com/abstract=4776237 or http://dx.doi.org/10.2139/ssrn.4776237

Harry-Anton Talvik

affiliation not provided to SSRN ( email )

No Address Available

Marek Oja

University of Tartu ( email )

Sirli Tamm

University of Tartu ( email )

Kerli Mooses (Contact Author)

University of Tartu ( email )

Dage Särg

affiliation not provided to SSRN ( email )

No Address Available

Marcus Lõo

affiliation not provided to SSRN ( email )

No Address Available

Õie Renata Siimon

affiliation not provided to SSRN

Hendrik Šuvalov

affiliation not provided to SSRN ( email )

No Address Available

Raivo Kolde

University of Tartu - Institute of Computer Science ( email )

Estonia

Jaak Vilo

affiliation not provided to SSRN ( email )

No Address Available

Sulev Reisberg

University of Tartu - Institute of Computer Science ( email )

Sven Laur

affiliation not provided to SSRN ( email )

No Address Available

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
75
Abstract Views
335
Rank
678,957
PlumX Metrics