Leveraging Protein Language Models to Identify Complex Trait Associations with Previously Inaccessible Classes of Functional Rare Variants
36 Pages Posted: 16 Dec 2024 Publication Status: Under Review
More...Abstract
Protein language models (PLMs) provide variant effect predictions for previously underexplored classes of rare variants in exome sequencing studies. Here we present novel approaches for leveraging the unique properties of PLMs to test for associations between complex traits and rare variants. First, we develop an allelic series-based regression test for isoform-specific variants and discover ~22% more significant associations than standard tests. Furthermore, 17 gene-trait pairs showed significantly higher effect sizes in non-canonical than canonical transcript. Next, we search for Evolutionary Plausible Variants (EPVs), attributed positive scores by PLMs, which are at the opposite spectrum of annotated deleterious variants. We found EPVs compose a small percentage of missense variants (0.45%) and, consistent with differential selection pressures, their allele frequencies are significantly higher than non-EPV and synonymous (p<2.2e-16) variants. We additionally identify eight associations with EPVs, including novel protective associations with LDL and bone mineral density. Our results show how applying PLMs to exome data expands the universe of gene-trait association mapping and interpretation.
Keywords: protein language model, rare variant, gene-based test, exome sequence
Suggested Citation: Suggested Citation