Machine Learning Mitigants for Speech Based Cyber Risk
36 Pages Posted: 31 Jul 2020
Date Written: July 5, 2020
Automatic Speaker Verification (ASV) technologies are a core component of the services industry, which increasingly relies on speech as one of the bio-metric measures utilized to access personal data and client information. A core challenge faced by such bio-metric speech-based ASV systems is that of differentiating between samples generated by two distinct populations of utterances, those produced by the authentic human voice and the ones coming from a malicious cyber attack formed by a synthetic voice. Solving such an issue through a statistical perspective, requires the definition of a decision function, commonly referred to as classifier. The problem is then formulated as a learning procedure aimed to identify the optimal classifier, which is the one minimizing its probability of error. The primary goal of this work is introducing such a statistical classification framework to the context of ASV systems through the following key contributions. Firstly, we define a new class of features representing the raw speech time-series. The determined summary statistics combine two main components:
(1) a basis decomposition technique called the Empirical Mode Decomposition (EMD) capable of capturing non-stationarity in speech;
(2) the Mel Frequency Cepstral Coefficients detecting energy concentration around specific frequencies characterizing each individual's unique vocal tract resonance.
We then adopt a Support Vector Machine Classifier in the multi-kernel learning context as it provides a high degree of flexibility in learning the discrimination boundary between classes of real and synthetic speech. We undertook two large case studies on real and synthetic speech that overwhelmingly demonstrated the significance of our feature extraction and classifier approach in reducing the threat of cyber attack perpetrated by synthetic voice replication aiming to trick a bio-metric voice system to gain unauthorized access to sensitive data.
Keywords: Speech Bio-metric Cyber Security, Automatic Speaker Verification, Support Vector Machines, Non-Stationary Feature Extraction, Empirical Mode Decomposition, Cyber Risk Mitigation
Suggested Citation: Suggested Citation