Machine Learning Mitigants for Speech Based Cyber Risk

36 Pages Posted: 31 Jul 2020

See all articles by Marta Campi

Marta Campi

UCL

Gareth Peters

Department of Actuarial Mathematics and Statistics, Heriot-Watt University; University College London - Department of Statistical Science; University of Oxford - Oxford-Man Institute of Quantitative Finance; London School of Economics & Political Science (LSE) - Systemic Risk Centre; University of New South Wales (UNSW) - Faculty of Science

Nourddine Azzaoui

Mathematics Department, Université Blaise Pascal

Date Written: July 5, 2020

Abstract

Automatic Speaker Verification (ASV) technologies are a core component of the services industry, which increasingly relies on speech as one of the bio-metric measures utilized to access personal data and client information. A core challenge faced by such bio-metric speech-based ASV systems is that of differentiating between samples generated by two distinct populations of utterances, those produced by the authentic human voice and the ones coming from a malicious cyber attack formed by a synthetic voice. Solving such an issue through a statistical perspective, requires the definition of a decision function, commonly referred to as classifier. The problem is then formulated as a learning procedure aimed to identify the optimal classifier, which is the one minimizing its probability of error. The primary goal of this work is introducing such a statistical classification framework to the context of ASV systems through the following key contributions. Firstly, we define a new class of features representing the raw speech time-series. The determined summary statistics combine two main components:

(1) a basis decomposition technique called the Empirical Mode Decomposition (EMD) capable of capturing non-stationarity in speech;

(2) the Mel Frequency Cepstral Coefficients detecting energy concentration around specific frequencies characterizing each individual's unique vocal tract resonance.

We then adopt a Support Vector Machine Classifier in the multi-kernel learning context as it provides a high degree of flexibility in learning the discrimination boundary between classes of real and synthetic speech. We undertook two large case studies on real and synthetic speech that overwhelmingly demonstrated the significance of our feature extraction and classifier approach in reducing the threat of cyber attack perpetrated by synthetic voice replication aiming to trick a bio-metric voice system to gain unauthorized access to sensitive data.

Keywords: Speech Bio-metric Cyber Security, Automatic Speaker Verification, Support Vector Machines, Non-Stationary Feature Extraction, Empirical Mode Decomposition, Cyber Risk Mitigation

Suggested Citation

Campi, Marta and Peters, Gareth and Azzaoui, Nourddine, Machine Learning Mitigants for Speech Based Cyber Risk (July 5, 2020). Available at SSRN: https://ssrn.com/abstract=3643826 or http://dx.doi.org/10.2139/ssrn.3643826

Marta Campi

UCL ( email )

1-19 Torrington Place
London, WC1 7HB
United Kingdom

Gareth Peters (Contact Author)

Department of Actuarial Mathematics and Statistics, Heriot-Watt University ( email )

Edinburgh Campus
Edinburgh, EH14 4AS
United Kingdom

HOME PAGE: http://garethpeters78.wixsite.com/garethwpeters

University College London - Department of Statistical Science ( email )

1-19 Torrington Place
London, WC1 7HB
United Kingdom

University of Oxford - Oxford-Man Institute of Quantitative Finance ( email )

University of Oxford Eagle House
Walton Well Road
Oxford, OX2 6ED
United Kingdom

London School of Economics & Political Science (LSE) - Systemic Risk Centre ( email )

Houghton St
London
United Kingdom

University of New South Wales (UNSW) - Faculty of Science ( email )

Australia

Nourddine Azzaoui

Mathematics Department, Université Blaise Pascal ( email )

24 Avenue des Landais
63117 Aubière Cedex
France

Here is the Coronavirus
related research on SSRN

Paper statistics

Downloads
12
Abstract Views
65
PlumX Metrics