The Impact of Hard and Easy Negative Training Data on Vulnerability Prediction Performance

JSSOFTWARE-D-23-00626R2

36 Pages Posted: 27 Mar 2023 Last revised: 1 Mar 2024

See all articles by Fahad Al Debeyan

Fahad Al Debeyan

affiliation not provided to SSRN

Lech Madeyski

Wrocław University of Science and Technology

Tracy Hall

affiliation not provided to SSRN

David Bowes

affiliation not provided to SSRN

Date Written: July 14, 2023

Abstract

Context: Vulnerability prediction models perform poorly in the real world. This has been shown to be partly due to the evaluation on datasets where large portions of the negative sample are typically excluded. The impact of negative samples in vulnerability prediction is largely unexplored.

Objective: We examine how different strategies for collecting negative samples influence vulnerability prediction model performance. Inspired by other disciplines (e.g. image processing), we distinguish between ’easy’ negative training data (easily distinguished from positives) and ’hard’ negatives.

Method: We measure model performance on datasets varying only in the negative sampling strategy. We also use AutoML to observe if the choice of model with the highest performance changes depending on the negative sampling strategy.

Results: Models evaluated on easy negatives outperform those evaluated on hard negatives. AutoML selects different algorithms based on the negative sampling strategy. Models trained on one strategy underperform on datasets following another strategy. Models trained on higher ratios of easy negatives perform better, plateauing at 15 easy negatives per positive instance for complete project evaluation.

Conclusions: The chosen negative sampling approach significantly impacts model performance, potentially leading to overly optimistic results. Evaluation on realistic test sets is crucial to assess practical suitability.

Keywords: Software Vulnerability Prediction, Vulnerability Datasets, Machine Learning

Suggested Citation

Al Debeyan, Fahad and Madeyski, Lech and Hall, Tracy and Bowes, David, The Impact of Hard and Easy Negative Training Data on Vulnerability Prediction Performance (July 14, 2023). JSSOFTWARE-D-23-00626R2, Available at SSRN: https://ssrn.com/abstract=4401545 or http://dx.doi.org/10.2139/ssrn.4401545

Fahad Al Debeyan (Contact Author)

affiliation not provided to SSRN ( email )

No Address Available

Lech Madeyski

Wrocław University of Science and Technology ( email )

wybrzeże Stanisława Wyspiańskiego 27
Wrocław, 50-370
Poland

Tracy Hall

affiliation not provided to SSRN ( email )

No Address Available

David Bowes

affiliation not provided to SSRN ( email )

No Address Available

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
20
Abstract Views
128
PlumX Metrics