Impact of the Distribution Parameter of Data Sampling Approaches on Software Defect Prediction Models

Kwabena Ebo Bennin, Jacky Keung, Akito Monden

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Citations (Scopus)

Abstract

Sampling methods are known to impact defect prediction performance. These sampling methods have configurable parameters that can significantly affect the prediction performance. It is however, impractical to assess the effect of all the possible different settings in the parameter space for all the several existing sampling methods. A constant and easy to tweak parameter present in all sampling methods is the distribution of the defective and non-defective modules in the dataset known as Pfp (% of fault-prone modules). In this paper, we investigate and assess the performance of defect prediction models where the Pfp parameter of sampling methods are tweaked. An empirical experiment and assessment of seven sampling methods on five prediction models over 20 releases of 10 static metric projects indicate that (1) Area Under the Receiver Operating Characteristics Curve (AUC) performance is not improved after tweaking the Pfp parameter, (2) pf (false alarms) performance degrades as the Pfp is increased. (3) a stable predictor is difficult to achieve across different Pfp rates. Hence, we conclude that the Pfp parameter setting can have a large impact on the performance (except AUC) of defect prediction models. We thus recommend researchers experiment with the Pfp parameter of the sampling method since the distribution of training datasets vary.

Original languageEnglish
Title of host publicationProceedings - 24th Asia-Pacific Software Engineering Conference, APSEC 2017
PublisherIEEE Computer Society
Pages630-635
Number of pages6
Volume2017-December
ISBN (Electronic)9781538636817
DOIs
Publication statusPublished - Mar 1 2018
Event24th Asia-Pacific Software Engineering Conference, APSEC 2017 - Nanjing, Jiangsu, China
Duration: Dec 4 2017Dec 8 2017

Other

Other24th Asia-Pacific Software Engineering Conference, APSEC 2017
CountryChina
CityNanjing, Jiangsu
Period12/4/1712/8/17

Fingerprint

Sampling
Defects
Experiments

Keywords

  • Defect prediction
  • Empirical software engineering
  • Imbalanced Data
  • Preprocessing
  • Sampling methods
  • Search based SE

ASJC Scopus subject areas

  • Software

Cite this

Bennin, K. E., Keung, J., & Monden, A. (2018). Impact of the Distribution Parameter of Data Sampling Approaches on Software Defect Prediction Models. In Proceedings - 24th Asia-Pacific Software Engineering Conference, APSEC 2017 (Vol. 2017-December, pp. 630-635). IEEE Computer Society. https://doi.org/10.1109/APSEC.2017.76

Impact of the Distribution Parameter of Data Sampling Approaches on Software Defect Prediction Models. / Bennin, Kwabena Ebo; Keung, Jacky; Monden, Akito.

Proceedings - 24th Asia-Pacific Software Engineering Conference, APSEC 2017. Vol. 2017-December IEEE Computer Society, 2018. p. 630-635.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Bennin, KE, Keung, J & Monden, A 2018, Impact of the Distribution Parameter of Data Sampling Approaches on Software Defect Prediction Models. in Proceedings - 24th Asia-Pacific Software Engineering Conference, APSEC 2017. vol. 2017-December, IEEE Computer Society, pp. 630-635, 24th Asia-Pacific Software Engineering Conference, APSEC 2017, Nanjing, Jiangsu, China, 12/4/17. https://doi.org/10.1109/APSEC.2017.76
Bennin KE, Keung J, Monden A. Impact of the Distribution Parameter of Data Sampling Approaches on Software Defect Prediction Models. In Proceedings - 24th Asia-Pacific Software Engineering Conference, APSEC 2017. Vol. 2017-December. IEEE Computer Society. 2018. p. 630-635 https://doi.org/10.1109/APSEC.2017.76
Bennin, Kwabena Ebo ; Keung, Jacky ; Monden, Akito. / Impact of the Distribution Parameter of Data Sampling Approaches on Software Defect Prediction Models. Proceedings - 24th Asia-Pacific Software Engineering Conference, APSEC 2017. Vol. 2017-December IEEE Computer Society, 2018. pp. 630-635
@inproceedings{ab258bd783c84e8c8bfa4d7e2d504d79,
title = "Impact of the Distribution Parameter of Data Sampling Approaches on Software Defect Prediction Models",
abstract = "Sampling methods are known to impact defect prediction performance. These sampling methods have configurable parameters that can significantly affect the prediction performance. It is however, impractical to assess the effect of all the possible different settings in the parameter space for all the several existing sampling methods. A constant and easy to tweak parameter present in all sampling methods is the distribution of the defective and non-defective modules in the dataset known as Pfp ({\%} of fault-prone modules). In this paper, we investigate and assess the performance of defect prediction models where the Pfp parameter of sampling methods are tweaked. An empirical experiment and assessment of seven sampling methods on five prediction models over 20 releases of 10 static metric projects indicate that (1) Area Under the Receiver Operating Characteristics Curve (AUC) performance is not improved after tweaking the Pfp parameter, (2) pf (false alarms) performance degrades as the Pfp is increased. (3) a stable predictor is difficult to achieve across different Pfp rates. Hence, we conclude that the Pfp parameter setting can have a large impact on the performance (except AUC) of defect prediction models. We thus recommend researchers experiment with the Pfp parameter of the sampling method since the distribution of training datasets vary.",
keywords = "Defect prediction, Empirical software engineering, Imbalanced Data, Preprocessing, Sampling methods, Search based SE",
author = "Bennin, {Kwabena Ebo} and Jacky Keung and Akito Monden",
year = "2018",
month = "3",
day = "1",
doi = "10.1109/APSEC.2017.76",
language = "English",
volume = "2017-December",
pages = "630--635",
booktitle = "Proceedings - 24th Asia-Pacific Software Engineering Conference, APSEC 2017",
publisher = "IEEE Computer Society",
address = "United States",

}

TY - GEN

T1 - Impact of the Distribution Parameter of Data Sampling Approaches on Software Defect Prediction Models

AU - Bennin, Kwabena Ebo

AU - Keung, Jacky

AU - Monden, Akito

PY - 2018/3/1

Y1 - 2018/3/1

N2 - Sampling methods are known to impact defect prediction performance. These sampling methods have configurable parameters that can significantly affect the prediction performance. It is however, impractical to assess the effect of all the possible different settings in the parameter space for all the several existing sampling methods. A constant and easy to tweak parameter present in all sampling methods is the distribution of the defective and non-defective modules in the dataset known as Pfp (% of fault-prone modules). In this paper, we investigate and assess the performance of defect prediction models where the Pfp parameter of sampling methods are tweaked. An empirical experiment and assessment of seven sampling methods on five prediction models over 20 releases of 10 static metric projects indicate that (1) Area Under the Receiver Operating Characteristics Curve (AUC) performance is not improved after tweaking the Pfp parameter, (2) pf (false alarms) performance degrades as the Pfp is increased. (3) a stable predictor is difficult to achieve across different Pfp rates. Hence, we conclude that the Pfp parameter setting can have a large impact on the performance (except AUC) of defect prediction models. We thus recommend researchers experiment with the Pfp parameter of the sampling method since the distribution of training datasets vary.

AB - Sampling methods are known to impact defect prediction performance. These sampling methods have configurable parameters that can significantly affect the prediction performance. It is however, impractical to assess the effect of all the possible different settings in the parameter space for all the several existing sampling methods. A constant and easy to tweak parameter present in all sampling methods is the distribution of the defective and non-defective modules in the dataset known as Pfp (% of fault-prone modules). In this paper, we investigate and assess the performance of defect prediction models where the Pfp parameter of sampling methods are tweaked. An empirical experiment and assessment of seven sampling methods on five prediction models over 20 releases of 10 static metric projects indicate that (1) Area Under the Receiver Operating Characteristics Curve (AUC) performance is not improved after tweaking the Pfp parameter, (2) pf (false alarms) performance degrades as the Pfp is increased. (3) a stable predictor is difficult to achieve across different Pfp rates. Hence, we conclude that the Pfp parameter setting can have a large impact on the performance (except AUC) of defect prediction models. We thus recommend researchers experiment with the Pfp parameter of the sampling method since the distribution of training datasets vary.

KW - Defect prediction

KW - Empirical software engineering

KW - Imbalanced Data

KW - Preprocessing

KW - Sampling methods

KW - Search based SE

UR - http://www.scopus.com/inward/record.url?scp=85045940191&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85045940191&partnerID=8YFLogxK

U2 - 10.1109/APSEC.2017.76

DO - 10.1109/APSEC.2017.76

M3 - Conference contribution

AN - SCOPUS:85045940191

VL - 2017-December

SP - 630

EP - 635

BT - Proceedings - 24th Asia-Pacific Software Engineering Conference, APSEC 2017

PB - IEEE Computer Society

ER -