Empirical evaluation of similarity-based missing data imputation for effort estimation

Koichi Tamura, Koji Toda, Masateru Tsunoda, Akito Monden, Ken Ichi Matsumoto, Takeshi Kakimoto, Naoki Ohsugi

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Multivariate regression models have been commonly used to estimate the software development effort to assist project planning and/or management. Since project data sets for model construction often contain missing values, we need to build a complete data set that has no missing values either by using imputation methods or by removing projects and metrics having missing values (removing method). However, while there are several ways to build the complete data set, it is unclear which method is the most suitable for the project data set. In this paper, using project data of 706 cases (47% missing value rate) collected from several companies, we applied four imputation methods (mean imputation, pair-wise deletion, k-nn method and applied CF method) and the removing method to build regression models. Then, using project data of 143 cases (having no missing values), we evaluated the estimation performance of models after applying each imputation and removing method. The result showed that the similarity-based imputation methods (k-nn method and applied CF method) showed the best performance.

Original languageEnglish
Pages (from-to)44-55
Number of pages12
JournalComputer Software
Volume26
Issue number3
Publication statusPublished - 2009
Externally publishedYes

Fingerprint

Software engineering
Planning
Industry

ASJC Scopus subject areas

  • Software

Cite this

Tamura, K., Toda, K., Tsunoda, M., Monden, A., Matsumoto, K. I., Kakimoto, T., & Ohsugi, N. (2009). Empirical evaluation of similarity-based missing data imputation for effort estimation. Computer Software, 26(3), 44-55.

Empirical evaluation of similarity-based missing data imputation for effort estimation. / Tamura, Koichi; Toda, Koji; Tsunoda, Masateru; Monden, Akito; Matsumoto, Ken Ichi; Kakimoto, Takeshi; Ohsugi, Naoki.

In: Computer Software, Vol. 26, No. 3, 2009, p. 44-55.

Research output: Contribution to journalArticle

Tamura, K, Toda, K, Tsunoda, M, Monden, A, Matsumoto, KI, Kakimoto, T & Ohsugi, N 2009, 'Empirical evaluation of similarity-based missing data imputation for effort estimation', Computer Software, vol. 26, no. 3, pp. 44-55.
Tamura K, Toda K, Tsunoda M, Monden A, Matsumoto KI, Kakimoto T et al. Empirical evaluation of similarity-based missing data imputation for effort estimation. Computer Software. 2009;26(3):44-55.
Tamura, Koichi ; Toda, Koji ; Tsunoda, Masateru ; Monden, Akito ; Matsumoto, Ken Ichi ; Kakimoto, Takeshi ; Ohsugi, Naoki. / Empirical evaluation of similarity-based missing data imputation for effort estimation. In: Computer Software. 2009 ; Vol. 26, No. 3. pp. 44-55.
@article{ca7c7d13a890498ca48bb0efe44842de,
title = "Empirical evaluation of similarity-based missing data imputation for effort estimation",
abstract = "Multivariate regression models have been commonly used to estimate the software development effort to assist project planning and/or management. Since project data sets for model construction often contain missing values, we need to build a complete data set that has no missing values either by using imputation methods or by removing projects and metrics having missing values (removing method). However, while there are several ways to build the complete data set, it is unclear which method is the most suitable for the project data set. In this paper, using project data of 706 cases (47{\%} missing value rate) collected from several companies, we applied four imputation methods (mean imputation, pair-wise deletion, k-nn method and applied CF method) and the removing method to build regression models. Then, using project data of 143 cases (having no missing values), we evaluated the estimation performance of models after applying each imputation and removing method. The result showed that the similarity-based imputation methods (k-nn method and applied CF method) showed the best performance.",
author = "Koichi Tamura and Koji Toda and Masateru Tsunoda and Akito Monden and Matsumoto, {Ken Ichi} and Takeshi Kakimoto and Naoki Ohsugi",
year = "2009",
language = "English",
volume = "26",
pages = "44--55",
journal = "Computer Software",
issn = "0289-6540",
publisher = "Japan Society for Software Science and Technology",
number = "3",

}

TY - JOUR

T1 - Empirical evaluation of similarity-based missing data imputation for effort estimation

AU - Tamura, Koichi

AU - Toda, Koji

AU - Tsunoda, Masateru

AU - Monden, Akito

AU - Matsumoto, Ken Ichi

AU - Kakimoto, Takeshi

AU - Ohsugi, Naoki

PY - 2009

Y1 - 2009

N2 - Multivariate regression models have been commonly used to estimate the software development effort to assist project planning and/or management. Since project data sets for model construction often contain missing values, we need to build a complete data set that has no missing values either by using imputation methods or by removing projects and metrics having missing values (removing method). However, while there are several ways to build the complete data set, it is unclear which method is the most suitable for the project data set. In this paper, using project data of 706 cases (47% missing value rate) collected from several companies, we applied four imputation methods (mean imputation, pair-wise deletion, k-nn method and applied CF method) and the removing method to build regression models. Then, using project data of 143 cases (having no missing values), we evaluated the estimation performance of models after applying each imputation and removing method. The result showed that the similarity-based imputation methods (k-nn method and applied CF method) showed the best performance.

AB - Multivariate regression models have been commonly used to estimate the software development effort to assist project planning and/or management. Since project data sets for model construction often contain missing values, we need to build a complete data set that has no missing values either by using imputation methods or by removing projects and metrics having missing values (removing method). However, while there are several ways to build the complete data set, it is unclear which method is the most suitable for the project data set. In this paper, using project data of 706 cases (47% missing value rate) collected from several companies, we applied four imputation methods (mean imputation, pair-wise deletion, k-nn method and applied CF method) and the removing method to build regression models. Then, using project data of 143 cases (having no missing values), we evaluated the estimation performance of models after applying each imputation and removing method. The result showed that the similarity-based imputation methods (k-nn method and applied CF method) showed the best performance.

UR - http://www.scopus.com/inward/record.url?scp=70350451005&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=70350451005&partnerID=8YFLogxK

M3 - Article

AN - SCOPUS:70350451005

VL - 26

SP - 44

EP - 55

JO - Computer Software

JF - Computer Software

SN - 0289-6540

IS - 3

ER -