Handling categorical variables in effort estimation

Masateru Tsunoda, Sousuke Amasaki, Akito Monden

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

Background: Accurate effort estimation is the basis of the software development project management. The linear regression model is one of the widely-used methods for the purpose. A dataset used to build a model often includes categorical variables denoting such as programming languages. Categorical variables are usually handled with two methods: the stratification and dummy variables. Those methods have a positive effect on accuracy but have shortcomings. The other handing method, the interaction and the hierarchical linear model (HLM), might be able to compensate for them. However, the two methods have not been examined in the research area. Aim: giving useful suggestions for handling categorical variables with the stratification, transforming dummy variables, the interaction, or HLM, when building an estimation model. Method: We built estimation models with the four handling methods on ISBSG, NASA, and Desharnais datasets, and compared accuracy of the methods with each other. Results: The most effective method was different for datasets, and the difference was statistically significant on both mean balanced relative error (MBRE) and mean magnitude of relative error (MMRE). The interaction and HLM were effective in a certain case. Conclusions: The stratification and transforming dummy variables should be tried at least, for obtaining an accurate model. In addition, we suggest that the application of the interaction and HLM should be considered when building the estimation model.

Original languageEnglish
Title of host publicationInternational Symposium on Empirical Software Engineering and Measurement
Pages99-102
Number of pages4
DOIs
Publication statusPublished - 2012
Externally publishedYes
Event6th ACM-IEEE International Symposium on Empirical Software Engineering and Measurement, ESEM 2012 - Lund, Sweden
Duration: Sep 19 2012Sep 20 2012

Other

Other6th ACM-IEEE International Symposium on Empirical Software Engineering and Measurement, ESEM 2012
CountrySweden
CityLund
Period9/19/129/20/12

Fingerprint

Project management
Linear regression
Computer programming languages
NASA
Software engineering

Keywords

  • Dummy variable
  • Hierarchical linear model
  • Interaction
  • Mixed effects
  • Model-based effort estimation
  • Stratification

ASJC Scopus subject areas

  • Computer Science Applications
  • Software

Cite this

Tsunoda, M., Amasaki, S., & Monden, A. (2012). Handling categorical variables in effort estimation. In International Symposium on Empirical Software Engineering and Measurement (pp. 99-102) https://doi.org/10.1145/2372251.2372267

Handling categorical variables in effort estimation. / Tsunoda, Masateru; Amasaki, Sousuke; Monden, Akito.

International Symposium on Empirical Software Engineering and Measurement. 2012. p. 99-102.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Tsunoda, M, Amasaki, S & Monden, A 2012, Handling categorical variables in effort estimation. in International Symposium on Empirical Software Engineering and Measurement. pp. 99-102, 6th ACM-IEEE International Symposium on Empirical Software Engineering and Measurement, ESEM 2012, Lund, Sweden, 9/19/12. https://doi.org/10.1145/2372251.2372267
Tsunoda M, Amasaki S, Monden A. Handling categorical variables in effort estimation. In International Symposium on Empirical Software Engineering and Measurement. 2012. p. 99-102 https://doi.org/10.1145/2372251.2372267
Tsunoda, Masateru ; Amasaki, Sousuke ; Monden, Akito. / Handling categorical variables in effort estimation. International Symposium on Empirical Software Engineering and Measurement. 2012. pp. 99-102
@inproceedings{8fa7dc262bb047339654a65c3ef8d46b,
title = "Handling categorical variables in effort estimation",
abstract = "Background: Accurate effort estimation is the basis of the software development project management. The linear regression model is one of the widely-used methods for the purpose. A dataset used to build a model often includes categorical variables denoting such as programming languages. Categorical variables are usually handled with two methods: the stratification and dummy variables. Those methods have a positive effect on accuracy but have shortcomings. The other handing method, the interaction and the hierarchical linear model (HLM), might be able to compensate for them. However, the two methods have not been examined in the research area. Aim: giving useful suggestions for handling categorical variables with the stratification, transforming dummy variables, the interaction, or HLM, when building an estimation model. Method: We built estimation models with the four handling methods on ISBSG, NASA, and Desharnais datasets, and compared accuracy of the methods with each other. Results: The most effective method was different for datasets, and the difference was statistically significant on both mean balanced relative error (MBRE) and mean magnitude of relative error (MMRE). The interaction and HLM were effective in a certain case. Conclusions: The stratification and transforming dummy variables should be tried at least, for obtaining an accurate model. In addition, we suggest that the application of the interaction and HLM should be considered when building the estimation model.",
keywords = "Dummy variable, Hierarchical linear model, Interaction, Mixed effects, Model-based effort estimation, Stratification",
author = "Masateru Tsunoda and Sousuke Amasaki and Akito Monden",
year = "2012",
doi = "10.1145/2372251.2372267",
language = "English",
pages = "99--102",
booktitle = "International Symposium on Empirical Software Engineering and Measurement",

}

TY - GEN

T1 - Handling categorical variables in effort estimation

AU - Tsunoda, Masateru

AU - Amasaki, Sousuke

AU - Monden, Akito

PY - 2012

Y1 - 2012

N2 - Background: Accurate effort estimation is the basis of the software development project management. The linear regression model is one of the widely-used methods for the purpose. A dataset used to build a model often includes categorical variables denoting such as programming languages. Categorical variables are usually handled with two methods: the stratification and dummy variables. Those methods have a positive effect on accuracy but have shortcomings. The other handing method, the interaction and the hierarchical linear model (HLM), might be able to compensate for them. However, the two methods have not been examined in the research area. Aim: giving useful suggestions for handling categorical variables with the stratification, transforming dummy variables, the interaction, or HLM, when building an estimation model. Method: We built estimation models with the four handling methods on ISBSG, NASA, and Desharnais datasets, and compared accuracy of the methods with each other. Results: The most effective method was different for datasets, and the difference was statistically significant on both mean balanced relative error (MBRE) and mean magnitude of relative error (MMRE). The interaction and HLM were effective in a certain case. Conclusions: The stratification and transforming dummy variables should be tried at least, for obtaining an accurate model. In addition, we suggest that the application of the interaction and HLM should be considered when building the estimation model.

AB - Background: Accurate effort estimation is the basis of the software development project management. The linear regression model is one of the widely-used methods for the purpose. A dataset used to build a model often includes categorical variables denoting such as programming languages. Categorical variables are usually handled with two methods: the stratification and dummy variables. Those methods have a positive effect on accuracy but have shortcomings. The other handing method, the interaction and the hierarchical linear model (HLM), might be able to compensate for them. However, the two methods have not been examined in the research area. Aim: giving useful suggestions for handling categorical variables with the stratification, transforming dummy variables, the interaction, or HLM, when building an estimation model. Method: We built estimation models with the four handling methods on ISBSG, NASA, and Desharnais datasets, and compared accuracy of the methods with each other. Results: The most effective method was different for datasets, and the difference was statistically significant on both mean balanced relative error (MBRE) and mean magnitude of relative error (MMRE). The interaction and HLM were effective in a certain case. Conclusions: The stratification and transforming dummy variables should be tried at least, for obtaining an accurate model. In addition, we suggest that the application of the interaction and HLM should be considered when building the estimation model.

KW - Dummy variable

KW - Hierarchical linear model

KW - Interaction

KW - Mixed effects

KW - Model-based effort estimation

KW - Stratification

UR - http://www.scopus.com/inward/record.url?scp=84867540713&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84867540713&partnerID=8YFLogxK

U2 - 10.1145/2372251.2372267

DO - 10.1145/2372251.2372267

M3 - Conference contribution

SP - 99

EP - 102

BT - International Symposium on Empirical Software Engineering and Measurement

ER -