Generation of mimic software project data sets for software engineering research

Maohua Gan, Kentaro Sasaki, Akito Monden, Zeynep Yucel

Research output: Contribution to journalConference article

Abstract

—To conduct empirical research on industry software development, it is necessary to obtain data of real software projects from industry. However, only few such industry data sets are publicly available; and unfortunately, most of them are very old. In addition, most of today’s software companies cannot make their data open, because software development involves many stakeholders, and thus, its data confidentiality must be strongly preserved. This paper proposes a method to artificially generate a “mimic” software project data set whose characteristics (such as average, standard deviation and correlation coefficients) are very similar to a given confidential data set. The proposed method uses the Box–Muller method for generating normally distributed random numbers, then, exponential transformation and number reordering are used for data mimicry. Instead of using the original (confidential) data set, researchers are expected to use the mimic data set to produce similar results as the original data set. To evaluate the usefulness of the proposed method, effort estimation models were built from an industry data set and its mimic data set. We confirmed that two models are very similar to each other, which suggests the usefulness of our proposal.

Original languageEnglish
Pages (from-to)30-37
Number of pages8
JournalCEUR Workshop Proceedings
Volume2273
Publication statusPublished - Jan 1 2018
Event6th International Workshop on Quantitative Approaches to Software Quality, QuASoQ 2018 - Nara, Japan
Duration: Dec 4 2018 → …

    Fingerprint

Keywords

  • Data confidentiality
  • Data mining
  • Empirical software engineering
  • Software effort estimation

ASJC Scopus subject areas

  • Computer Science(all)

Cite this