Tohru Kikuno

Graduate School of Information Science and Technology, Osaka University 1-5 Yamadaoka, Suita, Osaka, Japan

Graduate School of Information Science and Technology, Osaka University 1-5 Yamadaoka, Suita, Osaka, Japan

Oki Electric Industry Co., Ltd. 1-16-8, Chuou Warabi, Saitama, Japan

Department of Electronics and Computer Science College of Science and Technology, Nihon University

[email protected] Nahomi Kikuchi

[email protected]

[email protected] Masayuki Hirayama

[email protected]

ABSTRACT

Categories and Subject Descriptors

We constructed a prediction model with the data set provided by Software Engineering Center, Information-technology Promotion Agency, Japan(IPA/SEC) by applying the naive Bayesian classifier. The result showed that accuracy of predicting successful projects was 0.86. However, accuracy of predicting unsuccessful projects was 0.53, which was very low. To find the reason for low accuracy, we analyzed the characteristics of the IPA/SEC dataset and revealed the following two factors that appeared to affect accuracy. (1) Incompleteness: 44.6% of the values in the data were missing. (2) Imbalance: The number of successful projects was three times that of unsuccessful projects. We attempted to reduce the degree of incompleteness by mechanically minimizing the data that contained many missing values. The result of the preliminary experiment showed that the degree of incompleteness was reduced by our method. Moreover, thus the imbalance was simultaneously reduced. On the basis of these preliminary experiments, from a given dataset, we deleted the data in which the number of missing values was larger than γ, where γ was a certain integer value that indicates the upper limit of the number of missing values. The result of the experimental evaluation showed that when γ was 6, accuracy of predicting unsuccessful projects was 0.88, which indicated a major improvement.

D.2.9 [Software Engineering]: Management

Keywords Project Success, Prediction, Bayesian Classifier

1.

INTRODUCTION

In empirical software engineering, many researchers have tried to predict the quality and cost of a product using project data sets. In previous research [1–3], project data sets were obtained from well-designed projects. Therefore, the project data sets were complete, i.e. all metric data were filled out. Even when the data was not complete, there were only a few missing values [4, 5]. However, we worked with a public project data set obtained from an ordinary software development company, which contained many incomplete data. For example, the International Software Benchmarking Standards Group (ISBSG) collects software development project data with 65.4 % of the data missing [6]. In Japan, Software Engineering Center, Information-Technology Promotion Agency Japan (IPA/SEC) collects data with 83.8% of the data missing [7]. In this paper, we report an attempt to improve accuracy of prediction for product quality by using a subset of project data collected by IPA/SEC [7]. The total number of project data in the dataset is 237. First, we constructed a prediction model with the IPA/SEC data set by applying the naive Bayesian classifier. The result showed that accuracy of predicting successful projects was 0.86, which was very high. However, accuracy of predicting unsuccessful projects was 0.53, which was very low. To find the reason for low accuracy, we analyzed the characteristics of the IPA/SEC data set and revealed the following two factors that appeared to affect accuracy: (1) Incompleteness: 44.6% of the values in the data were missing.

・・・

12

M0

1 2

(M0 = 633)

・ ・

N0

D0

(N0 = 1397)

・・・

12

M1 (M1 = 69)

1 2

Experiment 1

D1

: N1 (N1= 237)

Improvement of γ 12

Experiment 2

12 11 2 1 2 1 2 2 : : : N2 N2 N2

・・・ M1 ・・・ M1 ・・・ M1

D2 D2 D2

Figure 1: Modification of the dataset (2) Imbalance: The number of successful projects was three times that of unsuccessful projects. We first attempted to reduce the degree of incompleteness by mechanically minimizing the data that contained many missing values. The results of the preliminary experiment on the IPA/SEC data set showed that the rate of missing values was reduced by our method. Moreover, the ratio between the number of successful and unsuccessful projects also gradually reduced, which simultaneously reduced the imbalance. On the basis of the preliminary experiments, from a given dataset, we deleted the data in which the number of missing values was larger than γ, where γ was a certain integer value, and calculated prediction accuracy by constructing a prediction model using the resultant dataset. The result of the experimental evaluation showed that when γ was 6 and the total number of data was 56, accuracy of predicting successful projects was 0.92. In addition, accuracy of predicting unsuccessful projects was 0.88, which indicated a major improvement. Figure 1 shows how the data set is successively modified in this study. D0 is the original data collected by IPA/SEC. The characteristics of D0 are explained in Section 3. D1 is the dataset for the prediction (Experiment 1), which is explained in Section 4. On the basis of Experiment 1, we improved the data set and obtained the resultant dataset D2. We explain the details of the improvement in Section 5. Finally, we performed the prediction with D2, which is explained in Section 6.

2.

RELATED WORKS

In this section, we refer to related works that have considered the two characteristics of incompleteness and imbalance of the data sets. Tsunoda et al. studied the influence of imbalanced data on the prediction accuracy [8]. They evaluated some prediction methods by decreasing the number of unsuccessful projects and then showed that prediction accuracy reduces when imbalanced data is used. Similarly, many studied have been conducted for “incompleteness” [4, 5, 9–12]. Three methods have been proposed to deal with missing values: (a) use the incomplete data without any change, (b) delete the data with missing values, and (c) substitute the missing value with any value. Research efforts [4] and [5] used incomplete project data similar to method (a). Abe et al. applied the Bayesian classifier technique to software development project data to predict whether a project will be completed successfully [4]. Amasaki et al. applied association rule mining to software project data to identify risk factors [5]. Research efforts [10] and [11] used method (b). Twala et al. evaluated some imputation methods and concluded that a multiple imputation method attains the highest level of prediction accuracy [10]. Cartwright et al. evaluated the k-Nearest Neighbor imputation and the median imputation methods and found that the k-Nearest Neighbor method provided the best results [11]. Research efforts [9] and [12] compared methods (b) and (c). Strike et al. evaluated some imputation and deletion methods [9]. They concluded that a listwise deletion method is the most reasonable approach. Kromrey et al. found that when the project data contains many missing values, prediction accuracy reduces significantly with method (c) [12]. Because there are many missing values in the IPA/SEC dataset, we did not use imputation methods and used incomplete data after deleting data with many missing values.

3. CHARACTERISTICS OF THE DATASET 3.1 Project Data D0 Using IPA/SEC data, released as “IPA/SEC White Paper 2008 on Software Development Projects in Japan” [7], we attempt to predict whether a project will be completed successfully at the end of the design phase. In Figure 1, the data is described as D0. M 0 is the number of metrics and N 0 is the number of projects of D0. The data consists of 1,397 custom enterprise software projects from 20 Japanese companies1 . The number of metrics, which describe the projects, is 633. Thus, N 0 is 1397, and M 0 is 633. Among the 633 metrics, in this paper, we use the metric “evaluation of results (quality),“ which indicates whether the project will be completed successful or not. Note that the data are characterized by a large number of metrics that are commonly used in these companies. However, the data also include many missing values, which is discussed in Section 3.3. 1 The data were collected from 2,056 projects. However, we deleted the data that IPA/SEC or the company evaluated as unreliable. A detailed explanation is found in Appendix A

Metrics 1

2

3

4

5

1 x 2

Projects

6

Table 1: An Example of Confusion Matrix Correct Result Successful Unsuccessful Prediction Successful s t Result Unsuccessful u v

7

x

3 4 5 6 x 7

x x

8 9 x 10

x

x

x

x

x

x x x

x x

x x

x

Figure 2: Explanation of missing data

3.2 Successful/Unsuccessful Projects In this study, we distinguish successful projects from unsuccessful projects by using the metric “evaluation of results (quality).“ The value of this metrics indicates the level of the quality delivered. This metric can be assigned a value from “a” to “e.” Here, “a” indicates that the number of defects after the system cutover is less than the planned value by 20% or more, and “b” indicates that the number of defects after the system cutover is less than the planned value. A value of “c” indicates that the number of defects after the system cutover exceeds the planned value by 50% or less. A value of “d” indicates that the number of defects after the system cutover exceeds the planned value by 100% or less, and a value of “e” specifies that the number of defects after the system cutover exceeds the planned value by more than 100%. In this study, if a project has the metric values “a” and “b,” it is considered to be successfully completed, and if a project has the values “c,” “d,” and “e,” it is considered to be an unsuccessful projects. In the data set D0, 186 projects were successful and 51 projects were unsuccessful. The values were missing in 1,160 projects.

3.3 Missing Values Figure 2 shows the data consisting of seven metrics and 10 projects. In data set D0, the number of metrics is 633 and the number of projects is 1,397. The mark “x” in Figure 2 denotes the missing values. The fourth metric has two missing values and the third project has four missing values. The rate of missing is calculated as 100 × X/(M × N )% where the number of metrics is M , the number of projects is N , and the number of missing is X. In Figure 2, X is 19, M is 7, and N is 10. Thus, the rate of missing of Figure 2 is 100 × 19/(7 × 10) ! 27.1%. In case of dataset D0, the rate of missing is 83.8%.

4. EXPERIMENT 1 4.1 Outline of Experiment 1 In this paper, we predicted whether a project will be completed successfully. For this purpose, we selected only those projects that had data for the metric “Evaluation of result (quality).” As mentioned in Section 3.2, the number of project data that have values was 237 and of the 237

projects, 186 projects were successful and 51 projects were unsuccessful. Furthermore, the prediction is performed at the end of the design phase in the software lifecycle. Thus, the metrics that were unavailable at that time were deleted from D0. As a result, the number of metrics was reduced from 633 to 69. As shown in Figure 1, the obtained resultant data set D1 has values N1 = 237 and M1 = 69. In this study, projects are predicted by applying the naive Bayesian classifier method, commonly known as the Bayesian classifier, which is one of the most common approaches for classifying categorical data into several classes based on a probabilistic model. This is an optimal method for supervised learning if the values of the attributes of an example are independent, given the class of the example. Although this assumption is almost always violated in practice, research effort [13] has shown that naive Bayesian supervised learning is also effective. A project is predicted to be successful when the success probability calculated by the 10 fold cross validation is greater the criteria (in this paper, we set the criteria as 0.5). When success probability is less than 0.5, the project is predicted to be unsuccessful. In this paper, we used the F-measure to evaluate the prediction accuracy of successful and unsuccessful projects. Table 1 is an example of a confusion matrix. In this example, the F-measure for predicting successful and unsuccessful projects are calculated as 2s/(2s+t+u), and 2v/(2v + t + u), respectively. To ensure statistical validation, we repeated the 10-fold cross validation 100 times and used the average. The averages of the F-measure for predicting successful and unsuccessful projects were 0.87 and 0.52, respectively. The F-measure for predicting unsuccessful projects was lower than that for predicting successful projects.

4.2

Analysis of Experiment 1

To find the reason for low accuracy in predicting unsuccessful projects, we analyzed the characteristics of dataset D1. The result of the analysis showed that the following two factors appeared to affect the prediction accuracy: • Incompleteness: The rate of missing in data set D1was approximately 44.6%, which was high. •

Imbalance: Successful projects constituted 78.5% of project data, whereas unsuccessful projects comprised 21.5% of the project data. Regarding incompleteness, Kromrey et al. found that in case of many missing values in the project data, prediction accuracy reduces significantly [12]. Regarding imbalance, Tsunoda et al. found that the imbalanced data affects prediction accuracy [8]. Therefore, we conjecture that reducing the imbalance may improve accuracy. Hence, we tried to improve accuracy of predicting unsuccessful projects by reducing incompleteness and imbalance.

Figure 3: Distribution of the number of missing values

To remove projects in descending order of the number of missing values, we introduced γ, an integer that indicates the upper limit of the number of missing values. By the definition, γ must be the same as the number of missing values in the dataset D1. The algorithm for eliminating the projects that have a large number of missing values is as follows: For γi : where i = 1, 2, · · · , m and γ1 > γ2 > · · · > γm , Remove the data in which the number of missing values is larger than γi from the given data.

Here, we explain how the method is applied using the example in Figure 2. In this example, γ assumes the values 4, 3, and 1. From the 10 projects, the projects in which the number of missing values is four are removed. Thus, the third, fifth, and ninth projects are removed. When we apply the method to dataset D1, γ assumes the values 53, 51, · · · , 3, and 2. First, the projects in which

5.1 Distribution of Missing Values

5.2 Removing Missing Values

Figure 4: Improvement of incompleteness

5. IMPROVEMENT OF DATASET To reduce incompleteness, we first analyzed distribution of the number of missing values for dataset D1. As mentioned earlier, the project data included 186 successful and 51 unsuccessful projects. In addition, the rate of missing of D1 was 44.6%. The result of the analysis of distribution of missing values is shown in Figure 3, which is a box plot of the number of missing values. From the figure, it can be observed that in successful projects, the maximum number of missing value is 53, the minimum is two, the median is 41, and the average is 33.5. In unsuccessful projects, the maximum number of missing values is 47, the minimum is two, the median is 18, and the average is 20.9. The result shows that the number of missing values in successful projects is larger than that in unsuccessful projects. For example, for 43 missing values, the number of successful and unsuccessful projects is 26 and four, respectively. Therefore, if projects are deleted in descending order of the number of missing values, incompleteness and imbalance may be reduced simultaneously.

Figure 5: Improvement of imbalance the number of missing values is 53 are deleted. Thus, one project is removed. Next, the projects in which the number of missing value is 51 are removed. Thus, two projects are deleted. The remaining projects are removed in a similar manner.

5.3

Improvement of Incompleteness

In this section, we evaluate incompleteness of the resultant dataset. According to the research [12], when the rate of missing is greater than 30%, prediction accuracy reduces significantly. Thus, in this experiment, we set a goal of the rate of missing as 30%. Figure 4 shows the relationship between γ and the rate of missing. The horizontal axis indicates γ and the vertical axis indicates the rate of missing. From Figure 4, it is observed that as γ reduces, the rate of missing reduces. Furthermore, Figure 4 shows that when γ is 42, the rate of missing is 27.6%, and when γ is 43, the rate of missing is 33.9%. Thus, we can infer that, from the viewpoint of incompleteness, it is desirable that γ is less than 42.

5.4

Improvement of Imbalance

As mentioned before in Section 5.1, it is expected that the imbalance can be reduced by removing the projects with missing values. In this section, we evaluate imbalance of the

resultant dataset. Figure 5 shows the relationship between γ and the rate of unsuccessful projects. The horizontal axis indicates γ and the vertical axis indicates the rate of unsuccessful projects. When the rate of unsuccessful projects is 0.5 (in Figure 5), the data is balanced. From Figure 5, it is observed that when γ is 6, the rate of unsuccessful projects is the closest to 0.5, i.e. 0.43. In this case, the total number of projects is 56, with 32 successful and 24 unsuccessful projects.

6. EXPERIMENT 2 6.1 Overview of Experiment 2 On the basis of the proposed method in Section 5.2, to predict product quality by using an improved dataset, we use the procedure that consists of improvement of a certain γ value and evaluation of accuracy of the γ. For the first improvement, we applied the method used in Section 5.2, and for the second evaluation, we used the 10fold cross validation to calculate the F-Measure. To ensure statistical validation, we repeated the 10-fold cross validation 100 times, and used the average value. The procedure is defined as follows: Phase 1 (Improvement for γ) Remove the data in which the number of missing value is larger than γi from the given data. Phase 2 (Evaluation of accuracy) Calculate prediction accuracy by constructing the prediction model by applying the naive Bayesian classifier to the resultant dataset.

6.2 Analysis of Experiment 2 Figure 6 shows the result of the experiment. The horizontal axis indicates γ and the vertical axis indicates F-Measure. The “+” mark indicates the F-Measure for predicting successful projects, and the “x” mark indicates the F-Measure for predicting unsuccessful projects. From Figure 6, it is observed that when γ is 6, the FMeasure for predicting unsuccessful projects is 0.88, which is the highest in this experiment. Furthermore, the F-Measure for predicting successful projects is 0.92, which is higher than that calculated from the given data. When γ is 6, the rate of unsuccessful projects is 0.43, the number of projects is 56, and the rate of missing is 5.1%. Regarding the F-Measure for predicting successful projects, the highest value is 0.95, which is obtained when γ is 3. In contrast, the F-Measure for predicting unsuccessful projects is 0.83. When γ is 3, the rate of unsuccessful projects is 0.27, the number of projects is 30, and the rate of missing is 3.9%. As mentioned in Section 4.2, our goal was to improve accuracy of predicting unsuccessful projects. Thus, we consider the case when γ is 6 as the result of improvement.

7. CONCLUSION In this paper, we attempted to improve the prediction accuracy (F-Measure) using the IPA/SEC dataset [7]. Initially, accuracy of predicting successful projects was 0.86 and accuracy of predicting unsuccessful projects was 0.53, when the number of projects was 237. We then analyzed the characteristics of the IPA/SEC dataset. The results of the analysis showed that incompleteness and

Figure 6: Relationship between γ and F-Measure imbalance may affect accuracy. In addition, we found that when incompleteness is reduced, imbalance is also reduced. Therefore, we deleted the data in which the number of missing values was larger than γ, where γ was a certain integer value, and then calculated prediction accuracy by constructing the prediction model using the resultant dataset. The result of the experimental evaluation showed that when γ was 6 and the total number of data was 56, the resultant data was the most balanced and accuracy of predicting successful projects was 0.92. Furthermore, accuracy of predicting unsuccessful projects was 0.88, which indicated a major improvement.

8.

ACKNOWLEDGMENTS

This work was partially supported by Grant-in-Aid for Scientific Research(C) (21500035) Japan and Grant-in-Aid for JSPS Fellows(21 3963) Japan.

9.

REFERENCES

[1] M. Shepperd and C. Schofield. Estimating software project effort using analogies. IEEE Transactions on Software Engineering, vol. 23, pp. 736–743, 1997. [2] Z. Chen, B. Boehm, T. Menzies, and Daniel Port. Finding the right data for software cost modeling. IEEE Software, Vol.23, pp. 38–46, 2005. [3] M. Kl¨ as, H. Nakao, F. Elberzhager, and J. M¨ unch. Predicting defect content and quality assurance effectiveness by combining expert judgment and defect data - a case study. In Proceedings of the 2008 19th International Symposium on Software Reliability Engineering, pp. 17–26, Washington, DC, USA, 2008. IEEE Computer Society. [4] S. Abe, O. Mizuno, T. Kikuno, N. Kikuchi, and M. Hirayama. Estimation of project success using Bayesian classifier. In Proceedings of 28th International Conference on Software Engineering (ICSE2006), pp. 600–603, 5 2006. Shanghai, China. [5] S. Amasaki, Y. Hamano, O. Mizuno, and T. Kikuno. Characterization of runaway software projects using association rule mining. In Proceedings of 7th International Conference on Product Focused Software

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

Process Improvement (PROFES2006), vol. LNCS 4034, pp. 402–407, 6 2006. Amsterdam, The Netherlands. International Software Benchmarking Standards Group. ISBSG estimating, benchmarking and research suite release 11. http://www.isbsg.org/, 2009. Information-technology Promotion Agency Software Engineering Center. IPA/SEC White Paper 2008 on Software Development Projects in Japan. Nikkei Business Publications, Tokyo, Japan, http://www.ipa.go.jp/english/sec/reports/ 20100507a\_2/20100507a\_2\_WP2008E.pdf 2008. M. Tsunoda, A. Monden, J. Shibata, and K. Matsumoto. Empirical evaluation of cost overrun prediction with imbalance data. In Proceedings of International Conference on Computer and Information Science (ICIS 2010), August 2010. Yamagata, Japan. K. Strike, K. E. Emam, and N. Madhavji. Software cost estimation with incomplete data. IEEE Transactions on Software Engineering, vol. 27, pp. 890–908, 2001. B. Twala, M. Cartwright, and M. Shepperd. Comparison of various methods for handling incomplete data in software engineering databases. Empirical Software Engineering, International Symposium on, pp. 105–114, 2005. M. H. Cartwright, M. J. Shepperd, and Q. Song. Dealing with missing software project data. Software Metrics, IEEE International Symposium on, pp. 154, 2003. J. Kromrey and C. Hines. Nonrandomly missing data in multiple regression : An empirical comparison of common missing-data treatments. Educational and Psychological Measurement, Vol. 54, No. 3, pp. 573–593, 1994. P. Domingos and M. Pazzani. On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning, vol. 29, pp. 103–130, November 1997.

APPENDIX A. DATA RELIABILITY To delete unreliable projects, we referred to the metrics “Data reliability (IPA/SEC)“ and “Data reliability (company)“ in the white paper [7]. These metrics evaluate the reliability of the project data and have four values: “a: The project data is confirmed as reasonable and completely consistent,” ”b: The project data looks reasonable, but it has several factors that affect consistency of the data,”“c: Consistency of the project data cannot be evaluated because critical data items are missing,” and “d: The project data has one or more factors indicating that the data is unreliable.” In this study, we deleted the projects in which the value of these metrics was “c” or “d.”

B. B.1

THE BAYESIAN CLASSIFIER METHOD Bayes’ theorem

Bayes’ theorem relates the conditional probabilities of events A and B, provided that the probability of B does not equal zero. In Bayes’ theorem, P (A|B), the conditional probability of A given B is represented as follows: P (A|B) =

P (A)P (B|A) P (B)

In this expression, P (A), P (B), and P (B|A) are defined as below: • P (A) is the prior probability of A. •

•

B.2

P (B) is the prior probability of B.

P (B|A) is the conditional probability of B given A.

Bayesian Classifier

Let M1 , M2 , · · · , Mn be the variables denoting the observed attribute values to predict a discrete class C. Furthermore, let c represent a particular class. Given the values m1 , m2 , · · · , mn , we can use Bayes’ theorem to calculate the probability P (C = c|M1 = m1 ∧· · · Mn = mn ) and then predict the most probable class. This probability is expressed as follows: n P (Mi = mi |C = c) Πi=1 × P (C = c) P (M1 = m1 ∧ · · · Mn = mn )