Utility of Real-time Decision-making in Commercial ...

Viewer
Transcript

Utility of Real-time Decision-making in Commercial Data Stream Mining Domains Clifton Phua, Vincent C S Lee, and Kate Smith-Miles

commercial data is streaming and domain-specific. Decisions must be almost instant: data flows without stopping, very quickly, and usually comes from sub-streams (multiple smaller streams). Decisions must fit operational requirements: for example, whether fraud detection and database marketing, or whether centralised or de-centralised decision-making. This paper describes three reasons to measure utility on real-time decision making in commercial data stream mining domains:

Abstract— The objective is to measure utility of real-time commercial decision making. It is important due to a higher possibility of mistakes in real-time decisions, problems with recording actual occurrences, and significant costs associated with predictions produced by algorithms. The first contribution is to use overall utility and represent individual utility with a monetary value instead of a prediction. The second is to calculate the benefit from predictions using the utility-based decision threshold. The third is to incorporate cost of predictions. For experiments, overall utility is used to evaluate communal and spike detection, and their adaptive versions. The overall utility results show that with fewer alerts, communal detection is better than spike detection. With more alerts, adaptive communal and spike detection are better than their static versions. To maximise overall utility with all algorithms, only 1% to 4% in the highest predictions should be alerts.

Reason 1 – Real-time decisions can be wrong: Although real-time decisions allow immediate responses to possible positive class examples, there is also the increased risk of wrong decisions due to lack of more information. To understand the consequences of real-time decisions, overall utility and individual utility should be practically represented with a monetary value. Question 1 – How to calculate the overall utility of all decisions and individual utility of each decision?

Index Terms— costs and benefits, measurement, real-time decision-making, utility

I.

INTRODUCTION

Reason 2 – Class-labels give feedback but are not problemfree: Predictions refer to probability of occurrence. Model-like classification algorithms (for example, naïve Bayes, decision trees, and support vector machines) produce predictions which are normally distributed and range between zero and one. However, exceptions apply. Anomaly like algorithms can produce predictions which are heavily skewed to the right (for example, the majority of the predictions are zero, very few have high predictions, and high predictions can exceed one). Class-labels refer to actual occurrence, but usually only the positive examples are manually labeled because they are the class of interest. In the case of binary classes, they are represented by either zero or one where the former belongs to the negative class and the latter is the positive class-label. In theory, class-labels are useful because they allow the retrospective look at the quality of decisions, made on the basis of predictions, which have already taken place. In practice, class-labels can be untimely, incomplete, and un-reliable. To explain further, there are time delays in labeling examples as positive because it takes time for event to be revealed. This provides a window of possible change in nature of positive class-labels which is not available to realtime decision-making. In addition, some positive examples are possibly unlabelled having been overlooked, and are prone to errors made unintentionally by people.

I go with my gut feeling [on decisions] when I have acquired information somewhere in the range of 40 to 70 percent.” - Colin Powell, 1995

F

ROM a commercial point-of-view, the company seeks to maximise its utility (or profit) by increasing benefit (or revenue) and decreasing cost (or expenses). At the same time, the company’s data has a reasonable percentage of positive examples annually, have well-defined patterns to find, and false alarm costs are low [1]. In this scenario, data mining services allow the company to achieve sustainable competitive advantage over its competitors. This means making decisions on data or examples immediately when they are received. Yet, good real-time decisions are hard to make because Manuscript received December 1, 2007. The first author was previously supported by ARC under Linkage Grant Number LP0454077 and by DEST under an Endeavour Research Fellowship. Clifton Phua is currently with the Institute of Infocomm Research (I2R), Singapore (e-mail: [email protected]). Vincent Lee is with the Clayton School of Information Technology, Monash University, Victoria, Australia (e-mail: [email protected]). Kate Smith-Miles is with the the School of Engineering and Information Technology, Deakin University, Victoria, Australia (e-mail: [email protected]).

978-1-4244-1672-1/08/$20.00 © IEEE 2008.

46

using non-utility or utility measures and contrasts them with this work. The calculations of utility of real-time decisions, benefit from predictions and class-labels, and cost of predictions are presented in Sections 3, 4, and 5 respectively. Section 6 provides a simple utility illustration and Section 7 gives the experimental design. Section 8 lists and explains the results. Section 9concludes the paper.

Without taking costs into consideration, the following figure shows the traditional relationship between decision threshold, imbalanced class distribution, and alert threshold: Decision threshold

II. PREVIOUS WORK

Imbalanced class distribution

Decisionmaking on predictions

The next figure illustrates two groups of performance measures: non-utility and utility measures, and highlights the contribution of individual utility to existing utility measures:

Alert threshold

Historical Data

Fig. 1. Traditional decision-making on predictions.

Data Stream

In Figure 1, traditional decision threshold is the cutoff point between zero and one. If predictions are higher than this cutoff point, then alerts are raised and actions (or responses) will be carried out on them. Actions can be in the forms of investigate/nothing, reject/accept, and contact/avoid for fraud detection and database marketing. Imbalanced class distribution is common in commercial data: the proportion of rare but valuable positive class-labels to negative ones. Alert threshold is the maximum number of predictions the company can carry out actions on. It is due to the company’s limited resources and unwillingness to cause negative public relations [2]. The traditional decision threshold can be set by finding the balance between the presence of imbalanced class distribution (how easy is it to find the positive class?) and limited number of alerts (how much resources is available to find the positive class?). Also, the imbalanced class distribution cannot be controlled by the company but on the other hand alert threshold can be varied. This paper argues for a utility-based decision threshold – that the decision threshold should be a monetary value. Question 2 – How to calculate the benefit from predictions and class-labels using a utility-based decision threshold?

Example

Algorithm

Non-Utility Measure

Utility Measure

Class Labels

Overall Measure

Overall Utility Constraints Benefits Costs

Individual Utility

Fig. 2. Non-utility and utility measures.

In Figure 2, non-utility measures (in light grey box) use only predictions and class-labels. In addition, utility measures (in medium grey box) also incorporate other constraints, benefits, and costs (if they are available). From previous work, all measures use the predictions from trained classifiers on test set. These retrospective measures return overall measure or utility at the end of a time period to compare against other algorithms and the baseline. In addition to overall utility, this paper proposes utility on each prediction (in dark grey box) which is better than solely using predictions in real-time decision-making for data streams.

Reason 3 – Predictions can be in real-time but they are not cheap: In academic literature, real-time predictions are a simple and elegant way to make real-time decisions. They avoid the need to specify costs as they are typically confidential information. However, in the real-world, most costs to produce predictions are known. Some costs, such as data and processing costs, are substantial enough to be accounted into the real-time decision-making. Question 3 – How to calculate the cost of predictions?

Section 2 highlights the developments of related work in

978-1-4244-1672-1/08/$20.00 © IEEE 2008.

Predictions

47

U = U (1 + d )

A. Non-Utility Measures Receiver Operating Characteristic (ROC) is a plot of true positive rate versus false positive rate to analyse performance under all possible thresholds. However, only the front part of the ROC curve is relevant as alerts are raised only on the highest predictions. Area Under the ROC Curve (AUC) measures how many times the instances have to be swapped with their neighbours when sorting data by predicted scores. Another alternative is to measure the percentage of positives in the top few percent of ranked predictions. There are other relevant non-utility measures such as precision-recall curves and F-measure curves. The most effective way to assess classifiers is to use one metric from threshold, ordering, and probability metrics; such as the average of mean squared error, accuracy, and AUC [3]. Different non-utility measures are subjective and can lead to different conclusions. Therefore, as a rule of thumb, at least a few non-utility measures should be used.

Equation (3) introduces our individual utility for each new example’s prediction to make a real-time decision:

U (vi ) = [u (vi ) + U (i * k )]* P(vi )

(3)

where U (vi ) is the utility for the ith prediction (most recent example). u(vi) is the utility attribute value unique to the example. It is a negative value for fraud detection to mean potential loss, and positive value for database marketing to mean potential gain. i is the total number of examples and k is the imbalanced class distribution. Their product gives the approximate number of positive class-labels (some classlabels are not yet known). P(vi) is the prediction of the example. Here, the decision threshold is no longer a cutoff point between zero and one but is a monetary value.

IV. BENEFIT FROM PREDICTIONS The following table contains the confusion matrix for binary class data: TABLE I CONFUSION MATRIX Positive

III. UTILITY OF REAL-TIME DECISIONS Equation (1) is the basic utility equation originated from [9]:

Negative

Alert

tp

fp

No alert

fn

tn

tp, fp, fn, and tn stand for total number of true positives, false positives, false negatives, and true negatives respectively.

(1)

Equation (4) defines this paper’s utility-based decision threshold. It is derived from the utility attribute value distribution (how risky or rewarding is it to find each positive class-label?) and percentage of alerts in all predictions (how much resources are available to find the positive class?):

where U is the overall utility or monetary value of decisions and the goal is to maximise. B is the benefit or detection capability from predictions and the aim is to maximise (for details, see Section IV). C1 is the cost of data and C2 is the cost of processing and the objective is to minimise (for details, see Section V). The U will be lower in the short-term due to the fixed costs in C2.

T = ( u max − u min ) * [1 − (tp + fp ) i ]

Equation (2) represents this paper’s re-evaluation of overall utility for all decisions at regular intervals and to avoid overestimation:

978-1-4244-1672-1/08/$20.00 © IEEE 2008.

(2)

where U is a more realistic overall utility. d is the simplified discount factor where decisions account for hidden costs such as inflation (to represent money in present value terms), depreciation (for hardware which loses values over time), and interest (for borrowed money). In practice, an accurate d can be difficult to determine but can be estimated. x is the number of intervals.

B. Utility Measures Simple but explicit cost [4] or simple benefit measures [5] can be defined. In cost-sensitive learning [6], monetary value is also placed on predictions with class-labels to maximise benefit and minimise cost. More complex utility measures can derive ideas from recent work. Minimise overall cost (if misclassification ratio is known) or maximise the amount of fraud given the maximum number of actions affordable (if maximum value for actions is known) [2]. Use the value of example (cost attribute) for quantification and cost quantification. Classifiers are trained from labelled data to return class distributions and total cost estimates on unlabeled test data’s classes [7]. Utility measures can include other cost and benefit aspects such as loan amount and interest rate [8].

U = B − C1 − C 2

x

(4)

where T is the utility-based decision threshold. umax is the highest utility attribute value and umin is the lowest. The use of absolute values in |umax|-|umin| is to represent negative utility

48

For fraud detection domain, the utility attribute value of possible losses is assumed to be normally distributed and is in the range of:

for fraud detection and positive utility for database marketing.

Equation (5) is our approach to derive benefit from predictions:

B = [tp * u (vi )] − [(tp + fp ) * c a ]

− $10,000 ≤ u (vi ) ≤ −$1,000

(5)

The number of alerts is limited by the total cost of alerts permitted by the company and average cost of one alert:

where B is the classification benefit over doing nothing and should be maximised. Classification benefit is the benefit of true positives discovered minus cost of alert resources spent to discover true positives. ca is the average cost of one alert. Misclassification cost is the cost of false negatives plus cost of false positives and should be minimised. From a commercial and utility point-of-view, classification benefit is more intuitive than misclassification cost to justify using the data mining algorithm.

tp + fp = $2 M $50 = 40,000 Applying equation (4), the utility-based decision threshold to find lower utility is:

T = ($1,000 − $10,000) * [1 − (40,000 1M )] = −$8,640 At the end of the first year, it is assumed that with the utility-based decision threshold, the best algorithm successfully rejected 10% of all examples and 2.56% are in all alerts:

V. COST OF PREDICTIONS Equation (6) is our approach to derive cost of data:

[

C1 = [i * ce ] + i * k * c p

]

(6)

TABLE 2 SIMPLE ILLUSTRATION OF UTILITY-BASED CONFUSION MATRIX

where i is the total number of examples and k is the imbalanced class distribution, also used in equation (3). C1 is a variable cost. ce is the average cost of purchasing one prediction. cp is the average cost of purchasing or obtaining one positive class-label. However, exceptions apply. Sometimes it can be B1 - a variable benefit. So be and bp are the average benefits paid to process each prediction. C2 is the cost of processing and it is mainly a fixed cost. It can consist of: • Cost of research and development of new customised prototype systems which requires technical expertise, domain knowledge, and testing phrase • Cost of equipment such as hardware, software, and rental of office space • Cost of maintenance in the form of labour, electricity, and upgrades to system • Cost of miscellaneous expenses like patenting, power and data loss, and hardware failure and software bugs

Positive Alert

1,000

39,000

No alert

9,000

951,000

Applying equations (1), (5), and (6), the overall utility at the end of first year is:

U = B − C1 − C 2

= {[1,000 * $5,500] − [39,000 * $50]} − {[1M * $1] + [1M * 0.01 * $5]} − {$0.5M } = $3.55M − $1.05M − $ 0.5M = $ 2M

Applying equation (2), a more realistic overall utility is:

U = $ 2M 1.1 = $1.82 M

VI. A SIMPLE UTILITY ILLUSTRATION This section demonstrates our utility concepts. In our hypothetical fraud detection situation, the idea is to investigate with individual utility below the decision threshold. At the start of the first year, the company expects to receive at least a certain number of examples with a certain imbalanced class distribution:

Applying equation (3) in the second year, the individual utility of the 1.5Mth and 2Mth examples are:

U (v1.5 M ) = [− $9,500 + 1.82 M (1.5M * 0.01)]* 0.95 = −$8,909 < −$8,640

i ≈ 1M , k ≈ 0.01

978-1-4244-1672-1/08/$20.00 © IEEE 2008.

Negative

49

U (v 2 M ) = [− $3,000 + 1.82M (2M * 0.01)] * 2.5 = −$ 7 ,273 > −$8,640

VII. EXPERIMENTAL DESIGN The purpose is to examine overall utility with various percentages of alerts in all predictions for all the different algorithms. For confidentiality reasons, the cost and benefit figures used for the following experiments cannot be listed. Overall utility is used to evaluate communal detection, spike detection, and their adaptive versions. With their new experiment numbers, six representative sets of predictions and their class-labels are compared to each other. They consist of one static communal detection [10] and two adaptive communal versions [11]; and one static spike detection and two adaptive spike versions [12].

0%

Static Communal Detection

i2

Adaptive Communal Detection – Modify only one parameter according to input size and output suspiciousness

i3

Adaptive Communal Detection – Randomly modify one parameter from a set of multiple parameters

i4

Static Spike Detection

i5

Adaptive Spike Detection – Measure weights and rank all attributes monthly without the class-label attribute

i6

Adaptive Spike Detection – Filter extreme ranked weights and then choose top two attributes monthly for suspicion score

3%

4%

5%

6%

7%

8%

9% 10%

For confidentiality reasons, the y-axis is overall utility and the actual numbers have been removed. The x-axis is the percentage of alerts in all predictions and it ranges from 0% to 10%. In Figure 3, the overall utility of each algorithm is very different to one another and changes significantly as number of alerts increase. With no alerts, there is a positive utility. It comes from having B1 instead of C1 in equation (6). With fewer alerts, communal detection (experiments i1 to i3) is more economically feasible than spike detection. With more alerts, adaptive spike detection (i5 and i6) and adaptive communal detection (i2 and i3) yield much higher utility than static communal detection (i1). Static spike detection (i4) has the lowest utility. To maximise overall utility with all algorithms, only 1% to 4% in the highest predictions should be alerts. IX. CONCLUSION It is important to have a new definition of utility for realtime decision-making; due to decision mistakes, actual occurrence recording problems, and prediction costs. This work introduces individual utility for each prediction which builds upon existing non-utility and recent utility literature. Utility for real-time decisions is defined along with its benefits and costs. A simple utility illustration of fraud detection is given to explain the utility concepts better. For experiments, overall utility is used to evaluate communal and spike detection, and their adaptive versions. The overall utility results show that with fewer alerts, communal detection is better than spike detection. With more alerts, adaptive communal and spike detection are better than their static versions. To maximise overall utility with all algorithms, only 1% to 4% in the highest predictions should be alerts. In other words, the utility results demonstrate that communal detection, adaptive-ness, and low alerts are essential to fraud detection.

VIII. UTILITY RESULTS AND DISCUSSION This paper proposes utility (monetary value) on each prediction (score between zero and one) which is better than solely using prediction in real-time decision-making for data streams. Figure 3 compares the overall utility results (each overall utility result is made up of individual utilities) of previous static/adaptive communal and spike detection experiments and highlights the optimum alerts required to maximise utility:

978-1-4244-1672-1/08/$20.00 © IEEE 2008.

2%

Fig. 3. Overall utility results of previous experiments.

Brief Description

i1

1%

Percentage of Alerts in All Examples

TABLE 3 COMPARISON OF PREVIOUS EXPERIMENTS WITH THE OVERALL UTILITY MEASURE Experiment Number

i1 i2 i3 i4 i5 i6

Overall Utility

where the 1.5Mth example should be rejected and 2Mth example should be accepted.

50

ACKNOWLEDGMENT This paper benefited from comments by Peter Christen, Warwick Graco, Van Munin Chhieng, Hyoungjoo Lee, and Pilsung Kang.

REFERENCES [1] [2]

B. Schneier, “How To Not Catch Terrorists,” Forbes, 2007. D. Hand, C. Whitrow, N. Adams, P. Juszczak, and D. Weston, “Performance Criteria for Plastic Card Fraud Detection Tools,” Journal of the Operational Research Society, 2007. [3] R. Caruana and A. Niculescu-Mizil, “Data Mining in Metric Space: An Empirical Analysis of Supervised Learning Performance Criteria,” in Proc. of SIGKDD04, 2004. [4] C. Phua, D. Alahakoon, and V. Lee, “Minority Report in Fraud Detection: Classification of Skewed Data,” ACM SIGKDD Explorations, 2004. [5] W. Fan, “Systematic Data Selection to Mine Concept-Drifting Data Streams,” in Proc. of SIGKDD04, 2004. [6] C. Elkan, “The Foundations of Cost-Sensitive Learning,” in Proc. of ICJAI01, 2001. [7] G. Forman, “Quantifying Trends Accurately Despite Classifier Error and Class Imbalance,” in Proc. of SIGKDD06, 2006. [8] N. Chawla and X. Li, “Price Based Framework for Benefit Scoring,” in Proc. of SIGKDD06 UBDM Workshop, 2006. [9] G. Weiss and Y. Tian, “Maximizing Classifier Utility when Training Data is Costly,” in Proc. Of SIGKDD06 UBDM Workshop, 2006. [10] C. Phua, R. Gayler, K. Smith-Miles, and V. Lee, “Communal Detection of Implicit Personal Identity Streams,” in Proc. of ICDM06 Workshop on Mining Evolving and Streaming Data, 2006. [11] C. Phua, V. Lee, K. Smith-Miles, and R. Gayler, “Adaptive Communal Detection in Search of Adversarial Identity Crime,” in Proc. of SIGKDD07 Workshop on Domain-Driven Data Mining, 2007. [12] C. Phua, K. Smith-Miles, V. Lee, and R. Gayler, “Adaptive Spike Detection for Resilient Data Stream Mining,” in Proc. of AusDM07, 2007.

Clifton Phua is currently a Research Fellow in the Institute of Infocomm Research (I2R). He has a Bachelor of Business Systems (Honours) and PhD from the Clayton School of Information Technology, Monash University. His research interests are data mining-based detection, security, data stream mining, and anomaly detection. Vincent C S Lee is an Associate Professor in the Clayton School of Information Technology, Monash University. He has had 15 years of tertiary teaching with three other universities in Australia and Singapore, research and supervision of research masters and PhD students from inception to completion. His current research includes complex (interdisciplinary multiparadigm) adaptive systems (including artificial immune systems) for operational and financial risk management, data and text mining of business intelligence. Kate Smith-Miles is a Professor and Head of the School of Engineering and Information Technology at Deakin University in Australia. Prior to joining Deakin University in 2006, she was a Professor in the Faculty of Information Technology at Monash University, Australia, where she held a variety of leadership roles over the previous ten years including Deputy Head and Director of Research for the School of Business Systems, and co-Director of the Monash Data Mining Centre. Her research interests include intelligent systems, data mining, machine learning, meta-learning, and others.

978-1-4244-1672-1/08/$20.00 © IEEE 2008.

51

Silvicultural decisionmaking in an uncertain climate future: a workshop ...