Privacy Protected Knowledge Management in Services ...

Viewer
Transcript

Privacy Protected Knowledge Management in Services with Emphasis on Quality Data Debapriyo Majumdar Rose Catherine Shajith Ikbal Karthik Visweswariah {debapriyo, rosecatherinek, shajmoha, v-karthik}@in.ibm.com IBM Research, Bangalore, India

ABSTRACT Improving productivity of practitioners through effective knowledge management and delivering high quality service in Application Management Services (AMS) domain, are key focus areas for all IT services organizations. One source of historical knowledge in AMS is the large amount of resolved problem ticket data which are often confidential, immensely valuable, but majority of it is of very bad quality. In this paper we present a knowledge management tool that detects the quality of information present in problem tickets and enables effective knowledge search in tickets by prioritizing quality data in the search ranking. The tool facilitates leveraging of knowledge across different AMS accounts, while preserving data privacy, by masking client confidential information. It also extracts several relevant entities contained in the noisy unstructured text entered in the tickets and presents them to the users. We present several experimental evaluations and a pilot study conducted with an AMS account which show that our tool is effective and leads to substantial improvement in productivity of the practitioners.

Categories and Subject Descriptors H.3.3 [Information storage and retrieval]: Information search and retrieval

General Terms Design, Experimentation, Security, Management

Keywords Ticket Search, Measuring Quality, Sharing Knowledge

1.

INTRODUCTION

Application Management Services (AMS) form a major chunk of the services business. In an AMS scenario, a services organization manages a set of applications for a client

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM’11, October 24–28, 2011, Glasgow, Scotland, UK. Copyright 2011 ACM 978-1-4503-0717-8/11/10 ...$10.00.

organization, by involving a dedicated set of practitioners, who are trained in the domain of the application, solving issues raised by clients about that application. These issues and their resolution details are entered into a system, commonly referred to as a ticketing tool. Some of the popular enterprise ticketing tools are TSRM1 , SAP Solution Manager2 and Remedy3 . Delivering top quality service is of at most importance to any services organization, to distinguish themselves and excel in the field. This business problem has two dimensions – first is to reduce the time taken to solve tickets, and the second is to determine the optimal team structure. Most ticketing tools do provide a search functionality on tickets, but they fail miserably in a large fraction of cases, because it is observed that, most of the times, the resolution field of the solved tickets are incomplete, very badly written or simply had no technical details. For example, a resolution statement: “Directly guided her through fixing the issue over a phone call”, doesn’t help a practitioner trying to solve a similar issue later. Another commonly reported issue while searching in the ticket repository is that, a significant fraction of tickets have noisy content in the resolution field, leading to irrelevant tickets getting ranked higher. This noisy data is usually due to copy-pasting emails that were exchanged between the practitioner and the client in the process of solving the ticket. The final resolution steps are therefore cluttered with a lot of non-technical content. Practitioners do search on the web for solutions to tickets. Apart from using web search engines such as Google, Bing, Yahoo, etc., they also search in domain specific repositories or discussion forums. For example, SDN4 is a discussion forum for SAP domain issues. However, public repositories on the web can help only in solving general problems, not account specific ones. So, building a knowledge management tool for confidential data is very important in Services. In this paper, we describe a knowledge management tool, hereafter referred to as KM Tool, which enables top quality service delivery in Application Management Services business. The key aspects of our tool are: (1) Estimating the quality of resolution information in 1

www.ibm.com/software/tivoli/products/service-requestmgr 2 http://www.sap.com/platform/netweaver/components/ solutionmanager/index.epx 3 http://www.bmc.com/products/productlisting/53035210-143801-2527.html 4 http://www.sdn.sap.com

tickets, and incorporating such information in the search algorithm to improve the ranking of the tickets so that, the useful results are ranked at the top, and (2) Leveraging knowledge across accounts, while preserving data privacy, by masking client confidential information. We also present results of piloting the tool with an AMS account. The results show a substantial improvement in productivity and efficiency of the practitioners, demonstrating the effectiveness of our tool. We also present results of evaluation of a few selected sub-components to illustrate their usefulness in the overall tool. The rest of this paper is organized as follows. Section 2 discusses the prior research that is relevant to our work. In Section 3 we discuss our method on searching in ticket data with emphasis on data quality, and in Section 4 we describe how we ensure privacy protection in knowledge sharing between accounts. In Section 5 we provide some details on the architecture and components of our tool. The experiments and the pilot study that we conducted using our tool are discussed in Section 6. Finally, Section 7 discusses our conclusions and future directions of work.

2.

RELATED WORK

Knowledge management to improve the efficiency of service delivery has been addressed previously by Lenchner et al [14], but in the domain of server management, where they integrate different data sources and facilitate search using eSearch [6], but do not aim to de-prioritize results that are useless, possibly because the average quality of the data is good in that domain. Most ticket management tools also come with some knowledge management features. For example, Tivoli Service Request Manager (TSRM) offers searching of the ticket database, but their ranking is based on exact text match and number of hits. In the scenario where majority of the data has non-informative content, such a system is not likely to provide an effective search. Classification constitutes an important component of knowledge management systems. Support Vector Machine (SVM) [12] and MaxEnt [16] are examples of commonly used classifiers for text data. A major challenge in classifying ticket data is dealing with noise. Past work in the classifying noisy text include [4, 24] for web data and [11] for call transcripts. Extracting useful information by classifying parts of a document were discussed in [23] and [17]. Detecting spam was studied in [3] and [25]. A key distinguishing aspect of our tool is that it facilitates leveraging of knowledge across different AMS accounts. This is achieved by masking confidential information in the data, to preserve privacy. Approaches to hide identities and preserve sensitive information was studied in [2, 8, 9, 13, 22], but in other different contexts like data mining and publishing, and in speech data. Automatically creating structure out of the heterogeneous text data contained in problem tickets, has been studied previously. [23] focuses on classifying each sentence to a pre-defined information type, while [21] discusses methods to identify the discourse structure in ticket data. However, both of them do not have an end-to-end system where the usefulness of this structuring can be demonstrated. [15] proposed a system called TroubleMiner, which clusters tickets into a hierarchy such that frequently occurring problems are closer to the top, to help network operators in maintenance activities. [10] suggested that correlating an incoming

ticket with Configuration Items (CI) stored in a Configuration Management Database can help quicken the resolution of the ticket. Here, the focus is to identify the CIs, and not to identify resolution information from similar tickets. There is quite some research done in the area of trouble ticket routing, like the papers [18, 19, 20]. Here, the focus is to automatically decide on an expert or a resolving group that can resolve the ticket quickly, but taking into account, multiple other factors like number of unresolved tickets in the system, tickets that are already assigned to that group and service level agreements.

3.

DATA QUALITY-AWARE SEARCH

The important aspects of our knowledge search tool that distinguish it from other similar systems are: (a) its ability to extract useful parts of a ticket and (b) incorporating this information to provide an effecting ranking technique.

3.1

Detecting useful parts in tickets

A ticket contains sentences that provide useful information and those that are not useful as technical information. Observing the contents of different tickets, it becomes clear that sentences that belong to the later category can be characterized by a particular set of words, usage etc. For example, salutations, email headers etc. use a set of words that enable humans to identify that they do not give technical details on how to solve a problem. So, we categorize the sentences into two categories, namely Non-Technical (N) and Technical (T). Technical sentences can be of three major sub-categories: Problem Description (P) – sentences that specify the details about the issue being reported, Root Cause (R) – sentences that discuss the cause of the problem, and Solution (S) – sentences that describe how it was solved. We frame the detection of the useful parts of a ticket as a classification problem at sentence level. A classifier was trained on manually labeled data to classify sentences of tickets to one of P, S, R and N categories. But as experiments show in Section 6.1, classifying to R and S categories achieved negligible accuracies, mostly owing to the small fraction of such examples in the training data and ambiguities between the classes R and S. Hence, we simplified the approach and focused on the twoclass classification problem – for each sentence, determine if it has technical content (class T) or not (class N). Results of experiments conducted for T-N classification, as discussed in Section 6.1 show that, it is possible to obtain reasonable accuracy for this task.

3.2

Ranking with quality scores

Determining the set of sentences of a ticket that contain technical information is then used to compute the quality score for that ticket. For each ticket t, we pre-compute the quality score as: nT nT , 1) × w + × (1 − w) qual(t) = min( c n where nT is the total number of words in sentences classified as T, n is the total number of words in the resolution description of t and the constant c is a tuned parameter (set as 40 in our experiments) which is regarded as the minimum number of words present in all sentences classified as T. Essentially the first part of the formula becomes close to (or

equal to) 1 if the number of words in technical sentences of the resolution of a ticket is close to (respectively equal to or more than) the ideal number, and the second part computes the fraction of length of the technical part in the text. The parameter w is tuned (set to 0.75) to compute the weighted average of the two parts for the overall quality score. For our system, the keywords of a query are expected to describe the problem. So, we boost the weight of the problem description and problem details fields for the search to be performed by Lucene. We retrieve up to top 1000 results with score from the Lucene engine. However, since the problem statement in many of these tickets may match the query but the resolution statement may be useless or incomplete, we re-rank them. For the final ranking, the similarity between a query q and a ticket t is computed as a weighted average, with equal weights, of the text-similarity score between q and t returned by the search engine and the pre-computed quality score of t.

4.

Sanitization of client sensitive data

A few entities that could potentially violate the privacy and thus should be masked in the ticket data include company name, product name, location, person name, telephone number, email address, and web address. Sanitization of these entities in our tool is achieved through a combination of dictionary look-up and regular expression matching. The above entities are divided into sub-entities such as proper names and number strings. Then all possible variants of these sub-entities are specified in a dictionary to further build the complex entities in order to detect their instances in the ticket text through regular expression match. The tool also provides an option to mask entities and patterns that are specific to the accounts and not covered in the above specified general list. The account administrators can specify such patterns by simple regular expressions. Furthermore, the tool removes some account specific jargon to prevent practitioners belonging a different account from speculating about the identity of the account of the sanitized ticket. For example, the date of creation of a ticket could be referred to as Date into OCF MH in the original ticket. In the sanitized version, this would simply be referred to as CREATED ON.

4.2

Figure 1: Creation of sanitized data and access control in search Among these, the second option is recommended as it protects the sensitive information of the client’s data but also enables knowledge sharing across teams leading to improvement in efficiency.

PRIVACY PROTECTED SEARCH

Our tool enables privacy protected technical knowledge sharing across different accounts. Protection of the clientspecific confidential information in the ticket text is achieved through data sanitization, i.e., detection and masking of the instances of entities in the text that contain sensitive information or can disclose the identity of the customer.

4.1

the same account can view the data. (2) Knowledge sharing with sanitization – users from the same account search and view the data in original form; other users are restricted to search and view the data in sanitized form. This scenario is illustrated in Figure 1. (3) Unprotected: the data is shared to all users without sanitization.

Knowledge sharing with customizable privacy settings

Using the masking techniques described in the previous section, the sanitizer module creates a sanitized version corresponding to each ticket. The sanitized version contains the same technical information as the original ticket, except for some confidential data, which is masked. This data is usually not important for specifying the resolution for a technical problem. In our system, the administrator for an account can set the data privacy setting for their tickets by choosing one of the following three options: (1) Strict – only users from

5.

THE SYSTEM

In this section we describe the architecture and details on how our system is built. Figure 2 shows a high level architecture of the tool. The input data goes through a mapping and sanitization module and get stored inside a database, which also stores all user information, configuration details and domain specific semantics. The ticket data to be made available for search is stored inside a Lucene index. The search engine, including other modules that are used online at query time, interact with both the database and the index, to support various functionalities of the system.

Figure 2: System Architecture of the KM Tool

5.1

Importing data

The solved tickets that are input to the tool come from various ticketing tools. All tickets, before being passed to the backend pre-processing stages, are first converted to a standard format using pluggable adapter modules written specifically for the input source type. The tool supports two types of import mechanisms: (1) Offline import via Excel files: Most ticketing tools support exporting ticket data into excel files. Account administrators export ticket data into excel files and use a UI to upload these files to our tool. These are converted to a standard format using an excel adaptor module and then passed to the backend pre-processor. (2) Automated import from ticketing tools: In this scenario, our tool directly connects to the ticketing tool and pulls new solved tickets. In the current implementation, our

tool can import data from Tivoli Service Request Manager (TSRM), a widely used ticket management product available in market today. New ticket sources can be easily supported by plugging in appropriate adapter modules, that can convert the input source data into the standard format used by the tool.

5.2

Batch processing in the backend

Once the tickets are converted to the standard format, they are passed through a processor module. The tasks of this module are to compute the quality score, as described in Section 3.2 and to derive some structured fields by extracting domain specific entities and classification of tickets according to their issue type. With derived attributes, it is possible to facilitate efficient drill-down through facets in our tool for accurate retrieval. Classification of issue types: Problems being addressed in tickets can typically be categorized into a set of broader issue categories. Thus categorizing the tickets according to the problem being addressed and assigning the issue category label as structured attribute to the ticket would help in grouping together the tickets of similar categories within the tool. In our tool, we extract the issue category of the tickets using a version of regularized linear classifier that has behavior similar to the SVM [26], trained using a training set that is manually marked with issue class labels. Extraction of domain specific entities: Ticket data typically includes information about the process area and the department of its origin. In addition, in case of problem tickets related to a specific product or module, the description includes mentions of the module, product, program, and version. We use manually written rule based annotators to extract the presence of the above mentioned entities. Use of these entities as structured attributes can facilitate an efficient grouping and retrieval of the historical problem tickets related to a particular module, product, and version. Sanitization: The processed tickets are then passed to the Sanitizer module which creates a sanitized version for each input ticket. The sanitized version has the client confidential information masked out, as described in Section 4. The sanitized tickets, along with the original tickets are then written to the backend database of the tool. Indexing: An Indexer module picks up the new tickets and adds them incrementally to an index. The present implementation of the tool uses Lucene [1], which is a commonly used open source Java text search engine library for creating full text indexes for efficient searching and retrieval.

5.3

The user interface

The main user interface of our tool used by the practitioners is shown in Figure 3. The user enters her search query in the search box, which also supports interactive query suggestion using the method presented in [5]. The search engine ranks the results and shows, on the right side, the relevance of each result t to the query q, along three dimensions, namely the text match or the text similarity score between t and q, the quality score of t and the overall score, as described in Section 3.2. The facets in the left side of the page help the user to zoom into the results using structured fields. Some of these facets are present in the tickets as structured fields, while some are derived. The tool also supports synonym based query expansion. A dictionary of synonyms specific to the domain

Class P S R N

#examples 240 33 24 477

κ 0.73 0.63 0.85 0.63

Precision 0.84 0.5 0.42 0.85

Recall 0.84 0.16 0.18 0.92

F-measure 0.84 0.25 0.25 0.88

Table 1: Label distribution in PRSN and classification accuracy using SVM

are provided by experts and loaded in the tool. These are shown as synonym suggestions that the user can then choose to add to her list of keywords. Apart from the main search interface, there are several role-based user interfaces for several other system related features, such as for managing users, tracking usage, uploading and managing ticket data.

6.

EXPERIMENTS AND ANALYSIS

We conducted several experiments to assess the effectiveness of our data quality detection and search ranking. We also conducted a controlled user study to measure the benefits of using our tool. Since the data used for these experiments contain client confidential information from several clients, details about the data have been provided only at a macroscopic level. The number of tickets present in our system at the time of these experiments were about 200,000.

6.1

Detecting useful parts of tickets

For extracting useful parts of the ticket as detailed in Section 3.1 we consider each sentence in the ticket to be one of P, R, S or N. Around 800 sentences from 220 tickets were manually tagged by a volunteer, which will henceforth be referred to as the PRSN dataset, the distribution of which is given in Table 1. As can be seen from the table, more than 60% of the sentences were tagged as junk, which motivates the requirement of detecting useful parts of the ticket and separating it from the junk. From the above 800 sentences, 200 sentences were randomly chosen and tagged by a second volunteer to calculate the inter-annotator agreement. Kappa statistic for these two manual tagging is also given in Table 1. The Cohen’s Kappa coefficient [7] is a statistical measure of inter-annotator agreement. Lower the Kappa value, lower the chances of being able to train a statistical classifier that can give good classification accuracy. Table 1 gives the accuracy measures obtained using SVM. Other algorithms like Decision Tree and Naive Bayes were also used for classification, but the accuracy numbers remained similar or were lower than that of SVM. From the table, it is clear that accuracy values for R and S are very low, owing to the small number of examples available in the training set. Hence, in the next experiment, P, R and S labels were merged to their parent class – T, as discussed in Section 3.1. This merging gives the TN dataset. Classification accuracies on TN at a label level and weighted averages are given in Table 2. Now, the individual class accuracies have significantly improved and this is the classifier that is used in the ranking experiments described in the next section.

6.2

Ranking of tickets

To evaluate the effectiveness of our ranking algorithm based on the combined score of text match and resolution

Figure 3: Screenshot of KM Tool Search UI (confidential data masked out) TODO Label T N Weighted Avg

Precision 0.825 0.851 0.841

Recall 0.747 0.901 0.842

F-measure 0.784 0.876 0.841

Table 2: Classification Accuracy on TN using SVM

Type of ticket Baseline turnaround time (hours) Improvement on average Number of tickets per month

A 40

B 3.7

C 1

29% 28

24% 5

Not significant 205

Table 4: Benefit measurement of our tool TXT TXT + RES

prec@5 0.41 0.63

prec@10 0.37 0.56

ndcg@5 0.40 0.62

ndcg@10 0.38 0.57

Table 3: Effectiveness of the ranking

score, we performed a manual evaluation of our system with some domain experts. With the help of the domain experts we created 15 search queries and let the experts evaluate each of the top 10 results as relevant or non-relevant in two versions of the system, one which ranked the results using only content match (TXT) and another which ranked the results with a combination of content match and resolution scored (TXT+RES). We computed the precision at top 5 and 10 results, and the normalized discriminative cumulative gain (ndcg) at top 5 and 10 results for both the methods. As we see in Table 3 our TXT+RES method improves over the baseline TXT method by more than 50% consistently in all measures. Although our evaluation set of 15 queries is rather small (as we could not perform such a thorough evaluation on a larger set of queries due to unavailability of experts), the improvement is very significant. It can be said that our method improves the effectiveness of the system by a considerable amount.

6.3

User study and business benefit

In this section, we present the results of user studies done with three accounts. The first two evaluations were guided user studies where our team met with a team of members engaged in solving problem tickets for an account, picked

n tickets randomly from the open tickets at that time, and performed search in our tool together with the practitioners. For each query, if any of the first four results had adequate information about how the open problem in hand could be solved, that search was considered a success; otherwise it is marked as a failure. In one of these studies, our tool reported 4 successes out of 6 trials and in the second study, 6 out of 10 tickets could be solved leveraging knowledge from top results found by our tool. Although the number of trials we could perform was rather small, each one took some time due to the complexity of the tickets in this domain. A more systematic and longer pilot was performed with a team of 30 practitioners in an account for a duration of two months. Before starting the pilot, their baseline performances were measured for three types of problem tickets of varying degree of difficulty, measured by the average turnaround times for each type. The same measurement was taken during the time the team used our tool. As given in Table 4, the results of this pilot showed that, for tickets that take a long time in general, our tool helped in reducing the turnaround time significantly. For the easier type of problem tickets, the turnaround time did not reduce from the original one hour, as we expect that those tickets were not complicated enough to leverage previous knowledge. The account also reported that for new practitioners, the tool was particularly helpful.

7.

CONCLUSION AND FUTURE WORK

In this paper, we presented several real challenges of knowledge management in the application management industry

and presented an effective system to overcome those challenges. We also presented a method for automatically detecting quality of resolution information in ticket data and showed by our experiments that using the quality score in ranking tickets improve the effectiveness of knowledge search significantly, in fact turns an unusable system to an effective one. Our system consists of several advanced features such as faceted search, ticket classification, and entity detection to derive structure out of unstructured data and present those as facets while searching tickets. Finally, we show by a user study and a pilot, that our system increases the productivity of the users significantly. Most of the novel components of our system were built while working with the users and understanding their requirements, and some of the features provide us scope for improvement and direction for further research. Systematic evaluation and improvement of our method for detecting quality of resolution information in ticket data is a future research direction on its own. It can also be investigated if such methods apply to other types of data consisting of problems and solutions such as technical forums and mailing lists and if quality detection enhances retrieval in those domains. Also, the classification of tickets into categories for such very noisy data, if possible with minimal training, is another very important research problem which is of huge interest for the industry, to manage risks and resources.

[13]

[14]

[15]

[16]

[17]

[18]

8.

REFERENCES

[1] Apache Lucene. http://lucene.apache.org, 2009. [2] C. C. Agarwal and P. S. Yu. Privacy-preserving data mining: Models and algorithms. Springer, 2008. [3] I. Androutsopoulos, J. Koutsias, K. V. Chandrinos, and C. D. Spyropoulos. An experimental comparison of naive bayesian and keyword-based anti-spam filtering with personal e-mail messages. In Proc. SIGIR, 2000. [4] T. Baldwin, D. Martinez, and R. B. Penman. Automatic thread classification for linux user forum information access. In Proc. of the 12th Australasian Document Computing Symposium, 2007. [5] S. Bhatia, D. Majumdar, and P. Mitra. Query suggestions in the absence of query logs. In SIGIR, pages 795–804. ACM, 2011. [6] D. Carmel, M. Shtalhaim, and A. Soffer. eresponder: Electronic question responder. In CoopIS, 2000. [7] J. Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 1960. [8] T. Dalenius. Finding a needle in a haystack - or identifying anonymous census record. Journal of Official Statistics, 1986. [9] B. C. M. Fung, K. Wang, R. Chen, and P. S. Yu. Privacy-preserving data publishing: A survey of recent developments. ACM Computing Survey, June 2010. [10] R. Gupta, K. Hima Prasad, and M. Mohania. Information integration techniques to automate incident management. In IEEE Network Operations and Management Symposium, 2008. [11] P. Haffner, G. Tur, and J. Wright. Optimizing svms for complex call classification. In Proc. of ICASSP, 2003. [12] T. Joachims. Making large-scale support vector

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

machine learning practical. In Advances in Kernel Methods: Support Vector Machines. MIT Press, 1998. B. L., B.-S. K., P. A., and A. H. The gene trustee: A universal identification system that ensures privacy and confidentiality for human genetic databases. Journal of Law and Medicine, 10, 2003. J. Lenchner, D. Rosu, N. F. Velasquez, S. Guo, K. Christiance, D. DeFelice, P. M. Deshpande, K. Kummamuru, N. Kraus, L. Z. Luan, D. Majumdar, M. McLaughlin, S. Ofek-Koifman, D. P, C. Perng, H. Roitman, C. Ward, and J. Young. A service delivery platform for server management services. IBM Journal of Research and Development, 53(6), 2009. A. Medem, M.-I. Akodjenou, and R. Teixeira. Troubleminer: Mining network trouble tickets. In IFIP/IEEE International Symposium on Integrated Network Management-Workshops, 2009. K. Nigam, J. Lafferty, and A. McCallum. Using maximum entropy for text classification. In Proc. IJCAI-9 9 Workshop on Machine Learning for Information Filtering, 1999. P. Raghavan, R. Catherine, S. Ikbal, N. Kambhatla, and D. Majumdar. Extracting problem and resolution information from online discussion forums. In Proc. of International Conference on Management of Data (COMAD’10), 2010. Q. Shao, Y. Chen, S. Tao, X. Yan, and N. Anerousis. Easyticket: a ticket routing recommendation engine for enterprise problem resolution. Proc. VLDB, 2008. Q. Shao, Y. Chen, S. Tao, X. Yan, and N. Anerousis. Efficient ticket routing by resolution sequence mining. In Proc. KDD, 2008. P. Sun, S. Tao, X. Yan, N. Anerousis, and Y. Chen. Content-aware resolution sequence mining for ticket routing. In Business Process Management, volume 6336 of Lecture Notes in Computer Science. Springer Berlin / Heidelberg, 2010. S. Symonenko, S. Rowe, and E. D. Liddy. Illuminating trouble tickets with sublanguage theory. In Proc. HLT/ NAACL, Companion Volume: Short Papers, NAACL-Short ’06, 2006. M. Tang, D. Hakkani-Tur, and G. Tur. Preserving privacy in spoken language databases. In Proc. of the International Workshop on Privacy and Security Issues in Data Mining, ECML/PKDD, 2004. X. Wei, A. Sailer, R. Mahindru, and G. Kar. Automatic structuring of it problem ticket data for enhanced problem resolution. In 10th IFIP/IEEE International Symposium on Integrated Network Management, 2007. L. Yi, B. Liu, and X. Li. Eliminating noisy information in web pages for data mining. In Proc. ACM SIGKDD, 2003. L. Zhang, J. Zhu, and T. Yao. An evaluation of statistical spam filtering techniques. ACM Transactions on Asian Language Information Processing (TALIP), 3, December 2004. T. Zhang. On the dual formulation of regularized linear systems. Machine Learning, 46, 2002.