Improved Mining of Outliers in Distributed Large Data Sets ... - IJRIT

Viewer
Transcript

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 5, May 2014, Pg: 321-327

International Journal of Research in Information Technology (IJRIT) www.ijrit.com

ISSN 2001-5569

Improved Mining of Outliers in Distributed Large Data Sets Using Parallel Data Mining Ms.K.Deepika1, Mrs.N.Vijitha2, Ms.D.Thamaraiselvi3 1

PG Scholar, CSE Department, Computer Science and Engineering, Anna University, Chennai 2

3

Assistant professor,CSE Department.

PG Scholar, CSE Department, Computer Science and Engineering, Anna University, Chennai Vivekanandha college of technology for women, Elayampalayam,Thiruchengode. Tamilnadu, India. 1

[email protected] [email protected] 3 [email protected] 2

Abstract- In Data Mining, a distributed approach for detecting distance-based outliers is in large data sets. The

proposed algorithm is based on the concept of outlier detection solving set, which is a small subset of the data set that can be also provably used for predicting novel outliers. The algorithm exploits parallel computation in order to achieve a large time savings and it meets two basic requirements: the reduction of the run time with respect to the centralized version and the ability to deal with distributed data sets. The proposed schema exhibits excellent performances. Here, outliers are objects that deviate from the correlation structure of the data. Data resides on distributed nodes, avoids sending all data to a coordinator and increases safety without any performance degradation. The data sent from the local nodes to the supervisor node is reduced with the increase of data sent from the supervisor node to the local nodes. The number of additional communications corresponding to node requirement executions is quite low. Importantly, the solving set computed by this approach in a distributed environment has the same quality as that produced by the corresponding centralized method. Index terms-Distance based outliers, novel outliers, distributed data, parallel and distributed algorithms.

I.INTRODUCTION In Data mining outlier detection is considered as a large problem in previous days. Outlier can corrupt the data which is in the network. When sending a data from source to destination there may be a loss of data or delay occurs in the network. This is because of that particular node is busy. The data need to wait for a long time until that particular node gets free. This causes delay and traffic. It also reduces the packet delivery ratio and throughput .In order to reduce the delay and traffic we use parallel algorithm. Here data are distributed first then use parallel algorithm to avoid delay and traffic. Now calculate the distance between the source node and its parallel node using distance based outlier method [3]. Finally, outliers are detected from the networks by using parallel algorithm and increase the throughput of the network. Today, the arguments for developing DDM algorithm are even stronger, as the tendency towards generating larger and inherently distributed data sets amplifies performance and communication insufficiencies [1], [2]. In fact, when applied to very large data sets, even has the capable of scaled data mining algorithms may still require execution times that are excessive when compared to the stringent requirements of today’s applications [8].

Ms.K.Deepika, IJRIT

321

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 5, May 2014, Pg: 321-327

Parallel processing tasks of mining could drastically reduce the constant factors and execution times. From the distributed sources in mining the data set is divided into many local data sets are generated at separate nodes of a network [4]. The transfer of all the data sets is widely adopted solution requires to a single storage and processing site, usually a data warehouse, centralized algorithm at the site prior to the application. Some of the advantages are simplicity and feasibility with established technology. On the other hand, the same order of magnitude for transmission times of large data sets and the running times of scalable data mining algorithms are executed with high-performance secondary memory in a system [14], [5], [11]. Parallel Data Mining is used for detecting outliers for the computation of distance-based outliers [9], [12], [15]. More than a decade ago, it was recognized that such a design approach was too limited to deal effectively with the issue of continuous increase in the size and complexity of real data sets and in the prevalence of distributed data sources [11]. Parallel processing of mining tasks could dramatically reduce the effect of constant factors and decrease execution times. The method exploits parallel computation in order to obtain vast time savings. In point of fact, beyond preserving the correctness of the result, the proposed schema exhibits excellent performance. II.OUTLIER AND OUTLIER DETECTION An outlier is an observation point that is distant from other observations. The data or an object which is dissimilar from the remaining data called as outliers. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set. A. TYPES OF OUTLIER Point Outlier: Data instance of an individual is anomalous with respect to the data. Contextual Outlier: Data instance of an individual is anomalous within a context.It requires a notion of context and conditional anomalies also referred. Collective Outlier: An anomalous is a collection of related data instance.A relationship among data instances is required B. OUTLIER DETECTION Outlier detection is the data mining task whose goal is to isolate the observations which are considerably dissimilar from the remaining data [7]. This task has practical applications in several domains such as intrusion detection, fraud detection, medical diagnosis, data cleaning and many others. Outlier detection of unsupervised approaches is able to discriminate each datum as normal or exceptional when no training examples are available. C. DISTANCE-BASED METHOD Distance-Based Methods distinguish an object as outlier on the basis of the distances to its nearest neighbors [3]. An object can be associated with a weight or score, which is, a function of its k nearest neighbors distances quantifying the dissimilarity of the object from its neighbors. A top-n distance-based outlier in a data set is an object having weight not smaller than the n-th largest weight, where the data set of a object weight is computed as the sum of the distances from the object to its k nearest neighbors. Many prominent data mining algorithms have been designed on the assumption that data are centralized in a single memory hierarchy. Mostly designed, such algorithms are to be executed by a single processor. Many data mining algorithms consider outliers as noise that must be eliminated because it degrades their predictive accuracy as shown in the figure.

Ms.K.Deepika, IJRIT

322

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 5, May 2014, Pg: 321-327

Fig.1.Outlier Detection For example, in classification algorithms mislabeled instances are considered outliers and thus are removed from the training set to improve the accuracy of the resulting classifier. D. METHODOLOGY FOR OUTLIER DETECTION IN STATISTICAL APPROACH The statistical approach to outlier detection assumes a distribution or probability model for the given data set and then identifies outliers with respect to the model using a discordance test [7]. In particular, an analysis for statistical approach is based on the five phases: Data collection, Compute average value/Compute Linear Regression equation, Compute upper and lower control limits/Compute upper and lower bound value, Data Testing, Analysis and comparison the output. One of the statistical approach drawbacks is it requires knowledge about parameters of the data set, such as the data distribution. However, the data distribution may not be known in many cases. III. EXISTING SYSTEM In Existing system when a user wants to send a data to destination only through particular node. If that particular node is busy in sending another data, it needs to wait in a queue. So there is a delay and chance for traffic occurrence. It needs to wait until that particular node becomes free. The coordinator of the particular node which also has the presence of outliers may present in it .The outliers are detected by the distance based outlier detection in the distributed environment in an iterative way.

NODE

YES

WAIT IN QUEUE (SO TRAFFIC OCCURS AND DELAY)

NODE BUSY

NO

SEND DATA TO DESTINATION

Fig.2.Existing System

Ms.K.Deepika, IJRIT

323

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 5, May 2014, Pg: 321-327

A. SOLVINGSET ALGORITHM The SolvingSet algorithm compares all data set objects with a selected small subset of the overall data set, called candidate objects and stores their k nearest neighbors with respect to the sets. From these stored neighbors, an upper bound to the true weight of each data set object can thus be obtained. The objects having weight upper bound lower than the n-th greatest weight associated with a candidate object are called non active, while the others are called active. During the computation, an object becomes non active, and then it will not be obey anymore for insertion into the set of candidates, so that it cannot be an outlier. More accurate weights are computes the algorithm processes new objects, and the number of non active objects increases. When no more objects have to be examined then algorithm stops and becomes empty. The solving set is the unions of the sets are computed during the each iteration.

B. DISTRIBUTED SOLVING SET Main work of distributed solving set is the core computation, which is simultaneously carried out by all the other nodes and the synchronization of the partial results returned by each node after completing its job. C. LAZY DISTRIBUTED SOLVINGSET It reduces the amount of data transferred over the network and achieves performance improvements. IV. PROPOSED SYSTEM In proposed System while sending a data to destination, if a node is busy in sending another data then we choose the parallel node to that particular node and this is repeated until it finds a free node. The System architecture diagram of the proposed system in that the large data sets can be partitioned into several distributed data nodes. In a Distributed environment data nodes are distributed for the outliers analysis in large data sets for parallel data mining. Detection of the outliers on the basis of the distance based outlier detection method. After the outlier detection the performance of overall execution time also reduces and the preserving the correctness of the results. LARGE DATA SETS

DISTRIBUTED DATA NODES

PARALLEL DATA MINING

DISTANCE BASED OUTLIER DETECTION

OUTLIER DETECTION

PERFORMANCES AND CORRECT RESULTS

Fig.3.System Architecture

Ms.K.Deepika, IJRIT

324

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 5, May 2014, Pg: 321-327

A.

DISTRIBUTED DATA NODES

The tendency towards generating larger and inherently distributed data sets amplifies performance and communication insufficiencies. In the point of fact, when applied to very large data sets, even scalable data mining algorithms. In mining data from distributed sources, the data set is shared into many local data sets, generated at separate nodes of a network. The transfer of all the data sets requires to adopted solution to a single storage and processing site, usually a data warehouse, ahead to the application of a centralized algorithm at the site.

B.

PARALLEL DATA MINING:

NODE

YES

SEARCHING FOR THE NEAREST NODE WHICH IS PARALLEL UNTILL WE FIND A FREE NODE

NODE BUSY

NO

SEND DATA TO DESTINATION

Fig.4.Proposed System By using the parallel data mining for the distributed environment the non-centralized node searching is implemented for the improving the performance. In parallel data mining if the particular node is busy it searching to its nearest neighbouring node which is parallel to its searching node, it searches until it finds the free node then it sends to destination. Usage of the parallel mining it reduces the delay and traffic in the distributed environment. The huge size of the available data-sets and their high-dimensionality make large-scale data mining applications computationally very demanding, to an extension that high-performance parallel computing is fast becoming a necessary component of the solution. Moreover, the data mining gives the quality of the results often depends directly on the amount of computing resources available. In fact, data mining applications are composed to become the controlling consumers of supercomputing in the near future. There is a demand to develop effective parallel algorithms for various data mining techniques. Data mining is the automated analysis of large volumes of data, views for the interesting relationships and knowledge that are implicit in large volumes of data. parallel data mining concerns for research and development work in the area of the study and definition of parallel algorithms, methods and the tools for the extraction of novel, useful and accurate patterns from data using high-performance architectures. The data mining tools are implemented on high-performance parallel computers, can analyze massive databases in a reasonable time. Quick processing also means that users can analysis with more models to understand complex data. High performance produces it practical for users to analyze greater quantities of data that, in turn, yield improved predictions. This implementation is portable on a large number of parallel architectures and it demonstrates to be scalable in terms of speedup and scale up.

Ms.K.Deepika, IJRIT

325

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 5, May 2014, Pg: 321-327

C. OUTLIER DETECTION The outlier detection task can be very time intense and recently an interest in parallel/distributed methods for outlier detection in several domains such as intrusion detection, medical diagnosis, data cleaning, fraud detection, and many others. Outlier detection for unsupervised approaches is able to separate each datum as normal or exceptional. Outlier Mining can be used in telecom or credit card frauds to detect the atypical usage of telecom services or credit cards, in medical analysis for example to test abnormal reactions to new medical therapies, in pharmaceutical research, in financial applications, in weather prediction, in marketing and customer segmentations to identify customers spending much more or much less the average customer. To exploit the locality properties of the problem at hand to partition the computation among the processors of a multiprocessor system or the host nodes of a communication network to obtain vast time savings. D.

DISTANCE-BASED OUTLIERS

The computation of distance based outliers. The key point of this approach is to exploit the locality properties of the problem at hand to partition the computation among the processors of a multiprocessor system or the host nodes of a communication network. The distance-based outlier detection task in the distributed scenario, the overall data set using the method computes an outlier detection solving set. It is worth to notice that this is a unique peculiarity of the other distributed methods for outlier detection is not able to return a model of the data. V. CONCLUSION A compressed form of data and derived a parallel data version by computing local distances and merging them at a coordinator site in an iterative way. The lazy version shows the most hopeful performances which send distances only when needed. This schema could be useful also for the parallelized version of other kinds of algorithms. A coordinator can be avoided and safety increased without performance degradation, when distributed computing power is available the good speedup guarantees an optimal exploitation of computing facilities and a better throughput. From an algorithm founded on a compressed form of data and derived a parallel data version by computing local distances and merging them at a coordinator site in an iterative way. REFERENCES [1] Fabrizio Angiulli, Stefano Basta, Stefano Lodi, and Claudio Sartori, “Distributed Strategies for Mining Outliers in Large Data Sets IEEE Trans. Knowledge and Data Eng, vol. 25, no. 7, pp. 1520-1532, 2013. [2] F. Angiulli, S. Basta, S. Lodi, and C. Sartori, “A Distributed Approach to Detect Outliers in very Large Data Sets,” Proc. 16th Int’l Euro-Par Conf. Parallel Processing (Euro-Par), pp. 329-340, 2010. [3] F. Angiulli, S. Basta, and C. Pizzuti, “Distance-Based Detection and Prediction of Outliers,” IEEE Trans. Knowledge and Data Eng., vol. 18, no. 2, pp. 145-160, Feb. 2006. [4] F. Angiulli and F. Fassetti, “Dolphin: An Efficient Algorithm for Mining Distance-Based Outliers in very Large Datasets,” Trans.Knowledge Discovery from Data, vol. 3, no. 1, article 4, 2009. [5] F. Angiulli and C. Pizzuti, “Outlier Mining in Large High-Dimensional Data Sets,” IEEE Trans. Knowledge and Data Eng.,vol. 2, no. 17, pp. 203-215, Feb. 2005. [6] A. Asuncion and D. Newman, UCI Machine Learning Repository, 2007. [7] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly Detection: A Survey,” ACM Computing Survey, vol. 41, no. 3, pp. 15:1-15:58, 2009. [8] A. Ghoting, S. Parthasarathy, and M.E. Otey, “Fast Mining of Distance-Based Outliers in High-Dimensional Datasets,“Data Mining Knowledge Discovery, vol. 16, no. 3, pp. 349-364, 2008. [9] E. Hung and D.W. Cheung, “Parallel Mining of Outliers in Large Database,” Distributed and Parallel Databases, vol. 12, no. 1, pp. 5-26, 2002. [10] E. Knorr and R. Ng, “Algorithms for Mining Distance-Based Outliers in Large Datasets,” Proc. 24rd Int’l Conf. Very Large Data Bases (VLDB), pp. 392-403, 1998. [11] A. Koufakou and M. Georgiopoulos, “A Fast Outlier Detection Strategy for Distributed High-Dimensional Data Sets with Mixed Attributes,” Data Mining Knowledge Discovery, vol. 20, pp. 259-289, 2009. [12] E. Lozano and E. Acun˜ a, “Parallel Algorithms for Distance-Based and Density-Based Outliers”, Proc. Fifth IEEE Int’l Conf. Data Mining (ICDM), pp. 729-732, 2005.

Ms.K.Deepika, IJRIT

326

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 5, May 2014, Pg: 321-327

[13] M.E. Otey, A. Ghoting, and S. Parthasarathy, “Fast Distributed Outlier Detection in Mixed-Attribute Data Sets,” Data Mining Knowledge Discovery, vol. 12, nos. 2/3, pp. 203-228, 2006. [14] S. Ramaswamy, R. Rastogi, and K. Shim, “Efficient Algorithms for Mining Outliers from Large Data Sets,” Proc. ACM SIGMOD Int’l Conf. Management of Data (SIGMOD), pp. 427-438, 2000. [15] Large-Scale Parallel Data Mining, M.J. Zaki and C.-T. Ho, eds. Springer, 2000.

Ms.K.Deepika, IJRIT

327