IJRIT International Journal of Research in Information Technology, Volume 1, Issue 7, July 2014, Pg. 146-151

International Journal of Research in Information Technology (IJRIT)

www.ijrit.com

ISSN 2001-5569

Design and implementation of Interactive visualization of GHSOM clustering algorithm for text mining tasks Martin Sarnovsky Department of cybernetics and artificial intelligence, Technical University Kosice Kosice, Slovak Republic [email protected]

Abstract The presented paper is focused on text clustering, text clustering algorithms visualization in particular. We have presented the GHSOM (Growing Hierarchical Self Organizing Map) algorithm, which is an extension of the standard approach to text clustering technique based on self organizing maps. Algorithm combines the adaptability features of the map expansion and hierarchical clustering features by implementing the map layers creation. This paper focuses primarily on the model visualization aspect. We describe the existing techniques used to provide the visualization of the clustering models and propose our solution of GHSOM model visualization based on combination of different methods. We have designed and implemented the solution using the Jbowl text mining library and Processing visualization and integrated the component into the cloud based text mining portal used for educational purposes.

Keywords: Clustering, Text mining, Self-organizing maps, visualization.

1. Introduction The process of knowledge discovery from text databases (knowledge discovery in texts – KDT, frequently labeled as text mining) is in comparison with classic methods of knowledge discovery in databases is more complex one. One of the reasons is, that the process has to deal with non-structured data and uncertainty [4]. Clustering of texts is a process of assigning of the text documents into the collection of clusters based on their similarity. Each of the clusters contains the documents, most related between each other, but on the other hand, as much different from the documents in other clusters as possible. Documents within one particular clusters are similar, speaking of their content, so clustering of texts can be viewed as a method that can be used to detect the content of the document collections and identify the possible topics. From the machine learning perspective, clustering methods belong to a set of unsupervised training approaches. Training of the models are usually automatic, without a feedback provided by user’s input which leaves the particular algorithm to find the patterns within the data. Discovered clusters are usually disjoint [1]. Various different approaches to clustering already exists, including the hierarchical methods, “lazy learners” (similarity methods), or neural networks. One of the basic neural network model based on principle of unsupervised learning are self-organizing maps (SOM) [2]. SOM neural networks are frequently used in text clustering tasks and several extensions of standard SOM already exist. One of them, important to mention, is Growing SOM (GSOM), that enables the map expansion. That kind of expansion proceeds in two different directions – by addiotion of new columns of neurons (clusters), or rows. Resulting output map perserves the structure. Other important SOM expansion is based on the idea of hierarchical extension – neurons (clusters) can be expanded into the new map, with separate output clusters. Neurons can be expanded on different levels, so resulting structure is hierarchical. Neuron expansion Martin Sarnovsky, IJRIT

146

IJRIT International Journal of Research in Information Technology, Volume 1, Issue 7, July 2014, Pg. 146-151

condition is based on variability of input vectors (documents in case of text mining). The combination of both presented extensions leads us to GHSOM (Growing Hierarchical Self-Organizing Map) algorithm [3]. GHSOM builds hierarchical structure consisting of multiple layers, where each layer consists of separate SOMs.

2. Design of the GHSOM algorithm visualization Various visualization methods exists, which main purpose is not only limited to providing the information of the models, but several techniques exists to provide the visualization of process of clustering, or provide the information more focused on content and interpretation of the discovered knowledge. We will present the basic introduction of some of the most frequently used techniques. One of the commnoly used methods are dendrograms. Dendrogram is a tree-like structure, which serves as a graphic representation of hierarchical clustering process. Based on type of hierarchical method, objects are connected into the cluster, or cluster is divided into the objects vice-versa. Each object represents the separate cluster, clusters are connected according to distances between each other [4]. Dendrograms are simple visual representation of hierarchical aspects of clustering. For visualization of lazy learners such as k-means method, 2D map visualization is commonly used. In case of mulitidimensional input data, various transformation methods are used to project the data from high dimensional vector space into the 2D or 3D. Self organizing maps do not assign the data into the concrete pre-defined clusters, do not identify the borders of the clusters. From that point of view, visualization of SOMs is a key factor in data analysis using these methods [5]. Some of the visualization techniques are based on weight vectors of the input data, using them to graphically display the clusters and their bounds (e. g. U-Matrix method [6]). Another group of methods focuses more on visualization of input data and their distribution in vector space. Those methods usually provides the visualization of particular group of vectors (e. g. methods Hit Histograms or Component Planes) [11]. Several approaches to GHSOM model visualization are possible and dependent on processed data. Based on the fact, that presented visualization will be used in text mining tasks, we have designed the model suitable for these purposes. Designed visualization method combines the approach of table layout, where each table cell represents one particular neuron (cluster) on the map. Each neuron is then described by some of the most important characteristics of the particular cluster, e.g. most significant terms (chosen according to the information gain criteria). Maps are connected together and creates hierarchical structure based on the GHSOM model structure. The main objectione of presented interactive visualization was to enable the user to browse through the model structure, explore different map layers and explore particular neurons and their content. Our visualization method is inspired by several existing methods, combines some aspects of the table methods with dendrograms and extends the interface by addition of task specific information. Using that kind of GHSOM model visualization, user gains visual feedback of the model structure and content on all model levels.

Martin Sarnovsky, IJRIT

147

IJRIT International Journal of Research in Information Technology, Volume 1, Issue 7, July 2014, Pg. 146-151

Fig. 1 Visualization of the GHSOM structure on the left and particular map on the right

3. Implementation of the GHSOM visualization method Designed solution was implemented and integrated into the cloud text mining portal presented in [11]. Portal provides a coherent system leveraging of analytical cloud services and providing simple user interface for users as well as administration and monitoring interfaces. Portal consists of data modules that covers storage of the data, data access and various pre-processing methods, based services providing particular analytical methods, as well as information system that manages workflow of data analysis tasks and provides necessary interfaces for users as well as administrators. As a cloud infrastructure, GridGain frameworks was used. GridGain is Java based middleware for development of data processing applications in distributed environments [8]. It supports development of scalable data-intensive and high-performance distributed applications. Main benefit of GridGain is the fact, that it is independent from infrastructure and platform. This allows applications developed using the framework to be deployed on various types of cloud infrastructures. GridGain provides native support for Java, Scala and Groovy programming languages. We have used the JBowl implementation of GHSOM algorithm. JBowl (Java Bag of words Library) is a open source library for text mining tasks. Distributed version of GHSOM clustering algorithm presented in [12] was integrated into the JBowl. Library provides various interfaces to build the text mining applications in Java. The system is being developed as open source with the intention to provide an easy extensible, modular framework for pre-processing, indexing and further exploration of large text collections, as well as for creation and evaluation of supervised and unsupervised text-mining models. JBOWL is a Java library, which contains methods for preprocessing, classification, clustering (including GHSOM algorithm) and evaluation techniques. It provides a set of classes and interfaces that enable integration of various classifiers and clustering algorithms. For visualization purposes, Processing language was used [9]. Processing is a open source programming language and development environment designed to create the graphical interfaces and visualizations for multiple purposes. Main feature of Processing is to enable the programmers to visualize the data interpretations and to design the graphic elements in a efficient manner. Visualizations created in Processing can be deployed within the development environment, in Java projects, as well as in HTML/JSP pages in tags. For web pages integration, processing.js extension is needed. Its main purpose is to translate the whole sketch into the Javascript. JBowl method creates the first layer map of the GHSOM model, selects the neurons to expand and assign vector instances corresponting to these neurons. Algorithm will build the rest of the complete GHSOM structure. Each of the particular GSOM map is created in the distributed fashion using the Martin Sarnovsky, IJRIT

148

IJRIT International Journal of Research in Information Technology, Volume 1, Issue 7, July 2014, Pg. 146-151

GridGain framework. When the final model is constructed, an xml representation of the GHSOM model is created. The structure of the document was designed to provide all the data necessary to build the model visualization. Visualization module is implemented in Processing.js, which is Processing extension, that combines the original language with javascript. Module contains twi main methods, setup() and draw(). The first method initializes the lists of input maps, neurons, the grapical objects, fills the maps list, computes and draws the initial map positions. The second one runs in the infinite loop and draws the model and control panel. Interactive visualization consists of three components (Fig. 2): • •



Main screen - contains the model visualization containing the structure of the GHSOM model as well as detailed visualization of selected map, Secondary screen – contains the extended information about the selected neuron/cluster, including the cluster position within the structure (connections), MQE of the cluster’s vectors, size of the selected cluster, number of assigned vectors, etc. Control panel – provides a set of tools to manipulate with the structure, including zoom in, zoom out, pruning of the lower layers, resizing options, redesigning tools.

Fig. 2 GHSOM structure visualization with context visualization of the selected map On the main screen, each circle represents the single map of the GHSOM model. By right-clicking on the particular map, additional information about the selected map are displayed. Detail of the map consists of the table view of the neurons (clusters). Extended information are displayed in the information panel on the left. When particular map is selected and displayed, actual position of the map within the complete structure is highlighted on the background. Each map consists of clusters, each cluster is described by its ID, mean quadratic error of the assigned vectors, number of vectors. Visualization also provides a visual feedback about the expandability of the neurons. As depicted in Fig. 3, if the neurons in the table view have green background, it means, that those clusters are not expanded more in the model. Red color identifies neurons, that are hierarchically expanded to another layer.

Martin Sarnovsky, IJRIT

149

IJRIT International Journal of Research in Information Technology, Volume 1, Issue 7, July 2014, Pg. 146-151

Fig. 3 Particular map detail More detailed view on the particular map and extended text information are depicted on Fig. 4. In this case visualization was tested on the Reuters 21578 dataset. In this case, second level map is displayed. It contains of the 6 clusters, each of them not expanded more within the current structure. Clusters are described by 6 terms, most significant within the cluster according to the information gain. Extended textual field on the left describes the context info about the parent cluster that lies on the first layer map.

Fig. 4 Visualization of the clusters with context information on the left

4. Conclusions Main objective of the work presented within the paper was to design and implement the GHSOM algorithm visualization and integrate the solution into the text mining portal used in education. Designed interactive visualization method was implemented using the Processing language and integrated into the portal. Backend of the solution is provided by Jbowl text mining library and GridGain framework. Visualization provides the set of tools to visualize the structure of clustering models created, enables the user to get a visual feedback of various information contained within the particular clusters and its content. Specific text Martin Sarnovsky, IJRIT

150

IJRIT International Journal of Research in Information Technology, Volume 1, Issue 7, July 2014, Pg. 146-151

mining tasks are supported by implementation of methods to visualize the different model characteristics, structure, and enables to provide intuitive graphical interface for the user to be able to explore different parts of the more complex models.

Acknowledgments This work was also supported by the Slovak VEGA Grants No. 1/1147/12.

References [1] N. O. Andrews, E. A. Fox, “Recent developments in document clustering”. Technical report TR-07-35. Deparment of Computer Science, Virginia Tech, 2007 [2] T. Kohonen, “Self-organizing maps”, Springer-Verlag, Berlín, 1995 [3] M. Ditttenbach, A. Rauber, D. Merkl, “The Growing Hierarchical Self-Organizing Map”, in Proceedings of International Joint Conference on Neural Networks, Como, Italy, 2000 [4] J. Paralič, et al., ”Dolovanie znalostí v textoch”, Košice: Equilibria, 2010. 182 s. ISBN 978-80-8928462-7. [5] R. Mayer, T. Aziz, A. Rauber, “Visualising class distribution on self-organising maps.” In: Proceedings of the International Conference on Artificial Neural Networks (ICANN’07), volume 4669 of LNCS, Springer. pp. 359--368. [6] A. Ultsch, “U*-Matrix: A tool to visualize clusters in high dimensional data”, Department of Computer Science, University of Marburg, Technical Report Nr. 36:1-12. [7] P. Bednar, P. Butka, “JBOWL - java bag-of-words library“, in 5th PhD Student Conference, FEI TU Kosice, Slovakia, pp. 19-20, 2005. [8] GridGain Systems: GridGain 3.0 White Paper [online]. 2010. Dostupné na internete: http://www.gridgain.com/media/gridgain_white_paper.pdf [9] Processing framework, online: http://www.processing.org [10] C. J. Kaufman, Rocky Mountain Research Lab., Boulder, CO, private communication, May 1995. [11] M. Sarnovsky, P. Butka, J. Pocsova. "Cloud computing as a platform for distributed fuzzy FCA approach in data analysis", 2012 IEEE 16th International Conference on Intelligent Engineering Systems (INES), 2012. [12] M. Sarnovsky, Z. Ulbrik, ”Cloud-based clustering of text documents using the GHSOM algorithm on the GridGain platform," Applied Computational Intelligence and Informatics (SACI), 2013 IEEE 8th International Symposium on , vol., no., pp.309,313, 23-25 May 2013

Martin Sarnovsky, IJRIT

151

Design and implementation of Interactive ...

Processing can be deployed within the development environment, in Java projects, as well as in HTML/JSP pages in tags. For web pages integration, processing.js extension is needed. Its main purpose is to translate the whole sketch into the Javascript. JBowl method creates the first layer map of the GHSOM ...

309KB Sizes 8 Downloads 307 Views

Recommend Documents

Design and implementation of Interactive visualization of GHSOM ...
presented the GHSOM (Growing Hierarchical Self Organizing Map) algorithm, which is an extension of the standard ... frequently labeled as text mining) is in comparison with classic methods of knowledge discovery in .... Portal provides a coherent sys

Design and Implementation of High Performance and Availability Java ...
compute engine object that is exported by one JAVA RMI (JRMI) server, it will take ... addition, to build such a system, reliable multicast communication, grid ...

Design and Implementation of High Performance and Availability Java ...
compute engine object that is exported by one JAVA RMI (JRMI) server, it will take .... Test an application, such as a huge computation on this system. 2. Test the ...

Page 1 Programming Languages Design and Implementation ...
Include. C. ) software simulation. : (. ) .... software simulation. (. ). 24 я я я ...... SiP j i. C. SB. SA. S end i output y output xx j begin integer j char y integer. xP ... Global param begin param integer param. SuB procedure. List array. In

Design and Implementation Statewide Pandemic Triage Line.pdf ...
Page 3 of 9. Design and Implementation Statewide Pandemic Triage Line.pdf. Design and Implementation Statewide Pandemic Triage Line.pdf. Open. Extract.

Relational Database Design and Implementation ...
[Read PDF] Relational Database Design and ... administration, this book provides practical information necessary to develop a design and management scheme ...

database systems design implementation and ...
database systems design implementation and management 10th edition contains important information and a detailed explanation about database systems ...

design and implementation of a high spatial resolution remote sensing ...
Therefore, the object-oriented image analysis for extraction of information from remote sensing ... Data Science Journal, Volume 6, Supplement, 4 August 2007.

vlsi design and implementation of reconfigurable ...
Apr 10, 2009 - In this paper a reconfigurable cryptographic system is proposed. .... the RAM blocks which are used for keys storage, and the. RCS. Core that is.

The Design and Implementation of a Large-Scale ...
a quadratic analytical initial step serving to create a quick coarse placement ..... GORDIAN [35] is another tool that uses a classical quadratic objective function ...... porting environment of software components with higher levels of functionality

Design and Implementation of e-AODV: A Comparative Study ... - IJRIT
IJRIT International Journal of Research in Information Technology, Volume 2, Issue 6, ... In order to maximize the network life time, the cost function defined in [9] ...

design and implementation of a high spatial resolution remote sensing ...
Aug 4, 2007 - 3College of Resources Science and Technology, Beijing Normal University, Xinjiekou Outer St. 19th, Haidian ..... 02JJBY005), and the Research Foundation of the Education ... Photogrammetric Record 20(110): 162-171.

The design and implementation of calmRISC32 floating ...
Department of Computer Science, Yonsei University, Seoul 120-749 Korea. **MCU Team ... and low power micro-controller, not high performance micm- processor. ... shows the architecture of CalmRISC43 FPU and section 3 de- scribes the ...

Design and implementation of I2C with Bist ...
and implementation of Inter-Integrated Circuit (I2C) protocol with self-testing ability. ... Most available I2C devices operate at a speed of 400kbps. .... FPGA Based Interface Model for Scale- Free Network using I2C Bus Protocol on Quartus II 6.0 ..

Design and implementation of a new tinnitus ... -
School of Electronics and Information Engineering, Sichuan University, ... Xavier Etchevers; Thierry Coupaye; Fabienne Boyer; Noël de Palma; Gwen ...

Design and Implementation of the Discrete Wavelet ...
Tonantzintla, PUE, Mex., e-mail: [email protected]. J. M. Ramırez-Cortés is a titular researcher at the Electronics Department,. National Institute of Astrophysics, Optics, and Electronics, St. Ma. Tonantzintla, PUE, Mex., e-mail: [email protected]. V.

Design and Implementation of a Fast Inter Domain ...
Jul 6, 2006 - proximity of virtual machines sharing data and events can .... that share file systems is already being investigated [14] [15]. [16]. It is not ...