Particle Swarm Optimization for Clustering Short-Text Corpora Diego A. Ingaramo, Marcelo L. Errecalde, Leticia C. Cagnina {daingara,merreca, lcagnina}@unsl.edu.ar LIDIC-Universidad Nacional de San Luis-Argentina

Paolo Rosso [email protected] NLE Lab-DSIC-Universidad Politécnica de Valencia-Spain

Introduction 

What is Document Clustering ? Finding groups of documents such that the documents in a group will be similar (or related) to one another and different from (or unrelated to) the documents in other groups.

Introduction 

What is the problem we are working on ? 

Main goal: to develop effective algorithms for the problem of clustering short-text corpora.



These algorithms assign documents to unknown categories in an unsupervised way.

Introduction 

What is the problem we are working on ? 

Our interest is on clustering of:  short-texts (in general)  narrow domain short-texts (in particular).

Introduction 

Why it is important ? 

Applicability in different areas of text processing:  Text Mining (Question Answering etc.)  Summarization  Information Retrieval



Tendencies of people to use “Small-Languages”  Blogs  Text-messages  Snippets

Introduction 

Why is this problem difficult ? 

General problems of text clustering:  Synonymy  Polysemy



Additional difficulties due to:  Low frequencies of the document terms.  High overlapping degree of their vocabularies.

These aspects can negatively affect the estimation of how similar the documents are and (in consequence) the whole clustering process.

Introduction 

What questions are we trying to answer in our work ?  Are

bio-inspired algorithms suitable approaches to solve clustering of short-text corpora problems? Which are most effective?

 Could  Is

we consider clustering as an optimization problem?

it possible to use unsupervised measures of cluster validity as functions to be optimized?

Clustering process: an overview 

A more detailed look to the document clustering process…

Clustering process: an overview 

A more detailed look to the document clustering process…

Clustering process: an overview 

A more detailed look to the document clustering process…

Clustering process: an overview 

A more detailed look to the document clustering process…

Clustering process: an overview 

A more detailed look to the document clustering process…

Clustering process: an overview 

A more detailed look to the document clustering process…

Clustering as Optimization 

Document clustering is the assignment of document to unknown categories.



In real document clustering problems, the results cannot be evaluated with typical external measures such as F-Measure, because the correct categorization is not available.

Clustering as Optimization 

For that, the quality of resulting groups is evaluated with respect to Internal Clustering Validity Measures, like Global Silhouette (GS) and Expected Density Measure (ρ).



GS and ρ were selected as optimization functions because they give a reasonable estimation of the quality of the groups obtained.

Clustering as Optimization 

Global Silhouette (GS): is the average of cluster silhouette of all clusters. The cluster silhouette of a cluster C is the average of all objects’ silhouette coefficients s(i), which are obtained with:

Clustering as Optimization 

Global Silhouette (GS): is the average of cluster silhouette of all clusters. The cluster silhouette of a cluster C is the average of all objects’ silhouette coefficients s(i), which are obtained with:

The average dissimilarity of object i to all objects in the nearest cluster

Clustering as Optimization 

Global Silhouette (GS): is the average of cluster silhouette of all clusters. The cluster silhouette of a cluster C is the average of all objects’ silhouette coefficients s(i), which are obtained with:

The average dissimilarity of object i to the remaining objects in its cluster

Clustering as Optimization 

Global Silhouette (GS): is the average of cluster silhouette of all clusters. The cluster silhouette of a cluster C is the average of all objects’ silhouette coefficients s(i), which are obtained with:



Expected Density Measure of a clustering C is

C is the clustering of a weighted graph The density θ of the graph from equation is computed as

CLUDIPSO Clustering with a Discrete Particle Swarm Optimizer (CLUDIPSO)    

Representation of solutions (particles) Equations to update particles and velocities Process of updating of particles Dynamic mutation operator to avoid premature convergence

CLUDIPSO 

Representation of solutions (particles): A particle represents a valid clustering. For a collection of n documents, the particle is a n-dimensional vector. Each dimension represents a document, and the value is the group for clustering.

1

3

3

2 … 1

1

2

3

4

particle k-groups

n-documents



n

CLUDIPSO 

Equations to update velocities and positions:

CLUDIPSO 

Equations to update velocities and positions:

CLUDIPSO 

Equations to update velocities and positions:

CLUDIPSO 

Equations to update velocities and positions:

CLUDIPSO 

Equations to update velocities and positions:

CLUDIPSO 

Equations to update velocities and positions:

CLUDIPSO 

Equations to update velocities and positions:

CLUDIPSO 

Equations to update velocities and positions:

CLUDIPSO 

Process of updating of particles: In our approach the process of updating particles is not as direct as in the continuous case. In CLUDIPSO, the updating process is not carried out on all dimensions at each iteration. To determine which dimensions of a particle will be updated we do the following steps: 1) all dimensions of the velocity vector are normalized in the [0..1] range; 2) a random number r in [0..1] is calculated; 3) all the dimensions (in the velocity vector) higher than r are selected in the position vector, and updated using:

CLUDIPSO 

Dynamic mutation operator to avoid premature convergence: applied to each particle with a pm-probability.

This operator swaps two random dimensions of the particle.

CLUCOPSO Clustering with a Continuous Particle Swarm Optimizer (CLUCOPSO)    

Representation of solutions (particles, representing centroids) Equations to update particles and velocities Process of updating of particles Dimensionality reduction in document representation

CLUCOPSO 

Representation of solutions (particles, representing centroids): In the continuous version (CLUCOPSO), the particles are (K x T) dimensional real vectors, where K centroids (one for each cluster) of T terms are stored in contiguous form:

CLUCOPSO 

Equations to update velocities and positions: This version is similar to the classical PSO algorithm with respect to the gbest and pbest particles and the updating formulas for position and velocity:

CLUCOPSO 

Dimensionality reduction in document representation: A dimensionality reduction phase is often applied in order to reduce the size of the document representations to a much smaller number. Usually aims to make the problem more manageable for the clustering method. PSO-based methods can be seriously affected by a high dimensionality in the document representation. The dimensionality of each centroid directly depends on the dimensionality of the vectors used for representing the documents, in consequence, a dimensionality reduction phase will be usually required to obtain an acceptable size of the particles.

CLUCOPSO 

Dimensionality reduction in document representation: Dimensionality reduction often takes the form of feature selection or feature extraction For CLUCOPSO, we used both forms of dimensionality reduction:



a feature selection method named Transition Point (use of terms whose frequency is closest to tp as indexes for VSM)



a feature extraction method known as Random Indexing (accumulation of context vectors based on the occurrence of words in contexts)

Data Sets We select 3 short-text collection to test our approach: 

CICling-2002: considered in many research works, as a high complexity corpus, with short-length documents and high vocabulary overlapping.



EasyAbstracts: collection easier than CICling-2002 with respect to the overlapping degree of the documents vocabulary.



Micro4News: collection easier than EasyAbstracts and CICLing-2002 with respect to the document length and vocabulary overlapping.

Test & results Parameters for CLUDIPSO:  50 independent runs for each collection.  10.000 iterations per run.  Swarm size: 50 particles.  pm_min=0.4 pm_max=0.9  w = 0.9 γ1 = γ2 = 1.0 Parameters for CLUCOPSO:  50 independent runs for each collection.  10.000 iterations per run.  Swarm size: 50 particles.  pm_min=0.4 pm_max=0.9  w = 0.7 γ1 = γ2 = 1.7

Test & results

Test & results

Test & results

Conclusions We introduce new ideas for clustering short-text corpora: 

CLUDIPSO, a novel discrete PSO-based algorithm adapted for this kind of problem,



CLUCOPSO, a continuous PSO-based algorithm adapted for this kind of problem,



the use of two interesting Internal Clustering Validity Measures, Global Silhouette and Expected Density, as explicit objective functions to be optimized.



The preliminary results obtained by CLUDIPSO and CLUCOPSO indicate that our approach is a highly competitive alternative to solve problems of clustering short-text corpora.

Thanks / Grazie

Paolo Rosso. Una formalizzazione degli schemi senso-motori di Arbib. Tesi di laurea. Università di Pisa. 1992. Co-relatori: Andrea Maggiolo-Schettini e Antonina Starita

Particle Swarm Optimization for Clustering Short-Text ... | Google Sites

Text Mining (Question Answering etc.) ... clustering of short-text corpora problems? Which are .... Data Sets. We select 3 short-text collection to test our approach:.

1MB Sizes 0 Downloads 248 Views

Recommend Documents

An Improved Particle Swarm Optimization for Prediction Model of ...
An Improved Particle Swarm Optimization for Prediction Model of. Macromolecular Structure. Fuli RONG, Yang YI,Yang HU. Information Science School ...

Chang, Particle Swarm Optimization and Ant Colony Optimization, A ...
Chang, Particle Swarm Optimization and Ant Colony Optimization, A Gentle Introduction.pdf. Chang, Particle Swarm Optimization and Ant Colony Optimization, ...

Particle Swarm Optimization: An Efficient Method for Tracing Periodic ...
[email protected] e [email protected] ..... http://www.adaptiveview.com/articles/ipsop1.html, 2003. [10] J. F. Schutte ... email:[email protected].

Particle Swarm Optimization: An Efficient Method for Tracing Periodic ...
trinsic chaotic phenomena and fractal characters [1, 2, 3]. Most local chaos control ..... http://www.adaptiveview.com/articles/ipsop1.html, 2003. [10] J. F. Schutte ...

A Modified Binary Particle Swarm Optimization ...
Traditional Methods. ○ PSO methods. ○ Proposed Methods. ○ Basic Modification: the Smallest Position Value Rule. ○ Further Modification: new set of Update ...

particle swarm optimization pdf ebook download
File: Particle swarm optimization pdf. ebook download. Download now. Click here if your download doesn't start automatically. Page 1 of 1. particle swarm ...

Application of a Parallel Particle Swarm Optimization ...
Application of a Parallel Particle Swarm Optimization. Scheme to the Design of Electromagnetic Absorbers. Suomin Cui, Senior Member, IEEE, and Daniel S.

A Modified Binary Particle Swarm Optimization ... - IEEE Xplore
Aug 22, 2007 - All particles are initialized as random binary vectors, and the Smallest Position. Value (SPV) rule is used to construct a mapping from binary.

EJOR-A discrete particle swarm optimization method_ Unler and ...
Page 3 of 12. EJOR-A discrete particle swarm optimization method_ Unler and Murat_2010 (1).pdf. EJOR-A discrete particle swarm optimization method_ Unler ...

Entropy based Binary Particle Swarm Optimization and ... - GitHub
We found that non-ear segments have lesser 2-bit entropy values ...... CMU PIE Database: 〈http://www.ri.cmu.edu/research_project_detail.html?project_.

Srinivasan, Seow, Particle Swarm Inspired Evolutionary Algorithm ...
Tbe fifth and last test function is Schwefel function. given by: d=l. Page 3 of 6. Srinivasan, Seow, Particle Swarm Inspired Evolutionar ... (PS-EA) for Multiobjective ...

An Interactive Particle Swarm Optimisation for selecting a product ...
Abstract: A platform-based product development with mixed market-modular strategy ... applied to supply chain, product development and electrical economic ...

Performance Comparison of Optimization Algorithms for Clustering ...
Performance Comparison of Optimization Algorithms for Clustering in Wireless Sensor Networks 2.pdf. Performance Comparison of Optimization Algorithms for ...Missing:

Application of a Novel Parallel Particle Swarm ...
Dept. of Electrical & Computer Engineering, University of Delaware, Newark, DE 19711. Email: [email protected], [email protected]. 1. Introduction. In 1995 ...

Quantum Evolutionary Algorithm Based on Particle Swarm Theory in ...
Md. Kowsar Hossain, Md. Amjad Hossain, M.M.A. Hashem, Md. Mohsin Ali. Dept. of ... Khulna University of Engineering & Technology, ... Proceedings of 13th International Conference on Computer and Information Technology (ICCIT 2010).

Quantum Evolutionary Algorithm Based on Particle Swarm Theory in ...
hardware/software systems design [1], determination ... is found by swarms following the best particle. It is ..... “Applying an Analytical Approach to Shop-Floor.

A Comparative Study of Differential Evolution, Particle Swarm ...
BiRC - Bioinformatics Research Center. University of Aarhus, Ny .... arPSO was shown to be more robust than the basic PSO on problems with many optima [9].

Control a Novel Discrete Chaotic System through Particle Swarm ...
Control a Novel Discrete Chaotic System through. Particle Swarm Optimization. *. Fei Gao and Hengqing Tong. Department of mathematics. Wuhan University of ...

Phishing Website Detection Using Random Forest with Particle Swarm ...
Phishing Website Detection Using Random Forest with Particle Swarm Optimization.pdf. Phishing Website Detection Using Random Forest with Particle Swarm ...

Swarm Based Sensor Deployment Optimization in Ad ...
Department of Computer Engineering, Kyung Hee University, Korea ... have been proposed to enhance network coverage and to extend the system lifetime.

Swarm Based Sensor Deployment Optimization in Ad hoc Sensor ...
be resolved by any improvement in the onboard-sensor technology [1]. ... number of wireless sensor devices can be deployed in hostile areas without human.