Keyword Extraction, Ranking, and Organization for the ...

Viewer
Transcript

Keyword Extraction, Ranking, and Organization for the Neuroinformatics Platform

S. Usui a,∗ , P. Palmes a , K. Nagata a , T. Taniguchi a , N. Ueda b a RIKEN b NTT

Brain Science Institute, 2-1 Hirosawa, Wako City, Saitama 351-0198, JAPAN

Communication Science Laboratories, 2-4 Hikaridai, Seika-cho, Soraku-gun Kyoto JAPAN

Abstract Brain-related researches encompass many fields of studies and usually involve worldwide collaborations. Recognizing the value of these international collaborations for efficient use of resources and improving the quality of brain research, the INCF (International Neuroinformatics Coordinating Facility) started to coordinate the effort of establishing Neuroinformatics (NI) centers and portal sites among the different participating countries. These NI centers and portal sites will serve as the conduit for the interchange of information and brain-related resources among different countries. In Japan, several NI platforms under the support of NIJC (NI Japan Center) are being developed with one platform called, Visiome, already operating and publicly accessible at “http://www.platform.visiome.org”. Each of these platforms requires their own set of keywords that represent important terms covering their respective fields of study. One important function of this predefined keyword list is to help contributors classify the contents of their contributions and group related resources. It is vital, therefore, that this predefined list should be properly chosen to cover the necessary areas. Currently, the process of identifying these appropriate keywords relies on the availability of human experts which does not scale well considering that different areas are rapidly evolving. This problem prompted us to develop a tool to automatically filter the most likely terms preferred by human experts. We tested the effectiveness of the proposed approach using the abstracts of the Vision Research Journal (VR) and Investigative Opthalmology and Visual Science Journal (IOVS) as source files. Key words: Neuroinformatics, Relevance Ranking, Weighting, Indexing, Automatic Extraction, Co-occurrence, Clustering

Preprint submitted to Elsevier Science

8 September 2006

1

Introduction

Understanding the brain as a system requires worldwide collaboration of scientists specializing in different areas of the brain. With the advancement and widespread adoption of information technology among the scientific communities, scientists nowadays working together attain a much richer level of understanding of a certain phenomenon. These rich interactions, while hastening the discovery of new science, produce new information at a rapid rate that makes understanding of the entire system like the brain becomes overwhelmingly complex for any individual. Consequently, further understanding and development in a particular field are difficult to achieve due to information overload. These issues confronting many areas of research and much more compounded in the fields of brain research, prompted for the development of a field called Neuroinformatics (NI). Its main goal is to help brain scientists handle the analysis, modeling, simulation, and management of the information resource before, during, and after the conduction of research. Scientists in different places working together need to have a common and remotely accessible environment that provides them tools for easy data organization and storage of their research findings. Also, the environment should allow them smooth integration of their results with other collaborators. The Neuroinformatics Platform such as “Visiome” (http://platform.visiome.org) aims to address these issues by providing portal sites to different fields of brain research (Usui, 2003a,b). These portal sites allow collaborators to share research resources which include not only published papers but also the papers’ corresponding support files such as source codes of algorithms and mathematical/statistical models, experimental data, movies, slides/images, presentations, etc. One vital component of the Neuroinformatics Platform is the index tree which is used to organize the materials submitted by the contributors. Since each contributor, upon submission of his/her work, has to choose the appropriate terms from the index tree, it is important that the elements of the index tree are reasonably chosen so that the submitted work can be properly organized and characterized in a coherent manner. These index terms should be able to cover almost all areas that are deemed highly relevant by the human experts and organized in a structure where the resources they point can easily be located. As the different fields of study evolve, the structure and composition of the index tree will also evolve. With the current manual scheme, it does not scale well. Automating the index keyword extraction is necessary to support ∗ Corresponding author. Tel.: +81 48 462 111x7601; fax: +81 48 467 7498 Email addresses: [email protected] (S. Usui), [email protected] (P. Palmes), [email protected] (K. Nagata), [email protected] (T. Taniguchi), [email protected] (N. Ueda).

2

VR IOVS

3714

3000

4000

5000

5184

2000

Years: 1992 − 2004 VR 1−3 grams: 112,520 IOVS 1−3 grams: 219,244

1000

1055

952 476 103

303

0

173 No. of Abstracts

Total KW

Unigram KW

Ngram KW (N>1)

Fig. 1. VR and IOVS Abstracts Basic Statistics. The IOVS database has a relatively smaller number of keywords and significantly wider search space compared to the VR database. Moreover, most of IOVS keywords are non-unigrams. These properties make the task of extracting IOVS keywords harder than the VR keywords.

the evolution of the platform in operation and in the establishment of new platforms.

2

Extracting Keywords

This section describes the datasets used as well as the data processing techniques and the rationale behind the formulation of the proposed weighting measures.

2.1

Data Sets

In this study for the automatic extraction of technical keywords, we use the collections of research abstracts from 1992-2004 of the VR (Vision Research) and IOVS (Investigative Opthalmology and Visual Science) Journals as test cases. Although using the full paper contents could have provided us better data quality and accuracy, we preferred the practicality of analyzing research abstracts because they are readily available free of charge in majority of the cases. In order to assess the effectiveness of the different weighting schemes, it is important to have a good basis of expert knowledge in determining the most relevant terms among the collection of abstracts being studied. For evaluation purposes, this paper considers the two sets of keywords defined by VR and IOVS editorial boards/publishers, respectively, to constitute the correct sets of keywords. 3

Pre−Processing Titles + Abstracts

Stopword Removal

Stemming

Terms and Docs ID Assignment

Information Matrices Construction

Weighting and Ranking

Ngram Extraction

Data Analysis

Processing

Fig. 2. Data Processing Flowchart. The pre-processing scheme employs similar techniques such as stemming and stopword removal which are popularly used in the text-mining community to extract terms for vector-space representation.

Figure 1 summarizes the basic statistics of the databases derived from both journals. Although VR and IOVS are somewhat related due to their focus in vision science, they relatively differ in scope and perspective. One prominent difference is in the list of their keywords. While majority of the VR keywords are unigrams (single term keywords), IOVS keywords are mostly bigrams with a smaller fraction composed of unigrams and trigrams. Also, the number of IOVS Ngrams [single or multiple-term keywords] (219,244) is almost twice as many as that of VR (112,520). However, IOVS keywords (476) are just about half the total number of VR keywords (1055). In this sense, extracting the IOVS main keywords is more difficult due to its large data size but relatively smaller number of keywords. The differences in the statistical property between VR and IOVS will allow us to determine which among the approaches is the most consistent, stable, and robust in keyword extraction.

2.2

Data Processing

All approaches included in the study utilized the vector-space or bag-of-words representation between terms and documents (Salton and McGill, 1983). It is the most common and generally accepted representation in the text mining community. Each unique term in the collection of research abstracts (Fig. 2) is assigned to a unique term-id after stopword removal and stemming (Porter, 1980). Also, each unique abstract is assigned to a unique document-id. The combination of term-ids and doc-ids facilitates the construction of the termdocument matrix where each cell (i, j) corresponds to the frequency of occurrence of term i in document j as shown in Figure 3 . The rows of this matrix represent the vector-space representation of terms embedded in the document space while its columns represent the vector-space representation of documents embedded in the term space. Statistical and machine learning 4

Term!Doc(ment ,atri/ Doc IDs

Document Vector

Term IDs

Term Vector

Fig. 3. Vector-Space or Bag-of-Words Representation. Each cell (i, j) corresponds to the number of occurrence of term i in document j. The rows represent the collections of terms embedded in the document space while its columns represent collections of documents embedded in the term space.

approaches in this study use the base information encoded in this matrix to derive other tables.

2.3

Term Ranking

As shown in Fig. 2, automatic keyword extraction is performed based on term ranking by using term weighting measures. Different weighting measures have significant influence on the relevance ranking of terms or its interestingness. The typical measure of interestingness is based on the term’s specificity and generality of occurrence. The simplest measure is the term frequency TF(i) given by TF(i) =

N X

TF(i, j).

(1)

j=1

Here, TF(i, j) is the frequency of term i that appeared in document j while N is the total number of documents. As shown in Figure 3, TF(i, j) is the (i, j)th element of a term-document matrix in the bag-of-words representation. Clearly, this measure has a problem that non-interesting general terms have high weights. To relax the problem, Salton (1991) proposed the TF-IDF (term frequency & inverse document frequency) weighting. The inverse document frequency, IDF(i), is defined as IDF(i) = log

N , DF(i)

(2)

where DF(i) or document frequency of term i is the number of documents 5

with term i. Finally, TF-IDF is defined by TF-IDF(i) = TF(i) × IDF(i).

(3)

The additional IDF measure penalizes general terms that appear frequently in many documents and favors specific terms that appear frequently in a relatively smaller number of documents. However, according to our exploration, we found that technical keywords often have the following properties: (P1) Keywords often appear in documents that deal with similar topics. (P2) Some keywords co-occur with other keywords in a document. Clearly, these properties are not directly considered in TF-IDF. Hence, we propose new measures. To incorporate (P1) into the measure, we consider inverse topic frequency, ITF, instead of IDF. More specifically, ITF(i) is given by ITF(i) = log

K , TPF(i)

(4)

where TPF(i) or topic frequency of term i is the number of topics to which documents with term i belong while K is the total number of topics. This measure penalizes those general terms appearing in many topics in similar way as IDF penalizes general terms appearing in many documents. To compute TPF, we first have to obtain document clusters, each of which corresponds to a latent topic (research field). For this purpose, we just apply the spherical K-means (SKM) algorithm [see Duda et al. (2001); Dhillon et al. (2003, 2001) for the details] to the document vectors shown in Fig. 3. Clearly, ITF depends on the value of K. However, as we will describe later, the choice of K is not so sensitive to the optimal performance of the final ranking of terms. Next, to incorporate (P2) into the measure, we propose the term-document co-occurrence frequency, TDCF(i), which is defined by TDCF(i) = log

1 , minj6=i F isher(i, j)

(5)

where F isher(i, j) denotes Fisher’s exact probability value (p-value) for the co-occurrence relationship between terms i and j. We will omit the details, but intuitively, when the number of co-occurrences of terms (i, j) within documents is large, the value of F isher(i, j) becomes small. Thus the inverse of minj6=i F isher(i, j) value provides the maximum degree of co-occurence of 6

Fig. 4. Rank Assignment for unigrams and Ngrams. Two-way ranking is carried out by ranking first the unigrams vertically followed by ranking the Ngrams in each row horizontally.

term i with some terms except i. We experimentally confirmed that incorporating the log-scale into its measure was appropriate, like in IDF and ITF. Finally, our new ranking measure for term i is given by TF-ITF-TDCF(i) = TF(i) × ITF(i) × TDCF(i).

2.4

(6)

Performance Evaluation

For general applicability, the extraction process has to take into account the presence of multiple-term keywords. Figure 4 outlines how both unigrams and Ngrams are ranked. First, vertical ranking is used to rank the unigrams, followed by horizontal ranking involving Ngrams. It shall be noted that the Ngrams in each row contain the corresponding unigram on their leftmost side which serves as as their anchor or root word. Because ranking covers both the vertical and horizontal directions, the evaluation criterion utilizes the concept of ideal rectangle. The evaluation is carried out by counting the number of keywords inside this rectangle (Fig. 5) anchored from the uppermost part of the final table. For the ideal case, all keywords are located inside this rectangle. We initially tested the effect of combining the different weighting schemes on the two-way ranking scheme described in Fig. 4. The test indicates that the optimal ranking performance is dominated by the weighting scheme used on the vertical ranking. Hence, the final implementation employs just one particular type of weighting in both horizontal and vertical ranking. Figures 6(a) and Fig. 6(b) show the precision and recall performances of the three weighting schemes in returning the topmost terms (topN) of VR and IOVS, respectively. Among the three schemes, it is apparent that the TF7

Fig. 5. Ngram Evaluation Method. Two-way ranking induces the accumulation of interesting terms towards the leftmost and uppermost part of the rectangle.

(a) VR

(b) IOVS

Recall Average

tf

tf−idf Ranking Scheme

160 165 170 175 180 185

Recall Average

460 470 480 490 500

Fig. 6. Precision/Recall Performance. Both plots indicate that TF-ITF-TDCF has better performances in VR and IOVS compared to the conventional approaches.

tf

tf−itf−tdcf

(a) VR

tf−idf tf−itf−tdcf Ranking Scheme

(b) IOVS

Fig. 7. Confidence Interval Plots. Using 10 trials, it is apparent that the TF-ITF-TDCF has superior average recall performance in both VR and IOVS.

ITF-TDCF has the most optimal performance in both precision and recall. Since the proposed scheme relies on the used of spherical K-means clustering 8

(a) VR

(b) IOVS

Fig. 8. Recall Performances of TF-ITF-TDCF for Different K. For K in the range of 80 to 200, its recall performances do not significantly vary in both VR and IOVS.

(Dhillon et al., 2003), it is important to check the significance and consistency of its optimal performance. Hence, we conducted significant testing using 10 trials for each database and used pairwise t-test for the analysis. Our tests indicate that the perceived optimal performances of TF-ITF-TDCF in VR and IOVS are significant at 0.05 level of confidence. Figures 7(a) and 7(b) show the confidence interval plots for VR and IOVS. The plots clearly indicate the superior performance of the proposed approach over the two conventional approaches. As mentioned in the previous section regarding the choice of K for the TPF, we tested how sensitive is the performance of the TF-ITF-TFDCF in different values of K. Figure 8 indicates that its optimal ranking in both VR and IOVS are not so sensitive to values of K between 80 to 200 clusters. With these results, we decided to use K = 100 for both VR and IOVS computations of ITF.

2.5

Visualization of Results

To help visualize the effectiveness of the different ranking schemes involving Ngrams, equation (7) describes a colormap assignment based on the relative rank of each Ngram term with respect to the term with minimum rank. Figure 9(a) shows the corresponding color weight lookup table. The higher is the rank of the term, the lighter is the color. In similar fashion, the darker is the color, the lower is its relative rank.

color-indx(i, j) =

log10 [rank(i, j) − minRank] log10 [maxRank − minRank]

(7)

Figures 9(a) and 9(b) show the colormaps of the worst and best ranking schemes (TF vs TF-ITF-TDCF) for VR and IOVS, respectively. We only show the colormap of the worst ranking scheme and the optimal ranking scheme to 9

(a) VR

(b) IOVS Fig. 9. VR and IOVS Colormaps. The hot pixels along the left side represent terms that are part of the keyword vocabulary list. It is apparent that TF-ITF-TDCF ranking produces a greater number of hot pixels at the upper-leftmost part of the VR and IOVS colormap tables than the TF.

highlight the significant differences between these two extremes. The TF-ITFTDCF colormaps in VR and IOVS indicate a relatively higher concentration of keywords compared to TF using the same size of pixel window (350x250). As previously demonstrated by Wilcoxon Test, all the different schemes have positive influence in filtering keywords which is indicated by a relatively high density of hot pixels (terms in the keyword vocabulary list) lying near the leftmost column of their colormap tables. However, the greater density of hot pixels in the TF-ITF-TDCF compared to the TF colormaps suggests that the TF-ITF-TDCF has a much better ability to discriminate keywords from the non-keywords. 10

Fig. 10. Stability of VR and IOVS Rank Assignments of Terms. TF-ITF-TDCF rank assignments of terms were applied to 400 randomly selected documents using 256 trials. The purpose is to determine which terms have stable rank assignments (constant rank assignments in 256 different trials) depicted by darker colors. Results show that the VR and IOVS stable terms are unigrams and bigrams, respectively. These results strongly suggest that most of the interesting terms in VR are unigrams while bigrams for IOVS. These results agree with the true nature of keywords in VR and IOVS.

Since the proposed algorithm uses spherical K-means which rely on the random initialization of centroids, it is important to check the algorithm’s consistency and stability in generating optimal ranking. The stability test (Fig. 10) involves applying TF-ITF-TDCF in 400 randomly selected documents using 256 independent trials. The experiments record the number of times each term appears in a particular rank order assigned by TF-ITF-TDCF which implies that terms with high occurrence rate have stable rank order and are most likely keywords. It is interesting to observe that the terms with high occurrence rate in VR are unigrams while in IOVS are bigrams. As one may recall in Fig. 1, VR keywords are mostly unigrams while IOVS keywords are mostly bigrams. The experiments were able to detect this particular difference between VR and IOVS although the raw data do not explicitly provide this information.

3

Conclusion

Extracting relevant terms that closely match experts’ preferences is a great challenge because of the nature of data which has high dimensionality, sparsity, and noise. In spite of these realities, we have demonstrated encouraging results that would help lessen the burden of identification and coherent organization of the highly relevant terms. One main target of this work is the development of a tool that automates the entire process of term extraction and keyword/topic identification. This tool (Fig. 11) will allow experts to immediately recognize 11

Fig. 11. Keyword Extractor. The main window is composed of three panes. The main pane (leftmost) lists the results of terms and they are ranked according to a preferred weighting measure. The middle pane contains the list of Ngram keywords corresponding to the selected unigram keyword of the main pane. Finally, the rightmost pane lists the terms selected to become part of the index keywords in a particular database.

the most relevant terms and easily incorporate their preferred terms over the list of suggested terms. The main engine for weighting will be TF-ITF-TDCF. However, to allow greater flexibility, the tool will also support other weighting schemes which users can opt to use based on their assessment of the accuracy of the chosen weighting scheme’s results. In the future, we would like to embed this technology in the portal sites of Neuroinformatics to help users organize their local databases by indexing relevant terms and incorporate their preferences to the automated output. It will help ordinary users in organizing information of their particular interest. Information overload is one side-effect of the advancement in information technology. Surely, we also need information technology to combat this problem. Our research is one attempt to help scientists manage their information resource and we expect that more tools similar to what we developed will become increasingly important with the maturation of the information era.

References Dhillon, I., Fan, J., Guan, Y., 2001. In Data Mining for Scientific and Engineering Applications. Kluwer Academic Publishers, Ch. Efficient clustering of very large document collections, pp. 357–381. 12

Dhillon, I., Mallela, S., Kumar, R., 2003. A divisive information-theoretic feature clustering algorithm for text classification. Journal of Machine Learning Research (JMLR) 3, 1265–1287. Duda, R., Hart, P., Stork, D., 2001. Pattern Classification. John Wiley & Sons, USA. Porter, M., July 1980. An algorithm for suffix stripping. Program 14, 130–137. Salton, G., 1991. Developments in automatic text retrieval. Science 253, 974– 979. Salton, G., McGill, M., 1983. Introduction to Modern Retrieval. McGraw-Hill Book Company. Usui, S., 2003a. Neuroinformatics research for vision science: NRV project. Biosystems 71, 189–193. Usui, S., 2003b. Visiome: Neuroinformatics research in vision project. Neural Networks 16, 1293–1300.

13

Thesaurus Based Term Ranking for Keyword Extraction