Text Genre Boundary Detection Using Co-occurrence Networks Ranzivelle Marianne L. Roxas, Josephine Jill Cabatbat, and Giovanni A. Tapang Instrumentation Physics Laboratory, National Institute of Physics University of the Philippines, Diliman, Quezon City (+632)9209749

[email protected] Since the order in which words co-occur (colocation) is significant to a written text, a directed network built from syntactic elements of the text visualizes both the syntactic and semantic (meaning) content.

ABSTRACT The transition between two different literary genres in a single article is detected by analyzing the topology of its co-occurrence networks. A text window is shifted throughout the text and network parameters for words within this window is calculated. Results show that an abrupt increase or decrease in the clustering coefficient indicates the boundary between prose and poetry.

By analyzing the topology of co-occurrence networks and visualizing the syntactic and semantic content of written text, a novel method of genre boundary detection is presented here. Network parameters were used to disambiguate prose and poem. Genre boundaries in a single article can also be detected. This method can be useful for data mining in large databases of literary works.

Keywords Text categorization, genre identification, language network.

1. INTRODUCTION

2. NETWORK PARAMETERS

Automatic text categorization has become important with the advent of the Internet and electronic mail. Detection of spam messages in e-mail and site identification by web crawlers are but a few examples of the usefulness of knowing the language, the authorship and genre of a given text automatically.

In graph theoretic modeling of semantic networks, a node represents words connected by edges. Since network structure reflects function [11], certain features of this network may be used for analysis. Parameters such as the average path length L, diameter D, and clustering coefficient C can be utilized.

Traditional methods of using a professional to categorize text would have limited applicability due to time and cost constraints as the corpus of text increases. Statistical methods and machine learning techniques can be applied such as nearest neighbor classifiers [1], decision trees [2] and Bayesian classifiers [3], Support Vector Machines [4], rule learning algorithms and neural networks[5].

The sequence of edges or arcs from one node to another is called a path and the number of arcs traversed in a path is the path length. The average path length L is the average of the shortest paths between nodes. A measure of the maximum of the distances between any two nodes is the diameter D [12]. The clustering coefficient C is the probability that two neighbors of a particular node are themselves neighbors. It is commonly calculated by counting the number of edges between the node’s neighbors, and then dividing by all their possible edges [12]. The clustering coefficient C is also dependent on the number of nodes. C will not be a good parameter in distinguishing one network from another for short articles.

Network approaches to studying language have been able to explain some trends such as Zipf’s Law [6] as well as the processes of language acquisition and growth. Semantic, syntactic, and co-occurrence networks are examples of this effort to explain particular aspects of language [6]. Syntax involves a set of rules for building phrases and sentences from separate words. These rules reflecting the syntactic relations among words can be directly built into a graph. A network of lexical collocations has also been used to segment texts into thematically coherent units [7].

3. METHODOLOGY Written text was pre-processed using Perl to isolate words, tokenize punctuation marks, and build the co-occurrence network for further processing. Punctuation marks were treated like words since it reflects the structure and style unique to an author.

When viewed as a dynamic network, e-mail exchanges have been shown to develop self-organized coherent structures similar to those appearing in other complex systems [8]. Thematic relationships between Web pages can be constructed precisely by looking only at hyperlinks without having to read what is in the pages [9]. Hierarchical structures such as networks built from text have also been found to follow long-range correlations in written texts which reflect the changes in content within it [10]. These studies show that the network structure in language is clearly important in the interpretation of its content.

A text window was shifted throughout the text to locate the genre boundaries in written works with combined poetry and prose. The text window size S we used are 50, 100 and 200 words and was moved by an increment I of 50 words. The parameters of the syntactic network of the first S words of the text is first calculated then the text window is shifted with an increment of I words. Another S words, starting from the I+1th word of the text, were then again considered. The window is shifted until the end of the text. The syntactic network of each window was built with each

1

unique word taken as an individual node. Two words are classified as neighbors if they are adjacent to each other in the text.

poetry. Prose would also contain more functional words. Poetry would also have higher D and L because poems are usually composed of unique words, i.e. words tend to be repeated less.

For this study, we analyze poem and prose by Edgar Allan Poe. To verify if the method works to detect genre boundaries, we analyze The Poetic Principle, an essay by the same author which has poem and prose parts.

An example of an exemption to this generalization is the poem Annabel Lee. In the poem, words are used repetitiously, which resulted in the low D and L of the generated network. This also resulted to a higher clustering coefficient. Table 1. Summary of the co-occurrence language network statistical properties for prose and poetry.

4. RESULTS AND DISCUSSION Figure 1 shows a visualization of a co-occurrence language network of a poetry and prose example. Functional words, such as articles, prepositions, and auxiliaries are immediately identified as hubs in the prose network. For poems, punctuation marks generally acts as hubs.

Literary Work

a)

C

L

D

Poem A dream A dream within a dream Alone Annabel Lee Imitation Serenade Tamerlane The city in the sea The raven The valley of unrest

0.021 0.028 0.043 0.073 0.013 0.041 0.042 0.053 0.047 0.024

4.43 3.87 3.59 3.28 3.88 3.93 3.86 3.89 3.98 4.15

11 8 7 8 9 10 10 10 9 10

Prose A descent into the Maelstrom The black cat The cask of Amontillado The facts in the case of Valdemar The imp of the perverse The masque of the red death The pit and the pendulum The premature burial The purloined letter The tell-tale heart

0.043 0.059 0.078 0.056 0.070 0.068 0.060 0.070 0.077 0.078

3.71 3.47 3.52 3.77 3.56 3.57 3.68 3.59 3.49 3.54

9 8 8 9 8 9 9 8 8 8

A higher clustering coefficient means that words are used more often in different combinations throughout the text thereby increasing the probability that they will be adjacent to each other. In prose this is true and that is reflected in the C values in Table 1.

b)

Figure 1. Co-occurrence language network of (a) poem and (b) prose by Edgar Allan Poe built from word adjacency in the text. The calculated network parameters for both prose and poetry are recorded in Table 1. D and L reflect directly the structure of the text. Prose would tend to have lower path lengths and diameter since words tend to be repeated more often in prose than in

Figure 2. 3D plot of the calculated network parameters. Two clusters are observed corresponding to the two literary genre, poetry ( ) .

2

a)

b)

c)

d)

Figure 3. Clustering coefficient for a moving window, 50 words width with a) 1, b) 10, c) 20 and d) 50 words increment. Abrupt changes detects genre boundaries. Shown in Figure 2 is the 3D plot of the clustering coefficient, path 5. CONCLUSION length and diameter. This is to further illustrate the difference of The structure of co-occurrence language network reflects the obtained network parameters for poem and prose articles. Two differences in structure of prose and poetry. In visualizing the text clusters were observed corresponding to the two literary genre. via a of co-occurrence network, one can have an overview of the These differences in the network parameters were utilized to underlying syntactic and semantic units as well as the central detect the boundaries of poems and prose contained in a single words that have low semantic content but important grammatical article. Plotting the clustering coefficient of the co-occurrence functions such as articles, prepositions and auxiliaries. network for each window, peaks and dips can be observed as in Functional words, such as articles, prepositions, and auxiliaries Figure 3. Since it has been established that poems have lower are identified as hubs in the prose network. For poems, clustering coefficient than prose, the peaks and dips can be punctuation marks generally acts as hubs. presumed as transitions between the two genres. A sudden decrease of clustering coefficient indicates a switch from prose to Prose would tend to have lower path lengths and diameter since poem. Abrupt change from low to high clustering coefficients words tend to be repeated more often in prose than in poetry. signifies a transition from poem to prose. A sudden decrease of clustering coefficient indicates a switch from prose to poem. A Clustering coefficients of prose are higher because words in prose poem-prose and prose poem transition was observed upon are used more often in different combinations throughout the text inspection of the part of the article contained in the text window than in verse. Although prose and poetry are identifiable by where these abrupt changes occur. Thus, a method for genre human readers, the method presented here is a novel way to boundary detection has been developed. quantify and identify the differences between the two literary genres. If both genres are present in written texts, the boundaries For S<50 words, the created network for both poetry and prose is between the two can be detected by calculating the clustering small and parameters cannot be used to distinguish between the coefficient of texts bounded by a moving window. A sudden two genre. Shown in Figure 2 is the clustering coefficient of the increase in clustering coefficient detects a transition from poem to network for S = 50 words shifted with an increment of I = 1, 10, prose. An abrupt dip signifies the switch from prose to poem. 20 and 50 words. Peaks and dips are observed to occur at the same interval. Shifting the text window for small intervals just give us a more resolved indication as to the location of the genre boundaries. The same is true for varying the text window size. However, larger text window size decreases the computation time needed.

3

[6] Ferrer, R. Zipf’s law from a communicative phase transition.

6. REFERENCES

Eur. Phys. J. B 2005, 47, 449-457.

[1] Kwon, O. and J. Lee, “Text categorization based on knearest neighbor approach for Web site classification,” Information Processing and Management, 39(1), 25-44 (January 2003).

[7] Ferret, O. How to thematically segment texts by using lexical collocations, Proceedings of the 36th annual meeting on Association for Computational Linguistics, 1481 – 1483, (1998).

[2] Shin, C., D. Doermann and A. Rosenfeld, “Classification of document pages using structurebased features,” International Journal on Document Analysis and Recognition, 3(4), 232-247 (May 2001).

[8] Eckmann, J. P., E. Moses, and D. Sergi, “Entropy of dialogues creates coherent structures in e-mail traffic,” Proc. Nat. Acad. Sci. 101 (40), 14333 (2004).

[9] Eckmann, J-P, and E. Moses, “Curvature of co-links

[3] Denoyer, L. and P. Gallinari, “Bayesian network model for semi-structured document classification,” Information Processing and Management, 40(5), 807827 (June 2004).

uncovers hidden thematic layers in the World Wide Web,” Proc. Nat. Acad. Sci. 99 (9), 5825 (2002).

[10] Alvarez-Lacalle E., B. Dorow, J.-P. Eckmann, and E. Moses “Hierarchical structures induce long-range dynamical correlations in written texts,” Proc. Nat. Acad. Sci. 103 (9), 5825 (2006).

[4] Tong, S., and D. Koller, “Support Vector Machine Active Learning with Applications to Text Classification,” Journal of Machine Learning Research, 2, 45-66 (November 2001).

[11] Strogatz, S.H. Nature 410, 268, (2001). [12] Bales, M. E., S. B. Johnson, “Graph theoretic modeling of

[5] Ruiz, M., and P. Srinivasan, “Hierarchical text classification using neural networks,” Information Retrieval, 5(1), 87-118, (January 2002).

large – scale semantic networks,” Journal of Biomedical Informatics (2005).

4

Book of Abstracts

visualizing the syntactic and semantic content of written text, a novel method of ... method can be useful for data mining in large databases of literary works. 2.

262KB Sizes 2 Downloads 238 Views

Recommend Documents

Book of Abstracts
The dynamics of Arctic sea ice decline and consequences for heat and .... document, and we hope this presentation inspires new and useful interactions, ...

INTED2007 Abstracts Book
systems are not specially, tailored to support sketch-based design. In this ... this process is done in an interactive environment with the great help of icons and ...

BOOK OF ABSTRACTS ICIEVE 2015 UNIVERSITAS PENDIDIKAN ...
BOOK OF ABSTRACTS ICIEVE 2015 UNIVERSITAS PENDIDIKAN INDONESIA.pdf. BOOK OF ABSTRACTS ICIEVE 2015 UNIVERSITAS PENDIDIKAN ...

abstracts - TSI
Oct 22, 2016 - Proximus (2016) Analytics for the real world, https://proximus.io/ ..... management measure should be to ensure operational control and the ...

plenary abstracts
In M. Apple, D. D. Da Silva, & T. Fellner (eds.), Language. Learning ... reality of school practices. In addition, the co-existence of national identity and the desire to ...

Abstracts
1Radiology and Imaging Sciences, National Institutes of Health ... splenic cephalo-caudal length cutoff of 12 cm to detect splenomeg- aly, 59% of all cases and ...

abstracts -
A conversation is interrupted as a friend types a lengthy text message. ... algorithms and firewall programs, to protect the identity of their customers. ... networks in the US obfuscate knowledge by broadcasting irrelevant information and blurring .

abstracts
between two sequences – Levenshtein distance or the minimal editing distance – is to ... recommended for studying by the students of philological departments in ... V.V. Nikitina (business terminology), D.P. Shapran (marketing terminology) and ot

Historical Abstracts
The Basic Search interface is similar to all EBSCO database interfaces. .... that the request was submitted and an email with instructions for accessing the item.

Abstracts of invited talks
teaching mathematics, which will help make the learning of mathematics interesting ..... The National Curriculum Framework (NCF, 2005) suggests the need of more ...... http://www.eurekalert.org/pub_releases/2002-02/aiop-ohs021202.php.

oral list of abstracts
1. Tailoring topographical and magnetic properties of Fe-Ni alloy based thin films by 100 MeV Ag ions. Lisha Raghavan [email protected], mraiye Cochin University of Science and technology, Cochin. B. O. 2. Change of Shape and Optical Responses of

abstracts
Mar 5, 2017 - Department of Fish and Game restoring habitats of pupfish, tui chub, trout, Steelhead, monitoring Tule elk ...... produces biogenic sediment creating an annual two-component laminae couplet influencing ..... are the most abundant compon

SJTSP Abstracts final.pdf
Moderator: Yin-Kai Chen, Bowling Green State University. Page 3 of 18. SJTSP Abstracts final.pdf. SJTSP Abstracts final.pdf. Open. Extract. Open with. Sign In.

Abstracts Division & Titles.pdf
Page 1 of 11. 1. S.No Abstract. Number. Delegate Name Title. 1. KNA-01 Key Note Address. Prof. K. Sathyavelu Reddy. Department of Zoology,. S.V.University,. Tirupati, AP. Bio Medical Sciences and Biotechnology. for Disease Control: A review of Past,.

call for abstracts! -
May 12, 2017 - Hyatt Regency Rochester. 125 East Main Street. Rochester, NY 14604. CALL FOR ABSTRACTS! New York State 2017 Clinical Conference on ...

DHM 2013 Abstracts
Queensland University of Technology, School of Public Health and Social Work, .... obtained by projecting these net joint torques along the degrees of mobility. ..... pathomechanics, ultrasound was used to measure relative motion of the long ...... t

call for abstracts
Mar 31, 2018 - considered as we transform today's roadway infrastructure into the smart transportation system of the future. It focuses on how to scale up the benefits from individual connected and automated vehicles to a system level, from a corrido

S146 Abstracts The Journal of Heart and Lung ...
O.H. Frazier,1 1Center for Cardiac Support, Texas Heart Institute,. Houston, TX. Objectives: ... individual sites including fire stations and ambulance companies.

Worsening file-drawer problem in the abstracts of ...
were indexed in 2008 in the Web of Science on environment, cancer and health, respec- tively (Tables ..... randomized trials presented at an oncology meeting. Journal of ... Peer review of grant applications in biology and medicine. Reliability ...

Chronological List of Abstracts
data, analysis is significantly slowed by the need to associate tracked points with ... In our experience, tracking time may increase nearly as the square of.

Szklarska Poreba 2017: Abstracts
Papers from the AAAI Fall Symposium. Technical Report FS-04-03. ..... Linguistics: Corpus-based approaches to syntax and lexis. Mouton de Gruyer, Berlin. 1.

Call for Abstracts Edit 4 -
neglected role in the fight against fascism during the Second World War. ... place on Thursday 5 July 2018 at the Steve Biko Centre, One Zotshie Street, ...

BIC Abstracts 2016.pdf
​Light Pollution: Assessing Variation in Artificial Skyglow. Purpose: The purpose of this project is to investigate the effect of urbanization on skyglow and when ...

Asteroid Initiative Idea Synthesis Abstracts v1 - NASA
Sep 30, 2013 - participation options on the workshop Web page prior to the workshop. .... measure the size of a 10 m asteroid at a distance of out to 4 lunar distances, ..... advanced processing to minimize size, weight, and power demands on the host