Feature Preprocessing on Web Page Language ...

Viewer
Transcript

Feature Preprocessing on Web Page Language Identification∗ Choon-Ching Ng & Ali Selamat

†

Intelligent Software System Research Laboratory (ISSLab), Faculty of Computer Science and Information Systems, Universiti Teknologi Malaysia, 81310 Johor Bahru.

[email protected]; [email protected]

ABSTRACT

Keywords

Globalization has led to increase in information flows between geographically remote locations and realization of a common global market. With growing explosion of multilingual data across the world wide web, the requirements of having effective automated language identifiers has increased further. For example, the multilingual web page applications included the crawler, searching system, translation or transliteration system, information retrieval etc. More information finds its way into the computer systems and the web and using manual methods to classify the information is becoming increasingly infeasible. There are 7000 languages which have been identified in the Ethnologue. However, the number of languages which has been implemented in Google or Microsoft system is less than 150 languages. Consequently, the problems of digital divide across the internet will get more serious. Web page language identification is defined as the assignment of textual documents to one or more predefined languages based on their content. In this work, we discuss the effect of feature preprocessing on the Arabic script web page language identification. The result shows that preprocessed features can give better performance than original content.

Web page language identification, feature preprocessing, letter frequency

Categories and Subject Descriptors H.4 [Information Systems Applications]: Miscellaneous; I.2.7 [Computing Methodologies]: Natural Language Processing—language models, text analysis

General Terms Algorithms ∗The portion of this research has been certified under Intellectual Property Corporation of Malaysia with the application number PI 20084793 on 26 November 2008. †Present address: Intelligent Software System Research Lab (ISSLab), Faculty of Computer Science & Information Systems, Universiti Teknologi Malaysia, 81310 UTM Skudai, Johor, Malaysia. Tel.: +6-07-5532638; Fax: +6-07-5565044.

1.

INTRODUCTION

Language herein is referred to as natural language used for human communication either in spoken or written. There are 7000 languages which have been reported in Ethnologue, a widely cited reference for languages around the world [12]. Globalization had led to the unlimited information sharing across the internet, so the communication among people in a bilingual environment is a critical issue to be overcome. Abd Rozan et al. [1] have justified the importance of monitoring the behavior and activities of world languages in cyberspace. The information collected from such study has implications on customized education, in which Information and Communication Technology (ICT) has to cope with the ‘digital divides’1 which exist both within countries and regions, and between countries. Furthermore, Maclean [22] has reassert the status of language as a topic of major interest to researchers in the light of the rise of the transnational corporation and also Redondo-Bellon [25] have analyzed the effects of bilingualism on the consumer in Spain. All these examples reflect the importance of multi languages in globalization. According to Internet World Stats, the internet usage has been increasing dramatically between 2000 and 2008 throughout the world, for example in middle eastern countries such as Iran, Syria, Saudi Arabia, Yemen etc. [13, 24]. In addition, the Summer Institute of Linguistics has reported that there are 69 languages spoken or used by more than 10 million people in the world including English [8]. Since there are many people such as Japanese, Arabic, Chinese etc. that do not use the international language like English, therefore language identification is needed to support a multilingual processing system. Language identification is the process of determining the predefined language automatically for a given content (e.g. English, Malay, Chinese, Japanese, Arabic etc.). In various applications, language is an important tool for human communication and presently, the language dominating internet is English. A web page is a digital document displayed in web browser. The web page can be written using diverse languages or different scripts of encoding scheme such as Unicode [2]. Figure 1 shows the example of web pages which use Roman script to display the 1 Digital divide refers to the disparity between those who have use of and access to ICT versus those who do not [1].

content, language used on those web pages are Indonesian, Spanish, Malay, and English [28]. The computer system can only identify the character set or encoding scheme have been applied but it is not able to discriminate the predefined language of the web page. Therefore, effective and automatic web page language identification is needed to solve the issue stated.

(a)

(b)

(c)

(d)

the call to an operator fluent in that language. Furthermore, rapid language identification and translation can even save lives. There are many reported cases of emergency response operators being unable to understand the language of a distressed caller. In response to these needs, an automatic language identification system could serve as a front-end for a multi-language translation system [19, 32] in which the input speech can be in one of several languages. The input language needs to be quickly identified before translation to the target language can begin [31]. Language is a basic requirement for any type of communication either between personal or organization communication. Multilingual environment is a space, in which communication among people from different races or countries occurred. Therefore, language identification is a core technology to support various multilingual applications such as Optical Character Recognition (OCR) [3, 14, 16, 29], automatic transliteration system [17], multilingual speech recognition [21], text categorization and spoken document retrieval [4, 19, 26, 29] (as shown in Figure 2).

Figure 1: Example of different language web pages using Latin script. a) Indonesia b) Spanish c) Malay d) English This paper is organized as follows: Related research on written language identification are described in Section 2. Research methodology and the particular proposed algorithms are described in Section 3. The experimental result and discussion are described in Sections 4. Finally, the conclusion is summarized in Section 5.

2.

Figure 2: Language identification is a core technology in any bilingual processing system

LITERATURE REVIEW

Language identification is often the initial step in a text processing system that may involve machine translation, semantic understanding, categorization, searching, routing or storage for information retrieval [7]. Traditionally, language identification is done manually by human intervention. However, it is possible to automate the process of language identification2 . Knowing the language of the text allows the correct dictionaries, sentence parsers, profiles, distribution lists, and stop-word list to be used. Incorrectly identifying the language would result in garbled translations, faulty or no information analysis, and poor precision and recall in searching [20]. There are several important areas for automatic language identification. As the global economic community expands, there is an increasing need for automatic language identification services [9, 27]. For example, checking into a hotel, arranging a meeting or making travel arrangement can be difficult for a non-native speaker. Telephone companies will be better equipped to handle foreign language calls if an automatic language identification system can be used to route 2 The example of automatic language identification is google translator, in which provides the option of automatically identifying the language of given text. http://translate.google.com

Multilingual environment could be rigidly defined as being native-like in two or more languages. Currently, there are a number of tools implements the multilingual environment especially in the web application. For example, Rosette Language Identifier [30], TextCat [6], Xerox Language Identifier [10] etc. The most popular system, the TextCat Language Guesser makes use of the language-specific letter N-gram distribution and can determine 69 different natural languages, but mostly is the majority language. According to Dunning [11], letter trigrams can identify the language almost error-freely from a text-length of 500 bytes. However, most of the available applications are focusing on common languages such as European languages or Asian languages. There are several difficulties when dealing with web documents. For example, the programming code used for visual appearance of a web page; grammatical error in the contents of a web page; the use of the character set in formatting the web page (or charset in web documents) [23]; tremendous short forms or terms have been applied throughout the internet. All these examples given reflect the noises consisted of a web document which can limit the identification process [31]. Language identification was typically performed by trained

professionals [15]. The manual language identification process is very time-consuming and costly if performed by diverse language expertise, thus limiting its applicability. To overcome the inefficiency of the manual process, the learningbased language identification has emerged. While existing language identification methods can produce reasonable results, they often do so at a large computational cost (in terms of both space and time) [15]. Many methods require large lists of words and/or n-grams with associated frequency counts for each language. Others require matrices whose size is dependent on the number of unique words and the number of documents in the reference language set. Calculations on large lists and matrices make these methods expensive to use [5]. With the rapid emergence and explosion of internet and the trend of globalization, a tremendous number of textual documents written in different languages are electronically accessible online. Efficiently and effectively managing these documents written in different languages is important to organizations and individuals. For this purpose, many studies have been done by researchers in order to automatically identify the language in which the information is written on a web document [31]. A suitable method of feature selection or extraction of web documents is required to extricate the useful features from web documents before an identification process is done. Indirectly, the classification performance can be increased if feature used are reliable and robust [5]. Many efforts have been made to prevent the falloff in using minority languages in the online community and less-computerized languages. With the increasing number of web pages on the web, it has become a necessity to provide some techniques to effectively identify and retrieve encoded information automatically.

3.

T-TEST

T-test is applied when the population is assumed to be normally distributed but the sample sizes are small enough that the statistic which inference is based is not normally distributed because it relies on uncertain estimate of standard deviation rather than on a precisely known value. For example, a t-test compares the means of two groups. For example, compare whether systolic blood pressure differs between control and treated group, between men and women, or any other two groups. T-test is a hypothesis test for answering questions about the mean where the data are collected from two random samples of independent observations, each from an underlying normal distribution. T-test also is a method of supporting stochastic models of the population viability. It’s based on assessing the mean and variance of the predicted population size [18]. Usually, when we use the t-test functions, we have to identify the ranges that contain two sets of sample data. A two-tailed test is a test in which you take into account both ends of the sampling distribution when setting critical region while an one-tailed test is concerned with only one end of the sampling distribution [18]. When carrying out t-test, we will formally entertain two hypotheses:

H0 : Population means are the same, µ1 = µ2 . H1 : Population means are not the same, µ1 6= µ2 . Mean is the arithmetic average of a set of values or distribution and is given by Equation 1.

x ¯=

n 1X xi n i=1

(1)

Standard deviation is a measure of the dispersion of a set of values. It can apply to a probability distribution, a random variable or a population. The standard deviation is usually denoted with the letter σ. It is defined as the square root of the variance as shown in Equation 2. v u n u1 X σ=t (xi − x ¯ )2 n i=1

(2)

The standard deviation of the mean S is given by Equation 3. σ √ n

(3)

where σ is the standard deviation of the variable and n is the number of observations. Degree of freedom (DoF ) is the number of parameters which may be independently varied and may use in a problem, like population and distribution. If there are two means to be estimated, then DoF is given by Equation DoF .

DoF = n1 + n2 − 2

(4)

T-test is conducted to determine whether the means are different and it can be calculated as Equation 5.

t=

x ¯1 − x ¯2 Sx¯1 −¯x2

(5)

and s Sx¯1 −¯x2 =

(n1 − 1) S12 + (n2 − 1) S22 n1 + n2 − 2

µ

1 1 + n1 n2

¶ (6)

where x ¯1 and x ¯2 are the mean of first and preprocessed data, n1 and n2 are the number of observations of first and preprocessed data, S12 and S22 are the variance of first and preprocessed data, respectively. Note that in this case, Sx¯1 −¯x2

is the pooled variance of the two samples. The t-value will be positive if the first mean is larger than the second and negative if it is smaller. The difference is not statistically significant if the t value is within the critical region. On the contrary, the difference is statistically significant when the t value is outside the critical region [18].

EXPERIMENTAL RESULTS

In this work, our proposed methods are based on the letter frequency. Therefore, letters of a document being used for feature selection affect the performance of identification. In the following experiment, we have prove that data preprocessing is important on the original documents. The experiment was done on Arabic script web documents which consists of 100 web documents of each language. The languages included Arabic, Persian, Urdu and Pashto. The Backpropagation Neural Network (BPNN)3 was used to justify the performance by using Root Mean Squarre Error (RMSE) on T-Test analysis [18]. Firstly, the web documents are collected using crawler. The data is divided into two group, original and preprocessed. The original group data remains the content as downloaded from the web. While the preprocessed data is applied with the data preprocessing on the original data. The programming codes in the original document are removed4 . Then, those code-free documents are further filtered for letters out of range between 1536 and 17925 are removed. The preprocessed data only remains the original web content. The following Table 1 shows the difference between original and preprocessed data.

The backpropagation neural networks Description Function Input Node Hidden Node Output Node Learning Rate Momentum Rate Epochs RMSE Features Normalized Output Normalized

Table 1: The difference between original and preprocessed data in T-Test analysis Description Original Preprocessed Language Arabic Script Arabic Script Dataset 100 units 100 units Feature Selection Letter Weighting Letter Weighting Identifier BPNN BPNN Data Preprocessing No Yes In order to do the analysis on T-Test, we have applied the RMSE of BPNN. The following Table 2 shows the parameters setting on the BPNN. The input node, hidden node and output node are 20, 8, and 1, respectively; learning rate is 0.01, momentum rate is 0.009, epochs is 1000, minimum RMSE is 0.01, features is normalized between -1 and 1, and output is normalized between 0 and 1, respectively. As stated that the T-Test analysis comprises the following hypothesis H0 and H1 , H0 : Population means are the same, µ1 = µ2 . H1 : Population means are not the same, µ1 6= µ2 . 3 The details algorithm of BPNN and the methodology used in collecting data can refer to the work of Selamat and Ng [28] 4 The original document are parsed with standard java function (version 1.4.2) namely HTMLEditorKit.ParserCallback for removing the HTML code 5 This is the decimal code point boundary of Arabic script letters in Unicode

Value Logistic 20 8 1 0.009 0.0001 1000 0.01 -1 to 1 0 to 1

We have obtained the result of RMSE of BPNN simulation based on 5-fold cross validation as shown in Figure 3. The RMSE of original data are 0.2366, 0.2315, 0.1435, 0.1655 and 0.1066, respectively. The RMSE of preprocessed data are 0.0498, 0.0501, 0.0595, 0.0443 and 0.0604, respectively.

RMSE

4.

Table 2: structure

Experiments

Figure 3: The Root Mean Squarre Error (RMSE) of original and preprocessed data using Backpropagation Neural Networks (BPNN) Table 3: Statistical analysis of hypothesis test Description Original Data Preprocessed Data Mean (¯ x) 0.0528 0.1767 STDEV (S) 0.0069 0.0564 n 5 5 DoF 8 x ¯1 − x ¯2 -0.1239 Sx¯1 −¯x2 0.0254 t -4.8780 Based on the information collected in Figure 3, we have further validated the critical region of the T-Test analysis. Table 3 shows the analysis of statistical hypothesis test or T-Test. We noticed that the x ¯ of original data and preprocessed data are 0.0528 and 0.1767, respectively. The x ¯ of original data is lower than preprocessed data. The standard deviation (S) of original data and preprocessed data are 0.0069 and 0.0564, respectively. The number of observations (n) is 5, degree of freedom (DoF ) is 8, the difference of means (¯ x1 − x ¯2 ) is -0.1239, the difference of standard deviation (Sx¯1 −¯x2 ) is 0.0254, so the critical value t is -4.8780.

Based on the Table 4 in the Appendix A, we noticed that the critical region of the hypothesis is between -2.015 and 2.015, in which the number of observations is 5 and the significance level is 0.05 and the Confidence Interval (CI) is equal to 95%. Through experiment, the t value (-4.8780) is outside the range of the critical region as shown in Figure 4. It shows that the difference statistically significant. Hence the null hypothesis H0 is rejected. Therefore, we summarized that the original data was significant different from preprocessed data.

[2] [3]

[4]

[5]

T = 2.015

0.6

T = −2.015

0.4

[6] [7] t = −4.878

0.2

Propability

0.8

1.0

T−Test

[8]

0.0

[9] −4

−2

0

2

4

Values

Figure 4: Critical region of T-Test

5.

[12]

[13]

[14]

[15]

ACKNOWLEDGMENTS

This work is supported by the Ministry of Science, Technology & Innovation (MOSTI), Malaysia and Research Management Center, Universiti Teknologi Malaysia (UTM), under the Vot 79200.

7.

[11]

CONCLUSIONS

Language identification is defined as the process to determine the predefined language of any contents. It is used to support any multilingual applications. Initially, we have done the web page language identification using letter frequency neural network. However, we have found that the features used in the experiment is the critical attributes in producing promising result of web page language identification. Therefore in this work, we have revised the effect of feature preprocessing on web page language identification. Based on the result analysis, we concluded that preprocessed data is better than using original data on web page language identification.

6.

[10]

REFERENCES

[1] M. Z. Abd Rozan, Y. Mikami, A. Z. Abu Bakar, and O. Vikas. Multilingual ict education: Language observatory as a monitoring instrument. In Proceedings of the South East Asia Regional Computer

[16]

[17]

Confederation (SEARCC) 2005: ICT Building Bridges Conference, volume 46, pages 53–61, Sydney, Australia, 2005. J. D. Allen and C. Unicode. The Unicode Standard 5.0. Addison-Wesley, 2007. N. E. B. Amara and F. Bouslama. Classification of arabic script using multiple sources of information: State of the art and perspectives. International Journal on Document Analysis and Recognition, 5(4):195–212, 2003. G.-W. Bian and H.-H. Chen. Machine translation and the information soup, volume 1529/1998, chapter Integrating query translation and document translation in a cross-language information retrieval system, pages 250–265. Springer Berlin / Heidelberg, 1998. G. Botha, V. Zimu, and E. Barnard. Text-based language identification for the south african languages. In Proceedings of the 17th Annual Symposium of the Pattern Recognition Association of South Africa, pages 7–13, Parys, South Africa, 2006. W. B. Cavnar and J. M. Trenkle. Textcat, 2008. accessed on June 2008. G. Chowdhury. Natural language processing. Annual Review of Information Science and Technology, 37(1):51–89, 2003. B. A. C. Comrie. Language: Microsoft encarta online encyclopedia, 2007. accessed on 10 December 2007. P. Constable and G. Simons. Language identification and it: Addressing problems of linguistic diversity on a global scale. SIL Electronic Working Papers, 2000-2001. X. Corporation. Xeror language identifier, 2008. accessed 10 June 2008. T. Dunning. Statistical identification of language, 1994. In Technical Report CRL MCCS-94-273, Computing Research Lab (CRL), New Mexico State University. R. G. Gordon, B. F. Grimes, and L. Summer Institute of. Ethnologue: Languages of the world. SIL International, 2005. M. M. Group. Internet world users by language: Top ten internet languages used in the web, 2007. accessed on 15 November 2007. J. Hochberg, K. Bowers, M. Cannon, and P. Kelly. Script and language identification for handwritten document images. International Journal on Document Analysis and Recognition, 2(2):45–52, 1999. H. Jin and K. F. Wong. A chinese dictionary construction algorithm for information retrieval. ACM Transactions on Asian Language Information Processing (TALIP), 1(4):281–296, 2002. G. D. Joshi, S. Garg, and J. Sivaswamy. Document Analysis Systems VII, volume 3872/2006, chapter Script identification from Indian documents. Springer Berlin / Heidelberg, 2006. S. Y. Jung, S. L. Hong, and E. Paek. An english to korean transliteration model of extended markov window. In Proceedings of the 18th conference on Computational linguistics, volume 1, pages 383–389, 2000.

[18] Y.-S. Lee. The impact of vmax activation function in particle swarm optimization neural network. Master’s thesis, Universiti Teknologi Malaysia, 2008. [19] G. A. Levow, D. W. Oard, and P. Resnik. Dictionary-based techniques for cross-language information retrieval. Information Processing and Management, 41(3):523–547, 2005. [20] D. Lewandowski. Problems with the use of web search engines to find results in foreign languages. Online Information Review, 32(5):668–672, 2008. [21] H.-Z. Li, B. Ma, and C.-H. Lee. A vector space modeling approach to spoken language identification. IEEE Transactions on Audio, Speech, and Language Processing, 15(1):271–284, 2007. [22] D. Maclean. Beyond english: Transnational corporations and the strategic management of language in a complex multilingual business environment. Management Decision, 44(10):1377–1390, 2006. [23] Y. Mikami and I. Suzuki. The language observatory project and its experiment: Cyber census survey. In Proceedings of the SCALLA 2004 Working Conference: Crossing the Digital Divide - Shaping Technologies to Meet Human Needs, 2004. [24] P. J. Payack. The global language monitor, 2007. accessed on 20 November 2007. [25] I. Redondo-Bellon. The effects of bilingualism on the consumer: The case of spain. European Journal of Marketing, 33(11/12):1136–1160, 1999. [26] S. Sagiroglu, U. Yavanoglu, and E. N. Guven. Web based machine learning for language identification and translation. In Proceedings of the Sixth International Conference on Machine Learning and Applications, pages 280–285, 2007. [27] S. A. SantoshKumar and V. Ramasubramanian. Automatic language identification using ergodic hmm. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, 2005. [28] A. Selamat and C.-C. Ng. Arabic script language identification using letter frequency neural networks. International Journal of Web Information Systems, 4(4):484–500, 2008. [29] P. Sibun and A. L. Spitz. Language determination: Natural language processing from scanned document images. In Proceedings of the fourth conference on Applied natural language processing, pages 15–21. Morgan Kaufmann Publishers Inc. San Francisco, CA, USA, 1994. [30] T. Toman. Basic technology rosetta language identifier, 2008. accessed on June 2008. [31] A. Xafopoulos, C. Kotropoulos, G. Almpanidis, and I. Pitas. Language identification in web documents using discrete hmms. Pattern Recognition, 37(3):583–594, 2004. [32] J. Xu, J. Gao, K. Toutanova, and H. Ney. Bayesian semi-supervised chinese word segmentation for statistical machine translation. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 1017–1024, 2008.

APPENDIX A. CRITICAL VALUE OF T-TEST Table 4: Critical value of T-Test DoF 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 40 60 120 ∞

0.1 3.08 1.89 1.64 1.53 1.48 1.44 1.42 1.4 1.38 1.37 1.36 1.36 1.35 1.35 1.34 1.34 1.33 1.33 1.33 1.33 1.32 1.32 1.32 1.32 1.32 1.32 1.31 1.31 1.31 1.31 1.3 1.3 1.29 1.28

0.05 6.31 2.92 2.35 2.13 2.02 1.94 1.9 1.86 1.83 1.81 1.8 1.78 1.77 1.76 1.75 1.75 1.74 1.73 1.73 1.73 1.72 1.72 1.71 1.71 1.71 1.71 1.7 1.7 1.7 1.7 1.68 1.67 1.66 1.65

0.03 12.71 4.3 3.18 2.78 2.57 2.45 2.37 2.31 2.26 2.23 2.2 2.18 2.16 2.15 2.13 2.12 2.11 2.1 2.09 2.09 2.08 2.07 2.07 2.06 2.06 2.06 2.05 2.05 2.05 2.04 2.02 2 1.98 1.96

0.01 31.82 6.97 4.54 3.75 3.37 3.14 3 2.9 2.82 2.76 2.72 2.68 2.65 2.62 2.6 2.58 2.57 2.55 2.54 2.53 2.52 2.51 2.5 2.49 2.49 2.48 2.47 2.47 2.46 2.46 2.42 2.39 2.54 2.33

0.01 63.66 9.93 5.84 4.6 4.03 3.71 3.5 3.36 3.25 3.17 3.11 3.06 3.01 2.98 2.95 2.92 2.9 2.88 2.86 2.85 2.83 2.82 2.81 2.8 2.79 2.78 2.77 2.76 2.76 2.75 2.7 2.66 2.62 2.76