FUZZY INTERVAL NUMBERS (FINs) TECHNIQUES ...

Viewer
Transcript

FUZZY INTERVAL NUMBERS (FINs) TECHNIQUES AND ITS APPLICATIONS IN NATURAL LANGUAGE QUERIES PROCESSING AND DOCUMENTS CLASSIFICATION CHRISTOS SKOURLAS1; THEODOROS ALEVIZOS2; PETROS BELSIS1; KOSTAS FRAGOS1; VASSILIS KABURLAZOS2 ; STELIOS PAPADAKIS2 1

Department of Informatics, TEI of Athens Ag. Spyridonos, Athens, Greece e-mail: [email protected] ; [email protected] ; [email protected] 1 Department of Industrial Informatics, TEI of Kavala Kavala, Greece e-mail: {alteo,vgkabs, spap}@teikav.edu.gr In this paper, definition, interpretation and a computation algorithm of Fuzzy Interval Numbers (FINs) are briefly presented and we also discuss a series of techniques based on FINs to solve Natural Language Queries’ processing and classification problems. Some examples showing the importance of these techniques are also presented and our method is evaluated using a standard monolingual test set (MED) and bilingual records (bibliographic descriptions) extracted from the public databases of the Greek National Documentation Centre (NDC). We pay particular interest on the multi-language approach while emphasis is given to the particularities of the Greek – English text, queries and test collections. The main advantage of our method is that it is simple, and English stemming is the only language dependent NLP tool used. It is worthwhile to mention that we also used simple stop-words files and centroids, in some experiments, to improve the accuracy of queries and documents’ classification. Keywords: Fuzzy Interval Numbers (FINs); Information Retrieval; Bilingual Retrieval; Classification; Natural Language Query Processing; Cross Language Information Retrieval (CLIR)

1 INTRODUCTION

During the last fifteen years much effort was spent experimenting with different techniques, and a collective effort of Information Retrieval (IR) researchers (TREC, CLEF, NTCIR) produced systems able to retrieve effectively (Sanderson, 1999), (Berger, 2001). Cross Language Information Retrieval (CLIR) is a branch of IR devoted to overcome language boundaries. Experiments were first initiated in the early seventies. CLIR can be defined as the process of searching for texts written in foreign languages based on native language queries e.g. typing a query in English to retrieve documents written in Greek. For such a process to succeed, we must conduct work in one of the following two directions: a) Translation of the user query written in the source language (e.g. English) b) Translation of the documents written in the target language (e.g. Greek). Oard (Oard, 1997) classifies free (full) text Cross Language Information Retrieval (CLIR) approaches to corpus-based and knowledge-based approaches. Knowledge-based approaches encompass dictionary-based and ontology (thesaurus)-based approaches while corpus-based approaches encompass parallel, comparable and monolingual corpora. Dictionary-based systems translate query terms one by one using all the possible senses of the term. The main drawbacks of this procedure are: a) the lack of fully updated Machine Readable Dictionaries

(MRDs), b) the ambiguity of the translations of terms results in a 50% loss of precision (Davis, 1996). Since the machine translation of the query is less accurate than that of a full text, experiments have been conducted with collections having machine translations of all the collection texts to all languages of interest. Such systems are really multi-monolingual systems. Parallel and comparable corpora systems are different: the parallel (or comparable) corpora are used to train the system and after that no translations are used for retrieval. Today, there are large parallel corpora like the one of the European Parliament. Some interesting systems in the past were based on Latent Semantic Indexing (LSI) (Dumais, 1996), (Berry, 1995). Today, there are other novel ideas, e.g. McNamee and Mayfield (McNamee, 2004) attempt to overcome the bottleneck of bad translation calculating translation n-grams instead of whole words. 1.1 TEXT RETRIEVAL AND GREEK LANGUAGE

The Greek language is characterized by its highly inflectional nature and by its rich morphology. Tambouratzis and Carayannis (Tambouratzis, 2001) suggest that «to encompass the language evolution throughout the ages (ancient Greek, formal Greek (Katharevousa), and casual Greek (Dimotiki, Monotoniko)) one would need to generate a large number of different morphological lexica, each one covering a specific Language Evolution Sample». They also claim that «for text retrieval and information extraction purposes» when working with texts in the Greek language, it is important to be able to use two or three different stemming operators in order to represent fully the inflectional evolution of the language through time. Ralli (Ralli, 1986) mentions that in the case of the Greek language the morphological phenomenon of derivation, together with inflection leads to the formation of different word forms from the same stem. Hence, a word may contain several valid stems of the Greek language! According to Ralli (Ralli, 1992) Modern Greek is particularly rich in compound formations and they are usually defined as an association of two stems or of a stem and a word. Ralli and Galiotou (Ralli, 2004) discuss the computational processing of phenomena such as stress and syllabification, which are indispensable for the analysis of Greek compound words. Their recognition process of such constructions is still at a prototype level, and is based on PCKIMMO version2 software. Certain limitations of the system with respect to the principles of the compound formation are shown. Fragos et al. (Fragos, 2004) applied two statistical methods for extracting collocations from text corpora written in Modern Greek: the mean and variance method and a method based on the X2 test. The two methods extracted interesting collocations. 1.2 TRANSLATION AND BILINGUAL RETRIEVAL

Savoy and Berger (Savoy, 2004b), in the 2004 CLEF evaluation campaigns, evaluated various query translation tools, together with a combined translation strategy, often (not always) resulting in a retrieval performance that is worth considering. However, while a bilingual search can be viewed as easier for some language pairs (e.g., from an English query into a French document collection, or English to Portuguese), this task is clearly more complex for other language pairs (e.g., English to Finnish). They automatically translate English requests (queries) into four different languages using the bilingual dictionary BABYLON (www.babylon.com) and nine freely available different machine translation (MT) systems. Translating an English request word-by-word and when more than one translation is provided they pick only a short number of available translations (e.g. the first three). Savoy (Savoy, 2003) claims that a given translation tool may produce acceptable translations for a given set of requests, and it may perform poorly for other queries. For the Greek language we found only two translation tools (SYSTRAN, MLS), but unfortunately their overall performance levels were not very good for specific document collections e.g. medical

bibliographic records. Savoy and Berger (Savoy, 2004b) stress that the main difficulty in their bilingual search was the translation of English topics into Finnish, due to limited number of free translation tools. They conclude that when handling the less-often speaking languages, it seems it would be worthwhile considering other translation alternatives, such as probabilistic translation based on parallel corpora. For monolingual searches, as described in Savoy (Savoy, 2004a), a data fusion search strategy was used that combined the Okapi and Prosit probabilistic models. It was indicated that data fusion approaches may result in better retrieval effectiveness in some cases. At the moment, it seems that translation between Greek and English for specific scientific data collections is not a mature task. Hence, we have decided to use standard test sets and parallel corpora. 1.3 FUZZY (SETS) TECHNIQUES

Fuzzy (set) techniques were proposed for Information Retrieval (IR) applications many years ago (Radecki, 1979), (Kraft, 1993), mainly for modelling. Fuzzy Interval Numbers (FINs) were introduced in Fuzzy System applications (Kaburlasos, 2004, 2006). The related theoretical (mathematical) background is based on the metric space of the generalized intervals. A FIN (see Figure 1) may be interpreted as a conventional Fuzzy Set; additional interpretations for a FIN are possible including a statistical interpretation. Fuzzy Interval Numbers (FINs) and lattice algorithms have been employed in various real-world application including numeric and non-numeric data (Kaburlasos, 2006). Marinagi et al (Marinagi, 2006) propose the use of FINs classifier to handle problems of Cross Language Information Retrieval. In section 2 a brief introduction to definition and construction of FINs is given. In section 3 the FINs (vectors) similarity is defined and applied to the similarity of queries and documents. Section 4 presents Fuzzy Interval Numbers and documents’ classification, examples of our experimentation and the evaluation of FIN - techniques. Conclusions are presented in section 5. 2 DEFINITION AND CONSTRUCTION OF FINS

FIN could be regarded as an abstract “mathematical object” and can have various interpretations and uses. Given a vector (a “population”) x = [x1,x2,…,xN] of term frequencies (“measurements”) that are real numbers sorted in ascending order. The dimension of a vector x is denoted by dim(x) e.g. dim([2,-1]) = 2, dim([-3,4,0,-1,7]) = 5. The median(x) of the vector x = [x1,x2,…,xN] is defined to be a number such that half of the N numbers x1,x2,…,xN are smaller than median(x) and the other half are larger than median(x); for instance, median([x1,x2,x3]) = x2, with x1 < x2 < x3, whereas median([x1,x2,x3,x4]) = (x2 + x3)/2, with x1 < x2 < x3 < x4. A FIN can be computed for the vector x (population) by applying the following CALFIN algorithm (Kaburlasos, 2006). Algorithm CALFIN // Calculate FIN Let x be a vector of term frequencies (real numbers). Sort vector x incrementally. Initially vector pts is empty. function calfin(x) { while (dim(x) ≠ 1) medi:= median(x) insert medi in vector pts x_left:= elements in vector x less-than number median(x) x_right:= elements in vector x larger-than number median(x) calfin(x_left) calfin(x_right) endwhile } //function calfin(x) Sort vector pts incrementally.

Store in vector val, dim(pts)/2 numbers from 0 up to 1 in steps of 2/dim(pts) followed by another dim(pts)/2 numbers from 1 down to 0 in steps of 2/dim(pts). For example, let us consider the following vector x = (4.333, 4.334, 4.335, 4.667, 4.668, 5, 5.25, 5.251, 5.252, 5.5, 5.501, 5.75, 6, 8, 8.001, 8.5, 8.501, 8.502, 9) It follows that median(x) = 5.5 and, vector(x1) = x_left = (4.333, 4.334, 4.335, 4.667, 4.668, 5, 5.25, 5.251, 5.252), vector(x2) = x_right=(5.501, 5.75, 6, 8, 8.001, 8.5, 8.501, 8.502, 9) Then, median(x1_left) = median (x_left)= 4.668 and vector(x11)=x1_left = (4.333, 4.334,4.335, 4.667), vector(x12)=x2_right = (5, 5.25, 5.251, 5.252) Then median(x2)=median(x2_right) = 8.001 and vector(x21)= x2_left = (5.501, 5.75, 6, 8) vector(x22)= x2_right= (8.5, 8.501, 8.502, 9) etc The above procedure is repeated recursively log2N times, until “half vectors” are computed including a single number; the latter numbers are, by definition (construction), median numbers. The computed median values are stored (sorted) in vector pts, whose entries constitute the abscissae of a positive FIN’s membership function; the corresponding ordinate values are computed in vector val. Eventually Algorithm CALFIN computes two vectors, i.e. pts and val, where vector val includes the degrees of (fuzzy) membership of the corresponding real numbers in vector pts. 2.1 FINs AND DOCUMENTS’ REPRESENTATION In the Vector Space Model for Information Retrieval, a text document is represented by a vector in a space of many dimensions, one for each different term in the collection. In the simplest case, the components of each vector are the frequencies of the corresponding terms in the document: Dock = ( fk1, fk2, … fkn ), where fkj stands for the frequency of occurrence of term tj in document Dock. Table I depicts the vector space model of a small collection comprising four documents: Doc1=microscopy of lung or bronchi. Doc2=lung or bronchial neoplasms. Doc3=microscopy of lung. lung neoplasms. microscopy of bronchi. Doc4=blood in breast or prostatic neoplasms. microscopy of bronchial neoplasms. To illustrate our notation we can see that tfkj = tf3,5=2 stands for the frequency of occurrence of term t5 (=lung) in document Doc3, ctfj = ctf5=4 stands for the total frequency of occurrence of term t5 in the whole collection. The collection term frequencies (ctf) are used as term identifiers. In order to ensure the uniqueness of the identifiers a multiple of a small ε is added to the ctfs when needed. Table 1 VECTOR SPACE MODEL FOR THE DOCUMENTS’ COLLECTION tf2k, tf3k, tf4k, ctfk #docs tf1k, Term termi k=1..7 k=1..7 k=1..7 k=1..7 Identifier i=1..7 of doc1 of doc2 of doc3 of doc4 blood 1 1 0 0 0 1 1 1.333 breast 1 1 0 0 0 1 1.667 prostatic 1 1 0 0 0 1 4 bronchi-al 4 4 1 1 1 1 4.25 Lung 4 3 1 1 2 0 4.50 microscopy 4 3 1 0 2 1 4.75 neoplasms 4 3 0 1 1 2

3. FUZZY INTERVAL NUMBERS AND FINS SIMILARITY

The basic idea for introducing a metric distance between arbitrary-shaped FINs is illustrated in the Figure 1. Two documents (or vectors of terms) of the CACM test collection are plotted as Fuzzy Interval Numbers constructed by the CALFIN algorithm. Each value on the term axis represents a term (the stem of the term). FIN is constructed such that any horizontal line εh, h∈ [0, 1], intersects the FIN at exactly 2 points (only for h=1 there exists a single intersection point). Hence, a horizontal line εh results in a “pulse” of height h which is called generalized interval of height h. In our example, the two generalized intervals, denoted by [a’, c’]h and [b’, d’] h, are positive and intersecting. The area [ac’ca’] is the support F(h) of the first FIN and the area [bd’db’] is the support of the second FIN. If h=0.25 then we can see that in each area there are about 75% of the values of each FIN. If a metric distance could be defined between two generalized intervals of height h then a metric distance is implied by two FINs simple by computing the corresponding definite integral from h=0 to 1. The points a, b, c, d in figure 1 are used to define dh(F1(h),F2(h)) = dh([a,b]h,[c,d]h). As we can see in figure 2, a bell-shaped mass function is used for the calculations of the distance between two FINs (or the similarity between two queries / documents which are represented by their FINs). 1

h

0.75

0.5 a’

c’

b’

d’

0.25

a

0 0

c 100

b

d 200

300

400

500

ctf (mod)

FIGURE 1: FINs F1 AND F2 REPRESENTING DOCUMENTS OF THE CACM COLLECTION.

The concept of the Generalized Intervals is used for introducing a metric into the lattice of the Fuzzy Interval Numbers (FINs) (Kaburlasos 2006). The interpretation of a generalized interval depends on an application; for instance if a feature (a term) is present in a document it could be indicated by a positive generalized interval (Marinagi 2006). The area “under” a generalized interval is a real number which could be calculated. We can define a metric distance and an inclusion measure function in the set (lattice) of the generalized intervals Mh. A positive generalized interval of height h∈(0,1], is denoted by [x1, x2]h (or μ[ x , x ]h ) and is 1

→{0, h} , given by: defined as a map [x1, x2] : R ⎯⎯ h [x1, x2] = h (x1≤ x≤ x2) and [x1, x2]h = 0 (x1≤ x2.), otherwise. A negative generalized interval of height [x1, x2]h : R ⎯⎯ →{0, −h} , given by: h [x1, x2] = - h (x1≥ x≥ x2) and [x1, x2]h = 0 (x1>x2), otherwise.

2

h

h∈(0,1]

is

defined

as

a

map

The set of all positive generalized intervals of height h is denoted by M h+ , the set of all negative generalized intervals by M h− , and the set of all generalized intervals by Mh.

FIGURE 2. TWO QUERIES / DOCUMENTS (VECTORS) ILLUSTRATED AS TWO FINS. EACH VALUE ON THE TERM AXIS REPRESENTS A TERM (STEM). A BELL-SHAPED MASS FUNCTION IS USED FOR THE CALCULATION OF THE DISTANCE (SIMILARITY) BETWEEN THE DOCUMENTS. 3.1 FIN METRICS

Let mh: R → R+ be a positive real function, a mass function, for h ∈ (0,1] (could be independent of h) and fh(x) =

∫

x

0

mh (t )dt . Function fh is strictly increasing.

The real function vh: M h+ → R, given by vh([a,b]h) = fh(b) – fh(a), is a positive valuation function in the set of positive generalized intervals of height h (Kaburlasos 2006). Then, dh([a,b]h, [c,d]h) = [fh(a∨c) – fh(a∧c)] + [fh(b∨d) – fh(b∧d)], where a∧c= min{a,c} and a∨c= max{a,c}, and therefore dh([a,b]h,[c,d]h) is a metric distance between the two generalized intervals [a,b]h and [c,d]h. A positive Fuzzy Interval Number (FIN) is a continuous function F: (0,1] → M h+ such that h1 ≤ h2 ⇒ support(F (h1)) ⊇ support(F( h2)), where 0 < h1 ≤ h2 <1. Given two positive FINs F1 and F2, then 1

d(F1,F2)= c ∫ d h ( F1 (h), F2 (h))dh 0

where c is an user-defined positive constant, is a metric distance (for a proof see (Kaburlasos, 2004)). 3.2 CALCULATION OF THE DISTANCE BETWEEN DOCUMENTS

The FIN distance is used instead of the similarity measure between documents: the smaller the distance the more similar the documents. In the case of documents and for the distance calculations a bell-shaped mass function could be selected: ρ + (1 − ρ )h mh(t)= [σ + (1 − σ )t ] 2 ⎞ t 1− z ⎛ 1+ −1 z ⎜⎝ maxctf /ν ⎟⎠ Figure 2 illustrates how we try to calculate the distance of two FINs F1 and F2 (representing two documents) using the mass function. The points a, b, c, d are used to define the distance,

at height h, dh(F1(h),F2(h)) = dh([a,b]h,[c,d]h); dh(F1(h),F2(h)) equals the sum of areas of the shaded regions. The distance of the two FINs at the height h is given by the sum of areas of the shaded regions and the distance between the two FINs is calculated using the definite integral of the distance at height h from h=0 to 1: 1

b'

d'

0

a'

c'

Distance = ∫ (| ∫ f (t )dt | + | ∫ f (t ) dt |)dh 4 EXAMPLES, EVALUATION AND DISCUSSION OF FINS BASED TECHNIQUES

The basic features of a FINs method for queries and documents’ classification are the following: 1) Queries and Documents are represented as FINs. The FIN representation of queries and documents is based on the use of the collection terms frequencies as the term identifiers 2) FIN distance (metric) is used instead of a similarity measure. From previous experimentation (Marinagi, 2006) we know that if the length of the queries and the documents is greater then the results of the method are better. Therefore, relatively long queries of the MED monolingual collection (Glasgow IDOM server) are selected as items of primary interest for our research and experimentation. We also focus on various samples using bibliographic descriptions extracted from the (bibliographic) databases of the Greek National Documentation Centre (NDC). Each bibliographic record (text, document) is publicly available in various formats (e.g. text, MARC XML). NDC is the host of various bibliographic databases that cover different research topics (e.g. medical bibliography, dissertations, financial reports and bibliography). Each bibliographic record comprises at least the following bilingual fields: title(s), abstract(s), author(s) and key-phrases. 4.1 SUCCESSFUL EVALUATION USING MONOLINGUAL TEST SET (MED) The MED test set is composed of a collection of 1033 documents; a set of 30 natural language queries, and a list of all the collection documents (“qrels”) that are relevant to each of the queries. Example of a query is Query 14: “renal amyloidosis as a complication of tuberculosis and the effects of steroids on this condition. only the terms kidney diseases and nephrotic syndrome were selected by the requester. prednisone and prednisolone are the only steroids of interest”. The “qrels” for the queries 14, 25 are: qrels14={23,24,25,26,28,29,454,455,456,457,459,461,463,466,467,468} qrels25={687,688,689,690,691,692,693,694,695,697,698,699,944,945,947,948,949,951, 952,953,954,955,956,958} We can define the classification problem in the following way: All the “qrels” can be seen as classes of the MED collection and we must correctly classify the queries using these classes. We shall classify query 14 (in the “qrels14”) and query 25 (in the “qrels25”) illustrating our method. First, we calculate query terms’ frequency for all the terms of the queries and select representative features for all the classes / “qrels” (Table 2). TABLE 2 REPRESENTATIVE FEATURES FOR SIX CLASSES (“qrels”)

Class “qrel14” “qrel17” “qrel20” “qrel25” “qrel27” “qrel29”

Representative features (terms) Amyloidosis, Tuberculosis, Nephrotic, Prdnisone, Prednisolone Nickel, Toxicity Cartilage, Pituitary, Dwarfism Diabetes, Insipidus, Chlorothiazide, Sodium Filaria, Parasite Neonatal, Jaundice, Bile, Duct, Biliary, Atresia, Hepatitis

After that we can use various techniques based on the terms’ frequencies. Table 3 summarizes some successful experimental results. In this case we used the representative features (terms) to construct one centroid for each class. Then, we calculate the distance of every query from these centroids (see table 3). If the FIN-distance of a query from the centroid of a class is minimal then the query is classified to this class. For the calculations we also used a stopwords file, and the following parameters for the mass-function: ρ = 0, σ = 1, ν = 2, z = 0.05. For the MED collection it is better to use TfIdf instead of simple tf (term frequency). TABLE 3 EXAMPLES OF SUCCESSFUL CLASSIFICATION OF THE QUERIES 14 AND 25 USING CENTROIDS AND WEIGHTS

Distances from “qrels14” of the queries 1..13 14 15..30

Distances from “qrels25” of the queries 1..24 25 26..30

14.3021;14.1693;14.2014;13.8026;14.0759 14.4419;14.2721;13.1545;13.8602;14.1755 13.5543;12.8904;14.3909 7.93429 14.0486;14.1869;13.9115;13.5024;13.0359 13.5292;13.7044;14.0708;14.4825;13.8046 14.0954;14.0296;14.0022;13.351 ;13.8408 13.5773 20.5103;19.2489;20.5912;20.3334;17.7892 20.4212;20.4358;15.9012;19.1723;20.2346 19.0442;18.916 ;20.5114;19.2268;20.5234 20.112 ;20.0813;20.3913;15.8797;18.8823 20.2971;19.8986;20.6172;19.5655 11.7364 19.8817;19.8922;18.8049;20.1242;19.385

Our approach also focused on achieving high precision and recall. All the records of the “qrels14” and “qrels25” formed a test set and then we measured the efficiency of our method, for the queries 14, 25, using F-measure =

2*( precision* recall) ( precision + recall)

.

For the calculations of the distances of the queries from the documents of the test set we used centroids, weights, stop-words file, the values ρ = 0, σ = 1, ν = 2, z = 0.05 for the massfunction, and TfIdf instead of simple tf (term frequency). The calculated value for the Fmeasure was promising (0.80). 4.2 MEDICAL BIBLIOGRAPHIC RECORDS The records of the NDC were searched using the search terms: 1) «παιδιατρική» (pediatrics in Greek) and 2) “pediatrics” and extracted 9 records containing the search term in their key-phrases (782, 3282, 3371, 6570, 8303, 8421, 9202, 10481, 10711), and 5 records not containing the search term in their key-phrases (6525, 11102, 11931, 12532, 12498), and 3) “mastectomy” and 25 records were extracted (183, 185, 1289, 1625, 1627, 1978, 1990, 2118, 3013, 4958, 5512, 6538, 6540, 7977, 8040, 8252, 8755, 9434, 9510, 10734, 10871, 11136, 11192, 12499, 14183). 16 out of 25 records contain the search term in their keywords. All the records without abstract were deleted and then all the documents that have a search term in their keywords formed two classes: Pediatrics-class, Mastectomy-class. Then, we supposed that the rest of the documents were our queries. All these queries (documents) were successfully classified using the FIN-based similarity (see Table 4 for some calculations).

TABLE 4 MEDICAL DOCUMENTS’ CLASSIFICATION Record “Average” distance from the Pediatrics – class 6525 0.2048 11102 0.2288 11931 0.2129 12532 0.1891 12948 0.2930

“Average” distance from the Mastectomy – class 0.3769 0.3836 0.4252 0.3116 0.4910

Comments Correct Correct Correct Correct Correct

It is worth mentioning that 1) the use of weights related to the features (terms) in a document improved the results and 2) the above described technique of using centroids was also applied in our sample and improved the results remarkably 3) assigning proper weights to the keywords of a bibliographic record we can calculate two similar FINs for the Greek and the English part of the same bilingual record. 5 CONCLUSION AND FURTHER WORK

We have concluded that the FIN-based calculation of the similarity between queries and documents (vectors) is a novel method for solving various problems. Especially, such a method seems to be promising in the case of queries and documents classification. We verified in our experiments that if we focus on the documents that are related to different search terms then we can easily apply FIN-based techniques and calculate correctly the distance (similarity) between them. If the length of the documents is greater then the results of the method are better. Documents’ classification can be based on various FINs techniques. We can use a training set and form classes (“sub-collections”) of the sample and then calculate the FINdistance of the “unclassified” documents from these classes. This distance could be seen as an average distance of every unclassified document from all the elements of the class. Classification is more accurate if we use weights for the search terms that are contained in the retrieved documents and/or penalty (negative weight) in the case that the search term is not included. It is worthwhile to mention that if we use centroids, one for each class, we can improve the accuracy of documents’ classification, in general. At present, further experiments with promising results are been conducted mainly with other bilingual samples extracted from the databases of the Greek NDC and three of the standard monolingual collections, namely MED, CACM and WSJ. REFERENCES

1. Berger A. (2001). Statistical machine learning for Information Retrieval, Dissertation, Carnegie Mellon University, p 147 2. Berry, M. and P. Young (1995), Using Latent Semantic Indexing for multi-language information retrieval, Computers and the Humanities, vol. 29, 6, 413-429. 3. Davis, M. (1996). New experiments in Cross - Language Text Retrieval at NMSU’s Computing Research Lab, TREC-5. 4. Dumais, S. T., T.K. Landauer, & Littman, M.L. (1996), Automatic Cross-Linguistic Information Retrieval using Latent Semantic Indexing, Working Notes of the Workshop on Cross-Linguistic Information Retrieval. ACM SIGIR 1996.

5. Fragos K., Maistros Y., Skourlas C. (2004), Discovering Collocations in Modern Greek Language, 1st NLUCS 2004, 151-159. 6. Kaburlasos, V.G. (2004), Fuzzy Interval Numbers (FINs): Lattice Theoretic Tools for Improving Prediction of Sugar Production from Populations of Measurements, IEEE Trans. on Man, Machine and Cybernetics – Part B, 34(2), 1017-1030. 7. Kaburlasos V. G. (2006), Towards a Unified Modeling and Knowledge –Representation based on Lattice Theory - Computational Intelligence and Soft Computing Applications, Springer-Verlag. 8. Kraft, D.H. and D.A. Buell (1993), Fuzzy Set and Generalized Boolean Retrieval Systems, Readings in Fuzzy Sets for Intelligent Systems, D. Dubius, H.Prade, R.R. Yager (eds). 9. Marinagi K., Alevisos T., Kaburlasos V.G., & Skourlas C. (2006), Fuzzy Interval Number (FIN) Techniques for Cross Language Information Retrieval, 8th ICEIS. 10. McNamee P., Mayfield J. (2004). JHU/APL Experiments in Tokenization and Non-Word Translation. LNCS 3227, 85-97 11. Oard, D.W. (1997), Alternative Approaches for Cross-Language Text Retrieval in CrossLanguage Text and Speech Retrieval, AAAI Technical Report SS-97-05. Available at http://www.clis.umd.edu/dlrg/filter/sss/papers/ 12. Radecki,T. (1979). Fuzzy Set Theoretical Approach to Document Retrieval, Information Processing and Management, v.15, Pergammon-Press. 13. Ralli A. (1986), Derivation and inflection, Studies in Greek Linguistics, 7th annual meeting of the Department of Linguistics, Aristotelian University of Thessaloniki, 29-48 14. Ralli A. (1992), Compounds in Modern Greek, Rivista di Linguistica,4(1), 143-174 15. Ralli A. and Galiotou E. (2004), Greek Compounds: A challenging case for the parsing techniques of PC-KIMMO, Intl Journal of Computational Intelligence, 2 16. Sanderson M., B. Croft. (1999). Deriving concept hierarchies from text, Proceedings of the 22nd Annual International. ACM SIGIR1999, 206-213. 17. Savoy J. (2003). Report on CLEF-2003 multilingual tracks, CLEF-2003. 18. Savoy, J. (2004a). Report on CLEF-2004 monolingual tracks, CLEF-2004. 19. Savoy, J., Berger P-Y. (2004b). Selection and Merging Strategies for Multilingual Information Retrieval, CLEF- 2004. 20. Tambouratzis G., Carayannis G. (2001), Automatic Corpora based Stemming in Greek, Literary and Linguistic Computing, Vol. 16, 4, 445-466. 21. Glasgow IDOM server– IRCollections, http://ir.dcs.gla.ac.uk/resources/test_collections// 22. Greek National Documentation Centre server: www.ekt.gr

fuzzy interval number (fin) techniques for multilingual ...