The Contextual-probability Model Developing an automized system for structured information processing

Maria V. Zimakova Penza State University, Russia Abstract: In this paper we present a new complex contextual-probability approach to the logical structure recognition of semi-structured documents 1 2

3 4

5 6 7 8

Introduction..................................................................................................................................................................1 The contextual-probability model ..............................................................................................................................2 2.1 The model.................................................................................................................................................................2 2.2 Basic operations .......................................................................................................................................................4 2.3 The probability model ..............................................................................................................................................6 Constructing the structure grammar.........................................................................................................................6 The algorithm in more detail ......................................................................................................................................7 4.1 The algorithm description ........................................................................................................................................7 4.2 General algorithm of logical structure recognition ..................................................................................................8 4.3 The algorithm using physical structure and contextual-probability dependences....................................................9 Structured information storage and retrieval methods..........................................................................................10 Structured information storage and retrieval automized system..........................................................................12 Conclusions.................................................................................................................................................................15 References...................................................................................................................................................................15

Introduction The purpose of document recognition is to extract information from documents. This ranges from character recognition and the identification of the layout of printed documents, the recognition of document logical structure and at the high end to the (largely domain dependent) extraction of semantic content. This paper focuses on recognizing the logical structure of untagged electronic documents to transform them into structured XML (eXtensible Markup Language) documents. This application plays an important role during the publication cycle. Structured documents can much better be exchanged and further processed to produce hyper–documents, individualized printed documents or databases. With the introduction of corporate networks for management systems support and the use of Web-networks for inter-corporate information exchange, a growing need for management of distributed document style, representation and presentation would arise, leading to a new meta-language XML, a subset SGML. Therefore there is a great scientific interest to the problem of semi-structured data control in the world. This problem is being investigated by numerous corporations and scientific centers such as Stanford University (USA), Database Group from CS+E (Center 'Science + Education') at the Washington University (USA), CEDAR (Center of Excellence for Document Analysis and Recognition) at Buffalo University (USA), CENPARMI (Center for Pattern Recognition and Machine Intelligence) in Concord University (Canada), DAR (Document Analysis and Recognition) in Fribourg University (Switzerland) etc. Therefore the development of structured information processing automized system for logical structure recognition of the given document class and saving of structured documents in the database is very important problem. The development of an automized system for logical structure recognition for a given document class and storing structured documents in a database is a very important problem. The application of this automized system for semi-structured information has required the development of new mathematical models and methods for logical structure recognition of semi-structured document classes. The implementation of storage functions in the automized system has required development of mapping of document logical structure to different models of databases and creation of special query language for retrieving of structured data. Some researchers have offered to apply a method of the syntactic analysis to document structure recognition being based on predefined knowledge of their specific features. This method is applied, if style of the document is known, i.e. the logical structure is known beforehand and it can be appeared through physical structure elements according to the given set of rules. Usually this set of rules is a regular or CF-grammar. As examples of using such grammars for logical structure recognition are the following: in [8, 12] various meth-

2 ods of information extraction about payment from checks and financial documents are offered; in [13] the grammar describing a document class concerning to technical reports, scientific papers and theses is used. These methods are applied only if the unequivocal representation style for given document class is known. Several researchers offered the methods allowing recognize document structures which not precisely correspond to the specific grammar. For example, in [2] authors offered to use the fuzzy syntactic analysis for logical structure recognition. In that case when the document structure can not be determined by given grammar, it is selected one or more “similar” elements which allow to correct full discrepancy of the structure to given grammar. However, if there is significant distance from the given document style, this method is crashed. Other group of researchers has developed various methods for preliminary training system of structure recognition. This problem corresponds to the analysis of a document set which style either is not completely determined or is simply unknown. In [6] they represent a training system for logical structure grammar recognition which uses a set of predetermined but the changeable rules. System [4] based on application of statistical model of n-grams, also is provided with training process. These methods are based on realization of preliminary training process on a specific document class. The key information handling tasks for management systems are recognition, storage and retrieval of structured information. A critical analysis of logical structure recognition methods for the semi-structured document classes has shown that for these tasks an iterative recognition methods with learning capabilities is most suitable [14]. The association of parse methods with probabilistic approach provides a more adequate representation of document classes, and allows effective methods and algorithms to handle for semi-structured document classes.

1 The contextual-probability model In this section the contextual-probability model as an underlying mathematical model for describing the logical structure of a document class, and the corresponding methods for logical structure grammar recognition and construction of a structure tree according to this grammar, are presented. 1.1

The model

The contextual-probability model of the document class is based not only on physical and logical structure but also on statistics of logic element appearance in a given context. A contextual-probability model ℋƊ is a tuple consisting of the following three units: ℋƊ = (GƊ, H, ℳ), GƊ = {NƊ, TƊ, PƊ, ΔƊ}, ℳ = (ΜT, ΜB, ΜL, ΜR) where context-free grammar GƊ determines the document logical structure, H associates document physical attributes with document logical structure, and ℳ is a set of cubic matrixes which determine contextual-probability dependencies of unit appearance in a document logical structure tree. In order to impose structure on a document, a set M of logical structure labels is assumed. These labels are used to tag document fragments. A tagged fragment is referred to as a logic area of that document. Definition 1. Let m ∈ M be a logical structure label and (m, Γ) is a logic area of object D.

Γ a part of document D. Then the pair

Definition 2. Logical areas (m1, Γ1) and (m2, Γ2) are equal, denoted as (m1, Γ1) = (m2, Γ2), if m1 = m2 and Γ1 = Γ2. Logic areas may be nested: Definition 3. Logical area (m1, Γ1) is enclosed in logical area (m2, Γ2), denoted as (m1, Γ1) ≼ (m2, Γ2), if Γ1 ⊆ Γ2, and Γ1 = Γ2 ⇔ (m1, Γ1) = (m2, Γ2). Let ℒ be the set of logic areas of document D, and ≼ℒ the restriction of ≼ to ℒ. Then (ℒ,≼ℒ) is a lattice ([3]). Next we focus on layout. Let Ψ be the set of possible physical (figure, table, equation) and formatting (font, alignment) attributes of document D. Then the mapping H: ℒ → Ψ associates logical areas of document D with physical and formatting attributes.

Example: In Figure 1 a web page is presented, consisting of three separate frames. We will focus on the frame on the right. A number of logical areas have been marked.

3

Figure 1. Course descriptions of some lecturing institute In Figure 1 we can see following elements of the set ℒ: (, Γ1), (, Γ2), (, Γ3), (, Γ4), (, Γ5), (, Γ6), (, Γ7). The grammar GƊ for this example is: ΔƊ = TƊ = {, , , , , , , , , , , , , ,
} NƊ = {, , , , , , ,
} PƊ = { (1) → {} (2) {}{} (3) (4) (5) → {} (6) (7)
(8)
? │
? │
? │
? │
? │
? } Note that this grammar is very simple indeed, but may be considerably expanded depending on user’s requirements. It should also be noticed that GƊ is a top-level grammar: it is not completed up to the recognition of low-level lexemes. The physical attributes in this example almost unequivocally supply terminal and non-terminal elements of logic structure. Note that in general this is not the case.



Large font size, bold, black font color

4 ↔ ↔ ↔ ↔

1.2

Green font color, no background Lilac background color, black font color Black font color, no background Image

Basic operations

With CF-grammar GƊ a forest FƊ = {T1, T2, …, TN, …} is associated, such that each grammar rule has a corresponding subtree (bush) in FƊ. The nodes in these trees correspond to terminal and non-terminal symbols form grammar GƊ, the edges form a reflection of the production rules. This is registrated in the function:

ℑ: PƊ → ℙƊ that associates each production of grammar GƊ with some bush from ℙƊ , where ℙƊ is the set of all possible bushes from FƊ. In our running example grammar rule (4) is represented by the bush in Figure 2. LECTION

LEC_NUM

LEC_THEME

LEC_BODY

Figure 2. The bush representant of rule (4) Let p1, p2 ∈ PƊ be productions of grammar GƊ and ℘1, ℘2 ∈ ℙƊ their corresponding bushes. Let productions p1 and p2 be as follows: p1:

α → β1 β2 … βk

and

p2:

γ → δ1 δ2 … δm,

where α, γ ∈ NƊ; βi, δj ∈ NƊ ∪ TƊ, i = 1, 2, …, k; j = 1, 2, …, m. By analogy with traditional syntactical analysis [1] we add some following definitions. Definition 4. Bush ℘1 is an ancestor of ℘2 or, equivalently, ℘2 a descendant of ℘1, if the righthand-side of rule p1 contains the lefthand-side of rule p2: γ = βi,

for some i ∈ [1, k].

Definition 5. The bushes ℘1 and ℘2 are neighbors if they have a common ancestor p3 ∈ PƊ p3: ϕ → ψ1 ψ2 … ψl, where ϕ ∈ NƊ, ψi ∈ NƊ ∪ TƊ, i = 1, 2, …, l, such that among the units of its right part there are two units ψq and ψs with α = ψq ∧ β = ψs;

q, s ∈ [1, l].

Definition 6. The bush ℘1 is the left (right) neighbor of ℘2 if ℘1 is neighbor ℘2 and q < s (q > s). Note that production rules can be both left and right neighbors. Graphic representations of these definitions and relations are shows in Figure 3. To define a context-probability model of the document class it is necessary to define four relations between bushes in the tree [4]. Let ℘, ℘′, ℘′′ ∈ ℙℑ. Then (a) NT (℘, ℘′, ℘′′) ≡ ℘ is ancestor of ℘′ and ℘′ ancestor of ℘′′. (b) NB (℘, ℘′, ℘′′) ≡ ℘ is descendant of ℘′ and ℘′ descendant of ℘′′. (c) NL (℘, ℘′, ℘′′) ≡ ℘ is left neighbor for ℘′ and ℘′ left neighbor for ℘′′. (d) NR (℘, ℘′, ℘′′) ≡ ℘ is right neighbor of ℘′ and ℘′ right neighbor of ℘′′. These relations are not independent: NB (℘, ℘′, ℘′′) = NT (℘′′, ℘′, ℘) and NL (℘, ℘′, ℘′′) = NR (℘′′, ℘′, ℘).

5

℘′′



℘′

℘′

℘′′ ℘

(a)



(b)

℘′

℘′′

(c)

℘′′

℘′



(d)

Figure 3. (a) – NT (℘,℘′,℘′′);

(b) – NB (℘,℘′,℘′′);

(c) – NL (℘,℘′,℘′′);

(d) – NR (℘,℘′,℘′′)

6 1.3

The probability model

The probability that a bush ℘ is ancestor of ℘′ and ℘′ is ancestor of ℘′′ for every ℘, ℘′, ℘′′ can be reduce to a cubic context-probability matrix

ΜT = ||μi j k||, i, j, k = 1, …, N + 1, where the elements of the matrix express the following conditional probability:

μ i j k = Ρ(℘ = ℘i | NT (℘i, ℘j, ℘k)). Note that ℘ is not tree, it’s a bush in some tree within FƊ (see Figure 3). This conditional probability can be approximated as follows:

Ρ (℘ = ℘i N T (℘,℘ j ,℘k )) =

Ρ (N T (℘i ,℘ j ,℘k )) , Ρ (N T (℘ j ,℘k ))

W (N T (℘i ,℘ j ,℘k )) Ρˆ (℘ = ℘i N T (℘,℘ j ,℘k )) =

( (

W N T ℘ j ,℘k

))

ˆ is a practical estimation of the conditional probability P. Here W(A) is a frequency of the event A, P The cubic context-probability matrixes ΜB, ΜL and ΜR are analogously reduced. The resulting context-probability model of the document class is more close to power linguistic models but allows the effective algorithmization of methods. According to the context-probability model of the document class the iterative algorithm for logical structure grammar determination can be designed. This algorithm represents the solution of the parse inverse problem for this case.

2 Constructing the structure grammar The construction of document logical structure grammar can be divided into two main stages - a low-level grammar construction and a top-level grammar construction. The process of top-level grammar lexeme separation according to knowledge base rules falls into the low-level grammar construction. From the point of view of the systems analysis this process corresponds to the transformation of an abstract model to a concrete model. The use of a knowledge base, in which the initial information about lexeme selection is saved, allows charging a designed abstract model of the document class with concrete data about the considered document class. First we consider the construction process of a document class logical structure grammar G. The input for this process is a (probably infinite) sequence of documents:

Ɗ = {D1, D2, …, DN, …} where each document Di (i ≥ 1) is described by some grammar Gi. Then the grammar construction may be carried out by an iterative method for which G[0] = ∅ is the initial approximation. G [k] (k > 0) is introduced by successive approximations

G [k] = ψ (G [k-1] ∪ ψ ° ϕ ° R (Dk)), k = 1, 2, …, N where ψ is a function that associates grammar alternatives, ϕ a grammar generalization function, and R a function that

~

converts the lexeme set to the document class grammar. This grammar sequence which converges to some solution G :

~ G [k] → G as k → ∞ ~ Let L[k] (k > 0) be the language that is generated by grammars G[k], and let language L be generated by limit gram~ [k] mar G . Then apparently there is an increase in the size of the generated languages: L ⊆ L[k+1]. Thus the languages L[k] ~ (k > 0) gradually come near to the language L . ~ Let's assume a residual function f as the residual of the precise and approximated languages L and L[k]. Apparently the ~ iterative approximation process should minimize the absolute value | f ( L , L[k])| of the residual function:

~

~

min → | f ( L , L[k]) | = | L \ L[k] | Obviously this criterion can not be used in practice as: ~ (a) the language L is not known beforehand and is a theoretical abstraction; ~ ~ (b) the determination of the language L generated by an arbitrary CF-grammar G without additional information is algorithmically insoluble problem.

7 More practical is the following criterion: if in iteration N grammar G[N] does not undergo any changes in comparison with grammar G[N - 1] the repetitive process can be terminated. The most composite task at definition of document class logical structure grammar is the construction of low-level grammar set from some set of lexemes. The given task generally is uncorrected and insoluble problem without additional information; therefore the use of a predetermined knowledge base is necessary.

3 The algorithm in more detail 3.1

The algorithm description

Next we analyze this construction process. Let Di ∈ Ɗ be a training document of the selected class, ℒi is the set of all logic areas, on which document Di is broken, and Gi is a logic structure grammar of this document Di. The process of the grammar construction is divided into two fundamental stages: construction of grammar set { Gi(1) , Gi( 2) , …, Gi(q ) } on the lower level (it’s implied that this grammar set is constructed from logical areas ℒi through function R)

and construction of grammar Gi on the upper level (this grammar is constructed from grammar set { Gi(1) , Gi( 2) , …, Gi(q ) } through iteration process above). ϕ and ψ functions are described below, on sizes of abstract how much allow. Let's assume each document Di (i > 0) from the training set has been fragmented into a set of strings

Si = { Si(1) , Si( 2) , …, Si(q ) } , i = 1, 2, …, N according to it’s logical structure. This splitting should be such that each string S i( j ) (j = 1, 2, …, q ) uniquely corresponds to some logic area ( mi( j ) , Γi( j ) ) ∈ ℒi not containing in itself any other nested logic areas, and sets some initial language Li( j ) . According to the data stored in a knowledge base, from the language Li( j ) the generating grammar Gi ( j ) can be obtained. However, the grammar Gi ( j ) constructed this way will only determine string S i( j ) . Therefore

language Li( j ) is expanded such that it includes also strings syntactically similar to string S i( j ) . Applying the grammar generalization function [6] on the resulting grammar Gi ( j ) results in the low-level grammar Gi ( j ) : Gi( j ) = ϕ ( Gi ( j ) ), i = 1, 2, …, N, j = 1, 2, …, q.

Example: In the running example (see figure 1), the string What the searcher wants is an atomic fragment S i( j ) for some i and j. The corresponding grammar Gi ( j ) is: ΔƊ = {} TƊ = {} NƊ = {} PƊ = { (1) (2)

→ →

(3)





(4) (5)



→ →

} And the grammar Gi( j ) = ϕ ( Gi ( j ) ) is ΔƊ = {} TƊ = {} NƊ = {}

'A'|'B'|'C'|'D'|'E'|F'|'G'|'H'|'I'|'J'|'K'|'L'|'M'|'N'|'O'|'P'|'Q'|'R'|'S'|'T'|'U'|'V'|'W'| 'X'|'Y'|'Z' 'a'|'b'|'c'|'d'|'e'|'f'|'g'|'h'|'i'|'j'|'k'|'l'|'m'|'n'|'o'|'p'|'q'|'r'|'s'|'t'|'u'|'v'|'w'|'x'|'y'|'z' ''

8 PƊ = {

(1) (2) (3)



→ → →

(4) (5)



→ →

{{}?} 'A'|'B'|'C'|'D'|'E'|F'|'G'|'H'|'I'|'J'|'K'|'L'|'M'|'N'|'O'|'P'|'Q'|'R'|'S'|'T'|'U'|'V'|'W'| 'X'|'Y'|'Z' 'a'|'b'|'c'|'d'|'e'|'f'|'g'|'h'|'i'|'j'|'k'|'l'|'m'|'n'|'o'|'p'|'q'|'r'|'s'|'t'|'u'|'v'|'w'|'x'|'y'|'z' ''

Based on the constructed set of low-level grammars { Gi(1) , Gi( 2) , …, Gi(q ) } and the set of logic areas ℒ labeled by the user, we create an initial high-level grammar Gi . This grammar can be extended by first applying the generalization function: Gi = ϕ ( Gi ), and next the association function ψ on rules with identical left part. The resulting top-level grammar is:

Gi = ψ ( Gi ) = ψ (ϕ ( Gi )), i = 1, 2, …, N. Next we consider the metagrammar structure in more detail. Assume the knowledge base has preset metagrammar GM = {NM, TM, PM, ΔM}. Furthermore, let's consider some limitations to be obeyed by a set of metagrammar rules PM = {π1, π2, …, πq} with. Let rule i of the metagrammar GM look like

αi → βi 1 βi 2 … β i ki , where αi ∈ TM, βi j ∈ TM ∪ NM; ki - positive numbers; i = 1, 2, …, q; j = 1, 2, …, ki. Let Α ⊆ NM be the set of all nonterminal symbols αi standing in a right member of rules πi, and Β the set of all right members βi 1 βi 2 … β i ki of rules πi, i = 1, 2, …, q. Then the set of rules PM should obey to the following main condition: μ: Β → Α - is an injective function. Upon execution of this condition on set PM it is possible to enter the relation ⊴ as follows:

⎧ ⎪π 1 = π 2 ⎪ π 1 < π 2 ⇔ ⎨∨ ∃β 2 j ∀ j =1, 2,K,k2 [ β 2 j = α 1 ] ⎪ (1) (r ) (1) ( 2) (r ) ⎪⎩∨ ∃π (1) ,π ( 2 ) ,K,π ( r ) ∈PM [π = π 1 , π = π 2 ∧ π < π
Theorem 1. The relation ⊴ is an ordering of the set of rules PM . The set of rules PM of metagrammar GM forms a lattice. 3.2

General algorithm of logical structure recognition

This leads to the following algorithm to obtain the document class logical structure description: 1. Reset iterations counter: k ← 0; Make initial approximation: G[0] ← ∅. 2. Extract next document Dk + 1 from learning sample. 3. Divide document Dk + 1 into a set of logical areas ℒk + 1. 3a. From ℒk + 1 construct the tree of its logical structure Tk + 1. 3b. From tree Tk + 1 construct the cubic probability matrices ΜT (k+1), ΜB(k + 1), ΜL (k + 1)and ΜR (k + 1) for document Dk + 1 3c. Connect matrixes ΜT (k + 1), ΜB (k + 1), ΜL (k + 1) and ΜR (k + 1) with the general probability matrices ΜT, ΜB, ΜL and ΜR respectively for document class Ɗ. 4. Construct the set of a low-level grammars { Gk(1+)1 , Gk( 2+)1 , …, Gk( q+)1 }. 5. Construct top-level grammar Gk + 1. 6. Integration of grammar G[k] and Gk + 1. 7. Integration of rules with an identical left part of obtained grammar G[k + 1]. 8. Compare of grammar G[k] and G[k + 1]. If grammars are congruent (T [k] = T [k + 1], N [k] = N [k + 1] and P [k] = P [k + 1]) Then Exit Else k ← k + 1; Goto 2.

9 ~ The result of this algorithm is the approximated grammar G . Because of constructed model of the document class and logical structure grammar the parsing methods and algorithms of document logical structure with use of physical structure and contextual-probability dependences are designed.

3.3

The algorithm using physical structure and contextual-probability dependences

Let Λ = {λ1, λ2, ..., λh} be the set of leaves and Δ the root of tree. Let P = {p1, p2, ..., pg} be the set of logic gates defined by the physical attributes. Select subset of set P: P ⊇ P′ = {p(1), p(2), ..., p(q)} where p(i) ∈ P, i = 1, 2, ..., q and q ≤ g, such as each unit p(i), i = 1, 2, ..., q, in a tree is on the minimum level after leaves. Then, based on the set of leaves Λ, it is possible to partition S such that the cosets are those leaves of Λ that are descendants of the same unit of set P′ ∪ Δ. Thus we shall have (q + 1) cosets Ci, i = 1, 2, ..., q + 1, where the first q cosets correspond to the q units of set P′ and the class Cq+1 corresponds to an initial unit Δ. Let P′′ = P′ ∪ Δ = {p(1), p(2), ..., p(q), p(q + 1)} where p(q + 1) = Δ, and relation Des (a, b) mean that a is a descendant from b, we shall receive λ ∈ Ci ⇔ Des (λ, p(i)) = true where p(i) ∈ P′′; i = 1, 2, ..., q + 1; λ ∈ Λ. The following parsing method of document logical structure with use of physical structure is offered: 1. Partition Λ on cosets Ci, i = 1, 2, ..., q + 1. 2. Bottom-up selection of the subtrees T(i) with root p(i) and leaves from coset Ci, i = 1, …, q+1 3. Top-down syntactical analysis of the subtree T(i). 4. Insert the obtained subtree T(i) into the common tree Т. 5. If the root of the subtree T(i) coincides with the root of the tree T, Then Exit, Else - Goto Step 2. The use of contextual-probability dependences allows to increase the efficiency of the top-down parsing method by the introduction of fuzzy sets defining the probabilities of bush appearances in a tree {(℘, ξΒ (℘))}, ∀ ℘ ∈ ℙD and given tree appearances in a set of trees {(T, ξΘ (T))}, ∀ T ∈ FƊ. The characteristic functions of fuzzy sets can be described as follows [4]:

⎛ (Μ (℘))α + (Μ B (℘))α + (Μ L (℘))α + (Μ R (℘))α ξ Β (℘, α ) = ⎜ T ⎜ 4 ⎝

(

)

⎛ (ξ Β (℘1 ))α + (ξ Β (℘2 ))α + K+ ξ Β (℘q ) α ξ Θ (℘1 ,℘2 , K,℘q ;α ) = ⎜ ⎜ q ⎝

1

⎞α ⎟ ⎟ ⎠

1

⎞α ⎟ ⎟ ⎠

Let ϑ ∈ NƊ and ϕ ∈ TƊ be the initial and finite units of a chain. Then the method of «following chain search» [11] used in a classical method of top-down parse can be exchanged by the offered method of «most approaching chain search» which is the chain having the highest probability of appearance in a given context: 1. с (0) = ϑ. 2. Using a classical method of «following chain search», we shall discover the next unit ck[r] which can be used for continuation of a current chain c (k – 1) = c0 c1 … ck – 1. 3. Calculate probability of a chain appearance {c (k – 1) & ck[r]} in a parsing tree: ξΘ[r] (℘0 ℘1 … ℘k[r]; α) where ℘i is bush with root node ci and ℘k[r] is bush with root node ck [r]. 4. Excluding element ck [r] and increasing value r, we shall repeat Steps 2-3 until the next element ck [r] can be found. 5. Let's consider set of probabilities {ξΘ[1], ξΘ[2], …, ξΘ[r]} and select from them maximal probability ξΘ[q]. Then ck = ck[q]. 6. Let's add a new unit in a chain: c (k) = c (k – 1) & ck where c (i) = c0 c1 … ci, i = 0, 1, …, k – 1. 7. Repeat Steps 2-6 until the equality ck = ϕ will not be executed yet or yet will not be sort out all possible chains. As a result of this algorithm we obtain a chain c(n) = c0 c1 … cn where c0 = ϑ, cn = ϕ and ξΘ(℘0 ℘1 … ℘n; α) - is maximum that is most probable chain in this context. Theorem 2. There is constant c such as the algorithm at processing entry chain w of length n ≥ 1 does not perform more qcn of elementary operations (where q is the number of logic element defined by the physical attributes) provided

10 that calculation of one step of algorithm requires fixed number of elementary operations. The theoretical evaluations of complexity of designed algorithms have shown that if no statistical information is available then they are exponential. However, by statistical information accumulation their characteristics will verge towards deterministic parsing methods.

4 Structured information storage and retrieval methods In thesis the sentential calculus for structured queries designed. It’s based on following mappings:

I: ΩINDEX → Z;

K: Z → ΩINDEX / Ker I;

J: S → ΩTAGS.

Here mapping I of set indexed terms of the document to set of indexes is surjective one and allows any indexed term associate with unique index according to its layout in the document. According to the main mapping theorem there is a bijective mapping K where ΩINDEX / Ker I is factor set on a kernel which allows for each index uniquely to define a coset of appropriate indexed term. Let S be the set of strings from an electronic document and S ⊇ ΩINDEX / Ker I. Consider the mapping J that according to some heuristic observations singles out from string s ∈ S a substring s′ ⊆ s and associates it with a marking tag. Then the mapping

K ◦ J ◦ I: Z → Z means that some unit r1 ∈ Z can be associate (if such probably) with unit r2 ∈ Z, and K(r2) represents appropriate tag to K (r1). From this mappings it is clear that everyone indexed term, which is present at the document, can be associate with index defining its layout relatively other terms. Therefore any part of the document can be presented as a segment (i, j), where i is index of the first word and j is index of the last word in a selected part of the document [5]. Definition 7. An extent a = (i, j) is a pair of numbers representing begin and end of segment that appropriate to some part of the document. Reasoning from mappings the following predicates are offered: (a) INCLUDE (a, b) - allows to define whether includes an extent a in itself an extent b. (b) ORDER (a, b, c) - allows to define whether is extent c such as it begins with an extent a and ends by extent b, and а ≠ b. (c) CONTENTS (a, b) - allows to define whether are contents of extent b a substring of contents of extent a and a ⊇ b . (d) CORRESPOND (a, b) - allows to define whether is the content of extent b appropriate tag to content of extent a such as J ( a ) = b . (e) EQUAL (x, y) - allows to define whether the argument x is equal to argument y. Next we add some definitions based on a terminology of relational calculus. Definition 8. A universe of extent components U is a set of numbers to which belongs every possible extent components. Definition 9. An active extent domain dom is a set of extents (i, j) such as i, j ∈ U and i ≥ j. Definition 10. Let ψ be a formula of calculus. An extended active extent domain with respect to the formula ψ, denoted as edom (ψ), is a set of extents (i, j) ∈ dom such as their components i and j explicitly or implicitly appears in ψ. All predicates which are included in well-formed formula (WFF) ψ are determined on set edom (ψ). Thus the extended active extent domain limits a use of the calculus to finite sets only. Theorem 3. All predicates offered in sentential calculus are independent in the sense that any of them cannot be expressed only through the stayed predicates. Proof. Note that independence of predicate groups {INCLUDE, ORDER}, {CONTENTS, CORRESPOND} and {EQUAL} with each other is obvious. The first group of predicates determines a positional relationship of extents without their contents; the second group of predicates determines the relation between extents and their contents using mappings K

11 and J; finally, predicate EQUAL determines equality not only extents and strings and also functions of extents and strings, therefore it forms independent group. Hence, for the theorem proof it is enough to prove that predicates INCLUDE and ORDER, and also predicates CONTENTS and CORRESPOND are independent. Let ψ (x1, x2, …, xk) be WFF. According to the de Morgan theorem it is possible to propose that ψ contains only operators ∨ and ¬. Let also edom (ψ) be an extended active extent domain with respect to the formula ψ. (I) Prove independence of predicate INCLUDE. For this purpose we shall propose that the formula ψ contains only predicate ORDER except for operators ∨ and ¬, however ψ it is equivalent to INCLUDE (a, b) for anyone a = (a1, a2) and b = (b1, b2) from edom (ψ). We prove an induction on number of the operators used in anyone subformule ω from ψ that from ω it is impossible to conclude that a includes b. Let generally a1 < b1 and a2 > b2. Basis. Zero of operators in ψ. Then ω is ORDER (x, y, z). If we substitute instead of x, y and z to a and b, then generally, by means of predicate ORDER it is impossible to show that a includes b. Induction. Let ω contain at least one operator and the induction hypothesis fair for all subformulas of formula ψ having less operators than ω. Case 1. ω = ω1 ∨ ω2. As under the induction hypothesis neither from ω1, nor from ω2 it is impossible to conclude that a includes b, ω1 ∨ ω2 also meets to this hypothesis. Case 2. ω = ¬ ω1. In this case, from ω it is possible to conclude that a includes b if and only if ω1 approves that a not includes b. However this statement also can not be constructed without use of predicate INCLUDE. Hence, ω meets to the induction hypothesis. (II) Analogously to a case (I) independence of predicate ORDER (a, b, c) is proved. Really, by means of predicate INCLUDE it is possible to show only contents a and b in c, however it is impossible to show an additional condition: c begins with a and ends to b. Consideration of cases 1 and 2 of inductive proof completely coincides with (I). (III) Prove independence of predicate CONTENTS (a, b). It is obvious that it can be expressed through predicate CORRESPOND only if the string a contains a tag name itself. In the general case it is impossible. (IV) Independence of predicate CORRESPOND follows from that it is unique predicate using the mapping J. Therefore it is obvious that it can not be expressed through other predicates which are not including the mapping J. The theorem is proved. ■ Relational calculus is the prevailing mathematical tool used to databases. Therefore the fact of a reduction from constructed calculus to relational calculus is the important fact. Denote as B a set of binary relations. Any set of extents can be represented as some relation from B which tuples are extents of this set, i.e. it is possible to introduce an injective mapping E: dom → B. Definition 11. A GC-relation (Generalized Concordance relation) is a binary relation from B which tuples represent the extents not enclosed each other [5]. Denote as G a set of GC-relations. Then the following theorem is correct: Theorem 4. If G is a set of all GC-relations and G* = G ∪ ∅ then a surjective mapping N: B → G* exists. Proof. Let S ∈ B and tuples of S are extents. We shall construct a relation S ′ = N (S) which tuples-extents will not be enclosed each other. A construction of relation S ′ using relational calculus with variables on tuples is following: S ′ = N (S) = {s | S (s) ∧ ¬ (∃ s′) (S (s′) ∧ s′ [1] > s [1] ∧ s′ [2] < s [2])} This relation S ′ is either GC-relation or empty, i.e. S ′ ∈ G*. Surjective of mapping N follows from G ⊂ B. Therefore, for any S ′ ∈ G* there is even if one S ∈ B such as S ′ = N (S). ■ Consequence. From the theorem 4 follows that some set of extents from dom can be associate with any (probably, empty) GC-relation, i.e. exists a mapping EN: dom → G*. As at construction of calculus all WFFs were appropriate to the basic limit then in relational calculus the result of each expression is a GC-relation. Definition 12. An extended relational calculus with variables on the tuples is a relational calculus extended by functions I, K and J, and also functions of string comparison and concatenation functions ⊃, ⊂, =, +. Theorem 5. For every WFF of the calculus exists an equivalent safe expression of extended relational calculus with variables on the tuples. Proof. Let ψ (t1, t2, …, tk) be a WFF if calculus. Proceeding from the purpose of query one of variables t1, t2, …, tk connected by a existential quantifier, advanced forward in the expression of relational calculus as free variable-tuple. In

12 other words if ψ (t1, t2, …, tk) = (∃ ti) (ϕ (t1, …, ti-1, ti+1, …, tk)) then this formula is replaced with expression G ({ti | ϕ′ (t1, …, ti-1, ti+1, …, tk)}). Here ϕ′ represents the formula ϕ in which all quantifiers are left by the same, and predicates INCLUDE, ORDER, CONTENTS, CORRESPOND and EQUAL are replaced with the appropriate formulas of relational calculus with variables on tuples: (a) INCLUDE (a, b) ∼ (a[1] ≤ b[1] ∧ a[2] ≥ b[2]); (b) ORDER (a, b, c) ∼ (b[1] > a[2] ∧ c[1] = a[1] ∧ c[2] = b[2]); (c) CONTENTS (a, b) ∼ (K (a) ⊇ K (b)); (d) CORRESPOND (a, b) ∼ (K (b) = J (K (a))); (e) EQUAL (x, y) ∼ (x = y). Next we prove an existence of the safe formula ϕ′′ which is equivalent to ϕ. As the formula ϕ of the calculus was determined on the extended active extent domain edom (ϕ), then the formula ϕ′ is determined on the extended active domain edom (ϕ′). So it is the limited interpretation of the formula of tuple calculus. According to the classic theory, for any formula of tuple calculus ϕ′ at the limited interpretation there is an equivalent safe formula ϕ′′. As the formula ϕ′′ is equivalent to ϕ′ and the formula ϕ′ is equivalent to ϕ, then formula ϕ′′ is equivalent to ϕ. ■ According to the offered sentential calculus the language for structured queries designed: Main query operator

[Let SuchAs {Let SuchAs }] Show SuchAs

Add operator

Append ()

Delete operator

Delete [From ( SuchAs )] Modify [From ( SuchAs )] To .

Modify operator

According to the offered method of structured documents stored the different database models designed and the main notice is given to application a relational DBMS for structured information storage and retrieval. The theorem of reduction of the offered language for structured queries to the SQL is proved. Theorem 6. Any query composed on the offered language for structured queries can be reduced to SQL-query. The practical significance of this theorem consists that it allows developing the program of SQL-queries generation for the compiler of the structured information storage and retrieval automized system.

5 Structured information storage and retrieval automized system Designed structured information storage and retrieval automized system showed on Figure 4, includes a set of independent units, joint among themselves:

∑A = ∑(I)Created → ∑(II)Structured → ∑(III)Stored. Here ∑(I) is editor of XML-documents with the graphics interface and similar editor for creation and validation of DTDrules; ∑(II) is interactive system for document logical structure recognition for XML-documents creation; ∑(III) is information system for the structured information storage and retrieval with the specialized query language. The testing results of a designed learning system represented on a Figure 5 and evaluated as the ratio of unrecognized units count n to units total N in documents. From these results are clear that the rather fast asymptotic convergence of obtained grammar to common grammar of the document class exists. Also it’s necessary to note that the convergence speed is vary depending on learning document sampling and the user should take the special note of this fact. The comparison testing of classical parsing methods, methods with use of physical structure and contextual-probability methods was carried out as the ratio of distance between automatically obtained and true structures d to units total N in the document. The given characteristic allows estimating an amount of corrections, which should be brought by user to the automatically labeled document. The results of this testing are shows on a Figure 6. The measurement of a query time interval to an information system represents measurement of time necessary for execution of each separate query [9]. The graphics representation of an averaged time interval T ( n) depending on an amount of documents n in the database is shown on a Figure 7. From the graphics on a Figure 7 is possible to conclude about logarithmic dependence of a time interval on an amount of documents in the database. Therefore the designed information system expends on demand processing, at any rate, not more time than the classical methods of information retrieval.

13

II

Document I

Document create

Parser

III

Marked document

Language for structured queries

user Document structure editor

Structured query Structured document Compiler for structured query

Document structure analyze

SQL-query

Output query

Logical structure grammar, physical attributes, context probabilities Integration of grammars

Database

Entry to Database

knowledge base

integrated structure grammar

New integrated structure grammar

Figure 4. Structured information storage and retrieval automated system

14 0 ,9 0 ,8

The n / N ratio

0 ,7 0 ,6 M e th o d w ith u se o f p h isic a l str u c tu r e

0 ,5 0 ,4 0 ,3

C o n te x t-p r o b a b ilitie s m e th o d

0 ,2 0 ,1 0 0

2

4 6 8 D o c u m e n t c o u n t in t h e le a r n in g s a m p le

10

Figure 5. Testing results of the learning system 0 ,3

0 ,2 5

0 ,2

d / N r a t io 0 ,1 5

0 ,1

n / N r a t io 0 ,0 5

0 0

1

2

3

Figure 6. Comparison testing of parsing methods

Average time interval

, sec

160 140

y = 8 4 . 7 7 ln x - 1 9 9 . 8 6

120 100 80 60 40 20 0 0

10

20

30

40

D ocum ent count n

Figure 7. Measurement of a query time interval to information system

50

15

6 Conclusion In this research following results are ensued: 1. As a systems analysis result of management system information processes it’s offered to use the structured information storage and retrieval automized system and also the primal problems on a research and development of automized systems of the given class are formulated. 2. The abstract mathematical model of the document class is devised, that define not only physical and logical structure of documents but also the set of contextual-probability dependences of structure units. 3. The iterative methods and algorithms for representation of the given finite document class with use of the contextual-probability model are designed and probed. 4. The methods and algorithms of combined parsing with the common tendency with use of physical structure and contextual-probability dependences are designed and probed. 5. The specifically calculus for structured queries is offered whose predicates are directional to fixation of document logical structure and respective query language is designed. 6. The methods of structured information storage in relational databases with flexibility logical structure representation are probed and designed. 7. As an experimental result of the designed structured information storage and retrieval automized system research were obtained numerical characteristics and experimental dependences confirming efficiency of offered models and methods.

7 References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.

Aho A.V., Ullman J.D The Theory of Parsing, Translation and Compilation, 1973. Bapst F., Brugger R., Ingold R. Towards an interactive document structure recognition system. Internal working paper, Institute of Informatics, University of Fribourg, Switzerland, 1995. Birkhoff R., Bartee T.C. Modern Applied Algebra, 1971. Brugger R., Zramdini A., Ingold R. Document modeling using generalized n-grams. Proceedings of 3rd International Conference Document Analysis and Recognition, Montreal, Canada, 1995. Clarke Ch., Cormack G., Burkowski F. An algebra for structured text search and a framework for its implementation. Technical Report CS-94-30, University of Waterloo, Waterloo, Canada, 1994. Fankhauser P., Xu Y. MarkItUp! An incremental approach to document structure recognition. Electronic Publishing: Origination, Dissemination and Design, Vol. 6 (4), 1993. Knuth D.E. The Art of Computer Programming, 1968-1973. Kuikka E., Penttonen M. Transformation of structured documents. Electronic Publishing: Origination, Dissemination and Design, Vol. 8(4), 1995. Lancaster F.W. Information Retrieval Systems: Characteristics, Testing and Evaluation, 1965. Liang J. Document structure analysis and performance evaluation. Phd thesis, University of Washington, Washington, USA, 1999. Rayward-Smith V.J. A First Course in Formal Language Theory, 1974. Suen C.Y, Liu K., Strathy N.W. Sorting and recognizing cheques and financial documents. Proceedings of the 5th International Conference Document Analysis and Recognition, New York, USA, 1999. Summers K.M. Automatic discovery of logical document structure. Phd thesis, Cornell University, Cornell, USA, 1998. Zimakova M.V. Mathematical Models and Methods for Automized Systems of Structured Information Processing. Phd thesis, Penza State University, Penza, Russia, 2001.

The Contextual-probability Model

With the introduction of corporate networks for management systems support ... of mapping of document logical structure to different models of databases and ...

493KB Sizes 6 Downloads 195 Views

Recommend Documents

AlgebraSolidGeometry_E_3sec model 1 And The model answer.pdf ...
Whoops! There was a problem loading more pages. AlgebraSolidGeometry_E_3sec model 1 And The model answer.pdf. AlgebraSolidGeometry_E_3sec model ...

Supporting Model-to-Model Transformations: The VMT ...
can attach a Java program that realizes the actual transformation (referred to as a ..... M. Clavel, F. Durän, S. Eker, P. Lincoln, N. Marti-Oliet, J. Meseguer, and J.

The subspace Gaussian mixture model – a structured model for ...
Aug 7, 2010 - We call this a ... In HMM-GMM based speech recognition (see [11] for review), we turn the .... of the work described here has been published in conference .... ize the SGMM system; we do this in such a way that all the states' ...

The 123PRSP Model
s: Average savings rate ... where R is the nominal exchange rate and P; the world price of .... cent, it also has one of the highest prevalence rates of W A D S .

Introducing the HPGENSELECT Procedure: Model ... - SAS Support
cluster of machines that distribute the data and the computations. ... PROC HPGENSELECT is a high-performance analytical procedure, which means that you ...

Model of the Atom.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps. ... Model of the Atom.pdf. Model of the Atom.pdf. Open. Extract.

The New Keynesian Model
showed that by putting money in the utility function could add a money demand curve to the model, but if the central bank conducted ... However, their utility is over aggregate consumption. Firms, since they are ..... forecasts), the coecient on inat

The core model induction
Mar 29, 2009 - Introduction. Canonical models. Previous work. Analysis of hod. More details. What is core model induction? Core model induction is a technique for evaluating lower bounds of consistency strengths of various combinatorial statements. I

The Fluid Events Model
The data from one person was dropped for failing to follow the instructions, leaving data for thirty-four people. The Binary Prediction data set contained 20,400 valid observations with a switch rate of .227. The rate at which the different actions w

The shared circuits model
Printed in the United States of America doi: 10.1017/ ..... action and similar performance, no translation is needed. But things ..... The advance from cooperation plus deceptive copying ..... you reach to pick up the ringing phone, your act and my.

The local Solow growth model
By local, we refer to the idea that a Solow model applies to each country, ... F G. , the analogous savings rate for human capital, and the log of (n. G##), where n.

Why the Standard Model
Available online 29 September 2007. Abstract ... The classification in the first step shows that the solutions fall in two classes. ... There are three real forms: unitary: Mk(C), orthogonal: Mk(R), symplectic: Ma(H) where H is the skew field of.

Model Questions
The entrance test for admission to Master's Degree in Hospital Management is ... After successive discounts of 10% and 8% have been granted the net price of ...

The Process Model of Roleplaying
Exploration of an Entity of the Shared Imagined Space. ○. Exploring the many-fold interactions a single entity has with others. ○ Exploration of a Concept through the Shared Imagined Space. ○. Exploring a concept through its expressions in the

The Object-Relation-Kin Model
Database Management System Supporting ORK Model ... unavoidable logics mapping the cache (in-memory objects) between the data store (disk storages).

4. Model Design of Weather Monitoring Model
model are AVHRR – LAC (Advanced Very. High Resolution Radiometer – Local Area. Coverage) type. Description about it could be seen in chapter 2.2.3. Actually, it has spatial resolution is 1,1 x 1,1 kilometers square and temporal resolution is one