The Contextual-probability Model Developing an automized system for structured information processing
Maria V. Zimakova Penza State University, Russia Abstract: In this paper we present a new complex contextual-probability approach to the logical structure recognition of semi-structured documents 1 2
3 4
5 6 7 8
Introduction..................................................................................................................................................................1 The contextual-probability model ..............................................................................................................................2 2.1 The model.................................................................................................................................................................2 2.2 Basic operations .......................................................................................................................................................4 2.3 The probability model ..............................................................................................................................................6 Constructing the structure grammar.........................................................................................................................6 The algorithm in more detail ......................................................................................................................................7 4.1 The algorithm description ........................................................................................................................................7 4.2 General algorithm of logical structure recognition ..................................................................................................8 4.3 The algorithm using physical structure and contextual-probability dependences....................................................9 Structured information storage and retrieval methods..........................................................................................10 Structured information storage and retrieval automized system..........................................................................12 Conclusions.................................................................................................................................................................15 References...................................................................................................................................................................15
Introduction The purpose of document recognition is to extract information from documents. This ranges from character recognition and the identification of the layout of printed documents, the recognition of document logical structure and at the high end to the (largely domain dependent) extraction of semantic content. This paper focuses on recognizing the logical structure of untagged electronic documents to transform them into structured XML (eXtensible Markup Language) documents. This application plays an important role during the publication cycle. Structured documents can much better be exchanged and further processed to produce hyper–documents, individualized printed documents or databases. With the introduction of corporate networks for management systems support and the use of Web-networks for inter-corporate information exchange, a growing need for management of distributed document style, representation and presentation would arise, leading to a new meta-language XML, a subset SGML. Therefore there is a great scientific interest to the problem of semi-structured data control in the world. This problem is being investigated by numerous corporations and scientific centers such as Stanford University (USA), Database Group from CS+E (Center 'Science + Education') at the Washington University (USA), CEDAR (Center of Excellence for Document Analysis and Recognition) at Buffalo University (USA), CENPARMI (Center for Pattern Recognition and Machine Intelligence) in Concord University (Canada), DAR (Document Analysis and Recognition) in Fribourg University (Switzerland) etc. Therefore the development of structured information processing automized system for logical structure recognition of the given document class and saving of structured documents in the database is very important problem. The development of an automized system for logical structure recognition for a given document class and storing structured documents in a database is a very important problem. The application of this automized system for semi-structured information has required the development of new mathematical models and methods for logical structure recognition of semi-structured document classes. The implementation of storage functions in the automized system has required development of mapping of document logical structure to different models of databases and creation of special query language for retrieving of structured data. Some researchers have offered to apply a method of the syntactic analysis to document structure recognition being based on predefined knowledge of their specific features. This method is applied, if style of the document is known, i.e. the logical structure is known beforehand and it can be appeared through physical structure elements according to the given set of rules. Usually this set of rules is a regular or CF-grammar. As examples of using such grammars for logical structure recognition are the following: in [8, 12] various meth-
2 ods of information extraction about payment from checks and financial documents are offered; in [13] the grammar describing a document class concerning to technical reports, scientific papers and theses is used. These methods are applied only if the unequivocal representation style for given document class is known. Several researchers offered the methods allowing recognize document structures which not precisely correspond to the specific grammar. For example, in [2] authors offered to use the fuzzy syntactic analysis for logical structure recognition. In that case when the document structure can not be determined by given grammar, it is selected one or more “similar” elements which allow to correct full discrepancy of the structure to given grammar. However, if there is significant distance from the given document style, this method is crashed. Other group of researchers has developed various methods for preliminary training system of structure recognition. This problem corresponds to the analysis of a document set which style either is not completely determined or is simply unknown. In [6] they represent a training system for logical structure grammar recognition which uses a set of predetermined but the changeable rules. System [4] based on application of statistical model of n-grams, also is provided with training process. These methods are based on realization of preliminary training process on a specific document class. The key information handling tasks for management systems are recognition, storage and retrieval of structured information. A critical analysis of logical structure recognition methods for the semi-structured document classes has shown that for these tasks an iterative recognition methods with learning capabilities is most suitable [14]. The association of parse methods with probabilistic approach provides a more adequate representation of document classes, and allows effective methods and algorithms to handle for semi-structured document classes.
1 The contextual-probability model In this section the contextual-probability model as an underlying mathematical model for describing the logical structure of a document class, and the corresponding methods for logical structure grammar recognition and construction of a structure tree according to this grammar, are presented. 1.1
The model
The contextual-probability model of the document class is based not only on physical and logical structure but also on statistics of logic element appearance in a given context. A contextual-probability model ℋƊ is a tuple consisting of the following three units: ℋƊ = (GƊ, H, ℳ), GƊ = {NƊ, TƊ, PƊ, ΔƊ}, ℳ = (ΜT, ΜB, ΜL, ΜR) where context-free grammar GƊ determines the document logical structure, H associates document physical attributes with document logical structure, and ℳ is a set of cubic matrixes which determine contextual-probability dependencies of unit appearance in a document logical structure tree. In order to impose structure on a document, a set M of logical structure labels is assumed. These labels are used to tag document fragments. A tagged fragment is referred to as a logic area of that document. Definition 1. Let m ∈ M be a logical structure label and (m, Γ) is a logic area of object D.
Γ a part of document D. Then the pair
Definition 2. Logical areas (m1, Γ1) and (m2, Γ2) are equal, denoted as (m1, Γ1) = (m2, Γ2), if m1 = m2 and Γ1 = Γ2. Logic areas may be nested: Definition 3. Logical area (m1, Γ1) is enclosed in logical area (m2, Γ2), denoted as (m1, Γ1) ≼ (m2, Γ2), if Γ1 ⊆ Γ2, and Γ1 = Γ2 ⇔ (m1, Γ1) = (m2, Γ2). Let ℒ be the set of logic areas of document D, and ≼ℒ the restriction of ≼ to ℒ. Then (ℒ,≼ℒ) is a lattice ([3]). Next we focus on layout. Let Ψ be the set of possible physical (figure, table, equation) and formatting (font, alignment) attributes of document D. Then the mapping H: ℒ → Ψ associates logical areas of document D with physical and formatting attributes.
Example: In Figure 1 a web page is presented, consisting of three separate frames. We will focus on the frame on the right. A number of logical areas have been marked.
3
Figure 1. Course descriptions of some lecturing institute In Figure 1 we can see following elements of the set ℒ: (, Γ1), (, Γ2), (, Γ3), (, Γ4), (, Γ5), (, Γ6), (, Γ7). The grammar GƊ for this example is: ΔƊ = TƊ = {, , , , , , , , , , , , , ,
With the introduction of corporate networks for management systems support ... of mapping of document logical structure to different models of databases and ...
can attach a Java program that realizes the actual transformation (referred to as a ..... M. Clavel, F. Durän, S. Eker, P. Lincoln, N. Marti-Oliet, J. Meseguer, and J.
Aug 7, 2010 - We call this a ... In HMM-GMM based speech recognition (see [11] for review), we turn the .... of the work described here has been published in conference .... ize the SGMM system; we do this in such a way that all the states' ...
s: Average savings rate ... where R is the nominal exchange rate and P; the world price of .... cent, it also has one of the highest prevalence rates of W A D S .
cluster of machines that distribute the data and the computations. ... PROC HPGENSELECT is a high-performance analytical procedure, which means that you ...
There was a problem previewing this document. Retrying... Download. Connect more apps. ... Model of the Atom.pdf. Model of the Atom.pdf. Open. Extract.
showed that by putting money in the utility function could add a money demand curve to the model, but if the central bank conducted ... However, their utility is over aggregate consumption. Firms, since they are ..... forecasts), the coecient on inat
Mar 29, 2009 - Introduction. Canonical models. Previous work. Analysis of hod. More details. What is core model induction? Core model induction is a technique for evaluating lower bounds of consistency strengths of various combinatorial statements. I
The data from one person was dropped for failing to follow the instructions, leaving data for thirty-four people. The Binary Prediction data set contained 20,400 valid observations with a switch rate of .227. The rate at which the different actions w
Printed in the United States of America doi: 10.1017/ ..... action and similar performance, no translation is needed. But things ..... The advance from cooperation plus deceptive copying ..... you reach to pick up the ringing phone, your act and my.
By local, we refer to the idea that a Solow model applies to each country, ... F G. , the analogous savings rate for human capital, and the log of (n. G##), where n.
Available online 29 September 2007. Abstract ... The classification in the first step shows that the solutions fall in two classes. ... There are three real forms: unitary: Mk(C), orthogonal: Mk(R), symplectic: Ma(H) where H is the skew field of.
The entrance test for admission to Master's Degree in Hospital Management is ... After successive discounts of 10% and 8% have been granted the net price of ...
Exploration of an Entity of the Shared Imagined Space. â. Exploring the many-fold interactions a single entity has with others. â Exploration of a Concept through the Shared Imagined Space. â. Exploring a concept through its expressions in the
model are AVHRR â LAC (Advanced Very. High Resolution Radiometer â Local Area. Coverage) type. Description about it could be seen in chapter 2.2.3. Actually, it has spatial resolution is 1,1 x 1,1 kilometers square and temporal resolution is one