Human Identification of Letters in Mixed-Script Handwriting: An Upper ...

Viewer
Transcript

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS-PART B: CYBERNETICS, vOL. 28, NO. 1, FEBRUARY 1998

78

Human Identification of Letters in Mixed-Script Handwriting: An Upper Bound on Recognition Rates Caroline Barrière and Réjean Plamondon Abstract- This paper focuses on a reading task consisting of the identification of letters in mixed-script handwritten words. This task is performed by humans using extended or limited linguistic context. Their performance rate is to give an upper bound on recognition rates of computer programs designed to recognize handwritten letters in mixedscript writing. Many recognition algorithms are being developed in the research community, and there is a need for establishing \tays to compare them. As some effort is on its way to give large test sets \rith standard formatsr we propose an algorithm to determine a test set of reduced size that is appropriate for the task to achieve (the type of texts or words to be recognized). AIso, with respect to a particular task, rile propose a method for finding an upper limit to the letter recognition rate to aim for' I ndex Terms Algorithm comparison methodology, Ietter identifi cation, mixed script recognition, reading experiment.

I.

INTROPUCTION

Handwriting recognition is a very difficult task for which much more human knowledge is involved than just the learned shapes of letters. Linguistic context is very important when some letters are difficult to read. Within a word, interpretation of those letters is facilitated by the knowledge of a limited lexicon and more broadly of pronounceable syllables. At a higher level, syntactic and semantic knowledge helps recognizing words within a sentence. In this perspective, a large number of on-line and off-line handwriting recognition systems rely on different types of hierarchical architecture where one or many low level recognizers mostly deal with the recognition of letters as represented by different sequences of handwriting strokes [1], tzl. An intermediate list of potential letter candidates is then analyzed and combined using lexical and sometimes syntactical knowledge to build up a list of potential word candidates t3l-tSl. Although different types of architecture can be used to combine high and low level processes, the bottleneck of theses approaches is generally dependent upon the letter recognition performance' of the low level recognizers. What is a realistic goal to reach when trying to reco gnize, at the letter level, mixed-script handwriting, that is handwritten words made in part of discrete letters and continuous cursive script sequences? To fix an upper limit to letter recognizer performance, a first approach is to compare algorithms using the same database (see, ê.g., t9]). The results of such an experiment are the more conclusive the larger is the database. In this perspective, there has been, in recent years, a serious effort to build up public databases, incorporating a very large number of words written by very large number of writers. For off-line character recognition, a few databases are now currently being used t10l-t131. Similarly, for on-line data, the UNIPEN project Manuscript received January 1996, and December

9,

1996.

t14l is an international effort into putting together large databases to be used by researchers to compare their algorithms. The results might be more conclusive as well if the database used is representative of the task to achieve and if \rye can establish in advance, what are "good" results to hope for. Those two aspects arg of importance if we work with mixed-script, or cursive script, where the problem of letter recognition is embedded in the problem of word recognition. The writing and reading of a letter is dependent on the other letters in the word, more particularly the preceding and following letters U71, t181. Therefore we cannot assume that recognizing the same letter in any word is the same task either for a machine or a human. Tÿe do not encounter these problems in words written with discrete letters. Therefore \rye wonder: In mixed-script handwriting recognition, how far can we go with letter'When is it appropriate, in a design process, based approaches? to focus on high level mechanisms to improve letter recognizer performances? These are the questions that we address partially in this correspondence by defining a methodology that can be used to specify an empirical upper bound or a reference perforrnance with respect to which an algorithm can be situated.

4, 1992; revised May 18,

1994, February 2,

This work was supported by NSERC Canada

under Grant OGP0009I5 and by FCAR Québec under Grant ER-1220.

C. Barrière was with the Laboratoire Scribens, Département de Génie Electrique et Génie Informatique, Ecole Polytechnique, Montréal, P.Q., Canada. She is now with the Natural Language Laboratory, Simon Fraser University, Vancouver, 8.C., V5A 156 Canada. R. Plamondon was with the Laboratoire Scribens, Département de Génie Electrique et Génie Informatique, Ecole Polytechnique, Montréal, P.Q., Canada (e-mail: rejean@ scribens.polymtl.ca). Publisher Item Identifier S 1083 -4419(98)00213-1.

II.

ExpnruMENT

Our approach is based upon the fact that literate human beings are actually the best systems for recognizing mixed script. In this context, we are interested in developing some writing and reading tests that are easy to nrr, with a small number of subjects, and helpful in providing some statistically significant results about an upper limit that could be reached with letter-based recognition algorithms, for some specific projects. V/e do not claim here that, in the long ilo, computer systems will not exceed human perfoffnance but we want to take advantage of the actual superiority of humans to help us develop better automatic systems. The whole methodology involves the selection of a representative handwritten vocabulary subset to test any proposed computer algorithm as well as to perform a human reading experiment. To specify an upper bound performance, a group of human subjects are thus asked to write words of the vocabulary subset. Another group is involved in reading some of theses handwritten words. The group of readers is composed of two subgroups: the readers familiar with the language in which the words are written are said to work with a large linguistic context, âs opposed to the readers not knowing the language who are working with little linguistic context. Readers using different linguistic contexts are chosen to roughly evaluate the influence of that factor on recognition rates. There are five writers, each writing on a digitizer two sets of words, a training set of 250 words and a test set of 275 words. They write in a free style, mixing as desired discrete letters and cursive script. For the purpose of this article, only the second set of 275 words is used. Both sets have been used in a previous work on developing a recognition algorithm [15].1 Heieafter, we briefly describe the methodology for generating sets of words, and we justify our choice of readers who will be given the task of reading those sets of words.

A.

Generating a Subset of Words

The task given to the computer algorithms is to select some French words taken randomly from the Larousse de Poche t161. I

The databases are available from the authors.

1083-4419/98$10.00 @ 1998 IEEE

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS-PART B: CYBERNETICS, VOL.28, NO.

The Larousse de Poche contains about 30 000 French words. This represents the set of possible words to be recognized by the computer systems. The algorithms will perform letter recognition and then word

reconstruction from the possible letters identified. V/e mentioned earlier the importance of the preceding and following letters in the recognition of a particular letter. This directs our attention to the digrams (two consecutive letters) present in the dictionary. 'We decided against looking at trigrams (three consecutive letters) because it would lead to (26 x 26 x 26) cnteria to solve as explained in the next paragraph. We use a method which analyzes the distribution of the number of occurence of all digrams in the dictionary and tries to reproduce this distribution in a smaller subset. To generate our test set, our problem -consists in finding a combination of X words among the 30 000 words that respects the frequency of occurrence of the digrams found in the dictionary. The number of words X must be chosen small enough so that further experiments with human subjects using the generated training or test set containing X words could proceed in a reasonable amount of time. There arc (26 x 26) possible digrams, that is 676 constraints that would need to be satisfied. Trying to optimally solve this problem would result in a combinatorial explosion. Therefore a nonoptimal polynomial selection algorithm has been developed to choose a subset of words that is representative of the proportiàns of digrams in the set of words to recogrize. Before giving the algorithm for that nonoptimal solution, the terminology employed hereafter must be clarified. A digram consists of two consecutive letters (for example "be"), a digram's count (DC) is the absolute number of times a digram occurs in a set of words, and the total digram count (TDC) is the sum of all the DC's for all the digrams. The steps for generating a subset of words from a global set (the set to recognize) are as follows. 1) Find all the possible digrams in the dictionary, and keep their DC in a table. In the Larousse French dictionary used, the most frequent digram is "er" with a DC of 7719, while at the other extreme, "bh" appears just once, and lg7 digrams are never present.

2) Scale the DC's by choosing an integer I[ and multiplying all DC's by Nl77L9. If a digram was present and got scaled to less than 1, set it to l. put the new DC's in the table. 3) Examine the dictionary in order that respects the decreasing word length proportions of the .global set, and for each word length, in a random order based on the first letter in the word. The first consideration avoids generating subsets of words containing just a few long words that the participants may have difficulties writing while maintaining a natural style, or a subset containing thousands of little words. The second consideration

'ï'ï:;'ïï:;i,'îï b. '

,

:îî:î

;,ffiï ;ï'ï,,,,

digrams are available (DC ) 0) in rhe table. For a chosen word, decrement by I all its DC,s in the table

A reducing factor I[ - 100 was used giving a TDC of abou1Z1OO. The algorithm was run twice. It generated the sets of 250 and 275 words mentioned earlier. To form those sets, the algorithm used about 90c/a of the TDC, showing the nonoptimality of the algorithm. The same procedure could be used for any language, using a large dictionary specific to the language toward which the system will be focused.

B. Choosing the Readers Based on the fact that the handwritten words to examine are in French, we define as the group working with extended linguistic

l, FEBRUARy

1998

79

context (EXT group), native French speakers. Subjects of this group have a fairly large French lexical knowledge, as well as intuition on the occulrence frequencies of digrams and trigrams, and knowledge of pronounceable syllables used in French words. Then we define as the

group working with limited linguistic context (LIM group), native English speakers, who did not learn French as a second languâgê, and who do not have a day to day contact with French people. V/e consider that these subjects have a limited linguistic .ort.*t because they do not know the French language, buisimilar words can occur in French and English, and they also have intuition of some unpronounceable syllables. Moreover, unlike machines, they know intrinsically how to generate continuous handwriting using letter formation and concatenation rules. Five native French speaking Canadians from Montréal formed the EXT group working with extended linguistic context and six native English speaking Americans from New York State formed the LIM group working with limited linguistic contexr. Among the test sets of 275 words given by the five writers, 55 words were chosen randomly, as to diminish the amount of time asked of the readers. A reader subject saw randomly on sheets of

paper the 55 complete words coming from five different writers. They had to write under each word the letters that they had identified. If one letter could not be guessed at all, they were asked to put ..-,, under that letter. A few weeks after the reading experiment, the readers from the French group were shown a sheet of paper with the typed version of the handwritten words they had to identify before. They were asked to cross out the words they did not know. For each reader, hisÆrer unknown words were taken away from his/trer test set to assure that each French reader had been asked to identify letters from words the he/she was familiar with. We wanted to keep a clear demarcation between the French and English readers, namely the French readers are familiar with the lexicon of the test set, and the English readers

are not.

m. Rrsurrs Fig. 1 shows the results of the identification of letters from the 55

words presented, averaged over six English readers for the LIM group and over five French readers for the EXT group. In the first column an example of the writing style of the different writers is shown with

the initials of the writers [wg, py, pc, fn, fl]. The second and third columns show the mean recognition results over all the words for each writer from the group LIM and EXT, respectively. The recognition rate is computed by looking at the maximum

number of recognized letters. Humans, as computer algorithms, can make different types of mistakes when identifÿing the letters. They can omit a letter, insert a letter or substitute a letter for another one. For example, if the word to recog nize is "devidoir,, (see Fig. 1, writer

FL) and the reader saw "durdoir", we have five recognized letters out of a total of eight letters to reco gnize, giving u ,..ognition rate of 62-5Vo. There is also the sequence "ur" that is recognized instead of "evi." This can be interpreted as one deletion and two substitutions. For the human reading, very few insertions and deletions occur, it is mostly substitutions, which makes the substitution rate the complement of the recognition rate. Expecting computers to perform a t00Vo recognition rate in a similar experiment might be very optimistic when humans even with the help of their linguistic context do not reach that goal. Just the fact that humans can see the whole word, when trying to iOrntify letters,

gives them an incommensurate advantage over a computer approach

that would try to identify one letter at a time. But even with this enormous advantage, their performances are a lot less than perfect score. In fact, somi writers have a style very

rEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNEfiCS-PART B: CYBERNETICS, VOL. 28, NO.

LIMGROI,'P

E)(TGROUP

(ENGLISH READERS)

GREI\TCH READERS)

89%

99?o

76%

87?o

92%

95?o

78%

857o

98*

98?o

86.6%

92.87o

I, FEBRUARY

1998

PC

INCULPER

PY OBJECTER

o

L ,urrl,e,\r a

FN

DEVIDOIR

I

d-hrld 6L,,r FL MURMURER

fnü"rnnu f e-î WG MEA}.I

Fig.

1.

Recognition rates

for LIM and EXT

groups.

difficult to read and our reader subjects performed poorly. In this case, writer tPYl who has a very small and hard-to-read writing gets a 87%o recognition rate by the EXT group (76Vo by the LIM group), and writer [FL] who's writing has a lot of loops and flourishes gets a 857o recognition rate by the EXT group (78Vo by the LIM group). Writer tV/Gl has an easy-to-read style, where almost all the letters are separated, and therefore the LIM group performs as well as the

EXT group (both 98lo). Over all the writers, the LIM group has a recognition rate of 86.6Vo and the EXT group of 92.8Vo. This shows a 6.27o decrease in the recognition rate between readers working with extended or limited linguistic context to help them interpolate missing letters. A paired t test comparing both models confirms a signiûcant difference between both methods by rejecting the null hypothesis I/o at level a = 0.05. V/ith [üpaired(a) = 2.671 > (üo.or,4:2.L32), the data supports the hypothesis rhat the EXT group performs better,than the LIM group.

For the participants using a writing style very difficult to read, both reader groups have low recognition rate. This shows that sometimes, even seeing the whole word doesn't help much the

readers, probably because they usually have access to even more contextual information. In our reading experiments, they face one isolated word at a time, they can't build any expectations from the context of a phrase or a paragraph. If a word is considered hardto-read, the word itself doesn't show enough global characteristics so that a reader can speculate about the whole word based on them and then interpolate the more vague letters. V/hen really nothing else can help them, the readers will base their judgment on the shapes of letters they know, and this is where the computer approaches can compete with them at a more fair level, and even perhaps give results that are better than humans, as shown in t 19] when discrete letters are partitioned and recognized by parts. This decomposition of letters into regions is certainly not a task that humans are performing every time they read a letter. Especially not a letter within a word written

in cursive script.

fV. A TpsuNG PROToCOL We have described the experiment and results obtained in an attempt at giving an upper limit on the recognition rate to be achieved

'l IEEE TRANSACTIoNS oN SYSTEMS, MAN, AND CYBERNETICS_PART

B: CYBERNETICS, VOL.28' NO.

on low-level letter recognition. Any computer algorithm working at the letter level could be compared to liu*- limit. Depending on " the language choseq as weti as the task to achieve, the procedure described in section II-A can be used to generate some sets of words. This procedure was explained,for a random choice of words from a dictionary, but for a màre realistic task, it can be adapted to choose a number of words representative of a corpus of texts to be recognized. At least one set should be generated ior testing, and if the ilgorithm used needs a training set, the same algorithm should ue usJo to generate a training set. The test set, or a subset of it as rrrg" poiiut" (depending in the availabiüty of the readers), can

I'

81

FEBRUARY 1998

The recognition results on letters as obtained by the LIM group could be seen as goals to achieve by the computer approaches working on a stroke or letter basis only. What we define as "limited context" for the English readers seems enormous as opposed to the computer approaches which do not have access to anything besides the learned shapes of partial letters. In fact, just by being able to see the whole word, and also by having intuitions of pronounceable syllables and nonfrequent digrams, the LIM group knows a lot more than a computer approach only working with shapes. Therefore, taking the recognition rate of the LIM group seems optimistic, but still more realistic than hoping a lNTo for computer approaches which only work with the learned shapes of letters' Emphasis has been given on the advantages that human have as compared to computers, but still, we certainly didn't emphasize

of

^ choosing ieaders with a limited to human ."ui.rr. By be presented -better approximation of an upper mgïistlc context, we will find a because ttrey oon't achieved, to be limit for the recognition results to heli them, enough. we talked about knowledge that humans have of their knowledge lexical level upper to have much access of the s,hapes of whole words seen over and over, of the which makes them more closely' related to a blind ietter recôgnition vocabulary, trying to pronounce those words, but yet, we didn't even of abiüty algorithm. when reading words A computer algorithm trying to recognize the letters in the test talk about higher level information available of upcoming expectations includes information it in a text. High level set will often give many canüàate letters for each position where words in other of the organization The context. the on words based identified a letter. For example, in an algorithm devioped in a earlier (verb, coming part of speech a certain predicting helps we the sentence of strokes, ,"qo.n.r, research [15], using a similarity measure on given information py) the following noun). Also, and probably mostly, all the semantic obtained for the word..inculpei' lsee rig. 1, writer words "possible" quantity of the enormously restricts ,"À or propo.ed letters: i"i, f .i, {*ri}, {re}, tiq},'{"i}, {r}. w" by the sentence All those informations could certainly be incorporated in a have one deletion followàd ty à rrurtitoiion, trr"'ü' is répràcea uv coming. ..m,, and another .obutitotioo where the "p" is seen as a 'J" text recognition system where the low-level information about the high-level information about exPected words could or..q,,. Averaging over the same 5 writers involved in the p...ent shapes, and the meet in-between with the help of a semantically and syntactically experiment, this earlier project gave a letter recogniüon rate of 78%. limited size dictionary to recognize the right words in a §entence' work To reach the g6.6To oUtainea uy trre lur,t group, there is still to be

done.

In general, the computer approaches, as opposed to human readers, will give a more important number of deletions and insertions and the cômparison to the human reading doesn't take it into account. For algorithms giving similar recognitions rates, the deletion and insertion rates should be considered as criteria to choose among them. As well, for two algorithms giving similar results, we should reevaluate our recognition rates by looking at less candidates. the less letter candidates, the less post-processing work there is to be done if we want to used a dictionary for example to find existing words. Testing against human readers can be used for on-line approaches, as well as off-line. It seems more intuitive to compare an optical off-line approach to human reading, but still it could be used with on-line approaches that segment the word signal into possible letters.

V.

CoNct-usloN

With the help of human readers, we tried to answer the question it realistic tôday to ask a computer to perform 1007o ftcognition rate on letters within mixed-script handwritten words?" We described an experiment, which consisted of asking humans to identify letters within French words. This experiment rwas done on readers having access to an extended linguistic context (EXT group-French readers) and on readers having access to a limited iinguistic context (LIM group-English readers). To be able to later compare human results with the results from computer experiments, the words given to the readers are chosen among a set of words that would be used in a computer experiment as a training or test set. This set of words is generated in a way to respect the proportion of all possible digrams in the total set of ,words that an algorithm might have to recognize. This is an important aspect, as in cursive script, the writing and the reading of a letter depends on the preceding and following letters. The test set must be representative of the set of words the algorithm will eventually have to reco gnize. The recognition rates obtained are 92.8Vo for the EXT group and 86.6Vo ..Is

for the LIM

grouP.

RTTenENCES

tll lzj

C. C. Tappert, C. Y. Suen, and T. Wâkahara, "The state of the art in on-line frândwriting recognition," IEEE Trans. Pattern Anal. Machine Intell., vol. 12, no. 8, pp. 787-808, 1990. G. Lorette and Y. Lecourtier, "Reconnaissance et interprétation de textes manuscrits hors-ligne: Un problème d'analyze de scène," in Actes du Colloque National sur l'Ecrit et le Document, Nancy,'France, 1992, pp. 109-135.

D. M. Ford and C. A. Higgin, "A tree based dictionary search technique and comparison with n-gram letter graph reduction," in Computer Processiig of Handwriting, R. Plamondon and C. G. Leedham, Eds' Singapore: V/orld Scientific, 1990, pp. 291-312. t4l A. ôôshhsby and R. \il. Ehrich, "Contextual word recogniti_on §t]tS probabilistic labeling," Pattern Recognif., vol. 21, pp. 455462, 1988. t5] it. ptu*ondon, S. Ciergeau de Tournemire, and C. Barrière, "Handwritten sentence recognition: From signal to syntax," in Proc. 12th Int. Conf. Pattern Recognition, Jerusalem, Israel, 1994. t6] G. Sabah, "Traitement des nonattendus," in L'intelligence Artificielle et le Langage, 1988, pp. 152-184. t7) C. J. W.ilt, L. J. Èvett, and R. J. V/hitrow, "'Word look up for script recognition: Choosing a candidate," in Proc. I st Int. Conf. Document Analysis and Recognition, St-Malo, France, 1991, pp. 62V628. t8] S. Ciergeau de Tournemire and R. Plamondon, "Integration of lexical and syntactical knowledge in a handwriting recognition system," MachVision Applicat, vol. 8, no. 4, pp. 249-260, 1995. t9l t-. gottou et al., "Comparison of classifier methods: A case study in' handwritten digit recognition," in t2th IAPR, L994, vol. 2, pp. 77-82. t10l J. Franke, L. Lam, R. Legault, C. Nadal, and C. Y. Suen, "Experiments with the cenparmi database combining different classification approaches," in Proc. 3rd Int. Workshop Frontiers Handwriting Recognition, Buffalo, NY, 1993, pp. 305-311. tlll R. Fenrich and J. J. Hull, "Ôoncerns in creation of images databases," in Proc. 3rd Int. Workshop Frontiers Handwriting Recognition, Buffalo, NY, 1993, pp. 305-31 1. UZ) D. H. Kim, ÿ. S. Hwang, S. T. Park, E. J. Kim, S. H. Paek, and S. Y. Bang, "Handwritten Korean character image datagase pe92," iy froc. 2nd-Int. Conf. Document Analysis Recognition, Tokyo, Japan, 1993' pp.

t3l

47V73.

t13l K. Toraichi,

.

R. Mori,

-;

I. Sekita, K. Yamamoto, and H. Yamada, "Hand-

printed Chinese character database,"

in

Computer Recognition and

Off-line Chinese Handwriting Identification Based on ... - IEEE Xplore