Chinese Writer Identification Based on the Distribution of Character Skeleton

Luo Wei, Zhang Dexian, Wang Feng, Gong Zhile, Zhu Min and Bao na School of Information Science and Engineering Henan University of Technology Zheng Zhou, P.R.China e-mail: [email protected] Abstract—In this paper, a kind of Chinese character writer identification method is proposed and tested. Firstly, considering handwriting can be texture image in some sense, the Gabor wavelet is used to extract texture feature. Then a Local Direction Contribution Method (LDCM) is adopted to extract the local features of feature characters. In practice, we first skeletonise the character and then compute the skeleton direction distribution in each sub-region. Nearest neighbor classifier based on weighted Euclidean distance is utilized in classification. Experiment results verifies that the classification performance of LDCM is better than the Gabor method, and the correct identification rate of Top-3 candidates can reach 100% under the random combination of 3 feature characters. Keywords-writer identification; LDCM; Gabor wavelet; WED

I.

INTRODUCTION

With the rapid growth of biometric person identification, it enables DNA typing, fingerprint classification, iris and handwriting identification to develop in a high rate. Different with physiological characteristics, such as fingerprint and iris, handwriting is the reflection of our human beings behavior characteristic. Contrary to other forms of biometric person identification used in forensic labs, handwriting captures the attributes of instability, forgery and nonuniqueness. Consequently it has always being an extremely difficult for machine automatic identification. Offline handwriting identification is mainly used to narrow the range of suspicious, improve the objective of processing and offer scientific bases for handwriting identification experts. Writer identification is the task of determining the author of sample handwriting from a set of writers. Recent advances in image processing, pattern classification and machine learning allow for a substantial new method in this field. Said et al. [1] treat the writer identification task as a texture analysis problem using multi-channel Gabor filtering and grey-scale co-occurrence matrix techniques. Zois et al. [2] base their approach on single words by morphologically processing horizontal projection profiles. Edge based directional probability distributions and connected component contours as features for the writer identification task are proposed in [3]. Leedham et al. [4] present a set of eleven features which can be extracted easily and used for the identification and verification of documents containing handwritten digits. Hidden Markov Model (HMM) based

recognizers are used for the identification and verification of persons based on their handwriting in [5]. Blankers et al. introduce loop and lead-in features for describing the individual properties of handwriting [6]. Zhang et al. [7, 8] extract features of feature words and characters, and determine the discriminability of digits and characters. [9] extract the basic strokes of Chinese characters as classification models and use Gaussian function to model each class. Unlike the former methods for feature extraction, In this paper we analyze the skeleton of character, extracting the skeleton direction distribution and synthesizing the correspondence position information as the handwriting features. Also we have simplify described the Gabor feature, which it’s one of the global features, and compared the discriminability with the method proposed in this paper. The experiments and analysis results indicate the feature proposed in this paper is performed superior to the Gabor feature. The remainder of the paper is structured as follows. In section 2, we describe data collection. Feature extraction will be introduced in detail in Section 3. Experiment method and correspondence analysis will follow in Section 4. Section 5 concludes the paper. II.

DATA CLLECTION

A source document in Chinese, comprise ten Arabic numerals range from 0 to 9, was to be copied by each writer three times, which was designed for the purpose of this study. The other one material about 5 to 10 line characters was self-generated description of what everything in Free writing style. Considering the writing instability of the same person at different time, we collected the materials with the interval from two weeks to one month. Each of the collected handwritten documents is digitally scanned (300 dpi resolution) and stored as 8 bit (256 grey levels) images. The writing is formed uniformly take white woodfree writing paper of size 15 cm×21 cm and gel-ink pen with accuracy of 0.5mm. Since the constraint of current image segmentation and image processing, we assumed the preprocessing, such as denoise, has been accomplished. And for the reason of impossible collecting the writing velocity and pressure for Off-line writer identification, we just limit our discussion in this paper in the scope of normal and free

writing style and discard the forgery document. One sample of writing is shown in figure 1(a), (b) shows the copies of the same characteristic word provided by four writers, each of which manually segment from their samples.

135o. This gives a total of 24 output images (4 for each frequency). The feature vector is the mean and standard deviation of each output image. Therefore, 48 features per input image are calculated. Testing was performed by using all the 48 features. Figure 2 shows an original binarilized image of size 128×128 pixels and its filtered image with Gabor filter in direction 0o, 45o, 90o and 135o which the central frequency is set in 4, 8, 16, 32, 48 and 64.

Figure 1. (a)handwriting sample (b)character “前” written by 4 writers each 3 copies.

III.

FEATURE EXTRACTION

A. Gabor wavelet feature Gabor wavelet is an effective analysis tool for wavelet transform. Its characteristic very resemble the visual neural mechanisms of our human being’s, it captures the attributes of arbitrary frequencies and directions. And it plays an important role in the analysis of texture image for its concise and excellent time-frequency performance. An input image I ( x , y ), x , y ∈ Ω ( Ω - the set of image points), is convolved with a two-dimensional Gabor function g ( x, y ), x, y ∈ Ω , to obtain a Gabor feature image

r ( x, y ) as follows[10]:

r ( x, y ) = ∫∫ I (ξ ,η ) g ( x −ξ , y − η ) d ξ dη

(1)

We use the following family of Gabor functions: 2

2

x' + γ 2 y ' x' π +ϕ) g λ ,θ ,ϕ ( x, y ) = exp( ) cos(2 λ 2σ 2 x ' = x cos θ + y sin θ

(2)

y ' = − x sin θ + y cos θ We’ll adopt circle symmetric orthogonality filter in our experiments, ϕ = 0 , π , γ = 1 . Texture characteristics can 2 be extracted from different frequencies and directions by altering the value of λ and θ which are the radial frequency and orientation that define the location of the channel in the frequency plane. We use frequencies of 4, 8, 16, 32, 48 and 64 cycles/degree. For each central frequency λ , filtering is performed at θ =0o, 45o, 90o and

Figure 2. (a)normalized binary image (b)filter image in direction 0o, 45o, 90o and 135o from top down and different central frequency set in 4, 8, 16, 32, 48 and 64 from left to right.

B. LDCM feature Document[11] introduced the usage of Local Direction Contribution Method (LDCM) in character recognition, it first segment the character into several cells, and then count the direction distribution of the black pixel in each cell. Position and direction contribute the final feature vector. In this paper, we first binarilize the original image and manually segment the feature characters from the document. The gravity normalization [12] was used to normalize the feature character into an image of size 64×64 pixels. Then extract character skeleton and segment the skeleton image into 16 cells. And finally we account the black pixel direction distribution information in each cell. Considering the strokes of Chinese character mainly lies in the horizon, vertical and left and right diagonal, we normalize the direction of every black pixel into one of these four directions. At each black pixel in the image, the longest continuous run of black pixel in each of the four directions is computed. The pixel is labeled with the direction in which the run length is maximum. That is, each black pixel is labeled as part of a stroke of one of the four directions. In our experiments, we labeled a value between 1 to 4 (for horizontal, left diagonal, vertical and right diagonal) that indicates the direction of the run with the maximum length at the current location in the image. And we labeled value 0 in the cell which doesn't have skeleton exist or absence the stroke in one of these four directions. For each of the 16 cells in the image area, the labeled black pixels of each type in that area are counted. The counts are then normalized by the total number of black pixels in the skeleton image. The stroke direction distribution is represented by a 64-dimensional feature vector, which stores the normalized counts of black pixel of each of the four types in 16 cells. The statistic characteristic of feature character “

他” is given in figure 3, and with its 64-dimensional feature vector.

Figure 3. (a)original character (b) binary character after gravity normalization (c)skeleton image (d)skeleton direction in 0o, 45o, 90o and 135o (e)64-dimensional feature vector

IV.

EXPERIMENT DESIGN

A. The Weighted Euclidean (WED) Classifier Nearest neighbor classifier based on weighted Euclidean distance was used to classify the samples in our Experiments. Representatives features for each writer are determined from the features extracted from training handwriting texts of the writer. Then, for an unseen handwritten text block by an unknown writer (who has contributed training images), similar feature extraction operations are carried out. The extracted features are then compared with the representative features of a set of known writers. The writer of the handwriting is identified as writer K by the WED classifier. The following distance function is a minimum at K: N ( f − f k )2 d(k) = ∑ n k 2n (vn ) n=1

Where

(3)

f n is the nth feature of the input document, and

f n( k ) and vn( k ) are the sample mean and sample standard deviation of the

nth feature of writer K respectively.

B. Experiment method A source document in Chinese, comprise ten Arabic numerals range from 0 to 9, was to be copied by each of the total 10 writers three times. The other one material about 5 to 10 line characters was self-generated description of what everything in Free writing style. We randomly selected two same content documents as training set, the rest is using as test set. For Gabor feature, we first binarilize the image and remove the gaps between rows and columns, and then using bilinear interpolation to normalize the image into size of 256 ×256 pixels. The flexibility of this method will be analyzed in detail in the next section. In order to obtain more training samples of each writer and improve the accuracy, we segment each image averagely into 4 sub-images of size 128 ×128 pixels. So every writer can get 8 training samples, the rest two samples can also take the same procedure to segment into 8 test samples, in which four samples’ content

can be found in the training samples. The total number of our experiments is 800 times (4 × 10 ×10+4 × 10 ×10). For LDCM feature, we primarily testify its usage in the text dependent writer identification as the feature characters involved in the study. So our experiments only carried out in the three same content documents. As mentioned above, we still randomly choose two of the three documents and select 10 feature characters in each one as training set. And test the discriminability using the feature characters from the third one. Experiments first implemented 100 times in the first one feature character, each one compared with the others besides 2

itself, and then perform C10 * (10 + 10) = 900 times in the combination of any two feature characters, and 3600 times of three feature characters. The final discriminate rate is the average value in each combination. About the selection of feature characters should incline to which comprise rich but concise strokes that easy to implicit the writer’s personality. Just as the character shows in figure 1(b). C. Experiment result Considering the difficult we encountered in character segmentation, we adopt one global compression method to produce the texture image mentioned above in this paper. Figure 4 shows the discriminability of Gabor feature in text dependent and text independent and its average discriminability based on WED. The text dependent result based on global compression is some out of our expectation, the accuracy rate isn’t as high as we have supposed, but its Top-3 correct discriminability can still reach 85%. The implicity reasons due to this surprising result may be explained as the nonuniform normalization for each character. And we can conclude that in order to improve the correct discriminalbility, the texture making in preprocessing must be taken in the level of character. Other wise, the discriminability of text dependent is just as the text independent’s. Of course the experiment still verified the Gabor feature is an effective feature in handwriting writer identification. Figure 5 shows the classification performance using WED as a function of LDCM feature. The feature characters be selected in our experiments is “要, 行, 动, 前, 们, 始, 牵, 坚, 成, 作”. The Top-6 writer-identification performance is 30% in using one feature character, and Top-7 of 60% in the combination of two. But the performance is improved apparently with the combination of any three of these feature characters and the Top-3 writer-identification performance is reached 100%. When affirming the experiment, we can imply the selection of feature characters is one of the reasons for affecting the discriminability in some content, and the experiment result with strong occasionality especially in the case of absence adequate feature characters. But this occasionality can preserve in a low level and manifest a stable discriminability while randomly using 3 feature characters. So from our experiments we are sure the LDCM feature based on character skeleton is one of effective features which can be used in handwriting writer identification. Comparing Gabor and LDCM features we can

easily find out the LDCM have a better performance over Gabor method.

Experiments verified the direction distribution of character skeleton is one of effective discriminate feature and also manifest the local feature is performed better than global feature in the scope of handwriting writer identification. The constraint of our experiment will be overcome in our future work, such as the experiments on more handwriting samples, the discriminability comparing of different local features, the fusion of characteristics in different layers and the selection of different classifier and integration. REFERENCES [1]

Figure 4. discriminability of Gabor feature

Figure 5. discriminability of LDCM feature

V.

CONCLUSION

In view of the difficulties of handwritten character segmentation, we adopted a global compression based on bilinear interpolation to produce the texture image. The result revealed the global compression can affect the discriminability of Gabor feature through experiments carried on text dependent and text independent. LDCM feature based on character skeleton can effectively use for writer identification while combining three feature characters, and it performed better than Gabor feature.

H. E. S. Said, T. Tan, and K. Baker, “Personal identification based on handwriting,” Pattern Recognition, vol. 33:pp. 149–160, 2000. [2] E. N. Zois and V. Anastassopoulos, “Morphological waveform coding for writer identification,” Pattern Recognition, vol. 33, pp. 385–398, 2000. [3] L. Schomaker and M. Bulacu, “Automatic writer identification using connected- component contours and edge-based featurs of uppercase western script,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 26(6), pp. 787–798, 2004. [4] G. Leedham and S. Chachra, “Writer identification using innovative binarised features of handwritten numerals,” In Proc. 7th Int. Conf. on Document Analysis and Recognition, pp. 413–417, 2003. [5] A. Schlapbach and H. Bunke, “Off-line handwriting identification using HMM based recognizers,” In Proc. 17th Int. Conf. on Pattern Recognition, vol. 2, pp. 654–658,2004. [6] Vivian Blankers and Ralph Niels, “Writer identification by means of loop and lead-in features,” In Proceedings of the 19th Belgian-Dutch Conference on Artificial Intelligence (BNAIC 2007), pp. 17-24, Utrecht, The Netherlands, November 5-6, 2007. [7] Bin Zhang and Sargur N. Srihari, “Analysis of Handwriting Individuality Using Word Features,” Proceedings International Conference on Document Analysis and Recognition (ICDAR), Edinburgh, Scotland, August 2003, pp. 1142-1146. [8] B. Zhang and S. N. Srihari, “Individuality of Handwritten Characters,” Proceedings International Conference on Document Analysis and Recognition (ICDAR), Edinburgh, Scotland, August 2003, pp. 1086-1090. [9] Kun Yu, Yunhong Wang, Tieniu Tan, “Writer Authentication Based on the Analysis of Strokes,” Proceedings of SPIE, volume 5404, pp. 215-224, 2004. [10] P. Kruizinga, N. Petkov and S.E. Grigorescu, “Comparison of texture features based on Gabor filters,” V.Roberto et al. Eds., Proceedings of the 10th International Conference on Image Analysis and Processing, Venice, Italy, September 27-29, 1999, pp.142-147. [11] Tin Kam Ho, Jonathan J. Hull and Sargur N. Srihari, “A Word Shape Analysis Approach to Lexicon Based Word Recognition,” Proceedings USPS Advanced Technology Conf., Washington, D.C., November 1990, pp. 217-229. [12] Liu Chenlin, Dai Ruwei, Liu Yingjian, “Character Image Preprocessing and Matching for Writer Identification,” Journal of Chinese Information Processing, Vol. 10 No.3 pp. 50-57,1995 (in Chinese)

Chinese Writer Identification Based on the Distribution ...

which it's one of the global features, and compared the discriminability with ..... [4] G. Leedham and S. Chachra, “Writer identification using innovative binarised ...

1MB Sizes 0 Downloads 239 Views

Recommend Documents

Off-line Chinese Handwriting Identification Based on ... - IEEE Xplore
method for off-line Chinese handwriting identification based on stroke shape and structure. To extract the features embed- ded in Chinese handwriting character, ...

Polony Identification Using the EM Algorithm Based on ...
Wei Li∗, Paul M. Ruegger†, James Borneman† and Tao Jiang∗. ∗Department of ..... stochastic linear system with the em algorithm and its application to.

Rotation Invariant Retina Identification Based on the ...
Department of Computer, University of Kurdistan, Sanandaj, Iran ... Biometric is the science of recognizing the identity of a person based .... degree of closeness.

Sparse-parametric writer identification using ...
grated in operational systems: 1) automatic feature extrac- tion from a ... 1This database has been collected with the help of a grant from the. Dutch Forensic ...

Sparse-parametric writer identification using heterogeneous feature ...
Retrieval yielding a hit list, in this case of suspect documents, given a query in the form .... tributed to our data set by each of the two subjects. f6:ЮаЯвбЗbзбйb£ ...

Sparse-parametric writer identification using heterogeneous feature ...
The application domain precludes the use ... Forensic writer search is similar to Information ... simple nearest-neighbour search is a viable so- .... more, given that a vector of ranks will be denoted by ╔, assume the availability of a rank operat

Sparse-parametric writer identification using ...
f3:HrunW, PDF of horizontal run lengths in background pixels Run lengths are determined on the bi- narized image taking into consideration either the black pixels cor- responding to the ink trace width distribution or the white pixels corresponding t

Confident Identification of Relevant Objects Based on ...
in a wet-lab, i.e., speedup the drug discovery process. In this paper, we ... NR method has been applied to problems that required ex- tremely precise and ...

Person Identification based on Palm and Hand ... - Semantic Scholar
using Pieas hand database is 96.4%. 1. ... The images in this database are captured using a simple .... Each feature is normalized before matching score to.

Methods and compositions for phenotype identification based on ...
Jul 9, 2004 - ing Analytical Data,” J. Chem. Inf. Comput. Sci. 38: 1161-1170. (1998). Caldwell and Joyce, PCR Methods and Applications 2:28-33 (1992).

A Lane Departure eparture eparture Identification based on PLSF ...
Abstract. In this paper, a technique for identification of unwanted lane departure of a travelling vehicle on a road is proposed. The piecewise linear stretching function (PLSF) is used to improve the contrast level of the region of interest (ROI). L

Person Re-identification Based on Global Color Context
which is of great interest in applications such as long term activity analysis [4] and continuously ..... self-similarities w.r.t. color word occurred by soft assignment.

A Lane Departure eparture eparture Identification based on ... - IJRIT
Self-clustering algorithm, fuzzy C-mean and fuzzy rules were used to ..... linear regression, Computer Vision and Image Understanding 99 (2005) 359–383.

The Application of Gabor Filter in Chinese Writer ...
Henan University of Technology, Zhengzhou, 450001, China [email protected]. ... handwriting is considered as a texture image. A two-dimensional ... Proceedings of 2008 IEEE International Symposium on IT in Medicine and Education. 360.

Person Identification based on Palm and Hand ... - Semantic Scholar
amount of variance among the images and the last dimension of this subspace ... A covariance matrix is created by multiplying the data matrix with its transpose.

Methods and compositions for phenotype identification based on ...
Jul 9, 2004 - http://www.mjresearch.com/html/consumables/ealing/ sealinggproductshtml. ...... Cleavage product characterization legend: MAIN = regular ...

On Hash-Based Work Distribution Methods for Parallel ...
4-3 Load balance (LB) and search overhead (SO) on 100 instances of the 15- .... node in the domain transition graph above corresponds to a location of ..... with infinite state spaces, Burns et al proposed SafePBNF, a livelock-free version.

The Empirical Size Distribution of Chinese Cities
distribution could be a good approximate to the data and Zipf's law appears ... probable case for some counties or towns bigger than small cities so it will not ...

Text-Independent Writer Identification and Verification ...
writer identification and verification performance in exten- sive tests carried out using large datasets (containing up to. 900 subjects) of Western handwriting [3].

Online Text Independent Writer Identification Using ...
defined at the character level. ... only they are embedded with electronics capable of storing .... prototypes are first defined on an independent isolated word.

Writer Identification and Verification: A Review - Semantic Scholar
in the database. Most of the present ... reference database in the identification process. From these two ..... Heterogeneous Feature Groups”, Proc. of Int. Conf. on ...

Writer Identification and Verification: A Review - Semantic Scholar
Faculty of Information & Communication Technology ... verification: feature extraction phase, classification phase ... cons of each of the writer identification systems. .... A stroke ending is defined as ..... Handwriting & Develop Computer-Assisted

Text-Independent Writer Identification and Verification ...
it is necessary to use computer representations (features) with the ability to ... in a handwriting database with the return of a likely list of candidates) and writer ...