Sparse-parametric writer identification using heterogeneous feature groups  L.R.B. Schomaker , M. Bulacu and M. van Erp AI Institute, Groningen University, NICI, The Netherlands      !#"$%&"(' ) *+"-,. /'0123'4!56!#"-'7"8'9

AI

nici

Nijmegen institute for Cognition and Information

RuG

Problem Traditional methods for forensic writer identification require considerable manual efforts in individual-character measurements by human experts. However, with current background removal methods, it now becomes feasible to use automatic image-based features on regions of interest which describe the individuality of handwriting style. Nevertheless, a single feature representation cannot be expected to capture all particularities of writing style, and combination methods are needed. The application domain precludes the use of training on the large datasets such that sparse-parametric combination methods are preferred, excluding MLP or SVM-based combination functions.

f4: (continued Brush PDF) After

summing all luminances, the accumulator window is normalized to a volume of 1, yielding a PDF for ink presence at stroke endings in any direction. pixels was This feature is not size invariant: the window of chosen because it captures 6-7 pixel-wide ink traces (size normalization is assumed). Figure 2 displays the overall shape and subtle writer differences.

WYX P

srqpqp srqpqp tu tu ™˜ ™˜ —– —–” • ” •“’ ” “’ ‘ “’ ‘ Ž ‘ Ž Œ‹Š‹Š Ž Œ‹Š‹Š Œ‹Š‹Š onmlml onmlml ‰ˆ‡†‡† ‰ˆ‡†‡† ‰ˆ‡†‡† kj[ZZ[ kj[ZZ[ …„ƒ‚‚ƒ …„ƒ‚‚ƒ …„ƒ‚‚ƒ \]_^_^ \]_^_^ b b d d f f €~~ €~~ €~~ œ œ œ a` a` c c e e g gh i h iwv h wv yx wv yx z yx{ z {}| z }| }| ›š ›š ›š

· T¹¸»º¬¼©½ ¶ ½ ¾ µ

Given a sample of unknown identity uniquely labeled

µ



can construct a Borda rank-combination scheme. Assume a set of tween the unknown sample

and the reference set

for each feature group

É ÌÀ ÌÊ Ã ÀÁ Ë ÀÆ

Ë Ê Ã Ì À ÆSÄ ÌªÀ ÍªÎ²Ï ¶ Ò¹ÓSÕ ÖØ× Ù à Û Ð Ã ÑÅæ å Æ Ã Í ¶ ÀÆ

µ

which returns for each dimension in

set

with respect to

Background

dexed

Ink

function

Background

0.007 0.006 0.005 0.004 0.003 0.002 0.001 0



. Thus

2

4 x

6

8

10 12 14 0

2

4

6

8

10

is the distance between an unknown sample

operating on a tensor:

f1 f2 f3 f4 f5 f6 f7 f8 WR

QKESR M J M QKESRKT J R P M QKESR T RVU

Explanation Autocorrelation in horizontal raster PDF of vertical run lengths of ink PDF of horizontal run length of ’white’ Ink-density PDF at stroke endings Edge-direction PDF Hinge angle combination PDF Horiz. edge-angle co-occurrence Writer: handedness, sex, age, style

FD EHIKG J LNG M OCP OP OP OP OCP

Dim. 100 Euclid. 100 100 225 16 Euclid. 464 512 16 Euclid.

f1: ACF, autocorrelation function of the horizontal raster, detects the presence of regularity in writing: regular vertical strokes will overlap in the original row and its horizontally shifted copy for offsets equal to integer multiples of the local wavelength. Every row of the image is shifted onto itself by a given offset and then the normalized dot product between the original row and the shifted copy is computed. The maximum offset (’delay’) corresponds to 100 pixels. All autocorrelation functions are then accumulated for all rows and the sum is normalized to obtain a zero-lag correlation of 1.

y

f5: ž Ÿ¢¡¤£ , simple edge-direction PDF, is computed by considering the PDF of quantized directions of the Sobel edges in the image. Sixteen bins were used in the histogram (Figure 3).

ûüüü üüü üüüý

is known that axial pen force (’pressure’) is a highly informative signal in on-line writer identification [1]. In ink traces of ball-point pens, there exist lift-off and landing shapes in the form of blobs or tapering [2] due to the ink-depositing process during take off and landing of the pen. A convolution window of 15x15 pixels was used, only accumulating the local image if the current region obeys to the constraints for a stroke ending. This constraint is determined by a supraliminal ink intensity in the central pixel of the window, co-occurring with a long run of white pixels along minimally 50% of the perimeter of the window, which is interrupted by one ink strip of at least 5 % of the window perimeter (Figure 1).



      



ÿþ

(1)

    



Ê

æÊÃÆ

¼P

¼ · TÔ¸ ¿ ¾

We evaluated the effectiveness of different features for writer identification using the Firemaker data set [4] A number of 251 Dutch subjects wrote four different A4 pages. On page 1 they were asked to copy a text presented as machine-printed characters. On page 2 they were asked to describe a given cartoon in their own words. The same kind of paper, pen and support were used for all subjects. The A4 sheets were scanned at 300 dpi, 8 bit / pixel gray-scale. Performance was tested using leave-one out. For a query sample, the set ) will contain one matching sample of the same writer and 500 distractor samples by 250 other writers.

100

BACKGROUND

0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0.15 0.15

writer 1 - page 1 writer 1 - page 2

0.1

0.05

0

0.05

0.1

writer 2 - page 1 writer 2 - page 2

0.1

0.05

0

0.05

0.1

80

0.15

Figure 3. (left) Two handwriting samples from two different subjects. (right): We superimposed the polar diagrams of the edgedirection distribution corresponding to pages 1 and 2 contributed to our data set by each of the two subjects.

f6: ž Ÿ¢¡

¡©¨>£ , hinge-angle combination PDF. In or> W § der to capture the curvature of the ink trace, which is very typical

for different writers, another feature is needed, using local angles along the edges [3]. The computation of this feature is similar to the one previously described, but it has added complexity. The central idea is to consider the two edge fragments emerging from a central pixel and, subsequently, compute the joint probability distribution of the orientations of the two fragments of this ’hinge’. The final normalized histogram gives the joint probability distribution quantifying the chance of finding in the image two and respec”hinged” edge fragments oriented at the angles tively. The orientation is quantized in 16 directions for a single ) angle. We will consider only the non-redundant angles ( and we will also eliminate the cases when the ending pixels have a common side. Therefore the final number of combinations is (464 dimensions). See Figure 4 for more details.

QKESRKT J R P M

RªT

RP

­ E ¢¨ ® J ¨ MC¯ ¨¢®  ° ® E ¨¢® ¯/±²M

+f5 +f8 +f2 +f3

40

+f3 +f2

20

0

f1 1

2

3 4 5 6 7 Rank (or: size of writer hit list)

8

9

0

10

1

2

3 4 5 6 7 Rank (or: size of writer hit list)

8

9

10

Figure 6. (left): Results for individual feature groups, (right): Results for sorted feature groups, using sequential Borda rank combination Recent tests with the Min operator, not reported here, have given indications that this rule may be preferable to sequential Borda. Ongoing studies have revealed still better identification performances if (a) feature vectors are computed separately from upper and lower parts of lines of text [5], and additional improvement if (b) local component-shape features are used (Fig. 7). Comparison split- vs entire-line features on lower case text

Comparison split- vs entire-line features on UPPER case text

100 95 90

100 95 90

80

80

70

70

60

60

50

30 20 10 0

50

p(phi1, phi2) - split p(phi1, phi2) - entire p(phi1, phi3) - split p(brush) - split p(phi1, phi3) - entire p(brush) - entire p(phi) - split p(phi) - entire p(rl) - split p(rl) - entire

40

1

10

20

30 List size

40

p(phi1, phi2) - split p(phi1, phi2) - entire p(phi1, phi3) - split p(phi1, phi3) - entire p(brush) - entire p(brush) - split p(phi) - split p(phi) - entire p(rl) - split p(rl) - entire

40 30 20 10 50

0

1

10

20

30

40

50

List size

Figure 7. Recent results on refinement of angular features [5]

ÑÇ Ñ ° W

0.015 0.010 0.005 0 360o o

o φ1180

180 o

360

0o

Conclusions

φ2

2

Localized, angular (co)occurences based on edges are very good features for writer identification.

BACKGROUND

Figure 4. (left): The computation of the angular ’hinge’ feature and (right): An example of a single-writer hinge PDF.

f7: ž Ÿ¢¡ W>§ ¡ rence.

40

+f4

Actual forensic systems: System A: 34%,(90%) and System B: 100 from the same 65%,(90%) for Top1,(Top10) using only *,+ - . / data are largely outperformed by our method: 79%,(95%).

0.020

φ1

+f4

60

+f8 +f7 +f6 +f5

f1

R P¬« R¥T

p(φ1, φ2)

φ2

60

20

Q¥E¦R M

80

+f7 +f6

% correct writer in hit list

INK

±£,

horizontal edge-angle co-occur-

This feature is an variant of the edge-hinge feature, in that the combination of angles is computed along the rows of the image. For the angle of a found edge fragment , the co-occurrence probability is computed with the angles of fragments which are horizontally displaced from (Figure 5).

³

³

φ1

f4: Brush, ink-density PDF at stroke endings. It





æ ÀÃ Æ ÎÏ Ê Ù Ö Ã µHÄŶÆ ¼ Ë Ê Ã Â Ë Ê Ã Á À  à à µHÄŶÆ Æ æ À Ã Æ ËÊ Æ

% correct writer in hit list

0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0.15

0o

narized image taking into consideration either the black pixels corresponding to the ink trace width distribution or the white pixels corresponding to the horizontal stroke and character-placement distribution for the writer. The histogram of run lengths is normalized and interpreted as a probability distribution. We use horizontal run lengths of up to 300 pixels (f3) and vertical run lengths (f2) of up to 100 pixels, i.e., the height of a written line in the data set used (resolution is 300 dpi). This feature is not size invariant. However, size normalization is not an issue in interactive writer search. The run-length PDFs provide orthogonal information to the directional features.



Results

INK

f2: VrunB, PDF of vertical run lengths in ink f3: HrunW, PDF of horizontal run lengths in background pixels Run lengths are determined on the bi-

ÿ þþ   

  ÿþ   

 ÿ  



where returns a vector in which has a monotonous relation to the combined rank vec tor. The output hit list contains the samples in the final rank order . In the regular Borda vote, "! #$   , i.e., is the Sum function: ranks are summed per dimension . However, many Borda-operator variants are known : Sum, Max, before being resorted by Median, Min, Majority, Plurality etc. In this study, we tested the use of the Sum operator. The problem of the Sum operator is that all votes are treated equally. Since the Median did not improve on this, we applied the Sum rule in a sequential and cumulative fashion from worst to best feature & where ( is group. This is comparable to taking a weighted sum with rank weights %& ' the quality index of the feature group (1=best, ( is worst) after optimal group reordering.

Probability of correct hit %

Feature ACF VrunB HrunW Brush

rank, in-

Data & Evaluation

14

Figure 2. Superimposed brush PDFs for two writers, and examples of an ”a” tail for two writers

Feature & Distance function Overview

A=

and a known sample of

,

. Then a Borda rank combination scheme can be considered as a rank-combination

100

?=

of a rank

guarantees that

φ

:<;>?$= @ AC= B

ÃÑ Æ Ð Ã ÒÔÓ¦Õ ÖØ× Ù Ú ÛÝÜ(ÒÔÓSÕ ÅÖ × Ù Þ Û¢Ü È È Ü(ÒÔÓSÕ ÖÅ× Ù ß»Û Æ µ áãâ ä

according to values in , in ascending order. The dimension

¼¿

12

, such that

(i.e, handwritten sample) its unique rank in the

ú çè3éêÔëVì¥íïî>ðñ-ò 6 ó ô ö õ ÷ùø

Ink

Run length, ink (Lb)

Brush PDF(x,y)

be-

more, given that a vector of ranks will be denoted by , assume the availability of a rank operator

where

’Writ158.Doc01-w.gnu’ ’Writ154.Doc01-w.gnu’

distance vectors

each dimension of the distance vectors corresponds to one and the same sample index. Further-

Center pixel

Figure 1. (left): A lower-case letter a with its tail stroke. (right): An example of detecting end strokes on the basis of a central inked pixel and a constrained ink and paper runlength configuration around the window border (actually 15x15 pixels).

¿

, each

feature groups describing a sample, we

Perimeter pixels:

0

A number of feature groups has been selected for this experiment, on the basis of literature and earlier work on on-line writer identification. Complementarity of extracted information in the feature group was an important design goal. Table 1: Feature groups used for writer identification and the used distance function between two samples and . Colors correspond to performance-curve colors in Figure 6.

¿

¶ ÁSÀ  à µHÄŶÆ Ç ¼ · T Ä¦È È ÄÅ¿ ¾

and a universe of samples of known writer identity

, and assuming there exist

value uniquely refers to a sample in

Run length, background (Lw)

Method Forensic writer search is similar to Information Retrieval yielding a hit list, in this case of suspect documents, given a query in the form of a questioned script sample. Given the requirements, simple nearest-neighbour search is a viable solution. However, a proper distance function has to be identified. For the combination of results, rank combination (Borda) will be tested.

Borda Rank-Combination Schemes

´

¨

2

For feature vectors which are PDFs, the 3 distance measure is mostly the natural choice. 2

In multiple feature groups where trained parametric combination cannot be applied, a sequential Borda approach which overweighs the better feature groups can be useful

φ1

φ3

φ3 INK

BACKGROUND

INK

BACKGROUND

Figure 5. Computation of the horizontal (or vertical) edge-angle cooccurrence

f8: Writer characteristics (WR)

is a ’pseudo’ feature vector, containing writer parameters which are often known in the application context: Style may be one of Handprint, Cursive or Mixed. The parameters are represented as a bit vector. This feature is added to underscore the possibility of using heterogeneous sources of information in a rank-combination scheme.

References [1] L. R. B. Schomaker and R. Plamondon, “The Relation between Pen Force and Pen-Point Kinematics in Handwriting,” Biological Cybernetics, vol. 63, pp. 277– 289, 1990. [2] D.S. Doermann and A. Rosenfeld, “Recovery of temporal information from static images of handwriting,” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 1992, pp. 162–168. [3] M. Bulacu, L. Schomaker, and L. Vuurpijl, “Writer identification using edgebased directional features,” in Proc. of ICDAR 2003, 2003, pp. 937–941. [4] L.R.B. Schomaker and L.G. Vuurpijl, “Forensic writer identification [internal report for the Netherlands Forensic Institute],” Tech. Rep., Nijmegen: NICI, 2000. [5] M. Bulacu and L. Schomaker, “Writer style from oriented edge fragments,” in Proc. of the 10th Int. Conference on Computer Analysis of Images and Patterns, 2003, pp. 460–469.

Acknowledgements: Thanks to the Dutch Forensic Institute (NFI) who enabled us to set up a large data collection, to Katrin Franke of the Fraunhofer Institut (IPK) with her expertise in background removal, who revived the problem of writer identification in our lab and Louis Vuurpijl (NICI) for the data collection and preparatory research. This poster was presented at ICIP’2003, Barcelona.

Sparse-parametric writer identification using ...

f3:HrunW, PDF of horizontal run lengths in background pixels Run lengths are determined on the bi- narized image taking into consideration either the black pixels cor- responding to the ink trace width distribution or the white pixels corresponding to the horizontal stroke and character-placement distribution for the writer.

396KB Sizes 0 Downloads 279 Views

Recommend Documents

Sparse-parametric writer identification using heterogeneous feature ...
The application domain precludes the use ... Forensic writer search is similar to Information ... simple nearest-neighbour search is a viable so- .... more, given that a vector of ranks will be denoted by ╔, assume the availability of a rank operat

Sparse-parametric writer identification using ...
grated in operational systems: 1) automatic feature extrac- tion from a ... 1This database has been collected with the help of a grant from the. Dutch Forensic ...

Sparse-parametric writer identification using heterogeneous feature ...
Retrieval yielding a hit list, in this case of suspect documents, given a query in the form .... tributed to our data set by each of the two subjects. f6:ЮаЯвбЗbзбйb£ ...

Online Text Independent Writer Identification Using ...
defined at the character level. ... only they are embedded with electronics capable of storing .... prototypes are first defined on an independent isolated word.

Species Identification using MALDIquant - GitHub
Jun 8, 2015 - Contents. 1 Foreword. 3. 2 Other vignettes. 3. 3 Setup. 3. 4 Dataset. 4. 5 Analysis. 4 .... [1] "F10". We collect all spots with a sapply call (to loop over all spectra) and ..... similar way as the top 10 features in the example above.

Text-Independent Writer Identification and Verification ...
writer identification and verification performance in exten- sive tests carried out using large datasets (containing up to. 900 subjects) of Western handwriting [3].

Writer Identification and Verification: A Review - Semantic Scholar
in the database. Most of the present ... reference database in the identification process. From these two ..... Heterogeneous Feature Groups”, Proc. of Int. Conf. on ...

Writer Identification and Verification: A Review - Semantic Scholar
Faculty of Information & Communication Technology ... verification: feature extraction phase, classification phase ... cons of each of the writer identification systems. .... A stroke ending is defined as ..... Handwriting & Develop Computer-Assisted

Text-Independent Writer Identification and Verification ...
it is necessary to use computer representations (features) with the ability to ... in a handwriting database with the return of a likely list of candidates) and writer ...

A Comparison of Clustering Methods for Writer Identification and ...
a likely list of candidates. This list is ... (ICDAR 2005), IEEE Computer Society, 2005, pp. 1275-1279 ... lected from 250 Dutch subjects, predominantly stu- dents ...

Chinese Writer Identification Based on the Distribution ...
which it's one of the global features, and compared the discriminability with ..... [4] G. Leedham and S. Chachra, “Writer identification using innovative binarised ...

speaker identification and verification using eigenvoices
approach, in which client and test speaker models are confined to a low-dimensional linear ... 100 client speakers for a high-security application, 60 seconds or more of ..... the development of more robust eigenspace training techniques. 5.

Identification Using Stability Restrictions
algebra and higher computational intensity, due to the increase in the dimen- sion of the parameter space. Alternatively, this assumption can be motivated.

LANGUAGE IDENTIFICATION USING A COMBINED ...
over the baseline system. Finally, the proposed articulatory language. ID system is combined with a PPRLM (parallel phone recognition language model) system ...

Multipath Medium Identification Using Efficient ...
proposed method leads to perfect recovery of the multipath delays from samples of the channel output at the .... We discuss this connection in more detail in the ...

SPEAKER IDENTIFICATION IMPROVEMENT USING ...
Air Force Research Laboratory/IFEC,. 32 Brooks Rd. Rome NY 13441-4514 .... Fifth, the standard error for the percent correct is zero as compared with for all frames condition. Therefore, it can be concluded that using only usable speech improves the

Electromagnetic field identification using artificial neural ... - CiteSeerX
resistive load was used, as the IEC defines. This resistive load (Pellegrini target MD 101) was designed to measure discharge currents by ESD events on the ...

Electromagnetic field identification using artificial neural ...
National Technical University of Athens, 9 Iroon Politechniou Str., 157 80 Athens. 4. National ..... Trigg, Clinical decision support systems for intensive care units: ...

Character Identification in Movie Using Movie Script - IJRIT
M.Tech Student, Department of Computer Science and Engineering ... Names for the clusters are then manually selected from the cast list. ..... Video and Image processing in multimedia system, Cloud Computing and Biometric systems.

Character Identification in Movie Using Movie Script - IJRIT
IJRIT International Journal of Research in Information Technology, Volume 1, Issue ..... In this paper, the preserved statistic properties are utilized and a proposal to .... processing in multimedia system, Cloud Computing and Biometric systems.

a novel pattern identification scheme using distributed ...
Jul 6, 2009 - macroblock temporal redundancy (ITR) (see Fig 1) which is static in successive frames. Indeed, few bits are used to signal zero residual error and zero motion for ITR to decoder, which is obviously significant when a sequence is encoded

Dong et al, Thermal Process System Identification Using Particle ...
Dong et al, Thermal Process System Identification Using Particle Swarm Optimization.pdf. Dong et al, Thermal Process System Identification Using Particle ...

Blind Identification Channel Using Higher Order ...
Technology (IC2INT'13) 13-14 November 2013, Settat, Morocco. [11] ETSI, “Broadband Radio Access Networks (BRAN), HIPERLAN Type. 2, Physical (PHY) layer”, 2001. [12] ETSI,“Broadband Radio Access Networks (BRAN), (HIPERLAN) Type. 2”, Requiremen

Polony Identification Using the EM Algorithm Based on ...
Wei Li∗, Paul M. Ruegger†, James Borneman† and Tao Jiang∗. ∗Department of ..... stochastic linear system with the em algorithm and its application to.