On Detecting Spatially Similar and Dissimilar Objects ...

Viewer
Transcript

On Detecting Spatially Similar and Dissimilar Objects Using Adaboost Wing Teng Ho, Yong Haur Tay Computer Vision and Intelligent Systems (CVIS) Group, Faculty of Information and Communication Technology, Universiti Tunku Abdul Rahman, Malaysia. [email protected]

Abstract AdaBoost has been verified to be proficient in processing images rapidly while attaining high detection rate in face detection. The speed of AdaBoost in face detection is demonstrated in [1], where the detection can be performed in 15 frames per second. The robust speediness and the high accuracy in tracing the target objects have enable AdaBoost to be successful in classification problems. In this paper, we examine the capability of Adaboost with Haar-like features in detecting text in image. We distinguish text into two categories, i.e. fixed text and variable text, which represent spatially similar and dissimilar objects, respectively. As a reference, we first present a face detector using AdaBoost with Haar-like feature. Next, we apply the same feature set on fixed text detection and variable texts detection. Experimental results show that Haar-like features in AdaBoost is suitable for detecting spatially similar objects such as faces and fixed texts. However, these features are not adequate in detecting spatially dissimilar objects such as variable text. Keywords: object detection, AdaBoost, Haar features.

text

AdaBoost in this paper will implement Haar-like features as the base features which will be used in the detection process. Spatially similar objects are objects that share similar characteristics at the image space. Fig. 1 illustrates two examples of spatially similar objects, i.e. face and fixed text; as every face has eyes, nose, and mouth at certain relative positions inside it, whereas we can expect the same character appears at the certain relative position in a fixed text. On the contrary, spatially dissimilar objects are objects that do not share the same features in the image space. Variable texts, such as license plate, where the characters for each object are different from each other, appear to be an example of spatially dissimilar object. Eye snose mouth (a)

detection,

(b)

Fig. 1 Two Examples Spatially Similar Objects (a)

1. Introduction Haar-like features adapted in Adaboost are verified to be successful in face detection with robust speediness and high detection rate. These features also used in other object detections such as soda can detection, pedestrian detection, and etc. The objective of this paper is to demonstrate whether current AdaBoost's Haar-like features are only suitable with "spatially similar object" detection. AdaBoost is a boosting algorithm that will select the features which are suitable for detecting the target object. The examined

Face (b) Fixed Text

Section 2 will explain the background of Adaboost and alternative algorithms in similar object classification problems. Section 3 illustrates the Haarlike features and Adaboost algorithm. Experimental results and analyses will be presented in Section 4 and followed the conclusion in the last section.

2. Background AdaBoost has been a successful approach for face detection that minimizes the computational time and

obtaining high detection rate. With the new image representation implemented by [1], it is known as the integral image which is used to calculate the image features values. It enables the computation of Haar-like features very rapidly at many scales or location in constant time. AdaBoost is an algorithm that selects relevant weak classifiers and builds an efficient classifier by combining multiple weak classifiers. Adaboost minimizes an exponential function of the margin over the training set, but the strong classifier learned by AdaBoost might not necessary be always achieving minimum error rate. FloatBoost, a novel procedure presented in [5], will generate a better boosted classifier to overcome these problems. It uses backtrack mechanism after each iteration of AdaBoost to remove the weak classifier which cause higher error rates. This approach constructs a classifier that contains fewer weak classifiers and achieve low error rate. However, the training time increase 5 times more than the existing AdaBoost algorithm. Besides, there is application of detecting and reading text in natural scenes using AdaBoost [6]. They use AdaBoost algorithm to train the strong classifier, with 4 strong classifiers containing 79 features. They claimed that the choice of features sets used in face is not efficient to text, since there are spatial variation differences for the text. Therefore, information features which give similar results on all text regions is applied in the algorithm which differentiate text and non-text.

3 AdaBoost Training AdaBoost training algorithm selects a small number of simple features and cascading the features to construct strong classifiers. The scaling of the sub-window size contributes to a very large set of Haar-like features, far-larger than the number of pixels. Motivated by the work of Tieu and Viola [1], each weak classifier only selects one feature at each stage of boosting process. The AdaBoost learning algorithm can also be known as the features selection process that choose the features to be cascaded together structurally in the form where it can increase the speed of the detector by focusing on the regions of areas that have higher probabilities to contain the target object. Therefore, after cascaded the weak classifiers, the background areas will be easily discarded to increase the speed and performance.

3.1 Haar-like Features Using features instead of raw pixel values is because the features in classifications provide useful

information and supply more knowledge on an image by its intensities. With the features apply to detect objects can perform much faster than using pixel-type detection. In [3], the features for the simple object detection are a window with size W x H pixels where W is the width and H is the height of the window. Fig. 2 illustrates 14 feature prototypes, i.e. 4 edges features, 8 line features, 2 centered–surround features, and 1 special diagonal line feature used in [3, 4, and 5]. All these 14 prototypes will scale separately in vertical and horizontal trends and directions to provide overcomplete set of features.

Fig. 2. 14 types of features commonly used in the object detection of AdaBoost

3.2. AdaBoost Boosting Learning Algorithm In every round, the AdaBoost selects the best weak classifier from a large set of features generated. The weak classifier is called „weak‟ because the classifier is not expected to be perfect and very good in detecting the object by itself where it may only classify the training data correctly at about 51% or more. Each collection of classifiers will later be combined and cascaded into a strong classifier. During the learning process, all the samples which are classified incorrectly will be re-weighted. This will allow the next classifier to pay more attentions to the samples that are classified incorrectly by the previous classifiers. Generally, the good classification function (classifiers) will have high weight while the poor classification will hold smaller weight. AdaBoost in fact will explore and seek for small set of the good classifications and yet provides a significant performance in detecting the object. Each set of classification function will be restricted to one simple feature which will best in determining the difference of the positive samples and negative samples. For each weak classifier, the optimal threshold will be determined. 1 p j f j ( x) p j j h j ( x)   otherwise 0

A weak classifier h( x, f, p , θ) where f is the feature, θ is the threshold and the p is the polarity depicting the direction of the inequality. Initially, the positive samples and negative samples will be denoted with image (x1,y1), …, (xn,yn) where yi=0 is represent the negative samples and yi =1 represent the positive samples. Then each negative samples and positive samples will be initialized with a weight denotes

w1,i =

1

1

2m , 2l

Where m is representing the number of the negative samples and l represent the number of the positive samples. The process will iterate for a large number of times until it met its convergence criteria. For t=1, until T rounds, each samples weights will be normalized with

Wt,i Wt,i =

4 Experiments 4.1 Experiments Setup 4.1.1 Training Samples We prepare the training samples which consist of positive samples and negative samples where positive samples consists of objects that we required to detect while negative samples are the images that without the target object. Initially, we used about 1000 of negative samples, but this resulted in high false positive rate. Therefore, in later experiments we increase the numbers of negative samples to decrease the false positive rate, but this simultaneously would cause a longer training time. 4.1.1.1 Positive Samples Face

n

∑ Wt,j j=1

Where i =0 is for negative samples and i=1 is for positive samples, and n is the number of the samples respectively. Then in each round every weak classifier h (xi) will calculate its error Єt = min f, p, θ ∑ wi | h (xi, f, p, θ) - yi |, The feature with minimum Єt will be selected and defined as ht(x) = h(x, f, p, θ) at round t. After that, the algorithm will update the weights of the samples:

wt+1,i = wt,i βt 1-e

i

Where the ei=0 if the example xi is classified correctly, ei=1 if the example is classified incorrectly, and

βt =

Єt

1- Єt

Fig. 3. Frontal face database with 1000 faces (20 x 20) The face training set consists of 1000 frontal face images with 20x20 sizes as our positive images. All the faces scaled and aligned to base resolution of 20 by 20 in pixels. The face images of the positive samples are bounded from the eyes brows and until the mouth. Variable Text

In the next round of the iteration, the weak classifiers will pay more attention and to the samples that are classified incorrectly. After T rounds, the strong classifiers is constructed by combining the weak classifiers selected according to: T 1T tht ( x )  t  t 1 1 t 1 2 c( x)   0otherwise 1 t t  log t   t with 1  t where

Fig. 4. Artificial License Plate created using the generators application (Variable Text)

To inspect the applicability of detecting spatially dissimilar objects using AdaBoost, we chose license plate as the target objects in this experiments. In this experiment we supply adequately few artificial license

plates with different brightness. The purpose is to determine whether with the available training data set we can trace and detect the license plates that are not in the training set. We provide about 1000 artificial license plate with 30 x 10 sizes as the variable text positive samples.

20x20 size, Fixed Text is 17x06 size, and Variables Text is 30 x 10 size), but this would not affect the performance of the particular detector. This is because the final trained detector will be able to trace the target objects which its size equal or bigger than the default size.

Fixed Text

80 70 60

Hit Rate

“Teksi” plate is used to examine the applicability of detecting spatially similar objects using AdaBoost. In this experiment, the word “Teksi” is used as the fixed text and we provide 360 positive samples and about 10000 negative samples, where the 360 positive samples consist of word “Teksi” with different font type. All the positive samples are scaled to base resolution of 17 x 6 in size.

Receiver Operating Characteristic (ROC) curves 90

50 40 30 20

Fig. 5. Artificial “Teksi” String (Fixed Text)

4.2 Test Set All images which are not used in training will be used as the test data set. The test set for the license plates are taken from the streets, car parks, etc. Meanwhile, to test the fixed text detector, we manually paste the “Teksi” to the real world images with different font types, font colors, and text size, which vary from the training set.

5 Results & Analysis 5.1.1 Application of Face Detector Fig. 6 is the receiver operating characteristic (ROC) [1] for the three different object detection experiments. It is apparent that the detection rates for three detectors increase as we removed more layers from the cascaded classifiers. At the same time, the false positive rate increase as well. It shows how the hit rates change when more layers removed and how the false positive rates react according to it. From the experiments, face detector achieves the best hit rate of 82.21% with false positive of 258 over 46 samples that consist of total 298 faces; while fixed text detector achieves hit rate of 67.65% with false positive of 80 over 21 samples; then, for the Variable text detector, it attains hit rate of 13.89% with 70 false positive over 35 samples. Although the base resolution size of three detectors are different (face detector is

10 0 0

200

400

600

800

1000

1200

False Positive Variable Text 30x10

Fixed Text 17x06

Face 20x20

Fig. 6 Receiver Operating Characteristic (ROC) curves comparing 20x20 face detector, 30x10 license plate detector, and 17x06 Teksi Plate Detector.

The results obtained from each experiment show that face detector and fixed text detector in the project achieve a reasonable hit rate with the small positive samples size. However, for the variables text detector, we increase the positive samples size to 1000 images in order to achieve the hit rate of 13.89%. The performance of the face detector can be improved by increasing the positive samples. Comparing the hit rate of our face detector with the face detector in [1] where their detector achieves 92% of detection rate and false positive of 65. [1] There is a significant difference in the hit rate between our detector and the face detector in [1]; this is because the training samples size used in our experiments (1000 samples) is smaller compared to theirs (4916 samples). Therefore, to improve the hit rate we might need to add in more variants of similar faces to the training samples, so that the detector will be able to trace more variety of faces. Subsequently, the fixed text detector achieved a considerable moderate hit rate where the detector will

not perform well when the test data sets contain words “Teksi” with ratio which is different from the training samples. The detector is able to detect the word “Teksi” which is different in font types, font size, and font color. This shows that the detector can trace objects which are spatially similar to the training samples. The license plate detector only achieves 13.89% of hit rate, a considerable low hit rate compared to the other two detectors. By analyzing the test data used to evaluate the performance of the trained detector, we found that, the detector could not perform well when there are different lighting effects on the license plate. We also notice that although we provide 1000 images for the training samples, the detector can only trace certain license plates which consider similar to those in the training samples.

5.4 Test samples

6. Conclusion The face, fixed text, and variable text detectors achieve detection rates of 82.21%, 67.65% and 13.89%, respectively. The face detection rate is much lower compared to [1], probably due to insufficient positive samples for training; We observe that AdaBoost is able to train to detect well the spatially similar objects i.e. face and “Teksi” plate. However, it fails to train to detect the variable texts. This shows that Haar-like features do not supply sufficient information to AdaBoost in discriminating variable texts. Features required for variable text detection are more global in nature, e.g. density and its variance of the sub-regions [6], which is not found in Haar. In future, we hope to integrate global features into the feature extraction process, by computing those features based on the integral image. It is hoped that with this implementation, the detection rate for variable text would be more accurate.

7. References [1]. P. Viola, M. J. Jones, “Robust Real –Time Face Detection.”, International Journal of Computer Vision, volume 57(2), pg 137-154. , May 2004 [2]. O. Tuzel, F. Porikli, and P. Meer, “Region Covariance: A Fast Descriptor for Detection and Classification”, Proc. the 9th European Conference on Computer Vision, Graz, Austria, volume 2, pg 589-600,2006. [3]. R. Lienhart, A. Kuranov, V. Pisarevsky, “Empirical Analysis of Detection Cascade of Boosted Classifiers for Rapid Objects Detection”, MRL Technical Report, Intel Corporation, Santa Clara, USA , December 2002.

Fig. 7 Output of our face detector

[4]. X.R. Chen, Yuille, A.L., “Detecting and Reading Text in Natural Scenes”, Computer Vision and Pattern Recognition 2004, Proc. of the 2004 IEEE Computer Society Conference on volume 2, pg 366-373 , 27 June- 2 July 2004 [5]. Li, S.Z., Z.Q. Zhang, “FloatBoost learning and statistical face detection”, Pattern Analysis and Machine Intelligence, IEEE Transactions on Volume: 26, pp 1112- 1123. , Sept. 2004 [6] H. Zhang, W.Jia, X.He, Q.Wu, “Learning-Based License Plate Detection Using Global and Local Features”, Proc. the 18th International Conference on Pattern Recognition (ICPR’06), HongKong, China, pp.909 – 912, August 2006.

Fig. 8 Output of our fixed text detector

Detecting Junctions in Photographs of Objects