Identification of Time-Varying Objects on the Web Satoshi Oyama [email protected]

Kenichi Shirasuna [email protected]

Katsumi Tanaka [email protected]

Department of Social Informatics, Graduate School of Informatics, Kyoto University Yoshida-Honmachi, Sakyo, Kyoto 606-8501, Japan

ABSTRACT We have developed a method for determining whether data found on the Web are for the same or different objects that takes into account the possibility of changes in their attribute values over time. Specifically, we estimate the probability that observed data were generated for the same object that has undergone changes in its attribute values over time and the probability that the data are for different objects, and we define similarities between observed data using these probabilities. By giving a specific form to the distributions of time-varying attributes, we can calculate the similarity between given data and identify objects by using agglomerative clustering on the basis of the similarity. Experiments in which we compared identification accuracies between our proposed method and a method that regards all attribute values as constant showed that the proposed method improves the precision and recall of object identification.

Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—clustering; I.5.3 [Pattern Recognition]: Clustering—algorithms, similarity measures

General Terms Algorithms, Experimentation

Keywords Object Identification, Object-Level Search, Temporal Data

1.

INTRODUCTION

We can find various kinds of information on the Web by using search engines. Such information includes news topics like “global warming” and “subprime lending,” HOWTOs like cooking recipes and health tips, and services like transportation timetables and weather forecasts. The names of real world objects including persons, products, organizations, and places are particularly popular queries in Web

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. JCDL’08, June 16–20, 2008, Pittsburgh, Pennsylvania, USA. Copyright 2008 ACM 978-1-59593-998-2/08/06 ...$5.00.

search. For example, it is reported that queries using personal names account for 5 to 10% of all Web queries[11]. General-purpose search engines receive keywords as queries from users and return lists of links to pages containing the keywords along with snippets from the pages. Although user needs are diverse, the kinds of information provided in the search results and the output styles are uniform and do not reflect the type of query submitted. For example, in a person search, a user is usually looking for attribute information about a specific person (typically, his or her name is specified in the query), such as his or her occupation or contact address. A general-purpose search engine simply returns a list of pages over which the target information is dispersed, and the pages generally include information irrelevant to the target. This makes finding the required information from the search results a burdensome task. Our recent large-scale survey of Web users [19] showed that many (45.7%) want a domain-focused search capability. Even more (48.1%) want to receive more than page titles and snippets in their search results. To satisfy such user needs, there has been much research on systems that enable domain-specific, i.e., vertical, Web searching [5, 6, 7, 10, 18, 23, 27, 29]. Domain-specific search engines, or vertical search engines, enable more accurate Web searching in a specific domain than general-purpose search engines. They use focused crawling or meta-searching to collect Web pages related to the domain. Object-level search engines [16, 20, 21, 22], a type of domain-specific search engine, treat an object, rather than a page, as a retrieval unit. For example, bibliographic data on a certain paper may appear on many pages in various formats. An object-level search engine aggregates information on a certain object, removes duplicate items, and displays the results for different objects separately. To enable this function, an object schema is defined to specify object attributes and relations between objects. The search engine extracts information on objects from Web pages in accordance with the object schema. In object-level search, a process called object identification is necessary. Object identification associates distinct observations with distinct objects. Object identification is generally done by clustering observed data on the basis of their similarities. Although object identification is not a new problem and had been addressed in database research even before the Web was introduced, identifying objects appearing on the Web is a much more difficult task. One reason for the difficulty is related to the accuracy of information extraction. In Web object retrieval, attribute

values are not entered by a person as with traditional databases. They are extracted automatically from Web pages, and the heterogeneity of Web pages causes errors in information extraction. This leads to incorrect attribute values, which makes object identification difficult. Another major difficulty comes from the fact that attribute values of objects change over time. For some kinds of objects such as research papers, which were commonly treated as target objects in previous research on object-level search, the attribute values (titles, venues, etc.) are determined when the objects are generated and do not change over time. However, the attribute values of objects like persons and companies change over time. For example, it is not unusual for the affiliation and/or address of a person to change every few years. The address and officers of a company also occasionally change. In a managed database, obsolete information is discarded or overwritten with new information, so the consistency of object attributes is usually maintained. On the Web, however, it is often the case that new information on an object is added while old information remains. Thus, the information collected for an object may be for different points in time, which may lead to the mistaken assumption that the various observations are for different objects or to the presentation of obsolete information to the user. On the other hand, we sometimes want to collect not only current information but also past information on a target object. For example, we may want to know past attributes or activities of a person or company to obtain a comprehensive picture of the target object. This suggests an important future direction of object search — object-history search [14]. This kind of information searching will become more frequent as more and more pages are added to the Web and are accumulated in Web archives like the Wayback Machine1 , so techniques for accurately disambiguating time-varying objects will become increasingly important. We have developed a method that can accurately identify time-varying objects on the Web. This object identification method is based on two estimated probabilities: the probability that observed data are for the same object that has changed over time and the probability that the observed data are for different objects.

2.

RELATED WORK

Research on domain-specific Web search engines has been conducted since the early years of the Web era [18, 29]. Initially, the main technical interest was the efficient collection of Web pages relevant to the target domain by using focusedcrawling [5, 6, 18] or meta-searching general-purpose search engines [7, 10, 23, 27, 29]. Various target domains, such as bibliographic information [18, 27], personal home pages [10, 29], and cooking recipes [23], were used in this research. While some domain-specific search engines extract relevant information from pages and organize it in the search results, object search engines do this in a more systematic way by using an object schema that models the attributes of objects in the domain [16, 20, 21, 22]. As with domainspecific search engines, collecting pages in target domains is still an important function of object search engines. Research on object-level search has also focused on object information extraction [20], object identification [21], and ob1

http://www.archive.org

ject ranking [22]. Bibliographic information [16, 20, 21, 22] and product information [20] have also been used as target domains. Object identification involves determining the correspondence between data observed in documents or databases and real-world objects, and it is an important process in information integration. It has long been an area of interest in both database research and linguistic research as “record linkage” or “reference resolution.” Object identification has been accomplished by clustering observed data on the basis of a measure of similarity between data, which is defined heuristically or obtained from training examples using machine learning [4, 8, 12, 24, 25, 28, 30]. Other object identification techniques, including the traditional approaches, are summarized by Bilenko et al. [3]. A typical example of object identification is personal name disambiguation. Research on personal name disambiguation has been conducted for databases and text collections. There have been many studies on record linkage for identifying the same person in databases such as those containing census data [32]. However, previous studies did not explicitly deal with the possibility of changes in attribute values over time. Personal name disambiguation in Web search results has recently come to be considered an important problem [1, 2, 11, 14, 17, 31]. For example, if we enter the search expression “Katsumi Tanaka” into the search box of Google2 , we get search results for a professor, a poet, and a pianist. The possibility of encountering people with the same name in Web search results is higher than with closed databases and text collections, so finding information relevant to a specific person in the results can be a tedious task. There have been several approaches to improving the accuracy of name disambiguation, such as by using extracted profiles of persons [11, 14, 17, 31], using hyperlinks between pages [1], and using social networks among persons [2]. However, in previous work, persons were treated as static objects, and the possibility of changes in attribute values was not considered when performing object identification. Our previous system [14] extracts from Web pages information for a person in relationship to time and organizes it along a time line as a personal history. This system also involves object identification. However, time and personal information are extracted from pages after page-level object identification is performed, so the object identification process itself does not utilize changes in objects over time. Methods for modeling attributes that change over time have been investigated in research on temporal databases. One example is the MADS model [26]. In our research, we also specify time-varying attributes in our object schema. The difference between our modeling technique and that for existing temporal databases is that ours uses probabilistic models to specify patterns of value changes in the schema.

3. PAIRWISE IDENTIFICATION OF TIMEVARYING OBJECTS In this section, we describe our method for determining whether two observations are for the same time-varying object or for different objects. Let (xi , ti ) denote an observation, where (1)

(2)

(N)

xi = (xi , xi , . . . , xi 2

http://www.google.co.jp

)

is a set of observed attribute values and N is the number of distinct attributes; ti denotes the time when xi was observed. The object that generated observation (xi , ti ) is denoted by oi . That is, if two observations, (xi , ti ) and (xj , tj ), were generated for the same object, oi = oj , and if they were generated for two different objects, oi = oj . We assume the value of oi is unknown. Given observations (xi , ti ) and (xj , tj ), we want to compare the posterior probability that the data were generated for the same object, p(oi = oj |(xi , ti ), (xj , tj )),

p(oi = oj |(xi , ti ), (xj , tj )).

p(oi = oj |(xi , ti ), (xj , tj )) . p(oi = oj |(xi , ti ), (xj , tj ))

(1)

p(oi = oj |(xi , ti ), (xj , tj )) p((xi , ti ), (xj , tj )|oi = oj )p(oi = oj ) , p((xi , ti ), (xj , tj ))

and the posterior probability of the two observations being generated for different objects, p(oi = oj |(xi , ti ), (xj , tj )) p((xi , ti ), (xj , tj )|oi = oj )p(oi = oj ) . p((xi , ti ), (xj , tj ))

Substituting these into (1) gives us log

p((xi , ti ), (xj , tj )|oi = oj ) p(oi = oj ) + log . p((xi , ti ), (xj , tj )|oi = oj ) p(oi =  oj )

The first term is equivalent to the formalization of record linkage by Fellegi and Sunter [9]. The second term is the logarithm of the odds ratio between the prior probabilities of observations i and j being generated for the same object and being generated for different objects. It takes the same value for any (i, j) and does not depend on the observation. Thus, we omit this term in computing similarities between observations. We assume that attributes are conditionally independent given the object identity of the two observations. That is, p((xi , ti ), (xj , tj )|oi = oj ) (1)

(1)

(N)

(2)

(N)

, ti ), (xj

(2)

(N)

· · · p((xi

(N)

, ti ), (xj

, tj )|oi = oj )

(2)

, tj )|oi = oj )

hold. The similarity between observations can then be calculated using similarity((xi , ti ), (xj , tj )) N X

(n)

log

n=1

(n)

p((xi , ti ), (xj , tj )|oi = oj ) (n) (n) p((xi , ti ), (xj , tj )|oi

= oj )

.

(2)

In this way, we can calculate the similarity, (n)

= log

(n)

(n)

(n)

(n)

p((xi , ti ), (xj , tj )|oi = oj ) p((xi , ti ), (xj , tj )|oi = oj )

(2)

,

(3)

for each attribute separately and then sum the attributelevel similarities to calculate the overall observation-level similarity. We now describe the calculation of (3) for each attribute. To simplify the notation, we omit superscript (n), which denotes the attribute, unless it is necessary. First we consider the denominator. It represents the conditional probability that, given i and j being different objects, the attribute of one object takes value xi at time ti and that the attribute of the other object takes value xj at time tj . If we assume that the distribution of the values of the attributes in the population (the set of all objects) is stationary, p((x, t)) = p(x), and that the observed attributes are conditionally independent given that the objects are different, the denominator of (3) can be approximated by p(xi )p(xj ). This is the probability that, when we randomly sample two objects from the population, the values of the attributes are xi and xj . Next we consider the numerator. Without loss of generality, we assume ti < tj (observation i precedes observation j). Since observations i and j belong to the same object, we can interpret this to mean that the value of the object attribute has changed from xi to xj during time interval ti to tj . As before, we assume that the distribution of the values of the attributes in the population is stationary. This means that the probability of observing attribute value xi at time ti is independent of ti , so it can be simply represented by p(xi ). We represent the probability of the attribute value changing from xi to xj during duration [ti , tj ] by q((xj , tj )|(xi , ti )). Then the numerator of (3) can be written as p(xi )q((xj , tj )|(xi , ti )). Finally, we have a simplified form of attribute-level similarity: p((xi , ti ), (xj , tj )|oi = oj ) p((xi , ti ), (xj , tj )|oi = oj ) p(xi )q((xj , tj )|(xi , ti )) q((xj , tj )|(xi , ti )) = log = log . (4) p(xi )p(xj ) p(xj ) similarity((xi , ti ), (xj , tj )) = log

=p((xi , ti ), (xj , tj )|oi = oj )p((xi , ti ), (xj , tj )|oi = oj ) · · · p((xi

(1)

(n)

It takes a large positive value if there is a high probability that the two observations were generated for the same object, and it takes a large negative value if there is a high probability that they were generated for different objects. We use this odds ratio as a measure of the similarity between observations i and j and perform object identification by clustering observations in accordance with their similarities. Using Bayes’ theorem, we can write the posterior probability of observations (xi , ti ) and (xj , tj ) being generated for the same object,

=

(1)

=p((xi , ti ), (xj , tj )|oi = oj )p((xi , ti ), (xj , tj )|oi = oj )

similarity((xi , ti ), (xj , tj ))

To compare the two probabilities, it is convenient to use the logarithm of the odds ratio:

=

p((xi , ti ), (xj , tj )|oi = oj )

=

and the posterior probability that the data were generated for different objects,

log

and

4.

COMPUTING SIMILARITIES

To calculate the similarity (4), we need to know the specific forms of p(x) and q((xj , tj )|(xi , ti )), which differ by attribute type. The types of attributes as well as the probabilistic models and their parameters (we call them object schema) have to be designed by using domain knowledge. The parameters of the models can be estimated from sample data by using statistical techniques. If the models precisely describe the characteristics of the actual data, object identification using these models should be highly accurate. However, designing complicated models with many parameters is difficult and requires much domain expertise. Estimating models with many parameters requires many training examples. In this section, we introduce simple models; however, in the basic framework described in the previous section, we can use more complicated models if such models are necessary and reasonable. In the remainder of this section, we define p and q for categorical and numerical attributes.

4.1 Categorical Attributes First we discuss how we deal with time-varying categorical attributes. A categorical attribute is an attribute the value of which corresponds to a distinct category. For example, the team to which a baseball player belongs is a categorical attribute. In general, we have to define the probability of transitioning between each pair of attribute values. Here we discuss the simple case in which these transition probabilities are homogeneous. Assume that there are L teams and that each team has the same number of players. Also assume that players change teams each year at rate r. Then we have the following distribution: q((xj , tj )|(xi , ti )) ( ` ´ (1 − r)tj −ti + L1 1 − (1 − r)tj −ti ´ = 1` 1 − (1 − r)tj −ti L Since p(x) =

1 L

if xi = xj , otherwise.

in both cases,

similarity((xi , ti ), (xj , tj )) = log

q((xj , tj )|(xi , ti )) →0 p(x)

holds when tj − ti → ∞. This means that after a long time has passed, the effect of the attribute values in clustering is lost. Some attribute values, like the blood type of a person, never change over time. We can deal with these cases by setting r to zero. Then we have ( 1 xj = xi q((xj , tj )|(xi , ti )) = (5) 0 xj = xi . If xi = xj , the value of (4) is log p(x1 j ) = − log p(xj ). This implies that the rarer the value of two observations, the more likely these observations are for the same object. For example, among Japanese people, persons with blood type A account for 40% of the population while persons with blood type AB account for 10%. If the observed attribute value is A, the similarity value is − log(0.4)  0.9. On the other hand, if the observed attribute value is AB, the similarity value is − log(0.1)  2.3, which is higher. If xi = xj , the value of (4) is −∞. These two observations are never merged during clustering and are always regarded as belonging to different objects. For example, a person with blood type

A and a person with blood type AB cannot be the same person.

4.2 Numerical Attributes Now we discuss how we deal with time-varying numerical attributes. For example, a persons’ salary generally increases over time, and there is some variance in the amount of the increases between individuals. That is, some increases are larger than the average, and some are less than the average. There may even be some decreases. We model the distribution of xj after time tj − ti has passed as a normal distribution with a mean of μxi ,tj −ti and a variance of ρ2tj −ti . We assume the average increase is linear over time; that is, μxi ,tj −ti = xi + α(tj − ti ), where α is the increase rate per unit time. When tj − ti = 0, the value of xj is equal to xi , so its variance equals zero. When tj − ti → ∞, we assume that the variance of xj converges to a certain constant, ρ2 . To ensure that, we set ρ2tj −ti =

β(tj − ti ) ρ2 , 1 + β(tj − ti )

where β > 0 is the rate of change per unit time. This gives us the following Gaussian distribution: « β(tj − ti ) ρ2 1 + β(tj − ti ) (6) Attribute values increasing on average may be seen as contradicting the previous assumption that the distribution of the values of the attributes in the population is stationary. However, in a population in which objects can appear and disappear, the average increase in the value of an attribute for each object does not necessarily correspond to the average increase in the value of the attribute for the population. For example, the age of a person monotonically increases over time, but the age distribution in the population can be stationary. In the case of xj = xi , the value of (6) is zero at tj − ti = 0. This means that an attribute of an object never takes a different value at any one time. However, in reality, errors in measurement may cause slightly different observed values for the same attribute at the same time. This causes observations for the same object to be incorrectly regarded as being for different objects. To handle errors in numeric attribute values, we assume the distribution of error δ obeys a Gaussian error function: q((xj , tj )|(xi , ti )) = N

„ xi + α(tj − ti ),

2 ), perror (δ) = N (0, σerror

(7)

where σerror is the standard deviation of errors. The observed value at time tj is the sum of the true value and the error in measurement. The true value follows a normal distribution (6) with a mean of μxi ,tj −ti and a variance of ρ2tj −ti while the error follows a normal distribution with a 2 . Since the sum of two mean of zero and a variance of σerror independent normally distributed variables follows a normal distribution with a mean that is equal to the sum of the means of the two variables and a variance that is equal to the sum of the variances of the two variables, the distribu-

tion of observed value x follows q((xj , tj )|(xi , ti )) „ =N xi + α(tj − ti ),

β(tj − ti ) 2 ρ2 + σerror 1 + β(tj − ti )

« (8)

This coincides with (7) when tj − ti = 0. Let us consider a simple case in which the attribute can be considered to be constant (for example, the height of an adult). We can then set α = β = 0. If we assume that the distribution p(x) of attribute values in the population is a normal distribution, p(x) = N (μx , σx2 ),

Input: Set of observations D = {(xi , ti )|i = 1, . . . , I}; threshold θ for stopping the clustering S Output: Partition of observations {Ck | k Ck = D, Ck ∩ Cl = ∅(k = l)} 1. Assign observations to initial clusters. 2. Find a pair of clusters such that (C  , C  ) = arg max similarity(Ck , Cl ). Ck ,Cl





3. If similarity(C , C ) < θ, go to 4, otherwise merge the clusters, C  = C  ∪ C  , C  = ∅, and go back to 2. 4. Output the results: {Ck |Ck = ∅}.

(9)

where μx and σx are the mean and standard deviation of the distribution, respectively, (4) becomes « „ σx (xj − μx )2 (xj − xi )2 + . − log 2 σerror 2σerror 2σx2 As in the case of categorical attributes, this means that if the difference between the observed values is large, the possibility of them being for the same object is small, and if the observed values are similar and significantly different from the mean, the possibility of them being for the same object is high. Some attribute values, such as the size of a company, typically have skewed distributions that differ greatly from a normal distribution. They typically undergo not additive but multiplicative updates, like xj ∼ xi η tj −ti . By taking a logarithm, we can transform this multiplicative updating into a linear form:

Figure 1: Clustering method for identifying timevarying objects where t1 ≤ t2 ≤ · · · ≤ tK+L . The ratio of likelihoods after and before merging the two clusters is computed using similarity(C  , C  ) = log p(x1 ) +

log p(x1 )



+

K X

) log q((xk , tk )|(xk−1 , tk−1 ))

k=2

( log p(x1 )



log q((xm , tm )|(xm−1 , tm−1 ))

m=2

(

+

L X

) log q((xl , tl )|(xl−1 , tl−1 ))

.

l=2

It can be shown that the above formula is equivalent to similarity(C  , C  )

log xj ∼ log xi + (tj − ti ) log η. Various measures, including company size, are empirically known to follow log-normal distributions [15]. To deal with such attributes, we can take the logarithms of the original attribute values and use them in (8) and (9).

K+L X

=

K+L X

similarity ((xm−1 , tm−1 ), (xm , tm ))

m=2



K X

` ´ similarity (xk−1 , tk−1 ), (xk , tk )

k=2

5.

CLUSTERING FOR IDENTIFYING TIMEVARYING OBJECTS

We perform object identification for a set of observations by clustering them. Here we take an agglomerative clustering approach [13]. We extend the similarity measure between two observations defined by (2) to a similarity measure between two clusters (sets of observations). The similarity is defined by the ratio of the likelihood of considering observations in different clusters to be for the same object (after merging the two clusters) and the likelihood of considering observations in different clusters to be for two different objects (before merging the two clusters.) Consider clusters C  = {(x1 , t1 ), (x2 , t2 ), . . . , (xK , tK )} and C  = {(x1 , t1 ), (x2 , t2 ), . . . , (xL , tL )}, where t1 ≤ t2 ≤ · · · ≤ tK and t1 ≤ t2 ≤ · · · ≤ tL . Then consider merging the two clusters to form a new cluster: C = C  ∪ C  = {(x1 , t1 ), (x2 , t2 ), . . . , (xK+L , tK+L )},



L X

` ´ similarity (xl−1 , tl−1 ), (xl , tl ) .

l=2

We continue merging the most-similar cluster pair until there are no pairs that have higher similarities than a threshold. The clustering algorithm is summarized in Figure 1. We next discuss several problems that need to be addressed when dealing with actual data.

5.1 Handling Outliers On the Web, an attribute may have an incorrect value (outlier) due to a mistake on the original Web page or an error in the information extraction process. The similarity between constant categorical attributes is ∞ in such cases. If an outlier deviates greatly from the true value, the similarity between numerical attributes is a very large negative number. The hard constraint that an observation having an outlier in one of its attributes is not merged into the correct cluster may degrade object identification. To avoid this problem, we take into account the possibility of outliers in the observed attribute values. We denote the rate of outliers in an attribute value by . If we ignore the case where two outliers occur in the attribute of both observations and

the incorrect values coincide, the probability of outliers in either of the observed values is approximated by 2. If we also assume that the values of outliers follow the distribution of attribute values in the population, the probability of outliers with value xj is given by 2p(xj ). In the case of no outliers, the value of xj is determined by the distribution q((xj , tj )|(xi , ti )). Then, the distribution of xj that takes into account the possibility of outliers is q  ((xj , tj )|(xi , ti )) = (1 − 2)q((xj , tj )|(xi , ti )) + 2p(xj ). (10) This gives similarity((xi , ti ), (xj , tj )) = log 2 even for q((xj , tj )|(xi , ti )) with zero or a very small value and eliminates the problem of −∞ similarity.

5.2 Handling Missing Attributes Even though the values of some variables cannot be found on a Web page, we still need to compute the similarity (2). If the value of an attribute is missing for either of the two observations, we set the similarity (4) for that attribute to zero; that is, this attribute does not contribute to the overall similarity between the observations.

5.3 Handling Various Presentation Styles for Categorical Attributes The values for some attributes such as addresses are presented in various styles. To avoid incorrectly considering the same attribute values as different values, we use the similarities of the strings to determine whether the attribute values are equal. That is, given strings A and B, let |A|, |B|, and |A ∩ B| denote the length of A, the length of B, and the length of the longest common sequence of A and B, respectively. Then, similarity string (A, B) =

|A ∩ B|2 |A||B|

(11)

is used for computing the similarity between strings. A pair of attribute values that have a similarity larger than a threshold are considered to be the same attribute value.

6.

EXPERIMENTS

To evaluate the effectiveness of the proposed method, we conducted experiments that compared the results of object identification with our method and a method that does not take into account the change in attribute values over time.

6.1 Target Classes We used professional sports players (including baseball, football, and hockey players) and companies as target classes and defined a schema for each class: Tables 1 and 2. The unit of time was one year. Some attributes are not relevant to all objects. For example, the attribute “Bats” is applicable only to baseball players. For the other objects, the value of the attribute was treated as a missing value, as discussed in Section 5.2. Object identification is most critical and difficult for objects with ambiguous names. For the professional sports player class, we selected names for which there are several different players. For the company class, we selected names for which there are several companies with confusing (in the sense of string similarity (11)) names. These names are shown in Table 3.

6.2 Data Sets With an object-level search engine, object identification is performed for attribute values extracted from Web pages by information extractors. We want to examine whether clustering errors are caused by errors in information extraction or the clustering algorithm itself and to measure the effect of the outlier rate. The distribution of observation times strongly depends on how the data was collected. A search engine generally ranks new pages higher than old ones. On the other hand, we can collect Web pages more uniformly over time if we use a Web archive. We want to also determine how the distribution of observation times affects the accuracy of object identification. Therefore, we first manually reconstructed the histories of distinct objects as much as possible and used them as a gold standard. Web pages containing the names in Table 3 were collected using a search engine (Google) and a Web archive (Wayback Machine). We collected English Web pages for the professional sports players and Japanese Web pages for the companies. We manually extracted attribute values and dates from the Web pages and identified corresponding objects. The number of objects for each name is also shown in Table 3. After the history of each object was completed, we sampled observations from the object histories by following a procedure with different parameter settings and made several sets of test collections. 1. We first sampled over different objects. In the Web, the frequency of observations for different objects varies significantly. We determined the probability of observation for each object on the basis of the frequency of Web observations. 2. We next sampled over time to determine the observation times. We used two different scenarios: a “search engine” scenario and a “Web archive” scenario. In the former, more recent observations are sampled more frequently. The observations follow an exponential distribution, p(t) = λ exp(−λt), where t denotes in years how long ago the observation occurred and λ is 0.4. In the latter, the observation times are sampled uniformly. 3. Then we sampled over different attributes. Different attributes have different observation possibilities. For example, the team to which a sport player belongs is more frequently mentioned than his height or weight. We determined the probability of observation for each attribute on the basis of the frequency of Web observations. 4. To model errors that occur in information extraction, we substituted an outlier value for the actual value of an attribute with a certain probability, data . The default value of data = 0.01 was used in the following experiments. We generated five random test sets for each ambiguous name listed in Table 3. Each test set consists of 100 observations.

6.3 Parameter Settings We set the estimation of the outlier rate  in computing similarity (10) to 0.01. The threshold for the string similarity (11) of categorical attribute values was set to 0.5. As a baseline method that does not consider the time-variation of

Table 1: Schema for professional sports players Attribute Height Weight Birthdate Throws Bats Birthplace High School College Age Team Experience Full Name Debut Position

Categorical/Numerical Numerical (σerror =3) Numerical (σerror =3) Categorical Categorical Categorical Categorical Categorical Categorical Numerical (σerror = 1) Categorical Numerical (σerror = 1) Categorical Categorical Categorical

Time-variation Constant Linear (α = 0, β = 2 × 10−3 , ρ2 = 10) Constant Constant Constant Constant Constant Constant Linear (α = 1, β = 0) Random (r = 0.1) Linear (α = 1, β = 0) Constant Constant Random (r = 0.01)

Distribution in Population p(x) = N (72, 102 ) p(x) = N (200, 152 ) p(x) = 10−5 p(right) = 0.8, p(left) = 0.2 p(right) = 0.8, p(left) = 0.2 p(x) = 10−5 p(x) = 10−5 p(x) = 10−5 p(x) = N (25, 52 ) p(x) = 1/30 p(x) = N (3, 42 ) p(x) = 1/5 p(x) = 1/5 p(x) = 1/10

Table 2: Schema for companies Attribute Company Name Representative Address of head office Postal code Capital

Categorical/Numerical Categorical Categorical Categorical Categorical Numerical (σerror = log(1.03))

Establishment date Number of employees

Categorical Numerical (σerror = log(1.03))

Number of offices

Numerical (σerror = log(1.03))

Phone number Net business profit

Categorical Numerical (σerror = log(1.03))

Net sales

Numerical (σerror = log(1.03))

Table 3: Target classes and object names Class

Object Name

Professional sports player

Company

Mark Johnson Mike Johnson Matt Smith Steve Smith James Williams Hitachi JR Mitsui Sumitomo NTT Tokyo Mitsubishi

Number of objects 3 5 5 2 10 2 3 3 3 2

attribute values, we used a schema in which parameters r, α, and β are set to zero for all time-varying attributes and used the single-linkage clustering method [13].

6.4 Evaluation Metrics For each target name in Table 3, we performed object identification by clustering the set of observations for the name. We used precision, recall, and the F-measure as metrics. If S is the set of pairs of observations in the same cluster (i.e., the clustering algorithm predicts these pairs belong to the same object) and T is the set of pairs that actually belong to the same object, precision, recall and the | | , Recall = |S∩T , F-measure are defined as Precision = |S∩T |S| |T | 2 , respectively. and F-measure = 1 + 1 Precision

Recall

Time-variation Random (r = 10−3 ) Random (r = 10−1 ) Random (r = 10−2 ) Random (r = 10−2 ) Multiplicative (α = 0, β = 2 × 10−5 , ρ2 = 10) Constant Multiplicative (α = 0, β = 5 × 10−5 , ρ2 = 3) Multiplicative (α = 0, β = 2 × 10−5 , ρ2 = 10) Random (r = 0.01) Multiplicative (α = 0, β = 10−5 , ρ2 = 10) Multiplicative (α = 0, β = 2 × 10−3 , ρ2 = 10)

Distribution in Population p(x) = 10−2 p(x) = 10−5 p(x) = 10−5 p(x) = 10−5 p(log(x)) = N (log(109 ), log(105 )) p(x) = 10−5 p(log(x)) = N (log(103 ), log(102 )) p(log(x)) = N (log(103 ), log(102 )) p(x) = 10−5 p(log(x)) = N (log(109 ), log(105 )) p(log(x)) = N (log(109 ), log(105 ))

6.5 Results We obtained results for different numbers of clusters by changing the threshold for stopping the clustering. We measured the precision and recall for each number of clusters and drew a recall-precision curve. The maximum F-measure value for each recall-precision curve is given for a uniform time distribution in Table 4 and for an exponential time distribution in 5. The method taking into account the timevariation of attribute values generally achieved a higher Fmeasure than the one that did not, even though the difference between the two methods was not significant for some object names. There was no apparent difference between the results for the uniform time distribution and the exponential time distribution. Figure 2 shows the recall-precision curves for the test sets with the uniform time distribution. The proposed method achieved similar or higher precision than the baseline at each recall level. To measure the sensitivity of the results to the parameter values for the rate of time variations, we changed the values of r and β in Tables 1 and 2 by multiplying them by scale constants. The results are shown in Tables 6 and 7, respectively. In general, the F-measure was not very sensitive to the values of r and β though some dependency was observed for several object names. There is thus room for improving the accuracy of object identification by tuning these parameters. We also examined the effect of the outlier rate in the test data. Figures 8 and 9 show the results for the test sets with

Table 4: Maximum F-measure values (uniform time distribution) Object Name Mark Johnson Mike Johnson Matt Smith Steve Smith James Williams Average Hitachi JR Mitsui Sumitomo NTT Tokyo Mitsubishi Average

Time-variation not considered 0.8529 0.5169 0.8015 0.7016 0.6915 0.7129 0.9389 0.7297 0.8479 0.8789 0.9203 0.8631

Time-variation considered 0.9983 0.9737 0.9862 0.9923 0.9539 0.9809 0.9683 0.9174 0.9089 0.9076 0.9424 0.9289

Table 5: Maximum F-measure values (exponential time distribution) Object Name Mark Johnson Mike Johnson Matt Smith Steve Smith James Williams Average Hitachi JR Mitsui Sumitomo NTT Tokyo Mitsubishi Average

Time-variation not considered 0.7089 0.6343 0.7020 0.7260 0.6870 0.6916 0.9299 0.7028 0.9116 0.8202 0.9098 0.8549

Time-variation considered 0.9831 0.9746 0.9861 0.9852 0.9315 0.9721 0.9438 0.9373 0.9361 0.8833 0.9183 0.9238

data = 0.01 and data = 0.1, respectively. Here we also varied the value of  (estimation of the outlier rate) in computing similarity (10). Obviously, the F-measure was lower for the data sets with many outliers than for the data sets with a few outliers. We can also see that taking into account the possibility of outliers when computing similarities was effective for the data sets with many outliers.

7.

CONCLUSION AND FUTURE WORK

We have developed a method for computing the similarity between two observations found on the Web with different time points. It estimates the probability that the data is for the same object even though some of the object attributes might have changed and the probability that the data is for two different objects. To enable computing the similarity for each type of attribute, we developed a probability model that specifies the pattern of change in an attribute value over time. We also developed an algorithm for identifying time-varying objects that is based on agglomerative clustering and uses the similarity of observations at different time points. We conducted experiments in which we compared identification accuracies between our proposed method and a method that regards all attribute values as constant. The results showed that the proposed method improves the precision and recall of object identification. In this work, we assumed that probabilistic models and their parameters for attribute values are designed by domain experts. In future work, we plan to develop a method for automatically determining or adjusting the parameters of the models by using data collected from the Web.

Table 6: Maximum F-measure values with different r values (uniform time distribution) Object Name Mark Johnson Mike Johnson Matt Smith Steve Smith James Williams Average Hitachi JR Mitsui Sumitomo NTT Tokyo Mitsubishi Average

×10 0.9963 0.9576 0.9882 0.9894 0.9476 0.9758 0.9042 0.8715 0.8413 0.8387 0.8967 0.8705

r ×1 0.9983 0.9737 0.9862 0.9923 0.9539 0.9809 0.9683 0.9174 0.9089 0.9076 0.9424 0.9289

×10−1 0.9983 0.9737 0.9855 0.9923 0.9539 0.9807 0.9474 0.9174 0.9108 0.9076 0.9424 0.9251

Table 7: Maximum F-measure values with different β values (uniform time distribution) Object Name Mark Johnson Mike Johnson Matt Smith Steve Smith James Williams Average Hitachi JR Mitsui Sumitomo NTT Tokyo Mitsubishi Average

×10 0.9983 0.9737 0.9862 0.9888 0.9470 0.9788 0.9695 0.8835 0.8909 0.9062 0.9310 0.9162

β ×1 0.9983 0.9737 0.9862 0.9923 0.9539 0.9809 0.9683 0.9174 0.9089 0.9076 0.9424 0.9289

×10−1 0.9983 0.9744 0.9863 0.9923 0.9542 0.9811 0.9683 0.9267 0.9179 0.9137 0.9433 0.9340

8. ACKNOWLEDGMENTS This work was supported in part by Grants-in-Aid for Scientific Research (Nos. 18049041 and 19700091) from MEXT of Japan, a MEXT project entitled “Development of Fundamental Software Technologies for Digital Archives,” a Kyoto University GCOE Program entitled “Informatics Education and Research for Knowledge-Circulating Society,” and a Microsoft IJARC CORE4 project entitled “Toward SpatioTemporal Object Search from the Web.”

9. REFERENCES [1] R. Al-Kamha and D. W. Embley. Grouping search-engine returned citations for person-name queries. In Proc. WIDM 2004, pages 96–103, 2004. [2] R. Bekkerman and A. McCallum. Disambiguating Web appearances of people in a social network. In Proc. WWW 2005, pages 463–470, 2005. [3] M. Bilenko, W. W. Cohen, S. Fienberg, R. J. Mooney, and P. Ravikumar. Adaptive name-matching in information integration. IEEE Intelligent Systems, 18(5):16–23, 2003. [4] M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Proc. KDD 2003, pages 39–48, 2003. [5] S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling: A new approach to topic-specific Web resource discovery. Computer Networks, 31(11-16):1679–1693, 1999. [6] M. Chau and H. Chen. Comparison of three vertical search spiders. IEEE Computer, 36(5):56–62, 2003.

Precision

Precision

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

Time-variation considered Time-variation not considered

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Recall

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Recall

(b) Mike Johnson

Precision

Precision

(a) Mark Johnson 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

Time-variation considered Time-variation not considered

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Recall

Precision

Precision

(d) Steve Smith

Time-variation considered Time-variation not considered

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Recall

Precision

Precision

(f) Hitachi

Time-variation considered Time-variation not considered

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Recall

Precision

(h) Mitsui Sumitomo

Precision Time-variation considered Time-variation not considered 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Recall

(i) NTT

Time-variation considered Time-variation not considered 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Recall

(g) JR 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

Time-variation considered Time-variation not considered 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Recall

(e) James Williams 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

Time-variation considered Time-variation not considered 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Recall

(c) Matt Smith 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

Time-variation considered Time-variation not considered

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

Time-variation considered Time-variation not considered 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Recall

(j) Tokyo Mitsubishi

Figure 2: Recall-precision curves (uniform time distribution)

Table 8: Maximum F-measure values for the test set with data = 0.01 (time-variation considered, uniform time distribution) 

Object Name Mark Johnson Mike Johnson Matt Smith Steve Smith James Williams Average Hitachi JR Mitsui Sumitomo NTT Tokyo Mitsubishi Average

0.1 0.9988 0.9737 0.9672 0.9949 0.9504 0.9770 0.9683 0.9174 0.9089 0.9076 0.9424 0.9289

0.01 0.9983 0.9737 0.9862 0.9923 0.9539 0.9809 0.9683 0.9174 0.9089 0.9076 0.9424 0.9289

0.001 0.9963 0.9685 0.9848 0.9893 0.9515 0.9781 0.9681 0.9174 0.9089 0.9076 0.9424 0.9289

0 0.9815 0.9672 0.9781 0.9807 0.9408 0.9697 0.9681 0.9072 0.9073 0.9076 0.9319 0.9244

Table 9: Maximum F-measure values for the test set with data = 0.1 (time-variation considered, uniform time distribution) 

Object Name Mark Johnson Mike Johnson Matt Smith Steve Smith James Williams Average Hitachi JR Mitsui Sumitomo NTT Tokyo Mitsubishi Average

0.1 0.9735 0.8911 0.9359 0.9110 0.7455 0.8914 0.9184 0.8552 0.8152 0.7516 0.8716 0.8424

0.01 0.9587 0.8752 0.9117 0.9160 0.7369 0.8797 0.9245 0.8214 0.8229 0.7543 0.8729 0.8392

0.001 0.9553 0.8565 0.9021 0.9163 0.7098 0.8680 0.9218 0.8096 0.8416 0.7534 0.8784 0.8410

0 0.8885 0.8146 0.8309 0.8790 0.6398 0.8106 0.9143 0.7132 0.7754 0.7402 0.8076 0.7901

[7] M. Chau, H. Chen, J. Qin, Y. Zhou, Y. Qin, W.-K. Sung, and D. McDonald. Comparison of two approaches to building a vertical search tool: A case study in the nanotechnology domain. In Proc. JCDL 2002, pages 135–144, 2002. [8] W. W. Cohen and J. Richman. Learning to match and cluster large high-dimensional data sets for data integration. In Proc. KDD 2002, pages 475–480, 2002. [9] I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64(328):1183–1210, 1969. [10] E. Glover, G. Flake, S. Lawrence, W. P. Birmingham, A. Kruger, C. L. Giles, and D. Pennock. Improving category specific Web search by learning query modifications. In Proc. SAINT 2001, pages 23–31, 2001. [11] R. Guha and A. Garg. Disambiguating people in search. Stanford University, 2004. [12] H. Han, L. Giles, H. Zha, C. Li, and K. Tsioutsiouliklis. Two supervised learning approaches for name disambiguation in author citations. In Proc. JCDL 2004, pages 296–305, 2004. [13] A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice-Hall, 1988. [14] R. Kimura, S. Oyama, H. Toda, and K. Tanaka. Creating personal histories from the Web using namesake disambiguation and event extraction. In Proc. ICWE 2007, pages 400–414, 2007.

[15] E. Limpert, W. A. Stahel, and M. Abbt. Log-normal distributions across the sciences: Keys and clues. BioScience, 51(5):341–352, 2001. [16] L. Lin, G. Li, and L. Zhou. Meta-search based Web resource discovery for object-level vertical search. In Proc. WISE 2006, pages 16–27, 2006. [17] G. S. Mann and D. Yarowsky. Unsupervised personal name disambiguation. In Proc. CoNLL 2003, pages 33–40, 2003. [18] A. McCallum, K. Nigam, J. Rennie, and K. Seymore. A machine learning approach to building domain-specific search engines. In Proc. IJCAI 1999, pages 662–667, 1999. [19] S. Nakamura, S. Konishi, A. Jatowt, H. Ohshima, H. Kondo, T. Tezuka, S. Oyama, and K. Tanaka. Trustworthiness analysis of Web search results. In Proc. ECDL 2007, pages 38–49, 2007. [20] Z. Nie, Y. Ma, S. Shi, J.-R. Wen, and W.-Y. Ma. Web object retrieval. In Proc. WWW 2007, pages 81–90, 2007. [21] Z. Nie, J.-R. Wen, and W.-Y. Ma. Object-level vertical search. In Proc. CIDR 2007, pages 235–246, 2007. [22] Z. Nie, Y. Zhang, J.-R. Wen, and W.-Y. Ma. Object-level ranking: Bringing order to Web objects. In Proc. WWW 2005, pages 567–574, 2005. [23] S. Oyama, T. Kokubo, and T. Ishida. Domain-specific Web search with keyword spices. IEEE Transactions on Knowledge and Data Engineering, 16(1):17–27, 2004. [24] S. Oyama and C. D. Manning. Using feature conjunctions across examples for learning pairwise classifiers. In Proc. ECML 2004, pages 322–333, 2004. [25] S. Oyama and K. Tanaka. Learning a distance metric for object identification without human supervision. In Proc. PKDD 2006, pages 609–616, 2006. [26] C. Parent, S. Spaccapietra, and E. Zim´ anyi. Conceptual Modeling for Traditional and Spatio-Temporal Applications: The MADS Approach. Springer, 2006. [27] J. Qin, Y. Zhou, and M. Chau. Building domain-specific Web collections for scientific digital libraries: a meta-search enhanced focused crawling method. In Proc. JCDL 2004, pages 135–141, 2004. [28] S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In Proc. KDD 2002, pages 269–278, 2002. [29] J. Shakes, M. Langheinrich, and O. Etzioni. Dynamic reference sifting: a case study in the homepage domain. In Proc. WWW 1997, pages 189–200, 1997. [30] S. Tejada, C. A. Knoblock, and S. Minton. Learning domain-independent string transformation weights for high accuracy object identification. In Proc. KDD 2002, pages 350–359, 2002. [31] X. Wan, J. Gao, M. Li, and B. Ding. Person resolution in person search results: WebHawk. In Proc. CIKM 2005, pages 163–170, 2005. [32] W. E. Winkler. Overview of record linkage and current research directions. Research Report Series (Statistics #2006-2), U.S. Census Bureau, 2006.

Identification of Time-Varying Objects on the Web

sonal names account for 5 to 10% of all Web queries[11]. General-purpose search engines .... ing social networks among persons [2]. However, in previous.

382KB Sizes 1 Downloads 251 Views

Recommend Documents

Confident Identification of Relevant Objects Based on ...
in a wet-lab, i.e., speedup the drug discovery process. In this paper, we ... NR method has been applied to problems that required ex- tremely precise and ...

Learning the structure of objects from Web supervision
sider for example the notion of object category, which is a basic unit of understanding in .... parts corresponding to the “bus” and “car” classes. Webly supervised ...

Comment on ``Identification of Nonseparable ...
Jun 25, 2015 - In all other cases, the results in Torgovitsky (2015) are ei- ther not covered by those in D'Haultfœuille and Février (2015), or are obtained under.

The Identification of Like-minded Communities on ...
Most community detection algorithms are designed to detect all communities in the entire network graph. As such, it would .... tional Workshop on Modeling Social Media (MSM'12), in-conjunction with HT'12. Pages 25-32. .... 15 Top 10 URLs .

On the identification of parametric underspread linear ...
pling for time delay estimation and classical results on recovery of frequencies from a sum .... x(t), which roughly defines the number of temporal degrees of free-.

Notes on the identification of VARs using external ...
Jul 26, 2017 - tool to inspect the underlying drivers of ri,t. One way to .... nent bi,2sb,t in the error term and delivers a consistent estimate of bi1. The estimate,.

A note on the identification of dynamic economic ...
DSGE models with generalized shock processes, such as shock processes which fol- low a VAR, have been an active area of research in recent years. Unfortunately, the structural parameters governing DSGE models are not identified when the driving pro-

The Identification of Like-minded Communities on Online Social ...
algorithm on three real-life social networks and the YouTube online social network, and ..... of the tweets of someone he/she is following. A user can ...... the content types in their tweets with Set P using more text-based tweets and ComCICD.

A Primer on the Empirical Identification of Government ...
defined by the Bureau of Economic Analysis (BEA) as the value of services produced by government, measured as the purchases made by government on ...

Model Identification for Energy-Aware Management of Web Service ...
plex service-based information systems, as the impact of energy ... Autonomic management of service center infrastructure is receiving great interest by.

On Identification of Hierarchical Structure of Fuzzy ...
Definition 2.2 (Choquet integral). For every measurable function f on X, can be writ- ten as a simple function f = n. ∑ i=1. (ai − ai−1)1Ai + m. ∑ i=1. (bi − bi−1)1Bi ,.

Orbital Identification of Carbonate-Bearing Rocks on Mars
Dec 21, 2008 - Ornithologici, V. D. Ilyichev, V. M. Gavrilov, Eds. (Academy of Sciences of ..... R. Greeley, J. E. Guest, “Geological Map of the Eastern. Equatorial ...

identification of recaptured photographs on lcd screens
Forensic Features. To prevent the security loophole of the image recapturing attack, reliable automatic identification of the finely recaptured images on LCD screens is highly desirable. By formulating the problem as a binary classification task, we

Identification and estimation of peer effects on ...
effects in models of social networks in which individuals form (possibly di- rected) links .... a slightly different empirical approach than that used in the literature on.

Orbital Identification of Carbonate-Bearing Rocks on Mars
Dec 21, 2008 - signature was recognized in earlier OMEGA (20) and CRISM (19) ..... Spectral Library splib06a, USGS Digital Data Series 231. (USGS, Denver ...

Identification and estimation of peer effects on ...
Saharan Africa in particular (World Bank 2008). Understanding variable ... Much of the literature on network effects, particularly in a developing- ... group of about 20 national-level organizations, the LBCs on the ground op- erate on a ...... ysis,

Question Identification on Twitter - Research at Google
Oct 24, 2011 - It contains two steps: detecting tweets that contain ques- tions (we call them ... empirical study,. 2http://blog.twitter.com/2011/03/numbers.html.

Polony Identification Using the EM Algorithm Based on ...
Wei Li∗, Paul M. Ruegger†, James Borneman† and Tao Jiang∗. ∗Department of ..... stochastic linear system with the em algorithm and its application to.

The Research on Offline Palmprint Identification
information society nowadays, and has been widely used in many personal identification and verification applications[1]. Since the good performance on the ..... Personal Identification in Networked Society", Norwell, MA. : Kluwer, 1999. [2] R.Clarke,

Rotation Invariant Retina Identification Based on the ...
Department of Computer, University of Kurdistan, Sanandaj, Iran ... Biometric is the science of recognizing the identity of a person based .... degree of closeness.

Chinese Writer Identification Based on the Distribution ...
which it's one of the global features, and compared the discriminability with ..... [4] G. Leedham and S. Chachra, “Writer identification using innovative binarised ...

SOME EXAMPLES OF TILT-STABLE OBJECTS ON ...
Proposition 5.1. Suppose E ∈ Tω,B is a torsion-free sheaf with νω,B(E)=0 and .... 5.2.4], it is shown that σ = (Zω,B , Aω,B) is a Bridge- land stability ..... is invariant under tensoring by a line bundle whose c1 is proportional to f∗ω, a

Web-Appendix: On the Demographic Adjustment of ...
Nov 22, 2016 - Geert Mesters: Universitat Pompeu Fabra and Barcelona GSE, email: .... Specifically, drawing on Abowd and Zellner's insight that the bulk of.