Pareto versus lognormal: a maximum entropy test Marco Bee∗

Massimo Riccaboni†

Stefano Schiavo‡

February 16, 2011

Abstract It is commonly found that distributions that seem to be lognormal over a broad range change to a power-law (Pareto) distribution for the last few percentiles. The distributions of species abundance, income and wealth as well as file, city and firm sizes are examples with this structure. We present a new test for the occurrence of power-law tails in statistical distributions based on maximum entropy. This methodology allows to identify the true data generating processes even in the case when it is neither lognormal nor Pareto. The maximum entropy approach is then compared with alternative methods at different levels of aggregation of economic systems. Our results provide support to the theory that distributions with lognormal body and Pareto tail can be generated as mixtures of lognormally distributed units. Keywords: Pareto distribution, power-law, lognormal distribution, maximum entropy, firm size, international trade JEL classication: C14, C51, C52



Department of Economics, University of Trento. E-mail: [email protected] DISA, University of Trento. E-mail: [email protected] ‡ Department of Economics, University of Trento. E-mail: [email protected]

1

Introduction

Several phenomena in physics, biology, computer science, demography, economics, finance and the social sciences are distributed according to a power-law, or at least display power-law behavior in the tails (Simon, 1955; Montroll and Shlesinger, 1982; Barab´ asi and Albert, 1999; Brown et al., 2002; Reed and Hughes, 2002; Mitzenmacher, 2004; Newman, 2005; Gabaix, 2009). The power-law upper tail of the distribution can be generated by an amplification method (Montroll and Shlesinger, 1982), such as mixtures of lognormals (Allen et al., 2001; Reed and Hughes, 2002; Growiec et al., 2008). In the last decade the debate has intensified on the appropriate procedures to detect power-law distributions in empirical data (Goldstein et al., 2004; Gabaix and Ioannides, 2004; Coronel-Brizio and Hern´ andez-Montoya, 2005; Pisarenko and Sornette, 2006), and a number of approaches have been proposed to establish the length of the power-law tail (Clauset et al., 2009; Malevergne et al., 2009), quickly gaining widespread acceptance and use. In the literature, the power-law (Pareto) distribution is generally compared to an alternative represented by the lognormal distribution, though other candidate distributions have been proposed (Fisher et al., 1943; Hubbell, 2001; Reed, 2001; Mitzenmacher, 2004; Pisarenko and Sornette, 2006). While in many cases the exact shape of the empirical distribution is not crucial, as long as heavy tails are accounted for, the debate appears to be especially animated in economics (Ijiri and Simon, 1977; Sutton, 1997; Axtell, 2001; Gabaix, 1999; Eeckhout, 2004; Levy, 2009; Eeckhout, 2009) and biology (Allen et al., 2001; Williamson and Gaston, 2005). A further complication in the analysis comes from the fact that the same system may display different tail behaviors when analyzed at different levels of aggregation, due to composition and sample size effects (Allen et al., 2001; De Fabritiis et al., 2003; Perline, 2005; Hisano and Mizuno, 2010). In this paper we provide a new methodology based on maximum entropy estimation (Kapur, 1989; Wu, 2003), to identify the data-generating process and to determine the existence of a power-law tail in the data.1 Two of the main benefits of this approach are its flexibility and the fact that it delivers a well-defined alternative to the power-law or lognormal distribution. As the Maximum Entropy (ME) density encompasses most commonly used distributions, the estimated ME density can be easily compared with a number of well-defined alternatives. Here we apply the ME methodology to evaluate the fit of lognormal versus power-law distributions, to compare different systems, and to analyze the behavior of the same systems at different levels of aggregation. In what follows we briefly describe the methodological framework and compare it with the most popular tools to estimate the upper tail behavior. Then we analyze two real-world distributions in economics: world trade flows and business firm sizes. In the latter case we find support to a theoretical prior suggesting the emergence of a powerlaw upper tail in the firm size distribution upon aggregation (De Fabritiis et al., 2003; Fu et al., 2005; Growiec et al., 2008). 1

The Matlab code implementing the methodology is available at www.stefanoschiavo.tk.

1

2

The Maximum Entropy method

The ME method is a technique aimed at obtaining a probability distribution, called ME distribution, that is consistent with the information in the data and “expresses maximum uncertainty with respect to all other matters” (Jaynes, 1957; Kapur, 1989). The distribution is the result of the maximization of the Shannon’s information entropy R W = −p(x) log(p(x))dx under constraints that impose the equality of the first k theoretical and empirical moments. The constraints are usually the arithmetic and geometric moments; they are R i R also icalled “characterizing moments” and are respectively given by x p(x)dx and log(x) p(x)dx, i = 0, 1, . . . , k. The characterizing moments are sufficient statistics for exponential families, so that in this case they uniquely determine the entire distribution. Most commonly used distributions, including the lognormal and Pareto, are encompassed by the Maximum Entropy (ME) distribution. If the data follow some parametric model but the investigator cannot assume it, the estimated ME density shall coincide with it, thus suggesting the correct model. On the contrary, if the data do not follow a certain parametric model, the method can be used to measure the distance from it. Let T (x)i and T (x)i be respectively the i-th theoretical and empirical characterizing moment. Formally, the task consists in maximizing W under the constraints T (x)i = T (x)i , and can be solved introducing k + 1 Lagrange multipliers λi (i = 0, . . . , k), so that the solution (that is, the ME density) takes the form f (x) = e−

Pk

i=0

λi T (x)i

.

Before considering the possible ways for determining the value of k, we briefly summarize the distributions nested by the ME that are of interest in the paper. The Pareto distribution is an ME density with k = 1, and the characterizing moment is the first logarithmic moment. The ME(1) density corresponds to the Par(c, α) distribution with density f (x) = αcα /xα+1 for x > c when its parameters are given by (Kapur, 1989) λ0 = − log(αcα ),

λ1 = α + 1.

(1)

If X ∼ Par(c, α), then Y = log(X/c) ∼ Exp(α) with density f (y) = αe−αy . In this case, the ME(1) density (with characterizing moment the first arithmetic moment) is Exp(α), λ0 = − log(α) and λ1 = α. So, if the data are X ∼ Par(α, c), it is also possible to apply the logarithmic transformation and work with the exponential distribution. The lognormal distribution is an ME density with k = 2. In this case the characterizing moments are the first two logarithmic moments and the ME(2) density with parameters   1 1 1 µ2 1 (2) + log(π), λ1 = 1 − 2µ/σ 2 , λ2 = 2 , λ0 = 2 2 − log 2 σ 2 2σ 2 2σ corresponds to the lognormal distribution with mean µ and variance σ 2 (Kapur, 1989). The normal distribution is also an ME density with k = 2, and the characterizing 2

moments are the first two arithmetic moments. The ME(2) density with parameters √ λ0 = log( 2πσ),

λ1 = 0,

λ2 =

1 , 2σ 2

corresponds to the N (µ, σ 2 ) distribution (Kapur, 1989). The truncated normal distribution is an ME distribution as well; the functional forms of the relations among λ0 , λ1 and λ2 and the truncated normal parameters can be found in (Kapur, 1989) .

2.1

Testing with respect to parametric distributions

The most important practical issue in ME estimation is the choice of k. A larger number of constraints results in a more precise approximation, but also in a model with more parameters, whose estimation introduces further uncertainty. Thus, the advantage of a better fit must be balanced against the noise caused by the estimation of more parameters. According to these remarks, there are at least two ways of making a decision concerning the optimal value of k (k∗ , say). P Since the maximized log-likelihood is equal to −N ki=0 λi T (x)i , a log-likelihood ratio (llr) test is easily P computed. The test the hypothesis k = k∗ against k = k∗ + 1 Pkof ∗ k ∗ +1 ˆ i ˆ i T (x)i ); from standard limiting theory, is given by llr = −2n( i=0 λi T (x) − i=0 λ 2 its asymptotic distribution is χ1 . Thus the procedure would be based on the following steps: (1) estimate sequentially the ME density with k = 1, 2, . . . ; (2) perform the test for each value of k; (3) stop at the first value of k (k0 , say) such that the hypothesis k = k0 cannot be rejected and conclude that k∗ = k0 . However, this method does not take into account the costs of estimating a model with a larger number of parameters. A possible remedy consists in computing an information criterion, such as the Akaike (AIC) or Bayesian (BIC) Information Criterion. To avoid overfitting, one can stop at the value k∗ such that at least one of the following two conditions holds: (1) the llr test cannot reject the hypothesis k = k∗ ; (2) the numerical value of AIC(k∗ + 1) [or BIC(k∗ + 1)] is larger than the numerical value of AIC(k∗ ) [or BIC(k∗ )]. To compare the discrepancy between some theoretical distribution and the ME density, one can R use the Kullback-Leibler (KL) distance between the two distributions K(f ||g) = f (x) log(f (x)/g(x))dx or, more conveniently, the Information Distinguishability (ID) index ID(f ||g) = 1 − e−K(f ||g) , which is normalized to be included in the unit interval (Soofi et al., 1995). We prefer not to use the Kolmogorov-Smirnov test because, when the parameters are estimated from the data, the asymptotic null distribution is no longer equal to the one obtained using the true parameters. The null distribution could be approximated by means of Monte Carlo simulation (Clauset et al., 2009). However, this task would be computationally heavy, because of the need of computing numerically the cdf of the ME distribution. It has often been noted that the Pareto tail of a system seems longer at an higher level of aggregation. For instance, simulating N observations from the same lognormal distribution, the top 102 observations look definitely Pareto when N = 105 , and not Pareto when N = 103 (Perline, 2005). This result has an analytical explanation based on extreme value theory. Since both the exponential (i.e. log-transformed Pareto) 3

and the normal (log-transformed Lognormal) belong to the domain of attraction of the Gumbel distribution, the asymptotic properties of their upper order statistics are the same (Embrechts et al., 1997). If the distribution is exponential, the expected value of a sufficiently large order statistics Xi,N from an exponential population of size N is exactly linear on a log scale with a slope not depending on N . On the other hand, the same expectation for a sufficiently large order statistics Xi,N from a normal population of 1/2 size N is, asymptotically, approximately linear with a slope equal to −σ/(2 log(N p )) . So in this case the slope goes to zero, but the convergence is of order O(1/ log(N )), and therefore extremely slow. In other words, departures from linearity are expected to become clear only for very large N especially when σ is large, so that the Pareto tail gets slowly less evident as N increases (Malevergne et al., 2009). Consider a population of size N1 , and suppose that the i-th standardized order statistic (X(i,N1 ) − aN1 )/bN1 converges to the Gumbel-type Extreme Value distribution with cdf Gi (x), where, for the log-transformed lognormal case (Embrechts et al., 1997),

P



X(i,N1 ) − aN1 ≤x bN 1



log(log(N )) + log(4π) ; 2(2 log(N ))1/2

bn =

σ . (2 log(N ))1/2

−→ Gi (x) ⇔ P X(i,N1 ) ≤ bN1 x + aN1

 N1 →∞ −→ Gi (x).

an = µ + σ(2 log(N ))1/2 − σ N1 →∞

(3)

The same result holds for the order statistic X(j,N2 ) of a population of size N2 < N1 following the same distribution. (j) If i and j are sufficiently large, it turns out that Gi ≈ Gj . Let xmin (j = 1, 2) be the (j) smallest number such that the distribution of the j-th population above xmin is Pareto. Then, from (3), it must be true that (1)

(2)

bN1 xmin + aN1 = bN2 xmin + aN2 . Thus, (1)

bN1 xmin + aN1 − aN 2 . bN 2 This approximation is expected to be rather rough, because the rate of convergence is very low, it is often difficult to know in advance whether the true distribution belongs to the maximum domain of attraction of the Gumbel, and j may not be large enough. However, under the hypothesis of lognormality, this gives an idea of the asymptotic properties of extreme order statistics. (2)

xmin ≈

2.2

Alternative approaches

The classical approach to the estimation of the parameters of the Pareto distribution is based on a random sample from the Par(c, α) distribution. In this setup, the maximum likelihood estimators of the parameters are (Kleiber and Kotz, 2003) n . (4) cˆ = min xi , α ˆ = Pn 1≤i≤n c) i=1 xi log(xi /ˆ 4

However, our problem is different, since it is known a priori that only observations larger than some unknown threshold xmin , follow the Pareto distribution, so that the threshold cannot be estimated by means of (4). In this case, the following two-step procedure is often suggested (Embrechts et al., 1997). First, plot the Mean Excess Function (MEF) e(x) = E(X − x|X > x).

Then, letting ∆n (x) = {i : Xi > x}, choosePxmin equal to the point x∗ such that the so-called empirical MEF en (x) = (1/#∆n (x)) i∈∆n (x) (Xi − x) becomes approximately linear for x > x∗ . Alternatively, one may use the Hill estimator (Hill, 1975; Embrechts et al., 1997), which is equivalent to the Maximum Likelihood Estimator (MLE) of the shape parameter α if the underlying distribution is Pareto. When the underlying distribution is in the maximum domain of attraction of the Pareto (equivalently, when the tail is Pareto above a certain threshold xmin ), the Hill estimator is the MLE conditional on the threshold being equal to the k-th order statistic x(k) . Various asymptotically equivalent versions of this estimator can be derived by means of different methods, so that it is a very natural solution. For the purpose of identifying the threshold xmin , one can observe that the Hill plot (a plot of the Hill estimator as a function of the order statistics x(k) ) is approximately linear for x(k) ≥ xmin . The justification is that the Hill estimator can be interpreted as the empirical MEF of log(X) calculated at the threshold u = log(X(k) ) (Embrechts et al., 1997). However, the Hill estimator in finite samples can be biased (Embrechts et al., 1997; Gabaix and Ioannides, 2004). Another approach is based on the fact that if X ∼ Par(c, α), then Y = log(X/c) ∼ Exp(α) (Malevergne et al., 2009; Hisano and Mizuno, 2010). Furthermore, the logarithm of a (truncated) lognormal is a (truncated) normal, so that testing the null hypothesis of a Pareto distribution against the alternative of a (truncated) lognormal is equivalent to testing the null hypothesis of Exponential against the alternative of (truncated) normal. The Uniformly Most Powerful Unbiased (UMPU) test can be based on the clipped sample coefficient of variation c¯ = min{1, cˆ}, where cˆ = σ ˆ /ˆ µ is the sample coefficient of variation. Various approximations to the critical values of the test are available (del Castillo and Puig, 1999). Therefore, the test is either computationally simple and theoretically sound. Finally, Clauset, Shalizi and Newman (CSN) proposed a different method based on the Kolmogorov-Smirnov (KS) statistics (Clauset et al., 2009). In particular, the estimated value of xmin is the value x ˆmin that minimizes the KS distance between the empirical cumulative density function (cdf) Fn (x) and the cdf of the fitted model: D = maxx≥xmin |Fn (x) − F (x)|. Although CSN also show how to test the hypothesis that the data larger than x ˆmin are truly power-law distributed, their method only provides the best threshold, but does not tell whether, and how plausibly, different thresholds also determine a power-law tail.

5

3

Empirical analysis

We analyze two widely investigated economic distributions: international trade flows (Bhattacharya et al., 2008; Easterly et al., 2009; Gabaix, 2009; Riccaboni and Schiavo, 2010) and business firm sizes (Ijiri and Simon, 1977; Axtell, 2001; Sutton, 1997; Growiec et al., 2008). First, we estimate the maximum entropy distribution against the (truncated) lognormal. Second, we compute the ME test to estimate the power-law tail and compare our results with the outcome of the UMPU and CSN tests. Third, we evaluate the effect of sample size on the power-law tail. Lastly, we discuss a theoretical model that properly account for the emergence of a power-law tail in the business firm size distribution (Fu et al., 2005; Growiec et al., 2008). Trade data are taken from the COMTRADE database maintained by the United Nations. This collects data on 6 002 617 bilateral trade flows among 157 reporting countries (sources) and 230 destinations, at the 6-digit level of the Harmonized System classification, which consists of roughly 5 000 products. In the analysis we focus on the latest available year (2007) and analyze both disaggregate data at the level of single product category (D-TRADE) and total trade obtained by summing up all trade flows for each of 20 767 non-null country pairs (A-TRADE). To analyze the distribution of business firm size we exploit a unique dataset on yearly sales of 916 036 pharmaceutical products by 5 721 firms in 21 countries in 2004 (Fu et al., 2005; Riccaboni et al., 2008). Information is available both at the disaggregate level of product sales (D-PHARMA), as well as reaggregated by assigning each product to the firm that sells it (A-PHARMA). All data are expressed in thousand US dollars (USD). For notational convenience, in the following the original data and their logarithms will be called “levels” and “logarithms” respectively. The distributions of both aggregate and disaggregate trade logarithms are not normal, but rather truncated normal, because of many small observations (corresponding to many levels clustered near zero). Moreover, the observations smaller than zero seem to be little informative, as there are clusters and peaks. Since (i) the distribution is truncated anyway; (ii) the smallest observations do not appear very reliable; (iii) there is some switch in the distribution approximately at xt = 1 and (iv) we are not particularly interested in the left tail of the distribution, we decided to discard the observations smaller than xt = 1 and estimate just the left-truncated lognormal defined on (xt , +∞). Notice that the distribution of all the observations available is likely to be a mixture of some distribution on (0, xt ) and a left-truncated lognormal on (xt , +∞). It is only because the distribution below xt does not seem to be lognormal and we are not interested in modeling the whole distribution, that we are legitimated to discard the observations smaller than xt . As for the pharmaceutical data, all observations are larger than a thousand USD. It is, however, clear that the distribution is truncated, in particular at the disaggregate level, so that we fit a truncated normal in this case as well, setting the truncation threshold equal to one thousand USD.

6

(a) Logarithms of A−TRADE (N = 20767)

(b) Logarithms of D−TRADE (N = 6002617)

0.12

0.16 Data Normal ME (5)

0.1

Data Normal ME (6)

0.14 0.12

0.08

0.1

0.06

0.08 0.06

0.04

0.04 0.02 0

0.02 0

5

10

15

0

20

0

Logarithms of thousands of dollars

(c) Logarithms of A−PHARMA (N = 7184)

10

15

20

(d) Logarithms of D−PHARMA (N = 916036)

0.14

0.14 Data Normal ME (3)

0.12

0.1

0.08

0.08

0.06

0.06

0.04

0.04

0.02

0.02

0

5

10

15

20

Data Normal ME (4)

0.12

0.1

0

5

Logarithms of thousands of dollars

0

25

Logarithms of thousands of dollars

0

5

10

15

20

25

Logarithms of thousands of dollars

Figure 1: Distributions of the logarithms of the data with superimposed the optimal ME density and the truncated normal density with parameters estimated from the data. From top to bottom and from left to right, the panels refer to aggregate trade data (A-TRADE), disaggregate trade data (D-TRADE), aggregate pharmaceutical data (APHARMA) and disaggregate pharmaceutical data (D-PHARMA). N is the population size. Fig. 1(a) shows the distribution of the logarithms of the aggregate trade data with superimposed the optimal ME density with k∗ = 5. For comparison purposes, the truncated normal density with parameters estimated from the data is shown as well. The ME density fits the data much better than the truncated normal. The fact that k∗ = 5 also implies that the lognormal hypothesis for the levels should be rejected. For disaggregate trade data (Fig. 1(b)), the lognormal hypothesis is again rejected, but the distance between the optimal ME and the normal seems smaller. Turning now to the pharmaceutical data (Fig. 1(c) and (d)), the normal distribution is clearly not appropriate for the whole distribution at both aggregation levels. The distributions are more skewed than in the trade case and the distance between the normal and the optimal ME looks large at both levels.

7

These considerations can be made more precise by computing the KL distance and the ID index (see Sect. 2.1) between the optimal ME density and the truncated normal distribution with parameters estimated from the data. The results are shown in Tab. 1, where the MLEs of the truncated normal are obtained by means of the EM algorithm (Dempster et al., 1977). The lognormal hypothesis is more supported, in particular in the trade case, as the aggregation level decreases. ID index KL distance

A-TRADE 0.0073 0.0073

D-TRADE 4.9702 · 10−4 4.9714 · 10−4

A-PHARMA 0.0711 0.0737

D-PHARMA 0.0688 0.0713

Table 1: Information Distinguishability (ID) index and Kullback-Leibler (KL) distance of the ME distribution from lognormal, for different levels of aggregation. We now consider the tail properties of the distribution fit. In order to estimate the threshold xmin (the Pareto scale parameter) we estimate the tail distribution for data {x : x > xm } for various xm . For aggregate trade data, the threshold sequence goes from rank 50 to rank 800. We do not show results for ranks smaller than 50 (because estimates obtained with less than 50 observations are likely to be too unstable) and larger than 800 (because the Pareto hypothesis is definitely rejected by all tests for ranks larger than 800). According to the details in Sect. 2, if the null hypothesis k = 1 cannot be rejected, the true distribution is Pareto. Fig. 2 shows the p-value of the llr test for H0 : k = 1 against H1 : k = 2 in the ME setup, the estimate of xmin obtained by CSN and the p-value of the UMPU test for exponential vs. truncated normal (del Castillo and Puig, 1999). The shaded area corresponds to thresholds such that AIC(1) > AIC(2) and therefore suggests rejection of the null hypothesis, where AIC(k) is the AIC criterion for the model with k parameters. Finally, the p-value of the power-law distribution found by CSN is reported in the caption. In the case of aggregate trade (Fig. 2(a)) the evidence is mixed. At the 5% level, the Pareto hypothesis is valid approximately for ranks smaller than 650 (quantile 96.87%) according to the ME test, and for ranks larger than 150 (quantile 99.28%) for the UMPU test. However, the results are far from clear-cut: on one hand, the quantity AIC(1) − AIC(2) is sometimes negative even for ranks smaller than 650, on the other hand the p-value of the UMPU test is sometimes near 0.05 for ranks larger than 150. The CSN approach yields a rank equal to 408 (quantile 98.04%), but the p-value is equal to 0.024, so that the presence of a power-law tail seems to be questionable. The only conclusions that can be drawn with reasonable certainty are that the distribution is Pareto for ranks smaller than 150 and is not Pareto for ranks larger than 700 (quantile 96.63%). Note that, according to the ME procedure, for ranks larger than 700, typically three parameters are needed (that is, we reject k = 1 but accept k = 2), so that the distribution is a truncated lognormal. Turning now to disaggregate trade data (Fig. 2(b)) the Pareto hypothesis is valid 8

Figure 2: The graphs show the p-value of the llr test for H0 : k = 1 against H1 : k = 2 in the ME setup, the estimate of xmin obtained by CSN and the p-value of the UMPU test for exponential vs. truncated normal. The shaded area defines thresholds such that AIC(1) > AIC(2) (H0 rejected). From top to bottom and from left to right, the panels refer to aggregate trade data (A-TRADE), disaggregate trade data (D-TRADE), aggregate pharmaceutical data (A-PHARMA) and disaggregate pharmaceutical data (D-PHARMA). The p-value of the CSN threshold is equal to 0.024 for A-TRADE, 0.274 for D-TRADE, 0.026 for A-PHARMA and 0.868 for D-PHARMA. approximately for ranks smaller than 1 600 for ME (quantile 99.97%), 1 480 for CSN (quantile 99.98%) and 300 for UMPU (quantile > 99.99%). However, similarly to the case of aggregate data, for ranks between 500 (quantile > 99.99%) and 1500 (quantile 99.98%) the p-value of the UMPU test is sometimes near 0.05. The p-value of the CSN approach seems to confirm that the distribution is power-law, and all tests suggest that the distribution is not Pareto for ranks larger than 1600. Although the ranks such that the power-law hypothesis is accepted are larger for disaggregate data, the population size is much larger, so that only a very small fraction of the observations is generated by a Pareto tail. As for pharmaceutical data, Fig. 2(c) shows that it is Pareto for ranks approximately smaller than 1400 for ME (quantile 80.51%), 1200 to 1300 for UMPU (quantiles 83.30% and 81.90%, respectively) and 900 for CSN (quantile 87.47%). The p-value of the CSN approach is small, so that the tail may actually not be Pareto. Finally, with 9

the disaggregate data (Fig. 2(d)) we get similar results with ME and CSN, for which the threshold is between ranks 8 and 9 thousand (quantiles 99.13% and 99.02% respectively), and different with the UMPU test, for which the rank is approximately equal to 300 (quantile 99.97%). However, for ranks between 1 and 6 thousand (quantiles 99.89% and 99.35% respectively), the sign of the AIC difference fluctuates a lot and the p-value is often smaller than 0.1, two facts that suggest that the optimal ME distribution is at the border between k = 1 and k = 2 for this range of ranks. The CSN p-value is large, so that the tail corresponding to the largest 8 thousand observations is likely to be Pareto, a result in line with the ME p-value for similar thresholds. α ˆ (M E) α ˆ (CSN )

A-TRADE 0.948 1.080

D-TRADE 1.380 1.402

A-PHARMA 0.532 0.601

D-PHARMA 1.021 1.038

Table 2: Estimates of the Pareto shape parameter obtained with the ME and CSN methods. α ˆ (CSN ) is the estimated parameter generated by the CNS method minus 1. One of the benefits of the ME approach is that is delivers the estimated parameters of the distribution. Tab. 2 shows the estimates of the Pareto shape parameter obtained by means of the ME and CSN approaches. These are obtained exploiting (1) and with the maximum likelihood method respectively. The values produced by the two methodologies are close to each other, a result that is not surprising given the asymptotic equivalence of ME and maximum likelihood. In line with this remark, the smallest difference is observed for D-TRADE, i.e. the population with the largest size. In some cases, the evidence from the three tests is not the same. It may therefore be of interest to check how different the ME densities are when k = 1 (Pareto) and k = 2 (truncated lognormal). To this aim, Fig. 3 gives some insight for the aggregate trade data. The graph displays the histogram of the logarithms above four different thresholds and the estimated ME(1) and ME(2) densities (respectively exponential and truncated normal when using the logarithms). The four thresholds correspond to ranks in different positions of the tails: in particular, panel (a) uses rank 100, such that all the tests accept the Pareto hypothesis, panel (b) and (c) use rank 300 and 550, such that the UMPU test rejects but the ME test accepts the Pareto hypothesis, and finally panel (d) uses rank 750, for which all the tests reject the Pareto hypothesis. It can be seen that the two densities are almost indistinguishable for the three smallest ranks, for which the tests give somewhat different results. On the other hand, when the rank is equal to 750 and all the tests suggest rejection of the Pareto hypothesis, the difference is more evident. These results are quite reassuring, as they show that, when the outcomes of the tests are not unambiguous, the possible data-generating processes are almost identical. A final issue that requires further investigation is the following. As pointed out in Sect. 2.1, under the lognormal hypothesis, when the threshold is large, so that the number of observations is small, it is often observed that the tail seems to follow a Pareto 10

(a) A−TRADE − Rank = 100

(b) A−TRADE − Rank = 300

1.6

1.4 Data ME(1) ME(2)

1.4

Data ME(1) ME(2)

1.2

1.2

1

1 0.8 0.8 0.6 0.6 0.4

0.4

0.2

0.2 0

0

0.5

1

1.5

2

0

2.5

0

(c) A−TRADE − Rank = 550

0.5

1

1.5

2

2.5

3

3.5

(d) A−TRADE − Rank = 750

1.4

1 Data ME(1) ME(2)

1.2

Data ME(1) ME(2)

0.8

1 0.6

0.8 0.6

0.4

0.4 0.2 0.2 0

0

0.5

1

1.5

2

2.5

3

3.5

0

4

0

1

2

3

4

Figure 3: Results for the sampled aggregate trade data (A-TRADE). distribution. In order to quantify how the sample size influences the statistical features of the tail (and, in particular, the estimated xmin ), we apply again the procedure to a sample of the disaggregate data of the same size as the aggregate populations (n = 20 767 for trade and n = 5 721 for pharmaceutical data). Thus, the size of the two datasets to be compared is now the same. The results are reported in Fig. 4. In qualitative terms, for the trade data, the outcomes are similar to the ones shown in Fig. 2(d); the thresholds obtained with the three tests are also in good agreement with each other. The Pareto tail approximately corresponds to the quantile 97.5% for the sampled data and 99.98% for the original data. For the pharmaceutical data, the Pareto tail approximately corresponds to the quantile 94.1% for the sampled data and 98.1% for the original data. Consistently with the analytical remarks in Sect. 2.1, the Pareto tail is more pronounced in the samples than in the original population. In particular, the difference between the length of the Pareto tail in the population and in the sample is larger for the trade than for the pharmaceutical sales data, as was to be expected in view of the larger difference between the population and the sample size in the trade case. Disaggregate data show, in the best case, a power-law tail confined to the last percentile of the distributions. However, trade and pharmaceutical data show a different behavior upon aggregation. Since the aggregate trade (A-TRADE) distribution is certainly not Pareto below the 96.63% percentile and a sample of the same size of the disaggregate trade (D-TRADE) distribution has a Pareto tail starting at the quantile 11

Figure 4: Results on the length of the Pareto tail for the sampled trade (a) and pharmaceutical (b) data. The p-value of the Pareto tail found by means of the CSN approach is 0.096 for trade and 0.084 for pharmaceutical data. 97.5%, we can conclude that the trade distribution has a very short Pareto tail (if any). Conversely, the business firm distribution has a power-law tail from the 80.51% quantile onward, while the Pareto tail of a sample of the same size from the D-PHARMA data is limited to the quantile 94.1%. Therefore the Pareto tail of business firm size distribution emerges upon aggregation, not just as a matter of sample size. To make sense of this, consider that the aggregate size of each firm (S) is given by the sum of the size of products (s) over the total number of products sold (K). If the size of products is approximately lognormally distributed, firm size is a sum of lognormals. Thus, the aggregate pdf is given by p(S) =

∞ X

p(K)p(S|K).

k=1

Unfortunately, the sum of lognormals does not have a closed form. A mixture of the Slimane bounds can be used as a first approximation of the aggregate cdf (Slimane, 2002; Growiec et al., 2008)

12

∞ X

 )   ln(S/K γ ) − ms K √ , P (S) = P (K) 1 − Θ Vs k=1 (

with 0 ≤ γ ≤ 1, ms and Vs are the mean and variance of  ln(s),γ and Θ denotes the )−ms √ cdf of the standardized normal distribution. Denoting Θ ln(S/K ≡ H(S) and Vs differentiating with respect to S yields the following pdf 0

0

p(S) = −P (S) = H (s) ×

∞ X

KH(S)K−1 p(K).

K=1 0

The firm size distribution is a lognormal distribution (H (s)) multiplied by a stretching factor which increases with S. If p(K) ≈ K −2 ,P by replacing summation R with integration, ∞ K−1 /K ≈ ∞ H(s)x−1 /xdx ≈ the stretching factor can be approximated by H(S) K=1 1 R∞ 1 x dx = − H(s) . For large S, with H(S) → 1, the stretching factor can be 0 ln H(S) arbitrarily large. Fig. 5 shows that the distribution of the number of products sold by each pharmaceutical firms P (Kf ) is approximately Pareto whereas the same distribution for international trade P (Kc ) (number of products traded by each country pair) is far less skewed. Thus the emergence of a power-law tail in the pharmaceutical data can be explained by the presence of a Pareto component in the stretching factor (Growiec et al., 2008). 0

10

P(Kf)

−1

10

P(Kc) −2

10

0

10 −3

10

P(Kf) −4

10

P(Kc) −5

10

0

5000

10000

15000

−5

10

0

10

1

10

2

3

10

10

4

10

5

10

Figure 5: The complementary cumulative distribution of the number of commodities traded by country pairs P (Kc )(trade data, broken line) and products by firm P (Kf ) (pharmaceutical data, full line). Double logarithmic scale (large picture) and semilogarithimic scale (inset).

13

To substantiate the claim that the Pareto tail in A-PHARMA can be generated by the aggregation of products into firms according to a very skewed distribution P (Kf ), we run the tests on a synthetic dataset (A*-PHARMA) obtained by aggregating DPHARMA data according to P (Kc ) instead of P (Kf ). We find that the Pareto tail is limited to ranks ranging from 136 (CSN) to 162 (ME test), which correspond to quantiles between 97.55% and 97.17%. Hence, the power-law tail is much smaller than in the true A-PHARMA dataset, a result in line with the conjecture that the skewness of the aggregation rule P (Kf ) contributes to the emergence of a Pareto tail in the data.

Figure 6: Results on the length of the Pareto tail on pharmaceutical data obtained by aggregating D-PHARMA according to P (Kc ). The p-value found by means of the CSN approach is XXX.

4

Conclusion

In this paper we present a new methodology to analyze the tail behavior of empirical distributions and compare it with existing approaches. While no single method is sufficient to discriminate among skewed distributions, the ME estimator and the UMPU test appear to be less sensitive to the sample size. The main advantage of the ME method is its wider scope: it always delivers the estimated parameters of the distribution and encom14

passes a broader range of alternatives. Moreover, it provides a well-defined model when the distribution is non-standard and allows to measure the distance of the best-fitting density from all commonly used parametric distributions.

References Allen, A., Li, B. and Charnov, E. (2001). Population fluctuations, power laws and mixtures of lognormal distributions. Ecology Letters, 4 (1), 1–3. Axtell, R. L. (2001). Zipf distribution of U.S. firm sizes. Science, 293 (5536), 1818– 1820. ´ si, A.-L. and Albert, R. (1999). Emergence of scaling in random networks. Baraba Science, 286 (5439), 509–512. ¨ ki, J., Kaski, K. and Manna, S. Bhattacharya, K., Mukherjee, G., Sarama (2008). The International Trade Network: weighted network analysis and modelling. Journal of Statistical Mechanics: Theory and Experiment, 2. Brown, J., Gupta, V., Li, B., Milne, B., Restrepo, C. and West, G. (2002). The fractal nature of nature: power laws, ecological complexity and biodiversity. Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences, 357 (1421), 619. Clauset, A., Shalizi, C. and Newman, M. (2009). Power-law distributions in empirical data. SIAM Review, 51 (4), 661–703, cited By (since 1996) 131. ´ ndez-Montoya, A. (2005). On fitting the paretoCoronel-Brizio, H. and Herna levy distribution to stock market index data: Selecting a suitable cutoff value. Physica A: Statistical Mechanics and its Applications, 354 (1-4), 437–449, cited By (since 1996) 11. De Fabritiis, G., Pammolli, F. and Riccaboni, M. (2003). On size and growth of business firms. Physica A: Statistical Mechanics and its Applications, 324 (1-2), 38–44. del Castillo, J. and Puig, P. (1999). Test of exponentiality against singly truncated normal alternatives. Journal of the American Statistical Association, 94, 529–532. Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society B, 39 (1), 1–38. Easterly, W., Reshef, A. and Schwenkenberg, J. (2009). The power of exports. Policy Research Working Paper Series 5081, The World Bank.

15

Eeckhout, J. (2004). Gibrat’s law for (all) cities. American Economic Review, 94 (5), 1429–51. — (2009). Gibrat’s law for (all) cities: Reply. American Economic Review, 99 (4), 1676– 83. ¨ ppelberg, C. and Mikosch, T. (1997). Modelling Extremal Embrechts, P., Klu Events for Insurance and Finance. Springer. Fisher, R., Corbet, A. and Williams, C. (1943). The relation between the number of species and the number of individuals in a random sample of an animal population. The Journal of Animal Ecology, 12 (1), 42–58. Fu, D., Pammolli, F., Buldyrev, S., Riccaboni, M., Matia, K., Yamasaki, K. and Stanley, H. (2005). The growth of business firms: Theoretical framework and empirical evidence. Proceedings of the National Academy of Sciences of the United States of America, 102 (52), 18801. Gabaix, X. (1999). Zipf’s law for cities: An explanation. Quarterly Journal of Economics, 114 (3), 739–67. — (2009). Power laws in economics and finance. Annual Review of Economics, 1, 255–93. — and Ioannides, Y. (2004). The evolution of city size distributions. Handbook of regional and urban economics, 4, 2341–2378. Goldstein, M., Morris, S. and Yen, G. (2004). Problems with fitting to the powerlaw distribution. European Physical Journal B, 41 (2), 255–258, cited By (since 1996) 122. Growiec, J., Pammolli, F., Riccaboni, M. and Stanley, H. E. (2008). On the size distribution of business firms. Economics Letters, 98 (2), 207–212. Hill, B. M. (1975). A simple general approach to inference about the tail of a distribution. Annals of Statistics, 3, 1163–1174. Hisano, R. and Mizuno, T. (2010). Sales distribution of consumer electronics. arXiv:1004.0637v2. Hubbell, S. (2001). The unified neutral theory of biodiversity and biogeography Princeton University Press. Princeton, New Jersey, USA. Ijiri, Y. and Simon, H. (1977). Skew Distributions and the Sizes of Business Firms. Amsterdam: North Holland. Jaynes, E. T. (1957). Information theory and statistical mechanics. Physics Review, 106, 620–630. Kapur, J. (1989). Maximum entropy models in science and engineering. Wiley. 16

Kleiber, C. and Kotz, S. (2003). Statistical size distributions in economics and actuarial sciences. Wiley. Levy, M. (2009). Gibrat’s law for (all) cities: Comment. American Economic Review, 99 (4), 1672–75. Malevergne, Y., Pisarenko, V. and Sornette, D. (2009). Gibrats law for cities: uniformly most powerful unbiased test of the Pareto against the lognormal. Swiss Finance Institute Research Paper Series, pp. 09–40. Mitzenmacher, M. (2004). A brief history of generative models for power law and lognormal distributions. Internet mathematics, 1 (2), 226–251. Montroll, E. and Shlesinger, M. (1982). On 1/f noise and other distributions with long tails. Proceedings of the National Academy of Sciences of the United States of America, 79 (10), 3380. Newman, M. (2005). Power laws, Pareto distributions and Zipf’s law. Contemporary physics, 46 (5), 323–351. Perline, R. (2005). Weak and false inverse power laws. Statistical Science, 20, 68–88. Pisarenko, V. and Sornette, D. (2006). New statistic for financial return distributions: Power-law or exponential? Physica A: Statistical Mechanics and its Applications, 366 (1), 387–400, cited By (since 1996) 6. Reed, W. (2001). The Pareto, Zipf and other power laws. Economics Letters, 74 (1), 15–19. — and Hughes, B. (2002). From gene families and genera to incomes and internet file sizes: Why power laws are so common in nature. Physical Review E, 66, 067103. Riccaboni, M., Pammolli, F., Buldyrev, S. V., Ponta, L. and Stanley, H. E. (2008). The size variance relationship of business firm growth rates. Proceedings of the National Academy of Sciences, 105 (50), 19595–19600. — and Schiavo, S. (2010). Structure and growth of weighted networks. New Journal of Physics, 12, 023003. Simon, H. (1955). On a class of skew distribution functions. Biometrika, 42 (3-4), 425. Slimane, B. S. (2002). Bounds on the distribution of a sum of independent lognormal random variables. Communications, IEEE Transactions on, 49 (6), 975–978. Soofi, E., Ebrahimi, N. and Habibullah, M. (1995). Information distinguishability with application to analysis of failure data. Journal of the American Statistical Association, 90, 657–668. Sutton, J. (1997). Gibrat’s legacy. Journal of Economic Literature, 35 (1), 40–59. 17

Williamson, M. and Gaston, K. J. (2005). The lognormal distribution is not an appropriate null hypothesis for the species–abundance distribution. Journal of Animal Ecology, 74 (3), 409–422. Wu, X. (2003). Calculation of maximum entropy densities with application to income distribution. Journal of Econometrics, 115, 347–354.

18

Pareto versus lognormal: a maximum entropy test

Feb 16, 2011 - a number of approaches have been proposed to establish the ... real-world distributions in economics: world trade flows and business firm sizes. ..... smaller than xt = 1 and estimate just the left-truncated lognormal defined on (xt,+∞). ... 800 (because the Pareto hypothesis is definitely rejected by all tests for ...

362KB Sizes 1 Downloads 187 Views

Recommend Documents

Maximum-entropy model
Test set: 264 sentences. Noisy-channel. 63.3. 50.247.1. 75.3. 64.162.1. 80.9. 72.069.5. Maximum EntropyMaximum Entropy with Bottom-up. F-measure. Bigram F-measure. BLEU score. 10. 20. 30. 40. 50. 60. 70. 80. 90. 100. S. NP. VP. NP. PP. The apple on t

A Maximum Entropy Model Applied to Spatial and ...
Pairs of neurons sharing correlations 0.75 were also excluded. ...... File type. 20 ms bin width. 1.2/4 ms bin width. Dissociated spikes. 0.481. 0.464 ... Density cloud moves across the diagonal line as ensemble size is increased from four to 10,.

A Maximum Entropy Framework for Statistical Modeling ...
Index Terms—Underwater acoustic communications, channel modeling .... y(t), and h(τ,t), with sampling frequency larger than the system bandwidth. L denotes the total number of channel taps. Generally, the statistical characterization of the channe

Maximum Entropy Modeling for Diacritization of Arabic ...
of a vowel. These methods provide a limited solution to the .... Detailed results will be presented at the conference. MODELS ... Speech and Audio Process. 8 (1),.

Backoff Inspired Features for Maximum Entropy ... - Research at Google
Sep 14, 2014 - lem into many binary language modeling problems (one versus the rest) and ... 4: repeat. 5: t ← t + 1. 6: {θ1. 1,...,θK. L } ← IPMMAP(D1,...,DK , Θt−1, n). 7: .... SuffixBackoff (NG+S); (3) n-gram features plus PrefixBackoffj.

Bistability and hysteresis of maximum-entropy states in ...
Provided δ > 0, the domain is square when δ = 1, and rectangular otherwise. As is common practice, and as ..... The reader is referred to Appendix D for further details. ..... with application to open ocean problems,” J. Comput. Phys. 34, 1 (1980

Entire Relaxation Path for Maximum Entropy ... - Research at Google
this free parameter is the regularization value of pe- .... vex function over a compact convex domain. Thus, ..... 100 times for each sample size and report average.

Small Sample Bias Using Maximum Likelihood versus ...
Mar 12, 2004 - The search model is a good apparatus to analyze panel data .... wage should satisfy the following analytical closed form equation6 w* = b −.

lognormal distribution pdf
Sign in. Loading… Whoops! There was a problem loading more pages. Retrying... Whoops! There was a problem previewing this document. Retrying.

Pareto prinsiippi reflektio.pdf
Whoops! There was a problem loading more pages. Whoops! There was a problem previewing this document. Retrying... Download. Connect more apps.

pdf-1853\lognormal-distributions-theory-and-applications-statistics-a ...
... of the apps below to open or edit this item. pdf-1853\lognormal-distributions-theory-and-applications-statistics-a-series-of-textbooks-and-monographs.pdf.

Sort-Cut: A Pareto-Optimal and Semi-Truthful ... - Andrew.cmu.edu
We propose Sort-Cut, a mechanism which does the next best ... (GSP) auction implemented for sponsored search ads at Google. Perhaps even ... of Sort-Cut. While earlier work on the problem led to mechanisms that leave some items unallocated [Borgs et

pareto efficiency pdf
Page 1. Whoops! There was a problem loading more pages. Retrying... pareto efficiency pdf. pareto efficiency pdf. Open. Extract. Open with. Sign In. Main menu.

pdf entropy
Page 1 of 1. File: Pdf entropy. Download now. Click here if your download doesn't start automatically. Page 1 of 1. pdf entropy. pdf entropy. Open. Extract.

Pareto Optimal Design of Absorbers Using a Parallel ...
high performance electromagnetic absorbers. ... optimization algorithms to design high performance absorbers: such algorithms return a set of ... NSGA to speed up the convergence. ..... optimal broadband microwave absorbers,” IEEE Trans.

Sort-Cut: A Pareto-Optimal and Semi-Truthful ... - Andrew.cmu.edu
(GSP) auction implemented for sponsored search ads at Google. Perhaps even more .... However, [2] only considers the offline optimization problem and does ...... a typical advertiser goes to a typical search engine company to sign up to bid in ...

Sort-Cut: A Pareto-Optimal and Semi-Truthful ... - Andrew.cmu.edu
(GSP) auction implemented for sponsored search ads at Google. Perhaps even .... However, [2] only considers the offline optimization problem and does not ...... Internet advertising and the generalized second price auction: Selling billions of ...

Pareto-Improving Firing Costs?
Jan 17, 2011 - Many countries impose costs on employers who wish to dismiss a worker. These can take several forms, including restrictions on how and when a worker can be fired, and severance costs that must be paid to the worker. These firing costs,