Experiments in fitting bivariate power laws (Rev #6)

This page looks at fitting a power law to an urban statistics dataset. The freely available dataset found is the relationship between city size and the number of list of cities with the most high-rise buildings. A high-rise is defined as a structure at least 35 meters or 12 stories tall. It’s unclear that this should be determined by a power law, but given tall buildings are generally erected to accomodate high-level economic activity, which is the kind of thing that is argued to depend on population size via a power law. The data comes from wikipedia and no attempt has been made to verify the data or apply any “corrections” such as removing outliers. The data is also right censored at a minmum of 100 high-rise buidlings.

Since we’re considering a situation where the dominant deviation from a power law is likely to, be due to e.g., a city being underdeveloped or overdeveloped due to unique factors, rather than measurement error, we’ll refer to differences from the basic power law as “deviations” rather than “errors”.

The aim of this page is **not** to find “the correct” power law fit for this data, but to look at the question of whether convincing power law fitting is possible in the absence of at least an idea of what kind of deviations are believed to be occurring.

Original data: $x$-axis is $v$, $y$-axis is $b$.

Double logarithm data: $x$-axis is $\log v$, $y$-axis is $\log b$.

Arguably both these plots could suggest there are some outliers which may for some reason be generated by an anomalous process. Many fitting methods, including maximum likelihood can be heavily skewed by such outliers and they should be removed for best results. However, removal can only be done in a principled way if there are pre-existing theory justifying their exclusion, and we’re considering the case of trying to fit a power law in advance of building a theory.

In order to see the visual effect of binning in the double logarithm domain, here is a plot using 20 equispaced bins. Note how this makes the data look like a line fit would bemuch more reasonable than the unbinned data.

Since we’re fitting using models in the original data rather than the log variables we don’t do anything more with this reduced data set.

We look at two very simple deviation models: additive normal deviation in the original data and additive normal deviation in the original data (which is log-normal deviation in the double logarithm data). Here $v$ is the city population and $b$ is the number of high-rise buildings.

model | hence | elt log likelihood | argmin | C @ min |
---|---|---|---|---|

$b ~ C v^\beta+N(0,\sigma)$ | $b-C v^\beta ~ N(0,\sigma)$ | $\sigma^{-2}\sum_i (b_i-C v_i^\beta)^2$ | $[0.6925,0.693]$ | $0.0326$ |

$b ~ C v^\beta(1+N(0,\sigma))$ | $b/(C v^\beta)-1 ~ N(0,\sigma)$ | $\sigma^{-2}\sum_i (b_i/(C v_i^\beta)-1)^2$ | $[0.653,0.6535]$ | $0.0913$ |

(Since this deviation measures the “vertical” distance to the fitted curve, this model is essentially assuming that cities produce more or fewer hirise buildings for their size: a deviation based on distance to the closest point on the curve would embody different assumptions.) For comparision, least squares fitting in the double logarithm domain is equivalent ot an original data model of

$b ~ C v^\beta \exp N(0,\sigma)$

The model technically has three parameters $(C,\beta,\sigma)$. However, *providing $\sigma$ is essentially independent of the other parameters and the $v$ values* it only scales the likelihood and so does not affect the location of the minimum. Additionally, using the analytical differentiation of the netative log likelihood the value of $C$ at the minimum has a simple form in terms of the data and $\beta$. So in these cases maximum likelihood parameter fitting collapses to just a 1-D search over $\beta$.

Plots of the negated log likelihoods of the two models (over the same range of $\beta$ values). Note that because the deviation term has a different meaning in the different models, the fact the two functions have very different ranges of values doesn’t mean anything.

Model 1 negative log likelihood plot: $x$-axis is $\beta$, $y$-axis proportional to negative log likelihood.

Model 2 negative log likelihood plot: $x$-axis is $\beta$, $y$-axis proportional to negative log likelihood.

The $\beta$ range within which the minimum falls, and approximately the corresponding $C$ value, are recorded in the table. As can be seen there is a dramatic difference between the two values.

Original data with two fitted power laws overlaid.

Double logarithm data with two fitted power laws overlaid.

It’s highly debatable if either of these fitted power laws are a good fit to the data, but it is clear that very different results are obtained depending on the deviation model chosen. This raises doubts whether it’s possible to find power laws in data with medium to high deviation purely by fitting procedures.

Source scripts will be available here.

category: statistical methods