We will work with data from the Colombia WFS Household Survey, conducted in 1975-76. I tabulated the age distribution of all household members and saved it in an ascci file, which we now read and plot:
. infile age pop using /// > https://grodri.github.io/datasets/cohhpop.dat, clear (99 observations read) . line pop age , /// > title(Colombia 1975-76) subtitle(WFS Household Survey) /// > ytitle(population) . graph export cohhpop.png, width(500) replace file cohhpop.png saved as PNG format
> library(ggplot2) > co <- read.table("https://grodri.github.io/datasets/cohhpop.dat", + col.names=c("age","pop"), header=FALSE) > ggplot(co, aes(age, pop)) + geom_line() + ggtitle("Colombia WFS 1975-76") > ggsave("cohhpopr.png", width=500/72, height=400/72, dpi=72)
As you can see, the distribution looks somewhat less smooth than the data from the Philippines that we studied earlier. Can you compute the Myers index for this distribution?
The simplest way to smooth a scatterplot is to use a moving
average, also known as a “running mean”. The most common approach
is to use a window of 2k + 1 observations, k to the
left and k to the right of each observation. The value of
k is a trade off between smoothness of goodness of fit. Special
care must be taken at the extremes of the range. Stata can compute running means via lowess
with the options mean
and noweight
.
A common problem with running means is bias. A solution is to use
weights that give more importance to the closest neighbors and
less to those farther away. A popular weight function is Tukey’s
tri-cube, defined as w(d) = (1-|d|3)3 for
|d| < 1 and 0 otherwise, where d is the distance to
the target point expressed as a fraction of the bandwidth. Stata can do this calculation via lowess
with
the option mean
if you omit
noweight
.
An even better solution is to use running lines. We define
again a neighborhood for each point, typically the k nearest
neighbors on each side, fit a regression line to the points in the
neighborhood, and then use it to predict a smoother value for the index
observation. This sounds like a lot of work, but the calculations can be
done efficiently using regression updating formulas. Stata can compute a running line via lowess
if you omit mean
but include
noweight
.
Better still is to use weighted running lines, giving more
weight to the closest observations, which is what the lowess
smoother does. A variant follows this estimation with a few iterations
to obtain a more robust line. This is clearly the best technique in the
family. Stata’s lowess
uses a weighted
running line if you omit mean
and
noweight
R implements the lowess smoother through the functions
lowess()
and the newer loess()
, which uses a
formula interface with one or more predictors and somewhat different
defaults. The parameter degree
controls the degree of the
local polynomial; the default is 2 for quadratic, alternatives are 1 for
linear and 0 for running means. Both implementations can use a robust
estimator, with the number of iterations controlled by a parameter
iter
or iterations
. Type ?loess
and ?lowess
in the R console for more information. In
ggplot()
you can overlay a lowess smoother by calling
geom_smooth()
.
The figure below shows the Colombian data and a lowess smoother with a span or bandwidth equal to 25% of the data.
. lowess pop age, bwidth(0.25) gen(smooth) title(Lowess Smoother) . graph export cohhrm.png, width(500) replace file cohhrm.png saved as PNG format
> ggplot(co, aes(age, pop)) + geom_point() + + geom_smooth(span=0.25, se=FALSE) + ggtitle("Lowess Smoother") > ggsave("cohhrmr.png", width=500/72, height=400/72, dpi=72)
You may want to try different badwidths to see how the results vary.
Smoothing the age distribution provides a better way to assess digit preference than Myers’ blending. Let us compute the last digit of age and tabulate it over the entire range of the data using the observed frequencies and a lowess smoother.
. gen lastdigit = mod(age,10) . tab lastdigit [fw=pop], matcell(obs) lastdigit │ Freq. Percent Cum. ────────────┼─────────────────────────────────── 0 │ 7,797 13.99 13.99 1 │ 4,847 8.69 22.68 2 │ 5,946 10.67 33.35 3 │ 5,388 9.66 43.01 4 │ 5,281 9.47 52.48 5 │ 6,577 11.80 64.28 6 │ 5,290 9.49 73.77 7 │ 4,879 8.75 82.52 8 │ 5,476 9.82 92.34 9 │ 4,269 7.66 100.00 ────────────┼─────────────────────────────────── Total │ 55,750 100.00
> library(dplyr) > N <- sum(co$pop) > obs <- co |> + mutate(last_digit = age %% 10) |> group_by(last_digit) |> + summarize(prop = sum(pop)/N) > obs # A tibble: 10 × 2 last_digit prop <dbl> <dbl> 1 0 0.140 2 1 0.0869 3 2 0.107 4 3 0.0966 5 4 0.0947 6 5 0.118 7 6 0.0949 8 7 0.0875 9 8 0.0982 10 9 0.0766
The raw frequencies show evidence of preference for ages ending in 0 and 5, which is very common, and probably 2 as well. We now use the smooth as weight
. tab lastdigit [aw=smooth], matcell(fit) lastdigit │ Freq. Percent Cum. ────────────┼─────────────────────────────────── 0 │ 11.0366512 11.15 11.15 1 │ 10.8233452 10.93 22.08 2 │10.59322023 10.70 32.78 3 │ 10.3362373 10.44 43.22 4 │ 10.0601082 10.16 53.38 5 │ 9.77877815 9.88 63.26 6 │ 9.50162757 9.60 72.86 7 │ 9.22961745 9.32 82.18 8 │ 8.95876398 9.05 91.23 9 │ 8.68165074 8.77 100.00 ────────────┼─────────────────────────────────── Total │ 99 100.00
> lf <- as.data.frame(lowess(co$age, co$pop, f = 0.25)) > M <- sum(lf$y) > fit <- lf |> mutate(last_digit = x %% 10) |> group_by(last_digit) |> + summarize(prop = sum(y)/M) > fit # A tibble: 10 × 2 last_digit prop <dbl> <dbl> 1 0 0.114 2 1 0.111 3 2 0.108 4 3 0.105 5 4 0.101 6 5 0.0983 7 6 0.0952 8 7 0.0922 9 8 0.0894 10 9 0.0867
The smoothed frequencies show that we expect fewer people at higher digits, even in a smooth distribution, with more ending in 0 than 9. We are now ready to compute an index of digit preference, defined as half the sum of absolute differences between observed and smooth frequencies:
. mata: obs = st_matrix("obs"); fit = st_matrix("fit") . mata: sum(abs(obs/sum(obs) - fit/sum(fit)))/2 .0553043845
> sum( abs(obs$prop - fit$prop) )/2 [1] 0.0547919
We see that we would need to reshuffle 5.5% of the observations to eliminate digit preference. You may wish to compare this result with the Myers index.