Stata 17 introduced a new system for producing highly-customizable
tables. At the heart of the system is a new collect
command
that can be used to collect the results left behind by various Stata
commands and present them in tables. It also introduced a new
table
command that simplifies the process for many kinds of
tabulations, and later an etable
command that specializes
in tables of estimates. Stata 18 added a dtable
command to
easily produce tables of descriptive statistics. In this tutorial we
will touch briefly on all four commands. Stata 16 and earlier had a
different table
command with its own syntax and features,
still available under version control.
Frequency tables include marginals or one-way distributions, crosstabs or two-way tabulations, and multi-way tables involving three or more variables.
The simplest table we can consider is just a one-way frequency table, where we often want to show percents as well as counts. The example below uses an extract from the 1975 Dominican Republic Fertility Survey and tabulates the distribution of respondent’s education
. use https://grodri.github.io/datasets/drsr03x, clear (DRSR03 extract) . table educg, statistic(frequency) statistic(percent) ────────────────┬───────────────────── │ Frequency Percent ────────────────┼───────────────────── Education level │ 0-2 │ 941 30.21 3-4 │ 771 24.75 5-7 │ 744 23.88 8-18 │ 659 21.16 Total │ 3,115 100.00 ────────────────┴─────────────────────
If you just type table educg
you will see the
frequencies, which is the default. If you want percents instead you use
the option statistic(percent)
. If you want both frequencies
and percents you use the statistic
option twice, as we did
here.
You could, of course, obtain the same results using
tabulate educg
, which also gives you cumulative
frequencies. However, the new table
command is much more
powerful, letting you customize the table and export the result in
various formats.
To give you just one example, suppose you wanted to label the columns
N
and %
. Although we view this as a one-way
table, it has two dimensions, the education groups that go in
the rows, and the two results that go in the columns, a dimension Stata
calls result
with levels frequency
and percent
. We can use collect
to replace the
labels of the levels of result and then preview our change. Try the next
two commands
collect label levels result frequency "N" percent "%", modify
collect preview
The table above can be transposed, putting the results in the rows
and the categories of education in the columns using the command
collect layout (result) (educg)
. (Alternatively, we could
specify table () (educg)
from the outset.)
The collect
commands act on the current collection,
which was produced by the table
command and is actually
called Table
. We’ll see how to generate our own collections
in Section 3.4. To learn more about one-way tables type
help table oneway
.
To obtain a two-way table we specify a row and a column variable. The example below looks at contraceptive use by education groups.
. table educg cuse, statistic(percent, across(cuse)) ────────────────┬────────────────────────────────────────────── │ Contraceptive use │ Not using Inefficient Efficient Total ────────────────┼────────────────────────────────────────────── Education level │ 0-2 │ 69.38 5.37 25.25 100.00 3-4 │ 59.65 5.45 34.90 100.00 5-7 │ 50.00 8.50 41.50 100.00 8-18 │ 31.84 14.43 53.73 100.00 Total │ 57.07 7.36 35.57 100.00 ────────────────┴──────────────────────────────────────────────
If you just type table educg cuse
you will get the
frequencies. Here we are more interested in row percents, which we
obtain using the percent
statistic with the
across(cuse)
option. We see that use of both efficient and
inefficient methods increases substantially with educational level.
This survey defined contraceptive use only for currently married
fecund women, and table
by default excludes missing values.
To include missing values use the missing
option. To see
the frequencies add the statistic(frequency)
option. To
learn more about two-way tables type help table twoway
.
It is also possible to do three-way tables, which is as far as we’ll go because tables get rather unwieldy as the number of dimensions increases. Let us look at contraceptive use by area and education:
. table (area educg) (cuse), statistic(percent, across(cuse)) ────────────────────┬────────────────────────────────────────────── │ Contraceptive use │ Not using Inefficient Efficient Total ────────────────────┼────────────────────────────────────────────── Type of area │ Urban │ Education level │ 0-2 │ 54.17 5.95 39.88 100.00 3-4 │ 49.70 2.42 47.88 100.00 5-7 │ 47.92 7.81 44.27 100.00 8-18 │ 31.40 14.53 54.07 100.00 Total │ 45.77 7.75 46.48 100.00 Rural │ Education level │ 0-2 │ 77.01 5.07 17.91 100.00 3-4 │ 66.53 7.53 25.94 100.00 5-7 │ 53.51 9.65 36.84 100.00 8-18 │ 34.48 13.79 51.72 100.00 Total │ 68.06 6.97 24.97 100.00 Total │ Education level │ 0-2 │ 69.38 5.37 25.25 100.00 3-4 │ 59.65 5.45 34.90 100.00 5-7 │ 50.00 8.50 41.50 100.00 8-18 │ 31.84 14.43 53.73 100.00 Total │ 57.07 7.36 35.57 100.00 ────────────────────┴──────────────────────────────────────────────
This command combines categories of residence and education in the rows and shows contraceptive use in the columns. I used parentheses for clarity, but they can be omitted. We see that use of contraception increases with education in both areas, and is generally much higher in urban than rural areas.
We could also produce separate tables for urban and rural areas. Try the following command
table (educg) (cuse) (area), statistic(percent, across(cuse))
Here parentheses are required, and the order is rows, columns,
panels, so area
comes last. The results are the same as
before, but to compare urban and rural you have to look across
panels.
You can supress marginal totals using the nototals
option, or specify which margins to include with totals()
,
using #
to interact variables. For example we could supress
the total panel but keep the row totals, so it is clear that the
percents add to 100% in each row, by using
totals(educg#area)
. To learn more type
help table multiway
.
These are just like the frequency tables we have seen, except that the cells show summary statistics of yet another variable. The table can have rows, columns and panels, each with one or more variables. We illustrate with two classification variables.
Here is a table showing the mean number of years of education by age groups and area of residence.
. table ageg area, statistic(mean educ) nformat(%5.2f) ───────────┬─────────────────────── │ Type of area │ Urban Rural Total ───────────┼─────────────────────── Age groups │ 15-19 │ 6.21 4.20 5.31 20-29 │ 6.67 3.75 5.40 30-39 │ 4.98 2.64 3.88 40-49 │ 4.32 1.63 2.90 Total │ 5.87 3.25 4.66 ───────────┴───────────────────────
We use the nformat
option to set the format for numeric
output, so we get just two decimal points. We notice that younger women
have achieved more education than their older counterparts in both
areas, and that average education is higher in urban than in rural
areas.
This table could use a title. As it happens the table
command does not have a title option, but there is a
collect title
command that adds a title to the current
collection, and a collect preview
command to display the
collection. Try
collect title "Mean years of education by age and area"
collect preview
Alternatively, you could add a note at the foot of the table with
collect note "Cells show mean years of education"
.
Tables of statistics can include not just means, but many other
statistics, such as the median, quartiles, standard deviation or
variance. For a full list of the statistics available type
help table_summary##stat
. An interesting “statistic” is
fvproportion
, which gives relative frequencies for a factor
or categorical variable.
It is possible to include two (or more) statistics in the same table. Here is an example showing the mean and standard deviation of years of education by age groups and area of residence.
. table ageg area, statistic(mean educ) statistic(sd educ) /// > nformat(%5.2f) sformat((%s) sd) style(table-tab2) ───────────┬────────────────────────── │ Type of area │ Urban Rural Total ───────────┼────────────────────────── Age groups │ 15-19 │ 6.21 4.20 5.31 │ (2.97) (2.67) (3.01) │ 20-29 │ 6.67 3.75 5.40 │ (3.95) (2.87) (3.81) │ 30-39 │ 4.98 2.64 3.88 │ (3.87) (2.40) (3.47) │ 40-49 │ 4.32 1.63 2.90 │ (3.93) (1.71) (3.26) │ Total │ 5.87 3.25 4.66 │ (3.79) (2.71) (3.58) ───────────┴──────────────────────────
Type just the first line first to see all the defaults. The second
line adds some customization. We use our old friend nformat
to display the statistics with just two decimals. We also use
sformat
to print the standard deviation in parentheses,
specifying sd
to ensure that this format applies only to
that statistic.
Why two kinds of formats? All numeric output is first converted to a
string, using an nformat
if any. Then that string is
displayed using an sformat
if any. So a standard deviation
of 9.4148 becomes “9.41” using the numeric format %5.2f
,
and is displayed as “(9.41)” using the string format
(%s)
.
Finally we use a built-in style called table-tab2
to
hide the labels for the statistics and add some space between the age
groups. To learn more about the available styles type
help Predefined styles
.
To learn more about the table
command, and its many
options, including the command
option that lets you run any
Stata command and collect its results, type help table
.
Research reports often include a table showing descriptive statistics
for a number of variables, using the mean and standard deviation for
numeric or continuous variables, and relative frequencies for
categorical or factor variables, frequently within categories of another
variable of interest. Sometimes this is called “Table 1”. The
table
command can produce this type of table, but the
dtable
command added in version 18 makes it very easy.
Here is a table showing means and standard deviations for age and years of education, our two continuous variables, and the frequency and percent distribution of contraceptive use, all separately for urban and rural areas.
. dtable age educ i.cuse, by(area, test) note: using test regress across levels of area for age and educ. note: using test pearson across levels of area for cuse. ─────────────────────────────────────────────────────────────────────── Type of area Urban Rural Total Test ─────────────────────────────────────────────────────────────────────── N 1,683 (54.0%) 1,432 (46.0%) 3,115 (100.0%) Age in years 27.435 (9.415) 28.515 (10.083) 27.931 (9.741) 0.002 Education in years 5.866 (3.786) 3.249 (2.708) 4.663 (3.580) <0.001 Contraceptive use Not using 319 (45.8%) 488 (68.1%) 807 (57.1%) <0.001 Inefficient 54 (7.7%) 50 (7.0%) 104 (7.4%) Efficient 324 (46.5%) 179 (25.0%) 503 (35.6%) ───────────────────────────────────────────────────────────────────────
As you can see, all we need to do is list the variables to be
described, using the i.
prefix for factor variables. The
by()
option specifies a classification variable, with the
suboption test
to request a test of differences across that
variable, based on regression or Pearson’s statistic as indicated in the
notes. That’s quite a bit of work with little effort on our part.
We see that the sample has a few more urban than rural women, and that urban women are younger, more educated, and more likely to use contraception (particularly efficient methods) than rural women. Moreover, all three differences are highly significant.
The sample statististics showing the urban/rural split can be omitted
using the nosample
option. You can also select which
statistics to calculate and where to place them using the
sample
option, type help dtable##sample
for
details.
There is a continuous
option to specify the statistics
and/or tests to use for one or more continuous variables. For example if
you wanted to use the median and interquartile range as descriptive
statistics and the Kruskal-Wallis rank test for education you could use
the option
continuous(educ, stat(median iqr) test(kwallis))
. Omitting
the variable name would apply these choices to all continuous variables.
To see a list of all the statistics and tests available for continuous
variables type help dtable##cstats
and
help dtable##ctests
.
There is an equivalent factor
option to specify the
statistics and tests to be used for factor variables. For example you
can use Fisher’s exact test, or a test based on ordinal association,
such as Kendall’s tau or Goodman and Kruskal’s gamma. Type
help dtable##fstats
and help dtable##ftests
for a full list of statistics and tests available for factor
variables.
The dtable
command has a large number of options,
including several that control table styles. The command creates its own
collection called DTable
, which allows further
customization using collect
commands. To learn more type
help dtable
.
The code below shows an alternative “table 1” that can be obtained
with the table
command in both Stata 17 and 18. It shows
sample sizes, mean and standard deviations on separate lines for
continuous variables, and just percents for factor variables, but no
significance tests.
. gen N = 1 . table (var) (area) , /// > stat(count N) /// sample > stat(mean age educ) stat(sd age educ) /// continuous > stat(fvpercent cuse) /// factor > nformat(%5.2f mean sd) nformat(%5.1f fvpercent) /// > sformat((%s) sd) sformat(%s%% fvpercent) style(table-1) ───────────────────┬─────────────────────────── │ Type of area │ Urban Rural Total ───────────────────┼─────────────────────────── N │ 1,683 1,432 3,115 │ Age in years │ 27.43 28.51 27.93 │ (9.41) (10.08) (9.74) │ Education in years │ 5.87 3.25 4.66 │ (3.79) (2.71) (3.58) │ Contraceptive use │ Not using │ 45.8% 68.1% 57.1% Inefficient │ 7.7% 7.0% 7.4% Efficient │ 46.5% 25.0% 35.6% ───────────────────┴───────────────────────────
We first create a new variable called N
to obtain sample
sizes. We specify the table rows using var
, which refers to
the variables in the statistics
option, and the columns
using area
. We then request the count
for the
sample size, the mean
and sd
for our
continuous variables, and the fvpercent
for our factor
variable.
To control the number of decimals printed we use our old friend
nformat
, specifying 2 decimals for the mean and standard
deviation, but just one for percents. To enclose the standard deviations
in parentheses and append a %
sign to the percents we use
sformat
. (If you are puzzled by the %s%%
format, note that %s
is the placeholder for the string and
that to append a %
symbol we need to escape it using
%%
.)
Finally we use the built-in style table-1
, which
provides a more compact layout for factor variables and a few other
tweaks. Try running the table without the style to see what it does.
We now turn our attention to tables presenting the results of one or
more estimation commands. We will use as an example simple linear
regression with the regress
command, but the same ideas
apply to other models. We could collect the results ourselves using
collect
as a prefix of the regress
command, or
even the command
option of table
, but
the
etable
command makes things easier.
If you type etable
after a regress
command
you get a table showing coefficients with standard errors in
parentheses, and the number of observations at the bottom. Let us add
just a couple of options.
. sysuse auto, clear (1978 automobile data) . quietly regress mpg i.foreign . etable, showstars showstarsnote ───────────────────────────────-- mpg ───────────────────────────────-- Car origin Foreign 4.946 ** (1.362) Intercept 19.827 ** (0.743) Number of observations 74 ───────────────────────────────-- ** p<.01, * p<.05
So foreign cars travel almost 5 more miles per gallon than domestic
cars. The option showstars
shows the usual significance
stars, and showstarsnote
adds an explanatory note. The
stars may be customized using the stars()
option, type
help table##starspec
to see how.
To compare two or more regressions all we have to do is save the
results of each one using estimates store
(before they are
overwriten by the next regression) and then pass the list of stored
estimates to etable
.
. gen gphm = 100/mpg . quietly regress gphm i.foreign . estimates store unadjusted . quietly regress gphm i.foreign weight . estimates store adjusted . etable, estimates(unadjusted adjusted) column(estimates) /// > cstat(_r_b) cstat(_r_z, sformat((%s))) /// > note(test statistic in parentheses) showstars showstarsnote ───────────────────────────────--─────────-- unadjusted adjusted ───────────────────────────────--─────────-- Car origin Foreign -1.005 ** 0.622 ** (-3.29) (3.11) Weight (lbs.) 0.002 ** (13.74) Intercept 5.318 ** -0.073 (31.92) (-0.18) Number of observations 74 74 ───────────────────────────────--─────────-- ** p<.01, * p<.05 test statistic in parentheses
Here we compare the efficiency of foreign and domestic cars before
and after adjusting for weight. Our measure of efficiency is gallons per
100 miles or gphm
rather than the usual mpg
,
because it has a more linear relationship with weight. To get the
defaults try etable estimates(unadjusted adjusted)
. Here we
added a couple of options.
The option column(estimates)
specifies that we want the
columns to be labeled with the name of the estimates rather than the
name of the dependent variable, which is the default.
The cstat
option (short for coefficient statistics),
lets you select which statistics to display. Type
help etable##cstat
to see a complete list. Here we selected
the coefficient (_r_b
) and the test statistic
(_r_z
). To make sure the test statistic is in parentheses
we use the sformat
option of cstat
to specify
(%s)
, where %s
is a placeholder for the
string, just as we did earlier in Section 3.2.1. We also use the
note
option of etable
to indicate exactly
what’s shown.
There is also a mstat
option (short for model
statistics) that lets you select model statistics to display, such as
the number of cases, R-squared, Akaike’s information criterion, and
others. Type help etable##mstat
to see a list. Try adding
R-squared to the previous table.
Our last example compared regressions with the same outcome and
different predictors. It is also possible to compare regressions with
different outcomes and the same predictors (or at least some overlap).
The table below compares regressions of weight
and
length
using four and three predictors, respectively, with
foreign cars as the reference cell for car origin:
. quietly regress weight ib1.foreign price rep78 headroom . estimates store weight . quietly regress length ib1.foreign price rep78 . estimates store length . etable, estimates(weight length) eqrecode(weight=both length=both) /// > mstat(N) mstat(r2) showstars showstarsnote ─────────────────────────────────--──────────-- weight length ─────────────────────────────────--──────────-- Car origin Domestic 893.057 ** 29.353 ** (137.788) (5.013) Price 0.140 ** 0.003 ** (0.017) (0.001) Repair record 1978 -47.367 -0.211 (61.474) (2.347) Headroom (in.) 222.060 ** (61.361) Intercept 1048.304 ** 147.845 ** (320.826) (11.229) Number of observations 69 69 R-squared 0.76 0.56 ─────────────────────────────────--──────────-- ** p<.01, * p<.05
The essential new option here is eqrecode()
which
ensures that coefficients for the same predictor with different outcomes
appear in the same row. Try running the command without this option to
see the default. This option is also essential if you run a multivariate
regression. At the bottom of the table we listed R-squared for each
regression, but you already knew how to do that, right? Did you notice
that to keep the number of observations you have to add
mstat(N)
?
The etable
command creates a collection called
ETable
which becomes the current collection and can then be
modified and/or exported. Type help etable
to learn
more.
Let us move now to an example where we will collect the results of
standard Stata commands ourselves. We want to calculate Tukey’s five
number summary, namely the minimum, first quartile, median, third
quartile and maximum. These statistics are all computed by
summarize
with the detail
option. We would
like to do this for several variables.
The collect
command can be used as a prefix to gather
the results stored by a general command in r()
or by an
estimation command in e()
. You can find out exactly what a
command has stored by typing return list
after a general
command such as summarize
, or typing
ereturn list
after an estimation command. But don’t worry,
collect
will gather everything. So here is our table:
. sysuse auto, clear (1978 automobile data) . collect clear . quietly collect, tags(cmdset[mpg]): summarize mpg, detail . quietly collect, tags(cmdset[length]): summarize length, detail . quietly collect, tags(cmdset[weight]): summarize weight, detail . collect style autolevels result min p25 p50 p75 max . collect label levels result /// > min "Min" p25 "Q1" p50 "Md" p75 "Q3" max "Max", modify . collect layout (cmdset) (result) Collection: default Rows: cmdset Columns: result Table 1: 3 x 5 ───────┬────────────────────────── │ Min Q1 Md Q3 Max ───────┼────────────────────────── mpg │ 12 18 20 25 41 length │ 142 170 192.5 204 233 weight │ 1760 2240 3190 3600 4840 ───────┴──────────────────────────
This will require a bit of explanation. We start by clearing the
collection system with collect clear
.
We then collect the results of summarize mpg, detail
,
which will produce the statistics we need, using quietly
to
skip displaying them. We also ask the system to tag the results with the
name of the variable being summarized, which unfortunately is not stored
with the results. Fortunately Stata creates a dimension called
cmdset
for our commands, which are just numbered 1, 2, and
3. The tags
option creates a more informative tag, using
the name of the variable.
Next we define a style. As it happens, summarize, detail
produces 19 results and we don’t want them all, just the five-number
summary. The collect style autolevels result
command sets
the levels of result
to the five statistics we want.
(Alternatively, you can specify which results to collect, type
help collect get
to learn more.)
Stata generates labels for practically all the results stored by its
commands, for example the label for p25
is “25th
percentile”, and by default uses these on the tables. We would like to
use shorter labels, in this case “Q1”, hence the
collect label levels result
command.
The final step is to specify the layout of the table with
collect layout
, which says we want the cmdset
with the variable names in the rows, and the result
with
the five-number summaries in the columns. The row and column
specifications in collect layout
must be enclosed in
parentheses.
Rather than repeat essentially the same command three times, varying only the name of the variable, we could have used a loop, a concept discussed later in Section 5.2 of this tutorial. That would make it easy to include many more variables in our table.
It is possible to produce similar results using table
,
as all five summaries are in the list of statistics available, but the
idea here was to collect the results ourselves to give you a sense of
the power and flexibility of the collection system.
Consider the two-way table in Section 3.1.2, showing contraceptive
use by education. We would like to show just the row percents, as we
did, but add a column with the total number of observations in each row.
One way to do this is to get both the frequencies and percents, and then
decide exactly what we want to show and how. We will also modify the
header, and remove a vertical border. Try the following commands (you
may want to try the first two without quietly
to see what
happens at each step):
. use https://grodri.github.io/datasets/drsr03x, clear (DRSR03 extract) . quietly table educg cuse, stat(percent, across(cuse)) stat(frequency) . quietly collect layout (educg) /// > (cuse#result[percent] cuse[.m]#result[frequency]) . collect style header result , level(hide) . collect style cell border_block, border(right, pattern(nil)) . collect preview ───────────────────────────────────────────────────────────────────── Contraceptive use Not using Inefficient Efficient Total ───────────────────────────────────────────────────────────────────── Education level 0-2 69.38 5.37 25.25 100.00 503 3-4 59.65 5.45 34.90 100.00 404 5-7 50.00 8.50 41.50 100.00 306 8-18 31.84 14.43 53.73 100.00 201 Total 57.07 7.36 35.57 100.00 1,414 ─────────────────────────────────────────────────────────────────────
After using table
to tabulate the data, we use
collect layout
to specify rows with educg
and
columns with the percents for cuse
(using an interaction
between cuse
and result[percent]
) and the
frequency for the total (interacting cuse[.m]
with
result[frequency]
).
We have used dimensions informally to refer to the rows and
columns of a table, but the concept of dimension here is more
general, representing all features used to tag the elements of a
collection. Type collect dims
to list all dimensions of the
current collection. Type collect levelsof
dimname
to list the levels of a dimension, and
collect label list
dimname
to list
the labels of the levels. This is how I learned that
cuse[.m]
had the totals.
Finally we use a couple of collect style
commands that
aim for a cleaner look; one to remove the labels of the levels of result
from the header, and another to omit the vertical border between the row
headers and the body of the table. This, by the way, uses yet another
dimension called border_block
, used to tag cells in the row
and column headers, the top-left corner, and the body of the table with
the items. Type collect levelsof border_block
to list the
level names.
This example has barely touched the surface of table customization.
To learn more type help collect
.
Tables are displayed on your screen but can also be exported in
various formats, including HTML, Word documents, Excel documents, LaTeX,
PDF, plain text, Markdown and even Stata’s own SMCL format. Type
collect export
to learn more.
Continue with Graphics