This is a collection of small datasets used in the course, classified by the type of statistical technique that may be used to analyze them. A couple of datasets appear in more than one category. The datasets are now available in Stata format as well as two plain text formats, as explained below. They can all be read from the datasets section of this website, as illustrated in the Stata and R logs.
All datasets are available as plain-text ASCII files, usually in two formats:
.dat
has a header line
with the variable names, and codes categorical variables using character strings.
This version is best for users of S-Plus or R and can be read using read.table()
.
Some files do not have column names; in these cases use header=FALSE
.
.raw
omits the header line and codes all variable using
numeric codes. This version is better for users of
Stata or other packages that prefer numerical codes.
(However, Stata can read the character version if you
specify the string width using str
.)
To download any of these files using your browser I recommend that you right-click and choose 'save as...'. If you left-click what happens next depends on how your browser is configured to handle these file types, and will often require an extra step.
The datasets are also available as
Stata system files
with extension .dta
, and can be read
directly from net-aware Stata versions 10 or higher
via the use
command.
This is the easiest method for Stata users.
You can also right click on the links to save a local copy.
R users can read the Stata files using the
read_dta()
function in the haven
package.
Here are the famous program effort data from Mauldin and Berelson. This extract consist of observations on an index of social setting, an index of family planning effort, and the percent decline in the crude birth rate (CBR) between 1965 and 1975, for 20 countries in Latin America.
setting effort change Bolivia 46 0 1 Brazil 74 0 10 Chile 89 16 29 Colombia 77 16 25 CostaRica 84 21 29 Cuba 89 15 40 DominicanRep 68 14 21 Ecuador 70 6 0 ElSalvador 60 13 13 Guatemala 55 9 4 Haiti 35 3 0 Honduras 51 7 7 Jamaica 87 23 21 Mexico 83 4 9 Nicaragua 68 0 7 Panama 84 19 22 Paraguay 74 3 6 Peru 73 0 2 TrinidadTobago 84 15 29 Venezuela 91 7 11
The data are available as plain text files effort.dat, which has a header line with the variable names, and effort.raw, which omits it; otherwise both files look like the listing above. The data are also available in Stata format as effort.dta.
Reference: P.W. Mauldin and B. Berelson (1978). Conditions of fertility decline in developing countries, 1965-75. Studies in Family Planning,9:89-147. JSTOR: http://www.jstor.org/stable/1965523.
These are the salary data used in Weisberg's book, consisting of observations on six variables for 52 tenure-track professors in a small college. The variables are:
The file is available in the usual plain text formats as salary.dat using character codes and salary.raw using numeric codes, and in Stata format as salary.dta. Here's an excerpt of the "dat" file:
sx rk yr dg yd sl male full 25 doctorate 35 36350 male full 13 doctorate 22 35350 male full 10 doctorate 23 28200 female full 7 doctorate 27 26775 male full 19 masters 30 33696 male full 16 doctorate 21 28516 ... female assistant 1 doctorate 1 16686 female assistant 1 doctorate 1 15000 female assistant 0 doctorate 2 20300
Reference: S. Weisberg (1985). Applied Linear Regression, Second Edition. New York: John Wiley and Sons. Page 194.
These are data based on a 5% sample of all births occurring in Philadelphia in 1990. The sample has 1115 observations (after deleting 32 cases with incomplete information) on five variables:
The data are available in plain text format in the files phbirths.raw and phbirths.dat, and in Stata format as phbirts.dta.
The 'dat' file codes black and smoke using TRUE or FALSE, whereas the 'raw' file uses 1 and 0.
Reference: I. T. Elo, G. Rodríguez and H. Lee (2001). Racial and Neighborhood Disparities in Birthweight in Philadelphia. Paper presented at the Annual Meeting of the Population Association of America, Washington, DC 2001.
Here are the contraceptive use data from page 46 of the lecture notes (and from the Stata handout), showing the distribution of 1607 currently married and fecund women interviewed in the Fiji Fertility Survey, according to age, education, desire for more children and current use of contraception.
age education wantsMore notUsing using <25 low yes 53 6 <25 low no 10 4 <25 high yes 212 52 <25 high no 50 10 25-29 low yes 60 14 25-29 low no 19 10 25-29 high yes 155 54 25-29 high no 65 27 30-39 low yes 112 33 30-39 low no 77 80 30-39 high yes 118 46 30-39 high no 68 78 40-49 low yes 35 6 40-49 low no 46 48 40-49 high yes 8 8 40-49 high no 12 31
The data are available in the format shown above as cuse.dat, and also as a Stata system file cusew.dta using numeric codes and labels for all variables. These files represent binomial data with 16 groups.
The dataset is also available in a long format simulating individual data and using weights to represent the frequencies.
Reference: Little, R. J. A. (1978). Generalized Linear Models for Cross-Classified Data from the WFS. World Fertility Survey Technical Bulletins, Number 5.
This is the alternative version of the contraceptive use data, showing the distribution of 1607 currently married and fecund women interviewed in the Fiji Fertility Survey, according to age, education, desire for more children and current use of contraception.
This version has 32 rows corresponding to all possible covariate and response patterns, and includes a weight indicating the frequency of each combination. The file has 5 columns with numeric codes:
The data in this alternative format are available in plain text as cuse.raw and in Stata format as cuse.dta. An excerpt of the "raw" file is shown below:
1 0 0 0 53 1 0 0 1 6 1 0 1 0 10 1 0 1 1 4 1 1 0 0 212 1 1 0 1 52 ... 4 1 0 1 8 4 1 1 0 12 4 1 1 1 31
Reference: Little, R. J. A. (1978). Generalized Linear Models for Cross-Classified Data from the WFS. World Fertility Survey Technical Bulletins, Number 5.
These are the data from Fiji on children ever born, from page 84 of the lecture notes (and the Stata handout).
The dataset has 70 rows representing grouped individual data. Each row has entries for:
This file is available in the usual two formats: ceb.dat has a header and uses character labels for the factors, and ceb.raw uses numeric codes, as described above. Here's an excerpt of the dat file:
dur res educ mean var n y 1 0-4 Suva none 0.50 1.14 8 4.00 2 0-4 Suva lower 1.14 0.73 21 23.94 3 0-4 Suva upper 0.90 0.67 42 37.80 4 0-4 Suva sec+ 0.73 0.48 51 37.23 5 0-4 urban none 1.17 1.06 12 14.04 6 0-4 urban lower 0.85 1.59 27 22.95 ... 69 25-29 rural none 7.48 11.34 195 1458.60 70 25-29 rural lower 7.81 7.57 59 460.79 71 25-29 rural upper 5.80 7.07 10 58.00
Reference: Little, R. J. A. (1978). Generalized Linear Models for Cross-Classified Data from the WFS. World Fertility Survey Technical Bulletins, Number 5.
This dataset has information from a Canadian study of mortality by age and smoking status.
The file in "raw" format, smoking.raw, has four columns:
The file is also available in "dat" format as smoking.dat, with variable names, row names and string labels for age and smoking status. An excerpt appears below:
age smoke pop dead 1 40-44 no 656 18 2 45-59 no 359 22 3 50-54 no 249 19 4 55-59 no 632 55 5 60-64 no 1067 117 6 65-69 no 897 170 .... 32 60-64 cigarretteOnly 3791 778 33 65-69 cigarretteOnly 2421 689 34 70-74 cigarretteOnly 1195 432 35 75-79 cigarretteOnly 436 214 36 80+ cigarretteOnly 113 63
The dataset comes from Best, E.W.R. and Walker, C.B. (1964). A Canadian study of smoking and health. Canadian Journal of Public Health, 58,1. Also given in Mosteller, F. and Tukey, J.W. (1977) Data analysis and regression, Reading, MA:Addison-Wesley, Exhibit 1, 559. Thanks to Moritz Marback for providing the reference, and to Ingeborg Gullikstad Hem for pointing out that the number of deaths is over 6 years. The population counts and age are at the start of the follow-up period.
These are the data from McCullagh and Nelder. The file has 34 rows corresponding to the observed combinations of type of ship, year of construction and period of operation. Each row has information on five variables as follows:
Note that there no ships of type E built in 1960-64, and that ships built in 1970-74 could not have operated in 1960-74. These combinations are omitted from the data file.
You can get the data in the usual versions: ships.dat has a header and codes the factors using strings, and ship.raw uses the numeric codes shown above. Here's an exceprt of the dat file:
type construction operation months damage 1 A 1960-64 1960-74 127 0 2 A 1960-64 1975-79 63 0 3 A 1965-69 1960-74 1095 3 4 A 1965-69 1975-79 1095 4 5 A 1970-74 1960-74 1512 6 6 A 1970-74 1975-79 3353 18 ... 32 E 1970-74 1960-74 1157 5 33 E 1970-74 1975-79 2161 12 34 E 1975-79 1975-79 542 1
Reference: McCullagh, P. and Nelder, J. (1989) Generalized Linear Models, 2nd Edition. Chapman and Hall, London. Page 204.
These are the data from Wilner, Walkley and Cook on the effect of racial attitudes on segregation and integration of public housing. The data can be viewed as a 2x2x2x2 contingency table:
Sentiment Proximity Contact Norms fav unfav close frequent favorable 77 32 unfavorable 30 36 infrequent favorable 14 19 unfavorable 15 27 distant frequent favorable 43 20 unfavorable 36 37 infrequent favorable 27 36 unfavorable 41 118
You can get a file in the usual character and numeric formats from housing.dat or housing.raw, respectively, and in Stata format from housing.dta.
The "raw' data file codes the factor levels in order of appearance as follows:
For regression analysis it would have been better to code these variables using 1 and 0 instead of 1 and 2, and rename them to something like proximClose, contactFreq, and normsFav. I haven't done this because it might break existing code, but the new variables can easily be added.
Reference: Wilner, D., Walkley, R.R. and Cook, S.W. (1955). Human relations in interracial housing: A study of the contact hypothesis. University of Minnesota Press
These are the Madsen data used in the revised lecture notes. This is a four-way table classifying 1681 residents of twelve areas in Copenhagen in terms of:
The data file contains 72 rows, one for each combination of values of the four variables, and has six columns, a row number, the four variables, and the number of cases in the category. The file is available in the usual character and numeric formats: copen.dat or copen.raw, respectively, and in Stata format as copen.dta Here's an exceprt of the "dat" file:
housing influence contact satisfaction n 1 tower low low low 21 2 tower low low medium 21 3 tower low low high 28 4 tower low high low 14 5 tower low high medium 19 6 tower low high high 37 ... 70 terraced high high low 5 71 terraced high high medium 6 72 terraced high high high 13
Reference: Madsen, M. (1976). Statistical Analysis of Multiple Contingency Tables. Two Examples. Scand. J. Statist.3:97-106. JSTOR: http://www.jstor.org/stable/4615621
These are the data from Bishop, Fienberg and Holland on the three-year survival status of breast-cancer patients by age and malignancy of tumor:
survive? age malignant yes no 1 under50 no 77 10 2 under50 yes 51 13 3 50-69 no 51 11 4 50-69 yes 38 20 5 70+ no 7 3 6 70+ yes 6 3
You can get a file in the usual character and numeric formats from cancer.dat or cancer.raw, and in Stata format from cancer.dta.
Reference: Bishop, Y. M. M. ; Fienberg, S. E. and Holland, P. W. (1975) Discrete Multivariate Analysis: Theory and Practice. MIT Press, Cambridge. .
The method choice data from Brazil are available in a file containing three columns:
As usual, the file is available in two formats: brazil.dat codes the factors using character labels, and brazil.raw uses numeric codes (the age groups are coded 1-6 and the methods are coded 1=not_using, 2=inefficient, 3=efficient, 4=sterilization). Here's an excerpt of the dat file:
15-19 sterilization 2 15-19 efficient 75 15-19 inefficient 6 15-19 not_using 90 20-24 sterilization 32 20-24 efficient 223 ... 40-44 efficient 71 40-44 inefficient 69 40-44 not_using 17
You can read the file with character labels (brazil.dat) into Stata using the command
infile str6 age str14 method freq /// using brazil.datbut of course we now provide a Stata file as brazil.dta.
This dataset comes from the Guatemalan Survey of Family Health, a survey of rural women that contains detailed data on care received during pregnancy and delivery along with extensive background information.
We have tabulated data on 3334 pregnancies. The outcome is the type of provider seen during pregnancy and there are three predictors. The raw data file has five columns, as follows:
The data are available using numeric codes as healthCare.raw and using string codes as well as row and column labels as healthCare.dat. Here's are a few lines from the latter:
eth migr avail provider n 1 indNoSpa no no none 7 2 indNoSpa no no midwife 93 3 indNoSpa no no healthPost 6 ... 34 ladino yes yes doctor 83
Reference: Glei, D. A. and Goldman, N. (2000), Understanding Ethnic Variation in Pregnancy-related Care in Rural Guatemala, Ethnicity and Health, 5:5-22.
The Social Mobility Data are available in a file containing five columns:
The file is available as mobility.dat, and also in Stata format. Here's an exceprt of the dat file:
fatherOccup sonOccup black nonintact n 1 farm farm no no 592 2 farm farm no yes 55 3 farm farm yes no 41 4 farm farm yes yes 15 5 farm unskilled no no 1005 6 farm unskilled no yes 134 ... 61 professional professional no yes 317 62 professional professional yes no 52 63 professional professional yes yes 19
This is a simplified version of a dataset from StatLib which may be found at http://lib.stat.cmu.edu/datasets/socmob. I rounded the counts for son's current occupation to the nearest integer, and grouped both father's and son's occupation into just four categories, treating 1-2 as farm, 3-6 as unskilled, 7-11 as skilled and 12-17 as professional/managerial.
If you use the data in a publication please acknowledge Statlib and the original authors, David L. Featherman and Robert M. Hauser (1978). Opportunity and Change. New York: Academic Press. The data were also analyzed by Timothy J. Biblarz and Adrian E. Raftery (1993). "The Effects of Family Disruption on Social Mobility", American Sociological Review, 58(1):97-109.
The Time to Ph.D. data are available in a file containing five columns:
The file has 73 rows and is called phd.dat. A brief excerpt is shown below:
1 1 1 31 7422 2 1 1 177 7166 3 1 1 393 6759 4 1 1 484 6138 5 1 1 500 5506 6 1 1 399 4824 ... 6 3 2 8 85 7 3 2 2 72 12 3 2 2 37
Reference: Espenshade, T.J. and Rodríguez, G. (1997). Completing the Ph.D.: Comparative Performances of U.S. and Foreign Students. Social Science Quarterly, 78:593-605.
The data show the length of remission in weeks for two groups of leukemia patients, treated and control, and were analyzed by Cox in his original proportional hazards paper. The data are available in a file containing three columns:
Thus, the third and fourth observations, 6 and 6+, corresponding to a death and a censored observation at six weeks, are coded 6, 1 and 6, 0, respectively.
The data are available in the usual two plain-text formats in gehan.dat and gehan.raw (group codes are 1=control, 2=treated), and as a Stata file in gehan.dta. Here's an excerpt of the dat file:
treatment time failure 1 treated 6 TRUE 2 treated 6 TRUE 3 treated 6 TRUE 4 treated 6 FALSE 5 treated 7 TRUE 6 treated 9 FALSE ... 40 control 17 TRUE 41 control 22 TRUE 42 control 23 TRUE
These data actually come from a matched-pairs design, where patients were paired according to remission status (partial or complete) and then randomly assigned to the treated or control group, but most analyses have ignored this fact. See Andersen et al (1993), pages 22-23, which has references to several papers using this dataset.
Reference: Andersen, P. K.; Borgan, O.; Gill, R. D. and Keiding, N. (1993). Statistical Models Based on Counting Processes, Springer-Verlag, New York.
These are Somoza's data on infant and child survival in Colombia, used in the notes (Table 3). The dataset comes from the Word Fertility Survey, which was fielded in Colombia in 1976. Women in the reproductive ages were asked about their children and these were tabulated by sex, year of birth (cohort), survival status and age at death or at interview.
The file has 48 lines, corresponding to the 48 combinations of sex, cohort and age, and six columns:
The data are available in plain text format as somoza.dat, which uses character labels for sex, cohort and age, and somoza.raw, which uses numeric codes for all variables, and in Stata format as somoza.dta A brief excerpt of the dat file is shown below.
sex cohort age dead alive Male 1941-59 0-1/12 99 0 Male 1941-59 1/12-3/12 35 0 ... Female 1968-76 10+ 0 0
In order to analyze these data using piece-wise exponential models you first have to calculate events and exposure by sex, cohort and age. The details of this calculation are shown in our Stata logs. The final step of that process, a file with events and exposure by cohort and age (collapsing over sex) is available in Stata format as somoza2.dta.
Reference: Somoza, J. (1980). Illustrative Analysis: Infant and Child Mortality in Colombia. World Fertility Survey Scientific Reports, Number 10.
This dataset, adapted from an example in the software package aML, is based on a longitudinal survey conducted in the U.S.
The unit of observation is the couple and the event of interest is divorce, with interview and widowhood treated as censoring events. We have three fixed covariates: education of the husband and two indicators of the couple's ethnicity: whether the husband is black and whether the couple is mixed. The variables are:
The dataset has 3771 couples and is available in "raw" format as divorce.raw and in "dat" format as divorce.dat, see excerpt below. The file is also available in Stata format as divorce.dta.
id heduc heblack mixed years div 9 12-15 years No No 10.546 No 11 < 12 years No No 34.943 No 13 < 12 years No No 2.834 Yes 15 < 12 years No No 17.532 Yes 33 12-15 years No No 1.418 No 36 < 12 years No No 48.033 No ... 17294 12-15 years Yes No 7.269 No 17302 12-15 years No Yes 18.73 No
Reference: Lillard and Panis (2000), aML Multilevel Multiprocess Statistical Software, Release 1.0, EconWare, LA, California.