Germán Rodríguez
Generalized Linear Models Princeton University

Datasets

This is a collection of small datasets used in the course, classified by the type of statistical technique that may be used to analyze them. A couple of datasets appear in more than one category. The datasets are now available in Stata format as well as two plain text formats, as explained below. They can all be read from the datasets section of this website, as illustrated in the Stata and R logs.

Data Formats

 All datasets are available as plain-text ASCII files, usually in two formats:

To download any of these files using your browser I recommend that you right-click and choose 'save as...'. If you left-click what happens next depends on how your browser is configured to handle these file types, and will often require an extra step.

 The datasets are also available as Stata system files with extension .dta, and can be read directly from net-aware Stata versions 10 or higher via the use command. This is the easiest method for Stata users. You can also right click on the links to save a local copy. R users can read the Stata files using the read_dta() function in the haven package.

The Program Effort Data

Here are the famous program effort data from Mauldin and Berelson. This extract consist of observations on an index of social setting, an index of family planning effort, and the percent decline in the crude birth rate (CBR) between 1965 and 1975, for 20 countries in Latin America.

                 setting  effort   change
   Bolivia            46       0        1
   Brazil             74       0       10
   Chile              89      16       29
   Colombia           77      16       25
   CostaRica          84      21       29
   Cuba               89      15       40
   DominicanRep       68      14       21
   Ecuador            70       6        0
   ElSalvador         60      13       13
   Guatemala          55       9        4
   Haiti              35       3        0
   Honduras           51       7        7
   Jamaica            87      23       21
   Mexico             83       4        9
   Nicaragua          68       0        7
   Panama             84      19       22
   Paraguay           74       3        6
   Peru               73       0        2
   TrinidadTobago     84      15       29
   Venezuela          91       7       11

The data are available as plain text files effort.dat, which has a header line with the variable names, and effort.raw, which omits it; otherwise both files look like the listing above. The data are also available in Stata format as effort.dta.

Reference: P.W. Mauldin and B. Berelson (1978). Conditions of fertility decline in developing countries, 1965-75. Studies in Family Planning,9:89-147. JSTOR: http://www.jstor.org/stable/1965523.

Discrimination in Salaries

These are the salary data used in Weisberg's book, consisting of observations on six variables for 52 tenure-track professors in a small college. The variables are:

The file is available in the usual plain text formats as salary.dat using character codes and salary.raw using numeric codes, and in Stata format as salary.dta. Here's an excerpt of the "dat" file:

    sx        rk yr        dg yd    sl
   male      full 25 doctorate 35 36350
   male      full 13 doctorate 22 35350
   male      full 10 doctorate 23 28200
 female      full  7 doctorate 27 26775
   male      full 19   masters 30 33696
   male      full 16 doctorate 21 28516
  ...
 female assistant  1 doctorate  1 16686
 female assistant  1 doctorate  1 15000
 female assistant  0 doctorate  2 20300

Reference: S. Weisberg (1985). Applied Linear Regression, Second Edition. New York: John Wiley and Sons. Page 194.

Births in Philadelphia

These are data based on a 5% sample of all births occurring in Philadelphia in 1990. The sample has 1115 observations (after deleting 32 cases with incomplete information) on five variables:

The data are available in plain text format in the files phbirths.raw and phbirths.dat, and in Stata format as phbirts.dta.

The 'dat' file codes black and smoke using TRUE or FALSE, whereas the 'raw' file uses 1 and 0.

Reference: I. T. Elo, G. Rodríguez and H. Lee (2001). Racial and Neighborhood Disparities in Birthweight in Philadelphia. Paper presented at the Annual Meeting of the Population Association of America, Washington, DC 2001.

The Contraceptive Use Data (W)

Here are the contraceptive use data from page 46 of the lecture notes (and from the Stata handout), showing the distribution of 1607 currently married and fecund women interviewed in the Fiji Fertility Survey, according to age, education, desire for more children and current use of contraception.

    age education wantsMore notUsing using 
    <25       low       yes       53     6
    <25       low        no       10     4
    <25      high       yes      212    52
    <25      high        no       50    10
  25-29       low       yes       60    14
  25-29       low        no       19    10
  25-29      high       yes      155    54
  25-29      high        no       65    27
  30-39       low       yes      112    33
  30-39       low        no       77    80
  30-39      high       yes      118    46
  30-39      high        no       68    78
  40-49       low       yes       35     6
  40-49       low        no       46    48
  40-49      high       yes        8     8
  40-49      high        no       12    31

The data are available in the format shown above as cuse.dat, and also as a Stata system file cusew.dta using numeric codes and labels for all variables. These files represent binomial data with 16 groups.

The dataset is also available in a long format simulating individual data and using weights to represent the frequencies.

Reference: Little, R. J. A. (1978). Generalized Linear Models for Cross-Classified Data from the WFS. World Fertility Survey Technical Bulletins, Number 5.

The Contraceptive Use Data (L)

This is the alternative version of the contraceptive use data, showing the distribution of 1607 currently married and fecund women interviewed in the Fiji Fertility Survey, according to age, education, desire for more children and current use of contraception.

This version has 32 rows corresponding to all possible covariate and response patterns, and includes a weight indicating the frequency of each combination. The file has 5 columns with numeric codes:

The data in this alternative format are available in plain text as cuse.raw and in Stata format as cuse.dta. An excerpt of the "raw" file is shown below:

     1         0         0         0        53
     1         0         0         1         6
     1         0         1         0        10
     1         0         1         1         4
     1         1         0         0       212
     1         1         0         1        52
    ...
     4         1         0         1         8
     4         1         1         0        12
     4         1         1         1        31

Reference: Little, R. J. A. (1978). Generalized Linear Models for Cross-Classified Data from the WFS. World Fertility Survey Technical Bulletins, Number 5.

The Children Ever Born Data

These are the data from Fiji on children ever born, from page 84 of the lecture notes (and the Stata handout).

The dataset has 70 rows representing grouped individual data. Each row has entries for:

This file is available in the usual two formats: ceb.dat has a header and uses character labels for the factors, and ceb.raw uses numeric codes, as described above. Here's an excerpt of the dat file:

    dur   res  educ mean   var   n       y 
 1   0-4  Suva  none 0.50  1.14   8    4.00
 2   0-4  Suva lower 1.14  0.73  21   23.94
 3   0-4  Suva upper 0.90  0.67  42   37.80
 4   0-4  Suva  sec+ 0.73  0.48  51   37.23
 5   0-4 urban  none 1.17  1.06  12   14.04
 6   0-4 urban lower 0.85  1.59  27   22.95
    ...
69 25-29 rural  none 7.48 11.34 195 1458.60
70 25-29 rural lower 7.81  7.57  59  460.79
71 25-29 rural upper 5.80  7.07  10   58.00

Reference: Little, R. J. A. (1978). Generalized Linear Models for Cross-Classified Data from the WFS. World Fertility Survey Technical Bulletins, Number 5.

Smoking and Lung Cancer

This dataset has information from a Canadian study of mortality by age and smoking status.

The file in "raw" format, smoking.raw, has four columns:

The file is also available in "dat" format as smoking.dat, with variable names, row names and string labels for age and smoking status. An excerpt appears below:

     age         smoke   pop dead
1  40-44            no   656   18
2  45-59            no   359   22
3  50-54            no   249   19
4  55-59            no   632   55
5  60-64            no  1067  117
6  65-69            no   897  170
....
32 60-64 cigarretteOnly 3791  778
33 65-69 cigarretteOnly 2421  689
34 70-74 cigarretteOnly 1195  432
35 75-79 cigarretteOnly  436  214
36   80+ cigarretteOnly  113   63

The dataset comes from Best, E.W.R. and Walker, C.B. (1964). A Canadian study of smoking and health. Canadian Journal of Public Health, 58,1. Also given in Mosteller, F. and Tukey, J.W. (1977) Data analysis and regression, Reading, MA:Addison-Wesley, Exhibit 1, 559. Thanks to Moritz Marback for providing the reference, and to Ingeborg Gullikstad Hem for pointing out that the number of deaths is over 6 years. The population counts and age are at the start of the follow-up period.

The Ship Damage Data

These are the data from McCullagh and Nelder. The file has 34 rows corresponding to the observed combinations of type of ship, year of construction and period of operation. Each row has information on five variables as follows:

Note that there no ships of type E built in 1960-64, and that ships built in 1970-74 could not have operated in 1960-74. These combinations are omitted from the data file.

You can get the data in the usual versions: ships.dat has a header and codes the factors using strings, and ship.raw uses the numeric codes shown above. Here's an exceprt of the dat file:

   type construction operation months damage
1     A      1960-64   1960-74    127      0
2     A      1960-64   1975-79     63      0
3     A      1965-69   1960-74   1095      3
4     A      1965-69   1975-79   1095      4
5     A      1970-74   1960-74   1512      6
6     A      1970-74   1975-79   3353     18
    ...
32    E      1970-74   1960-74   1157      5
33    E      1970-74   1975-79   2161     12
34    E      1975-79   1975-79    542      1

Reference: McCullagh, P. and Nelder, J. (1989) Generalized Linear Models, 2nd Edition. Chapman and Hall, London. Page 204.

The Housing Data

These are the data from Wilner, Walkley and Cook on the effect of racial attitudes on segregation and integration of public housing. The data can be viewed as a 2x2x2x2 contingency table:

                                     Sentiment
Proximity  Contact     Norms         fav unfav
close      frequent    favorable     77    32
                       unfavorable   30    36
           infrequent  favorable     14    19
                       unfavorable   15    27
distant    frequent    favorable     43    20
                       unfavorable   36    37
           infrequent  favorable     27    36
                       unfavorable   41   118

You can get a file in the usual character and numeric formats from housing.dat or housing.raw, respectively, and in Stata format from housing.dta.

The "raw' data file codes the factor levels in order of appearance as follows:

For regression analysis it would have been better to code these variables using 1 and 0 instead of 1 and 2, and rename them to something like proximClose, contactFreq, and normsFav. I haven't done this because it might break existing code, but the new variables can easily be added.

Reference: Wilner, D., Walkley, R.R. and Cook, S.W. (1955). Human relations in interracial housing: A study of the contact hypothesis. University of Minnesota Press

Housing Conditions in Copenhagen

These are the Madsen data used in the revised lecture notes. This is a four-way table classifying 1681 residents of twelve areas in Copenhagen in terms of:

The data file contains 72 rows, one for each combination of values of the four variables, and has six columns, a row number, the four variables, and the number of cases in the category. The file is available in the usual character and numeric formats: copen.dat or copen.raw, respectively, and in Stata format as copen.dta Here's an exceprt of the "dat" file:

    housing influence contact satisfaction  n 
 1    tower       low     low          low 21
 2    tower       low     low       medium 21
 3    tower       low     low         high 28
 4    tower       low    high          low 14
 5    tower       low    high       medium 19
 6    tower       low    high         high 37
    ...
70 terraced      high    high          low  5
71 terraced      high    high       medium  6
72 terraced      high    high         high 13

Reference: Madsen, M. (1976). Statistical Analysis of Multiple Contingency Tables. Two Examples. Scand. J. Statist.3:97-106. JSTOR: http://www.jstor.org/stable/4615621

The Cancer Data

These are the data from Bishop, Fienberg and Holland on the three-year survival status of breast-cancer patients by age and malignancy of tumor:

                    survive?
      age malignant yes no 
1 under50        no  77 10
2 under50       yes  51 13
3   50-69        no  51 11
4   50-69       yes  38 20
5     70+        no   7  3
6     70+       yes   6  3

You can get a file in the usual character and numeric formats from cancer.dat or cancer.raw, and in Stata format from cancer.dta.

Reference: Bishop, Y. M. M. ; Fienberg, S. E. and Holland, P. W. (1975) Discrete Multivariate Analysis: Theory and Practice. MIT Press, Cambridge. .

The Method Choice Data

The method choice data from Brazil are available in a file containing three columns:

As usual, the file is available in two formats: brazil.dat codes the factors using character labels, and brazil.raw uses numeric codes (the age groups are coded 1-6 and the methods are coded 1=not_using, 2=inefficient, 3=efficient, 4=sterilization). Here's an excerpt of the dat file:

15-19 sterilization     2
15-19     efficient    75
15-19   inefficient     6
15-19     not_using    90
20-24 sterilization    32
20-24     efficient   223
    ...
40-44     efficient    71
40-44   inefficient    69
40-44     not_using   17

You can read the file with character labels (brazil.dat) into Stata using the command

infile str6 age str14 method freq ///
  using brazil.dat
but of course we now provide a Stata file as brazil.dta.

Health Care Utilization in Guatemala

This dataset comes from the Guatemalan Survey of Family Health, a survey of rural women that contains detailed data on care received during pregnancy and delivery along with extensive background information.

We have tabulated data on 3334 pregnancies. The outcome is the type of provider seen during pregnancy and there are three predictors. The raw data file has five columns, as follows:

The data are available using numeric codes as healthCare.raw and using string codes as well as row and column labels as healthCare.dat. Here's are a few lines from the latter:

          eth  migr avail    provider    n 
 1   indNoSpa    no    no        none    7  
 2   indNoSpa    no    no     midwife   93  
 3   indNoSpa    no    no  healthPost    6  
    ...
34     ladino   yes   yes      doctor   83

Reference: Glei, D. A. and Goldman, N. (2000), Understanding Ethnic Variation in Pregnancy-related Care in Rural Guatemala, Ethnicity and Health, 5:5-22.

The Social Mobility Data

The Social Mobility Data are available in a file containing five columns:

The file is available as mobility.dat, and also in Stata format. Here's an exceprt of the dat file:

     fatherOccup     sonOccup black nonintact   n
  1         farm         farm    no        no  592  
  2         farm         farm    no       yes   55  
  3         farm         farm   yes        no   41  
  4         farm         farm   yes       yes   15  
  5         farm    unskilled    no        no 1005  
  6         farm    unskilled    no       yes  134
    ...
 61 professional professional    no       yes  317  
 62 professional professional   yes        no   52  
 63 professional professional   yes       yes   19

This is a simplified version of a dataset from StatLib which may be found at http://lib.stat.cmu.edu/datasets/socmob. I rounded the counts for son's current occupation to the nearest integer, and grouped both father's and son's occupation into just four categories, treating 1-2 as farm, 3-6 as unskilled, 7-11 as skilled and 12-17 as professional/managerial.

If you use the data in a publication please acknowledge Statlib and the original authors, David L. Featherman and Robert M. Hauser (1978). Opportunity and Change. New York: Academic Press. The data were also analyzed by Timothy J. Biblarz and Adrian E. Raftery (1993). "The Effects of Family Disruption on Social Mobility", American Sociological Review, 58(1):97-109.

Time to Ph.D.

The Time to Ph.D. data are available in a file containing five columns:

The file has 73 rows and is called phd.dat. A brief excerpt is shown below:

  1    1   1     31  7422
  2    1   1    177  7166
  3    1   1    393  6759
  4    1   1    484  6138
  5    1   1    500  5506
  6    1   1    399  4824
    ...
  6    3   2      8    85
  7    3   2      2    72
 12    3   2      2    37

Reference: Espenshade, T.J. and Rodríguez, G. (1997). Completing the Ph.D.: Comparative Performances of U.S. and Foreign Students. Social Science Quarterly, 78:593-605.

The Gehan-Freirich Survival Data

The data show the length of remission in weeks for two groups of leukemia patients, treated and control, and were analyzed by Cox in his original proportional hazards paper. The data are available in a file containing three columns:

Thus, the third and fourth observations, 6 and 6+, corresponding to a death and a censored observation at six weeks, are coded 6, 1 and 6, 0, respectively.

The data are available in the usual two plain-text formats in gehan.dat and gehan.raw (group codes are 1=control, 2=treated), and as a Stata file in gehan.dta. Here's an excerpt of the dat file:

   treatment time failure
1    treated    6    TRUE
2    treated    6    TRUE
3    treated    6    TRUE
4    treated    6   FALSE
5    treated    7    TRUE
6    treated    9   FALSE
    ...
40   control   17    TRUE
41   control   22    TRUE
42   control   23    TRUE

These data actually come from a matched-pairs design, where patients were paired according to remission status (partial or complete) and then randomly assigned to the treated or control group, but most analyses have ignored this fact. See Andersen et al (1993), pages 22-23, which has references to several papers using this dataset.

Reference: Andersen, P. K.; Borgan, O.; Gill, R. D. and Keiding, N. (1993). Statistical Models Based on Counting Processes, Springer-Verlag, New York.

The Somoza Dataset

These are Somoza's data on infant and child survival in Colombia, used in the notes (Table 3). The dataset comes from the Word Fertility Survey, which was fielded in Colombia in 1976. Women in the reproductive ages were asked about their children and these were tabulated by sex, year of birth (cohort), survival status and age at death or at interview.

The file has 48 lines, corresponding to the 48 combinations of sex, cohort and age, and six columns:

The data are available in plain text format as somoza.dat, which uses character labels for sex, cohort and age, and somoza.raw, which uses numeric codes for all variables, and in Stata format as somoza.dta A brief excerpt of the dat file is shown below.

    sex cohort        age   dead  alive
   Male 1941-59    0-1/12     99      0
   Male 1941-59 1/12-3/12     35      0  
    ...
 Female 1968-76       10+      0      0

In order to analyze these data using piece-wise exponential models you first have to calculate events and exposure by sex, cohort and age. The details of this calculation are shown in our Stata logs. The final step of that process, a file with events and exposure by cohort and age (collapsing over sex) is available in Stata format as somoza2.dta.

Reference: Somoza, J. (1980). Illustrative Analysis: Infant and Child Mortality in Colombia. World Fertility Survey Scientific Reports, Number 10.

Marriage Dissolution in the U.S.

This dataset, adapted from an example in the software package aML, is based on a longitudinal survey conducted in the U.S.

The unit of observation is the couple and the event of interest is divorce, with interview and widowhood treated as censoring events. We have three fixed covariates: education of the husband and two indicators of the couple's ethnicity: whether the husband is black and whether the couple is mixed. The variables are:

The dataset has 3771 couples and is available in "raw" format as divorce.raw and in "dat" format as divorce.dat, see excerpt below. The file is also available in Stata format as divorce.dta.

     id       heduc  heblack  mixed    years  div
     9  12-15 years       No     No   10.546   No
    11   < 12 years       No     No   34.943   No
    13   < 12 years       No     No    2.834  Yes
    15   < 12 years       No     No   17.532  Yes
    33  12-15 years       No     No    1.418   No
    36   < 12 years       No     No   48.033   No
	...
 17294  12-15 years      Yes     No    7.269   No
 17302  12-15 years       No    Yes    18.73   No

Reference: Lillard and Panis (2000), aML Multilevel Multiprocess Statistical Software, Release 1.0, EconWare, LA, California.