Logistic regression

Logistic regression

Datasets

R Script


Exercise 1 (birth)

In a study in Massachusetts, USA, birth weight was measured for the children of 189 women. The main variable in the study was birth weight, bwt, which is an important indicator of the condition of a newborn child. Low birth weight (below 2500 g) may be a medical risk factor. A major question is whether smoking during pregnancy influences the birth weight. One has also studied whether a number of other factors are related to birth weight, such as hypertension in the mother.

The variables of the study are

  • Identification number (id)
  • Low birth weight (low), i.e. bwt below or above 2500g
  • Age of the mother in years (age)
  • Weight (in pounds) at last menstrual period (lwt)
  • Ethnicity: white, black, or other (eth)
  • Smoking status, smoker means current smoker, nonsmoker means not smoking during pregnancy (smk)
  • History of premature labour, values 0, 1, 2… (ptl)
  • History of hypertension: yes vs no (ht)
  • Uterine irritability, yes vs no (ui)
  • First trimester visits, values 0, 1, 2… (ftv)
  • Third trimester visits, values 0, 1, 2… (ttv)
  • Birth weight in grams (bwt)

1a)

Perform a logistic regression analysis with low as dependent variable, and age as independent variable. Interpret the results.

Recode low as is_low_bwt

With logistic regression (glm()), you need to specify the outcome variable as

  • either 0 or 1 (numeric values)
  • or a factor (ordered so that it matches 0 or 1 category)

In the solutions we should the first option (numeric 0 or 1)

It could be a good practice to name another variable, is_low_bwt, so that you keep the original low variable untouched; this is useful in case you want to double check if the coding is correct or need to start over.

1b)

Perform a logistic regression analysis with low as dependent variable and smk as independent variable.

1c)

Perform a logistic regression with low as dependent variable, and eth as independent variable. Be careful with the reference category.

1d)

Perform a logistic regression with low as dependent variable, and age,smk and eth as independent variables.

1e)

Based on the above analysis, set up a result table which reports:

  • odds ratios OR (unadjusted, and adjusted)
  • 95% confidence intervals for OR
  • p-values for OR

Make sure you know how to interpret the table.

Exercise 2 (framingham)

We use the data from the Framingham study, framingham. The dataset contains a selection of n = 500 men aged 31 to 65 years.

The response variable is FIRSTCHD, and this is equal to 1 if the individual has coronary heart disease and 0 otherwise.

We have four explanatory variables:

  • MEANSBP, the average systolic blood pressure (mmHg) of two blood pressure measurements;
  • SMOKE which is smoking (1 = yes, 0 = no);
  • CHOLESTEROL which is serum cholesterol in mg/dl;
  • AGE (age in years).

2a)

Analyse the relationship between firstchd and smoke in a logistic regression model.

2b)

Analyse the relationship between firstchd and meansbp in a logistic regression model.

2c)

Include also the other two explanatory variables in a logisic regression model. Interpret the results.

Solution

Exercise 1 (birth)

In a study in Massachusetts, USA, birth weight was measured for the children of 189 women. The main variable in the study was birth weight, bwt, which is an important indicator of the condition of a newborn child. Low birth weight (below 2500 g) may be a medical risk factor. A major question is whether smoking during pregnancy influences the birth weight. One has also studied whether a number of other factors are related to birth weight, such as hypertension in the mother.

The variables of the study are

  • Identification number (id)
  • Low birth weight (low), i.e. bwt below or above 2500g
  • Age of the mother in years (age)
  • Weight (in pounds) at last menstrual period (lwt)
  • Ethnicity: white, black, or other (eth)
  • Smoking status, smoker means current smoker, nonsmoker means not smoking during pregnancy (smk)
  • History of premature labour, values 0, 1, 2… (ptl)
  • History of hypertension: yes vs no (ht)
  • Uterine irritability, yes vs no (ui)
  • First trimester visits, values 0, 1, 2… (ftv)
  • Third trimester visits, values 0, 1, 2… (ttv)
  • Birth weight in grams (bwt)
# load data
birth_data <- read.csv('data/birth.csv')

# print the first rows of the data set
head(birth_data)
  id         low age lwt   eth       smk ptl  ht  ui fvt ttv  bwt
1  4 bwt <= 2500  28 120 other    smoker   1  no yes   0   0  709
2 10 bwt <= 2500  29 130 white nonsmoker   0  no yes   2   0 1021
3 11 bwt <= 2500  34 187 black    smoker   0 yes  no   0   0 1135
4 13 bwt <= 2500  25 105 other nonsmoker   1 yes  no   0   0 1330
5 15 bwt <= 2500  25  85 other nonsmoker   0  no yes   0   4 1474
6 16 bwt <= 2500  27 150 other nonsmoker   0  no  no   0   5 1588

1a)

Perform a logistic regression analysis with low as dependent variable, and age as independent variable. Interpret the results.

Recode low as is_low_bwt

With logistic regression (glm()), you need to specify the outcome variable as

  • either 0 or 1 (numeric values)
  • or a factor (ordered so that it matches 0 or 1 category)

In the example we should the first option (numeric 0 or 1)

It could be a good practice to create another variable, is_low_bwt, so that you keep the original low variable untouched; this is useful in case you want to double check if the coding is correct or need to start over.

# be careful with the levels: bwt<= 2500 is the cases (code as 1)
table(birth_data$low)

bwt <= 2500  bwt > 2500 
         59         130 
# recode low variable (numeric 1 and 0)
birth_data$is_low_bwt <- ifelse(birth_data$low == 'bwt <= 2500',1,0)

# check each category
table(birth_data$is_low_bwt)

  0   1 
130  59 
# fit lr (using is_low_bwt)
lr_low_age <- glm(is_low_bwt ~ age, 
                  data = birth_data, 
                  family = 'binomial')
summary(lr_low_age)

Call:
glm(formula = is_low_bwt ~ age, family = "binomial", data = birth_data)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  0.38458    0.73212   0.525    0.599
age         -0.05115    0.03151  -1.623    0.105

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 231.91  on 187  degrees of freedom
AIC: 235.91

Number of Fisher Scoring iterations: 4
# double check whether 0/1 categories are correct
table(lr_low_age$y)

  0   1 
130  59 
# the result regresssion coefficient is not odds ratio
lr_low_age$coefficients
(Intercept)         age 
 0.38458192 -0.05115294 
# equivalently,
# coefficients(lr_low_age)
# coef(lr_low_age)
# confidence interval for beta:
# confint(lr_low_age)


# to get odds ratio, exponentiate 
exp(coef(lr_low_age)) # OR
(Intercept)         age 
  1.4690000   0.9501333 
exp(confint(lr_low_age)) # CI for OR
                2.5 %   97.5 %
(Intercept) 0.3557144 6.343305
age         0.8912949 1.009027

1b)

Perform a logistic regression analysis with low as dependent variable and smk as independent variable.

# low ~ smk 
lr_low_smk <- glm(is_low_bwt ~ smk, 
                  data = birth_data, 
                  family = 'binomial')
summary(lr_low_smk)

Call:
glm(formula = is_low_bwt ~ smk, family = "binomial", data = birth_data)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.0871     0.2147  -5.062 4.14e-07 ***
smksmoker     0.7041     0.3196   2.203   0.0276 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 229.80  on 187  degrees of freedom
AIC: 233.8

Number of Fisher Scoring iterations: 4
# odds ratio 
exp(coef(lr_low_smk))
(Intercept)   smksmoker 
  0.3372093   2.0219436 
exp(confint(lr_low_smk))
                2.5 %    97.5 %
(Intercept) 0.2177709 0.5070199
smksmoker   1.0818724 3.8005817

1c)

Perform a logistic regression with low as dependent variable, and eth as independent variable. Be careful with the reference category.

# low ~ eth 
# ethnicity needs to be factorised
# baseline: white
birth_data$eth <- eth <- factor(birth_data$eth, 
              levels = c('white', 'black', 'other'), 
              labels = c('white', 'black', 'other'))
head(birth_data$eth)
[1] other white black other other other
Levels: white black other
# fit lr 
lr_low_eth <- glm(is_low_bwt ~ eth, 
                  data = birth_data, 
                  family = 'binomial')
summary(lr_low_eth)

Call:
glm(formula = is_low_bwt ~ eth, family = "binomial", data = birth_data)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.1550     0.2391  -4.830 1.36e-06 ***
ethblack      0.8448     0.4634   1.823   0.0683 .  
ethother      0.6362     0.3478   1.829   0.0674 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 229.66  on 186  degrees of freedom
AIC: 235.66

Number of Fisher Scoring iterations: 4
# odds ratio 
exp(coef(lr_low_eth))
(Intercept)    ethblack    ethother 
  0.3150685   2.3275362   1.8892340 
exp(confint(lr_low_eth))
                2.5 %    97.5 %
(Intercept) 0.1929856 0.4950281
ethblack    0.9255511 5.7746995
ethother    0.9565420 3.7578709

1d)

Perform a logistic regression with low as dependent variable, and age,smk and eth as independent variables.

lr_low_all <- glm(is_low_bwt ~ age + smk + eth, 
                  data = birth_data, 
                  family = 'binomial')
summary(lr_low_all)

Call:
glm(formula = is_low_bwt ~ age + smk + eth, family = "binomial", 
    data = birth_data)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept) -1.00755    0.86166  -1.169  0.24228   
age         -0.03488    0.03340  -1.044  0.29634   
smksmoker    1.10055    0.37195   2.959  0.00309 **
ethblack     1.01141    0.49342   2.050  0.04039 * 
ethother     1.05673    0.40596   2.603  0.00924 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 218.86  on 184  degrees of freedom
AIC: 228.86

Number of Fisher Scoring iterations: 4
# odds ratio 
exp(coef(lr_low_all))
(Intercept)         age   smksmoker    ethblack    ethother 
  0.3651110   0.9657186   3.0058203   2.7494834   2.8769483 
exp(confint(lr_low_all))
                 2.5 %   97.5 %
(Intercept) 0.06601379 1.967972
age         0.90303360 1.029955
smksmoker   1.47208358 6.378576
ethblack    1.03958814 7.308152
ethother    1.31818618 6.531492

1e)

Based on the above analysis, set up a result table which reports:

  • odds ratios OR (unadjusted, and adjusted)
  • 95% confidence intervals for OR
  • p-values for OR

Make sure you know how to interpret the table.

Exercise 2 (framingham)

We use the data from the Framingham study, framingham. The dataset contains a selection of n = 500 men aged 31 to 65 years.

The response variable is FIRSTCHD, and this is equal to 1 if the individual has coronary heart disease and 0 otherwise.

We have four explanatory variables:

  • MEANSBP, the average systolic blood pressure (mmHg) of two blood pressure measurements;
  • SMOKE which is smoking (1 = yes, 0 = no);
  • CHOLESTEROL which is serum cholesterol in mg/dl;
  • AGE (age in years).

2a)

Analyse the relationship between firstchd and smoke in a logistic regression model.

framingham <- read.csv('data/framingham.csv')
head(framingham)
  obsno age       smoke cholesterol    firstchd meansbp
1     2  38     smoking         211 no-evidence   122.0
2     6  43     smoking         192 no-evidence   129.0
3    15  42     smoking         205 no-evidence   129.5
4    22  57     smoking         196 no-evidence    77.5
5    24  38     smoking         167 no-evidence   133.5
6    26  48 non-smoking         347 no-evidence   133.0
# check the outcome variable
# evidence is 1, no-evidence is 0
table(framingham$firstchd)

   evidence no-evidence 
         37         463 
# code into 1 and 0 (numeric)
framingham$firstchd_coded <- ifelse(framingham$firstchd == 'evidence', 1, 0)

# check if correct
table(framingham$firstchd_coded)

  0   1 
463  37 
# firstchd, smoke 
lr_chd_smk <- glm(firstchd_coded ~ smoke, 
                  data = framingham, 
                  family = 'binomial')
# double check if category is correct
table(lr_chd_smk$y)  

  0   1 
463  37 
summary(lr_chd_smk)

Call:
glm(formula = firstchd_coded ~ smoke, family = "binomial", data = framingham)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -2.6655     0.3656  -7.290  3.1e-13 ***
smokesmoking   0.1806     0.4136   0.437    0.662    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 263.86  on 499  degrees of freedom
Residual deviance: 263.67  on 498  degrees of freedom
AIC: 267.67

Number of Fisher Scoring iterations: 5
# odds ratio 
exp(coef(lr_chd_smk))
 (Intercept) smokesmoking 
  0.06956522   1.19791667 
exp(confint(lr_chd_smk))
                  2.5 %   97.5 %
(Intercept)  0.03120555 0.133367
smokesmoking 0.55661970 2.876730

2b)

Analyse the relationship between firstchd and meansbp in a logistic regression model.

# firstchd, meansbp
lr_chd_bp <- glm(firstchd_coded ~ meansbp, 
                 data = framingham, 
                 family = 'binomial')

summary(lr_chd_bp)

Call:
glm(formula = firstchd_coded ~ meansbp, family = "binomial", 
    data = framingham)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -5.846488   1.020563  -5.729 1.01e-08 ***
meansbp      0.024479   0.007178   3.410 0.000649 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 263.86  on 499  degrees of freedom
Residual deviance: 253.36  on 498  degrees of freedom
AIC: 257.36

Number of Fisher Scoring iterations: 5
# odds ratio 
exp(coef(lr_chd_bp))
(Intercept)     meansbp 
 0.00289003  1.02478099 
exp(confint(lr_chd_bp))
                   2.5 %     97.5 %
(Intercept) 0.0003828574 0.02159271
meansbp     1.0100963603 1.03921069

2c)

Include also the other two explanatory variables in a logisic regression model. Interpret the results.

lr_chd_all <- glm(firstchd_coded ~ meansbp + smoke + cholesterol + age, 
                  data = framingham, 
                  family = 'binomial')
summary(lr_chd_all)

Call:
glm(formula = firstchd_coded ~ meansbp + smoke + cholesterol + 
    age, family = "binomial", data = framingham)

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -10.097386   1.719736  -5.871 4.32e-09 ***
meansbp        0.016815   0.007698   2.184  0.02893 *  
smokesmoking   0.408227   0.432310   0.944  0.34502    
cholesterol    0.007685   0.003764   2.042  0.04115 *  
age            0.066512   0.022283   2.985  0.00284 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 263.86  on 499  degrees of freedom
Residual deviance: 240.25  on 495  degrees of freedom
AIC: 250.25

Number of Fisher Scoring iterations: 6
# odds ratio 
round(exp(coef(lr_chd_all)), digits = 3)
 (Intercept)      meansbp smokesmoking  cholesterol          age 
       0.000        1.017        1.504        1.008        1.069 
round(exp(confint(lr_chd_all)), digits = 3)
             2.5 % 97.5 %
(Intercept)  0.000  0.001
meansbp      1.001  1.032
smokesmoking 0.674  3.743
cholesterol  1.000  1.015
age          1.024  1.118