Logistic regression

Datasets

Exercise 1: birth (rda link, csv link)
Exercise 2: framingham (rda link, csv link)

Exercise 1 (birth)

In a study in Massachusetts, USA, birth weight was measured for the children of 189 women. The main variable in the study was birth weight, bwt, which is an important indicator of the condition of a newborn child. Low birth weight (below 2500 g) may be a medical risk factor. A major question is whether smoking during pregnancy influences the birth weight. One has also studied whether a number of other factors are related to birth weight, such as hypertension in the mother.

The variables of the study are

Identification number (id)
Low birth weight (low), i.e. bwt below or above 2500g
Age of the mother in years (age)
Weight (in pounds) at last menstrual period (lwt)
Ethnicity: white, black, or other (eth)
Smoking status, smoker means current smoker, nonsmoker means not smoking during pregnancy (smk)
History of premature labour, values 0, 1, 2… (ptl)
History of hypertension: yes vs no (ht)
Uterine irritability, yes vs no (ui)
First trimester visits, values 0, 1, 2… (ftv)
Third trimester visits, values 0, 1, 2… (ttv)
Birth weight in grams (bwt)

1a)

Perform a logistic regression analysis with low as dependent variable, and age as independent variable. Interpret the results.

Recode low as is_low_bwt

With logistic regression (glm()), you need to specify the outcome variable as

either 0 or 1 (numeric values)
or a factor (ordered so that it matches 0 or 1 category)

In the solutions we should the first option (numeric 0 or 1)

It could be a good practice to name another variable, is_low_bwt, so that you keep the original low variable untouched; this is useful in case you want to double check if the coding is correct or need to start over.

1b)

Perform a logistic regression analysis with low as dependent variable and smk as independent variable.

1c)

Perform a logistic regression with low as dependent variable, and eth as independent variable. Be careful with the reference category.

1d)

Perform a logistic regression with low as dependent variable, and age,smk and eth as independent variables.

1e)

Based on the above analysis, set up a result table which reports:

odds ratios OR (unadjusted, and adjusted)
95% confidence intervals for OR
p-values for OR

Make sure you know how to interpret the table.

Exercise 2 (framingham)

We use the data from the Framingham study, framingham. The dataset contains a selection of n = 500 men aged 31 to 65 years.

The response variable is FIRSTCHD, and this is equal to 1 if the individual has coronary heart disease and 0 otherwise.

We have four explanatory variables:

MEANSBP, the average systolic blood pressure (mmHg) of two blood pressure measurements;
SMOKE which is smoking (1 = yes, 0 = no);
CHOLESTEROL which is serum cholesterol in mg/dl;
AGE (age in years).

2a)

Analyse the relationship between firstchd and smoke in a logistic regression model.

2b)

Analyse the relationship between firstchd and meansbp in a logistic regression model.

2c)

Include also the other two explanatory variables in a logisic regression model. Interpret the results.

Solution