In a study in Massachusetts, USA, birth weight was measured for the children of 189 women. The main variable in the study was birth weight, bwt, which is an important indicator of the condition of a newborn child. Low birth weight (below 2500 g) may be a medical risk factor. A major question is whether smoking during pregnancy influences the birth weight. One has also studied whether a number of other factors are related to birth weight, such as hypertension in the mother.
The variables of the study are
Identification number (id)
Low birth weight (low), i.e. bwt below or above 2500g
Age of the mother in years (age)
Weight (in pounds) at last menstrual period (lwt)
Ethnicity: white, black, or other (eth)
Smoking status, smoker means current smoker, nonsmoker means not smoking during pregnancy (smk)
History of premature labour, values 0, 1, 2… (ptl)
History of hypertension: yes vs no (ht)
Uterine irritability, yes vs no (ui)
First trimester visits, values 0, 1, 2… (ftv)
Third trimester visits, values 0, 1, 2… (ttv)
Birth weight in grams (bwt)
1a)
Perform a logistic regression analysis with low as dependent variable, and age as independent variable. Interpret the results.
Recode low as is_low_bwt
With logistic regression (glm()), you need to specify the outcome variable as
either 0 or 1 (numeric values)
or a factor (ordered so that it matches 0 or 1 category)
In the solutions we should the first option (numeric 0 or 1)
It could be a good practice to name another variable, is_low_bwt, so that you keep the original low variable untouched; this is useful in case you want to double check if the coding is correct or need to start over.
1b)
Perform a logistic regression analysis with low as dependent variable and smk as independent variable.
1c)
Perform a logistic regression with low as dependent variable, and eth as independent variable. Be careful with the reference category.
1d)
Perform a logistic regression with low as dependent variable, and age,smk and eth as independent variables.
1e)
Based on the above analysis, set up a result table which reports:
odds ratios OR (unadjusted, and adjusted)
95% confidence intervals for OR
p-values for OR
Make sure you know how to interpret the table.
Exercise 2 (framingham)
We use the data from the Framingham study, framingham. The dataset contains a selection of n = 500 men aged 31 to 65 years.
The response variable is FIRSTCHD, and this is equal to 1 if the individual has coronary heart disease and 0 otherwise.
We have four explanatory variables:
MEANSBP, the average systolic blood pressure (mmHg) of two blood pressure measurements;
SMOKE which is smoking (1 = yes, 0 = no);
CHOLESTEROL which is serum cholesterol in mg/dl;
AGE (age in years).
2a)
Analyse the relationship between firstchd and smoke in a logistic regression model.
2b)
Analyse the relationship between firstchd and meansbp in a logistic regression model.
2c)
Include also the other two explanatory variables in a logisic regression model. Interpret the results.
Solution
Exercise 1 (birth)
In a study in Massachusetts, USA, birth weight was measured for the children of 189 women. The main variable in the study was birth weight, bwt, which is an important indicator of the condition of a newborn child. Low birth weight (below 2500 g) may be a medical risk factor. A major question is whether smoking during pregnancy influences the birth weight. One has also studied whether a number of other factors are related to birth weight, such as hypertension in the mother.
The variables of the study are
Identification number (id)
Low birth weight (low), i.e. bwt below or above 2500g
Age of the mother in years (age)
Weight (in pounds) at last menstrual period (lwt)
Ethnicity: white, black, or other (eth)
Smoking status, smoker means current smoker, nonsmoker means not smoking during pregnancy (smk)
History of premature labour, values 0, 1, 2… (ptl)
History of hypertension: yes vs no (ht)
Uterine irritability, yes vs no (ui)
First trimester visits, values 0, 1, 2… (ftv)
Third trimester visits, values 0, 1, 2… (ttv)
Birth weight in grams (bwt)
# load databirth_data <-read.csv('data/birth.csv')# print the first rows of the data sethead(birth_data)
id low age lwt eth smk ptl ht ui fvt ttv bwt
1 4 bwt <= 2500 28 120 other smoker 1 no yes 0 0 709
2 10 bwt <= 2500 29 130 white nonsmoker 0 no yes 2 0 1021
3 11 bwt <= 2500 34 187 black smoker 0 yes no 0 0 1135
4 13 bwt <= 2500 25 105 other nonsmoker 1 yes no 0 0 1330
5 15 bwt <= 2500 25 85 other nonsmoker 0 no yes 0 4 1474
6 16 bwt <= 2500 27 150 other nonsmoker 0 no no 0 5 1588
1a)
Perform a logistic regression analysis with low as dependent variable, and age as independent variable. Interpret the results.
Recode low as is_low_bwt
With logistic regression (glm()), you need to specify the outcome variable as
either 0 or 1 (numeric values)
or a factor (ordered so that it matches 0 or 1 category)
In the example we should the first option (numeric 0 or 1)
It could be a good practice to create another variable, is_low_bwt, so that you keep the original low variable untouched; this is useful in case you want to double check if the coding is correct or need to start over.
# be careful with the levels: bwt<= 2500 is the cases (code as 1)table(birth_data$low)
bwt <= 2500 bwt > 2500
59 130
# recode low variable (numeric 1 and 0)birth_data$is_low_bwt <-ifelse(birth_data$low =='bwt <= 2500',1,0)# check each categorytable(birth_data$is_low_bwt)
0 1
130 59
# fit lr (using is_low_bwt)lr_low_age <-glm(is_low_bwt ~ age, data = birth_data, family ='binomial')summary(lr_low_age)
Call:
glm(formula = is_low_bwt ~ age, family = "binomial", data = birth_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.38458 0.73212 0.525 0.599
age -0.05115 0.03151 -1.623 0.105
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 234.67 on 188 degrees of freedom
Residual deviance: 231.91 on 187 degrees of freedom
AIC: 235.91
Number of Fisher Scoring iterations: 4
# double check whether 0/1 categories are correcttable(lr_low_age$y)
0 1
130 59
# the result regresssion coefficient is not odds ratiolr_low_age$coefficients
(Intercept) age
0.38458192 -0.05115294
# equivalently,# coefficients(lr_low_age)# coef(lr_low_age)# confidence interval for beta:# confint(lr_low_age)# to get odds ratio, exponentiate exp(coef(lr_low_age)) # OR
(Intercept) age
1.4690000 0.9501333
exp(confint(lr_low_age)) # CI for OR
2.5 % 97.5 %
(Intercept) 0.3557144 6.343305
age 0.8912949 1.009027
1b)
Perform a logistic regression analysis with low as dependent variable and smk as independent variable.
# low ~ smk lr_low_smk <-glm(is_low_bwt ~ smk, data = birth_data, family ='binomial')summary(lr_low_smk)
Call:
glm(formula = is_low_bwt ~ smk, family = "binomial", data = birth_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.0871 0.2147 -5.062 4.14e-07 ***
smksmoker 0.7041 0.3196 2.203 0.0276 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 234.67 on 188 degrees of freedom
Residual deviance: 229.80 on 187 degrees of freedom
AIC: 233.8
Number of Fisher Scoring iterations: 4
# check the outcome variable# evidence is 1, no-evidence is 0table(framingham$firstchd)
evidence no-evidence
37 463
# code into 1 and 0 (numeric)framingham$firstchd_coded <-ifelse(framingham$firstchd =='evidence', 1, 0)# check if correcttable(framingham$firstchd_coded)
0 1
463 37
# firstchd, smoke lr_chd_smk <-glm(firstchd_coded ~ smoke, data = framingham, family ='binomial')# double check if category is correcttable(lr_chd_smk$y)
0 1
463 37
summary(lr_chd_smk)
Call:
glm(formula = firstchd_coded ~ smoke, family = "binomial", data = framingham)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.6655 0.3656 -7.290 3.1e-13 ***
smokesmoking 0.1806 0.4136 0.437 0.662
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 263.86 on 499 degrees of freedom
Residual deviance: 263.67 on 498 degrees of freedom
AIC: 267.67
Number of Fisher Scoring iterations: 5
---title: "Logistic regression"description: "Logistic regression"format: html: code-fold: false code-tools: trueeditor: source---Datasets* Exercise 1: `birth` ([rda link](https://github.com/ocbe-uio/teaching_mf9130e/blob/main/lab/data/birth.rda), [csv link](https://github.com/ocbe-uio/teaching_mf9130e/blob/main/lab/data/birth.csv))* Exercise 2: `framingham` ([rda link](https://github.com/ocbe-uio/teaching_mf9130e/blob/main/lab/data/framingham.rda), [csv link](https://github.com/ocbe-uio/teaching_mf9130e/blob/main/lab/data/framingham.csv))[R Script](https://github.com/ocbe-uio/teaching_mf9130e/blob/main/lab/code/9_logisticreg.R)------------------## Exercise 1 (birth)In a study in Massachusetts, USA, birth weight was measured for the children of 189 women. The main variable in the study was birth weight, `bwt`, which is an important indicator of the condition of a newborn child. Low birth weight (below 2500 g) may be a medical risk factor. A major question is whether smoking during pregnancy influences the birth weight. One has also studied whether a number of other factors are related to birth weight, such as hypertension in the mother. The variables of the study are* Identification number (id) * Low birth weight (low), i.e. bwt below or above 2500g * Age of the mother in years (age) * Weight (in pounds) at last menstrual period (lwt) * Ethnicity: white, black, or other (eth) * Smoking status, `smoker` means current smoker, `nonsmoker` means not smoking during pregnancy (smk) * History of premature labour, values 0, 1, 2... (ptl) * History of hypertension: yes vs no (ht) * Uterine irritability, yes vs no (ui) * First trimester visits, values 0, 1, 2... (ftv) * Third trimester visits, values 0, 1, 2... (ttv) * Birth weight in grams (bwt) #### 1a) Perform a logistic regression analysis with `low` as dependent variable, and `age` as independent variable. Interpret the results.::: {.callout-note}## Recode `low` as `is_low_bwt`With logistic regression (glm()), you need to specify the outcome variable as * either 0 or 1 (numeric values)* or a factor (ordered so that it matches 0 or 1 category)In the solutions we should the first option (numeric 0 or 1)It could be a good practice to name another variable, `is_low_bwt`, so that you keep the original `low` variable untouched; this is useful in case you want to double check if the coding is correct or need to start over.:::#### 1b) Perform a logistic regression analysis with `low` as dependent variable and `smk` as independent variable. #### 1c) Perform a logistic regression with `low` as dependent variable, and `eth` as independent variable. Be careful with the reference category.#### 1d)Perform a logistic regression with `low` as dependent variable, and `age`,`smk` and `eth` as independent variables.#### 1e) Based on the above analysis, set up a result table which reports:* odds ratios OR (unadjusted, and adjusted)* 95% confidence intervals for OR* p-values for ORMake sure you know how to interpret the table. ## Exercise 2 (framingham)We use the data from the Framingham study, `framingham`. The dataset contains a selection of n = 500 men aged 31 to 65 years. The response variable is FIRSTCHD, and this is equal to 1 if the individual has coronary heart disease and 0 otherwise. We have four explanatory variables: * MEANSBP, the average systolic blood pressure (mmHg) of two blood pressure measurements; * SMOKE which is smoking (1 = yes, 0 = no);* CHOLESTEROL which is serum cholesterol in mg/dl;* AGE (age in years). #### 2a)Analyse the relationship between `firstchd` and `smoke` in a logistic regression model.#### 2b)Analyse the relationship between `firstchd` and `meansbp` in a logistic regression model.#### 2c)Include also the other two explanatory variables in a logisic regression model. Interpret the results. # Solution### Exercise 1 (birth)In a study in Massachusetts, USA, birth weight was measured for the children of 189 women. The main variable in the study was birth weight, `bwt`, which is an important indicator of the condition of a newborn child. Low birth weight (below 2500 g) may be a medical risk factor. A major question is whether smoking during pregnancy influences the birth weight. One has also studied whether a number of other factors are related to birth weight, such as hypertension in the mother. The variables of the study are* Identification number (id) * Low birth weight (low), i.e. bwt below or above 2500g * Age of the mother in years (age) * Weight (in pounds) at last menstrual period (lwt) * Ethnicity: white, black, or other (eth) * Smoking status, `smoker` means current smoker, `nonsmoker` means not smoking during pregnancy (smk) * History of premature labour, values 0, 1, 2... (ptl) * History of hypertension: yes vs no (ht) * Uterine irritability, yes vs no (ui) * First trimester visits, values 0, 1, 2... (ftv) * Third trimester visits, values 0, 1, 2... (ttv) * Birth weight in grams (bwt) ```{r}#| label: lr-1a-1#| warning: false#| echo: true# load databirth_data <-read.csv('data/birth.csv')# print the first rows of the data sethead(birth_data)```### 1a)Perform a logistic regression analysis with `low` as dependent variable, and `age` as independent variable. Interpret the results.::: {.callout-note}## Recode `low` as `is_low_bwt`With logistic regression (glm()), you need to specify the outcome variable as * either 0 or 1 (numeric values)* or a factor (ordered so that it matches 0 or 1 category)In the example we should the first option (numeric 0 or 1)It could be a good practice to create another variable, `is_low_bwt`, so that you keep the original `low` variable untouched; this is useful in case you want to double check if the coding is correct or need to start over.:::```{r}#| label: lr-1a-2#| warning: false#| echo: true# be careful with the levels: bwt<= 2500 is the cases (code as 1)table(birth_data$low)# recode low variable (numeric 1 and 0)birth_data$is_low_bwt <-ifelse(birth_data$low =='bwt <= 2500',1,0)# check each categorytable(birth_data$is_low_bwt)``````{r}#| label: lr-1a-3#| warning: false#| echo: true# fit lr (using is_low_bwt)lr_low_age <-glm(is_low_bwt ~ age, data = birth_data, family ='binomial')summary(lr_low_age)# double check whether 0/1 categories are correcttable(lr_low_age$y)``````{r}#| label: lr-1a-4#| warning: false#| echo: true# the result regresssion coefficient is not odds ratiolr_low_age$coefficients# equivalently,# coefficients(lr_low_age)# coef(lr_low_age)# confidence interval for beta:# confint(lr_low_age)# to get odds ratio, exponentiate exp(coef(lr_low_age)) # ORexp(confint(lr_low_age)) # CI for OR```### 1b) Perform a logistic regression analysis with `low` as dependent variable and `smk` as independent variable. ```{r}#| label: lr-1b-1#| warning: false#| echo: true# low ~ smk lr_low_smk <-glm(is_low_bwt ~ smk, data = birth_data, family ='binomial')summary(lr_low_smk)# odds ratio exp(coef(lr_low_smk))exp(confint(lr_low_smk))```### 1c) Perform a logistic regression with `low` as dependent variable, and `eth` as independent variable. Be careful with the reference category.```{r}#| label: lr-1c-1#| warning: false#| echo: true# low ~ eth # ethnicity needs to be factorised# baseline: whitebirth_data$eth <- eth <-factor(birth_data$eth, levels =c('white', 'black', 'other'), labels =c('white', 'black', 'other'))head(birth_data$eth)# fit lr lr_low_eth <-glm(is_low_bwt ~ eth, data = birth_data, family ='binomial')summary(lr_low_eth)# odds ratio exp(coef(lr_low_eth))exp(confint(lr_low_eth))```### 1d)Perform a logistic regression with `low` as dependent variable, and `age`,`smk` and `eth` as independent variables.```{r}#| label: lr-1d-1#| warning: false#| echo: truelr_low_all <-glm(is_low_bwt ~ age + smk + eth, data = birth_data, family ='binomial')summary(lr_low_all)# odds ratio exp(coef(lr_low_all))exp(confint(lr_low_all))```### 1e) Based on the above analysis, set up a result table which reports:* odds ratios OR (unadjusted, and adjusted)* 95% confidence intervals for OR* p-values for ORMake sure you know how to interpret the table. ### Exercise 2 (framingham)We use the data from the Framingham study, `framingham`. The dataset contains a selection of n = 500 men aged 31 to 65 years. The response variable is FIRSTCHD, and this is equal to 1 if the individual has coronary heart disease and 0 otherwise. We have four explanatory variables: * MEANSBP, the average systolic blood pressure (mmHg) of two blood pressure measurements; * SMOKE which is smoking (1 = yes, 0 = no);* CHOLESTEROL which is serum cholesterol in mg/dl;* AGE (age in years). ### 2a)Analyse the relationship between `firstchd` and `smoke` in a logistic regression model.```{r}#| label: lr-2a-1#| warning: false#| echo: trueframingham <-read.csv('data/framingham.csv')head(framingham)# check the outcome variable# evidence is 1, no-evidence is 0table(framingham$firstchd)# code into 1 and 0 (numeric)framingham$firstchd_coded <-ifelse(framingham$firstchd =='evidence', 1, 0)# check if correcttable(framingham$firstchd_coded)``````{r}#| label: lr-2a-2#| warning: false#| echo: true# firstchd, smoke lr_chd_smk <-glm(firstchd_coded ~ smoke, data = framingham, family ='binomial')# double check if category is correcttable(lr_chd_smk$y) summary(lr_chd_smk)# odds ratio exp(coef(lr_chd_smk))exp(confint(lr_chd_smk))```### 2b)Analyse the relationship between `firstchd` and `meansbp` in a logistic regression model.```{r}#| label: lr-2b-1#| warning: false#| echo: true# firstchd, meansbplr_chd_bp <-glm(firstchd_coded ~ meansbp, data = framingham, family ='binomial')summary(lr_chd_bp)# odds ratio exp(coef(lr_chd_bp))exp(confint(lr_chd_bp))```### 2c)Include also the other two explanatory variables in a logisic regression model. Interpret the results. ```{r}#| label: lr-2c-1#| warning: false#| echo: truelr_chd_all <-glm(firstchd_coded ~ meansbp + smoke + cholesterol + age, data = framingham, family ='binomial')summary(lr_chd_all)# odds ratio round(exp(coef(lr_chd_all)), digits =3)round(exp(confint(lr_chd_all)), digits =3)```