Analysis of non-normal data¶

    example by harald.weedon-fekjar@medisin.uio.no, 2025

Data:
USA NHANES 2009-2012 data on alcohol consumption for people >= 18 years old, measured as number of days over the past year that participant drank alcoholic.

In [74]:
# Load data from internet:
NHANES <- read.csv2(file="https://raw.githubusercontent.com/ProjectMOSAIC/NHANES/refs/heads/master/data-raw/NHANES.csv",sep=",")

# Select variables of interest:
NHANES <- NHANES[c("AlcoholYear","Gender")]

# Only include with known alcohol consumption:
NHANES <- NHANES[!is.na(NHANES$AlcoholYear),]

# Select first 80 observations:
NHANES_f80 <- NHANES[1:80, ]

# View data
head(NHANES_f80)
A data.frame: 6 × 2
AlcoholYearGender
<int><chr>
1 72female
3 0male
4156male
6 2female
8 40male
9364female

Check distribution; Is data alcohol consumption normally distributed?¶

In [75]:
hist(NHANES$AlcoholYear,freq=F,
     main="Alcohol consumption",
     xlab="Days of alcohol consumption per year",ylab="Frequency")
lines(density(NHANES$AlcoholYear,na.rm=T,from=0,to=365),lwd=2)
No description has been provided for this image
In [76]:
qqnorm(NHANES$AlcoholYear);qqline(NHANES$AlcoholYear)
No description has been provided for this image

Descriptive statistics:¶

In [77]:
mean(NHANES_f80$AlcoholYear)
61.0125
In [78]:
median(NHANES_f80$AlcoholYear)
12

Differences between females and males:¶

In [79]:
mean(NHANES_f80$AlcoholYear[NHANES_f80$Gender=="female"])
31.3243243243243
In [80]:
mean(NHANES_f80$AlcoholYear[NHANES_f80$Gender=="male"])
86.5581395348837
In [81]:
mean(NHANES_f80$AlcoholYear[NHANES_f80$Gender=="female"])-mean(NHANES_f80$AlcoholYear[NHANES_f80$Gender=="male"])
-55.2338152105594

Bootstrap confidence intervals:¶

95% bootstrap confidence intervals based on 10 000 bootstrap replicates.
Use bias-corrected and accelerated (BCa) if no warning, else percentile.

In [82]:
library(boot)

Bootstrap confidence interval for mean:

In [83]:
meanboot <- function(data, indices) {
  bootdata <- data[indices,]
  return (mean(bootdata$AlcoholYear))
}
set.seed(4)
resBoot <- boot(data=NHANES_f80, statistic=meanboot,R=10000)
boot.ci(resBoot)
Warning message in boot.ci(resBoot):
"bootstrap variances needed for studentized intervals"
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 10000 bootstrap replicates

CALL : 
boot.ci(boot.out = resBoot)

Intervals : 
Level      Normal              Basic         
95%   (39.43, 82.78 )   (38.76, 81.61 )  

Level     Percentile            BCa          
95%   (40.41, 83.26 )   (42.66, 86.85 )  
Calculations and Intervals on Original Scale

Bootstrap confidence interval for median:

In [84]:
medianboot <- function(data, indices) {
  bootdata <- data[indices,]
  return (median(bootdata$AlcoholYear))
}
set.seed(4)
resBoot <- boot(data=NHANES_f80, statistic=medianboot,R=10000)
boot.ci(resBoot)
Warning message in boot.ci(resBoot):
"bootstrap variances needed for studentized intervals"
Warning message in norm.inter(t, adj.alpha):
"extreme order statistics used as endpoints"
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 10000 bootstrap replicates

CALL : 
boot.ci(boot.out = resBoot)

Intervals : 
Level      Normal              Basic         
95%   (-2.14, 20.97 )   ( 0.00, 18.00 )  

Level     Percentile            BCa          
95%   ( 6, 24 )   ( 1, 12 )  
Calculations and Intervals on Original Scale
Warning : BCa Intervals used Extreme Quantiles
Some BCa intervals may be unstable

Bootstrap confidence interval for difference of means:

In [85]:
meanboot_bySex <- function(data, indices) {
  bootdata <- data[indices,]
  meanF <- mean(bootdata[bootdata$Gender=="female",]$AlcoholYear)
  meanM <- mean(bootdata[bootdata$Gender=="male",]$AlcoholYear)
  return ( meanF-meanM )
}
set.seed(4)
resBoot <- boot(data=NHANES_f80, statistic=meanboot_bySex,R=10000)
boot.ci(resBoot)
Warning message in boot.ci(resBoot):
"bootstrap variances needed for studentized intervals"
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 10000 bootstrap replicates

CALL : 
boot.ci(boot.out = resBoot)

Intervals : 
Level      Normal              Basic         
95%   (-95.14, -15.28 )   (-95.37, -14.54 )  

Level     Percentile            BCa          
95%   (-95.92, -15.10 )   (-96.62, -15.66 )  
Calculations and Intervals on Original Scale

Classical normal based statistics for comparison:¶

In [86]:
t.test(NHANES_f80$AlcoholYear)
	One Sample t-test

data:  NHANES_f80$AlcoholYear
t = 5.526, df = 79, p-value = 4.072e-07
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 39.03609 82.98891
sample estimates:
mean of x 
  61.0125 
In [87]:
AlcoholYear_female_f80 <- NHANES_f80$AlcoholYear[NHANES_f80$Gender=="female"]
AlcoholYear_male_f80   <- NHANES_f80$AlcoholYear[NHANES_f80$Gender=="male"]
t.test(AlcoholYear_female_f80 ,AlcoholYear_male_f80 )
	Welch Two Sample t-test

data:  AlcoholYear_female_f80 and AlcoholYear_male_f80
t = -2.6813, df = 69.007, p-value = 0.009168
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -96.32941 -14.13822
sample estimates:
mean of x mean of y 
 31.32432  86.55814