Exploratory data analysis (Part I)

Exploring a dataset, descriptive statistics

Datasets

Explore a dataset: birth (csv link)
Exercise 1: no data, we type in data directly
Exercise 2: PEFH98-english (rda link, csv link)

Explore a dataset

In this section, we will introduce some frequently used commands when you first load a dataset.

We use birth data as an example.

# read a csv file, name it 'birth'
# this is the method to read in the data with R command
# you can also use the Rstudio interface to import a dataset
birth <- read.csv('data/birth.csv', sep = ',')

You can type birth in the console to print the data; but you will see that it gets hard to read. This is usually the case: you have lots of data that is hard to fit in your screen.

Therefore, it is useful to

print out a few rows (head of the data);
have a list of column names
know the dimension of the dataset (number of observations and features/measurements)
produce some summary statistics

We extract the head (first several rows, by default 6) rows of your data. Similarly, you can get the last several rows using tail()

head(birth)

  id         low age lwt   eth       smk ptl  ht  ui fvt ttv  bwt
1  4 bwt <= 2500  28 120 other    smoker   1  no yes   0   0  709
2 10 bwt <= 2500  29 130 white nonsmoker   0  no yes   2   0 1021
3 11 bwt <= 2500  34 187 black    smoker   0 yes  no   0   0 1135
4 13 bwt <= 2500  25 105 other nonsmoker   1 yes  no   0   0 1330
5 15 bwt <= 2500  25  85 other nonsmoker   0  no yes   0   4 1474
6 16 bwt <= 2500  27 150 other nonsmoker   0  no  no   0   5 1588

# this is giving the first 6 rows of your data
# birth[1:6, ]
# head(birth, 10)  gives first 10 rows
# tail(birth)

In RStudio, you can also use View() command to look at your dataset in a tabular form.

To know what measurements (or feature, variable, parameter) are in the data, you can extract the column names.

colnames(birth)

 [1] "id"  "low" "age" "lwt" "eth" "smk" "ptl" "ht"  "ui"  "fvt" "ttv" "bwt"

ncol(birth)

[1] 12

To know how many subjects are in a dataset, there are a few options. You can find the dimension of the dataset; or find how many rows are in the data.

Alternatively, if you have one variable (say age), you can also use length(age).

# by index
birth[1, ]

  id         low age lwt   eth    smk ptl ht  ui fvt ttv bwt
1  4 bwt <= 2500  28 120 other smoker   1 no yes   0   0 709

birth[, 1]

  [1]   4  10  11  13  15  16  17  18  19  20  22  23  24  25  26  27  28  29
 [19]  30  31  32  33  34  35  36  37  40  42  43  44  45  46  47  49  50  51
 [37]  52  54  56  57  59  60  61  62  63  65  67  68  69  71  75  76  77  78
 [55]  79  81  82  83  84  85  86  87  88  89  91  92  93  94  95  96  97  98
 [73]  99 100 101 102 103 104 105 106 107 108 109 111 112 113 114 115 116 117
 [91] 118 119 120 121 123 124 125 126 127 128 129 130 131 132 133 134 135 136
[109] 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 154 155 156
[127] 159 160 161 162 163 164 166 167 168 169 170 172 173 174 175 176 177 179
[145] 180 181 182 183 184 185 186 187 188 189 190 191 192 193 195 196 197 199
[163] 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217
[181] 218 219 220 221 222 223 224 225 226

# by column name
birth[, 'age']

  [1] 28 29 34 25 25 27 23 24 24 21 32 19 25 16 25 20 21 24 21 20 25 19 19 26 24
 [26] 17 20 22 27 20 17 25 20 18 18 20 21 26 31 15 23 20 24 15 23 30 22 17 23 17
 [51] 26 20 26 14 28 14 23 17 21 19 33 20 21 18 21 22 17 29 26 19 19 22 30 18 18
 [76] 15 25 20 28 32 31 36 28 25 28 17 29 26 17 17 24 35 25 25 29 19 27 31 33 21
[101] 19 23 21 18 18 32 19 24 22 22 23 22 30 19 16 21 30 20 17 17 23 24 28 26 20
[126] 24 28 20 22 22 31 23 16 16 18 25 32 20 23 22 32 30 20 23 17 19 23 36 22 24
[151] 21 19 25 16 29 29 19 19 30 24 19 24 23 20 25 30 22 18 16 32 18 29 33 20 28
[176] 14 28 25 16 20 26 21 22 25 31 35 19 24 45

# by $ operator
birth$age

  [1] 28 29 34 25 25 27 23 24 24 21 32 19 25 16 25 20 21 24 21 20 25 19 19 26 24
 [26] 17 20 22 27 20 17 25 20 18 18 20 21 26 31 15 23 20 24 15 23 30 22 17 23 17
 [51] 26 20 26 14 28 14 23 17 21 19 33 20 21 18 21 22 17 29 26 19 19 22 30 18 18
 [76] 15 25 20 28 32 31 36 28 25 28 17 29 26 17 17 24 35 25 25 29 19 27 31 33 21
[101] 19 23 21 18 18 32 19 24 22 22 23 22 30 19 16 21 30 20 17 17 23 24 28 26 20
[126] 24 28 20 22 22 31 23 16 16 18 25 32 20 23 22 32 30 20 23 17 19 23 36 22 24
[151] 21 19 25 16 29 29 19 19 30 24 19 24 23 20 25 30 22 18 16 32 18 29 33 20 28
[176] 14 28 25 16 20 26 21 22 25 31 35 19 24 45

# what column names are there?
colnames(birth)

 [1] "id"  "low" "age" "lwt" "eth" "smk" "ptl" "ht"  "ui"  "fvt" "ttv" "bwt"

# how many columns and rows?
ncol(birth)

[1] 12

nrow(birth)

[1] 189

dim(birth)

[1] 189  12

Sometimes it is also useful to know what types of data it is for each variable. You can either use str() (structure) on the dataset; or use class() on the variable you are interested in.

str(birth)

'data.frame':   189 obs. of  12 variables:
 $ id : int  4 10 11 13 15 16 17 18 19 20 ...
 $ low: chr  "bwt <= 2500" "bwt <= 2500" "bwt <= 2500" "bwt <= 2500" ...
 $ age: int  28 29 34 25 25 27 23 24 24 21 ...
 $ lwt: int  120 130 187 105 85 150 97 128 132 165 ...
 $ eth: chr  "other" "white" "black" "other" ...
 $ smk: chr  "smoker" "nonsmoker" "smoker" "nonsmoker" ...
 $ ptl: int  1 0 0 1 0 0 0 1 0 0 ...
 $ ht : chr  "no" "no" "yes" "yes" ...
 $ ui : chr  "yes" "yes" "no" "no" ...
 $ fvt: int  0 2 0 0 0 0 1 1 0 1 ...
 $ ttv: int  0 0 0 0 4 5 5 2 5 4 ...
 $ bwt: int  709 1021 1135 1330 1474 1588 1588 1701 1729 1790 ...

class(birth$age)

[1] "integer"

class(birth$low)

[1] "character"

Descriptive statistics

Continuous variable

A continuous variable can take values in a continuous range. Height and weight are examples of this type. Usually the data type in R is numeric; if not, you might need to convert it first.

Common summary statistics for continuous variables are

extremes (minimum, maximum)
average and central tendency (mean, median)
quantiles (top 1 percent, bottom 5 percent, IQR - interquartile range)

Let us focus on one continuous variable, bill_length_mm.

# create a variable called age
age <- birth$age


# use the command summary on a continuous variable
summary(age)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  14.00   19.00   23.00   23.24   26.00   45.00

# compute the min, max
min(age)

[1] 14

max(age)

[1] 45

mean(age)

[1] 23.2381

median(age)

[1] 23

quantile(age, 0.05)

5% 
16

quantile(age, 0.95)

95% 
 32

Categorical variable

Categorical variables take discrete values, such as sex (male and female), smoking habit (frequent smoker or not), age groups.

For categorical variables, the metrics above (mean and median, max and min, quantiles) do not make much sense. We use counts and percentages instead.

We use the variable species and island from the penguin data.

# categorical variable
smoker <- birth$smk
low_birthweight <- birth$low

# categories with counts
table(smoker)

smoker
nonsmoker    smoker 
      115        74

table(low_birthweight)

low_birthweight
bwt <= 2500  bwt > 2500 
         59         130

# two categories
table(smoker, low_birthweight)

           low_birthweight
smoker      bwt <= 2500 bwt > 2500
  nonsmoker          29         86
  smoker             30         44

# to create percentage, divide by number of subjects
table(smoker)/length(smoker)

smoker
nonsmoker    smoker 
0.6084656 0.3915344

Visualization

Histogram

hist(age)

Box plot

A boxplot gives the position of

median (50th percentile; second quartile)
25th and 75th percentile (first and third quartile)
maximum and minimum
outliers

boxplot(age)

# multiple boxplot together
boxplot(bwt ~ smk, data = birth)

Scatterplot of two continuous variables

plot(birth$age, birth$bwt)

Exercise 1 (weight)

1a

Generate a variable named weight, with the following measurements:

50 75 70 74 95 83 65 94 66 65 65 75 84 55 73 68 72 67 53 65

1b

Make a simple descriptive analysis of the variable, what are the mean, standard deviation, maximum, minimum and quantiles?

How to interpret the data?

1c

Make a histogram.

1d

Make a boxplot. What do the two dots on the top represent?

Exercise 2 (lung function)

Lung function has been measured on 106 medical students. Peak expiratory flow rate (PEF, measured in liters per minute) was measured three times in a sittinng position, and three times in a standing position.

The variables are

Age (years)
Gender (1 is female, 2 is male)
Height (cm)
Weight (kg)
PEF measured three times in a sitting position (pefsit1, pefsit2, pefsit3)
PEF measured three times in a standing position (pefsta1, pefsta2, pefsta3)
Mean of the three measurements made in a sitting position (pefsitm)
Mean of the three measurements made in a standing position (pefstam)
Mean of all six PEF values (pefmean)

2a)

Download and open PEFH98-english.dta into R.

If you have problem with .dta data format, you can also use PEFH98-english.csv.

Pay attention to how gender is coded. We might have to modify it.

2b)

How many observations are there (number of subjects)? How do you get a list of variable names from your dataset?

Make a histogram for each of the following variables. Compute means, and interpret the results.

height
weight
age
pefsitm
pefstam

2c)

Make histograms for the variables height and pefmean for men and women separately. Also try to make boxplots.

What conclusion can you draw?

2d)

Make three scatterplots to compare

pefmean with height
pefmean with weight
pefmean with age

What association do you see?

Solution

Exercise 1 (weight)

1a

Generate a variable named weight, with the following measurements:

50 75 70 74 95 83 65 94 66 65 65 75 84 55 73 68 72 67 53 65

weight <- c(50, 75, 70, 74, 95, 
            83, 65, 94, 66, 65, 
            65, 75, 84, 55, 73, 
            68, 72, 67, 53, 65)

1b

Make a simple descriptive analysis of the variable, what are the mean, median, maximum, minimum and quantiles?

How to interpret the data?

mean(weight)

[1] 70.7

median(weight)

[1] 69

max(weight)

[1] 95

min(weight)

[1] 50

# alternatively, 
summary(weight)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   50.0    65.0    69.0    70.7    75.0    95.0

1c

Make a histogram.

hist(weight)

1d

Make a boxplot. What do the two dots on the top represent?

boxplot(weight)

They are the largest two points in the dataset. These are outliers in this data.

(Optional) Outliers in box plot

There are different ways to define outliers. The default box plot in R determines points beyond \(Q_1 - 1.5\times IQR\) and \(Q_3 + 1.5\times IQR\) are outliers. You can check the values of quartiles using summary(), and read up the documentation ?boxplot().