Exploratory data analysis (Part II)

More on data manipulation, visualisation

Dataset: liggetid (rda link, csv link)

R script


Overview

In the previous section (EDA Part I), we have introduced some basic manipulation, where we focused mostly on 1 or 2 variables.

In this part, we will introduce how to

  • select multiple columns from a data frame (base R and tidyverse);
  • assign new column names for variables (base R and tidyverse);
  • explore a dataset with slightly more advanced visualization using ggplot.

Tidyverse is a collection of R packages for data manipulation and visualization. It is widely used in the R community, both in academia and industry.

This course is introductory statistics where statistical methodology is the focus. However, we think that exploratory data analysis (EDA) and an intuitive way of understanding of your data is crucial for selecting the correct method; that is why we have this section.

Do I need to learn tidyverse and ggplot?

Tidyverse data manipulation and ggplot are for slightly more experienced R users; no, you do not need them for your home exam.

If you are new to R, focus on base R functions and plots (scatter, boxplot, histogram, qqplot).

Explore dataset: length of hospital stay

The data liggetid was collected at the Geriatric Department at UllevÄl Sykehus. For a complete description of the variables, please refer to Exercise 2 in Non-parametric tests.

We will focus on the following variables:

  • Year of hospital admission (innaar)
  • Gender, men and women (kjoenn)
  • Admission from, where 1 = home, 2 = Div. of Medicine, 3 = Div. of Surgery, 4 = Other division, 5 = Other hospital, 6 = Nursing home (kom_fra)
  • Stroke, where 1 = yes, 0 = no (slag)
  • Age (alder)
  • Hospital stay, in days (liggetid)

We will first select a subset of the variables above from the original dataset, rename the variables, filter (to extract rows or subjects from) the dataset based on one or several conditions; and then introduce some advanced visualization techniques as part of the exploration (not required in home exam).

liggetid <- read.csv('data/liggetid.csv')
head(liggetid, 3)
  faar fmaan fdag innaar innmaan inndag utaar utmaan utdag kjoenn kom_fra slag
1 1906     3    4   1987       3      5    87      3    18 kvinne       1    0
2 1891     4    3   1987       3      6    87      3    23 kvinne       1    0
3 1908     9    6   1987       3     10    87      3    16 kvinne       1    0
  alder liggetid lnliggti kom_fra2 kom_fra3 kom_fra4 kom_fra5 kom_fra6 status
1    81       13 2.564949        0        0        0        0        0      1
2    96       17 2.833213        0        0        0        0        0      1
3    79        6 1.791759        0        0        0        0        0      1

Base R functions

Subsetting and rename variables

In the previous section, we have used the following commands when we first explore a dataset.

# basic descriptive 
dim(liggetid)
colnames(liggetid)
ncol(liggetid)
summary(liggetid) # on the entire dataset
summary(liggetid$liggetid) # on one variable

You can also use str() on the dataset to get some information of the data.

# str (structure) tells you data type 
str(liggetid) 

To take a smaller subset of the dataset, you can write in the column name in the square bracket. You can not extract more than 1 variable using the $.

# year of admission, age, sex, admission from, stroke, length of stay
data_los <- liggetid[, c('innaar', 'alder', 'kjoenn', 
                         'kom_fra', 'slag', 'liggetid')]
head(data_los, 3)
  innaar alder kjoenn kom_fra slag liggetid
1   1987    81 kvinne       1    0       13
2   1987    96 kvinne       1    0       17
3   1987    79 kvinne       1    0        6

Now we can rename the variables in English; or any name you prefer. Simply assign the names(dataset) with the new names.

# rename in english
names(data_los) # same as colnames(data_los)
[1] "innaar"   "alder"    "kjoenn"   "kom_fra"  "slag"     "liggetid"
names(data_los) <- c('adm_year', 'age', 'sex', 'adm_from', 'stroke', 'los')
head(data_los, 3)
  adm_year age    sex adm_from stroke los
1     1987  81 kvinne        1      0  13
2     1987  96 kvinne        1      0  17
3     1987  79 kvinne        1      0   6

Filter data

Filtering a dataset is a very common task in data analysis. It is often the case where you need to filter based on gender, a certain range of values or combined.

Note that with base R filtering, you need to call the variable in this way: data$variable (e.g. data_los$sex). If you want to avoid this, you can try the tidyverse solution (next section).

# base R filtering needs you to use data$variable inside the square bracket
# filter based on sex == kvinne
dsex_kvinne <- data_los[data_los$sex == 'kvinne', ]
# dsex_kvinne # print out yourself to see the outcome!

# based on length of stay over 1000
data_los[data_los$los > 1000, ]
     adm_year age    sex adm_from stroke  los
1046     1982  82 kvinne        2      1 1020
1048     1982  86 kvinne        3      0 1022
1062     1981  87 kvinne        2      0 1074
1074     1981  71 kvinne        3      0 1013
1104     1981  76 kvinne        2      0 1022
1120     1981  85 kvinne        1      0 1215
1128     1982  78 kvinne        3      0 1037
1131     1982  86 kvinne        4      0 1010
# stroke has some NA, let us examine those
dstroke_na <- data_los[is.na(data_los$stroke), ]
# dstroke_na # print out yourself to see the outcome!

# combine multiple conditions with &
data_los[data_los$adm_year == 1986 & 
           data_los$age > 80 & 
           data_los$sex == 'mann' &
           data_los$los > 100, ]
    adm_year age  sex adm_from stroke los
366     1986  87 mann        2     NA 214
399     1986  82 mann        2     NA 123
408     1986  93 mann        2     NA 170
552     1986  82 mann        3      0 138
583     1986  86 mann        3      0 135

Tidyverse and data.table

Tidyverse is a collection of R packages that are very useful for data manipulation and visualization. Packages such as dplyr, tidyr, ggplot2 are all part of tidyverse.

data.table is an R package (and a data format) that facilitates large dataset manipulation; however there are some useful syntax that are handy for small datasets as well.

In this section we will load the packages via library(package_name).

Learning tidyverse

As tidyverse is a complex ecosystem for R packages, we are not going to cover the details; rather, we recommend that you try to install the packages and run the code in this section.

To install an R package (such as ggplot2), type install.packages('ggplot2') in the console.

Subsetting and rename variables

First load the packages.

# install.packages('tidyverse')
library(data.table)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:data.table':

    between, first, last
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(ggplot2)
library(RColorBrewer) # color palette

Use select() (from dplyr package) to select a subset of the data

data_los2 <- select(liggetid, c('innaar', 'alder',  'kjoenn', 
                                'kom_fra', 'slag', 'liggetid'))
head(data_los2, 3)
  innaar alder kjoenn kom_fra slag liggetid
1   1987    81 kvinne       1    0       13
2   1987    96 kvinne       1    0       17
3   1987    79 kvinne       1    0        6

Use setnames(data, oldname, newname) (from data.table package) to change variable names one by one.

# rename
# note: you can also use the base R way here
setnames(data_los2, old = 'innaar', new = 'adm_year')
setnames(data_los2, old = 'alder', new = 'age')
setnames(data_los2, old = 'kjoenn', new = 'sex')
setnames(data_los2, old = 'kom_fra', new = 'adm_from')
setnames(data_los2, old = 'slag', new = 'stroke')
setnames(data_los2, old = 'liggetid', new = 'los')

head(data_los2, 3)
  adm_year age    sex adm_from stroke los
1     1987  81 kvinne        1      0  13
2     1987  96 kvinne        1      0  17
3     1987  79 kvinne        1      0   6

Filter data with dplyr

The benefit of using the filter function from dplyr package is that you do not need to use the $ anymore:

# filter based on sex == kvinne
dsex_kvinne2 <- filter(data_los2, sex == 'kvinne')
# dsex_kvinne2 # print out yourself to see what happened here

# based on length of stay over 1000
filter(data_los2, los > 1000)
  adm_year age    sex adm_from stroke  los
1     1982  82 kvinne        2      1 1020
2     1982  86 kvinne        3      0 1022
3     1981  87 kvinne        2      0 1074
4     1981  71 kvinne        3      0 1013
5     1981  76 kvinne        2      0 1022
6     1981  85 kvinne        1      0 1215
7     1982  78 kvinne        3      0 1037
8     1982  86 kvinne        4      0 1010
# combine multiple conditions with &
# admission year 1986, age greater than 80, sex mann
filter(data_los2, adm_year == 1986 & age > 80 & sex == 'mann' & los > 100)
  adm_year age  sex adm_from stroke los
1     1986  87 mann        2     NA 214
2     1986  82 mann        2     NA 123
3     1986  93 mann        2     NA 170
4     1986  82 mann        3      0 138
5     1986  86 mann        3      0 135

Visualization with ggplot2

Resources for ggplot2

If you want to learn ggplot2, you can start with the following resources:

  • Software carpentry workshop material link
  • Lecture notes at UiO software carpentries workshop by Chi link
  • ggplot2 book by Hadley Wickham link

Google often, and re-use your own code from other plots!

## Data processing

It is usually required to pre-process the data before visualization.

We know that there are some missing values, so we can remove them now (with filter).

Some variables also need to be re-coded so that they display nicely in the plot. For example, we can code adm_from (admission from) with text.

#data_los <- liggetid[, c('innaar', 'alder', 'kjoenn', 
#                         'kom_fra', 'slag', 'liggetid')]
#names(data_los) <- c('adm_year', 'age', 'sex', 'adm_from', 'stroke', 'los')

# remove NA 
data_los <- filter(data_los, !is.na(sex) & !is.na(stroke) & !is.na(adm_from))

# code admission from with text
data_los$adm_from <- as.character(data_los$adm_from)
data_los$adm_from <- factor(data_los$adm_from, 
                            levels = c('1', '2', '3', '4', '5', '6'), 
                            labels = c('Home', 'Div.Medicine', 'Div.Surgery', 
                                       'Div.Other', 'Other hospital', 'Nursing home'))

# code admission year with text
data_los$adm_year <- factor(data_los$adm_year,
                            levels = c(1981:1987),
                            labels = as.character(1981:1987))

data_los$stroke <- factor(data_los$stroke, 
                          levels = c(0, 1), 
                          labels = c('no','yes'))

Scatter plot

We first explore age and length of hospital stay (los) with scatter plots.

# age vs length of stay
# baseR:
plot(data_los$age, data_los$los)

# ggplot 
plt_scat <- ggplot(data = data_los, 
                   mapping = aes(x = age, y = los))
plt_scat <- plt_scat + geom_point() # add point
plt_scat

# can make some customizations: change titles, bigger fonts
plt_scat <- plt_scat + labs(
  x = 'Age', 
  y = 'Length of hosptial stay (days)', 
  title = 'Length of stay versus age'
)
plt_scat <- plt_scat + theme_bw() # make white background
plt_scat <- plt_scat + theme(
  axis.text = element_text(size = 15),
  axis.title = element_text(size = 20), 
  plot.title = element_text(size = 20)
)
plt_scat

You can add more information in the plot, with colors and shapes.

# can add on more information: color
plt_scat2 <- ggplot(data = data_los, 
                    mapping = aes(x = age, y = los, shape = sex, color = sex))
plt_scat2 <- plt_scat2 + geom_point(size = 2, alpha = 0.7)
# customize
plt_scat2 <- plt_scat2 + labs(
  x = 'Age', 
  y = 'Length of hosptial stay (days)', 
  title = 'Length of stay versus age'
)
plt_scat2 <- plt_scat2 + theme_bw() # make white background
# change text size
plt_scat2 <- plt_scat2 + theme(
  axis.text = element_text(size = 12),
  axis.title = element_text(size = 12), 
  plot.title = element_text(size = 15)
)
# change color
plt_scat2 <- plt_scat2 + scale_color_brewer(palette = 'Set1')
plt_scat2

You can also use R packages beyond the tidyverse, to add histograms on top of the scatter plot.

# can add histogram on top
library(ggExtra)

ggMarginal(plt_scat, type = 'histogram')

Histogram

plt_hist <- ggplot(data = data_los, mapping = aes(x = los, fill = sex))
plt_hist <- plt_hist + geom_histogram() 
plt_hist <- plt_hist + facet_wrap( ~ sex, ncol = 1)

# some customization
plt_hist <- plt_hist + theme_minimal() # make minimal background
# change axis
plt_hist <- plt_hist + labs(
  x = 'Length of hospital stay (days)', 
  y = 'Count', 
  title = 'Histograms for length of hospital stay'
)
# change text size
plt_hist <- plt_hist + theme(
  axis.text = element_text(size = 12),
  axis.title = element_text(size = 12), 
  plot.title = element_text(size = 15)
)
# change color
plt_hist <- plt_hist + scale_fill_brewer(palette = 'Set1')
plt_hist
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

You can plot a few histogram (or density) in the same plot.

library(ggridges)
plt_ridge <- ggplot(data = data_los, 
                    mapping = aes(x = los, y = adm_year, fill = sex))
plt_ridge <- plt_ridge + geom_density_ridges(alpha = 0.6) 
plt_ridge <- plt_ridge + theme_ridges()
plt_ridge <- plt_ridge + labs(
  x = 'Length of hosptial stay (days)', 
  y = 'Admission year', 
  title = 'Length of stay in each year, for each gender'
)
# change color
plt_ridge <- plt_ridge + scale_fill_brewer(palette = 'Set1')
plt_ridge
Picking joint bandwidth of 32.7

Box plot

# we try to plot length of stay versus year, and potentially sex and adm type
# base R is limited in this regard
par(mfrow = c(1, 3))
boxplot(los ~ adm_year, data = data_los, main = 'Length of stay vs year')
boxplot(los ~ sex, data = data_los,  horizontal = T, main = 'Length of stay vs sex')
boxplot(los ~ adm_from, data = data_los, horizontal = T,
        main = 'Length of stay vs admission type')

# ggplot can make more flexible plots
# add color 
plt_box <- ggplot(data = data_los, 
                  mapping = aes(x = adm_year, y = los, fill = sex))
plt_box <- plt_box + geom_boxplot(outlier.size = 1)
# plt_box <- plt_box + facet_wrap( ~ sex)
plt_box <- plt_box + coord_flip()

# customize
plt_box <- plt_box + theme_bw() # make white background
plt_box <- plt_box + labs(
  x = 'Admission year', 
  y = 'Length of hosptial stay (days)', 
  title = 'Length of stay in each year, both men and women'
)
plt_box <- plt_box + theme(
  axis.text = element_text(size = 12),
  axis.title = element_text(size = 12), 
  plot.title = element_text(size = 15), 
  strip.text = element_text(size = 12)
)

plt_box <- plt_box + scale_fill_brewer(palette = 'Set1')
plt_box 

# add even more information 
plt_box2 <- ggplot(data = data_los, 
                  mapping = aes(x = adm_year, y = los, fill = sex))
plt_box2 <- plt_box2 + geom_boxplot(outlier.size = 0.8)
plt_box2 <- plt_box2 + facet_wrap( ~ adm_from)


# customize
plt_box2 <- plt_box2 + theme_bw() # make white background
plt_box2 <- plt_box2 + labs(
  x = 'Admission year', 
  y = 'Length of hosptial stay (days)', 
  title = 'Length of stay in each year, each type of admission'
)
plt_box2 <- plt_box2 + theme(
  axis.text = element_text(size = 11),
  axis.title = element_text(size = 12), 
  plot.title = element_text(size = 15), 
  strip.text = element_text(size = 12), 
  axis.text.x = element_text(angle = 45) # more readable
)

plt_box2 <- plt_box2 + scale_fill_brewer(palette = 'Set1')
plt_box2