Introduction to R

Create a variable, data types, data structure, basic data manipulation, import data

R is a language and environment for statistical computing, data analysis, visualisation and graphics and many more. It is a free and open source software, under the terms of GNU General Public License.

R runs on a wide variety of platforms, including Windows, Linux and MacOS.

Getting help
  • Use Google well! There are a lot of discussions in forums such as StackOverflow
  • Cheatsheets made by Posit and community
  • Type ?functionName in the console
  • Package vignettes (longer format of documentation)

Create variables

Create a numeric variable

To create a variable, you type variable_name <- variable_value in the console.

# create a numeric variable number_1
a <- 3
a
[1] 3

You can carry out **mathematical calculation8* on numeric variables, such as exponentiate, addition, division and many more.

# assign values to variables a, b, c
a <- 3
b <- 4
c <- 7

# calculate the average of a,b,c
# output directly
(a+b+c)/3
[1] 4.666667
# or, save into a new variable d
d <- (a+b+c)/3
d
[1] 4.666667
# e to the power of a (e = 2.7182)
exp(a)
[1] 20.08554

Data types

In R, there are a few types of variables. The ones you will interact with are:

  • numeric (real numbers): 1.2, -5
  • integer: 1, 2, 2000
  • character (strings): “male”, “female”
  • logical (binary, 1/0): True or False

Note that code that start with # are comments, and are not evaluated.

# create a numeric variable number_1
number_1 <- 1.2

# a character variable student
student <- 'hadley'

# a logical variable true_or_false
true_or_false <- T

To evaluate (or return) the variable you have created, you can either type the name of the variable, or print() with the variable name inside the bracket.

number_1
[1] 1.2
print(number_1)
[1] 1.2

You can check the variable type using class(variable_name):

class(number_1)
[1] "numeric"
class(student)
[1] "character"
class(true_or_false)
[1] "logical"
Name your variable

It is good practice to give your variable a name that is both easy to understand, and also valid.

  • Names are case sensitive, VariableA is not the same as variablea
  • Numbers can not be a variable name by itself. Combining numbers and letters is allowed, but should start with a letter, such as variable3, but NOT 22variable
  • You can use underscores (“snake_case” naming style). In fact it encourages readability, so it is my personal favoriate.

Avoid the following:

  • Other special characters, such as dot and dollar sign: var.A, var$A have special meanings in R.
  • Avoid using function names like function, list and so on. If you really can’t think of a better name, you can use names my_function, list_1 to avoid the ambiguity.

Data structure

Vectors

A vector is a list of values; it can be numeric, and also characters and logical.

To create a vector, use function c().

# numeric
num_vector <- c(1, 2, 3, 4, 5)
num_vector
[1] 1 2 3 4 5
# character
char_vector <- c('student_a', 'student_b', 'student_c')
char_vector
[1] "student_a" "student_b" "student_c"
# logical 
logical_vector <- c(T, F, T, F)
logical_vector
[1]  TRUE FALSE  TRUE FALSE

There are some shortcuts to create a sequence of values; not required to learn, but very useful.

# numeric
# num_vector <- c(1, 2, 3, 4, 5)
num_vector <- 1:5 # from 1 to 5
seq(from = 1, to = 11, by = 2) # from 1 to 11, with 2 between each
[1]  1  3  5  7  9 11
rep(1, 5) # repeat 1 for 5 times
[1] 1 1 1 1 1
# character
# char_vector <- c('student_a', 'student_b', 'student_c')
char_vector <- paste0('student_', c('a', 'b', 'c'))
char_vector
[1] "student_a" "student_b" "student_c"
Types of elements in a vector

In a vector, types of the elements must be the same. If you try to combine multiple types of variables in the same vector, such as a numeric number and a character, R will try to convert them into the same type.

Try to combine the following values into a vector, and see what happens.

  • 1.52, “student_a”
  • 1.52, TRUE (logical)
  • TRUE, “student_a”

Combine multiple vectors

You can combine multiple vectors using c(). For example, vec1 has 3 elements, vec2 has 2 elements (assuming that they are of the same type), combining them gives 5 elements.

vec1 <- c(1, 3, 5)
vec2 <- c(100, 101)
c(vec1, vec2)
[1]   1   3   5 100 101
# you can also save it into a new variable, 
# so that you can access it in the future
vec_combined <- c(vec1, vec2)
vec_combined
[1]   1   3   5 100 101

Matrix

A matrix can be thought of as a stack of vectors. When you collect data from \(n\) patients (or subjects), you measure a few aspects on each patient such as age, sex, height and smoking. Let’s say you have measured \(p\) aspects. This forms a matrix of size \(n \times p\).

You might not need to create a matrix from scratch in R (because the focus of this course is data analysis); but it is helpful to understand some basic data manipulation commands.

You can create a matrix using matrix(), with some parameters:

matrix_1 <- matrix(data = c(1, 2, 3, 4), nrow = 2, ncol = 2, byrow = T)
matrix_1
     [,1] [,2]
[1,]    1    2
[2,]    3    4

You can also create a matrix by combining two vectors of the same size, using cbind() or rbind(), which stands for “column bind” and “rowbind”.

vec1 <- c(1, 2)
vec2 <- c(3, 4)

# bind by columnn
matrix_c <- cbind(vec1, vec2)
matrix_c
     vec1 vec2
[1,]    1    3
[2,]    2    4
# bind by row
matrix_r <- rbind(vec1, vec2)
matrix_r
     [,1] [,2]
vec1    1    2
vec2    3    4

Dataframe

Dataframe, data.frame is a format of data commonly used in data analysis with R and python. It can be considered as a matrix, but allows a mixture of data types, such as numeric and categorical measurements (age and sex).

In this course, you will mostly be working with dataframes.

We create a small dataframe of 3 subjects:

  • Subject 1 is a 20 years-old male who has covid
  • Subject 2 is a 50 years-old female who has covid
  • Subject 3 is a 32 years-old male who does not have covid

This is how you can present the dataframe, where each column has a different data type.

mini_data <- data.frame(
  age = c(20, 50, 32), 
  sex = c('male', 'female', 'male'), 
  has_covid = c(T, T, F)
)
mini_data
  age    sex has_covid
1  20   male      TRUE
2  50 female      TRUE
3  32   male     FALSE

Basic data manipulation

Dimension of your data

You can find the size of a vector with length().

For a matrix or dataframe, you can use dim(). It will return nrow ncol, number of rows and number of columns.

vec1 <- c(1, 2)
length(vec1)
[1] 2
# matrix
mat <- matrix(data = c(1, 2, 3, 4), nrow = 2, byrow = 2)
dim(mat)
[1] 2 2
# dataframe
dim(mini_data)
[1] 3 3
dim() or length()

If you use dim() on a vector, it returns NULL. Given that a vector is just a matrix with 1 row (or column), this seems insensible.

Nonetheless, dim() works on matrix objects. if you convert the vector into a matrix with nrow =1 or ncol = 1, dim() will work.

If you use length() on a matrix, it will return the total number of elements, i.e. ncol times nrow.

You can also use nrow(), ncol() to get the number of rows and columns explicitly.

# ncol, nrow
# dim(mini_data)
nrow(mini_data)
[1] 3
ncol(mini_data)
[1] 3

Accessing elements in your data

For a vector, you can access

  • an element at a given position
  • multiple elements at given positions
  • elements beyond, or below a certain element

Sometimes you might need to combine previous knowledge to get what you want (e.g. to know how many elements in total there are).

letters <- c('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h')

# 3rd letter
letters[3]
[1] "c"
# 3rd, and 5th
letters[c(3, 5)]
[1] "c" "e"
# letters beyond 4
letters[5:8] # or, letters[5:length(letters)]
[1] "e" "f" "g" "h"

For a matrix,

  • matrix[r, c] to get the element on \(r\)-th row, \(c\)-th column.
  • matrix[r, ], matrix[, c] to get all the elements on \(r\)-th row or \(c\)-th column
mat_3by3 <- matrix(data = 1:9, nrow = 3, byrow = T)
mat_3by3
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
# element (2,3)
mat_3by3[2, 3]
[1] 6
# first row
mat_3by3[1,]
[1] 1 2 3

For a dataframe,

  • you can use indices (row, col) in the same way as matrices above;
  • use data$column_name, or data['column_name'] to access the entire column

Conventionally, each row is a subject, and each columnn is a variable (or aspect of measurement, feature, characteristic, risk factor etc).

mini_data
  age    sex has_covid
1  20   male      TRUE
2  50 female      TRUE
3  32   male     FALSE
# first row 
mini_data[1, ]
  age  sex has_covid
1  20 male      TRUE
# second col
mini_data[, 2]
[1] "male"   "female" "male"  
# via column name using $
mini_data$age
[1] 20 50 32
# alternatively, 
mini_data['age']
  age
1  20
2  50
3  32
Filter data based on criteria

You might have a task where you need to filter elements based on another variable: for example, select the age based on sex. This task is done in 2 steps:

  • create a logical (binary, true or false) variable on sex, call it sex_indicator
  • select the elements in age vector, corresponding to sex_ind == TRUE. (The operator == evaluates whether the criteria is met)

The following example illustrates this process. You will use this a few times in the course, for example to select the height measured for men and women.

age <- c(55, 60, 65)
sex <- c('Male', 'Female', 'Male')

# select age for only female
# first create a variable indicating 'sex == Female'
# i.e. if the element is Female, returns T; otherwise, F

sex_indicator <- sex == 'Female'
sex_indicator
[1] FALSE  TRUE FALSE
# next combine age with sex_indicator, this only selects the 2nd element
age[sex_indicator] 
[1] 60
# you can skip the middle step:
age[sex == 'Female']
[1] 60

Modify existing data (optional)

Keep your original data safe!

Modifying an existing data is easy, but you should be aware of the risks. In this class we only modify data we created in the class so there is little risk, but you might have your own datasets to analyse in the future.

You should keep your original data in a safe place, and work on copies of it.

Version control is a good skill to learn.

# vector
# make e into E 
letters[5] <- 'E'
letters
[1] "a" "b" "c" "d" "E" "f" "g" "h"
# matrix
# make (1, 1) 20
mat_3by3[1, 1] # originally was 1
[1] 1
mat_3by3[1, 1] <- 20
mat_3by3
     [,1] [,2] [,3]
[1,]   20    2    3
[2,]    4    5    6
[3,]    7    8    9
# dataframe
# make so that subject 2 does not have covid
mini_data
  age    sex has_covid
1  20   male      TRUE
2  50 female      TRUE
3  32   male     FALSE
mini_data$has_covid[2] <- F
mini_data
  age    sex has_covid
1  20   male      TRUE
2  50 female     FALSE
3  32   male     FALSE

Import data

Before importing a dataset, you need to know where it is, and how to tell R to find it in your file system.

Working directory, R project

You can think of the working directory as the folder where R looks for (and saves) your scripts by default.

You can check where your working directory by running the following command.

getwd()
[1] "/Users/andrea/Documents/GitHub/course_med3007/lab"

You can manually set this to a folder of your choosing by setwd(path).

It is recommanded to use R project. It sets a folder just for the current tasks you work on, so that you do not need to set the working directory every time you open RStudio. Read more about how to create an R project.

Import data

Data exist in different formats,

  • csv is one of the most commonly used data format for tabular data. If possible, it is a good idea to use this data format as it is readable by different languages and softwares
  • xlsx is also good for storing tabular data; however it is slightly more complicated than csv.
  • rda can be used to store R data (such as lists, higher dimensional arrays);
  • Some formats are data created by foreign softwares (such as dta created by STATA), and they would require some specific R packages to load in.

It is difficult to summarise all the data formats here, so you should check the documentation on how to import and write (save) data of different types.