# create a numeric variable number_1
<- 3
a a
[1] 3
R is a language and environment for statistical computing, data analysis, visualisation and graphics and many more. It is a free and open source software, under the terms of GNU General Public License.
R runs on a wide variety of platforms, including Windows, Linux and MacOS.
?functionName
in the consoleTo create a variable, you type variable_name <- variable_value
in the console.
You can carry out **mathematical calculation8* on numeric variables, such as exponentiate, addition, division and many more.
In R, there are a few types of variables. The ones you will interact with are:
Note that code that start with #
are comments, and are not evaluated.
To evaluate (or return) the variable you have created, you can either type the name of the variable, or print()
with the variable name inside the bracket.
You can check the variable type using class(variable_name)
:
It is good practice to give your variable a name that is both easy to understand, and also valid.
VariableA
is not the same as variablea
variable3
, but NOT 22variable
Avoid the following:
var.A
, var$A
have special meanings in R.function
, list
and so on. If you really can’t think of a better name, you can use names my_function
, list_1
to avoid the ambiguity.A vector is a list of values; it can be numeric, and also characters and logical.
To create a vector, use function c()
.
[1] 1 2 3 4 5
[1] "student_a" "student_b" "student_c"
[1] TRUE FALSE TRUE FALSE
There are some shortcuts to create a sequence of values; not required to learn, but very useful.
# numeric
# num_vector <- c(1, 2, 3, 4, 5)
num_vector <- 1:5 # from 1 to 5
seq(from = 1, to = 11, by = 2) # from 1 to 11, with 2 between each
[1] 1 3 5 7 9 11
[1] 1 1 1 1 1
# character
# char_vector <- c('student_a', 'student_b', 'student_c')
char_vector <- paste0('student_', c('a', 'b', 'c'))
char_vector
[1] "student_a" "student_b" "student_c"
In a vector, types of the elements must be the same. If you try to combine multiple types of variables in the same vector, such as a numeric number and a character, R will try to convert them into the same type.
Try to combine the following values into a vector, and see what happens.
You can combine multiple vectors using c()
. For example, vec1
has 3 elements, vec2
has 2 elements (assuming that they are of the same type), combining them gives 5 elements.
A matrix can be thought of as a stack of vectors. When you collect data from \(n\) patients (or subjects), you measure a few aspects on each patient such as age, sex, height and smoking. Let’s say you have measured \(p\) aspects. This forms a matrix of size \(n \times p\).
You might not need to create a matrix from scratch in R (because the focus of this course is data analysis); but it is helpful to understand some basic data manipulation commands.
You can create a matrix using matrix()
, with some parameters:
[,1] [,2]
[1,] 1 2
[2,] 3 4
You can also create a matrix by combining two vectors of the same size, using cbind()
or rbind()
, which stands for “column bind” and “rowbind”.
Dataframe, data.frame
is a format of data commonly used in data analysis with R and python. It can be considered as a matrix, but allows a mixture of data types, such as numeric and categorical measurements (age and sex).
In this course, you will mostly be working with dataframes.
We create a small dataframe of 3 subjects:
This is how you can present the dataframe, where each column has a different data type.
You can find the size of a vector with length()
.
For a matrix or dataframe, you can use dim()
. It will return nrow ncol
, number of rows and number of columns.
[1] 2
[1] 2 2
[1] 3 3
dim()
or length()
If you use dim()
on a vector, it returns NULL
. Given that a vector is just a matrix with 1 row (or column), this seems insensible.
Nonetheless, dim()
works on matrix objects. if you convert the vector into a matrix with nrow =1
or ncol = 1
, dim()
will work.
If you use length()
on a matrix, it will return the total number of elements, i.e. ncol times nrow.
You can also use nrow()
, ncol()
to get the number of rows and columns explicitly.
For a vector, you can access
Sometimes you might need to combine previous knowledge to get what you want (e.g. to know how many elements in total there are).
[1] "c"
[1] "c" "e"
[1] "e" "f" "g" "h"
For a matrix,
matrix[r, c]
to get the element on \(r\)-th row, \(c\)-th column.matrix[r, ]
, matrix[, c]
to get all the elements on \(r\)-th row or \(c\)-th column [,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
[1] 6
[1] 1 2 3
For a dataframe,
data$column_name
, or data['column_name']
to access the entire columnConventionally, each row is a subject, and each columnn is a variable (or aspect of measurement, feature, characteristic, risk factor etc).
age sex has_covid
1 20 male TRUE
2 50 female TRUE
3 32 male FALSE
age sex has_covid
1 20 male TRUE
[1] "male" "female" "male"
[1] 20 50 32
age
1 20
2 50
3 32
You might have a task where you need to filter elements based on another variable: for example, select the age
based on sex
. This task is done in 2 steps:
sex
, call it sex_indicator
age
vector, corresponding to sex_ind == TRUE
. (The operator ==
evaluates whether the criteria is met)The following example illustrates this process. You will use this a few times in the course, for example to select the height measured for men and women.
Modifying an existing data is easy, but you should be aware of the risks. In this class we only modify data we created in the class so there is little risk, but you might have your own datasets to analyse in the future.
You should keep your original data in a safe place, and work on copies of it.
Version control is a good skill to learn.
[1] "a" "b" "c" "d" "E" "f" "g" "h"
[1] 1
[,1] [,2] [,3]
[1,] 20 2 3
[2,] 4 5 6
[3,] 7 8 9
age sex has_covid
1 20 male TRUE
2 50 female TRUE
3 32 male FALSE
age sex has_covid
1 20 male TRUE
2 50 female FALSE
3 32 male FALSE
Before importing a dataset, you need to know where it is, and how to tell R to find it in your file system.
You can think of the working directory as the folder where R looks for (and saves) your scripts by default.
You can check where your working directory by running the following command.
You can manually set this to a folder of your choosing by setwd(path)
.
It is recommanded to use R project. It sets a folder just for the current tasks you work on, so that you do not need to set the working directory every time you open RStudio. Read more about how to create an R project.
Data exist in different formats,
csv
is one of the most commonly used data format for tabular data. If possible, it is a good idea to use this data format as it is readable by different languages and softwaresxlsx
is also good for storing tabular data; however it is slightly more complicated than csv
.rda
can be used to store R data (such as lists, higher dimensional arrays);dta
created by STATA), and they would require some specific R packages to load in.It is difficult to summarise all the data formats here, so you should check the documentation on how to import and write (save) data of different types.