20 min read

Some basic R concepts

R objects

type of simple R objects

  • NULL: empty object
  • logical: boolean or logical object (FALSE, TRUE)
  • numeric: numeric object
  • character: character object
  • complex: complex object (2+3i)

The operators <- and = assign into the environment in which they are evaluated. The operator <- can be used anywhere, whereas the operator = is only allowed at the top level (e.g., in the complete expression typed at the command prompt) or as one of the subexpressions in a braced list of expressions. Recommendation is to use <- in your programs.

x <- value

x<-2.3

mode(x)
## [1] "numeric"
# Take a look to all the is.XXX() functions which ask what type of object is
# return a boolean
is.numeric(x)
## [1] TRUE
is.character(x)
## [1] FALSE
is.complex(x)
## [1] FALSE
is.null(x)
## [1] FALSE
is.logical(x)
## [1] FALSE

To convert an object in a different mode, you use the as.XXX() functions.

as.numeric(x)
## [1] 2.3
as.character(x)
## [1] "2.3"
as.null(x)
## NULL

Missing data

The missing data , NA for “not available” has its specific functions.

x<-NA
x
## [1] NA
print(x*2)
## [1] NA
# is x a missing data?
is.na(x)
## [1] TRUE
# ! for negative asking
!is.na(x)
## [1] FALSE

Vector

A vector is the simplest type of data structure in R. The R manual defines a vector as “a single entity consisting of a collection of things.” A collection of numbers, for example, is a numeric vector. We can have numeric vectors or character vectors or logical vectors…

#--- numeric vector built using c()
vecnum<-c(2,3.2,6.1,0.005,1e-2,150)

# print it
vecnum
## [1]   2.000   3.200   6.100   0.005   0.010 150.000
# length of my vector
length(vecnum)
## [1] 6
# Add some elements to my vector
vecnum<-c(vecnum,2.3,5.3)

vecnum
## [1]   2.000   3.200   6.100   0.005   0.010 150.000   2.300   5.300
# Test if all the values are > 1
all(vecnum > 1)
## [1] FALSE
# Test if at least one of the values of my vector is > 1
any(vecnum > 1)
## [1] TRUE
# Print the first elements of my vector
head(vecnum)
## [1]   2.000   3.200   6.100   0.005   0.010 150.000
# Print the last elements of my vector
tail(vecnum)
## [1]   6.100   0.005   0.010 150.000   2.300   5.300
# Print the 5th element of my vector
vecnum[5]
## [1] 0.01
# Print the 2nd to the 4th elements of my vector
vecnum[2:4]
## [1] 3.200 6.100 0.005
# Print the 2nd, 4th and 6th element of my vector
vecnum[c(2,4,6)]
## [1]   3.200   0.005 150.000
# Select and print the elements of my vector > 1

vecnum[vecnum > 1]
## [1]   2.0   3.2   6.1 150.0   2.3   5.3
# Select and print the elements in the interval [2,10]
vecnum[ (vecnum >= 2) & (vecnum <= 10)]
## [1] 2.0 3.2 6.1 2.3 5.3
# Print the minimum value of my vector

min(vecnum)
## [1] 0.005
# Print the position in the vector of the minimal value
which.min(vecnum)
## [1] 4
# Print the maximum of my vector
max(vecnum)
## [1] 150
# Print the position in the vector of the maximal value
which.max(vecnum)
## [1] 6
# Sort the vector by ascending order
sort(vecnum)
## [1]   0.005   0.010   2.000   2.300   3.200   5.300   6.100 150.000
# Sort the vector by descending order
rev(sort(vecnum))
## [1] 150.000   6.100   5.300   3.200   2.300   2.000   0.010   0.005
# Build a vector using seq()
myvec<-seq(1,5)

myvec
## [1] 1 2 3 4 5
# Build a vector using rep()
myvec2<-rep(1,5)

myvec2
## [1] 1 1 1 1 1
# another way
myvec3<-rep(c(1,2),each=4)

myvec3
## [1] 1 1 1 1 2 2 2 2
#--- a character vector
vecchar<-c("toto","titi","tutu")

vecchar
## [1] "toto" "titi" "tutu"
vecchar2<-rep("black",5)

vecchar2
## [1] "black" "black" "black" "black" "black"
# Test if vecchar is really a vector
is.vector(vecchar)
## [1] TRUE
#--- Build a logical vector
veclog<-c("TRUE","FALSE","TRUE")

is.vector(veclog)
## [1] TRUE

Factor

The factor is a type of vector that describes a qualitative variable and is internally coded by a number and not by the string of characters representing its value.

#--- Build a factor
fac <- factor(c("red", "green", "red", "blue", "green"))
fac
## [1] red   green red   blue  green
## Levels: blue green red
# the levels of fac
levels(fac)
## [1] "blue"  "green" "red"
# Test if fac is a factor or a vector???

is.vector(fac)
## [1] FALSE
is.factor(fac)
## [1] TRUE
# By default, the levels of a factor are sorted by alphabetical order
# you can constrain the levels order
fac<-factor(c("red","green","red","blue","green"),levels=c("green","red","blue"))
fac
## [1] red   green red   blue  green
## Levels: green red blue
# To set again "blue" in first level
fac2<-relevel(fac, "blue")
fac2
## [1] red   green red   blue  green
## Levels: blue green red
#--- To update the levels ----

# Build fac2 from fac but with only the 3 first elements
fac2 <- fac[1:3]
fac2
## [1] red   green red  
## Levels: green red blue
# Printing fac2, we can see that the level "blue" is still there even if not in the factor
# as it can be a problem in further analyses, we update the levels of fac2 with factor()
fac2 <- factor(fac2)
fac2
## [1] red   green red  
## Levels: green red
# Now, the levels of fac2 are only values contained in the factor ("blue" disappeared)
#--- Transform a factor to character
fac3<-as.character(fac2)
fac3
## [1] "red"   "green" "red"
# Test the mode
is.factor(fac3)
## [1] FALSE
is.character(fac3)
## [1] TRUE
#--- Retrieve the internal coding of a factor
as.numeric(fac)
## [1] 2 1 2 3 1
#--- Count frequencies by levels of the fac factor
table(fac)
## fac
## green   red  blue 
##     2     2     1
# Suppress the duplicated elements of a factor
# unique() function can be used on any kind of variable
unique(fac)
## [1] red   green blue 
## Levels: green red blue

Matrix

A matrix is a base object containing only one type of element. A matrix therefore contains either numerical or character elements, and so on… Each element of the matrix can be marked by its line and column numbers.

# Create a matrix
matr<-matrix(c(1,2.3,2.5,26,45,5,4.2,1.2,15,10),nrow=2)
matr
##      [,1] [,2] [,3] [,4] [,5]
## [1,]  1.0  2.5   45  4.2   15
## [2,]  2.3 26.0    5  1.2   10
# number of rows of matr
nrow(matr)
## [1] 2
# number of columns of matr
ncol(matr)
## [1] 5
# is it really a matrix?
is.matrix(matr)
## [1] TRUE
# what does matr contain?
mode(matr)
## [1] "numeric"
# dimension of the matrix
dim(matr)
## [1] 2 5
# Number of elements into matr
length(matr)
## [1] 10
# Add a new column by concatenating
matr<-cbind(matr,seq(1,2))
matr
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]  1.0  2.5   45  4.2   15    1
## [2,]  2.3 26.0    5  1.2   10    2
# Add a new row by concatenating
matr<-rbind(matr,seq(1:6))
matr
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]  1.0  2.5   45  4.2   15    1
## [2,]  2.3 26.0    5  1.2   10    2
## [3,]  1.0  2.0    3  4.0    5    6
# Print the value of row 2 and column 4
matr[2,4]
## [1] 1.2
# Print the 1st row of matr
matr[1,]
## [1]  1.0  2.5 45.0  4.2 15.0  1.0
# Print the 3rd column of matr
matr[,3]
## [1] 45  5  3
# Print matr without the 2 column
matr[,-2]
##      [,1] [,2] [,3] [,4] [,5]
## [1,]  1.0   45  4.2   15    1
## [2,]  2.3    5  1.2   10    2
## [3,]  1.0    3  4.0    5    6
# Add 2 matrices (all mathematical operators permitted: *, /, -, %*%)
matr+matr
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]  2.0    5   90  8.4   30    2
## [2,]  4.6   52   10  2.4   20    4
## [3,]  2.0    4    6  8.0   10   12
# Note: "*" multiply element by element
# the matrix multiplication is performed using %*%

# matrix transposition
t(matr)
##      [,1] [,2] [,3]
## [1,]  1.0  2.3    1
## [2,]  2.5 26.0    2
## [3,] 45.0  5.0    3
## [4,]  4.2  1.2    4
## [5,] 15.0 10.0    5
## [6,]  1.0  2.0    6
# all elements to square root
# All mathematical functions exist: sin, cos, exp, log etc...
sqrt(matr)
##          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]
## [1,] 1.000000 1.581139 6.708204 2.049390 3.872983 1.000000
## [2,] 1.516575 5.099020 2.236068 1.095445 3.162278 1.414214
## [3,] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490

It is possible to construct array-type objects that are matrices of dimension greater than 2.

List

A list is a heterogeneous object, i.e. it can contain different type and length of elements. The objects contained in the list are the components of the list and may have names (otherwise they are numbered by default). The lists can thus contain values, vectors, matrices, data-frames etc… These are important objects in the R software because most functions return lists of several objects.

# Create a list containing vecnum, matr
liste<-list(vecnum,matr)
liste
## [[1]]
## [1]   2.000   3.200   6.100   0.005   0.010 150.000   2.300   5.300
## 
## [[2]]
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]  1.0  2.5   45  4.2   15    1
## [2,]  2.3 26.0    5  1.2   10    2
## [3,]  1.0  2.0    3  4.0    5    6
# length of liste
length(liste)
## [1] 2
# type of elements in liste
mode(liste)
## [1] "list"
# structure of liste

str(liste)
## List of 2
##  $ : num [1:8] 2 3.2 6.1 0.005 0.01 150 2.3 5.3
##  $ : num [1:3, 1:6] 1 2.3 1 2.5 26 2 45 5 3 4.2 ...
# Give names to the 2 elements of liste
names(liste)<-c("UnVecteur","UneMatrice")
names(liste)
## [1] "UnVecteur"  "UneMatrice"
# print liste
liste
## $UnVecteur
## [1]   2.000   3.200   6.100   0.005   0.010 150.000   2.300   5.300
## 
## $UneMatrice
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]  1.0  2.5   45  4.2   15    1
## [2,]  2.3 26.0    5  1.2   10    2
## [3,]  1.0  2.0    3  4.0    5    6
# --- Note: differences of printing according how you access the elements of a list
# Print the 1st object of liste
liste[1] # give a list
## $UnVecteur
## [1]   2.000   3.200   6.100   0.005   0.010 150.000   2.300   5.300
is.list(liste[1])
## [1] TRUE
liste[[1]] # gives a vector
## [1]   2.000   3.200   6.100   0.005   0.010 150.000   2.300   5.300
is.vector(liste[[1]])
## [1] TRUE
liste$UnVecteur # gives a vector
## [1]   2.000   3.200   6.100   0.005   0.010 150.000   2.300   5.300
is.vector(liste$UnVecteur)
## [1] TRUE
# Print the 2nd object of liste
liste[2] # gives a list
## $UneMatrice
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]  1.0  2.5   45  4.2   15    1
## [2,]  2.3 26.0    5  1.2   10    2
## [3,]  1.0  2.0    3  4.0    5    6
is.list(liste[2])
## [1] TRUE
liste[[2]] # gives a matrix
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]  1.0  2.5   45  4.2   15    1
## [2,]  2.3 26.0    5  1.2   10    2
## [3,]  1.0  2.0    3  4.0    5    6
is.matrix(liste[[2]])
## [1] TRUE
liste$UneMatrice # gives a matrix
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]  1.0  2.5   45  4.2   15    1
## [2,]  2.3 26.0    5  1.2   10    2
## [3,]  1.0  2.0    3  4.0    5    6
is.matrix(liste$UneMatrice)
## [1] TRUE
#---- Manipulate an object of a list
liste[[1]]
## [1]   2.000   3.200   6.100   0.005   0.010 150.000   2.300   5.300
is.vector(liste[[1]])
## [1] TRUE
liste[[1]] + liste[[1]]
## [1]   4.00   6.40  12.20   0.01   0.02 300.00   4.60  10.60
liste[[1]]*10
## [1]   20.00   32.00   61.00    0.05    0.10 1500.00   23.00   53.00
# Concatenate 2 lists
c(liste,liste)
## $UnVecteur
## [1]   2.000   3.200   6.100   0.005   0.010 150.000   2.300   5.300
## 
## $UneMatrice
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]  1.0  2.5   45  4.2   15    1
## [2,]  2.3 26.0    5  1.2   10    2
## [3,]  1.0  2.0    3  4.0    5    6
## 
## $UnVecteur
## [1]   2.000   3.200   6.100   0.005   0.010 150.000   2.300   5.300
## 
## $UneMatrice
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]  1.0  2.5   45  4.2   15    1
## [2,]  2.3 26.0    5  1.2   10    2
## [3,]  1.0  2.0    3  4.0    5    6
# Transform the list object creating a vector containing all the atomic elements of liste
unlist(liste)
##   UnVecteur1   UnVecteur2   UnVecteur3   UnVecteur4   UnVecteur5   UnVecteur6 
##        2.000        3.200        6.100        0.005        0.010      150.000 
##   UnVecteur7   UnVecteur8  UneMatrice1  UneMatrice2  UneMatrice3  UneMatrice4 
##        2.300        5.300        1.000        2.300        1.000        2.500 
##  UneMatrice5  UneMatrice6  UneMatrice7  UneMatrice8  UneMatrice9 UneMatrice10 
##       26.000        2.000       45.000        5.000        3.000        4.200 
## UneMatrice11 UneMatrice12 UneMatrice13 UneMatrice14 UneMatrice15 UneMatrice16 
##        1.200        4.000       15.000       10.000        5.000        1.000 
## UneMatrice17 UneMatrice18 
##        2.000        6.000
# is it a vector?

is.vector(unlist(liste))
## [1] TRUE

Data.frame

A dataframe is an object with a matrix structure, but which can contain different type components. The data sets usually used in statistics are often dataframes. Indeed, the dataframe can contain numeric columns (the data set) and character or factor columns to identify individuals (which are by rows).

# Create a dataframe
vec1<-paste("S",seq(1:5),sep="")
vec2<-as.factor(c(1,1,1,2,2))
vec3<-c(1.2,2.3,2.5,4.2,1.2)
mydata<-data.frame(Id=vec1,Type=vec2,Variable=vec3)
mydata
##   Id Type Variable
## 1 S1    1      1.2
## 2 S2    1      2.3
## 3 S3    1      2.5
## 4 S4    2      4.2
## 5 S5    2      1.2
# another way
mydata<-cbind.data.frame(vec1,vec2,vec3)
colnames(mydata)<-c("ID","type","Variable")
mydata
##   ID type Variable
## 1 S1    1      1.2
## 2 S2    1      2.3
## 3 S3    1      2.5
## 4 S4    2      4.2
## 5 S5    2      1.2
# dimension of mydata
dim(mydata)
## [1] 5 3
# number of rows
nrow(mydata)
## [1] 5
# number of columns
ncol(mydata)
## [1] 3
# is it really a data.frame?
is.data.frame(mydata)
## [1] TRUE
# Structure of the dataframe
str(mydata)
## 'data.frame':    5 obs. of  3 variables:
##  $ ID      : chr  "S1" "S2" "S3" "S4" ...
##  $ type    : Factor w/ 2 levels "1","2": 1 1 1 2 2
##  $ Variable: num  1.2 2.3 2.5 4.2 1.2
# summary of the dataframe
summary(mydata)
##       ID            type     Variable   
##  Length:5           1:3   Min.   :1.20  
##  Class :character   2:2   1st Qu.:1.20  
##  Mode  :character         Median :2.30  
##                           Mean   :2.28  
##                           3rd Qu.:2.50  
##                           Max.   :4.20
# Print the 2nd row of the data.frame
mydata[2,]
##   ID type Variable
## 2 S2    1      2.3
# Print the 3rd column of the data.frame
mydata [,3]
## [1] 1.2 2.3 2.5 4.2 1.2
mydata[,"Variable"]
## [1] 1.2 2.3 2.5 4.2 1.2
# Print only rows containing S1, S3 and S4 ID
mydata[mydata[,"ID"] %in% c("S1","S3","S4"),]
##   ID type Variable
## 1 S1    1      1.2
## 3 S3    1      2.5
## 4 S4    2      4.2
# Do not print rows containing  S1, S3 and S4 ID
mydata[!mydata[,"ID"] %in% c("S1","S3","S4"),]
##   ID type Variable
## 2 S2    1      2.3
## 5 S5    2      1.2
# Transform the matr matrix into a data.frame
tmp<-as.data.frame(matr)
tmp
##    V1   V2 V3  V4 V5 V6
## 1 1.0  2.5 45 4.2 15  1
## 2 2.3 26.0  5 1.2 10  2
## 3 1.0  2.0  3 4.0  5  6
# Is it a matrix of a data.frame??
is.matrix(tmp)
## [1] FALSE
is.data.frame(tmp)
## [1] TRUE

Manipulate data in R

Libraries useful to manipulate data in R are provided by Rstudio:

Functions for character variables

library(dplyr)
## 
## Attachement du package : 'dplyr'
## Les objets suivants sont masqués depuis 'package:stats':
## 
##     filter, lag
## Les objets suivants sont masqués depuis 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)

# Create a character vector
mychar<-c("plane","train","bus","car")
mychar
## [1] "plane" "train" "bus"   "car"
#--- Add the pattern "funny" to each end of value separated by a "-" 
mychar<-paste(mychar,"funny",sep="-")
mychar
## [1] "plane-funny" "train-funny" "bus-funny"   "car-funny"
#--- Add the pattern "A" to each begin of value separated by a "_"
mychar<-paste("A",mychar,sep="_")
mychar
## [1] "A_plane-funny" "A_train-funny" "A_bus-funny"   "A_car-funny"
# Suppress the pattern "A_" in mychar
mychar<-gsub(pattern="A_",replacement="",x=mychar)
mychar
## [1] "plane-funny" "train-funny" "bus-funny"   "car-funny"
# Replace "funny" by "little" in mychar
mychar<-gsub(pattern="funny",replacement="little",x=mychar)
mychar
## [1] "plane-little" "train-little" "bus-little"   "car-little"
# Set all in upper case
mychar<-toupper(mychar)
mychar
## [1] "PLANE-LITTLE" "TRAIN-LITTLE" "BUS-LITTLE"   "CAR-LITTLE"
# Set all in lower case
mychar<-tolower(mychar)
mychar
## [1] "plane-little" "train-little" "bus-little"   "car-little"
# Count the number of characters for each element of mychar
nchar(mychar)
## [1] 12 12 10 10
# Separate each element of mychar with a symbol (here "-")
# This function returns a list!!
strsplit(mychar,split="-")
## [[1]]
## [1] "plane"  "little"
## 
## [[2]]
## [1] "train"  "little"
## 
## [[3]]
## [1] "bus"    "little"
## 
## [[4]]
## [1] "car"    "little"
#--- Have a look to all the other functions dedicated to manipulate a character vector
# grep()
# sub(), gsub()
# regexpr() etc...

# (grep etc...) regular expressions

Import a data set

In all the following functions allowing to import a tabular data, the user specifies the path to the file, the name file, the separator in the file, the header, the missing data symbol etc… Please have a look to the help functions.

  • read.table(): Reads a file in table format and creates a data frame from it, with cases corresponding to lines and variables to fields in the file. csv, txt files allowed
  • library readr: a set of functions faster than the previous one. allows to import the same kind of files (csv, txt and so on…). https://readr.tidyverse.org/
  • library readxl: makes it easy to get data out of Excel and into R. easy to install and use on all operating systems. It is designed to work with tabular data. https://readxl.tidyverse.org/

Sorting a data set

Souche<-paste0("S",seq(1,19,1))
Type<-c(rep("BA",5),rep("CL",4),rep("CO",5),rep("FB",5))
Vmax<-c(1.61,1.68,1.75,1.66,1.70,1.54,1.81,1.86,1.69,1.59,1.62,1.55,1.83,1.72,1.85,1.71,1.85,1.71,1.93)
T80<-c(163,444,80,86,124,69,122,NA,86,86,81,86,80,102,NA,340,133,144,98)

mydata<-cbind.data.frame(Souche,Type,Vmax,T80)
str(mydata)
## 'data.frame':    19 obs. of  4 variables:
##  $ Souche: chr  "S1" "S2" "S3" "S4" ...
##  $ Type  : chr  "BA" "BA" "BA" "BA" ...
##  $ Vmax  : num  1.61 1.68 1.75 1.66 1.7 1.54 1.81 1.86 1.69 1.59 ...
##  $ T80   : num  163 444 80 86 124 69 122 NA 86 86 ...
dim(mydata)
## [1] 19  4
# Print the first rows of mydata
head(mydata)
##   Souche Type Vmax T80
## 1     S1   BA 1.61 163
## 2     S2   BA 1.68 444
## 3     S3   BA 1.75  80
## 4     S4   BA 1.66  86
## 5     S5   BA 1.70 124
## 6     S6   CL 1.54  69
# Print the last rows of mydata
tail(mydata)
##    Souche Type Vmax T80
## 14    S14   CO 1.72 102
## 15    S15   FB 1.85  NA
## 16    S16   FB 1.71 340
## 17    S17   FB 1.85 133
## 18    S18   FB 1.71 144
## 19    S19   FB 1.93  98
# Sorting mydata according to Vmax column with a function from base
mydata[order(mydata[,"Vmax"]),]
##    Souche Type Vmax T80
## 6      S6   CL 1.54  69
## 12    S12   CO 1.55  86
## 10    S10   CO 1.59  86
## 1      S1   BA 1.61 163
## 11    S11   CO 1.62  81
## 4      S4   BA 1.66  86
## 2      S2   BA 1.68 444
## 9      S9   CL 1.69  86
## 5      S5   BA 1.70 124
## 16    S16   FB 1.71 340
## 18    S18   FB 1.71 144
## 14    S14   CO 1.72 102
## 3      S3   BA 1.75  80
## 7      S7   CL 1.81 122
## 13    S13   CO 1.83  80
## 15    S15   FB 1.85  NA
## 17    S17   FB 1.85 133
## 8      S8   CL 1.86  NA
## 19    S19   FB 1.93  98
# Sorting mydata according to Vmax column with a dplyr function
arrange(mydata,Vmax)
##    Souche Type Vmax T80
## 1      S6   CL 1.54  69
## 2     S12   CO 1.55  86
## 3     S10   CO 1.59  86
## 4      S1   BA 1.61 163
## 5     S11   CO 1.62  81
## 6      S4   BA 1.66  86
## 7      S2   BA 1.68 444
## 8      S9   CL 1.69  86
## 9      S5   BA 1.70 124
## 10    S16   FB 1.71 340
## 11    S18   FB 1.71 144
## 12    S14   CO 1.72 102
## 13     S3   BA 1.75  80
## 14     S7   CL 1.81 122
## 15    S13   CO 1.83  80
## 16    S15   FB 1.85  NA
## 17    S17   FB 1.85 133
## 18     S8   CL 1.86  NA
## 19    S19   FB 1.93  98
# Sorting mydata according to Vmax column with a dplyr function, in descending order
arrange(mydata,desc(Vmax))
##    Souche Type Vmax T80
## 1     S19   FB 1.93  98
## 2      S8   CL 1.86  NA
## 3     S15   FB 1.85  NA
## 4     S17   FB 1.85 133
## 5     S13   CO 1.83  80
## 6      S7   CL 1.81 122
## 7      S3   BA 1.75  80
## 8     S14   CO 1.72 102
## 9     S16   FB 1.71 340
## 10    S18   FB 1.71 144
## 11     S5   BA 1.70 124
## 12     S9   CL 1.69  86
## 13     S2   BA 1.68 444
## 14     S4   BA 1.66  86
## 15    S11   CO 1.62  81
## 16     S1   BA 1.61 163
## 17    S10   CO 1.59  86
## 18    S12   CO 1.55  86
## 19     S6   CL 1.54  69

dealing with missing data

# Retrieve rows with NA
mydata2<-na.omit(mydata)

# with the NAs
str(mydata)
## 'data.frame':    19 obs. of  4 variables:
##  $ Souche: chr  "S1" "S2" "S3" "S4" ...
##  $ Type  : chr  "BA" "BA" "BA" "BA" ...
##  $ Vmax  : num  1.61 1.68 1.75 1.66 1.7 1.54 1.81 1.86 1.69 1.59 ...
##  $ T80   : num  163 444 80 86 124 69 122 NA 86 86 ...
# without the NAs
str(mydata2)
## 'data.frame':    17 obs. of  4 variables:
##  $ Souche: chr  "S1" "S2" "S3" "S4" ...
##  $ Type  : chr  "BA" "BA" "BA" "BA" ...
##  $ Vmax  : num  1.61 1.68 1.75 1.66 1.7 1.54 1.81 1.69 1.59 1.62 ...
##  $ T80   : num  163 444 80 86 124 69 122 86 86 81 ...
##  - attr(*, "na.action")= 'omit' Named int [1:2] 8 15
##   ..- attr(*, "names")= chr [1:2] "8" "15"

With tidyr library:

  • drop_na()
  • replace_na()

mutate family functions with dplyr

mutate() allows to add new columns to your data set and can be use with all kind of base functions:

https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf

  • offsets: lag(), lead()
  • cumulative aggregates: cumall(), cumany(), cummax(), cummean() and so on…
  • rankings: cume_dist(), dense_rank(), min_rank(), row_number()
  • math: +, -, *, /, log(), log2(), sin(), cos(), sqrt() and so on…
  • misc: between(), na_if(), if_else(), recode(), recode_factor() and so on…
mutate(mydata,sqrt_vmax=sqrt(Vmax))
##    Souche Type Vmax T80 sqrt_vmax
## 1      S1   BA 1.61 163  1.268858
## 2      S2   BA 1.68 444  1.296148
## 3      S3   BA 1.75  80  1.322876
## 4      S4   BA 1.66  86  1.288410
## 5      S5   BA 1.70 124  1.303840
## 6      S6   CL 1.54  69  1.240967
## 7      S7   CL 1.81 122  1.345362
## 8      S8   CL 1.86  NA  1.363818
## 9      S9   CL 1.69  86  1.300000
## 10    S10   CO 1.59  86  1.260952
## 11    S11   CO 1.62  81  1.272792
## 12    S12   CO 1.55  86  1.244990
## 13    S13   CO 1.83  80  1.352775
## 14    S14   CO 1.72 102  1.311488
## 15    S15   FB 1.85  NA  1.360147
## 16    S16   FB 1.71 340  1.307670
## 17    S17   FB 1.85 133  1.360147
## 18    S18   FB 1.71 144  1.307670
## 19    S19   FB 1.93  98  1.389244

joining data sets

  • to concatenate 2 datasets: binds_cols()
  • to join 2 datasets: left_join(), right_join(), inner_join(), full_join(). please, have a look to the cheatsheet file for these functions to better understand their outputs! https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf
  • left_join(x,y,…): join matching values from y to x
  • right_join(x,y,…): join matching values from x to y
  • inner_join(x,y,…): join data, retain only rows with matches
  • full_join(x,y,…): join data, retain all values, all rows

you specifies by which column(s) (key) you want to join the data sets: by=“A”, by=c(“col_df1”=“col_df2”))…

Summarize a data set

The summarize functions of the dplyr library allow to create new variables in a new data set summarizing the initial data set.

3 ways:

  • summarise(dataset,newVar = mean(col1)): apply a summarizing function on one variable on the whole dataset
  • summarise_each(dataset,funs(mean)): apply a summarizing function on each column of the dataset on the whole dataset
  • summarise(group_by(dataset,col1), newVar=mean(col2)): apply a summarizing function on grouped data of the initial dataset

Some summarizing function:

  • first: first value of a vector
  • last: last value of a vector
  • nth: nth value of a vector
  • n: number of values in a vector
  • n_dictinct: number of distinct values in a vector
  • IQR: IQR of a vector
  • min, max: minimum and maximum of a vector
  • mean, median, var, sd - take care to us these with na.rm=TRUE option
summarise(mydata,mymmean=mean(Vmax))
##    mymmean
## 1 1.718947
summarise(group_by(mydata,Type),mymean=mean(Vmax))
## # A tibble: 4 x 2
##   Type  mymean
##   <chr>  <dbl>
## 1 BA      1.68
## 2 CL      1.72
## 3 CO      1.66
## 4 FB      1.81

Export a data set

write.table() allows to export a data.frame to a tabular file. The user specifies the separator for the columns, the name of file and its extension (csv, txt…).

# write.table(mydata,file="aName.txt",append=FALSE,quote=FALSE,sep="\t",row.names=FALSE)

Save an analysis

# Save all the analysis
save(list(ls()),"aName.Rdata")

# save just one object
save(toto,"toto.Rdata")

Create a report: rmarkdown

It is possible to create a report in R, combining R and markdown languages. Please have a look to their web site:

https://rmarkdown.rstudio.com/gallery.html

as well as their cheatsheet file referencing all the steps to produce a report.

https://github.com/rstudio/cheatsheets/raw/master/rmarkdown-2.0.pdf

The created report can be an HTML file, a PDF file (you need to add pandoc and latex to your system), a DOCX file, some slides, a web site etc…