R objects
type of simple R objects
- NULL: empty object
- logical: boolean or logical object (FALSE, TRUE)
- numeric: numeric object
- character: character object
- complex: complex object (2+3i)
The operators <- and = assign into the environment in which they are evaluated. The operator <- can be used anywhere, whereas the operator = is only allowed at the top level (e.g., in the complete expression typed at the command prompt) or as one of the subexpressions in a braced list of expressions. Recommendation is to use <- in your programs.
x <- value
x<-2.3
mode(x)
## [1] "numeric"
# Take a look to all the is.XXX() functions which ask what type of object is
# return a boolean
is.numeric(x)
## [1] TRUE
is.character(x)
## [1] FALSE
is.complex(x)
## [1] FALSE
is.null(x)
## [1] FALSE
is.logical(x)
## [1] FALSE
To convert an object in a different mode, you use the as.XXX() functions.
as.numeric(x)
## [1] 2.3
as.character(x)
## [1] "2.3"
as.null(x)
## NULL
Missing data
The missing data , NA for “not available” has its specific functions.
x<-NA
x
## [1] NA
print(x*2)
## [1] NA
# is x a missing data?
is.na(x)
## [1] TRUE
# ! for negative asking
!is.na(x)
## [1] FALSE
Vector
A vector is the simplest type of data structure in R. The R manual defines a vector as “a single entity consisting of a collection of things.” A collection of numbers, for example, is a numeric vector. We can have numeric vectors or character vectors or logical vectors…
#--- numeric vector built using c()
vecnum<-c(2,3.2,6.1,0.005,1e-2,150)
# print it
vecnum
## [1] 2.000 3.200 6.100 0.005 0.010 150.000
# length of my vector
length(vecnum)
## [1] 6
# Add some elements to my vector
vecnum<-c(vecnum,2.3,5.3)
vecnum
## [1] 2.000 3.200 6.100 0.005 0.010 150.000 2.300 5.300
# Test if all the values are > 1
all(vecnum > 1)
## [1] FALSE
# Test if at least one of the values of my vector is > 1
any(vecnum > 1)
## [1] TRUE
# Print the first elements of my vector
head(vecnum)
## [1] 2.000 3.200 6.100 0.005 0.010 150.000
# Print the last elements of my vector
tail(vecnum)
## [1] 6.100 0.005 0.010 150.000 2.300 5.300
# Print the 5th element of my vector
vecnum[5]
## [1] 0.01
# Print the 2nd to the 4th elements of my vector
vecnum[2:4]
## [1] 3.200 6.100 0.005
# Print the 2nd, 4th and 6th element of my vector
vecnum[c(2,4,6)]
## [1] 3.200 0.005 150.000
# Select and print the elements of my vector > 1
vecnum[vecnum > 1]
## [1] 2.0 3.2 6.1 150.0 2.3 5.3
# Select and print the elements in the interval [2,10]
vecnum[ (vecnum >= 2) & (vecnum <= 10)]
## [1] 2.0 3.2 6.1 2.3 5.3
# Print the minimum value of my vector
min(vecnum)
## [1] 0.005
# Print the position in the vector of the minimal value
which.min(vecnum)
## [1] 4
# Print the maximum of my vector
max(vecnum)
## [1] 150
# Print the position in the vector of the maximal value
which.max(vecnum)
## [1] 6
# Sort the vector by ascending order
sort(vecnum)
## [1] 0.005 0.010 2.000 2.300 3.200 5.300 6.100 150.000
# Sort the vector by descending order
rev(sort(vecnum))
## [1] 150.000 6.100 5.300 3.200 2.300 2.000 0.010 0.005
# Build a vector using seq()
myvec<-seq(1,5)
myvec
## [1] 1 2 3 4 5
# Build a vector using rep()
myvec2<-rep(1,5)
myvec2
## [1] 1 1 1 1 1
# another way
myvec3<-rep(c(1,2),each=4)
myvec3
## [1] 1 1 1 1 2 2 2 2
#--- a character vector
vecchar<-c("toto","titi","tutu")
vecchar
## [1] "toto" "titi" "tutu"
vecchar2<-rep("black",5)
vecchar2
## [1] "black" "black" "black" "black" "black"
# Test if vecchar is really a vector
is.vector(vecchar)
## [1] TRUE
#--- Build a logical vector
veclog<-c("TRUE","FALSE","TRUE")
is.vector(veclog)
## [1] TRUE
Factor
The factor is a type of vector that describes a qualitative variable and is internally coded by a number and not by the string of characters representing its value.
#--- Build a factor
fac <- factor(c("red", "green", "red", "blue", "green"))
fac
## [1] red green red blue green
## Levels: blue green red
# the levels of fac
levels(fac)
## [1] "blue" "green" "red"
# Test if fac is a factor or a vector???
is.vector(fac)
## [1] FALSE
is.factor(fac)
## [1] TRUE
# By default, the levels of a factor are sorted by alphabetical order
# you can constrain the levels order
fac<-factor(c("red","green","red","blue","green"),levels=c("green","red","blue"))
fac
## [1] red green red blue green
## Levels: green red blue
# To set again "blue" in first level
fac2<-relevel(fac, "blue")
fac2
## [1] red green red blue green
## Levels: blue green red
#--- To update the levels ----
# Build fac2 from fac but with only the 3 first elements
fac2 <- fac[1:3]
fac2
## [1] red green red
## Levels: green red blue
# Printing fac2, we can see that the level "blue" is still there even if not in the factor
# as it can be a problem in further analyses, we update the levels of fac2 with factor()
fac2 <- factor(fac2)
fac2
## [1] red green red
## Levels: green red
# Now, the levels of fac2 are only values contained in the factor ("blue" disappeared)
#--- Transform a factor to character
fac3<-as.character(fac2)
fac3
## [1] "red" "green" "red"
# Test the mode
is.factor(fac3)
## [1] FALSE
is.character(fac3)
## [1] TRUE
#--- Retrieve the internal coding of a factor
as.numeric(fac)
## [1] 2 1 2 3 1
#--- Count frequencies by levels of the fac factor
table(fac)
## fac
## green red blue
## 2 2 1
# Suppress the duplicated elements of a factor
# unique() function can be used on any kind of variable
unique(fac)
## [1] red green blue
## Levels: green red blue
Matrix
A matrix is a base object containing only one type of element. A matrix therefore contains either numerical or character elements, and so on… Each element of the matrix can be marked by its line and column numbers.
# Create a matrix
matr<-matrix(c(1,2.3,2.5,26,45,5,4.2,1.2,15,10),nrow=2)
matr
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.0 2.5 45 4.2 15
## [2,] 2.3 26.0 5 1.2 10
# number of rows of matr
nrow(matr)
## [1] 2
# number of columns of matr
ncol(matr)
## [1] 5
# is it really a matrix?
is.matrix(matr)
## [1] TRUE
# what does matr contain?
mode(matr)
## [1] "numeric"
# dimension of the matrix
dim(matr)
## [1] 2 5
# Number of elements into matr
length(matr)
## [1] 10
# Add a new column by concatenating
matr<-cbind(matr,seq(1,2))
matr
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1.0 2.5 45 4.2 15 1
## [2,] 2.3 26.0 5 1.2 10 2
# Add a new row by concatenating
matr<-rbind(matr,seq(1:6))
matr
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1.0 2.5 45 4.2 15 1
## [2,] 2.3 26.0 5 1.2 10 2
## [3,] 1.0 2.0 3 4.0 5 6
# Print the value of row 2 and column 4
matr[2,4]
## [1] 1.2
# Print the 1st row of matr
matr[1,]
## [1] 1.0 2.5 45.0 4.2 15.0 1.0
# Print the 3rd column of matr
matr[,3]
## [1] 45 5 3
# Print matr without the 2 column
matr[,-2]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.0 45 4.2 15 1
## [2,] 2.3 5 1.2 10 2
## [3,] 1.0 3 4.0 5 6
# Add 2 matrices (all mathematical operators permitted: *, /, -, %*%)
matr+matr
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 2.0 5 90 8.4 30 2
## [2,] 4.6 52 10 2.4 20 4
## [3,] 2.0 4 6 8.0 10 12
# Note: "*" multiply element by element
# the matrix multiplication is performed using %*%
# matrix transposition
t(matr)
## [,1] [,2] [,3]
## [1,] 1.0 2.3 1
## [2,] 2.5 26.0 2
## [3,] 45.0 5.0 3
## [4,] 4.2 1.2 4
## [5,] 15.0 10.0 5
## [6,] 1.0 2.0 6
# all elements to square root
# All mathematical functions exist: sin, cos, exp, log etc...
sqrt(matr)
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1.000000 1.581139 6.708204 2.049390 3.872983 1.000000
## [2,] 1.516575 5.099020 2.236068 1.095445 3.162278 1.414214
## [3,] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490
It is possible to construct array-type objects that are matrices of dimension greater than 2.
List
A list is a heterogeneous object, i.e. it can contain different type and length of elements. The objects contained in the list are the components of the list and may have names (otherwise they are numbered by default). The lists can thus contain values, vectors, matrices, data-frames etc… These are important objects in the R software because most functions return lists of several objects.
# Create a list containing vecnum, matr
liste<-list(vecnum,matr)
liste
## [[1]]
## [1] 2.000 3.200 6.100 0.005 0.010 150.000 2.300 5.300
##
## [[2]]
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1.0 2.5 45 4.2 15 1
## [2,] 2.3 26.0 5 1.2 10 2
## [3,] 1.0 2.0 3 4.0 5 6
# length of liste
length(liste)
## [1] 2
# type of elements in liste
mode(liste)
## [1] "list"
# structure of liste
str(liste)
## List of 2
## $ : num [1:8] 2 3.2 6.1 0.005 0.01 150 2.3 5.3
## $ : num [1:3, 1:6] 1 2.3 1 2.5 26 2 45 5 3 4.2 ...
# Give names to the 2 elements of liste
names(liste)<-c("UnVecteur","UneMatrice")
names(liste)
## [1] "UnVecteur" "UneMatrice"
# print liste
liste
## $UnVecteur
## [1] 2.000 3.200 6.100 0.005 0.010 150.000 2.300 5.300
##
## $UneMatrice
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1.0 2.5 45 4.2 15 1
## [2,] 2.3 26.0 5 1.2 10 2
## [3,] 1.0 2.0 3 4.0 5 6
# --- Note: differences of printing according how you access the elements of a list
# Print the 1st object of liste
liste[1] # give a list
## $UnVecteur
## [1] 2.000 3.200 6.100 0.005 0.010 150.000 2.300 5.300
is.list(liste[1])
## [1] TRUE
liste[[1]] # gives a vector
## [1] 2.000 3.200 6.100 0.005 0.010 150.000 2.300 5.300
is.vector(liste[[1]])
## [1] TRUE
liste$UnVecteur # gives a vector
## [1] 2.000 3.200 6.100 0.005 0.010 150.000 2.300 5.300
is.vector(liste$UnVecteur)
## [1] TRUE
# Print the 2nd object of liste
liste[2] # gives a list
## $UneMatrice
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1.0 2.5 45 4.2 15 1
## [2,] 2.3 26.0 5 1.2 10 2
## [3,] 1.0 2.0 3 4.0 5 6
is.list(liste[2])
## [1] TRUE
liste[[2]] # gives a matrix
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1.0 2.5 45 4.2 15 1
## [2,] 2.3 26.0 5 1.2 10 2
## [3,] 1.0 2.0 3 4.0 5 6
is.matrix(liste[[2]])
## [1] TRUE
liste$UneMatrice # gives a matrix
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1.0 2.5 45 4.2 15 1
## [2,] 2.3 26.0 5 1.2 10 2
## [3,] 1.0 2.0 3 4.0 5 6
is.matrix(liste$UneMatrice)
## [1] TRUE
#---- Manipulate an object of a list
liste[[1]]
## [1] 2.000 3.200 6.100 0.005 0.010 150.000 2.300 5.300
is.vector(liste[[1]])
## [1] TRUE
liste[[1]] + liste[[1]]
## [1] 4.00 6.40 12.20 0.01 0.02 300.00 4.60 10.60
liste[[1]]*10
## [1] 20.00 32.00 61.00 0.05 0.10 1500.00 23.00 53.00
# Concatenate 2 lists
c(liste,liste)
## $UnVecteur
## [1] 2.000 3.200 6.100 0.005 0.010 150.000 2.300 5.300
##
## $UneMatrice
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1.0 2.5 45 4.2 15 1
## [2,] 2.3 26.0 5 1.2 10 2
## [3,] 1.0 2.0 3 4.0 5 6
##
## $UnVecteur
## [1] 2.000 3.200 6.100 0.005 0.010 150.000 2.300 5.300
##
## $UneMatrice
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1.0 2.5 45 4.2 15 1
## [2,] 2.3 26.0 5 1.2 10 2
## [3,] 1.0 2.0 3 4.0 5 6
# Transform the list object creating a vector containing all the atomic elements of liste
unlist(liste)
## UnVecteur1 UnVecteur2 UnVecteur3 UnVecteur4 UnVecteur5 UnVecteur6
## 2.000 3.200 6.100 0.005 0.010 150.000
## UnVecteur7 UnVecteur8 UneMatrice1 UneMatrice2 UneMatrice3 UneMatrice4
## 2.300 5.300 1.000 2.300 1.000 2.500
## UneMatrice5 UneMatrice6 UneMatrice7 UneMatrice8 UneMatrice9 UneMatrice10
## 26.000 2.000 45.000 5.000 3.000 4.200
## UneMatrice11 UneMatrice12 UneMatrice13 UneMatrice14 UneMatrice15 UneMatrice16
## 1.200 4.000 15.000 10.000 5.000 1.000
## UneMatrice17 UneMatrice18
## 2.000 6.000
# is it a vector?
is.vector(unlist(liste))
## [1] TRUE
Data.frame
A dataframe is an object with a matrix structure, but which can contain different type components. The data sets usually used in statistics are often dataframes. Indeed, the dataframe can contain numeric columns (the data set) and character or factor columns to identify individuals (which are by rows).
# Create a dataframe
vec1<-paste("S",seq(1:5),sep="")
vec2<-as.factor(c(1,1,1,2,2))
vec3<-c(1.2,2.3,2.5,4.2,1.2)
mydata<-data.frame(Id=vec1,Type=vec2,Variable=vec3)
mydata
## Id Type Variable
## 1 S1 1 1.2
## 2 S2 1 2.3
## 3 S3 1 2.5
## 4 S4 2 4.2
## 5 S5 2 1.2
# another way
mydata<-cbind.data.frame(vec1,vec2,vec3)
colnames(mydata)<-c("ID","type","Variable")
mydata
## ID type Variable
## 1 S1 1 1.2
## 2 S2 1 2.3
## 3 S3 1 2.5
## 4 S4 2 4.2
## 5 S5 2 1.2
# dimension of mydata
dim(mydata)
## [1] 5 3
# number of rows
nrow(mydata)
## [1] 5
# number of columns
ncol(mydata)
## [1] 3
# is it really a data.frame?
is.data.frame(mydata)
## [1] TRUE
# Structure of the dataframe
str(mydata)
## 'data.frame': 5 obs. of 3 variables:
## $ ID : chr "S1" "S2" "S3" "S4" ...
## $ type : Factor w/ 2 levels "1","2": 1 1 1 2 2
## $ Variable: num 1.2 2.3 2.5 4.2 1.2
# summary of the dataframe
summary(mydata)
## ID type Variable
## Length:5 1:3 Min. :1.20
## Class :character 2:2 1st Qu.:1.20
## Mode :character Median :2.30
## Mean :2.28
## 3rd Qu.:2.50
## Max. :4.20
# Print the 2nd row of the data.frame
mydata[2,]
## ID type Variable
## 2 S2 1 2.3
# Print the 3rd column of the data.frame
mydata [,3]
## [1] 1.2 2.3 2.5 4.2 1.2
mydata[,"Variable"]
## [1] 1.2 2.3 2.5 4.2 1.2
# Print only rows containing S1, S3 and S4 ID
mydata[mydata[,"ID"] %in% c("S1","S3","S4"),]
## ID type Variable
## 1 S1 1 1.2
## 3 S3 1 2.5
## 4 S4 2 4.2
# Do not print rows containing S1, S3 and S4 ID
mydata[!mydata[,"ID"] %in% c("S1","S3","S4"),]
## ID type Variable
## 2 S2 1 2.3
## 5 S5 2 1.2
# Transform the matr matrix into a data.frame
tmp<-as.data.frame(matr)
tmp
## V1 V2 V3 V4 V5 V6
## 1 1.0 2.5 45 4.2 15 1
## 2 2.3 26.0 5 1.2 10 2
## 3 1.0 2.0 3 4.0 5 6
# Is it a matrix of a data.frame??
is.matrix(tmp)
## [1] FALSE
is.data.frame(tmp)
## [1] TRUE
Manipulate data in R
Libraries useful to manipulate data in R are provided by Rstudio:
- dplyr: https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf
- tidyr: https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf
Functions for character variables
library(dplyr)
##
## Attachement du package : 'dplyr'
## Les objets suivants sont masqués depuis 'package:stats':
##
## filter, lag
## Les objets suivants sont masqués depuis 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
# Create a character vector
mychar<-c("plane","train","bus","car")
mychar
## [1] "plane" "train" "bus" "car"
#--- Add the pattern "funny" to each end of value separated by a "-"
mychar<-paste(mychar,"funny",sep="-")
mychar
## [1] "plane-funny" "train-funny" "bus-funny" "car-funny"
#--- Add the pattern "A" to each begin of value separated by a "_"
mychar<-paste("A",mychar,sep="_")
mychar
## [1] "A_plane-funny" "A_train-funny" "A_bus-funny" "A_car-funny"
# Suppress the pattern "A_" in mychar
mychar<-gsub(pattern="A_",replacement="",x=mychar)
mychar
## [1] "plane-funny" "train-funny" "bus-funny" "car-funny"
# Replace "funny" by "little" in mychar
mychar<-gsub(pattern="funny",replacement="little",x=mychar)
mychar
## [1] "plane-little" "train-little" "bus-little" "car-little"
# Set all in upper case
mychar<-toupper(mychar)
mychar
## [1] "PLANE-LITTLE" "TRAIN-LITTLE" "BUS-LITTLE" "CAR-LITTLE"
# Set all in lower case
mychar<-tolower(mychar)
mychar
## [1] "plane-little" "train-little" "bus-little" "car-little"
# Count the number of characters for each element of mychar
nchar(mychar)
## [1] 12 12 10 10
# Separate each element of mychar with a symbol (here "-")
# This function returns a list!!
strsplit(mychar,split="-")
## [[1]]
## [1] "plane" "little"
##
## [[2]]
## [1] "train" "little"
##
## [[3]]
## [1] "bus" "little"
##
## [[4]]
## [1] "car" "little"
#--- Have a look to all the other functions dedicated to manipulate a character vector
# grep()
# sub(), gsub()
# regexpr() etc...
# (grep etc...) regular expressions
Import a data set
In all the following functions allowing to import a tabular data, the user specifies the path to the file, the name file, the separator in the file, the header, the missing data symbol etc… Please have a look to the help functions.
- read.table(): Reads a file in table format and creates a data frame from it, with cases corresponding to lines and variables to fields in the file. csv, txt files allowed
- library readr: a set of functions faster than the previous one. allows to import the same kind of files (csv, txt and so on…). https://readr.tidyverse.org/
- library readxl: makes it easy to get data out of Excel and into R. easy to install and use on all operating systems. It is designed to work with tabular data. https://readxl.tidyverse.org/
Sorting a data set
Souche<-paste0("S",seq(1,19,1))
Type<-c(rep("BA",5),rep("CL",4),rep("CO",5),rep("FB",5))
Vmax<-c(1.61,1.68,1.75,1.66,1.70,1.54,1.81,1.86,1.69,1.59,1.62,1.55,1.83,1.72,1.85,1.71,1.85,1.71,1.93)
T80<-c(163,444,80,86,124,69,122,NA,86,86,81,86,80,102,NA,340,133,144,98)
mydata<-cbind.data.frame(Souche,Type,Vmax,T80)
str(mydata)
## 'data.frame': 19 obs. of 4 variables:
## $ Souche: chr "S1" "S2" "S3" "S4" ...
## $ Type : chr "BA" "BA" "BA" "BA" ...
## $ Vmax : num 1.61 1.68 1.75 1.66 1.7 1.54 1.81 1.86 1.69 1.59 ...
## $ T80 : num 163 444 80 86 124 69 122 NA 86 86 ...
dim(mydata)
## [1] 19 4
# Print the first rows of mydata
head(mydata)
## Souche Type Vmax T80
## 1 S1 BA 1.61 163
## 2 S2 BA 1.68 444
## 3 S3 BA 1.75 80
## 4 S4 BA 1.66 86
## 5 S5 BA 1.70 124
## 6 S6 CL 1.54 69
# Print the last rows of mydata
tail(mydata)
## Souche Type Vmax T80
## 14 S14 CO 1.72 102
## 15 S15 FB 1.85 NA
## 16 S16 FB 1.71 340
## 17 S17 FB 1.85 133
## 18 S18 FB 1.71 144
## 19 S19 FB 1.93 98
# Sorting mydata according to Vmax column with a function from base
mydata[order(mydata[,"Vmax"]),]
## Souche Type Vmax T80
## 6 S6 CL 1.54 69
## 12 S12 CO 1.55 86
## 10 S10 CO 1.59 86
## 1 S1 BA 1.61 163
## 11 S11 CO 1.62 81
## 4 S4 BA 1.66 86
## 2 S2 BA 1.68 444
## 9 S9 CL 1.69 86
## 5 S5 BA 1.70 124
## 16 S16 FB 1.71 340
## 18 S18 FB 1.71 144
## 14 S14 CO 1.72 102
## 3 S3 BA 1.75 80
## 7 S7 CL 1.81 122
## 13 S13 CO 1.83 80
## 15 S15 FB 1.85 NA
## 17 S17 FB 1.85 133
## 8 S8 CL 1.86 NA
## 19 S19 FB 1.93 98
# Sorting mydata according to Vmax column with a dplyr function
arrange(mydata,Vmax)
## Souche Type Vmax T80
## 1 S6 CL 1.54 69
## 2 S12 CO 1.55 86
## 3 S10 CO 1.59 86
## 4 S1 BA 1.61 163
## 5 S11 CO 1.62 81
## 6 S4 BA 1.66 86
## 7 S2 BA 1.68 444
## 8 S9 CL 1.69 86
## 9 S5 BA 1.70 124
## 10 S16 FB 1.71 340
## 11 S18 FB 1.71 144
## 12 S14 CO 1.72 102
## 13 S3 BA 1.75 80
## 14 S7 CL 1.81 122
## 15 S13 CO 1.83 80
## 16 S15 FB 1.85 NA
## 17 S17 FB 1.85 133
## 18 S8 CL 1.86 NA
## 19 S19 FB 1.93 98
# Sorting mydata according to Vmax column with a dplyr function, in descending order
arrange(mydata,desc(Vmax))
## Souche Type Vmax T80
## 1 S19 FB 1.93 98
## 2 S8 CL 1.86 NA
## 3 S15 FB 1.85 NA
## 4 S17 FB 1.85 133
## 5 S13 CO 1.83 80
## 6 S7 CL 1.81 122
## 7 S3 BA 1.75 80
## 8 S14 CO 1.72 102
## 9 S16 FB 1.71 340
## 10 S18 FB 1.71 144
## 11 S5 BA 1.70 124
## 12 S9 CL 1.69 86
## 13 S2 BA 1.68 444
## 14 S4 BA 1.66 86
## 15 S11 CO 1.62 81
## 16 S1 BA 1.61 163
## 17 S10 CO 1.59 86
## 18 S12 CO 1.55 86
## 19 S6 CL 1.54 69
dealing with missing data
# Retrieve rows with NA
mydata2<-na.omit(mydata)
# with the NAs
str(mydata)
## 'data.frame': 19 obs. of 4 variables:
## $ Souche: chr "S1" "S2" "S3" "S4" ...
## $ Type : chr "BA" "BA" "BA" "BA" ...
## $ Vmax : num 1.61 1.68 1.75 1.66 1.7 1.54 1.81 1.86 1.69 1.59 ...
## $ T80 : num 163 444 80 86 124 69 122 NA 86 86 ...
# without the NAs
str(mydata2)
## 'data.frame': 17 obs. of 4 variables:
## $ Souche: chr "S1" "S2" "S3" "S4" ...
## $ Type : chr "BA" "BA" "BA" "BA" ...
## $ Vmax : num 1.61 1.68 1.75 1.66 1.7 1.54 1.81 1.69 1.59 1.62 ...
## $ T80 : num 163 444 80 86 124 69 122 86 86 81 ...
## - attr(*, "na.action")= 'omit' Named int [1:2] 8 15
## ..- attr(*, "names")= chr [1:2] "8" "15"
With tidyr library:
- drop_na()
- replace_na()
mutate family functions with dplyr
mutate() allows to add new columns to your data set and can be use with all kind of base functions:
https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf
- offsets: lag(), lead()
- cumulative aggregates: cumall(), cumany(), cummax(), cummean() and so on…
- rankings: cume_dist(), dense_rank(), min_rank(), row_number()
- math: +, -, *, /, log(), log2(), sin(), cos(), sqrt() and so on…
- misc: between(), na_if(), if_else(), recode(), recode_factor() and so on…
mutate(mydata,sqrt_vmax=sqrt(Vmax))
## Souche Type Vmax T80 sqrt_vmax
## 1 S1 BA 1.61 163 1.268858
## 2 S2 BA 1.68 444 1.296148
## 3 S3 BA 1.75 80 1.322876
## 4 S4 BA 1.66 86 1.288410
## 5 S5 BA 1.70 124 1.303840
## 6 S6 CL 1.54 69 1.240967
## 7 S7 CL 1.81 122 1.345362
## 8 S8 CL 1.86 NA 1.363818
## 9 S9 CL 1.69 86 1.300000
## 10 S10 CO 1.59 86 1.260952
## 11 S11 CO 1.62 81 1.272792
## 12 S12 CO 1.55 86 1.244990
## 13 S13 CO 1.83 80 1.352775
## 14 S14 CO 1.72 102 1.311488
## 15 S15 FB 1.85 NA 1.360147
## 16 S16 FB 1.71 340 1.307670
## 17 S17 FB 1.85 133 1.360147
## 18 S18 FB 1.71 144 1.307670
## 19 S19 FB 1.93 98 1.389244
joining data sets
- to concatenate 2 datasets: binds_cols()
- to join 2 datasets: left_join(), right_join(), inner_join(), full_join(). please, have a look to the cheatsheet file for these functions to better understand their outputs! https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf
- left_join(x,y,…): join matching values from y to x
- right_join(x,y,…): join matching values from x to y
- inner_join(x,y,…): join data, retain only rows with matches
- full_join(x,y,…): join data, retain all values, all rows
you specifies by which column(s) (key) you want to join the data sets: by=“A”, by=c(“col_df1”=“col_df2”))…
Summarize a data set
The summarize functions of the dplyr library allow to create new variables in a new data set summarizing the initial data set.
3 ways:
- summarise(dataset,newVar = mean(col1)): apply a summarizing function on one variable on the whole dataset
- summarise_each(dataset,funs(mean)): apply a summarizing function on each column of the dataset on the whole dataset
- summarise(group_by(dataset,col1), newVar=mean(col2)): apply a summarizing function on grouped data of the initial dataset
Some summarizing function:
- first: first value of a vector
- last: last value of a vector
- nth: nth value of a vector
- n: number of values in a vector
- n_dictinct: number of distinct values in a vector
- IQR: IQR of a vector
- min, max: minimum and maximum of a vector
- mean, median, var, sd - take care to us these with na.rm=TRUE option
summarise(mydata,mymmean=mean(Vmax))
## mymmean
## 1 1.718947
summarise(group_by(mydata,Type),mymean=mean(Vmax))
## # A tibble: 4 x 2
## Type mymean
## <chr> <dbl>
## 1 BA 1.68
## 2 CL 1.72
## 3 CO 1.66
## 4 FB 1.81
Export a data set
write.table() allows to export a data.frame to a tabular file. The user specifies the separator for the columns, the name of file and its extension (csv, txt…).
# write.table(mydata,file="aName.txt",append=FALSE,quote=FALSE,sep="\t",row.names=FALSE)
Save an analysis
# Save all the analysis
save(list(ls()),"aName.Rdata")
# save just one object
save(toto,"toto.Rdata")
Create a report: rmarkdown
It is possible to create a report in R, combining R and markdown languages. Please have a look to their web site:
https://rmarkdown.rstudio.com/gallery.html
as well as their cheatsheet file referencing all the steps to produce a report.
https://github.com/rstudio/cheatsheets/raw/master/rmarkdown-2.0.pdf
The created report can be an HTML file, a PDF file (you need to add pandoc and latex to your system), a DOCX file, some slides, a web site etc…