Image Credit: Pixabay
Image Credit: Pixabay
It is all about extracting information out of data in order to make better decisions.
It is all about extracting information out of data in order to make better decisions.
It is all about extracting information out of data in order to make better decisions.
Basics of R Programming
Data Import
Data Wrangling
Data Visualization
Basics of R Programming ⚙️
Data Import
Data Wrangling
Data Visualization
R is a software environment for statistical computing and graphics.
Language designers: Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand.
Parent language: S
The latest R version 4.0.1 has been released on 2020-06-06.
Free
Powerful: Over 14600 contributed packages on the main repository (CRAN), as of July 2019, provided by top international researchers and programmers.
Flexible: It is a language, and thus allows you to create your own solutions.
Community: Large global community friendly and helpful, lots of resources.
Image Credit: R-Ladies Newcastle
"If R were an airplane, RStudio would be the airport, providing many, many supporting services that make it easier for you, the pilot, to take off and go to awesome places. Sure, you can fly an airplane without an airport, but having those runways and supporting infrastructure is a game-changer."
-- Julie Lowndes
Image Credit: Clastic Detritus
7+1
[1] 8
rnorm(10)
[1] 0.42886236 0.87398624 -0.14720398 1.29780260 0.09344244 1.44028046 [7] 0.73037732 -1.11167627 0.50529457 0.69291615
7+1
[1] 8
rnorm(10)
[1] 0.42886236 0.87398624 -0.14720398 1.29780260 0.09344244 1.44028046 [7] 0.73037732 -1.11167627 0.50529457 0.69291615
a <- rnorm(10)a
[1] -0.9555159 -0.5979601 -1.0782460 1.0624732 0.2590546 0.2246372 [7] -0.6120881 0.6849044 2.4574333 0.8701314
7+1
[1] 8
rnorm(10)
[1] 0.42886236 0.87398624 -0.14720398 1.29780260 0.09344244 1.44028046 [7] 0.73037732 -1.11167627 0.50529457 0.69291615
a <- rnorm(10)a
[1] -0.9555159 -0.5979601 -1.0782460 1.0624732 0.2590546 0.2246372 [7] -0.6120881 0.6849044 2.4574333 0.8701314
b <- a*100b
[1] -95.55159 -59.79601 -107.82460 106.24732 25.90546 22.46372 [7] -61.20881 68.49044 245.74333 87.01314
ls()
can be used to display the names of the objects which are currently stored within R.
The collection of objects currently stored is called the workspace.
ls()
[1] "a" "b" "p"
ls()
can be used to display the names of the objects which are currently stored within R.
The collection of objects currently stored is called the workspace.
ls()
[1] "a" "b" "p"
To remove objects the function rm
is available.
remove all objects rm(list=ls())
remove specific objects rm(x, y, z)
rm(a)ls()
[1] "b" "p"
rm(list=ls())ls()
character(0)
At the end of an R session, if save: the objects are written to a file called .RData in the current directory, and the command lines used in the session are saved to a file called .Rhistory
When R is started at later time from the same directory
When R is started at later time from the same directory it reloads the associated workspace and commands history.
When R is started at later time from the same directory it reloads the associated workspace and commands history.
rnorm(10) # This is a comment
[1] -0.7146212 -0.6647724 -0.4563877 -0.3020826 0.9044316 0.7319009 [7] -2.1557290 -0.7404257 1.2983685 1.0892268
sum(1:10) # 1+2
[1] 55
sum(1:10)#Bad commenting style
[1] 55
sum(1:10) # Good commenting style
[1] 55
# Read data ----------------# Plot data ----------------
To learn more read Hadley Wickham's Style guide.
R is an object-oriented language.
An object in R is anything (data structures, functions, etc., that can be assigned to a variable).
R is an object-oriented language.
An object in R is anything (data structures, functions, etc., that can be assigned to a variable).
Let's take a look of some common types of objects.
R is an object-oriented language.
An object in R is anything (data structures, functions, etc., that can be assigned to a variable).
Let's take a look of some common types of objects.
Data structures are the ways of arranging data.
R is an object-oriented language.
An object in R is anything (data structures, functions, etc., that can be assigned to a variable).
Let's take a look of some common types of objects.
Data structures are the ways of arranging data.
Functions tell R to do something.
A function may be applied to an object.
Result of applying a function is usually an object too.
All function calls need to be followed by parentheses.
a <- 1:20 # data structuresum(a) # sum is a function applied on a
[1] 210
help.start() # Some functions work on their own.
help(rnorm)
for
, if
, [[
help("[[")
help.search(‘weighted mean’)
?rnorm
??rnorm
Data structures differ in terms of,
Type of data they can hold
How they are created
Structural complexity
Notation to identify and access individual elements
Image Credit: venus.ifca.unican.es
Vectors are one-dimensional arrays that can hold numeric data, character data, or logical data.
Combine function c() is used to form the vector.
Data in a vector must only be one type or mode (numeric, character, or logical). You can’t mix modes in the same vector.
Syntax
vector_name <- c(element1, element2, element3)
x <- c(5, 6, 3, 1 , 100)
assignment operator ('<-'), '=' can be used as an alternative.
c()
function
What will be the output of the following code?
y <- c(x, 500, 600)
first_vec <- c(10, 20, 50, 70)second_vec <- c("Jan", "Feb", "March", "April")third_vec <- c(TRUE, FALSE, TRUE, TRUE)fourth_vec <- c(10L, 20L, 50L, 70L)
To check if it is a
is.vector()
is.vector(first_vec)
[1] TRUE
is.character()
is.character(first_vec)
[1] FALSE
is.double()
is.double(first_vec)
[1] TRUE
is.integer()
is.integer(first_vec)
[1] FALSE
is.logical()
is.logical(first_vec)
[1] FALSE
length(first_vec)
[1] 4
Vectors must be homogeneous. When you attempt to combine different types they will be coerced to the most flexible type so that every element in the vector is of the same type.
Order from least to most flexible
logical
--> integer
--> double
--> character
a <- c(3.1, 2L, 3, 4, "GPA") typeof(a)
[1] "character"
anew <- c(3.1, 2L, 3, 4)typeof(anew)
[1] "double"
Vectors can be explicitly coerced from one class to another using the as.*
functions, if available. For example, as.character
, as.numeric
, as.integer
, and as.logical
.
vec1 <- c(TRUE, FALSE, TRUE, TRUE)typeof(vec1)
[1] "logical"
vec2 <- as.integer(vec1)typeof(vec2)
[1] "integer"
vec2
[1] 1 0 1 1
Why does the below output NAs?
x <- c("a", "b", "c")as.numeric(x)
Warning: NAs introduced by coercion
[1] NA NA NA
x1 <- 1:3x2 <- c(10, 20, 30)combinedx1x2 <- c(x1, x2)combinedx1x2
[1] 1 2 3 10 20 30
x1 <- 1:3x2 <- c(10, 20, 30)combinedx1x2 <- c(x1, x2)combinedx1x2
[1] 1 2 3 10 20 30
class(x1)
[1] "integer"
class(x2)
[1] "numeric"
class(combinedx1x2)
[1] "numeric"
x1 <- 1:3x2 <- c(10, 20, 30)combinedx1x2 <- c(x1, x2)combinedx1x2
[1] 1 2 3 10 20 30
class(x1)
[1] "integer"
class(x2)
[1] "numeric"
class(combinedx1x2)
[1] "numeric"
y1 <- c(1, 2, 3)y2 <- c("a", "b", "c")c(y1, y2)
[1] "1" "2" "3" "a" "b" "c"
:
produces regular spaced ascending or descending sequences. 10:16
[1] 10 11 12 13 14 15 16
-0.5:8.5
[1] -0.5 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5
:
produces regular spaced ascending or descending sequences. 10:16
[1] 10 11 12 13 14 15 16
-0.5:8.5
[1] -0.5 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5
seq(initial_value, final_value, increment)
seq(1,11)
[1] 1 2 3 4 5 6 7 8 9 10 11
seq(1, 11, length.out=5)
[1] 1.0 3.5 6.0 8.5 11.0
seq(0, 11, by=2)
[1] 0 2 4 6 8 10
rep()
rep(9, 5)
[1] 9 9 9 9 9
rep(1:4, 2)
[1] 1 2 3 4 1 2 3 4
rep(1:4, each=2) # each element is repeated twice
[1] 1 1 2 2 3 3 4 4
rep(1:4, times=2) # whole sequence is repeated twice
[1] 1 2 3 4 1 2 3 4
rep(1:4, each=2, times=3)
[1] 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4
rep(1:4, 1:4)
[1] 1 2 2 3 3 3 4 4 4 4
rep(1:4, c(4, 1, 4, 2))
[1] 1 1 1 1 2 3 3 3 3 4 4
c(1, 2, 3) == c(10, 20, 3)
[1] FALSE FALSE TRUE
c(1, 2, 3) != c(10, 20, 3)
[1] TRUE TRUE FALSE
1:5 > 3
[1] FALSE FALSE FALSE TRUE TRUE
1:5 < 3
[1] TRUE TRUE FALSE FALSE FALSE
<=
less than or equal to
>=
greater than or equal to
|
or
&
and
%in%
- in the seta <- c(1, 2, 3)b <- c(1, 10, 3)a%in%b
[1] TRUE FALSE TRUE
x <- 1:10y <- 1:3x
[1] 1 2 3 4 5 6 7 8 9 10
y
[1] 1 2 3
x %in% y
[1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
y %in% x
[1] TRUE TRUE TRUE
c(10, 100, 100) + 2 # two is added to every element in the vector
[1] 12 102 102
c(10, 100, 100) + 2 # two is added to every element in the vector
[1] 12 102 102
v1 <- c(1, 2, 3); v2 <- c(10, 100, 1000)v1 + v2
[1] 11 102 1003
c(10, 100, 100) + 2 # two is added to every element in the vector
[1] 12 102 102
v1 <- c(1, 2, 3); v2 <- c(10, 100, 1000)v1 + v2
[1] 11 102 1003
Add two vectors of unequal length
longvec <- seq(10, 100, length=10); shortvec <- c(1, 2, 3, 4, 5)shortvec + longvec
[1] 11 22 33 44 55 61 72 83 94 105
What will be the output of the following code?
first <- c(1, 2, 3, 4); second <- c(10, 100)first * second
Use NA
or NaN
to place a missing value in a vector.
z <- c(10, 101, 2, 3, NA)is.na(z)
[1] FALSE FALSE FALSE FALSE TRUE
[1] 1 2 3 4 5 5 4 3 2 1
R code?
01:00
myvec <- 1:20; myvec
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
myvec <- 1:20; myvec
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
myvec[1]
[1] 1
myvec <- 1:20; myvec
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
myvec[1]
[1] 1
myvec[5:10]
[1] 5 6 7 8 9 10
myvec <- 1:20; myvec
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
myvec[1]
[1] 1
myvec[5:10]
[1] 5 6 7 8 9 10
myvec[-1]
[1] 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
myvec <- 1:20; myvec
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
myvec[1]
[1] 1
myvec[5:10]
[1] 5 6 7 8 9 10
myvec[-1]
[1] 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
myvec[myvec > 3]
[1] 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Extract elements present in vector a
a
[1] 1 2 3
myvec
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
myvec %in% a
[1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
myvec[myvec %in% a]
[1] 1 2 3
Extract elements present in vector a
a
[1] 1 2 3
myvec
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
myvec %in% a
[1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
myvec[myvec %in% a]
[1] 1 2 3
b <- 100:105myvec[myvec %in% b]
integer(0)
Generate a sequence using the code seq(from=1, to=10, by=1)
.
What other ways can you generate the same sequence?
Using the function rep
, create the below sequence 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4
Extract the 5th element.
Extract elements greater than 2.
02:30
Rectangular arrangement of data with rows corresponding to observational units and columns corresponding to variables.
More general than a matrix in that different columns can contain different modes of data.
It’s similar to the datasets you’d typically see in SPSS and MINITAB.
Data frames are the most common data structure you’ll deal with in R.
Image Credit: Hadley Wickham and Garrett Grolemund
Syntax
name_of_the_dataframe <- data.frame( var1_name=vector of values of the first variable, var2_names=vector of values of the second variable)
Example
corona <- data.frame(ID=c("C001", "C002", "C003", "C004"), Location=c("Beijing", "Wuhan", "Shanghai", "Beijing"), Test_Results=c(FALSE, TRUE, FALSE, FALSE))corona
ID Location Test_Results1 C001 Beijing FALSE2 C002 Wuhan TRUE3 C003 Shanghai FALSE4 C004 Beijing FALSE
To check if it is a dataframe
is.data.frame(corona)
[1] TRUE
colnames(corona)
[1] "ID" "Location" "Test_Results"
length(corona)
[1] 3
dim(corona)
[1] 4 3
nrow(corona)
[1] 4
ncol(corona)
[1] 3
summary(corona)
ID Location Test_Results Length:4 Length:4 Mode :logical Class :character Class :character FALSE:3 Mode :character Mode :character TRUE :1
str(corona)
'data.frame': 4 obs. of 3 variables: $ ID : chr "C001" "C002" "C003" "C004" $ Location : chr "Beijing" "Wuhan" "Shanghai" "Beijing" $ Test_Results: logi FALSE TRUE FALSE FALSE
corona$Location
[1] "Beijing" "Wuhan" "Shanghai" "Beijing"
corona[,2]
[1] "Beijing" "Wuhan" "Shanghai" "Beijing"
corona[, "Location"]
[1] "Beijing" "Wuhan" "Shanghai" "Beijing"
corona[2, ]
ID Location Test_Results2 C002 Wuhan TRUE
03:30
install.packages("ggplot2")
Image Credit: Professor Di Cook Monash University, AUS
library(gapminder)data(gapminder)head(gapminder)
# A tibble: 6 x 6 country continent year lifeExp pop gdpPercap <fct> <fct> <int> <dbl> <int> <dbl>1 Afghanistan Asia 1952 28.8 8425333 779.2 Afghanistan Asia 1957 30.3 9240934 821.3 Afghanistan Asia 1962 32.0 10267083 853.4 Afghanistan Asia 1967 34.0 11537966 836.5 Afghanistan Asia 1972 36.1 13079460 740.6 Afghanistan Asia 1977 38.4 14880372 786.
str(gapminder)
tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame) $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ... $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ... $ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ... $ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ... $ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ... $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
Use the R dataset “gapminder” to answer the following questions:
How many rows and columns does gapminder have?
Extract column names in gapminder.
Display the first 10 rows of the data.
Display the last 3 rows of the data.
Select rows from 10 to 20, containing all variables.
Select rows from 10 to 20 containing country
and lifeExp
.
Create a single vector (a new object) called ‘LifeExpectancy’ containing the values in lifeExp
in gapminder.
How many countries in Asia had lifeExp
larger than 30 in 1952?
05:30
Image Credit: Pixabay
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
s | Start & Stop the presentation timer |
t | Reset the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |