ICTC 3104.3 Data Analytics and Big DataBasics of R Programming2020-08-131 / 65

What is Data Analysis?

Image Credit: Pixabay

2 / 65

What is Data Analysis?

Image Credit: Pixabay

It is all about extracting information out of data in order to make better decisions.

2 / 65

What is Data Analysis?

It is all about extracting information out of data in order to make better decisions.

3 / 65

What is Data Analysis?

It is all about extracting information out of data in order to make better decisions.

3 / 65

Data Analysis Workflow

700px

4 / 65

Outline

Basics of R Programming
Data Import
Data Wrangling
Data Visualization

5 / 65

Outline

Basics of R Programming ⚙️
Data Import
Data Wrangling
Data Visualization

6 / 65

What is R?

R is a software environment for statistical computing and graphics.
Language designers: Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand.
Parent language: S
The latest R version 4.0.1 has been released on 2020-06-06.

description of the image

7 / 65

Why R?

Free
Powerful: Over 14600 contributed packages on the main repository (CRAN), as of July 2019, provided by top international researchers and programmers.
Flexible: It is a language, and thus allows you to create your own solutions.
Community: Large global community friendly and helpful, lots of resources.

8 / 65

R environment9 / 65

The RStudio IDE10 / 65

The RStudio IDE

Image Credit: Clastic Detritus

11 / 65

The RStudio IDE

]

12 / 65

R and RStudio

Image Credit: Clastic Detritus

13 / 65

R and RStudio

"If R were an airplane, RStudio would be the airport, providing many, many supporting services that make it easier for you, the pilot, to take off and go to awesome places. Sure, you can fly an airplane without an airport, but having those runways and supporting infrastructure is a game-changer."

-- Julie Lowndes

Image Credit: Clastic Detritus

14 / 65

Create a new project15 / 65

16 / 65

17 / 65

18 / 65

19 / 65

20 / 65

21 / 65

R Console

7+1

[1] 8

rnorm(10)

 [1]  0.42886236  0.87398624 -0.14720398  1.29780260  0.09344244  1.44028046
 [7]  0.73037732 -1.11167627  0.50529457  0.69291615

22 / 65

R Console

7+1

[1] 8

rnorm(10)

 [1]  0.42886236  0.87398624 -0.14720398  1.29780260  0.09344244  1.44028046
 [7]  0.73037732 -1.11167627  0.50529457  0.69291615

Variable assignment

a <- rnorm(10)
a

 [1] -0.9555159 -0.5979601 -1.0782460  1.0624732  0.2590546  0.2246372
 [7] -0.6120881  0.6849044  2.4574333  0.8701314

22 / 65

R Console

7+1

[1] 8

rnorm(10)

 [1]  0.42886236  0.87398624 -0.14720398  1.29780260  0.09344244  1.44028046
 [7]  0.73037732 -1.11167627  0.50529457  0.69291615

Variable assignment

a <- rnorm(10)
a

 [1] -0.9555159 -0.5979601 -1.0782460  1.0624732  0.2590546  0.2246372
 [7] -0.6120881  0.6849044  2.4574333  0.8701314

b <- a*100
b

 [1]  -95.55159  -59.79601 -107.82460  106.24732   25.90546   22.46372
 [7]  -61.20881   68.49044  245.74333   87.01314

22 / 65

Data permanency

ls() can be used to display the names of the objects which are currently stored within R.
The collection of objects currently stored is called the workspace.

ls()

[1] "a" "b" "p"

23 / 65

Data permanency

ls() can be used to display the names of the objects which are currently stored within R.
The collection of objects currently stored is called the workspace.

ls()

[1] "a" "b" "p"

To remove objects the function rm is available.
- remove all objects rm(list=ls())
- remove specific objects rm(x, y, z)

rm(a)
ls()

[1] "b" "p"

rm(list=ls())
ls()

character(0)

23 / 65

24 / 65

At the end of an R session, if save: the objects are written to a file called .RData in the current directory, and the command lines used in the session are saved to a file called .Rhistory

24 / 65

When R is started at later time from the same directory

25 / 65

When R is started at later time from the same directory it reloads the associated workspace and commands history.

26 / 65

27 / 65

When R is started at later time from the same directory it reloads the associated workspace and commands history.

27 / 65

Comment your code

Each line of a comment should begin with the comment symbol and a single space: # .

rnorm(10) # This is a comment

 [1] -0.7146212 -0.6647724 -0.4563877 -0.3020826  0.9044316  0.7319009
 [7] -2.1557290 -0.7404257  1.2983685  1.0892268

sum(1:10) # 1+2

[1] 55

28 / 65

Style Guide

Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread. -- Hadley Wickham

sum(1:10)#Bad commenting style

[1] 55

sum(1:10) # Good commenting style

[1] 55

Also, use commented lines of - and = to break up your file into easily readable sub-sections.

# Read data ----------------
# Plot data ----------------

To learn more read Hadley Wickham's Style guide.

29 / 65

Objects in R

R is an object-oriented language.

30 / 65

Objects in R

R is an object-oriented language.
An object in R is anything (data structures, functions, etc., that can be assigned to a variable).

30 / 65

Objects in R

R is an object-oriented language.
An object in R is anything (data structures, functions, etc., that can be assigned to a variable).

Let's take a look of some common types of objects.

30 / 65

Objects in R

R is an object-oriented language.
An object in R is anything (data structures, functions, etc., that can be assigned to a variable).

Let's take a look of some common types of objects.

Data structures are the ways of arranging data.
- You can create objects, using the left pointing arrow <-

30 / 65

Objects in R

R is an object-oriented language.
An object in R is anything (data structures, functions, etc., that can be assigned to a variable).

Let's take a look of some common types of objects.

Data structures are the ways of arranging data.
- You can create objects, using the left pointing arrow <-
Functions tell R to do something.
- A function may be applied to an object.
- Result of applying a function is usually an object too.
- All function calls need to be followed by parentheses.

a <- 1:20 # data structure
sum(a) # sum is a function applied on a

[1] 210

help.start() # Some functions work on their own.

30 / 65

Getting help with functions and features

R has inbuilt help facility

Method 1

help(rnorm)

For a feature specified by special characters such as for, if, [[

help("[[")

Search the help files for a word or phrase.

help.search(‘weighted mean’)

Method 2

?rnorm

??rnorm

31 / 65

Data structures

Image Credit: venus.ifca.unican.es

32 / 65

Data structures

Data structures differ in terms of,

Type of data they can hold
How they are created
Structural complexity
Notation to identify and access individual elements

Image Credit: venus.ifca.unican.es

33 / 65

34 / 65

1. Vectors35 / 65

Vectors

Vectors are one-dimensional arrays that can hold numeric data, character data, or logical data.
Combine function c() is used to form the vector.
Data in a vector must only be one type or mode (numeric, character, or logical). You can’t mix modes in the same vector.

Vector assignment

Syntax

vector_name <- c(element1, element2, element3)

x <- c(5, 6, 3, 1 , 100)

assignment operator ('<-'), '=' can be used as an alternative.
c() function

What will be the output of the following code?

y <- c(x, 500, 600)

36 / 65

Types and tests with vectors

first_vec <- c(10, 20, 50, 70)
second_vec <- c("Jan", "Feb", "March", "April")
third_vec <- c(TRUE, FALSE, TRUE, TRUE)
fourth_vec <- c(10L, 20L, 50L, 70L)

To check if it is a

vector: is.vector()

is.vector(first_vec)

[1] TRUE

character vector: is.character()

is.character(first_vec)

[1] FALSE

37 / 65

double: is.double()

is.double(first_vec)

[1] TRUE

integer: is.integer()

is.integer(first_vec)

[1] FALSE

logical: is.logical()

is.logical(first_vec)

[1] FALSE

length

length(first_vec)

[1] 4

38 / 65

Coercion

Vectors must be homogeneous. When you attempt to combine different types they will be coerced to the most flexible type so that every element in the vector is of the same type.

Order from least to most flexible

logical --> integer --> double --> character

a <- c(3.1, 2L, 3, 4, "GPA") 
typeof(a)

[1] "character"

anew <- c(3.1, 2L, 3, 4)
typeof(anew)

[1] "double"

39 / 65

Explicit coercion

Vectors can be explicitly coerced from one class to another using the as.* functions, if available. For example, as.character, as.numeric, as.integer, and as.logical.

vec1 <- c(TRUE, FALSE, TRUE, TRUE)
typeof(vec1)

[1] "logical"

vec2 <- as.integer(vec1)
typeof(vec2)

[1] "integer"

vec2

[1] 1 0 1 1

Why does the below output NAs?

x <- c("a", "b", "c")
as.numeric(x)

Warning: NAs introduced by coercion

[1] NA NA NA

40 / 65

x1 <- 1:3
x2 <- c(10, 20, 30)
combinedx1x2 <- c(x1, x2)
combinedx1x2

[1]  1  2  3 10 20 30

41 / 65

x1 <- 1:3
x2 <- c(10, 20, 30)
combinedx1x2 <- c(x1, x2)
combinedx1x2

[1]  1  2  3 10 20 30

class(x1)

[1] "integer"

class(x2)

[1] "numeric"

class(combinedx1x2)

[1] "numeric"

41 / 65

x1 <- 1:3
x2 <- c(10, 20, 30)
combinedx1x2 <- c(x1, x2)
combinedx1x2

[1]  1  2  3 10 20 30

class(x1)

[1] "integer"

class(x2)

[1] "numeric"

class(combinedx1x2)

[1] "numeric"

If you combine a numeric vector and a character vector

y1 <- c(1, 2, 3)
y2 <- c("a", "b", "c")
c(y1, y2)

[1] "1" "2" "3" "a" "b" "c"

41 / 65

Simplifying vector creation

colon : produces regular spaced ascending or descending sequences.

 10:16

[1] 10 11 12 13 14 15 16

-0.5:8.5

 [1] -0.5  0.5  1.5  2.5  3.5  4.5  5.5  6.5  7.5  8.5

42 / 65

Simplifying vector creation

colon : produces regular spaced ascending or descending sequences.

 10:16

[1] 10 11 12 13 14 15 16

-0.5:8.5

 [1] -0.5  0.5  1.5  2.5  3.5  4.5  5.5  6.5  7.5  8.5

sequence: seq(initial_value, final_value, increment)

seq(1,11)

 [1]  1  2  3  4  5  6  7  8  9 10 11

seq(1, 11, length.out=5)

[1]  1.0  3.5  6.0  8.5 11.0

seq(0, 11, by=2)

[1]  0  2  4  6  8 10

42 / 65

repeats rep()

rep(9, 5)

[1] 9 9 9 9 9

rep(1:4, 2)

[1] 1 2 3 4 1 2 3 4

rep(1:4, each=2) # each element is repeated twice

[1] 1 1 2 2 3 3 4 4

rep(1:4, times=2) # whole sequence is repeated twice

[1] 1 2 3 4 1 2 3 4

rep(1:4, each=2, times=3)

 [1] 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4

rep(1:4, 1:4)

 [1] 1 2 2 3 3 3 4 4 4 4

rep(1:4, c(4, 1, 4, 2))

 [1] 1 1 1 1 2 3 3 3 3 4 4

43 / 65

Logical operators

c(1, 2, 3) == c(10, 20, 3)

[1] FALSE FALSE  TRUE

c(1, 2, 3) != c(10, 20, 3)

[1]  TRUE  TRUE FALSE

1:5 > 3

[1] FALSE FALSE FALSE  TRUE  TRUE

1:5 < 3

[1]  TRUE  TRUE FALSE FALSE FALSE

<= less than or equal to
>= greater than or equal to
| or
& and

44 / 65

Operators: `%in%` - in the set

a <- c(1, 2, 3)
b <- c(1, 10, 3)
a%in%b

[1]  TRUE FALSE  TRUE

x <- 1:10
y <- 1:3
x

 [1]  1  2  3  4  5  6  7  8  9 10

[1] 1 2 3

x %in% y

 [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

y %in% x

[1] TRUE TRUE TRUE

45 / 65

Vector arithmetic

operations are performed element by element.

c(10, 100, 100) + 2 # two is added to every element in the vector

[1]  12 102 102

46 / 65

Vector arithmetic

operations are performed element by element.

c(10, 100, 100) + 2 # two is added to every element in the vector

[1]  12 102 102

operations between two vectors

v1 <- c(1, 2, 3); v2 <- c(10, 100, 1000)
v1 + v2

[1]   11  102 1003

46 / 65

Vector arithmetic

operations are performed element by element.

c(10, 100, 100) + 2 # two is added to every element in the vector

[1]  12 102 102

operations between two vectors

v1 <- c(1, 2, 3); v2 <- c(10, 100, 1000)
v1 + v2

[1]   11  102 1003

Add two vectors of unequal length

longvec <- seq(10, 100, length=10); shortvec <- c(1, 2, 3, 4, 5)
shortvec + longvec

 [1]  11  22  33  44  55  61  72  83  94 105

What will be the output of the following code?

first <- c(1, 2, 3, 4); second <- c(10, 100)
first * second

46 / 65

Missing values

Use NA or NaN to place a missing value in a vector.

z <- c(10, 101, 2, 3, NA)
is.na(z)

[1] FALSE FALSE FALSE FALSE  TRUE

47 / 65

Your turn

 [1] 1 2 3 4 5 5 4 3 2 1

R code?

01:00

48 / 65

Vectors: Subsetting

myvec <- 1:20; myvec

 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

49 / 65

Vectors: Subsetting

myvec <- 1:20; myvec

 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

myvec[1]

[1] 1

49 / 65

Vectors: Subsetting

myvec <- 1:20; myvec

 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

myvec[1]

[1] 1

myvec[5:10]

[1]  5  6  7  8  9 10

49 / 65

Vectors: Subsetting

myvec <- 1:20; myvec

 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

myvec[1]

[1] 1

myvec[5:10]

[1]  5  6  7  8  9 10

myvec[-1]

 [1]  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

49 / 65

Vectors: Subsetting

myvec <- 1:20; myvec

 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

myvec[1]

[1] 1

myvec[5:10]

[1]  5  6  7  8  9 10

myvec[-1]

 [1]  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

myvec[myvec > 3]

 [1]  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

49 / 65

Vectors: Subsetting

Extract elements present in vector a

[1] 1 2 3

myvec

 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

myvec %in% a

 [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

myvec[myvec %in% a]

[1] 1 2 3

50 / 65

Vectors: Subsetting

Extract elements present in vector a

[1] 1 2 3

myvec

 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

myvec %in% a

 [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

myvec[myvec %in% a]

[1] 1 2 3

b <- 100:105
myvec[myvec %in% b]

integer(0)

50 / 65

Your turn

Generate a sequence using the code seq(from=1, to=10, by=1).
What other ways can you generate the same sequence?
Using the function rep , create the below sequence 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4
Extract the 5th element.
Extract elements greater than 2.

02:30

51 / 65

2. Data Frames52 / 65

53 / 65

Data frames

Rectangular arrangement of data with rows corresponding to observational units and columns corresponding to variables.
More general than a matrix in that different columns can contain different modes of data.
It’s similar to the datasets you’d typically see in SPSS and MINITAB.
Data frames are the most common data structure you’ll deal with in R.

Figure 1: Components of a dataframe.

Image Credit: Hadley Wickham and Garrett Grolemund

54 / 65

Create a data frame

Syntax

name_of_the_dataframe <- data.frame(
                          var1_name=vector of values of the first variable,
                          var2_names=vector of values of the second variable)

Example

corona <- data.frame(ID=c("C001", "C002", "C003", "C004"),
                     Location=c("Beijing", "Wuhan", "Shanghai", "Beijing"),
                     Test_Results=c(FALSE, TRUE, FALSE, FALSE))
corona

    ID Location Test_Results
1 C001  Beijing        FALSE
2 C002    Wuhan         TRUE
3 C003 Shanghai        FALSE
4 C004  Beijing        FALSE

To check if it is a dataframe

is.data.frame(corona)

[1] TRUE

55 / 65

Some useful functions with dataframes

colnames(corona)

[1] "ID"           "Location"     "Test_Results"

length(corona)

[1] 3

dim(corona)

[1] 4 3

nrow(corona)

[1] 4

ncol(corona)

[1] 3

56 / 65

Some useful functions with dataframes (cont.)

summary(corona)

      ID              Location         Test_Results   
 Length:4           Length:4           Mode :logical  
 Class :character   Class :character   FALSE:3        
 Mode  :character   Mode  :character   TRUE :1

str(corona)

'data.frame':    4 obs. of  3 variables:
 $ ID          : chr  "C001" "C002" "C003" "C004"
 $ Location    : chr  "Beijing" "Wuhan" "Shanghai" "Beijing"
 $ Test_Results: logi  FALSE TRUE FALSE FALSE

57 / 65

Subsetting dataframes

corona$Location

[1] "Beijing"  "Wuhan"    "Shanghai" "Beijing"

corona[,2]

[1] "Beijing"  "Wuhan"    "Shanghai" "Beijing"

corona[, "Location"]

[1] "Beijing"  "Wuhan"    "Shanghai" "Beijing"

corona[2, ]

    ID Location Test_Results
2 C002    Wuhan         TRUE

58 / 65

Your turn

03:30

59 / 65

Installing R Packages

Method 1

Method 2

install.packages("ggplot2")

60 / 65

install.packages vs library

Image Credit: Professor Di Cook Monash University, AUS

61 / 65

Built-in dataframes

library(gapminder)
data(gapminder)
head(gapminder)

# A tibble: 6 x 6
  country     continent  year lifeExp      pop gdpPercap
  <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Afghanistan Asia       1957    30.3  9240934      821.
3 Afghanistan Asia       1962    32.0 10267083      853.
4 Afghanistan Asia       1967    34.0 11537966      836.
5 Afghanistan Asia       1972    36.1 13079460      740.
6 Afghanistan Asia       1977    38.4 14880372      786.

str(gapminder)

tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
 $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
 $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
 $ gdpPercap: num [1:1704] 779 821 853 836 740 ...

62 / 65

Your turn

Use the R dataset “gapminder” to answer the following questions:

How many rows and columns does gapminder have?
Extract column names in gapminder.
Display the first 10 rows of the data.
Display the last 3 rows of the data.
Select rows from 10 to 20, containing all variables.
Select rows from 10 to 20 containing country and lifeExp.
Create a single vector (a new object) called ‘LifeExpectancy’ containing the values in lifeExp in gapminder.
How many countries in Asia had lifeExp larger than 30 in 1952?

05:30

63 / 65

What have we learned today?64 / 65

Dr. Thiyanga S. Talagala

65 / 65

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Start & Stop the presentation timer

Reset the presentation timer

?, h

Toggle this help

ICTC 3104.3 Data Analytics and Big Data

Basics of R Programming

2020-08-13

What is Data Analysis?

What is Data Analysis?

What is Data Analysis?

What is Data Analysis?

Data Analysis Workflow

Outline

Outline

What is R?

Why R?

R environment

The RStudio IDE

The RStudio IDE

The RStudio IDE

R and RStudio

R and RStudio

Create a new project

R Console

R Console

Variable assignment

R Console

Variable assignment

Data permanency

Data permanency

Comment your code

Style Guide

Objects in R

Objects in R

Objects in R

Objects in R

Objects in R

Getting help with functions and features

Method 1

Method 2

Data structures

Data structures

1. Vectors

Vectors

Vector assignment

Types and tests with vectors

Coercion

Explicit coercion

Simplifying vector creation

Simplifying vector creation

Logical operators

Operators: %in% - in the set

Vector arithmetic

Vector arithmetic

Vector arithmetic

Missing values

Your turn

Vectors: Subsetting

Vectors: Subsetting

Vectors: Subsetting

Vectors: Subsetting

Vectors: Subsetting

Vectors: Subsetting

Vectors: Subsetting

Your turn

2. Data Frames

Data frames

Create a data frame

Some useful functions with dataframes

Some useful functions with dataframes (cont.)

Subsetting dataframes

Your turn

Installing R Packages

Method 1

Method 2

install.packages vs library

Built-in dataframes

Your turn

What have we learned today?

What is Data Analysis?

Help

Operators: `%in%` - in the set