What does a statistician do?1 / 49

What does a statistician do?

# A tibble: 15 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
11 Afghanistan Asia       2002    42.1 25268405      727.
12 Afghanistan Asia       2007    43.8 31889923      975.
13 Albania     Europe     1952    55.2  1282697     1601.
14 Albania     Europe     1957    59.3  1476505     1942.
15 Albania     Europe     1962    64.8  1728137     2313.

2 / 49

3 / 49

4 / 49

What is Regression Analysis?Statistical technique for investigating and modelling the relationship between variables.
Statistical Modellinga simplified, mathematically-formalized way to approximate reality (i.e. what generates your data) and optionally to make predictions from this approximation.
5 / 49

Statistical Modelling: The Bigger Picture6 / 49

Statistical Modelling Workflow

Image Credit: Hadley Wickham

7 / 49

Software: R and RStudio (IDE) [Visit: https://hellor.netlify.app/]

8 / 49

Consider trying to answer the following kinds of questions:

To use the parents’ heights to predict childrens’ heights.

  mheight dheight
1    59.7    55.1
2    58.2    56.5
3    60.6    56.0
4    60.7    56.8
5    61.8    56.0
6    55.5    57.9

Predict the daughter's height if her mother's height is 66 inches?

9 / 49

10 / 49

Regression Analysis involves curve fitting.
Curve fitting: The process of finding a relation or equation of best fit.

11 / 49

Model

$Y = f (x_{1}, x_{2}, x_{3}) + ϵ$

Goal: Estimate $f$ ?

How do we estimate $f$ ?

Non-parametric methods:

estimate $f$ using observed data without making explicit assumptions about the functional form of $f$ .

Parametric methods

estimate $f$ using observed data by making assumptions about the functional form of $f$ .

Ex: $Y = β_{0} + β_{1} x_{1} + β_{2} x_{2} + β_{3} x_{3} + ϵ$

12 / 49

13 / 49

14 / 49

15 / 49

Do not under-estimate the power of simple models.16 / 49

Do not under-estimate the power of simple models.Create something new which is more efficient than the existing method.
16 / 49

17 / 49

Machine Learning Algorithms

18 / 49

Machine Learning Algorithms

Random Forest
XGboost
Neural networks, etc.

18 / 49

Pearson's Correlation Coefficient

Measures the strength of the linear relationship between two quantitative variables.
Does not completely characterize their relationship.

19 / 49

Pearson's Correlation Coefficient

$c o r (x, y) = \frac{\sum_{i = 1}^{N} (x_{i} - μ_{x}) (y_{i} - μ_{y})}{N * σ_{x} σ_{y}}$