STA 506 2.0 Linear Regression AnalysisLecture 10: Confidence Intervals in Multiple RegressionDr Thiyanga S. Talagala2020-10-311 / 23

Recap

Exploratory data analysis (scatterplot, correlation)
Fit a regression model
Check the validity of the assumptions/ Residual analysis
Check $R_{a d j}^{2}$
Hypothesis testing: ANOVA
- Test the significance of regression
Hypothesis testing: t-test
- Tests on individual regression coefficients
Interpret point estimates of coefficients
Compute confidence intervals on the regression coefficients and mean response and interpret the results
Prediction of new observations

2 / 23

Recap

Exploratory data analysis
Fit a regression model
Check the validity of the assumptions/ Residual analysis
Check $R_{a d j}^{2}$
Hypothesis testing: ANOVA
- Test the significance of regression
Hypothesis testing: t-test
- Tests on individual regression coefficients
Interpret point estimates of coefficients

8. Compute confidence intervals on the regression coefficients and mean response and interpret the results

9. Prediction of new observations

3 / 23

Confidence Intervals in Multiple Regression

Confidence intervals on the regression coefficients
Confidence interval estimation of the mean response at a given point

4 / 23

Confidence Intervals on the Regression Coefficients

To construct confidence intervals for regression coefficients ( $β_{j}$ , $j = 0, 1, . . . p$ ) we will continue to assume that,

errors $ϵ_{i}$ are normally and independently distributed with mean zero and variance $σ^{2}$ .

Hence, before constructing the confidence intervals you need to check the validity of the assumptions.

5 / 23

Confidence Intervals on the Regression Coefficients

Data set

library(tidyverse)
heart.data <- read_csv("heart.data.csv")
heart.data

# A tibble: 498 x 4
      X1 biking smoking heart.disease
   <dbl>  <dbl>   <dbl>         <dbl>
 1     1  30.8    10.9          11.8 
 2     2  65.1     2.22          2.85
 3     3   1.96   17.6          17.2 
 4     4  44.8     2.80          6.82
 5     5  69.4    16.0           4.06
 6     6  54.4    29.3           9.55
 7     7  49.1     9.06          7.62
 8     8   4.78   12.8          15.9 
 9     9  65.7    12.0           3.07
10    10  35.3    23.3          12.1 
# … with 488 more rows

6 / 23

Confidence Intervals on the Regression Coefficients (cont.)

regHeart <-  lm(heart.disease ~ biking+ smoking, data=heart.data)
regHeart


Call:
lm(formula = heart.disease ~ biking + smoking, data = heart.data)
Coefficients:
(Intercept)       biking      smoking  
    14.9847      -0.2001       0.1783

Validity of the assumptions: All satisfied. We discussed in Week 8

Compute 95% confidence intervals for regression coefficients

confint(regHeart, level=0.95)

                 2.5 %     97.5 %
(Intercept) 14.8272075 15.1421084
biking      -0.2028166 -0.1974495
smoking      0.1713800  0.1852878

7 / 23

Compute 95% confidence intervals for regression coefficients

confint(regHeart, level=0.95)

                 2.5 %     97.5 %
(Intercept) 14.8272075 15.1421084
biking      -0.2028166 -0.1974495
smoking      0.1713800  0.1852878

Compute 90% confidence intervals for regression coefficients

confint(regHeart, level=0.90)

                   5 %       95 %
(Intercept) 14.8525973 15.1167186
biking      -0.2023839 -0.1978822
smoking      0.1725014  0.1841665

8 / 23

Interpretation of confidence intervals for regression coefficients

confint(regHeart, level=0.95)

                 2.5 %     97.5 %
(Intercept) 14.8272075 15.1421084
biking      -0.2028166 -0.1974495
smoking      0.1713800  0.1852878

9 / 23

Interpretation of confidence intervals for regression coefficients

                 2.5 %     97.5 %
(Intercept) 14.8272075 15.1421084
biking      -0.2028166 -0.1974495
smoking      0.1713800  0.1852878

Intercept: 95% Confidence Interval [14.82, 15.14]

This means that if $X_{1}$ (biking) and $X_{2}$ (smoking) remain at zero, we are 95% confidence that the mean percentage of people with heart disease is between 14.82% and 15.14%.

$β_{1}$ : 95% Confidence Interval [-0.20, -0.19]

This means that if $X_{2}$ (smoking) remains fixed, we are 95% confidence that an one percent increase in biking is associated with a decrease in the mean percentage of people with heart disease at least 0.19 percent and not more than 0.20 percent.

$β_{2}$ : 95% Confidence Interval [0.17, 0.19]

This means that if $X_{1}$ (biking) remains fixed, we are 95% confidence that an one percent increase in smoking is associated with an increase in the mean percentage of people with heart disease at least 0.17 percent and not more than 0.18 percent.

10 / 23

Confidence Interval Estimation of the Mean Response at a Given Point

$Y = β_{0} + β_{1} X_{1} + β_{2} X_{2} + ϵ$

where,

$Y$ - percentage of people with heart disease,

$X_{1}$ - percentage of people in each town who bike to work,

$X_{2}$ - percentage of people in each town who smoke

Fitted regression model

regHeart


Call:
lm(formula = heart.disease ~ biking + smoking, data = heart.data)
Coefficients:
(Intercept)       biking      smoking  
    14.9847      -0.2001       0.1783

$\hat{Y} = 14.9847 - 0.2001 X_{1} + 0.1783 X_{2}$ , where $\hat{Y}$ - Fitted values.

11 / 23

Confidence Interval Estimation of the Mean Response at a Given Point

Suppose we have an observation $X_{1} = 30.8$ and $X_{2} = 10.9$ and we would like to find a 95% confidence interval on the percentage of people with heart disease

The fitted value at this point is:

$\hat{Y} = 14.9847 - 0.2001 X_{1} + 0.1783 X_{2}$

$\hat{Y} = 14.9847 - 0.2001 (30.8) + 0.1783 (10.9) = 10.764$

A 95% confidence interval on the mean percentage of people with heart disease at this point is:

predict(regHeart, list(biking = 30.8, smoking = 10.9),
interval='confidence', level=0.95)

      fit      lwr      upr
1 10.7644 10.69625 10.83255

Interpretation:

We can be 95% confident that the mean percentage of people with heart disease of all towns at $X_{1} (b i k i n g) = 30.8$ and $X_{2} (s m o k i n g) = 10.9$ is between 10.69 and 10.83 percent.

12 / 23

Prediction of New Observation

Suppose we want to predict construct 95% prediction interval on the percentage people with heart disease at $X_{1} = 60$ and $X_{2} = 20$ .

predict(regHeart, list(biking = 60, smoking = 20),
interval='predict', level=0.95)

       fit      lwr      upr
1 6.543353 5.255293 7.831413

We can be 95% confident that the percentage of people with heart disease at a town at $X_{1} = 60$ and $X_{2} = 20$ will be between 5.25 and 7.83 percent.

13 / 23

Confidence Interval Estimation of the

Mean Response vs. Prediction Interval

Purpose: Illustrate the difference between confidence intervals for mean and prediction intervals.

14 / 23

Prediction of the mean response

A 95% confidence interval on the mean percentage of people with heart disease at the point at $X_{1} = 30.8$ and $X_{2} = 10.9$ .

predict(regHeart, list(biking = 30.8, smoking = 10.9),
interval='confidence', level=0.95)

      fit      lwr      upr
1 10.7644 10.69625 10.83255

Prediction of a future value

Suppose we want to predict construct 95% prediction interval on the percentage people with heart disease at $X_{1} = 60$ and $X_{2} = 20$ .

predict(regHeart, list(biking = 60, smoking = 20),
interval='predict', level=0.95)

       fit      lwr      upr
1 6.543353 5.255293 7.831413

15 / 23

Prediction of the mean response

what would be the average (mean) response with characteristics $X_{1} = 30.8$ and $X_{2} = 10.9$ ?

predict(regHeart, list(biking = 30.8, smoking = 10.9),
interval='confidence', level=0.95)

      fit      lwr      upr
1 10.7644 10.69625 10.83255

We predict the mean value of $Y$ with characteristics $X_{1} = 30.8$ and $X_{2} = 10.9$ .

Prediction of a future value

what is the predicted value of $Y$ with characteristics $X_{1} = 60$ and $X_{2} = 20$ ?

predict(regHeart, list(biking = 60, smoking = 20),
interval='predict', level=0.95)

       fit      lwr      upr
1 6.543353 5.255293 7.831413

We predict $Y$ for a specific new case that comes from the population with characteristics $X_{1} = 60$ and $X_{2} = 20$ .

Prediction interval for a new response.

16 / 23

Interpretations

Prediction of the mean response

      fit      lwr      upr
1 10.7644 10.69625 10.83255

We can be 95% confident that the mean percentage of people with heart disease of all towns at $X_{1} (b i k i n g) = 30.8$ and $X_{2} (s m o k i n g) = 10.9$ is between 10.69 and 10.83 percent.

Prediction of a future value

       fit      lwr      upr
1 6.543353 5.255293 7.831413

We can be 95% confident that the percentage of people with heart disease at a town at $X_{1} = 60$ and $X_{2} = 20$ will be between 5.25 and 7.83 percent.

17 / 23

Prediction of Set of New Observations

newheartdata <- data.frame(biking = c(30, 40, 40, 60),
                           smoking = c(20, 30, 12, 10))
newheartdata

  biking smoking
1     30      20
2     40      30
3     40      12
4     60      10

predict(regHeart, newdata=newheartdata , interval="predict")

        fit       lwr       upr
1 12.547345 11.260464 13.834225
2 12.329353 11.039054 13.619652
3  9.119343  7.832795 10.405891
4  4.760014  3.471742  6.048286

18 / 23

Mathematical Formula: Least-square estimator

$Y = β_{0} + β_{1} X_{1} + β_{2} X_{2} + . . . + β_{p} X_{p} + ϵ$

$Y = [\begin{matrix} Y_{1} \\ Y_{2} \\ . \\ . \\ . \\ Y_{n} \end{matrix}]$

$X = [\begin{matrix} 1, x_{11}, x_{12}, . . ., x_{1 p} \\ 1, x_{21}, x_{22}, . . ., x_{2 p} \\ . \\ . \\ . \\ 1, x_{n 1}, x_{n 2}, . . ., x_{n p} \end{matrix}]$

$\hat{β} = (X^{'} X)^{- 1} X^{'} Y$

19 / 23

Mathematical Formula

Confidence intervals on the regression coefficients

[ $\hat{β_{j}} - t_{α / 2, n - p} \sqrt{{\hat{σ}}^{2} C_{j j}}$ , $\hat{β_{j}} + t_{α / 2, n - p} \sqrt{{\hat{σ}}^{2} C_{j j}}$ ]

$C_{j j}$ is the $j$ th diagonal element of the $(X^{'} X)^{- 1}$

Unbiased estimator for $σ^{2}$ is given by

${\hat{σ}}^{2}$ = MSE

20 / 23

Mathematical Formula

Prediction of the mean response Confidence interval for:

Mean Response at $x_{01}, x_{02}, . ., x_{0 p}$ , $E [Y | X_{1} = x_{01}, X_{2} = x_{02} . . ., X_{p} = x_{0 p}] = μ_{Y | X_{1} = x_{01}, X_{2} = x_{02} . . ., X_{p} = x_{0 p}}$

Fitted value at $x_{01}, x_{02}, . ., x_{0 p}$

$[x_{0}]^{'}$ = $[1, x_{01}, x_{02}, . . ., x_{0 k}]$

${\hat{y}}_{0} = x_{0}^{'} \hat{β}$

[ $\hat{y_{0}} - t_{α / 2, n - p} \sqrt{{\hat{σ}}^{2} x_{0}^{'} (X^{'} X)^{- 1} x_{0}}$ , $\hat{y_{0}} + t_{α / 2, n - p} \sqrt{{\hat{σ}}^{2} x_{0}^{'} (X^{'} X)^{- 1} x_{0}}$ ]

21 / 23

Mathematical Formula (cont.)

Prediction of a future value

[ $\hat{y_{0}} - t_{α / 2, n - p} \sqrt{{\hat{σ}}^{2} (1 + x_{0}^{'} (X^{'} X)^{- 1} x_{0}})$ , $\hat{y_{0}} + t_{α / 2, n - p} \sqrt{{\hat{σ}}^{2} (1 + x_{0}^{'} (X^{'} X)^{- 1} x_{0}})$ ]

22 / 23

Acknowledgement

Introduction to Linear Regression Analysis, Douglas C. Montgomery, Elizabeth A. Peck, G. Geoffrey Vining

Dr. Thiyanga S. Talagala

23 / 23

Recap

Exploratory data analysis (scatterplot, correlation)

Fit a regression model

Check the validity of the assumptions/ Residual analysis

Check $R_{a d j}^{2}$

Hypothesis testing: ANOVA

Test the significance of regression

Hypothesis testing: t-test

Tests on individual regression coefficients

Interpret point estimates of coefficients

Compute confidence intervals on the regression coefficients and mean response and interpret the results

Prediction of new observations

2 / 23

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help