class: center, middle, inverse, title-slide # STA 506 2.0 Linear Regression Analysis ## Lecture 10: Confidence Intervals in Multiple Regression ### Dr Thiyanga S. Talagala ### 2020-10-31 --- ## Recap 1. Exploratory data analysis (scatterplot, correlation) 2. Fit a regression model 3. Check the validity of the assumptions/ Residual analysis 4. Check `\(R^2_{adj}\)` 5. Hypothesis testing: ANOVA - Test the significance of regression 6. Hypothesis testing: t-test - Tests on individual regression coefficients 7. Interpret point estimates of coefficients 8. Compute confidence intervals on the regression coefficients and mean response and interpret the results 9. Prediction of new observations --- ## Recap 1. Exploratory data analysis 2. Fit a regression model 3. Check the validity of the assumptions/ Residual analysis 4. Check `\(R^2_{adj}\)` 5. Hypothesis testing: ANOVA - Test the significance of regression 6. Hypothesis testing: t-test - Tests on individual regression coefficients 7. Interpret point estimates of coefficients **8. Compute confidence intervals on the regression coefficients and mean response and interpret the results** **9. Prediction of new observations** --- ## Confidence Intervals in Multiple Regression 1. Confidence intervals on the regression coefficients 2. Confidence interval estimation of the mean response at a given point --- ## Confidence Intervals on the Regression Coefficients To construct confidence intervals for regression coefficients ( `\(\beta_j\)` , `\(j=0, 1, ...p\)`) we will continue to assume that, - errors `\(\epsilon_i\)` are normally and independently distributed with mean zero and variance `\(\sigma^2\)`. Hence, before constructing the confidence intervals you need to check the validity of the assumptions. --- ## Confidence Intervals on the Regression Coefficients Data set ```r library(tidyverse) heart.data <- read_csv("heart.data.csv") heart.data ``` ``` # A tibble: 498 x 4 X1 biking smoking heart.disease <dbl> <dbl> <dbl> <dbl> 1 1 30.8 10.9 11.8 2 2 65.1 2.22 2.85 3 3 1.96 17.6 17.2 4 4 44.8 2.80 6.82 5 5 69.4 16.0 4.06 6 6 54.4 29.3 9.55 7 7 49.1 9.06 7.62 8 8 4.78 12.8 15.9 9 9 65.7 12.0 3.07 10 10 35.3 23.3 12.1 # … with 488 more rows ``` --- ## Confidence Intervals on the Regression Coefficients (cont.) ```r regHeart <- lm(heart.disease ~ biking+ smoking, data=heart.data) regHeart ``` ``` Call: lm(formula = heart.disease ~ biking + smoking, data = heart.data) Coefficients: (Intercept) biking smoking 14.9847 -0.2001 0.1783 ``` Validity of the assumptions: All satisfied. We discussed in Week 8 ### Compute 95% confidence intervals for regression coefficients ```r confint(regHeart, level=0.95) ``` ``` 2.5 % 97.5 % (Intercept) 14.8272075 15.1421084 biking -0.2028166 -0.1974495 smoking 0.1713800 0.1852878 ``` --- ### Compute 95% confidence intervals for regression coefficients ```r confint(regHeart, level=0.95) ``` ``` 2.5 % 97.5 % (Intercept) 14.8272075 15.1421084 biking -0.2028166 -0.1974495 smoking 0.1713800 0.1852878 ``` ### Compute 90% confidence intervals for regression coefficients ```r confint(regHeart, level=0.90) ``` ``` 5 % 95 % (Intercept) 14.8525973 15.1167186 biking -0.2023839 -0.1978822 smoking 0.1725014 0.1841665 ``` --- ### Interpretation of confidence intervals for regression coefficients ```r confint(regHeart, level=0.95) ``` ``` 2.5 % 97.5 % (Intercept) 14.8272075 15.1421084 biking -0.2028166 -0.1974495 smoking 0.1713800 0.1852878 ``` --- ### Interpretation of confidence intervals for regression coefficients ``` 2.5 % 97.5 % (Intercept) 14.8272075 15.1421084 biking -0.2028166 -0.1974495 smoking 0.1713800 0.1852878 ``` **Intercept: 95% Confidence Interval [14.82, 15.14]** This means that if `\(X_1\)` (`biking`) and `\(X_2\)` (`smoking`) remain at zero, we are 95% confidence that the mean percentage of people with heart disease is between 14.82% and 15.14%. `\(\beta_1\)`: **95% Confidence Interval [-0.20, -0.19]** This means that if `\(X_2\)` (`smoking`) remains fixed, we are 95% confidence that an one percent increase in `biking` is associated with a decrease in the mean percentage of people with heart disease at least 0.19 percent and not more than 0.20 percent. `\(\beta_2\)` **: 95% Confidence Interval [0.17, 0.19]** This means that if `\(X_1\)` (`biking`) remains fixed, we are 95% confidence that an one percent increase in `smoking` is associated with an increase in the mean percentage of people with heart disease at least 0.17 percent and not more than 0.18 percent. --- ### Confidence Interval Estimation of the Mean Response at a Given Point `$$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \epsilon$$` where, `\(Y\)` - percentage of people with heart disease, `\(X_1\)` - percentage of people in each town who bike to work, `\(X_2\)` - percentage of people in each town who smoke ### Fitted regression model ```r regHeart ``` ``` Call: lm(formula = heart.disease ~ biking + smoking, data = heart.data) Coefficients: (Intercept) biking smoking 14.9847 -0.2001 0.1783 ``` `\(\hat{Y} = 14.9847 - 0.2001X_1 + 0.1783X_2\)`, where `\(\hat{Y}\)` - Fitted values. --- #### Confidence Interval Estimation of the Mean Response at a Given Point Suppose we have an observation `\(X_1 = 30.8\)` and `\(X_2=10.9\)` and we would like to find a 95% confidence interval on the percentage of people with heart disease **The fitted value at this point is:** `\(\hat{Y} = 14.9847 - 0.2001X_1 + 0.1783X_2\)` `\(\hat{Y} = 14.9847 - 0.2001(30.8) + 0.1783(10.9) = 10.764\)` **A 95% confidence interval on the mean percentage of people with heart disease at this point is:** ```r predict(regHeart, list(biking = 30.8, smoking = 10.9), interval='confidence', level=0.95) ``` ``` fit lwr upr 1 10.7644 10.69625 10.83255 ``` **Interpretation: ** We can be 95% confident that the **mean percentage of people with heart disease** of all towns at `\(X_1 (biking)=30.8\)` and `\(X_2 (smoking)=10.9\)` is between 10.69 and 10.83 percent. --- ## Prediction of New Observation Suppose we want to predict construct 95% prediction interval on the percentage people with heart disease at `\(X_1=60\)` and `\(X_2=20\)`. ```r predict(regHeart, list(biking = 60, smoking = 20), interval='predict', level=0.95) ``` ``` fit lwr upr 1 6.543353 5.255293 7.831413 ``` <!--https://stats.stackexchange.com/questions/16493/difference-between-confidence-intervals-and-prediction-intervals--> We can be 95% confident that the **percentage of people with heart disease** at a town at `\(X_1=60\)` and `\(X_2=20\)` will be between 5.25 and 7.83 percent. --- class: center, middle, duke-orange # Confidence Interval Estimation of the # Mean Response vs. Prediction Interval Purpose: Illustrate the difference between confidence intervals for mean and prediction intervals. --- ### Prediction of the mean response A 95% confidence interval on the **mean** percentage of people with heart disease at the point at `\(X_1=30.8\)` and `\(X_2=10.9\)`. ```r predict(regHeart, list(biking = 30.8, smoking = 10.9), interval='confidence', level=0.95) ``` ``` fit lwr upr 1 10.7644 10.69625 10.83255 ``` ### Prediction of a future value Suppose we want to predict construct 95% **prediction interval** on the percentage people with heart disease at `\(X_1=60\)` and `\(X_2=20\)`. ```r predict(regHeart, list(biking = 60, smoking = 20), interval='predict', level=0.95) ``` ``` fit lwr upr 1 6.543353 5.255293 7.831413 ``` --- .content-box-yellow[Prediction of the mean response] what would be the average (mean) response with characteristics `\(X_1=30.8\)` and `\(X_2=10.9\)` ? ```r predict(regHeart, list(biking = 30.8, smoking = 10.9), interval='confidence', level=0.95) ``` ``` fit lwr upr 1 10.7644 10.69625 10.83255 ``` We predict the mean value of `\(Y\)` with characteristics `\(X_1=30.8\)` and `\(X_2=10.9\)`. .content-box-yellow[Prediction of a future value] what is the predicted value of `\(Y\)` with characteristics `\(X_1=60\)` and `\(X_2=20\)`? ```r predict(regHeart, list(biking = 60, smoking = 20), interval='predict', level=0.95) ``` ``` fit lwr upr 1 6.543353 5.255293 7.831413 ``` We predict `\(Y\)` for a specific new case that comes from the population with characteristics `\(X_1=60\)` and `\(X_2=20\)`. Prediction interval for a new response. --- ### Interpretations .full-width[.content-box-yellow[Prediction of the mean response]] ``` fit lwr upr 1 10.7644 10.69625 10.83255 ``` We can be 95% confident that the **mean percentage of people with heart disease** of all towns at `\(X_1 (biking)=30.8\)` and `\(X_2 (smoking)=10.9\)` is between 10.69 and 10.83 percent. .full-width[.content-box-yellow[Prediction of a future value]] ``` fit lwr upr 1 6.543353 5.255293 7.831413 ``` We can be 95% confident that the **percentage of people with heart disease** at a town at `\(X_1=60\)` and `\(X_2=20\)` will be between 5.25 and 7.83 percent. --- ## Prediction of Set of New Observations ```r newheartdata <- data.frame(biking = c(30, 40, 40, 60), smoking = c(20, 30, 12, 10)) newheartdata ``` ``` biking smoking 1 30 20 2 40 30 3 40 12 4 60 10 ``` ```r predict(regHeart, newdata=newheartdata , interval="predict") ``` ``` fit lwr upr 1 12.547345 11.260464 13.834225 2 12.329353 11.039054 13.619652 3 9.119343 7.832795 10.405891 4 4.760014 3.471742 6.048286 ``` --- ## Mathematical Formula: Least-square estimator `\(Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ...+\beta_pX_p + \epsilon\)` `$$Y = \begin{bmatrix} Y_{1}\\ Y_{2}\\ .\\ .\\ .\\ Y_{n} \end{bmatrix}$$` `$$X = \begin{bmatrix} 1, x_{11}, x_{12}, ..., x_{1p}\\ 1, x_{21}, x_{22}, ..., x_{2p}\\ .\\ .\\ .\\ 1, x_{n1}, x_{n2}, ..., x_{np} \end{bmatrix}$$` `$$\hat{\beta} = (X'X)^{-1}X'Y$$` --- ## Mathematical Formula .full-width[.content-box-yellow[Confidence intervals on the regression coefficients]] [ `\(\hat{\beta_j} -t_{\alpha/2, n-p} \sqrt{\hat{\sigma}^2C_{jj}}\)`, `\(\hat{\beta_j} + t_{\alpha/2, n-p} \sqrt{\hat{\sigma}^2C_{jj}}\)` ] `\(C_{jj}\)` is the `\(j\)`th diagonal element of the `\((X'X)^{-1}\)` Unbiased estimator for `\(\sigma^2\)` is given by `\(\hat{\sigma}^2\)` = MSE --- ## Mathematical Formula .full-width[.content-box-yellow[Prediction of the mean response ]] Confidence interval for: Mean Response at `\(x_{01}, x_{02},..,x_{0p}\)`, `\(E[Y|X_1=x_{01}, X_2=x_{02}..., X_p=x_{0p}] = \mu_{Y|X_1=x_{01}, X_2=x_{02}..., X_p=x_{0p}}\)` Fitted value at `\(x_{01}, x_{02},..,x_{0p}\)` `\(\bf{[x_0]'}\)` = `\([1, x_{01}, x_{02},..., x_{0k}]\)` `\(\hat{y}_0 = \bf{x_0'} \hat{\bf{\beta}}\)` [ `\(\hat{y_0} -t_{\alpha/2, n-p} \sqrt{\hat{\sigma}^2 \bf{x_0'(X'X)^{-1}x_0}}\)`, `\(\hat{y_0} + t_{\alpha/2, n-p} \sqrt{\hat{\sigma}^2 \bf{x_0'(X'X)^{-1}x_0}}\)` ] --- ## Mathematical Formula (cont.) .full-width[.content-box-yellow[Prediction of a future value]] [ `\(\hat{y_0} -t_{\alpha/2, n-p} \sqrt{\hat{\sigma}^2 (1+\bf{x_0'(X'X)^{-1}x_0}})\)`, `\(\hat{y_0} + t_{\alpha/2, n-p} \sqrt{\hat{\sigma}^2 (1+\bf{x_0'(X'X)^{-1}x_0}})\)` ] --- class: center, middle Acknowledgement Introduction to Linear Regression Analysis, Douglas C. Montgomery, Elizabeth A. Peck, G. Geoffrey Vining All rights reserved by [Dr. Thiyanga S. Talagala](https://thiyanga.netlify.app/)