The coefficient of multiple correlation, denoted R, is a scalar that is defined as the Pearson correlation coefficient between the predicted and the actual values of the dependent variable in a linear regression model that includes an intercept.
The square of the coefficient of multiple correlation can be computed using the vector c = ( r x 1 y , r x 2 y , … , r x N y ) ⊤ {\displaystyle \mathbf {c} ={(r_{x_{1}y},r_{x_{2}y},\dots ,r_{x_{N}y})}^{\top }} of correlations r x n y {\displaystyle r_{x_{n}y}} between the predictor variables x n {\displaystyle x_{n}} (independent variables) and the target variable y {\displaystyle y} (dependent variable), and the correlation matrix R x x {\displaystyle R_{xx}} of correlations between predictor variables. It is given by
where c ⊤ {\displaystyle \mathbf {c} ^{\top }} is the transpose of c {\displaystyle \mathbf {c} } , and R x x − 1 {\displaystyle R_{xx}^{-1}} is the inverse of the matrix
If all the predictor variables are uncorrelated, the matrix R x x {\displaystyle R_{xx}} is the identity matrix and R 2 {\displaystyle R^{2}} simply equals c ⊤ c {\displaystyle \mathbf {c} ^{\top }\,\mathbf {c} } , the sum of the squared correlations with the dependent variable. If the predictor variables are correlated among themselves, the inverse of the correlation matrix R x x {\displaystyle R_{xx}} accounts for this.
The squared coefficient of multiple correlation can also be computed as the fraction of variance of the dependent variable that is explained by the independent variables, which in turn is 1 minus the unexplained fraction. The unexplained fraction can be computed as the sum of squares of residuals—that is, the sum of the squares of the prediction errors—divided by the sum of squares of deviations of the values of the dependent variable from its expected value.
With more than two variables being related to each other, the value of the coefficient of multiple correlation depends on the choice of dependent variable: a regression of y {\displaystyle y} on x {\displaystyle x} and z {\displaystyle z} will in general have a different R {\displaystyle R} than will a regression of z {\displaystyle z} on x {\displaystyle x} and y {\displaystyle y} . For example, suppose that in a particular sample the variable z {\displaystyle z} is uncorrelated with both x {\displaystyle x} and y {\displaystyle y} , while x {\displaystyle x} and y {\displaystyle y} are linearly related to each other. Then a regression of z {\displaystyle z} on y {\displaystyle y} and x {\displaystyle x} will yield an R {\displaystyle R} of zero, while a regression of y {\displaystyle y} on x {\displaystyle x} and z {\displaystyle z} will yield a strictly positive R {\displaystyle R} . This follows since the correlation of y {\displaystyle y} with its best predictor based on x {\displaystyle x} and z {\displaystyle z} is in all cases at least as large as the correlation of y {\displaystyle y} with its best predictor based on x {\displaystyle x} alone, and in this case with z {\displaystyle z} providing no explanatory power it will be exactly as large.
Introduction to Multiple Regression http://onlinestatbook.com/2/regression/multiple_regression.html ↩
Multiple correlation coefficient http://mtweb.mtsu.edu/stats/regression/level3/multicorrel/multicorrcoef.htm ↩