Posted on August 29, 2017
The following blog article has been written to provide an overview of linear regression. It is suitable for those with little to no experience of this type of analysis. This is not a guide on how to conduct your own analysis, but instead will serve as a taster to some of the key terms and principles of regression. Further reading resources will be provided at the end for those who wish to further their knowledge.
Like regression, there are different types of correlation. The main two types of correlation are Pearson’s correlation and Spearman’s correlation. Pearson’s correlation focuses on describing a linear relationship between two variables, whereas Spearman’s correlation is more concerned with the rank-order of the points, regardless of where exactly they lie.
Furthermore, it is important to distinguish correlation from regression:
A few similarities between correlation and regression should also be noted. Most importantly, neither can affirm causality (2). This is because although a change in x might result in a change in y, it is possible that both variables are related to a third confounding variable not described in the analysis. Secondly, the square of Pearson’s correlation coefficient (r) is the same value as the R2 in simple linear regression.
One variable is dependent and the other variable is independent.
Regression results are given as R2 and a p-value. R2 represents the strength of association, whereas the p-value is testing the null hypothesis, that there is no association between the variables. Therefore, if the resulting p-value is low (usually defined as < 0.05), we can assume that the relationship between the variables is unlikely the result of chance and so, we reject the null hypothesis.
One dependent variable and two or more independent variables.
Multiple linear regression follows the same concept as simple linear regression. However, the difference is that it investigates how a dependent variable changes based on alterations in a combination of multiple independent variables. This reflects a more ‘real-life’ scenario as variables are usually influenced by a number of factors.
The general process of multiple linear regression is as follows (a backwards elimination method is used in this example):
The removal of non-significant variables is important and can be done most commonly with either forwards or backwards elimination methods (2). Forwards elimination means you remove these variables before the analysis is run whereas backwards elimination does this after the results are obtained. The non-significant variables can be removed if their absence from the model would not significantly decrease the effectiveness of the model. Removing these improves the models ‘goodness-of-fit’. Variables with low p-values (P < 0.05) remain because this implies they are meaningful additions to the model and that changes in them are associated with changes in the dependent variable. A model is finalised when no more variables can be removed without a statistically significant loss in fit.
Residuals are the difference between the observed and predicted value of the dependent variable (y). Residuals are crucial, as this allows the model to be validated. Residuals must be calculated for each data point, plotted and then inspected to verify the suitability of the model. The points on a residual plot should have an even distribution around the horizontal axis of zero. If so, the assumption of linearity of the data is true and it is suitable for its intended purpose. A bad residual plot is when there is clear pattern to the data. In other words, the residual data points are skewed. The following graphs show a good and bad example of residual plots.
If the residual plot looks less than ideal, data transformations can be conducted. The methods of doing so are beyond the scope of the article, but in short, it allows the non-linear data to be used more effectively with the linear regression models. It is important to always state if any data transformations were performed.
Once the model is complete, it should be tested on further independent data sets to check its suitability (i.e. data that was not used to construct the model itself). A developed model can be shown to be robust if it is still effective with independent data.
Finally, it is always important to state clearly if there was any missing data. Complex methods do exist to handle missing data sets in linear regression, but these will not be discussed here.
An example of a multiple linear regression model is shown below in the form of both a table and the resultant equation.
Firstly, the table shows all the necessary components of the model. The R2 value is given here; however it is good practice to use the adjusted R2 value, as it accounts for the sample size used.
The equation shows the line of best fit. It expands upon the y = a + bx formula to account for the multiple independent variables involved.
Poor fit (over-fitting, under-fitting etc.) of a model means it does not serve it intended purpose effectively. There are ways that the model can be adjusted to optimise its usefulness.
Multicollinearity exists when two or more of the independent variables in the model are highly correlated with one another. This poses a problem as these related variables offer much the same information to the model. To deal with multicollinearity, one of the variables must be removed (usually the least significant one/highest p-value). The Variance Inflation Factor (VIF) is calculated for each independent variable and is used to determine whether multicollinearity is present. Basically, a high VIF means the variable is explained by other independent variables, whereas a low VIF means it is not. A low VIF is good and indicates no significant multicollinearity. It’s important to clarify that no universal cut-off point exists and so it requires a subjective, but educated decision from the researchers conducting the analysis to determine if the value of VIF is acceptable or not. Regression is an artful science and so, requires informed judgement and experience to optimise a model for its intended purpose.
Linear regression is a useful method to predict changes in a dependent variable based on alterations in independent variables. It is hoped that this blog can act as a gentle introduction to this type of analysis. For more information about linear regression, the following web resources are recommended: