I have four books about linear least square regression models. I have no interest in writing another one. But I find that at whatever level, none step back and explain the underlying philosophy for this horribly misnamed[1], powerful but deceptive technique. It is worthwhile to think of some basic concepts before getting overwhelmed in complicated equations and model assessments.
We will step back from classifying and counting and go back to the continuous–in our case the spring training for Blue Zone’s softball team. In this example, we compared food spending with clothing spending.
Figure 1
[av_image src=’https://www.movedbymetrics.com/wp-content/uploads/2013/07/food-clothing-plot-495×400.png’ attachment=’360′ align=’left’ animation=’no-animation’ link=” target=’no’ av_uid=’av-wxdvt’]
|
[1] See Wikipedia for Francis Galton’s ‘regression to the mean’ story.
It is easy to put in a straight line in most software now, but are you still on the straight and narrow if all you wanted was to show correlation? Not really, because the rule that magically draws the line in your spreadsheet or BI software makes several powerful assumptions. When we were looking at correlation, the display of food on the horizontal axis and the clothing on the vertical axis didn’t matter. They could be interchanged. The graph could be rotated or stretched and we would still see a point that corresponded to the combination of values for food and clothing spending for each player. I use the word combination in the mathematical sense of not having any meaningful order. For player 1 the combination would be (food =58 clothing =115) writing it down as (clothing= 115, food= 58) doesn’t change it the relationship between the variables. But the line implies that clothing is a specific function of food spending.[2] This line becomes the function of what we call least square regression.
Functions are a rule of assignment. Think of going to the theater and getting an assigned seat. In this type of function there is a one-to-one assignment. Another function is the assignment of the price to your ticket. All the people in your section of the theater are assigned the same price, a many to one function. The first question to ask before pasting in a least square regression line is whether you want to think of the relationship between one variable and another as an assignment. Does one variable precede or give information about the other? Typically this is referred to as the ‘independent’ and ‘dependent’ variable. But I prefer explanatory and response because independent variable is easy to confuse with ‘independence.’
There are two functions when using a scatter plot with a regression line. The first function assigns the explanatory variable, displayed on the horizontal axis to the observed value for the response on the vertical axis. There can be several different y values for each level of the x value. The second function assigns all the x values to one y value on the regression line, often called the predicted value. The value of the observed minus the predicted is called the residual. For every observed value there is a residual. (Another function!)
The bank data from the February posting showed 3 graphs of bank data with different correlations. As we explained, this is a snapshot of local data in the Northeast, pre-2008 with equity of 0 or more up to 100% of the bank’s listed home value. By removing points and transforming variables, the correlation increased. Below are the three graphs with a regression fit and equation.
[2] For machine learning people, think of unsupervised and supervised learning.
Figure 2 [av_image src=’https://www.movedbymetrics.com/wp-content/uploads/2013/07/graph-1-equity-710×575.png’ attachment=’359′ align=’left’ animation=’no-animation’ link=” target=’no’ av_uid=’av-qm3s9′] |
Figure 3 [av_image src=’https://www.movedbymetrics.com/wp-content/uploads/2013/07/graph-2-equity-495×400.png’ attachment=’358′ align=’center’ animation=’no-animation’ link=” target=’no’ av_uid=’av-jd8kx’] |
Figure 4 [av_image src=’https://www.movedbymetrics.com/wp-content/uploads/2013/07/graph-3-equity-495×400.png’ attachment=’357′ align=’center’ animation=’no-animation’ link=” target=’no’ av_uid=’av-f80o1′] |
Figure 5-Functions Applied
Graph | ExplanatoryVariable | Observed ResponseFunction of the explanatory | Modeled response function | Residual function |
Graph 1 | Years in Home | Equity in Home | y = 1396x + 77250 | Equity in Home-(1396x + 77250) |
Graph 2 | Years in Home | Equity in HomeHome Value | y = 0.0161x + 0.266 | Equity in HomeHome Value[DC2] – (0.0161x + 0.266 ) |
Graph 3 | Years in Home | Equity in HomeHome ValueYears>0 and Equity>0 | y = 0.0230x + 0.261 | Equity in HomeHome Value- (0.0230x + 0.261) |
We are able to obtain a modeled response function for each of the data sets above. However, in Graph 1, the line looks almost flat and offers little insight about years in home and equity. In fact the slope is not significantly different from 0. By transforming the data by dividing equity by home value, we gain better insight into the relationship. The slope shows an increase of .016 in the equity to home value ratio for each year in the home. Removing those with 0 equity increases the slope to .023. The model gives a good numerical and visual explanation for how equity builds over time.
We can look at 5 subjects in their home for 20 years circled in the blue loop above. The predicted value, the y value of the line at x=20, is .722. One subject is nearly at the mark at .692. Three are above the predicted and 2 are below. An advanced analysis of residuals will give an understanding of how well the model fits and any points that are influencing the model or don’t belong.
Figure 6 – Graph 3 Data at years in home=20
ID |
EQUITY |
HOME VALUE |
EQUITY TO VALUE RATIO |
PREDICTED |
RESIDUAL |
15 |
53995 |
66660 |
0.810 |
0.722 |
0.088 |
116 |
232106 |
267540 |
0.868 |
0.722 |
0.146 |
117 |
81365 |
167440 |
0.486 |
0.722 |
-0.236 |
191 |
116697 |
122240 |
0.955 |
0.722 |
0.233 |
277 |
283175 |
409210 |
0.692 |
0.722 |
-0.030 |
Connection regression and correlation
[av_image src=’https://www.movedbymetrics.com/wp-content/uploads/2013/07/connection-regression.png’ attachment=’356′ align=’left’ animation=’no-animation’ link=” target=’no’ av_uid=’av-8pk2x’]
There are many ways to obtain the slope and intercept for the regression line. One formula
expresses b1, the slope as a function of the ratio of the standard deviations and the correlation coefficient of the x’s and the y’s . The intercept, bo, is obtained from solving with b1 at the average of the observed x’s and y’s. Of more interest in correlation than r is R2, the square of the correlation between the observed y’s and predicted values. (For simple linear regression it holds that R is the absolute value of rxy but not as regressions get more complicated.) R2 is a common measure of the strength of a regression model but should be used in the context of the situation. For example, a R2 =.81 in a lab could be terrible but in a market research survey would be terrific.
If you would like to talk more about linear regression, contact me at www.directeffects.net.
Have fun.
Georgette
[1] See Wikipedia for Francis Galton’s ‘regression to the mean’ story.
[2] For machine learning people, think of unsupervised and supervised learning.