Linear regression is one of the most commonly used statistical modeling technique. While not as sexy as machine learning algorithms such as neural networks, it is one of the staple methods used throughout industry. Any card-carrying predictive modeler needs to know this one.
I won't go over the theory of linear regression. Instead, I will reference a few resources that do a good job of describing it.
We will learn a few important things in this simple example, including:
First we import a the boston house prices dataset, and print a description of it so we can examine what is in the data. Remember in order to execute a 'cell' like the one below, you can 1) click on it and run it using the run button above or 2) click in the cell and hit shift+enter.
import pandas as pd
from sklearn.datasets import load_boston
boston = load_boston()
print(boston.data.shape) #get (numer of rows, number of columns or 'features')
print(boston.DESCR) #get a description of the dataset
# Next, we load the data into a 'dataframe' object for easier manipulation, and also print the first few rows in order to examine it
data = pd.DataFrame(boston.data, columns=boston.feature_names)
data.head() #notice that the target variable (MEDV) is not included
#For some reason, the loaded data does not include the target variable (MEDV), we add it here
data['MEDV'] = pd.Series(data=boston.target, index=data.index)
data.describe() #get some basic stats on the dataset
#Load the independent variables (the x1, x2, etc.) into a dataframe object called 'X'. Similarly for the dependent variable 'Y'
X = data.drop('MEDV', axis = 1) #define independent predictor set (excluding the dependent variable)
Y = data['MEDV'] #define the target values (i.e. the dependent variable)
We randomly select a third of our data to be the 'test' dataset. This way we can train our model on 2/3 of the data, and test it on the remainder. Once we are confident that our model is generalizing well (i.e. there is not a HUGE different in the training/testing performance, or in other words, not obviously overfitting), then we can use all of our data to train the model.
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 5)
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)
In linear regression we assume that the relationship between the independent variables (X) and the dependent variable (Y) is linear and then go about finding one that minimizes the squared error between the predicted Y and the actual Y.
$$ {y}_i = \beta_0 + \beta_1 {x}_i + \epsilon_i $$We now import the LinearRegression method from the sklearn library. Note that the process of creating the model involves the very simple command lm.fit(X,Y). This runs the model and we find the intercept-term, $\beta_0$, and the coefficient $\beta_1$ that minimizes the squared errors.
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,Y_train)
Y_pred_train = lm.predict(X_train) #predictions on training data
Y_pred = lm.predict(X_test) #predictions on testing data
# We plot predicted Y (y-axis) against actual Y (x-axis). Perfect predictions will lie on the diagonal. We see the diagonal trend, suggesting a 'good' fit
import matplotlib.pyplot as plt
plt.scatter(Y_test,Y_pred)
plt.xlabel("Actual Price: $Y_i$")
plt.ylabel("Predicted Price: $\hat{Y}_i$")
plt.title("Predicted Price vs Actual Price: $Y_i$ vs $\hat{Y}_i$")
plt.show()
#Let's get the coefficients
print('Intercept term: ',lm.intercept_) # This gives us the intercept term
print('Coefficients: \n',lm.coef_) # This gives us the coefficients (in the case of this model, just one coefficient)
We can inspect a few key values such as the coefficient and the mean-squared-error by executing the cell below.
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
mse = mean_squared_error(Y_test,Y_pred)
print('Mean Squared Error: ',mse)
mae = mean_absolute_error(Y_test,Y_pred)
print('Mean Absolute Error: ',mae)
rsq = r2_score(Y_train,Y_pred_train) #R-Squared on the training data
print('R-square, Training: ',rsq)
rsq = r2_score(Y_test,Y_pred) #R-Squared on the testing data
print('R-square, Testing: ',rsq)
We can get a distribution of the error, below.
error = Y_test - Y_pred
plt.hist(error, bins = 25)
plt.xlabel("Prediction Error [MEDV]")
_ = plt.ylabel("Count")
If you have ideas on how to improve this post, please let me know: https://predictivemodeler.com/feedback/