Linear regression is one of *the* most commonly used statistical modeling technique. While not as sexy as machine learning algorithms such as neural networks, it is one of the staple methods used throughout industry. Any card-carrying predictive modeler needs to know this one.

I won't go over the theory of linear regression. Instead, I will reference a few resources that do a good job of describing it.

- Wikipedia, of course! https://en.wikipedia.org/wiki/Linear_regression
- https://newonlinecourses.science.psu.edu/stat501/node/251/
- http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm

We will learn a few important things in this simple example, including:

- How to load a standard dataset from sklearn library into a data object (specifically, a data frame)
- How to explore and plot the data we just loaded
- How to fit a linear trend line to the data using OLS

First we import a the boston house prices dataset, and print a description of it so we can examine what is in the data. Remember in order to execute a 'cell' like the one below, you can 1) click on it and run it using the *run* button above or 2) click in the cell and hit shift+enter.

In [12]:

```
import pandas as pd
from sklearn.datasets import load_boston
boston = load_boston()
print(boston.data.shape) #get (numer of rows, number of columns or 'features')
print(boston.DESCR) #get a description of the dataset
```

In [2]:

```
# Next, we load the data into a 'dataframe' object for easier manipulation, and also print the first few rows in order to examine it
data = pd.DataFrame(boston.data, columns=boston.feature_names)
data.head() #notice that the target variable (MEDV) is not included
```

Out[2]:

In [3]:

```
#For some reason, the loaded data does not include the target variable (MEDV), we add it here
data['MEDV'] = pd.Series(data=boston.target, index=data.index)
data.describe() #get some basic stats on the dataset
```

Out[3]:

In [4]:

```
#Load the independent variables (the x1, x2, etc.) into a dataframe object called 'X'. Similarly for the dependent variable 'Y'
X = data.drop('MEDV', axis = 1) #define independent predictor set (excluding the dependent variable)
Y = data['MEDV'] #define the target values (i.e. the dependent variable)
```

We randomly select a third of our data to be the 'test' dataset. This way we can train our model on 2/3 of the data, and test it on the remainder. Once we are confident that our model is generalizing well (i.e. there is not a HUGE different in the training/testing performance, or in other words, not obviously overfitting), then we can use *all* of our data to train the model.

In [5]:

```
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 5)
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)
```

In linear regression we *assume* that the relationship between the independent variables (X) and the dependent variable (Y) is *linear* and then go about finding one that minimizes the *squared error* between the predicted Y and the actual Y.

We now import the LinearRegression method from the *sklearn* library. Note that the process of creating the model involves the very simple command lm.fit(X,Y). This runs the model and we find the intercept-term, $\beta_0$, and the coefficient $\beta_1$ that minimizes the squared errors.

In [7]:

```
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,Y_train)
Y_pred_train = lm.predict(X_train) #predictions on training data
Y_pred = lm.predict(X_test) #predictions on testing data
# We plot predicted Y (y-axis) against actual Y (x-axis). Perfect predictions will lie on the diagonal. We see the diagonal trend, suggesting a 'good' fit
import matplotlib.pyplot as plt
plt.scatter(Y_test,Y_pred)
plt.xlabel("Actual Price: $Y_i$")
plt.ylabel("Predicted Price: $\hat{Y}_i$")
plt.title("Predicted Price vs Actual Price: $Y_i$ vs $\hat{Y}_i$")
plt.show()
```

In [8]:

```
#Let's get the coefficients
print('Intercept term: ',lm.intercept_) # This gives us the intercept term
print('Coefficients: \n',lm.coef_) # This gives us the coefficients (in the case of this model, just one coefficient)
```

We can inspect a few key values such as the coefficient and the mean-squared-error by executing the cell below.

In [14]:

```
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
mse = mean_squared_error(Y_test,Y_pred)
print('Mean Squared Error: ',mse)
mae = mean_absolute_error(Y_test,Y_pred)
print('Mean Absolute Error: ',mae)
rsq = r2_score(Y_train,Y_pred_train) #R-Squared on the training data
print('R-square, Training: ',rsq)
rsq = r2_score(Y_test,Y_pred) #R-Squared on the testing data
print('R-square, Testing: ',rsq)
```

We can get a distribution of the error, below.

In [11]:

```
error = Y_test - Y_pred
plt.hist(error, bins = 25)
plt.xlabel("Prediction Error [MEDV]")
_ = plt.ylabel("Count")
```

If you have ideas on how to improve this post, please let me know: https://predictivemodeler.com/feedback/