Linear regression is one of the most commonly used statistical modeling technique. While not as sexy as machine learning algorithms such as neural networks, it is one of the staple methods used throughout industry. Any card-carrying predictive modeler needs to know this one.
I won't go over the theory of linear regression. Instead, I will reference a few resources that do a good job of describing it.
We will learn a few important things in this simple example, including:
First we import a couple of packages that will help us load the data. The data as well as this script is available for download at the bottom of this post. You can store the data pmds001.txt and change the link (i.e. 'D:...') below to where you put it. We choose to call our data "sample_data". Remember in order to execute a 'cell' like the one below, you can 1) click on it and run it using the run button above or 2) click in the cell and hit shift+enter.
import csv
import pandas as pd
sample_data = pd.read_csv('D:\Data\PredictiveModeler\The Book\Benchmarked Data\pmds001.txt', delimiter='\t')
You can quickly explore the data using '.head()', or '.info()', or '.describe()', or '.columns'. This is helpful as you get a quick sense of what is in your data. Go ahead and try these options below!
sample_data.head()
In the cell below we plot the data in order to visually inspect it quickly. You will note that it clearly is some sort of exponential function. But, we are going to fit a linear trendline anyway - because that's what this tutorial is about! In a future post we will go over the improved model fit we get when we try some non-linear modeling techniques.
You will note that we first import a new library, matplotlib that allows us to create a nice plot of our data.
import matplotlib.pyplot as plt
# We load the independent variable(s) in a data array called 'X' and similarly, the dependent variable into one called 'Y'
X = sample_data[['X']]
Y = sample_data['Y']
# Now, we create a scatter plot from X and Y
plt.scatter(X, Y, color='black')
plt.title("X vs Y")
plt.xlabel("x-axis")
plt.ylabel("y-axis")
plt.show()
In linear regression we assume that the relationship between the independent variables (X) and the dependent variable (Y) is linear and then go about finding one that minimizes the squared error between the predicted Y and the actual Y.
$$ {y}_i = \beta_0 + \beta_1 {x}_i + \epsilon_i $$We now import the LinearRegression method from the sklearn library. Note that the process of creating the model involves the very simple command lm.fit(X,Y). This runs the model and we find the intercept-term, $\beta_0$, and the coefficient $\beta_1$ that minimizes the squared errors.
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X,Y)
We can inspect a few key values such as the coefficient and the mean-squared-error by executing the cell below.
from sklearn.metrics import mean_squared_error, r2_score
print('Intercept: ',lm.intercept_) # This gives us the intercept term
print('Coefficients: \n',lm.coef_) # This gives us the coefficients (in the case of this model, just one coefficient)
Finally, we can overlay the predicted trend-line with the actual data. Of course the fit is not great, but is the best we can do with a linear model while minimizing the squared-error.
Yp = lm.predict(X) # We load the predictions into a new data object, Yp
plt.scatter(X, Y, color='black')
plt.plot(X, Yp, color='red', linewidth=1)
plt.title("X vs Y and Yp")
plt.xlabel("X")
plt.ylabel("Y and Yp")
plt.show()
Clearly, a linear model is not a great option for this. Can you spot a different transformation of the Y variable that would be? Given that the Y seemingly increases exponentially relative to X, what happens if we plot log(Y) vs X?
import numpy #import a new library that allows us to take logs
logY = numpy.log10(Y) #create a new data array logY, that takes logs of Y
plt.scatter(X, logY, color='red') #plot & visually inspect the new relationship
plt.title("X vs logY")
plt.xlabel("X")
plt.ylabel("logY")
plt.show()
Looks linear now! Transformations & normalizations are useful in predictive modeling. We will explore them in future case studies. If you like, you can try fitting a linear model to logY as an exercise.