Looking at a Regression's R-Squared
R-squared measures how closely the data fit the regression line, so the R-squared in a simple regression is related to the correlation between the two variables. In particular, the magnitude of the correlation is the square root of the R-squared and the sign of the correlation is the sign of the regression coefficient.
In this exercise, you will start using the statistical package statsmodels
, which performs much of the statistical modeling and testing that is found in R and software packages like SAS and MATLAB.
You will take two series, x
and y
, compute their correlation, and then regress y
on x
using the function OLS(y,x)
in the statsmodels.api
library (note that the dependent, or right-hand side variable y
is the first argument). Most linear regressions contain a constant term which is the intercept (the \(\small \alpha\) in the regression \(\small y_t=\alpha + \beta x_t + \epsilon_t\)). To include a constant using the function OLS()
, you need to add a column of 1's to the right hand side of the regression.
The module statsmodels.api
has been imported for you as sm
.
This exercise is part of the course
Time Series Analysis in Python
Exercise instructions
- Compute the correlation between
x
andy
using the.corr()
method. - Run a regression:
- First convert the Series
x
to a DataFramedfx
. - Add a constant using
sm.add_constant()
, assigning it todfx1
- Regress
y
ondfx1
usingsm.OLS().fit()
.
- First convert the Series
- Print out the results of the regression and compare the R-squared with the correlation.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import the statsmodels module
import statsmodels.api as sm
# Compute correlation of x and y
correlation = ___
print("The correlation between x and y is %4.2f" %(correlation))
# Convert the Series x to a DataFrame and name the column x
dfx = pd.DataFrame(x, columns=['x'])
# Add a constant to the DataFrame dfx
dfx1 = sm.add_constant(___)
# Regress y on dfx1
result = sm.OLS(___, ___).fit()
# Print out the results and look at the relationship between R-squared and the correlation above
print(result.summary())