How to Calculate R^2 with Scikit-Learn
Last Updated :
05 Aug, 2024
The coefficient of determination, denoted as R², is an essential metric in regression analysis. It indicates the extent to which the independent variables account for the variation in the dependent variable.
In this article, we will walk you through calculating R² using Scikit-Learn, a powerful Python library for machine learning.
What is R²?
R² quantifies the proportion of variance in the dependent variable that can be predicted from the independent variables. It ranges between 0 and 1, with 0 indicating that the model does not explain any of the variability and 1 indicating that the model explains all the variability.
Mathematically, R² is expressed as:
R^2 = 1 - \frac{\text{SS}_{res}}{\text{SS}_{tot}}
Here:
- SS_{res} is the sum of squares of residuals (the difference between actual and predicted values).
- SS_{tot} is the total sum of squares (the difference between actual values and the mean of actual values).
Calculating R2 with Scikit-Learn for Sample Data
Let's go through an example to calculate R² from sample data using simple linear regression model.
Step 1: Import Necessary Libraries
import numpy as np
from sklearn.metrics import r2_score
Step 2: Generate Sample Data
# Generate random data
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
# Assuming a perfect model prediction (just for the sake of demonstration)
y_pred = 4 + 3 * X
Step 3: Computer the R2 using sklearn
# Flatten the arrays to use in r2_score
y = y.flatten()
y_pred = y_pred.flatten()
# Compute R² using Scikit-Learn
R2_sklearn = r2_score(y, y_pred)
print(f"R² (Scikit-Learn Calculation): {R2_sklearn}")
Complete Code
Python
import numpy as np
from sklearn.metrics import r2_score
# Generate random data
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
# Assuming a perfect model prediction (just for the sake of demonstration)
y_pred = 4 + 3 * X
# Flatten the arrays to use in r2_score
y = y.flatten()
y_pred = y_pred.flatten()
# Compute R² using Scikit-Learn
R2_sklearn = r2_score(y, y_pred)
print(f"R² (Scikit-Learn Calculation): {R2_sklearn}")
Output:
R² (Scikit-Learn Calculation): 0.7639751938835576
Calculating R2 for Simple Polynomial Regression Problem using Sklearn
Polynomial regression is a type of regression analysis in which the relationship between the independent variable X and the dependent variable y is modeled as an n-th degree polynomial. We will compute R-square value for polynomial regression model using python.
Step 1: Import Libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
Step 2: Generate Sample Data
We'll create a simple nonlinear dataset:
# Generate random data
np.random.seed(42)
X = 6 * np.random.rand(100, 1) - 3
y = 0.5 * X**2 + X + 2 + np.random.randn(100, 1)
Step 3: Prepare Polynomial Features
Transform the input data to include polynomial features up to the desired degree (e.g., degree 2):
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)
Step 4: Fit the Polynomial Regression Model
Fit a linear regression model to the polynomial features:
model = LinearRegression()
model.fit(X_poly, y)
y_pred = model.predict(X_poly)
Step 5: Calculate R² Using Scikit-Learn
Verify the manual calculation using Scikit-Learn's r2_score function:
# Flatten the arrays to use in r2_score
y = y.flatten()
y_pred = y_pred.flatten()
# Compute R² using Scikit-Learn
R2_sklearn = r2_score(y, y_pred)
print(f"R² (Scikit-Learn Calculation): {R2_sklearn}")
Visualizing the Results
It's often helpful to visualize the polynomial regression curve along with the data points:
plt.scatter(X, y, color='blue', label='Actual')
# Sort the values for better plotting
sorted_indices = X.flatten().argsort()
plt.plot(X[sorted_indices], y_pred[sorted_indices], color='red', linewidth=2, label='Predicted')
plt.title('Actual vs Predicted (Polynomial Regression)')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.show()
Complete Code
Python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
# Generate random data
np.random.seed(42)
X = 6 * np.random.rand(100, 1) - 3
y = 0.5 * X**2 + X + 2 + np.random.randn(100, 1)
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)
model = LinearRegression()
model.fit(X_poly, y)
y_pred = model.predict(X_poly)
# Flatten the arrays to use in r2_score
y = y.flatten()
y_pred = y_pred.flatten()
# Compute R² using Scikit-Learn
R2_sklearn = r2_score(y, y_pred)
print(f"R² (Scikit-Learn Calculation): {R2_sklearn}")
plt.scatter(X, y, color='blue', label='Actual')
# Sort the values for better plotting
sorted_indices = X.flatten().argsort()
plt.plot(X[sorted_indices], y_pred[sorted_indices], color='red', linewidth=2, label='Predicted')
plt.title('Actual vs Predicted (Polynomial Regression)')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.show()
Output:
R² (Scikit-Learn Calculation): 0.8525067519009746
Polynomial Regression CurveConclusion
Calculating R² directly from sample data in Python is straightforward and provides valuable insight into your model's performance. By following the steps outlined above, you can easily implement and interpret R² in your regression analyses without relying on a predefined regression model. This approach is useful when you want to validate the goodness of fit of your predictions against actual data.
Similar Reads
How to Calculate F1 Score in R?
In this article, we will be looking at the approach to calculate F1 Score using the various packages and their various functionalities in the R language. F1 Score The F-score or F-measure is a measure of a test's accuracy. It is calculated from the precision and recall of the test, where the precisi
5 min read
How to Calculate SMAPE in R
SMAPE stands for symmetric mean absolute percentage error. It is an accuracy measure and is used to determine the predictive accuracy of models that are based on relative errors. The relative error is computed as: relative error = x / y Where x is the absolute error and y is the magnitude of exact v
3 min read
How to Obtain TP, TN, FP, FN with Scikit-Learn
Answer: To obtain True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN) for evaluating classification models, Scikit-Learn offers a straightforward method using the confusion_matrix function. This function helps in extracting these metrics directly from your model'
2 min read
How to Calculate Ratios in R
Ratios are used to compare quantities by dividing them. They help us to understand how one quantity is related to another quantity. In data analysis, they're important for understanding proportions. In this article, we will learn How to Calculate Ratios in R Programming Language. In R programming la
4 min read
How to Calculate Cosine Similarity in R?
In this article, we are going to see how to calculate Cosine Similarity in the R Programming language. We can define cosine similarity as the measure of the similarity between two vectors of an inner product space. The formula to calculate the cosine similarity between two vectors is: [Tex]ΣXiYi / (
2 min read
Step-by-Step Guide to Calculating RMSE Using Scikit-learn
Root Mean Square Error (RMSE) is a widely used metrics for evaluating the accuracy of regression models. It not only provides a comprehensive measure of how closely predictions align with actual values but also emphasizes larger errors, making it particularly useful for identifying areas where model
5 min read
How to Calculate the Standard Error of the Mean in R?
In this article, we will discuss how to calculate the standard error of the mean in the R programming language. StandardError Of Mean is the standard deviation divided by the square root of the sample size. Formula: Standard Error: (Sample Standard Deviation of Sample)/(Square Root of the sample siz
2 min read
How to Calculate SMAPE in Python?
In this article, we will see how to compute one of the methods to determine forecast accuracy called the Symmetric Mean Absolute Percentage Error (or simply SMAPE) in Python. The SMAPE is one of the alternatives to overcome the limitations with MAPE forecast error measurement. In contrast to the mea
3 min read
How to Calculate Precision in R Programming?
In this article, we going to learn how to calculate precision using the confusion matrix in the R programming language. Precision A numerical quantity's precision indicates how precisely the amount is expressed. Typically, this is measured in bits, although it can also be in decimal digits. It relat
3 min read
How to Calculate Cramerâs V in Python?
Cramer's V: It is defined as the measurement of length between two given nominal variables. A nominal variable is a type of data measurement scale that is used to categorize the different types of data. Cramer's V lies between 0 and 1 (inclusive). 0 indicates that the two variables are not linked by
2 min read