How to Interpret Results of h2o.predict in R

Last Updated : 26 Aug, 2024

H2O.ai is a great platform for machine learning and data science, offering open-source and enterprise-level solutions. It facilitates high-performance computing for large-scale data analysis. It provides a suite of tools designed to simplify and accelerate the development of machine learning models.

Key features of H2O.ai

Here we will discuss the main Key features of H2O.ai.

Scalability: H2O.ai supports distributed computing through which it to handle large datasets and complex computations efficiently.
Algorithm Variety: It includes a wide range of machine learning algorithms, from traditional models like linear regression and decision trees to advanced techniques like gradient boosting machines and deep learning.
Easy to Use: With user-friendly interfaces and integration with popular programming languages (R, Python, Scala), H2O.ai is accessible to both new and experienced data scientists.
Automatic Machine Learning (AutoML): H2O.ai offers AutoML capabilities, allowing users to build and optimize machine learning models with minimal manual intervention.
Deployment: It supports seamless model deployment, including integration with REST APIs and real-time scoring.

Overview of h2o.predict Function in R

The h2o.predict function in R is an important component for making predictions with models built using H2O.ai. It allows users to apply a trained H2O model to new data and obtain predictions. The primary role of h2o.predict is to generate predictions from an H2O model, which can be used for classification, regression, or other tasks depending on the type of model. The function returns an H2O frame containing the predictions. The format of the output depends on the type of model and task (e.g., class probabilities for classification, predicted values for regression).

h2o.predict(model, newdata)
where:
model: The trained H2O model object from which predictions are to be made.
newdata: The dataset or H2O frame containing the new data on which predictions will be performed.

Types of Outputs Based on Model Type

Now we will discuss different Types of Outputs Based on Model Type.

1. Regression Models

Output Type: Continuous values.
Description: For regression tasks, h2o.predict returns predicted continuous values, which represent the model’s estimate of the target variable.
Output Component: A column in the H2O frame with the predicted continuous values.

If the model is trained to predict house prices, the output will be predicted prices. Output can look like:

predictions 
-----------------
250000
275000
300000

2. Classification Models

Output Type: Class labels and/or prediction probabilities.
Description: For classification tasks, h2o.predict provides two types of outputs:
Class Labels: The most likely class for each observation.
Prediction Probabilities: The probability of each class being the predicted class.
Output Components: The output includes columns for the class label and a column for each class showing the probability.

For a binary classification model predicting whether an email is spam or not, the output might include the predicted class (spam or not spam) and probabilities for each class. Output can look like:

class  p0    p1
-----------------
0      0.95  0.05
1      0.20  0.80
0      0.90  0.10

3. Time-Series Models

Output Type: Forecasted values.
Description: For time-series models, h2o.predict returns forecasts for future time points based on the historical data provided.
Output Component: A column in the H2O frame with the forecasted values for the specified horizon.

If forecasting monthly sales, the output might include sales predictions for future months. Output can look like:

predictions
-----------------
5000
5200
5400

Now we will discuss step by step implementation of Interpret Results of h2o.predict in R Programming Language.

Step 1: Load the data

First we will load the dataset.

# Load the H2O library and start the H2O cluster
library(h2o)
h2o.init()

# Import the dataset
data_url <- "https://github.com/YBI-Foundation/Dataset/raw/main/Diabetes.csv"
data <- h2o.importFile(data_url)

# Display the first few rows of the dataset
head(data)

Output:

  pregnancies glucose diastolic triceps insulin  bmi   dpf age diabetes
1           6     148        72      35       0 33.6 0.627  50        1
2           1      85        66      29       0 26.6 0.351  31        0
3           8     183        64       0       0 23.3 0.672  32        1
4           1      89        66      23      94 28.1 0.167  21        0
5           0     137        40      35     168 43.1 2.288  33        1
6           5     116        74       0       0 25.6 0.201  30        0

Step 2: Data Preparation

Now we will prepare the dataset.

# Check the structure of the dataset
str(data)

# Define the correct response variable
response <- "diabetes"  # Replace "Outcome" with the actual column name

# Update predictors accordingly
predictors <- setdiff(names(data), response)

# Split the data into training and test sets
splits <- h2o.splitFrame(data, ratios = 0.8)
train <- splits[[1]]
test <- splits[[2]]

Output:

Class 'H2OFrame' <environment: 0x00000201e281fef0> 
 - attr(*, "op")= chr "Parse"
 - attr(*, "id")= chr "Key_Frame__https___github_com_YBI_Foundation_

 - attr(*, "eval")= logi FALSE
 - attr(*, "nrow")= int 768
 - attr(*, "ncol")= int 9
 - attr(*, "types")=List of 9
  ..$ : chr "int"
  ..$ : chr "int"
  ..$ : chr "int"
  ..$ : chr "int"
  ..$ : chr "int"
  ..$ : chr "real"
  ..$ : chr "real"
  ..$ : chr "int"
  ..$ : chr "int"
 - attr(*, "data")='data.frame':	10 obs. of  9 variables:
  ..$ pregnancies: num  6 1 8 1 0 5 3 10 2 8
  ..$ glucose    : num  148 85 183 89 137 116 78 115 197 125
  ..$ diastolic  : num  72 66 64 66 40 74 50 0 70 96
  ..$ triceps    : num  35 29 0 23 35 0 32 0 45 0
  ..$ insulin    : num  0 0 0 94 168 0 88 0 543 0
  ..$ bmi        : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0
  ..$ dpf        : num  0.627 0.351 0.672 0.167 2.288 ...
  ..$ age        : num  50 31 32 21 33 30 26 29 53 54
  ..$ diabetes   : num  1 0 1 0 1 0 1 0 1 1

Step 3: Train a Logistic Regression Model

Now we will Train a Logistic Regression Model.

# Train a logistic regression model with the correct response variable
logistic_model <- h2o.glm(x = predictors, y = response, training_frame = train, 
                                                          family = "binomial")

# Make predictions on the test set
logistic_predictions <- h2o.predict(logistic_model, test)

# View the first few rows of the predictions
head(logistic_predictions)

Output:

  predict         p0         p1
1       1 0.20352816 0.79647184
2       0 0.94999001 0.05000999
3       1 0.12371666 0.87628334
4       1 0.44333930 0.55666070
5       1 0.62917778 0.37082222
6       1 0.06512674 0.93487326

Step 4: Visualization of the model

Now we will visualize the model.

# Convert test response variable and predictions to data frames
test_response <- as.factor(as.data.frame(test)[[response]])
logistic_pred_df <- as.data.frame(logistic_predictions)
probabilities <- logistic_pred_df$p1

# Verify levels of the response variable
print(levels(test_response))

# Calculate and plot ROC curve
roc_data <- roc(test_response, probabilities)
plot(roc_data, main = "ROC Curve for Logistic Regression")

Output:

[1] "0" "1"

Conclusion

Interpreting the results from h2o.predict involves understanding and effectively utilizing the output provided by the H2O machine learning models. Key points include ensuring the correct extraction of probability columns, matching the lengths of prediction probabilities and actual values, and accurately interpreting these results in the context of model evaluation.

How to Test for Multicollinearity in R

shalini_chabarwal

Improve

Article Tags :

Practice Tags :

Machine Learning

How to Interpret Results of h2o.predict in R

Key features of H2O.ai

Overview of h2o.predict Function in R

Types of Outputs Based on Model Type

1. Regression Models

2. Classification Models

3. Time-Series Models

Step 1: Load the data

Step 2: Data Preparation

Step 3: Train a Logistic Regression Model

Step 4: Visualization of the model

Conclusion

Similar Reads

Thank You!

What kind of Experience do you want to share?