Pywedge package for Machine Learning problems
Last Updated :
09 Mar, 2021
When people start to learn machine learning and data science, one fact/observation they will always hear is that fitting of machine learning models to a dataset is easy but preparing the dataset for the task is not. While solving ML problems we are often required to go through a series of steps before we can actually find the best ML algorithm that fits accurately onto our dataset. Few major steps can be named as:
- Collection of data: It can be collected from various sources either from real-life data or can be made manually.
- Dataset preprocessing: After collecting the raw data, we need to convert it into a meaningful form, so that it can be well interpreted by the algorithms. It also involves a series of steps such as- understanding the data using exploratory data analysis, removes the missing values in the dataset (by imputation methods/manually).
- Feature engineering: In feature engineering, we implement process such as converting categorical features into numerical features, standardization, normalization, feature selection using different methods such as chi-square test, using extra tree classifier.
- Handling imbalance in dataset: Sometimes the dataset we collect is in highly imbalanced state. Fitting any model to this type of dataset can give us inaccurate results because the model always has a bias towards the frequently occurring data inside the dataset.
- Making baseline models: In this we fit different ML algorithms on our data and try to figure out which model gives us more accurate result.
- Hyperparametertuning: After we select the best model from all the models, we tune the hyperparameters of the model in order to increase accuracy of our model by solving the problem of underfitting/ overfitting.
Thus, we can conclude before getting our desired results, we have to undergo a lot of different steps. Talking in terms of time, around 80% of the time is consumed in data preparation so that model can fit onto it and rest 20% is required for fitting on ML algorithms and making predictions. Thus, it is surely an exhaustive task to carry out all these tasks, but what if we can use some method/function/library so that our this task becomes easy.
In this article, we are going to read about one such open-source python package named Pywedge.
What is Pywedge?
Pywedge is an open-source python package and is pip-installable which is developed by Venkatesh Rengarajan Muthu and it can help us to automate the task of writing code for data preprocessing, data visualization, feature engineering, handling imbalanced data, and making standard baseline models, hyperparameter tuning in a very interactive manner.
Features of Pywedge:
- It can make 8 different types of interactive charts such as: Scatter plot, Pie Chart, Box plot, Bar plot, Histogram,etc.
- Data preprocessing using interactive methods such as handling of missing values, converting categorical features into numerical features, standardization, normalization, handling class imbalance, etc.
- It automatically fits our data onto different ML algorithms and gives us 10 best baseline models.
- We can also apply hyperparameter tuning on our desired model.
Let's use this pywedge library to solve a regression problem in which we have to predict the energy generated by a powerplant using the dataset taken from Dockship's Power Plant Energy Prediction AI Challenge.
Importing important libraries
Python3
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
Loading the training and test dataset:
Python
# Loading testing Data
test_data = pd.read_csv("TEST.csv")
# Loading training Data
data = pd.read_csv("TRAIN.csv")
# Printing the shape of train dataset
data.shape
(8000, 5)
Now, we will check how our dataset looks like using the head() method and check some of its information in the subsequent step as:
Python
We can infer from the above image that our dataset has 5 columns in which the first four columns are our features and last column (PE) is our target column.
Python
Using the info() method, we can interpret that our dataset has no missing values and data type of each feature is of type float64.
Using pywedge library:
Python
import pywedge as pw
ppd = pw.Pre_process_data(data, test_data, y='PE',c=None,type="Regression")
new_X, new_y, new_test = ppd.dataframe_clean()
We use pywedge's Pre_process_data method to load the training data and create a Pre_process_data object, the object has a dataframe_clean method which returns pre-processed data. This method interactively asks for methods to convert categorical features into numerical features and also gives options to choose different standardization techniques to standardize the dataset.
Preparing baseline models using pywedge:
Making the modified train and test data and preparing the baseline models-
Python
# Assigning preprocessed data to make train and test data
X_train = new_X
y_train = new_y
X_test = new_test
# calling baseline_model method to prepare all the baseline models
blm = pw.baseline_model(X_train,y_train)
# printing the regression summary
blm.Regression_summary()
standard baseline models
The baseline_model method creates an object 'blm' and Regression_summary() method returns a summary about the implemented models. It gives us the top 10 most important features calculated using AdaBoost regressor and best baseline models. Also, we can check which algorithm takes how much time to train and make predictions. Different metrics using which we evaluate our model is also displayed. However, it does not perform any hyperparameter tuning so the best model can later be fine-tuned to get more accurate results.Â
Thus, we can notice how quickly we can find out which machine learning model we should use for our problem by just writing a few lines of code.
Similar Reads
Best R Packages for Machine Learning
Machine Learning is a subset of artificial intelligence that focuses on the development of computer software or programs that access data to learn from them and make predictions.R language is being used in building machine learning models due to its flexibility, efficient packages and the ability to
5 min read
Maths for Machine Learning
Mathematics is the foundation of machine learning. Math concepts plays a crucial role in understanding how models learn from data and optimizing their performance. Before diving into machine learning algorithms, it's important to familiarize yourself with foundational topics, like Statistics, Probab
5 min read
Top 20 ChatGPT Prompts For Machine Learning
Machine learning has made significant strides in recent years, and one remarkable application is ChatGPT, an advanced language model developed by OpenAI. ChatGPT can engage in natural language conversations, making it a versatile tool for various applications. In this article, we will explore the to
10 min read
Python for Machine Learning
Welcome to "Python for Machine Learning," a comprehensive guide to mastering one of the most powerful tools in the data science toolkit. Python is widely recognized for its simplicity, versatility, and extensive ecosystem of libraries, making it the go-to programming language for machine learning. I
6 min read
Model Selection for Machine Learning
Machine learning (ML) is a field that enables computers to learn patterns from data and make predictions without being explicitly programmed. However, one of the most crucial aspects of machine learning is selecting the right model for a given problem. This process is called model selection. The cho
6 min read
Top Python Notebooks for Machine Learning
Notebooks illustrates the analysis process step-by-step manner by arranging the stuff like text, code, images, output, etc. This helps a data scientist record the process of thinking while designing the process of research. Traditionally, notebooks were used to record work and replicate findings, si
6 min read
Well posed learning problems
Well Posed Learning Problem - A computer program is said to learn from experience E in context to some task T and some performance measure P, if its performance on T, as was measured by P, upgrades with experience E. Any problem can be segregated as well-posed learning problem if it has three traits
2 min read
Large scale Machine Learning
Large-scale machine learning (LML) aims to efficiently learn patterns from big data with comparable performance to traditional machine learning approaches. This article explores the core aspects of LML, including its definition, importance, challenges, and strategies to address these challenges.What
5 min read
7 Best Tools to Manage Machine Learning Projects
Managing the Machine Learning Projects isnât an easy piece of cake for every ML enthusiast or a student/developer working on them. Even Gartner has concluded in one of its researches that 85 percent of the ML projects have failed in the current year. And, this trend may continue in the future also i
7 min read
5 Reasons Why Python is Used for Machine Learning
Machine learning (ML) stands out as a key technology in the fast-coming field of artificial intelligence and solutions based on data, with implications for a variety of sectors. Python, a programming language, is central to this transformation, becoming a top choice for machine learning researchers,
7 min read