0% found this document useful (0 votes)
4 views

REF-10-Automated_Machine_Learning_The_New_Wave_of_Machine_Learning

The document discusses the advancements in Automated Machine Learning (AutoML) and its significance in automating the machine learning model development pipeline. It highlights the challenges faced in traditional model training and the need for efficient AutoML tools that can enhance accuracy and reduce human error. The paper also reviews various AutoML frameworks, their components, and applications, concluding with future research directions in the field.

Uploaded by

JIJIN K
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

REF-10-Automated_Machine_Learning_The_New_Wave_of_Machine_Learning

The document discusses the advancements in Automated Machine Learning (AutoML) and its significance in automating the machine learning model development pipeline. It highlights the challenges faced in traditional model training and the need for efficient AutoML tools that can enhance accuracy and reduce human error. The paper also reviews various AutoML frameworks, their components, and applications, concluding with future research directions in the field.

Uploaded by

JIJIN K
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Proceedings of the Second International Conference on Innovative Mechanisms for Industry Applications (ICIMIA 2020)

IEEE Xplore Part Number: CFP20K58-ART; ISBN: 978-1-7281-4167-1

Automated Machine Learning: The New Wave of


Machine Learning
Karansingh Chauhan1 , Shreena Jani1 , Dhrumin Thakkar1 , Riddham Dave1 ,Jitendra Bhatia1 , Sudeep Tanwar2 , Mohammad S. Obaidat,
Fellow of IEEE and Fellow of SCS3
1 2 3
Vishwakarma Government Engineering Institute of Technology University of Sharjah
College
Gujarat Technological University, Nirma University, Ahmedabad, India Dean of College of Computing and
Ahmedabad, India Informatics, University of Sharjah, UAE,
King Abdullah II School of IT, University
[email protected] [email protected] of Jordan, Jordan, and University of Science
[email protected] Ming Chuan University, Taiwan
[email protected]
[email protected] [email protected]
[email protected]

Abstract—With the explosion in the use of machine learning gets critical as there is not much work done for creating a
in various domains, the need for an efficient pipeline for the formal framework for deciding model parameters without the
development of machine learning models has never been more need for trial and error. These nuances emphasized the need for
critical. However, the task of forming and training models largely
remains traditional with a dependency on domain experts and AutoML where automation can reduce turnaround times and
time-consuming data manipulation operations, which impedes also increase the accuracy of the derived models by removing
the development of machine learning models in both academia human errors. In recent years, several tools and models have
as well as industry. This demand advocates the new research been proposed in the domain of AutoML. Some of these focus
era concerned with fitting machine learning models fully auto- on particular segments of AutoML such as feature engineering
matically i.e., AutoML. Automated Machine Learning(AutoML)
is an end-to-end process that aims at automating this model or model selection, whereas some models attempt to optimize
development pipeline without any external assistance. First, we the complete pipeline. These tools have matured enough to be
provide an insights of AutoML. Second, we delve into the able to compare with human experts on Kaggle competitions
individual segments in the AutoML pipeline and cover their and at times have beat them as well, showcasing their veracity.
approaches in brief. We also provide a case study on the industrial There are wide variety of applications based on AutoML such
use and impact of AutoML with a focus on practical applicability
in a business context. At last, we conclude with the open research as autonomic cloud computing [4] [5], Intelligent Vehicular
issues, and future research directions. networks, Block Chain [6],Software Defined Networking [7]
Index Terms—Automated Machine Learning, Artificial Intelli- [8], among others.
gence Meta Learning, Hyperparameter Optimization
This paper aims at providing an overview of the advances
seen in the realm of AutoML in recent years. We focus on in-
I. I NTRODUCTION
dividual aspects of AutoML and summarize the improvements
Data analysis is a powerful tool for learning insights on achieved in recent years. The motivation of this paper stems
how to improve the decision making, business model and from the unavailability of a compact study of the current state
even products. This involves the construction and training of AutoML. While we acknowledge the existence of other
of a machine learning model which faces several challenges surveys [9] [10] [11], their motive is to either provide an
due to lack of expert knowledge [1]. This challenges can be in-depth understanding of a particular segment of AutoML,
overcomed by using automated machine learning(AutoML) provide just an experimental comparison of various tools used
field. AutoML refers to the process of studying a traditional or are fixated towards deep learning models. The primary
machine learning model development pipeline to segment contributions of this paper are threefold:
it into modules and automate each of those to accelerate
workflow. With the advent of deeper models, such as the ones 1) We segment the AutoML pipeline into parts and review
used in image processing [2], Natural Language Processing the contributions in each of these segments.
[3], etc., there is an increasing need for tailored models that 2) We explore the various state-of-the-art tools currently
can be crafted for specific workloads. However, such specific available for AutoML and evaluate them.
models require immense resources such as high capacity 3) We also incorporate the advancements seen in machine
memory, strong GPUs, domain experts to help during the learning which seems to be overshadowed by deep
development and long wait times during training. The task learning in recent years.

978-1-7281-4167-1/20/$31.00 ©2020 IEEE 205


Authorized licensed use limited to: DRDO-ITR. Downloaded on December 20,2024 at 07:04:53 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Innovative Mechanisms for Industry Applications (ICIMIA 2020)
IEEE Xplore Part Number: CFP20K58-ART; ISBN: 978-1-7281-4167-1

The rest of the paper is organized as follows. Section I-A II. R ELATED WORK
describes the problem definition of AutoML and covers the This section describes the various segments of AutoML as
contributions in AutoML with each subsection reviewing a per the taxonomy shown in Figure II. We present the most
specific segment. Section II discusses the recent trends and notable contributions seen in the domain of AutoML. We
advancements seen in the domain of AutoML. Section IV compare the various approaches adopted for each individual
covers a case study of the use-case of AutoML for insurance. segment of AutoML.
Section V concludes the paper and provides future directions
for the work that needs to be done in AutoML.

A. Automated Machine Learning


An AutoML is the process of automating the end-to-end
process of applying machine learning to real-world problems.
The problem of AutoML is a combinational one, where any
proposed algorithm is required to find a suitable combination
of operations for each segment of the ML pipeline to minimize
the errors. Mathematically, AutoML can be expressed as:

X X
OP
COS +2N ·G(fi ,fj ) PNM + P (m0 , r|m)P (r+γ.v(m0 ))
m0 ∈M r∈R
(1)
where,
OP is standard pre-defined operation set
OS indicates operations selected by the algorithms
G(fi , fj ) specifies generator function for creating new
features
N represents the number of selected features
NM = Maximum features to be selected

The standard data pre-processing operations are well


defined and discussed in sectionII-A. While a completely
raw data collection cannot be processed with these standard
operations, datasets are usually refined to some extent and
can work with such operations well. The automation in
data pre-processing is defined as a series of actions that
are selected(OS ) from the standard pre-defined operation
set(OP ) and performed on the dataset. Feature Engineering
is performed by selecting relevant features(2N ) from the
dataset by finding dependant pairs((fi , fj )) and using them
for generating new features(G(fi , fj ) ). Model selection and
hyperparameter optimization work on finding the optimal
parametric configuration from an infinite search-space or
from learning them (reinforcement learning) from previous
models designed for various tasks. The final term in equation
1 demonstrates the probabilistic reinforcement learning used
in recent years for constraining the configuration space.
The solution-space explosion due to exponentials and facto-
rials, as shown in equation 1 is the core issue of AutoML. This
explosion causes a high expense computationally and voids
any accuracy advantage over humans. To address this problem,
various research works that are proposed, allows a parameter
configuration to granularly adjust the volume of the search Fig. 1. Taxonomy of AutoML
space explored by any algorithm. Some works have removed
the combination configurations deemed ineffective based on
previous experience. There are lots of frameworks available
for AutoML, which are shown in Table I.

978-1-7281-4167-1/20/$31.00 ©2020 IEEE 206


Authorized licensed use limited to: DRDO-ITR. Downloaded on December 20,2024 at 07:04:53 UTC from IEEE Xplore. Restrictions apply.
TABLE I
S UMMARY OF THE AUTO ML F RAMEWORKS

Data source ML Task


AutoML Tools

UI

FE
NAS
Authorization

Preprocessing
Meta-Learning

HPO Techniques

Structured
Supervised
Model Selection and

Unstructured
Unsupervised
H2O-AutoML 3 7 3 3 3 7 Random search and Grid search 7 3 3 Open Source

DataRobot 3 3 3 3 3 3 Random search, Bayesian optimization 3 7 3 Proprietary


IEEE Xplore Part Number: CFP20K58-ART; ISBN: 978-1-7281-4167-1

Genetic algorithm, Random search,


Cloud AutoML 3 3 3 3 3 3 3 3 3 Proprietary
Bayesian optimization

TPOT 3 7 7 3 3 7 Genetic algorithm 7 7 7 Open Source

Auto-Keras 3 3 7 3 3 7 Random search, Bayesian optimization 3 7 7 Open Source

978-1-7281-4167-1/20/$31.00 ©2020 IEEE


Auto-Weka 3 7 7 3 3 7 Random search, Bayesian optimization 7 3 3 Open Source

ML BOX 3 7 3 3 3 7 SMAC Bayesian optimization 7 3 3 Open Source

AutoSklearn 3 7 7 3 3 3 Random search, Bayesian optimization 7 3 7 Open Source

Multi-fidelity optimization,
Proceedings of the Second International Conference on Innovative Mechanisms for Industry Applications (ICIMIA 2020)

Auto-Pytorch 3 7 3 3 3 3 3 3 7 Open Source


Bayesian optimization (BOHM)

Authorized licensed use limited to: DRDO-ITR. Downloaded on December 20,2024 at 07:04:53 UTC from IEEE Xplore. Restrictions apply.
207
Proceedings of the Second International Conference on Innovative Mechanisms for Industry Applications (ICIMIA 2020)
IEEE Xplore Part Number: CFP20K58-ART; ISBN: 978-1-7281-4167-1

A. Data preprocesing ordinarily utilized to tackle data imbalance are under-sampling


Data pre-processing guarantees the delivery of quality data and down-sampling [23] in which data is repeated or removed
derived from the original dataset. It is an important step due pairwise to maintain class balance across the dataset. Another
to the unavailability of quality data as a large portion of approach to boost classifier performance in case of an im-
information generated and stored is usually semi-structural or balance is Ensemble learning derived from Beriman’s work
even non-structured in form. However, even though it is a [22] in which the existing dataset is augmented to generate
crucial part of any machine learning pipeline, it is reported to more data points in an attempt to increase training data. An
be the least enjoyable part, with authors [12] [13] stating that approach for tackling data imbalance is to use cost-sensitive
60-80% of data scientists finding it to be the most mundane learning [24] instead of changing the dataset. Cost-sensitive
and tedious job. learning uses the variable cost of misclassification to balance
In AutoML, certain data-preprocessing operations are hard- the bias of an imbalance class. It is suitable for a highly skewed
coded, which are then applied to a given dataset in certain dataset where certain classes are minorities. In the case of
combinations such that the overall clarity and usability of the AutoML, tools such as TPOT [25] provide an implementation
data increases. [14] We have largely classified these operations in its API to adjust for class-specific sensitivity to adjust for
into the following categories based on our surveys of recent skewed classes.
papers as seen in Figure II-A. 3) Data Encoding:: To make the data human-readable, the
training data is often labelled in words. Data Encoding refers
to converting the provided feature labels into numerical form
to allow computer machines to interpret them. Some of the
common forms of data encoding are ordinal, one-hot, binary,
hashing, target encoding, etc. Target encoding is the process
of replacing a categorical value with the mean/median/mode
of the target variable. While other label encoding assigns
incremental values or binary columns to every label, the values
assigned to them are not representative of any property of the
given data. Target encoding assigns a meaningful label number
Fig. 2. Data pre-processing pipeline which represents a certain property of the data such as the
fraction of true values in the target variable. H2O.AI is an
1) Data Imputation:: Often datasets, in reality, may contain autoML tool that makes use of target encoding in its API.
missing values for some different reasons (human error or Auto-Sklearn uses one-hot for data encoding [25].
unavailability of the data). There are two fundamental kinds
of missing data as described by [15], which are missing B. Feature Engineering
at random (MAR) data and Missing completely at random Features obtained from a dataset are seldom used as-is. We
(MCAR) data. The randomness of MCAR data is high enough generally perform some operations to generate new features
that there is no overall bias towards any particular class, unlike that are well-suited for a given problem. For example, in a
MAR data, which are responsible for causing an increase in housing dataset, we may want to combine the length and
bias. In data Imputation, we deal with inconsistencies such breadth of the house property to compute its area, which is
as NaNs, spaces, Null values, incorrect data types, etc. This a better feature and is regularly observed in the real-estate
is addressed by replacing these values with multiple methods domain. However, the task of the creation of new features
such as default value selection in which every problematic from existing features is an artful and domain-specific process.
value is removed, and a pre-selected value takes its place. AutoML deals with this in three distinct segments
Another approach is to use the mean or median of the dataset 1) Feature Mining:: The dataset generated after pre-
column [16] to replace any missing value. Some approaches processing contains features, some of them which are useful
as regression imputation have used standard Deviation and for the model training, whereas others have minimal contri-
Variance to compute the replacement [17] value for a given bution towards the training phase of the machine learning
data column. Some data imputation technique with lighter model. Feature mining is responsible for picking out the
time constraints uses the successive halving approach in Auto- impactful features from a given dataset. This is done by
WEKA [18]. XGboost algorithm [19] is also used widely in computing the relevant feature pairs. The measure of relevance
TPOT tool [20], and Auto-WEKA [21] for data imputation. is usually defined as the information gain or by measuring the
2) Data Balancing:: Data imbalance is a condition when relationship between any featured pair. AutoLearn [26] uses a
one or more classes in a categorical dataset have higher cosine-transform and measures the euclidean distance on these
observations than the rest of the classes. Feeding such im- transforms to determine the feature pairs which correlate.
balanced data leads [22] to the input majority class have 2) Feature Generation:: Feature generation is a process
an unjustified bias. The sample handling approach for data of combining pre-existing features to generate new features.
balancing will preprocess the training set to minimize class AutoLearn [26] achieves this by performing ridged regression
differences, and this issue can be resolved. Two techniques over feature pairs to map the relationships and considers the

978-1-7281-4167-1/20/$31.00 ©2020 IEEE 208


Authorized licensed use limited to: DRDO-ITR. Downloaded on December 20,2024 at 07:04:53 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Innovative Mechanisms for Industry Applications (ICIMIA 2020)
IEEE Xplore Part Number: CFP20K58-ART; ISBN: 978-1-7281-4167-1

newly generated mapping as a relationship. COGNITO [27] tackle it. Hence, it is crucial to determine the most appropriate
uses a series of standard operations over a feature tree to model keeping in mind the accuracy-execution time tradeoff.
generate new features as seen in Fig. II-B2. LFE(Learning Conventionally, domain experts with previous experience ap-
Feature Engineering) [28] improves on COGNITO by learning proximate a model to be used. This manual task by humans
over the previous datasets to learn the relationship between follows an iterative trial and error approach for determining
features and the transforms, which generated better accuracy the model to be used, as shown in Fig. II-C. Once a model is
outcomes. These transforms are evaluated for a given dataset finalized, the hyperparameter optimization is again performed
by LFE to determine the operations which will lead to an manually to generate the final model. AutoML automates these
increase in the performance of the machine learning model. above steps to reduce human dependency.
Reinforcement learning has also been used by Khurana et al.
[29] for generating features. They leveraged the feature tree
structure used in COGNITO and incorporated a traversal pol-
icy to optimize transform exploration. Reinforcement learning
encourages the exploration of transforms that are beneficial to
the overall model and also applies a budget constraint. This
constraint is needed to prevent the algorithm from performing
an exhaustive search over the feature graph.

Fig. 4. Hyperparameter optimization by trial and error

Most of the AutoML tools and methods combine the


problem of model selection and hyperparameter optimization
into a single problem called the CASH(Combined Algorithm
Selection and Hyperparameter) problem. CASH problem con-
siders model selection and hyperparameters optimization as
a single hierarchical hyperparameter optimization problem.
At the root level, a hyperparameter resides, which selects
between different learning algorithms or models. At the next
level, model-specific hyperparameters are placed which are
optimized to generate the final model. Auto-WEKA [21]
and SmartML [32] are some of the tools, which consider
Fig. 3. An example of the feature tree generated by [27] model selection and hyperparameter optimization as a singular
problem. The following approaches address CASH problem:
3) Feature Selection:: Feature Generation is an iterative 1) Grid search:: The Grid search was proposed as a
process that leads to an explosion of total features. This is traditional approach for the systematic exploration of the
controlled by the selection phase based on the impact of a hyperparameter configuration space. It is simply a brute-
particular feature on the overall accuracy of the model. This force algorithm that searches through a pre-specified subset
is usually measured either by using a rank function or by of hyperparameter space of the specific learning algorithm.
measuring the loss of the model when a particular feature is The algorithm can be parallelized across multiple models with
excluded and included. ExploreKit [30] uses a novel ranking different configurations to accelerate the search. However, due
function based on a meta-feature classifier to determine which to its brute-forcing characteristics, it is a very costly approach
generated features are important. OneBM [31] utilizes the chi- as for N different hyperparameters each having just two
square hypothesis to determine the features most relevant to possible values, we have a total of 2n possible configurations
the performance of the machine learning model. Information [33].
gain has also been used as a parameter for feature selection 2) Random Search:: To alleviate the exhaustive enumera-
[26]. The above-described methods set a threshold to select tion of combinations in a grid search, Random Search chooses
only the most relevant features generated in an autoML random values from the hyperparameter subset independently.
pipeline. By navigating the grid of hyperparameters randomly, one can
obtain a similar performance as a full grid search. However,
C. Model Selection and Hyperparameter Optimization this approach is surprisingly easy and effective. It is also well
The core of any machine learning pipeline is the model used suited for gradient-free functions with many local minima [34].
to perform the prediction task. However, a single problem may Random search can outperform Grid Search in a scenario
have multiple model configurations with varying accuracy to where the small number of hyperparameters affects the final

978-1-7281-4167-1/20/$31.00 ©2020 IEEE 209


Authorized licensed use limited to: DRDO-ITR. Downloaded on December 20,2024 at 07:04:53 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Innovative Mechanisms for Industry Applications (ICIMIA 2020)
IEEE Xplore Part Number: CFP20K58-ART; ISBN: 978-1-7281-4167-1

performance of the model [35]. agents in the former generation. This new generation again
3) Sequential Model-Based Optimization:: Random Search performs the same given task, and the cycle continues, as
and Grid search performs hyperparameter checking independ shown in Figure II-C4. For AutoML, evolutionary algorithms
of each other and often end up performing repeated and are used for tackling hyperparameter optimization by search-
wasteful computations. To improve on their shortcomings, ing the configuration space for a given model.
Sequential Model-based Optimization(SMBO) [36] was pro-
posed, which uses a combination of regression and Bayesian
optimization to select hyperparameters. It sequentially applies
the hyperparameters and adjusts their values based on the
Bayesian heatmap, which is a probabilistic distribution, as
shown in Figure II-C3. The probabilistic approach of SMBO
resolves the scalability issues that were rampant in grid search
and random search.

Fig. 6. An example of the evolutionary algorithm where every generation of


best agents breed to generate next generation

To further imp Tree-based Pipeline Optimization


Tool(TPOT) [40], an AutoML tool, makes use of evolutionary
Fig. 5. A comparative example of hyperparameter selection behaviour of algorithms (genetic programming in particular) and has
various strategies. Notice the selection clustering in the case of SMBO near
the high scoring regions
demonstrated its effectiveness on simulated and real-world
genetic datasets. Autostacker [41] is another AutoML
To further improve this Bayesian optimization approach, the architecture that uses evolutionary models to optimize
work in [37] introduces a deep neural network for global hyperparameter search. It produces candidate pipelines in
optimization of the hyperparameters. For neural networks, each generation and evolves itself.
Mendoza et al. [38] introduced Auto-Net, an AutoML tool
III. D ISCUSSION
based on Bayesian optimization for tuning neural networks.
The tool uses Stochastic Gradient Descent (SGD) as its Even though data pre-processing consumes a large chunk
optimizer for Hyperparameter optimization. The authors have of time in an ML pipeline, it is astonishing to see the
also demonstrated a combined approach of using Auto-Net inadequate amount of work done to automate it. For data pre-
and Auto-SKLearn to outperform human adversaries by a processing, it can be noted that while the existing approaches
significant margin of 10% in the AutoML Challenge 2018 are adequate for structured and semi-structured data, work still
[18]. needs to be done to assimilate unstructured data. We suggest
The authors in [39] further improve upon the previous the incorporation of data-mining methods as they can deal
approaches which can be generalized across datasets using a with such unformed data. This can allow AutoML pipelines
transfer learning strategy. They achieve this by constructing a to create models capable of learning from Internet sources.
common hyperparameter surface of the previous hyperparam- In feature engineering, it should be noted that most methods
eter selection plane and the target models hyperparameters. used until now adhere to supervised learning. However, dataset
SmartML [32] is a meta-learning based framework for auto- specificity is high, and therefore, AutoML pipelines should be
mated hyperparameter tuning and selection of ML models. It as generic as possible to accommodate the diverse datasets.
continuously learns from a given dataset and stores informa- Therefore, a gradual paradigm shift towards unsupervised
tion about the meta-features of all the previously processed learning is required to increase the ability of AutoML. To
datasets to increase performance. replace domain experts, feature generation should be able to
4) Evolutionary optimization:: Evolutionary optimization work flexibly(such as the introduction of non-standard trans-
is inspired by biological evolution which follows Survival forms) with the original feature sets. Reinforcement learning
of the fittest. Such algorithms work by generating random is a step in the right direction and needs to be inculcated
agents, which perform a particular task and are scored on further with feature engineering. Hyperparameter optimization
their performance. The agents are evaluated, and a breeding has seen large improvements over the years, especially with
algorithm generates new child agents derived from the best the introduction of Bayesian optimization strategies such as

978-1-7281-4167-1/20/$31.00 ©2020 IEEE 210


Authorized licensed use limited to: DRDO-ITR. Downloaded on December 20,2024 at 07:04:53 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Innovative Mechanisms for Industry Applications (ICIMIA 2020)
IEEE Xplore Part Number: CFP20K58-ART; ISBN: 978-1-7281-4167-1

SMBO. However, the use of a continuously integrating meta- 25000 false claims are filed for reimbursements, hence to
learning framework needs to be researched as its performance reduce the loss of revenue due to such fraudulent activities,
gain is high. Transfer learning has also been successfully used identification of false claims is required. The company utilized
in the context of AutoML to show promising results. With the the AutoML Tables tool to train the model using features like
increase in the availability of task-specific pre-trained models, patient history, hospitals, specific disease of patient, invoices,
it should be expected to see an increase in the usage of transfer agent who sold the policy, etc. The tool is responsible for
learning. the optimal algorithm selection, pre-processing and feature
selection. Based on the data provided, an appropriate anomaly
IV. C ASE S TUDY: I MPACT OF AUTO ML AT GNP detection algorithm is selected, and its hyperparameters are
S EGUROS , AN INSURANCE COMPANY tuned accordingly. Using such a model, an accuracy of 96.64%
The insurance industry usually prefers a data-driven ap- is achieved for false claim detection. This model outperforms
proach to solve business problems. The multi-source data, the existing in-house fraud detection model by 20% to 30%.
generated in a massive amount, provoked the need for machine The labelling of gender can also be considered as one of the
learning for further analysis and predictions. The significant problems for the company. GNP not only provides insurance
challenges faced by the industry are the detection of false to the individual, but also for a collective group of people, for
claims, utilization of unstructured data collected by the mar- example, complete family or a company's employee network.
keting and sales team, automation of transaction and claim In a general scenario, for the group insurance, the sales team
processing, personalization of solutions for customers, among provides data in CSV(Comma Separated Values) format to
others. the underwriting department. For such a collective insurance
GNP(Grupo Nacional Provincial) is one of the largest insur- policy, gender value in the data is the utmost requirement.
ance companies in Mexico. Like any large and well-established However, missing values are frequently encountered in the
company, GNP is undergoing a profound transformation for columns representing gender features; hence, the identification
modernizing information systems and operations. To achieve of gender based on the persons name is required. To learn the
this, the company is utilizing the cloud resources to centralize gender of the person based on his/her corresponding full name,
the generalized computations [42]. GNP is making significant the company utilizes an AutoML model trained by the Tables
efforts to organize and utilize all the operational information tool. The AutoML tool selected the best suitable classification
of the company in the central Data Lake [43] [44]. To extract algorithm for the task and optimally tuned the hyperparameters
value from Data lake, the company has begun to apply ma- accordingly to achieve an accuracy of 99.2%.
chine learning for getting intuitions as well as predicting and In this particular case, a single AutoML tool was able to
improving the company’s performance based on their domain- tackle three problems head-on and created machine learning
specific factors [45]. For such a data-driven approach, a team models for the same with minimal human intervention. This
of highly trained data scientists is required, which is financially showcases the need as well as the opportunities the domain of
taxing. In the earlier stages, the company’s data scientists AutoML provides, especially in the business sector.
built and trained various models manually and thus achieved
moderate accuracy for the prediction problem. To improve V. C ONCLUSION AND F UTURE D IRECTIONS
accuracy and reduce the amount of time and expenses, GNP In this paper, we provide insights to the readers about the
adopted the tool called AutoML Tables provided by Google various segments of AutoML with a conceptual perspective.
Cloud to simplify and speed up the creation of ML models Each of these segments has various approaches that have
and migrate the scarcity of highly trained data scientists. The been briefly explained to provide a concise overview. We
company utilizes the provided tool to solve problems like also discuss the various trends seen in recent years including
Car claim risk, Detection of fraudulent healthcare claims, and suggestions of thirsty research areas which need attention. We
Gender Labeling, which are discussed in detail below. also put forward some future directions that can be explored
The car claim risk problem is defined as the task to predict to extend the research in the domain of AutoML. We suggest
the probability of the car having an accident using the given that the research exploration can be done in the direction of
features/characteristics of the insured car and the owner. The a generalized AutoML pipeline, which can accept datasets
company spends about USD 550 million on car damage of a wide range and a central meta-learning framework be
claims annually, hence predicting risk amount accurately is established that acts as a central brain for approximating the
the primary intention of the company. To solve the prediction pipelines for all future problems statements.
problem, the company trained the model, using the AutoML
R EFERENCES
Tables tool, which utilized the 21 features columns and over
1.34 million rows of raw data. The tool creates the model by [1] Lukas Tuggener, Mohammadreza Amirian, Katharina Rombach, Stefan
Lörwald, Anastasia Varlet, Christian Westermann, and Thilo Stadel-
selecting the most relevant features. By this, the accuracy of mann. Automated machine learning in practice: state of the art and
98.1% has been achieved, which was much better than any of recent results. In 2019 6th Swiss Conference on Data Science (SDS),
their previous manually trained models. pages 31–36. IEEE, 2019.
[2] Karen Simonyan and Andrew Zisserman. Very deep convolu-
The detection of Healthcare fraudulent claims is one of tional networks for large-scale image recognition. arXiv preprint
the major problems faced by the company. Annually about arXiv:1409.1556, 2014.

978-1-7281-4167-1/20/$31.00 ©2020 IEEE 211


Authorized licensed use limited to: DRDO-ITR. Downloaded on December 20,2024 at 07:04:53 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Innovative Mechanisms for Industry Applications (ICIMIA 2020)
IEEE Xplore Part Number: CFP20K58-ART; ISBN: 978-1-7281-4167-1

[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. [27] Udayan Khurana, Deepak Turaga, Horst Samulowitz, and Srinivasan
Bert: Pre-training of deep bidirectional transformers for language un- Parthasrathy. Cognito: Automated feature engineering for supervised
derstanding. arXiv preprint arXiv:1810.04805, 2018. learning. In 2016 IEEE 16th International Conference on Data Mining
[4] Avatar Jaykrushna, Pathik Patel, Harshal Trivedi, and Jitendra Bhatia. Workshops (ICDMW), pages 1304–1307. IEEE, 2016.
Linear regression assisted prediction based load balancer for cloud [28] Fatemeh Nargesian, Horst Samulowitz, Udayan Khurana, Elias B Khalil,
computing. In 2018 IEEE Punecon, pages 1–3. IEEE. and Deepak S Turaga. Learning feature engineering for classification.
[5] Jitendra Bhatia, Ruchi Mehta, and Madhuri Bhavsar. Variants of In IJCAI, pages 2529–2535, 2017.
software defined network (sdn) based load balancing in cloud comput- [29] Udayan Khurana, Horst Samulowitz, and Deepak Turaga. Feature
ing: A quick review. In International Conference on Future Internet engineering for predictive modeling using reinforcement learning. In
Technologies and Trends, pages 164–173. Springer, 2017. Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
[6] Ishan Mistry, Sudeep Tanwar, Sudhanshu Tyagi, and Neeraj Kumar. [30] Gilad Katz, Eui Chul Richard Shin, and Dawn Song. Explorekit: Auto-
Blockchain for 5g-enabled iot for industrial automation: A systematic matic feature generation and selection. In 2016 IEEE 16th International
review, solutions, and challenges. Mechanical Systems and Signal Conference on Data Mining (ICDM), pages 979–984. IEEE, 2016.
Processing, 135:106382, 2020. [31] Hoang Thanh Lam, Johann-Michael Thiebaut, Mathieu Sinn, Bei Chen,
[7] Jitendra Bhatia, Yash Modi, Sudeep Tanwar, and Madhuri Bhavsar. Soft- Tiep Mai, and Oznur Alkan. One button machine for automating feature
ware defined vehicular networks: A comprehensive review. International engineering in relational databases. arXiv preprint arXiv:1706.00327,
Journal of Communication Systems, 32(12):e4005, 2019. 2017.
[8] Jitendra Bhatia, Ridham Dave, Heta Bhayani, Sudeep Tanwar, and [32] Mohamed Maher and Sherif Sakr. Smartml: A meta learning-based
Anand Nayyar. Sdn-based real-time urban traffic analysis in vanet framework for automated selection and hyperparameter tuning for ma-
environment. Computer Communications, 149:162 – 175, 2020. chine learning algorithms. In EDBT: 22nd International Conference on
[9] Xin He, Kaiyong Zhao, and Xiaowen Chu. Automl: A survey of the Extending Database Technology, 2019.
state-of-the-art. arXiv preprint arXiv:1908.00709, 2019. [33] Steven M LaValle, Michael S Branicky, and Stephen R Lindemann.
[10] Radwa Elshawi, Mohamed Maher, and Sherif Sakr. Automated ma- On the relationship between classical grid search and probabilistic
chine learning: State-of-the-art and open challenges. arXiv preprint roadmaps. The International Journal of Robotics Research, 23(7-8):673–
arXiv:1906.02287, 2019. 692, 2004.
[34] Francisco J Solis and Roger J-B Wets. Minimization by random search
[11] Anh Truong, Austin Walters, Jeremy Goodsitt, Keegan Hines, Bayan
techniques. Mathematics of operations research, 6(1):19–30, 1981.
Bruss, and Reza Farivar. Towards automated machine learning: Evalu-
[35] James Bergstra and Yoshua Bengio. Random search for hyper-parameter
ation and comparison of automl approaches and tools. arXiv preprint
optimization. Journal of Machine Learning Research, 13(Feb):281–305,
arXiv:1908.05557, 2019.
2012.
[12] Shichao Zhang, Chengqi Zhang, and Qiang Yang. Data preparation for
[36] Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Sequential
data mining. Applied artificial intelligence, 17(5-6):375–381, 2003.
model-based optimization for general algorithm configuration. In In-
[13] Erhard Rahm and Hong Hai Do. Data cleaning: Problems and current ternational conference on learning and intelligent optimization, pages
approaches. IEEE Data Eng. Bull., 23(4):3–13, 2000. 507–523. Springer, 2011.
[14] Dipali Shete and Sachin Bojewar. Auto approach for extracting relevant [37] Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur
data using machine learning. International Journal of Electronics, 6:0, Satish, Narayanan Sundaram, Mostofa Patwary, Mr Prabhat, and Ryan
2019. Adams. Scalable bayesian optimization using deep neural networks. In
[15] Carol M Musil, Camille B Warner, Piyanee Klainin Yobas, and Susan L International conference on machine learning, pages 2171–2180, 2015.
Jones. A comparison of imputation techniques for handling missing [38] Hector Mendoza, Aaron Klein, Matthias Feurer, Jost Tobias Springen-
data. Western Journal of Nursing Research, 24(7):815–829, 2002. berg, and Frank Hutter. Towards automatically-tuned neural networks.
[16] RB Kline. Principles and practice of structural equation modeling. 1998. In Workshop on Automatic Machine Learning, pages 58–65, 2016.
New York: Guilford, 1998. [39] Dani Yogatama and Gideon Mann. Efficient transfer learning method for
[17] Joseph F Hair, Rolph E Anderson, Ronald L Tatham, and William C automatic hyperparameter tuning. In Artificial intelligence and statistics,
Black. Multivariate data analysis. englewood cliff. New Jersey, USA, pages 1077–1085, 2014.
5(3):207–2019, 1998. [40] Randal S Olson and Jason H Moore. Tpot: A tree-based pipeline
[18] Matthias Feurer, Katharina Eggensperger, Stefan Falkner, Marius Lin- optimization tool for automating machine learning. In Automated
dauer, and Frank Hutter. Practical automated machine learning for Machine Learning, pages 151–160. Springer, 2019.
the automl challenge 2018. In International Workshop on Automatic [41] Boyuan Chen, Harvey Wu, Warren Mo, Ishanu Chattopadhyay, and Hod
Machine Learning at ICML, pages 1189–1232, 2018. Lipson. Autostacker: A compositional evolutionary learning system. In
[19] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting Proceedings of the Genetic and Evolutionary Computation Conference,
system. In Proceedings of the 22nd acm sigkdd international conference pages 402–409. ACM, 2018.
on knowledge discovery and data mining, pages 785–794. ACM, 2016. [42] Jitendra Bhatia and Malaram Kumhar. Perspective study on load
[20] Tpot: Skewed classes. https://github.com/EpistasisLab/tpot/blob/v0.9.5/ balancing paradigms in cloud computing. IJCSC, 6(1):112–120, 2015.
tpot/metrics.py. (Accessed: September 10, 2019). [43] Natalia Miloslavskaya and Alexander Tolstoy. Big data, fast data and
[21] Chris Thornton, Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. data lake concepts. Procedia Computer Science, 88:300–305, 2016.
Auto-weka: Combined selection and hyperparameter optimization of [44] Jitendra Bhagwandas Bhatia. A dynamic model for load balancing
classification algorithms. In Proceedings of the 19th ACM SIGKDD in cloud infrastructure. Nirma University Journal of Engineering and
international conference on Knowledge discovery and data mining, Technology (NUJET), 4(1):15, 2015.
pages 847–855. ACM, 2013. [45] Jai Prakash Verma, Sudeep Tanwar, Sanjay Garg, Ishit Gandhi, and
[22] Leo Breiman. Bagging predictors. Machine learning, 24(2):123–140, Nikita H Bachani. Evaluation of pattern based customized approach
1996. for stock market trend prediction with big data and machine learning
[23] Chris Drummond, Robert C Holte, et al. C4. 5, class imbalance, and cost techniques. International Journal of Business Analytics (IJBAN), 6(3):1–
sensitivity: why under-sampling beats over-sampling. In Workshop on 15, 2019.
learning from imbalanced datasets II, volume 11, pages 1–8. Citeseer,
2003.
[24] Mohamed Bekkar and Taklit Akrouf Alitouche. Imbalanced data
learning approaches.
[25] Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Tobias
Springenberg, Manuel Blum, and Frank Hutter. Auto-sklearn: Efficient
and robust automated machine learning. In Automated Machine Learn-
ing, pages 113–134. Springer, 2019.
[26] Ambika Kaul, Saket Maheshwary, and Vikram Pudi. Autolearnauto-
mated feature generation and selection. In 2017 IEEE International
Conference on Data Mining (ICDM), pages 217–226. IEEE, 2017.

978-1-7281-4167-1/20/$31.00 ©2020 IEEE 212


Authorized licensed use limited to: DRDO-ITR. Downloaded on December 20,2024 at 07:04:53 UTC from IEEE Xplore. Restrictions apply.

You might also like