REF-10-Automated_Machine_Learning_The_New_Wave_of_Machine_Learning
REF-10-Automated_Machine_Learning_The_New_Wave_of_Machine_Learning
Abstract—With the explosion in the use of machine learning gets critical as there is not much work done for creating a
in various domains, the need for an efficient pipeline for the formal framework for deciding model parameters without the
development of machine learning models has never been more need for trial and error. These nuances emphasized the need for
critical. However, the task of forming and training models largely
remains traditional with a dependency on domain experts and AutoML where automation can reduce turnaround times and
time-consuming data manipulation operations, which impedes also increase the accuracy of the derived models by removing
the development of machine learning models in both academia human errors. In recent years, several tools and models have
as well as industry. This demand advocates the new research been proposed in the domain of AutoML. Some of these focus
era concerned with fitting machine learning models fully auto- on particular segments of AutoML such as feature engineering
matically i.e., AutoML. Automated Machine Learning(AutoML)
is an end-to-end process that aims at automating this model or model selection, whereas some models attempt to optimize
development pipeline without any external assistance. First, we the complete pipeline. These tools have matured enough to be
provide an insights of AutoML. Second, we delve into the able to compare with human experts on Kaggle competitions
individual segments in the AutoML pipeline and cover their and at times have beat them as well, showcasing their veracity.
approaches in brief. We also provide a case study on the industrial There are wide variety of applications based on AutoML such
use and impact of AutoML with a focus on practical applicability
in a business context. At last, we conclude with the open research as autonomic cloud computing [4] [5], Intelligent Vehicular
issues, and future research directions. networks, Block Chain [6],Software Defined Networking [7]
Index Terms—Automated Machine Learning, Artificial Intelli- [8], among others.
gence Meta Learning, Hyperparameter Optimization
This paper aims at providing an overview of the advances
seen in the realm of AutoML in recent years. We focus on in-
I. I NTRODUCTION
dividual aspects of AutoML and summarize the improvements
Data analysis is a powerful tool for learning insights on achieved in recent years. The motivation of this paper stems
how to improve the decision making, business model and from the unavailability of a compact study of the current state
even products. This involves the construction and training of AutoML. While we acknowledge the existence of other
of a machine learning model which faces several challenges surveys [9] [10] [11], their motive is to either provide an
due to lack of expert knowledge [1]. This challenges can be in-depth understanding of a particular segment of AutoML,
overcomed by using automated machine learning(AutoML) provide just an experimental comparison of various tools used
field. AutoML refers to the process of studying a traditional or are fixated towards deep learning models. The primary
machine learning model development pipeline to segment contributions of this paper are threefold:
it into modules and automate each of those to accelerate
workflow. With the advent of deeper models, such as the ones 1) We segment the AutoML pipeline into parts and review
used in image processing [2], Natural Language Processing the contributions in each of these segments.
[3], etc., there is an increasing need for tailored models that 2) We explore the various state-of-the-art tools currently
can be crafted for specific workloads. However, such specific available for AutoML and evaluate them.
models require immense resources such as high capacity 3) We also incorporate the advancements seen in machine
memory, strong GPUs, domain experts to help during the learning which seems to be overshadowed by deep
development and long wait times during training. The task learning in recent years.
The rest of the paper is organized as follows. Section I-A II. R ELATED WORK
describes the problem definition of AutoML and covers the This section describes the various segments of AutoML as
contributions in AutoML with each subsection reviewing a per the taxonomy shown in Figure II. We present the most
specific segment. Section II discusses the recent trends and notable contributions seen in the domain of AutoML. We
advancements seen in the domain of AutoML. Section IV compare the various approaches adopted for each individual
covers a case study of the use-case of AutoML for insurance. segment of AutoML.
Section V concludes the paper and provides future directions
for the work that needs to be done in AutoML.
X X
OP
COS +2N ·G(fi ,fj ) PNM + P (m0 , r|m)P (r+γ.v(m0 ))
m0 ∈M r∈R
(1)
where,
OP is standard pre-defined operation set
OS indicates operations selected by the algorithms
G(fi , fj ) specifies generator function for creating new
features
N represents the number of selected features
NM = Maximum features to be selected
UI
FE
NAS
Authorization
Preprocessing
Meta-Learning
HPO Techniques
Structured
Supervised
Model Selection and
Unstructured
Unsupervised
H2O-AutoML 3 7 3 3 3 7 Random search and Grid search 7 3 3 Open Source
Multi-fidelity optimization,
Proceedings of the Second International Conference on Innovative Mechanisms for Industry Applications (ICIMIA 2020)
Authorized licensed use limited to: DRDO-ITR. Downloaded on December 20,2024 at 07:04:53 UTC from IEEE Xplore. Restrictions apply.
207
Proceedings of the Second International Conference on Innovative Mechanisms for Industry Applications (ICIMIA 2020)
IEEE Xplore Part Number: CFP20K58-ART; ISBN: 978-1-7281-4167-1
newly generated mapping as a relationship. COGNITO [27] tackle it. Hence, it is crucial to determine the most appropriate
uses a series of standard operations over a feature tree to model keeping in mind the accuracy-execution time tradeoff.
generate new features as seen in Fig. II-B2. LFE(Learning Conventionally, domain experts with previous experience ap-
Feature Engineering) [28] improves on COGNITO by learning proximate a model to be used. This manual task by humans
over the previous datasets to learn the relationship between follows an iterative trial and error approach for determining
features and the transforms, which generated better accuracy the model to be used, as shown in Fig. II-C. Once a model is
outcomes. These transforms are evaluated for a given dataset finalized, the hyperparameter optimization is again performed
by LFE to determine the operations which will lead to an manually to generate the final model. AutoML automates these
increase in the performance of the machine learning model. above steps to reduce human dependency.
Reinforcement learning has also been used by Khurana et al.
[29] for generating features. They leveraged the feature tree
structure used in COGNITO and incorporated a traversal pol-
icy to optimize transform exploration. Reinforcement learning
encourages the exploration of transforms that are beneficial to
the overall model and also applies a budget constraint. This
constraint is needed to prevent the algorithm from performing
an exhaustive search over the feature graph.
performance of the model [35]. agents in the former generation. This new generation again
3) Sequential Model-Based Optimization:: Random Search performs the same given task, and the cycle continues, as
and Grid search performs hyperparameter checking independ shown in Figure II-C4. For AutoML, evolutionary algorithms
of each other and often end up performing repeated and are used for tackling hyperparameter optimization by search-
wasteful computations. To improve on their shortcomings, ing the configuration space for a given model.
Sequential Model-based Optimization(SMBO) [36] was pro-
posed, which uses a combination of regression and Bayesian
optimization to select hyperparameters. It sequentially applies
the hyperparameters and adjusts their values based on the
Bayesian heatmap, which is a probabilistic distribution, as
shown in Figure II-C3. The probabilistic approach of SMBO
resolves the scalability issues that were rampant in grid search
and random search.
SMBO. However, the use of a continuously integrating meta- 25000 false claims are filed for reimbursements, hence to
learning framework needs to be researched as its performance reduce the loss of revenue due to such fraudulent activities,
gain is high. Transfer learning has also been successfully used identification of false claims is required. The company utilized
in the context of AutoML to show promising results. With the the AutoML Tables tool to train the model using features like
increase in the availability of task-specific pre-trained models, patient history, hospitals, specific disease of patient, invoices,
it should be expected to see an increase in the usage of transfer agent who sold the policy, etc. The tool is responsible for
learning. the optimal algorithm selection, pre-processing and feature
selection. Based on the data provided, an appropriate anomaly
IV. C ASE S TUDY: I MPACT OF AUTO ML AT GNP detection algorithm is selected, and its hyperparameters are
S EGUROS , AN INSURANCE COMPANY tuned accordingly. Using such a model, an accuracy of 96.64%
The insurance industry usually prefers a data-driven ap- is achieved for false claim detection. This model outperforms
proach to solve business problems. The multi-source data, the existing in-house fraud detection model by 20% to 30%.
generated in a massive amount, provoked the need for machine The labelling of gender can also be considered as one of the
learning for further analysis and predictions. The significant problems for the company. GNP not only provides insurance
challenges faced by the industry are the detection of false to the individual, but also for a collective group of people, for
claims, utilization of unstructured data collected by the mar- example, complete family or a company's employee network.
keting and sales team, automation of transaction and claim In a general scenario, for the group insurance, the sales team
processing, personalization of solutions for customers, among provides data in CSV(Comma Separated Values) format to
others. the underwriting department. For such a collective insurance
GNP(Grupo Nacional Provincial) is one of the largest insur- policy, gender value in the data is the utmost requirement.
ance companies in Mexico. Like any large and well-established However, missing values are frequently encountered in the
company, GNP is undergoing a profound transformation for columns representing gender features; hence, the identification
modernizing information systems and operations. To achieve of gender based on the persons name is required. To learn the
this, the company is utilizing the cloud resources to centralize gender of the person based on his/her corresponding full name,
the generalized computations [42]. GNP is making significant the company utilizes an AutoML model trained by the Tables
efforts to organize and utilize all the operational information tool. The AutoML tool selected the best suitable classification
of the company in the central Data Lake [43] [44]. To extract algorithm for the task and optimally tuned the hyperparameters
value from Data lake, the company has begun to apply ma- accordingly to achieve an accuracy of 99.2%.
chine learning for getting intuitions as well as predicting and In this particular case, a single AutoML tool was able to
improving the company’s performance based on their domain- tackle three problems head-on and created machine learning
specific factors [45]. For such a data-driven approach, a team models for the same with minimal human intervention. This
of highly trained data scientists is required, which is financially showcases the need as well as the opportunities the domain of
taxing. In the earlier stages, the company’s data scientists AutoML provides, especially in the business sector.
built and trained various models manually and thus achieved
moderate accuracy for the prediction problem. To improve V. C ONCLUSION AND F UTURE D IRECTIONS
accuracy and reduce the amount of time and expenses, GNP In this paper, we provide insights to the readers about the
adopted the tool called AutoML Tables provided by Google various segments of AutoML with a conceptual perspective.
Cloud to simplify and speed up the creation of ML models Each of these segments has various approaches that have
and migrate the scarcity of highly trained data scientists. The been briefly explained to provide a concise overview. We
company utilizes the provided tool to solve problems like also discuss the various trends seen in recent years including
Car claim risk, Detection of fraudulent healthcare claims, and suggestions of thirsty research areas which need attention. We
Gender Labeling, which are discussed in detail below. also put forward some future directions that can be explored
The car claim risk problem is defined as the task to predict to extend the research in the domain of AutoML. We suggest
the probability of the car having an accident using the given that the research exploration can be done in the direction of
features/characteristics of the insured car and the owner. The a generalized AutoML pipeline, which can accept datasets
company spends about USD 550 million on car damage of a wide range and a central meta-learning framework be
claims annually, hence predicting risk amount accurately is established that acts as a central brain for approximating the
the primary intention of the company. To solve the prediction pipelines for all future problems statements.
problem, the company trained the model, using the AutoML
R EFERENCES
Tables tool, which utilized the 21 features columns and over
1.34 million rows of raw data. The tool creates the model by [1] Lukas Tuggener, Mohammadreza Amirian, Katharina Rombach, Stefan
Lörwald, Anastasia Varlet, Christian Westermann, and Thilo Stadel-
selecting the most relevant features. By this, the accuracy of mann. Automated machine learning in practice: state of the art and
98.1% has been achieved, which was much better than any of recent results. In 2019 6th Swiss Conference on Data Science (SDS),
their previous manually trained models. pages 31–36. IEEE, 2019.
[2] Karen Simonyan and Andrew Zisserman. Very deep convolu-
The detection of Healthcare fraudulent claims is one of tional networks for large-scale image recognition. arXiv preprint
the major problems faced by the company. Annually about arXiv:1409.1556, 2014.
[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. [27] Udayan Khurana, Deepak Turaga, Horst Samulowitz, and Srinivasan
Bert: Pre-training of deep bidirectional transformers for language un- Parthasrathy. Cognito: Automated feature engineering for supervised
derstanding. arXiv preprint arXiv:1810.04805, 2018. learning. In 2016 IEEE 16th International Conference on Data Mining
[4] Avatar Jaykrushna, Pathik Patel, Harshal Trivedi, and Jitendra Bhatia. Workshops (ICDMW), pages 1304–1307. IEEE, 2016.
Linear regression assisted prediction based load balancer for cloud [28] Fatemeh Nargesian, Horst Samulowitz, Udayan Khurana, Elias B Khalil,
computing. In 2018 IEEE Punecon, pages 1–3. IEEE. and Deepak S Turaga. Learning feature engineering for classification.
[5] Jitendra Bhatia, Ruchi Mehta, and Madhuri Bhavsar. Variants of In IJCAI, pages 2529–2535, 2017.
software defined network (sdn) based load balancing in cloud comput- [29] Udayan Khurana, Horst Samulowitz, and Deepak Turaga. Feature
ing: A quick review. In International Conference on Future Internet engineering for predictive modeling using reinforcement learning. In
Technologies and Trends, pages 164–173. Springer, 2017. Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
[6] Ishan Mistry, Sudeep Tanwar, Sudhanshu Tyagi, and Neeraj Kumar. [30] Gilad Katz, Eui Chul Richard Shin, and Dawn Song. Explorekit: Auto-
Blockchain for 5g-enabled iot for industrial automation: A systematic matic feature generation and selection. In 2016 IEEE 16th International
review, solutions, and challenges. Mechanical Systems and Signal Conference on Data Mining (ICDM), pages 979–984. IEEE, 2016.
Processing, 135:106382, 2020. [31] Hoang Thanh Lam, Johann-Michael Thiebaut, Mathieu Sinn, Bei Chen,
[7] Jitendra Bhatia, Yash Modi, Sudeep Tanwar, and Madhuri Bhavsar. Soft- Tiep Mai, and Oznur Alkan. One button machine for automating feature
ware defined vehicular networks: A comprehensive review. International engineering in relational databases. arXiv preprint arXiv:1706.00327,
Journal of Communication Systems, 32(12):e4005, 2019. 2017.
[8] Jitendra Bhatia, Ridham Dave, Heta Bhayani, Sudeep Tanwar, and [32] Mohamed Maher and Sherif Sakr. Smartml: A meta learning-based
Anand Nayyar. Sdn-based real-time urban traffic analysis in vanet framework for automated selection and hyperparameter tuning for ma-
environment. Computer Communications, 149:162 – 175, 2020. chine learning algorithms. In EDBT: 22nd International Conference on
[9] Xin He, Kaiyong Zhao, and Xiaowen Chu. Automl: A survey of the Extending Database Technology, 2019.
state-of-the-art. arXiv preprint arXiv:1908.00709, 2019. [33] Steven M LaValle, Michael S Branicky, and Stephen R Lindemann.
[10] Radwa Elshawi, Mohamed Maher, and Sherif Sakr. Automated ma- On the relationship between classical grid search and probabilistic
chine learning: State-of-the-art and open challenges. arXiv preprint roadmaps. The International Journal of Robotics Research, 23(7-8):673–
arXiv:1906.02287, 2019. 692, 2004.
[34] Francisco J Solis and Roger J-B Wets. Minimization by random search
[11] Anh Truong, Austin Walters, Jeremy Goodsitt, Keegan Hines, Bayan
techniques. Mathematics of operations research, 6(1):19–30, 1981.
Bruss, and Reza Farivar. Towards automated machine learning: Evalu-
[35] James Bergstra and Yoshua Bengio. Random search for hyper-parameter
ation and comparison of automl approaches and tools. arXiv preprint
optimization. Journal of Machine Learning Research, 13(Feb):281–305,
arXiv:1908.05557, 2019.
2012.
[12] Shichao Zhang, Chengqi Zhang, and Qiang Yang. Data preparation for
[36] Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Sequential
data mining. Applied artificial intelligence, 17(5-6):375–381, 2003.
model-based optimization for general algorithm configuration. In In-
[13] Erhard Rahm and Hong Hai Do. Data cleaning: Problems and current ternational conference on learning and intelligent optimization, pages
approaches. IEEE Data Eng. Bull., 23(4):3–13, 2000. 507–523. Springer, 2011.
[14] Dipali Shete and Sachin Bojewar. Auto approach for extracting relevant [37] Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur
data using machine learning. International Journal of Electronics, 6:0, Satish, Narayanan Sundaram, Mostofa Patwary, Mr Prabhat, and Ryan
2019. Adams. Scalable bayesian optimization using deep neural networks. In
[15] Carol M Musil, Camille B Warner, Piyanee Klainin Yobas, and Susan L International conference on machine learning, pages 2171–2180, 2015.
Jones. A comparison of imputation techniques for handling missing [38] Hector Mendoza, Aaron Klein, Matthias Feurer, Jost Tobias Springen-
data. Western Journal of Nursing Research, 24(7):815–829, 2002. berg, and Frank Hutter. Towards automatically-tuned neural networks.
[16] RB Kline. Principles and practice of structural equation modeling. 1998. In Workshop on Automatic Machine Learning, pages 58–65, 2016.
New York: Guilford, 1998. [39] Dani Yogatama and Gideon Mann. Efficient transfer learning method for
[17] Joseph F Hair, Rolph E Anderson, Ronald L Tatham, and William C automatic hyperparameter tuning. In Artificial intelligence and statistics,
Black. Multivariate data analysis. englewood cliff. New Jersey, USA, pages 1077–1085, 2014.
5(3):207–2019, 1998. [40] Randal S Olson and Jason H Moore. Tpot: A tree-based pipeline
[18] Matthias Feurer, Katharina Eggensperger, Stefan Falkner, Marius Lin- optimization tool for automating machine learning. In Automated
dauer, and Frank Hutter. Practical automated machine learning for Machine Learning, pages 151–160. Springer, 2019.
the automl challenge 2018. In International Workshop on Automatic [41] Boyuan Chen, Harvey Wu, Warren Mo, Ishanu Chattopadhyay, and Hod
Machine Learning at ICML, pages 1189–1232, 2018. Lipson. Autostacker: A compositional evolutionary learning system. In
[19] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting Proceedings of the Genetic and Evolutionary Computation Conference,
system. In Proceedings of the 22nd acm sigkdd international conference pages 402–409. ACM, 2018.
on knowledge discovery and data mining, pages 785–794. ACM, 2016. [42] Jitendra Bhatia and Malaram Kumhar. Perspective study on load
[20] Tpot: Skewed classes. https://github.com/EpistasisLab/tpot/blob/v0.9.5/ balancing paradigms in cloud computing. IJCSC, 6(1):112–120, 2015.
tpot/metrics.py. (Accessed: September 10, 2019). [43] Natalia Miloslavskaya and Alexander Tolstoy. Big data, fast data and
[21] Chris Thornton, Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. data lake concepts. Procedia Computer Science, 88:300–305, 2016.
Auto-weka: Combined selection and hyperparameter optimization of [44] Jitendra Bhagwandas Bhatia. A dynamic model for load balancing
classification algorithms. In Proceedings of the 19th ACM SIGKDD in cloud infrastructure. Nirma University Journal of Engineering and
international conference on Knowledge discovery and data mining, Technology (NUJET), 4(1):15, 2015.
pages 847–855. ACM, 2013. [45] Jai Prakash Verma, Sudeep Tanwar, Sanjay Garg, Ishit Gandhi, and
[22] Leo Breiman. Bagging predictors. Machine learning, 24(2):123–140, Nikita H Bachani. Evaluation of pattern based customized approach
1996. for stock market trend prediction with big data and machine learning
[23] Chris Drummond, Robert C Holte, et al. C4. 5, class imbalance, and cost techniques. International Journal of Business Analytics (IJBAN), 6(3):1–
sensitivity: why under-sampling beats over-sampling. In Workshop on 15, 2019.
learning from imbalanced datasets II, volume 11, pages 1–8. Citeseer,
2003.
[24] Mohamed Bekkar and Taklit Akrouf Alitouche. Imbalanced data
learning approaches.
[25] Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Tobias
Springenberg, Manuel Blum, and Frank Hutter. Auto-sklearn: Efficient
and robust automated machine learning. In Automated Machine Learn-
ing, pages 113–134. Springer, 2019.
[26] Ambika Kaul, Saket Maheshwary, and Vikram Pudi. Autolearnauto-
mated feature generation and selection. In 2017 IEEE International
Conference on Data Mining (ICDM), pages 217–226. IEEE, 2017.