Implementing Cloud-Native Technologies for Big Data Processing

This paper presents a case study on implementing cloud-native technologies, specifically Kubernetes and Apache Airflow, for big data processing. It highlights the architecture design and deployment strategies that leverage the scalability and flexibility of cloud computing to manage complex data workflows efficiently. The findings demonstrate the effectiveness of these technologies in orchestrating and managing big data processes in cloud environments.

Uploaded by

Sakshi Wagh 79

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Implementing Cloud-Native Technologies for Big Data Processing

Uploaded by

Sakshi Wagh 79

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/382441736

Implementing Cloud-Native Technologies for Big Data Processing: A Case Study

with Kubernetes and Airﬂow

Article in International Journal of Science and Research (IJSR) · May 2021

DOI: 10.21275/SR24430152128

CITATIONS READS

0 35

1 author:

Chandrakanth Lekkala
Florida Institute of Technology
34 PUBLICATIONS 15 CITATIONS

SEE PROFILE

All content following this page was uploaded by Chandrakanth Lekkala on 21 July 2024.

The user has requested enhancement of the downloaded file.

International Journal of Science and Research (IJSR)
ISSN: 2319-7064
SJIF (2020): 7.803

Implementing Cloud-Native Technologies for Big

Data Processing: A Case Study with Kubernetes and
Airflow
Chandrakanth Lekkala
Email: chan.Lekkala[at]gmail.com

Abstract: This paper presents a case study on the architecture design and implementation details of cloud-native technologies for big data
processing. Cloud-native technologies, such as Kubernetes and Airflow, are modern solutions intricately connected as essential components
within IT infrastructures. They are designed specifically for processing and managing data in cloud environments. These technologies leverage
the scalability and flexibility of cloud computing to enable efficient and reliable data storage, analysis, and retrieval., focusing on Apache
Airflow and Kubernetes. Acting as a container orchestrator, Kubernetes efficiently manages a vast number of containers, eliminating the
necessity to explicitly outline the configuration for executing specific tasks. Meanwhile, Airflow is the orchestration layer for managing data
processing workflows within the Kubernetes environments. The findings of this paper underscore the potential of Kubernetes and Airflow in
enabling seamless orchestration and management of big data workflows in cloud environments.

Keywords: Cloud-native, Apache Airflow, Kubernetes, Big Data Processing, Orchestration Stability, Efficiency

1. Introduction the architecture and deployment strategies of these cloud-native

technologies.
A confluence of economic, social, and technological trends,
including the growing ubiquity of wireless broadband access, 2. Literature Review
widespread adoption of smart devices and infrastructure, and
appeal of social networking, are resulting in the generation of The proliferation of big data, as reflected in the widespread of
vast streams of data, also known as big data. Big data continues digital devices and mobile subscriptions, has transformed the
to grow exponentially over time, and the datasets are so huge landscape of modern business, empowering organizations to
and complex in variety, volume, and velocity that traditional derive valuable insights and drive innovation. More than 50
data management systems cannot process, store, or analyze billion interconnected devices are estimated to be deployed
them effectively [1]. Concurrently, the modern landscape of worldwide in areas such as transport systems, the environment,
complex applications – with end users expecting continuous security, energy control systems, and healthcare [3]. Given that
innovation and unapparelled responsiveness – requires internet penetration exceeds 100% in developing and developed
organizations' systems to be more strategic and increasingly countries within the Organization for Economic Cooperation
flexible. The demand for cost-effective, scalable, and reliable and Development (OECD) and that wireless broadband
approaches prompts the exploration of cloud-native services, penetration is nearly 70% in OECD areas, the source of big data
such as Kubernetes and Airflow, as a viable alternative. will continue to grow further [4]. The amount of data traffic
Kubernetes orchestrate containerized applications to run on a generated by mobile devices and sensors has been doubling
cluster of hosts and use cloud platforms to automate and every year, as illustrated in Figure 1. However, harnessing the
manage cloud-native applications [2]. Meanwhile, Airflow's potential of big data comes with its own set of challenges,
rich user interface complements Kubernetes by making it including the need for scalable, reliable, and cost-effective data
possible to visualize pipelines running in production, processing solutions. The literature surrounding cloud-native
troubleshoot issues, and monitor the progress of complex data technologies provides valuable insights into the theoretical
pipelines. This paper demonstrates the implementation of foundations and practical implementations of these
Kubernetes and Airflow for big data processing, delving into technologies.

Volume 10 Issue 5, May 2021

Figure 1: Global IP Data Traffic

1) Cloud-Native Technologies microservices architecture, immutable infrastructure,

The term “cloud-native” identifies the practice of developing, automated scaling, and declarative language in application
deploying, and maintaining modern applications that take the programming interfaces (APIs) can be named as the main
flexible computing characteristics offered in the cloud resources of scalability and efficiency of cloud-native
environment itself [5]. Being that in mind, cloud-native systems technology [6]. These characteristics allow server to provide
are architected and developed using features and capabilities of the same function to all computer system, so the data processing
cloud computing environments such as their scalability, built-in and saving is performed in a similar way without the limitations
resiliency, ability to easily manage the workload and provide and failure of devices. Additionally, the unaltered state of cloud-
elasticity. The cloud-native technologies equip enterprises with native technologies post-deployment significantly reduces the
tools for running applications in private, public and hybrid complexity of such technologies.
clouds on a massive scale. Characteristics which are

Figure 2: Features of Cloud-Native Technologies

2) Kubernetes and monitoring of cloud-native applications, regardless of

Kubernetes orchestrate container-based workflows and whether they're deployed on public cloud platforms or on-
applications to operate on a cluster of hosts. To achieve this aim, premises infrastructure [7]. The typical architecture of
Kubernetes facilitates the automated deployment, management, Kubernetes consists of the control plane, nodes, and clusters.

Volume 10 Issue 5, May 2021

Figure 3: Components of a Kubernetes

Figure 4: Simplified view showing how services interact with pod networking in the Kubernetes cluster

Volume 10 Issue 5, May 2021

www.ijsr.net
Licensed Under Creative Commons Attribution CC BY
Paper ID: SR24430152128 DOI: https://dx.doi.org/10.21275/SR24430152128 1337
International Journal of Science and Research (IJSR)
ISSN: 2319-7064
SJIF (2020): 7.803
3) Airflow preferences is paramount for success. To maintain a competitive
Airflow, in this case Apache Airflow, is a workflow edge and remain relevant in the market, TechShop collects data
management platform that allows for monitoring, scheduling, from various sources, including purchase history, social media
and orchestration of complex data pipelines. It is used for the engagement, website interactions, and demographic
orchestration and scheduling of data pipelines [9]. Typically, information. This information is highly diverse and contains
Orchestrating data pipelines involves arranging and large collections of structured, semi-structured, and
coordinating the flow of data from various sources. Within an unstructured data that are expected to grow exponentially over
Airflow, the data pipelines deliver data sets that are ready for time. Additionally, the complexity, variety, volume, and
consumption either by data science, machine learning models, velocity of the datasets overwhelm the current data
or business intelligence applications that support big data management system, hindering its ability to efficiently process,
applications. The installation of an Airflow comprises the analyze, and store them. Therefore, TechShop aims to
following components: implement a modern, cloud-native solution to process and
• Scheduler – Responsible for overseeing the progression of analyze data effectively. The primary objectives of the cloud-
tasks and DAGs, it initiates task instances as soon as their native solution include:
prerequisites have been met. • The solution should be able to handle fluctuations in data
• Webserver – Presents a user interface to trigger, inspect, and volume and velocity without manual intervention.
debug the behavior of tasks and DAGs. • The architecture should support iterative development with
• DAG file – A DAG (directed acyclic graph) is the core different analytical models and data processing algorithms.
concept of Airflow. It collects tasks together and defines • The data processing workflows must be fault-tolerant and
how dependencies and relationships should run [10]. robust enough to ensure consistent performance.
• Metadata database – This is the database that the
components of Airflow use to store the state of workflows 3.2 Architecture Overview
and tasks [10].
The proposed architecture comprises three main components:
Overall, Airflow serves as the orchestration layer for managing data ingestion, data processing, and data analytics. For the data
data processing workflows. Using the above-mentioned ingestion, raw data and assorted files (purchase history, social
components, Airflow facilitates the creation of complex data media engagement, website interactions, and demographic
pipelines by defining DAGs and orchestrating their execution information) collected from various sources, such as streaming
across distributed systems. databases and external APIs, will be imported into a single,
cloud-based storage medium – Google Cloud Storage (GCS).
3. Case Study: Implementing Kubernetes and GCS is the most-suited storage option because it is scalable,
Airflow for Big Data Processing durable, and cost-effective, with high availability and low
latency access [5]. The data stored in the GCS will then be
3.1 Use Case Description transformed and stored in a centralized repository. Once
ingested, the data will undergo various transformations, which
This study focuses on TechShop, a fictitious e-commerce involve aggregations and machine learning algorithms to
company that operates in a highly competitive market where extract meaningful insights [11]. The processed data will then
understanding and meeting the ever-evolving consumer be analyzed to generate visualizations, reports, and actionable
insights for business stakeholders.

Figure 5: Architecture Overview

3.3 Kubernetes Deployment • Pods – Pods are the smallest deployable units in Kubernetes,
encapsulating the containers with shared network resources
TechShop opts to deploy Kubernetes clusters on Google Cloud such as data ingestion services, analytics tools, and
Platform – a leading public cloud provider – to leverage its processing engines [13].
scalability features and managed services. Deployment of the • Services – The services will provide network access to a set
Kubernetes will encompass the following components: of pods, enabling load balancing and task discovery within
• Mater Node – This node controls and manages a set of the cluster.
nodes. It comprises the following components to help • Horizontal Pod Autoscaling (HPA) – HPA automatically
manage worker nodes: Kube-API server, Kube-Controller- adjusts the number of replica pods in a deployment, ensuring
Manager, etcd, and Kube scheduler [12]. This node will responsiveness to workload fluctuations and optimal
control the overall Kubernetes cluster, managing resource resource allocation.
allocation, scaling, and scheduling.
Volume 10 Issue 5, May 2021
www.ijsr.net
Licensed Under Creative Commons Attribution CC BY
Paper ID: SR24430152128 DOI: https://dx.doi.org/10.21275/SR24430152128 1338
International Journal of Science and Research (IJSR)
ISSN: 2319-7064
SJIF (2020): 7.803

Figure 6: Kubernetes Architecture Diagram

3.4 Airflow Integration for dynamic pipeline generation [14]. This will allow the
respective operators to write code that instantiates data
Apache Airflow will be deployed as a set of Kubernetes pods, pipelines dynamically. In the implementation stage, TechShop
each representing a different component, including the will use a distributed Airflow Architecture, meaning that the
scheduler, webserver, and worker nodes. The key components DAG files will be synchronized between all the components
of the Airflow include DAGs, operators, schedulers, and that use them – workers, triggered, and scheduler. The greatest
executors. Apache Airflow DAGs will be written in Python strength of Apache Airflow is its flexibility, offering easy
using the Airflow API. Similarly, tasks will be implemented as extensibility through its plug-in framework [15]. Additionally,
Python operators, which can execute data processing logic, Apache Airflow provides a wide range of integrations for
interact with external sources, and handle data discrepancies. services on various cloud providers.
Configuring the Airflow architecture as a code (Python) allows

Figure 7: Airflow integration into Kubernetes

TechShop will leverage Kubernetes manifests to define the its lifecycle, as shown in Figure 7. The Kubernetes Operator
desired states of the Apache Airflow deployment and manage will use Python to generate a request that will be processed by

Volume 10 Issue 5, May 2021

www.ijsr.net
Licensed Under Creative Commons Attribution CC BY
Paper ID: SR24430152128 DOI: https://dx.doi.org/10.21275/SR24430152128 1339
International Journal of Science and Research (IJSR)
ISSN: 2319-7064
SJIF (2020): 7.803
the API server. Subsequently, Kubernetes will launch Database, Managed Kubernetes, and Serverless, 487-
TechShop's pods based on the defined specs, enabling the 551.
collected data to be loaded under a single command. Once the [7] Toffetti, G., Brunner, S., Blöchlinger, M., Spillner, J., &
tasks are launched, the operators will only need to track logs Bohnert, T. M. (2017, July). Self-managing cloud-native
while gathering logs for the scheduler or any other distributed applications: Design, implementation, and
logging service within the Kubernetes cluster. The integration experience. Future Generation Computer Systems, 72,
of Kubernetes and Airflow will enable TechShop to leverage the 165-179. https://doi.org/10.1016/j.future.2016.09.002
scalability and flexibility of Kubernetes for processing big data [8] Sayfan, G. (2017, May). Mastering Kubernetes. Packt
workflows. Kubernetes' auto-scaling capabilities will allow the Publishing Ltd.
cluster to dynamically adjust resource allocation based on [9] Koutoulakis, E. (2020, September). Implementation of a
workload demands, ensuring optimal performance during peak federated workflow execution engine for life sciences
periods. through virtualization services.
https://apothesis.lib.hmu.gr/bitstream/handle/20.500.126
4. Conclusion 88/9638/KoutoulakisEmmanouil2020.pdf?sequence=1&i
sAllowed=y
In summary, Cloud-native technologies involve building, [10] Panhalkar, S. (2019, January). Libflow: A Platform to
deploying, and managing modern applications that leverage the Schedule and Manage Workflows Using DAGs.
distributed computing capabilities offered in cloud delivery [11] Palazzo, C., Mariello, A., Fiore, S., D'Anca, A., Elia, D.,
models. This means that cloud-native services, such as Williams, D. N., & Aloisio, G. (2015, July). A workflow-
Kubernetes and Airflow, are designed and built to exploit the enabled big data analytics software stack for eScience.
scale, resiliency, flexibility, and elasticity of cloud computing In 2015 International Conference on High Performance
environments. In this case study, we deploy Kubernetes clusters Computing & Simulation (HPCS) (pp. 545-552). IEEE.
on the Google Cloud Platform to leverage its scalability and https://doi.org/10.1109/HPCSim.2015.7237088
manage services. Meanwhile, Airflow is the orchestration layer [12] Larsson, L., Tärneberg, W., Klein, C., Elmroth, E., & Kihl,
for managing data processing workflows within the Kubernetes M. (2020, August). Impact of etcd deployment on
environments. Kubernetes' auto-scaling capabilities ensure Kubernetes, Istio, and application performance. Software:
optimal resource utilization while Airflow orchestrates the Practice and Experience, 50(10), 1986-2007.
execution of concurrent workflows. https://doi/org/10.1002/spe.2885
[13] Calcote, L., & Butcher, Z. (2019, October). Istio: Up and
References running: Using a service mesh to connect, secure, control,
and observe. O'Reilly Media.
[1] Anagnostopoulos, I., Zeadally, S., & Esposito, E. (2016, [14] Shubha, B., & Prasad, A. (2019, June). Airflow directed
February). Handling big data: research challenges and acyclic graph. J Signal Process, 5(2).
future directions. The Journal of Supercomputing, 72, https://core.ac.uk/download/pdf/230490858.pdf
1494-1516. [15] Hummer, W., Muthusamy, V., Rausch, T., Dube, P., El
https://link.springer.com/article/10.1007/s11227-016- Maghraoui, K., Murthi, A., & Oum, P. (2019, June).
1677-z Modelops: Cloud-based lifecycle management for
[2] Barika, M., Garg, S., Zomaya, A. Y., Wang, L., Moorsel, reliable and trusted AI. In 2019 IEEE International
A. V., & Ranjan, R. (2019, September). Orchestrating big Conference on Cloud Engineering (IC2E) (pp. 113-120).
data analysis workflows in the cloud: research challenges, IEEE. https://doi.org/10.1109/IC2E.2019.00025
survey, and future directions. ACM Computing Surveys
(CSUR), 52(5), 1-41. https://doi.org/10.1145/3332301
[3] Hammi, B., Khatoun, R., Zeadally, S., Fayad, A., &
Khoukhi, L. (2018, January). IoT technologies<? show
[AQ ID= Q1]?> for smart cities. IET networks, 7(1), 1-13.
https://doi.org/10.1049/iet-net.2017.0163
[4] Kuss, D. J., & Lopez-Fernandez, O. (2016, March).
Internet addiction and problematic Internet use: A
systematic review of clinical research. World journal of
psychiatry, 6(1), 143.
https://doi.org/10.5498%2Fwjp.v6.i1.143
[5] Laszewski, T., Arora, K., Farr, E., & Zonooz, P. (2018,
August). Cloud Native Architectures: Design high-
availability and cost-effective applications for the cloud.
Packt Publishing Ltd.
[6] Jakóbczyk, M. T., & Jakóbczyk, M. T. (2020, February).
Cloud-Native Architecture. Practical Oracle Cloud
Infrastructure: Infrastructure as a Service, Autonomous

Volume 10 Issue 5, May 2021

www.ijsr.net
Licensed Under Creative Commons Attribution CC BY
Paper ID: SR24430152128
View publication stats DOI: https://dx.doi.org/10.21275/SR24430152128 1340