Implementing Cloud-Native Technologies for Big Data Processing
Implementing Cloud-Native Technologies for Big Data Processing
net/publication/382441736
CITATIONS READS
0 35
1 author:
Chandrakanth Lekkala
Florida Institute of Technology
34 PUBLICATIONS 15 CITATIONS
SEE PROFILE
All content following this page was uploaded by Chandrakanth Lekkala on 21 July 2024.
Abstract: This paper presents a case study on the architecture design and implementation details of cloud-native technologies for big data
processing. Cloud-native technologies, such as Kubernetes and Airflow, are modern solutions intricately connected as essential components
within IT infrastructures. They are designed specifically for processing and managing data in cloud environments. These technologies leverage
the scalability and flexibility of cloud computing to enable efficient and reliable data storage, analysis, and retrieval., focusing on Apache
Airflow and Kubernetes. Acting as a container orchestrator, Kubernetes efficiently manages a vast number of containers, eliminating the
necessity to explicitly outline the configuration for executing specific tasks. Meanwhile, Airflow is the orchestration layer for managing data
processing workflows within the Kubernetes environments. The findings of this paper underscore the potential of Kubernetes and Airflow in
enabling seamless orchestration and management of big data workflows in cloud environments.
Keywords: Cloud-native, Apache Airflow, Kubernetes, Big Data Processing, Orchestration Stability, Efficiency
Figure 4: Simplified view showing how services interact with pod networking in the Kubernetes cluster
3.3 Kubernetes Deployment • Pods – Pods are the smallest deployable units in Kubernetes,
encapsulating the containers with shared network resources
TechShop opts to deploy Kubernetes clusters on Google Cloud such as data ingestion services, analytics tools, and
Platform – a leading public cloud provider – to leverage its processing engines [13].
scalability features and managed services. Deployment of the • Services – The services will provide network access to a set
Kubernetes will encompass the following components: of pods, enabling load balancing and task discovery within
• Mater Node – This node controls and manages a set of the cluster.
nodes. It comprises the following components to help • Horizontal Pod Autoscaling (HPA) – HPA automatically
manage worker nodes: Kube-API server, Kube-Controller- adjusts the number of replica pods in a deployment, ensuring
Manager, etcd, and Kube scheduler [12]. This node will responsiveness to workload fluctuations and optimal
control the overall Kubernetes cluster, managing resource resource allocation.
allocation, scaling, and scheduling.
Volume 10 Issue 5, May 2021
www.ijsr.net
Licensed Under Creative Commons Attribution CC BY
Paper ID: SR24430152128 DOI: https://dx.doi.org/10.21275/SR24430152128 1338
International Journal of Science and Research (IJSR)
ISSN: 2319-7064
SJIF (2020): 7.803
3.4 Airflow Integration for dynamic pipeline generation [14]. This will allow the
respective operators to write code that instantiates data
Apache Airflow will be deployed as a set of Kubernetes pods, pipelines dynamically. In the implementation stage, TechShop
each representing a different component, including the will use a distributed Airflow Architecture, meaning that the
scheduler, webserver, and worker nodes. The key components DAG files will be synchronized between all the components
of the Airflow include DAGs, operators, schedulers, and that use them – workers, triggered, and scheduler. The greatest
executors. Apache Airflow DAGs will be written in Python strength of Apache Airflow is its flexibility, offering easy
using the Airflow API. Similarly, tasks will be implemented as extensibility through its plug-in framework [15]. Additionally,
Python operators, which can execute data processing logic, Apache Airflow provides a wide range of integrations for
interact with external sources, and handle data discrepancies. services on various cloud providers.
Configuring the Airflow architecture as a code (Python) allows
TechShop will leverage Kubernetes manifests to define the its lifecycle, as shown in Figure 7. The Kubernetes Operator
desired states of the Apache Airflow deployment and manage will use Python to generate a request that will be processed by