0% found this document useful (0 votes)
8 views25 pages

bda2

This document outlines a report on the implementation of Hadoop for a BCA program, detailing its installation, configuration, and basic commands for handling big data. It includes a case study on the application of Hadoop in healthcare, emphasizing its advantages and challenges, particularly in predictive analytics for patient data management. The report concludes that while Hadoop is a powerful tool for big data analytics, it has limitations that necessitate the use of complementary technologies like Apache Spark.

Uploaded by

arsalan79843
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views25 pages

bda2

This document outlines a report on the implementation of Hadoop for a BCA program, detailing its installation, configuration, and basic commands for handling big data. It includes a case study on the application of Hadoop in healthcare, emphasizing its advantages and challenges, particularly in predictive analytics for patient data management. The report concludes that while Hadoop is a powerful tool for big data analytics, it has limitations that necessitate the use of complementary technologies like Apache Spark.

Uploaded by

arsalan79843
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

School of Computer Science and Information Technology

Programme: BCA
2024-2025

SEMESTER: VI

COURSE NAME:
Activity#2: HADOOP

Submitted by
22BCAR0503
SYED ARSALAN

Date of Submission : 31/03/2025


​ ​ ​ ​

Name of Faculty In-Charge​:


Dr. K. Suneetha
Professor & Head
CS and IT
​ ​

EVALUATION CRITERIA
Report Submission Oral Viva Total Convert 25 into
(15) Presentation (5) (25) 15 marks
(05)

​ ​ ​ ​ ​ DECLARATION

We declared that Activity-3 has been carried out by us following all ethical practices of Jain
(Deemed-to-be-University)for the partial fulfillment of the General Course of BCA

in the year 2024-2025 (6th Semester)

​ ​ SYED ARSALAN , 22BCAR0503


​ ​ ​ ​ ​

​ 2 | Page
INDEX

Sl. No. Table of Contents Page No.

1 Introduction 4-5

2 Interface and Installation Steps 6-9

3 Basic Commands and Execution 10-16

4 Case Study diagram or workflow where applicable 17-20

5 Advantages and Disadvantages 20-23

6 Conclusion and Summary 24

7 References 25

3 | Page
INTRODUCTION
Big Data refers to extremely large and complex datasets that cannot be efficiently processed using
traditional data management tools. These datasets originate from various sources, including social
media, sensors, financial transactions, healthcare records, and more.

Characteristics of Big Data (5Vs Model)

1. Volume – Large amounts of data generated every second.

2. Velocity – The speed at which data is generated and processed.

3. Variety – Different formats like structured (databases), semi-structured (JSON, XML), and
unstructured (videos, images, text).

4. Veracity – Data accuracy and reliability.

5. Value – The ability to extract useful insights from data.

Challenges of Big Data

1. Storage and Management – Traditional databases struggle to store and manage vast
amounts of data.

2. Processing Speed – Handling real-time or batch processing efficiently.

3. Data Integration – Combining data from multiple sources with different formats.

4. Security and Privacy – Ensuring data protection against cyber threats.

5. Scalability – Systems must scale efficiently as data grows.

What is Hadoop?

Apache Hadoop is an open-source framework designed for storing and processing large datasets in a
distributed computing environment. It enables organizations to handle vast amounts of data
efficiently and cost-effectively.

Key Components of Hadoop

1. Hadoop Distributed File System (HDFS) – A distributed storage system that breaks data into
chunks and stores it across multiple nodes.

2. MapReduce – A processing model that distributes computation tasks across multiple servers.

3. YARN (Yet Another Resource Negotiator) – Manages resources and schedules tasks in the
Hadoop ecosystem.

4. HBase, Hive, Pig, and Spark (Hadoop Ecosystem) – Additional tools for data querying, real-
time processing, and analytics.

4 | Page
Why Hadoop?

• Scalability – Handles large-scale data across multiple machines.

• Fault Tolerance – Data replication ensures reliability.

• Cost-Effective – Runs on commodity hardware.

• Flexibility – Supports structured, semi-structured, and unstructured data.

5 | Page
INTERFACE AND INSTALLATION PROCESS

Hadoop does not have a single user-friendly interface like traditional software. Instead, it
provides multiple ways to interact with the system:
1. Command Line Interface (CLI) – Most Hadoop operations are performed using
terminal commands, such as HDFS file management and running MapReduce jobs.

2. Web User Interfaces (Web UI) – Hadoop provides web-based monitoring tools:
o Hadoop ResourceManager UI – Monitors and manages cluster resources.
o HDFS NameNode UI – Tracks file system metadata and block locations.
3. Hadoop Ecosystem Interfaces – Additional tools provide user-friendly interfaces:
o Apache Hive – SQL-like query interface for Hadoop.

o Apache Hue – A web-based UI for Hadoop services.


o Apache Spark UI – Interactive data processing and monitoring tool.
Installing Hadoop on Windows requires additional configurations since Hadoop is designed
to run on Linux. Below is a step-by-step guide to setting up Hadoop on Windows 10/11
(Single Node Cluster).
1. System Requirements
• Operating System: Windows 10/11 (64-bit)
• Java Development Kit (JDK): JDK 8 or later

• Hadoop Version: Latest stable release (e.g., Hadoop 3.3.4)


• RAM: Minimum 8GB recommended
• Storage: At least 50GB free space

2. Install Java JDK


1. Download Java from Oracle JDK or OpenJDK.
2. Install it and set up environment variables:

6 | Page
o Open System Properties → Advanced System Settings → Environment
Variables.
o Under System Variables, create/edit:

▪ JAVA_HOME = C:\Program Files\Java\jdk-8 (or your JDK installation


path)
▪ Add %JAVA_HOME%\bin to the Path variable.
3. Verify Java installation:

bash
CopyEdit
java -version

3. Download and Extract Hadoop


1. Download Hadoop Binary for Windows from Apache Hadoop Releases.
2. Extract the ZIP file to C:\hadoop (or any preferred location).

4. Install and Configure Hadoop

(A) Configure core-site.xml


1. Navigate to C:\hadoop\etc\hadoop\core-site.xml.
2. Open it in a text editor (Notepad++ or VS Code) and add:
xml

CopyEdit
<configuration>
<property>
<name>fs.defaultFS</name>

<value>hdfs://localhost:9000</value>
</property>
</configuration>

7 | Page
(B) Configure hdfs-site.xml
1. Open C:\hadoop\etc\hadoop\hdfs-site.xml.
2. Add the following configuration:

xml
CopyEdit
<configuration>
<property>

<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

(C) Configure Hadoop Environment File


1. Edit C:\hadoop\etc\hadoop\hadoop-env.cmd.

2. Set the Java path:


cmd
CopyEdit
set JAVA_HOME=C:\Program Files\Java\jdk-8

5. Format the Hadoop Namenode


1. Open Command Prompt as Administrator.
2. Run:

bash
CopyEdit

8 | Page
hdfs namenode -format
6. Verify Installation
• Open your browser and check:

o HDFS Web UI: http://localhost:9870/


o YARN Web UI: http://localhost:8088/
• Run the following to check running services:
bash

CopyEdit
jps
Expected output:
nginx

CopyEdit
NameNode
DataNode
ResourceManager
NodeManager

9 | Page
COMMANDS AND EXECUTION

10 | Page
11 | Page
12 | Page
13 | Page
Steps to Upload a File into HDFS from Local
1. Start Hadoop Services
bash

CopyEdit
start-dfs.cmd
start-yarn.cmd
jps

2. Create a Directory in HDFS


bash
CopyEdit

14 | Page
hdfs dfs -mkdir /user
hdfs dfs -mkdir /user/yourusername
hdfs dfs -mkdir /user/yourusername/input

3. Upload a File from Local to HDFS


bash
CopyEdit

hdfs dfs -put C:\localpath\filename.txt /user/yourusername/input/

4. Verify the File in HDFS


bash

CopyEdit
hdfs dfs -ls /user/yourusername/input/

Steps to Upload a Folder into HDFS from Local


1. Start Hadoop Services

bash
CopyEdit
start-dfs.cmd
start-yarn.cmd

jps # Verify services


2. Create a Directory in HDFS (If Not Exists)
bash
CopyEdit

15 | Page
hdfs dfs -mkdir -p /user/yourusername/input
3. Upload a Folder from Local to HDFS
bash

CopyEdit
hdfs dfs -put C:\path\to\local\folder /user/yourusername/input
4. Verify Upload
bash

CopyEdit
hdfs dfs -ls /user/yourusername/input

16 | Page
CASESTUDIES
Case Study: Hadoop in Healthcare for Patient Data Analysis

1. Introduction

1.1 Overview of Big Data in Healthcare

The healthcare sector generates vast amounts of data daily from electronic
health records (EHRs), medical imaging, wearable devices, and insurance
claims. Managing and analyzing this large volume of data efficiently is crucial
for improving patient care and operational efficiency. Traditional database
systems often struggle to handle such massive and diverse datasets, leading to
delays in decision-making and inefficiencies in healthcare delivery.

1.2 Role of Hadoop in Healthcare

Apache Hadoop, an open-source framework, provides a scalable and cost-


effective solution for handling big data in healthcare. By leveraging Hadoop’s
distributed computing model, healthcare institutions can store, process, and
analyze large datasets efficiently. Hadoop enables predictive analytics, real-
time data processing, and machine learning applications that help improve
patient care and reduce costs.

2. Problem Statement
2.1 Challenges in Healthcare Data Management

17 | Page
Hospitals and healthcare institutions face several challenges in managing
patient data:
Data Volume: Huge amounts of structured and unstructured data from EHRs,
medical scans, and IoT devices.
Data Variety: Different formats, including text, images, and real-time sensor
data.
Processing Speed: Traditional systems struggle to process large datasets
quickly.
Predicting Readmissions: Identifying high-risk patients for early intervention to
reduce hospital readmission rates.
2.2 Need for Predictive Analytics
Predicting patient readmission risks is a critical challenge for hospitals.
Readmissions increase healthcare costs and indicate gaps in post-discharge
care. Analyzing historical patient data using Hadoop can help predict which
patients are at high risk of being readmitted and allow proactive intervention.
3. Solution Using Hadoop
3.1 Hadoop-Based Predictive Analytics Model
Hadoop enables efficient storage and processing of vast healthcare datasets.
The process involves:
1. Data Collection: Patient data from EHRs, IoT devices (wearables),
medical imaging, and hospital visit records.
2. Data Storage: Storing structured and unstructured data in Hadoop
Distributed File System (HDFS).
3. Data Processing: Using MapReduce and Apache Spark to clean,
transform, and process data.
4. Machine Learning: Applying predictive analytics to identify patients at
risk of readmission.

18 | Page
5. Visualization & Decision Making: Displaying results in dashboards (e.g.,
Tableau, Power BI) to help healthcare providers make informed
decisions.

Work Flow:

19 | Page
20 | Page
Advantages of Hadoop
1. Scalability

Hadoop is highly scalable because it distributes data across multiple machines.


As data grows, new nodes can be added easily without significant changes to
the existing infrastructure.
2. Cost-Effective
Since Hadoop is open-source, organizations can use commodity hardware (low-
cost servers) instead of expensive, high-end servers. This makes Hadoop an
affordable solution for big data storage and processing.
3. Fault Tolerance
Hadoop replicates data across multiple nodes. If a node fails, the system
automatically recovers the data from another node, ensuring high availability
and reliability.
4. Fast Data Processing
With its parallel processing capabilities, Hadoop processes large volumes of
data efficiently. MapReduce allows data to be processed in parallel across
multiple nodes, reducing execution time.
5. Flexibility in Data Processing
Hadoop supports structured, semi-structured, and unstructured data,
including text, images, videos, and logs. This makes it ideal for handling diverse
datasets.
6. Wide Adoption and Community Support
Being open-source, Hadoop has a strong developer community, extensive
documentation, and a large number of contributors, making it easy to get
support and updates.

21 | Page
Disadvantages of Hadoop
1. Complexity in Setup and Management
Hadoop requires expertise in Java, Linux, and distributed computing, making it
difficult to install, configure, and manage, especially for beginners.
2. High Latency for Small Data
Hadoop is designed for batch processing and is not ideal for real-time data
analytics. For small datasets, traditional databases perform better with lower
latency.
3. Security Issues
By default, Hadoop lacks built-in security features like authentication and
encryption. It needs external security mechanisms such as Kerberos for secure
access.
4. High Memory and CPU Usage
MapReduce operations require significant computational resources, making
Hadoop inefficient for applications that demand low-latency and high-speed
processing.
5. Inefficiency with Iterative Processing
Hadoop’s MapReduce model is not efficient for iterative machine learning and
real-time data analytics. Frameworks like Apache Spark provide better
performance in such cases.
6. Data Integrity Challenges

22 | Page
Managing large-scale data replication can sometimes lead to data
inconsistencies, requiring additional monitoring and maintenance.

23 | Page
Summary
This report explored the implementation of Hadoop, where we successfully
installed and configured the framework and performed basic commands to
understand its working principles. Hadoop's distributed storage (HDFS) and
MapReduce processing enable efficient handling of large datasets, making it a
powerful tool for big data analytics.
Additionally, a case study on AI in healthcare was conducted, detailing its
workflow and how AI-powered systems enhance medical diagnostics,
treatment planning, and patient management. The case study demonstrated
how AI applications leverage machine learning, deep learning, and big data
analytics to improve healthcare efficiency while addressing challenges like data
privacy, bias, and high computational requirements.
Furthermore, the report covered the advantages and disadvantages of Hadoop,
highlighting its scalability, cost-effectiveness, and fault tolerance, along with
challenges such as complex setup, security issues, and inefficiency in real-time
data processing.

Conclusion
Hadoop remains a fundamental tool in big data analytics, offering a robust
infrastructure for processing massive datasets. Its implementation in AI-driven
healthcare proves beneficial in managing and analyzing complex medical data.
However, the challenges associated with Hadoop, such as security concerns
and inefficiencies in real-time analytics, suggest that organizations must
complement Hadoop with other big data technologies like Apache Spark for
better performance.
The integration of AI and big data in healthcare presents vast opportunities to
enhance patient care, optimize operations, and support medical research.
While AI-driven systems continue to evolve, addressing ethical, regulatory, and
data security challenges remains critical for widespread adoption. The synergy
of Hadoop’s big data capabilities and AI innovations in healthcare will likely play
a significant role in the future of medical advancements.

24 | Page
References
1. Big Data Analytics: A Literature Review Paper
Elgendy, Nada & Elragal, Ahmed. (2014). Big Data Analytics: A
Literature Review Paper. Lecture Notes in Computer Science. 8557.
214-227. 10.1007/978-3-319-08976-8_16.

2. The use of Big Data Analytics in healthcare


Batko, K., Ślęzak, A. The use of Big Data Analytics in healthcare. J Big
Data 9, 3 (2022). https://doi.org/10.1186/s40537-021-00553-4

3. Big Data Analytics in Support of the Decision Making


Process
Nada Elgendy a, Ahmed Elragal ,
https://doi.org/10.1016/j.procs.2016.09.251

25 | Page

You might also like