0% found this document useful (0 votes)
115 views

A Project Report On Web Based Data Management

The Electronic Health Record (EHR) stores valuable information on patient records in digital form. The amount of data in the EHR is increasing due to government mandates and technological innovation. Patient data are recorded using sensors and medical reports. Given huge amounts of heterogeneous data in the EHR, there is a need for effective methods to store and analyze these data for meaningful interpretations. This study focuses on various analysis techniques for analyzing and retrieving requi

Uploaded by

Krupa Patel
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
115 views

A Project Report On Web Based Data Management

The Electronic Health Record (EHR) stores valuable information on patient records in digital form. The amount of data in the EHR is increasing due to government mandates and technological innovation. Patient data are recorded using sensors and medical reports. Given huge amounts of heterogeneous data in the EHR, there is a need for effective methods to store and analyze these data for meaningful interpretations. This study focuses on various analysis techniques for analyzing and retrieving requi

Uploaded by

Krupa Patel
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

3161607

Big Data Analytics

A Project Report
On
“Web Based Data Management Using Apache Hive of
Health Care Data “
Under subject of
BIG DATA ANALYTICS
BE Semester-VI
(Information and Technology Engineering)

Submitted By:

Sr. No. Name of Student Enrollment No.


1. Patel Krupa 190633116001
2. Patel Maitri 190633116003
3. Soni Riya 190633116004

Guided By :-
Prof. Pooja Bhatt

Academic Year (2020-2021)

INDEX

1 190633116001: Krupa Patel


3161607
Big Data Analytics

No. Content Page No.

1 Introduction 3-4

2 Big Data Tools and Techniques 4-9

3 Hive Vs. Pig 10

Analysis 11-13
4
Proposed Work 14-15
5
Conclusion 16
6
Reference 16
7

 ABSTRACT :

2 190633116001: Krupa Patel


3161607
Big Data Analytics

The Electronic Health Record (EHR) stores valuable information on


patient records in digital form. The amount of data in the EHR is
increasing due to government mandates and technological innovation.
Patient data are recorded using sensors and medical reports. Given huge
amounts of heterogeneous data in the EHR, there is a need for effective
methods to store and analyze these data for meaningful interpretations.
This study focuses on various analysis techniques for analyzing and
retrieving required information from big data in the EHR. Many Hive
queries are conducted in the Hadoop distributed file system to extract
valuable information. The study also proposes and demonstrates the use
of Tableau as a data analysis technique for effectively deducing valuable
information in the form of visual graphs.

 Introduction

A digital version of patient medical reports known as the Electronic


Health Record (EHR) provides benefits such as data sharing, easy
access to patient health history, and powerful analysis techniques for
extracting valuable information from big data in the EHR. Given such
benefits, the U.S. government made it mandatory to maintain the EHR
for all patients under the American Recovery and Reinvestment Act
(ARRA) of 2009. As a result, electronic data in the EHR has become
heterogenous and is rapidly growing. Therefore, it is difficult for a
single system to support huge amounts of big data. Unstructured
traditional DBMS tools have been shown to be insufficient in handling
such large data. All these requirements have led to the use of
distributed systems for the efficient processing of big data. The
Hadoop Distributed File System (HDFS) is a technique in which big
data can be processed in a distributed manner using a number of
systems.

This paper presents different SQL-like statements for querying and


managing useful information using Apache Hive on top of the HDFS.
The paper also proposes the integration of Apache Hive with Tableau

3 190633116001: Krupa Patel


3161607
Big Data Analytics

as a powerful visualization tool to enable deeper insights into big data


from the EHR. The rest of this paper is organized as follows: Section
2 describes important big data tools and techniques for analyzing and
extracting useful information and offers details on Apache Hadoop, its
architecture, and the role of Apache HIVE in big data analytics.
Section 3 presents related work on the analysis of big data and their
issues. Section 4 describes the proposed work by explaining the
dataset and experimental methodology. Section 5 presents the output
of various Apache Hive queries and discusses different metrics for
measuring the error. It also details the integration of Tableau for better
visualization of Apache Hive query results. The section also
graphically presents forecasted states of high alert for high rate deaths.
Finally, Section 6 concludes with some avenues for future research.

 Big Data Tools and Techniques

Big data sources consist of scientific repositories such as the EHR, forecast
analysis, and social media repositories such as Facebook and Twitter for various
textual, voice, video, and image patterns [2]. Big data analytics involves
scientific mining with business intelligence models to for deeper data analysis
for useful pattern extraction, inferences for hidden insights along with
predictive analysis, and filtering of huge amounts of information from big data.
In addition to explanatory and discovery analytics, big data also offer golden
path analysis consisting of a new technique of behavioral data analysis of
human actions. The big data landscape reflects vast changes from the traditional
era of healthcare analytics in the use of advanced data management tools. Big
data analytic tools in healthcare domains currently include Apache Sqoop,
MongoDB, NoSQL, Apache Flume, Cassandra, Apache Hadoop, MapReduce,
and Spark, among many others. This section provides the basis of Apache
Hadoop and Map Reduce as well as the Hive Query Interface.

 Apache Hadoop

Apache Hadoop is a powerful distributed software

4 190633116001: Krupa Patel


3161607
Big Data Analytics

framework configured over servers for massive


parallel clusters. Apache Hadoop is an open-source
distributed system written in Java and can process
data across clusters of computers. A conceptual view
of the Hadoop architecture is described in Fig. 1.

Fig.1. Conceptual view of Hadoop architecture

 The Hadoop architecture consists of four types of modules


including Hadoop Common Runtime, Hadoop YARN,
HDFS, and Hadoop MapReduce.

 Hadoop Common Runtime: This includes Java libraries


and utilities required for Hadoop modules. These libraries
provide access to the Hadoop distributed file system and
basic runtime libraries for the initialization of Hadoop start-
up modules.

 Hadoop YARN: This provides a layer for job scheduling


and resource management.

 HDFS: To optimize data access, the HDFS enables high


throughput access to application data. Hadoop MapReduce
interface focuses on parallelism to process large datasets.

 Hadoop MapReduce: This is a software framework with


two basic operations: Map Task and Reduce Task. Map Task
is the first task and must be performed before Reduce Task.

 Role of Apache HIVE in Big Data Analytics

 Hive is a Data warehouse infrastructure built on top of Hadoop that can


compile SQL Quires as Map Reduce jobs and run the jobs in the cluster.
 Suitable for semi and structured databases.
 Capable to deal with different storage and file formats.
 Provides HQL(SQL like Query Language) What Hive is not.
 Does not use complex indexes so do not response in seconds.
 But it scales very well , it works with data of peta byte order.
 It is not independent and its performance is tied hadoop.
 Hive Built on top of Hadoop – Think HDFS and Map Reduce
 Hive stored data in the HDFS

5 190633116001: Krupa Patel


3161607
Big Data Analytics

 Hive compile SQL Quires into Map Reduce jobs and run the jobs in the
Hadoop cluster.
 It stores schema in a database and processed data into HDFS.
 It is designed for OLAP.
 We need reports to make operations better not to conduct and operations.
 We use ETL to populate data in DW
 Hive is a data warehouse infrastructure tool to process structured data in
Hadoop.
 It resides on top of Hadoop to summarize Big Data, and makes querying
and analyzing easy.
 Initially Hive was developed by Facebook, later the Apache Software
Foundation took it up and developed it further as an open source under
the name Apache Hive.
 It is used by different companies. For example, Amazon uses it in
Amazon Elastic MapReduce.
 It provides SQL type language for querying called HiveQL or HQL.
 It is familiar, fast, scalable, and extensible.
 Hive gives an SQL-like interface to query data stored in various
databases and file systems that integrate with Hadoop. It provides a SQL
interface to query data stored in Hadoop distributed file system (HDFS)
or Amazon S3 (an AWS implementation) through an HDFS-like
abstraction layer called EMRFS (Elastic MapReduce File System).
 A Data Warehouse is a database specific for analysis and reporting
purpose.
 We want report and Summary not live data of transactions for continuing
the operate.
 Hive Architecture :

6 190633116001: Krupa Patel


3161607
Big Data Analytics

 Here the User Interface (UI) calls the execute interface to the
Driver.
 The driver creates a session handle for the query. Then it sends the
query to the compiler to generate an execution plan.

 The compiler needs the metadata. So it sends a request for


getMetaData. Thus receives the sendMetaData request from
Metastore.

 Now compiler uses this metadata to type check the expressions in


the query. The compiler generates the plan which is DAG of stages
with each stage being either a map/reduce job, a metadata
operation or an operation on HDFS. The plan contains map
operator trees and a reduce operator tree for map/reduce stages.

 Now for queries, execution engine directly read the contents of the
temporary file from HDFS as part of the fetch call from the Driver.

 User Interface – Hive is a data warehouse infrastructure software

7 190633116001: Krupa Patel


3161607
Big Data Analytics

that can create interaction between user and HDFS.


 Meta Store – Hive chooses respective database servers to store the
schema or Metadata of tables, databases, columns in a table, their
data types, and HDFS mapping.
 HiveQL Process Engine – HiveQL is similar to SQL for querying
on schema info on the Metastore. It is one of the replacements of
traditional approach for MapReduce program.
 Execution Engine : The conjunction part of HiveQL process
Engine and MapReduce is Hive Execution Engine. It uses the
flavor of MapReduce.
 HDFS or HBASE – Hadoop distributed file system or HBASE are
the data storage techniques to store data into file system. Extreme
scalability (up to 100 PB) – Self-healing storage .

 In addition to Hive, the Hadoop ecosystem


includes other sub-tools such as Sqoop and
Pig. Sqoop provides data migration from the
traditional RDBMS schema to the HDFS
schema. It provides import and export
operations to and from HDFS and RDBMS.
The programmer can transform RDBMS data
definitions into HDFS datasets and vice
versa. Pig is a Hadoop procedural language
for constructing own scripts to map-reduce
operations. Pig is like SQL scripting in
RDBMS systems. Pig is required to make
more advanced scripting operations and
achieve dynamicity through the power of the
MapReduce API.

 Working of Hive :

8 190633116001: Krupa Patel


3161607
Big Data Analytics

 Execution of Hive :
 Execute Query : The Hive interface such as Command Line or Web
UI sends query to Driver (any database driver such as JDBC,
ODBC, etc.) to execute.
 Get Plan : The driver takes the help of query compiler that parses
the query to check the syntax and query plan or the requirement of
query.
 Get Metadata : The compiler sends metadata request to Meta store
(any database).
 Send Metadata: Meta store sends metadata as a response to the
compiler.
 Send Plan : The compiler checks the requirement and resends the
plan to the driver. Up to here, the parsing and compiling of a query
is complete.
 Execute Plan: The driver sends the execute plan to the execution
engine.
 Execute Job: The execution engine sends the job to JobTracker,
which is in Name node and it assigns this job to TaskTracker,
which is in Data node. Here, the query executes MapReduce job.

 Hive Vs. Pig

9 190633116001: Krupa Patel


3161607
Big Data Analytics

 Pig : Pig is used for the analysis of a large amount of data. It is abstract
over MapReduce. Pig is used to perform all kinds of data manipulation
operations in Hadoop. It provides the Pig-Latin language to write the
code that contains many inbuilt functions like join, filter, etc. The two
parts of the Apache Pig are Pig-Latin and Pig-Engine. Pig Engine is
used to convert all these scripts into a specific map and reduce tasks. Pig
abstraction is at a higher level. It contains less line of code as compared
to MapReduce.

  Hive : Hive is built on the top of Hadoop and is used to process


structured data in Hadoop. Hive was developed by Facebook. It provides
various types of querying language which is frequently known as Hive
Query Language. Apache Hive is a data warehouse and which provides
an SQL-like interface between the user and the Hadoop distributed file
system (HDFS) which integrates Hadoop.

 Analysis Based on Queries Executed in Hive

10 190633116001: Krupa Patel


3161607
Big Data Analytics

 We executed a large number of Hive queries to analyze based on


the experimental methodology.
 We recorded various factors related to the Hadoop environment,
including the number of Map, Reducers, cumulative CPU usage,
HDFS Read, and HDFS Write, while executing and analyzed as
described in the next sections.

Fig.14.Heat map for different states corresponding to the number of deaths

 Heat maps are used for insightful visualization. They use color and size to
express values, so the user can quickly obtain a wide array of
information.
 Heat maps have wide application when comparisons are required using
large numbers of datasets.
 In Fig.14, the states of California, Texas, Florida, and New York are on
high alert, as indicated by the brightness of the color red is greater for
these states.
 Therefore, all further research is focused on these states.
 With Tableau, main reasons for deaths in these states and numbers of
deaths can be inferred through the visualization provided by Tableau.

11 190633116001: Krupa Patel


3161607
Big Data Analytics

Fig.15. Visualization of reasons for death in California and its percentage distribution

From Fig. 15, the following findings can be summarized:


 Cancer, heart disease, and septicaemia are the main causes of death in
California.
 The number of deaths from cancer and diabetes remains constant since
1999.
 Although diseases related to the heart represent one of the main causes of
death in California, but the percentage decreases over time.
 Unexpectedly, Alzheimer's disease and unintentional injuries(accidents)
come into existence, increasing the number of deaths.
 The number of deaths from septicaemia, suicide, and stroke shows a
decreasing trend.

12 190633116001: Krupa Patel


3161607
Big Data Analytics

Fig.19.Forecasting of four states using exponential smoothing

 Forecasted states from exponential smoothing are shown in Fig. 19, and
error values are shown in Table 1. The number of deaths is expected to
further increase in California and Florida, indicating a need for
appropriate preventive measures. It has also been observed that certain
diseases require proper attention, as suggested by their rapid increases
over time.
 Table 1.Values of various quality metrics for proposed work

Detail Quality Metrics


State RMSE MAE MASE MAPE AIC
Texas 4,096 3,049 1.23 2.20% 289
New York 2,540 2,192 1.25 1.70% 273
Florida 3,743 2,395 1.16 1.60% 286
California 4,424 3,094 1.01 1.50% 291

13 190633116001: Krupa Patel


3161607
Big Data Analytics

 The Proposed Work :

 This paper analyzes healthcare data using Apache Hive. The


study uses a number of Hive queries for data analysis.
 For improved data analytics, Apache Hive is integrated
with Tableau as a powerful analytics tool.
 This helps to examine the conversion of data into useful
information.
 The use of Tableau helps analyze high-alert states with
maximum deaths, major diseases resulting in more
deaths, and deaths corresponding to deceasing diseases.
 Exponential smoothing is used to forecast disease trends.
This section presents the proposed work and
experiments for the analysis of big data in the EHR.
 Experiments
 We conducted a set of experiments to analyze EHR big data. The proposed
technique was validated based on the dataset from the Centres for Disease
Control and Prevention.
 The dataset was publicly available under the Open Data Commons Open
Database License (ODbL).
 This dataset provided numbers for various causes of deaths in the United
States and the reasons.
 The proposed work was based on the following predefined steps:
 Step 1.Connection Establishment: Beeline connections are
established using the HIVE server with Java Database
Connectivity(JDBC), as depicted in Fig. 3.

14 190633116001: Krupa Patel


3161607
Big Data Analytics

 Step 2. Downloading Dataset: The dataset is downloaded from the Internet


using a browser in a virtual environment and saved in downloads as a default
location.

 Step 3. Creation of Empty Tables: A table is created to store downloaded


datasets using the following command:
create table tempp_master(col_value string);

 Step 4. Loading the Dataset: The dataset is loaded from the downloads folder
to the Hadoop environment using the following command:
LOAD DATA local INPATH
'/home/cloudera/Downloads/Health_Dataa.csv' OVERWRITE INTO
TABLE tempp_master;
 Step 5. Creation of Table: A new table is created with each column having
the same datatype as columns of the original dataset through the following
command:
create table master_new (Year INT,Cause
VARCHAR(100),CnameSTRING,StateSTRING,DeathsINT,Adjusted
Death Rate FLOAT );
 Step 6. Mapping of columns: Each column is mapped from the table storing
the whole dataset to the table created in Step 5 using a regular expression
based on the following command:
insert overwrite table masterr_new SELECT
regexp_extract(col_value,'^(?:([^,]*),?){1}',1)
Year,regexp_extract(col_value,'^(?:([^,]*),?){2}',1)
Cause,regexp_extract(col_value,'^(?:([^,]*),?){3}',1)
CName,regexp_extract(col_value,'^(?:([^,]*),?){4}',1)
State,regexp_extract(col_value,'^(?:([^,]*),?){5}',1)
Deaths,regexp_extract(col_value,'^(?:([^,]*),?){6}',1)
AdjustedDeathRate from tempp_master;

 Step 7. Display the Table:To check the properties of the table created in Step
6 with all datasets stored in the columns, use the following command for the
result shown in Fig. 4:
show TBLPROPERTIES master_new;

15 190633116001: Krupa Patel


3161607
Big Data Analytics

 Conclusions and Future Research

This study analyzes EHR big data using Hive queries and forecasts high-
alert states with maximum death rates. We analyze big data extracted
from the given dataset to forecast high-alert states using exponential
smoothing and visualize the results using Tableau.

The results suggest an urgent need to address the alarming situation of


high numbers of deaths in four states, namely California, Florida, New
York, and Texas. Major causes of diseases such as cancer and heart
disease remain constant, and in some cases such as Alzheimer’s disease
and unintentional injuries(accidents), the numbers are increasing.

This study uses exponential smoothing for forecasting purposes, and as


an extension to this study, machine learning techniques such as neural
networks may produce better prediction results.

 References:

 Cloudera Quickstart . (n.d.). (Cloudera) Retrieved July 1, 2018,


from https://www.cloudera.com/]
 Common Statistical Formulas. (n.d.). (Statistics Solutions)
Retrieved July 1, 2018, from
http://www.statisticssolutions.com/common-statistical-formulas/
 Forcasting. (n.d.). (Tableau) Retrieved July 1,2018,from
https://onlinehelp.tableau.com/v9.3/public/online/windows/en-
us/help.html#forecast_describe.html
 Aggregate Functions. (n.d.). (Cloudera) Retrieved July 1, 2018,
from https://www.cloudera.com/documentation/enterprise/5-8-
x/topics/impala_variance.html
 Tableau.(n.d.).(Tableau) Retrieved July1,2018, from
https://www.tableau.com

16 190633116001: Krupa Patel

You might also like