A Project Report On Web Based Data Management
A Project Report On Web Based Data Management
A Project Report
On
“Web Based Data Management Using Apache Hive of
Health Care Data “
Under subject of
BIG DATA ANALYTICS
BE Semester-VI
(Information and Technology Engineering)
Submitted By:
Guided By :-
Prof. Pooja Bhatt
INDEX
1 Introduction 3-4
Analysis 11-13
4
Proposed Work 14-15
5
Conclusion 16
6
Reference 16
7
ABSTRACT :
Introduction
Big data sources consist of scientific repositories such as the EHR, forecast
analysis, and social media repositories such as Facebook and Twitter for various
textual, voice, video, and image patterns [2]. Big data analytics involves
scientific mining with business intelligence models to for deeper data analysis
for useful pattern extraction, inferences for hidden insights along with
predictive analysis, and filtering of huge amounts of information from big data.
In addition to explanatory and discovery analytics, big data also offer golden
path analysis consisting of a new technique of behavioral data analysis of
human actions. The big data landscape reflects vast changes from the traditional
era of healthcare analytics in the use of advanced data management tools. Big
data analytic tools in healthcare domains currently include Apache Sqoop,
MongoDB, NoSQL, Apache Flume, Cassandra, Apache Hadoop, MapReduce,
and Spark, among many others. This section provides the basis of Apache
Hadoop and Map Reduce as well as the Hive Query Interface.
Apache Hadoop
Hive compile SQL Quires into Map Reduce jobs and run the jobs in the
Hadoop cluster.
It stores schema in a database and processed data into HDFS.
It is designed for OLAP.
We need reports to make operations better not to conduct and operations.
We use ETL to populate data in DW
Hive is a data warehouse infrastructure tool to process structured data in
Hadoop.
It resides on top of Hadoop to summarize Big Data, and makes querying
and analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software
Foundation took it up and developed it further as an open source under
the name Apache Hive.
It is used by different companies. For example, Amazon uses it in
Amazon Elastic MapReduce.
It provides SQL type language for querying called HiveQL or HQL.
It is familiar, fast, scalable, and extensible.
Hive gives an SQL-like interface to query data stored in various
databases and file systems that integrate with Hadoop. It provides a SQL
interface to query data stored in Hadoop distributed file system (HDFS)
or Amazon S3 (an AWS implementation) through an HDFS-like
abstraction layer called EMRFS (Elastic MapReduce File System).
A Data Warehouse is a database specific for analysis and reporting
purpose.
We want report and Summary not live data of transactions for continuing
the operate.
Hive Architecture :
Here the User Interface (UI) calls the execute interface to the
Driver.
The driver creates a session handle for the query. Then it sends the
query to the compiler to generate an execution plan.
Now for queries, execution engine directly read the contents of the
temporary file from HDFS as part of the fetch call from the Driver.
Working of Hive :
Execution of Hive :
Execute Query : The Hive interface such as Command Line or Web
UI sends query to Driver (any database driver such as JDBC,
ODBC, etc.) to execute.
Get Plan : The driver takes the help of query compiler that parses
the query to check the syntax and query plan or the requirement of
query.
Get Metadata : The compiler sends metadata request to Meta store
(any database).
Send Metadata: Meta store sends metadata as a response to the
compiler.
Send Plan : The compiler checks the requirement and resends the
plan to the driver. Up to here, the parsing and compiling of a query
is complete.
Execute Plan: The driver sends the execute plan to the execution
engine.
Execute Job: The execution engine sends the job to JobTracker,
which is in Name node and it assigns this job to TaskTracker,
which is in Data node. Here, the query executes MapReduce job.
Pig : Pig is used for the analysis of a large amount of data. It is abstract
over MapReduce. Pig is used to perform all kinds of data manipulation
operations in Hadoop. It provides the Pig-Latin language to write the
code that contains many inbuilt functions like join, filter, etc. The two
parts of the Apache Pig are Pig-Latin and Pig-Engine. Pig Engine is
used to convert all these scripts into a specific map and reduce tasks. Pig
abstraction is at a higher level. It contains less line of code as compared
to MapReduce.
Heat maps are used for insightful visualization. They use color and size to
express values, so the user can quickly obtain a wide array of
information.
Heat maps have wide application when comparisons are required using
large numbers of datasets.
In Fig.14, the states of California, Texas, Florida, and New York are on
high alert, as indicated by the brightness of the color red is greater for
these states.
Therefore, all further research is focused on these states.
With Tableau, main reasons for deaths in these states and numbers of
deaths can be inferred through the visualization provided by Tableau.
Fig.15. Visualization of reasons for death in California and its percentage distribution
Forecasted states from exponential smoothing are shown in Fig. 19, and
error values are shown in Table 1. The number of deaths is expected to
further increase in California and Florida, indicating a need for
appropriate preventive measures. It has also been observed that certain
diseases require proper attention, as suggested by their rapid increases
over time.
Table 1.Values of various quality metrics for proposed work
Step 4. Loading the Dataset: The dataset is loaded from the downloads folder
to the Hadoop environment using the following command:
LOAD DATA local INPATH
'/home/cloudera/Downloads/Health_Dataa.csv' OVERWRITE INTO
TABLE tempp_master;
Step 5. Creation of Table: A new table is created with each column having
the same datatype as columns of the original dataset through the following
command:
create table master_new (Year INT,Cause
VARCHAR(100),CnameSTRING,StateSTRING,DeathsINT,Adjusted
Death Rate FLOAT );
Step 6. Mapping of columns: Each column is mapped from the table storing
the whole dataset to the table created in Step 5 using a regular expression
based on the following command:
insert overwrite table masterr_new SELECT
regexp_extract(col_value,'^(?:([^,]*),?){1}',1)
Year,regexp_extract(col_value,'^(?:([^,]*),?){2}',1)
Cause,regexp_extract(col_value,'^(?:([^,]*),?){3}',1)
CName,regexp_extract(col_value,'^(?:([^,]*),?){4}',1)
State,regexp_extract(col_value,'^(?:([^,]*),?){5}',1)
Deaths,regexp_extract(col_value,'^(?:([^,]*),?){6}',1)
AdjustedDeathRate from tempp_master;
Step 7. Display the Table:To check the properties of the table created in Step
6 with all datasets stored in the columns, use the following command for the
result shown in Fig. 4:
show TBLPROPERTIES master_new;
This study analyzes EHR big data using Hive queries and forecasts high-
alert states with maximum death rates. We analyze big data extracted
from the given dataset to forecast high-alert states using exponential
smoothing and visualize the results using Tableau.
References: