Group 37
Group 37
Guided By –
Neeta Maitre
Sponsorship and External
Guide
Guided by:
Mr.Ajit Aher
Miss.Deepa Nagaliker
Mr.Sunil Chawla
Problem Definition
Reading and transformation of unstructured
data inputs into semi-structured or
structured form.
Using a data warehouse and visualisation
tools for storing and displaying of the
processed data.
Processing of the data will be done by
analytic engine, which will help to get
insights for better managerial and strategic
decision making.
Displaying the processed data in dashboard
using various visualization tools.
Motivation
Flavours of existing system include many
online tools with functionalities for
conversion of:
◦ Pdf to text
◦ Excel or csv file to text
◦ Image to text
These current systems have limited end-
to-end functionality which motivated us
for designing the system with better
utilization of functionality of open source
stack.
Introduction
About 90% of today’s data is
unstructured.
The unstructured data is in various
forms like pdfs, images, videos, white
papers, researches, etc.
The proposed system will help in getting
insights for making quick decisions as it
is the need of any business.
The system should be user friendly,
compatible and could be enhanced
easily in the future.
Scope
The various forms of data could
be indentified and processed
accordingly by the proposed
system.
Implementation of the
functionality for better decision
making and giving the idea of
trending business world.
Visualization of data in form of
dashboard- pie charts, bar-
Domain Specific Keywords
Decision support systems
Data mining
Text mining
Sentiment analysis
Clustering and classification
Business intelligence
Proposed System
Module wise Description
Login:
◦ Access allowed to only authorized users.
◦ Proper authentication and password
change/recovery facilities
Data Processing Module:
◦ Asking for the type of data file to be
transformed
◦ Transform input file into intermediate file
most likely text file
◦ Storing this file on HDFS
Module wise Description contd…
Data Storing and Analysing Module:
◦ Using PostgreSQL for querying the large
datasets
◦ Analyzing the queried data using analytic
engine with R Programming
◦ Storing the analyzed data back in new
database
Data Visualization Module:
◦ Visualizing the data generated in various
forms of reports or dashboards using
JasperSoft.
Technology used
PostgreSQL
◦ Full support for outer-joins
◦ Easy processing of sub-queries.
Hadoop HDFS
◦ Storage functionality for
large data-sets of different forms.
◦ Use of rmr, rhbase, rhdfs for
integration with R.
Technology Used contd…
Talend ETL
◦ Simple graphical tools
◦ Simple creation of routines and jobs
◦ Cost effective
R Studio
◦ Open-source IDE
◦ Support for inbuilt functions of R
◦ Easy to integrate with other
softwares.
Technology used contd…
JasperSoft
◦ Open-source BI tool
◦ Supports variety of targets
◦ Generates dynamic content
◦ Used in Java EE.
Programming Languages
Java
◦ Talend ETL- Routine or Job
◦ Hadoop
R Programming
◦ R studio – Data Mining Algorithms
Literature Survey
ETL Tool – Talend, Informatica
◦ A tool for Java Programmers
◦ Save lots of time by generating
code for the user
◦ Multiple algorithms on a record.
◦ Easy to use.
◦ Software BUNDLED, unlike
Informatica.
Literature Survey contd…
Hadoop HDFS-
◦ For storing large business data
PostgreSQL –
◦ Support full outer join over MySQL
◦ Support wide variety of Languages
◦ Strong support for sub queries
◦ Less Hassel with custom data
type,table inheritance
Literature Survey contd…
Visualization
Tool- Jaspersoft,
Pentaho Report Design
◦ Heavy focus on reporting and
analysis.
◦ Better web user interface than
Pentaho and easier to use.
◦ Benefits from better marketing,
informational web sites and
documentation.
Conclusions
The proposed system will help
analyst to take quick decisions in
easy way by means of
visualization.
This analysis can be very well
implemented in following domains:
◦ News Reporters
◦ Politicians
◦ Medicine and Health-care
Time Line
Work Done so far
Installation of the following :
◦ Java
◦ R
◦ R-Studio
◦ Talend
◦ Hadoop
Integration of the following:
◦ R-Hadoop
◦ R-postgreSQL
◦ tInputFileDelimited
◦ tMap
◦ tLogRow
Run the Job and we will get a text file of the
selected PDF file at the desired location
which is nothing but the Job folder at current.
Integration of R-Hadoop
Download rJava from link-
https://cran.r-project.org/web/packages/rJava/index.html
Open RStudio
Goto: Tools->install packages
Select Package Archive file from the Tab name
” Install from” ->browse from Package Archive Tab ->select downloaded
tar.gz file ->click on install.
Goto command prompt of RStudio and Type command-
install.packages(c("rJava", "Rcpp", "RJSONIO", "bitops", "digest",
"functional", "stringr", "plyr", "reshape2", "caTools"))