0% found this document useful (0 votes)
105 views28 pages

Group 37

1. The document describes a proposed system for unstructured data mining using open source tools like PostgreSQL, Hadoop, R and Jaspersoft. 2. The system aims to process unstructured data inputs like PDFs and images into structured data for analysis and visualization to help decision making. 3. Work done so far includes installation of tools, integration of R with PostgreSQL and Hadoop for processing and analyzing unstructured data, and using Talend for converting a PDF to text.

Uploaded by

Pooja Ban
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
105 views28 pages

Group 37

1. The document describes a proposed system for unstructured data mining using open source tools like PostgreSQL, Hadoop, R and Jaspersoft. 2. The system aims to process unstructured data inputs like PDFs and images into structured data for analysis and visualization to help decision making. 3. Work done so far includes installation of tools, integration of R with PostgreSQL and Hadoop for processing and analyzing unstructured data, and using Talend for converting a PDF to text.

Uploaded by

Pooja Ban
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

WELCOME !!!!

Unstructured Data Mining


Using Open Source Stack

Jagruti Wagh B120204420


Ashvini Dukare B120204252
Jidnyasa Gondane B120204260

Guided By –
Neeta Maitre
Sponsorship and External
Guide

Principal Global Services Pvt


Ltd.

Guided by:
Mr.Ajit Aher
Miss.Deepa Nagaliker
Mr.Sunil Chawla
Problem Definition
Reading and transformation of unstructured
data inputs into semi-structured or
structured form.
Using a data warehouse and visualisation
tools for storing and displaying of the
processed data.
Processing of the data will be done by
analytic engine, which will help to get
insights for better managerial and strategic
decision making.
Displaying the processed data in dashboard
using various visualization tools.
Motivation
Flavours of existing system include many
online tools with functionalities for
conversion of:
◦ Pdf to text
◦ Excel or csv file to text
◦ Image to text
These current systems have limited end-
to-end functionality which motivated us
for designing the system with better
utilization of functionality of open source
stack.
Introduction
About 90% of today’s data is
unstructured.
The unstructured data is in various
forms like pdfs, images, videos, white
papers, researches, etc.
The proposed system will help in getting
insights for making quick decisions as it
is the need of any business.
The system should be user friendly,
compatible and could be enhanced
easily in the future.
Scope
The various forms of data could
be indentified and processed
accordingly by the proposed
system.
Implementation of the
functionality for better decision
making and giving the idea of
trending business world.
Visualization of data in form of
dashboard- pie charts, bar-
Domain Specific Keywords
Decision support systems
Data mining
Text mining
Sentiment analysis
 Clustering and classification
 Business intelligence
Proposed System
Module wise Description
Login:
◦ Access allowed to only authorized users.
◦ Proper authentication and password
change/recovery facilities
Data Processing Module:
◦ Asking for the type of data file to be
transformed
◦ Transform input file into intermediate file
most likely text file
◦ Storing this file on HDFS
Module wise Description contd…
Data Storing and Analysing Module:
◦ Using PostgreSQL for querying the large
datasets
◦ Analyzing the queried data using analytic
engine with R Programming
◦ Storing the analyzed data back in new
database
Data Visualization Module:
◦ Visualizing the data generated in various
forms of reports or dashboards using
JasperSoft.
Technology used
PostgreSQL
◦ Full support for outer-joins
◦ Easy processing of sub-queries.

Hadoop HDFS
◦ Storage functionality for
large data-sets of different forms.
◦ Use of rmr, rhbase, rhdfs for
integration with R.
Technology Used contd…
Talend ETL
◦ Simple graphical tools
◦ Simple creation of routines and jobs
◦ Cost effective

R Studio
◦ Open-source IDE
◦ Support for inbuilt functions of R
◦ Easy to integrate with other
softwares.
Technology used contd…
JasperSoft
◦ Open-source BI tool
◦ Supports variety of targets
◦ Generates dynamic content
◦ Used in Java EE.
Programming Languages
Java
◦ Talend ETL- Routine or Job
◦ Hadoop

R Programming
◦ R studio – Data Mining Algorithms
Literature Survey
ETL Tool – Talend, Informatica
◦ A tool for Java Programmers
◦ Save lots of time by generating
code for the user
◦ Multiple algorithms on a record.
◦ Easy to use.
◦ Software BUNDLED, unlike
Informatica.
Literature Survey contd…
Hadoop HDFS-
◦ For storing large business data
PostgreSQL –
◦ Support full outer join over MySQL
◦ Support wide variety of Languages
◦ Strong support for sub queries
◦ Less Hassel with custom data
type,table inheritance
Literature Survey contd…
Visualization
Tool- Jaspersoft,
Pentaho Report Design
◦ Heavy focus on reporting and
analysis.
◦ Better web user interface than
Pentaho and easier to use.
◦ Benefits from better marketing,
informational web sites and
documentation.
Conclusions
The proposed system will help
analyst to take quick decisions in
easy way by means of
visualization.
This analysis can be very well
implemented in following domains:
◦ News Reporters
◦ Politicians
◦ Medicine and Health-care
Time Line
Work Done so far
Installation of the following :
◦ Java
◦ R
◦ R-Studio
◦ Talend
◦ Hadoop
Integration of the following:
◦ R-Hadoop
◦ R-postgreSQL

Pdf-text conversion in Talend


Integration of R-
PostgreSQL
Installing of RPostgreSQL library in R and
connecting to the database created :
We have already installed R, so open a terminal
and type R, to get into the R-prompt.
A. Installing RPostgreSQL package
>system(‘gksudo “apt-get –y install postgresql-
9.3 libpq-dev” ‘)
>install.packages(“RPostgreSQL”)//select CRAN
as 0.cloud
 
>library(RPostgreSQL)
//if this command runs without giving any
error, means you have installed RPostgresql
package successfully.
Conversion of PDF to Text using
Talend
We first need to create a Routine which will
contain Java code to convert PDF to text.
Create a Job which will contain the graphical
representation with the Help of Palettes.
3 palettes required:

◦ tInputFileDelimited
◦ tMap
◦ tLogRow
Run the Job and we will get a text file of the
selected PDF file at the desired location
which is nothing but the Job folder at current.
Integration of R-Hadoop
Download rJava from link-
https://cran.r-project.org/web/packages/rJava/index.html 
Open RStudio
Goto: Tools->install packages
Select Package Archive file from the Tab name
” Install from” ->browse from Package Archive Tab ->select downloaded
tar.gz file ->click on install.
Goto command prompt of RStudio and Type command-
install.packages(c("rJava", "Rcpp", "RJSONIO", "bitops", "digest",
"functional", "stringr", "plyr", "reshape2", "caTools"))

Download RHadoop Packages from link rmr2, rhbase, rhdfs


https://github.com/RevolutionAnalytics/RHadoop/wiki/Downloads 
Open RStudio
Goto Tools->install packages
Select Package Archive file from the tab name
“Install from” ->browse from Package Archive Tab ->select downloaded
tar.gz file->click on install.
We are now ready to perform operation on file saved in HDFS using R
programming.
Integration of R-Hadoop contd…
For Exporting of data from the hdfs to R is done
by using the following commands in R:
Sys.setenv(HADOOP_CMD=”/usr/local/hadoop/b
in/hadoop”) //setting env variable
Library(rhdfs)
Hdfs.init()
F=hdfs.file(“/project/xyz.csv”,”r”,buffersize=10
3)
M=hdfs.read(F)
C=rawToChar(M)
Data=read.table(textConnection(C),sep=”,”)
References
A Study of Open-Source Data
Mining Tools for Forecasting
-Nurdatillah Hasim, Norhaidah Abu
Haris
Efficiency Evaluation of Open
Source ETL Tools - Tim A. Majchrzak,
Tobias Jansen, Herbert Kuchen
A Study of the Contributors of
PostgreSQL - Daniel M. German
Websites
We have referred the following YouTube
videos for these part:
1.For installing and getting started with
PostgreSql:
◦ https://youtu.be/67XGzdzv9k0
2. For Connectivity of R and PostgreSQL:
◦ https://youtu.be/90j5rX6iSGI
http://www.bogotobogo.com/Hadoop/BigData
_hadoop_Install_on_ubuntu_single_node_clus
ter.php
Integration - www.youtube.com
◦ R- Hadoop
◦ R PostgreSQL
◦ Jaspersoft PostgreSQL

You might also like