0% found this document useful (0 votes)
3 views

Unit 1 Bigdata

Uploaded by

massdangerman701
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Unit 1 Bigdata

Uploaded by

massdangerman701
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 30

III B.SC.

, COMPUTER SCIENCE - BIG DATA ANALYTICS

Unit-1: INTRODUCTIONTO BIGDATA


Introduction to bigdata: Introduction to Big Data Platform – Challenges of Conventional Systems –
Intelligent data analysis – Nature of Data – Characteristics of Data – Evolution of Big Data – Definition
of Big Data – Challenges with Big Data – Volume, Velocity, Variety – Other Characteristics of Data –
Need for Big Data – Analytic Processes and Tools – Analysis vs Reporting.
Data and Information
 Data are plain facts.
 The word "data" is plural for "datum."
 Data is nothing but facts and statistics stored or free flowing over a network, generally it's raw and
unprocessed.
 When data are processed, organized, structured or presented in a given context so as to make them
useful, they are called Information.
For example: When you visit any website, they might store you IP address, that is data, in return they
might add a cookie in your browser, marking you that you visited the website, that is data, your name,
it's data, your age, it's data.
What is Data?
– The quantities, characters, or symbols on which operations are performed by a computer,
– which may be stored and transmitted in the form of electrical signals and
– recorded on magnetic, optical, or mechanical recording media.
3 Actions on Data
– Capture
– Transform
– Store
Big Data
• Big Data may well be the Next Big Thing in the IT world.
• Big data burst upon the scene in the first decade of the 21st century.
• The first organizations to embrace it were online and startup firms.
• Firms like Google, eBay, LinkedIn, and Facebook were built around big data from the beginning.
• Like many new information technologies,
• big data can bring about dramatic cost reductions,
• substantial improvements in the time required to perform a computing task, or
• new product and service offerings.
• Walmart handles more than 1 million customer transactions every hour.
• Facebook handles 40 billion photos from its user base.
• Decoding the human genome originally took 10years to process; now it can be achieved in one
week.
What is Big Data?
– Big Data is also data but with a huge size.
– Big Data is a term used to describe a collection of data that is huge in size and yet growing
exponentially with time.
– In short such data is so large and complex that none of the traditional data management tools are
able to store it or process it efficiently.
No single definition; here is from Wikipedia:
• Big data is the term for
– a collection of data sets so large and complex that it becomes difficult to process using on-hand
database management tools or traditional data processing applications.
Examples of Bigdata
• Following are some the examples of Big Data-
– The New York Stock Exchange generates about one terabyte of new trade data per day.
– Other examples of Big Data generation includes
• stock exchanges,
• social media sites,
• jet engines,
• etc.
Types Of Big Data
• BigData could be found in three forms:
1. Structured
2. Unstructured
3. Semi-structured
What is Structured Data?

• Any data that can be stored, accessed and processed in the form of fixed format is termed as a
'structured' data.
• Developed techniques for working with such kind of data (where the format is well known in advance)
and also deriving value out of it.
• Foreseeing issues of today :
– when a size of such data grows to a huge extent, typical sizes are being in the rage of multiple zetta
bytes.
• Do you know?
• 1021 bytes equal to 1 zettabyte or one billion terabytes forms a zettabyte.
– That is why the name Big Data is given and imagine the challenges involved in its storage and
processing?
• Do you know?
– Data stored in a relational database management system is one example of a 'structured' data.
• An 'Employee' table in a database is an example of Structured Data:
Employee_ID Employee_Name Gender Department Salary_In_lacs
2365 Rajesh Kulkarni Male Finance 650000
3398 Pratibha Joshi Female Admin 650000
7465 Shushil Roy Male Admin 500000
7500 Shubhojit Das Male Finance 500000
Unstructured Data
• Any data with unknown form or the structure is classified as unstructured data.
• In addition to the size being huge,
– un-structured data poses multiple challenges in terms of its processing for deriving value out of it.
– A typical example of unstructured data is
• a heterogeneous data source containing a combination of simple text files, images, videos etc.
• Now day organizations have wealth of data available with them but unfortunately,
– they don't know how to derive value out of it since this data is in its raw form or unstructured format.
Example of Unstructured data
– The output returned by 'Google Search'
Semi-structured Data
• Semi-structured data can contain both the forms of data.
• Semi-structured data as a structured in form
– but it is actually not defined with e.g. a table definition in relational DBMS.
• Example of semi-structured data is
– a data represented in an XML file.
• Personal data stored in an XML file.
<rec>
<name>Prashant Rao</name>
<sex>Male</sex>
<age>35</age>
</rec>
<rec>
<name>Seema R.</name>
<sex>Female</sex>
<age>41</age>
</rec>
<rec>
<name>Satish Mane</name>
<sex>Male</sex>
<age>29</age>
</rec>
<rec>
<name>Subrato Roy</name>
<sex>Male</sex>
<age>26</age>
</rec>
<rec>
<name>Jeremiah J.</name>
<sex>Male</sex>
<age>35</age></rec>
Characteristics of BD OR 3Vs of Big Data
• Three Characteristics of Big Data V3s:
1) Volume
 Data quantity
2) Velocity
 Data Speed
3) Variety
 Data Types

Growth of Big Data

Storing Big Data


• Analyzing your data characteristics
– Selecting data sources for analysis
– Eliminating redundant data
– Establishing the role of NoSQL
• Overview of Big Data stores
– Data models: key value, graph, document, column-family
– Hadoop Distributed File System (HDFS)
– Hbase
– Hive
Processing Big Data
• Integrating disparate data stores
– Mapping data to the programming framework
– Connecting and extracting data from storage
– Transforming data for processing
– Subdividing data in preparation for Hadoop MapReduce
• Employing Hadoop MapReduce
– Creating the components of Hadoop MapReduce jobs
– Distributing data processing across server farms
– Executing Hadoop MapReduce jobs
– Monitoring the progress of job flows
Why Big Data?
• Growth of Big Data is needed
– Increase of storage capacities
– Increase of processing power
– Availability of data(different data types)
– Every day we create 2.5 quintillion bytes of data; 90% of the data in the world today has been created
in the last two years alone
 Huge storage need in Real Time Applications
– FB generates 10TB daily
– Twitter generates 7TB of data Daily
– IBM claims 90% of today’s stored data was generated in just the last two years.
Big Data sources
• Users
• Application
• Systems
• Sensors
Moved to
• Large and growing files (Big data files)
Risks of Big Data
• Will be so overwhelmed
– Need the right people and solve the right problems
• Costs escalate too fast
– Isn’t necessary to capture 100%
• Many sources of big data is privacy
– self-regulation
– Legal regulation
Leading Technology Vendors

Example Vendors Commonality


IBM – Netezza • MPP architectures
EMC – Greenplum • Commodity Hardware
Oracle – Exadata • RDBMS based
• Full SQL compliance
BASICS OF BIGDATA PLATFORM
 Big Data platform is IT solution which combines several Big Data tools and utilities into one
packaged solution for managing and analyzing Big Data.

 Big data platform is a type of IT solution that combines the features and capabilities of several big
data application and utilities within a single solution.
 It is an enterprise class IT platform that enables organization in developing, deploying, operating
and managing a big data infrastructure /environment.

What is Big Data Platform?

 Big Data Platform is integrated IT solution for Big Data management which combines several
software system, software tools and hardware to provide easy to use tools system to enterprises.
 It is a single one-stop solution for all Big Data needs of an enterprise irrespective of size and data
volume. Big Data Platform is enterprise class IT solution for developing, deploying and managing Big
Data.
 There are several Open source and commercial Big Data Platform in the market with varied features
which can be used in Big Data environment.
 Big data platform is a type of IT solution that combines the features and capabilities of several big data
application and utilities within a single solution.
 It is an enterprise class IT platform that enables organization in developing, deploying, operating and
managing a big data infrastructure /environment.
 Big data platform generally consists of big data storage, servers, database, big data management,
business intelligence and other big data management utilities
 It also supports custom development, querying and integration with other systems.
 The primary benefit behind a big data platform is to reduce the complexity of multiple vendors/
solutions into a one cohesive solution.
Features of Big Data Platform

Here are most important features of any good Big Data Analytics Platform:
a) Big Data platform should be able to accommodate new platforms and tool based on the business
requirement. Because business needs can change due to new technologies or due to change in business
process.
b) It should support linear scale-out
c) It should have capability for rapid deployment
d) It should support variety of data format
e) Platform should provide data analysis and reporting tools
f) It should provide real-time data analysis software
g) It should have tools for searching the data through large data sets
Big data is a term for data sets that are so large or complex that traditional data processing applications
are inadequate.
Challenges include
 Analysis,
 Capture,
 Data Curation,
 Search,
 Sharing,
 Storage,
 Transfer,
 Visualization,
 Querying,
 Updating

List of BigData Platforms

h) Hadoop
i) Cloudera
j) Amazon Web Services
k) Hortonworks
l) MapR
m) IBM Open Platform
n) Microsoft HDInsight
o) Intel Distribution for Apache Hadoop
p) Datastax Enterprise Analytics
q) Teradata Enterprise Access for Hadoop
r) Pivotal HD

a) Hadoop
What is Hadoop?

 Hadoop is open-source, Java based programming framework and server software which is used to
save and analyze data with the help of 100s or even 1000s of commodity servers in a clustered
environment.
 Hadoop is designed to storage and process large datasets extremely fast and in fault tolerant way.
 Hadoop uses HDFS (Hadoop File System) for storing data on cluster of commodity computers. If any
server goes down it know how to replicate the data and there is no loss of data even in hardware failure.
 Hadoop is Apache sponsored project and it consists of many software packages which runs on the top
of the Apache Hadoop system.
 Top Hadoop based Commercial Big Data Analytics Platform
 Hadoop provides set of tools and software for making the backbone of the Big Data analytics system.
b) Cloudera
 Cloudra is one of the first commercial Hadoop based Big Data Analytics Platform offering Big Data
solution.
 Its product range includes Cloudera Analytic DB, Cloudera Operational DB, Cloudera Data Science &
Engineering and Cloudera Essentials.
 All these products are based on the Apache Hadoop and provides real-time processing and analytics of
massive data sets.

c) Amazon Web Services


 Amazon is offering Hadoop environment in cloud as part of its Amazon Web Services package.
 AWS Hadoop solution is hosted solution which runs on Amazon’s Elastic Cloud Compute and Simple
Storage Service (S3).
 Enterprises can use the Amazon AWS to run their Big Data processing analytics in the cloud
environment.
Website: https://aws.amazon.com/emr/

d) Hortonworks
 Hortonworks is using 100% open-source software without any propriety software. Hortonworks were
the one who first integrated support for Apache HCatalog.
 The Hortonworks is a Big Data company based in California.
 This company is developing and supports application for Apache Hadoop.

Website: https://hortonworks.com/
e) MapR
 MapR is another Big Data platform which us using the Unix file system for handling data.
 It is not using HDFS and this system is easy to learn anyone familiar with the Unix system.
 This solution integrates Hadoop, Spark, and Apache Drill with a real-time data processing feature.
 Website: https://mapr.com

f)IBM Open Platform


 IBM also offers Big Data Platform which is based on the Hadoop eco-system software.
 IBM well knows company in software and data computing.
It uses the latest Hadoop software and provides following features (IBM Open Platform Features):
 Based on 100% Open source software
 Native support for rolling Hadoop upgrades
 Support for long running applications within YEARN.
 Native support for Spark, developers can use Java, Python and Scala to written program
 Platform includes Ambari, which is a best tool for provisioning, managing & monitoring Apache
Hadoop clusters
 IBM Open Platform includes all the software of Hadoop ecosystem e.g. HDFS, YARN, MapReduce,
Ambari, Hbase, Hive, Oozie, Parquet, Parquet Format, Pig, Snappy, Solr, Spark, Sqoop, Zookeeper,
Open JDK, Knox, Slider
 Application is well supported by IBM technology team

Website: https://www.ibm.com/analytics/us/en/technology/hadoop/
g) Microsoft HDInsight
 The Microsoft HDInsight is also based on the Hadoop distribution and it’s a commercial Big Data
platform from Microsoft.
 Microsoft is software giant which is into development of windows operating system for Desktop
users and Server users.
 This is the big Hadoop distribution offering which runs on the Windows and Azure environment.
 It offer customized, optimized open source Hadoop based analytics clusters which uses Spark, Hive,
MapReduce, HBase, Strom, Kafka and R Server which runs on the Hadoop system on windows/Azure
environment.

Website: https://azure.microsoft.com/en-in/services/hdinsight/

Open Source Big Data Platform

There are various open-source Big Data Platform which can be used for Big Data handling and
data analytics in real-time environment.
Both small and Big Enterprise can use these tools for managing their enterprise data for getting
best value from their enterprise data.
i) Apache Hadoop

 Apache Hadoop is Big Data platform and software package which is Apache sponsored project.
 Under Apache Hadoop project various other software is being developed which runs on the top of
Hadoop system to provide enterprise grade data management and analytics solutions to enterprise.
 Apache Hadoop is open-source, distributed file system which provides data processing and analysis
engine for analyzing large set of data.
 Hadoop can run on Windows, Linux and OS X operating systems, but it is mostly used on Ubunut
and other Linux variants.

ii) MapReduce

 The MapReduce engine was originally written by Google and this is the system which enables the
developers to write program which can run in parallel on 100 or even 1000s of computer nodes to process
vast data sets.
 After processing all the job on the different nodes it comes the results and return it to the program
which executed the MapReduce job.
 This software is platform independent and runs on the top of Hadoop ecosystem. It can process
tremendous data at very high speed in Big Data environment.
iii) GridGain

 GridGain is another software system for parallel processing of data just like MapRedue. GridGain is
an alternative of Apache MapReduce.
 GridGain is used for the processing of in-memory data and its is based on Apache Iginte framework.
 GridGain is compatable with the Hadoop HDFS and runs on the top of Hadoop ecosystem.
 Then enterprise version of GridGain can be purchased from official website of GridGain. While free
version can be downloaded from GitHub repository.
Website: https://www.gridgain.com/
iv) HPCC Systems

 HPCC Systems stands for "high performance computing cluster” and this system is developed by
LexisNexis Risk Solutions.
 According to the company this software is much faster than Hadoop and can be used in the cloud
environment.
 HPCC Systems is developed in C++ and compiled into binary code for distribution.
 HPCC Systems is open-source, massive parallel processing system which is installed in cluster to
process data in real-time.
 It requires Linux operating system and runs on the commodity servers connected with high-speed
network.
 It is scalable from one node to 1000s of nodes to provide performance and scalability.
 Website: https://hpccsystems.com/

v) Apache Storm

 Apache Storm is a software for real-time computing and distributed processing.


 Its free and open-source software developed at Apache Software foundation. It’s a real- time, parallel
processing engine.
 Apache Storm is highly scalable, fault-tolerant which supports almost all the programming language.
vi) Apache Strom can be used in:
 Realtime analytics
 Online machine learning
 Continuous computation
 Distributed RPC
 ETL
 And all other places where real-time processing is required.
Apache Strom is used by Yahoo, Twitter, Spotify, Yelp, Flipboard and many other data giants.
Website: http://storm.apache.org/
vii) Apache Spark

 Apache Spark is software that runs on the top of Hadoop and provides API for real-time, in-memory
processing and analysis of large set of stored in the HDFS.
 It stores the data into memory for faster processing.
 Apache Spark runs program 100 times faster in-memory and 10 times faster on disk as compared to
the MapRedue.
 Apache Spark is here to faster the processing and analysis of big data sets in Big Data environment.
 Apache Spark is being adopted very fast by the business to analyze their data set to get real value of
their data.
 Website: http://spark.apache.org/

viii) SAMOA

 SAMOA stands for Scalable Advanced Massive Online Analysis,


 It’s a system for mining the Big Data streams.
 SAMOA is open-source software distributed at GitHub, which can be used as distributed machine
learning framework also.
Website: https://github.com/yahoo/samoa
CHALLENGES OF CONVENTIONAL SYSTEMS
Introduction to Conventional Systems
What is Conventional System?
Conventional Systems.
 The system consists of one or more zones each having either manually operated call points or
automatic detection devices, or a combination of both.
 Big data is huge amount of data which is beyond the processing capacity ofconventional data
base systems to manage and analyze the data in a specific time interval.

Difference between conventional computing and intelligent computing

 The conventional computing functions logically with a set of rules and calculations while the
neural computing can function via images, pictures, and concepts.
 Conventional computing is often unable to manage the variability of data obtained in the real
world.
 On the other hand, neural computing, like our own brains, is well suited to situations that have no
clear algorithmic solutions and are able to manage noisy imprecise data. This allows them to excel in
those areas that conventional computing often finds difficult.

Comparison of Big Data with Conventional Data


Big Data Conventional Data
Huge data sets Data set size in control.
Unstructured data such as text, video, and audio. Normally structured data such as numbers and
categories, but it can take other forms as well.
Hard-to-perform queries and analysis Relatively easy-to-perform queries and analysis.
Needs a new methodology for analysis. Data analysis can be achieved by using
conventional methods.
Need tools such as Hadoop, Hive, Tools such as SQL, SAS, R, and Excel
Hbase, Pig, Sqoop, and so on. alone may be sufficient.
The aggregated or sampled or filtered data. Raw transactional data.
Used for reporting, basic analysis, and text mining. Used for reporting, advanced analysis, and
Advanced analytics is only in a starting stage in big data. Predictive modeling.
Big data analysis needs both programming skills(such Analytical skills are sufficient for
as Java) and analytical skills to perform analysis. conventional data; advanced analysis
tools don’t require expert programing skills.
Petabytes/exabytes of data. Millions/billions of accounts.
Billions/trillions of transactions. Megabytes/gigabytes of data.
Thousands/millions of accounts. Millions of transactions
Generated by big financial institutions, Facebook, Generated by small enterprises and small banks.
Google, Amazon, eBay,Walmart, and so on.

List of challenges of Conventional Systems

The following list of challenges has been dominating in the case Conventional systems in real time
scenarios:
1) Uncertainty of Data Management Landscape
2) The Big Data Talent Gap
3) The talent gap that exists in the industry Getting data into the big data platform
4) Need for synchronization across data sources
5) Getting important insights through the use of Big data analytics

1) Uncertainty of Data Management Landscape:


 Because big data is continuously expanding, there are new companies and technologies that are
being developed everyday.
 A big challenge for companies is to find out which technology works bests for them without the
introduction of new risks and problems.
2) The Big Data Talent Gap:
 While Big Data is a growing field, there are very few experts available in this field.
 This is because Big data is a complex field and people who understand the complexity and
intricate nature of this field are far few and between.
3) The talent gap that exists in the industry Getting data into the big data platform:
 Data is increasing every single day. This means that companies have to tackle limitless amount of
data on a regular basis.
 The scale and variety of data that is available today can overwhelm any data practitioner and that
is why it is important to make data accessibility simple and convenient for brand mangers and owners.
4) Need for synchronization across data sources:
 As data sets become more diverse, there is a need to incorporate them into an analytical platform.
 If this is ignored, it can create gaps and lead to wrong insights and messages.
5) Getting important insights through the use of Big data analytics:
 It is important that companies gain proper insights from big data analytics and it is important that
the correct department has access to this information.
 A major challenge in the big data analytics is bridging this gap in an effective fashion.
Other Three challenges of Conventional systems
Three Challenges That big data face.
1. Data
2. Process
3. Management
1. Data Challenges
Volume
1. The volume of data, especially machine-generated data, is exploding,
2. how fast that data is growing every year, withnew sources of data that are emerging.
3. For example, in the year 2000, 800,000petabytes (PB) of data were stored in the world, and it is
expected to reach 35 zetta bytes (ZB) by2020 (according to IBM).
Social media plays a key role: Twitter generates 7+ terabytes (TB) of data every day. Facebook, 10
TB.
• Mobile devices play a key role as well, as there were estimated 6 billion mobile phones in 2011.
• The challenge is how to deal with the size of Big Data.

Variety, Combining Multiple Data Sets


• More than 80% of today’s information is unstructured and it is typically too big to manage
effectively.
• Today, companies are looking to leverage a lot more•data from a wider variety of sources both inside
and outside the organization.
• Things like documents, contracts, machine data, sensor data, social media, health records, emails,
etc. The list is endless really.
Variety•A lot of this data is unstructured, or has a complex structure that’s hard to represent in rows
and columns.
2. Processing
 More than 80% of today’s information isunstructured and it is typically too big to manage
effectively.
 Today, companies are looking to leverage a lot more data from a wider variety of sources both
inside and outside the organization.
 Things like documents, contracts, machine data, sensor data, social media, health records, emails,
etc. The list is endless really.
3. Management
 A lot of this data is unstructured, or has acomplex structure that’s hard to represent in rows and
columns.
Big Data Challenges
– The challenges include capture, duration, storage, search, sharing, transfer,
– analysis, and visualization.
• Big Data is trend to larger data sets
• due to the additional information derivable from analysis of a single large set of related data,
– as compared to separate smaller sets with the same total amount of data, allowing correlations to
be found to
"spot business trends, determine quality of research, prevent diseases, link legal citations, combat
crime, and determine real-time roadway traffic conditions.”

Challenges of Big Data


The following are the five most important challenges of the Big Data
a) Meeting the need for speed
In today’s hypercompetitive business environment, companies not only have to find and analyze the
relevant data they need, they must find it quickly.
b) Visualization helps organizations perform analyses and make decisions much more rapidly, but
the challenge is going through the sheer volumes of data and accessing the level of detail needed, all at
a high speed.
c) The challenge only grows as the degree of granularity increases. One possible solution is
hardware. Some vendors are using increased memory and powerful parallel
processing to crunch large volumes of data extremely quickly
d) Understanding the data
 It takes a lot of understanding to get data in the RIGHT SHAPE so that you can use
 visualization as part of data analysis.
e) Addressing data quality
 Even if you can find and analyze data quickly and put it in the proper context for the
f)Displaying meaningful results
 Plotting points on a graph for analysis becomes difficult when dealing with extremely
 large amounts of information or a variety of categories of information.
g) Dealing with outliers
 The graphical representations of data made possible by visualization can communicate
 trends and outliers much faster than tables containing numbers and text.
 Visual analytics enables organizations to take raw data and present it in a meaningful way
that generates the most value. However, when used with big data, visualization is bound to lead to some
challenges.

INTRODUCTION TO INTELLIGENT DATA ANALYSIS (IDA)


Intelligent Data Analysis (IDA) is one of the hot issues in the field of
artificial intelligence and information.
What is Intelligent Data Analysis

(IDA)?

IDA is

… an inter disciplinary study concerned with the effective analysis of data;


… used for extracting useful information from large quantities of online data; extracting
desirable knowledge or interesting patterns from existing databases;
 the distillation of information that has been collected, classified, organized, integrated,
abstracted and value-added;
 at a level of abstraction higher than the data, and information on which it is based and
can be used to deduce new information and new knowledge;
 usually in the context of human expertise used in solving problems.
 the distillation of information that has been collected, classified, organized, integrated,
abstracted and value-added;
 at a level of abstraction higher than the data, and information on which it is based and
can be used to deduce new information and new knowledge;
 usually in the context of human expertise used in solving problems.
Goal:
Goal of Intelligent data analysis is to extract useful knowledge, the process demands a
combination of extraction, analysis, conversion, classification, organization, reasoning, and so
on.
Uses / Benefits of IDA
Intelligent Data Analysis provides a forum for the examination of issues related to the research
and applications of Artificial Intelligence techniques in data analysis across a variety of
disciplines and the techniques include (but are not limited to):
The benefit areas are:
 Data Visualization
 Data pre-processing (fusion, editing, transformation, filtering, sampling)
 Data Engineering
 Database mining techniques, tools and applications
 Use of domain knowledge in data analysis
 Big Data applications
 Evolutionary algorithms
 Machine Learning(ML)
 Neural nets
 Fuzzy logic
 Statistical pattern recognition
 Knowledge Filtering and
 Post-processing

Intelligent Data Analysis (IDA)


Why IDA?
 Decision making is asking for information and knowledge
 Data processing can give them
 Multidimensionality of problems is looking for methods for adequate and deep
data processing and analysis
 Epidemiological study (1970-1990)
 Sample of examinees died from cardiovascular diseases during the period
 Question: Did they know they
were ill? 1 – they were healthy

2 – they were ill (drug treatment, positive clinical and laboratory findings)
Intelligent Data Analysis
Knowledge Acquisition
 The process of eliciting, analyzing, transforming, classifying, organizing and
integrating knowledge and representing that knowledge in a form that can be used in a
computer system.
Knowledge in a domain can be expressed as a number of rules
A Rule :
A formal way of specifying a recommendation, directive, or strategy, expressed as
"IF premise THEN conclusion" or "IF condition THEN action".
How to discover rules hidden in the data?

Intelligent Data Examples:


Example of IDA
 Epidemiological study (1970-1990)
 Sample of examinees died from cardiovascular diseases during the period

NATURE OF DATA
Data
 Data is a set of values of qualitative or quantitative variables; restated, pieces of data
are individual pieces of information.
 Data is measured, collected and reported, and analyzed, whereupon it can be visualized using
graphs or images.
Properties of Data
For examining the properties of data, reference to the various definitions of data.
Reference to these definitions reveals that following are the properties of data:
a) Amenability of use
b) Clarity
c) Accuracy
d) Essence
e) Aggregation
f)Compression
g) Refinement
a) Amenability of use: From the dictionary meaning of data it is learnt that data are facts used in
deciding something. In short, data are meant to be used as a base for arriving at definitive
conclusions.
b) Clarity: Data are a crystallized presentation. Without clarity, the meaning desired to be
communicated will remain hidden.

c) Accuracy: Data should be real, complete and accurate. Accuracy is thus, an essential
property of data.
d) Essence: A large quantities of data are collected and they have to be Compressed and refined.
Data so refined can present the essence or derived qualitative value, of the matter.
e) Aggregation: Aggregation is cumulating or adding up.
f)Compression: Large amounts of data are always compressed to make them more meaningful.
Compress data to a manageable size.Graphs and charts are some examples of compressed data.
g) Refinement: Data require processing or refinement. When refined, they are capable of leading
to conclusions or even generalizations. Conclusions can be drawn only when data are processed
or refined.
Types of Data:
 In order to understand the nature of data it is necessary to categorize them into various
types.
 Different categorizations of data are possible.
 The first such categorization may be on the basis of disciplines, e.g., Sciences, Social
Sciences, etc. in which they are generated.
 Within each of these fields, there may be several ways in which data can be categorized into
types.

There are four types of data:

 Nominal
 Ordinal
 Interval
 Ratio
Each offers a unique set of characteristics, which impacts the type of analysis that can be
performed.
The distinction between the four types of scales center on three different characteristics:

1. The order of responses – whether it matters or not


2. The distance between observations – whether it matters or is interpretable
3. The presence or inclusion of a true zero
Nominal Scales
Nominal scales measure categories and have the following characteristics:

 Order: The order of the responses or observations does not matter.


 Distance: Nominal scales do not hold distance. The distance between a 1 and a 2 is not the same
as a 2 and 3.
 True Zero: There is no true or real zero. In a nominal scale, zero is uninterruptable.
Appropriate statistics for nominal scales: mode, count, frequencies
Displays: histograms or bar charts
Ordinal Scales
At the risk of providing a tautological definition, ordinal scales measure, well, order. So, our
characteristics for ordinal scales are:

 Order: The order of the responses or observations matters.


Distance: Ordinal scales do not hold distance. The distance between first and second is unknown as is
the distance between first and third along with all observations.
 True Zero: There is no true or real zero. An item, observation, or category cannot finish zero.
Appropriate statistics for ordinal scales: count, frequencies, mode
Displays: histograms or bar charts
Interval Scales
Interval scales provide insight into the variability of the observations or data.
Classic interval scales are Likert scales (e.g., 1 - strongly agree and 9 - strongly disagree) and
Semantic Differential scales (e.g., 1 - dark and 9 - light).
In an interval scale, users could respond to “I enjoy opening links to thwebsite from a company email”
with a response ranging on a scale of values.
The characteristics of interval scales are:

 Order: The order of the responses or observations does matter.


 Distance: Interval scales do offer distance. That is, the distance from 1 to 2 appears the same as 4
to 5. Also, six is twice as much as three and two is half of four. Hence, we can perform arithmetic
operations on the data.
 True Zero: There is no zero with interval scales. However, data can be rescaled in a manner that
contains zero. An interval scales measure from 1 to 9 remains the same as 11 to 19 because we
added 10 to all values. Similarly, a 1 to 9 interval scale is the same a -4 to 4 scale because we
subtracted 5 from all values. Although the new scale contains zero, zero remains uninterruptable
because it only appears in the scale from the transformation.
Appropriate statistics for interval scales: count, frequencies, mode, median, mean, standard
deviation (and variance), skewness, and kurtosis.
Displays: histograms or bar charts, line charts, and scatter plots.
Ratio Scales
Ratio scales appear as nominal scales with a true zero.
They have the following characteristics:
 Order: The order of the responses or observations matters.
 Distance: Ratio scales do do have an interpretable distance.
 True Zero: There is a true zero. Income
is a classic example of a ratio scale:
 Order is established. We would all prefer $100 to $1!
 Zero dollars means we have no income (or, in accounting terms, our revenue exactly equals
our expenses!)
 Distance is interpretable, in that $20 appears as twice $10 and $50 is half of a $100. For the
web analyst, the statistics for ratio scales are the same as for interval scales.
Appropriate statistics for ratio scales: count, frequencies, mode, median, mean, standard
deviation (and variance), skewness, and kurtosis.
Displays: histograms or bar charts, line charts, and scatter plots.
The table below summarizes the characteristics of all four types of scales.

Nominal Ordinal Interval Ratio


Order Matters No Yes Yes Yes

Distance Is Interpretable No No Yes Yes

Zero Exists No No No Yes

CHARACTERISTIC OF DATA
Big Data contains a large amount of data that is not being processed by traditional data storage or the
processing unit. It is used by many multinational companies to process the data and business of
many organizations. The data flow would exceed 150 exabytes per day before replication.
There are five v's of Big Data that explains the characteristics.
5 V's of Big Data
o Volume
o Veracity
o Variety
o Value
o Velocity
Volume
The name Big Data itself is related to an enormous size. Big Data is a vast 'volumes' of data
generated from many sources daily, such as business processes, machines, social media platforms,
networks, human interactions, and many more.
Facebook can generate approximately a billion messages, 4.5 billion times that the "Like" button is
recorded, and more than 350 million new posts are uploaded each day. Big data technologies can
handle large amounts of data.

Variety
Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources. Data will only be collected from databases and sheets in the past, But these days
the data will comes in array forms, that are PDFs, Emails, audios, SM posts, photos, videos, etc.

The data is categorized as below:


a. Structured data: In Structured schema, along with all the required columns. It is in a tabular
form. Structured Data is stored in the relational database management system.
b. Semi-structured: In Semi-structured, the schema is not appropriately defined, e.g., JSON,
XML, CSV, TSV, and email. OLTP (Online Transaction Processing) systems are built to
work with semi-structured data. It is stored in relations, i.e., tables.
c. Unstructured Data: All the unstructured files, log files, audio files, and image files are
included in the unstructured data. Some organizations have much data available, but they did
not know how to derive the value of data since the data is raw.
d. Quasi-structured Data:The data format contains textual data with inconsistent data formats
that are formatted with effort and time with some tools.
Example: Web server logs, i.e., the log file is created and maintained by some server that contains
a list of activities.
Veracity
Veracity means how much the data is reliable. It has many ways to filter or translate the data.
Veracity is the process of being able to handle and manage data efficiently. Big Data is also essential
in business development. For example, Facebook posts with hashtags.

Value
Value is an essential characteristic of big data. It is not the data that we process or store. It
is valuable and reliable data that we store, process, and also analyze.

Velocity
Velocity plays an important role compared to others. Velocity creates the speed by which the data
is created in real-time. It contains the linking of incoming data sets speeds, rate of change,
and activity bursts. The primary aspect of Big Data is to provide demanding data rapidly.
Big data velocity deals with the speed at the data flows from sources like application logs,
business processes, networks, and social media sites, sensors, mobile devices, etc.

EVOLUTION OF BIG DATA:


The term big data has been in use since the early 1990’s. John R Mashey is given the credit
of making the term ‘Big Data’ popular. Big data is not something that is completely new or only used
in last two decades. People have been trying to use data analysis and analytics techniques to support
their decision-making process from very long years back. The tremendous increase of both structured
and un-structured data sets made the task of traditional data analysis very difficult and this transformed
into ‘Big Data’ in the last decade.

The term ‘Big Data’ has been in use since the early 1990s. John R. Mashey is given the credit of
making the term ‘Big Data’ popular [7]. Big Data is not something that is completely new or only
used from last two decades. People have been trying to use data analysis and analytics
The evolution of Big Data can be classified in to 3 phases, where every phase has its own
characteristics and capabilities and has contributed to the contemporary meaning of Big Data.

Phase I: Big Data originate from the domain of database management. It mostly depends on the
storage, extraction ,and optimization of data that is stored in Relational Database Management
Systems (RDBMS).
Database management and data warehousing are the two core components of Big Data in the first
Phase. It gives a foundation to modern data analysis and techniques such as database queries, online
analytical processing and standard reporting tools.

Phase II: From early 2000s, usage of Internet and the Web started offering unique data
collections and data analysis opportunities. Companies such as Yahoo, Amazon and eBay expanded
the online stores and started analyzing customer behavior for personalization. The HTTP-based
content on web massively increased the semi-structured and unstructured data.

Organizations now had to find new approaches and storage solutions to deal with these new data types
and analyze them effectively. In later years the growth of social media data aggravated the need for
tools, technologies and analytics techniques that were able to extract meaningful information out
of this unstructured data.

Phase III: From past decade the large scale usage of smart phones with different internet based
applications give the possibility to analyze behavioral data (such as clicks and search queries)and
also location-based data (GPS-data). Simultaneously, the rise of sensor-based internet- enabled
devices termed as the ‘Internet of Things’ (IoT) is making millions of TVs, thermostats, these new
data sources. This gives origin to other new terms ‘Big Data Analytics’.
wearable’s and even refrigerators to generate zettabytes of data every day. This incredible
growth of ‘Big Data’ now started a race to extract meaningful and valuable information out of
Table 1 gives the summary of the three phases in Big Data
Phase-I
Phase-II
Phase-III
DBMS-based, structured content:
1.RDBMS & data warehousing
2.Extract Transfer Load
3.Onine Analytical Processing
4.Dashboards & scorecards
5.Data mining & statistical
analysis
Web based, unstructured content
1.Infomiation retrieval and
extraction
2.Opinion mining
3.Question answering
4.Web analytics and web
intelligence
5.Social media analytics
6.Social network analysis
7.Spatial-temporal analysis
Mobile and senor based content
1.Location-aware analysis
2.Person-centered analysis
3.Context-relevant analysis
4.Mobile visualization
5.Human-Computer interaction

CHALLENGES WITH BIG DATA


The challenges in Big Data are the real implementation hurdles. These require immediate attention
and need to be handled because if not handled then the failure of the technology may take place
which can also lead to some unpleasant result. Big data challenges include the storing, analyzing the
extremely large and fast-growing data.
Some of the Big Data challenges are:
1. Sharing and Accessing Data:
 Perhaps the most frequent challenge in big data efforts is the inaccessibility of data sets from
external sources.
 Sharing data can cause substantial challenges.
 It include the need for inter and intra- institutional legal documents.
 Accessing data from public repositories leads to multiple difficulties.
 It is necessary for the data to be available in an accurate, complete and timely manner because
if data in the companies information system is to be used to make accurate decisions in time then it
becomes necessary for data to be available in this manner.
2. Privacy and Security:
 It is another most important challenge with Big Data. This challenge includes sensitive,
conceptual, technical as well as legal significance.
 Most of the organizations are unable to maintain regular checks due to large amounts of data
generation. However, it should be necessary to perform security checks and observation in real time
because it is most beneficial.
 There is some information of a person which when combined with external large data may
lead to some facts of a person which may be secretive and he might not want the owner to know this
information about that person.
 Some of the organization collects information of the people in order to add value to their
business. This is done by making insights into their lives that they’re unaware of.
3. Analytical Challenges:
 There are some huge analytical challenges in big data which arise some main challenges
questions like how to deal with a problem if data volume gets too large?
 Or how to find out the important data points?
 Or how to use data to the best advantage?
 These large amount of data on which these type of analysis is to be done can be structured
(organized data), semi-structured (Semi-organized data) or unstructured (unorganized data). There
are two techniques through which decision making can be done:
 Either incorporate massive data volumes in the analysis.
 Or determine upfront which Big data is relevant.
4. Technical challenges:
 Quality of data:
 When there is a collection of a large amount of data and storage of this data, it comes at a
cost. Big companies, business leaders and IT leaders always want large data storage.
 For better results and conclusions, Big data rather than having irrelevant data, focuses on
quality data storage.
 This further arise a question that how it can be ensured that data is relevant, how much data
would be enough for decision making and whether the stored data is accurate or not.
 Fault tolerance:
 Fault tolerance is another technical challenge and fault tolerance computing is extremely hard,
involving intricate algorithms.
 Nowadays some of the new technologies like cloud computing and big data always intended
that whenever the failure occurs the damage done should be within the acceptable threshold that is
the whole task should not begin from the scratch.
 Scalability:
 Big data projects can grow and evolve rapidly. The scalability issue of Big Data has lead
towards cloud computing.
 It leads to various challenges like how to run and execute various jobs so that goal of each
workload can be achieved cost-effectively.
 It also requires dealing with the system failures in an efficient manner. This leads to a big
question again that what kinds of storage devices are to be used.

NEED FOR BIG DATA


Applications of Big Data
The term Big Data is referred to as large amount of complex and unprocessed data. Now a day's
companies use Big Data to make business more informative and allows to take business decisions by
enabling data scientists, analytical modelers and other professionals to analyse large volume of
transactional data. Big data is the valuable and powerful fuel that drives large IT industries of the
21st century. Big data is a spreading technology used in each business sector. In this section, we will
discuss application of Big Data.
Travel and Tourism
Travel and tourism are the users of Big Data. It enables us to forecast travel facilities requirements
at multiple locations, improve business through dynamic pricing, and many more.
Financial and banking sector
The financial and banking sectors use big data technology extensively. Big data analytics
help banks and customer behaviour on the basis of investment patterns, shopping trends,
motivation to invest, and inputs that are obtained from personal or financial backgrounds.
Healthcare
Big data has started making a massive difference in the healthcare sector, with the help of predictive
analytics, medical professionals, and health care personnel. It can produce personalized
healthcare and solo patients also.
Telecommunication and media
Telecommunications and the multimedia sector are the main users of Big Data. There
are zettabytes to be generated every day and handling large-scale data that require big data
technologies.
Government and Military
The government and military also used technology at high rates. We see the figures that
the government makes on the record. In the military, a fighter plane requires to
process petabytes of data.
Government agencies use Big Data and run many agencies, managing utilities, dealing with traffic
jams, and the effect of crime like hacking and online fraud.
Aadhar Card: The government has a record of 1.21 billion citizens. This vast data is analyzed and
store to find things like the number of youth in the country. Some schemes are built to target the
maximum population. Big data cannot store in a traditional database, so it stores and analyze data by
using the Big Data Analytics tools.
E-commerce
E-commerce is also an application of Big data. It maintains relationships with customers that is
essential for the e-commerce industry. E-commerce websites have many marketing ideas to retail
merchandise customers, manage transactions, and implement better strategies of innovative ideas to
improve businesses with Big data.
o Amazon: Amazon is a tremendous e-commerce website dealing with lots of traffic daily.
But, when there is a pre-announced sale on Amazon, traffic increase rapidly that may crash
the website. So, to handle this type of traffic and data, it uses Big Data. Big Data help in
organizing and analyzing the data for far use.
Social Media
Social Media is the largest data generator. The statistics have shown that around 500+ terabytes of
fresh data generated from social media daily, particularly on Facebook. The data mainly
contains videos, photos, message exchanges, etc. A single activity on the social media site
generates many stored data and gets processed when required. The data stored is in terabytes (TB); it
takes a lot of time for processing. Big Data is a solution to the problem.
ANALYTIC PROCESSES AND TOOLS
Big Data Analytics is the process of collecting large chunks of structured/unstructured data,
segregating and analyzing it and discovering the patterns and other useful business insights from it.
These days, organizations are realizing the value they get out of big data analytics and hence
they are deploying big data tools and processes to bring more efficiency in their work environment.
Many big data tools and processes are being utilized by companies these days in the
processes of discovering insights and supporting decision making.
Big data processing is a set of techniques or programming models to access large- scale data
to extract useful information for supporting and providing decisions.
Below is the list of some of the data analytics tools used most in the industry :
 R Programming (Leading Analytics Tool in the industry)
 Python
 Excel
 SAS
 Apache Spark
 Splunk
 RapidMiner
 Tableau Public
 KNime
There are 6 analytic processes:
1. Deployment
2. Business Understanding
3. Data Exploration
4. Data Preparation
5. Data Modeling
6. Data Evaluation

Step 1: Deployment
• Here we need to:
– plan the deployment and monitoring and maintenance,
– we need to produce a final report and review the project.
– In this phase,
• we deploy the results of the analysis.
• This is also known as reviewing the project.
Step 2: Business Understanding
• Business Understanding
– The very first step consists of business understanding.
– Whenever any requirement occurs, firstly we need to determine the
business objective,
– assess the situation,
– determine data mining goals and then
– produce the project plan as per the requirement.
• Business objectives are defined in this phase.
Step 3: Data Exploration
• The second step consists of Data understanding.
– For the further process, we need to gather initial data, describe and explore
the data and verify data quality to ensure it contains the data we require.
– Data collected from the various sources is described in terms of its
application and the need for the project in this phase.
– This is also known as data exploration.
• This is necessary to verify the quality of data collected.

Step 4: Data Preparation


• From the data collected in the last step,
– we need to select data as per the need, clean it, construct it to get
useful information and
– then integrate it all.
• Finally, we need to format the data to get the appropriate data.
• Data is selected, cleaned, and integrated into the format finalized for the analysis in
this phase.
Step 5: Data Modeling
• we need to
– select a modeling technique, generate test design, build a model and assess
the model built.
• The data model is build to
– analyze relationships between various selected objects in the data,
– test cases are built for assessing the model and model is tested and
implemented on the data in this phase.
• Big data tools for HPC and supercomputing
– MPI
• Big data tools on clouds
– MapReduce model
– Iterative MapReduce model
– DAG model
– Graph model
– Collective model
• Other BDA tools
– SaS
– R
– Hadoop
Thus the BDA tools are used through out the BDA applications
development.

ANALYSIS VS REPORTING
Reporting :
 Once data is collected, it will be organized using tools such as graphs and tables.
 The process of organizing this data is called reporting.
 Reporting translates raw data into information.
 Reporting helps companies to monitor their online business and be alerted when data falls
outside of expected ranges.
 Good reporting should raise questions about the business from its end users.
Analysis :
 Analytics is the process of taking the organized data and analyzing it.
 This helps users to gain valuable insights on how businesses can improve their performance.
 Analysis transforms data and information into insights.
 The goal of the analysis is to answer questions by interpreting the data at a deeper level and
providing actionable recommendations.
Conclusion :
 Reporting shows us “what is happening”.
 The analysis focuses on explaining “why it is happening” and “what we can do about it”.
What is Analysis?
• The process of exploring data and reports
– in order to extract meaningful insights,
– which can be used to better understand and improve business performance.
• What is Reporting ?
• Reporting is
– “the process of organizing data
– into informational summaries
– in order to monitor how different areas of a business are performing.”

COMPARING ANALYSIS WITH REPORTING

• Reporting is “the process of organizing data into informational summaries in order


to monitor how different areas of a business are performing.”
• Measuring core metrics and presenting them — whether in an email, a
slidedeck, or online dashboard — falls under this category.
• Analytics is “the process of exploring data and reports in order to extract
meaningful insights, which can be used to better understand and improve business
performance.”
• Reporting helps companies to monitor their online business and be alerted to when
data falls outside of expected ranges.
• Good reporting
• should raise questions about the business from its end users.
• The goal of analysis is
• to answer questions by interpreting the data at a deeper level and
providing actionable recommendations.

• A firm may be focused on the general area of analytics (strategy,


implementation, reporting, etc.)
– but not necessarily on the specific aspect of analysis.
• It’s almost like some organizations run out of gas after the initial set-up-related activities
and don’t make it to the analysis stage

A reporting activity deliberately proposes Analysis activity.


CONTRAST BETWEEN ANALYSIS AND REPORTING
The basis differences between Analysis and Reporting are as follows:

Analysis Reporting
Provides what is needed Provides what is asked for
Is typically customized Is Typically standardized
Involves a person Does not involve a person
Is extremely flexible Is fairly Inflexible

• Reporting translates raw data into information.


• Analysis transforms data and information into insights.
• reporting shows you what is happening
• while analysis focuses on explaining why it is happening and what you can do about it.

 Reports are like Robots n monitor and alter you and where as analysis is like parents - c
an figure out what is going on (hungry, dirty diaper, no pacifier, , teething, tired, ear
infection, etc).
 Reporting and analysis can go hand-in-hand:
 Reporting provides no limited context about what is happening in the data. Context
is critical to good analysis.
 Reporting translate a raw data into information
 Reporting usually raises a question – What is happening ?
 Analysis transforms the data into insights - Why is it happening ? What you can do
about it?

You might also like