Program: B.E Subject Name: Data Science Subject Code: IT-8003 Semester: 8th
Program: B.E Subject Name: Data Science Subject Code: IT-8003 Semester: 8th
E
Subject Name: Data Science
Subject Code: IT-8003
Semester: 8th
Downloaded from be.rgpvnotes.in
Unit III
Introduction of Data Analytics
Data Analytics: Data Analytics the science of examining raw data with the purpose of
drawing conclusions about that information.Data Analytics involves applying an algorithmic
or mechanical process to derive insights. For example, running through a number of data
sets to look for meaningful correlations between each other.It is used in a number of
industries to allow the organizations and companies to make better decisions as well as
verify and disprove existing theories or models.The focus of Data Analytics lies in inference,
which is the process of deriving conclusions that are solely based on what the researcher
already knows.
Applications of Data Analysis:
Healthcare: The main challenge for hospitals with cost pressures tightens is to treat as many
patients as they can efficiently, keeping in mind the improvement of the quality of care.
Travel: Data analytics is able to optimize the buying experience through the mobile/ weblog
a d the so ial edia data a alysis. T a el sights a gai i sights i to the usto e ’s desi es
and preferences.
Gaming: Data Analytics helps in collecting data to optimize and spend within as well as
across games. Game companies gain insight into the dislikes, the relationships, and the likes
of the users.
Energy Management: Most firms are using data analytics for energy management, including
smart-grid management, energy optimization, energy distribution, and building automation
in utility companies.
Core architecture data model (CADM) in enterprise architecture is a logical data model of
information used to describe and build architectures.
Single-tier architecture: The objective of a single layer is to minimize the amount of data
stored. This goal is to remove data redundancy. This architecture is not frequently used in
practice.
Two-tier architecture: Two-layer architecture separates physically available sources and
data warehouse. This architecture is not expandable and also not supporting a large number
of end-users. It also has connectivity problems because of network limitations.
Three-tier architecture: This is the most widely used architecture.
Top-Tier: The top tier is a front-end client layer. Top tier is the tools and API that you
connect and get data out from the data warehouse. It could be Query tools, reporting tools,
managed query tools, Analysis tools and Data mining tools.
The data warehouse is based on an RDBMS server which is a central information repository
that is surrounded by some key components to make the entire environment functional,
manageable and accessible
Column Oriented Database:
The best example of a Column-Oriented data stores is HBase Database, which is basically
designed from the ground up to provide scalability and partitioning to enable efficient data
structure serialization, storage, and retrieval.
databases (assigning different computers to deal with different users or queries), or may
require every node to maintain its own copy of the application's data, using some kind of
coordination protocol. This is often referred to as database sharding.
There is some doubt about whether a web application with many independent web nodes
but a single, shared database (clustered or otherwise) should be counted as SN. One of the
approaches to achieve SN architecture for stateful applications (which typically maintain
state in a centralized database) is the use of a data grid, also known as distributed caching.
This still leaves the centralized database as a single point of failure.
Shared-nothing architectures have become prevalent in the data warehousing space. There
is much debate as to whether the shared-nothing approach is superior to shared Disk with
sound arguments presented by both camps. Shared-nothing architectures certainly take
longer to respond to queries that involve joins over large data sets from different partitions
(machines). However, the potential for scaling is huge.
Massive Parallel Processing:
For example, the virtual shared-disk featu e of IBM’s sha ed-nothingRS/6000 SP permits
higher-level programs, for example Oracle DBMS, touse this MPP as if it were a shared-disk
configuration.
In addition to MPP, the shared-nothing configuration can also be implemented in a cluster
of computers where the coupling is limited to alow number, as opposed to a high number,
which is the case with anMPP. In general, this shared-nothing lightly (or modestly) parallel
clusterexhibits characteristics similar to those of an MPP.
The distinction between MPP and a lightly parallel cluster is somewhatblurry. The following
table shows a comparison of some salient featuresfor distinguishing the two configurations.
The most noticeable featureappears to be the arbitrary number of connected processors,
which islarge for MPP and small for a lightly parallel cluster.
Elastic Scalability:
Scalability has long been a concern, but now it's taking on new dimensions.The Information
Age has matured beyond our wildest dreams, and our standards need to evolve with it. Big
data analytics is becoming increasingly intertwined with domains like business intelligence,
customer relationship management, and even diagnostic medicine. Enterprises that want to
expand must incorporate growth-capable IT strategies into their operating plans.
Infrastructure Choices: Companies need flexible infrastructures if they want to use Big Data
to reduce their operating costs, learn more about consumers, and hone their
methodologies. The real question is how to implement IT systems that expand on demand.
Organizations like Oracle and Intel point to the cloud and suggest that firms invest in open-
source tools like Hadoop. For many big data users, the fact that you can purchase appliances
that have already been configured to work within these frameworks might make it much
easier to get started.
Component Integration: It's one thing to implement a data storage or analysis framework
that scales. Scaling the vital connections that deliver information to your system is another
story.
These examples implicitly use big data analytics to deliver personalized content, but there
are countless other applications. There are many different ways to create a system that
garners insights from big data. As thought leaders like Scott Chow of the Blog Starter point
out, however, ensuring that all the parts can grow uniformly is critical to your success.
Problem-Solving Strategies: Not all algorithms are equally proficient at solving the same
problems. A programming language that parses limited information with flying colors might
crash and burn when it's treated to millions of data sets.
Big data demands a bit more planning foresight and less plug-and-play than some other
areas of computer science. For example, the R language is made for statistical computing.
When you attempt to develop scalable scripts, however, you run into numerous problems,
like its in-memory operation, potentially inefficient data duplication and lack of support for
parallelism. To put this arguably powerful tool to use in big data environments, you'll need
to adapt your approach and refine your understanding, preferably with the help of data
scientists.
Oversight: Another scalability quandary in big data analytics involves maintaining effective
oversight. While it's relatively easy to watch a process to discover some conclusion or result,
the genuine control means also understanding what's happening along the way. As you
scale up, reporting and feedback systems that let you manage individual processes are
critical to ensuring that your projects use resources efficiently.
DataLoader is a generic utility to be used as part of your application's data fetching layer to
provide a simplified and consistent API over various remote data sources such as databases
or web services via batching and caching.
A data pattern defines the way in which the data collected (semi-structured data) can be
structured, indexed, and made available for searching. One of the primary functions of
creating a data pattern is to specify fields that must be extracted from the data collected.
Fields are name=value pairs that represent a grouping by which your data can be
categorized. The fields that you specify at the time of creating a data pattern are added to
each record in the data indexed, enabling you to both search effectively and carry out
advanced analysis by using search commands. You can also assign a field type (category: an
integer, a string, or a long integer) for each of the fields that you intend to get extracted.
Assigning a field type enables you to run specific search commands on the fields of a certain
type and perform advanced analysis.
The Data Patterns tab allows you to configure data patterns that can be used by the data
collectors for collecting data in the specified way. While creating a data collector, it is
important that you select an appropriate data pattern. This is necessary so that the indexed
data looks as you expected, with events categorized in multiple lines (raw event data), fields
extracted, and time stamp extracted. The more appropriate the data pattern, the more
chances that your search will be effective.
Phase 3Model planning: Phase 3 is model planning, where the team determines the
methods, techniques, and workflow it intends to follow for the subsequent model building
phase. The team explores the data to learn about the relationships between variables and
subsequently selects key variables and the most suitable models.
Phase 4Model building: In Phase 4, the team develops datasets for testing, training, and
production purposes. In addition, in this phase the team builds and executes models based
on the work done in the model planning phase. The team also considers whether its existing
tools will suffice for running the models, or if it will need a more robust environment for
executing models and workflows (for example, fast hardware and parallel processing, if
applicable).
The centroids of the K clusters, which can be used to label new data
Labels for the training data (each data point is assigned to a single cluster)
The algorithm works as follows:
Association Rule:
Association Rule Mining, as the name suggests, association rules are simple If/Then
statements that help discover relationships between seemingly independent relational
databases or other data repositories.
Most machine learning algorithms work with numeric datasets and hence tend to be
mathematical. However, association rule mining is suitable for non-numeric, categorical
data and requires just a little bit more than simple counting.
Association rule mining is a procedure which aims to observe frequently occurring patterns,
correlations, or associations from datasets found in various kinds of databases such as
relational databases, transactional databases, and other forms of repositories.
a consequent (then).
A a te ede t is so ethi g that’s fou d i data, a d a o se ue t is a ite that is fou d
in combination with the antecedent. Have a look at this rule for instance:
If a usto e uys ead, he’s 70% likely of uyi g ilk.
In the above association rule, bread is the antecedent and milk is the consequent. Simply
put, it a e u de stood as a etail sto e’s asso iatio ule to ta get their customers better.
If the above rule is a result of thorough analysis of some data sets, it can be used to not only
i p o e usto e se i e ut also i p o e the o pa y’s e e ue.
Association rules are created by thoroughly analyzing data and looking for frequent if/then
patterns. Then, depending on the following two parameters, the important relationships are
observed:
Support: Support indicates how frequently the if/then relationship appears in the database.
Confidence: Confidence tells about the number of times these relationships have been
found to be true.
So, in a given transaction with multiple items, Association Rule Mining primarily tries to find
the rules that govern how or why such products/items are often bought together. For
example, peanut butter and jelly are frequently purchased together because a lot of people
like to make PB&J sandwiches.