0% found this document useful (0 votes)
71 views

Program: B.E Subject Name: Data Science Subject Code: IT-8003 Semester: 8th

The document discusses data science and data analytics. It provides information on: 1) Data analytics involves applying algorithms to derive insights from raw data in industries like healthcare, travel, gaming, and energy management. 2) A three-tier architecture is commonly used for data warehousing with bottom, middle and top tiers. The bottom tier stores data, the middle tier provides access, and the top tier includes analysis tools. 3) Column-oriented databases store each column continuously to enable fast aggregation queries, while distributed computing networks multiple computers to share resources and scale horizontally.

Uploaded by

All in one
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views

Program: B.E Subject Name: Data Science Subject Code: IT-8003 Semester: 8th

The document discusses data science and data analytics. It provides information on: 1) Data analytics involves applying algorithms to derive insights from raw data in industries like healthcare, travel, gaming, and energy management. 2) A three-tier architecture is commonly used for data warehousing with bottom, middle and top tiers. The bottom tier stores data, the middle tier provides access, and the top tier includes analysis tools. 3) Column-oriented databases store each column continuously to enable fast aggregation queries, while distributed computing networks multiple computers to share resources and scale horizontally.

Uploaded by

All in one
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Program : B.

E
Subject Name: Data Science
Subject Code: IT-8003
Semester: 8th
Downloaded from be.rgpvnotes.in

IT 8003 (2): Data Science

Unit III
Introduction of Data Analytics

Data Analytics: Data Analytics the science of examining raw data with the purpose of
drawing conclusions about that information.Data Analytics involves applying an algorithmic
or mechanical process to derive insights. For example, running through a number of data
sets to look for meaningful correlations between each other.It is used in a number of
industries to allow the organizations and companies to make better decisions as well as
verify and disprove existing theories or models.The focus of Data Analytics lies in inference,
which is the process of deriving conclusions that are solely based on what the researcher
already knows.
Applications of Data Analysis:
Healthcare: The main challenge for hospitals with cost pressures tightens is to treat as many
patients as they can efficiently, keeping in mind the improvement of the quality of care.
Travel: Data analytics is able to optimize the buying experience through the mobile/ weblog
a d the so ial edia data a alysis. T a el sights a gai i sights i to the usto e ’s desi es
and preferences.

Gaming: Data Analytics helps in collecting data to optimize and spend within as well as
across games. Game companies gain insight into the dislikes, the relationships, and the likes
of the users.

Energy Management: Most firms are using data analytics for energy management, including
smart-grid management, energy optimization, energy distribution, and building automation
in utility companies.

Drivers for Analytics&Core Components of Analytical Data Architecture:

Figure 3.1: Drivers for Analytics

Core architecture data model (CADM) in enterprise architecture is a logical data model of
information used to describe and build architectures.

Page no: 1 Follow us on facebook to get real-time updates from RGPV


Downloaded from be.rgpvnotes.in

IT 8003 (2): Data Science

Data Warehouse Architecture:

Single-tier architecture: The objective of a single layer is to minimize the amount of data
stored. This goal is to remove data redundancy. This architecture is not frequently used in
practice.
Two-tier architecture: Two-layer architecture separates physically available sources and
data warehouse. This architecture is not expandable and also not supporting a large number
of end-users. It also has connectivity problems because of network limitations.
Three-tier architecture: This is the most widely used architecture.

It consists of the Top, Middle and Bottom Tier.


Bottom Tier: The database of the Datawarehouse servers as the bottom tier. It is usually a
relational database system. Data is cleansed, transformed, and loaded into this layer using
back-end tools.
Middle Tier: The middle tier in Data warehouse is an OLAP server which is implemented
using either ROLAP or MOLAP model. For a user, this application tier presents an abstracted
view of the database. This layer also acts as a mediator between the end-user and the
database.

Top-Tier: The top tier is a front-end client layer. Top tier is the tools and API that you
connect and get data out from the data warehouse. It could be Query tools, reporting tools,
managed query tools, Analysis tools and Data mining tools.

Figure 3.2: Data Warehouse Architecture

The data warehouse is based on an RDBMS server which is a central information repository
that is surrounded by some key components to make the entire environment functional,
manageable and accessible
Column Oriented Database:

Page no: 2 Follow us on facebook to get real-time updates from RGPV


Downloaded from be.rgpvnotes.in

IT 8003 (2): Data Science

A column-oriented database stores each column continuously. i.e. on disk or in-memory


each column on the left will be stored in sequential blocks.
For analytical queries that perform aggregate operations over a small number of columns
retrieving data in this format is extremely fast. As PC storage is optimized for block access,
by storing the data beside each other we exploit locality of reference. On hard disk drives
this is particularly important which due to their performance characteristics provide optimal
performance for sequential access.
The goal of a columnar database is to efficiently write and read data to and from hard disk
storage in order to speed up the time it takes to return a query.
One of the main benefits of a columnar database is that data can be highly compressed. The
compression permits columnar operations — like MIN, MAX, SUM, COUNT and AVG— to be
performed very rapidly. Another benefit is that because a column-based DBMSs is self-
indexing, it uses less disk space than a relational database management system (RDBMS)
containing the same data.

The best example of a Column-Oriented data stores is HBase Database, which is basically
designed from the ground up to provide scalability and partitioning to enable efficient data
structure serialization, storage, and retrieval.

Parallel vs Distributed Processing


A computer performs tasks according to the instructions provided by the human. Parallel
computing and distributed computing are two computation types. Parallel computing is
used in high-performance computing such as supercomputer development. Distributed
computing provides data scalability and consistency. Google and Facebook use distributed
computing for data storing. The key difference between parallel and distributed computing
is that parallel computing is to execute multiple tasks using multiple processors
simultaneously while in distributed computing, multiple computers are interconnected via a
network to communicate and collaborate in order to achieve a common goal. Each
computer in the distributed system has their own users and helps to share resources.

Figure 3.3: Parallel and Distributed Computing

Page no: 3 Follow us on facebook to get real-time updates from RGPV


Downloaded from be.rgpvnotes.in

IT 8003 (2): Data Science

Difference between Parallel and Distributed Computing

Particular Parallel Computing Distributed Computing


Definition Parallel computing is a
Distributed computing is a
computation type in which computation type in which
multiple processors execute
networked computers communicate
multiple tasks simultaneously. and coordinate the work through
message passing to achieve a
common goal.
Number of Parallel computing occurs on one Distributed computing occurs
Computers computer. between multiple computers.
required
Processing In parallel computing multiple In distributed computing, computers
Mechanism processors perform processing. rely on message passing.
Synchronization All processors share a single There is no global clock in
master clock for synchronization. distributed computing, it uses
synchronization algorithms.
Memory In Parallel computing, computers In Distributed computing, each
can have shared memory or computer has their own memory.
distributed memory.
Usage Parallel computing is used to Distributed computing is used to
increase performance and for share resources and to increase
scientific computing. scalability.
Table 3.1: Difference between Parallel & Distributed Computing
Shared-Nothing Architecture:
Shared-nothing architecture (SNA) is a pattern used in distributing computing in which a
system is based on multiple self-sufficient nodes that have their own memory, HDD storage
and independent input/output interfaces. Each node shares no resources with other nodes,
and there is a synchronization mechanism that ensures that all information is available on at
least two nodes.
Shared-nothing architecture is very popular in web applications, because it provides almost
infinite horizontal scaling that can be made with very inexpensive hardware. It is widely
used by Google, Microsoft and many other companies that need to collect and process
massive sets of data.
One of the good examples of using the SNA architecture is a MySQL cluster. It features a
Network Data Base (NDB) storage engine that automatically distributes MySQL data across
multiple storage nodes and provides great performance in write-heavy applications.
Applications:
Shared-nothing is popular for web development because of its scalability. As Google has
demonstrated, a pure SN system can scale simply by adding nodes in the form of
inexpensive computers, since there is no single bottleneck to slow the system down. Google
calls this sharding. A SN system typically partitions its data among many nodes on different

Page no: 4 Follow us on facebook to get real-time updates from RGPV


Downloaded from be.rgpvnotes.in

IT 8003 (2): Data Science

databases (assigning different computers to deal with different users or queries), or may
require every node to maintain its own copy of the application's data, using some kind of
coordination protocol. This is often referred to as database sharding.
There is some doubt about whether a web application with many independent web nodes
but a single, shared database (clustered or otherwise) should be counted as SN. One of the
approaches to achieve SN architecture for stateful applications (which typically maintain
state in a centralized database) is the use of a data grid, also known as distributed caching.
This still leaves the centralized database as a single point of failure.
Shared-nothing architectures have become prevalent in the data warehousing space. There
is much debate as to whether the shared-nothing approach is superior to shared Disk with
sound arguments presented by both camps. Shared-nothing architectures certainly take
longer to respond to queries that involve joins over large data sets from different partitions
(machines). However, the potential for scaling is huge.
Massive Parallel Processing:

MPP (massively parallel processing) is the coordinated processing of a program by multiple


processor s that work on different parts of the program, with each processor using its own
operating system and memory. Typically, MPP processors communicate using some
messaging interface. In some implementations, up to 200 or more processors can work on
the same application. An "interconnect" arrangement of data paths allows messages to be
sent between processors. Typically, the setup for MPP is more complicated, requiring
thought about how to partition a common database among processors and how to assign
work among the processors. An MPP system is also known as a "loosely coupled" or "shared
nothing" system.
From a simplistic I/O and memory sharing point of view, there is adistinction between MPP
and SMP architectures. However, as will be discussed later in the article, operating systems
(OSs) or other software layers can mask some of these differences, permitting some
software writtenfor other configurations, such as shared-disk, to be executed on an MPP.

For example, the virtual shared-disk featu e of IBM’s sha ed-nothingRS/6000 SP permits
higher-level programs, for example Oracle DBMS, touse this MPP as if it were a shared-disk
configuration.
In addition to MPP, the shared-nothing configuration can also be implemented in a cluster
of computers where the coupling is limited to alow number, as opposed to a high number,
which is the case with anMPP. In general, this shared-nothing lightly (or modestly) parallel
clusterexhibits characteristics similar to those of an MPP.
The distinction between MPP and a lightly parallel cluster is somewhatblurry. The following
table shows a comparison of some salient featuresfor distinguishing the two configurations.
The most noticeable featureappears to be the arbitrary number of connected processors,
which islarge for MPP and small for a lightly parallel cluster.
Elastic Scalability:

Page no: 5 Follow us on facebook to get real-time updates from RGPV


Downloaded from be.rgpvnotes.in

IT 8003 (2): Data Science

Scalability has long been a concern, but now it's taking on new dimensions.The Information
Age has matured beyond our wildest dreams, and our standards need to evolve with it. Big
data analytics is becoming increasingly intertwined with domains like business intelligence,
customer relationship management, and even diagnostic medicine. Enterprises that want to
expand must incorporate growth-capable IT strategies into their operating plans.
Infrastructure Choices: Companies need flexible infrastructures if they want to use Big Data
to reduce their operating costs, learn more about consumers, and hone their
methodologies. The real question is how to implement IT systems that expand on demand.
Organizations like Oracle and Intel point to the cloud and suggest that firms invest in open-
source tools like Hadoop. For many big data users, the fact that you can purchase appliances
that have already been configured to work within these frameworks might make it much
easier to get started.

Component Integration: It's one thing to implement a data storage or analysis framework
that scales. Scaling the vital connections that deliver information to your system is another
story.
These examples implicitly use big data analytics to deliver personalized content, but there
are countless other applications. There are many different ways to create a system that
garners insights from big data. As thought leaders like Scott Chow of the Blog Starter point
out, however, ensuring that all the parts can grow uniformly is critical to your success.
Problem-Solving Strategies: Not all algorithms are equally proficient at solving the same
problems. A programming language that parses limited information with flying colors might
crash and burn when it's treated to millions of data sets.
Big data demands a bit more planning foresight and less plug-and-play than some other
areas of computer science. For example, the R language is made for statistical computing.
When you attempt to develop scalable scripts, however, you run into numerous problems,
like its in-memory operation, potentially inefficient data duplication and lack of support for
parallelism. To put this arguably powerful tool to use in big data environments, you'll need
to adapt your approach and refine your understanding, preferably with the help of data
scientists.

Oversight: Another scalability quandary in big data analytics involves maintaining effective
oversight. While it's relatively easy to watch a process to discover some conclusion or result,
the genuine control means also understanding what's happening along the way. As you
scale up, reporting and feedback systems that let you manage individual processes are
critical to ensuring that your projects use resources efficiently.

Data Loading Patterns:

DataLoader is a generic utility to be used as part of your application's data fetching layer to
provide a simplified and consistent API over various remote data sources such as databases
or web services via batching and caching.

Page no: 6 Follow us on facebook to get real-time updates from RGPV


Downloaded from be.rgpvnotes.in

IT 8003 (2): Data Science

A data pattern defines the way in which the data collected (semi-structured data) can be
structured, indexed, and made available for searching. One of the primary functions of
creating a data pattern is to specify fields that must be extracted from the data collected.
Fields are name=value pairs that represent a grouping by which your data can be
categorized. The fields that you specify at the time of creating a data pattern are added to
each record in the data indexed, enabling you to both search effectively and carry out
advanced analysis by using search commands. You can also assign a field type (category: an
integer, a string, or a long integer) for each of the fields that you intend to get extracted.
Assigning a field type enables you to run specific search commands on the fields of a certain
type and perform advanced analysis.
The Data Patterns tab allows you to configure data patterns that can be used by the data
collectors for collecting data in the specified way. While creating a data collector, it is
important that you select an appropriate data pattern. This is necessary so that the indexed
data looks as you expected, with events categorized in multiple lines (raw event data), fields
extracted, and time stamp extracted. The more appropriate the data pattern, the more
chances that your search will be effective.

Data Analytics Lifecycle:


Phase 1Discovery: In Phase 1, the team learns the business domain, including relevant
history such as whether the organization or business unit has attempted similar projects in
the past from which they can learn. The team assesses the resources available to support
the project in terms of people, technology, time, and data. Important activities in this phase
include framing the business problem as an analytics challenge that can be addressed in
subsequent phases and formulating initial hypotheses (IHs) to test and begin learning the
data.
Phase 2Data preparation: Phase 2 requires the presence of an analytic sandbox, in which the
team can work with data and perform analytics for the duration of the project. The team
needs to execute extract, load, and transform (ELT) or extract, transform and load (ETL) to
get data into the sandbox. The ELT and ETL are sometimes abbreviated as ETLT. Data should
be transformed in the ETLT process so the team can work with it and analyse it. In this
phase, the team also needs to familiarize itself with the data thoroughly and take steps to
condition the data

Phase 3Model planning: Phase 3 is model planning, where the team determines the
methods, techniques, and workflow it intends to follow for the subsequent model building
phase. The team explores the data to learn about the relationships between variables and
subsequently selects key variables and the most suitable models.
Phase 4Model building: In Phase 4, the team develops datasets for testing, training, and
production purposes. In addition, in this phase the team builds and executes models based
on the work done in the model planning phase. The team also considers whether its existing
tools will suffice for running the models, or if it will need a more robust environment for
executing models and workflows (for example, fast hardware and parallel processing, if
applicable).

Page no: 7 Follow us on facebook to get real-time updates from RGPV


Downloaded from be.rgpvnotes.in

IT 8003 (2): Data Science

Phase 5Communicate results: In Phase 5, the team, in collaboration with major


stakeholders, determines if the results of the project are a success or a failure based on the
criteria developed in Phase 1. The team should identify key findings, quantify the business
value, and develop a narrative to summarize and convey findings to stakeholders.
Phase 6Operationalize: In Phase 6, the team delivers final reports, briefings, code, and
technical documents. In addition, the team may run a pilot project to implement the models
in a production environment.
K means clustering:
K-means clustering is a type of unsupervised learning, which is used when you have
unlabelled data (i.e., data without defined categories or groups). The goal of this algorithm
is to find groups in the data, with the number of groups represented by the variable K. The
algorithm works iteratively to assign each data point to one of K groups based on the
features that are provided. Data points are clustered based on feature similarity. The results
of the K-means clustering algorithm are:

 The centroids of the K clusters, which can be used to label new data
 Labels for the training data (each data point is assigned to a single cluster)
The algorithm works as follows:

 First we initialize k points, called means, randomly.


 We atego ize ea h ite to its losest ea a d e update the ea ’s oo di ates,
which are the averages of the items categorized in that mean so far.
 We repeat the process for a given number of iterations and at the end, we have our
clusters.
The above algorithm in pseudocode:
Initialize k means with random values
For a given number of iterations:
Iterate through items:

Find the mean closest to the item


Assign item to mean
Update mean

Association Rule:

Association Rule Mining, as the name suggests, association rules are simple If/Then
statements that help discover relationships between seemingly independent relational
databases or other data repositories.

Page no: 8 Follow us on facebook to get real-time updates from RGPV


Downloaded from be.rgpvnotes.in

IT 8003 (2): Data Science

Most machine learning algorithms work with numeric datasets and hence tend to be
mathematical. However, association rule mining is suitable for non-numeric, categorical
data and requires just a little bit more than simple counting.
Association rule mining is a procedure which aims to observe frequently occurring patterns,
correlations, or associations from datasets found in various kinds of databases such as
relational databases, transactional databases, and other forms of repositories.

An association rule has two parts:


an antecedent (if) and

a consequent (then).
A a te ede t is so ethi g that’s fou d i data, a d a o se ue t is a ite that is fou d
in combination with the antecedent. Have a look at this rule for instance:
If a usto e uys ead, he’s 70% likely of uyi g ilk.
In the above association rule, bread is the antecedent and milk is the consequent. Simply
put, it a e u de stood as a etail sto e’s asso iatio ule to ta get their customers better.
If the above rule is a result of thorough analysis of some data sets, it can be used to not only
i p o e usto e se i e ut also i p o e the o pa y’s e e ue.
Association rules are created by thoroughly analyzing data and looking for frequent if/then
patterns. Then, depending on the following two parameters, the important relationships are
observed:
Support: Support indicates how frequently the if/then relationship appears in the database.
Confidence: Confidence tells about the number of times these relationships have been
found to be true.
So, in a given transaction with multiple items, Association Rule Mining primarily tries to find
the rules that govern how or why such products/items are often bought together. For
example, peanut butter and jelly are frequently purchased together because a lot of people
like to make PB&J sandwiches.

Page no: 9 Follow us on facebook to get real-time updates from RGPV


We hope you find these notes useful.
You can get previous year question papers at
https://qp.rgpvnotes.in .

If you have any queries or you want to submit your


study notes please write us at
[email protected]

You might also like