u1 c clsrm
u1 c clsrm
• Data Analysis:
• Breaking up of any data into parts i.e., the examination of these parts to know about their nature,
proportion, function, interrelationship, etc.
• A process in which the analyst moves laterally and recursively between three modes: describing
data (profiling, correlation, summarizing), assembling data (scrubbing, translating, synthesizing,
filtering) and creating data (deriving, formulating, simulating).
• It is the process of finding and identifying the meaning of data.
• Importance of IDA:
• Intelligent Data Analysis (IDA) is one of the major issues in artificial intelligence and information.
• Intelligent data analysis discloses hidden facts that are not known previously and provides
potentially important information or facts from large quantities of data (White, 2008).
• It also helps in making a decision. Based on machine learning, artificial intelligence, recognition of
pattern, and records and visualization technology mainly, IDA helps to obtain useful information,
necessary data and interesting models from a lot of data available online in order to make the
right choices.
• Intelligent data analysis helps to solve a problem that is already solved as a matter of routine. If
the data is collected for the past cases together with the result that was finally achieved, such
data can be used to revise and optimize the presently used strategy to arrive at a conclusion.
• In certain cases, if some questions arise for the first time, and have only a little knowledge about
it, data from the related situations helps us to solve the new problem or any unknown
relationships can be discovered from the data to gain knowledge in an unfamiliar area.
• Steps Involved In IDA:
• Data analysis is the process of a combination of extracting data from data set, analyzing,
classification of data, organizing, reasoning, and so on.
• Data analysis need not essentially involve arithmetic or statistics. While it is true that analysis often involves
one or both, and that many analytical pursuits cannot be handled without them, much of the data analysis
that people perform in the course of their work involves at most mathematics no more complicated than the
calculation of the mean of a set of values.
• The essential activity of analysis is a comparison (of values, patterns, etc.), which can often be done by
simply using our eyes.
• The aim of the analysis is not to find out appealing information in the data. Rather, this is only a vital part of
the process (Berthold & Hand, 2003). The aim is to make sense of data (i.e., to understand what it means)
and then to make decisions based on the understanding that is achieved.
• Information in and of itself is not useful. Even understanding information in and of it is not useful. The aim of
data analysis is to make better decisions.
• The process of data analysis starts with the collection of data that can add to the solution of any
given problem, and with the organization of that data in some regular form.
• It involves identifying and applying a statistical or deterministic schema or model of the data that
can be manipulated for explanatory or predictive purposes.
• It then involves an interactive or automated solution that explores the structured data in order to
extract information – a solution to the business problem – from the data.
Big Data Analytics (Three Types)
• With the flood of data available to businesses regarding their supply chain these days, companies
are turning to analytics solutions to extract meaning from the huge volumes of data to help
improve decision making.
• Big data analytics reformed the ways to conduct business in many ways, such as it improves
decision making, business process management, etc.
• Business analytics uses the data and different other techniques like information technology,
features of statistics, quantitive methods and different models to provide results.
• Descriptive analytics analyses a database to provide information on the trends of past or current
business events that can help the organization to develop a road map for future actions.
• Descriptive analytics are useful because they allow us to learn from past behaviors, help in
determinig what is happening at the present time, and understand how they might influence
future outcomes.
• The vast majority of the statistics we use fall into this category (Think basic arithmetic like sums,
averages, percent changes). Usually, the underlying data is a count, or aggregate of a filtered
column of data to which basic math is applied.
• Descriptive statistics are useful to show things like total stock in inventory, average dollars spent
per customer and year-over-year change in sales.
• Common examples of descriptive analytics are reports that provide historical insights regarding
the company’s production, financials, operations, sales, finance, inventory and customers.
• You should use Descriptive Analytics when you need to understand at an aggregate level what is
going on in your company, and when you want to summarize and describe different aspects of
your business.
• Predictive Analytics: Understanding the future
• Predictive analytics has its roots in the ability to “predict” what might happen.
• Predictive analytics provides companies with actionable insights based on data. Predictive
analytics provides estimates about the likelihood of a future outcome.
• It is important to remember that no statistical algorithm can “predict” the future with
100% certainty. Companies use these statistics to forecast what might happen in the
future. This is because the foundation of predictive analytics is based on probabilities.
• These statistics try to take the data that you have, and fill in the missing data with best guesses.
They combine historical data found in ERP, CRM, HR and POS systems to identify patterns in the
data and apply statistical models and algorithms to capture relationships between various data
sets.
• Companies use predictive statistics and analytics any time they want to look into the future.
• Predictive analytics can be used throughout the organization, from forecasting customer behavior
and purchasing patterns to identifying trends in sales activities. They also help forecast demand
for inputs from the supply chain, operations and inventory.
• One common application most people are familiar with is the use of predictive analytics to
produce a credit score. These scores are used by financial services to determine the probability of
customers making future credit payments on time.
• Prescriptive Analytics:
Advise on possible outcomes
• The relatively new field of prescriptive analytics allows users to “prescribe” a number of different
possible actions and guide them towards a solution.
• Prescriptive analytics attempts to quantify the effect of future decisions in order to advise on
possible outcomes before the decisions are actually made.
• At their best, prescriptive analytics predicts not only what will happen, but also why it will
happen, providing recommendations regarding actions that will take advantage of the
predictions.
• These analytics go beyond descriptive and predictive analytics by recommending one or more
possible courses of action. Essentially they predict multiple futures and allow companies to assess
a number of possible outcomes based upon their actions.
• Prescriptive analytics use a combination of techniques and tools such as business rules,
algorithms, machine learning and computational modelling procedures. These techniques are
applied against input from many different data sets including historical and transactional data,
real-time data feeds, and big data.
• Prescriptive analytics are relatively complex to administer, and most companies are not yet using
them in their daily course of business. Larger companies are successfully using prescriptive
analytics to optimize production, scheduling and inventory in the supply chain to make sure they
are delivering the right products at the right time and optimizing the customer experience.
• Prescriptive Analytics should be used any time you need to provide users with advice on what
action to take.
• There's another type, called, Diagnostic analytics, that takes descriptive data a step further and
provides deeper analysis to answer the question: "Why did this happen?".
• Often, diagnostic analysis is referred to as root cause analysis.
• This includes using processes such as data discovery, data mining, and drill down and drill
through.
• When analyzing data, we can also categoize Big Data analytics as follows:
✔ Basic analytics
✔ Advanced analytics
✔ Operational analytics
✔ Monetized analytics
• Summary of the four approaches:
Advantages of Big Data Analytics
• Cost Savings: Some big data tools, such as Hadoop and cloud-based analytics, can bring a cost
advantage to a company by the amount of data they need to store. These tools also help you
identify more efficient ways to do business.
• Time Reductions: The rapid speed of tools such as Hadoop and in-memory analysis makes it easy
to identify new data sources, helping businesses analyze data instantly and make quick decisions
based on what they learn.
• New Product Development: By knowing the trends in customer needs and satisfaction through
analytics, you can design products according to customer needs.
• Understand the market conditions: Analyzing big data gives you a better understanding of
current market conditions. For example, by analyzing customer buying behavior, the company
that sells the most is analyzing and manufacturing products according to this trend. With it, it can
outperform its competitors.
• Control online reputation: Big data tools can do sentiment analysis. Therefore, you get feedback
on who is talking about your business. If you want to monitor and improve your business's online
presence, big data tools can help
Challenges of Conventional Systems
• Three major challenges that Big Data face are as follows:
1. Data or Volume
2. Process
3. Management
• Data or Volume:
• The volume of data, especially machine-generated data, is exploding,
• How fast that data is growing every year, withnewsources of data that are emerging.
• For example, in the year 2000, 800,000petabytes (PB) of data were stored in the world, and it is
expected to reach 35 zettabytes(ZB) by 2020 (according to IBM).
• Processing:
• More than 80% of today’s information is unstructured and it is typically too big to manage
effectively.
• Today, companies are looking to leverage a lot more data from a wider variety of sources both
inside and outside the organization.
• Things like documents, contracts, machine data, sensor data, social media, health records, emails,
etc. The list is endless really.
• Management:
• A lot of this data is unstructured, or has a complex structure that’s hard to represent in rows and
columns.
Relational Database Management Systems -
Why can’t we use databases with lots of disks to do
large-scale analysis? Why is Hadoop
needed?
• Although Hadoop isn’t the first distributed system for data storage and analysis, but it has some
unique properties that set it apart from other systems that may seem similar.
• Let's find the answer to the above questions by exploring how Hadoop is different than the
traditional systems like RDBMSs (for example, seek time, normalization, scaling, etc.)
• First, here are some diffrerences between
the two:
• Seek Time:
• A trend today in disk drives is that seek time is improving more slowly than transfer rate.
• Seeking is the process of moving the disk’s head to a particular place on the disk to read or
write data. It characterizes the latency of a disk operation, whereas the transfer rate
corresponds to a disk’s bandwidth.
• If the data access pattern is dominated by seeks, it will take longer to read or write large portions
of the dataset than streaming through it, which operates at the transfer rate.
• On the other hand, for updating a small proportion of records in a database, a traditional B-Tree
(the data structure used in relational databases, which is limited by the rate at which it can
perform seeks) works well. For updating the majority of a database, a B-Tree is less efficient than
MapReduce, which uses Sort/Merge to rebuild the database.
• An RDBMS is good for point queries or updates, where the dataset has been indexed to deliver
low-latency retrieval and update times of a relatively small amount of data.
• MapReduce suits applications where the data is written once and read many times, whereas a
relational database is good for datasets that are continually updated.
• However, the differences between relational databases and Hadoop systems are blurring.
• Relational databases have started incorporating some of the ideas from Hadoop, and from the
other direction, Hadoop systems such as Hive are becoming more interactive (by moving away
from MapReduce) and adding features like indexes and transactions that make them look more
and more like traditional RDBMSs.
• Structure:
• Another difference between Hadoop and an RDBMS is the amount of structure in the datasets on
which they operate.
• Structured data is the realm of the RDBMS.
• Semi-structured data, on the other hand, is looser: for example, a spreadsheet, in which the
structure is the grid of cells, although the cells themselves may hold any form of data.
• Unstructured data does not have any particular internal structure: for example, plain text or
image data.
• Hadoop works well on unstructured or semi-structured data because it is designed to interpret the
data at processing time (so called schema-on-read).
• This provides flexibility and avoids the costly data loading phase of an RDBMS, since in Hadoop it
is just a file copy.
• Normalization:
• Relational data is often normalized to retain its integrity and remove redundancy.
• Normalization poses problems for Hadoop processing because it makes reading a record a
nonlocal operation, and one of the central assumptions that Hadoop makes is that it is possible to
perform (high-speed) streaming reads and writes.
• A web server log is a good example of a set of records that is not normalized (for example, the
client hostnames are specified in full each time, even though the same client may appear many
times), and this is one reason that logfiles of all kinds are particularly well suited to analysis with
Hadoop.
(Note that Hadoop can perform joins; it’s just that they are not used as much as in the relational
world.)
• Scaling:
• MapReduce—and the other processing models in Hadoop—scales linearly with the size of the
data.
• Data is partitioned, and the functional primitives (like map and reduce) can work in parallel on
separate partitions.
• This means that if you double the size of the input data, a job will run twice as slowly.
• But if you also double the size of the cluster, a job will run as fast as the original one.