0% found this document useful (0 votes)
2 views

Unit V

Uploaded by

apdeshmukh371122
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Unit V

Uploaded by

apdeshmukh371122
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Data Explosion

The rapid or exponential increase in the amount of data that is generated and
stored in the computing systems, which reaches a level where data management
becomes difficult, is called a “Data Explosion”

Three characteristics define Big Data: volume, variety, and velocity.

Volume: Big data sets must include millions of unstructured, low-density data
points. Companies that use big data can keep anything from dozens of terabytes to
hundreds of petabytes of user data. The advent of cloud computing means
companies now have access to zettabytes of data! All data is saved regardless of
apparent importance. Big data specialists argue that sometimes the answers to
business questions can lie in unexpected data.

Velocity: Velocity refers to the fast generation and application of big data. Big
data is received, analyzed, and interpreted in quick succession to provide the most
up-to-date findings. Many big data platforms even record and interpret data in real
-time.

Variety: Big data sets contain different types of data within the same unstructured
database. Traditional data management systems use structured relational
databases that contain specific data types with set relationships to other data
types. Big data analytics programs use many different types of unstructured data
to find all correlations between all types of data. Big data approaches often lead to
a more complete picture of how each factor is related
Life Cycle of Data Analytics
The Data analytics lifecycle was designed to address Big Data problems and data
science projects. The process is repeated to show the real projects. To address the
specific demands for conducting analysis on Big Data, the step-by-step
methodology is required to plan the various tasks associated with the acquisition,
processing, analysis, and recycling of data.

Phase 1: Discovery -

The data science team is trained and researches the issue.

Create context and gain understanding.

Learn about the data sources that are needed and accessible to the project.

The team comes up with an initial hypothesis, which can be later confirmed with
evidence.

Phase 2: Data Preparation -

Methods to investigate the possibilities of pre-processing, analysing, and preparing


data before analysis and modelling.

The team performs, loads, and transforms to bring information to the data sandbox.

Data preparation tasks can be repeated and not in a predetermined sequence.

Some of the tools used commonly for this process include - Hadoop, Alpine Miner,
Open Refine, etc.

Phase 3: Model Planning -

The team studies data to discover the connections between variables. Later, it
selects the most significant variables as well as the most effective models.

In this phase, the data science teams create data sets that can be used for training
for testing, production, and training goals.

The team builds and implements models based on the work completed in the
modelling planning phase.
Some of the tools used commonly for this stage are MATLAB and STASTICA.

Phase 4: Model Building -

The team creates datasets for training, testing as well as production use.

The team is also evaluating whether its current tools are sufficient to run the
models or if they require an even more robust environment to run models.

Tools that are free or open-source or free tools Rand PL/R, Octave, WEKA.

Commercial tools - MATLAB, STASTICA.

Phase 5: Communication Results -

Following the execution of the model, team members will need to evaluate the
outcomes of the model to establish criteria for the success or failure of the model.

The team is considering how best to present findings and outcomes to the various
members of the team and other stakeholders while taking into consideration
cautionary tales and assumptions.

The team should determine the most important findings, quantify their value to the
business and create a narrative to present findings and summarize them to all
stakeholders.

Phase 6: Operationalize -

The team distributes the benefits of the project to a wider audience. It sets up a
pilot project that will deploy the work in a controlled manner prior to expanding the
project to the entire enterprise of users.

This technique allows the team to gain insight into the performance and constraints
related to the model within a production setting at a small scale and then make
necessary adjustments before full deployment.

The team produces the last reports, presentations, and codes.

Open source or free tools such as WEKA, SQL, MADlib, and Octave.

You might also like