Unit V
Unit V
The rapid and exponential increase/growth of data generated from various sources such as
social media, mobile devices, sensors, IoT devices, and digital platforms which are stored
in computing systems, when reaches a level where data management becomes difficult, is
called a “Data Explosion”.
Social media: Platforms like Facebook, Twitter, and Instagram generate large volumes of
user-generated data, including text, images, and videos.
IoT Devices and Sensors: Smart devices, wearables, and industrial sensors constantly
produce data in real-time, used in industries from healthcare to manufacturing.
Transactional Data: Financial and retail transactions, including e-commerce, banking, and
point-of-sale systems, generate structured data on purchases and customer behaviour.
Web and Mobile Applications: User interactions on websites and mobile apps produce data
like clickstream, browsing history, and app usage patterns.
Machine-generated Data: System logs, network traffic, and automated machine data in
sectors like telecommunications, transportation, and cybersecurity contribute to big data.
Public Data and Open Data: Government records, scientific research data, and open
databases provide data across fields like healthcare, environment, and demographics.
1. Volume: This refers to the vast amounts of data generated from various sources. Big data
systems handle terabytes to petabytes of data, making traditional data storage and
processing methods insufficient. The volume aspect emphasizes the scale and capacity
required to manage and analyze such large datasets.
2. Velocity: This represents the speed at which data is generated, collected, and processed.
With real-time data sources like social media, IoT sensors, and financial transactions, big data
needs fast, continuous processing to provide timely insights and decision-making.
3. Variety: Big data comes in diverse formats, including structured data (like databases), semi-
structured data (like JSON files), and unstructured data (like images, videos, and texts). This
variety demands flexible tools and models to integrate and analyze different types of data
effectively & efficiently.
4. What do you mean by Data Analytic Lifecycle?
The Data Analytics Lifecycle is a systematic approach to managing and executing data analysis
projects, guiding teams from defining objectives to delivering actionable insights. It was designed to
address Big Data problems, as well as specific demands for conducting analysis on Big Data.
It typically consists of six stages:
1. Discovery: Understanding the business problem, objectives, and resources (such as data,
tools, and team skills) available for analysis. This stage sets the project’s foundation.
2. Data Preparation: Gathering, cleaning, transforming, and exploring data to ensure it’s ready
for analysis. This often involves handling missing values, identifying outliers, and combining
data sources.
3. Model Planning: Determining the analytical techniques, algorithms, and tools to apply.
Teams often use data exploration and hypothesis testing to decide the best modeling
approach.
4. Model Building: Constructing, training, and validating models based on the planned
approach. This involves iterating to refine models for improved accuracy and reliability.
6. Operationalize: Deploying the model or insights into production for real-world application,
which could involve setting up monitoring systems and training teams to ensure long-term
effectiveness.
2) Data Preparation: In this phase, data is collected, cleaned, and transformed to prepare it for
analysis. This involves handling missing data, removing inconsistencies, combining data from
multiple sources, and conducting exploratory analysis to understand data characteristics and
ensure quality.
Some of the tools used commonly for this process include - Hadoop, Alpine Miner,
Open Refine, etc.
3) Model Planning: Here, the team identifies suitable analytical methods and algorithms based on
the data and objectives. Techniques like regression, clustering, and classification may be selected
depending on the problem. This phase often includes creating a preliminary model structure and
strategy for testing hypotheses.
Some of the tools used commonly for this stage are MATLAB and STASTICA.
4) Model Building: This phase involves developing and training models based on the chosen
methods. The team iterates on the model, adjusting parameters and features to improve
accuracy and reliability, and uses training data to refine the model’s predictive capability.
Tools that are free or open-source or free tools Rand PL/R, Octave, WEKA.
Commercial tools - MATLAB, STASTICA.
5) Communication of Results: In this phase, the team presents findings to stakeholders in a clear
and actionable format. The insights are often communicated through visualizations, dashboards,
or reports tailored to the audience’s needs, making complex data understandable. This phase
focuses on ensuring that stakeholders understand how the insights address the original business
problem and how they can use them to make informed decisions.
6) Operationalize: This final phase involves integrating the model, insights, or analytics system into
the business’s operational processes for long-term use. This could include deploying the model
into production, setting up monitoring systems, and establishing maintenance workflows to
ensure its continued accuracy and relevance. Operationalizing also includes training teams and
setting up processes to ensure the model or analytics solution is effectively utilized over time.
Open source or free tools such as WEKA, SQL, MADlib, and Octave.