1st M
1st M
1. Descriptive analysis
Descriptive analysis examines data to gain insights into what happened or what is happening in the data
environment. It is characterized by data visualizations such as pie charts, bar charts, line graphs, tables,
or generated narratives. For example, a flight booking service may record data like the number of tickets
booked each day. Descriptive analysis will reveal booking spikes, booking slumps, and high-performing
months for this service.
2. Diagnostic analysis
Diagnostic analysis is a deep-dive or detailed data examination to understand why something happened.
It is characterized by techniques such as drill-down, data discovery, data mining, and correlations.
Multiple data operations and transformations may be performed on a given data set to discover unique
patterns in each of these techniques.For example, the flight service might drill down on a particularly
high-performing month to better understand the booking spike. This may lead to the discovery that
many customers visit a particular city to attend a monthly sporting event.
3. Predictive analysis
Predictive analysis uses historical data to make accurate forecasts about data patterns that may occur in
the future. It is characterized by techniques such as machine learning, forecasting, pattern matching,
and predictive modeling. In each of these techniques, computers are trained to reverse engineer
causality connections in the data.For example, the flight service team might use data science to predict
flight booking patterns for the coming year at the start of each year. The computer program or
algorithm may look at past data and predict booking spikes for certain destinations in May. Having
anticipated their customer’s future travel requirements, the company could start targeted advertising
for those cities from February.
4. Prescriptive analysis
Prescriptive analytics takes predictive data to the next level. It not only predicts what is likely to happen
but also suggests an optimum response to that outcome. It can analyze the potential implications of
different choices and recommend the best course of action. It uses graph analysis, simulation, complex
event processing, neural networks, and recommendation engines from machine learning.
Back to the flight booking example, prescriptive analysis could look at historical marketing campaigns to
maximize the advantage of the upcoming booking spike. A data scientist could project booking
outcomes for different levels of marketing spend on various marketing channels. These data forecasts
would give the flight booking company greater confidence in their marketing decisions.
As the world enters the era of big databases, modern lifestyle produces more and
more data at an unparalleled speed through apps, websites, smartphones, and
smart devices. Hence, the need to store data has also expanded as the main
challenge and concern among modern enterprise industries. Data science is the
future of AI algorithms to overcome these challenges. It is a blend of various tools,
algorithms, and machine learning life cycles with the primary goal of discovering
hidden patterns from raw unstructured data.
Data science focuses on a more forward-looking, innovative approach, ensuring to
evaluate historically and compare it with present data to make better decisions and
future predictions on behaviour and outcomes. From healthcare and finance to
cybersecurity and automobiles, data scientists significantly contribute to various
breakthroughs across verticals.
Read on to learn more about the different stages in the data science lifecycle that
require different skill sets, tools, and techniques.
How many phases are there in the data analytics life cycle?
Marketing
Healthcare
Defense and Security
Natural Sciences
Engineering
Finance
Insurance
Political Policy
These are just some of the industries that use data to their advantage, but data
science is valuable to a wide range of other fields and organizations. Some of the
benefits of data science for any industry include:
Data science can help businesses understand their customers better so they
can improve customer interactions and tailor their marketing or product
offerings.
Data helps organizations interpret patterns in their operations, which can
highlight areas of success and areas for improvement.
Organizations that use data can be more responsive to customer needs and
desires, helping them stay ahead of their competition.
Data-driven insights are reliable and can be used to inform new initiatives as
well as measure their success.
With our Data Science Graduate Certificate, you’ll learn valuable skills that are
crucial to organizations, like:
With a firm understanding of the best practices for data management, you’ll know
how to use security measures and strategies to support organizations in the
following scenarios:
Remote Work – The rise of remote work increases the amount of valuable
data being shared across networks. You can help implement measures like
anonymization, access controls, and encryption to minimize threats.
Phishing – Phishing accounts for nearly 22% of all data breaches. Fraud data
scientists use scam prevention techniques to keep data safe from phishing
attempts.
Cloud Security – The cloud expands an organization’s capacity to store data
through third-party services. By implementing secure servers and firewalls,
you can help ensure this data stays safe.
More than ever, consumers expect their data to be managed safely and securely by
the organizations they trust with it. Using the above strategies and following ethical
data science practices, you can help organizations do just that.
As a data scientist, you’ll also play a crucial role in protecting company data. As you
progress in your career, you’ll become knowledgeable in secure storage techniques
and privacy measures, helping organizations stay in compliance with data
management regulations and preserving brand trust.
The typical data science process consists of six steps through which
you’ll iterate, as shown in figure 2.1.
Figure 2.1 summarizes the data science process and shows the main
steps and actions you’ll take during a project. The following list is a
short introduction; each of the steps will be discussed in greater depth
throughout this chapter.
1. The first step of this process is setting a research goal. The main
purpose here is making sure all the stakeholders understand
the what, how, and why of the project. In every serious project this
will result in a project charter.
3. Now that you have the raw data, it’s time to prepare it. This
includes transforming the data from a raw form into data that’s
directly usable in your models. To achieve this, you’ll detect and
correct different kinds of errors in the data, combine data from
different data sources, and transform it. If you have successfully
completed this step, you can progress to data visualization and
modeling.
4. The fourth step is data exploration. The goal of this step is to gain
a deep understanding of the data. You’ll look for patterns,
correlations, and deviations based on visual and descriptive
techniques. The insights you gain from this phase will enable you to
start modeling.
6. The last step of the data science model is presenting your results
and automating the analysis, if needed. One goal of a project is to
change a process and/or make better decisions. You may still need to
convince the business that your findings will indeed change the
business process as expected. This is where you can shine in your
influencer role. The importance of this step is more apparent in
projects on a strategic and tactical level. Certain projects require you
to perform the business process over and over again, so automating
the project will save time.
In reality you won’t progress in a linear way from step 1 to step 6.
Often you’ll regress and iterate between the different phases.
Following these six steps pays off in terms of a higher project success
ratio and increased impact of research results. This process ensures
you have a well-defined research plan, a good understanding of the
business question, and clear deliverables before you even start
looking at data. The first steps of your process focus on getting high-
quality data as input for your models. This way your models will
perform better later on. In data science there’s a well-known
saying: Garbage in equals garbage out.
Retrieving Data:
Building Models:
Various software tools such as Matlab and Power BI, along with programming
languages like Python and R, offer a plethora of utility features that enable us to
tackle complex tasks efficiently within tight timeframes. Below is an image
showcasing some of the popular tools in the field of Data Science.
1. Problem Definition:
Use: Clearly define the problem at hand and establish the objectives of the analysis.
Benefits:
2. Data Collection:
Use: Gather data from diverse sources, perform cleaning, and prepare it for
analysis.
Benefits:
3. Data Exploration:
Benefits:
4. Data Modeling
Use: Develop mathematical models and algorithms to solve problems and make
predictions.
Benefits:
5. Evaluation:
Use: Assess the performance and accuracy of the model using relevant metrics.
Benefits:
6. Deployment:
Use: Continuously monitor the model’s performance and make necessary updates to
maintain accuracy.
Benefits:
Model Interpretability:
Technical Challenges:
Wrapping Up
The Data Science Process offers a structured approach to harnessing the power of
data, enabling organisations to derive actionable insights and drive strategic
decision-making. By following this systematic methodology, businesses can
overcome challenges, unlock opportunities, and stay ahead in today’s data-driven
world. The benefits are manifold, from improved decision-making and enhanced
operational efficiency to innovative product development and increased
competitiveness.
To start on a transformative journey into the realm of data science and business
analytics, consider enrolling in the Accelerator Program in Business Analytics and
Data Science at Hero Vired. With a cutting-edge curriculum, expert faculty, and
hands-on learning experiences, this program equips aspiring data professionals with
the skills and knowledge needed to thrive in the dynamic field of data science. Don’t
miss this opportunity to propel your career forward and become a driving force in the
digital age. Join us at Hero Vired and unlock your potential in data science today.
Retrieving data :
Retrieving required data is second phase of data science project. Sometimes Data
scientists need to go into the field and design a data collection process. Many
companies will have already collected and stored the data and what they don't
have can often be bought from third parties.
• Most of the high quality data is freely available for public and commercial use.
Data can be stored in various format. It is in text file format and tables in database.
Data may be internal or external.
1. Start working on internal data, i.e. data stored within the company
• First step of data scientists is to verify the internal data. Assess the relevance
and quality of the data that's readily in company. Most companies have a program
for maintaining key data, so much of the cleaning work may already be done.
This data can be stored in official data repositories such as databases, data marts,
data warehouses and data lakes maintained by a team of IT professionals.
• Data repository is also known as a data library or data archive. This is a general
term to refer to a data set isolated to be mined for data reporting and analysis. The
data repository is a large database infrastructure, several databases that collect,
manage and store data sets for data analysis, sharing and reporting.
• Data repository can be used to describe several ways to collect and store data:
a) Data warehouse is a large data repository that aggregates data usually from
multiple sources or segments of a business, without the data being necessarily
related.
b) Data lake is a large data repository that stores unstructured data that is
classified and tagged with metadata.
c) Data marts are subsets of the data repository. These data marts are more
targeted to what the data user needs and easier to use.
d) Metadata repositories store data about data and databases. The metadata
explains where the data source, how it was captured and what it represents.
e) Data cubes are lists of data with three or more dimensions stored as a table.
ii. Data isolation allows for easier and faster data reporting.
iii. Unauthorized users can access all sensitive data more easily than if it was
distributed across several locations.
• If required data is not available within the company, take the help of other
company, which provides such types of database. For example, Nielsen and GFK
are provides data for retail industry. Data scientists also take help of Twitter,
LinkedIn and Facebook.
• Government's organizations share their data for free with the world. This data
can be of excellent quality; it depends on the institution that creates and manages
it. The information they share covers a broad range of topics such as the number
of accidents or amount of drug abuse in a certain region and its demographics.
3. Perform data quality checks to avoid later problem
• Allocate or spend some time for data correction and data cleaning. Collecting
suitable, error free data is success of the data science project.
• Most of the errors encounter during the data gathering phase are easy to spot,
but being too careless will make data scientists spend many hours solving data
issues that could have been prevented during data import.
• Data scientists must investigate the data during the import, data preparation and
exploratory phases. The difference is in the goal and the depth of the
investigation.
• In data retrieval process, verify whether the data is right data type and data is
same as in the source document.
• With data preparation process, more elaborate checks performed. Check any
shortcut method is used. For example, check time and data format.
• During the exploratory phase, Data scientists focus shifts to what he/she can
learn from the data. Now Data scientists assume the data to be clean and look at
the statistical properties such as distributions, correlations and outliers.
Cleansing :
Data cleansing is the process of finding and removing errors, inconsistencies, duplications,
and missing entries from data to increase data consistency and quality—also known as data
scrubbing or cleaning.
While organizations can be proactive about data quality in the collection stage, it can
still be noisy or dirty. This can be because of a range of problems:
Duplications due to multiple unmatched data sources
Data entry errors with misspellings and inconsistencies
Incomplete data or missing fields
Punctuation errors or non-compliant symbols
Outdated data
Data cleansing takes these problems, and using a variety of methods, cleanses the
data and ensures it matches the business rules.
Integrating :
Data integration refers to the process of bringing together data from multiple
sources across an organization to provide a complete, accurate, and up-to-date
dataset for BI, data analysis and other applications and business processes. It
includes data replication, ingestion and transformation to combine different types
of data into standardized formats to be stored in a target repository such as a data
warehouse, data lake or data lakehouse.
Five Approaches
There are five different approaches, or patterns, to execute data integration:
ETL, ELT, streaming, application integration (API) and data virtualization. To
implement these processes, data engineers, architects and developers can
either manually code an architecture using SQL or, more often, they set up and
manage a data integration tool, which streamlines development and automates
the system.
The illustration below shows where they sit within a modern data management
process, transforming raw data into clean, business ready information.
1. ETL
An ETL pipeline is a traditional type of data pipeline which converts raw data to match the
target system via three steps: extract, transform and load. Data is transformed in a staging
area before it is loaded into the target repository (typically a data warehouse). This allows
for fast and accurate data analysis in the target system and is most appropriate for small
datasets which require complex transformations.
Change data capture (CDC) is a method of ETL and refers to the process or technology for
identifying and capturing changes made to a database. These changes can then be applied
to another data repository or made available in a format consumable by ETL, EAI, or other
types of data integration tools.
2. ELT
In the more modern ELT pipeline, the data is immediately loaded and then transformed
within the target system, typically a cloud-based data lake, data warehouse or data
lakehouse. This approach is more appropriate when datasets are large and timeliness is
important, since loading is often quicker. ELT operates either on a micro-batch or (CDC)
timescale. Micro-batch, or “delta load”, only loads the data modified since the last
successful load. CDC on the other hand continually loads data as and when it changes on
the source.
3. Data Streaming
Instead of loading data into a new repository in batches, streaming data integration moves
data continuously in real-time from source to target. Modern data integration (DI) platforms
can deliver analytics-ready data into streaming and cloud platforms, data warehouses, and
data lakes.
4. Application Integration
Application integration (API) allows separate applications to work together by moving and
syncing data between them. The most typical use case is to support operational needs
such as ensuring that your HR system has the same data as your finance system.
Therefore, the application integration must provide consistency between the data sets. Also,
these various applications usually have unique APIs for giving and taking data so SaaS
application automation tools can help you create and maintain native API integrations
efficiently and at scale.
5. Data Virtualization
Like streaming, data virtualization also delivers data in real time, but only when it is
requested by a user or application. Still, this can create a unified view of data and makes
data available on demand by virtually combining data from different systems. Virtualization
and streaming are well suited for transactional systems built for high performance queries.
Transforming Data :
Data transformation is a critical part of the data integration process in which raw data is
converted into a unified format or structure. Data transformation ensures compatibility
with target systems and enhances data quality and usability. It is an essential aspect
of data management practices including data wrangling, data analysis and data
warehousing.
While specialists can manually achieve data transformation, the large swaths of data
required to power modern enterprise applications typically require some level
of automation. The tools and technologies deployed through the process of converting
data can be simple or complex.
These advanced data engineering functions include data normalization, which defines
relationships between data points; and data enrichment, which supplements existing
information with third-party datasets.
Data cleaning: Data cleaning improves data quality by rectifying errors and
inconsistencies, such as eliminating duplicate records.
Data aggregation: Data aggregation summarizes data by combining multiple
records into a single value or dataset.
Data normalization: Data normalization standardizes data, bringing all values
into a common scale or format such as numerical values from 1 to 10.
Data encoding: Data encoding converts categorical data into a numerical format,
making it easier to analyze. For instance, data encoding might assign a unique
number to each category of data.
Data enrichment: Data enrichment enhances data by adding relevant
information from external sources, such as third-party demographic data or
relevant metadata.
Data imputation: Data imputation replaces missing data with plausible values.
For instance, it might replace missing values with the median or average value.
Data splitting: Data splitting divides data into subsets for different purposes.
For example, engineers might split a data set to use one for training and one for
testing in machine learning.
Data discretization: In data discretization, data is converted into discrete
buckets or intervals in a process sometimes referred to as binning. As an
example, discretization might be used in a healthcare setting to translate data
like patient age into categories like “infant” or “adult.”
Data generalization: Data generalization abstracts large data sets into a higher-
level or summary form, reducing detail and making the data easier to
understand.
Data visualization: Data visualization represents data graphically, revealing
patterns or insights that might not be immediately obvious.
Exploratory Data Analysis (EDA) is an analysis approach that identifies general patterns
in the data. These patterns include outliers and features of the data that might be
unexpected.
EDA is an important first step in any data analysis. Understanding where outliers occur
and how variables are related can help one design statistical analyses that yield
meaningful results. In biological monitoring data, sites are likely to be affected by
multiple stressors. Thus, initial explorations of stressor correlations are critical before
one attempts to relate stressor variables to biological response variables. EDA can
provide insights into candidate causes that should included in a causal assessment.
The tabs at the top of this page link to sections with additional information on specific
exploratory analyses. Scatterplots and correlation coefficients can provide useful
information on relationships between pairs of variables. However, when analyzing
numerous variables, basic methods of multivariate visualization can provide greater
insights. Mapping data also is critical for understanding spatial relationships among
samples.
• To build the model, data should be clean and understand the content properly.
The components of model building are as follows:
b) Execution of model
Model Execution
• Various programming language is used for implementing the model. For model
execution, Python provides libraries like StatsModels or Scikit-learn. These
packages use several of the most popular techniques.
• Linear regression works if we want to predict a value, but for classify something,
classification models are used. The k-nearest neighbors method is one of the best
method.
1. SAS enterprise miner: This tool allows users to run predictive and descriptive
models based on large volumes of data from across the enterprise.
2. SPSS modeler: It offers methods to explore and analyze data through a GUI.
4. Alpine miner: This tool provides a GUI front end for users to develop analytic
workflows and interact with Big Data tools and platforms on the back end.
2. Presenting Findings
Data Visualization: Use charts, graphs, and dashboards to make data more
digestible. Tools like Tableau, Power BI, or Matplotlib can be helpful.
Storytelling: Frame your findings in a narrative that highlights the problem, the
analysis, and the implications. Use the "Problem-Solution-Impact" framework.
Key Metrics: Focus on key performance indicators (KPIs) that matter to your
audience. Highlight actionable insights rather than overwhelming them with data.
Interactive Presentations: Consider using tools like Shiny (for R) or Dash (for
Python) to create interactive visualizations that allow stakeholders to explore the
data themselves.
3. Building Applications
Define Use Cases: Identify specific problems that your findings can solve. This
could be predictive analytics, recommendation systems, or operational dashboards.
Choose the Right Technology Stack: Depending on the application, select
appropriate technologies (e.g., Flask/Django for web apps, Streamlit for data apps).
Data Integration: Ensure that your application can access and process the
necessary data. This may involve setting up databases or APIs.
User Experience (UX): Design the application with the end-user in mind. Ensure it is
intuitive and meets user needs.
Iterative Development: Use agile methodologies to develop the application in
iterations, allowing for feedback and adjustments along the way.
5. Communicating Impact
Measure Outcomes: After implementation, measure the impact of your application
on business metrics. This could include increased efficiency, cost savings, or
improved customer satisfaction.
Share Success Stories: Communicate the success of your application through case
studies or presentations to stakeholders, reinforcing the value of data-driven
decision-making.