0% found this document useful (0 votes)
7 views

1st M

Data science is a multidisciplinary field focused on extracting insights from data through various analyses, including descriptive, diagnostic, predictive, and prescriptive methods. Its importance lies in helping organizations make informed decisions based on vast amounts of data generated across industries. The data science lifecycle involves several iterative steps, from defining research goals to data retrieval, cleansing, exploration, modeling, and presenting findings, ensuring a structured approach to deriving actionable insights.

Uploaded by

psharnu0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

1st M

Data science is a multidisciplinary field focused on extracting insights from data through various analyses, including descriptive, diagnostic, predictive, and prescriptive methods. Its importance lies in helping organizations make informed decisions based on vast amounts of data generated across industries. The data science lifecycle involves several iterative steps, from defining research goals to data retrieval, cleansing, exploration, modeling, and presenting findings, ensuring a structured approach to deriving actionable insights.

Uploaded by

psharnu0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Module - 1

What is data science?


Data science is the study of data to extract meaningful insights for business. It is a multidisciplinary
approach that combines principles and practices from the fields of mathematics, statistics, artificial
intelligence, and computer engineering to analyze large amounts of data. This analysis helps data
scientists to ask and answer questions like what happened, why it happened, what will happen, and what
can be done with the results.

Why is data science important?


Data science is important because it combines tools, methods, and technology to generate meaning
from data. Modern organizations are inundated with data; there is a proliferation of devices that can
automatically collect and store information. Online systems and payment portals capture more data in
the fields of e-commerce, medicine, finance, and every other aspect of human life. We have text, audio,
video, and image data available in vast quantities.

History of data science


While the term data science is not new, the meanings and connotations have changed over time. The
word first appeared in the ’60s as an alternative name for statistics. In the late ’90s, computer science
professionals formalized the term. A proposed definition for data science saw it as a separate field with
three aspects: data design, collection, and analysis. It still took another decade for the term to be used
outside of academia.

Future of data science


Artificial intelligence and machine learning innovations have made data processing faster and more
efficient. Industry demand has created an ecosystem of courses, degrees, and job positions within the
field of data science. Because of the cross-functional skillset and expertise required, data science shows
strong projected growth over the coming decades.

What is data science used for?


Data science is used to study data in four main ways:

1. Descriptive analysis

Descriptive analysis examines data to gain insights into what happened or what is happening in the data
environment. It is characterized by data visualizations such as pie charts, bar charts, line graphs, tables,
or generated narratives. For example, a flight booking service may record data like the number of tickets
booked each day. Descriptive analysis will reveal booking spikes, booking slumps, and high-performing
months for this service.

2. Diagnostic analysis

Diagnostic analysis is a deep-dive or detailed data examination to understand why something happened.
It is characterized by techniques such as drill-down, data discovery, data mining, and correlations.
Multiple data operations and transformations may be performed on a given data set to discover unique
patterns in each of these techniques.For example, the flight service might drill down on a particularly
high-performing month to better understand the booking spike. This may lead to the discovery that
many customers visit a particular city to attend a monthly sporting event.

3. Predictive analysis

Predictive analysis uses historical data to make accurate forecasts about data patterns that may occur in
the future. It is characterized by techniques such as machine learning, forecasting, pattern matching,
and predictive modeling. In each of these techniques, computers are trained to reverse engineer
causality connections in the data.For example, the flight service team might use data science to predict
flight booking patterns for the coming year at the start of each year. The computer program or
algorithm may look at past data and predict booking spikes for certain destinations in May. Having
anticipated their customer’s future travel requirements, the company could start targeted advertising
for those cities from February.

4. Prescriptive analysis

Prescriptive analytics takes predictive data to the next level. It not only predicts what is likely to happen
but also suggests an optimum response to that outcome. It can analyze the potential implications of
different choices and recommend the best course of action. It uses graph analysis, simulation, complex
event processing, neural networks, and recommendation engines from machine learning.

Back to the flight booking example, prescriptive analysis could look at historical marketing campaigns to
maximize the advantage of the upcoming booking spike. A data scientist could project booking
outcomes for different levels of marketing spend on various marketing channels. These data forecasts
would give the flight booking company greater confidence in their marketing decisions.

Life Cycle of Data science :

As the world enters the era of big databases, modern lifestyle produces more and
more data at an unparalleled speed through apps, websites, smartphones, and
smart devices. Hence, the need to store data has also expanded as the main
challenge and concern among modern enterprise industries. Data science is the
future of AI algorithms to overcome these challenges. It is a blend of various tools,
algorithms, and machine learning life cycles with the primary goal of discovering
hidden patterns from raw unstructured data.
Data science focuses on a more forward-looking, innovative approach, ensuring to
evaluate historically and compare it with present data to make better decisions and
future predictions on behaviour and outcomes. From healthcare and finance to
cybersecurity and automobiles, data scientists significantly contribute to various
breakthroughs across verticals.
Read on to learn more about the different stages in the data science lifecycle that
require different skill sets, tools, and techniques.

Data science lifecycle explained


A data science lifecycle is defined as the iterative set of data science steps
required to deliver a project or analysis. There are no one-size-fits that define data
science projects. Hence you need to determine the one that best fits your business
requirements. Each step in the lifecycle should be performed carefully. Any
improper execution will affect the following step, ultimately impacting the entire
process.
Data Science is key to achieving a perfect data science process here. The main
aim is to build a framework and solutions to store data. Since every data science
project and the team is unique, every specific data science lifecycle is different.

How many phases are there in the data analytics life cycle?

Why data science is widely used :


In today’s world of technology and analytics, almost every industry uses data to
some degree. Through becoming a data scientist, you’ll be able to help companies
understand wide ranges of data from multiple sources and gain valuable insights so
they can make strategic and informed business decisions.

Some of the industries that use data science include:

 Marketing
 Healthcare
 Defense and Security
 Natural Sciences
 Engineering
 Finance
 Insurance
 Political Policy
These are just some of the industries that use data to their advantage, but data
science is valuable to a wide range of other fields and organizations. Some of the
benefits of data science for any industry include:

 Data science can help businesses understand their customers better so they
can improve customer interactions and tailor their marketing or product
offerings.
 Data helps organizations interpret patterns in their operations, which can
highlight areas of success and areas for improvement.
 Organizations that use data can be more responsive to customer needs and
desires, helping them stay ahead of their competition.
 Data-driven insights are reliable and can be used to inform new initiatives as
well as measure their success.

With our Data Science Graduate Certificate, you’ll learn valuable skills that are
crucial to organizations, like:

 Applying data science methodologies to data-driven problems


 Extracting knowledge from data to address real-world challenges
 Using data science tools like Python, SQL, and more
 Data mining – Analyzing large databases
 Machine learning
And you’ll learn all of this in a flexible format that meets you where you are in life.
With our course, you’ll have deadlines each week, but you can complete the
coursework around your schedule.

Practices for Securing Data


With the rise of organizations embracing data in today’s world, the risk of data falling
into the wrong hands has also grown. This concern makes data scientists, who are
valued for their ability to protect sensitive data from being compromised, even more
important.

With a firm understanding of the best practices for data management, you’ll know
how to use security measures and strategies to support organizations in the
following scenarios:

 Remote Work – The rise of remote work increases the amount of valuable
data being shared across networks. You can help implement measures like
anonymization, access controls, and encryption to minimize threats.
 Phishing – Phishing accounts for nearly 22% of all data breaches. Fraud data
scientists use scam prevention techniques to keep data safe from phishing
attempts.
 Cloud Security – The cloud expands an organization’s capacity to store data
through third-party services. By implementing secure servers and firewalls,
you can help ensure this data stays safe.
More than ever, consumers expect their data to be managed safely and securely by
the organizations they trust with it. Using the above strategies and following ethical
data science practices, you can help organizations do just that.

Why Companies Rely on Data Scientists


Companies rely on data scientists for a wide range of reasons. With our graduate
certificate in Data Science, you’ll start learning how to help companies gather,
compartmentalize, and manage data. With the knowledge you’ll learn at UTSA, you
can help collect insightful data, which is a coveted tool that organizations will use to
highlight important trends and guide strategy.

Companies will also rely on you, as a data scientist, to represent information in a


clear and straightforward way. This will help ensure that stakeholders across the
organization can access and understand the data, recognize trends, and gather
insights to make informed and strategic business decisions to successfully meet
customer needs.

As a data scientist, you’ll also play a crucial role in protecting company data. As you
progress in your career, you’ll become knowledgeable in secure storage techniques
and privacy measures, helping organizations stay in compliance with data
management regulations and preserving brand trust.

Overview of the data Science Prosses :


Following a structured approach to data science helps you to
maximize your chances of success in a data science project at the
lowest cost. It also makes it possible to take up a project as a team,
with each team member focusing on what they do best. Take care,
however: this approach may not be suitable for every type of project
or be the only way to do good data science.

The typical data science process consists of six steps through which
you’ll iterate, as shown in figure 2.1.
Figure 2.1 summarizes the data science process and shows the main
steps and actions you’ll take during a project. The following list is a
short introduction; each of the steps will be discussed in greater depth
throughout this chapter.

1. The first step of this process is setting a research goal. The main
purpose here is making sure all the stakeholders understand
the what, how, and why of the project. In every serious project this
will result in a project charter.

2. The second phase is data retrieval. You want to have data


available for analysis, so this step includes finding suitable data and
getting access to the data from the data owner. The result is data in its
raw form, which probably needs polishing and transformation before
it becomes usable.

3. Now that you have the raw data, it’s time to prepare it. This
includes transforming the data from a raw form into data that’s
directly usable in your models. To achieve this, you’ll detect and
correct different kinds of errors in the data, combine data from
different data sources, and transform it. If you have successfully
completed this step, you can progress to data visualization and
modeling.
4. The fourth step is data exploration. The goal of this step is to gain
a deep understanding of the data. You’ll look for patterns,
correlations, and deviations based on visual and descriptive
techniques. The insights you gain from this phase will enable you to
start modeling.

5. Finally, we get to the sexiest part: model building (often referred to


as “data modeling” throughout this book). It is now that you attempt
to gain the insights or make the predictions stated in your project
charter. Now is the time to bring out the heavy guns, but remember
research has taught us that often (but not always) a combination of
simple models tends to outperform one complicated model. If you’ve
done this phase right, you’re almost done.

6. The last step of the data science model is presenting your results
and automating the analysis, if needed. One goal of a project is to
change a process and/or make better decisions. You may still need to
convince the business that your findings will indeed change the
business process as expected. This is where you can shine in your
influencer role. The importance of this step is more apparent in
projects on a strategic and tactical level. Certain projects require you
to perform the business process over and over again, so automating
the project will save time.
In reality you won’t progress in a linear way from step 1 to step 6.
Often you’ll regress and iterate between the different phases.

Following these six steps pays off in terms of a higher project success
ratio and increased impact of research results. This process ensures
you have a well-defined research plan, a good understanding of the
business question, and clear deliverables before you even start
looking at data. The first steps of your process focus on getting high-
quality data as input for your models. This way your models will
perform better later on. In data science there’s a well-known
saying: Garbage in equals garbage out.

Another benefit of following a structured approach is that you work


more in prototype mode while you search for the best model. When
building a prototype, you’ll probably try multiple models and won’t
focus heavily on issues such as program speed or writing code against
standards. This allows you to focus on bringing business value
instead.

Not every project is initiated by the business itself. Insights learned


during analysis or the arrival of new data can spawn new projects.
When the data science team generates an idea, work has already been
done to make a proposition and find a business sponsor.
Dividing a project into smaller stages also allows employees to work
together as a team. It’s impossible to be a specialist in everything.
You’d need to know how to upload all the data to all the different
databases, find an optimal data scheme that works not only for your
application but also for other projects inside your company, and then
keep track of all the statistical and data-mining techniques, while also
being an expert in presentation tools and business politics. That’s a
hard task, and it’s why more and more companies rely on a team of
specialists rather than trying to find one person who can do it all.

The process we described in this section is best suited for a data


science project that contains only a few models. It’s not suited for
every type of project. For instance, a project that contains millions of
real-time models would need a different approach than the flow we
describe here. A beginning data scientist should get a long way
following this manner of working, though.

Defining research goals and creating a project


charter:

 Spend time understanding the goals and context of your research.


 Continuously ask questions and devise examples until the business
expectations are clear.
 Create a project charter outlining:
o Clear research goals
o Project mission and context
o Approach for analysis
o Expected resources
o Proof of project feasibility,
o Deliverables and success metrics
o Timeline

Retrieving Data:

 Start with data stored within the company.


 Data may be stored in databases, data marts, data warehouses, or data
lakes.
 Accessing data may require time and adherence to company policies.

Cleansing, Integrating, and Transforming Data:

 Cleaning: Remove errors in data to ensure consistency and accuracy.


 Integrating: Combine data from different sources through joining and
appending operations.
 Transforming: Restructure data to meet model requirements, including
reducing variables and using dummy variables.

Exploratory Data Analysis:

 Take a deep dive into the data to understand its characteristics.


 Utilise graphical techniques such as bar plots, line plots, scatter plots,
histograms, etc., to visualise data and identify patterns.

Building Models:

 Develop models aimed at making predictions, classifying objects, or


understanding underlying systems.

Presenting Findings and Building Applications:

 Use soft skills to present results to stakeholders effectively.


 Industrialise the analysis process for repetitive use and integration with other
tools.
Following these steps ensures a systematic approach to data science projects,
leading to meaningful insights and actionable outcomes.

Tools Used in Data Science Process


With time, tools used in the Data Science process have evolved.

Various software tools such as Matlab and Power BI, along with programming
languages like Python and R, offer a plethora of utility features that enable us to
tackle complex tasks efficiently within tight timeframes. Below is an image
showcasing some of the popular tools in the field of Data Science.

Use and Benefits of Data Science Process


The Data Science Process offers a structured approach to addressing data-related
challenges, providing numerous benefits across various industries. Here’s a closer
look at how businesses leverage each step of the process and its associated
advantages:

1. Problem Definition:

Use: Clearly define the problem at hand and establish the objectives of the analysis.

Benefits:

 Ensures alignment with business goals.


 Helps in setting clear expectations for outcomes.

2. Data Collection:

Use: Gather data from diverse sources, perform cleaning, and prepare it for
analysis.

Benefits:

 Access to comprehensive datasets for analysis.


 Improves data quality and accuracy.

3. Data Exploration:

Use: Explore data to uncover insights, trends, patterns, and relationships.

Benefits:

 Provides valuable insights into data characteristics.


 Identifies potential opportunities and challenges.

4. Data Modeling

Use: Develop mathematical models and algorithms to solve problems and make
predictions.

Benefits:

 Enables predictive analytics and decision-making.


 Enhances understanding of complex data relationships.

5. Evaluation:

Use: Assess the performance and accuracy of the model using relevant metrics.

Benefits:

 Validates the effectiveness of the model.


 Facilitates improvements based on feedback.

6. Deployment:

Use: Implement the model in a production environment for real-time predictions or


automated decision-making.
Benefits:

 Enables integration into operational workflows.


 Supports scalable and efficient decision-making processes.

7. Monitoring and Maintenance:

Use: Continuously monitor the model’s performance and make necessary updates to
maintain accuracy.

Benefits:

 Ensures ongoing relevance and reliability of predictions.


 Mitigates risks associated with model degradation.

Overall, the Data Science Process empowers organisations to derive actionable


insights from data, make informed decisions, and drive business success. By
following this systematic approach, businesses can harness the full potential of their
data assets and stay competitive in today’s data-driven landscape.

Issues/Challenges Faced During Data Science


Process

Data Quality and Availability:

 Data must be accurate, complete, and consistent to ensure model accuracy.


 Challenges may arise when required data is not readily available or
accessible.

Bias in Data and Algorithms:

 Bias in data due to sampling techniques or measurement errors can impact


model accuracy.
 Algorithms may perpetuate societal biases, leading to unfair outcomes.
Model Overfitting and Underfitting:

 Overfitting occurs when a model is overly complex and fails to generalise to


new data.
 Underfitting happens when a model is too simple to capture underlying data
relationships effectively.

Model Interpretability:

 Complex models can be challenging to interpret, hindering the explanation of


model decisions.
 This lack of interpretability can pose obstacles in making informed business
decisions.

Privacy and Ethical Considerations:

 Collection and analysis of sensitive personal information raise privacy and


ethical concerns.
 It’s crucial to ensure responsible and ethical use of data to address these
concerns.

Technical Challenges:

 Technical hurdles like data storage, processing, algorithm selection, and


computational scalability may arise.
 Overcoming these challenges requires robust technical expertise and
infrastructure.

Wrapping Up
The Data Science Process offers a structured approach to harnessing the power of
data, enabling organisations to derive actionable insights and drive strategic
decision-making. By following this systematic methodology, businesses can
overcome challenges, unlock opportunities, and stay ahead in today’s data-driven
world. The benefits are manifold, from improved decision-making and enhanced
operational efficiency to innovative product development and increased
competitiveness.
To start on a transformative journey into the realm of data science and business
analytics, consider enrolling in the Accelerator Program in Business Analytics and
Data Science at Hero Vired. With a cutting-edge curriculum, expert faculty, and
hands-on learning experiences, this program equips aspiring data professionals with
the skills and knowledge needed to thrive in the dynamic field of data science. Don’t
miss this opportunity to propel your career forward and become a driving force in the
digital age. Join us at Hero Vired and unlock your potential in data science today.

Retrieving data :

Retrieving required data is second phase of data science project. Sometimes Data
scientists need to go into the field and design a data collection process. Many
companies will have already collected and stored the data and what they don't
have can often be bought from third parties.

• Most of the high quality data is freely available for public and commercial use.
Data can be stored in various format. It is in text file format and tables in database.
Data may be internal or external.

1. Start working on internal data, i.e. data stored within the company

• First step of data scientists is to verify the internal data. Assess the relevance
and quality of the data that's readily in company. Most companies have a program
for maintaining key data, so much of the cleaning work may already be done.
This data can be stored in official data repositories such as databases, data marts,
data warehouses and data lakes maintained by a team of IT professionals.

• Data repository is also known as a data library or data archive. This is a general
term to refer to a data set isolated to be mined for data reporting and analysis. The
data repository is a large database infrastructure, several databases that collect,
manage and store data sets for data analysis, sharing and reporting.

• Data repository can be used to describe several ways to collect and store data:

a) Data warehouse is a large data repository that aggregates data usually from
multiple sources or segments of a business, without the data being necessarily
related.
b) Data lake is a large data repository that stores unstructured data that is
classified and tagged with metadata.

c) Data marts are subsets of the data repository. These data marts are more
targeted to what the data user needs and easier to use.

d) Metadata repositories store data about data and databases. The metadata
explains where the data source, how it was captured and what it represents.

e) Data cubes are lists of data with three or more dimensions stored as a table.

Advantages of data repositories:

i. Data is preserved and archived.

ii. Data isolation allows for easier and faster data reporting.

iii. Database administrators have easier time tracking problems.

iv. There is value to storing and analyzing data.

Disadvantages of data repositories :

i. Growing data sets could slow down systems.

ii. A system crash could affect all the data.

iii. Unauthorized users can access all sensitive data more easily than if it was
distributed across several locations.

2. Do not be afraid to shop around

• If required data is not available within the company, take the help of other
company, which provides such types of database. For example, Nielsen and GFK
are provides data for retail industry. Data scientists also take help of Twitter,
LinkedIn and Facebook.

• Government's organizations share their data for free with the world. This data
can be of excellent quality; it depends on the institution that creates and manages
it. The information they share covers a broad range of topics such as the number
of accidents or amount of drug abuse in a certain region and its demographics.
3. Perform data quality checks to avoid later problem

• Allocate or spend some time for data correction and data cleaning. Collecting
suitable, error free data is success of the data science project.

• Most of the errors encounter during the data gathering phase are easy to spot,
but being too careless will make data scientists spend many hours solving data
issues that could have been prevented during data import.

• Data scientists must investigate the data during the import, data preparation and
exploratory phases. The difference is in the goal and the depth of the
investigation.

• In data retrieval process, verify whether the data is right data type and data is
same as in the source document.

• With data preparation process, more elaborate checks performed. Check any
shortcut method is used. For example, check time and data format.

• During the exploratory phase, Data scientists focus shifts to what he/she can
learn from the data. Now Data scientists assume the data to be clean and look at
the statistical properties such as distributions, correlations and outliers.

Cleansing :

Data cleansing is the process of finding and removing errors, inconsistencies, duplications,
and missing entries from data to increase data consistency and quality—also known as data
scrubbing or cleaning.
While organizations can be proactive about data quality in the collection stage, it can
still be noisy or dirty. This can be because of a range of problems:
 Duplications due to multiple unmatched data sources
 Data entry errors with misspellings and inconsistencies
 Incomplete data or missing fields
 Punctuation errors or non-compliant symbols
 Outdated data

Data cleansing takes these problems, and using a variety of methods, cleanses the
data and ensures it matches the business rules.

Data cleansing for big data


Cleaning big data is the biggest challenge many industries face. It is already a
gargantuan volume, and unless systems are put in place now, the problem is only
going to continue to grow. There are a number of ways to potentially manage this
problem, and to be effective and efficient, they must be fully automated, with no
human inputs.
Specialized cleaning tools: These typically deal with a particular domain, mostly
name and address data, or concentrate on duplicate elimination. A number of
commercial tools focus on cleaning this kind of data. These tools extract data, break
it down into the individual elements (such as phone number, address, and name),
validate the address information and zip codes, and then match the data. Once the
records are matching, they are merged and presented as one.
Extract Transform and Load (ETL) tools: A large number of organizational tools
support the ETS process for data warehouses. This process extracts data from one
source, transforms it into another form, and then loads it into the target dataset. The
“transform” step is where the cleansing occurs. It removes inconsistencies, errors,
and detects missing information. Depending on the software, there can be a huge
number of cleansing tools within the transform step.
Within these forms, there are also different ways that errors can be detected.

Integrating :

Data integration refers to the process of bringing together data from multiple
sources across an organization to provide a complete, accurate, and up-to-date
dataset for BI, data analysis and other applications and business processes. It
includes data replication, ingestion and transformation to combine different types
of data into standardized formats to be stored in a target repository such as a data
warehouse, data lake or data lakehouse.

Five Approaches
There are five different approaches, or patterns, to execute data integration:
ETL, ELT, streaming, application integration (API) and data virtualization. To
implement these processes, data engineers, architects and developers can
either manually code an architecture using SQL or, more often, they set up and
manage a data integration tool, which streamlines development and automates
the system.

The illustration below shows where they sit within a modern data management
process, transforming raw data into clean, business ready information.
1. ETL

An ETL pipeline is a traditional type of data pipeline which converts raw data to match the
target system via three steps: extract, transform and load. Data is transformed in a staging
area before it is loaded into the target repository (typically a data warehouse). This allows
for fast and accurate data analysis in the target system and is most appropriate for small
datasets which require complex transformations.
Change data capture (CDC) is a method of ETL and refers to the process or technology for
identifying and capturing changes made to a database. These changes can then be applied
to another data repository or made available in a format consumable by ETL, EAI, or other
types of data integration tools.

2. ELT

In the more modern ELT pipeline, the data is immediately loaded and then transformed
within the target system, typically a cloud-based data lake, data warehouse or data
lakehouse. This approach is more appropriate when datasets are large and timeliness is
important, since loading is often quicker. ELT operates either on a micro-batch or (CDC)
timescale. Micro-batch, or “delta load”, only loads the data modified since the last
successful load. CDC on the other hand continually loads data as and when it changes on
the source.

Learn More About ETL vs ELT

3. Data Streaming

Instead of loading data into a new repository in batches, streaming data integration moves
data continuously in real-time from source to target. Modern data integration (DI) platforms
can deliver analytics-ready data into streaming and cloud platforms, data warehouses, and
data lakes.
4. Application Integration

Application integration (API) allows separate applications to work together by moving and
syncing data between them. The most typical use case is to support operational needs
such as ensuring that your HR system has the same data as your finance system.
Therefore, the application integration must provide consistency between the data sets. Also,
these various applications usually have unique APIs for giving and taking data so SaaS
application automation tools can help you create and maintain native API integrations
efficiently and at scale.

Here is an example of a B2B marketing integration flow:

5. Data Virtualization

Like streaming, data virtualization also delivers data in real time, but only when it is
requested by a user or application. Still, this can create a unified view of data and makes
data available on demand by virtually combining data from different systems. Virtualization
and streaming are well suited for transactional systems built for high performance queries.
Transforming Data :
Data transformation is a critical part of the data integration process in which raw data is
converted into a unified format or structure. Data transformation ensures compatibility
with target systems and enhances data quality and usability. It is an essential aspect
of data management practices including data wrangling, data analysis and data
warehousing.

While specialists can manually achieve data transformation, the large swaths of data
required to power modern enterprise applications typically require some level
of automation. The tools and technologies deployed through the process of converting
data can be simple or complex.

For example, a data transformation might be as straightforward as converting a date


field (for example: MM/DD/YY) into another, or splitting a single Excel column into two.
But complex data transformations, which clean and standardize data from multiple
disparate sources and consist of multiple workflows, might involve advanced data
science skills.

These advanced data engineering functions include data normalization, which defines
relationships between data points; and data enrichment, which supplements existing
information with third-party datasets.

In today’s digital-first global economy, data transformations help organizations harness


large volumes of data from different sources to improve service, train machine learning
models and deploy big data analytics.

Data transformation use cases


By standardizing datasets and preparing them for subsequent processing, data
transformation makes several crucial enterprise data practices possible. Common
reasons for data transformation in the business world include:
Business intelligence
Organizations transform data for use in business intelligence applications like real-time
dashboards and forecast reports, allowing for data-driven decision-making that takes
vast amounts of information into account.
Data warehousing
Data transformation prepares data for storage and management in a data warehouse
or data lake, facilitating efficient querying and analysis.
Machine learning
Machine learning models require clean, organized data. Ensuring the data is
trustworthy and in the correct format allows organizations to use it for training and
tuning artificial intelligence (AI) tools.
Big data analytics
Before big data can be analyzed for business intelligence, market research or other
applications, it must be collated and formatted appropriately.
Data migration
Moving data from older on-premises systems to modern platforms like a cloud data
warehouse or data lakehouse often involves complex data transformations.

Data transformation process


Data transformations typically follow a structured process to produce usable, valuable
data from its raw form. Common steps in a data transformation process include:
1. Data discovery
During the discovery process, source data is gathered. This process might include
scraping raw data from APIs, an SQL database or internal files in disparate formats. In
identifying and extracting this information, data professionals ensure that the collected
information is comprehensive and relevant to its eventual application. During
discovery, engineers also begin to understand the data’s characteristics and structure in
a process known as data profiling.
2. Data cleaning
Data preparation and cleaning requires identifying and fixing errors, inconsistencies
and inaccuracies in raw data. This step ensures data quality and reliability by removing
duplicates and outliers or handling missing values.
3. Data mapping
Data mapping involves creating a schema or mapping process to guide the
transformation process. During this process, data engineers define how the elements in
the source system corresponds to specific elements in the target format.
4. Code generation
Either using a third-party tool or by generating code internally, during this step an
organization creates the code that will transform the data.
5. Code execution and validation
During this phase, the actual transformation takes place as code is applied to the raw
data. Transformed data is loaded into its target system for further analysis or
processing. The transformed data and data model are then validated to ensure
consistency and correctness.
6. Review
During the review process, data analysts, engineers or end users review the output data,
confirming that it meets requirements.

Types of data transformation


Data scientists and engineers use several distinct techniques throughout the data
transformation process. Which tactics are deployed depends entirely on the project and
intended use for the data, though several methods may be used in tangent as part of a
complex process.

 Data cleaning: Data cleaning improves data quality by rectifying errors and
inconsistencies, such as eliminating duplicate records.
 Data aggregation: Data aggregation summarizes data by combining multiple
records into a single value or dataset.
 Data normalization: Data normalization standardizes data, bringing all values
into a common scale or format such as numerical values from 1 to 10.
 Data encoding: Data encoding converts categorical data into a numerical format,
making it easier to analyze. For instance, data encoding might assign a unique
number to each category of data.
 Data enrichment: Data enrichment enhances data by adding relevant
information from external sources, such as third-party demographic data or
relevant metadata.
 Data imputation: Data imputation replaces missing data with plausible values.
For instance, it might replace missing values with the median or average value.
 Data splitting: Data splitting divides data into subsets for different purposes.
For example, engineers might split a data set to use one for training and one for
testing in machine learning.
 Data discretization: In data discretization, data is converted into discrete
buckets or intervals in a process sometimes referred to as binning. As an
example, discretization might be used in a healthcare setting to translate data
like patient age into categories like “infant” or “adult.”
 Data generalization: Data generalization abstracts large data sets into a higher-
level or summary form, reducing detail and making the data easier to
understand.
 Data visualization: Data visualization represents data graphically, revealing
patterns or insights that might not be immediately obvious.

Exploratory data analysis :

Exploratory Data Analysis (EDA) is an analysis approach that identifies general patterns
in the data. These patterns include outliers and features of the data that might be
unexpected.

EDA is an important first step in any data analysis. Understanding where outliers occur
and how variables are related can help one design statistical analyses that yield
meaningful results. In biological monitoring data, sites are likely to be affected by
multiple stressors. Thus, initial explorations of stressor correlations are critical before
one attempts to relate stressor variables to biological response variables. EDA can
provide insights into candidate causes that should included in a causal assessment.

The tabs at the top of this page link to sections with additional information on specific
exploratory analyses. Scatterplots and correlation coefficients can provide useful
information on relationships between pairs of variables. However, when analyzing
numerous variables, basic methods of multivariate visualization can provide greater
insights. Mapping data also is critical for understanding spatial relationships among
samples.

Build the Models :

• To build the model, data should be clean and understand the content properly.
The components of model building are as follows:

a) Selection of model and variable

b) Execution of model

c) Model diagnostic and model comparison

• Building a model is an iterative process. Most models consist of the following


main steps:

1. Selection of a modeling technique and variables to enter in the model

2. Execution of the model

3. Diagnosis and model comparison

Model and Variable Selection


• For this phase, consider model performance and whether project meets all the
requirements to use model, as well as other factors:

1. Must the model be moved to a production environment and, if so, would it be


easy to implement?
2. How difficult is the maintenance on the model: how long will it remain
relevantif left untouched?

3. Does the model need to be easy to explain?

Model Execution

• Various programming language is used for implementing the model. For model
execution, Python provides libraries like StatsModels or Scikit-learn. These
packages use several of the most popular techniques.

• Coding a model is a nontrivial task in most cases, so having these libraries


available can speed up the process. Following are the remarks on output:

a) Model fit: R-squared or adjusted R-squared is used.

b) Predictor variables have a coefficient: For a linear model this is easy to


interpret.

c) Predictor significance: Coefficients are great, but sometimes not enough


evidence exists to show that the influence is there.

• Linear regression works if we want to predict a value, but for classify something,
classification models are used. The k-nearest neighbors method is one of the best
method.

• Following commercial tools are used :

1. SAS enterprise miner: This tool allows users to run predictive and descriptive
models based on large volumes of data from across the enterprise.

2. SPSS modeler: It offers methods to explore and analyze data through a GUI.

. Matlab: Provides a high-level language for performing a variety of data


analytics, algorithms and data exploration.

4. Alpine miner: This tool provides a GUI front end for users to develop analytic
workflows and interact with Big Data tools and platforms on the back end.

• Open Source tools:


1. R and PL/R: PL/R is a procedural language for PostgreSQL with R.

2. Octave: A free software programming language for computational modeling,


has some of the functionality of Matlab.

3. WEKA: It is a free data mining software package with an analytic workbench.


The functions created in WEKA can be executed within Java code.

4. Python is a programming language that provides toolkits for machine learning


and analysis.

5. SQL in-database implementations, such as MAD lib provide an alterative to in


memory desktop analytical tools.

Presenting findings and building applications on top


of them :

Presenting findings and building applications on top of data science results is a


crucial part of the data science workflow. It involves translating complex data
analyses into actionable insights and creating tools or applications that leverage
these insights for end-users. Here’s a structured approach to effectively present
findings and build applications:

1. Understanding Your Audience


 Identify Stakeholders: Know who will be using the findings (executives, technical
teams, end-users).
 Tailor Communication: Adjust the complexity of your presentation based on the
audience's technical expertise.

2. Presenting Findings
 Data Visualization: Use charts, graphs, and dashboards to make data more
digestible. Tools like Tableau, Power BI, or Matplotlib can be helpful.
 Storytelling: Frame your findings in a narrative that highlights the problem, the
analysis, and the implications. Use the "Problem-Solution-Impact" framework.
 Key Metrics: Focus on key performance indicators (KPIs) that matter to your
audience. Highlight actionable insights rather than overwhelming them with data.
 Interactive Presentations: Consider using tools like Shiny (for R) or Dash (for
Python) to create interactive visualizations that allow stakeholders to explore the
data themselves.
3. Building Applications
 Define Use Cases: Identify specific problems that your findings can solve. This
could be predictive analytics, recommendation systems, or operational dashboards.
 Choose the Right Technology Stack: Depending on the application, select
appropriate technologies (e.g., Flask/Django for web apps, Streamlit for data apps).
 Data Integration: Ensure that your application can access and process the
necessary data. This may involve setting up databases or APIs.
 User Experience (UX): Design the application with the end-user in mind. Ensure it is
intuitive and meets user needs.
 Iterative Development: Use agile methodologies to develop the application in
iterations, allowing for feedback and adjustments along the way.

4. Deployment and Maintenance


 Deployment: Use cloud platforms (like AWS, Azure, or Google Cloud) to deploy
your application. Consider containerization (Docker) for easier management.
 Monitoring and Feedback: Implement monitoring tools to track application
performance and user engagement. Gather user feedback for continuous
improvement.
 Documentation: Provide clear documentation for users and developers to
understand how to use and maintain the application.

5. Communicating Impact
 Measure Outcomes: After implementation, measure the impact of your application
on business metrics. This could include increased efficiency, cost savings, or
improved customer satisfaction.
 Share Success Stories: Communicate the success of your application through case
studies or presentations to stakeholders, reinforcing the value of data-driven
decision-making.

You might also like