0% found this document useful (0 votes)

4 views

Unit 2 BI & Data Science (1)

Uploaded by

Hawana Tamang

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Unit 2 BI & Data Science (1)

Uploaded by

Hawana Tamang

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

UNIT- 3 & 6

Data, Big data, Data Science & Data Visualization

Data: Data Collection-Data Management-Big Data Management-Organization/sources of

Data- Importance of Data Quality- Dealing with missing or incomplete data – Data
Visualization- Data Classification.
Data Science project Life Cycle- Business Requirement – Data Acquisition- data
Preparation- Hypothesis and Modelling- Evaluation and interpretation- Deployment-
Operations-Optimization-Applications for Data Science.

Data

 Knowledge is power, information is knowledge, and data is information in digitized

form, at least as defined in IT. Hence, data is power.
 Data are individual facts, statistics, or items of information, often numeric. In a more
technical sense, data are a set of values of qualitative or quantitative variables about
one or more persons or objects
 Data is various kinds of information formatted in a particular way. Therefore, data
collection is the process of gathering, measuring, and analyzing accurate data from a
variety of relevant sources to find answers to research problems, answer questions,
evaluate outcomes, and forecast trends and probabilities.
 Accurate data collection is necessary to make informed business decisions, ensure
quality assurance, and keep research integrity.
 The concept of data collection isn’t a new one, as we’ll see later, but the world has
changed. There is far more data available today, and it exists in forms that were
unheard of a century ago. The data collection process has had to change and grow
with the times, keeping pace with technology.
 Data collection breaks down into two methods: 1. Primary & 2. Secondary

 Data Collection
Data collection is the process of acquiring, collecting, extracting, and storing the
voluminous amount of data which may be in the structured or unstructured form like text,
video, audio, XML files, records, or other image files used in later stages of data analysis.
In the process of big data analysis, “Data collection” is the initial step before starting to
analyze the patterns or useful information in data. The data which is to be analyzed must
be collected from different valid sources.

The actual data is then further divided mainly into two types known as:
1. Primary data
2. Secondary data
1. Primary data:

The data which is Raw, original, and extracted directly from the official sources is known
as primary data. This type of data is collected directly by performing techniques such as
questionnaires, interviews, and surveys. The data collected must be according to the
demand and requirements of the target audience on which analysis is performed otherwise
it would be a burden in the data processing.
Few methods of collecting primary data:
 Interview method:
The data collected during this process is through interviewing the target audience by a
person called interviewer and the person who answers the interview is known as the
interviewee. Some basic business or product related questions are asked and noted down in
the form of notes, audio, or video and this data is stored for processing. These can be both
structured and unstructured like personal interviews or formal interviews through
telephone, face to face, email, etc.
 Survey method:
The survey method is the process of research where a list of relevant questions are asked
and answers are noted down in the form of text, audio, or video. The survey method can be
obtained in both online and offline mode like through website forms and email. Then that
survey answers are stored for analyzing data. Examples are online surveys or surveys
through social media polls.
 Observation method:
The observation method is a method of data collection in which the researcher keenly
observes the behaviour and practices of the target audience using some data collecting tool
and stores the observed data in the form of text, audio, video, or any raw formats. In this
method, the data is collected directly by posting a few questions on the participants. For
example, observing a group of customers and their behaviour towards the products. The
data obtained will be sent for processing.
 Projective Technique
Projective data gathering is an indirect interview, used when potential respondents know why
they're being asked questions and hesitate to answer. For instance, someone may be reluctant
to answer questions about their phone service if a cell phone carrier representative poses the
questions. With projective data gathering, the interviewees get an incomplete question, and
they must fill in the rest, using their opinions, feelings, and attitudes.

 Delphi Technique.
The Oracle at Delphi, according to Greek mythology, was the high priestess of Apollo’s
temple, who gave advice, prophecies, and counsel. In the realm of data collection, researchers
use the Delphi technique by gathering information from a panel of experts. Each expert
answers questions in their field of specialty, and the replies are consolidated into a single
opinion.

 Focus Groups.
Focus groups, like interviews, are a commonly used technique. The group consists of
anywhere from a half-dozen to a dozen people, led by a moderator, brought together to
discuss the issue.

 Questionnaires.
Questionnaires are a simple, straightforward data collection method. Respondents get a series
of questions, either open or close-ended, related to the matter at hand.

 Experimental method:
The experimental method is the process of collecting data through performing
experiments, research, and investigation. The most frequently used experiment methods
are CRD, RBD, LSD, FD.
 CRD- Completely Randomized design is a simple experimental design used in
data analytics which is based on randomization and replication. It is mostly used for
comparing the experiments.
 RBD- Randomized Block Design is an experimental design in which the
experiment is divided into small units called blocks. Random experiments are
performed on each of the blocks and results are drawn using a technique known as
analysis of variance (ANOVA). RBD was originated from the agriculture sector.
 LSD – Latin Square Design is an experimental design that is similar to CRD and
RBD blocks but contains rows and columns. It is an arrangement of NxN squares with
an equal amount of rows and columns which contain letters that occurs only once in a
row. Hence the differences can be easily found with fewer errors in the experiment.
Sudoku puzzle is an example of a Latin square design.
 FD- Factorial design is an experimental design where each experiment has two
factors each with possible values and on performing trail other combinational factors
are derived.

2. Secondary data:

Secondary data is the data which has already been collected and reused again for some
valid purpose. This type of data is previously recorded from primary data and it has two
types of sources named internal source and external source.
i. Internal source:
These types of data can easily be found within the organization such as market record, a
sales record, transactions, customer data, accounting resources, etc. The cost and time
consumption is less in obtaining internal sources.
 Financial Statements
 Sales Reports
 Retailer/Distributor/Deal Feedback
 Customer Personal Information (e.g., name, address, age, contact info)
 Business Journals
 Government Records (e.g., census, tax records, Social Security info)
 Trade/Business Magazines
 The internet

ii. External source:

The data which can’t be found at internal organizations and can be gained through external
third party resources is external source data. The cost and time consumption is more
because this contains a huge amount of data. Examples of external sources are
Government publications, news publications, Registrar General of India, planning
commission, international labour bureau, syndicate services, and other non-governmental
publications.
iii. Other sources:
 Sensors data: With the advancement of IoT devices, the sensors of these devices
collect data which can be used for sensor data analytics to track the performance
and usage of products.
 Satellites data: Satellites collect a lot of images and data in terabytes on daily
basis through surveillance cameras which can be used to collect useful information.
 Web traffic: Due to fast and cheap internet facilities many formats of data
Which is uploaded by users on different platforms can be predicted and collected
with their permission for data analysis. The search engines also provide their data
through keywords and queries searched mostly.

 Data Collection Tools

1. Word Association.
The researcher gives the respondent a set of words and asks them what comes to mind when
they hear each word.

2. Sentence Completion.
Researchers use sentence completion to understand what kind of ideas the respondent has.
This tool involves giving an incomplete sentence and seeing how the interviewee finishes it.

3. Role-Playing.
Respondents are presented with an imaginary situation and asked how they would act or react
if it was real.

4. In-Person Surveys.
The researcher asks questions in person.

5. Online/Web Surveys.
These surveys are easy to accomplish, but some users may be unwilling to answer truthfully,
if at all.

6. Mobile Surveys.
These surveys take advantage of the increasing proliferation of mobile technology. Mobile
collection surveys rely on mobile devices like tablets or smart phones to conduct surveys via
SMS or mobile apps.

7. Phone Surveys.
No researcher can call thousands of people at once, so they need a third party to handle the
chore. However, many people have call screening and won’t answer.

8. Observation.
Sometimes, the simplest method is the best. Researchers who make direct observations
collect data quickly and easily, with little intrusion or third-party bias. Naturally, it’s only
effective in small-scale situations.

 Data Management

Data management refers to the professional practice of constructing and maintaining a

framework for ingesting, storing, mining, and archiving the data integral to a modern
business. Data management is the spine that connects all segments of the information
lifecycle.

Data management works symbiotically with process management, ensuring that the actions
teams take are informed by the cleanest, most current data available — which in today’s
world means tracking changes and trends in real-time. Below is a deeper look at the practice,
its benefits and challenges, and best practices for helping your organization get the most out
of its business intelligence.

 7 types of data management

Data management experts generally focus on specialties within the field. These specialties
can fall under one or more of the following areas:
1. Master data management: Master data management (MDM) is the process of ensuring
the organization is always working with — and making business decisions based on — a
single version of current, reliable information. Ingesting data from all of your data sources
and presenting it as one constant, reliable source, as well as repropagating data into different
systems, requires the right tools.

2. Data stewardship: A data steward does not develop information management policies but
rather deploys and enforces them across the enterprise. As the name implies, a data steward
stands watch over enterprise data collection and movement policies, ensuring practices are
implemented and rules are enforced.

3. Data quality management: If a data steward is a kind of digital sheriff, a data quality
manager might be thought of as his court clerk. Quality management is responsible for
combing through collected data for underlying problems like duplicate records, inconsistent
versions, and more. Data quality managers support the defined data management system.

4. Data security: One of the most important aspects of data management today is security.
Though emergent practices like DevSecOps incorporate security considerations at every level
of application development and data exchange, security specialists are still tasked with
encryption management, preventing unauthorized access, guarding against accidental
movement or deletion, and other frontline concerns.

5. Data governance: Data governance sets the law for an enterprise’s state of information. A
data governance framework is like a constitution that clearly outlines policies for the intake,
flow, and protection of institutional information. Data governors oversee their network of
stewards, quality management professionals, security teams, and other people and data
management processes in pursuit of a governance policy that serves a master data
management approach.

6. Big data management: Big data is the catch-all term used to describe gathering,
analyzing, and using massive amounts of digital information to improve operations. In broad
terms, this area of data management specializes in intake, integrity, and storage of the tide of
raw data that other management teams use to improve operations and security or inform
business intelligence.

7. Data warehousing: Information is the building block of modern business. The sheer
volume of information presents an obvious challenge: What do we do with all these blocks?
Data warehouse management provides and oversees the physical and/or cloud-based
infrastructure used to aggregate raw data and analyze it in-depth to produce business insights.
The unique needs of any organization practicing data management may require a blend of
some or all of these approaches. Familiarity with management areas provides data managers
with the background they need to build solutions customized for their environments.

 Benefits of data management systems

Data management processes help organizations identify and resolve internal pain points to
deliver a better customer experience.
First, data management provides businesses with a way of measuring the amount of data in
play. A myriad of interactions occur in the background of any business — between network
infrastructure, software applications, APIs, security protocols, and much more — and each
presents a potential glitch (or time bomb) to operations if something goes wrong. Data
management gives managers a big-picture look at business processes, which helps with both
perspective and planning.

Once data is under management, it can be mined for informational gold: business
intelligence. This helps business users across the organization in a variety of ways, including
the following:
 Smart advertising that targets customers according to their interests and interactions
 Holistic security that safeguards critical information
 Alignment with relevant compliance standards, saving time and money
 Machine learning that grows more environmentally aware over time, powering automatic
and continuous improvement
 Reduced operating expenses by restricting use to only the necessary storage and compute
power required for optimal performance

 Data management challenges

 All these benefits don’t come without climbing some hills. The ever-growing, rolling
landscape of information technology is constantly changing and data managers will
encounter plenty of challenges along the way.
 There are four key data management challenges to anticipate:

 The amount of data can be (at least temporarily) overwhelming. It’s hard to overstate
the volume of data that must come under management in a modern business, so, when
developing systems and processes, be ready to think big. Really big. Specialized third-
party services and apps for integrating big data or providing it as a platform are crucial
allies.
 Many organizations silo data. The development team may work from one data set, the
sales team from another, operations from another, and so on. A modern data management
system relies on access to all this information to develop modern business intelligence. Re
 Real-time data platform services help stream and share clean information between teams
from a single, trusted source.
 The journey from unstructured data to structured data can be steep. Data often
pours into organizations in an unstructured way. Before it can be used to generate
business intelligence, data preparation has to happen: Data must be organized, de-
duplicated, and otherwise cleaned. Data managers often rely on third-party partnerships to
assist with these processes, using tools designed for on-premises, cloud, or hybrid
environments.
 Managing the culture is essential to managing data. All of the processes and systems
in the world won’t do you much good if people don’t know how — and perhaps just as
importantly, why — to use them. By making team members aware of the benefits of data
management (and the potential pitfalls of ignoring it) and fostering the skills of using data
correctly, managers engage team members as essential pieces of the information process.

These and other challenges stand between the old way of doing business and initiatives that
harness the power of data for business intelligence. But with proper planning, practices, and
partners, technologies like accelerated machine learning can turn pinch points into gateways
for deeper business insights and better customer experience.

 Data management best practices

Though specific data needs are unique to every organization’s data strategy and data systems,
preparing a framework will smooth the path to easier, more effective data management
solutions. Best practices like the three below are key to a successful strategy.
1. Make a plan
2. Store your data
3. Share your data

1. Make a plan
 Develop and write a data management plan (DMP). This document charts estimated
data usage, accessibility guidelines, archiving approaches, ownership, and more. A
DMP serves as both a reference and a living record and will be revised as
circumstances change.
 Additionally, DMPs present the organization’s overarching strategy for data
management to investors, auditors, and other involved parties — which is an
important insight into a company’s preparedness for the rigors of the modern market.
The best DMPs define granular details, including:
 Preferred file formats
 Naming conventions
 Access parameters for various stakeholders
 Backup and archiving processes
 Defined partners and the terms and services they provide
 Thorough documentation
 There are online services that can help create DMPs by providing step-by-step
guidance to creating plans from templates.
2. Store your data
 Among the granular details mentioned above, a solid data storage approach is central
to good data management. It begins by determining if your storage needs best suit a
data warehouse or a data lake (or both), and whether the company’s data belongs on-
premises or in the cloud.
 Then outline a consistent, and consistently enforced, agreement for naming files,
folders, directories, users, and more. This is a foundational piece of data management,
as these parameters will determine how to store all future data, and inconsistencies
will result in errors and incomplete intelligence.
1. Security and backups. Insecure data is dangerous, so security must be considered at
every layer. Some organizations come under special regulatory burdens like HIPAA,
CIPA, GDPR, and others, which add additional security requirements like periodic
audits. When security fails, the backup plan can be the difference between business
life and death. Traditional models called for three copies of all important data: the
original, the locally stored copy, and a remote copy. But emerging cloud models
include decentralized data duplication, with even more backup options available at an
increasingly affordable cost for storage and transfer.
2. Documentation is key. If it’s important, document it. If the entire team splits the
lottery and runs off to Jamaica, thorough, readable documentation outlining security
and backup procedures will give the next team a fighting chance to pick up where
they left off. Without it, knowledge resides exclusively with holders who may or may
not be part of a long-term data management approach.
Data storage needs to be able to change as fast as the technology demands, so any approach
should be flexible and have a reasonable archiving approach to keep costs manageable.

3. Share your data

After all the plans are laid for storing, securing, and documenting your data, you should begin
the process of sharing it with the appropriate people.
Here are some critical questions to answer before other people access potentially critical
information:
 Who owns the data?
 Can it be copied?
 Has everyone contributing to the data consented to share it with others?
 Who can access it and at what times?

Are there copyrights, corporate secrets, proprietary intellectual property, or other off-
limits information in the data set?
 What else does the organization’s data reveal about itself?
With those and other questions answered, it’s time to find a place and means of sharing the
data. Once called a repository, this role is increasingly filled by software and infrastructure as
service models that are fine-tuned for big data management.

 Big Data Management

Big data consists of huge amounts of information that cannot be stored or processed using
traditional data storage mechanisms or processing techniques. It generally consists of three
different variations.

i. Structured data (as its name suggests) has a well-defined structure and follows a
consistent order. This kind of information is designed so that it can be easily accessed
and used by a person or computer. Structured data is usually stored in the well-
defined rows and columns of a table (such as a spreadsheet) and databases —
particularly relational database management systems, or RDBMS.

ii. Semi-structured data exhibits a few of the same properties as structured data, but for
the most part, this kind of information has no definite structure and cannot conform to
the formal rules of data models such as an RDBMS.

iii. Unstructured data possesses no consistent structure across its various forms and
does not obey conventional data models’ formal structural rules. In very few
instances, it may have information related to date and time.

Characteristics of Big Data Management

In line with classical definitions of the concept, big data is generally associated with
three core characteristics:

1. Volume: This trait refers to the immense amounts of information generated every
second via social media, cell phones, cars, transactions, connected sensors, images,
video, and text. In petabytes, terabytes, or even zettabytes, these volumes can only be
managed by big data technologies.

2. Variety: To the existing landscape of transactional and demographic data such as

phone numbers and addresses, information in the form of photographs, audio streams,
video, and a host of other formats now contributes to a multiplicity of data types —
about 80% of which are completely unstructured.

3. Velocity: Information is streaming into data repositories at a prodigious rate, and this
characteristic alludes to the speed of data accumulation. It also refers to the speed
with which big data can be processed and analyzed to extract the insights and patterns
it contains. These days, that speed is often real-time.

Beyond “the Three Vs,” current descriptions of big data management also include two other
characteristics, namely:
 Veracity: This is the degree of reliability and truth that big data has to offer in terms
of its relevance, cleanliness, and accuracy.

 Value: Since the primary aim of big data gathering and analysis is to discover insights
that can inform decision-making and other processes, this characteristic explores the
benefit or otherwise that information and analytics can ultimately produce.

Big Data Management Services

When it comes to technology, organizations have many different types of big data
management solutions to choose from. Vendors offer a variety of standalone or multi-
featured big data management tools, and many organizations use multiple tools. Some of the
most common types of big data management capabilities include the following:

 Data cleansing: finding and fixing errors in data sets

 Data integration: combining data from two or more sources

 Data migration: moving data from one environment to another, such as moving
data from in-house data centres to the cloud

 Data preparation: readying data to be using in analytics or other applications

 Data enrichment: improving the quality of data by adding new data sets,
correcting small errors or extrapolating new information from raw data

 Data analytics: analysing data with a variety of algorithms in order to gain

insights

 Data quality: making sure data is accurate and reliable

 Master data management (MDM) :linking critical enterprise data to one master
set that serves as the single source of truth for the organization
 Data governance: ensuring the availability, usability, integrity and accuracy of
data

 Extract transform load (ETL): moving data from an existing repository into a
database or data warehouse.

 Organization/Sources of Data

Data organization is the practice of categorizing and classifying data to make it more usable.
Similar to a file folder, where we keep important documents, you’ll need to arrange your data
in the most logical and orderly fashion, so you — and anyone else who accesses it — can
easily find what they’re looking for.

DATA IS BEING COLLECTED

 The big data includes information produced by humans and devices.

 Device-driven data is largely clean and organized,
 But of far greater interest is human-driven data that exist in various formats
and need more exquisite tools for proper processing and management.
The big data collection is focused on the following types of data:

 Network data. This type of data is gathered on all kinds of networks, including social
media, information and technological networks, the Internet and mobile networks, etc.
 Real-time data. They are produced on online streaming media, such as YouTube,
Twitch, Skype, or Netflix.
 Transactional data. They are gathered when a user makes an online purchase
(information on the product, time of purchase, payment methods, etc.)
 Geographic data. Location data of everything, humans, vehicles, building, natural
reserves, and other objects are continuously supplied with satellites.
 Natural language data. These data are gathered mostly from voice searches that can
be made on different devices accessing the Internet.
 Time series data. This type of data is related to the observation of trends and
phenomena taking place at this very moment and over a period of time, for instance,
global temperatures, mortality rates, pollution levels, etc.
 Linked data. They are based on HTTP, RDF, SPARQL, and URIs web technologies
and meant to enable semantic connections between various databases so that
computers could read and perform semantic queries correctly.

HOW IS BIG DATA COLLECTED?

There are different ways of how to collect big data from users. These are the most
popular ones.

 1. Asking for it
the majority of firms prefer asking users directly to share their personal information.
They give these data when creating website accounts or buying online. The minimum
information to be collected includes a username and an email address, but some
profiles require more details.
 2. Cookies and Web Beacons
Cookies and web beacons are two widely used methods to gather the data on users,
namely, what web pages they visit and when. They provide basic statistics about how
a website is used. Cookies and web beacons in no way compromise your privacy but
just serve to personalize your experience with one or another web source.
 3. Email tracking
Email trackers are meant to give more information on the user actions in the mailbox.
In particular, an email tracker allows detecting when an email was opened. Both
Google and Yahoo use this method to learn their users’ behavioural patterns and
provide personalized advertising.

 Importance of Data Quality

Data quality is defined as:

“The degree to which data meets a company’s expectations of accuracy, validity,
completeness, and consistency”

By tracking data quality, a business can pinpoint potential issues harming quality, and ensure
that shared data is fit to be used for a given purpose.
When collected data fails to meet the company expectations of accuracy, validity,
completeness, and consistency, it can have massive negative impacts on customer service,
employee productivity, and key strategies.

Quality data is key to making accurate, informed decisions. And while all data has some level
of “quality,” a variety of characteristics and factors determines the degree of data quality
(high-quality versus low-quality). Furthermore, different data quality characteristics will
likely be more important to various stakeholders across the organization.
A list of popular data quality characteristics and dimensions include:

1. Completeness: Completeness is defined as a measure of the percentage of data that is

missing within a dataset.
2. Timeliness: Timeliness measures how up-to-date or antiquated the data is at any given
moment.
3. Validity: Validity refers to information that fails to follow specific company formats,
rules, or processes.
4. Integrity: Integrity of data refers to the level at which the information is reliable and
trustworthy.
5. Uniqueness: Uniqueness is a data quality characteristic most often associated with
customer profiles.
6. Consistency: It ensures that the source of the information collection is capturing the
correct data based on the unique objectives of the department or company.

 Dealing with Missing or incomplete Data

The concept of missing data is implied in the name: its data that is not captured for a variable
for the observation in question. Missing data reduces the statistical power of the analysis,
which can distort the validity of the results.
Fortunately, there are proven techniques to deal with missing data.
Imputation vs. Removing Data

When dealing with missing data, data scientists can use two primary methods to solve
the error: imputation or the removal of data.

The imputation method develops reasonable guesses for missing data. It’s most useful when
the percentage of missing data is low. If the portion of missing data is too high, the results
lack natural variation that could result in an effective model.

The other option is to remove data. When dealing with data that is missing at random, related
data can be deleted to reduce bias. Removing data may not be the best option if there are not
enough observations to result in a reliable analysis. In some situations, observation of specific
events or factors may be required.

Before deciding which approach to employ, data scientists must understand why the data is
missing.

Missing at Random (MAR)

Missing at Random means the data is missing relative to the observed data. It is not related to
the specific missing values. The data is not missing across all observations but only within
sub-samples of the data. It is not known if the data should be there; instead, it is
missing given the observed data. The missing data can be predicted based on the complete
observed data.

Missing Completely at Random (MCAR)

In the MCAR situation, the data is missing across all observations regardless of the expected
value or other variables. Data scientists can compare two sets of data, one with missing
observations and one without. Using a t-test, if there is no difference between the two data
sets, the data is characterized as MCAR.

Data may be missing due to test design, failure in the observations or failure in recording
observations. This type of data is seen as MCAR because the reasons for its absence are
external and not related to the value of the observation.

It is typically safe to remove MCAR data because the results will be unbiased. The test may
not be as powerful, but the results will be reliable.

Missing Not at Random (MNAR)

The MNAR category applies when the missing data has a structure to it. In other words, there
appear to be reasons the data is missing. In a survey, perhaps a specific group of people – say
women ages 45 to 55 – did not answer a question. Like MAR, the data cannot be determined
by the observed data, because the missing information is unknown. Data scientists
must model the missing data to develop an unbiased estimate. Simply removing observations
with missing data could result in a model with bias.

Deletion

There are two primary methods for deleting data when dealing with missing data: list wise
and dropping variables.
List wise

In this method, all data for an observation that has one or more missing values are deleted.
The analysis is run only on observations that have a complete set of data. If the data set is
small, it may be the most efficient method to eliminate those cases from the analysis.
However, in most cases, the data are not missing completely at random (MCAR). Deleting
the instances with missing observations can result in biased parameters and estimates and
reduce the statistical power of the analysis.

Pair wise

Pair wise deletion assumes data are missing completely at random (MCAR), but all the cases
with data, even those with missing data, are used in the analysis. Pairwise deletion allows
data scientists to use more of the data. However, the resulting statistics may vary because
they are based on different data sets. The results may be impossible to duplicate with a
complete set of data.

Dropping Variables

If data is missing for more than 60% of the observations, it may be wise to discard it if the
variable is insignificant.

 Imputation

When data is missing, it may make sense to delete data, as mentioned above. However, that
may not be the most effective option. For example, if too much information is discarded, it
may not be possible to complete a reliable analysis. Or there may be insufficient data to
generate a reliable prediction for observations that have missing data.

Instead of deletion, data scientists have multiple solutions to impute the value of missing
data. Depending why the data are missing, imputation methods can deliver reasonably
reliable results. These are examples of single imputation methods for replacing missing data.

Mean, Median and Mode

This is one of the most common methods of imputing values when dealing with missing data.
In cases where there are a small number of missing observations, data scientists can calculate
the mean or median of the existing observations. However, when there are many missing
variables, mean or median results can result in a loss of variation in the data. This method
does not use time-series characteristics or depend on the relationship between the variables.

Time-Series Specific Methods

Another option is to use time-series specific methods when appropriate to impute data. There
are four types of time-series data:

 No trend or seasonality.
 Trend, but no seasonality.
 Seasonality, but no trend.
 Both trend and seasonality.

The time series methods of imputation assume the adjacent observations will be like the
missing data. These methods work well when that assumption is valid. However, these
methods won’t always produce reasonable results, particularly in the case of strong
seasonality.

Last Observation Carried Forward (LOCF) & Next Observation Carried Backward
(NOCB)

These options are used to analyze longitudinal repeated measures data, in which follow-up
observations may be missing. In this method, every missing value is replaced with the last
observed value. Longitudinal data track the same instance at different points along a timeline.
This method is easy to understand and implement. However, this method may introduce bias
when data has a visible trend. It assumes the value is unchanged by the missing data.

Linear Interpolation

Linear interpolation is often used to approximate a value of some function by using two
known values of that function at other points. This formula can also be understood as a
weighted average. The weights are inversely related to the distance from the end points to the
unknown point. The closer point has more influence than the farther point.

When dealing with missing data, you should use this method in a time series that exhibits a
trend line, but it’s not appropriate for seasonal data.

Seasonal Adjustment with Linear Interpolation

When dealing with data that exhibits both trend and seasonality characteristics, use seasonal
adjustment with linear interpolation. First you would perform the seasonal adjustment by
computing a centered moving average or taking the average of multiple averages – say, two
one-year averages – that are offset by one period relative to another. You can then complete
data smoothing with linear interpolation as discussed above.

Multiple Imputations

Multiple imputations is considered a good approach for data sets with a large amount of
missing data. Instead of substituting a single value for each missing data point, the missing
values are exchanged for values that encompass the natural variability and uncertainty of the
right values. Using the imputed data, the process is repeated to make multiple imputed data
sets. Each set is then analyzed using the standard analytical procedures, and the multiple
analysis results are combined to produce an overall result.

The various imputations incorporate natural variability into the missing values, which creates
a valid statistical inference. Multiple imputations can produce statistically valid results even
when there is a small sample size or a large amount of missing data.

K Nearest Neighbours

In this method, data scientists choose a distance measure for k neighbours, and the average is
used to impute an estimate. The data scientist must select the number of nearest neighbours
and the distance metric. KNN can identify the most frequent value among the neighbours and
the mean among the nearest neighbours.

 Data Visualization
Data visualization is the practice of translating information into a visual context, such
as a map or graph, to make data easier for the human brain to understand and pull
insights from.
The main goal of data visualization is to make it easier to identify patterns, trends and
outliers in large data sets. The term is often used interchangeably with others,
including information graphics, information visualization and statistical graphics.
Data visualization is one of the steps of the data science process, which states that
after data has been collected, processed and modelled, it must be visualized for
conclusions to be made.
Data visualization is also an element of the broader data presentation architecture
(DPA) discipline, which aims to identify, locate, manipulate, format and deliver data
in the most efficient way possible.
Data visualization is important for almost every career. It can be used by teachers to
display student test results, by computer scientists exploring advancements
in artificial intelligence (AI) or by executives looking to share information with
stakeholders.
It also plays an important role in big data projects. As businesses accumulated
massive collections of data during the early years of the big data trend, they needed a
way to quickly and easily get an overview of their data. Visualization tools were a
natural fit.
Visualization is central to advanced analytics for similar reasons. When a data
scientist is writing advanced predictive analytics or machine learning (ML)
algorithms, it becomes important to visualize the outputs to monitor results and ensure
that models are performing as intended. This is because visualizations of complex
algorithms are generally easier to interpret than numerical outputs.
Why is data visualization important?
Data visualization provides a quick and effective way to communicate information in a
universal manner using visual information. The practice can also help businesses identify
which factors affect customer behaviour; pinpoint areas that need to be improved or need
more attention; make data more memorable for stakeholders; understand when and where to
place specific products; and predict sales volumes.
Other benefits of data visualization include the following:
 the ability to absorb information quickly, improve insights and make faster
decisions;
 an increased understanding of the next steps that must be taken to improve the
organization;
 an improved ability to maintain the audience's interest with information they can
understand;
 an easy distribution of information that increases the opportunity to share insights
with everyone involved;
 eliminate the need for data scientists since data is more accessible and
understandable; and
 An increased ability to act on findings quickly and, therefore, achieve success
with greater speed and less mistakes.
Data visualization and big data
o The increased popularity of big data and data analysis projects has made visualization
more important than ever.
o Companies are increasingly using machine learning to gather massive amounts of data
that can be difficult and slow to sort through, comprehend and explain.
o Visualization offers a means to speed this up and present information to business
owners and stakeholders in ways they can understand.
o Big data visualization often goes beyond the typical techniques used in normal
visualization, such as pie charts, histograms and corporate graphs. It instead uses
more complex representations, such as heat maps and fever charts.
o Big data visualization requires powerful computer systems to collect raw data, process
it and turn it into graphical representations that humans can use to quickly draw
insights.
 Examples of data visualization
In the early days of visualization, the most common visualization technique was using
a Microsoft Excel spreadsheet to transform the information into a table, bar graph or pie
chart. While these visualization methods are still commonly used, more intricate techniques
are now available, including the following:
 info graphics
 bubble clouds
 bullet graphs
 heat maps
 fever charts
 time series charts

Some other popular techniques are as follows.

Line charts. This is one of the most basic and common techniques used. Line charts display
how variables can change over time.
Area charts. This visualization method is a variation of a line chart; it displays multiple
values in a time series -- or a sequence of data collected at consecutive, equally spaced points
in time.
Scatter plots. This technique displays the relationship between two variables. A scatter
plot takes the form of an x- and y-axis with dots to represent data points.
Tree maps. This method shows hierarchical data in a nested format. The size of the
rectangles used for each category is proportional to its percentage of the whole. Treemaps are
best used when multiple categories are present, and the goal is to compare different parts of a
whole.
Population pyramids. This technique uses a stacked bar graph to display the complex social
narrative of a population. It is best used when trying to display the distribution of a
population.

Data Visualization Applications

Common use cases for data visualization include the following:

Sales and Marketing: Research from the media agency Magna predicts that half of all global
advertising dollars will be spent online by 2020. As a result, marketing teams must pay close
attention to their sources of web traffic and how their web properties generate revenue. Data
visualization makes it easy to see traffic trends over time as a result of marketing efforts.

Politics: A common use of data visualization in politics is a geographic map that displays the
party each state or district voted for.

Healthcare: Healthcare professionals frequently use choropleth maps to visualize important

health data. A choropleth map displays divided geographical areas or regions that are
assigned a certain color in relation to a numeric variable. Choropleth maps allow
professionals to see how a variable, such as the mortality rate of heart disease, changes across
specific territories.
Scientists: Scientific visualization, sometimes referred to in shorthand as SciVis, allows
scientists and researchers to gain greater insight from their experimental data than ever
before.
Finance: Finance professionals must track the performance of their investment decisions
when choosing to buy or sell an asset. Candlestick charts are used as trading tools and help
finance professionals analyze price movements over time, displaying important information,
such as securities, derivatives, currencies, stocks, bonds and commodities. By analyzing how
the price has changed over time, data analysts and finance professionals can detect trends.
Logistics: Shipping companies can use visualization tools to determine the best global
shipping routes.

 Data visualization tools and vendors

Data visualization tools can be used in a variety of ways. The most common use today is as
business intelligence (BI) reporting tool. Users can set up visualization tools to generate
automatic dashboards that track company performance across key performance indicators
(KPIs) and visually interpret the results.
The generated images may also include interactive capabilities, enabling users to manipulate
them or look more closely into the data for questioning and analysis. Indicators designed to
alert users when data has been updated or when predefined conditions occur can also be
integrated.
Many business departments implement data visualization software to track their own
initiatives. For example, a marketing team might implement the software to monitor the
performance of an email campaign, tracking metrics like open rate, click-through rate and
conversion rate.
As data visualization vendors extend the functionality of these tools, they are increasingly
being used as front ends for more sophisticated big data environments. In this setting, data
visualization software helps data engineers and scientists keep track of data sources and do
basic exploratory analysis of data sets prior to or after more detailed advanced analyses.
The biggest names in the big data tools marketplace include Microsoft, IBM, SAP and
SAS.
Some other vendors offer specialized big data visualization software; popular names in this
market include Tableau, Qlik and Tibco.
While Microsoft Excel continues to be a popular tool for data visualization, others have
been created that provide more sophisticated abilities:
 IBM Cognos Analytics
 Qlik Sense and QlikView
 Microsoft Power BI
 Oracle Visual Analyzer
 SAP Lumira
 SAS Visual Analytics
 Tibco Spotfire
 Zoho Analytics
 D3.js
 Jupyter
 MicroStrategy
 Google Charts

 Data Classification

 Data classification is broadly defined as the process of organizing data by relevant

categories so that it may be used and protected more efficiently. On a basic level,
the classification process makes data easier to locate and retrieve.
 Data classification is of particular importance when it comes to risk management,
compliance, and data security.
 Data classification involves tagging data to make it easily searchable and traceable.
 It also eliminates multiple duplications of data, which can reduce storage and backup
costs while speeding up the search process. Though the classification process may
sound highly technical, it is a topic that should be understood by your organization’s
leadership.
Importance of Data Classification:
Data classification is a regulatory requirement, as data must be searchable and retrievable
within specified timeframes.
For the purposes of data security, data classification is a useful tactic that facilitates proper
security responses based on the type of data being retrieved, transmitted, or copied.

 Types of Data Classification

Data classification involves the use of tags and labels to define the data type, its
confidentiality, and its integrity. There are three main types of data classification that are
considered the industry standard:

 Content-based classification – inspects and interprets files, looking for sensitive

information
 Context-based classification – looks to the application, location, metadata, or creator
(among other variables) as indirect indicators of sensitive information
 User-based classification – requires a manual, end-user selection for each document.
User-based classification takes advantage of the user knowledge of the sensitivity of the
document, and can be applied or updated upon creation, edit, review, or dissemination.

DETERMINING DATA RISK

In addition to the types of classification, it’s wise for an organization to determine the relative
risk associated with the types of data, how that data is handled and where it is stored/sent
(endpoints). A common practice is to separate data and systems into three levels of risk

Low risk: If data is public and it’s not easy to permanently lose (e.g. recovery is easy), this
data collection and the systems surrounding it are likely a lower risk than others.

Moderate risk: Essentially, this is data that isn’t public or is used internally (by your
organization and/or partners). However, it’s also not likely too critical to operations or
sensitive to be “high risk.” Proprietary operating procedures cost of goods and some company
documentation may fall into the moderate category.
High risk: Anything remotely sensitive or crucial to operational security goes into the high
risk category. Also, pieces of data those are extremely hard to recover (if lost). All
confidential, sensitive and necessary data falls into a high risk category.

 Data Sensitivity Levels

While we’ve looked at mapping data out by type, you should also look to segment your
organization’s data in terms of the level of sensitivity – high, moderate, or low.

 High sensitivity data (Confidential) – data that if compromised or destroyed would

be expected to have a severe or catastrophic effect on organizational operations, assets,
or individuals. Examples can include financial data, medical records, and intellectual
property.
 Moderate sensitivity data (Restricted) – data that if compromised or destroyed
would be expected to have a serious effect on organizational operations, assets, or
individuals. Examples can include unpublished research results, information strictly for
internal use, and operational documents.
 Low sensitivity data (Public) – data that if compromised or destroyed would be
expected to have a limited effect on organizational operations, assets, or individuals.
Examples can include press releases, job advertisements, and published research.

The following shows common examples of organizational data which may be classified into
each sensitivity level:

High:
o Personally identifiable information (PII)
o Credit card details (PCI)
o Intellectual property (IP)
o Protected healthcare information (including HIPAA regulated data)
o Financial information
o Employee records
o ITAR materials
o Internal correspondence including confidential data
Moderate:

o Student education records

o Unpublished research data
o Operational data
o Information security information
o Supplier contact information
o Internal correspondence not containing confidential data

Low:

o Public websites
o Public directory data
o Publicly available research
o Press releases
o Job advertisements
o Marketing materials
 Data Science Project Life Cycle :
Data Science is a multidisciplinary field that uses scientific methods to extract insights from
structured and unstructured data. Data science is such a huge field and concept that’s often
intermingled with other disciplines, but generally, DS unifies statistics, data analysis,
machine learning, and related fields.
Data Science life cycle provides the structure to the development of a data science project.
The lifecycle outlines the major steps, from start to finish, that projects usually follow. Now,
there are various approaches to managing DS projects, amongst which are Cross-industry
standard process for data mining (aka CRISP-DM), process of knowledge discovery in
databases (aka KDD), any proprietary-based custom procedures conjured up by a company,
and a few other simplified processes.

CRISP-DM
CRISP-DM is an open standard process model that describes common approaches used by
data mining scientists. In 2015, it was refined and extended by IBM, which released a new
methodology called Analytics Solutions Unified Method for Data Mining/Predictive
Analytics (aka ASUM-DM).

The CRISP-DM model steps are:

1. Business Understanding
2. Data Understanding
3. Data Preparation
4. Modelling
5. Evaluation and
6. Deployment
Knowledge discovery in databases (KDD)

KDD is commonly defined with the following stages:

 Selection
 Pre-processing
 Transformation
 Data mining
 Interpretation/evaluation
The simplified process looks as follows: (1) Pre-processing, (2) Data Mining, and (3)
Results Validation.

Suppose, we have a standard DS project (without any industry-specific peculiarities), then the
lifecycle would typically include:
 Business understanding
 Data acquisition and understanding
 Modelling
 Deployment
 Customer acceptance
The DS project life cycle is an iterative process of research and discovery that provides
guidance on the tasks needed to use predictive models. The goal of this process is to move a
DS project to an engagement end-point by providing means for easier and clearer
communication between teams and customers with a well-defined set of artifacts and
standardized templates to homogenize procedures and avoid misunderstandings.

Each stage has the following information:

 Goals and specific objectives of the stage

 A clear outline of specific tasks and instructions on how to complete them
 The expected deliverables (artifact)

Business understanding

Before you even embark on a DS project, you need to understand the problem you’re trying
to solve and define the central objectives of your project by identifying the variables to
predict.

Goals:
 Identify key variables that will serve as model targets and serve as the metrics for
defining the success of the project
 Identify data sources that the business has already access to or need to obtain such
access
Guidelines:
Work with customers and stakeholders to define business problems and formulate questions
that data science needs to answer.
The goal here is to identify the key business variables (aka model targets) that your analysis
needs to predict and the project’s success would be assessed against. For example, the sales
forecasts. This is what needs to be predicted, and at the end of your project, you’ll compare
your predictions to the actual volume of sales.

Define project goals by asking specific questions related to data science, such as:

 How much/many? (regression)

 Which category? (classification)
 Which group? (clustering)
 Does this make sense? (anomaly detection)
 Which option should be taken? (recommendation)

 Business Requirements

 The purpose of business requirements is to define a project’s business need, as well as

the criteria of its success.
 Business requirements describe why a project is needed, whom it will benefit, when
and where it will take place, and what standards will be used to evaluate it.
 Business requirement generally do not define how a project is to be implemented; the
requirements of the business need do not encompass a project’s implementation
details.
 “Business requirements are higher-level statements of the goals, objectives, or needs
of the enterprise.”
 “They describe the reasons why a project has been initiated, the objectives that the
project will achieve, and the metrics that will be used to measure its success.”
 In short, business requirements chart where a project is going, not how it’s going to
get there.

The business requirements the analyst creates for this project would include (but not be
limited to):

 Identification of the business problem (key objectives of the project), i.e.,

“Declining ticket sales require a strategy to increase the number of customers at our
theatres.”
 Why the solution has been proposed (its benefits; why it will produce the desired
outcome of returning ticket sales to higher levels), i.e., “Customers have
overwhelmingly cited the inconvenience of standing in line as the primary reason they
no longer attend our theatre. We will remove this impediment by enabling customers
to buy and print their theatre tickets at home with just a few clicks.”
 The scope of the project. A few examples might be: “1. while the plan is to bring
this project to all 400 theatres eventually, we will start with 50 theatres in the most
populated metropolitan areas.”
 Rules, policies, and regulations. For example, “We will design our web site and
commerce so that all other relevant governmental regulations are properly adhered
to.”
 Key features of the service (without details as to how they will be implemented). A
few examples might include: “1. we will provide a secure site for the user to select the
number of tickets and showing they wish, and to enter their payment information. 2.
We will give the user the option to store his or her card information in our system so
that they do not have to re-enter it in a later session. 3. The system will accommodate
credit, debit, or PayPal payment methods only.”
 Key performance features (without details as to how they will be implemented), i.e.,
“1. The system will be designed so that it can recover within 30 seconds of any
downtime. 2. Because our peak audience has been 25,000 customers in all of our
theatres on one night, the system will accommodate at least 10 times that many users
at any given time without any impact on system performance.”
 Key security features (again without details), i.e., “We will devise a unique identifier
for each ticket that will prohibit photocopies or counterfeits.”
 Criteria to measure the project’s success, such as: “This project will be deemed
successful if ticket sales return to 2008 levels within 12 months of its launch.”

This project’s resulting business requirements would not include:

 A description of how to adhere to governmental or regulatory requirements.

 A description of how performance requirements will be implemented, such as: “The
XYZ server on which customer information is stored will be backed up every five
minutes using XYZ program.”
 Any description of how the unique ticket identifier would be implemented.
 Any details or specifics related to the service’s features, such as: “1. The credit card
number text box will be 20 characters long and accommodate simple text. 2. If the
user selects Yes (01), the information will be loaded to our XYZ storage server
called.”

While the above examples accompanying selected bullet points are textual, business
requirements may include graphs, models, or any combination of these that best serves the
project. Effective business requirements require strong strategic thinking, significant input
from a project’s business owners, and the ability to clearly state the needs of a project at a
high level.

As with all requirements, business requirements should be:

 Verifiable. Just because business requirements state business needs rather than
technical specifications doesn’t mean they mustn’t be demonstrable.
 Verifiable requirements are specific and objective. A quality control expert must be
able to check, for example, that the system accommodates the debit, credit, and
PayPal methods specified in the business requirements. (S)he could not do so if the
requirements were more vague, i.e., “The system will accommodate appropriate
payment methods.” (Appropriate is subject to interpretation.)
 Unambiguous, stating precisely what problem is being solved. For example, “This
project will be deemed successful if ticket sales increase sufficiently,” is probably too
vague for all stakeholders to agree on its meaning at the project’s end.
 Comprehensive, covering every aspect of the business need. Business requirements
are indeed big picture, but they are very thorough big picture. In the aforementioned
example, if the analyst assumed that the developers would know to design a system
that could accommodate many times the number of customers the theatre chain had
seen at one time in the past, but did not explicitly state so in the requirements, the
developers might design a system that could accommodate only 10,000 patrons at any
one time without performance issues.

Remember that business requirements answer the what’s, not the how’s, but they are
meticulously thorough in describing those’s. No business point is overlooked. At a project’s
end, the business requirements should serve as a methodical record of the initial business
problem and the scope of its solution.

Understanding the project objectives and requirements from a domain perspective and then
converting this knowledge into a data science problem definition with a preliminary plan
designed to achieve the objectives. Data science projects are often structured around the
specific needs of an industry sector (as shown below) or even tailored and built for a single
organization. A successful data science project starts from a well defined question or need.

 Data Acquisition

 Data acquisition (commonly abbreviated as DAQ or DAS) is the process of

sampling signals that measure real-world physical phenomena and converting them
into a digital form that can be manipulated by a computer and software.

 Data Acquisition is generally accepted to be distinct from earlier forms of recording

to tape recorders or paper charts. Unlike those methods, the signals are converted
from the analog domain to the digital domain and then recorded to a digital medium
such as ROM, flash media, or hard disk drives.

The Purposes of Data Acquisition

The primary purpose of a data acquisition system is to acquire and store the data. But they are
also intended to provide real-time and post-recording visualization and analysis of the data.
Furthermore, most data acquisition systems have some analytical and report generation
capability built-in.
Engineers in different applications have various requirements, of course, but these key
capabilities are present in varying proportion:

o Data recording
o Data storing
o Real-time data visualization
o Post-recording data review
o Data analysis using various mathematical and statistical calculations
o Report generation

 Data Preparation

Data preparation is about constructing a dataset from one or more data sources to be used for
exploration and modeling. It is a solid practice to start with an initial dataset to get familiar with the
data, to discover first insights into the data and have a good understanding of any possible data
quality issues. Data preparation is often a time consuming process and heavily prone to errors. The
old saying "garbage-in-garbage-out" is particularly applicable to those data science projects where
data gathered with many invalid, out-of-range and missing values. Analyzing data that has not been
carefully screened for such problems can produce highly misleading results. Then, the success of
data science projects heavily depends on the quality of the prepared data.
Data
Data is information typically the results of measurement (numerical) or counting
(categorical). Variables serve as placeholders for data. There are two types of
variables, numerical and categorical.

A numerical or continuous variable is one that can accept any value within a finite or infinite
interval (e.g., height, weight, temperature, blood glucose,). There are two types of numerical
data, interval and ratio. Data on an interval scale can be added and subtracted but cannot be
meaningfully multiplied or divided because there is no true zero. For example, we cannot say that
one day is twice as hot as another day. On the other hand, data on a ratio scale has true zero and can
be added, subtracted, multiplied or divided (e.g., weight).

A categorical or discrete variable is one that can accept two or more values (categories). There
are two types of categorical data, nominal and ordinal. Nominal data does not have an intrinsic
ordering in the categories. For example, "gender" with two categories, male and female. In contrast,
ordinal data does have an intrinsic ordering in the categories. For example, "level of energy" with
three orderly categories (low, medium and high).

Dataset
Dataset is a collection of data, usually presented in a tabular form. Each column represents a
particular variable, and each row corresponds to a given member of the data.

There are some alternatives for columns, rows and values.

 Columns, Fields, Attributes, Variables
 Rows, Records, Objects, Cases, Instances, Examples, Vectors
 Values, Data

In predictive modeling, predictors or attributes are the input variables and target or class
attribute is the output variable whose value is determined by the values of the predictors and
function of the predictive model.

Database
Database collects, stores and manages information so users can retrieve, add, update or remove
such information. It presents information in tables with rows and columns. A table is referred to as
a relation in the sense that it is a collection of objects of the same type (rows). Data in a table can
be related according to common keys or concepts, and the ability to retrieve related data from
related tables is the basis for the term relational database. A Database Management System
(DBMS) handles the way data is stored, maintained, and retrieved. Most data science toolboxes
connect to databases through ODBC (Open Database Connectivity) or JDBC (Java Database
Connectivity) interfaces.

SQL (Structured Query Language) is a database computer language for managing and
manipulating data in relational database management systems (RDBMS).

SQL Data Definition Language (DDL) permits database tables to be created, altered or deleted. We
can also define indexes (keys), specify links between tables, and impose constraints between
database tables.

 CREATE TABLE : creates a new table

 ALTER TABLE : alters a table
 DROP TABLE : deletes a table
 CREATE INDEX : creates an index
 DROP INDEX : deletes an index

SQL Data Manipulation Language (DML) is a language which enables users to access and
manipulate data.

 SELECT : retrieval of data from the database

 INSERT INTO : insertion of new data into the database
 UPDATE : modification of data in the database
 DELETE : deletion of data in the database

ETL (Extraction, Transformation and Loading)

ETL extracts data from data sources and loads it into data destinations using a set of transformation
functions.

 Data extraction provides the ability to extract data from a variety of data sources, such as
flat files, relational databases, streaming data, XML files, and ODBC/JDBC data sources.
 Data transformation provides the ability to cleanse, convert, aggregate, merge, and split
data.
 Data loading provides the ability to load data into destination databases via update, insert
or delete statements, or in bulk.

Credit Default Datasets

 Data Exploration

Data Exploration is about describing the data by means of statistical and visualization
techniques. We explore data in order to bring important aspects of that data into focus for
further analysis.

1. Univariate Analysis
2. Bivariate Analysis

Modeling
Predictive modeling is the process by which a model is created to predict an outcome. If the outcome
is categorical it is called classification and if the outcome is numerical it is called regression.
Descriptive modeling or clustering is the assignment of observations into clusters so that observations
in the same cluster are similar. Finally, association rules can find interesting associations amongst
observations.

Model Evaluation
Model Evaluation is an integral part of the model development process. It helps to find the best
model that represents our data and how well the chosen model will work in the future. Evaluating
model performance with the data used for training is not acceptable in data science because it can
easily generate overoptimistic and over fitted models. There are two methods of evaluating models in
data science, Hold-Out and Cross-Validation. To avoid overfitting, both methods use a test set (not
seen by the model) to evaluate model performance.

Hold-Out
In this method, the mostly large dataset is randomly divided to three subsets:
1. Training set is a subset of the dataset used to build predictive models.
2. Validation set is a subset of the dataset used to assess the performance of model built in the
training phase. It provides a test platform for fine tuning model's parameters and selecting the
best-performing model. Not all modeling algorithms need a validation set.
3. Test set or unseen examples are a subset of the dataset to assess the likely future performance
of a model. If a model fit to the training set much better than it fits the test set, overfitting is
probably the cause.

Cross-Validation
When only a limited amount of data is available, to achieve an unbiased estimate of the model
performance we use k-fold cross-validation. In k-fold cross-validation, we divide the data
into k subsets of equal size. We build models k times, each time leaving out one of the subsets from
training and use it as the test set. If k equals the sample size, this is called "leave-one-out".

Model evaluation can be divided to two sections:

 Classification Evaluation
 Regression Evaluation

Model Deployment
The concept of deployment in data science refers to the application of a model for prediction using a
new data. Building a model is generally not the end of the project. Even if the purpose of the model is
to increase knowledge of the data, the knowledge gained will need to be organized and presented in a
way that the customer can use it. Depending on the requirements, the deployment phase can be as
simple as generating a report or as complex as implementing a repeatable data science process. In
many cases, it will be the customer, not the data analyst, who will carry out the deployment steps. For
example, a credit card company may want to deploy a trained model or set of models (e.g., neural
networks, meta-learner) to quickly identify transactions, which have a high probability of being
fraudulent. However, even if the analyst will not carry out the deployment effort it is important for
the customer to understand up front what actions will need to be carried out in order to actually make
use of the created models.

Model deployment methods:

In general, there is four way of deploying the models in data science.
1. Data science tools (or cloud)
2. Programming language (Java, C, VB, …)
3. Database and SQL script (TSQL, PL-SQL, …)
4. PMML (Predictive Model Markup Language)
5.

An example of using a data mining tool (Orange) to deploy a decision tree model.
 Operations Research

Generally, OR is concerned with obtaining extreme values of some real-world objective functions;

maximum (profit, performance, utility, or yield), minimum (loss, risk, distance, or cost). It

incorporates techniques from mathematical modelling, optimization, and statistical analysis while

emphasising the human-technology interface. However, one of the difficulties in answering this

question is that there is a lot of overlap in scientific terminology — and sometimes terms become

extremely popular, affecting the landscape of the terminology. E.g. the popularity of vague, broad

terms such as AI and Big Data that works good for marketing but does nothing for the discussion on

the research. Therefore, I have tried illustrating it in terms of ORs related fields, subfields, and the

addressed problems

Process optimization is an exercise that aims to streamline operations within a project

process, maximizing resource use and improving overall output. It is a significant element of
business decision-making and is used in many different project management areas.

Process optimization methods and techniques

There are many process optimization techniques you can use to get you started. Here are
three examples:

Process mining: This is a group of techniques with a data science approach. Data is taken
from event logs to analyze what team members are doing in a company and what steps they
take to complete a task. This data can then be turned into insights, helping project managers
to spot any roadblocks and optimize their processes.

DMAIC: DMAIC is a data-focused method used in Six Sigma to improve processes. It

stands for Define, Measure, Analyze, Improve, and Control. These five stages combine to
form a cycle. First, customers are defined. Then, performance is measured, and the data is
analyzed. Finally, improvements are implemented and controlled to ensure the process
remains in optimal condition.

PDSA: PDSA is an acronym for Plan, Do, Study, Act. It uses a four-stage cyclical model to
improve quality and optimize business processes. Project managers will start by mapping
what achievements they want to accomplish. Next, they will test proposed changes on a small
scale. After this, they will study the results and determine if these changes were effective. If
so, they will implement the changes across the entire business process.

It's good practice for a project manager to take some time to research various process
optimization methods before deciding which one is most suited to their business

Major Applications of Data Science

Data Science is the deep study of a large quantity of data, which involves extracting some
meaningful from the raw, structured, and unstructured data. The extracting out meaningful
data from large amounts use processing of data and this processing can be done using
statistical techniques and algorithm, scientific techniques, different technologies, etc. It
uses various tools and techniques to extract meaningful data from raw data. Data Science
is also known as the Future of Artificial Intelligence.

1. In Search Engines
The most useful application of Data Science is Search Engines. As we know when we
want to search for something on the internet, we mostly used Search engines like Google,
Yahoo, Safari, Firefox, etc. So Data Science is used to get Searches faster.
For Example, When we search something suppose “Data Structure and algorithm courses
” then at that time on the Internet Explorer we get the first link of GeeksforGeeks Courses.
This happens because the GeeksforGeeks website is visited most in order to get
information regarding Data Structure courses and Computer related subjects. So this
analysis is Done using Data Science, and we get the Topmost visited Web Links.
2. In Transport
Data Science also entered into the Transport field like Driverless Cars. With the help of
Driverless Cars, it is easy to reduce the number of Accidents.
For Example, In Driverless Cars the training data is fed into the algorithm and with the
help of Data Science techniques, the Data is analyzed like what is the speed limit in
Highway, Busy Streets, Narrow Roads, etc. And how to handle different situations while
driving etc.
3. In Finance
Data Science plays a key role in Financial Industries. Financial Industries always have an
issue of fraud and risk of losses. Thus, Financial Industries needs to automate risk of loss
analysis in order to carry out strategic decisions for the company. Also, Financial
Industries uses Data Science Analytics tools in order to predict the future. It allows the
companies to predict customer lifetime value and their stock market moves.
For Example, In Stock Market, Data Science is the main part. In the Stock Market, Data
Science is used to examine past behavior with past data and their goal is to examine the
future outcome. Data is analyzed in such a way that it makes it possible to predict future
stock prices over a set timetable.
4. In E-Commerce
E-Commerce Websites like Amazon, Flipkart, etc. uses data Science to make a better user
experience with personalized recommendations.
For Example, When we search for something on the E-commerce websites we get
suggestions similar to choices according to our past data and also we get recommendations
according to most buy the product, most rated, most searched, etc. This is all done with the
help of Data Science.
5. In Health Care
In the Healthcare Industry data science act as a boon. Data Science is used for:
 Detecting Tumor.
 Drug discoveries.
 Medical Image Analysis.
 Virtual Medical Bots.
 Genetics and Genomics.
 Predictive Modeling for Diagnosis etc.
6. Image Recognition
currently, Data Science is also used in Image Recognition. For Example, When we upload
our image with our friend on Facebook, Facebook gives suggestions Tagging who is in the
picture. This is done with the help of machine learning and Data Science. When an Image
is Recognized, the data analysis is done on one’s Facebook friends and after analysis, if
the faces which are present in the picture matched with someone else profile then
Facebook suggests us auto-tagging.
7. Targeting Recommendation
Targeting Recommendation is the most important application of Data Science. Whatever
the user searches on the Internet, he/she will see numerous posts everywhere. This can be
explained properly with an example: Suppose I want a mobile phone, so I just Google
search it and after that, I changed my mind to buy offline. Data Science helps those
companies who are paying for Advertisements for their mobile. So everywhere on the
internet in the social media, in the websites, in the apps everywhere I will see the
recommendation of that mobile phone which I searched for. So this will force me to buy
online.
8. Airline Routing Planning
With the help of Data Science, Airline Sector is also growing like with the help of it, it
becomes easy to predict flight delays. It also helps to decide whether to directly land into
the destination or take a halt in between like a flight can have a direct route from Delhi to
the U.S.A or it can halt in between after that reach at the destination.
9. Data Science in Gaming
In most of the games where a user will play with an opponent i.e. a Computer Opponent,
data science concepts are used with machine learning where with the help of past data the
Computer will improve its performance. There are many games like Chess, EA Sports, etc.
will use Data Science concepts.
10. Medicine and Drug Development
The process of creating medicine is very difficult and time-consuming and has to be done
with full disciplined because it is a matter of Someone’s life. Without Data Science, it
takes lots of time, resources, and finance or developing new Medicine or drug but with the
help of Data Science, it becomes easy because the prediction of success rate can be easily
determined based on biological data or factors. The algorithms based on data science will
forecast how this will react to the human body without lab experiments.
11. In Delivery Logistics
Various Logistics companies like DHL, FedEx, etc. make use of Data Science. Data
Science helps these companies to find the best route for the Shipment of their Products, the
best time suited for delivery, the best mode of transport to reach the destination, etc.
12. Auto complete
AutoComplete feature is an important part of Data Science where the user will get the
facility to just type a few letters or words, and he will get the feature of auto-completing
the line. In Google Mail, when we are writing formal mail to someone so at that time data
science concept of Auto complete feature is used where he/she is an efficient choice to
auto-complete the whole line. Also in Search Engines in social media, in various apps,
AutoComplete feature is widely used.

Microsoft Windows 7 Ultimate
64% (11)
Microsoft Windows 7 Ultimate
9 pages
FDS - Unit 1 Question Bank
No ratings yet
FDS - Unit 1 Question Bank
16 pages
What Is Data Collection
No ratings yet
What Is Data Collection
8 pages
DAFD UNit-2
No ratings yet
DAFD UNit-2
16 pages
L3eBPn0qjGjy9YQ5fWWbtwVPSHPtGuXy5gQtHgYa
No ratings yet
L3eBPn0qjGjy9YQ5fWWbtwVPSHPtGuXy5gQtHgYa
69 pages
Data Analytics - Unit - 1
No ratings yet
Data Analytics - Unit - 1
25 pages
What Is Data Collection
No ratings yet
What Is Data Collection
13 pages
RESEARCH DEVELOPMENT Lesson 7
No ratings yet
RESEARCH DEVELOPMENT Lesson 7
14 pages
Data Collection
No ratings yet
Data Collection
64 pages
Data Collection Methods
No ratings yet
Data Collection Methods
8 pages
Primary Data Collection Methods
No ratings yet
Primary Data Collection Methods
18 pages
TECH8000 Week 05
No ratings yet
TECH8000 Week 05
30 pages
educational material
No ratings yet
educational material
6 pages
What Do You Mean by Data Processing? Data Processing Is The Conversion of Data Into Usable and Desired Form. This Conversion or
No ratings yet
What Do You Mean by Data Processing? Data Processing Is The Conversion of Data Into Usable and Desired Form. This Conversion or
4 pages
Research Methods - Chapter 5 (Final Version)
No ratings yet
Research Methods - Chapter 5 (Final Version)
143 pages
Document 1 5
No ratings yet
Document 1 5
3 pages
PDF document brm 3
No ratings yet
PDF document brm 3
3 pages
Arid Agriculture University: Pir Mehar Ali Shah Rawalpindi
No ratings yet
Arid Agriculture University: Pir Mehar Ali Shah Rawalpindi
24 pages
Data Collection Methods Ramirez
No ratings yet
Data Collection Methods Ramirez
10 pages
Comp. 6 - Module 13
No ratings yet
Comp. 6 - Module 13
8 pages
What Is Collection of Data - Methods, Types & Everything You Should Know
No ratings yet
What Is Collection of Data - Methods, Types & Everything You Should Know
16 pages
COE125 E01 HW1 Sorallo Brendan Kyle L
No ratings yet
COE125 E01 HW1 Sorallo Brendan Kyle L
8 pages
Data Collection2
No ratings yet
Data Collection2
10 pages
QMET
No ratings yet
QMET
15 pages
Data Collection Is An Important Aspect of Research
No ratings yet
Data Collection Is An Important Aspect of Research
9 pages
Q4_L1-Types-of-data
No ratings yet
Q4_L1-Types-of-data
16 pages
Data Collection Tools and Techniques
No ratings yet
Data Collection Tools and Techniques
13 pages
Data Analytics pdf
No ratings yet
Data Analytics pdf
115 pages
CH 2 Source of Secondary Data 2021-22
No ratings yet
CH 2 Source of Secondary Data 2021-22
38 pages
Data Collection and Types of Research
No ratings yet
Data Collection and Types of Research
7 pages
Lec02 Business Analytics - 20231224 - 102047 - 0000 1
No ratings yet
Lec02 Business Analytics - 20231224 - 102047 - 0000 1
23 pages
Ict
No ratings yet
Ict
3 pages
I. Data Collection What Is Data?
No ratings yet
I. Data Collection What Is Data?
12 pages
Unit 1_data Science_iii Bsc Cs.
No ratings yet
Unit 1_data Science_iii Bsc Cs.
14 pages
Chapter-II-Data-Collection-and-Management
No ratings yet
Chapter-II-Data-Collection-and-Management
19 pages
3374897-CLASS IX AI - PART B - unit-2-DATA LITERACY
No ratings yet
3374897-CLASS IX AI - PART B - unit-2-DATA LITERACY
32 pages
Methods Are Used For The Collection of The Data
No ratings yet
Methods Are Used For The Collection of The Data
15 pages
Data Science
No ratings yet
Data Science
68 pages
Research Instrument and Data Collection in Languange Learning Validity and Reliability of Research Data
100% (1)
Research Instrument and Data Collection in Languange Learning Validity and Reliability of Research Data
13 pages
Brm-Unit III Data Collection - Notes
No ratings yet
Brm-Unit III Data Collection - Notes
17 pages
Business Data Mining Week 1
No ratings yet
Business Data Mining Week 1
10 pages
DS Module2 L1 L11
No ratings yet
DS Module2 L1 L11
27 pages
DA Unit 1
No ratings yet
DA Unit 1
26 pages
Primary and Secondary Data: Learning Outcome 5
No ratings yet
Primary and Secondary Data: Learning Outcome 5
25 pages
DATA ANALYTICS Syllabus 3 Units
No ratings yet
DATA ANALYTICS Syllabus 3 Units
37 pages
What Are Statistics ?: Statistics Is A Form of Mathematical Analysis That Uses Quantified Models
No ratings yet
What Are Statistics ?: Statistics Is A Form of Mathematical Analysis That Uses Quantified Models
15 pages
something
No ratings yet
something
21 pages
Data Collection Methods: Nipun Tyagi RGGI, Meerut
No ratings yet
Data Collection Methods: Nipun Tyagi RGGI, Meerut
30 pages
Chapter 6
No ratings yet
Chapter 6
63 pages
LESSON1 ObtainingData
100% (1)
LESSON1 ObtainingData
32 pages
Data Process Improvement: Data Collection Is A Term Used To Describe A Process of Preparing and
No ratings yet
Data Process Improvement: Data Collection Is A Term Used To Describe A Process of Preparing and
8 pages
Data Collection Procedures Using Instruments: Competency
No ratings yet
Data Collection Procedures Using Instruments: Competency
4 pages
Finding The Answers To The Research Questions Lesson 1-Data Analysis Method
No ratings yet
Finding The Answers To The Research Questions Lesson 1-Data Analysis Method
5 pages
Sources of Data
No ratings yet
Sources of Data
10 pages
MBAStat Unit 1 Ques Ans
No ratings yet
MBAStat Unit 1 Ques Ans
27 pages
Chapter 2: Research Methodology
No ratings yet
Chapter 2: Research Methodology
11 pages
Overview of Decision Making and Analytics
No ratings yet
Overview of Decision Making and Analytics
27 pages
Session 3 Data Collection Analysis and Interpretation
No ratings yet
Session 3 Data Collection Analysis and Interpretation
31 pages
Welcome To: Govt. Digvijay Autonomous P.G. College, Rajnandgaon, C.G
No ratings yet
Welcome To: Govt. Digvijay Autonomous P.G. College, Rajnandgaon, C.G
22 pages
Data Collection: Six Sigma Thinking, #1
From Everand
Data Collection: Six Sigma Thinking, #1
Sumeet Savant
No ratings yet
Image Retrieval: Unlocking the Power of Visual Data
From Everand
Image Retrieval: Unlocking the Power of Visual Data
Fouad Sabry
No ratings yet
AHD-RB6 C Dat EN V9 20140408
No ratings yet
AHD-RB6 C Dat EN V9 20140408
6 pages
IFM01A1 2024 ST2 Main SSA Marksheet
No ratings yet
IFM01A1 2024 ST2 Main SSA Marksheet
1 page
NEW Fronius TPSi RI FBi FANUC 1.0 Interface Product Introduction
No ratings yet
NEW Fronius TPSi RI FBi FANUC 1.0 Interface Product Introduction
14 pages
Performance Appraisal
No ratings yet
Performance Appraisal
75 pages
Transformador de Tensión Capacitiva
No ratings yet
Transformador de Tensión Capacitiva
20 pages
0400000010
No ratings yet
0400000010
339 pages
College Resume Template
100% (1)
College Resume Template
7 pages
Computer Fundamentals & Programming (Exercise)
No ratings yet
Computer Fundamentals & Programming (Exercise)
4 pages
Unit 2a PROBLEM SOLVING TECHNIQUES - Uninformed Search PDF
No ratings yet
Unit 2a PROBLEM SOLVING TECHNIQUES - Uninformed Search PDF
65 pages
Grade 8 Lesson 1
No ratings yet
Grade 8 Lesson 1
49 pages
Spectra Precision Survey Pro Onboard Quick Start Guide
No ratings yet
Spectra Precision Survey Pro Onboard Quick Start Guide
40 pages
Akash Resume
No ratings yet
Akash Resume
1 page
Tuv Sud Technical Document
100% (1)
Tuv Sud Technical Document
7 pages
Short_Notice_FOR_SKILL_TEST_OF_PHASE_XI_2023_SELECTION_POSTS
No ratings yet
Short_Notice_FOR_SKILL_TEST_OF_PHASE_XI_2023_SELECTION_POSTS
2 pages
DC Regulated Power Supply
No ratings yet
DC Regulated Power Supply
4 pages
Note
No ratings yet
Note
6 pages
Logic
No ratings yet
Logic
94 pages
Levels of Testing
No ratings yet
Levels of Testing
38 pages
Assignment Functions
No ratings yet
Assignment Functions
6 pages
HP Elitebook 6930p Wistron Karia Discrete Schematic
No ratings yet
HP Elitebook 6930p Wistron Karia Discrete Schematic
58 pages
Timeline Cycle Visual Charts Presentation in Blue White Teal Simple Style
No ratings yet
Timeline Cycle Visual Charts Presentation in Blue White Teal Simple Style
17 pages
ECE 424 - Assign3
0% (1)
ECE 424 - Assign3
2 pages
The Effects of Web Based Educational Drills in Competitive Atmosphere On Motivation and Learning
No ratings yet
The Effects of Web Based Educational Drills in Competitive Atmosphere On Motivation and Learning
6 pages
Employee Task Management System
No ratings yet
Employee Task Management System
17 pages
PIT For DE
No ratings yet
PIT For DE
14 pages
Sales Commission Template Fixed
0% (1)
Sales Commission Template Fixed
11 pages
(Free Open Source Software) : Presented by
No ratings yet
(Free Open Source Software) : Presented by
16 pages
Digital Presentation Notes Grade 9
No ratings yet
Digital Presentation Notes Grade 9
3 pages
Rosemount 3051: F Fieldbus Reference Card
No ratings yet
Rosemount 3051: F Fieldbus Reference Card
2 pages