Unit 2 BI & Data Science (1)
Unit 2 BI & Data Science (1)
Data
Data Collection
Data collection is the process of acquiring, collecting, extracting, and storing the
voluminous amount of data which may be in the structured or unstructured form like text,
video, audio, XML files, records, or other image files used in later stages of data analysis.
In the process of big data analysis, “Data collection” is the initial step before starting to
analyze the patterns or useful information in data. The data which is to be analyzed must
be collected from different valid sources.
The actual data is then further divided mainly into two types known as:
1. Primary data
2. Secondary data
1. Primary data:
The data which is Raw, original, and extracted directly from the official sources is known
as primary data. This type of data is collected directly by performing techniques such as
questionnaires, interviews, and surveys. The data collected must be according to the
demand and requirements of the target audience on which analysis is performed otherwise
it would be a burden in the data processing.
Few methods of collecting primary data:
Interview method:
The data collected during this process is through interviewing the target audience by a
person called interviewer and the person who answers the interview is known as the
interviewee. Some basic business or product related questions are asked and noted down in
the form of notes, audio, or video and this data is stored for processing. These can be both
structured and unstructured like personal interviews or formal interviews through
telephone, face to face, email, etc.
Survey method:
The survey method is the process of research where a list of relevant questions are asked
and answers are noted down in the form of text, audio, or video. The survey method can be
obtained in both online and offline mode like through website forms and email. Then that
survey answers are stored for analyzing data. Examples are online surveys or surveys
through social media polls.
Observation method:
The observation method is a method of data collection in which the researcher keenly
observes the behaviour and practices of the target audience using some data collecting tool
and stores the observed data in the form of text, audio, video, or any raw formats. In this
method, the data is collected directly by posting a few questions on the participants. For
example, observing a group of customers and their behaviour towards the products. The
data obtained will be sent for processing.
Projective Technique
Projective data gathering is an indirect interview, used when potential respondents know why
they're being asked questions and hesitate to answer. For instance, someone may be reluctant
to answer questions about their phone service if a cell phone carrier representative poses the
questions. With projective data gathering, the interviewees get an incomplete question, and
they must fill in the rest, using their opinions, feelings, and attitudes.
Delphi Technique.
The Oracle at Delphi, according to Greek mythology, was the high priestess of Apollo’s
temple, who gave advice, prophecies, and counsel. In the realm of data collection, researchers
use the Delphi technique by gathering information from a panel of experts. Each expert
answers questions in their field of specialty, and the replies are consolidated into a single
opinion.
Focus Groups.
Focus groups, like interviews, are a commonly used technique. The group consists of
anywhere from a half-dozen to a dozen people, led by a moderator, brought together to
discuss the issue.
Questionnaires.
Questionnaires are a simple, straightforward data collection method. Respondents get a series
of questions, either open or close-ended, related to the matter at hand.
Experimental method:
The experimental method is the process of collecting data through performing
experiments, research, and investigation. The most frequently used experiment methods
are CRD, RBD, LSD, FD.
CRD- Completely Randomized design is a simple experimental design used in
data analytics which is based on randomization and replication. It is mostly used for
comparing the experiments.
RBD- Randomized Block Design is an experimental design in which the
experiment is divided into small units called blocks. Random experiments are
performed on each of the blocks and results are drawn using a technique known as
analysis of variance (ANOVA). RBD was originated from the agriculture sector.
LSD – Latin Square Design is an experimental design that is similar to CRD and
RBD blocks but contains rows and columns. It is an arrangement of NxN squares with
an equal amount of rows and columns which contain letters that occurs only once in a
row. Hence the differences can be easily found with fewer errors in the experiment.
Sudoku puzzle is an example of a Latin square design.
FD- Factorial design is an experimental design where each experiment has two
factors each with possible values and on performing trail other combinational factors
are derived.
2. Secondary data:
Secondary data is the data which has already been collected and reused again for some
valid purpose. This type of data is previously recorded from primary data and it has two
types of sources named internal source and external source.
i. Internal source:
These types of data can easily be found within the organization such as market record, a
sales record, transactions, customer data, accounting resources, etc. The cost and time
consumption is less in obtaining internal sources.
Financial Statements
Sales Reports
Retailer/Distributor/Deal Feedback
Customer Personal Information (e.g., name, address, age, contact info)
Business Journals
Government Records (e.g., census, tax records, Social Security info)
Trade/Business Magazines
The internet
1. Word Association.
The researcher gives the respondent a set of words and asks them what comes to mind when
they hear each word.
2. Sentence Completion.
Researchers use sentence completion to understand what kind of ideas the respondent has.
This tool involves giving an incomplete sentence and seeing how the interviewee finishes it.
3. Role-Playing.
Respondents are presented with an imaginary situation and asked how they would act or react
if it was real.
4. In-Person Surveys.
The researcher asks questions in person.
5. Online/Web Surveys.
These surveys are easy to accomplish, but some users may be unwilling to answer truthfully,
if at all.
6. Mobile Surveys.
These surveys take advantage of the increasing proliferation of mobile technology. Mobile
collection surveys rely on mobile devices like tablets or smart phones to conduct surveys via
SMS or mobile apps.
7. Phone Surveys.
No researcher can call thousands of people at once, so they need a third party to handle the
chore. However, many people have call screening and won’t answer.
8. Observation.
Sometimes, the simplest method is the best. Researchers who make direct observations
collect data quickly and easily, with little intrusion or third-party bias. Naturally, it’s only
effective in small-scale situations.
Data Management
Data management works symbiotically with process management, ensuring that the actions
teams take are informed by the cleanest, most current data available — which in today’s
world means tracking changes and trends in real-time. Below is a deeper look at the practice,
its benefits and challenges, and best practices for helping your organization get the most out
of its business intelligence.
2. Data stewardship: A data steward does not develop information management policies but
rather deploys and enforces them across the enterprise. As the name implies, a data steward
stands watch over enterprise data collection and movement policies, ensuring practices are
implemented and rules are enforced.
3. Data quality management: If a data steward is a kind of digital sheriff, a data quality
manager might be thought of as his court clerk. Quality management is responsible for
combing through collected data for underlying problems like duplicate records, inconsistent
versions, and more. Data quality managers support the defined data management system.
4. Data security: One of the most important aspects of data management today is security.
Though emergent practices like DevSecOps incorporate security considerations at every level
of application development and data exchange, security specialists are still tasked with
encryption management, preventing unauthorized access, guarding against accidental
movement or deletion, and other frontline concerns.
5. Data governance: Data governance sets the law for an enterprise’s state of information. A
data governance framework is like a constitution that clearly outlines policies for the intake,
flow, and protection of institutional information. Data governors oversee their network of
stewards, quality management professionals, security teams, and other people and data
management processes in pursuit of a governance policy that serves a master data
management approach.
6. Big data management: Big data is the catch-all term used to describe gathering,
analyzing, and using massive amounts of digital information to improve operations. In broad
terms, this area of data management specializes in intake, integrity, and storage of the tide of
raw data that other management teams use to improve operations and security or inform
business intelligence.
7. Data warehousing: Information is the building block of modern business. The sheer
volume of information presents an obvious challenge: What do we do with all these blocks?
Data warehouse management provides and oversees the physical and/or cloud-based
infrastructure used to aggregate raw data and analyze it in-depth to produce business insights.
The unique needs of any organization practicing data management may require a blend of
some or all of these approaches. Familiarity with management areas provides data managers
with the background they need to build solutions customized for their environments.
Once data is under management, it can be mined for informational gold: business
intelligence. This helps business users across the organization in a variety of ways, including
the following:
Smart advertising that targets customers according to their interests and interactions
Holistic security that safeguards critical information
Alignment with relevant compliance standards, saving time and money
Machine learning that grows more environmentally aware over time, powering automatic
and continuous improvement
Reduced operating expenses by restricting use to only the necessary storage and compute
power required for optimal performance
The amount of data can be (at least temporarily) overwhelming. It’s hard to overstate
the volume of data that must come under management in a modern business, so, when
developing systems and processes, be ready to think big. Really big. Specialized third-
party services and apps for integrating big data or providing it as a platform are crucial
allies.
Many organizations silo data. The development team may work from one data set, the
sales team from another, operations from another, and so on. A modern data management
system relies on access to all this information to develop modern business intelligence. Re
Real-time data platform services help stream and share clean information between teams
from a single, trusted source.
The journey from unstructured data to structured data can be steep. Data often
pours into organizations in an unstructured way. Before it can be used to generate
business intelligence, data preparation has to happen: Data must be organized, de-
duplicated, and otherwise cleaned. Data managers often rely on third-party partnerships to
assist with these processes, using tools designed for on-premises, cloud, or hybrid
environments.
Managing the culture is essential to managing data. All of the processes and systems
in the world won’t do you much good if people don’t know how — and perhaps just as
importantly, why — to use them. By making team members aware of the benefits of data
management (and the potential pitfalls of ignoring it) and fostering the skills of using data
correctly, managers engage team members as essential pieces of the information process.
These and other challenges stand between the old way of doing business and initiatives that
harness the power of data for business intelligence. But with proper planning, practices, and
partners, technologies like accelerated machine learning can turn pinch points into gateways
for deeper business insights and better customer experience.
1. Make a plan
Develop and write a data management plan (DMP). This document charts estimated
data usage, accessibility guidelines, archiving approaches, ownership, and more. A
DMP serves as both a reference and a living record and will be revised as
circumstances change.
Additionally, DMPs present the organization’s overarching strategy for data
management to investors, auditors, and other involved parties — which is an
important insight into a company’s preparedness for the rigors of the modern market.
The best DMPs define granular details, including:
Preferred file formats
Naming conventions
Access parameters for various stakeholders
Backup and archiving processes
Defined partners and the terms and services they provide
Thorough documentation
There are online services that can help create DMPs by providing step-by-step
guidance to creating plans from templates.
2. Store your data
Among the granular details mentioned above, a solid data storage approach is central
to good data management. It begins by determining if your storage needs best suit a
data warehouse or a data lake (or both), and whether the company’s data belongs on-
premises or in the cloud.
Then outline a consistent, and consistently enforced, agreement for naming files,
folders, directories, users, and more. This is a foundational piece of data management,
as these parameters will determine how to store all future data, and inconsistencies
will result in errors and incomplete intelligence.
1. Security and backups. Insecure data is dangerous, so security must be considered at
every layer. Some organizations come under special regulatory burdens like HIPAA,
CIPA, GDPR, and others, which add additional security requirements like periodic
audits. When security fails, the backup plan can be the difference between business
life and death. Traditional models called for three copies of all important data: the
original, the locally stored copy, and a remote copy. But emerging cloud models
include decentralized data duplication, with even more backup options available at an
increasingly affordable cost for storage and transfer.
2. Documentation is key. If it’s important, document it. If the entire team splits the
lottery and runs off to Jamaica, thorough, readable documentation outlining security
and backup procedures will give the next team a fighting chance to pick up where
they left off. Without it, knowledge resides exclusively with holders who may or may
not be part of a long-term data management approach.
Data storage needs to be able to change as fast as the technology demands, so any approach
should be flexible and have a reasonable archiving approach to keep costs manageable.
Big data consists of huge amounts of information that cannot be stored or processed using
traditional data storage mechanisms or processing techniques. It generally consists of three
different variations.
i. Structured data (as its name suggests) has a well-defined structure and follows a
consistent order. This kind of information is designed so that it can be easily accessed
and used by a person or computer. Structured data is usually stored in the well-
defined rows and columns of a table (such as a spreadsheet) and databases —
particularly relational database management systems, or RDBMS.
ii. Semi-structured data exhibits a few of the same properties as structured data, but for
the most part, this kind of information has no definite structure and cannot conform to
the formal rules of data models such as an RDBMS.
iii. Unstructured data possesses no consistent structure across its various forms and
does not obey conventional data models’ formal structural rules. In very few
instances, it may have information related to date and time.
In line with classical definitions of the concept, big data is generally associated with
three core characteristics:
1. Volume: This trait refers to the immense amounts of information generated every
second via social media, cell phones, cars, transactions, connected sensors, images,
video, and text. In petabytes, terabytes, or even zettabytes, these volumes can only be
managed by big data technologies.
3. Velocity: Information is streaming into data repositories at a prodigious rate, and this
characteristic alludes to the speed of data accumulation. It also refers to the speed
with which big data can be processed and analyzed to extract the insights and patterns
it contains. These days, that speed is often real-time.
Beyond “the Three Vs,” current descriptions of big data management also include two other
characteristics, namely:
Veracity: This is the degree of reliability and truth that big data has to offer in terms
of its relevance, cleanliness, and accuracy.
Value: Since the primary aim of big data gathering and analysis is to discover insights
that can inform decision-making and other processes, this characteristic explores the
benefit or otherwise that information and analytics can ultimately produce.
When it comes to technology, organizations have many different types of big data
management solutions to choose from. Vendors offer a variety of standalone or multi-
featured big data management tools, and many organizations use multiple tools. Some of the
most common types of big data management capabilities include the following:
Data migration: moving data from one environment to another, such as moving
data from in-house data centres to the cloud
Data enrichment: improving the quality of data by adding new data sets,
correcting small errors or extrapolating new information from raw data
Master data management (MDM) :linking critical enterprise data to one master
set that serves as the single source of truth for the organization
Data governance: ensuring the availability, usability, integrity and accuracy of
data
Extract transform load (ETL): moving data from an existing repository into a
database or data warehouse.
Organization/Sources of Data
Data organization is the practice of categorizing and classifying data to make it more usable.
Similar to a file folder, where we keep important documents, you’ll need to arrange your data
in the most logical and orderly fashion, so you — and anyone else who accesses it — can
easily find what they’re looking for.
Network data. This type of data is gathered on all kinds of networks, including social
media, information and technological networks, the Internet and mobile networks, etc.
Real-time data. They are produced on online streaming media, such as YouTube,
Twitch, Skype, or Netflix.
Transactional data. They are gathered when a user makes an online purchase
(information on the product, time of purchase, payment methods, etc.)
Geographic data. Location data of everything, humans, vehicles, building, natural
reserves, and other objects are continuously supplied with satellites.
Natural language data. These data are gathered mostly from voice searches that can
be made on different devices accessing the Internet.
Time series data. This type of data is related to the observation of trends and
phenomena taking place at this very moment and over a period of time, for instance,
global temperatures, mortality rates, pollution levels, etc.
Linked data. They are based on HTTP, RDF, SPARQL, and URIs web technologies
and meant to enable semantic connections between various databases so that
computers could read and perform semantic queries correctly.
There are different ways of how to collect big data from users. These are the most
popular ones.
1. Asking for it
the majority of firms prefer asking users directly to share their personal information.
They give these data when creating website accounts or buying online. The minimum
information to be collected includes a username and an email address, but some
profiles require more details.
2. Cookies and Web Beacons
Cookies and web beacons are two widely used methods to gather the data on users,
namely, what web pages they visit and when. They provide basic statistics about how
a website is used. Cookies and web beacons in no way compromise your privacy but
just serve to personalize your experience with one or another web source.
3. Email tracking
Email trackers are meant to give more information on the user actions in the mailbox.
In particular, an email tracker allows detecting when an email was opened. Both
Google and Yahoo use this method to learn their users’ behavioural patterns and
provide personalized advertising.
By tracking data quality, a business can pinpoint potential issues harming quality, and ensure
that shared data is fit to be used for a given purpose.
When collected data fails to meet the company expectations of accuracy, validity,
completeness, and consistency, it can have massive negative impacts on customer service,
employee productivity, and key strategies.
Quality data is key to making accurate, informed decisions. And while all data has some level
of “quality,” a variety of characteristics and factors determines the degree of data quality
(high-quality versus low-quality). Furthermore, different data quality characteristics will
likely be more important to various stakeholders across the organization.
A list of popular data quality characteristics and dimensions include:
The concept of missing data is implied in the name: its data that is not captured for a variable
for the observation in question. Missing data reduces the statistical power of the analysis,
which can distort the validity of the results.
Fortunately, there are proven techniques to deal with missing data.
Imputation vs. Removing Data
When dealing with missing data, data scientists can use two primary methods to solve
the error: imputation or the removal of data.
The imputation method develops reasonable guesses for missing data. It’s most useful when
the percentage of missing data is low. If the portion of missing data is too high, the results
lack natural variation that could result in an effective model.
The other option is to remove data. When dealing with data that is missing at random, related
data can be deleted to reduce bias. Removing data may not be the best option if there are not
enough observations to result in a reliable analysis. In some situations, observation of specific
events or factors may be required.
Before deciding which approach to employ, data scientists must understand why the data is
missing.
Missing at Random means the data is missing relative to the observed data. It is not related to
the specific missing values. The data is not missing across all observations but only within
sub-samples of the data. It is not known if the data should be there; instead, it is
missing given the observed data. The missing data can be predicted based on the complete
observed data.
In the MCAR situation, the data is missing across all observations regardless of the expected
value or other variables. Data scientists can compare two sets of data, one with missing
observations and one without. Using a t-test, if there is no difference between the two data
sets, the data is characterized as MCAR.
Data may be missing due to test design, failure in the observations or failure in recording
observations. This type of data is seen as MCAR because the reasons for its absence are
external and not related to the value of the observation.
It is typically safe to remove MCAR data because the results will be unbiased. The test may
not be as powerful, but the results will be reliable.
The MNAR category applies when the missing data has a structure to it. In other words, there
appear to be reasons the data is missing. In a survey, perhaps a specific group of people – say
women ages 45 to 55 – did not answer a question. Like MAR, the data cannot be determined
by the observed data, because the missing information is unknown. Data scientists
must model the missing data to develop an unbiased estimate. Simply removing observations
with missing data could result in a model with bias.
Deletion
There are two primary methods for deleting data when dealing with missing data: list wise
and dropping variables.
List wise
In this method, all data for an observation that has one or more missing values are deleted.
The analysis is run only on observations that have a complete set of data. If the data set is
small, it may be the most efficient method to eliminate those cases from the analysis.
However, in most cases, the data are not missing completely at random (MCAR). Deleting
the instances with missing observations can result in biased parameters and estimates and
reduce the statistical power of the analysis.
Pair wise
Pair wise deletion assumes data are missing completely at random (MCAR), but all the cases
with data, even those with missing data, are used in the analysis. Pairwise deletion allows
data scientists to use more of the data. However, the resulting statistics may vary because
they are based on different data sets. The results may be impossible to duplicate with a
complete set of data.
Dropping Variables
If data is missing for more than 60% of the observations, it may be wise to discard it if the
variable is insignificant.
Imputation
When data is missing, it may make sense to delete data, as mentioned above. However, that
may not be the most effective option. For example, if too much information is discarded, it
may not be possible to complete a reliable analysis. Or there may be insufficient data to
generate a reliable prediction for observations that have missing data.
Instead of deletion, data scientists have multiple solutions to impute the value of missing
data. Depending why the data are missing, imputation methods can deliver reasonably
reliable results. These are examples of single imputation methods for replacing missing data.
This is one of the most common methods of imputing values when dealing with missing data.
In cases where there are a small number of missing observations, data scientists can calculate
the mean or median of the existing observations. However, when there are many missing
variables, mean or median results can result in a loss of variation in the data. This method
does not use time-series characteristics or depend on the relationship between the variables.
Another option is to use time-series specific methods when appropriate to impute data. There
are four types of time-series data:
No trend or seasonality.
Trend, but no seasonality.
Seasonality, but no trend.
Both trend and seasonality.
The time series methods of imputation assume the adjacent observations will be like the
missing data. These methods work well when that assumption is valid. However, these
methods won’t always produce reasonable results, particularly in the case of strong
seasonality.
Last Observation Carried Forward (LOCF) & Next Observation Carried Backward
(NOCB)
These options are used to analyze longitudinal repeated measures data, in which follow-up
observations may be missing. In this method, every missing value is replaced with the last
observed value. Longitudinal data track the same instance at different points along a timeline.
This method is easy to understand and implement. However, this method may introduce bias
when data has a visible trend. It assumes the value is unchanged by the missing data.
Linear Interpolation
Linear interpolation is often used to approximate a value of some function by using two
known values of that function at other points. This formula can also be understood as a
weighted average. The weights are inversely related to the distance from the end points to the
unknown point. The closer point has more influence than the farther point.
When dealing with missing data, you should use this method in a time series that exhibits a
trend line, but it’s not appropriate for seasonal data.
When dealing with data that exhibits both trend and seasonality characteristics, use seasonal
adjustment with linear interpolation. First you would perform the seasonal adjustment by
computing a centered moving average or taking the average of multiple averages – say, two
one-year averages – that are offset by one period relative to another. You can then complete
data smoothing with linear interpolation as discussed above.
Multiple Imputations
Multiple imputations is considered a good approach for data sets with a large amount of
missing data. Instead of substituting a single value for each missing data point, the missing
values are exchanged for values that encompass the natural variability and uncertainty of the
right values. Using the imputed data, the process is repeated to make multiple imputed data
sets. Each set is then analyzed using the standard analytical procedures, and the multiple
analysis results are combined to produce an overall result.
The various imputations incorporate natural variability into the missing values, which creates
a valid statistical inference. Multiple imputations can produce statistically valid results even
when there is a small sample size or a large amount of missing data.
K Nearest Neighbours
In this method, data scientists choose a distance measure for k neighbours, and the average is
used to impute an estimate. The data scientist must select the number of nearest neighbours
and the distance metric. KNN can identify the most frequent value among the neighbours and
the mean among the nearest neighbours.
Data Visualization
Data visualization is the practice of translating information into a visual context, such
as a map or graph, to make data easier for the human brain to understand and pull
insights from.
The main goal of data visualization is to make it easier to identify patterns, trends and
outliers in large data sets. The term is often used interchangeably with others,
including information graphics, information visualization and statistical graphics.
Data visualization is one of the steps of the data science process, which states that
after data has been collected, processed and modelled, it must be visualized for
conclusions to be made.
Data visualization is also an element of the broader data presentation architecture
(DPA) discipline, which aims to identify, locate, manipulate, format and deliver data
in the most efficient way possible.
Data visualization is important for almost every career. It can be used by teachers to
display student test results, by computer scientists exploring advancements
in artificial intelligence (AI) or by executives looking to share information with
stakeholders.
It also plays an important role in big data projects. As businesses accumulated
massive collections of data during the early years of the big data trend, they needed a
way to quickly and easily get an overview of their data. Visualization tools were a
natural fit.
Visualization is central to advanced analytics for similar reasons. When a data
scientist is writing advanced predictive analytics or machine learning (ML)
algorithms, it becomes important to visualize the outputs to monitor results and ensure
that models are performing as intended. This is because visualizations of complex
algorithms are generally easier to interpret than numerical outputs.
Why is data visualization important?
Data visualization provides a quick and effective way to communicate information in a
universal manner using visual information. The practice can also help businesses identify
which factors affect customer behaviour; pinpoint areas that need to be improved or need
more attention; make data more memorable for stakeholders; understand when and where to
place specific products; and predict sales volumes.
Other benefits of data visualization include the following:
the ability to absorb information quickly, improve insights and make faster
decisions;
an increased understanding of the next steps that must be taken to improve the
organization;
an improved ability to maintain the audience's interest with information they can
understand;
an easy distribution of information that increases the opportunity to share insights
with everyone involved;
eliminate the need for data scientists since data is more accessible and
understandable; and
An increased ability to act on findings quickly and, therefore, achieve success
with greater speed and less mistakes.
Data visualization and big data
o The increased popularity of big data and data analysis projects has made visualization
more important than ever.
o Companies are increasingly using machine learning to gather massive amounts of data
that can be difficult and slow to sort through, comprehend and explain.
o Visualization offers a means to speed this up and present information to business
owners and stakeholders in ways they can understand.
o Big data visualization often goes beyond the typical techniques used in normal
visualization, such as pie charts, histograms and corporate graphs. It instead uses
more complex representations, such as heat maps and fever charts.
o Big data visualization requires powerful computer systems to collect raw data, process
it and turn it into graphical representations that humans can use to quickly draw
insights.
Examples of data visualization
In the early days of visualization, the most common visualization technique was using
a Microsoft Excel spreadsheet to transform the information into a table, bar graph or pie
chart. While these visualization methods are still commonly used, more intricate techniques
are now available, including the following:
info graphics
bubble clouds
bullet graphs
heat maps
fever charts
time series charts
Line charts. This is one of the most basic and common techniques used. Line charts display
how variables can change over time.
Area charts. This visualization method is a variation of a line chart; it displays multiple
values in a time series -- or a sequence of data collected at consecutive, equally spaced points
in time.
Scatter plots. This technique displays the relationship between two variables. A scatter
plot takes the form of an x- and y-axis with dots to represent data points.
Tree maps. This method shows hierarchical data in a nested format. The size of the
rectangles used for each category is proportional to its percentage of the whole. Treemaps are
best used when multiple categories are present, and the goal is to compare different parts of a
whole.
Population pyramids. This technique uses a stacked bar graph to display the complex social
narrative of a population. It is best used when trying to display the distribution of a
population.
Sales and Marketing: Research from the media agency Magna predicts that half of all global
advertising dollars will be spent online by 2020. As a result, marketing teams must pay close
attention to their sources of web traffic and how their web properties generate revenue. Data
visualization makes it easy to see traffic trends over time as a result of marketing efforts.
Politics: A common use of data visualization in politics is a geographic map that displays the
party each state or district voted for.
Data visualization tools can be used in a variety of ways. The most common use today is as
business intelligence (BI) reporting tool. Users can set up visualization tools to generate
automatic dashboards that track company performance across key performance indicators
(KPIs) and visually interpret the results.
The generated images may also include interactive capabilities, enabling users to manipulate
them or look more closely into the data for questioning and analysis. Indicators designed to
alert users when data has been updated or when predefined conditions occur can also be
integrated.
Many business departments implement data visualization software to track their own
initiatives. For example, a marketing team might implement the software to monitor the
performance of an email campaign, tracking metrics like open rate, click-through rate and
conversion rate.
As data visualization vendors extend the functionality of these tools, they are increasingly
being used as front ends for more sophisticated big data environments. In this setting, data
visualization software helps data engineers and scientists keep track of data sources and do
basic exploratory analysis of data sets prior to or after more detailed advanced analyses.
The biggest names in the big data tools marketplace include Microsoft, IBM, SAP and
SAS.
Some other vendors offer specialized big data visualization software; popular names in this
market include Tableau, Qlik and Tibco.
While Microsoft Excel continues to be a popular tool for data visualization, others have
been created that provide more sophisticated abilities:
IBM Cognos Analytics
Qlik Sense and QlikView
Microsoft Power BI
Oracle Visual Analyzer
SAP Lumira
SAS Visual Analytics
Tibco Spotfire
Zoho Analytics
D3.js
Jupyter
MicroStrategy
Google Charts
Data Classification
Data classification involves the use of tags and labels to define the data type, its
confidentiality, and its integrity. There are three main types of data classification that are
considered the industry standard:
Low risk: If data is public and it’s not easy to permanently lose (e.g. recovery is easy), this
data collection and the systems surrounding it are likely a lower risk than others.
Moderate risk: Essentially, this is data that isn’t public or is used internally (by your
organization and/or partners). However, it’s also not likely too critical to operations or
sensitive to be “high risk.” Proprietary operating procedures cost of goods and some company
documentation may fall into the moderate category.
High risk: Anything remotely sensitive or crucial to operational security goes into the high
risk category. Also, pieces of data those are extremely hard to recover (if lost). All
confidential, sensitive and necessary data falls into a high risk category.
While we’ve looked at mapping data out by type, you should also look to segment your
organization’s data in terms of the level of sensitivity – high, moderate, or low.
The following shows common examples of organizational data which may be classified into
each sensitivity level:
High:
o Personally identifiable information (PII)
o Credit card details (PCI)
o Intellectual property (IP)
o Protected healthcare information (including HIPAA regulated data)
o Financial information
o Employee records
o ITAR materials
o Internal correspondence including confidential data
Moderate:
Low:
o Public websites
o Public directory data
o Publicly available research
o Press releases
o Job advertisements
o Marketing materials
Data Science Project Life Cycle :
Data Science is a multidisciplinary field that uses scientific methods to extract insights from
structured and unstructured data. Data science is such a huge field and concept that’s often
intermingled with other disciplines, but generally, DS unifies statistics, data analysis,
machine learning, and related fields.
Data Science life cycle provides the structure to the development of a data science project.
The lifecycle outlines the major steps, from start to finish, that projects usually follow. Now,
there are various approaches to managing DS projects, amongst which are Cross-industry
standard process for data mining (aka CRISP-DM), process of knowledge discovery in
databases (aka KDD), any proprietary-based custom procedures conjured up by a company,
and a few other simplified processes.
CRISP-DM
CRISP-DM is an open standard process model that describes common approaches used by
data mining scientists. In 2015, it was refined and extended by IBM, which released a new
methodology called Analytics Solutions Unified Method for Data Mining/Predictive
Analytics (aka ASUM-DM).
Suppose, we have a standard DS project (without any industry-specific peculiarities), then the
lifecycle would typically include:
Business understanding
Data acquisition and understanding
Modelling
Deployment
Customer acceptance
The DS project life cycle is an iterative process of research and discovery that provides
guidance on the tasks needed to use predictive models. The goal of this process is to move a
DS project to an engagement end-point by providing means for easier and clearer
communication between teams and customers with a well-defined set of artifacts and
standardized templates to homogenize procedures and avoid misunderstandings.
Business understanding
Before you even embark on a DS project, you need to understand the problem you’re trying
to solve and define the central objectives of your project by identifying the variables to
predict.
Goals:
Identify key variables that will serve as model targets and serve as the metrics for
defining the success of the project
Identify data sources that the business has already access to or need to obtain such
access
Guidelines:
Work with customers and stakeholders to define business problems and formulate questions
that data science needs to answer.
The goal here is to identify the key business variables (aka model targets) that your analysis
needs to predict and the project’s success would be assessed against. For example, the sales
forecasts. This is what needs to be predicted, and at the end of your project, you’ll compare
your predictions to the actual volume of sales.
Define project goals by asking specific questions related to data science, such as:
Business Requirements
The business requirements the analyst creates for this project would include (but not be
limited to):
While the above examples accompanying selected bullet points are textual, business
requirements may include graphs, models, or any combination of these that best serves the
project. Effective business requirements require strong strategic thinking, significant input
from a project’s business owners, and the ability to clearly state the needs of a project at a
high level.
Verifiable. Just because business requirements state business needs rather than
technical specifications doesn’t mean they mustn’t be demonstrable.
Verifiable requirements are specific and objective. A quality control expert must be
able to check, for example, that the system accommodates the debit, credit, and
PayPal methods specified in the business requirements. (S)he could not do so if the
requirements were more vague, i.e., “The system will accommodate appropriate
payment methods.” (Appropriate is subject to interpretation.)
Unambiguous, stating precisely what problem is being solved. For example, “This
project will be deemed successful if ticket sales increase sufficiently,” is probably too
vague for all stakeholders to agree on its meaning at the project’s end.
Comprehensive, covering every aspect of the business need. Business requirements
are indeed big picture, but they are very thorough big picture. In the aforementioned
example, if the analyst assumed that the developers would know to design a system
that could accommodate many times the number of customers the theatre chain had
seen at one time in the past, but did not explicitly state so in the requirements, the
developers might design a system that could accommodate only 10,000 patrons at any
one time without performance issues.
Remember that business requirements answer the what’s, not the how’s, but they are
meticulously thorough in describing those’s. No business point is overlooked. At a project’s
end, the business requirements should serve as a methodical record of the initial business
problem and the scope of its solution.
Understanding the project objectives and requirements from a domain perspective and then
converting this knowledge into a data science problem definition with a preliminary plan
designed to achieve the objectives. Data science projects are often structured around the
specific needs of an industry sector (as shown below) or even tailored and built for a single
organization. A successful data science project starts from a well defined question or need.
Data Acquisition
o Data recording
o Data storing
o Real-time data visualization
o Post-recording data review
o Data analysis using various mathematical and statistical calculations
o Report generation
Data Preparation
Data preparation is about constructing a dataset from one or more data sources to be used for
exploration and modeling. It is a solid practice to start with an initial dataset to get familiar with the
data, to discover first insights into the data and have a good understanding of any possible data
quality issues. Data preparation is often a time consuming process and heavily prone to errors. The
old saying "garbage-in-garbage-out" is particularly applicable to those data science projects where
data gathered with many invalid, out-of-range and missing values. Analyzing data that has not been
carefully screened for such problems can produce highly misleading results. Then, the success of
data science projects heavily depends on the quality of the prepared data.
Data
Data is information typically the results of measurement (numerical) or counting
(categorical). Variables serve as placeholders for data. There are two types of
variables, numerical and categorical.
A numerical or continuous variable is one that can accept any value within a finite or infinite
interval (e.g., height, weight, temperature, blood glucose,). There are two types of numerical
data, interval and ratio. Data on an interval scale can be added and subtracted but cannot be
meaningfully multiplied or divided because there is no true zero. For example, we cannot say that
one day is twice as hot as another day. On the other hand, data on a ratio scale has true zero and can
be added, subtracted, multiplied or divided (e.g., weight).
A categorical or discrete variable is one that can accept two or more values (categories). There
are two types of categorical data, nominal and ordinal. Nominal data does not have an intrinsic
ordering in the categories. For example, "gender" with two categories, male and female. In contrast,
ordinal data does have an intrinsic ordering in the categories. For example, "level of energy" with
three orderly categories (low, medium and high).
Dataset
Dataset is a collection of data, usually presented in a tabular form. Each column represents a
particular variable, and each row corresponds to a given member of the data.
In predictive modeling, predictors or attributes are the input variables and target or class
attribute is the output variable whose value is determined by the values of the predictors and
function of the predictive model.
Database
Database collects, stores and manages information so users can retrieve, add, update or remove
such information. It presents information in tables with rows and columns. A table is referred to as
a relation in the sense that it is a collection of objects of the same type (rows). Data in a table can
be related according to common keys or concepts, and the ability to retrieve related data from
related tables is the basis for the term relational database. A Database Management System
(DBMS) handles the way data is stored, maintained, and retrieved. Most data science toolboxes
connect to databases through ODBC (Open Database Connectivity) or JDBC (Java Database
Connectivity) interfaces.
SQL (Structured Query Language) is a database computer language for managing and
manipulating data in relational database management systems (RDBMS).
SQL Data Definition Language (DDL) permits database tables to be created, altered or deleted. We
can also define indexes (keys), specify links between tables, and impose constraints between
database tables.
SQL Data Manipulation Language (DML) is a language which enables users to access and
manipulate data.
Data extraction provides the ability to extract data from a variety of data sources, such as
flat files, relational databases, streaming data, XML files, and ODBC/JDBC data sources.
Data transformation provides the ability to cleanse, convert, aggregate, merge, and split
data.
Data loading provides the ability to load data into destination databases via update, insert
or delete statements, or in bulk.
Data Exploration
Data Exploration is about describing the data by means of statistical and visualization
techniques. We explore data in order to bring important aspects of that data into focus for
further analysis.
1. Univariate Analysis
2. Bivariate Analysis
Modeling
Predictive modeling is the process by which a model is created to predict an outcome. If the outcome
is categorical it is called classification and if the outcome is numerical it is called regression.
Descriptive modeling or clustering is the assignment of observations into clusters so that observations
in the same cluster are similar. Finally, association rules can find interesting associations amongst
observations.
Model Evaluation
Model Evaluation is an integral part of the model development process. It helps to find the best
model that represents our data and how well the chosen model will work in the future. Evaluating
model performance with the data used for training is not acceptable in data science because it can
easily generate overoptimistic and over fitted models. There are two methods of evaluating models in
data science, Hold-Out and Cross-Validation. To avoid overfitting, both methods use a test set (not
seen by the model) to evaluate model performance.
Hold-Out
In this method, the mostly large dataset is randomly divided to three subsets:
1. Training set is a subset of the dataset used to build predictive models.
2. Validation set is a subset of the dataset used to assess the performance of model built in the
training phase. It provides a test platform for fine tuning model's parameters and selecting the
best-performing model. Not all modeling algorithms need a validation set.
3. Test set or unseen examples are a subset of the dataset to assess the likely future performance
of a model. If a model fit to the training set much better than it fits the test set, overfitting is
probably the cause.
Cross-Validation
When only a limited amount of data is available, to achieve an unbiased estimate of the model
performance we use k-fold cross-validation. In k-fold cross-validation, we divide the data
into k subsets of equal size. We build models k times, each time leaving out one of the subsets from
training and use it as the test set. If k equals the sample size, this is called "leave-one-out".
Model Deployment
The concept of deployment in data science refers to the application of a model for prediction using a
new data. Building a model is generally not the end of the project. Even if the purpose of the model is
to increase knowledge of the data, the knowledge gained will need to be organized and presented in a
way that the customer can use it. Depending on the requirements, the deployment phase can be as
simple as generating a report or as complex as implementing a repeatable data science process. In
many cases, it will be the customer, not the data analyst, who will carry out the deployment steps. For
example, a credit card company may want to deploy a trained model or set of models (e.g., neural
networks, meta-learner) to quickly identify transactions, which have a high probability of being
fraudulent. However, even if the analyst will not carry out the deployment effort it is important for
the customer to understand up front what actions will need to be carried out in order to actually make
use of the created models.
An example of using a data mining tool (Orange) to deploy a decision tree model.
Operations Research
Generally, OR is concerned with obtaining extreme values of some real-world objective functions;
maximum (profit, performance, utility, or yield), minimum (loss, risk, distance, or cost). It
incorporates techniques from mathematical modelling, optimization, and statistical analysis while
emphasising the human-technology interface. However, one of the difficulties in answering this
question is that there is a lot of overlap in scientific terminology — and sometimes terms become
extremely popular, affecting the landscape of the terminology. E.g. the popularity of vague, broad
terms such as AI and Big Data that works good for marketing but does nothing for the discussion on
the research. Therefore, I have tried illustrating it in terms of ORs related fields, subfields, and the
addressed problems
There are many process optimization techniques you can use to get you started. Here are
three examples:
Process mining: This is a group of techniques with a data science approach. Data is taken
from event logs to analyze what team members are doing in a company and what steps they
take to complete a task. This data can then be turned into insights, helping project managers
to spot any roadblocks and optimize their processes.
PDSA: PDSA is an acronym for Plan, Do, Study, Act. It uses a four-stage cyclical model to
improve quality and optimize business processes. Project managers will start by mapping
what achievements they want to accomplish. Next, they will test proposed changes on a small
scale. After this, they will study the results and determine if these changes were effective. If
so, they will implement the changes across the entire business process.
It's good practice for a project manager to take some time to research various process
optimization methods before deciding which one is most suited to their business
Data Science is the deep study of a large quantity of data, which involves extracting some
meaningful from the raw, structured, and unstructured data. The extracting out meaningful
data from large amounts use processing of data and this processing can be done using
statistical techniques and algorithm, scientific techniques, different technologies, etc. It
uses various tools and techniques to extract meaningful data from raw data. Data Science
is also known as the Future of Artificial Intelligence.
1. In Search Engines
The most useful application of Data Science is Search Engines. As we know when we
want to search for something on the internet, we mostly used Search engines like Google,
Yahoo, Safari, Firefox, etc. So Data Science is used to get Searches faster.
For Example, When we search something suppose “Data Structure and algorithm courses
” then at that time on the Internet Explorer we get the first link of GeeksforGeeks Courses.
This happens because the GeeksforGeeks website is visited most in order to get
information regarding Data Structure courses and Computer related subjects. So this
analysis is Done using Data Science, and we get the Topmost visited Web Links.
2. In Transport
Data Science also entered into the Transport field like Driverless Cars. With the help of
Driverless Cars, it is easy to reduce the number of Accidents.
For Example, In Driverless Cars the training data is fed into the algorithm and with the
help of Data Science techniques, the Data is analyzed like what is the speed limit in
Highway, Busy Streets, Narrow Roads, etc. And how to handle different situations while
driving etc.
3. In Finance
Data Science plays a key role in Financial Industries. Financial Industries always have an
issue of fraud and risk of losses. Thus, Financial Industries needs to automate risk of loss
analysis in order to carry out strategic decisions for the company. Also, Financial
Industries uses Data Science Analytics tools in order to predict the future. It allows the
companies to predict customer lifetime value and their stock market moves.
For Example, In Stock Market, Data Science is the main part. In the Stock Market, Data
Science is used to examine past behavior with past data and their goal is to examine the
future outcome. Data is analyzed in such a way that it makes it possible to predict future
stock prices over a set timetable.
4. In E-Commerce
E-Commerce Websites like Amazon, Flipkart, etc. uses data Science to make a better user
experience with personalized recommendations.
For Example, When we search for something on the E-commerce websites we get
suggestions similar to choices according to our past data and also we get recommendations
according to most buy the product, most rated, most searched, etc. This is all done with the
help of Data Science.
5. In Health Care
In the Healthcare Industry data science act as a boon. Data Science is used for:
Detecting Tumor.
Drug discoveries.
Medical Image Analysis.
Virtual Medical Bots.
Genetics and Genomics.
Predictive Modeling for Diagnosis etc.
6. Image Recognition
currently, Data Science is also used in Image Recognition. For Example, When we upload
our image with our friend on Facebook, Facebook gives suggestions Tagging who is in the
picture. This is done with the help of machine learning and Data Science. When an Image
is Recognized, the data analysis is done on one’s Facebook friends and after analysis, if
the faces which are present in the picture matched with someone else profile then
Facebook suggests us auto-tagging.
7. Targeting Recommendation
Targeting Recommendation is the most important application of Data Science. Whatever
the user searches on the Internet, he/she will see numerous posts everywhere. This can be
explained properly with an example: Suppose I want a mobile phone, so I just Google
search it and after that, I changed my mind to buy offline. Data Science helps those
companies who are paying for Advertisements for their mobile. So everywhere on the
internet in the social media, in the websites, in the apps everywhere I will see the
recommendation of that mobile phone which I searched for. So this will force me to buy
online.
8. Airline Routing Planning
With the help of Data Science, Airline Sector is also growing like with the help of it, it
becomes easy to predict flight delays. It also helps to decide whether to directly land into
the destination or take a halt in between like a flight can have a direct route from Delhi to
the U.S.A or it can halt in between after that reach at the destination.
9. Data Science in Gaming
In most of the games where a user will play with an opponent i.e. a Computer Opponent,
data science concepts are used with machine learning where with the help of past data the
Computer will improve its performance. There are many games like Chess, EA Sports, etc.
will use Data Science concepts.
10. Medicine and Drug Development
The process of creating medicine is very difficult and time-consuming and has to be done
with full disciplined because it is a matter of Someone’s life. Without Data Science, it
takes lots of time, resources, and finance or developing new Medicine or drug but with the
help of Data Science, it becomes easy because the prediction of success rate can be easily
determined based on biological data or factors. The algorithms based on data science will
forecast how this will react to the human body without lab experiments.
11. In Delivery Logistics
Various Logistics companies like DHL, FedEx, etc. make use of Data Science. Data
Science helps these companies to find the best route for the Shipment of their Products, the
best time suited for delivery, the best mode of transport to reach the destination, etc.
12. Auto complete
AutoComplete feature is an important part of Data Science where the user will get the
facility to just type a few letters or words, and he will get the feature of auto-completing
the line. In Google Mail, when we are writing formal mail to someone so at that time data
science concept of Auto complete feature is used where he/she is an efficient choice to
auto-complete the whole line. Also in Search Engines in social media, in various apps,
AutoComplete feature is widely used.