0% found this document useful (0 votes)
19 views

Unit-1 DA

Data analytics is the process of analyzing raw data to extract meaningful insights that inform business decisions. It helps make sense of past data and predict future trends rather than relying on guesswork. The key steps in data analytics are understanding the problem, collecting data from internal and external sources, cleaning the data, exploring and analyzing the data using visualization tools and techniques, and interpreting the results. Popular tools for data analytics include SAS, Microsoft Excel, R, Python, Power BI, Tableau, and others. Each tool has its own strengths and weaknesses for statistical analysis, visualization, programming capabilities, and other factors.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Unit-1 DA

Data analytics is the process of analyzing raw data to extract meaningful insights that inform business decisions. It helps make sense of past data and predict future trends rather than relying on guesswork. The key steps in data analytics are understanding the problem, collecting data from internal and external sources, cleaning the data, exploring and analyzing the data using visualization tools and techniques, and interpreting the results. Popular tools for data analytics include SAS, Microsoft Excel, R, Python, Power BI, Tableau, and others. Each tool has its own strengths and weaknesses for statistical analysis, visualization, programming capabilities, and other factors.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Unit-1

Data Analytics
Introduction: - Most of companies are collecting loads of data all the time but, in its raw form, this data
doesn‘t really mean anything. Data analytics helps you to make sense of the past and to predict future trends
and behaviors; rather than basing your decisions and strategies on guesswork, you‘re making informed
choices based on what the data is telling you. Armed with the insights drawn from the data, businesses and
organizations are able to develop a much deeper understanding of their audience, their industry, and their
company as a whole and, as a result, are much better equipped to make decisions and plan ahead.

🡪 What is data analytics?


Data analytics is the process of analysing raw data in order to draw out meaningful, actionable insights,
which are then used to inform and drive smart business decisions. A data analyst will extract raw data,
organize it, and then analyse it, transforming it from incomprehensible numbers into coherent, intelligible
information that‘s called Data Analytics.
🡺 Ways to use Data Analytics?
∙ Improved Decision Making
∙ Better Customer Services
∙ Efficient Operations
∙ Effective Marketing
🡺 Various Steps involved in Data Analytics.

🡺 Understand
the problem:- You need to understand the business problem define the organizational
goals and plan for a lucrative solution, also try to find out key performance indicator and consider the
matrix to tracks along the ways.
🡺 Data Collection: - Data collection is the process of gathering the data / information on targeting the
variable on identifiers as data requirements. Gather the right data from various sources and other
information based on your priorities. Data collection started with primary sources is also known as
the internal sources this is typically structure data gathering from CRM software, ERP system,
Marketing Automation tools and any others. These sources contain information about the customer,
finances, gape and sales etc. And the External sources you have both structure and unstructured data,
so you looking a sentiment analysis your brand you would gather data from various reviews website
or social media apps.
🡺 Data Cleaning: - The data which the collected from the various sources is highly lightly to contain
incomplete duplicate data and missing values, so you need to clean the unwanted data to make it ready
for analysis. Clean the data to remove unwanted redundant and missing values and make it
ready analysis and generate the accurate results. Analytics professional it must be identify the
duplicate data and anonymous data and others inconsistency. According to the reports 60% of data
scientist says the most of the time is cleaning the data and 57% data Scientist is say that most
enjoyable task.
🡺 Data Exploration and Analysis: - One‘s the data clean and ready. You can go ahead and explore
the data, using the data visualization and business intelligent tools. The data exploration is use the data
visualization techniques and Business intelligence tools, data mining techniques and predictive
modelling to analyse the data and build models. You can also use the different technique supervised
learn steps and supervised algorithms such as linear regression, logistic regression , decision tree, k nn,
kemine clustering and lot‘s more to build prediction models to making business decisions.
🡺 Interpret the results: - This part is important its new a business gain actual values from the
previous four steps. Interpret the result to find out hidden patterns, feature trends and gain insights.
You can views a validation check if the assuring your questions these results cann be showns your
client and stockholders to better understanding.

🡪 Data analytics tools: -


Data Analytics is an important aspect of many organizations nowadays. Real-time data analytics is essential
for the success of a major organization and helps drive decision making. There are many tools that are used
for deriving useful insights from the given data. Some are programming based and others are non
programming based. Some of the most popular tools are:

∙ SAS

∙ Microsoft Excel
∙R Programming Language
∙ Python Programming Language
∙ Power BI(Business intelligence Tools)
∙ Apache Spark
∙ Qlick Views
∙ Tableau

∙ RapidMiner

∙ KNIME

1. SAS:
∙ Type of tool: Statistical software suite.
∙ Availability: Commercial.
∙ Mostly used for: Business intelligence, multivariate, and predictive analysis.
∙ Pros: Easily accessible, business-focused, good user support.
∙ Cons: High cost, poor graphical representation.
🡪SAS (which stands for Statistical Analysis System) is a popular commercial suite of business intelligence
and data analysis tools. It was developed by the SAS Institute in the 1960s and has evolved ever since. Its
main use today is for profiling customers, reporting, data mining, and predictive modeling. Created for an
enterprise market, the software is generally more robust, versatile, and easier for large organizations to use.
This is because they tend to have varying levels of in-house programming expertise.
But as a commercial product, SAS comes with a hefty price tag. Nevertheless, with cost comes benefits; it
regularly has new modules added, based on customer demand. Although it has fewer of these than say,
Python libraries, they are highly focused. For instance, it offers modules for specific uses such as anti money
laundering and analytics for the Internet of Things.
2. Microsoft Excel:
∙ Type of tool: Spreadsheet software.
∙ Availability: Commercial.
∙ Mostly used for: Data wrangling and reporting.
∙ Pros: Widely-used, with lots of useful functions and plug-ins.
∙ Cons: Cost, calculation errors, poor at handling big data.
Excel: the world‘s best-known spreadsheet software. What‘s more, it features calculations and graphing
functions that are ideal for data analysis. Whatever your specialism, and no matter what other software you
might need, Excel is a staple in the field. Its invaluable built-in features include pivot tables (for sorting or
totaling data) and form creation tools. It also has a variety of other functions that streamline data
manipulation. For instance, the CONCATENATE function allows you to combine text, numbers, and dates
into a single cell. SUMIF lets you create value totals based on variable criteria, and Excel‘s search function
makes it easy to isolate specific data.
It has limitations though. For instance, it runs very slowly with big datasets and tends to approximate large
numbers, leading to inaccuracies. Nevertheless, it‘s an important and powerful data analysis tool, and with
many plug-ins available, you can easily bypass Excel‘s shortcomings. Get started with these ten Excel
formulas that all data analysts should know.
3. R Programming Language
∙ Type of tool: Programming language.
∙ Availability: Open-source.
∙ Mostly used for: Statistical analysis and data mining.
∙ Pros: Platform independent, highly compatible, lots of packages.
∙ Cons: Slower, less secure, and more complex to learn than Python.
🡪R, like Python, is a popular open-source programming language. It is commonly used to create
statistical/data analysis software. R‘s syntax is more complex than Python and the learning curve is steeper.
However, it was built specifically to deal with heavy statistical computing tasks and is very
popular for data visualization. A bit like Python, R also has a network of freely available code, called CRAN
(the Comprehensive R Archive Network), which offers 10,000+ packages. 🡪R is the leading analytics tool in
the industry and is widely used for statistics and data modeling. It can easily manipulate data and present it
in different ways. It has exceeded SAS in many ways like capacity of data, performance and outcome. R
compiles and runs on a wide variety of platforms viz -UNIX, Windows and macOS.
4. Python Programming Language
∙ Type of tool: Programming language.
∙ Availability: Open-source, with thousands of free libraries.
∙ Used for: Everything from data scraping to analysis and reporting.
∙ Pros: Easy to learn, highly versatile, widely-used.
∙ Cons: Memory intensive—doesn‘t execute as fast as some other languages.
🡪Python is an object-oriented scripting language which is easy to read, write, and maintain. Plus, it is a free
open source tool. It was developed by Guido van Rossum in the late 1980s and supports both functional and
structured programming methods. Python is easy to learn as it is very similar to JavaScript, Ruby, and PHP.
Also, Python has very good machine learning libraries.
🡪Python is also extremely versatile; it has a huge range of resource libraries suited to a variety of different
data analytics tasks. For example, the NumPy and pandas libraries are great for streamlining highly
computational tasks, as well as supporting general data manipulation.
🡪Libraries like Beautiful Soup and Scrapy are used to scrape data from the web, while Matplotlib is
excellent for data visualization and reporting. Python‘s main drawback is its speed—it is memory intensive
and slower than many languages. In general though, if you‘re building software from scratch, Python‘s
benefits far outweigh its drawbacks.
. 🡪Some of the companies that use Python for data analytics include Instagram, Facebook, Spotify and
Amazon etc.
5. Tableau: Type of tool: Data visualization tool.
∙ Availability: Commercial.
∙ Mostly used for: Creating data dashboards and worksheets.
∙ Pros: Great visualizations, speed, interactivity, mobile support.
∙ Cons: Poor version control, no data pre-processing.
If you‘re looking to create interactive visualizations and dashboards without extensive coding expertise,
Tableau is one of the best commercial data analysis tools available. The suite handles large amounts of data
better than many other BI tools, and it is very simple to use. It has a visual drag and drop interfa ce (another
definite advantage over many other data analysis tools). However, because it has no scripting layer, there‘s a
limit to what Tableau can do. For instance, it‘s not great for pre-processing data or building more complex
calculations.
While it does contain functions for manipulating data, these aren‘t great. As a rule, you‘ll need to carry out
scripting functions using Python or R before importing your data into Tableau. But its visualization is pretty
top-notch, making it very popular despite its drawbacks. Furthermore, it‘s mobile-ready. As a data analyst,
mobility might not be your priority, but it‘s nice to have if you want to dabble on the move!
6. Rapid Miner:
RapidMiner is a powerful integrated data science platform. It is developed by the same company that
performs predictive analysis and other advanced analytics like data mining, text analytics, machine learning
and visual analytics without any programming. RapidMiner can incorporate any data source type, including
Access, Excel, Microsoft SQL, Tera data, Oracle, Sybase, IBM DB2, Ingres, MySQL, IBM SPSS, Dbase
etc. The tool is very powerful that can generate analytics based on real-life data transformation settings, i.e.
you can control the formats and data sets for predictive analysis.
Approximately 774 companies use RapidMiner and most of these are US-based. Some of the esteemed
companies on that list include the Boston Consulting Group and Dominos Pizza Inc. 7. Knime:
∙ Type of tool: Data integration platform.
∙ Availability: Open-source.
∙ Mostly used for: Data mining and machine learning.
∙ Pros: Open-source platform that is great for visually-driven programming.
∙ Cons: Lacks scalability and technical expertise is needed for some functions.
Last on our list is KNIME (Konstanz Information Miner), an open-source, cloud-based, data integration
platform. It was developed in 2004 by software engineers at Konstanz University in Germany. Although first
created for the pharmaceutical industry, KNIME‘s strength in accruing data from numerous sources into a
single system has driven its application in other areas. These include customer analysis, business
intelligence, and machine learning.
Its main draw (besides being free) is its usability. A drag-and-drop graphical user interface (GUI) makes it
ideal for visual programming. This means users don‘t need a lot of technical expertise to create data
workflows. While it claims to support the full range of data analytics tasks, in reality, its strength lies in data
mining. Though it offers in-depth statistical analysis too, users will benefit from some knowledge of Python
and R. Being open-source, KNIME is very flexible and customizable to an organization‘s needs— without
heavy costs. This makes it popular with smaller businesses, who have limited budgets.
8. Apache SPARK: -
Apache Spark is an open-source data analytics engine to process data in real-time and carry out complex
analytics using SQL queries and machine learning algorithms. Its supports spark streaming for real time
analytics and spark sql for writing sql queries , it also spark Mlib which is libraries that has depository for
machine learning algorithms then graphs for graphical computation.
9. Qlik View
QlikView has many unique features like patented technology and in-memory data processing. This executes
the result very fast to the end-users and stores the data in the report itself. Data association in QlikView is
automatically maintained and can be compressed to almost 10% of its original size. Data relationship is
visualized using colours where a specific colour is given to related data and another colour to non-related
data.
10. Power BI.
∙ Type of tool: Business analytics suite.
∙ Availability: Commercial software (with a free version available).
∙ Mostly used for: Everything from data visualization to predictive analytics.
∙ Pros: Great data connectivity, regular updates, good visualizations.
∙ Cons: Clunky user interface, rigid formulas, data limits (in the free version).
At less than a decade old, Power BI is a relative newcomer to the market of data analytics tools. It began life
as an Excel plug-in but was redeveloped in the early 2010s as a standalone suite of business data analysis
tools. Power BI allows users to create interactive visual reports and dashboards, with a minimal learning
curve. Its main selling point is its great data connectivity—it operates seamlessly with Excel (as you‘d
expect, being a Microsoft product) but also text files, SQL server, and cloud sources, like Google and
Facebook analytics.
It also offers strong data visualization but has room for improvement in other areas. For example, it has quite
a bulky user interface, rigid formulas, and the proprietary language (Data Analytics Expressions, or ‗DAX‘)
is not that user-friendly. It does offer several subscriptions though, including a free one. This is great if you
want to get to grips with the tool, although the free version does have drawbacks—the main limitation being
the low data limit (around 2GB).

🡪Applications of Data Analytics & Environments


1. Retail: To study sales patterns, consumer behaviour, and inventory management, data analytics can be
applied in the retail sector. Data analytics can be used by retailers to make data-driven decisions regarding
what products to stock, how to price them, and how to best organise their stores.
2. Healthcare: Data analytics can be used to evaluate patient data, spot trends in patient health, and create
individualised treatment regimens. Data analytics can be used by healthcare companies to enhance patient
outcomes and lower healthcare expenditures.
3. Finance: In the field of finance, data analytics can be used to evaluate investment data, spot trends in the
financial markets, and make wise investment decisions. Data analytics can be used by financial institutions
to lower risk and boost the performance of investment portfolios.
4. Marketing: By analysing customer data, spotting trends in consumer behaviour, and creating customised
marketing strategies, data analytics can be used in marketing. Data analytics can be used by marketers to
boost the efficiency of their campaigns and their overall impact.
5. Manufacturing: Data analytics can be used to examine production data, spot trends in production
methods, and boost production efficiency in the manufacturing sector. Data analytics can be used by
manufacturers to cut costs and enhance product quality.
6. Logistics or Transportation: To evaluate logistics data, spot trends in transportation routes, and improve
transportation routes, the transportation sector can employ data analytics. Data analytics can help
transportation businesses cut expenses and speed up delivery times.
7. Banking Sector: Banking institutions gather the information and large volumes of data to dive analytical
insights and make secure financial decision they find out the probable loan defaulters, customers churn out
rate and deduct fraud in transections etc.
🡺 Types of Data Analytics: - Now we have a working definition of data analytics. There are four
types of data analytics:
∙ Predictive (forecasting)🡪 (What is likely to happen in future?)
∙ Descriptive (business intelligence and data mining)🡪 (What Happened?)
∙ Prescriptive (optimization and simulation)🡪 (What the best course of action?)
∙ Diagnostic analytics🡪 (Why did it happen?)

Predictive Analytics: Predictive analytics tries to predict what is likely to happen in the future. This is
where data analysts start to come up with actionable, data-driven insights that the company can use to
inform their next steps.
Predictive analytics turn the data into valuable, actionable information. Predictive analytics uses data
to determine the probable outcome of an event or a likelihood of a situation occurring. Predictive analytics
holds a variety of statistical techniques from modeling, machine, learning, data mining, and game theory that
analyze current and historical facts to make predictions about a future event. Techniques that are used for
predictive analytics are:

∙ Linear Regression
∙ Time series analysis and forecasting
∙ Data Mining
There are three basic cornerstones of predictive analytics:
∙ Predictive modeling
∙ Decision Analysis and optimization
∙ Transaction profiling
🡪 Descriptive Analytics: Descriptive analytics is a simple, surface-level type of analysis that looks at what
has happened in the past. The two main techniques used in descriptive analytics are data aggregation and
data mining—so; the data analyst first gathers the data and presents it in a summarized format (that‘s the
aggregation part) and then ―minesǁ the data to discover patterns.
Descriptive analytics looks at data and analyze past event for insight as to how to approach future
events. It looks at the past performance and understands the performance by mining historical data to
understand the cause of success or failure in the past. Almost all management reporting such as sales,
marketing, operations, and finance uses this type of analysis.
The descriptive model quantifies relationships in data in a way that is often used to classify customers or
prospects into groups. Unlike a predictive model that focuses on predicting the behaviour of a single
customer, Descriptive analytics identifies many different relationships between customer and product.

Common examples of Descriptive analytics are company reports that provide historic reviews like: ∙
Data Queries
∙ Reports
∙ Descriptive Statistics
∙ Data dashboard
🡪 Prescriptive Analytics: Prescriptive analytics advises on the actions and decisions that should be taken.
In other words, prescriptive analytics shows you how you can take advantage of the outcomes that have been
predicted. When conducting prescriptive analysis, data analysts will consider a range of possible scenarios
and assess the different actions the company might take.
Prescriptive Analytics automatically synthesize big data, mathematical science, business rule, and
machine learning to make a prediction and then suggests a decision option to take advantage of the
prediction. Prescriptive analytics goes beyond predicting future outcomes by also suggesting action benefit
from the predictions and showing the decision maker the implication of each decision option. Prescriptive
Analytics not only anticipates what will happen and when to happen but also why it will happen. Further,
Prescriptive Analytics can suggest decision options on how to take advantage of a future opportunity or
mitigate a future risk and illustrate the implication of each decision option.
For example, Prescriptive Analytics can benefit healthcare strategic planning by using analytics to leverage
operational and usage data combined with data of external factors such as economic data, population
demography, etc.
🡪 Diagnostic Analytics: While descriptive analytics looks at the ―whatǁ, diagnostic analytics explores the
―whyǁ. When running diagnostic analytics, data analysts will first seek to identify anomalies within the
data—that is, anything that cannot be explained by the data in front of them. For example: If the data shows
that there was a sudden drop in sales for the month of March, the data analyst will need to investigate the
cause.
In this analysis, we generally use historical data over other data to answer any question or for the
solution of any problem. We try to find any dependency and pattern in the historical data of the particular
problem.
For example, companies go for this analysis because it gives a great insight into a problem, and they also
keep detailed information about their disposal otherwise data collection may turn out individual for every
problem and it will be very time-consuming. Common techniques used for Diagnostic Analytics are: ∙ Data
discovery
∙ Data mining
∙ Correlations
🡪Benefits of Data Analytics
The ultimate result of all these activities enabled by data analytics is often visible in the bottom line of the
organization.
🡪Reduce costs by streamlining business operations, rightsizing technology spending, improve inventory
management and better negotiate with suppliers.
🡪Speed time-to-market by quickly identifying new product opportunities, enhancing development
processes, enabling faster testing and improving overall quality.
🡪Improve customer satisfaction by better meeting customer needs and giving customer service agents the
tools, training and support they need.
🡪Increase sales by improving product offerings, enhancing marketing efforts and empowering salespeople.
🡪Increase margins by reducing costs and optimizing prices.
🡪Improve the accuracy of forecasts by analyzing historical data and using machine learning to enable
predictive and prescriptive analytics.

🡪The Future of Data Analytics


Cloud computing: - Today, most data analytics is happening in the cloud, and that trend is likely to
increase. Because organizations store much of their data with cloud providers, it makes sense to analyze
where it is stored to minimize costs and take advantage of the scalability and reliability of cloud services.
Artificial intelligence and machine learning: - Many of the most complex forms of data analytics,
including predictive and prescriptive analytics, rely on artificial intelligence and machine learning
capabilities. As these technologies advance, analytics will become even more powerful. Synthetic data: -
Privacy regulations often limit the amount of analytics that organizations can perform directly on customer
data. One of the ways to get around this is with synthetic data, which is anonymous and usually generated by
data models and algorithms.
Multiple analytics solutions and hubs: - Most large enterprises find that no single analytics solution meets
all their needs across the entire organization. Experts say that the most successful companies are likely to be
those who find innovative ways to combine their various analytics solutions and data repositories.

🡪Data bases and Types of Data & variables


🡺Databases: - Database is the type of data refers to the format of data storage that can hold a distinct type
or range of values. When computer programs store data in variables, each variable must be designated a
distinct data type. Some common data types are as follows: integers, characters, strings, floating-point
numbers and arrays. More specific data types are as follows: varchar (variable character) formats, Boolean
values, dates and timestamps.

There are programming languages that require the programmer to determine the data type of a variable
before attaching a value to it. While some programming languages can automatically attach a data type to a
variable based on the initial data assigned to the variable. For example, a variable is assigned with the value
―3.75ǁ, then the data type that will be attached to the variable is floating point.

Most of the programming languages enable each variable to store only a single data type. For example, if the
data type attached to the variable is integer, when you assign a string data to the variable, the string data will
be converted to an integer format.

Database applications use data types. Database fields require distinct type of data to be entered. For
example, school record for a student may use a string data type for student‘s first and last name. The
student‘s date of birth would be stored in a date format and the student‘s GPA can be stored as decimal. By
ensuring that the data types are consistent across multiple records, database applications can easily perform
calculations, comparisons, searching and sorting of fields in different records.

🡪DB (Database) Data Types

Integer – is a whole number that can have a positive, negative or zero value. It cannot be a fraction nor can
have decimal places. It is commonly used in programming especially for increasing values. Addition,
subtraction and multiplication of two integers results to an integer. But division of two integers may result to
an integer or a decimal. The resulting decimal can be rounded off or truncated to produce an integer.

Character – refers to any number, letter, space or symbol that can be entered in a computer. Each character
occupies one byte of space.

String – is used to represent text. It is composed of a set of characters that can have spaces and numbers.
Strings are enclosed in quotation marks to identify the data as string and not a variable name nor a number.

Floating Point Number – is a number that contains decimals. Numbers that contain fractions are also
considered as floating point numbers.
Array – contains a group of elements which can be of the same data type like an integer or string. It is used
to organise data for easier sorting and searching of related set of values.

Varchar – as the name implies is variable character as the memory storage has variable length. Each
character occupies one byte of space plus 2 bytes for length information.Note: Use Character for data entries
with fixed length, like phone number. Use Varchar for data entries with variable length, like address.

Boolean – is used for creating true or false statements. To compare values the following operators are being
used: AND, OR, XOR, and NOT.

Boolean Operator Result Condition

x AND y True If both x and y are True

x AND y False If either x or y is False

x OR y True If either x or y, or both x and y are True

x OR y False If both x and y are False

x XOR y True If only x or y is True

x XOR y False If x and y are both True or both False

NOT x True If x is False

NOT x False If x is True

Date, Time and Timestamp – these data types are used to work with data containing dates and times.

🡺Data and variable


We will define data and its importance. There are three term one is variable, measurement and data.

🡪What is a variable? Variable is a characteristic of any entity being studied that is capable of taking on
different values. Say for example, X is the variable it can take any values it may be 1 it may be 2 or it may be
0 and so on.
🡪What is the measurement? The measurement is, when you standard processes used to assign numbers to
your particular attributes or characteristics of variable are called a measurement. For that X, you want to
substitute some values. For that value, you have to measure the characteristics of the variable that is nothing
but your measurement.
🡪What is the data? Data are recorded measurement. So there is a variable you measure the phenomena,
after measuring the phenomena you are substituting some value for the variable so the variable will take a
particular value that value is nothing but your data.
🡪 Why Data is important? The data helps in making better decisions, data helps in solve problem by
finding the reason for underperformance. Suppose some company it is not performing properly by collecting
the data we can identify what was the reason for this under performance. The data helps one to evaluate the
performance. So what is the current performance, the data also can be used for benchmarking the
performance of your business organization. And after benchmarking data helps one improving the
performance also, so data also can help one understand the consumers and the markets, especially the
marketing context. You can understand who the right consumers are and what kind of preferences they are
having in the market.
🡪What is generating so much data? Data can be generated different way humans, machines, and human -
machines combines. The humans, machines and human - machines combines in the sense, now seen
everybody is having the various Facebook account, we have LinkedIn account, we are in various social
network sites. Now the availability of the data is not the problem. It can be generated anywhere where the
information is generated and stored and structured or unstructured format.
🡪How the data add value to the business? So the data after getting from various sources assume that it is
a store in the form of data warehouse. So from the data warehouse the data can be used for development of a
data product.
🡪Attributes and its types in data analytics
Attributes: An attribute is a data item that appears as a property of a data entity. Machine learning literature
tends to use the term feature while statisticians prefer the term variable.

Example – Let‘s consider an example like name, address, email, etc. are the attributes for the contact
information. Perceived values for a given attribute are termed as observations. The variety of an attribute is
insisted on by the set of feasible values – nominal, binary, ordinal, or numeric.

Types of Attributes:
Nominal Attributes:
Nominal means ―relating to namesǁ. The utilities of a nominal attribute are sign or title of objects. Each
value represents some kind of category, code or state, and so nominal attributes are also referred to as
categorial.
Example – Suppose that skin color and education status are two attributes of expressing person objects. In
our implementation, possible values for skin color are dark, white, brown. The attributes for education status
can contain the values- undergraduate, postgraduate, matriculate. Both skin color and education status are
nominal attributes.

Binary Attributes: A binary attribute is a category of nominal attributes that contains only two classes: 0 or
1, where 0 often tells that the attribute is not present, and 1 tells that it is existing. Binary attributes are
mentioned as Boolean if the two conditions agree to true and false.

Example – Given the attribute drinker narrate a patient item, 1 specify that the drinker drinks, while 0
specify that the patient does not. Similarly, suppose the patient undergoes a medical test that has two
practicable outcomes.
Ordinal Attributes: An ordinal attribute is an attribute with a viable advantage that has a significant
sequence or ranking among them, but the enormity between consecutive values is not known.

Example – Suppose that food quantity corresponds to the variety of dishes available at a restaurant. The
nominal attribute has three possible values: starters, main course, and combo. The values have a meaningful
sequence that corresponds to different food quantity however; we cannot tell from the values how much
bigger, say, a medium is than a large.

Numeric Attributes: A numeric attribute is calculable, that is, it is a quantifiable amount that constitutes
integer or real values. Numeric attributes can be of two types as follows: Interval- scaled, and Ratio – scaled.

🡪Let’s discuss one by one for Numeric Attributes.

Interval – Scaled Attributes: Interval – scaled attributes are calculated on a lamella of uniform- size units.
The values of interval-scaled attributes have order and can be positive, 0, or negative. Thus, in addition to
providing a ranking of values, such attributes allow us to compare and quantify the difference between
values.

Example – A temperature attribute is an interval – scaled. We have different temperature values for every
new day, where each day is an entity. By sequencing the values, we obtain an arrangement of entities with
reference to temperature. In addition, we can quantify the difference in the value between values, for
example, a temperature of 20 degrees C is five degrees higher than a temperature of 15 degrees C.

Ratio – Scaled Attributes: A ratio – scaled attribute is a category of a numeric attribute with imminent or fix
zero points. In inclusion, the entities are structured, and we can also compute the difference between values,
as well as the mean, median, and mode.
Example – The Kelvin (K) temperature scale has what is contemplated as a true zero point. It is the point at
which the tiny bits that consist of matter has zero kinetic energy.

Discrete Attribute: A discrete attribute has a limited or restricted unlimited set of values, which may appear
as integers. The attributes skin color, drinker, medical report, and drink size each have a finite number of
values, and so are discrete.

Continuous Attribute: A continuous attribute has real numbers as attribute values.

Example – Height, weight, and temperature have real values . Real values can only be represented and
measured using finite number of digits . Continuous attributes are typically represented as floating-point
variables.
🡪 Data Modelling
Data modelling (data modelling) is the process of creating a data model for the data to be stored in a
database. This data model is a conceptual representation of Data objects, the associations between different
data objects, and the rules.
Data modeling helps in the visual representation of data and enforces business rules, regulatory
compliances, and government policies on the data. Data Models ensure consistency in naming conventions,
default values, semantics, security while ensuring quality of the data.
🡪Data Models in DBMS
The Data Model is defined as an abstract model that organizes data description, data semantics, and
consistency constraints of data. The data model emphasizes on what data is needed and how it should be
organized instead of what operations will be performed on data. Data Model is like an architect‘s building
plan, which helps to build conceptual models and set a relationship between data items.
The two types of Data Modeling Techniques are
1. Entity Relationship (E-R) Model
2. UML (Unified Modelling Language)
🡪Why use Data Model?
The primary goal of using data model are:
∙ Ensures that all data objects required by the database are accurately represented. Omission of data will
lead to creation of faulty reports and produce incorrect results.
∙A data model helps design the database at the conceptual, physical and logical levels. ∙ Data Model
structure helps to define the relational tables, primary and foreign keys and stored procedures.
∙ It provides a clear picture of the base data and can be used by database developers to create a physical
database.
∙ It is also helpful to identify missing and redundant data.
∙ Though the initial creation of data model is labor and time consuming, in the long run, it makes your IT
infrastructure upgrade and maintenance cheaper and faster.
🡪Types of Data Models in DBMS
Types of Data Models: There are mainly three different types of data models: conceptual data models,
logical data models, and physical data models, and each one has a specific purpose. The data models are
used to represent the data and how it is stored in the database and to set the relationship between data items.
1. Conceptual Data Model: This Data Model defines WHAT the system contains. This model is
typically created by Business stakeholders and Data Architects. The purpose is to organize, scope
and define business concepts and rules.
2. Logical Data Model: Defines HOW the system should be implemented regardless of the DBMS.
This model is typically created by Data Architects and Business Analysts. The purpose is to
developed technical map of rules and data structures.
3. Physical Data Model: This Data Model describes HOW the system will be implemented using a
specific DBMS system. This model is typically created by DBA and developers. The purpose is
actual implementation of the database.

Types of Data Model


🡪Conceptual Data Model
A Conceptual Data Model is an organized view of database concepts and their relationships. The purpose
of creating a conceptual data model is to establish entities, their attributes, and relationships. In this data
modeling level, there is hardly any detail available on the actual database structure. Business stakeholders
and data architects typically create a conceptual data model.
The 3 basic tenants of Conceptual Data Model are
∙ Entity: A real-world thing
∙ Attribute: Characteristics or properties of an entity
∙ Relationship: Dependency or association between two entities
Data model example:
∙ Customer and Product are two entities. Customer number and name are attributes of the Customer
entity
∙ Product name and price are attributes of product entity
∙ Sale is the relationship between the customer and product
Conceptual Data Model
🡪Characteristics of a conceptual data model
∙ Offers Organisation-wide coverage of the business concepts.
∙ This type of Data Models are designed and developed for a business audience.
∙ The conceptual model is developed independently of hardware specifications like data storage capacity,
location or software specifications like DBMS vendor and technology. The focus is to represent data
as a user will see it in the ―real world.ǁ
Conceptual data models known as Domain models create a common vocabulary for all stakeholders by
establishing basic concepts and scope.
🡪Logical Data Model
The Logical Data Model is used to define the structure of data elements and to set relationships between
them. The logical data model adds further information to the conceptual data model elements. The
advantage of using a Logical data model is to provide a foundation to form the base for the Physical model.
However, the modeling structure remains generic.

Logical Data Model


At this Data Modeling level, no primary or secondary key is defined. At this Data modeling
level, you need to verify and adjust the connector details that were set earlier for relationships.
🡪Characteristics of a Logical data model
∙ Describes data needs for a single project but could integrate with other logical data models based on the
scope of the project.
∙ Designed and developed independently from the DBMS.
∙ Data attributes will have datatypes with exact precisions and length.
∙ Normalization processes to the model is applied typically till 3NF.
🡪Physical Data Model
A Physical Data Model describes a database-specific implementation of the data model. It offers database
abstraction and helps generate the schema. This is because of the richness of meta-data offered by a Physical
Data Model. The physical data model also helps in visualizing database structure by replicating database
column keys, constraints, indexes, triggers, and other RDBMS features.

Physical Data Model


🡪Characteristics of a physical data model:

∙ The physical data model describes data need for a single project or application though it may be
integrated with other physical data models based on project scope.
∙ Data Model contains relationships between tables that which addresses cardinality and null ability of
the relationships.
∙ Developed for a specific version of a DBMS, location, data storage or technology to be used in the
project.
∙ Columns should have exact data types, lengths assigned and default values.
∙ Primary and Foreign keys, views, indexes, access profiles, and authorizations, etc. are defined.
🡪Advantages and Disadvantages of Data Model:
Advantages of Data model:
∙ The main goal of a designing data model is to make certain that data objects offered by the functional
team are represented accurately.
∙ The data model should be detailed enough to be used for building the physical database. ∙ The
information in the data model can be used for defining the relationship between tables, primary and
foreign keys, and stored procedures.
∙ Data Model helps business to communicate the within and across organizations. ∙
Data model helps to documents data mappings in ETL process
∙ Help to recognize correct sources of data to populate the model
Disadvantages of Data model:
∙ To develop Data model one should know physical data stored characteristics.
∙ This is a navigational system produces complex application development, management. Thus, it
requires knowledge of the biographical truth.
∙ Even smaller change made in structure requires modification in the entire application. ∙
There is no set data manipulation language in DBMS.
🡪Data Modelling Techniques
There are various techniques to achieve data modeling successfully, though the basic concepts remain the
same across techniques. Some popular data modeling techniques include Hierarchical, Relational, Network,
Entity-relationship, and Object-oriented.

🡪Hierarchical Technique
The Hierarchical data modeling technique follows a tree-like structure where its nodes are sorted in a
particular order. A hierarchy is an arrangement of items represented as ―above,ǁ ―below,ǁ or ―at the same
level asǁ each other. Hierarchical data modeling technique was implemented in the IBM Information
Management System (IMS) and was introduced in 1966.
It was a popular concept in a wide variety of fields, including computer science, mathematics, design,
architecture, systematic biology, philosophy, and social sciences. But it is rarely used now due to the
difficulties of retrieving and accessing data.
🡪Relational Technique
The relational data modeling technique is used to describe different relationships between entities, which
reduces the complexity and provides a clear overview. The relational model was first proposed as an
alternative to the hierarchical model by IBM researcher Edgar F. Codd in 1969. It has four different sets of
relations between the entities: one to one, one to many, many to one, and many to many.
🡪Network Technique
The network data modeling technique is a flexible way to represent objects and underlying relationships
between entities, where the objects are represented inside nodes and the relationships between the nodes is
illustrated as an edge. It was inspired by the hierarchical technique and was originally introduced by Charles
Bachman in 1969.
The network data modeling technique makes it easier to convey complex relationships as records and can be
linked to multiple parent records.
🡪Entity-relationship technique
The entity-relationship (ER) data modeling technique represents entities and relationships between them in a
graphical format consisting of Entities, Attributes, and Relationships. The entities can be anything, such as
an object, a concept, or a piece of data. The entity-relationship data modeling technique was developed for
databases and introduced by Peter Chen in 1976. It is a high-level relational model that is used to define data
elements and relationships in a sophisticated information system.
🡪Object-Oriented Technique
The object-oriented data modeling technique is a construction of objects based on real-life scenarios, which
are represented as objects. The object-oriented methodologies were introduced in the early 1990s‘ and were
inspired by a large group of leading data scientists.
It is a collection of objects that contain stored values, in which the values are nothing but objects. The
objects have similar functionalities and are linked to other objects.
🡪Data Modeling: An Integrated View
Data modeling is an essential technology for understanding relationships between data sets. The integrated
view of conceptual, logical, and physical data models helps users to understand the information and ensure
the right information is used across an entire enterprise.
Although data modeling can take time to perform effectively, it can save significant time and money by
identifying errors before they occur. Sometimes a small change in structure may require modification of an
entire application.
Some information systems, such as a navigational system, use complex application development and
management that requires advanced data modeling skills. There are many open sources Computer-Aided
Software Engineering (CASE) as well as commercial solutions that are widely used for this data modeling
purpose.

🡪Missing Imputations of Data


🡪What Is Data Imputation?
Data imputation is a method for retaining the majority of the dataset's data and information by substituting
missing data with a different value. These methods are employed because it would be impractical to remove
data from a dataset each time. Additionally, doing so would substantially reduce the dataset's size, raising
questions about bias and impairing analysis.
🡪Importance of Data Imputation
Let us see why exactly it is important. We employ imputation since missing data can lead to the following
problems:
∙ Distorts Dataset: Large amounts of missing data can lead to anomalies in the variable distribution, which
can change the relative importance of different categories in the dataset.
∙ Unable to work with the majority of machine learning-related Python libraries: When utilizing ML
libraries (SkLearn is the most popular), mistakes may occur because there is no automatic handling of
these missing data.
∙ Impacts on the Final Model: Missing data may lead to bias in the dataset, which could affect the final
model's analysis.
∙ Desire to restore the entire dataset: This typically occurs when we don't want to lose any (or any more) of
the data in our dataset because all of it is crucial. Additionally, while the dataset is not very large,
eliminating a portion of it could have a substantial effect on the final model.
Data Imputation Techniques
After learning about what data imputation is and its importance, we will now learn about some of the various
data imputation techniques.
These are some of the data imputation techniques that we will be discussing
in-depth: 🡺 Next or Previous Value
🡺 K Nearest Neighbors
🡺 Maximum or Minimum Value
🡺 Missing Value Prediction
🡺 Most Frequent Value
🡺 Average or Linear Interpolation
🡺 (Rounded) Mean or Moving Average or Median Value
🡺 Fixed Value
1. Next or Previous Value
For time-series data or ordered data, there are specific imputation techniques. These techniques take into
consideration the dataset's sorted structure, wherein nearby values are likely more comparable than far-off
ones. The next or previous value inside the time series is typically substituted for the missing value as part of
a common method for imputed incomplete data in the time series. This strategy is effective for both nominal
and numerical values.
2. K Nearest Neighbors
The objective is to find the k nearest examples in the data where the value in the relevant feature is not
absent and then substitute the value of the feature that occurs most frequently in the group. 3. Maximum or
Minimum Value
You can use the minimum or maximum of the range as the replacement cost for missing values if you are
aware that the data must fit within a specific range [minimum, maximum] and if you are aware from the
process of data collection that the measurement instrument stops recording and the message saturates further
than one of such boundaries. For instance, if a price cap has been reached in a financial exchange and the
exchange procedure has indeed been halted, the missing price can be substituted with the exchange
boundary's minimum value.
4. Missing Value Prediction
Using a machine learning model to determine the final imputation value for characteristic x based on other
features is another popular method for single imputation. The model is trained using the values in the
remaining columns, and the rows in feature x without missing values are utilized as the training set.
Depending on the type of feature, we can employ any regression or classification model in this situation. In
resistance training, the algorithm is used to forecast the most likely value of each missing value in all
samples.
A basic imputation approach, such as the mean value, is used to temporarily impute all missing values when
there is missing data in more than a feature field. Then, one column's values are restored to missing. After
training, the model is used to complete the missing variables. In this manner, an is trained for every feature
that has a missing value up until a model can impute all of the missing values.
5. Most Frequent Value
The most frequent value in the column is used to replace the missing values in another popular technique
that is effective for both nominal and numerical features.
6. Average or Linear Interpolation
The average or linear interpolation, which calculates between the previous and next accessible value and
substitutes the missing value, is similar to the previous/next value imputation but only applicable to
numerical data. Of course, as with other operations on ordered data, it is crucial to accurately sort the data in
advance, for example, in the case of time series data, according to a timestamp.
7. (Rounded) Mean or Moving Average or Median Value
Median, Mean, or rounded mean are further popular imputation techniques for numerical features. The
technique, in this instance, replaces the null values with mean, rounded mean, or median values determined
for that feature across the whole dataset. It is advised to utilize the median rather than the mean when your
dataset has a significant number of outliers.
8. Fixed Value
Fixed value imputation is a universal technique that replaces the null data with a fixed value and is
applicable to all data types. You can impute the null values in a survey using "not answered" as an example
of using fixed imputation on nominal features.
Since we have explored single imputation, its importance, and its techniques, let us now learn about
Multiple imputations.
What Is Multiple Imputation?
Single imputation treats an unknown missing value as though it were a true value by substituting a single
value for it [Rubin, 1988]. Single imputation overlooks uncertainty as a result, and it almost invariably
understates variation. This issue is solved by multiple imputations, which account for both within- and
between-imputation uncertainty.
For each missing value, the multiple data imputation approaches generate n suggestions. Each of these
values of n is given a plausible value, and n fresh datasets are produced as though a straightforward
imputation had taken place in each dataset.
In this fashion, a single table column creates n brand-new sets of data, which are then individually examined
using particular techniques. In a subsequent phase, these analyses were combined to produce or consolidate
the results of that data set.
The following steps take place in multiple imputations
Step 1: A collection of n values to also be imputed is created for each attribute in a data set record that is
missing a value;
Step 2: Utilizing one of the n replacement ideas produced in the previous item, a statistical analysis is
carried out on each data set;
Step 3: A set of results is created by combining the findings of the various analyses.
We will now try to understand this in a better way by looking at an example.
Example of Multiple Imputation
A perfect example of Multiple Data Imputation is explained below.
Think about a study where some participants' systolic blood pressure information is missing, such as one
looking at the relationship between systolic blood pressure and the risk of developing coronary heart disease
later on. Age (older patients are more likely to have their systolic blood pressure measured by a doctor),
rising body mass index, and a history of smoking all reduce the likelihood that it is missing.
We can use multiple estimations to calculate the overall affiliation between systolic blood pressure and heart
disease if we presume that data are missing at random and we have systolic blood pressure information data
on a representative sample of people within body mass index, strata of age, coronary heart disease and,
smoking.
There is potential for multiple imputations to increase the reliability of medical studies. The user must model
the probability of each variable with missing values using the observed data when using the multiple
imputation process, though. Multiple imputation results must be modeled carefully and appropriately in
order for them to be valid. If at all possible, specialized statistical assistance should be sought before using
multiple imputations as a standard procedure that can be used at the touch of a button.

You might also like