Unit-1 DA
Unit-1 DA
Data Analytics
Introduction: - Most of companies are collecting loads of data all the time but, in its raw form, this data
doesn‘t really mean anything. Data analytics helps you to make sense of the past and to predict future trends
and behaviors; rather than basing your decisions and strategies on guesswork, you‘re making informed
choices based on what the data is telling you. Armed with the insights drawn from the data, businesses and
organizations are able to develop a much deeper understanding of their audience, their industry, and their
company as a whole and, as a result, are much better equipped to make decisions and plan ahead.
🡺 Understand
the problem:- You need to understand the business problem define the organizational
goals and plan for a lucrative solution, also try to find out key performance indicator and consider the
matrix to tracks along the ways.
🡺 Data Collection: - Data collection is the process of gathering the data / information on targeting the
variable on identifiers as data requirements. Gather the right data from various sources and other
information based on your priorities. Data collection started with primary sources is also known as
the internal sources this is typically structure data gathering from CRM software, ERP system,
Marketing Automation tools and any others. These sources contain information about the customer,
finances, gape and sales etc. And the External sources you have both structure and unstructured data,
so you looking a sentiment analysis your brand you would gather data from various reviews website
or social media apps.
🡺 Data Cleaning: - The data which the collected from the various sources is highly lightly to contain
incomplete duplicate data and missing values, so you need to clean the unwanted data to make it ready
for analysis. Clean the data to remove unwanted redundant and missing values and make it
ready analysis and generate the accurate results. Analytics professional it must be identify the
duplicate data and anonymous data and others inconsistency. According to the reports 60% of data
scientist says the most of the time is cleaning the data and 57% data Scientist is say that most
enjoyable task.
🡺 Data Exploration and Analysis: - One‘s the data clean and ready. You can go ahead and explore
the data, using the data visualization and business intelligent tools. The data exploration is use the data
visualization techniques and Business intelligence tools, data mining techniques and predictive
modelling to analyse the data and build models. You can also use the different technique supervised
learn steps and supervised algorithms such as linear regression, logistic regression , decision tree, k nn,
kemine clustering and lot‘s more to build prediction models to making business decisions.
🡺 Interpret the results: - This part is important its new a business gain actual values from the
previous four steps. Interpret the result to find out hidden patterns, feature trends and gain insights.
You can views a validation check if the assuring your questions these results cann be showns your
client and stockholders to better understanding.
∙ SAS
∙ Microsoft Excel
∙R Programming Language
∙ Python Programming Language
∙ Power BI(Business intelligence Tools)
∙ Apache Spark
∙ Qlick Views
∙ Tableau
∙ RapidMiner
∙ KNIME
1. SAS:
∙ Type of tool: Statistical software suite.
∙ Availability: Commercial.
∙ Mostly used for: Business intelligence, multivariate, and predictive analysis.
∙ Pros: Easily accessible, business-focused, good user support.
∙ Cons: High cost, poor graphical representation.
🡪SAS (which stands for Statistical Analysis System) is a popular commercial suite of business intelligence
and data analysis tools. It was developed by the SAS Institute in the 1960s and has evolved ever since. Its
main use today is for profiling customers, reporting, data mining, and predictive modeling. Created for an
enterprise market, the software is generally more robust, versatile, and easier for large organizations to use.
This is because they tend to have varying levels of in-house programming expertise.
But as a commercial product, SAS comes with a hefty price tag. Nevertheless, with cost comes benefits; it
regularly has new modules added, based on customer demand. Although it has fewer of these than say,
Python libraries, they are highly focused. For instance, it offers modules for specific uses such as anti money
laundering and analytics for the Internet of Things.
2. Microsoft Excel:
∙ Type of tool: Spreadsheet software.
∙ Availability: Commercial.
∙ Mostly used for: Data wrangling and reporting.
∙ Pros: Widely-used, with lots of useful functions and plug-ins.
∙ Cons: Cost, calculation errors, poor at handling big data.
Excel: the world‘s best-known spreadsheet software. What‘s more, it features calculations and graphing
functions that are ideal for data analysis. Whatever your specialism, and no matter what other software you
might need, Excel is a staple in the field. Its invaluable built-in features include pivot tables (for sorting or
totaling data) and form creation tools. It also has a variety of other functions that streamline data
manipulation. For instance, the CONCATENATE function allows you to combine text, numbers, and dates
into a single cell. SUMIF lets you create value totals based on variable criteria, and Excel‘s search function
makes it easy to isolate specific data.
It has limitations though. For instance, it runs very slowly with big datasets and tends to approximate large
numbers, leading to inaccuracies. Nevertheless, it‘s an important and powerful data analysis tool, and with
many plug-ins available, you can easily bypass Excel‘s shortcomings. Get started with these ten Excel
formulas that all data analysts should know.
3. R Programming Language
∙ Type of tool: Programming language.
∙ Availability: Open-source.
∙ Mostly used for: Statistical analysis and data mining.
∙ Pros: Platform independent, highly compatible, lots of packages.
∙ Cons: Slower, less secure, and more complex to learn than Python.
🡪R, like Python, is a popular open-source programming language. It is commonly used to create
statistical/data analysis software. R‘s syntax is more complex than Python and the learning curve is steeper.
However, it was built specifically to deal with heavy statistical computing tasks and is very
popular for data visualization. A bit like Python, R also has a network of freely available code, called CRAN
(the Comprehensive R Archive Network), which offers 10,000+ packages. 🡪R is the leading analytics tool in
the industry and is widely used for statistics and data modeling. It can easily manipulate data and present it
in different ways. It has exceeded SAS in many ways like capacity of data, performance and outcome. R
compiles and runs on a wide variety of platforms viz -UNIX, Windows and macOS.
4. Python Programming Language
∙ Type of tool: Programming language.
∙ Availability: Open-source, with thousands of free libraries.
∙ Used for: Everything from data scraping to analysis and reporting.
∙ Pros: Easy to learn, highly versatile, widely-used.
∙ Cons: Memory intensive—doesn‘t execute as fast as some other languages.
🡪Python is an object-oriented scripting language which is easy to read, write, and maintain. Plus, it is a free
open source tool. It was developed by Guido van Rossum in the late 1980s and supports both functional and
structured programming methods. Python is easy to learn as it is very similar to JavaScript, Ruby, and PHP.
Also, Python has very good machine learning libraries.
🡪Python is also extremely versatile; it has a huge range of resource libraries suited to a variety of different
data analytics tasks. For example, the NumPy and pandas libraries are great for streamlining highly
computational tasks, as well as supporting general data manipulation.
🡪Libraries like Beautiful Soup and Scrapy are used to scrape data from the web, while Matplotlib is
excellent for data visualization and reporting. Python‘s main drawback is its speed—it is memory intensive
and slower than many languages. In general though, if you‘re building software from scratch, Python‘s
benefits far outweigh its drawbacks.
. 🡪Some of the companies that use Python for data analytics include Instagram, Facebook, Spotify and
Amazon etc.
5. Tableau: Type of tool: Data visualization tool.
∙ Availability: Commercial.
∙ Mostly used for: Creating data dashboards and worksheets.
∙ Pros: Great visualizations, speed, interactivity, mobile support.
∙ Cons: Poor version control, no data pre-processing.
If you‘re looking to create interactive visualizations and dashboards without extensive coding expertise,
Tableau is one of the best commercial data analysis tools available. The suite handles large amounts of data
better than many other BI tools, and it is very simple to use. It has a visual drag and drop interfa ce (another
definite advantage over many other data analysis tools). However, because it has no scripting layer, there‘s a
limit to what Tableau can do. For instance, it‘s not great for pre-processing data or building more complex
calculations.
While it does contain functions for manipulating data, these aren‘t great. As a rule, you‘ll need to carry out
scripting functions using Python or R before importing your data into Tableau. But its visualization is pretty
top-notch, making it very popular despite its drawbacks. Furthermore, it‘s mobile-ready. As a data analyst,
mobility might not be your priority, but it‘s nice to have if you want to dabble on the move!
6. Rapid Miner:
RapidMiner is a powerful integrated data science platform. It is developed by the same company that
performs predictive analysis and other advanced analytics like data mining, text analytics, machine learning
and visual analytics without any programming. RapidMiner can incorporate any data source type, including
Access, Excel, Microsoft SQL, Tera data, Oracle, Sybase, IBM DB2, Ingres, MySQL, IBM SPSS, Dbase
etc. The tool is very powerful that can generate analytics based on real-life data transformation settings, i.e.
you can control the formats and data sets for predictive analysis.
Approximately 774 companies use RapidMiner and most of these are US-based. Some of the esteemed
companies on that list include the Boston Consulting Group and Dominos Pizza Inc. 7. Knime:
∙ Type of tool: Data integration platform.
∙ Availability: Open-source.
∙ Mostly used for: Data mining and machine learning.
∙ Pros: Open-source platform that is great for visually-driven programming.
∙ Cons: Lacks scalability and technical expertise is needed for some functions.
Last on our list is KNIME (Konstanz Information Miner), an open-source, cloud-based, data integration
platform. It was developed in 2004 by software engineers at Konstanz University in Germany. Although first
created for the pharmaceutical industry, KNIME‘s strength in accruing data from numerous sources into a
single system has driven its application in other areas. These include customer analysis, business
intelligence, and machine learning.
Its main draw (besides being free) is its usability. A drag-and-drop graphical user interface (GUI) makes it
ideal for visual programming. This means users don‘t need a lot of technical expertise to create data
workflows. While it claims to support the full range of data analytics tasks, in reality, its strength lies in data
mining. Though it offers in-depth statistical analysis too, users will benefit from some knowledge of Python
and R. Being open-source, KNIME is very flexible and customizable to an organization‘s needs— without
heavy costs. This makes it popular with smaller businesses, who have limited budgets.
8. Apache SPARK: -
Apache Spark is an open-source data analytics engine to process data in real-time and carry out complex
analytics using SQL queries and machine learning algorithms. Its supports spark streaming for real time
analytics and spark sql for writing sql queries , it also spark Mlib which is libraries that has depository for
machine learning algorithms then graphs for graphical computation.
9. Qlik View
QlikView has many unique features like patented technology and in-memory data processing. This executes
the result very fast to the end-users and stores the data in the report itself. Data association in QlikView is
automatically maintained and can be compressed to almost 10% of its original size. Data relationship is
visualized using colours where a specific colour is given to related data and another colour to non-related
data.
10. Power BI.
∙ Type of tool: Business analytics suite.
∙ Availability: Commercial software (with a free version available).
∙ Mostly used for: Everything from data visualization to predictive analytics.
∙ Pros: Great data connectivity, regular updates, good visualizations.
∙ Cons: Clunky user interface, rigid formulas, data limits (in the free version).
At less than a decade old, Power BI is a relative newcomer to the market of data analytics tools. It began life
as an Excel plug-in but was redeveloped in the early 2010s as a standalone suite of business data analysis
tools. Power BI allows users to create interactive visual reports and dashboards, with a minimal learning
curve. Its main selling point is its great data connectivity—it operates seamlessly with Excel (as you‘d
expect, being a Microsoft product) but also text files, SQL server, and cloud sources, like Google and
Facebook analytics.
It also offers strong data visualization but has room for improvement in other areas. For example, it has quite
a bulky user interface, rigid formulas, and the proprietary language (Data Analytics Expressions, or ‗DAX‘)
is not that user-friendly. It does offer several subscriptions though, including a free one. This is great if you
want to get to grips with the tool, although the free version does have drawbacks—the main limitation being
the low data limit (around 2GB).
Predictive Analytics: Predictive analytics tries to predict what is likely to happen in the future. This is
where data analysts start to come up with actionable, data-driven insights that the company can use to
inform their next steps.
Predictive analytics turn the data into valuable, actionable information. Predictive analytics uses data
to determine the probable outcome of an event or a likelihood of a situation occurring. Predictive analytics
holds a variety of statistical techniques from modeling, machine, learning, data mining, and game theory that
analyze current and historical facts to make predictions about a future event. Techniques that are used for
predictive analytics are:
∙ Linear Regression
∙ Time series analysis and forecasting
∙ Data Mining
There are three basic cornerstones of predictive analytics:
∙ Predictive modeling
∙ Decision Analysis and optimization
∙ Transaction profiling
🡪 Descriptive Analytics: Descriptive analytics is a simple, surface-level type of analysis that looks at what
has happened in the past. The two main techniques used in descriptive analytics are data aggregation and
data mining—so; the data analyst first gathers the data and presents it in a summarized format (that‘s the
aggregation part) and then ―minesǁ the data to discover patterns.
Descriptive analytics looks at data and analyze past event for insight as to how to approach future
events. It looks at the past performance and understands the performance by mining historical data to
understand the cause of success or failure in the past. Almost all management reporting such as sales,
marketing, operations, and finance uses this type of analysis.
The descriptive model quantifies relationships in data in a way that is often used to classify customers or
prospects into groups. Unlike a predictive model that focuses on predicting the behaviour of a single
customer, Descriptive analytics identifies many different relationships between customer and product.
Common examples of Descriptive analytics are company reports that provide historic reviews like: ∙
Data Queries
∙ Reports
∙ Descriptive Statistics
∙ Data dashboard
🡪 Prescriptive Analytics: Prescriptive analytics advises on the actions and decisions that should be taken.
In other words, prescriptive analytics shows you how you can take advantage of the outcomes that have been
predicted. When conducting prescriptive analysis, data analysts will consider a range of possible scenarios
and assess the different actions the company might take.
Prescriptive Analytics automatically synthesize big data, mathematical science, business rule, and
machine learning to make a prediction and then suggests a decision option to take advantage of the
prediction. Prescriptive analytics goes beyond predicting future outcomes by also suggesting action benefit
from the predictions and showing the decision maker the implication of each decision option. Prescriptive
Analytics not only anticipates what will happen and when to happen but also why it will happen. Further,
Prescriptive Analytics can suggest decision options on how to take advantage of a future opportunity or
mitigate a future risk and illustrate the implication of each decision option.
For example, Prescriptive Analytics can benefit healthcare strategic planning by using analytics to leverage
operational and usage data combined with data of external factors such as economic data, population
demography, etc.
🡪 Diagnostic Analytics: While descriptive analytics looks at the ―whatǁ, diagnostic analytics explores the
―whyǁ. When running diagnostic analytics, data analysts will first seek to identify anomalies within the
data—that is, anything that cannot be explained by the data in front of them. For example: If the data shows
that there was a sudden drop in sales for the month of March, the data analyst will need to investigate the
cause.
In this analysis, we generally use historical data over other data to answer any question or for the
solution of any problem. We try to find any dependency and pattern in the historical data of the particular
problem.
For example, companies go for this analysis because it gives a great insight into a problem, and they also
keep detailed information about their disposal otherwise data collection may turn out individual for every
problem and it will be very time-consuming. Common techniques used for Diagnostic Analytics are: ∙ Data
discovery
∙ Data mining
∙ Correlations
🡪Benefits of Data Analytics
The ultimate result of all these activities enabled by data analytics is often visible in the bottom line of the
organization.
🡪Reduce costs by streamlining business operations, rightsizing technology spending, improve inventory
management and better negotiate with suppliers.
🡪Speed time-to-market by quickly identifying new product opportunities, enhancing development
processes, enabling faster testing and improving overall quality.
🡪Improve customer satisfaction by better meeting customer needs and giving customer service agents the
tools, training and support they need.
🡪Increase sales by improving product offerings, enhancing marketing efforts and empowering salespeople.
🡪Increase margins by reducing costs and optimizing prices.
🡪Improve the accuracy of forecasts by analyzing historical data and using machine learning to enable
predictive and prescriptive analytics.
There are programming languages that require the programmer to determine the data type of a variable
before attaching a value to it. While some programming languages can automatically attach a data type to a
variable based on the initial data assigned to the variable. For example, a variable is assigned with the value
―3.75ǁ, then the data type that will be attached to the variable is floating point.
Most of the programming languages enable each variable to store only a single data type. For example, if the
data type attached to the variable is integer, when you assign a string data to the variable, the string data will
be converted to an integer format.
Database applications use data types. Database fields require distinct type of data to be entered. For
example, school record for a student may use a string data type for student‘s first and last name. The
student‘s date of birth would be stored in a date format and the student‘s GPA can be stored as decimal. By
ensuring that the data types are consistent across multiple records, database applications can easily perform
calculations, comparisons, searching and sorting of fields in different records.
Integer – is a whole number that can have a positive, negative or zero value. It cannot be a fraction nor can
have decimal places. It is commonly used in programming especially for increasing values. Addition,
subtraction and multiplication of two integers results to an integer. But division of two integers may result to
an integer or a decimal. The resulting decimal can be rounded off or truncated to produce an integer.
Character – refers to any number, letter, space or symbol that can be entered in a computer. Each character
occupies one byte of space.
String – is used to represent text. It is composed of a set of characters that can have spaces and numbers.
Strings are enclosed in quotation marks to identify the data as string and not a variable name nor a number.
Floating Point Number – is a number that contains decimals. Numbers that contain fractions are also
considered as floating point numbers.
Array – contains a group of elements which can be of the same data type like an integer or string. It is used
to organise data for easier sorting and searching of related set of values.
Varchar – as the name implies is variable character as the memory storage has variable length. Each
character occupies one byte of space plus 2 bytes for length information.Note: Use Character for data entries
with fixed length, like phone number. Use Varchar for data entries with variable length, like address.
Boolean – is used for creating true or false statements. To compare values the following operators are being
used: AND, OR, XOR, and NOT.
Date, Time and Timestamp – these data types are used to work with data containing dates and times.
🡪What is a variable? Variable is a characteristic of any entity being studied that is capable of taking on
different values. Say for example, X is the variable it can take any values it may be 1 it may be 2 or it may be
0 and so on.
🡪What is the measurement? The measurement is, when you standard processes used to assign numbers to
your particular attributes or characteristics of variable are called a measurement. For that X, you want to
substitute some values. For that value, you have to measure the characteristics of the variable that is nothing
but your measurement.
🡪What is the data? Data are recorded measurement. So there is a variable you measure the phenomena,
after measuring the phenomena you are substituting some value for the variable so the variable will take a
particular value that value is nothing but your data.
🡪 Why Data is important? The data helps in making better decisions, data helps in solve problem by
finding the reason for underperformance. Suppose some company it is not performing properly by collecting
the data we can identify what was the reason for this under performance. The data helps one to evaluate the
performance. So what is the current performance, the data also can be used for benchmarking the
performance of your business organization. And after benchmarking data helps one improving the
performance also, so data also can help one understand the consumers and the markets, especially the
marketing context. You can understand who the right consumers are and what kind of preferences they are
having in the market.
🡪What is generating so much data? Data can be generated different way humans, machines, and human -
machines combines. The humans, machines and human - machines combines in the sense, now seen
everybody is having the various Facebook account, we have LinkedIn account, we are in various social
network sites. Now the availability of the data is not the problem. It can be generated anywhere where the
information is generated and stored and structured or unstructured format.
🡪How the data add value to the business? So the data after getting from various sources assume that it is
a store in the form of data warehouse. So from the data warehouse the data can be used for development of a
data product.
🡪Attributes and its types in data analytics
Attributes: An attribute is a data item that appears as a property of a data entity. Machine learning literature
tends to use the term feature while statisticians prefer the term variable.
Example – Let‘s consider an example like name, address, email, etc. are the attributes for the contact
information. Perceived values for a given attribute are termed as observations. The variety of an attribute is
insisted on by the set of feasible values – nominal, binary, ordinal, or numeric.
Types of Attributes:
Nominal Attributes:
Nominal means ―relating to namesǁ. The utilities of a nominal attribute are sign or title of objects. Each
value represents some kind of category, code or state, and so nominal attributes are also referred to as
categorial.
Example – Suppose that skin color and education status are two attributes of expressing person objects. In
our implementation, possible values for skin color are dark, white, brown. The attributes for education status
can contain the values- undergraduate, postgraduate, matriculate. Both skin color and education status are
nominal attributes.
Binary Attributes: A binary attribute is a category of nominal attributes that contains only two classes: 0 or
1, where 0 often tells that the attribute is not present, and 1 tells that it is existing. Binary attributes are
mentioned as Boolean if the two conditions agree to true and false.
Example – Given the attribute drinker narrate a patient item, 1 specify that the drinker drinks, while 0
specify that the patient does not. Similarly, suppose the patient undergoes a medical test that has two
practicable outcomes.
Ordinal Attributes: An ordinal attribute is an attribute with a viable advantage that has a significant
sequence or ranking among them, but the enormity between consecutive values is not known.
Example – Suppose that food quantity corresponds to the variety of dishes available at a restaurant. The
nominal attribute has three possible values: starters, main course, and combo. The values have a meaningful
sequence that corresponds to different food quantity however; we cannot tell from the values how much
bigger, say, a medium is than a large.
Numeric Attributes: A numeric attribute is calculable, that is, it is a quantifiable amount that constitutes
integer or real values. Numeric attributes can be of two types as follows: Interval- scaled, and Ratio – scaled.
Interval – Scaled Attributes: Interval – scaled attributes are calculated on a lamella of uniform- size units.
The values of interval-scaled attributes have order and can be positive, 0, or negative. Thus, in addition to
providing a ranking of values, such attributes allow us to compare and quantify the difference between
values.
Example – A temperature attribute is an interval – scaled. We have different temperature values for every
new day, where each day is an entity. By sequencing the values, we obtain an arrangement of entities with
reference to temperature. In addition, we can quantify the difference in the value between values, for
example, a temperature of 20 degrees C is five degrees higher than a temperature of 15 degrees C.
Ratio – Scaled Attributes: A ratio – scaled attribute is a category of a numeric attribute with imminent or fix
zero points. In inclusion, the entities are structured, and we can also compute the difference between values,
as well as the mean, median, and mode.
Example – The Kelvin (K) temperature scale has what is contemplated as a true zero point. It is the point at
which the tiny bits that consist of matter has zero kinetic energy.
Discrete Attribute: A discrete attribute has a limited or restricted unlimited set of values, which may appear
as integers. The attributes skin color, drinker, medical report, and drink size each have a finite number of
values, and so are discrete.
Example – Height, weight, and temperature have real values . Real values can only be represented and
measured using finite number of digits . Continuous attributes are typically represented as floating-point
variables.
🡪 Data Modelling
Data modelling (data modelling) is the process of creating a data model for the data to be stored in a
database. This data model is a conceptual representation of Data objects, the associations between different
data objects, and the rules.
Data modeling helps in the visual representation of data and enforces business rules, regulatory
compliances, and government policies on the data. Data Models ensure consistency in naming conventions,
default values, semantics, security while ensuring quality of the data.
🡪Data Models in DBMS
The Data Model is defined as an abstract model that organizes data description, data semantics, and
consistency constraints of data. The data model emphasizes on what data is needed and how it should be
organized instead of what operations will be performed on data. Data Model is like an architect‘s building
plan, which helps to build conceptual models and set a relationship between data items.
The two types of Data Modeling Techniques are
1. Entity Relationship (E-R) Model
2. UML (Unified Modelling Language)
🡪Why use Data Model?
The primary goal of using data model are:
∙ Ensures that all data objects required by the database are accurately represented. Omission of data will
lead to creation of faulty reports and produce incorrect results.
∙A data model helps design the database at the conceptual, physical and logical levels. ∙ Data Model
structure helps to define the relational tables, primary and foreign keys and stored procedures.
∙ It provides a clear picture of the base data and can be used by database developers to create a physical
database.
∙ It is also helpful to identify missing and redundant data.
∙ Though the initial creation of data model is labor and time consuming, in the long run, it makes your IT
infrastructure upgrade and maintenance cheaper and faster.
🡪Types of Data Models in DBMS
Types of Data Models: There are mainly three different types of data models: conceptual data models,
logical data models, and physical data models, and each one has a specific purpose. The data models are
used to represent the data and how it is stored in the database and to set the relationship between data items.
1. Conceptual Data Model: This Data Model defines WHAT the system contains. This model is
typically created by Business stakeholders and Data Architects. The purpose is to organize, scope
and define business concepts and rules.
2. Logical Data Model: Defines HOW the system should be implemented regardless of the DBMS.
This model is typically created by Data Architects and Business Analysts. The purpose is to
developed technical map of rules and data structures.
3. Physical Data Model: This Data Model describes HOW the system will be implemented using a
specific DBMS system. This model is typically created by DBA and developers. The purpose is
actual implementation of the database.
∙ The physical data model describes data need for a single project or application though it may be
integrated with other physical data models based on project scope.
∙ Data Model contains relationships between tables that which addresses cardinality and null ability of
the relationships.
∙ Developed for a specific version of a DBMS, location, data storage or technology to be used in the
project.
∙ Columns should have exact data types, lengths assigned and default values.
∙ Primary and Foreign keys, views, indexes, access profiles, and authorizations, etc. are defined.
🡪Advantages and Disadvantages of Data Model:
Advantages of Data model:
∙ The main goal of a designing data model is to make certain that data objects offered by the functional
team are represented accurately.
∙ The data model should be detailed enough to be used for building the physical database. ∙ The
information in the data model can be used for defining the relationship between tables, primary and
foreign keys, and stored procedures.
∙ Data Model helps business to communicate the within and across organizations. ∙
Data model helps to documents data mappings in ETL process
∙ Help to recognize correct sources of data to populate the model
Disadvantages of Data model:
∙ To develop Data model one should know physical data stored characteristics.
∙ This is a navigational system produces complex application development, management. Thus, it
requires knowledge of the biographical truth.
∙ Even smaller change made in structure requires modification in the entire application. ∙
There is no set data manipulation language in DBMS.
🡪Data Modelling Techniques
There are various techniques to achieve data modeling successfully, though the basic concepts remain the
same across techniques. Some popular data modeling techniques include Hierarchical, Relational, Network,
Entity-relationship, and Object-oriented.
🡪Hierarchical Technique
The Hierarchical data modeling technique follows a tree-like structure where its nodes are sorted in a
particular order. A hierarchy is an arrangement of items represented as ―above,ǁ ―below,ǁ or ―at the same
level asǁ each other. Hierarchical data modeling technique was implemented in the IBM Information
Management System (IMS) and was introduced in 1966.
It was a popular concept in a wide variety of fields, including computer science, mathematics, design,
architecture, systematic biology, philosophy, and social sciences. But it is rarely used now due to the
difficulties of retrieving and accessing data.
🡪Relational Technique
The relational data modeling technique is used to describe different relationships between entities, which
reduces the complexity and provides a clear overview. The relational model was first proposed as an
alternative to the hierarchical model by IBM researcher Edgar F. Codd in 1969. It has four different sets of
relations between the entities: one to one, one to many, many to one, and many to many.
🡪Network Technique
The network data modeling technique is a flexible way to represent objects and underlying relationships
between entities, where the objects are represented inside nodes and the relationships between the nodes is
illustrated as an edge. It was inspired by the hierarchical technique and was originally introduced by Charles
Bachman in 1969.
The network data modeling technique makes it easier to convey complex relationships as records and can be
linked to multiple parent records.
🡪Entity-relationship technique
The entity-relationship (ER) data modeling technique represents entities and relationships between them in a
graphical format consisting of Entities, Attributes, and Relationships. The entities can be anything, such as
an object, a concept, or a piece of data. The entity-relationship data modeling technique was developed for
databases and introduced by Peter Chen in 1976. It is a high-level relational model that is used to define data
elements and relationships in a sophisticated information system.
🡪Object-Oriented Technique
The object-oriented data modeling technique is a construction of objects based on real-life scenarios, which
are represented as objects. The object-oriented methodologies were introduced in the early 1990s‘ and were
inspired by a large group of leading data scientists.
It is a collection of objects that contain stored values, in which the values are nothing but objects. The
objects have similar functionalities and are linked to other objects.
🡪Data Modeling: An Integrated View
Data modeling is an essential technology for understanding relationships between data sets. The integrated
view of conceptual, logical, and physical data models helps users to understand the information and ensure
the right information is used across an entire enterprise.
Although data modeling can take time to perform effectively, it can save significant time and money by
identifying errors before they occur. Sometimes a small change in structure may require modification of an
entire application.
Some information systems, such as a navigational system, use complex application development and
management that requires advanced data modeling skills. There are many open sources Computer-Aided
Software Engineering (CASE) as well as commercial solutions that are widely used for this data modeling
purpose.