Computer Application In Accounting
Computer Application In Accounting
COURSE TITLE:
It is no more the case that computing and Accounting studies should remain separate and distinct
academic disciplines from each other. Even though for specialization purposes, Computing and
Accounting remain separate academic disciplines, in application they are so much needed
together as in solving business problems. Not only do business organizations need knowledge in
business domains like Accountancy, Business Management, Human Resource Management and
the likes but also the technological know-how in dealing with today’s highly dynamic business
landscape. . It is an undeniable fact that in today’s turbulent business environment, only core
business knowledge is woefully inadequate in providing solutions to business problems. There is
the pressing need for integration between cor business knowlege such Accounting and
knowledge from other disciplines such as statistics, computing, information technology and
others. This is so because there is the need for advanced analyses of the big data generated by
business activities. Such deeper analyses of business data goes beyond the knowledge from
Accountancy, HRM, and Business Management disciplines, and require the application of
knowledge from sister disciplines like Statistics/Mathematics, Computing, Information
Technology, data and information science, and the likes.
Compounding this reality is the emergence of the so-called Digital Revolution, where along the
time line of the 21st century, it is ever possible for every activity under the sun to be digitized
into what is called “Big Data”. The concept of “big data” means the phenomenon where the
volume and velocity of data generation by today’s business organizations surpasses the storage
and processing capacities of traditional data management systems, thereby requiring new,
advanced data management technologies. It is therefore required of today’s business students,
and teachers alike, to master the knowledge, skills, tools and technologies of big data
management systems. Accounting curriculum need to change from the traditional style to
involve areas such as statistics, data analytics (using tools like Python, R, java excel), big data
management technologies such as SQL, NoSQL. To buttress our arguments, check out the
following statements from very authoritative sources.
“Accountancy professionals will be increasingly required to broaden their skill-set and take ownership of their
personal development if they are to succeed in an everchanging, dynamic world” (source: The Association Chartered
Certified Accountants (ACCA)
“The increased availability of data and of analytic tools have opened up a range of opportunities for the finance
function. Analytics have facilitated a more forward-looking [accounting and] finance functions, enabling the
generation of greater insights and the provision of immense value to stakeholders. Leveraging data analytics in
decision making provides significant opportunities for the[accounting and] finance functions to cement its position as
a true strategic partner” (sources: ACCA, and Chartered Accountants Australia and NewZealand).
“Accountants and finance teams have an opportunity to lead their organizations to better business decisions driven
by insights from analysis of big data, rather than following a familiar pattern of reporting on past events” (source:
ACCA).
“The insights fromour CEO survey reveal that businesses are preparing for a future that’s different fromtoday. And
they expect their talent to adapt. One implication of this rapidly changing business environment is clear—today’s
accounting curriculum
should be updated to equip students with new skills, especially in technology and data analytics” (source:
Pricewaterhouse Coopers(PwC)).
“Businesses today—accounting firms among them—are struggling to deal with the high volume of data that
technology has made available. Organizations that can successfully interpret these data and use them to make
crucial business decisions have a competitive advantage. In such a climate, it’s helpful for CPAs to know as much
about the technological tools they use as possible. Just knowing basic accounting and office software programs is no
longer enough: Accountants need to know how to code” (source: American Institute of Certified Public Accountants
(AICPA)).
These quotes are to make you informed about the global trends in the Accountancy and Finance
Profession, and to help you get the right direction in your studies and carrier. It is the dream of
the the author of this text to equip you with the knowlege of technological Applications in
Accounting. Even though the title of the course is “Computer Applications in Accounting”, the
course is not about traditional computing concepts such as the history of computers, computer
hardware, and the likes. This course broadly speaking, will introduce you to the concept of big
data revolution, big data management technologies, and data analytics.
UNIT ONE
INTERNET OF THINGS, DIGITAL TECHNOLOGY, AND THE BIG DATA REVOLUTION
Introduction
Along the line of time was the advent of computers and information technologies which
tremendously improved data management capabilities of business organizations. Computing
power, combined with advances in information technology and analyses techniques, could
capture and process more data, more speedily than before, initiating the emergence of called “big
data”. Despite the emergence of improved technologies to manage large quantities of data, the
“magical” nature of big data was soon to be realized. By magical i mean data has the quality of
acting as a magician, a god, or a sorcerer who can reveal all the hidden facts about the universe.
Like a supernatural being, you need a medium(s) to engage data to reveal hidden secretes to
business managers.
Big data is now seen as the most important business asset and the ability to harness the benefits
of data creates a huge competitive advantage for businesses. Big Data technologies allow
companies and other organizations to use large amounts of data for effective decision making. They
allow business and other organizations to identify trends, patterns, and associations that would be
quite challenging or nearly impossible to find with conventional data-processing solutions. As a result,
there’s a huge demand for big data professionals. Harnessing the benefits of big data requires the
knowledge of big data management technologies such as relational database management
systems, and data analytics. In the ensuing sessions we will discuss into detail the concepts of big
data, and how this phenomenon has the order of the day.
Big refer to the phenomenon where data generation has become so massive that tradition data
management systems and technologies can no longer handle it. Think of it. It is estimated that
every day, more than 3 quintillion bytes of data (3,000,000,000,000,000,000) is created globally,
and more than 90% percent of this gaping figure was created just about 6 years ago. This does
not necessarily mean that all the ages before six years ago did not produce as much data. The
reason is principally because those ages were without the technologies to readily and seamlessly
capture, store, and analyze data. The ability to generate these unimaginable quantities of data in
our age is made possible for three key reasons:
Information technology can be defined as the use of digital technology to facilitate the
capturing,and storing of business data, processing data into decision-ready form, and
disseminating decision-ready information to decision makers..
The advent and advances in digital technology, and the proliferation of digital devices has made
it possible for every bit of business data, financial or non-financial, operational or strategic, to be
generated and stored into a digital image on computers and other digital device. For example,
every business transaction can now be seamlessly captured and stored on a computer or some
other digital device in form of an electronic image. Sales of products to customers can be carried
out and stored electonically on computers and mobile devices as electronic images. Even
customers’ likes and preferences can be capture and stored in electronic image. The” likes” you
do on Facebook, Instagram, Twitter, and other social media platforms are stored as digital
images, creating data to be harnessed for improved decision making. Such an abililty to generate
digital data on every aspect of business activity is creating big data for businesses.
Figure 1: digital technology as a factor of the big data revolution
In addition to digital technology, one other key factor contributing to the gargantuan data
creation is number of connected digital devices. In 2018, there were approximately 17
billion connected devices worldwide, with the number of IoT devices at 7
billion (not counting traditional internet-connected devices such as laptops or
smartphones), according to a report published by IoT Analytics. The term “Internet of
Things” generally refers to scenarios where network connectivity and computing
capability extend to objects, sensors and everyday items not normally considered
computers, allowing these devices to generate, exchange and consume data with
minimal human intervention. IoT devices include an extraordinary number of objects of
all shapes and sizes such as:
Smart microwaves, which automatically cook your food for the right length of time,
Self- driving cars, whose complex sensors detect objects in their path,
Wearable fitness devices that measure your heart rate and the number of steps
you’ve taken that day, then use that information to suggest exercise plans tailored to
you.
There are even connected footballs that can track how far and fast they are thrown and
record those statistics via an app for future training purposes. The global market is now
expected to reach $1.6 billion by 2025. Each of these many devices has the power to
generate unimaginable quantum of data. It is estimated that every day, more than 3
quintillion bytes of data (3,000,000,000,000,000,000) is created globally, and this keeps
exploding at an alarming exponent. This should give you an idea of how big big data is.
The size of big data at any point in time is very difficult to estimate. The population of
digital devices across the globe is alarmingly exploding. The population
connected devices globally is several millions more than the population of
humans.Figure 2 shows the would map populated with connected digital devices,
tying to make you picture how the world looks like from the perspective of
digital devices, while Figure 3 shows a projection of the population of
connected devices alongside the population of humans (source: Cisco IBSG,
2011). from Figure 3, you can notice that all the years before 2010, global
human population has outnumbered the global population of digital devices.
However, from the year 2010 and onwards, the global population of digital
devices has outstripped the global human population, and the gap keeps
increasing.Connected devices per person is also on the rise.
Figure 2: World population from the perspective of connected devices
Figure 3: Populations of connected devices and humans compared
Every year, you probably expect to pay at least a little more money than what you did the
previous year for most products and services. This has been the trend for most commodities and
services. However the opposite has been the case in the computer and communications fields,
especially with regard to the hardware supporting these technologies. For many decades,
hardware costs have fallen rapidly. Every year or two, the capacities of computers (both in terms
of storage and processing) have approximately doubled inexpensively.
This observation apply especially to the amount of primary memory (RAM), storage, and
processing spped. Modern computers come with tremendous sizes of storage and computing
power. The amount of secondary storage (such as solid-state drives (SSD) and Hard disk drives
(HDDs) which come with today’s personal computers are a huge improvement over the past.
Computers now come with huge secondary storage sizes at very affordable prices, comparatively.
The storage capacities of personal computers have increase from megabytes terabytes.
Meanwhile the corresponding cost has only marginally increased. Computers can perform
calculations and make logical decisions phenomenally faster than human beings can. Many of
today’s personal computers can perform billions of calculations in one second—more than a
human can perform in a lifetime. Supercomputers are already performing thousands of trillions
(quadrillions) of instructions per second. For example, IBM has developed the IBM Summit
supercomputer, which can perform over 122 quadrillion calculations per second (122 petaflops).
To put that in perspective, the IBM Summit supercomputer can perform in one second almost 16
million calculations for every person on the planet. And supercomputing upper limits are
growing quickly. Comparing these astronomical increases in computing power to their
corresponding costs, it is obvious that computing resources are increasing inexpensively. This
remarkable trend is called the Moore’s Law, named for the person who identified it in the 1960s,
Gordon Moore, co-founder of Intel Corporation. Moore’s law applies to the developments in the
telecommunication and and the internet sectors. There has being a tremendous improvements in
information and communication technologies inexpensively. Bigger communications/internet
bandwidth is now selling cheaper.
The Big Data revolution is changing every part of our world into a SMARTER one. The basic
idea behind the phrase ‘Big Data’ is that everything we do is increasingly leaving a digital trace
(or data), which we (and others) can use and analyse to become smarter. Today the really
successful companies “pray to” “Big Data” for more insight into business problems. From big
data, companies can now perfectly understand the taste and preferences of their customers,
where they are and, perhaps more importantly, what they are doing and where they are going.
Through big data, successful businesses are able to know what is happening as it is happening,
and they allow that information to guide their strategy and inform their decision-making.
Companies that won’t strategize to take advantage of the big data revolution can hardly succeed.
The Structure (or Forms) of Big Data
For businesses to get the full value of big data they must embrace all structures of Big Data,
structured or otherwise. SMART business occurs when we combine structured data sets with
unstructured data from both internal and external sources to gain more insight for intelligent
decision making. The rest of the session explains in detail the characteristics of these forms of
big data.
Structured Data
Structured data refers to data that fits a predefined data model or is organized in a predetermined
way. Structured data are organized in a defined tabular structure (with columns and rows). The
columns, also called fields, or attributes, are the characteristics of the entity which is being
captured into a database. For example, if you look at a standard customer database the
fields/columns that are defined will include name, address, contact telephone numbers, email
address, etc. of the customer. These fields describe the chararteristics of the entity being modeled
into a database. Note that each field name is particularly specified for a specific data type
(number, or text, or something else). this require that the data values entered into these columns
must strictly conform to the specified data type. Within each field, constraints may also be set,
for example, the telephone number field can be set to NOT NULL to require value to be always
provided for that field before a record will be accepted into the database. The name field can be
set “NOT NULL” or “required” to make sure that the name name field cannot be empty. There
are a number of data types and constraints which can be employed to provide consistency and
integrity to the database. The database design can also include drop down menus that limit the
choices of the data that can be entered into a field, thus ensuring consistency of input. Currently,
structured data is managed using Structured Query Language (SQL) – a programming language
originally created by IBM in the 1970s for managing and querying data in relational database
management systems. There are quite a number of database management systems designed to
“speak” the SQL programming language, such as PostgreSQL, MySQl, SQL Server, Oracle,
IBM DB2, etc.
SQL database design rides on the relational model developed by E.F. Codd in the late 1960s with
the publication of the paper titled “A Relational Model of Data for Large Shared Data Banks,”
(see http://www.acm.org/classics/nov95/toc.html). The relational model emphasizes data
integrity much more than other models of database design. Referential integrity refers to
making sure that data in the database makes sense at all times.In its basic application, the
relational model requires the system to be modeled into a database to be decomposed into several
logical “data containers” called entities. An entity can be defined as that component of the
overall system which is perceived or known or inferred to have its own distinct existence. For
example, in modeling a school system into a relational database, you need to break the school
system into entities such as Student, Course, Academic Program, Department, Residence,
Teacher, Contact Details, etc. Each of the entities is represented with a table which stores the
data on the attributes of the entity. These tables are then related (relationships are set among
tables) where necessary. The details of the relational database design principles will be treated in
the course of time. Figure 4 depicts an extract of a school database designed based on the
relational model.
Figure 4: A Representation of the Relational Database Model
Among the entities of the database are the three logical data containers, Student, Residence,
and Fee, with their attributes clearly specified.
Given that the SQL programming language, and relational database management systems
(RDBMS) are the technologies and systems required for the management of structured big data,
Accounting and Finance students cannot afford to go without these critical knowledge and skills
set. As part of its recommendation for contemporary curriculum for accounting courses, PwC,
one of the big 5 auditing firms in the world, specifically points to knowledge and skills in SQL,
including a number of computing, programming, analytics, and statistical knowledge and skills.
The following is a direct quotation form PwC:
“Universities should infuse analytical exercises into existing curriculum to help students develop data analytics
proficiency on top of their core accounting skills. Most schools currently require a class on computing and one or
two statistics courses early in the curriculum. But if reformed, these classes could provide all business students with
a base level of sophistication around data analytics. Whether accounting students or not, everyone could benefit, as
analytics will continue to growin importance in every business discipline.
We also forecast a significant increase in demand for students with double majors in accounting and information
systems. Academic programs that support these double concentrations will be increasingly attractive to both
employers and students.
Deep dives into statistics theory are not the primary point. Instead, we suggest the following courses as a tentative
outline for providing students a newset
of skills.
should include:
Basic programming skills using a contemporary coding language such as Python or Java
Core skills in the legacy technologies (Microsoft Excel and Access), especially in teaching the complex power of
spreadsheet software Core skills with both structured and unstructured databases (SQL, MongoDB, Hadoop,
etc.)” - (PwC, 2015)
Semi-structured data is a cross between unstructured and structured. These are data that may
have some structure that can be used for some analysis, but lacks the strict data model structure.
In semi-structured data, tags or other types of markers are used to identify certain elements
within the data, but the data doesn’t have a rigid structure like that of a structured data. For
example, a Facebook post can be categorized by author, date, length and even sentiment but the
content is generally unstructured.
Unstructured data growth is no longer being driven by the usual suspects - documents,
presentations, photos, videos and audios. The impetus behind its growth today is
sources such as log files produced by computer programs, IoT devices, social media,
CCTV, sensors, metadata and even search engine queries.
In the past, unstructured data were not used that much for decision making. They were often
locked away in siloed document management systems making it what's known as dark data,
unavailable for analysis. With the development of big data platforms,
primarily Hadoop clusters, NoSQL databases and the Amazon Simple Storage Service (S3),
unstructured data can now be processed as much as structured data, and be very useful for
decision making. These big data platforms provide the required infrastructure for processing,
storing and managing large volumes of unstructured data. NoSQL databases are non-relational
databases built on a non-relational model, and the programming language used to communicate
with NoSQL databases is the NoSQL language. Both databases deal with data in different
ways, SQL databases structure data in a ‘relational manner’, as discussed previously, and
NoSQL stores data in a ‘non-relational manner. Just as we have a number of relational
database management systems (or SQL database management systems) there are also a
number of of database management systems built on the non-relational or NoSQL model such
as MongoDB, Redis, FaunaDB, CouchDB, Cassandra, Elasticsearch, etc.
Other techniques that play roles in unstructured data analytics include data mining, machine
learning and predictive analytics. Text analytics tools look for patterns, keywords and sentiment
in textual data. At a more advanced level, natural language processing technology is a form of
artificial intelligence that seeks to understand meaning and context in text and human speech,
increasingly with the aid of deep learning algorithms that use neural networks to analyze data.
Newer tools can aggregate, analyze and query all data types to enable greater insight into
corporate data and improved decision-making. Examples include the following: Azure Data
Services, Microsoft Power BI, IBM Cognos Analytics, and Tableau.
These technological developments in today’s arena is massively shaping the role and how of
accounting should be performed in business organizations. Since the accounting function
generates and manages the biggest chunk of business data, there is no doubt that accounting
professional are the right people to possess the appropriate data management and analyses skills
needed to ensure business success. As an Accounting student these revelations about the
developments in the profession is to guide you in the acquisition of skill set necessary to make
you relevant for solving business problems. Today’s businesses are on a hunt for graduates with
both skills in accountancy, and data science. It would be great to become one of such graduates,
by mastering data science skills in addition to your accounting knowledge and skills.
Characteristics of Big Data
Following are the big data core characteristics. Understanding the characteristics of big data is vital
to know how it works and how you can use it. There are primarily 5 characteristics of big data
analytics, even though other talk about 7. They are Volume, Velocity, Variety, veracity, and Value.
Because each of the characteristics begin with the letter “V”, the five big data characteristics are
often referred as the the 5 Vs of big data. The meanings of each of these characteristics are given
below:
Volume
The volumes of data generated by modern IT, industrial, healthcare, Internet of Things, and other
systems is growing exponentially driven by the lowering costs of data storage and processing
architectures and the need to extract valuable insights from the data to improve business
processes, efficiency and service to consumers.
Though there is no fixed threshold for the volume of data to be considered as big data, however,
typically, the term big data is used for massive scale data that is difficult to store, manage and
process using traditional databases and data processing architectures.
Velocity
Velocity of data refers to how fast the data is generated. Data generated by certain sources can
arrive at very high velocities, for example, social media data or sensor data. Velocity is another
important characteristic of big data and the primary reason for the exponential growth of data.
High velocity of data results in the volume of data accumulated to become very large, in short
span of time. Some applications can have strict deadlines for data analysis (such as trading or
online fraud detection) and the data needs to be analyzed in real-time. Specialized tools are
required to ingest such high velocity data into the big data infrastructure and analyze the data in
real-time.
Variety
Variety refers to the forms of the data. Big data comes in different forms such as structured,
unstructured or semi-structured, including text data, image, audio, video and sensor data. Big
data systems need to be flexible enough to handle such variety of data
Veracity
Veracity refers to how accurate is the data (data quality). To extract value from the data, the data
needs to be cleaned to remove noise. Data-driven applications can reap the benefits of big data
only when the data is meaningful and accurate. Therefore, cleansing of data is important so that
incorrect and faulty data can be filtered out.
Value
Value of data refers to the usefulness of data for the intended purpose. The end goal of any big
data analytics system is to extract value from the data. The value of the data is also related to the
veracity or accuracy of the data. For some applications value also depends on how fast we are
able to process the data.
There are numerous advantages of Big Data for organizations. Some of the key ones are as
follows:
1. Enhanced Decision-making
Big data implementations can help businesses and organizations make better-informed decisions
in less time. It allows them to use outside intelligence such as search engines and social media
platforms to fine-tune their strategies. Big data can identify trends and patterns that would’ve
been invisible otherwise, helping companies avoiding errors.
Another huge impact big data can have on all industries is in the customer service department.
Companies are replacing the traditional customer feedback system with data-driven solutions.
Such solutions can analyze customer feedback more efficiently and help them offer customer
service to the consumers.
3. Efficiency Optimization
Organizations use big data to identify the weak areas present within them. Then, they use these
findings to resolve those issues and enhance their operations substantially. For example, Big
Data has substantially helped the manufacturing sector improve its efficiency through IoT and
robotics.
Big Data has transformed several areas by enabling real-time trackings, such as inventory
management, supply chain optimization, anti-money laundering, and fraud detection in banking
& finance.
UNIT TWO
Data Analytics is a broad term that encompasses the processes, technologies, frameworks and
algorithms to extract meaningful insights from data. Raw data in itself does not have a meaning
until it is contextualized and processed into useful information. Analytics is this process of
extracting and creating information from raw data by filtering, processing, categorizing,
condensing and contextualizing the data.
Data analytics increasingly deals with vast amount of data—mostly unstructured information stored in a wide variety of
mediums and formats—and complex data sets collected through fragmented databases during the course of time. It
deals with streaming data, coming at you faster than traditional RDBMS systems can handle. This is also called fast data.
It’s about combining external data with internal data, integrating it and analyzing all data sets together.
Big data analytics aims to answer three domains of questions. These questions explain (1) what has happened in the past
(retrospective analytics), (2) what is happening right now (real-time analysis) and (3) what is about to happen.
Retrospective analytics can explain and present knowledge about the events of the past, show trends and help find root-
causes for those events. The real-time analytics shows what is happening right now. It works to present situational
awareness, alarms
when data reaches certain threshold or send reminders when a certain rule is satisfied. Prospective analytics presents a
view into the future. It attempts to predict what will happen, what are the future values of certain variables. Table 1
shows the taxonomy of the three analytics questions.
Questions about the past Questions about the Questions about the
present future
Knowledge-based
dashboards
The choice of the technologies, algorithms, and frameworks for analytics is driven by the
analytics goals, whether it is to provide retrospective, real-time, or prospective understanding, or
some combination of the three goals described above.
Descriptive Analytics
Descriptive analytics comprises analyzing past data to present it in a summarized form which can
be easily interpreted. Descriptive analytics aims to answer - What has happened? A major
portion of analytics done today is descriptive analytics through use of statistical functions such as
counts, maximum, minimum, mean, top-N, percentage, for instance. These statistics help in
describing patterns in the data and present the data in a summarized form. Examples of
descriptive analytics include: computing the total number of likes for a particular post,
computing the average monthly rainfall or finding the average number of visitors per month on a
website. Descriptive analytics is useful to summarize the data.
Businesses use descriptive analytics all the time, whether they are aware of it or not. It’s often
called business intelligence . Companies these days have vast amounts of data available to them,
so they would do well to use analytics to interpret this data to help them with decision making. It
helps them to learn from what happened in the past, and enables them to try to accurately predict
what may happen in the future. In other words, analytics helps companies anticipate trends. For
instance, if sales increased in November for the past five years, and declined in January, after the
Christmas rush, then one could predict that the same thing is likely to happen in year six and
prepare for it. Companies could use this to perhaps increase their marketing in January, offering
special offers and other incentives to customers.
Descriptive analytics give insight into what happened in the past (this may include the distant
past, or the recent past, like sales figures for last week.) They summarize data that describes
these past events and make it simple for people to understand. and make it simple for people to
understand. In a business, for example, we may look at how sales trends have changed from year
to year, or how production costs have escalated.
Descriptive statistics, basically then, is the name given to the analysis of data that helps to show
trends and patterns that emerge from collected information. They are a way of describing the
data in a way that helps us to visualize what the data shows.
This is especially helpful where there has been a lot of information collected. Descriptive
statistics are the actual numbers that are used to describe the information that has been collected
in the business or from, say, a survey. They include statistics like the mean, standard deviation,
variance, range, skewness, maximum, minimum, etc.
As a descriptive analysts you can make descriptive analytics information easier to understand by
analyzing big data into graphs, charts, or some other pictorial representations. This way,
management and employees alike can see what has been happening within the company in the
past, and make useful predictions and therefore good decisions for the future.
Descriptive statistics is so-called because they help to describe the data which has been collected.
They are a way of summarizing big groups of numerical information, by summarizing a sample.
(In this way, descriptive statistics are different from inferential statistics, which uses data to find
out about the population that the data is supposed to represent.)
Two sets of statistics are most often used in this descriptive statistics are measures of central
tendency , and measures of dispersion . Central tendency involves the idea that there’s one figure
that’s in a way central to the whole set of figures.
Measures of central tendency include the mean, the median, and the mode. They summarize a
whole lot of figures with one single number. This will obviously make it easier for people to
understand and use the data.
Dispersion refers to how spread out the figures are from a central number. Measures of
dispersion include the range, the variance, and the standard deviation. They help us see how the
scores are spread out (whether they are close together or widely spread apart).
Inferential Statistics
When research is done on groups of people, usually both descriptive and inferential statistics are
used to analyze results and arrive at conclusions.
Inferential statistics are useful when we just want to use a small sample group to infer things
about a larger population. So, we are trying to come to conclusions that reach past the immediate
data that we actually have on hand.
They can help to assess the impact that various inputs may have on our objectives. For instance,
if we introduce a bonus system for our workers, what might the productivity outcome be?
Inferential statistics can only be used if we have a complete list of the population members from
which we have taken our sample. Also, the sample needs to be big enough.
There are different types of inferential statistics, some of which are fairly easy to interpret. An
example is the confidence level. If say, our confidence interval is 98%, it means that we are 98%
confident that we can infer the score of a population based on the score of our sample.
Inferential statistics therefore allow us to apply the conclusions from small experimental studies
to larger populations that have actually never been tested experimentally. This means then, that
inferential statistics can only speak in terms of probability, but it is very reliable probability, and
an estimate with a certain measurable confidence level.
Accounting students are not only required to possess knowledge and skills in statistics but also to
have a mastery over the statistical software which can be used to perform descriptive statistics.
Software which can perform descriptive analytics include Microsoft Excel, IBM SPSS.
This text does not intend to reteach statistics to students. It is expected that students will possess,
at least, basic statistics knowledge and skills which can understanding.
Diagnostic Analytics
Diagnostic analytics comprises analysis of past data to diagnose the reasons as to why certain
events happened. Diagnostic analytics aims to answer the question- Why did it happen?
The computational tasks such as Linear Algebraic Computations, General N-Body Problems, and
Graph-theoretic Computations can be used for diagnostic analytics.
Predictive Analytics
We are now aware of how data and the analysis of it are vital for a company to be able to
function optimally. We are now going to examine another type of data analytics that can help
grow a company - predictive analytics.
Predictive analytics is used to make predictions about unknown future events. It uses many
techniques, such as statistical algorithms, data mining, statistics, modeling, machine learning and
artificial intelligence, to analyze current and past data to make predictions about the future. It
aims to identify the likelihood of future outcomes based on the available current and historical
data. The goal is therefore to go beyond what has happened to provide the best assessment of
what will happen.
More and more companies are beginning to use predictive analytics to gain an advantage over
their competitors. As economic conditions worsen, it provides a way for companies to gain
competitive edge. Predictive analysis has become more accessible today for even smaller
companies, and other low-budget organizations. This is because the volume of easily available
data has grown hugely, computing has become more powerful and more affordable, and the
software has become simpler to use. Therefore, one doesn’t need to be a mathematician to be
able to take advantage of the available technologies. Accounting students must champion the
skill of predictive analytics so as to provide the necessary competitiveness for their organizations.
Firstly, it can help predict fraud and other criminal activities in businesses, government
departments, etc.
Secondly, it can help companies optimize marketing by monitoring the responses and buying
trends of customers.
Thirdly, it can help businesses and organizations improve their way of managing resources
by predicting busy times and ensuring that stock and staff are available during those times.
For instance, hospitals can predict when their busy times of the year are likely to be, and
ensure there will be enough doctors and medicines available over that time.
In this chapter, we’re going to examine the various techniques that are used to conduct
predictive analysis. The two main pigeonholes into which these methods may be grouped are
machine learning techniques and regression techniques . They’ll be discussed here in more detail:
Machine Learning Techniques
Machine learning is a method of data analysis that automates the building of analytical models.
It uses algorithms that continuously adapt and learn from data and from previous computations,
thereby allowing them to find information without having to be directly programmed where to
search.
Growing volumes of available data, together with cheaper and more powerful computational
processing have created an unprecedented interest in the use of machine learning. More
affordable data storage has also increased its use.
When it comes to modeling, humans can maybe make a couple of models a week, but machine
learning can create thousands in the same amount of time.
Using this technique, after you make a purchase, online retailers can send you offers almost
instantaneously for other products that may be of interest to you. Banks can give answers
regarding your loan requests almost at once. Insurance companies can deal with your claims as
soon as you submit them. These actions are all driven by machine learning algorithms, as are
more common everyday activities such as web search results and email spam filtering.
Regression Techniques
These techniques form the basis of predictive analytics. They seek to create a mathematical
equation, which will serve as a model to represent the interactions among the different variables.
Depending on the circumstances, different regression techniques can be used for performing
predictive analysis. It can be difficult to select the right one from the vast array available. It’s
important to pick the most suitable one based on the type of independent and dependent variables,
and depending on the characteristics of the available data.
Linear Regression
Linear regression is the most well-known modeling approach. It calculates the relationship
between the dependent and independent variables using a straight line (regression line). It’s
normally in equation form. Linear regression can be used where the dependent variable has an
unlimited range.
If the dependent variable is discrete, another type of model will have to be used. Discrete, or
qualitative, choice models are mainly used in economics. They are models which describe, and
predict choices between different alternatives. For example, whether to export goods to China or
not; whether to use shipping or air travel to export goods. Unlike other models, which examine,
“how much”, qualitative choice models look at “which one?”
The techniques logistic regression and probit regression may be used for analyzing discrete
choice.
Logistic Regression
Logistic regression is another much-used modeling approach. It’s used to calculate the
probability of event success and failure. It’s mainly used for classification problems, and needs
large sample sizes.
The word probit is formed from the roots probability and unit . It’s a kind of regression where
the dependent variable can only have two values, for instance, employed or unemployed. Its
purpose is to appraise the likelihood that a certain observation will fall into one or other of the
categories. In other words, it is used to model binary outcome variables, such as whether or not a
candidate will win an election. Here, the outcome variable is binary: zero or one, win or lose.
The predictor variables may, for example, include how much money was spent on the campaign,
or how much time the candidate spent campaigning. Probit regression is used a great deal in the
field of economics.
Neural Networks
Neural networks are powerful data models capable of capturing and representing complicated
input and output relationships. They are widely used in medical and psychological contexts, as
well as in the financial and engineering worlds. Neural networks are normally used when one is
not aware of the exact relationship between the inputs and the output. These networks are
capable of learning the underlying relationship through training. (This may also be called
supervised training, unsupervised training, and reinforcement learning.)
Neural networks are based on the performance of “intelligent” functions similar to those
performed by the human brain. They’re similar in that, like the brain amasses knowledge through
learning, a neural network stores knowledge (data) inside inter-neuron connections called
synaptic weights.
Their advantage is that they can model both linear and non-linear relationships, whereas other
modeling techniques are better with just linear relationships.
An example of their use would be in an optical character recognition application. The document
is scanned, saved as an image, and broken down into single characters. It’s then translated from
image format into binary format, with each 0 and 1 representing a pixel of the single character.
The binary data is then fed into a neural network that can make the association between the
image data and the corresponding numerical value. The output from the neural network is then
translated into text and stored as a file.
Prescriptive Analytics
While predictive analytics uses prediction models to predict the likely outcome of an event,
prescriptive analytics uses multiple prediction models to predict various outcomes and the best
course of action for each outcome.
For example, prescriptive analytics can be used to prescribe the best medicine for treatment of a
patient based on the outcomes of various medicines for similar patients.
Technology is essential when it comes to analyzing data, so much so that it is the standard to use
computers for data analysis. To have a computer is a great start but it is no use without being
coupled with the appropriate software. Not only do you need special software but you also have
to be able to know how to use it in order to be able to carry out great data analysis.
There are a lot of data analysis software out there. Some (e.g. python and R) are designed for
more advanced data analysis than others. Python is one of the most powerful data analytics
software available for data analysis, and arguably, the most widely used piece of data analytics
software. Python is not only entirely free to use but it is also open-source meaning that the source
code can be changed by programmers if they feel it is appropriate. Python operates differently in
comparison to many of the software programs we use on a regular basis (such Microsoft Excel,
and IBM SPSS). This is because the python interpreter (the python software) requires lines of
codes to be entered as instructions to command the computer hardware to execute. Performing
data analysis with python requires the user to have knowledge ans skill in the python language. It
takes time to learn the complete syntax and commands of the python language but once you get
used with it you will find your data analysis process quite enjoyable. It takes a few simple steps
to have Python downloaded and installed onto a computer. Downloading and installing is
straightforward, it is learning how to use the software that will take some time but once you’ve
mastered it you would love to be a data analyst for ever. Introduction to python will be dealt with
in future units in this textbook.
If you can’t perform analytics to make sense of your data, you’ll have trouble improving quality and costs, and
you won’t succeed in the new business environment.
As indicated earlier, we live in a world where vast amounts of data are collected every second. Analyzing such data is an
important need. This session looks at how data mining can meet this need by providing tools to discover knowledge from
data. “We are living in the information age” has become an everyday saying. However, we are actually living in the data
age. Terabytes or petabytes of data pour into our computer networks, the World Wide Web (WWW), and various data
storage devices every day from business, society, science and engineering, medicine, and almost every other aspect of
daily life. A petabyte is a unit of information or computer storage equal to 1 quadrillion bytes, or a thousand terabytes, or 1 million gigabytes.
This explosive growth of available data volume is a result of the computerization of our society and the fast development
of powerful data collection and storage tools. Businesses worldwide generate gigantic data sets, including sales
transactions, stock trading records, product descriptions, sales promotions, company profiles and performance, and
customer feedback. For example, large stores, such as Wal-Mart, handle
hundreds of millions of transactions per week at thousands of branches around the world. In ghana, businesses such as
Melcom, Kasapreko and the likes generate large quantity of data from their activities.
Global telecommunication networks carry tens of petabytes of data traffic every day. The medical and health industry
generates tremendous amounts of data from medical records, patient monitoring, and medical imaging. Billions of Web
searches supported by search engines process tens of petabytes of data daily. Web Communities and social media have
become increasingly important data sources, producing digital pictures and videos, blogs, and various kinds of other data.
The list of sources that generate huge amounts of data is endless.
This explosively growing, widely available, and gigantic body of data makes our time truly the data age. Powerful and
versatile tools are badly needed to automatically uncover valuable information from the tremendous amounts of data
and to transform such data into organized knowledge. This necessity has led to the birth of data mining, a relatively
younger and dynamic field, but very promising. Data mining has and will continue to make great strides in our journey
from the data age toward the coming information age.
By definition, data mining, also known as knowledge discovery in data (KDD), is the process of
uncovering patterns and other valuable information from large data sets (Big Data). The growth
of big data and advances in database and data warehousing technologies are assisting companies
in transforming their raw data into useful knowledge.
Using a broad range of techniques, data mining information can be used to increase revenues, cut
costs, improve customer relationships, improve quality, reduce risks and more. The foundation of
data mining comprises some intertwined scientific disciplines, including: statistics (the numeric
study of data relationships), artificial intelligence (human-like intelligence displayed by software
and/or machines) and machine learning (algorithms that can learn from data to make predictions).
Figure 5: The Various Aspects of Data Mining
You’ve seen the staggering numbers – the volume of data produced is doubling every two years.
Unstructured data alone makes up 90 percent of the digital universe. But more information does
not necessarily mean more knowledge.
Data mining provides us with the means of resolving problems and issues in this challenging information age. Data
mining benefits include:
3. Discover time-variant associations between products and services to maximize sales and customer value.
4. Identify the most profitable customers, and their preferential needs to strengthen relationships and maximize sales.
3. Forecast consumption levels of different product types (based on seasonal and environmental conditions) to optimize
4. Discover interesting patterns into the movement of products, especially ones with a short shelf life, in a supply chain
3. Identify the most profitable customers and provide them with personalized services to maintain their repeat
business.
4. Retain valuable employees by identifying and acting on the root causes for attrition.
An example of how data mining turns a large collection of data into knowledge
Google) receives hundreds of millions of queries every day. Each query can be viewed
as a transaction where the user describes her or his information need. What novel and
useful knowledge can a search engine learn from such a huge collection of queries col
lected from users over time? Interestingly, some patterns found in user search queries
can disclose invaluable knowledge that cannot be obtained by reading individual data
items alone. For example, Google’s Flu Trends uses specifific search terms as indicators of
flflu activity. It found a close relationship between the number of people who search for
flflu-related information and the number of people who actually have flflu symptoms. A
pattern emerges when all of the search queries related to flflu are aggregated. Using aggre
gated Google search data, Flu Trends can estimate flflu activity up to two weeks faster
than traditional systems can.2 This example shows how data mining can turn a large
collection of data into knowledge that can help meet a current global challenge.
The data mining process involves a number of steps from data collection, data preparation,
visualization to extraction of valuable information from large data sets. The data mining process
starts with prior knowledge and ends with posterior knowledge, which is the incremental insight
gained about the business via data through the process. The whole data mining process is a
framework to invoke the right questions (Chapman et al., 2000) and guide us through the right
approaches to solve a business problem. It is not meant to be used as a set of rigid rules, but as a
set of iterative, distinct steps that aid in knowledge discovery.
Several years ago, representatives from a diverse array of industries gathered to define the best
practices, or standard process, for data mining. The result of this task was the CRoss-Industry
Standard Process for Data Mining (CRISP-DM). The CRISP-DM process model was based on
direct experience from data mining practitioners, rather than scientists or academics, and
represents a “best practices” model for data mining that was intended to transcend professional
domains. The CRISP-DM process model has been broken down into six steps:
Business understanding,
Data understanding,
Data preparation,
Modeling,
Evaluation, and
Deployment
Business Understanding
The second phase of the CRISP-DM analytical process is the data understanding step. During
this phase, data are collected and the analysts begin to explore and gain familiarity with the data,
including form, content, and structure. Knowledge and understanding of the numeric features
and properties of the data (e.g., categorical versus continuous data) will be important during the
data preparation process and essential to the selection of appropriate statistical tools and
algorithms used during the modeling phase. Finally, it is through this preliminary exploration
that the analyst acquires an understanding of and familiarity with the data that will be used in
subsequent steps to guide the analytical process, including any modeling, evaluation of the
results, and preparing the output and reports.
Business Understanding
Perhaps the most important phase of the data mining process includes gaining an understanding
of the current practices and overall objectives of the project. During the business understanding
phase of the CRISP-DM process, the analyst determines the objectives of the data mining project.
Included in this phase are an identification of the resources available and any associated
constraints, overall goals, and specific metrics that can be used to evaluate the success or failure
of the data mining project.
Data Understanding
The second phase of the CRISP-DM analytical process is the data understanding step. During
this phase, data are collected and the analysts begin to explore and gain familiarity with the data,
including form, content, and structure. Knowledge and understanding of the numeric features
and properties of the data (e.g., categorical versus continuous data) will be important during the
data preparation process and essential to the selection of appropriate statistical tools and
algorithms used during the modeling phase. Finally, it is through this preliminary exploration
that the analyst acquires an understanding of and familiarity with the data that will be used in
subsequent steps to guide the analytical process, including any modeling, evaluation of the
results, and preparing the output and reports.
Data Preparation
After the data have been examined and characterized in a preliminary fashion during the data
understanding stage, the data are then prepared for subsequent mining and analysis. This data
preparation includes any cleaning and re-coding as well as the selection of any necessary training
and test samples. It is also during this stage that any necessary merging or aggregating of data
sets or elements is done. The goal of this step is the creation of the data set that will be used in
the subsequent modeling phase of the process.
Modeling
During the modeling phase of the project, specific modeling algorithms are selected and run on
the data. Selection of the specific algorithms employed in the data mining process is based on the
nature of the question and outputs desired. For example, scoring algorithms or decision tree
models are used to create decision rules based on known categories or relationships that can be
applied to unknown data. Unsupervised learning or clustering techniques are used to uncover
natural patterns or relationships in the data when group membership or category has not been
identified previously.
Evaluation
During the evaluation phase of the data mining project, the models created are reviewed to
determine their accuracy as well as their ability to meet the goals and objectives of the project
identified in the business understanding phase. Put simply, the evaluation phase answers the
question: Is the model accurate, and does it answer the question posed?
Deployment
The deployment phase, which the final phase, includes the dissemination of the information. The
form of the information can include tables and reports as well as the creation of rule sets or
scoring algorithms that can be applied directly to other data.
Cluster Analysis
Cluster analysis is a statistical method for processing data into groups, or clusters, on the basis of
how closely associated they are. Cluster analysis can be a powerful data-mining tool for any
organization that needs to identify discrete groups of customers, sales transactions, or other types
of behaviors and things. For example, insurance providers use cluster analysis to detect
fraudulent claims, and banks use it for credit scoring.
Regression
Regression is a form of supervised data mining technique that tries to predict any continuous
valued attribute. It analyses the relationship between a target variable (dependent) and its
predictor variable(s) (independent). Regression is an important tool for data analysis that can be
used for predicting the future as well as improving business processes. For example, in terms of
prediction, Regression can be used to predict how many units consumers will purchase of a
product or service. Insurance companies heavily rely on regression analysis to estimate how
many policy holders will be involved in accidents or be victims of burglaries, for example.
In terms of optimization of business processes, a company operating a call center can use
regression to know the relationship between wait times of callers and number of complaints. A
factory manager can, for example, build a regression model to understand the relationship
between oven temperature and the shelf life of the cookies baked in those ovens. A fundamental
driver of enhanced productivity in business and rapid economic advancement around the globe
during the 20th century was the frequent use of statistical tools in manufacturing as well as
service industries. Today, managers consider regression an indispensable tool.
Classification Analysis
This analysis is used to retrieve important and relevant information about data, and metadata. It
is used to classify different data in different classes. Classification is similar to clustering in a
way that it also segments data records into different segments called classes. But unlike
clustering, here the data analysts would have the knowledge of different classes or cluster. So, in
classification analysis you would apply algorithms to decide how new data should be classified.
A classic example of classification analysis how emails are classified as legitimate or spam. In
Outlook for example, they use certain algorithms to characterize an email as legitimate or spam.
It refers to the method that can help you identify some interesting relations (dependency
modeling) between different variables in large databases. This technique can help you unpack
some hidden patterns in the data that can be used to identify variables within the data and the
concurrence of different variables that appear very frequently in the dataset. Association rules are
useful for examining and forecasting customer behavior. It is highly recommended in the retail
industry analysis. This technique is used to determine shopping basket data analysis, product
clustering, catalog design and store layout. In IT, programmers use association rules to build
programs capable of machine learning.
UNIT THREE
Introduction
It is said that we live in the data age. This is 100% true, and the reason is already known to you
from our previous discussions. The data revolution presents both opportunities and challenges.
Businesses can take advantage of the big data reolution by mastering the application of the which
can be used to take advantage of the revolution. Quite a number of technologies have emerged to
deal with the big data revolution. Database technologies have emerged to support the handling of
big data. In our previous discussions, you were introduced, in passing, to database technologies
such as relational or SQL databases and NoSQL databases. These are the most popular database
technologies whic have emerged for the manageent of big data. In this unit we will dive into the
“pool” of SQL database. SQL is a programming language designed on the relational model. SQL
stands for Structured Query Language.
Objectives
Explain the
The relational model is solidly based on two parts of mathematics: firstorder predicate logic and the theory of relations.
This book, however, does not dwell on the theoretical foundations, but rather on the features of the relational model that is
important for database design and use.
The central purpose of the relational model is to provide a framework for designing of efficient , anomalies-free databases.
The relational model become necessary when the management of big data with trational methods became problematic. E.F.
Codd, and IBM engineer and reseacher invented the relation model as means of efficiently managing big data. It willbe
helpful for you to read the original theory of the relational model as written by Dr. E. F. Codd. The principles enshrined in
the relational model are sumarized into what is termed as normalization. Normalization seeks to eliminate all the
anomalies associated with traditional (also called flat-file) databases. Some of the challenges include anomalies
in
In addition to the above anomalies, there is also much difficulty and complexity in generating
reports from flat file databases, especially if the report is a little bit complex.
Basically, normalization divides larger tables into smaller tables which store data on specific
subject matter. These tables are then linked together using what is referred to as table
relationships. The purpose of normalization is to eliminate redundant (useless) data and ensure
data is stored logically, so that data management is efficient. There are several stages in the
normalization process, referred to as normal forms. We will discuss the first three of the normal
forms. These are the First Normal Form (1NF), the Second Normal Form (2NF), and the Third
Normal Form (3NF).
The first step to constructing the right database table is to ensure that the data is stored in its first
normal form (1NF). When a table is in its first normal form, searching, filtering and sorting
information is easier. The following are the specific principles of 1NF:
Notice that in table 2, the first and third records have the same values for FirstName and
Surname even though they are different students. In this case, the role of the primary key
is so obvious – it shows that the two students, even though they bear the same first and
last names, are unique by their Studid (the primary key).
Table 2: Student Table With Violation of Atomicity of the Principle Atomic Column Values
Studid (PK) StudentName OtherAttributes
1 Josh Addo
2 Nana Adwoa Sampah
3 Josh Addo
In Table 2, the principle of atomicity of column values is violated. Under column 2, titled
StudentName, the values stored are not atomic in nature. For example, the value “Josh
Addo” is compound in nature, and can be further split into two parts as “Josh” and
“Addo”. To resolve this violation, the StudentName column should be broken into its
separate atomic columns as FirstName, and Surname (or LastName). Table 3 shows the
correction of the violation of the atomicity principles.
In Table 4, the columns Phone1, and Phone2 are columns storinf data about the same
attribute (Phone). This violationcan be resolved in two main ways:
We can adopt a policy of storing data on only one phone number. In this case the
problem of repeating phone columns will be done away with.
We can also move the repeating columns (phone1, phone2, etc) from the Student table
and form them into an entity/table. This entity could be called contact entity, and could
be populated with multiple phone numbers, as well as other contact details like email,
website, and address. This newly formed entity (contact) is then linked to the student
entity and all other relevant entities. Tables 6 and 7 are the resulting entities after the
violation is resolved.
4. There should be no Columns with Multiple Values
An alternative way of ‘dancing around’ the problem presented in Table 4 is to maintain a
single column and store multiple values, separated by semi-colon or comma (see Table 5
below). This is, as well, a violation of 1NF principle, as the values under the Phone
column are not atomic.
This violation can be resolved in the same way as the problem of repeating columns.
There ca be a policy to store only one phone number. Alternatively the violation can be
resolved by taking out the column with multiple values and forming it into an entity, and
therefore a separate table. This separate table could be called Contact and may contain
phone and other contact data. The Contact table is then made to ‘communicate’ with the
Student table through table relationship (we will talk in detail about relationship later).
Tables 6 and 7 below demonstrate how the above violation is resolved.
We already know about the 1NF, so let’s break the requirements of the 2NF down. The main
feature of relational database design is to make different tables hold data on a single subject. In
relational database design, it is wrong to lump multi-purpose data into a single table. This will
create serious problems when updating, deleting, and building reports from such a database. A
table design like the one depicted in Table 8 is not 2NF-normalized since the table stores data on
multiple subjects.
Table 8 above violates 2NF because it is not storing data on a single subject, as apart from
student data, it also contains data on students’ residence (halls and hostels) as well as programs
read by students. Table 8 is not a single-purpose table. At least this table is serving three
purposes:
Like we have said already, relational database tables must store data on only one, specific subject.
The main reason why relational database tables are put in 2NF is to narrow the tables to a
single purpose. Doing so brings clarity to the database design, makes it easier for us to describe
and use the table, and it also helps to eliminate modification anomalies.
For a database table to describe a single subject, you need to make sure that all the
attributes/columns/fields of the table are describing that specific subject matter, and nothing else.
For example, for the student table to be describing students, and therefore serving a single
purpose, all the attributes/columns/fields in the table must be specifically describing the student,
and no other business. Any attribute that is not specifically describing the student entity is an
“alien” attribute and must be removed, and formed into its own entity. When all the attributes are
specifically describing the entity in question, the table is said to be in 2NF.
In relational database terminologies, the primary key of a table represents the table. In other
words the purpose of every table can be identified by its primary key. For example, in Table 8,
the purpose of the table can be identified by the table’s primary key (Studid). The primary key
Studid suggests that the table’s purpose is to store data on students. The primary key Studid also
means that the table should have the single purpose of storing student-specific data. All other
non-key attributes should support this agenda of making the table a single-purpose one
(storing student-specific data). All of those attributes (such as Residence and Program in this
case) which do not support the primary key in making the table a single-purpose table are to be
removed. They belong somewhere else. They need to be formed into tables that store
residence-specific and program-specific data.
When all the non-key attributes support the primary key in defining a single-purpose table, those
non-key attributes are said to be fully, and functionally dependent on the primary key. When we
talk about attributes or columns being functionally dependent on the primary key, we mean, that
in order to find a particular value stored under any of the non-key attributes/columns, such as the
surname “Oforiwah”, you would have to know the primary key value, “3”, to look up for the
value “Oforiwah”. The primary key of 3 is the true identifier of the third record of Table 8. In
judging whether a non-key attribute/column is dependent on the primary key, ask yourself the
following question:
“Does this attribute/column/field serve to describe what the primary key represents? If you
answer “yes,” then the attribute/column/field is dependent on the primary key and therefore
describes the purpose of the table. If you answer “no”, then that column/field does not describe
the purpose of the table and therefore not dependent on the primary key. Such an
attribute/columns/field should be moved and formed into a different table for its own purpose.
When all the columns relate to the primary key, the columns in combination, serve a common
purpose of defining a single purpose table. In summary, when a table is in second normal form,
that table serves a single purpose of storing data on a single entity, not multiple entities.
Once a table is in second normal form, we are guaranteed that every non-key column/field is
functionally dependent on the primary key. In other words, the table serves a single purpose by
storing data on a single entity. However, in some cases, some non-key attributes/columns/fields
may depend on the primary key through another non-e column. The non-key attribute that
depends on the primary key through another non-key attribute is said to be transitively dependent
on the primary key, meaning that the ‘transitive’ attribute/column cannot depend on the primary
key without the presence of some other non-key column. This is referred to as transitive
dependency. Transitive dependency is violation of normalization, more specifically, violation of
third normal form. Table 9 shows an example of a table with transitive dependency.
• X--->Y (X depends on Y)
• Y does not --->X (Y does not depend on X)
• Y--->Z (Y depends on Z)
X--->Z
Y--->Z
Studid is the unique identifier for Firstname, Surname, DoB [DoB depends on Studid (DoB--
>Studid)]
Transitive dependency leaves a trace of multi-purpose table. 2NF’s effort to define a single
purpose table will be incomplete if there is the presence of transitive attributes/columns.
Transitive dependency causes redundant data, update and delete anomalies. Such a situation can
cause inconsistencies and must be resolved. When this transitive dependency is resolved, the
table then moves into 3NF. In summary, a table is in third normal form if:
1. it is in 2NF and
2. it contains only attributes/columns/fields that are non-transitively dependent on the
primary key
Transitive dependnecy problem (3NF violation) could be resolved by simply removing the
transitive column Age from the student table. This will be enough since we can compute age
from DoB.
Example 1
Suppose a company wants to store the complete data of each employee, and therefore a table
named employee_records, that looks like Table 11 is created:
Table 10: Database Table In Violation of 3NF
emp_id emp_fname emp_sname emp_postcode emp_region emp_city emp_district
To remove the transitive dependency to make this table comply with 3NF, we have to
break the table into two tables into Employee and Location Tables as follows:
By this design the problem of 3NF violation is eliminated. These two table are made to communicate
with each other through a table relationship. The linking field are postcode as a primary key in the
Location table, and postcode as a foreign key in the Employee table.
Example 2
In this example, we are interested in modeling a project management system into a relational
database. Students are made to work on projects as part of what is required for them to earn a
degree. In Table 13, you can observe that there is a normalization problem. Can you tell what the
problem is? Of course the table has both 2NF and 3NF normalization problems. Closer look at Table
10 reveals the following:
The table is not a single purpose one, a 2NF problem. You can clearly notice that the table attempts to
store student data, as well as project data. In a relational database, it is criminal to design a table in
this way, for the reason of all the anomalies we’ve discussed. There is a transitive dependency
problem, a 3NF problem. There appears to be two possible key columns for the table, Studentid and
Project_number. This is part of the reason for the 2NF problem above. Two possible key columns will
automatically create two-tables-in-one. Even though the primary key for the combined table is
Studentid, the Project_number column naturally has unique values which guarantees unique
identification of projects. The presence of the Project_number column means that some non-key
columns will not depend on the Studentid primary key column directly. Actually, the non-key column,
Project_name depend on the Project_number, and not primarily on the Studentid primary key
column. Project_name attribute depends on the primary key (Studentid) through Project_number
attribute. For the the table to be in 3NF, there should be only one primary key column and all the
other columns should fully depend on that primary key. No columns should depend on some other
column which is not the primary key. In relation to this example, the columns First_name, Last_name,
Project_number, and Project_name all should fully and exclusively depend on the primary key
(Studentid) for the table to be in 3NF. At the moment, this is not the case. The column Project
depends on another non-key column (Project_number).
To resolve the problem for the table to be in 3NF, the table must be split into two: Student table and
Project table as follows. By this design, in each of the two tables, is correctly normalized from 1NF to
3NF. In Table 14 the Student table has one primary key column, Studentid. All the non-key columns in
the table (First_name, and Last_name) are fully and exclusively dependent on the primary key
Studentid. Note that the Project_number column in the Student table is a foreign key column, serving
the purpose of linking the Project table (see Table 15) to the Student table. Table relationships are
discussed in detail in the following session.
The concept of primary key has already been discussed extensively in previous
sessions. A table is represented by its primary key. A primary key field contains
values which uniquely identifies every record in the table. For there to be a
relationship between between any two tables, there should be a copy of the primary
key field of one table into the other table. In the receiving table, the copied key
column is a foreign key column, and its main job is the create a link between the two
tables.
For example in Table 13, the Studentid columnn is the primary key column for the
Student table. The Project_number column in the Student table is referenced from the
Project table (see Table 14). In the Project table the Project_number column is a
primary key. This is referenced into the Student table as a foreign key to establish the
basis of connection between the two tables. Deciding on which table receives the
foreign key is largely an issue of practicality and convenience. However, what is
practical and convenient can be relative. For example, it may be more convenient and
practical that when you enter a record in the student table, you specify which project
is assigned to the student. If this way of capturing the data into the tables is the most
is the most convenient for you, so be it. On the other hand, if it is more convenient
and practical for you to specify which student is responsible for which project after
entering the project details, then that becomes the design. In this latter case, the
foreign will be place in the project table. It will be the primary key of the Student
table (Studentid) referenced into the Project table as a foreign key.
The table which has the foreign key is referred to as the child table. For a example in
the student-project database above (see Tables 13 an 14) the child table is the table
named Student. Can you figure out why it is fitting to call the “foreign” key table as
the child table? It is because the table is a dependent table. Dependent in the sense
that the values of one of its columns (the foreign key column) is fed by a “parent”.
this is understandable. If you are fed, then you are a child, and the one who feeds you
is mostly your parents. Not surprisingly, the source from which the foreign key
originates is called a parent table. In reference to the student-project database above,
the parent table is the Project table, and you should be able to explain why the Project
table is a parent table.
Remember that in a relational database, each table is made to hold entity-specific data
such that from any one particular table, you cannot find related data (see 2NF above).
for example, in the Student table, you can only find student-specific data. You cannot
find data on students’ fees. Such a design is purposely to eliminate the anomalies
associated with flat file database systems (you can revisit the session on flat file
database anomalies). But definitely, relational database tables do have relationships
between themselves. For example for a school database, Student entity has a
relationship with Course entity, and the relationship is that students read courses or
courses are read by students. In a sales processing system, the Customer entity as a
relationship with the Order entity, and the relationship is that customers place orders
or orders are placed by customers. There are three types relationships which can exist
between related tables:
one-to-many relationship
many-to-many relationship
one-to-one relationship.
One-To-Many Relationship
The most common type of table relationship in a relational database is one-to-many relationship. In
one-to-many relationship one record in one entity can be referenced by multiple records in another
Manufacturer_id Manufacturer_name
1 Mercedes
2 Honda
3 Kia
4 Toyota
5 Ford
Table 19: Model Table Of The Physical Implementation Of The Car Database
1 Camry 2008 4
2 Accord 2018 2
3 C-Class 2020 1
4 Focus 2017 5
5 Corolla 2018 4
6 E-Class 2017 1
The one-to-many relationship between the Manufacturer table and
the Model table is more conspicuous in the physical model. For
every one record in the Manufacturer table, there is an associated
multiple records in the Model table.for example, the first record in
the Manufacturer table is Mercedes,with a manufacturer_id of 1. In
the Model table, the manufacturer Mercedes is associated with the
model names, C-class , E-class, etc. This makes it one manufacturer
(Mercedes) mapping to more than one model (C-clss, and E-class).
On the other hand, for every one record in the Model table, there is
one and only one associated record in the Manufacturer table. For
example, C-class is a model name with the model_id of 3. This is
one record in the model table, and can relate to one and only one
manufacturer (Mercedes) in the Manufacturer table.
Figure 8: Logical Modeling of the Relationship Between Student and Program Entities
1 Accounting
2 Computer Science
3 Procurement
4 Statistics
From Tables 20 and 21, you can see the physical rendering of
relationship between the Program table and the Student table. It
can be seen that for any one record in the Program table, there is
more than one related record in the Student table. For example, if
we take the Accounting program in the Program table, we can find
at least two matches in the Student table. In terms of the type of
relationship, this is one record from the Program table to many
records in the Student table. On the other hand, one record (one
student) from the Student table can relate to one and only one
record (one program) in the Program table. This ultimately creates
a one-to-many relationship between the Program and Student
tables
Tables demonstrate the physical design of the relationship between the Customer and
Order entities.
1 UCC 0332012348
2 UG 033123249
3 KNUST 033212345
4 UEW 033149857
1 2021-11-6 20000 2
2 2019-10-2 23000 4
3 2020-01-10 12000 1
4 2021-11-7 30000 2
5 2019-12-23 15000 1
6 2020-03-12 20000 3
A closer look at Tables 22 and 23 reveals that one customer in the Customer table is
related to multiple orders in the Order table. This is true as any particular customer
can can buy from a company several times. On the reverse side, one order in the
Order table is related to one and only one customer in the Customer table. No one
order can come from two or more customers.
Many-To-Many Relationship
In this example, we will show an extract of a university database. We are only interested in two
entities which are related in a many-to-many fashion. There can be several entities which bear
many-to-many relationships. However, at this point we will consider the lecturers and the
subjects they teach. The entities involved will be Lecturer and Subject.Let’s check how this is a
many-to-many relationship. One lecturer could teach one or many subjects, but one subject
could also be taught by one or many professors. This is a many-to-many relationship, and your
logical model should look like this:
Figure 10: Logical Model Of A Many-To-Many Relationship Between Lecturer and Subject
Entities
Many-to-many relationships are not ideal. If we translate this logical model (see Figure 10)
directly into a physical database, the data would be duplicated. For instance, if there’s a lecturer
that teaches six subjects, you would have him or her listed in the table six times, every time for a
different subject. This is quite inefficient. So, how would we resolve this many-to-many
relationship between these two entities? Many-to-many relationship modeling is not done as
straight forward as the logical model in Figure 10. many-to-many relationship is implemented by
introducing a junction/link table into your model. The junction or link table is placed in-between
the two main entities (Lecturer and Subject entities in this case). This design breaks the direct
many-to-many relationship into multiple one-to-many relationships as shown in Figure 11.
As you can see, there’s a new table called Subject_details. Others prefer to name the junction
table by pairing the names of the tables which it is providing the linkage between for
example,insteadof the name lecture_details, the junction table could have been named
Lecturer_Subject. Note that the junction table (Subject_detail) contains the following attributes:
• Lecturerid : A foreign key attribute, which references the Lecturer_id column in the
Lecturer entity.
• subjectid : A foreign key attribute, which references the Subject id attribute in Subject
entity.
The attribute Lecturer _id is a foreign key attribute to the entity Lecture_details. The same goes
with the S ubjectid ; it is a foreign key attribute to the entity Lecture_details. At the same time,
the pair Lecturer id , S ubjectid is the primary key for the table Subject_details. The
columns Lecturer id and S ubjectid, in the Lecturer-Subject entity, together form a composite
primary key (i.e. the primary key consists of two or more attributes). This composite primary key
ensures that a lecturer can be assigned to one subject one time. Each pair of values
(Lecturer id , S ubjectid ) can be in the Lecturer-Subject table no more than once. The same goes
for the subjects; each one can be assigned to one Lecturer one time. The composite key ensures
the uniqueness of the attribute combinations.
Tables 24 through 26 depict the physical modeling of the many-to-relationship between the
Lecturer and Subject tables. Let’s check that the junction table solves the many-to-many
relationship. One lecturer can be allocated only once to the same subject. On the other hand,
one subject can be assigned only once to the same lecturer.
In this example, our task is to create a database that will help a company store information about
their suppliers. The database will also contain info on all the products/services ordered from the
suppliers. The logical data model could look something like this:
The relationship between these two entities is again many-to-many. One or many products can
be ordered from one supplier. At the same time, the company can order the same product from
many suppliers, e.g. services from different legal firms, tires from different manufacturers, etc.
How would this logical model look when transformed into a relational database model? Figure
12 depicts the ‘raw form’ of the many-to-many relationship between the two entities. But once
again, it will be wasteful to directly translate Figure 12 into a physical database. Like we
discussed for example one above, many-to-many relationship are implemented by by breaking it
into multiple one-to-many relationships, by placing a junction table in-between (supplier-product)
the entities as shown in Figure 13.
Figure 13: Implementing Many-To-Many Relationship for Supplier and Product Entities
Once again, instead of many-to-many, there’s a new table that’s automatically been
named supplier_product. In this implementation the junction table has only two attributes:
Again, the pair supplierid , and productid is the primary key (actually a composite key) of the
supplier-product table. At this level of implementation, our database only tracks suppliers and
the products they supply to us. If we want our database to track orders we make to suppliers, it
would be better to expand the junction table supplier_product a bit, as shown in Figure 14.
Figure 14: Order Entity Serving as Junction Between Supplier and Product Entities
The design looks much better now. First of all, the name of the junction table has changed to
something more descriptive; it’s now named order. Several new attributes have also been added
to the table. It consists of the following:
• Orderid : The ID of this order from the supplier and the table’s primary key (PK).
• Supplierid : The ID of the supplier; references the table supplier.
• Productid : The ID of the product ordered; references the table product.
• Order_date : The date of the order.
• Quantity : The number of items ordered.
• Total_price : The total value of the ordered products.
• Status : The status of the order.
Remember that there are two possibilities when creating a junction table. One is that it contains
only foreign keys that reference the other tables, which happens often enough (see Table 13).
However, sometimes the junction table becomes its own entity, as in Table 14 where it also
contains other attributes. You should always adapt the model to your needs. Now that you’ve
become so good at this, let’s take a look at one more example!
• Book
• Staff
• Role
• isbn : The International Standard Book Number, a primary identifier (PI) used for books.
• Book_title : The title of the book.
• issue : The issue (i.e. edition) of the book (e.g. first printing, first edition, etc.).
• date : The date of the issue.
Between the Book and Staff entities, there’s a many-to-many relationship. Let’s check the logic.
One staff member can work on one or many books. One book can be handled by one person
(well, hardly) or by many people. You must comprehend why the relationship between the two
entities is many-to-many.
• Roleid :
The ID of the role; a primary identifier (PI).
• role_name :
The name of the role.
• role_description : A description of that role.
Again, there is a many-to-many relationship between the entities Staff and Role. This is the logic
behind the relationship: one staff member can fill one or many roles when working on a book
and that one role can be performed by one or many staff members. Role, in this sense, means
something like author, co-author, editor, proofreader, translator, illustrator, etc. For instance,
the author of one book can also be an illustrator on another book, translator on a third, and
proofreader on a fourth.
The implementation of the many-to-many relationships among the three entitites will look like
Figure 16.
Figure 16: Implementing The Many-To-Many Relationships Between Book, Staff, And Role
Entities
This model seems a bit complicated. Until now, we have have been dealing with situations
involving only two entities. This the modeling in Figure 16 rather has three entities related to
each other in a many-t-many relationship. This kind of relationship where three entities
participate are involved is referred to as a ternary relationship. Here, the junction table again has
a composite primary key that is made up of foreign keys. This time, however, the primary key
consists of three columns, not two.
Let’s analyze the junction table named book_creators. It has three attributes as follows:
• Book_isbn : This is a foreign key which references the Book_isbn (primary key) of the
book entity.
• Staffid : This is a foreign key which references the Staffid (primary key) of the
Staff entity.
• Roleid : This is a foreign key which references the the Roleid (primary key) of
the Role entity.
The primary key of the junction table (book_creators) is the unique combination of the
attributes B ook_isbn , S taffid , and R oleid .
One-To-One Relationship
One-to-one relationship is the entity relationship where one record in table A is related to one
and only one record in table B, and on the reverse, one record in table B also relates to one and
only one record in table A. the following are real-life examples of one-to-one relationships:
• Country and capital city relationship: Each country has exactly one capital city. Each
capital city is the capital of exactly one country.
• Person and their fingerprints. Each person has a unique set of fingerprints. Each set of
fingerprints identifies exactly one person.
• Email and user account. For many websites, one email address is associated with exactly
one user account and each user account is identified by its email address.
• Spouse and spouse relationship: In a monogamous marriage, each person has exactly
one spouse. But in a polygamous marriages, the relationship will not be one-to-one.
• User profile and user settings. One user has one set of user settings. One set of user
settings is associated with exactly one user.
For clarity, let’s contrast these examples with relationships that are not one-to-one:
• Country and city relationship: Each city is in exactly one country, but most countries
have many cities. This is one-to-many relationship
• Parent and child relationship: Each child has two parents, but each parent can have
many children.
• Employee and manager relationship: Each employee has exactly one immediate
supervisor or manager, but each manager usually supervises many employees.
The one-to-one relationship between country and capital can be denoted like this:
The perpendicular straight lines mean “mandatory”. This diagram shows that it’s mandatory for
a capital to have a country and it’s mandatory for a country to have a capital.
Another possibility is for one or both of the sides of the relationship to be optional. An optional
side is denoted with an open circle. This diagram says that there is a one-to-one relationship
between a person and their fingerprints. A person is mandatory (fingerprints must be assigned to
a person), but fingerprints are optional (a person may have no fingerprints assigned in the
database).
One way to implement a one-to-one relationship in a database is to use the same primary key in
both tables. Rows with the same value in the primary key are related. In this example, France is
a country with the id 1 and its capital city is in the table capital under id 1.
country
id name
1 France
2 Germany
3 Spain
capital
Technically, one of the primary keys has to be marked as foreign key, like in this data model:
The primary key in table capital is also a foreign key which references the id column in the
table country. Since capital.id is a primary key, each value in the column is unique, so the
capital can reference at most one country. It also must reference a country – it’s a primary key,
so it cannot be left empty.
Another way you can implement a one-to-one relationship in a database is to add a new column
and make it a foreign key.
In this example, we add the column country_id in the table capital . The capital with id 1,
Madrid, is associated with country 3, Spain.
country
id name
id name
1 France
2 Germany
3 Spain
capital
id name country_id
1 Madrid 3
2 Berlin 2
3 Paris 1
Technically, the column country_id should be a foreign key referencing the id column in the
table country. Since you want each capital to be associated with exactly one country, you should
make the foreign key column country_id unique.
One-to-one relationships are the least frequent relationship type. One of the reasons for this is
that very few one-to-one relationships exist in real life. Also, most one-to-one relationships are
one-to-one only for some period of time. If your model includes a time component and captures
change history, as is very often the case, you’ll have very few one-to-one relationships.
A monogamous relationship may split up or one of the partners may die. If you model the reality
of monogamous relationships (such as marriages or civil unions) over time, you’ll likely need to
model the fact that they last only for a certain period.
You’d think that a person and their fingerprints never change. But what if the person loses a
finger or the finger is badly burnt? Their fingerprints might change. It’s not a very frequent
scenario; still, in some models, you may need to take this into account.
Even something seemingly as stable as countries and their capitals change over time. For
example, Bonn used to be the capital of West Germany (Bundesrepublik Deutschland) after
World War II, when Berlin was part of East Germany. This changed after German reunification;
the capital of Germany (Bundesrepublik Deutschland) is now Berlin. Whether you should or
should not take this into account depends on your business reality and the application you’re
working on.
UNIT FOUR
Structured Query Landguage (SQL) is a flexible language that is used in creating and managing relational databases. By far,
it is the widely used tool for communicating with a relational database. SQL originated in one of IBM’s research laboratories, as
did relational database theory. In the early 1970s, as IBM researchers developed early relational DBMS (or RDBMS) systems,
they created a data language to operate on these systems. They named the pre-release version of this language SEQUEL
(Structured English QUEry Language). This relational database language has been developed into what is now called SQL. The
syntax of SQL is a form of structured English, which is where its original name came from. To bring consitency in the application
of the SQL language, the American National Standards Institute (ANSI) has, since mid 1980s, regulated the SQL language
through standardization. The current standardized SQL by ANSI is the SQL: 2016.
The SQL language is implented with a relational database management system (RDBMS). regardless the of the particular
RDBMS (whether PostgeSQL, MySQL, SQL Lite, or some other one)the SQL language is the same except a few vendor-specific
differences. Sinse SQL is implementd with an RDBMS, it is the best thing to learn how to set up and SQL database through an
RDBMS. For the purposes of this course, the PostgreSQL RDBMS is adopted. Note that the choice of an RDBMS does not really
matter as the core of the language is the same across all RDBMS platforms. Let us take a few minutes to go through the
installation process of PostgreSQL.
system that can handle very large amounts of data. Here are some
reasons
It’s free.
manage your database, import and export data, and write queries.
Windows Installation
For Windows, I recommend using the installer provided by the
company
https://www.enterprisedb.com/software-downloads-postgres/.
Select the latest available 64-bit Windows version of EDB Postgres
Standard unless you’re using an older PC with 32-bit Windows.
After
1.
your computer. The program will perform a setup task and then
2.
3.
select
tool,
4.
Choose the location to store data. You can choose the default,
which
is in a “data” subdirectory in the PostgreSQL directory.
5.
is robust
the initial
database superuser
6.
Select a port number where the server will listen. Unless you have
some other application is using that default, the value might be 5433
7.
Select your locale. Using the default is fine. Then click through the
minutes.
10.
Expand the Spatial Extensions menu and select either the 32-bit or
installed. Also, expand the Add-ons, tools and utilities menu and
11.
Make sure PostGIS and Create spatial database are selected. Click
Next, accept the default database location, and click Next again.
12.
13. Answer Yes when asked to register GDAL. Also, answer Yes to
the
questions
the
POSTGIS_ENABLE_OUTDB_RASTERS
environment variable
1.
In the object browser, expand the plus sign (+) to the left of the
Depending on your
could be localhost or
Double-click the server name. Enter the password you chose during
3.
4.
Under postgres, expand the Schemas object, and then expand public.
There’s a lot here, but for now we’ll focus on the location of tables.
To
where you can access the table. In Chapter 1, you’ll use this
browser to
Like any language, SQL has a syntax (the grammatical arrangement of words in sentences) which one must adhere to in writing
valid SQL codes. To master the use of the language actually is to master the syntax of the language. Like the learning of any
language, it persistent practice for you to master the SQL language. At this juncture, we turn our focus to the building blocks of
the language.
SQL Statements
The SQL language consists of a limited number of statements that perform three main functions of data management. This is
good news due to the fact that the language is not too complex to learn. Some group of SQL statements define data, some
manipulate data, some control data, and others retrieve data for reporting. The set of statements for the definition of data
form what is called Data Definition Language (DDL). Those statements meant for manipulation of data for what is called Data
Manipulation Language (DML). The group of statements for the retrieval of data form the Data Retrieval Language (DRL) or
Data Query Language (DQL), while those for the control of data form what is called Data Control Language.
The Data Definition Language (DDL) is the part of SQL you use to create, change, or destroy the basic elements of a relational
database. Basic elements include tables, views, schemas, catalogs, clusters, and possibly other things as well. In the following
sections, we discuss the containment hierarchy that relates these elements to each other and look at the commands that
operate on these elements. A database is like a big container with sub-containers. The containment hierarchy of a database can
be broken down as follows:
You need to worry mostly about table as the relational database management system (RDBMS) manages the other
elements by itself. The RDBMS is a software pachage which is designed to interpret the SQL language and for the total
management of the database. As has been mention earlier, there are a number RDBMSs designed for the management of
relational databases. Just for emphasis, examples are reproduced here as follows: PostgreSQL, MySQL, Microsoft SQL
Server, IBM DB2
The RDBMS provides the interface for you to enter your DDL codes to create your
database tables and other database elements. This means that you need to download
the RDBMS and install it on your computer. After the installation is complete, you
can go ahead to create your database and the tables for the database. It must be noted
that tables are the building blocks of relational databases, and so you need to master
the are table design as discussed previously. You should master all your normalization
principles as these are the principles of table design in a relational database. As best
practice, you need to first of all break down the system to be modelled into a database
into entities. You should know what an entity is by now. Keep in mind the following procedures
when planning your database:
After you complete the design of your database on paper (using ERD) and verify that it is sound, you’re ready to transfer
the design to the computer. The paper design of the database is referred to as logical design. And the translation of the
logical design onto the computer is referred to as the physical design. At this point in time, we will begin the practice of
SQL coding by trying to model a school database. We have chosen a school system because all of you are familiar with
it. In your personal practice, you may want to model a different system into a database. Note that you can model just
any system into a relational database.
The totality of the school system could be broken into the following entities:
• Student
• Course
• Teacher
• Program
• Department
• Fee/Payment
• Hall/Residence
• Guardian
The list can be more than what is provided above. After this exercise of breaking the
system into entities, you need to specify the attributes of the entities you want to keep
data on. For example, for any particular student in the student table, we need to keep
capture data on their names, registration numbers, Gender, contact details, and others.
In thinking about which attributes to capture about an entity, remember to be guided
by the principles of normalization as we learnt them in unit three. We will be
revisiting specific aspects of the principles of normalization where necessary. If you
find difficulties in coming up with the appropriate attributes of, revisit unit three for
more guidance. In a relational database, each of the entities is represented with a table,
and the attributes you specify for the entities become the column labels/headings for
each table. So this is how, for example, the Student entity will be set up as a table in
the database:
Note that each of the columns store specific data of specific type. For example, the
first_name column stores textual data about only students’ first names, and nothing
else. Some other columns for some tables will store numerical data. For example the
Payment table will have most of its columns to be of the numerical data types. Before
we provide detailed information on SQL data types, let us see the logical view of the
school database (in an ERD form). The ERD in Figure xyz shows the various entities
(and therefore table) in the school database, together with the attributes (also known
as fields of columns) of the entities. As we have said already, each of the columns
stores a specific data of specific type. By type, we mean the type data, whether
text/string, or numeric, or something else. It important at this point in time for us to
discuss SQL data types which can be declared on fields.
SQL Data Types
SQL databases support a variety of data types. However for this course we will talk about the common SQL data types such as
the following:
» Numerics
» Character/Strings
» Booleans
» Datetimes
We call these general types because inside some of the above types can be sub-types. Data type constitutes one of the critical
building blocks of relational database design, and therefore we have to break the discussion down for easy understanding.
Basically we will discuss each of the types and their sub-types in some detail as follows:
String Type
String types are general-purpose types suitable for any combination
of text, numbers, and symbols. Character data type has two sub-
types: character(n) [with the short form as char(n)], and character
varying(n) [with the short form as varchar(n)]
char(n)
varchar(n)
SQL, also can be specified using the longer name character varying(n)
Numbers
Number columns hold various types of (you guessed it) numbers,
but that’s not all: they also allow you to perform calculations on
those numbers. That’s an important distinction from numbers you
store as strings in a character column, which can’t be added,
multiplied, divided, or perform any other math operation. Also, as I
discussed in Chapter 2,
Integers
Integers
The integer data types are the most common number types you’ll
find
The SQL standard provides three integer types: smallint, integer, and
bigint.
The difference between the three types is the maximum size
of the
numbers they can hold. Table 3-1 shows the upper and lower limits
of each, as well as how much storage each requires in bytes.
type size
+9223372036854775807
Even though it eats up the most storage, bigint will cover just about
any requirement you’ll ever have with a number column. Its use is a
must
if you’re working with numbers larger than about 2.1 billion, but
you can
easily make it your go-to default and never worry. On the other
hand, if
you’re confident numbers will remain within the integer limit, that
type is
When the data values will remain constrained, smallint makes sense:
days of the month or years are good examples. The smallint type will
use
Decimal Numbers
As opposed to integers, decimals represent a whole number plus a
fraction
number of digits to the left and right of the decimal point, and the
decimal point.
Floating-Point Types
The two floating-point types are real and double precision. The difference
between the two is how much data they store. The real type allows
The database stores the number in parts representing the digits and
an
In the query editor, type the SQl statement which creates a database as follows:
This statement perfectly creates an SQL database named my_school. Note that below the query
editor is a message about the status of your SQL code, whether it executed correctly or there is an
error. When your statament is right, you get feedback message like the one in Figure xxx: “CREATE
DATABASE statement returned successfully in 3 secs 925 msec”. In summary the statement you need to
write to tell your RDBMS (in this case PostgreSQL) to create a database is a combination of two SQL
key words plus the name you want to give to your database. Also not that in SQL, a statement ends
with a semicolon(;), not fool-stop. In a generic form, the syntax for the CREATE DATABASE statement
is CREATE DATABASE database_name. Note that SQL key words in any statement are written in upper case.
That is the convention the database community.
ASSIGNMENT
Search for, and familiarize yourself with, SQL key words on google
Creating Tables
A database table looks a lot like a spreadsheet table: a two-dimensional array made up of rows and columns. You can
create a table by using the SQL CREATE TABLE command. Within the command, you specify the name and data type
of each column.This is where you need to apply the knowledge of SQL data types learnt in the previous session. The
generic form of the CREATE TABLE statement goes like this:
Note: this code could have been written as one line like
This is equally correct SQL statement like the first one . the difference is that the first
one is more readable that the latter. With the first one you are able to see at glance, the
columns and their data types. This is not the case with the second style. More
importantly, the database community recommends the first approach. Now let’s turn
to our ERD and create the Student table. Open your pgAdmin and type the codes you
see in Figure xxx. If your statement runs successfully,
As you can see from the feedback message, the CREATE TABLE statement executed successfully and
the Student table is created in your my_school database. How can you know that? In the object
browser pane of pgAdmin, locate the my_school database.
1. Right click the my_school database and choose refresh to refresh the system.
2. Click the dropdown arrow to see all the objects in the my_school database.
4. Under tables, you can locate the Student table you just created.
5. If you click the columns under tables, you will see all the columns.
You can also use the SELECT statement to view your tables. You type the SELECT statement in the
query editor as follows:
Now let’s create the Teacher table. I know by now all of you can create just any table. In your
pgAdmin, type the following codes to create the Teacher table:
We will create two more tables (Payment table and Course table) together and then you will be left
on your own to do create the rest.