0% found this document useful (0 votes)

12 views

Computer Application In Accounting

The document discusses the integration of computing and accounting, emphasizing the necessity for accounting professionals to acquire skills in data management and analytics due to the rise of big data in business environments. It highlights the transformation brought about by digital technology and the Internet of Things (IoT), which have enabled unprecedented data generation and necessitated new approaches to data management in accounting education. The text argues for a curriculum update in accounting to include knowledge of big data technologies and analytics tools to prepare students for the evolving demands of the profession.

Uploaded by

gabylastic

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

Computer Application In Accounting

Uploaded by

gabylastic

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 95

UNIVERSITY OF CAPE COAST

COLLEGE OF DISTANCE EDUCATION

COURSE TITLE:

Computer Applications In Accounting (1)

© COLLEGE OF DISTANCE EDUCATION, UNIVERSITY OF CAPE COAST

CODE PUBLICATIONS, 2019

Preface

It is no more the case that computing and Accounting studies should remain separate and distinct
academic disciplines from each other. Even though for specialization purposes, Computing and
Accounting remain separate academic disciplines, in application they are so much needed
together as in solving business problems. Not only do business organizations need knowledge in
business domains like Accountancy, Business Management, Human Resource Management and
the likes but also the technological know-how in dealing with today’s highly dynamic business
landscape. . It is an undeniable fact that in today’s turbulent business environment, only core
business knowledge is woefully inadequate in providing solutions to business problems. There is
the pressing need for integration between cor business knowlege such Accounting and
knowledge from other disciplines such as statistics, computing, information technology and
others. This is so because there is the need for advanced analyses of the big data generated by
business activities. Such deeper analyses of business data goes beyond the knowledge from
Accountancy, HRM, and Business Management disciplines, and require the application of
knowledge from sister disciplines like Statistics/Mathematics, Computing, Information
Technology, data and information science, and the likes.

Compounding this reality is the emergence of the so-called Digital Revolution, where along the
time line of the 21st century, it is ever possible for every activity under the sun to be digitized
into what is called “Big Data”. The concept of “big data” means the phenomenon where the
volume and velocity of data generation by today’s business organizations surpasses the storage
and processing capacities of traditional data management systems, thereby requiring new,
advanced data management technologies. It is therefore required of today’s business students,
and teachers alike, to master the knowledge, skills, tools and technologies of big data
management systems. Accounting curriculum need to change from the traditional style to
involve areas such as statistics, data analytics (using tools like Python, R, java excel), big data
management technologies such as SQL, NoSQL. To buttress our arguments, check out the
following statements from very authoritative sources.
“Accountancy professionals will be increasingly required to broaden their skill-set and take ownership of their
personal development if they are to succeed in an everchanging, dynamic world” (source: The Association Chartered
Certified Accountants (ACCA)

“The increased availability of data and of analytic tools have opened up a range of opportunities for the finance
function. Analytics have facilitated a more forward-looking [accounting and] finance functions, enabling the
generation of greater insights and the provision of immense value to stakeholders. Leveraging data analytics in
decision making provides significant opportunities for the[accounting and] finance functions to cement its position as
a true strategic partner” (sources: ACCA, and Chartered Accountants Australia and NewZealand).

“Accountants and finance teams have an opportunity to lead their organizations to better business decisions driven
by insights from analysis of big data, rather than following a familiar pattern of reporting on past events” (source:
ACCA).

“The insights fromour CEO survey reveal that businesses are preparing for a future that’s different fromtoday. And
they expect their talent to adapt. One implication of this rapidly changing business environment is clear—today’s
accounting curriculum

should be updated to equip students with new skills, especially in technology and data analytics” (source:
Pricewaterhouse Coopers(PwC)).

“Businesses today—accounting firms among them—are struggling to deal with the high volume of data that
technology has made available. Organizations that can successfully interpret these data and use them to make
crucial business decisions have a competitive advantage. In such a climate, it’s helpful for CPAs to know as much
about the technological tools they use as possible. Just knowing basic accounting and office software programs is no
longer enough: Accountants need to know how to code” (source: American Institute of Certified Public Accountants
(AICPA)).

These quotes are to make you informed about the global trends in the Accountancy and Finance
Profession, and to help you get the right direction in your studies and carrier. It is the dream of
the the author of this text to equip you with the knowlege of technological Applications in
Accounting. Even though the title of the course is “Computer Applications in Accounting”, the
course is not about traditional computing concepts such as the history of computers, computer
hardware, and the likes. This course broadly speaking, will introduce you to the concept of big
data revolution, big data management technologies, and data analytics.

UNIT ONE
INTERNET OF THINGS, DIGITAL TECHNOLOGY, AND THE BIG DATA REVOLUTION

Introduction

Business activities generate outcomes which can be recorded/stored, processed (analyzed,

summarized) into more decion-ready information, and disseminated to information consumers
(both within and outside of the business entity). so many years ago, the recording, storing and
storing of data on business activities were done using “hard” papers and books (you remember
the concept of “books of accounts, day books, and the likes). during those days data capturing,
storage, and processing techologies were so primitive that not all aspects of business data could
be captured, stored and processed. Even though traditional accounting books were intimidatingly
huge in size, they still could not afford to hold the quantum of data generated by business
activities. The use of pens, pencils and other primitive tools to capture data on business activities
is indisputably woefully inadequate. Accountancy, in those ages, seem to have handled the
challenge of big data versus inadequate data management systems by resorting to capturing and
storing only data of financial nature, and also doing basic data analyses. This practice meant that
the accounting function of business organization contributed to business development in a
fractional way, not fully as required.

Along the line of time was the advent of computers and information technologies which
tremendously improved data management capabilities of business organizations. Computing
power, combined with advances in information technology and analyses techniques, could
capture and process more data, more speedily than before, initiating the emergence of called “big
data”. Despite the emergence of improved technologies to manage large quantities of data, the
“magical” nature of big data was soon to be realized. By magical i mean data has the quality of
acting as a magician, a god, or a sorcerer who can reveal all the hidden facts about the universe.
Like a supernatural being, you need a medium(s) to engage data to reveal hidden secretes to
business managers.
Big data is now seen as the most important business asset and the ability to harness the benefits
of data creates a huge competitive advantage for businesses. Big Data technologies allow
companies and other organizations to use large amounts of data for effective decision making. They
allow business and other organizations to identify trends, patterns, and associations that would be
quite challenging or nearly impossible to find with conventional data-processing solutions. As a result,
there’s a huge demand for big data professionals. Harnessing the benefits of big data requires the
knowledge of big data management technologies such as relational database management
systems, and data analytics. In the ensuing sessions we will discuss into detail the concepts of big
data, and how this phenomenon has the order of the day.

What id big data? And how big is big data?

Big refer to the phenomenon where data generation has become so massive that tradition data
management systems and technologies can no longer handle it. Think of it. It is estimated that
every day, more than 3 quintillion bytes of data (3,000,000,000,000,000,000) is created globally,
and more than 90% percent of this gaping figure was created just about 6 years ago. This does
not necessarily mean that all the ages before six years ago did not produce as much data. The
reason is principally because those ages were without the technologies to readily and seamlessly
capture, store, and analyze data. The ability to generate these unimaginable quantities of data in
our age is made possible for three key reasons:

1. The advent and advances in digital technology

2. The proliferation of digital devices (including internet of things (IoT))

3. Improving computing and storage technologies at reducing costs

The advent and Advances in Digital and Information Technologies

The word “digital” is derived from the Latin word digitus, which means finger, one of the oldest
tools for counting. Computers since their creation have always understood the world in numbers
of zeroes and ones (0s and 1s). Digital technology refers to electronic technology that generates,
stores, and processes data in terms of two digits: 1 and 0. Each of these digits is referred to as a
bit (the short form of binary digit) (and a string of bits that a computer can address as a group is
called a byte). In the context of this course, the term technology refers to data management
technology that relies on the use of computing devices/microprocessors and applications that are
dependent on computers such as the Internet. Example of digital devices through which digital
technology is implemented include, but not limited to, desktops, laptops, tablets, servers,
mobile phones, smartphones, and any similar storage device which currently exists.

Information technology can be defined as the use of digital technology to facilitate the
capturing,and storing of business data, processing data into decision-ready form, and
disseminating decision-ready information to decision makers..

The advent and advances in digital technology, and the proliferation of digital devices has made
it possible for every bit of business data, financial or non-financial, operational or strategic, to be
generated and stored into a digital image on computers and other digital device. For example,
every business transaction can now be seamlessly captured and stored on a computer or some
other digital device in form of an electronic image. Sales of products to customers can be carried
out and stored electonically on computers and mobile devices as electronic images. Even
customers’ likes and preferences can be capture and stored in electronic image. The” likes” you
do on Facebook, Instagram, Twitter, and other social media platforms are stored as digital
images, creating data to be harnessed for improved decision making. Such an abililty to generate
digital data on every aspect of business activity is creating big data for businesses.
Figure 1: digital technology as a factor of the big data revolution

The Proliferation of Digital Devices (including Internet of Things (IoT))

In addition to digital technology, one other key factor contributing to the gargantuan data
creation is number of connected digital devices. In 2018, there were approximately 17
billion connected devices worldwide, with the number of IoT devices at 7
billion (not counting traditional internet-connected devices such as laptops or
smartphones), according to a report published by IoT Analytics. The term “Internet of
Things” generally refers to scenarios where network connectivity and computing
capability extend to objects, sensors and everyday items not normally considered
computers, allowing these devices to generate, exchange and consume data with
minimal human intervention. IoT devices include an extraordinary number of objects of
all shapes and sizes such as:
 Smart microwaves, which automatically cook your food for the right length of time,

 Self- driving cars, whose complex sensors detect objects in their path,

 Wearable fitness devices that measure your heart rate and the number of steps
you’ve taken that day, then use that information to suggest exercise plans tailored to
you.

There are even connected footballs that can track how far and fast they are thrown and
record those statistics via an app for future training purposes. The global market is now
expected to reach $1.6 billion by 2025. Each of these many devices has the power to
generate unimaginable quantum of data. It is estimated that every day, more than 3
quintillion bytes of data (3,000,000,000,000,000,000) is created globally, and this keeps
exploding at an alarming exponent. This should give you an idea of how big big data is.
The size of big data at any point in time is very difficult to estimate. The population of
digital devices across the globe is alarmingly exploding. The population
connected devices globally is several millions more than the population of
humans.Figure 2 shows the would map populated with connected digital devices,
tying to make you picture how the world looks like from the perspective of
digital devices, while Figure 3 shows a projection of the population of
connected devices alongside the population of humans (source: Cisco IBSG,
2011). from Figure 3, you can notice that all the years before 2010, global
human population has outnumbered the global population of digital devices.
However, from the year 2010 and onwards, the global population of digital
devices has outstripped the global human population, and the gap keeps
increasing.Connected devices per person is also on the rise.
Figure 2: World population from the perspective of connected devices
Figure 3: Populations of connected devices and humans compared

Moore’s Law and Big Data

Every year, you probably expect to pay at least a little more money than what you did the
previous year for most products and services. This has been the trend for most commodities and
services. However the opposite has been the case in the computer and communications fields,
especially with regard to the hardware supporting these technologies. For many decades,
hardware costs have fallen rapidly. Every year or two, the capacities of computers (both in terms
of storage and processing) have approximately doubled inexpensively.

This observation apply especially to the amount of primary memory (RAM), storage, and
processing spped. Modern computers come with tremendous sizes of storage and computing
power. The amount of secondary storage (such as solid-state drives (SSD) and Hard disk drives
(HDDs) which come with today’s personal computers are a huge improvement over the past.
Computers now come with huge secondary storage sizes at very affordable prices, comparatively.
The storage capacities of personal computers have increase from megabytes terabytes.
Meanwhile the corresponding cost has only marginally increased. Computers can perform
calculations and make logical decisions phenomenally faster than human beings can. Many of
today’s personal computers can perform billions of calculations in one second—more than a
human can perform in a lifetime. Supercomputers are already performing thousands of trillions
(quadrillions) of instructions per second. For example, IBM has developed the IBM Summit
supercomputer, which can perform over 122 quadrillion calculations per second (122 petaflops).
To put that in perspective, the IBM Summit supercomputer can perform in one second almost 16
million calculations for every person on the planet. And supercomputing upper limits are
growing quickly. Comparing these astronomical increases in computing power to their
corresponding costs, it is obvious that computing resources are increasing inexpensively. This
remarkable trend is called the Moore’s Law, named for the person who identified it in the 1960s,
Gordon Moore, co-founder of Intel Corporation. Moore’s law applies to the developments in the
telecommunication and and the internet sectors. There has being a tremendous improvements in
information and communication technologies inexpensively. Bigger communications/internet
bandwidth is now selling cheaper.

Implications of Big Data for the Business World

The Big Data revolution is changing every part of our world into a SMARTER one. The basic
idea behind the phrase ‘Big Data’ is that everything we do is increasingly leaving a digital trace
(or data), which we (and others) can use and analyse to become smarter. Today the really
successful companies “pray to” “Big Data” for more insight into business problems. From big
data, companies can now perfectly understand the taste and preferences of their customers,
where they are and, perhaps more importantly, what they are doing and where they are going.
Through big data, successful businesses are able to know what is happening as it is happening,
and they allow that information to guide their strategy and inform their decision-making.
Companies that won’t strategize to take advantage of the big data revolution can hardly succeed.
The Structure (or Forms) of Big Data

Big Data comes in three main forms:

 Structured data
 Semi-structured data
 Unstructured data

For businesses to get the full value of big data they must embrace all structures of Big Data,
structured or otherwise. SMART business occurs when we combine structured data sets with
unstructured data from both internal and external sources to gain more insight for intelligent
decision making. The rest of the session explains in detail the characteristics of these forms of
big data.

Structured Data

Structured data refers to data that fits a predefined data model or is organized in a predetermined
way. Structured data are organized in a defined tabular structure (with columns and rows). The
columns, also called fields, or attributes, are the characteristics of the entity which is being
captured into a database. For example, if you look at a standard customer database the
fields/columns that are defined will include name, address, contact telephone numbers, email
address, etc. of the customer. These fields describe the chararteristics of the entity being modeled
into a database. Note that each field name is particularly specified for a specific data type
(number, or text, or something else). this require that the data values entered into these columns
must strictly conform to the specified data type. Within each field, constraints may also be set,
for example, the telephone number field can be set to NOT NULL to require value to be always
provided for that field before a record will be accepted into the database. The name field can be
set “NOT NULL” or “required” to make sure that the name name field cannot be empty. There
are a number of data types and constraints which can be employed to provide consistency and
integrity to the database. The database design can also include drop down menus that limit the
choices of the data that can be entered into a field, thus ensuring consistency of input. Currently,
structured data is managed using Structured Query Language (SQL) – a programming language
originally created by IBM in the 1970s for managing and querying data in relational database
management systems. There are quite a number of database management systems designed to
“speak” the SQL programming language, such as PostgreSQL, MySQl, SQL Server, Oracle,
IBM DB2, etc.

SQL database design rides on the relational model developed by E.F. Codd in the late 1960s with
the publication of the paper titled “A Relational Model of Data for Large Shared Data Banks,”
(see http://www.acm.org/classics/nov95/toc.html). The relational model emphasizes data
integrity much more than other models of database design. Referential integrity refers to
making sure that data in the database makes sense at all times.In its basic application, the
relational model requires the system to be modeled into a database to be decomposed into several
logical “data containers” called entities. An entity can be defined as that component of the
overall system which is perceived or known or inferred to have its own distinct existence. For
example, in modeling a school system into a relational database, you need to break the school
system into entities such as Student, Course, Academic Program, Department, Residence,
Teacher, Contact Details, etc. Each of the entities is represented with a table which stores the
data on the attributes of the entity. These tables are then related (relationships are set among
tables) where necessary. The details of the relational database design principles will be treated in
the course of time. Figure 4 depicts an extract of a school database designed based on the
relational model.
Figure 4: A Representation of the Relational Database Model

Among the entities of the database are the three logical data containers, Student, Residence,
and Fee, with their attributes clearly specified.

Given that the SQL programming language, and relational database management systems
(RDBMS) are the technologies and systems required for the management of structured big data,
Accounting and Finance students cannot afford to go without these critical knowledge and skills
set. As part of its recommendation for contemporary curriculum for accounting courses, PwC,
one of the big 5 auditing firms in the world, specifically points to knowledge and skills in SQL,
including a number of computing, programming, analytics, and statistical knowledge and skills.
The following is a direct quotation form PwC:

“Universities should infuse analytical exercises into existing curriculum to help students develop data analytics
proficiency on top of their core accounting skills. Most schools currently require a class on computing and one or
two statistics courses early in the curriculum. But if reformed, these classes could provide all business students with
a base level of sophistication around data analytics. Whether accounting students or not, everyone could benefit, as
analytics will continue to growin importance in every business discipline.

We also forecast a significant increase in demand for students with double majors in accounting and information
systems. Academic programs that support these double concentrations will be increasingly attractive to both
employers and students.

Deep dives into statistics theory are not the primary point. Instead, we suggest the following courses as a tentative
outline for providing students a newset

of skills.

a. Basic computing course: The core competencies

should include:

 Basic programming skills using a contemporary coding language such as Python or Java

 Core skills in the legacy technologies (Microsoft Excel and Access), especially in teaching the complex power of
spreadsheet software Core skills with both structured and unstructured databases (SQL, MongoDB, Hadoop,
etc.)” - (PwC, 2015)

Unstructured and Semi-structured Data

Unstructured and semi-structured data represent all the data which do not exhibit structured
pattern, and therefore can’t be so easily slotted into columns and rows. It is usually text-heavy,
but may also contain data such as dates, numbers and facts or different types of data such as
images. These inconsistencies make it difficult to analyse unstructured data using traditional
computer programs. It is estimated that 80% of business-relevant information originates in
unstructured or semi-structured data. Unstructured data is the fastest growing data around.
It's increasing at a compound annual growth rate of 61% or more. This makes quite
needful for technologies and systems to be developed for the efficient and effective management
of unstructured and semi-structured big data. Examples of unstructured and semi-structured data
include:
 Photos and graphic images
 Videos
 Websites
 Text files or documents such as email, PDF, blogs, social
 media posts, etc.
 PowerPoint presentations.

Semi-structured data is a cross between unstructured and structured. These are data that may
have some structure that can be used for some analysis, but lacks the strict data model structure.
In semi-structured data, tags or other types of markers are used to identify certain elements
within the data, but the data doesn’t have a rigid structure like that of a structured data. For
example, a Facebook post can be categorized by author, date, length and even sentiment but the
content is generally unstructured.

Unstructured data growth is no longer being driven by the usual suspects - documents,
presentations, photos, videos and audios. The impetus behind its growth today is
sources such as log files produced by computer programs, IoT devices, social media,
CCTV, sensors, metadata and even search engine queries.

In the past, unstructured data were not used that much for decision making. They were often
locked away in siloed document management systems making it what's known as dark data,
unavailable for analysis. With the development of big data platforms,
primarily Hadoop clusters, NoSQL databases and the Amazon Simple Storage Service (S3),
unstructured data can now be processed as much as structured data, and be very useful for
decision making. These big data platforms provide the required infrastructure for processing,
storing and managing large volumes of unstructured data. NoSQL databases are non-relational
databases built on a non-relational model, and the programming language used to communicate
with NoSQL databases is the NoSQL language. Both databases deal with data in different
ways, SQL databases structure data in a ‘relational manner’, as discussed previously, and
NoSQL stores data in a ‘non-relational manner. Just as we have a number of relational
database management systems (or SQL database management systems) there are also a
number of of database management systems built on the non-relational or NoSQL model such
as MongoDB, Redis, FaunaDB, CouchDB, Cassandra, Elasticsearch, etc.

Other techniques that play roles in unstructured data analytics include data mining, machine
learning and predictive analytics. Text analytics tools look for patterns, keywords and sentiment
in textual data. At a more advanced level, natural language processing technology is a form of
artificial intelligence that seeks to understand meaning and context in text and human speech,
increasingly with the aid of deep learning algorithms that use neural networks to analyze data.
Newer tools can aggregate, analyze and query all data types to enable greater insight into
corporate data and improved decision-making. Examples include the following: Azure Data
Services, Microsoft Power BI, IBM Cognos Analytics, and Tableau.

These technological developments in today’s arena is massively shaping the role and how of
accounting should be performed in business organizations. Since the accounting function
generates and manages the biggest chunk of business data, there is no doubt that accounting
professional are the right people to possess the appropriate data management and analyses skills
needed to ensure business success. As an Accounting student these revelations about the
developments in the profession is to guide you in the acquisition of skill set necessary to make
you relevant for solving business problems. Today’s businesses are on a hunt for graduates with
both skills in accountancy, and data science. It would be great to become one of such graduates,
by mastering data science skills in addition to your accounting knowledge and skills.
Characteristics of Big Data

Following are the big data core characteristics. Understanding the characteristics of big data is vital
to know how it works and how you can use it. There are primarily 5 characteristics of big data
analytics, even though other talk about 7. They are Volume, Velocity, Variety, veracity, and Value.
Because each of the characteristics begin with the letter “V”, the five big data characteristics are
often referred as the the 5 Vs of big data. The meanings of each of these characteristics are given
below:

Volume

The volumes of data generated by modern IT, industrial, healthcare, Internet of Things, and other
systems is growing exponentially driven by the lowering costs of data storage and processing
architectures and the need to extract valuable insights from the data to improve business
processes, efficiency and service to consumers.

Though there is no fixed threshold for the volume of data to be considered as big data, however,
typically, the term big data is used for massive scale data that is difficult to store, manage and
process using traditional databases and data processing architectures.

Velocity

Velocity of data refers to how fast the data is generated. Data generated by certain sources can
arrive at very high velocities, for example, social media data or sensor data. Velocity is another
important characteristic of big data and the primary reason for the exponential growth of data.
High velocity of data results in the volume of data accumulated to become very large, in short
span of time. Some applications can have strict deadlines for data analysis (such as trading or
online fraud detection) and the data needs to be analyzed in real-time. Specialized tools are
required to ingest such high velocity data into the big data infrastructure and analyze the data in
real-time.

Variety
Variety refers to the forms of the data. Big data comes in different forms such as structured,
unstructured or semi-structured, including text data, image, audio, video and sensor data. Big
data systems need to be flexible enough to handle such variety of data

Veracity

Veracity refers to how accurate is the data (data quality). To extract value from the data, the data
needs to be cleaned to remove noise. Data-driven applications can reap the benefits of big data
only when the data is meaningful and accurate. Therefore, cleansing of data is important so that
incorrect and faulty data can be filtered out.

Value

Value of data refers to the usefulness of data for the intended purpose. The end goal of any big
data analytics system is to extract value from the data. The value of the data is also related to the
veracity or accuracy of the data. For some applications value also depends on how fast we are
able to process the data.

Advantages provided by Big Data

There are numerous advantages of Big Data for organizations. Some of the key ones are as
follows:

1. Enhanced Decision-making

Big data implementations can help businesses and organizations make better-informed decisions
in less time. It allows them to use outside intelligence such as search engines and social media
platforms to fine-tune their strategies. Big data can identify trends and patterns that would’ve
been invisible otherwise, helping companies avoiding errors.

2. Data-driven Customer Service

Another huge impact big data can have on all industries is in the customer service department.
Companies are replacing the traditional customer feedback system with data-driven solutions.
Such solutions can analyze customer feedback more efficiently and help them offer customer
service to the consumers.

3. Efficiency Optimization

Organizations use big data to identify the weak areas present within them. Then, they use these
findings to resolve those issues and enhance their operations substantially. For example, Big
Data has substantially helped the manufacturing sector improve its efficiency through IoT and
robotics.

4. Real-time Decision Making

Big Data has transformed several areas by enabling real-time trackings, such as inventory
management, supply chain optimization, anti-money laundering, and fraud detection in banking
& finance.

UNIT TWO

INTRODUCTION TO BIG DATA ANALYTICS AND BUSINESS INTELLIGENCE

What is Data Analytics

Data Analytics is a broad term that encompasses the processes, technologies, frameworks and
algorithms to extract meaningful insights from data. Raw data in itself does not have a meaning
until it is contextualized and processed into useful information. Analytics is this process of
extracting and creating information from raw data by filtering, processing, categorizing,
condensing and contextualizing the data.
Data analytics increasingly deals with vast amount of data—mostly unstructured information stored in a wide variety of
mediums and formats—and complex data sets collected through fragmented databases during the course of time. It
deals with streaming data, coming at you faster than traditional RDBMS systems can handle. This is also called fast data.
It’s about combining external data with internal data, integrating it and analyzing all data sets together.

Big data analytics aims to answer three domains of questions. These questions explain (1) what has happened in the past
(retrospective analytics), (2) what is happening right now (real-time analysis) and (3) what is about to happen.
Retrospective analytics can explain and present knowledge about the events of the past, show trends and help find root-
causes for those events. The real-time analytics shows what is happening right now. It works to present situational
awareness, alarms

when data reaches certain threshold or send reminders when a certain rule is satisfied. Prospective analytics presents a
view into the future. It attempts to predict what will happen, what are the future values of certain variables. Table 1
shows the taxonomy of the three analytics questions.

Table 1: The Three Main Analytics Questions

Questions about the past Questions about the Questions about the
present future

Retrospective view Real-time view Prospective view

What happened? What is happening now? What will happen next?

Why it happened? Uses real-time data How can I intervene?

Uses historical data Actionable dashboards Uses historical and real-

Delivers static Alerts time data

dashboards Reminders Predictive dashboards

Knowledge-based

dashboards

The choice of the technologies, algorithms, and frameworks for analytics is driven by the
analytics goals, whether it is to provide retrospective, real-time, or prospective understanding, or
some combination of the three goals described above.

Descriptive Analytics

Descriptive analytics comprises analyzing past data to present it in a summarized form which can
be easily interpreted. Descriptive analytics aims to answer - What has happened? A major
portion of analytics done today is descriptive analytics through use of statistical functions such as
counts, maximum, minimum, mean, top-N, percentage, for instance. These statistics help in
describing patterns in the data and present the data in a summarized form. Examples of
descriptive analytics include: computing the total number of likes for a particular post,
computing the average monthly rainfall or finding the average number of visitors per month on a
website. Descriptive analytics is useful to summarize the data.

Businesses use descriptive analytics all the time, whether they are aware of it or not. It’s often
called business intelligence . Companies these days have vast amounts of data available to them,
so they would do well to use analytics to interpret this data to help them with decision making. It
helps them to learn from what happened in the past, and enables them to try to accurately predict
what may happen in the future. In other words, analytics helps companies anticipate trends. For
instance, if sales increased in November for the past five years, and declined in January, after the
Christmas rush, then one could predict that the same thing is likely to happen in year six and
prepare for it. Companies could use this to perhaps increase their marketing in January, offering
special offers and other incentives to customers.

Descriptive analytics give insight into what happened in the past (this may include the distant
past, or the recent past, like sales figures for last week.) They summarize data that describes
these past events and make it simple for people to understand. and make it simple for people to
understand. In a business, for example, we may look at how sales trends have changed from year
to year, or how production costs have escalated.

Descriptive statistics, basically then, is the name given to the analysis of data that helps to show
trends and patterns that emerge from collected information. They are a way of describing the
data in a way that helps us to visualize what the data shows.

This is especially helpful where there has been a lot of information collected. Descriptive
statistics are the actual numbers that are used to describe the information that has been collected
in the business or from, say, a survey. They include statistics like the mean, standard deviation,
variance, range, skewness, maximum, minimum, etc.
As a descriptive analysts you can make descriptive analytics information easier to understand by
analyzing big data into graphs, charts, or some other pictorial representations. This way,
management and employees alike can see what has been happening within the company in the
past, and make useful predictions and therefore good decisions for the future.

Measures in Descriptive Statistics

Descriptive statistics is so-called because they help to describe the data which has been collected.
They are a way of summarizing big groups of numerical information, by summarizing a sample.
(In this way, descriptive statistics are different from inferential statistics, which uses data to find
out about the population that the data is supposed to represent.)

Two sets of statistics are most often used in this descriptive statistics are measures of central
tendency , and measures of dispersion . Central tendency involves the idea that there’s one figure
that’s in a way central to the whole set of figures.

Measures of central tendency include the mean, the median, and the mode. They summarize a
whole lot of figures with one single number. This will obviously make it easier for people to
understand and use the data.

Dispersion refers to how spread out the figures are from a central number. Measures of
dispersion include the range, the variance, and the standard deviation. They help us see how the
scores are spread out (whether they are close together or widely spread apart).

Inferential Statistics

When research is done on groups of people, usually both descriptive and inferential statistics are
used to analyze results and arrive at conclusions.

Inferential statistics are useful when we just want to use a small sample group to infer things
about a larger population. So, we are trying to come to conclusions that reach past the immediate
data that we actually have on hand.

They can help to assess the impact that various inputs may have on our objectives. For instance,
if we introduce a bonus system for our workers, what might the productivity outcome be?
Inferential statistics can only be used if we have a complete list of the population members from
which we have taken our sample. Also, the sample needs to be big enough.

There are different types of inferential statistics, some of which are fairly easy to interpret. An
example is the confidence level. If say, our confidence interval is 98%, it means that we are 98%
confident that we can infer the score of a population based on the score of our sample.

Inferential statistics therefore allow us to apply the conclusions from small experimental studies
to larger populations that have actually never been tested experimentally. This means then, that
inferential statistics can only speak in terms of probability, but it is very reliable probability, and
an estimate with a certain measurable confidence level.

Accounting students are not only required to possess knowledge and skills in statistics but also to
have a mastery over the statistical software which can be used to perform descriptive statistics.
Software which can perform descriptive analytics include Microsoft Excel, IBM SPSS.

This text does not intend to reteach statistics to students. It is expected that students will possess,
at least, basic statistics knowledge and skills which can understanding.

Diagnostic Analytics

Diagnostic analytics comprises analysis of past data to diagnose the reasons as to why certain
events happened. Diagnostic analytics aims to answer the question- Why did it happen?

The computational tasks such as Linear Algebraic Computations, General N-Body Problems, and
Graph-theoretic Computations can be used for diagnostic analytics.

Predictive Analytics

We are now aware of how data and the analysis of it are vital for a company to be able to
function optimally. We are now going to examine another type of data analytics that can help
grow a company - predictive analytics.
Predictive analytics is used to make predictions about unknown future events. It uses many
techniques, such as statistical algorithms, data mining, statistics, modeling, machine learning and
artificial intelligence, to analyze current and past data to make predictions about the future. It
aims to identify the likelihood of future outcomes based on the available current and historical
data. The goal is therefore to go beyond what has happened to provide the best assessment of
what will happen.

More and more companies are beginning to use predictive analytics to gain an advantage over
their competitors. As economic conditions worsen, it provides a way for companies to gain
competitive edge. Predictive analysis has become more accessible today for even smaller
companies, and other low-budget organizations. This is because the volume of easily available
data has grown hugely, computing has become more powerful and more affordable, and the
software has become simpler to use. Therefore, one doesn’t need to be a mathematician to be
able to take advantage of the available technologies. Accounting students must champion the
skill of predictive analytics so as to provide the necessary competitiveness for their organizations.

Predictive analysis is extremely useful for a number of reasons:

 Firstly, it can help predict fraud and other criminal activities in businesses, government
departments, etc.

 Secondly, it can help companies optimize marketing by monitoring the responses and buying
trends of customers.

 Thirdly, it can help businesses and organizations improve their way of managing resources
by predicting busy times and ensuring that stock and staff are available during those times.
For instance, hospitals can predict when their busy times of the year are likely to be, and
ensure there will be enough doctors and medicines available over that time.

Predictive Analysis Methods

In this chapter, we’re going to examine the various techniques that are used to conduct
predictive analysis. The two main pigeonholes into which these methods may be grouped are
machine learning techniques and regression techniques . They’ll be discussed here in more detail:
Machine Learning Techniques

Machine learning is a method of data analysis that automates the building of analytical models.
It uses algorithms that continuously adapt and learn from data and from previous computations,
thereby allowing them to find information without having to be directly programmed where to
search.

Growing volumes of available data, together with cheaper and more powerful computational
processing have created an unprecedented interest in the use of machine learning. More
affordable data storage has also increased its use.

When it comes to modeling, humans can maybe make a couple of models a week, but machine
learning can create thousands in the same amount of time.

Using this technique, after you make a purchase, online retailers can send you offers almost
instantaneously for other products that may be of interest to you. Banks can give answers
regarding your loan requests almost at once. Insurance companies can deal with your claims as
soon as you submit them. These actions are all driven by machine learning algorithms, as are
more common everyday activities such as web search results and email spam filtering.

Regression Techniques

These techniques form the basis of predictive analytics. They seek to create a mathematical
equation, which will serve as a model to represent the interactions among the different variables.
Depending on the circumstances, different regression techniques can be used for performing
predictive analysis. It can be difficult to select the right one from the vast array available. It’s
important to pick the most suitable one based on the type of independent and dependent variables,
and depending on the characteristics of the available data.

Linear Regression

Linear regression is the most well-known modeling approach. It calculates the relationship
between the dependent and independent variables using a straight line (regression line). It’s
normally in equation form. Linear regression can be used where the dependent variable has an
unlimited range.
If the dependent variable is discrete, another type of model will have to be used. Discrete, or
qualitative, choice models are mainly used in economics. They are models which describe, and
predict choices between different alternatives. For example, whether to export goods to China or
not; whether to use shipping or air travel to export goods. Unlike other models, which examine,
“how much”, qualitative choice models look at “which one?”

The techniques logistic regression and probit regression may be used for analyzing discrete
choice.

Logistic Regression

Logistic regression is another much-used modeling approach. It’s used to calculate the
probability of event success and failure. It’s mainly used for classification problems, and needs
large sample sizes.

The Probit Model

The word probit is formed from the roots probability and unit . It’s a kind of regression where
the dependent variable can only have two values, for instance, employed or unemployed. Its
purpose is to appraise the likelihood that a certain observation will fall into one or other of the
categories. In other words, it is used to model binary outcome variables, such as whether or not a
candidate will win an election. Here, the outcome variable is binary: zero or one, win or lose.
The predictor variables may, for example, include how much money was spent on the campaign,
or how much time the candidate spent campaigning. Probit regression is used a great deal in the
field of economics.

Neural Networks

Neural networks are powerful data models capable of capturing and representing complicated
input and output relationships. They are widely used in medical and psychological contexts, as
well as in the financial and engineering worlds. Neural networks are normally used when one is
not aware of the exact relationship between the inputs and the output. These networks are
capable of learning the underlying relationship through training. (This may also be called
supervised training, unsupervised training, and reinforcement learning.)
Neural networks are based on the performance of “intelligent” functions similar to those
performed by the human brain. They’re similar in that, like the brain amasses knowledge through
learning, a neural network stores knowledge (data) inside inter-neuron connections called
synaptic weights.

Their advantage is that they can model both linear and non-linear relationships, whereas other
modeling techniques are better with just linear relationships.

An example of their use would be in an optical character recognition application. The document
is scanned, saved as an image, and broken down into single characters. It’s then translated from
image format into binary format, with each 0 and 1 representing a pixel of the single character.
The binary data is then fed into a neural network that can make the association between the
image data and the corresponding numerical value. The output from the neural network is then
translated into text and stored as a file.

Prescriptive Analytics

While predictive analytics uses prediction models to predict the likely outcome of an event,
prescriptive analytics uses multiple prediction models to predict various outcomes and the best
course of action for each outcome.

Prescriptive analytics aims to answer - What can we do to make it happen?

It seeks to evaluate options and prescribing the best course of action

For example, prescriptive analytics can be used to prescribe the best medicine for treatment of a
patient based on the outcomes of various medicines for similar patients.

General N-Body Problems, Graph-theoretic Computations, Optimization and Alignment

Problems can be used for prescriptive analytics.
Data Analytics Software

Technology is essential when it comes to analyzing data, so much so that it is the standard to use
computers for data analysis. To have a computer is a great start but it is no use without being
coupled with the appropriate software. Not only do you need special software but you also have
to be able to know how to use it in order to be able to carry out great data analysis.

There are a lot of data analysis software out there. Some (e.g. python and R) are designed for
more advanced data analysis than others. Python is one of the most powerful data analytics
software available for data analysis, and arguably, the most widely used piece of data analytics
software. Python is not only entirely free to use but it is also open-source meaning that the source
code can be changed by programmers if they feel it is appropriate. Python operates differently in
comparison to many of the software programs we use on a regular basis (such Microsoft Excel,
and IBM SPSS). This is because the python interpreter (the python software) requires lines of
codes to be entered as instructions to command the computer hardware to execute. Performing
data analysis with python requires the user to have knowledge ans skill in the python language. It
takes time to learn the complete syntax and commands of the python language but once you get
used with it you will find your data analysis process quite enjoyable. It takes a few simple steps
to have Python downloaded and installed onto a computer. Downloading and installing is
straightforward, it is learning how to use the software that will take some time but once you’ve
mastered it you would love to be a data analyst for ever. Introduction to python will be dealt with
in future units in this textbook.

If you can’t perform analytics to make sense of your data, you’ll have trouble improving quality and costs, and
you won’t succeed in the new business environment.

DATA MINING - AND ADVANCED DATA ANALYTICS TECHNIQUE

As indicated earlier, we live in a world where vast amounts of data are collected every second. Analyzing such data is an
important need. This session looks at how data mining can meet this need by providing tools to discover knowledge from
data. “We are living in the information age” has become an everyday saying. However, we are actually living in the data
age. Terabytes or petabytes of data pour into our computer networks, the World Wide Web (WWW), and various data
storage devices every day from business, society, science and engineering, medicine, and almost every other aspect of
daily life. A petabyte is a unit of information or computer storage equal to 1 quadrillion bytes, or a thousand terabytes, or 1 million gigabytes.

This explosive growth of available data volume is a result of the computerization of our society and the fast development
of powerful data collection and storage tools. Businesses worldwide generate gigantic data sets, including sales
transactions, stock trading records, product descriptions, sales promotions, company profiles and performance, and
customer feedback. For example, large stores, such as Wal-Mart, handle

hundreds of millions of transactions per week at thousands of branches around the world. In ghana, businesses such as
Melcom, Kasapreko and the likes generate large quantity of data from their activities.

Global telecommunication networks carry tens of petabytes of data traffic every day. The medical and health industry
generates tremendous amounts of data from medical records, patient monitoring, and medical imaging. Billions of Web
searches supported by search engines process tens of petabytes of data daily. Web Communities and social media have
become increasingly important data sources, producing digital pictures and videos, blogs, and various kinds of other data.
The list of sources that generate huge amounts of data is endless.

This explosively growing, widely available, and gigantic body of data makes our time truly the data age. Powerful and
versatile tools are badly needed to automatically uncover valuable information from the tremendous amounts of data
and to transform such data into organized knowledge. This necessity has led to the birth of data mining, a relatively
younger and dynamic field, but very promising. Data mining has and will continue to make great strides in our journey
from the data age toward the coming information age.

By definition, data mining, also known as knowledge discovery in data (KDD), is the process of
uncovering patterns and other valuable information from large data sets (Big Data). The growth
of big data and advances in database and data warehousing technologies are assisting companies
in transforming their raw data into useful knowledge.

Using a broad range of techniques, data mining information can be used to increase revenues, cut
costs, improve customer relationships, improve quality, reduce risks and more. The foundation of
data mining comprises some intertwined scientific disciplines, including: statistics (the numeric
study of data relationships), artificial intelligence (human-like intelligence displayed by software
and/or machines) and machine learning (algorithms that can learn from data to make predictions).
Figure 5: The Various Aspects of Data Mining

Why is Data Mining Important?

You’ve seen the staggering numbers – the volume of data produced is doubling every two years.
Unstructured data alone makes up 90 percent of the digital universe. But more information does
not necessarily mean more knowledge.

Data mining provides us with the means of resolving problems and issues in this challenging information age. Data
mining benefits include:

Customer relationship management

1. Identify most likely buyers of new products and services.

2. Understand the root causes of customer attrition to improve customer retention.

3. Discover time-variant associations between products and services to maximize sales and customer value.
4. Identify the most profitable customers, and their preferential needs to strengthen relationships and maximize sales.

The retail industry

1. Predict accurate sales volumes at specific inventory levels.

2. Identify sales relationships between different product types (market-basket analysis).

3. Forecast consumption levels of different product types (based on seasonal and environmental conditions) to optimize

logistics and hence maximize revenue.

4. Discover interesting patterns into the movement of products, especially ones with a short shelf life, in a supply chain

by analyzing sensory and RFID data.

Manufacturing and production

1. Predict machinery failures before they occur by using sensory data, which will enable condition-based maintenance.

2. Identify commonalities and anomalies in production systems to optimize manufacturing capacity.

3. Discover novel patterns to identify and improve product quality.

The travel industry (airlines or hotels)

1. Useful to predict sales of different services (seat types in airplanes, type of hotel rooms) to optimally price services to

maximize revenues as a function of yield management.

2. Forecast demand at different locations to better allocate limited organizational resources.

3. Identify the most profitable customers and provide them with personalized services to maintain their repeat

business.

4. Retain valuable employees by identifying and acting on the root causes for attrition.

An example of how data mining turns a large collection of data into knowledge

A search engine (e.g.,

Google) receives hundreds of millions of queries every day. Each query can be viewed

as a transaction where the user describes her or his information need. What novel and
useful knowledge can a search engine learn from such a huge collection of queries col

lected from users over time? Interestingly, some patterns found in user search queries

can disclose invaluable knowledge that cannot be obtained by reading individual data

items alone. For example, Google’s Flu Trends uses specifific search terms as indicators of

flflu activity. It found a close relationship between the number of people who search for

flflu-related information and the number of people who actually have flflu symptoms. A

pattern emerges when all of the search queries related to flflu are aggregated. Using aggre

gated Google search data, Flu Trends can estimate flflu activity up to two weeks faster

than traditional systems can.2 This example shows how data mining can turn a large

collection of data into knowledge that can help meet a current global challenge.

The Data Mining Process

The data mining process involves a number of steps from data collection, data preparation,
visualization to extraction of valuable information from large data sets. The data mining process
starts with prior knowledge and ends with posterior knowledge, which is the incremental insight
gained about the business via data through the process. The whole data mining process is a
framework to invoke the right questions (Chapman et al., 2000) and guide us through the right
approaches to solve a business problem. It is not meant to be used as a set of rigid rules, but as a
set of iterative, distinct steps that aid in knowledge discovery.

The Cross-Industry Standard Process for Data Mining

(CRISP-DM)

Several years ago, representatives from a diverse array of industries gathered to define the best
practices, or standard process, for data mining. The result of this task was the CRoss-Industry
Standard Process for Data Mining (CRISP-DM). The CRISP-DM process model was based on
direct experience from data mining practitioners, rather than scientists or academics, and
represents a “best practices” model for data mining that was intended to transcend professional
domains. The CRISP-DM process model has been broken down into six steps:
 Business understanding,
 Data understanding,
 Data preparation,
 Modeling,
 Evaluation, and
 Deployment

Figure 6: The CRISP Framework for Data Mining

Business Understanding

The second phase of the CRISP-DM analytical process is the data understanding step. During
this phase, data are collected and the analysts begin to explore and gain familiarity with the data,
including form, content, and structure. Knowledge and understanding of the numeric features
and properties of the data (e.g., categorical versus continuous data) will be important during the
data preparation process and essential to the selection of appropriate statistical tools and
algorithms used during the modeling phase. Finally, it is through this preliminary exploration
that the analyst acquires an understanding of and familiarity with the data that will be used in
subsequent steps to guide the analytical process, including any modeling, evaluation of the
results, and preparing the output and reports.

Business Understanding

Perhaps the most important phase of the data mining process includes gaining an understanding
of the current practices and overall objectives of the project. During the business understanding
phase of the CRISP-DM process, the analyst determines the objectives of the data mining project.
Included in this phase are an identification of the resources available and any associated
constraints, overall goals, and specific metrics that can be used to evaluate the success or failure
of the data mining project.

Data Understanding

Data Preparation

After the data have been examined and characterized in a preliminary fashion during the data
understanding stage, the data are then prepared for subsequent mining and analysis. This data
preparation includes any cleaning and re-coding as well as the selection of any necessary training
and test samples. It is also during this stage that any necessary merging or aggregating of data
sets or elements is done. The goal of this step is the creation of the data set that will be used in
the subsequent modeling phase of the process.

Modeling

During the modeling phase of the project, specific modeling algorithms are selected and run on
the data. Selection of the specific algorithms employed in the data mining process is based on the
nature of the question and outputs desired. For example, scoring algorithms or decision tree
models are used to create decision rules based on known categories or relationships that can be
applied to unknown data. Unsupervised learning or clustering techniques are used to uncover
natural patterns or relationships in the data when group membership or category has not been
identified previously.

Evaluation

During the evaluation phase of the data mining project, the models created are reviewed to
determine their accuracy as well as their ability to meet the goals and objectives of the project
identified in the business understanding phase. Put simply, the evaluation phase answers the
question: Is the model accurate, and does it answer the question posed?

Deployment

The deployment phase, which the final phase, includes the dissemination of the information. The
form of the information can include tables and reports as well as the creation of rule sets or
scoring algorithms that can be applied directly to other data.

Some Data Mining Techniques

Cluster Analysis

Cluster analysis is a statistical method for processing data into groups, or clusters, on the basis of
how closely associated they are. Cluster analysis can be a powerful data-mining tool for any
organization that needs to identify discrete groups of customers, sales transactions, or other types
of behaviors and things. For example, insurance providers use cluster analysis to detect
fraudulent claims, and banks use it for credit scoring.

Regression

Regression is a form of supervised data mining technique that tries to predict any continuous
valued attribute. It analyses the relationship between a target variable (dependent) and its
predictor variable(s) (independent). Regression is an important tool for data analysis that can be
used for predicting the future as well as improving business processes. For example, in terms of
prediction, Regression can be used to predict how many units consumers will purchase of a
product or service. Insurance companies heavily rely on regression analysis to estimate how
many policy holders will be involved in accidents or be victims of burglaries, for example.

In terms of optimization of business processes, a company operating a call center can use
regression to know the relationship between wait times of callers and number of complaints. A
factory manager can, for example, build a regression model to understand the relationship
between oven temperature and the shelf life of the cookies baked in those ovens. A fundamental
driver of enhanced productivity in business and rapid economic advancement around the globe
during the 20th century was the frequent use of statistical tools in manufacturing as well as
service industries. Today, managers consider regression an indispensable tool.

Classification Analysis

This analysis is used to retrieve important and relevant information about data, and metadata. It
is used to classify different data in different classes. Classification is similar to clustering in a
way that it also segments data records into different segments called classes. But unlike
clustering, here the data analysts would have the knowledge of different classes or cluster. So, in
classification analysis you would apply algorithms to decide how new data should be classified.
A classic example of classification analysis how emails are classified as legitimate or spam. In
Outlook for example, they use certain algorithms to characterize an email as legitimate or spam.

Anomaly or Outlier Detection

This refers to the observation for data items in a dataset that do not match an expected pattern or
an expected behavior. Anomalies are also known as outliers, novelties, noise, deviations and
exceptions. Often they provide critical and actionable information. An anomaly is an item that
deviates considerably from the common average within a data set or a combination of data.
These types of items are statistically aloof as compared to the rest of the data and hence, it
indicates that something out of the ordinary has happened and requires additional attention. This
technique can be used in a variety of domains, such as intrusion detection, system health
monitoring, fraud detection, fault detection, event detection in sensor networks, and detecting
eco-system disturbances.

Association Rule Learning

It refers to the method that can help you identify some interesting relations (dependency
modeling) between different variables in large databases. This technique can help you unpack
some hidden patterns in the data that can be used to identify variables within the data and the
concurrence of different variables that appear very frequently in the dataset. Association rules are
useful for examining and forecasting customer behavior. It is highly recommended in the retail
industry analysis. This technique is used to determine shopping basket data analysis, product
clustering, catalog design and store layout. In IT, programmers use association rules to build
programs capable of machine learning.

UNIT THREE

INTRODUCTION TO RELATIONAL DATABASE SYSTEMS

Introduction

It is said that we live in the data age. This is 100% true, and the reason is already known to you
from our previous discussions. The data revolution presents both opportunities and challenges.
Businesses can take advantage of the big data reolution by mastering the application of the which
can be used to take advantage of the revolution. Quite a number of technologies have emerged to
deal with the big data revolution. Database technologies have emerged to support the handling of
big data. In our previous discussions, you were introduced, in passing, to database technologies
such as relational or SQL databases and NoSQL databases. These are the most popular database
technologies whic have emerged for the manageent of big data. In this unit we will dive into the
“pool” of SQL database. SQL is a programming language designed on the relational model. SQL
stands for Structured Query Language.

Objectives

After going through this unit, you should be able to

Explain the

The Relational Model for Database Management

The relational model is solidly based on two parts of mathematics: firstorder predicate logic and the theory of relations.
This book, however, does not dwell on the theoretical foundations, but rather on the features of the relational model that is
important for database design and use.

The central purpose of the relational model is to provide a framework for designing of efficient , anomalies-free databases.
The relational model become necessary when the management of big data with trational methods became problematic. E.F.
Codd, and IBM engineer and reseacher invented the relation model as means of efficiently managing big data. It willbe
helpful for you to read the original theory of the relational model as written by Dr. E. F. Codd. The principles enshrined in
the relational model are sumarized into what is termed as normalization. Normalization seeks to eliminate all the
anomalies associated with traditional (also called flat-file) databases. Some of the challenges include anomalies
in

• updating the database (update anomaly),

• deleting records from the database (delete anomaly)
• inserting records into the database (insert anomaly)

In addition to the above anomalies, there is also much difficulty and complexity in generating
reports from flat file databases, especially if the report is a little bit complex.

Relational database management system (RDBMS) is a science of data management which

offers an approach of data management that solves all the above anomalies. The key difference
between flat file data management systems and RDBMS is that RDBMS follows logical
procedures in the design of the database tables which ultimately results in the elimination of the
redundancies and the anomalies mentioned above. In RDBMS, the process of designing database
tables so that the database as a whole becomes free from data redundancies and the anomalies
previously mentioned is referred to as NORMALIZATION.

Basically, normalization divides larger tables into smaller tables which store data on specific
subject matter. These tables are then linked together using what is referred to as table
relationships. The purpose of normalization is to eliminate redundant (useless) data and ensure
data is stored logically, so that data management is efficient. There are several stages in the
normalization process, referred to as normal forms. We will discuss the first three of the normal
forms. These are the First Normal Form (1NF), the Second Normal Form (2NF), and the Third
Normal Form (3NF).

First Normal Form (1NF)

The first step to constructing the right database table is to ensure that the data is stored in its first
normal form (1NF). When a table is in its first normal form, searching, filtering and sorting
information is easier. The following are the specific principles of 1NF:

1. The Concept of a Primary Key as a unique identifier of records.

The table stores information in rows and columns where one or more columns, called the
primary key, uniquely identify each row. In other words, at least one of the
attributes/columns/fields of the table must always contain unique values that work to
uniquely identify each row/record in the table. In Table 2, the first column (titled Studid),
is designated as a primary key which uniquely identifies the rows in the table.

Notice that in table 2, the first and third records have the same values for FirstName and
Surname even though they are different students. In this case, the role of the primary key
is so obvious – it shows that the two students, even though they bear the same first and
last names, are unique by their Studid (the primary key).

2. Each column/attribute/field should contain atomic values.

The second principle of 1NF stresses the concept of atomicity for column values. When a
value is atomic, the value cannot be further subdivided. In Table 2, this principle is
violated. For example, the value “Josh Addo” is not atomic. “Josh Addo” is a compound,
which can be split into their atomic values as “Josh” and “Addo”.

Table 2: Student Table With Violation of Atomicity of the Principle Atomic Column Values
Studid (PK) StudentName OtherAttributes
1 Josh Addo
2 Nana Adwoa Sampah
3 Josh Addo

In Table 2, the principle of atomicity of column values is violated. Under column 2, titled
StudentName, the values stored are not atomic in nature. For example, the value “Josh
Addo” is compound in nature, and can be further split into two parts as “Josh” and
“Addo”. To resolve this violation, the StudentName column should be broken into its
separate atomic columns as FirstName, and Surname (or LastName). Table 3 shows the
correction of the violation of the atomicity principles.

Table 3: Student Table Corrected for Atomicity of Column Values

Studid (PK) FirstName Surname OtherName OtherAttributes
1 Josh Addo
2 Adwoa Sampah Nana
3 Josh Addoew

3. There should be no Repeating Groups of Columns

The third principle 1NF says that there should be no multiple columns storing data about
the same attribute or concept. For example there should be no repeating columns like
Phone1; Phone2; Phone3 etc. or Custome1; Customer2; Customer3 etc. This is because
these multiple columns store data about the same attribute or subject matter: phone or
customer. Table 4 demonstrates how this principle has been violated.
Table 4: Student Table with Repeating Groups of Phone Columns
Studid (PK) FirstName Surname Phone1 Phone2
1 Josh Addo 050 602 8495 024 651 2655
2 Adwoa Sampah 024 434 7140

In Table 4, the columns Phone1, and Phone2 are columns storinf data about the same
attribute (Phone). This violationcan be resolved in two main ways:
 We can adopt a policy of storing data on only one phone number. In this case the
problem of repeating phone columns will be done away with.
 We can also move the repeating columns (phone1, phone2, etc) from the Student table
and form them into an entity/table. This entity could be called contact entity, and could
be populated with multiple phone numbers, as well as other contact details like email,
website, and address. This newly formed entity (contact) is then linked to the student
entity and all other relevant entities. Tables 6 and 7 are the resulting entities after the
violation is resolved.
4. There should be no Columns with Multiple Values
An alternative way of ‘dancing around’ the problem presented in Table 4 is to maintain a
single column and store multiple values, separated by semi-colon or comma (see Table 5
below). This is, as well, a violation of 1NF principle, as the values under the Phone
column are not atomic.

Table 5: Student Table With Multiple Column values

Studid (PK) FirstName Surname Phone
1 Josh Addo 050 602 8495;
024 651 2655
2 Adwoa Sampah 024 434 7140

This violation can be resolved in the same way as the problem of repeating columns.
There ca be a policy to store only one phone number. Alternatively the violation can be
resolved by taking out the column with multiple values and forming it into an entity, and
therefore a separate table. This separate table could be called Contact and may contain
phone and other contact data. The Contact table is then made to ‘communicate’ with the
Student table through table relationship (we will talk in detail about relationship later).
Tables 6 and 7 below demonstrate how the above violation is resolved.

Table 6: Well-Normalized Student Table

Studid (PK) FirstName Surname
1 Josh Addo
2 Adwoa Sampah

Table 7: Contact Table With A Relationship With Student Table

Contid (PK) Phone Email Studid (FK)
1 050 602 8495 1
2 024 651 2655 1
3 024 434 7140 2

Second Normal Form (2NF) Principle

A table is in 2nd Normal Form if:

1. First of all the table is in 1st normal form, and

2. All the non-key columns/attributes/fields are dependent on the table’s primary key.

We already know about the 1NF, so let’s break the requirements of the 2NF down. The main
feature of relational database design is to make different tables hold data on a single subject. In
relational database design, it is wrong to lump multi-purpose data into a single table. This will
create serious problems when updating, deleting, and building reports from such a database. A
table design like the one depicted in Table 8 is not 2NF-normalized since the table stores data on
multiple subjects.

Table 8: Multi-Purpose Table in Violation of 2NF

Studid FirstName Surname Residence Program
1 Josh Addo Casford Bcom (Accounting)
2 Sampah Adwoa Adehye Bcom (Accounting)
3 Duodu Oforiwah Adehye Bcom (Management)

Table 8 above violates 2NF because it is not storing data on a single subject, as apart from
student data, it also contains data on students’ residence (halls and hostels) as well as programs
read by students. Table 8 is not a single-purpose table. At least this table is serving three
purposes:

1. Storing student-specific data

2. Storing data on students’ residence
3. Storing data on programs of study

Like we have said already, relational database tables must store data on only one, specific subject.
The main reason why relational database tables are put in 2NF is to narrow the tables to a
single purpose. Doing so brings clarity to the database design, makes it easier for us to describe
and use the table, and it also helps to eliminate modification anomalies.

For a database table to describe a single subject, you need to make sure that all the
attributes/columns/fields of the table are describing that specific subject matter, and nothing else.
For example, for the student table to be describing students, and therefore serving a single
purpose, all the attributes/columns/fields in the table must be specifically describing the student,
and no other business. Any attribute that is not specifically describing the student entity is an
“alien” attribute and must be removed, and formed into its own entity. When all the attributes are
specifically describing the entity in question, the table is said to be in 2NF.

In relational database terminologies, the primary key of a table represents the table. In other
words the purpose of every table can be identified by its primary key. For example, in Table 8,
the purpose of the table can be identified by the table’s primary key (Studid). The primary key
Studid suggests that the table’s purpose is to store data on students. The primary key Studid also
means that the table should have the single purpose of storing student-specific data. All other
non-key attributes should support this agenda of making the table a single-purpose one
(storing student-specific data). All of those attributes (such as Residence and Program in this
case) which do not support the primary key in making the table a single-purpose table are to be
removed. They belong somewhere else. They need to be formed into tables that store
residence-specific and program-specific data.

When all the non-key attributes support the primary key in defining a single-purpose table, those
non-key attributes are said to be fully, and functionally dependent on the primary key. When we
talk about attributes or columns being functionally dependent on the primary key, we mean, that
in order to find a particular value stored under any of the non-key attributes/columns, such as the
surname “Oforiwah”, you would have to know the primary key value, “3”, to look up for the
value “Oforiwah”. The primary key of 3 is the true identifier of the third record of Table 8. In
judging whether a non-key attribute/column is dependent on the primary key, ask yourself the
following question:

“Does this attribute/column/field serve to describe what the primary key represents? If you
answer “yes,” then the attribute/column/field is dependent on the primary key and therefore
describes the purpose of the table. If you answer “no”, then that column/field does not describe
the purpose of the table and therefore not dependent on the primary key. Such an
attribute/columns/field should be moved and formed into a different table for its own purpose.

When all the columns relate to the primary key, the columns in combination, serve a common
purpose of defining a single purpose table. In summary, when a table is in second normal form,
that table serves a single purpose of storing data on a single entity, not multiple entities.

Third Normal Form (3NF) Principles

Once a table is in second normal form, we are guaranteed that every non-key column/field is
functionally dependent on the primary key. In other words, the table serves a single purpose by
storing data on a single entity. However, in some cases, some non-key attributes/columns/fields
may depend on the primary key through another non-e column. The non-key attribute that
depends on the primary key through another non-key attribute is said to be transitively dependent
on the primary key, meaning that the ‘transitive’ attribute/column cannot depend on the primary
key without the presence of some other non-key column. This is referred to as transitive
dependency. Transitive dependency is violation of normalization, more specifically, violation of
third normal form. Table 9 shows an example of a table with transitive dependency.

Table 9: Student Table With Transitive Column

Studid (PK) FirstName Surname DoB Age
1 Josh Addo 27/12/1985
2 Adwoa Sampah 26/12/2006
3 Patience Oforiwah
A functional dependency is said to be transitive if it is indirectly formed by two functional
dependencies. Mathematically X ---> Z is a transitive dependency if the following three functional
dependencies hold true:

• X--->Y (X depends on Y)
• Y does not --->X (Y does not depend on X)
• Y--->Z (Y depends on Z)

A situation of no transitive relation/dependency looks like below:

X--->Z

Y--->Z

X does not -->Y

Y does not -->X

In Table 9, the following facts can be observed:

Studid is the unique identifier for Firstname, Surname, DoB [DoB depends on Studid (DoB--
>Studid)]

DoB is a unique identifier of Age [Age depends on DoB (Age-->DoB)]

The full mapping looks like this: Age------>DoB--------Studid

In a transitive dependency situation, the values of the transitively dependent

attribute/column/field can be generated by the values of some other attribute/column rather than
the primary key column. For example, in Table 9, the values of the column titled Age could be
accurately generated for every student from, not only the values of the primary key Studid, but
also from the values of a non-key attribute DoB (short for date of birth). In other words, if you
know the DoB of a student, you can accurately know/retrieve that student’s age (Age depends on
DOB). Note that for there to be not transitive dependency, all the non-key attributes should
depend only on the primary key (Studid), and that no part of the non-key attributes should
depend on a non-key attribute. For the Age attribute to depend on DoB is a clear demonstration
of non-key attribute (Age) depending on another non-key attribute (DoB).

Transitive dependency leaves a trace of multi-purpose table. 2NF’s effort to define a single
purpose table will be incomplete if there is the presence of transitive attributes/columns.
Transitive dependency causes redundant data, update and delete anomalies. Such a situation can
cause inconsistencies and must be resolved. When this transitive dependency is resolved, the
table then moves into 3NF. In summary, a table is in third normal form if:

1. it is in 2NF and
2. it contains only attributes/columns/fields that are non-transitively dependent on the
primary key

Transitive dependnecy problem (3NF violation) could be resolved by simply removing the
transitive column Age from the student table. This will be enough since we can compute age
from DoB.

Further example to clarify transitive dependency

Example 1

Suppose a company wants to store the complete data of each employee, and therefore a table
named employee_records, that looks like Table 11 is created:
Table 10: Database Table In Violation of 3NF
emp_id emp_fname emp_sname emp_postcode emp_region emp_city emp_district

1001 John Kwame 282005 CC Cape Coast Cape Coast Metropolitan

1002 Ajeet Musah 222008 NT Tamale Tamale Metropolitan

1006 Lora Abena 282007 AA Juaso Asante Akim South

1101 Lilly Adjetey 292008 GB Ashaiman Ashaiman Municipal

1201 Steve Barima 222999 EA Atimpoku Asuogyaman

In Table 10, emp_region, emp_city and emp_district are all dependent on
emp_postcode (a non-key attribute). This makes non-key attributes (emp_region,
emp_city and emp_district) transitively dependent on the primary key (emp_id). This
violates the rule of 3NF.

To remove the transitive dependency to make this table comply with 3NF, we have to
break the table into two tables into Employee and Location Tables as follows:

Table 11: Employee Table

emp_id emp_fname emp_Surame postcode

1001 John Kwame 282005

1002 Ajeet Musah 222008

1006 Lora Abena 282007

1101 Lilly Adjetey 292008

1201 Steve Barima 222999

Table 12: Location Table

postcode emp_region emp_city emp_district
Cape Coast
282005 CC Cape Coast
Metropolitan
Tamale
222008 NT Tamale
Metropolitan
Asante Akim
282007 AA Juaso
South
Ashaiman
292008 GB Ashaiman
Municipal
222999 EA Atimpoku Asuogyaman

By this design the problem of 3NF violation is eliminated. These two table are made to communicate
with each other through a table relationship. The linking field are postcode as a primary key in the
Location table, and postcode as a foreign key in the Employee table.
Example 2

In this example, we are interested in modeling a project management system into a relational
database. Students are made to work on projects as part of what is required for them to earn a
degree. In Table 13, you can observe that there is a normalization problem. Can you tell what the
problem is? Of course the table has both 2NF and 3NF normalization problems. Closer look at Table
10 reveals the following:

The table is not a single purpose one, a 2NF problem. You can clearly notice that the table attempts to
store student data, as well as project data. In a relational database, it is criminal to design a table in
this way, for the reason of all the anomalies we’ve discussed. There is a transitive dependency
problem, a 3NF problem. There appears to be two possible key columns for the table, Studentid and
Project_number. This is part of the reason for the 2NF problem above. Two possible key columns will
automatically create two-tables-in-one. Even though the primary key for the combined table is
Studentid, the Project_number column naturally has unique values which guarantees unique
identification of projects. The presence of the Project_number column means that some non-key
columns will not depend on the Studentid primary key column directly. Actually, the non-key column,
Project_name depend on the Project_number, and not primarily on the Studentid primary key
column. Project_name attribute depends on the primary key (Studentid) through Project_number
attribute. For the the table to be in 3NF, there should be only one primary key column and all the
other columns should fully depend on that primary key. No columns should depend on some other
column which is not the primary key. In relation to this example, the columns First_name, Last_name,
Project_number, and Project_name all should fully and exclusively depend on the primary key
(Studentid) for the table to be in 3NF. At the moment, this is not the case. The column Project
depends on another non-key column (Project_number).

Table 13: A Table To Track Students And Their Projects

Studentid First_name Last_name Project_number Project_name

SB/ACC/20/0010 Samuel Parry 199 Geo Location

SB/ACC/20/0100 Alfred Andoh 120 Cluster

Exploration

To resolve the problem for the table to be in 3NF, the table must be split into two: Student table and
Project table as follows. By this design, in each of the two tables, is correctly normalized from 1NF to
3NF. In Table 14 the Student table has one primary key column, Studentid. All the non-key columns in
the table (First_name, and Last_name) are fully and exclusively dependent on the primary key
Studentid. Note that the Project_number column in the Student table is a foreign key column, serving
the purpose of linking the Project table (see Table 15) to the Student table. Table relationships are
discussed in detail in the following session.

Table 14: Student Table

Studentid (PK) First_name Last_name Project_number (FK)

SB/ACC/20/0010 Samuel Parry 199

SB/ACC/20/0100 Alfred Andoh 120

Table 15: Project Table

Project_number (PK) Project_name

199 Geo Location

120 Cluster Exploration

Relating Tables with Key Columns

The biggest magic in relational database is the ability of database tables to “talk” to
themselves. This is a huge advantage over the flat file database system. In this session
we will learn how table relationships are set up to enable database tables to
communicate with themselves. For two tables to be related (linked up) to facilitate
communication between themselves, they require a primary key (PK) and a foreign
key (FK).

The concept of primary key has already been discussed extensively in previous
sessions. A table is represented by its primary key. A primary key field contains
values which uniquely identifies every record in the table. For there to be a
relationship between between any two tables, there should be a copy of the primary
key field of one table into the other table. In the receiving table, the copied key
column is a foreign key column, and its main job is the create a link between the two
tables.

For example in Table 13, the Studentid columnn is the primary key column for the
Student table. The Project_number column in the Student table is referenced from the
Project table (see Table 14). In the Project table the Project_number column is a
primary key. This is referenced into the Student table as a foreign key to establish the
basis of connection between the two tables. Deciding on which table receives the
foreign key is largely an issue of practicality and convenience. However, what is
practical and convenient can be relative. For example, it may be more convenient and
practical that when you enter a record in the student table, you specify which project
is assigned to the student. If this way of capturing the data into the tables is the most
is the most convenient for you, so be it. On the other hand, if it is more convenient
and practical for you to specify which student is responsible for which project after
entering the project details, then that becomes the design. In this latter case, the
foreign will be place in the project table. It will be the primary key of the Student
table (Studentid) referenced into the Project table as a foreign key.

Table 16: Student

Studentid (PK) First_name Last_name Project_number (FK)

SB/ACC/20/0010 Samuel Parry 199

SB/ACC/20/0100 Alfred Andoh 120

Table 17: Project

Project_number (PK) Project_name

199 Geo Location

120 Cluster Exploration

The table which has the foreign key is referred to as the child table. For a example in
the student-project database above (see Tables 13 an 14) the child table is the table
named Student. Can you figure out why it is fitting to call the “foreign” key table as
the child table? It is because the table is a dependent table. Dependent in the sense
that the values of one of its columns (the foreign key column) is fed by a “parent”.
this is understandable. If you are fed, then you are a child, and the one who feeds you
is mostly your parents. Not surprisingly, the source from which the foreign key
originates is called a parent table. In reference to the student-project database above,
the parent table is the Project table, and you should be able to explain why the Project
table is a parent table.

Remember that in a relational database, each table is made to hold entity-specific data
such that from any one particular table, you cannot find related data (see 2NF above).
for example, in the Student table, you can only find student-specific data. You cannot
find data on students’ fees. Such a design is purposely to eliminate the anomalies
associated with flat file database systems (you can revisit the session on flat file
database anomalies). But definitely, relational database tables do have relationships
between themselves. For example for a school database, Student entity has a
relationship with Course entity, and the relationship is that students read courses or
courses are read by students. In a sales processing system, the Customer entity as a
relationship with the Order entity, and the relationship is that customers place orders
or orders are placed by customers. There are three types relationships which can exist
between related tables:

 one-to-many relationship

 many-to-many relationship

 one-to-one relationship.

The details of these relationships are discussed as follows:

One-To-Many Relationship

The most common type of table relationship in a relational database is one-to-many relationship. In
one-to-many relationship one record in one entity can be referenced by multiple records in another

entity. Stated differently, In

a one-to-many relationship, a primary key
value in the first table will have multiple matching values in the
second table’s joined column (the joining column is the foreign key
column). For example, consider a database that tracks automobiles.
One table would hold data on automobile manufacturers, with one
row each for Mercedes, Toyota, Ford, Honda, Kia, and so on. A
second table with model names, such as C Class, Camry, Focus, Civic,
Sedona, and Accord, would have several rows matching each row in
the manufacturers’ table.The logical model of the car database is
produced in Figure 7. In Figure 7, the relationship line connecting
the Manufacturer entity and the Model entity reflects a one-to-
many relationship. The relationship line uses the so-called crow’s
leg notation. If you have paid attention to the nature of crow’s leg
you would realize that the le is straight from the top and splits out
into many parts at the toe section. In modeling relationships
between entities, the many part of the crow’s leg notation points to
the entity on the many side of the relationship. In Figure 7, the
many part of the relationship fall on the Model entity.
Figure 7: Logical Model of A Car Database

the physical implementation of the car database, is illustrated in

Tables....
Table 18: Manufacturer Table Of The Physical Implementation Of The Car Database

Manufacturer_id Manufacturer_name

1 Mercedes

2 Honda

3 Kia

4 Toyota

5 Ford

Table 19: Model Table Of The Physical Implementation Of The Car Database

Model_id (PK) Model_name Model_year Manufacturer_id

(FK)

1 Camry 2008 4

2 Accord 2018 2

3 C-Class 2020 1

4 Focus 2017 5

5 Corolla 2018 4

6 E-Class 2017 1
The one-to-many relationship between the Manufacturer table and
the Model table is more conspicuous in the physical model. For
every one record in the Manufacturer table, there is an associated
multiple records in the Model table.for example, the first record in
the Manufacturer table is Mercedes,with a manufacturer_id of 1. In
the Model table, the manufacturer Mercedes is associated with the
model names, C-class , E-class, etc. This makes it one manufacturer
(Mercedes) mapping to more than one model (C-clss, and E-class).
On the other hand, for every one record in the Model table, there is
one and only one associated record in the Manufacturer table. For
example, C-class is a model name with the model_id of 3. This is
one record in the model table, and can relate to one and only one
manufacturer (Mercedes) in the Manufacturer table.

Another example is students and their programs of study. In many

situations, students in universities are enrolled into one and only
one program. In other words, many students can be admitted into
one program, but no one student can be admitted into more than
one program. In such a situation the type of relationship between
the Student entity and the Program entity is one-to-many, modeled
logically as follows:

Figure 8: Logical Modeling of the Relationship Between Student and Program Entities

In a physical model, the Student-Program relationship can be

modeled as follows:

Table 20: Program Table

Programid (PK) Program_name

1 Accounting

2 Computer Science

3 Procurement

4 Statistics

Table 21: Student Table

Studentid First_name Last_name Registration_number Programid

(PK) (FK)

1 Josh Addo Sb/acc/20/0020 1

2 Kofi Addo Sb/acc/20/0001 1

3 Solomon Kwarteng Sb/csc/20/0041 2

4 Thomasina Ntiamoah Sb/csc/20/0014 2

5 Nana-Adwoa Sampa Sb/pcm/20/0011 3

6 Kwabena Ankoma Sb/sta/2010090 4

From Tables 20 and 21, you can see the physical rendering of
relationship between the Program table and the Student table. It
can be seen that for any one record in the Program table, there is
more than one related record in the Student table. For example, if
we take the Accounting program in the Program table, we can find
at least two matches in the Student table. In terms of the type of
relationship, this is one record from the Program table to many
records in the Student table. On the other hand, one record (one
student) from the Student table can relate to one and only one
record (one program) in the Program table. This ultimately creates
a one-to-many relationship between the Program and Student
tables

In a sale processing system, the relationship between the Customer

entity and the Order entity is one-to-many. One record in the
Customer table can relate to many records in the the Order table.
On the reverse side, one record in the Order table can relate to only
one record in the Customer table. This ultimately results in a one-
to-many relationship, with the many part of the relationship falling
on the Order table. The logical modeling of the relationship
between the Customer and Order Tables is shown in Figure 9.

Figure 9: The Logical Relationship Between Customer And Order Tables

Tables demonstrate the physical design of the relationship between the Customer and
Order entities.

Table 22: Customer Table

Customerid (PK) Customer_name Phone

1 UCC 0332012348

2 UG 033123249

3 KNUST 033212345

4 UEW 033149857

Table 23: order Table

Orderid (PK) Order_date Order_amount Comment Customerid (FK)

1 2021-11-6 20000 2

2 2019-10-2 23000 4

3 2020-01-10 12000 1

4 2021-11-7 30000 2
5 2019-12-23 15000 1

6 2020-03-12 20000 3

A closer look at Tables 22 and 23 reveals that one customer in the Customer table is
related to multiple orders in the Order table. This is true as any particular customer
can can buy from a company several times. On the reverse side, one order in the
Order table is related to one and only one customer in the Customer table. No one
order can come from two or more customers.

Many-To-Many Relationship

In a many-to-many relationship, multiple rows/records in table A

will have multiple matching rows in table B. on the reverse side
multiple records/rows in table B will have multiple matching
rows/records in table A. As an example, a table of baseball players
could be joined to a table of field positions. Each player can be
assigned to multiple positions, and each position can be played by
multiple people.

It is important to dedicate some time for modeling many-to-many

relationship as its physical modeling is not as straight forward as the
one-to-many relationship discussed above. Lets now discuss some
examples of situations requiring many-to-many relationship
modeling

Example 1: University Database

In this example, we will show an extract of a university database. We are only interested in two
entities which are related in a many-to-many fashion. There can be several entities which bear
many-to-many relationships. However, at this point we will consider the lecturers and the
subjects they teach. The entities involved will be Lecturer and Subject.Let’s check how this is a
many-to-many relationship. One lecturer could teach one or many subjects, but one subject
could also be taught by one or many professors. This is a many-to-many relationship, and your
logical model should look like this:
Figure 10: Logical Model Of A Many-To-Many Relationship Between Lecturer and Subject
Entities

Many-to-many relationships are not ideal. If we translate this logical model (see Figure 10)
directly into a physical database, the data would be duplicated. For instance, if there’s a lecturer
that teaches six subjects, you would have him or her listed in the table six times, every time for a
different subject. This is quite inefficient. So, how would we resolve this many-to-many
relationship between these two entities? Many-to-many relationship modeling is not done as
straight forward as the logical model in Figure 10. many-to-many relationship is implemented by
introducing a junction/link table into your model. The junction or link table is placed in-between
the two main entities (Lecturer and Subject entities in this case). This design breaks the direct
many-to-many relationship into multiple one-to-many relationships as shown in Figure 11.

Figure 11: Implementation Of Many-To-Many Relationship

As you can see, there’s a new table called Subject_details. Others prefer to name the junction
table by pairing the names of the tables which it is providing the linkage between for
example,insteadof the name lecture_details, the junction table could have been named
Lecturer_Subject. Note that the junction table (Subject_detail) contains the following attributes:

• Lecturerid : A foreign key attribute, which references the Lecturer_id column in the
Lecturer entity.
• subjectid : A foreign key attribute, which references the Subject id attribute in Subject
entity.

The attribute Lecturer _id is a foreign key attribute to the entity Lecture_details. The same goes
with the S ubjectid ; it is a foreign key attribute to the entity Lecture_details. At the same time,
the pair Lecturer id , S ubjectid is the primary key for the table Subject_details. The
columns Lecturer id and S ubjectid, in the Lecturer-Subject entity, together form a composite
primary key (i.e. the primary key consists of two or more attributes). This composite primary key
ensures that a lecturer can be assigned to one subject one time. Each pair of values
(Lecturer id , S ubjectid ) can be in the Lecturer-Subject table no more than once. The same goes
for the subjects; each one can be assigned to one Lecturer one time. The composite key ensures
the uniqueness of the attribute combinations.

Tables 24 through 26 depict the physical modeling of the many-to-relationship between the
Lecturer and Subject tables. Let’s check that the junction table solves the many-to-many
relationship. One lecturer can be allocated only once to the same subject. On the other hand,
one subject can be assigned only once to the same lecturer.

Table 24: Lecturer Table

Lecturerid (PK) First_name Last_name
1 Joshua Addo
2 Isaac Anim
3 Emmanuel Arhin
4 Stephen Asante
5 Peace Ametepe
6 Stephen Winful
7 Patrick Darkwa
8 Evans Frimpong-Manso

Table 25: Subject_Details

Lecturerid (PK, Fk) Subjectid (PK, FK) Credit_hours
1 1 3
1 2 3
1 3 3
1 4 3
7 5 3
7 6 3
4 7 3
8 7 3

Table 26: Subject Table

Subjectid (PK) Subject_code Subject_name
1 ACC204 Computer Applications in Accounting
2 SBU307 Introduction to Cost and Management Accounting
3 ACC403 Audit and Internal Review
4 ACC404 Auditing and Assurance Practice
5 ACC301 Cost and management Accounting I
6 ACC302 Cost and management Accounting II
7 ACC304 Taxation

Example 2: Purchase Processing (Product Ordering) Database

In this example, our task is to create a database that will help a company store information about
their suppliers. The database will also contain info on all the products/services ordered from the
suppliers. The logical data model could look something like this:

Figure 12: Supplier-Product Relationship

The relationship between these two entities is again many-to-many. One or many products can
be ordered from one supplier. At the same time, the company can order the same product from
many suppliers, e.g. services from different legal firms, tires from different manufacturers, etc.
How would this logical model look when transformed into a relational database model? Figure
12 depicts the ‘raw form’ of the many-to-many relationship between the two entities. But once
again, it will be wasteful to directly translate Figure 12 into a physical database. Like we
discussed for example one above, many-to-many relationship are implemented by by breaking it
into multiple one-to-many relationships, by placing a junction table in-between (supplier-product)
the entities as shown in Figure 13.

Figure 13: Implementing Many-To-Many Relationship for Supplier and Product Entities

Once again, instead of many-to-many, there’s a new table that’s automatically been
named supplier_product. In this implementation the junction table has only two attributes:

• supplierid : References the Supplierid column in the supplier table.

• productid : References the productid column in the product table.

Again, the pair supplierid , and productid is the primary key (actually a composite key) of the
supplier-product table. At this level of implementation, our database only tracks suppliers and
the products they supply to us. If we want our database to track orders we make to suppliers, it
would be better to expand the junction table supplier_product a bit, as shown in Figure 14.

Figure 14: Order Entity Serving as Junction Between Supplier and Product Entities

The design looks much better now. First of all, the name of the junction table has changed to
something more descriptive; it’s now named order. Several new attributes have also been added
to the table. It consists of the following:

• Orderid : The ID of this order from the supplier and the table’s primary key (PK).
• Supplierid : The ID of the supplier; references the table supplier.
• Productid : The ID of the product ordered; references the table product.
• Order_date : The date of the order.
• Quantity : The number of items ordered.
• Total_price : The total value of the ordered products.
• Status : The status of the order.

Remember that there are two possibilities when creating a junction table. One is that it contains
only foreign keys that reference the other tables, which happens often enough (see Table 13).
However, sometimes the junction table becomes its own entity, as in Table 14 where it also
contains other attributes. You should always adapt the model to your needs. Now that you’ve
become so good at this, let’s take a look at one more example!

Example 3: Book Publisher Database

In this example, think of yourself as being in the publishing business and you need to maintain a
record of the books you’ve published. Many people are involved in producing a book, so you also
want to have a record of these people and their roles. The logical model could look something
like this:

Figure 15: Logical Model of Book Publishing Database

In Figure 15, there are three entities:

• Book
• Staff
• Role

The book entity contains these attributes:

• isbn : The International Standard Book Number, a primary identifier (PI) used for books.
• Book_title : The title of the book.
• issue : The issue (i.e. edition) of the book (e.g. first printing, first edition, etc.).
• date : The date of the issue.

The next entity is staff:

• Staff_id: The staff member’s unique ID; a primary identifier (PI).

• first_name : The staff member’s first name.
• last_name : The staff member’s last name.
•

Between the Book and Staff entities, there’s a many-to-many relationship. Let’s check the logic.
One staff member can work on one or many books. One book can be handled by one person
(well, hardly) or by many people. You must comprehend why the relationship between the two
entities is many-to-many.

The third entity, Role, consists of the following entities:

• Roleid :
The ID of the role; a primary identifier (PI).
• role_name :
The name of the role.
• role_description : A description of that role.

Again, there is a many-to-many relationship between the entities Staff and Role. This is the logic
behind the relationship: one staff member can fill one or many roles when working on a book
and that one role can be performed by one or many staff members. Role, in this sense, means
something like author, co-author, editor, proofreader, translator, illustrator, etc. For instance,
the author of one book can also be an illustrator on another book, translator on a third, and
proofreader on a fourth.

The implementation of the many-to-many relationships among the three entitites will look like
Figure 16.
Figure 16: Implementing The Many-To-Many Relationships Between Book, Staff, And Role
Entities

This model seems a bit complicated. Until now, we have have been dealing with situations
involving only two entities. This the modeling in Figure 16 rather has three entities related to
each other in a many-t-many relationship. This kind of relationship where three entities
participate are involved is referred to as a ternary relationship. Here, the junction table again has
a composite primary key that is made up of foreign keys. This time, however, the primary key
consists of three columns, not two.

Let’s analyze the junction table named book_creators. It has three attributes as follows:

• Book_isbn : This is a foreign key which references the Book_isbn (primary key) of the
book entity.
• Staffid : This is a foreign key which references the Staffid (primary key) of the
Staff entity.
• Roleid : This is a foreign key which references the the Roleid (primary key) of
the Role entity.

The primary key of the junction table (book_creators) is the unique combination of the
attributes B ook_isbn , S taffid , and R oleid .

One-To-One Relationship

One-to-one relationship is the entity relationship where one record in table A is related to one
and only one record in table B, and on the reverse, one record in table B also relates to one and
only one record in table A. the following are real-life examples of one-to-one relationships:

• Country and capital city relationship: Each country has exactly one capital city. Each
capital city is the capital of exactly one country.
• Person and their fingerprints. Each person has a unique set of fingerprints. Each set of
fingerprints identifies exactly one person.
• Email and user account. For many websites, one email address is associated with exactly
one user account and each user account is identified by its email address.
• Spouse and spouse relationship: In a monogamous marriage, each person has exactly
one spouse. But in a polygamous marriages, the relationship will not be one-to-one.
• User profile and user settings. One user has one set of user settings. One set of user
settings is associated with exactly one user.

For clarity, let’s contrast these examples with relationships that are not one-to-one:

• Country and city relationship: Each city is in exactly one country, but most countries
have many cities. This is one-to-many relationship
• Parent and child relationship: Each child has two parents, but each parent can have
many children.
• Employee and manager relationship: Each employee has exactly one immediate
supervisor or manager, but each manager usually supervises many employees.

Denoting a One-to-One Relationship in an ER Diagram

A one-to-one relationship in an ER diagram is denoted, like all relationships, with a line
connecting the two entities. The “one” cardinality is denoted with a single straight line. (The
“many” cardinality is denoted with a crow’s foot symbol .)

The one-to-one relationship between country and capital can be denoted like this:

The perpendicular straight lines mean “mandatory”. This diagram shows that it’s mandatory for
a capital to have a country and it’s mandatory for a country to have a capital.

Another possibility is for one or both of the sides of the relationship to be optional. An optional
side is denoted with an open circle. This diagram says that there is a one-to-one relationship
between a person and their fingerprints. A person is mandatory (fingerprints must be assigned to
a person), but fingerprints are optional (a person may have no fingerprints assigned in the
database).

One-to-One Relationships in a Physical Database

There are a few ways to implement a one-to-one relationship in a physical database.
Primary Key as Foreign Key

One way to implement a one-to-one relationship in a database is to use the same primary key in
both tables. Rows with the same value in the primary key are related. In this example, France is
a country with the id 1 and its capital city is in the table capital under id 1.

country

id name

1 France

2 Germany

3 Spain

capital

Technically, one of the primary keys has to be marked as foreign key, like in this data model:

The primary key in table capital is also a foreign key which references the id column in the
table country. Since capital.id is a primary key, each value in the column is unique, so the
capital can reference at most one country. It also must reference a country – it’s a primary key,
so it cannot be left empty.

Additional Foreign Key with Unique Constraint

Another way you can implement a one-to-one relationship in a database is to add a new column
and make it a foreign key.

In this example, we add the column country_id in the table capital . The capital with id 1,
Madrid, is associated with country 3, Spain.

country

id name
id name

1 France

2 Germany

3 Spain

capital

id name country_id

1 Madrid 3

2 Berlin 2

3 Paris 1

Technically, the column country_id should be a foreign key referencing the id column in the
table country. Since you want each capital to be associated with exactly one country, you should
make the foreign key column country_id unique.

One-to-One Relationships in Practice

Few One-to-One Relationships Last

One-to-one relationships are the least frequent relationship type. One of the reasons for this is
that very few one-to-one relationships exist in real life. Also, most one-to-one relationships are
one-to-one only for some period of time. If your model includes a time component and captures
change history, as is very often the case, you’ll have very few one-to-one relationships.
A monogamous relationship may split up or one of the partners may die. If you model the reality
of monogamous relationships (such as marriages or civil unions) over time, you’ll likely need to
model the fact that they last only for a certain period.

You’d think that a person and their fingerprints never change. But what if the person loses a
finger or the finger is badly burnt? Their fingerprints might change. It’s not a very frequent
scenario; still, in some models, you may need to take this into account.

Even something seemingly as stable as countries and their capitals change over time. For
example, Bonn used to be the capital of West Germany (Bundesrepublik Deutschland) after
World War II, when Berlin was part of East Germany. This changed after German reunification;
the capital of Germany (Bundesrepublik Deutschland) is now Berlin. Whether you should or
should not take this into account depends on your business reality and the application you’re
working on.

UNIT FOUR

THE STRUCTURED QUERY LANGUAGE

Structured Query Landguage (SQL) is a flexible language that is used in creating and managing relational databases. By far,
it is the widely used tool for communicating with a relational database. SQL originated in one of IBM’s research laboratories, as
did relational database theory. In the early 1970s, as IBM researchers developed early relational DBMS (or RDBMS) systems,
they created a data language to operate on these systems. They named the pre-release version of this language SEQUEL
(Structured English QUEry Language). This relational database language has been developed into what is now called SQL. The
syntax of SQL is a form of structured English, which is where its original name came from. To bring consitency in the application
of the SQL language, the American National Standards Institute (ANSI) has, since mid 1980s, regulated the SQL language
through standardization. The current standardized SQL by ANSI is the SQL: 2016.

The SQL language is implented with a relational database management system (RDBMS). regardless the of the particular
RDBMS (whether PostgeSQL, MySQL, SQL Lite, or some other one)the SQL language is the same except a few vendor-specific
differences. Sinse SQL is implementd with an RDBMS, it is the best thing to learn how to set up and SQL database through an
RDBMS. For the purposes of this course, the PostgreSQL RDBMS is adopted. Note that the choice of an RDBMS does not really
matter as the core of the language is the same across all RDBMS platforms. Let us take a few minutes to go through the
installation process of PostgreSQL.

Why the choice of PostgreSQL

PostgreSQL, or simply Postgres, is a robust database

system that can handle very large amounts of data. Here are some
reasons

PostgreSQL is a great choice to use with this book:

 It’s free.

 It’s available for Windows, macOS, and linux operating systems.

 Its SQL implementation closely follows ANSI standards.

 It’s widely used for analytics and data mining, so finding help
online from peers is easy.

 Its geospatial extension, PostGIS, lets you analyze geometric

data and perform mapping functions.

Installing PostgreSQL for Windows

You’ll start by installing the PostgreSQL database and the graphical

administrative tool pgAdmin, which is software that makes it easy

manage your database, import and export data, and write queries.

Always install the latest available version of PostgreSQL for your

operating

system to ensure that it’s up to date on security patches and new

features. For

this book, I’ll assume you’re using version 13 or later.

Windows Installation
For Windows, I recommend using the installer provided by the
company

EnterpriseDB, which offers support and services for PostgreSQL

users.

EnterpriseDB’s package bundles PostgreSQL with pgAdmin. To get

the software, visit https://www.enterprisedb.com/ and create a

free account. Then go to the downloads page at

https://www.enterprisedb.com/software-downloads-postgres/.
Select the latest available 64-bit Windows version of EDB Postgres
Standard unless you’re using an older PC with 32-bit Windows.
After

you download the installer, follow these steps:

Right-click the installer and select Run as administrator. Answer

Yes to the question about allowing the program to make changes to

your computer. The program will perform a setup task and then

present an initial welcome screen. Click through it.

Choose your installation directory, accepting the default.

On the Select Components screen,

select

the boxes to install

PostgreSQL Server, the pgAdmin

tool,

Stack Builder, and

Command Line Tools.

Choose the location to store data. You can choose the default,
which
is in a “data” subdirectory in the PostgreSQL directory.

Choose a password. PostgreSQL

is robust

with security and

permissions. This password is for

the initial

database superuser

account, which is called postgres.

Select a port number where the server will listen. Unless you have

another database or application using it, the default of 5432 should

fine. If you have another version of PostgreSQL already installed or

some other application is using that default, the value might be 5433

or another number, which is also okay.

Select your locale. Using the default is fine. Then click through the

summary screen to begin the installation, which will take several

minutes.

8. When the installation is done, you’ll be asked whether you want

launch EnterpriseDB’s Stack Builder to obtain additional packages.

Select the box and click Finish.

9. When Stack Builder launches, choose the PostgreSQL installation
on the drop-down menu and click Next. A list of additional

applications should download.

10.

Expand the Spatial Extensions menu and select either the 32-bit or

64-bit version of PostGIS Bundle for the version of Postgres you

installed. Also, expand the Add-ons, tools and utilities menu and

select EDB Language Pack, which installs support for programming

languages including Python. Click through several times; you’ll need

to wait while the installer downloads the additional components.

11.

When installation files have been downloaded, click Next to install

both components. For PostGIS, you’ll need to agree to the license

terms; click through until you’re asked to Choose Components.

Make sure PostGIS and Create spatial database are selected. Click

Next, accept the default database location, and click Next again.

12.

Enter your database password when prompted and continue

through

the prompts to finish installing PostGIS.

13. Answer Yes when asked to register GDAL. Also, answer Yes to
the

questions

about setting POSTGIS_ENABLED_DRIVERS and

enabling

the

POSTGIS_ENABLE_OUTDB_RASTERS

environment variable

When finished, a PostgreSQL folder that contains shortcuts and

links

to documentation should be on your Windows Start menu.

Working with pgAdmin

Before you can start writing code, you’ll need to become familiar
with pgAdmin, which is the administration and management tool
for

PostgreSQL. It’s free, but don’t underestimate its performance. In

fact,

pgAdmin is a full-featured tool similar to tools for purchase, such as

Microsoft’s SQL Server Management Studio, in its capability to let

you

control multiple aspects of server operations. It includes a graphical

interface for configuring and administrating your PostgreSQL server

and

databases, and—most appropriately for this book—offers a SQL

query

tool for writing, testing, and saving queries.

The left vertical pane displays an object browser where you can
view

available servers, databases, users, and other objects. Across the

top of the screen is a collection of menu items, and below those are
tabs to display

various aspects of database objects and performance.

Next, use the following steps to connect to the default database:

In the object browser, expand the plus sign (+) to the left of the

Servers node to show the default server.

Depending on your

operating system, the default server name

could be localhost or

PostgreSQL x, where x is the Postgres version number.

Double-click the server name. Enter the password you chose during

installation if prompted. A brief message appears while pgAdmin is

establishing a connection. When you’re connected, several new

object items should display under the server name.

Expand Databases and then expand the default database postgres.

Under postgres, expand the Schemas object, and then expand public.

Your object browser pane should look similar to Figure 2

This collection of objects defines every feature of your database

server.

There’s a lot here, but for now we’ll focus on the location of tables.
To

view a table’s structure or perform actions on it with pgAdmin, this

where you can access the table. In Chapter 1, you’ll use this
browser to

create a new database and leave the default postgres as is.

In addition, pgAdmin includes a Query Tool, which is where you

write

and execute code. To open the Query Tool, in pgAdmin’s object

browser, click once on any database to highlight it, and then select
Tools ▸ Query Tool. The Query Tool
has two panes: one for writing queries and one for output.

Like any language, SQL has a syntax (the grammatical arrangement of words in sentences) which one must adhere to in writing
valid SQL codes. To master the use of the language actually is to master the syntax of the language. Like the learning of any
language, it persistent practice for you to master the SQL language. At this juncture, we turn our focus to the building blocks of
the language.
SQL Statements
The SQL language consists of a limited number of statements that perform three main functions of data management. This is
good news due to the fact that the language is not too complex to learn. Some group of SQL statements define data, some
manipulate data, some control data, and others retrieve data for reporting. The set of statements for the definition of data
form what is called Data Definition Language (DDL). Those statements meant for manipulation of data for what is called Data
Manipulation Language (DML). The group of statements for the retrieval of data form the Data Retrieval Language (DRL) or
Data Query Language (DQL), while those for the control of data form what is called Data Control Language.

Data Definition Language

The Data Definition Language (DDL) is the part of SQL you use to create, change, or destroy the basic elements of a relational
database. Basic elements include tables, views, schemas, catalogs, clusters, and possibly other things as well. In the following
sections, we discuss the containment hierarchy that relates these elements to each other and look at the commands that
operate on these elements. A database is like a big container with sub-containers. The containment hierarchy of a database can
be broken down as follows:

1. The database is the biggest container, and contains catalogs

2. Catalogs contain schemas

3. Schemas contain tables and views

4. Tables contain columns and rows.

You need to worry mostly about table as the relational database management system (RDBMS) manages the other
elements by itself. The RDBMS is a software pachage which is designed to interpret the SQL language and for the total
management of the database. As has been mention earlier, there are a number RDBMSs designed for the management of
relational databases. Just for emphasis, examples are reproduced here as follows: PostgreSQL, MySQL, Microsoft SQL
Server, IBM DB2

The RDBMS provides the interface for you to enter your DDL codes to create your
database tables and other database elements. This means that you need to download
the RDBMS and install it on your computer. After the installation is complete, you
can go ahead to create your database and the tables for the database. It must be noted
that tables are the building blocks of relational databases, and so you need to master
the are table design as discussed previously. You should master all your normalization
principles as these are the principles of table design in a relational database. As best
practice, you need to first of all break down the system to be modelled into a database
into entities. You should know what an entity is by now. Keep in mind the following procedures
when planning your database:

1. Identify all tables.

2. Define the columns that each table must contain.

3. Give each table a primary key that you can guarantee is unique. (see discuss on primary keys in Chapters 3)
4. Make sure that every table in the database has at least one column in common with (at least) one other table in
the database (this is what we refer to as foreign key). These foreign keys columns serve as logical links that
enable you to relate information in one table to the corresponding information in another table.
5. Put each table in third normal form (3NF) or better, to ensure the prevention of insertion, deletion, and update
anomalies. (see normalization in Chapter 3)

After you complete the design of your database on paper (using ERD) and verify that it is sound, you’re ready to transfer
the design to the computer. The paper design of the database is referred to as logical design. And the translation of the
logical design onto the computer is referred to as the physical design. At this point in time, we will begin the practice of
SQL coding by trying to model a school database. We have chosen a school system because all of you are familiar with
it. In your personal practice, you may want to model a different system into a database. Note that you can model just
any system into a relational database.

Modeling a School System into A Database

The Logical Design

The first task on the journey of creating a relational database is to break the system
you are modelling into logical data containers called entities. It is always a good idea
to create the logical design of the database before physically creating it. The logical
design simply means the architectural design of the database on paper or computer,
usually using an entity relationship diagram. There some computer application such as
Microsoft visio, draw io and others, which you can use to do your ERD. You can use
any of these tools to draw a very nice entity relationship diagram as shown in Figure
xyz

The totality of the school system could be broken into the following entities:
• Student
• Course
• Teacher
• Program
• Department
• Fee/Payment
• Hall/Residence
• Guardian

The list can be more than what is provided above. After this exercise of breaking the
system into entities, you need to specify the attributes of the entities you want to keep
data on. For example, for any particular student in the student table, we need to keep
capture data on their names, registration numbers, Gender, contact details, and others.
In thinking about which attributes to capture about an entity, remember to be guided
by the principles of normalization as we learnt them in unit three. We will be
revisiting specific aspects of the principles of normalization where necessary. If you
find difficulties in coming up with the appropriate attributes of, revisit unit three for
more guidance. In a relational database, each of the entities is represented with a table,
and the attributes you specify for the entities become the column labels/headings for
each table. So this is how, for example, the Student entity will be set up as a table in
the database:

Studid First_name Last_name Registration_number Gender Phone Email

Note that each of the columns store specific data of specific type. For example, the
first_name column stores textual data about only students’ first names, and nothing
else. Some other columns for some tables will store numerical data. For example the
Payment table will have most of its columns to be of the numerical data types. Before
we provide detailed information on SQL data types, let us see the logical view of the
school database (in an ERD form). The ERD in Figure xyz shows the various entities
(and therefore table) in the school database, together with the attributes (also known
as fields of columns) of the entities. As we have said already, each of the columns
stores a specific data of specific type. By type, we mean the type data, whether
text/string, or numeric, or something else. It important at this point in time for us to
discuss SQL data types which can be declared on fields.
SQL Data Types

SQL databases support a variety of data types. However for this course we will talk about the common SQL data types such as
the following:

» Numerics

» Character/Strings

» Booleans

» Datetimes
We call these general types because inside some of the above types can be sub-types. Data type constitutes one of the critical
building blocks of relational database design, and therefore we have to break the discussion down for easy understanding.
Basically we will discuss each of the types and their sub-types in some detail as follows:

String Type
String types are general-purpose types suitable for any combination
of text, numbers, and symbols. Character data type has two sub-
types: character(n) [with the short form as char(n)], and character
varying(n) [with the short form as varchar(n)]

char(n)

A fixed-length column where the character length is specified by n.

A column set at char(20) stores 20 characters per row regardless of
how many characters you insert. If you insert fewer than 20
characters in any row, PostgreSQL pads the rest of that column with
spaces. This type, which is part of standard SQL, also can be
specified with the

longer name character(n). Nowadays, char(n) is used infrequently and is

mainly a remnant of legacy computer systems.

varchar(n)

A variable-length column where the maximum length is specified by

n. If you insert fewer characters than the maximum, PostgreSQL will
not store extra spaces. For example, the string blue will take four
spaces, whereas the string 123 will take three. In large databases,
this practice saves considerable space. This type, included in
standard

SQL, also can be specified using the longer name character varying(n)

Numbers
Number columns hold various types of (you guessed it) numbers,
but that’s not all: they also allow you to perform calculations on
those numbers. That’s an important distinction from numbers you
store as strings in a character column, which can’t be added,
multiplied, divided, or perform any other math operation. Also, as I
discussed in Chapter 2,

numbers stored as characters sort differently than numbers stored

as numbers, arranging in text rather than numerical order. So, if
you’re doing math or the numeric order is important, use number
types. The SQL number types include:

Integers

Whole numbers, both positive and negative

Fixed-point and floating-point

Two formats of fractions of whole numbers

We’ll look at each type separately.

Integers
The integer data types are the most common number types you’ll
find

when exploring data in a SQL database. Think of all the places

integers

appear in life: your street or apartment number, the serial number

your refrigerator, the number on a raffle ticket. These are whole

numbers,

both positive and negative, including zero.

The SQL standard provides three integer types: smallint, integer, and

bigint.
The difference between the three types is the maximum size
of the

numbers they can hold. Table 3-1 shows the upper and lower limits
of each, as well as how much storage each requires in bytes.

Data Storage Range

type size

smallint 2 bytes −32768 to +32767

integer 4 bytes −2147483648 to

+2147483647

bigint 8 bytes −9223372036854775808

+9223372036854775807
Even though it eats up the most storage, bigint will cover just about

any requirement you’ll ever have with a number column. Its use is a
must

if you’re working with numbers larger than about 2.1 billion, but
you can

easily make it your go-to default and never worry. On the other
hand, if

you’re confident numbers will remain within the integer limit, that
type is

a good choice because it doesn’t consume as much space as bigint (a

concern when dealing with millions of data rows).

When the data values will remain constrained, smallint makes sense:

days of the month or years are good examples. The smallint type will
use

half the storage as integer, so it’s a smart database design decision if

the

column values will always fit within its range.

Decimal Numbers
As opposed to integers, decimals represent a whole number plus a
fraction

of a whole number; the fraction is represented by digits following a

decimal point. In a SQL database, they’re handled by fixed-point and

floating-point data types.

Fixed-Point Numbers
The fixed-point type, also called the arbitrary precision type, is

numeric(precision,scale). You give the argument precision as the maximum

number of digits to the left and right of the decimal point, and the

argument scale as the number of digits allowable on the right of the

decimal point.

Floating-Point Types
The two floating-point types are real and double precision. The difference

between the two is how much data they store. The real type allows

precision to six decimal digits, and double precision to 15 decimal points

precision, both of which include the number of digits on both sides

of the

point. These floating-point types are also called variable-precision

types.

The database stores the number in parts representing the digits and
an

exponent—the location where the decimal point belongs. So, unlike

numeric,where we specify fixed precision and scale, the decimal

point in a

given column can “float” depending on the number.

Dates and Times
This is essential for storytelling with data, because the issue of
when something occurred is usually as valuable a question as who,
what, or how many were involved. PostgreSQL’s date and time
support includes the four major data types shown in Table 3-4.

Data type Storage size Description Range

timestamp 8 bytes Date and time 4713 BC to

294276 AD

date 4 bytes Date (no time) 4713 BC to

5874897 AD

time 8 bytes Time (no date) 00:00:00 to

24:00:00

interval 16 bytes Time interval +/−

178,000,000
years

CREATING YOUR FIRST DATABASE AND

TABLE
The PostgreSQL program you downloaded in the is a database
management system, a software package that allows you to define,
manage, and query databases. For this course, we will creat a
database for the school system described in unit three. For the sake
of proximity, the ERD for the school system is reproduced here.The
ERD represents the logical view of the school database.
Creating the school database

Every database need to be given a name. Let’s name our

database my_school. Now open pgAdmin and click the
dropdown of the Tool menu. Choose Query Tool. You should
see somethig similar to the Figure

In the query editor, type the SQl statement which creates a database as follows:

This statement perfectly creates an SQL database named my_school. Note that below the query
editor is a message about the status of your SQL code, whether it executed correctly or there is an
error. When your statament is right, you get feedback message like the one in Figure xxx: “CREATE
DATABASE statement returned successfully in 3 secs 925 msec”. In summary the statement you need to
write to tell your RDBMS (in this case PostgreSQL) to create a database is a combination of two SQL
key words plus the name you want to give to your database. Also not that in SQL, a statement ends
with a semicolon(;), not fool-stop. In a generic form, the syntax for the CREATE DATABASE statement
is CREATE DATABASE database_name. Note that SQL key words in any statement are written in upper case.
That is the convention the database community.

ASSIGNMENT

Search for, and familiarize yourself with, SQL key words on google

Creating Tables
A database table looks a lot like a spreadsheet table: a two-dimensional array made up of rows and columns. You can
create a table by using the SQL CREATE TABLE command. Within the command, you specify the name and data type
of each column.This is where you need to apply the knowledge of SQL data types learnt in the previous session. The
generic form of the CREATE TABLE statement goes like this:

CREATE TABLE table_name (

Column1 datatype,
Column2 datatype,
Column3 datatype,
Columnn datatype
);

Note: this code could have been written as one line like

CREATE TABLE table_name (Column1 datatype,Column2 datatype,Column3 datatype, Columnn datatype);

This is equally correct SQL statement like the first one . the difference is that the first
one is more readable that the latter. With the first one you are able to see at glance, the
columns and their data types. This is not the case with the second style. More
importantly, the database community recommends the first approach. Now let’s turn
to our ERD and create the Student table. Open your pgAdmin and type the codes you
see in Figure xxx. If your statement runs successfully,
As you can see from the feedback message, the CREATE TABLE statement executed successfully and
the Student table is created in your my_school database. How can you know that? In the object
browser pane of pgAdmin, locate the my_school database.

1. Right click the my_school database and choose refresh to refresh the system.

2. Click the dropdown arrow to see all the objects in the my_school database.

3. Locate schemas, and under schemas, locate tables.

4. Under tables, you can locate the Student table you just created.

5. If you click the columns under tables, you will see all the columns.
You can also use the SELECT statement to view your tables. You type the SELECT statement in the
query editor as follows:
Now let’s create the Teacher table. I know by now all of you can create just any table. In your
pgAdmin, type the following codes to create the Teacher table:

We will create two more tables (Payment table and Course table) together and then you will be left
on your own to do create the rest.

EMRAM Acute Stage-7 Preparatory Guide 2021v1
No ratings yet
EMRAM Acute Stage-7 Preparatory Guide 2021v1
44 pages
The Impact of Big Data On Accounting and Auditing
No ratings yet
The Impact of Big Data On Accounting and Auditing
14 pages
49 AI-tools
100% (5)
49 AI-tools
67 pages
BAN 702 Module 1 Assignment: Types of Data, Big Data and Artificial Intelligence
No ratings yet
BAN 702 Module 1 Assignment: Types of Data, Big Data and Artificial Intelligence
5 pages
Gartner Data Analytics Australia Augmented Analytics 2018
100% (2)
Gartner Data Analytics Australia Augmented Analytics 2018
36 pages
Turban - Information Technology For Management 9e
No ratings yet
Turban - Information Technology For Management 9e
43 pages
RA Maturity List
No ratings yet
RA Maturity List
30 pages
Technology Challenges in Accounting & Finance (IV)
No ratings yet
Technology Challenges in Accounting & Finance (IV)
15 pages
Big data, data analytics and artificial intelligence in accounting An overview.
No ratings yet
Big data, data analytics and artificial intelligence in accounting An overview.
34 pages
Data_Analytics_for_Accounting - Lecture 1
No ratings yet
Data_Analytics_for_Accounting - Lecture 1
36 pages
Bab 12 Budaya Organisasi
No ratings yet
Bab 12 Budaya Organisasi
6 pages
1.the Value of Big Data in An Accounting Firm
No ratings yet
1.the Value of Big Data in An Accounting Firm
4 pages
Big Data: Understanding How Data Powers Big Business
From Everand
Big Data: Understanding How Data Powers Big Business
Bill Schmarzo
2/5 (1)
Business Technology - 1
No ratings yet
Business Technology - 1
33 pages
The Impact of Financial Technology (Fintech)
No ratings yet
The Impact of Financial Technology (Fintech)
31 pages
The Evolution of Technology For The Accounting Profession
No ratings yet
The Evolution of Technology For The Accounting Profession
8 pages
Big Data: Opportunities and challenges
From Everand
Big Data: Opportunities and challenges
BCS, The Chartered Institute for IT
No ratings yet
Big Data, Data Analytics and Artificial Intelligence in Accounting: An
No ratings yet
Big Data, Data Analytics and Artificial Intelligence in Accounting: An
35 pages
Data Science
From Everand
Data Science
Chloe Martin
No ratings yet
Module 10 - Big Data
No ratings yet
Module 10 - Big Data
36 pages
Data Analytics for Accountants Canadian Accountants
No ratings yet
Data Analytics for Accountants Canadian Accountants
7 pages
Accounting Analytics 1
No ratings yet
Accounting Analytics 1
44 pages
Materi Webminar Bigdata-Indra
No ratings yet
Materi Webminar Bigdata-Indra
32 pages
The Impact of Financial Technology (Fintech)
No ratings yet
The Impact of Financial Technology (Fintech)
29 pages
P5 - Chapter 5 THE IMPACT OF INFORMATION TECHNOLOGY
No ratings yet
P5 - Chapter 5 THE IMPACT OF INFORMATION TECHNOLOGY
3 pages
Monetizing Data: From Raw Signals to Revenue Streams
From Everand
Monetizing Data: From Raw Signals to Revenue Streams
Jeremy Hold
No ratings yet
Ref1.Richardson2023.DataAnalytics for Accounting - Chapt 1-2
No ratings yet
Ref1.Richardson2023.DataAnalytics for Accounting - Chapt 1-2
197 pages
1.2 Rapid Change in Accountantâ ™s Role
No ratings yet
1.2 Rapid Change in Accountantâ ™s Role
13 pages
AFA - Module 5 - 1
No ratings yet
AFA - Module 5 - 1
15 pages
Jawaharlal Nehru Technologial University Anantapur Ananthapuramu, Ap, 515002
No ratings yet
Jawaharlal Nehru Technologial University Anantapur Ananthapuramu, Ap, 515002
13 pages
What Every Manager Should Know About Big Data and Data Science
From Everand
What Every Manager Should Know About Big Data and Data Science
Lars Nielsen
No ratings yet
Blog On IT
No ratings yet
Blog On IT
14 pages
FAR3 Group Assignment
No ratings yet
FAR3 Group Assignment
11 pages
Marketing Program Expense C2
No ratings yet
Marketing Program Expense C2
1 page
Business Impact of Digital Transformation Technologies
From Everand
Business Impact of Digital Transformation Technologies
SADANAND PUJARI
No ratings yet
2024_The Improvement of Accounting Work Efficiency and Quality Through Big Data Technology
No ratings yet
2024_The Improvement of Accounting Work Efficiency and Quality Through Big Data Technology
7 pages
Understanding The Impact of Technology in Audit and Finance
No ratings yet
Understanding The Impact of Technology in Audit and Finance
17 pages
Digitisation Big Data and The Transformation of Accounting Information
100% (1)
Digitisation Big Data and The Transformation of Accounting Information
23 pages
Information Technology in The Accounting Curriculum
No ratings yet
Information Technology in The Accounting Curriculum
4 pages
Tomorrows Accountant
No ratings yet
Tomorrows Accountant
1 page
Digital Strategy: Boost Your Business with Big Data and Data Science
From Everand
Digital Strategy: Boost Your Business with Big Data and Data Science
Quick Solutions
No ratings yet
Challenges and Trends For The Incorporation of Big Data in The Accounting Profession
No ratings yet
Challenges and Trends For The Incorporation of Big Data in The Accounting Profession
9 pages
Data Analytics-Unit1 Notes
No ratings yet
Data Analytics-Unit1 Notes
30 pages
Data Analytics with Python: Data Analytics in Python Using Pandas
From Everand
Data Analytics with Python: Data Analytics in Python Using Pandas
Frank Millstein
3/5 (1)
Big Data, Data Analytics and External Auditing
No ratings yet
Big Data, Data Analytics and External Auditing
9 pages
The Future of Accounting and Accounting Information Systems[1]
No ratings yet
The Future of Accounting and Accounting Information Systems[1]
12 pages
Article b2
No ratings yet
Article b2
4 pages
Accounts PSDA 1
No ratings yet
Accounts PSDA 1
20 pages
The Power of Big Data in Accounting
No ratings yet
The Power of Big Data in Accounting
17 pages
Analitik Data Akuntansi Kelompok - Ak4406
No ratings yet
Analitik Data Akuntansi Kelompok - Ak4406
5 pages
Business @ the Speed of Thought (Review and Analysis of Gates' Book)
From Everand
Business @ the Speed of Thought (Review and Analysis of Gates' Book)
BusinessNews Publishing
5/5 (2)
Financial Accounting Tally Workbook: Submitted By-Rushil Submitted To - Ms. Anita ROLL NO. 15321
No ratings yet
Financial Accounting Tally Workbook: Submitted By-Rushil Submitted To - Ms. Anita ROLL NO. 15321
4 pages
My BAAssignmentrony
No ratings yet
My BAAssignmentrony
10 pages
SI08_2019b
No ratings yet
SI08_2019b
15 pages
79b8af_e1ed9eeb78e8426495de641a18c414f5
No ratings yet
79b8af_e1ed9eeb78e8426495de641a18c414f5
5 pages
(eBook PDF) Accounting Information Systems 3rd Editiondownload
No ratings yet
(eBook PDF) Accounting Information Systems 3rd Editiondownload
50 pages
Activity 2
No ratings yet
Activity 2
3 pages
From Data To Decisions: Driving Performance in the Age of Analytics
From Everand
From Data To Decisions: Driving Performance in the Age of Analytics
Babatunde Yusuf
No ratings yet
Minding the Machines: Building and Leading Data Science and Analytics Teams
From Everand
Minding the Machines: Building and Leading Data Science and Analytics Teams
Jeremy Adamson
No ratings yet
Advance Financial Accounting Full Notes
100% (1)
Advance Financial Accounting Full Notes
129 pages
Subject Outline Contemporary Business Issues
100% (1)
Subject Outline Contemporary Business Issues
6 pages
Understanding The Impact of Technology in Audit and Finance
No ratings yet
Understanding The Impact of Technology in Audit and Finance
16 pages
G12 It Unit 2
No ratings yet
G12 It Unit 2
30 pages
Articule2 - La Tecnología en El Ámbito Contable
No ratings yet
Articule2 - La Tecnología en El Ámbito Contable
5 pages
2021 ACA Syllabus Handbook - Advanced
No ratings yet
2021 ACA Syllabus Handbook - Advanced
59 pages
Namrata Hirawat_2024MBA262 CV Latest
No ratings yet
Namrata Hirawat_2024MBA262 CV Latest
1 page
CB Insights - Global Unicorn Club - 2019
No ratings yet
CB Insights - Global Unicorn Club - 2019
30 pages
DT - Roadmap-Solomon EGBI - FINAL
No ratings yet
DT - Roadmap-Solomon EGBI - FINAL
23 pages
Dikshita SIP BI
No ratings yet
Dikshita SIP BI
37 pages
Moving Toward Profitable Growth: The E-Grocery Challenge
No ratings yet
Moving Toward Profitable Growth: The E-Grocery Challenge
8 pages
Artificial Intelligence Machine Learning 101 t109 - r4
No ratings yet
Artificial Intelligence Machine Learning 101 t109 - r4
6 pages
E - 20240606 FAQ Universal Journal Extensibility
No ratings yet
E - 20240606 FAQ Universal Journal Extensibility
9 pages
M.Tech.: Data Science & Engineering
No ratings yet
M.Tech.: Data Science & Engineering
20 pages
Marketing Analytic
No ratings yet
Marketing Analytic
34 pages
Ricardo Correia_ Dominyka Venciute - AI Innovation in Services Marketing-Business Science Reference (2024)
No ratings yet
Ricardo Correia_ Dominyka Venciute - AI Innovation in Services Marketing-Business Science Reference (2024)
314 pages
Website_Fees_B&E_International
No ratings yet
Website_Fees_B&E_International
21 pages
Digitalization and The Challenges For The Accounting Profession
No ratings yet
Digitalization and The Challenges For The Accounting Profession
10 pages
CBA Batch 5 Profile Book 25JAN18 PDF
No ratings yet
CBA Batch 5 Profile Book 25JAN18 PDF
62 pages
Business Intelligence Assignment
No ratings yet
Business Intelligence Assignment
21 pages
Big Data in The Construction Industry A Review of Present Status, Opportunities, and Future Trends
100% (3)
Big Data in The Construction Industry A Review of Present Status, Opportunities, and Future Trends
42 pages
Industry 4.0 - Data Focus CoEP
No ratings yet
Industry 4.0 - Data Focus CoEP
26 pages
Centurysoft Company Profile
No ratings yet
Centurysoft Company Profile
5 pages
KAVYA.L
No ratings yet
KAVYA.L
44 pages
BP Presentation
No ratings yet
BP Presentation
19 pages
Customer-Journey Map
No ratings yet
Customer-Journey Map
7 pages
iot_mcq
No ratings yet
iot_mcq
43 pages
DS - RMT 021 - Ardifa Rizky Saputra
No ratings yet
DS - RMT 021 - Ardifa Rizky Saputra
1 page
Mini Project II Instructions Segmentation and Regression
0% (1)
Mini Project II Instructions Segmentation and Regression
6 pages
Adoption of artificial intelligence in banking services- an empirical analysis
No ratings yet
Adoption of artificial intelligence in banking services- an empirical analysis
31 pages
Business Intelligence
No ratings yet
Business Intelligence
18 pages