Week 12 - Reading Materials 1 (1)
Week 12 - Reading Materials 1 (1)
1 – INTRODUCTION TO DATABASES
This chapter discusses the basic concepts needed to understand and use simple
databases. While the spreadsheet’s power lies in its ability to analyze data and
relate values by creating formulas that reference other cells, a database
management system (DBMS) is designed to relate groups of information and to
store, retrieve, and manipulate that information in an efficient manner.
DEFINING A DATABASE:
A RELATIONAL DBMS
Each DBMS is based on a database model that defines the way the information
should be organized and accessed. The three most commonly used models are the
hierarchical, network, and relational. Of these the most flexible is the relational
model, which is what we will be discussing in this chapter.
Figure 1
A relational database may contain more than one table and these tables may
themselves be related to each other. The example in Figure 1 contains a second
table with information about each doctor. Notice that each doctor is identified by
a unique DoctorID which can be related to the DoctorID on the patient table in
each patient record.
In addition to tables, most modern day DBMS’s include other objects which allow
the user store, retrieve, and manipulate data. In MS Access these objects include
the following:
Forms – structures for displaying data that allow a user to view information from
and input information in one or more objects (tables, queries, etc.).
Reports – structures for written output of data which again allow one to combine
information from one or more objects and view both details and summaries.
The diagram in Figure 2 represents part of an Order Entry and Inventory control
system. The system includes forms, queries, and reports for data entry and
retrieval. The tables store information regarding product inventory, vendors,
customers and orders.
Forms:
Program Modules
Tables:
Figure 2
The flow of information in the database for a typical order that might be phoned in
by a customer may be as follows:
The order entry clerk would enter the order into an Order Transaction form.
Once the order is input, a predefined program module would take this data
and enter it into the appropriate tables: e.g., Order Details table, Customer
Accounts table, etc.
A Daily Pick List report will be printed for the fork lift operator in the closest
warehouse and customer invoice report printed to be included with the
shipment.
The inventory table will be updated at this warehouse with this reduction in
quantity; if insufficient inventory remains, an order report would automatically
be emailed to the supplier to order more inventory.
As you can see, a company’s supply chain system using DBMS software is an
extremely valuable tool in modern day business. This course will discuss some
simple database concepts as well as how to design and query a simple database.
The mechanics of setting up tables, reports, and forms is covered in the course
textbook. Setting up program modules/macros is beyond the scope of this course.
The basic component of an Access database is the table. All other objects are
based on the structure and data within the tables. Each table is organized into a
specified set of ordered categories, or fields. Figure 3 is part of a table named
Customers. The fields in the Customer table include SSN, First Name, Last Name,
Address, City, State, and Postal Code. Related information is input into the table as
records. Each record contains related values for each table field. For example,
the first record in this table contains Jane Doe’s SSN, her last name, her first
name, her address, her city, her state, and her postal code in that order. Jane Doe’s
record does not contain John Black’s SSN or Mary Park’s postal code. In addition,
Figure 3
the third piece of data in any record in this table will always be the Last Name as
the order of values in each record is the same as the order of the fields in the table.
In Access 2010, the Fields ribbon allows the user to not only view and input data,
but to add new fields and to specify field properties. Figure 4 illustrates the Access
window with the Accounts table open to the Fields ribbon visible. An excellent
overview of the Access interface is given in the course textbook at the beginning of
Fields
Ribbon
Views
button –
switch from
datasheet to
Navigation
Pane – lists
database
objects
Record Selection
Buttons
Figure 4
View buttons
chapter 1.
FIELD PROPERTIES
Field size: The number of characters for text or the precision of numbers. e.g.,
numbers can be integer, long integer, single precision, double precision,
decimal, byte, etc.
Format: For numbers, the format specifies display properties such as currency
style, scientific notation, etc.
Figure 5
It is advantageous to specify field properties both to make the table easier to use
and more efficient. For example, each time a new record is created, Access
allocates memory (bytes) based on the field size specified. That is, an amount of
memory the size of the field is set aside, whether it is needed or not. If a specific
text field only requires three characters and the size specified as 50, each record
will waste the space of 47 characters. If each character takes two bytes of memory,
94 bytes of storage space would be wasted per record. A large database with
100,000 records would be wasting 9.4 million bytes!
A social security number consists of 9 digits. What field type would be best suited
to store this data? Using a number gives the user the ability to perform arithmetic
calculations, while text does not. Will it ever be necessary to perform arithmetic
calculations on these values? Probably not. So a number type is not needed, but
can it be used?
Consider the social security number 003278343. If this value is typed into a
Number field, what value will be stored? Try it and you’ll find that the value
3278343 is displayed – the leading zeros are discarded. Does that matter? In the
case of social security numbers, this matters greatly. The user will not want to
have to “add” the zeros to print out a person’s data in a report. If the value was
stored as text, the zeros would remain part of the data stored.
Thus the best field choice for a social security number is a text field. This same
logic applies to zip codes and even phone numbers.
Imagine a large bank with over 100,000 accounts; can a person’s last name alone
be used to identify the contents of their bank account? Is it possible that two
customers have the same last name, or even that one customer has multiple
accounts? If such a customer made a deposit to their account how do we know
which account to use?
Obviously this is a very realistic situation and must be taken into consideration
when designing a database. To solve this problem, database designers include a
field in tables that uniquely identifies each record. The Customers table presented
in Figure 3 contains a unique social security number, SSN. Since no one person
has the same social security number, this can be used to uniquely identify a
person/record in the table. A field that uniquely identifies a record is referred to as
a primary key field. The primary key cannot be blank nor can it contain any
duplicate values (two records have the same value for the primary key field).
Will SSN always be a good primary key field to use? Not necessarily, it will
depend on the situation. Would an SSN uniquely identify a bank account? If a
customer can have multiple accounts (e.g., for example one for savings and one for
checking) then the SSN is not a unique identifier. To solve this problem the bank
may use a unique account number, as seen in Figure 4.
A primary key field is not always necessary; not every table will have a single-field
primary key or any primary key at all. However, every relationship between
tables must have some field or combination of fields that uniquely identify records
in one of the tables. Otherwise, for example, it could not be established exactly
which transaction will go to which account. When using a combination of fields as
a key, additional fields are known as secondary, tertiary, etc. fields. A possible
combination of fields to uniquely identify a bank customer could be the first name,
birth date, and phone number to include with last name. In this course we will use
single primary key fields to uniquely identify records.
In documentation a table is normally listed by its name followed by an ordered list
of fields in parenthesis. The primary key is underlined. The Accounts table would
be written in this notation as follows:
Tables are structured into records of related data organized into ordered fields.
Records are uniquely identified by a primary key field. By uniquely defining a field
such as an account number we can find data corresponding to that account
number that may reside in other tables, such as transactions made on that
account. In this section we will explore relating data between tables.
DEFINITION: FOREIGN KEYS
Figure 6
Figure 7 contains a modified version of the second table that includes the account
number instead of name. Can each deposit be uniquely matched to a single
account using the account number as the foreign key? If the account number is the
primary key of the first table, the transactions on the second table can related to a
specific account. Thus, a foreign key must be a primary key on at least one of
the tables for a relationship to be valid.
Figure 7
This type of relationship where many values from one table (many deposits) can
match to a single value on the related table (one acct#) is referred to as a Many to
One relationship (or a One to Many relationship). An equally valid relationship
would be a One to One relationship where each record of one table corresponds to
at most one record in the second table. A One to One relationship occurs when
each foreign key is the primary key on both tables.
The following rules define what is required for a relationship between two tables to
be valid:
1. The foreign key must be a primary key on at least one of the tables.
2. The field types for the foreign key field must be the same on both tables.
The first requirement has already been discussed. What about the second
requirement, what does it mean the field types must be the same? Consider the
foreign key in the previous example. What field type is the acct# field? The
designer had a choice of using Number or Text. Either would have worked, though
one may have been more efficient that the other. What matters is consistency
between the two tables. If the acct# field is an Integer number on the first table
and Text on the second table, Access will not be able to match the foreign key
records.
Why should it make a difference which data type is specified? After all, you can’t
see the type of the field when looking at the information in the datasheet view.
However, remember the data is being stored in memory of a computer as a series
of high/low level electrical charges that we express as zeros and ones. The text
representation for the digit 1 may be a series of 32 zeros and ones while the
Integer representation for the number 1 may be a series of 16 zeros and ones.
These two values are NOT EQUAL and thus computer will not recognize the two
values as matching.
What about the third requirement for a valid foreign key, what does “information
being related must be the same” mean? Consider two tables, where each has
account numbers. Table 1 contains the account numbers at First City Bank and
Table 2 contains the bank numbers at Union Trust bank. While the fields may have
the same name and type, there is no relationship between account 1234 at First
City Bank and account 1234 at Union Trust bank. Relating the records between
these two tables based on account number would be meaningless.
In the following example another table has been added to the database to keep
track of withdrawals. The schema of the table is Withdrawals(Acct#, Amount). Can
the Withdrawals table be related to the Deposits table from the previous example?
Deposits Withdrawals
Acct# Amount Acct# Amount
?
256887 $50. 256777 $25
256887 $75 656887 $100
654887 $32 256887 $25
Figure 8
The fields Acct# (table Deposits) and Acct# (table Withdrawals) both represent a
customer’s account and we can assume they both are specified as the same data
type. Yet in neither case are these fields primary. There may be many instances of
an account number on the deposits table (e.g., 256887) and many instances of that
same account number of the withdrawals table. However, a deposit does not
correspond to a withdrawal (and vice versa). Thus, the account number would not
be a valid foreign key. This type of relationship is referred to as a Many to Many
relationship.
Can we relate the tables using the Amount fields? While the Amounts fields have
the same name, they represent different information. On the deposits table
Amount is the money into an account. On the withdrawal table Amount is the
money out of an account; it does not make sense to relate these fields. In addition,
it possible that two transactions contain the same value in the amount field, so
neither of these fields is primary.
But certainly it makes sense that somehow the deposits into an account are related
to the withdrawals from that account. To solve this dilemma databases are
designed with intermediate tables, in this case the Accounts table. In the Accounts
table the account number is the primary key and can be related to both the
Deposits table and the Withdrawals table, as seen in Figure 9. This changes the
relationships so that there are now two Many to One relationships. The
relationship between three such is referred to as a Many to One to Many
relationship.
Figure 9
Figure 10
Figure 11
When the relationships view is first launched in a new database it will be blank.
Each table must be individually added from the Show Table box, as seen in Figure
12. To open the Show Table dialog box, right-click anywhere in the relationships
window. Once the box is open, click on the Tables Tab (or Query tab for a query)
and select the name of the table to be added and then click then Add button (or
double click on the table name). Repeat the process to add additional tables.
Figure 13 illustrates the Relationship window with the Accounts and Transactions
tables added but not yet related. To relate tables follow these three steps:
One of the main reasons for using DBMS software is the ability to quickly and
easily locate specific data or sets of data. Within an Access table it is possible to
quickly and easily locate data using the Filter and Sort tools. To understand how a
database finds records, we will also briefly explore the concept of search routines
and Indexing.
The filter tool can be used in the datasheet view to display selected records of a
table. A filter allows us to specify criteria in a field or fields and show only those
records that meet the criteria. Using the filter tool we can list only those people
who live in Columbus, those people whose last name is Jones, or even only people
whose last name is Jones and live in Columbus. The mechanics of setting up filters
from the datasheet view of a table can be found in any of the step-by-step
instructions in the course text. The Filter tools can be found on the Sort and Filter
group of the Home tab (Figure 14). They include:
This ability to find records that meet a specific set of criteria will be greatly
extended in the next chapter using a database query.
SORTING TABLES
From the Datasheet Table View in Access, tables can also be sorted. Using the
sort tool, select a field and a sort type (ascending or descending). The records will
be temporarily rearranged based on this order. Sorts can be performed by
clicking on the field to be sorted and then selecting either the ascending or
descending sort buttons in the Sort & Filter group of the Home tab. The buttons
for sort-ascending and sort-descending look like this:
If the table is not saved using the Save button, the table will revert back to the
original record order when reopened. Sorting is an efficient tool for helping to
retrieve specific records. More advanced sorts using multiple sort keys can be
done using a query.
INDEXING TABLES
There are also methods by which DBMS systems can index your files to create a
cross reference to the table records based on a specific sorting method. Since data
is usually stored on magnetic disks in a linear fashion, similar to music on a tape,
file indexing combined with search schemes make it more efficient for the
computer to retrieve records, especially in databases with millions of records.
Several indices can be setup for the same table, allowing for efficient searching for
a variety of fields. For example, the bank can search by account number or by last
name, depending on the information the customer has provided. These searches
may always be done whether or not a table is indexed; searches over a large
number of records are more efficient when using indices.
As previously mentioned, one of the reasons to sort tables is to allow for more
efficient data retrieval. For example, imagine a dictionary that had words listed
randomly. In order to find a specific word one needs to systematically go through
each word, one by one, until the desired word is found. This is known as a linear
search. On average, a linear search will have to look at the number of items in a
table divided by two in order to find a specific piece data: it might be the first
word in this randomly organized dictionary, but then again it might be the last.
Why is efficient data handling so important? First let us understand how data
is stored and retrieved by a DBMS like MS Access. Recall that when working with
an Excel spreadsheet, the entire file is loaded from the disk drive onto the
computer’s RAM (random access memory). So working with spreadsheet data is
usually extremely fast, but the size of files are limited by the RAM of the computer.
In fact, one notices a significant slow down of operation speed as the workbook file
increases in size.
In contrast, most relational databases do not load all of the tables, queries, reports,
etc. directly into RAM. They load only the table of contents of the objects. To
process information from one or more objects, just those objects are loaded into
RAM. Thus, a DBMS can handle much larger quantities of data. In fact, many
large databases systems have millions of records.
When running a DBMS, the computer is not just processing information but
continually retrieving and writing data to and from secondary memory (usually a
magnetic hard drive). While a computer’s RAM can process information at very
fast speeds, searching for specific information on a disk drive and retrieving and/or
writing to the drive is a much more time consuming process. If a file is stored in
random order, as with the un-alphabetized dictionary, it will require the computer
to look at the disk many more times to find the information that we want than if the
file was sorted. Consequently, computer scientists are interested in how to search
for information more efficiently.
There are many different search schemes that can be used with indexed files to
speed up retrieval of information. Most of us are all familiar with the
alphabetical sort routine that divides textual information into 26 groups based on
the first letter of each item, and then further subdivides each group by the second
letter, etc. The search routine to retrieve information from an alphabetical list,
such as a dictionary, is to identify the first letter of the text and match it to the
correct group and then continue doing this with the second letter and so on until a
match has been identified.
A similarly efficient scheme which can be used with numerical data is known as a
binary search routine. A binary search routine is much more efficient than a
linear search in finding information. In Figure 15 records have been sorted by the
indexed field, ID#, in ascending order. To find the record for id#606147775 using
a binary search routine the computer would do the following:
Figure 15
3. If the value is less than the middle record, ignore all records from the midpoint
to the end of the list. The new list will contain only records from the beginning
of the list until the midpoint. Set a new midpoint for this new list and begin
again at step 1.
4. This process will continue until a match is found or all records have been
searched (in which case the value does not appear in the list).
The midpoint of this list is 606147775. Since this midpoint now matches our
search value the desired record has been found.
This search only looked at three different records in the table. A linear search on
average would have looked at 19/2 or 10.5 records. This is a significant
improvement.
To illustrate the significant difference between linear and binary search routines,
consider a situation where instead of 19 records, the list had a million records. If
the list isn’t sorted by the value we’re searching for, in the worst case we would
have to look at all one million records. If the list is sorted and we use a binary
search, we would only have to look at thirty one. If the list had 10 million
records and we could use a binary search, the worst case is still only looking at
thirty five records.
A binary search is only one of many different methods computer scientist use to
improve the efficiency of retrieving data. There are computer science courses
devoted solely to this topic. This discussion is only meant to provide you an
appreciation of the processes involved and an insight into the importance and
complexity of the topic.
The design of a complex database management system can take weeks, months or
even years to complete, involving thousands of man-hours of effort by a team of
computer scientists and management. You may someday be part of one of these
teams, or you may just be trying to create a small database to keep track of a guest
list for a large party. Regardless of the size and complexity of your database, there
are several things you must consider before creating a database. As with a
spreadsheet, the critical step in designing an effective database is to plan it. Think
about the following:
What data objects are present? Customers and account transactions are each
table objects in our sample database.
How is the data related? In our sample database, we have related these objects
by a foreign key field (SSN).
What information will be generated from the data? Will we need to design
queries and/or reports to list of all accounts for owners who live in Columbus,
or summarize transactions by account?
When setting up even the simplest of tables there are several factors to consider:
Obviously this list can be greatly expanded. In fact there are both undergraduate
and graduate courses devoted to learning how to design, build and maintain
databases. But the list should give you some idea as to the types of things that
need to be considered.