This text is about relational databases from the point of view of a software developer. It contains just remarks and not strict guidelines. Thoughts of what to do and avoid to do.
Table types
A database table may belong to one of the following categories, in regard to the nature of data and the number of rows it may have: Master, LookUp, Transaction and Correlation table.
- Master. A
CUSTOMER or MATERIAL table is considered to be a master table. A master table does not records historical data. It functions merely as a registry of main entities in an application. Other tables may have foreign keys to it. A master table usually has a lot of columns and many thousand rows. A master table may contain foreign keys to other master or lookup tables.
- Lookup. A
COUNTRY, STATE, MEASURE_UNIT or OCCUPATION table is considered to be a lookup table. A lookup table does not records historical data. It functions merely as a registry of secondary entities in an application. Other tables may have foreign keys to it. A lookup table usually has a few columns and, at most, a few hundred rows. Usually has ID and NAME or ID, CODE and NAME columns. A perfect lookup table contains no foreign keys to other tables.
- Transaction. A transaction table records transactions of master tables. In many cases it contains a
datetime column, e.g. ENTRY_DATE. Transaction tables sometimes are called historical tables or even trade tables. A transaction table can easily have millions of rows. An ORDERS or TRADE table is considered to be a transaction table. Transaction data very often require two or even more tables in a master-detail relationship forming a table tree. For instance ORDERS and ORDER_LINES, where the ORDERS, the master transaction table, contains information regarding the Customer, the Date of transaction etc. while the ORDER_LINES, the detail transaction table, records information regarding goods, quantities and prices and an ORDER_ID foreign key to the ORDERS table.
- Correlation. A correlation table correlates two, or even more, master tables. Usually a correlation table records only
IDs from those master tables an nothing more. A correlation table is used in recording one-to-one, one-to-many and many-to-many relationships. For instance a CAR and a DRIVER master table may require a CAR_DRIVER correlation table, having a CAR_ID and a DRIVER_ID foreign key columns to the respective tables. A car driver may have many cars in his responsibility.
Regarding business logic, a database application may logically divided into business modules. Each module, say Customer or Store or Sales module, uses a set of related tables. For instance an imaginary Sales module may use the CUSTOMER, MATERIAL, TRADE and TRADE_LINES tables.
Normalization
Normalization is a term used in database programming to describe the techniques involved in designing tables in order to minimize duplication of information.
For instance, in a CUSTOMER table it’s not wise to have an OCCUPATION_NAME string column. Instead you use an OCCUPATION_ID foreign key column pointing to the ID primary key column of the OCCUPATION table. The same stands true for an ORDER_LINES transaction table and a MATERIAL master table. The ORDER_LINES should have a MATERIAL_ID, pointing to MATERIAL table, and not a MATERIAL_NAME column.
In short, a bit of information is stored once and only once and in a certain table. Any other table that wants access to that bit of information it just maintains a reference to it, utilizing a foreign key column. That’s called normalization.
Normalization is not a panacea. It is used in a so called Production database or Operational database, that is a database used in the daily processing of transactions. A database used by an ERP system is such a database.
Normalization is not used in a Warehouse database where the primary use is for reporting, data analysis and data mining operations. That kind of databases contain non-normalized data.
You may check entries such as Database Normalization, Data Warehouse and Data Mining at wikipedia.
But regarding Normalization, please, don’t spend much time reading those texts, unless you’ re a student looking reference material for the next semester. Conventional wisdom is quite enough.
Have a COUNTRY table with just an ID, CODE and NAME. In all other tables where Country information is needed, a COUNTRY_ID foreign key referencing the COUNTRY.ID column is enough.
Primary keys
I follow the next rule:
Every table in a database has
- a single column as its Primary Key
- named ID (not COUNTRYID or countryid or country_id, just ID)
- of data type integer or GUID string
- it is unique (that is it uniquely identifies its row)
- it has no business meaning at all (which is very very important)
- it never changes or re-used
- and the application user almost never sees it.
A Primary Key should consist of a single column.
Avoid Composite Primary Keys, sometimes called Compound Keys, that is Primary Keys consisting of two or more columns. Single column Primary Keys are easier to maintain and locate in a search or filter.
There are cases though where a composite Primary Key may be used, as is the case of correlation tables. Even in this case though I find it easier to have a single column as Primary Key and use a Unique Constraint on the correlated columns in order to guarantee the uniqueness of their combination.
I have not yet found a case where a single column Primary Key is bad.
The data type of a Primary Key column may be an integer or a GUID string.
Prefer GUID strings although integers are easier to human eye. Having Primary Keys with GUID strings is easier to migrate data from one system to another.
Integer Primary Keys come in two forms:
- auto-increment columns, such as MS SQL
identity columns or MySql AUTO_INCREMENT columns.
- or unique number generators as in Interbase/Firebird and Oracle.
Primary Keys sometimes are called Object Identifiers or OIDs.
Triggers and Stored Procedures
No stored procedures.
No triggers.
Stored procedures and triggers tie an application to a certain RDBMS such as Oracle, MS SQL, etc. because each RDBMS has its own stored procedure and trigger language dialect.
Besides that, using stored procedures and triggers distributes and fragments the business logic code. Business logic code should be in one place. And this place is the application source code.
Searching the web for “are stored procedures faster than queries?” results in a lot of debates.
There is a rumor floating around that stored procedures are executed in a fraction of time than normal SQL statements do, and that’s because the database server pre-compiles the stored procedure code.
I’m not quite sure about that rumor, but I have to say that a decent database server should retain execution plans for all SQL statements in its internal cache, and not just for stored procedures. And if it doesn’t then I consider that as a major flaw.
I admit though that my opinion lacks the proper validity.
To force an RDBMS to generate and use an execution plan, in subsequent calls, use the exact same SQL statement more than once. The secret to achieve this is to use parameterized SQL statements. Something like “select * from CUSTOMER where ID = :ID“. From a database server point of view this statement looks much better than the usual “select * from CUSTOMER where ID = 1234“.
Every SQL statement needs an execution plan. The first time a SQL statement is executed, be it a stored procedure, a trigger, a SELECT, or any other statement, the RDBMS generates an execution plan. And then executes the statement. The next time it just re-uses the previously generated plan. Thus in next executions the statement is executed faster than the first time.
Views
All RDBMS I know off use the same Statement in creating a view.
create view VIEW_NAME as
select
...
from
...
The problem with the different RDBMS language dialect is not present here.
But again, as with stored procedures and triggers, using views distributes and fragments the business logic code. Business logic code should be in one place. And this place is the application source code.
Constraints and Referential Integrity
not null and unique constraints are necessary and should be part of the create table statement.
check constraints, e.g. check (Age >= 21), enforce a rule very different from a not null or unique rule. Prefer enforcing these check rules in the application code.
Referential Integrity dictates that in order to add a value in the CUSTOMER.COUNTRY_ID column, this value must exist in the COUNTRY.ID column.
The rule here is: a reference must be valid.
Declarative Referential Integrity is a term denoting that RDBMS is responsible in enforcing the rule of integrity.
This is done using a declaration, known as foreign key constraint, either when creating the table or later using an alter table statement.
create table Customer (
ID integer not null primary key,
COUNTRY_ID integer not null
...
foreign key(COUNTRY_ID) references COUNTRY(ID)
)
There is another way though in enforcing the integrity rule: use application code to enforce it.
Which is better depends on many factors of whom the most important is to know what you’re doing and why.
There are systems that do not use declarative referential integrity. Microsoft’s Navision/Business Central is one of them. Navision/Business Central enforces referential integrity using application code.
One thing is for sure: it is easier to migrate data from one system to another when there is no declarative referential integrity.
Indexes
An index helps an RDBMS to search, filter and sort the data faster. Without an index it has to do a full table scan when executing a query. So indexes are very important.
Deciding what to index, and what not, is an art and a science at the same time. What is good for an application it may be bad for another.
In deciding what to index there are no strict rules. Just some remarks and guidelines:
- avoid over indexing. Every
INSERT, UPDATE or DELETE statement forces the RDBMS not only to update the data, but the indexes too. Having many indexes will slow execution
- index data are updated whenever data are updated
- primary keys and unique constraints are maintained by RDBMS using an index
- indexes take part in the generation of an execution plan, so contradicting indexes may confuse the RDBMS optimizer
- prefer single-column indexes over multi-column indexes
- in multi-column indexes place first the columns with the fewer values, e.g.
MARRIED, NAME
- in an ERP or similar application, with a filtering UI, it is not possible to know what the user decides to place in the query filter
- learn about
index selectivity
- learn what
index statistics means to your RDBMS
- search the web for
SQL Indexing best practices
Maintenance and database health
Here are some remarks and guidelines
- Most RDBMS come with a number of maintenance and health check tools. Check what is available and how to use it.
- Reindex frequently. How frequently depends on the system.
- Check database health frequently
- Schedule frequent backups
- With RDBMS having a
Full Recovery Model, such as MS SQL, examine whether the Simple is better for the case
- Create and use maintenance time-scheduled jobs using tools such as Windows Task Scheduler or Linux Crontab.