Fundamentals of Relational Database Design
Fundamentals of Relational Database Design
Outline
Definitions Selecting a dbms Selecting an application layer Relational Design Planning A very few words about Replication Space
2
Definitions Instance
A database instance, or an instance is made up of the background processes needed by the database software. These processes usually include a process monitor, session monitor, lock monitor, etc. They will vary from database vendor to database vendor.
4
A SCHEMA IS NOT A DATABASE, AND A DATABASE IS NOT A SCHEMA. A database instance controls 0 or more databases. A database contains 0 or more database application schemas. A database application schema is the set of database objects that apply to a specific application. These objects are relational in nature, and are related to each other, within a database to serve a specific functionality. For example payroll, purchasing, calibration, trigger, etc. A database application schema not a database. Usually several schemas coexist in a database. A database application is the code base to manipulate and retrieve the data stored in the database application schema.
5
Table, a set of columns that contain data. In the old days, a table was called a file. Row, a set of columns from a table reflecting a record. Index, an object that allows for fast retrieval of table rows. Every primary key and foreign key should have an index for retrieval speed. Primary key, often designated pk, is 1 or more columns in a table that makes a record unique.
6
Foreign key, often designated fk, is a common column common between 2 tables that define the relationship between those 2 tables. Foreign keys are either mandatory or optional. Mandatory forces a child to have a parent by creating a not null column at the child. Optional allows a child to exist without a parent, allowing a nullable column at the child table (not a common circumstance).
7
Er Example
STATUS
# ST A T_ ID o ST A TUS_ NA M E * CREA TE _DA TE * CREA TE _US ER ...
creates
describes
11
A view is a selective presentation of the structure of, and data in, one or more tables (or other views). A view is a virtual table, having predefined columns and joins to one or more tables, reflecting a specific facet of information.
13
Definitions Cont.
Mission Critical Applications An application is defined as mission critical, imho, if 1. there are legal implications or financial loss to the institution if the data is lost or unavailable. 2. there are safety issues if the data is lost or unavailable. 3. no data loss can be tolerated. 4. uptime must be maximized (98%+).
16
Definitions Cont.
large or very large or a lot Seems odd, but large is a hard definition to determine. Vldb is an acronym for very large databases. Its definition varies depending on the database software one selects. Very large normally indicates data that is reaching the limits of capacity for the database software, or data that needs extraordinary measures need to be taken for operations such as backup, recovery, storage, etc.
17
Definitions Cont.
Commercial databases do not a have a practical limit to the size of the load. Issues will be backup strategies for large databases. Freeware does limit the size of the databases, and the number of users. Documentation on these issues vary widely from the freeware sites to the user sites. Mysql supposedly can support 8T and 100 users. However, you will find arguments on the users lists that these numbers cannot be met.
18
Selecting a DBMS
Many options, many decisions, planning, costs, criticality. For lots of good information, please refer to the urls on the last slides. Many examples of people choosing product.
19
23
24
Direct access to the database layer? (probably should be avoided) Are you replicating? How? Where? With what? There are no utilities that will port data from 1 database to another (i.e., postgres to mysql). if database portability is a requirement, an independent code must be written to satisfy this requirement.
25
Application maintenance issues People availability, working with users as a team, talent, and turnover? (historically a huge issue) A known or common language? Freeware? Bug fixes, patchesare they important and timely? Documentation? Set standards, procedures, code reviews making sure the documentation exists and is clear. Is the application flexible enough to easily accommodate business rule changes that mandate modifications? The availability of an ER diagram at this stage is invaluable. We consider it a must have. There are no utilities to port data from 1 type of db to 26
Relational Design
The design of the application schema will determine the usability and query ability of the application. Done incorrectly, the application and users will suffer until someone else is forced to rewrite it.
28
34
CPU (d0ora2)
An database can accommodate 1 or more instances
35
What is a schema?
It is
Tables (columns/datatypes) having Constraints (not null, unique, foreign & primary keys) Triggers Indexes etc. Accounts Privileges & Roles Server side processes
It is not
The
environment (servers, OS) The results of queries, I.e objects Application Code
One implements a schema by running scripts. These scripts can be run against multiple servers and should be archived.
36
37
Do not design your schema around your favorite query. A relational design will enable all queries to be speedy, not only your favorite. Dont design the schema around your narrow view of the application. Get other users involved from the start, ask for input and review.
42
Create a relational structure, not a hierarchical structure. The ER diagram should not necessarily resemble a tree or a circle. It is the logical building of relationships between data. Relationships flow between subsets of data. The resulting ER diagrams look is not a standard by which one can judge the quality of the design.
43
Do not create 1 huge table to hold 99% of the data. We have seen a table with 1100+ columnsunusable, unqueryable, required an entire application rewrite, took over a year, made 80 tables from the 1 table. Do not create separate schemas for the same application or functions within an application. Use indices and constraints, this is a MUST!
44
Using timestamp as the primary key assumes that within a second, no other record will be inserted. Actually this was not the case, and an insert operation failed. Use database generated sequences as primary keys and NON-UNIQUE index on timestamp. A table with more than 900 columns. Such design will cause chaining since each record is not going to fit in one block. One record spanning many blocks, thus chaining, hence bad performance.
45
Do not let the application control a generated sequence. Have seen locking issues, and duplicate values issues when the application increments the sequence. Have the database increment/lock/constrain the sequence/primary key. That is why the databases have sequence mechanisms, use them. Use indices! An Atlas table with 200,000 rows, halted during a query. Reason? No indices. Added a primary key index, instantaneous query response. Indices are not wasted space!
46
CHILD
# CHILD_ID
A
# A_ ID
B
# B_ ID
C
# C_ ID
D
# D_ ID
E
# E_ ID
F
# F_ ID
48
define owned by
H
# H_ ID
G2
# G2_ID
G2H2
map to define
H2
# H2 _ID
I
# I_ ID
J relate to
# J_ID
I2
# I2 _ID
define map to
I2J2
map to define
J2
# J2_ ID
49
define relate to
L
# L_ ID
M
# M _ID
define relate to
N
# N_ ID
O
# O_ID
define relate to
P
# P_ ID
50
CALIB_TYPE
# CALIB_T YPE_ ID * DESCRIPTION
Calibration type might have 3 rows, drift, pedestal, & gain This is a parent table.
Each calibration record will be Defined by drift, pedestal or gain. In addition to start and end times. This is a child table.
51
define
relate to PEDESTAL_CALIB
# PEDESTAL_ CALIB_ID * T START o TEND
You have now created 3 different children, all reporting the same information, when 1 child would suffice. Code will have to be written, tested, and maintained for 4 tables now instead of 2. 52
CALIBRATION(2)
# CALIBRATION_ID * T START o TEND
CALIBRATION(3)
# CALIBRATION_ID * T START o TEND
defines
defines relate to
defines
relate to PEDESTAL_CALIB
# PEDESTAL_ CALIB_ID * T START o TEND
relate to DRIFT_CALIB
# DRIFT _CAL IB_ID * T START o TEND
GAIN_CALIB
# GAIN_CALIB_ID * T START o TEND
Now you have created 3 different applications, using 6 tables. All of which could be managed with 2 tables. Extra code, extra testing, extra maintenance.
53
CALIB_TYPE
# CALIB_T YPE_ ID * DESCRIPTION
54
An entity relationship diagram The ability to create the ddl (data definition language) needed The ability to project disk space usage Ddl in a format to allow you to enter the code into a code library (cvs), and that will allow you to run against your database
55
Planning Overall
What do I need to plan for? People, hardware, software, obsolescence, maintenance, emergencies. How far out do I need to plan? Initially 2-4 years. How often do I need to review the plans? Annually. What if my plan fails or looks undoable? Nip it in the bud, be proactive, come up with options.
58
Planning Overall
Disk space requirements. My experience is all the wags, (wild guesses) fall short of what is needed. It is hard to predict the number of rows in a table. It would be easier if we knew the amount and results of the science ahead of time! Remember, 10x what you think the data will take. Hardware requirements. Experience tells us that the database machine should serve 1 master (if it is a large database or mission critical), the database, nothing else. Ideally there will be root, a database monitor user and a database user, oracle for example. No apache, no log file areas, no applications, etc.
59
Planning Overall
Growth and obsolesce. Plan for 3-4 years before needing to replace hardware. Hardware and software become obsolete. New/upgraded software gives addition functionality that you will want/need. Maintenance. Do you change the oil in your car? Plan on 1 morning per month downtime for caring for the hardware and software. Security patches could mandate additional stoppages. I cannot stress how important this is. Fire walling will not protect you from bugs and obsolescence. If the downtime is not needed, it will not be taken. Planning maintenance time is as important as planning to buy 60 disks.
61
Planning Maintenance
Database/Operating system software need upgrades. One always hopes one can get on a stable version of something and not upgrade. That is a fallacy. Major version upgrades provide needed and new functionality. Bug patches and security patches are a never ending fact of life.
62
63
Planning Failover
Yikes, we are down! Everyone always wants 24x7 scheduled uptime. Until they see the cost. Make anyone who insists on real 100% uptime to justify it (and pay for it?). 98-99% uptime can be realized at a much lower cost. Uptime requirements will influence, possibly dictate, database choices, hardware choices, fte requirements.
65
Planning Failover
The cheapest method of addressing a failure is proactive planning. Make sure your database and database software are backed up. Unless you are using a commercial database with roll forward recovery, assume you will lose all dml since your last backup if you need to recover. This should dictate your backup schedule. Do not forget tape backups as a catastrophic recovery method. Practice recovery on your integration and development databases. Practice different scenarios, delete a datafile, delete the entire database.
66
Replication
Replication is the process of copying and maintaining database objects in multiple databases that make up a distributed database system. Replication can improve the performance and protect the availability of applications because alternate data access options exist.
67
Replication Cont.
Oracle Supports 3 types of replication READ ONLY Snapshots (Materialized views), Advanced Replication and streams based replication. Streams allows ddl modifications made to the master automatically. Streams can be configured in uni-directional ( Single Source and one or more than targets) or master to master where updates can happen to any participant database. Advanced replication also supports master to master . But streams based replication is recommended.
READ ONLY Snapshots replication from a Sun box to a Sun & Linux box(s) is being done in CDF. When a replica is under maintenance there is failover to another replica. The replicas are up and running in read only mode if the master is down for maintenance. 68
Replication cont.
Oracle master to master replication allows for updates on both the master and replica sides. Master to master is a complex and a high maintenance replication. It seems to be the 1st option the unwitting opt for. Both Cern and Fermi dbas have requested firm justification before considering this type of replication request. Every link in the multi master would be required to be a fully staffed, as downtime will be critical.
69
Replication cont.
1.
2.
3.
4.
Disk Space for Archives. If receiving site is down for extended period of time, then source db should be tuned enough to hold the archives logs, otherwise, one has to reinstantiate the replication. Reasonable downtime for target depends upon archive area being generated on source. Space, space and more space. Conflict Resolution In Master to Master, conflict resolution may be challenge. Rules should be well defined to resolve the data conflicts. Design of Data Model if Primary Keys are populated by sequences , there is very much chance of overlapping the sequences and will cause integrity constraints. Data Model should be designed very carefully. DB Support In Master to Master Replication, all master sites should be in 24*7 support mode. Otherwise , sync up of data will be challenge or one may lead to reinstantiation of replication. Reinstantiation is not unplug and play type of 70 situation.
Freeware Replication
MySQL has replication in the last stable version (3.23.32, v4.1 is out). It is masterslave replication using binary log of operations on the server side. It is possible to build star or chain type structures. There is a PostgreSQL replication tool. We have not tested it yet.
71
Lost in Space
Space is the 1 area consistently under estimated in every application I have seen. Imho, consistently, data volume initial estimates were undersized by a factor of 2 or 3. For example, RunII events were estimated at 1 billion rows. This estimate was surpassed Feb. 2004. We will probably end up with 4-5 billion event rows. That is a lot of disk space. Disk hardware becomes unsupported, and obsolete in what seems to be a blink of an eye.
72
Data
mirror
Index
mirror
Backup
Replication
Good rule of thumb: You need 10x the disk to hold a given amount of data in an RDB.
Operate in 2 year cycles: First 2 years storage available on day 1. Evaluate growth at end of year 1, begin prep of next 2 yr.
73
Additional References
**WARNING some of these may be database specific. Intro to database design http://www.cc.gatech.edu/classes/AY2000/cs4400_sprin g/cs4400a/ Intro to Oracle tutorial http://w2.syronex.com/jmr/edu/db/ Evolutionary Database Design http://www.martinfowler.com/articles/evodb.html mentions 1 dba for atlas Sql course http://sqlcourse.com/
75
Additional References
***Highly recommended reading, db comparatives http://wwwcss.fnal.gov/dsg/external/freeware/ db infrastructure standard, support levels, etc. for fermi computing http://wwwcss.fnal.gov/dsg/external/oracle_admin/
76
Additional References
Oracle Designer tutorial http://wwwcss.fnal.gov/dsg/internal/ora_adm/index.htm#d esigner (choose Oracle Designer tutorial or Oracle Designer Short Cuts and Lessons Learned)
Btev specific additional information
http://wwwcss.fnal.gov/dsg/external/BTeV/index.html
77