Introduction To DBMS: What Is A Database?
Introduction To DBMS: What Is A Database?
As the name suggests, the database management system consists of two parts. They are: 1. Database and 2. Management System
What is a Database?
To find out what database is, we have to start from data, which is the basic building block of any DBMS. Data: Facts, figures, statistics etc. having no particular meaning (e.g. 1, ABC, 19 etc). Record: Collection of related data items, e.g. in the above example the three data items had no meaning. But if we organize them in the following way, then they collectively represent meaningful information. Roll Name 1 ABC Table or Relation: Collection of related records. Roll 1 2 3 Name ABC DEF XYZ Age 19 22 28 Age 19
The columns of this relation are called Fields, Attributes or Domains. The rows are called Tuples or Records. Database: Collection of related relations. Consider the following collection of tables: T1 Name Age ABC 19 DEF 22 XYZ 28 T2 Address KOL DEL MUM T3 Year I II
3 Year I II
I T4 Hostel H1 H2
We now have a collection of 4 tables. They can be called a related collection because we can clearly find out that there are some common attributes existing in a selected pair of tables. Because of these common attributes we may combine the data of two or more tables together to find out the complete details of a student. Questions like Which hostel does the youngest student live in? can be answered now, although Age and Hostel attributes are in different tables. In a database, data is organized strictly in row and column format. The rows are called Tuple or Record. The data items within one row may belong to different data types. On the other hand, the columns are often called Domain or Attribute. All the data items within a single attribute are of the same data type.
The word schema means arrangement how we want to arrange things that we have to store. The diagram above shows the three different schemas used in DBMS, seen from different levels of abstraction. The lowest level, called the Internal or Physical schema, deals with the description of how raw data items (like 1, ABC, KOL, H2 etc.) are stored in the physical storage (Hard Disc, CD, Tape Drive etc.). It also describes the data type of these data items, the size of the items in the storage media, the location (physical address) of the items in the storage device and so on. This schema is useful for database application developers and database administrator. The middle level is known as the Conceptual or Logical Schema, and deals with the structure of the entire database. Please note that at this level we are not interested with the raw data items anymore, we are interested with the structure of the database. This means we want to know the information about the attributes of each table, the common attributes in different tables that help them to be combined, what kind of data can be input into these attributes, and so on. Conceptual or Logical schema is very useful for database administrators whose responsibility is to maintain the entire database. The highest level of abstraction is the External or View Schema. This is targeted for the end users. Now, an end user does not need to know everything about the structure of the entire database, rather than the amount of details he/she needs to work with. We may not want the end user to become confused with astounding amount of details by allowing him/her to have a look at the entire database, or we may also not allow this for the purpose of security, where sensitive information must remain hidden from unwanted persons. The database administrator may want to create custom made tables, keeping in mind the specific kind of need for each user. These tables are also known as virtual tables, because they have no separate physical existence. They are crated dynamically for the users at runtime. Say for example, in our sample database we have created earlier, we have a special officer whose responsibility is to keep in touch with the parents of any under aged student living in the hostels. That officer does not need to know every detail except the Roll, Name, Addresss and Age. The database administrator may create a virtual table with only these four attributes, only for the use of this officer.
Data Independence
This brings us to our next topic: data independence. It is the property of the database which tries to ensure that if we make any change in any level of schema of the database, the schema immediately above it would require minimal or no need of change. What does this mean? We know that in a building, each floor stands on the floor below it. If we change the design of any one floor, e.g. extending the width of a room by demolishing the western wall of that room, it is likely that the design in the above floors will have to be changed also. As a result, one change needed in one particular floor would mean continuing to change the design of each floor until we reach the top floor, with an increase in the time, cost and labour. Would not life be easy if the change could be contained in one floor only? Data independence is the answer for this. It removes the need for additional amount of work needed in adopting the single change into all the levels above. Data independence can be classified into the following two types:
1. Physical Data Independence: This means that for any change made in the physical schema, the
need to change the logical schema is minimal. This is practically easier to achieve. Let us explain with an example. Say, you have bought an Audio CD of a recently released film and one of your friends has bought an Audio Cassette of the same film. If we consider the physical schema, they are entirely different. The first is digital recording on an optical media, where random access is possible. The second one is magnetic recording on a magnetic media, strictly sequential access. However, how this change is reflected in the logical schema is very interesting. For music tracks, the logical schema for both the CD and the Cassette is the title card imprinted on their back. We have information like Track no, Name of the Song, Name of the Artist and Duration of the Track, things which are identical for both the CD and the Cassette. We can clearly say that we have achieved the physical data independence here.
2. Logical Data Independence: This means that for any change made in the logical schema, the need
to change the external schema is minimal. As we shall see, this is a little difficult to achieve. Let us explain with an example. Suppose the CD you have bought contains 6 songs, and some of your friends are interested in copying some of those songs (which they like in the film) into their favorite collection. One friend wants the songs 1, 2, 4, 5, 6, another wants 1, 3, 4, 5 and another wants 1, 2, 3, 6. Each of these collections can be compared to a view schema for that friend. Now by some mistake, a scratch has appeared in the CD and you cannot extract the song 3. Obviously, you will have to ask the friends who have song 3 in their proposed collection to alter their view by deleting song 3 from their proposed collection as well.
Database Administrator
The Database Administrator, better known as DBA, is the person (or a group of persons) responsible for the well being of the database management system. S/he has the flowing functions and responsibilities regarding database management: 1. Definition of the schema, the architecture of the three levels of the data abstraction, data independence. 2. Modification of the defined schema as and when required.
3. Definition of the storage structure i.e. and access method of the data stored i.e. sequential, indexed or direct. 4. Creating new used-id, password etc, and also creating the access permissions that each user can or cannot enjoy. DBA is responsible to create user roles, which are collection of the permissions (like read, write etc.) granted and restricted for a class of users. S/he can also grant additional permissions to and/or revoke existing permissions from a user if need be. 5. Defining the integrity constraints for the database to ensure that the data entered conform to some rules, thereby increasing the reliability of data. 6. Creating a security mechanism to prevent unauthorized access, accidental or intentional handling of data that can cause security threat. 7. Creating backup and recovery policy. This is essential because in case of a failure the database must be able to revive itself to its complete functionality with no loss of data, as if the failure has never occurred. It is essential to keep regular backup of the data so that if the system fails then all data up to the point of failure will be available from a stable storage. Only those amount of data gathered during the failure would have to be fed to the database to recover it to a healthy status.
Redundancy is the problem of storing the same data item in more one place. Redundancy creates several problems like requiring extra storage space, entering same data more than once during data insertion, and deleting data from more than one place during deletion. Anomalies may occur in the database if insertion, deletion etc are not done properly. 2. Sharing of Data: In a paper-based record keeping, data cannot be shared among many users. But in computerized DBMS, many users can share the same database if they are connected via a network. 3. Data Integrity: We can maintain data integrity by specifying integrity constrains, which are rules and restrictions about what kind of data may be entered or manipulated within the database. This increases the reliability of the database as it can be guaranteed that no wrong data can exist within the database at any point of time. 4. Data security: We can restrict certain people from accessing the database or allow them to see certain portion of the database while blocking sensitive information. This is not possible very easily in a paper-based record keeping. However, there could be a few disadvantages of using DBMS. They can be as following: 1. As DBMS needs computers, we have to invest a good amount in acquiring the hardware, software, installation facilities and training of users. 2. We have to keep regular backups because a failure can occur any time. Taking backup is a lengthy process and the computer system cannot perform any other job at this time. 3. While data security system is a boon for using DBMS, it must be very robust. If someone can bypass the security system then the database would become open to any kind of mishandling.
with reasonable amount of details. 2. Producing ERD: ERD or Entity Relationship Diagram is a diagrammatic representation of the description we have gathered about the system. 3. Designing the database: Out of the ERD we have created, it is very easy to determine the tables, the attributes which the tables must contain and the relationship among these tables. 4. Normalization: This is a process of removing different kinds of impurities from the tables we have just created in the above step.
Well, fine. Up to this point the ERD shows how boy and ice cream are related. Now, every boy must have a name, address, phone number etc. and every ice cream has a manufacturer, flavor, price etc. Without these the diagram is not complete. These items which we mentioned here are known as attributes, and they must be incorporated in the ERD as connected ovals.
But can only entities have attributes? Certainly not. If we want then the relationship must have their attributes too. These attribute do not inform anything more either about the boy or the ice cream, but they provide additional information about the relationships between the boy and the ice cream.
Step 3 We are almost complete now. If you look carefully, we now have defined structures for at least three tables like the following: Boy Name Address Phone Ice Cream Manufacturer Flavor Price Eats Date Time
However, this is still not a working database, because by definition, database should be collection of related tables. To make them connected, the tables must have some common attributes. If we chose the attribute Name of the Boy table to play the role of the common attribute, then the revised structure of the above tables become something like the following. Boy Name Address Phone Ice Cream Manufacturer Flavor Price Name Eats Date Time Name This is as complete as it can be. We now have information about the boy, about the ice cream he has eaten and about the date and time when the eating was done.
Cardinality of Relationship
While creating relationship between two entities, we may often need to face the cardinality problem. This simply means that how many entities of the first set are related to how many entities of the second set. Cardinality can be of the following three types. One-to-One Only one entity of the first set is related to only one entity of the second set. E.g. A teacher teaches a student. Only one teacher is teaching only one student. This can be expressed in the following diagram as:
One-to-Many Only one entity of the first set is related to multiple entities of the second set. E.g. A teacher teaches students. Only one teacher is teaching many students. This can be expressed in the following diagram as:
Many-to-One Multiple entities of the first set are related to multiple entities of the second set. E.g. Teachers teach a student. Many teachers are teaching only one student. This can be expressed in the following diagram as:
Many-to-Many Multiple entities of the first set is related to multiple entities of the second set. E.g. Teachers teach students. In any school or college many teachers are teaching many students. This can be considered as a two way one-to-many relationship. This can be expressed in the following diagram as:
In this discussion we have not included the attributes, but you can understand that they can be used without any problem if we want to.
Super Key or Candidate Key: It is such an attribute of a table that can uniquely identify a row in a table. Generally they contain unique values and can never contain NULL values. There can be more than one super key or candidate key in a table e.g. within a STUDENT table Roll and Mobile No. can both serve to uniquely identify a student. Primary Key: It is one of the candidate keys that are chosen to be the identifying key for the entire table. E.g. although there are two candidate keys in the STUDENT table, the college would obviously use Roll as the primary key of the table. Alternate Key: This is the candidate key which is not chosen as the primary key of the table. They are named so because although not the primary key, they can still identify a row. Composite Key: Sometimes one key is not enough to uniquely identify a row. E.g. in a single class Roll is enough to find a student, but in the entire school, merely searching by the Roll is not enough, because there could be 10 classes in the school and each one of them may contain a certain roll no 5. To uniquely identify the student we have to say something like class VII, roll no 5. So, a combination of two or more attributes is combined to create a unique combination of values, such as Class + Roll. Foreign Key: Sometimes we may have to work with an attribute that does not have a primary key of its own. To identify its rows, we have to use the primary attribute of a related table. Such a copy of another related tables primary key is called foreign key.
As you can see, the weak entity itself and the relationship linking a strong and weak entity must have double border.
Relational DBMS This is our subject of study. A DBMS is relational if the data is organized into relations, that is, tables. In RDBMS, all data are stored in the well-known row-column format. Hierarchical DBMS In HDBMS, data is organized in a tree like manner. There is a parent-child relationship among data items and the data model is very suitable for representing one-to-many relationship. To access the data items, some kind of tree-traversal techniques are used, such as preorder traversal. Because HDBMS is built on the one-to-many model, we have to face a little bit of difficulty to organize a hierarchical database into row column format. For example, consider the following hierarchical database that shows four employees (E01, E02, E03, and E04) belonging to the same department D1.
There are two ways to represent the above one-to-many information into a relation that is built in one-toone relationship. The first is called Replication, where the department id is replicated a number of times in the table like the following. Dept-Id Employee Code D1 E01 D1 E02 D1 E03 D1 E04
Replication makes the same data item redundant and is an inefficient way to store data. A better way is to use a technique called the Virtual Record. While using this, the repeating data item is not used in the table. It is kept at a separate place. The table, instead of containing the repeating information, contains a pointer to that place where the data item is stored.
This organization saves a lot of space as data is not made redundant. Network DBMS The NDBMS is built primarily on a oneto-many relationship, but where a parent-child representation among the data items cannot be ensured. This may happen in any real world situation where any entity can
be linked to any entity. The NDBMS was proposed by a group of theorists known as the Database Task Group (DBTG). What they said looks like this: In NDBMS, all entities are called Records and all relationships are called Sets. The record from where the relationship starts is called the Owner Record and where it ends is called Member Record. The relationship or set is strictly one-to-many.
In case we need to represent a many-to-many relationship, an interesting thing happens. In NDBMS, Owner and Member can only have one-to-many relationship. We have to introduce a third common record with which both the Owner and Member can have one-to-many relationship. Using this common record, the Owner and Member can be linked by a many-to-many relationship. Suppose we have to represent the statement Teachers teach students. We have to introduce a third record, suppose CLASS to which both teacher and the student can have a many-to-many relationship. Using the class in the middle, teacher and student can be linked to a virtual many-to-many relationship.
Normalization
While designing a database out of an entityrelationship model, the main problem existing in that raw database is redundancy. Redundancy is storing the same data item in more one place. A redundancy creates several problems like the following: 1. 2. 3. 4. 5. Extra storage space: storing the same data in many places takes large amount of disk space. Entering same data more than once during data insertion. Deleting data from more than one place during deletion. Modifying data in more than one place. Anomalies may occur in the database if insertion, deletion, modification etc are no done properly. It creates inconsistency and unreliability in the database.
To solve this problem, the raw database needs to be normalized. This is a step by step process of removing different kinds of redundancy and anomaly at each step. At each step a specific rule is followed to remove specific kind of impurity in order to give the database a slim and clean look.
E02 E03
BB CC
B02 B01
UTI SBI
In the sample table above, there are multiple occurrences of rows under each key Emp-Id. Although considered to be the primary key, Emp-Id cannot give us the unique identification facility for any single row. Further, each primary key points to a variable length record (3 for E01, 2 for E02 and 4 for E03).
As you can see now, each row contains unique combination of values. Unlike in UNF, this relation contains only atomic values, i.e. the rows can not be further decomposed, so the relation is now in 1NF.
B02
UTI
After removing the portion into another relation we store lesser amount of data in two relations without any loss information. There is also a significant reduction in redundancy.
1. The candidate keys are composite. 2. There are more than one candidate keys in the relation. 3. There are some common attributes in the relation.
Professor Code Department Head of Dept. Percent Time P1 Physics Ghosh 50 P1 Mathematics Krishnan 50 P2 Chemistry Rao 25 P2 Physics Ghosh 75 P3 Mathematics Krishnan 100 Consider, as an example, the above relation. It is assumed that: 1. A professor can work in more than one department 2. The percentage of the time he spends in each department is given. 3. Each department has only one Head of Department. The relation diagram for the above relation is given as the following:
The given relation is in 3NF. Observe, however, that the names of Dept. and Head of Dept. are duplicated. Further, if Professor P2 resigns, rows 3 and 4 are deleted. We lose the information that Rao is the Head of Department of Chemistry. The normalization of the relation is done by creating a new relation for Dept. and Head of Dept. and deleting Head of Dept. form the given relation. The normalized relations are shown in the following. Professor Code Department Percent Time P1 Physics 50 P1 Mathematics 50 P2 Chemistry 25
P2 P3
Physics Mathematics
75 100
Head of Dept. Department Physics Ghosh Mathematics Krishnan Chemistry Rao See the dependency diagrams for these new relations.
A multi valued dependency exists here because all the attributes depend upon the other and yet none of them is a primary key having unique value. Vendor Code Item Code Project No. V1 I1 P1 V1 I2 P1 V1 I1 P3 V1 I2 P3 V2 I2 P1 V2 I3 P1 V3 I1 P2 V3 I1 P3
The given relation has a number of problems. For example: 1. If vendor V1 has to supply to project P2, but the item is not yet decided, then a row with a blank for item code has to be introduced. 2. The information about item I1 is stored twice for vendor V3. Observe that the relation given is in 3NF and also in BCNF. It still has the problem mentioned above. The problem is reduced by expressing this relation as two relations in the Fourth Normal Form (4NF). A relation is in 4NF if it has no more than one independent multi valued dependency or one independent multi valued dependency with a functional dependency. The table can be expressed as the two 4NF relations given as following. The fact that vendors are capable of supplying certain items and that they are assigned to supply for some projects in independently specified in the 4NF relation. Vendor-Supply Item Code Vendor Code V1 I1 V1 I2 V2 I2 V2 I3 V3 I1 Vendor-Project Project No. Vendor Code V1 P1 V1 P3 V2 P1 V3 P2
P3
13
Let us finally summarize the normalization steps we have discussed so far. Input Transformation Relation All RelationsEliminate variable length record. Remove multi-attribute lines in table. 1NF RelationRemove dependency of non-key attributes on part of a multi-attribute key. 2NF Remove dependency of non-key attributes on other non-key attributes. 3NF Remove dependency of an attribute of a multi attribute key on an attribute of another (overlapping) multi-attribute key. BCNF Remove more than one independent multi-valued dependency from relation by splitting relation. 4NF Add one relation relating attributes with multi-valued dependency. Output Relation 1NF 2NF 3NF BCNF 4NF 5NF