0% found this document useful (0 votes)
152 views

Distributed Database system(KCA045)

The document covers the fundamentals of Distributed Database Systems (DDBS), including architecture, design strategies, and the benefits of distributed data processing such as scalability, fault tolerance, and performance. It outlines the challenges faced in distributed systems, including data consistency, network latency, and system complexity, while providing best practices to mitigate these issues. Additionally, it describes various types of distributed databases, fragmentation methods, and replication strategies to optimize data management across multiple locations.

Uploaded by

Shalini Verma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
152 views

Distributed Database system(KCA045)

The document covers the fundamentals of Distributed Database Systems (DDBS), including architecture, design strategies, and the benefits of distributed data processing such as scalability, fault tolerance, and performance. It outlines the challenges faced in distributed systems, including data consistency, network latency, and system complexity, while providing best practices to mitigate these issues. Additionally, it describes various types of distributed databases, fragmentation methods, and replication strategies to optimize data management across multiple locations.

Uploaded by

Shalini Verma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

MASTER OF COMPUTER APPLICATION MCA IV semester

KCA045: Distributed Database Systems

Unit-1

Introduction: Distributed Data Processing, Distributed Database System, Promises of DDBSs, Problem
areas.Distributed DBMS Architecture:Architectural Models for Distributed DBMS,
DDMBS Architecture Distributed Database Design: Alternative Design Strategies, Distribution Design
issues, Fragmentation, Allocation

UNIT-2

Query processing and decomposition: Query processing objectives,


characterization of query processors, layers of query processing, query
decomposition, localization of distributed data. Distributed query Optimization:
Query optimization, centralized query optimization, distributed query
optimization algorithms.

Introduction:

As the volume and complexity of data continue to surge, traditional data processing methods face serious challenges. Companies trying to
extract valuable insights from vast data sets need efficient and scalable processing capabilities to be able to make impactful decisions at
scale. One of these capabilities is distributed data processing.

What Is Distributed Data Processing?


Distributed data processing refers to the approach of handling and analyzing data across multiple interconnected devices or nodes. In
contrast to centralized data processing, where all data operations occur on a single, powerful system, distributed processing decentralizes
these tasks across a network of computers. This method leverages the collective computing power of interconnected devices, enabling
parallel processing and faster data analysis.

Benefits of Distributed Data Processing


The benefits of distributed data processing include:

Scalability
One of the primary advantages of distributed data processing is scalability. As data volumes grow, organizations can expand their processing
capabilities by adding more nodes to the network. This scalability ensures that the system can handle increasing workloads without a
significant drop in performance, providing a flexible and adaptive solution to the challenges posed by big data.
Fault Tolerance
Distributed data processing systems inherently offer improved fault tolerance compared to centralized systems. In a distributed environment,
if one node fails, the remaining nodes can continue processing data, reducing the risk of a complete system failure. This resilience is crucial
for maintaining uninterrupted data operations in mission-critical applications.

Performance
Parallel processing, a key feature of distributed data processing, contributes to enhanced performance. By breaking down complex tasks into
smaller subtasks distributed across nodes, the system can process data more quickly and efficiently. This results in reduced processing
times and improved overall performance, enabling organizations to derive insights from data in a timely manner.

Efficient Handling of Large Volumes of Data


In the era of big data, efficiently handling large volumes of data is a paramount concern for organizations. Distributed data processing excels
in this aspect by employing data partitioning strategies. Large data sets are divided into smaller, more manageable segments, and each
segment is processed independently across distributed nodes.

Challenges and Considerations of Distributed Data Processing


The shift toward distributed data processing has ushered in a new era of scalability and performance, but it's not without its challenges. As
organizations increasingly adopt distributed systems to handle vast and complex data sets, they must grapple with a range of considerations
to ensure seamless operations.

These challenges include:

Data Consistency
Maintaining data consistency across distributed nodes poses a significant challenge in distributed data processing. In a decentralized
environment, where data is processed simultaneously across multiple nodes, ensuring that all nodes have access to the most recent and
accurate data becomes complex.

Tips and best practices:

 Implement distributed databases that support strong consistency models, ensuring that all nodes see the same version of the
data.
 Leverage techniques like two-phase commit protocols to synchronize changes across distributed nodes.
 Consider eventual consistency models for scenarios where immediate consistency is not critical, allowing for flexibility in trade-offs
between consistency and availability.

Network Latency
Network latency, the delay in data transmission over a network, is a critical consideration in distributed data processing. As nodes
communicate and share data, the time it takes for information to traverse the network can impact the overall performance of the system.

Tips and best practices:

 Optimize network configurations to minimize latency, including the use of high-speed connections and efficient routing.
 Leverage data partitioning strategies to reduce the need for frequent communication between nodes, minimizing the impact of
latency.
 Implement caching mechanisms to store frequently accessed data locally, reducing the reliance on network communication for
repetitive tasks.

System Complexity
The inherent complexity of distributed systems poses a challenge for organizations adopting distributed data processing. Coordinating tasks,
managing nodes, and ensuring fault tolerance in a decentralized environment requires a nuanced understanding of system intricacies.

Tips and best practices:

 Embrace containerization and orchestration tools, such as Docker and Kubernetes, to streamline the deployment and
management of distributed applications.
 Implement comprehensive monitoring and logging systems to track the performance and health of distributed nodes, facilitating
timely identification and resolution of issues.
 Invest in employee training and education to equip the team with the necessary skills to navigate the complexities of distributed
data processing.
Ensuring Data Security
Distributed data processing introduces additional considerations for data security. With data distributed across nodes, organizations must
implement robust measures to protect sensitive information from potential threats and unauthorized access.

Tips and best practices:

 Encrypt data both in transit and at rest to safeguard it from interception or unauthorized access.
 Implement access control mechanisms to restrict data access based on user roles and permissions.
 Regularly audit and update security protocols to stay ahead of emerging threats and vulnerabilities.

A distributed database system


A distributed database system is spread across several locations with distinct physical components. This can be necessary when different people from

all over the world need to access a certain database. It must be handled such that, to users, it seems to be a single database.

Types:

1. Homogeneous Database: A homogeneous database stores data uniformly across all locations. All sites utilize the same operating system,
database management system, and data structures. They are therefore simple to handle.

2. Heterogeneous Database: With a heterogeneous distributed database, many locations may employ various software and schema, which may
cause issues with queries and transactions. Moreover, one site could not be even aware of the existence of the other sites. Various operating systems
and database applications may be used by various machines. They could even employ separate database data models. Translations are therefore
necessary for communication across various sites.

Data may be stored on several places in two ways using distributed data storage:

1. Replication - With this strategy, every aspect of the connection is redundantly kept at two or more locations. It is a completely redundant database if
the entire database is accessible from every location. Systems preserve copies of the data as a result of replication. This has advantages since it
makes more data accessible at many locations. Moreover, query requests can now be handled in parallel. But, there are some drawbacks as well.
Data must be updated often. All changes performed at one site must be documented at every site where that relation is stored in order to avoid
inconsistent results. There is a tone of overhead here. Moreover, since concurrent access must now be monitored across several sites, concurrency
management becomes far more complicated.
2. Fragmentation - In this method, the relationships are broken up into smaller pieces and each fragment is kept in the many locations where it is
needed. To ensure there is no data loss, the pieces must be created in a way that allows for the reconstruction of the original relation. As
fragmentation doesn't result in duplicate data, consistency is not a concern.

Relationships can be fragmented in one of two ways:

o Separating the relation into groups of tuples using rows results in horizontal fragmentation, where each tuple is allocated to at
least one fragment.
o Vertical fragmentation, also known as splitting by columns, occurs when a relation's schema is split up into smaller schemas. A
common candidate key must be present in each fragment in order to guarantee a lossless join

Uses for distributed databases

o The corporate management information system makes use of it.


o Multimedia apps utilize it.
o Used in hotel chains, military command systems, etc.
o The production control system also makes use of it

Characteristics of distributed databases


Generally speaking, distributed databases have the following characteristics:

o Place unrelated
o Spread-out query processing
o The administration of distributed transactions
o Independent of hardware
o Network independent of operating systems
o Transparency of transactions

Distributed databases' benefits

Using distributed databases has a lot of benefits.

o As distributed databases provide modular development, systems may be enlarged by putting new computers and local data in a
new location and seamlessly connecting them to the distributed system.
o With centralized databases, failures result in a total shutdown of the system. Distributed database systems, however, continue to
operate with lower performance when a component fails until the issue is resolved.

Distributed database examples

o Apache Ignite, Apache Cassandra, Apache HBase, Couchbase Server, Amazon SimpleDB, Clusterpoint, and FoundationDB are
just a few examples of the numerous distributed databases available.
o Large data sets may be stored and processed with Apache Ignite across node clusters. GridGain Systems released Ignite as
open source in 2014, and it was later approved into the Apache Incubator program. RAM serves as the database's primary
processing and storage layer in Apache Ignite.
o Apache Cassandra has its own query language, Cassandra Query Language, and it supports clusters that span several locations
(CQL). Replication tactics in Cassandra may also be customized.
o Apache HBase offers a fault-tolerant mechanism to store huge amounts of sparse data on top of the Hadoop Distributed File
System. Moreover, it offers per-column Bloom filters, in-memory execution, and compression. Although Apache Phoenix offers a
SQL layer for HBase, HBase is not meant to replace SQL databases.
o An interactive application that serves several concurrent users by producing, storing, retrieving, aggregating, altering, and
displaying data is best served by Couchbase Server, a NoSQL software package. Scalable key value and JSON document
access is provided by Couchbase Server to satisfy these various application demands.
o Along with Amazon S3 and Amazon Elastic Compute Cloud, Amazon SimpleDB is utilised as a web service. Developers may
request and store data with Amazon SimpleDB with a minimum of database maintenance and administrative work.

The promises of a distributed database system (DDBS) include:


 Transparent management
The system manages data distribution and replication transparently to the user
 Reliable access
The system provides reliable access to data through distributed transactions
 Improved performance
The system improves performance by storing data closer to where it is needed
 Easier expansion
The system makes it easier to expand the system
 Increased reliability and availability
The system provides higher availability and reliability than a centralized database system
Key problem areas in Distributed Database Management Systems (DDBMS) include:

data consistency and integrity, distributed query processing, concurrency control, data replication and fragmentation,
network communication latency, security and access control, scalability, fault tolerance and recovery, heterogeneity and
interoperability, and managing complex transactions across multiple sites; essentially, ensuring data accuracy across
distributed locations while managing complex operations across a network can be challenging.

A Distributed Database Management System (DDBMS) architecture refers to the design of how data is
distributed across multiple computer sites in a network, with the system managing this distribution
transparently to users, allowing them to access the data as if it were stored in a single, centralized
database.

Key points about DDBMS architecture:


 Data Distribution:
The primary aspect of DDBMS architecture is how data is physically partitioned and replicated across different sites within
the network, allowing for parallel processing and improved performance based on data locality.
 Schema Levels:
External Schema: Represents the specific view of data that individual users or
applications see.

Global Conceptual Schema: Represents the overall logical view of the database,
providing a unified view of all data across different sites.
Local Conceptual Schema: Defines the logical structure of data at each individual site.
Local Internal Schema: Describes the physical storage details of data at each site.

Common DDBMS Architectural Models:


 Client-Server Architecture:
The most widely used model, where clients send requests to a dedicated server which handles data access and
processing, providing a clear separation of concerns.

 Peer-to-peer (Collaborating Server) Architecture:


Each node in the network acts as both a client and a server, allowing direct communication between peers and sharing
resources dynamically.

 Hierarchical Architecture:
A tree-like structure where nodes at higher levels manage the data access of nodes at lower levels, often used for
distributed systems with a clear hierarchy.

Important Considerations in DDBMS Architecture:


 Data Fragmentation:
Dividing data into smaller units (fragments) to distribute across different sites.
 Replication:
Duplicating data on multiple sites to enhance availability and performance in case of failures.
 Transaction Management:
Ensuring consistency and data integrity across distributed transactions, managing distributed locks and commit protocols.
 Query Optimization:
Developing strategies to efficiently execute queries that involve data from multiple sites, considering network latency and
data distribution.

In Distributed Database Design, selecting an appropriate design strategy is crucial to ensure


optimal system performance, data availability, and scalability. Below are some of the common
alternative design strategies:
1. Top-Down Design

 Description: Begins with a global schema and divides it into fragments, which are
allocated to different sites in the network.
 Steps:
o Define the global requirements and conceptual schema.
o Fragment the database based on usage patterns.
o Allocate fragments to various sites.
 Advantages: Ensures data consistency and completeness from the outset.
 Disadvantages: Complex and time-consuming for large systems.

2. Bottom-Up Design

 Description: Starts by analyzing existing local databases and integrating them into a
distributed system.
 Steps:
o Analyze local schema and data requirements.
o Identify common data entities and relationships.
o Integrate schemas into a global conceptual schema.
 Advantages: Simplifies integration when existing databases are already in place.
 Disadvantages: Potential inconsistency in the global schema due to different local design
approaches.

3. Mixed Design (Hybrid Approach)

 Description: Combines the top-down and bottom-up approaches, balancing structure and
flexibility.
 Steps:
o Analyze global requirements for critical data.
o Incorporate local databases and refine the design incrementally.
 Advantages: Provides a balanced approach, leveraging the strengths of both top-down
and bottom-up methods.
 Disadvantages: Can be complex and require careful coordination.

4. Centralized Design with Distribution at a Later Stage

 Description: Initially designs the database as a centralized system and then distributes
data across multiple sites.
 Advantages: Simple initial design and efficient testing of centralized functionality.
 Disadvantages: May require significant redesign for distribution, leading to potential
delays.

5. Fragmentation Strategies in Distributed Design

 Fragmentation is often necessary regardless of the chosen design strategy:


o Horizontal Fragmentation: Dividing a table into subsets of rows.
o Vertical Fragmentation: Dividing a table into subsets of columns.
o Mixed Fragmentation: Combination of both horizontal and vertical
fragmentation.

6. Replication Strategies

 Full Replication: Entire database replicated at each site.


 Partial Replication: Only essential fragments are replicated at some sites.
 Advantages: Improves data availability and fault tolerance.
 Disadvantages: Increases storage and synchronization complexity.

7. Allocation Strategies

 Allocating fragments across sites optimally:


o Centralized Allocation: All fragments stored at a single site.
o Decentralized Allocation: Fragments are distributed across multiple sites.
o Criteria for Allocation: Minimized response time, load balancing, and reduced
communication costs.

Fragmentation
Fragmentation is the task of dividing a table into a set of smaller tables. The subsets of the table are called fragments. Fragmentation
can be of three types: horizontal, vertical, and hybrid (combination of horizontal and vertical). Horizontal fragmentation can further be
classified into two techniques: primary horizontal fragmentation and derived horizontal fragmentation.

Fragmentation should be done in a way so that the original table can be reconstructed from the fragments. This is needed so that the
original table can be reconstructed from the fragments whenever required. This requirement is called “reconstructiveness.”
Rule:
A)completeness
b) Reconstruction
 c) Disjointness : This property ensures that the fragments are mutually exclusive, meaning that a particular data record or tuple belongs to
only one fragment and not multiple fragments.
(Example:
If a table Customers is horizontally fragmented into Customers_North and Customers_South, disjointness ensures that no customer
record appears in both fragments; each customer belongs to either the North or South fragment, but not both. )
d)Allocation alternative
-replication
-non-replication

Advantages of Fragmentation

Since data is stored close to the site of usage, efficiency of the database system is increased. Local query optimization techniques are
sufficient for most queries since data is locally available.Since irrelevant data is not available at the sites, security and privacy of the
database system can be maintained.

Disadvantages of Fragmentation

When data from different fragments are required, the access speeds may be very high. In case of recursive fragmentations, the job of
reconstruction will need expensive techniques. Lack of back-up copies of data in different sites may render the database ineffective in
case of failure of a site.

Vertical Fragmentation

In vertical fragmentation, the fields or columns of a table are grouped into fragments. In order to maintain reconstructiveness, each
fragment should contain the primary key field(s) of the table. Vertical fragmentation can be used to enforce privacy of data.

Horizontal Fragmentation

Horizontal fragmentation groups the tuples of a table in accordance to values of one or more fields. Horizontal fragmentation should
also confirm to the rule of reconstructiveness. Each horizontal fragment must have all columns of the original base table.

Hybrid Fragmentation

In hybrid fragmentation, a combination of horizontal and vertical fragmentation techniques are used. This is the most flexible
fragmentation technique since it generates fragments with minimal extraneous information. However, reconstruction of the original table
is often an expensive task.

Hybrid fragmentation can be done in two alternative ways − At first, generate a set of horizontal fragments; then generate vertical
fragments from one or more of the horizontal fragments.

At first, generate a set of vertical fragments; then generate horizontal fragments from one or more of the vertical fragments.

Data Allocation
It is the process to decide where exactly you want to store the data in the database. Also involves the decision as to which data type
of data has to be stored at what particular location. Three main types of data allocation are centralized, partitioned, and replicated.
Centralises: Entire database is stored at a single site. No data distribution occurs
Partitioned: The database gets divided into different fragments which are stored at several sites.
Replicated: Copies of the database are stored at different locations to access the data.

You might also like