0% found this document useful (0 votes)
28 views

Testing

This document provides an overview of key concepts related to data consistency in distributed systems. It discusses challenges in ensuring consistency across geographically dispersed nodes due to factors like network latency and node failures. The document notes that while consensus algorithms, ACID properties, replication and other techniques help ensure consistency, applying them in real-world distributed environments remains complex. The objective of the thesis is to improve data consistency in geo-replicated distributed systems by exploring consistency models, algorithms and mechanisms to enhance consistency across geographically distributed nodes.

Uploaded by

Ishika Kharbanda
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Testing

This document provides an overview of key concepts related to data consistency in distributed systems. It discusses challenges in ensuring consistency across geographically dispersed nodes due to factors like network latency and node failures. The document notes that while consensus algorithms, ACID properties, replication and other techniques help ensure consistency, applying them in real-world distributed environments remains complex. The objective of the thesis is to improve data consistency in geo-replicated distributed systems by exploring consistency models, algorithms and mechanisms to enhance consistency across geographically distributed nodes.

Uploaded by

Ishika Kharbanda
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Contents

Contents

1 Introduction and Problem Statement 1

2 Conceptual Foundations 2

3 Objective and Research Questions 4

4 Related Work 5

5 Preliminary Outline 10

References 11

Distributed Systems Group I


1 Introduction and Problem Statement

1 Introduction and Problem Statement


In today's big data and cloud computing world, distributed databases and systems are quickly
becoming the standard rather than the exception [1][2]. These systems are made up of several
nodes located in different parts of the world and can connect to store and process data in a
decentralized fashion. A key motivation for this distribution is enhancing reliability and
performance through replication. Replication involves storing multiple copies of the same data
on different nodes in the system, significantly improving data availability and system reliability
[3][4]. However, one of the most critical challenges in these systems is ensuring that the data is
consistent and that all nodes have the same view of the database. Inconsistencies can lead to data
loss, corruption, and inaccurate results. Therefore, consistency models and algorithms are
needed to guarantee the integrity and consistency of data, especially in the face of replication in
distributed databases and systems.
Several consistency models, such as solid consistency, weak consistency, eventual consistency,
and causal consistency, amongst others, have been presented [5]. The availability, performance,
and consistency compromise that each of these models makes are different. In the meanwhile,
several methods for enforcing these models have been created. Some examples of these
algorithms are distributed locking, two-phase commit, Paxos, and Raft. These algorithms tackle
many facets of consistency, such as atomicity, durability, and consistency level. The choice of a
suitable model and algorithm largely depends on the specific application's requirements and the
nature of the distributed system.

2. Problem Statement: 

Despite the significant progress in distributed systems, several critical challenges remain,
particularly in data consistency and implementing robust consistency models and algorithms.
These challenges arise from the inherent trade-offs between consistency, availability, and
performance. The selection of the most suitable consistency model and algorithm for a specific
application can be a complex task, given the nuances of these trade-offs. The rapid evolution
and expansion of distributed systems, the growing demand for geo-replication, and the
increasing complexity of applications further compound these challenges. Moreover, factors
such as network latency, the potential failure of individual nodes, and data replication issues can
significantly affect these algorithms' performance.

These challenges can be addressed by utilizing best practices such as consensus algorithms,
implementing ACID properties, distributed lock managers, data partitioning, two-phase commit
protocols, and appropriate use of data replication. However, the practical application and
integration of these strategies in a real-world distributed system environment present a complex
problem due to the dynamic nature of these environments and the need for balancing efficiency,
reliability, and data integrity.

Consequently, this thesis addresses the problem of ensuring data consistency in distributed
systems amidst these challenges. Specifically, the focus will be improving data consistency in
geo-replicated distributed systems. The inherent complexity of such systems and the unique
challenges of geo-replication - such as increased network latency, nuanced conflict resolution,
and specific data replication strategies - make this a crucial study area. This focus will allow for
an in-depth exploration of the methods and mechanisms that can enhance data consistency
across geographically dispersed nodes in a distributed system, thereby contributing to the
reliability and efficiency of these increasingly prevalent systems.

1
2 Conceptual Foundations

2 Conceptual Foundations

Data consistency in a distributed system is critical to ensure that all nodes in the design reflect
the same data state. This consistency is maintained through several techniques, such as data
replication, which enhances fault tolerance, reliability, and accessibility by maintaining
redundancy among software and hardware components. Data replication can bring the data
closer to end customers, reducing latency and increasing customer satisfaction [6].

Distributed systems employ a variety of consistency models, which are sets of rules specifying
the ordering and visibility of updates across different nodes in the system. These models provide
a contract between the design and the programmer, promising that if operations on memory
follow the rules defined by the model, the results of these operations will be predictable and
consistent [7].

Strong consistency models, such as strict and sequential, are the most rigorous. In strict
consistency, a write to a variable by any processor must be seen instantaneously by all
processors. This model is the most rigid, and while it guarantees that a programmer will always
receive the expected result, its practical relevance is limited due to the impossibility of
instantaneous message exchange. In sequential consistency, a write to a variable does not need
to be seen instantaneously but writes by different processors must be seen in the same order by
all processors.

Maintaining data consistency and integrity in distributed databases involves following certain
best practices. These include using a consensus algorithm to ensure all nodes agree on the
system's state, implementing ACID properties for transaction reliability and consistency, using a
distributed lock manager to control access to data, implementing data partitioning for workload
distribution and data accessibility, implementing a two-phase commit protocol for transaction
agreement, and using replication to ensure data availability across nodes [8].

Distributed system and database management technologies: As the number of replicated


data storage grows, the two different but linked areas of distributed systems and databases
merge.
The results of each community's efforts should be shared with the others. A concern shared by
all stakeholders is ensuring that all database versions are adequately maintained up to date.
Historically, the distributed systems field relied on outdated consensus algorithms or looked at
less robust consistency models for many of its methodologies. On the other hand, database
systems have focused chiefly on 2-phase commit protocols and eager update protocols.
Simultaneously, members of the database community considered other ACID features, such as
those requiring the combination of commit processes with concurrency control protocols and
recovery mechanisms. However, the work of the two groups has grown increasingly entangled
in the past ten years, notably with the introduction of real-world implementations of the Paxos
consensus algorithms and the implementation of file replication in storage systems to maintain
availability [9]. One of the issues in this context is that the work done by the different groups
continues to make slightly different assumptions on the failure and correctness models. They are
often so intricate that even the most seasoned specialists may need help differentiating between
them.

Consequently, they may lead to radically different approaches to the situation. In the realm of
databases, for example, the separation of the concepts of atomicity, isolation, and durability
presents numerous opportunities for optimization. Still, it also adds another layer of complexity
Distributed Systems Group 3
2 Conceptual Foundations
to the process of determining which algorithms are appropriate for various types of
circumstances. Bridging this gap in understanding each other and the ramifications of
correctness and failure models continues to be challenging.

Conflict resolution issues in highly scalable systems: Over the past several years, many
products and enterprises have incorporated conflict-free replicated data types, often known as
CRDTs, to meet high-availability requirements in the face of concurrent data changes. Recent
advancements in collaborative editing systems have the potential to allow hundreds of people to
collaborate on a shared document or data item while having no impact on the system's
performance. Several talks covered programming approaches, static analyses, and related tools
for using CRDTs safely when eventual consistency is insufficient to protect application
invariants. When eventual consistency is inadequate to ensure application invariants, CRDTs are
utilized.

Distributed system programming paradigms: Despite several microservice composition and


scalability issues, using microservices as a standard technique for constructing large-scale
distributed systems have become increasingly prevalent [10]. Several speakers brought up
concepts like actor-based and data-flow programming for debate. Design for testability and test
frameworks are critical to providing trustworthy services, but these talents need much
experience in today's world. Additional advancements in programming models and discoveries
in theoretical underpinnings will help simplify this challenging process and support
programmers in building safe systems in the future.

Distributed Systems Group 3


3 Objective and Research Questions

3 Objective and Research Questions


This study aims to analyze the consistency models and algorithms used in distributed databases
and systems and to understand the relevance of these models and algorithms in maintaining the
consistency, availability, and scalability of data in distributed systems. The primary objective is
to determine the difficulties and restrictions connected with the various consistency models and
algorithms and then to suggest solutions that potentially enhance the overall performance of
distributed systems.

Research questions:

Research Question 1: What are consistency models and algorithms in distributed systems and
databases? What are the different types available in each?

Research Question 2: How do consistency models and algorithms impact the performance of
distributed systems and databases? What are the challenges and limitations associated with
various consistency models and algorithms?

Research Question 3: What are the recommended solutions in the literature to overcome the
constraints and limitations of consistency models and algorithms in distributed systems and
databases? How do these solutions affect scalability and availability?

Distributed Systems Group 4


4 Related Work

4 Related Work
Numerous studies have explored consistency models, algorithms, and approaches to ensure data
consistency in distributed databases and systems. These works contribute to understanding the
trade-offs between consistency, availability, and performance in distributed environments. The
following overview highlights some notable research efforts in this area:

One notable study by Diogo M. et al. [11] presents a comprehensive comparison of various
consistency models, including solid consistency, weak consistency, eventual consistency, and
causal consistency. The authors analyse each model's benefits, limitations, and practical
implications, shedding light on their impact on system performance and data integrity. They
evaluate the consistency guarantees these models provide, considering factors such as
synchronization overhead, latency, and fault tolerance. The study also explores real-world
applications of different consistency models and discusses the challenges of maintaining
consistency in distributed databases and systems.

Siswandi Agung Hidayat et al. [12] focus on the evaluation and performance analysis of
different consensus algorithms, such as Paxos and Raft. The study investigates how these
algorithms affect distributed systems' data consistency, availability, and fault tolerance. The
authors provide insights into the strengths and weaknesses of each algorithm, considering
factors like network latency and failure scenarios. They conduct experiments using a simulated
distributed system to evaluate the performance of the consensus algorithms under various
workload conditions and system configurations. The findings highlight the impact of consensus
algorithms on data consistency and the trade-offs between performance and fault tolerance.
Similarly, Ouri Wolfson [13] conducts a comparative study of transaction management
protocols in distributed databases, including two-phase commit and three-phase commit. The
research examines the implications of these protocols on data consistency and concurrency
control, discussing the challenges and optimizations associated with each approach. The authors
analyse the performance characteristics of each protocol, such as latency and scalability, and
evaluate their effectiveness in ensuring transactional consistency across distributed systems. The
study also considers scenarios with high transactional workloads and network latency to assess
the protocols' performance under realistic conditions.
Furthermore, a survey by Sashi Tarun et al. [14] presents an overview of various techniques and
strategies for data replication in distributed databases. The authors discuss replication models,
such as primary-backup and multi-primary replication, and explore their impact on data
consistency and availability. The survey covers different replication strategies, including
synchronous and asynchronous replication, and examines their trade-offs regarding consistency
guarantees, performance, and fault tolerance. Additionally, the authors discuss conflict
resolution mechanisms employed in replicated databases to handle conflicts arising from
concurrent updates.

In replicated distributed databases, data is evaluated to see whether it is up-to-date and


consistent using the single main update method and the moving primary update strategy. The
results of a quantitative investigation of the racial scenario to which the system is exposed are
used to assess the data's relevance and consistency, expressed in terms of likelihood. The
frequency of updates and the degree of transactional activity on the website are closely
connected with the severity of the race problem. These characteristics may affect determining
how timely and dependable the data is. Both homogeneous and heterogeneous systems are
examined here. In robust systems, it is often necessary to execute dynamic reconfiguration of the
site. This might be essential for various reasons, including a past accident there.

It has known as the Civil Air Patrol. The proof of the theorem shows that a system cannot ensure
Distributed Systems Group 9
4 Related Work

(firm) consistency, availability, and partition tolerance all at the same time. The term "causal
consistency" refers to one of the weak consistency models that may be employed in distributed
systems to improve partitioning tolerance and retain availability. Using this paradigm, data may
be read from any system partition. They provided a technique for automatically determining if
the executions of distributed or concurrent systems are compatible with causal theories in this
body of work [15]. They reduced the difficulty of evaluating whether an action is causally
consistent by translating it into questions that must be answered from a data log. As a
consequence, the task will be simpler to manage.

The reduction is based on a detailed definition of the executions that violate causal consistency,
namely the presence of cycles in correctly defined linkages between the operations that occur in
these executions. These executions contradict causal consistency regarding the presence of
cycles in these adequately stated relations. These executions contradict the laws of causality
since they do not go rationally from one action to the next. They demonstrated the benefits of
the proposed method by adding it to a remote database testing tool and executing several tests on
genuine case studies. This has allowed us to demonstrate the method's efficacy [16].

This article presents "Coo," a framework for ensuring database consistency with one another.
The following is a list of Coo's specific technical accomplishments. Coo's initial
recommendation is to employ a partial order pair (POP) graph, which gives more expressiveness
for scheduling difficulties. This is achieved by accounting stateful data, such as Commit and
Abort, and by being more descriptive in transaction conflicts. When Coo speaks about 100%
consistency, they mean a POP graph without a cycle. Coo's second attribute is the capacity to
generate inconsistent test cases based on POP cycles. These test cases may be used to investigate
database consistency in a manner that is exact (they can detect a wide variety of abnormalities),
accessible (they are based on SQL), and inexpensive (they only need to do the check once,
which takes a few minutes). They used eleven separate databases to test COO at each of the
degrees of isolation it allows, some scattered while others were centrally centralized. According
to the study's results, databases do not entirely comply with the ANSI SQL standard (for
example, Oracle claimed to be serializable but only appeared in particular inconsistent
scenarios). Furthermore, each database has its own implementation methodologies and
concurrency control behaviours (for example, PostgreSQL, MySQL, and SQL Server all
handled Repeatable Read operations in quite different ways). Using a COO helps bridge the
knowledge gap between course levels, enabling the discovery of more complicated and all-
encompassing inconsistent activities [17].

Optimization can potentially enhance data currency and consistency significantly; improvements
of 30% to 100% are not uncommon. However, there are a few additional possibilities that apply.
The approaches for configuration optimization covered in this article are all simple to grasp and
use. Furthermore, the present technique allows us to offer a mathematical rationale for several
intuitively recognized truths regarding the consistency and timeliness of the data. The study has
enabled the establishment of a mathematical explanation for several facts, making this viable
[18].

Using simulation, this examines three distinct scheduling methods for distributed transactional
systems in the real world. This investigation aims to find a solution that fits both the necessity
for serializability and the need for timeliness. Consequently, scheduling algorithms must provide
concurrency management and on-time job completion. The first algorithm does this via
consensus, but the other approaches rely on tokens to offer globally consistent orderings. The
second step of the technique consists of scheduling activities, beginning with the non-
preemptive deadline and ending with the earliest start time. The outcomes of our simulations for
Distributed Systems Group 9
4 Related Work

each method and the effects of the many different system components are given. They looked at
the various distinctions between these tactics in the last portion of our study [19].
Distributed databases are kept in many locations, frequently spanning multiple geographical
areas. It is an exciting subject for academic research since it introduces a whole new set of
difficulties. One of the problems will be keeping the database in a consistent state at all times.
Because numerous users are simultaneously accessing the database, there is reason to be
concerned about its consistency and integrity. This issue must be addressed. Various viable
solutions, notably the on-lock and timestamp-based systems, are reviewed in this article.
Furthermore, they analysed these tactics while considering additional considerations [20].

Distributed Systems Group 9


5 Preliminary Outline

5 Preliminary Outline

Introduction
a. Background and motivation
b. Problem statement
c. Objectives of the study
Understanding Consistency
a. Overview of distributed databases and systems
b. Different types of consistency models (e.g., strong, weak, eventual, causal)
c. How consistency models impact data reliability and accuracy
Existing Research
a. Overview of previous studies on consistency models and algorithms
b. Comparison of different approaches and their pros and cons
c. Evaluation of performance and scalability trade-offs
Challenges and Limitations
Recommendations and Solutions
Evaluation and Performance Analysis
Conclusion
a. Summary of key findings and contributions
b. Implications of the research
c. Future research directions

Distributed Systems Group 10


References

References
[1] Hongwei, Duan & Ligetu, Bi. (2021). Research on Distributed Storage Technology of
Database Big Data Based on Cloud Computing. Journal of Physics: Conference Series. 1982.
012195. 10.1088/1742-6596/1982/1/012195.
[2] N, Shahana. (2022). Impact and Implications of Big Data Analytics in Cloud Computing
Platforms. International Journal for Research in Applied Science and Engineering Technology.
10. 4661-4666. 10.22214/ijraset.2022.43407.
[3] Waseem, Q., Sofiah Wan Din, W. I., Alshamrani, S. S., Alharbi, A., & Nazir, A. (2021,
March 12). Quantitative Analysis and Performance Evaluation of Target-Oriented Replication
Strategies in Cloud Computing. MDPI. https://doi.org/10.3390/electronics10060672
[4] Gill, S., Varshney, H., Chola, A., & Chhabra, M. (2022, May 6). Data Replication in
Distributed Systems: The Best Guide 101 - Learn | Hevo. Learn | Hevo.
https://hevodata.com/learn/data-replication-in-distributed-system/
[5] McLaren, K & Burnett, RA & Goodlad, John & Howatson, S & Lang, S & Lee, F &
Lessells, A & Ogston, Simon & Robertson, AJ & Simpson, J & Smith, G & Tavadia, H &
Group, F. (2003). Consistency of histopathological reporting of laryngeal dysplasia.
Histopathology. 37. 460 - 463. 10.1046/j.1365-2559.2000.00998.x.
[6] Steen, M. V., & Tanenbaum, A. S. (2016, August 16). A brief introduction to distributed
systems - Computing. SpringerLink. https://doi.org/10.1007/s00607-016-0508-7
[7] Using artificial intelligence and data fusion for environmental monitoring: A review and
future perspectives. (2022, June 25). Using Artificial Intelligence and Data Fusion for
Environmental Monitoring: A Review and Future Perspectives - ScienceDirect.
https://doi.org/10.1016/j.inffus.2022.06.003
[8] Zhu, C., Li, J., Zhong, Z., Yue, C., & Zhang, M. (2023, April 24). A Survey on the
Integration of Blockchains and Databases - Data Science and Engineering. SpringerLink.
https://doi.org/10.1007/s41019-023-00212-z
[9] Sharfuddin, Mohammed & Ragunathan, Thirumalaisamy. (2022). Improving Performance of
Cloud Storage Systems Using Support-Based Replication Algorithm. ECTI Transactions on
Computer and Information Technology (ECTI-CIT). 17. 14-26. 10.37936/recti-
cit.2023171.247333.
[10] Pham, Van-Nam & Hossain, Md. Delowar & Lee, Ga-Won & Huh, Eui-nam. (2023).
Efficient Data Delivery Scheme for Large-Scale Microservices in Distributed Cloud
Environment. Applied Sciences. 13. 886. 10.3390/app13020886.
[11] Diogo, M., Cabral, B., & Bernardino, J. (2019, February 14). Consistency Models of
NoSQL Databases. MDPI. https://doi.org/10.3390/fi11020043
[12] Performance Comparison and Analysis of Paxos, Raft, and PBFT Using NS3. (n.d.).
Performance Comparison and Analysis of Paxos, Raft, and PBFT Using NS3 | IEEE Conference
Publication | IEEE Xplore. https://ieeexplore.ieee.org/document/9975938
[13] Wolfson, O. (2005, January 1). A comparative analysis of two-phase-commit protocols. A
Comparative Analysis of Two-phase-commit Protocols | SpringerLink.
https://doi.org/10.1007/3-540-53507-1_84
[14]A Review on Fragmentation, Allocation and Replication in Distributed Database Systems.
(n.d.). A Review on Fragmentation, Allocation and Replication in Distributed Database Systems
| IEEE Conference Publication | IEEE Xplore.
https://ieeexplore.ieee.org/abstract/document/9004233
[15] Computing Systems, IEEE International Workshop. 257. 10.1109/FTDCS.1997.644735.
[16] Zennou, Rachid & Biswas, Ranadeep & Bouajjani, Ahmed & Enea, Constantin & Erradi,
Mohamed. (2021). Checking causal consistency of distributed databases. Computing. 104. 1-21.
10.1007/s00607-021-00911-3.
[17] Li, Haixiang & Chen, Yuxing & Li, Xiaoyan. (2022). Coo: Consistency Check for
Transactional Databases. 10.48550/arXiv.2206.14602.
[18] Leung, Clement & Wolfenden, K. (1985). Analysis and Optimisation of Data Currency and
Distributed Systems Group 12
References

Consistency in Replicated Distributed Databases. Comput. J.. 28. 518-523.


10.1093/comjnl/28.5.518.
[19] Gammar, Sonia & Kamoun, Farouk. (1997). A comparison of scheduling algorithms for
real-time distributed transactional systems. Future Trends of Distributed
[20] Nalawala, Husen & Shah, Jaymin & Agrawal, Smita & Oza, Parita. (2023). Concurrency
Control in Distributed Database Systems: An In-Depth Analysis. 10.1007/978-981-19-1142-
2_17.

Distributed Systems Group 12

You might also like