Testing
Testing
Contents
2 Conceptual Foundations 2
4 Related Work 5
5 Preliminary Outline 10
References 11
2. Problem Statement:
Despite the significant progress in distributed systems, several critical challenges remain,
particularly in data consistency and implementing robust consistency models and algorithms.
These challenges arise from the inherent trade-offs between consistency, availability, and
performance. The selection of the most suitable consistency model and algorithm for a specific
application can be a complex task, given the nuances of these trade-offs. The rapid evolution
and expansion of distributed systems, the growing demand for geo-replication, and the
increasing complexity of applications further compound these challenges. Moreover, factors
such as network latency, the potential failure of individual nodes, and data replication issues can
significantly affect these algorithms' performance.
These challenges can be addressed by utilizing best practices such as consensus algorithms,
implementing ACID properties, distributed lock managers, data partitioning, two-phase commit
protocols, and appropriate use of data replication. However, the practical application and
integration of these strategies in a real-world distributed system environment present a complex
problem due to the dynamic nature of these environments and the need for balancing efficiency,
reliability, and data integrity.
Consequently, this thesis addresses the problem of ensuring data consistency in distributed
systems amidst these challenges. Specifically, the focus will be improving data consistency in
geo-replicated distributed systems. The inherent complexity of such systems and the unique
challenges of geo-replication - such as increased network latency, nuanced conflict resolution,
and specific data replication strategies - make this a crucial study area. This focus will allow for
an in-depth exploration of the methods and mechanisms that can enhance data consistency
across geographically dispersed nodes in a distributed system, thereby contributing to the
reliability and efficiency of these increasingly prevalent systems.
1
2 Conceptual Foundations
2 Conceptual Foundations
Data consistency in a distributed system is critical to ensure that all nodes in the design reflect
the same data state. This consistency is maintained through several techniques, such as data
replication, which enhances fault tolerance, reliability, and accessibility by maintaining
redundancy among software and hardware components. Data replication can bring the data
closer to end customers, reducing latency and increasing customer satisfaction [6].
Distributed systems employ a variety of consistency models, which are sets of rules specifying
the ordering and visibility of updates across different nodes in the system. These models provide
a contract between the design and the programmer, promising that if operations on memory
follow the rules defined by the model, the results of these operations will be predictable and
consistent [7].
Strong consistency models, such as strict and sequential, are the most rigorous. In strict
consistency, a write to a variable by any processor must be seen instantaneously by all
processors. This model is the most rigid, and while it guarantees that a programmer will always
receive the expected result, its practical relevance is limited due to the impossibility of
instantaneous message exchange. In sequential consistency, a write to a variable does not need
to be seen instantaneously but writes by different processors must be seen in the same order by
all processors.
Maintaining data consistency and integrity in distributed databases involves following certain
best practices. These include using a consensus algorithm to ensure all nodes agree on the
system's state, implementing ACID properties for transaction reliability and consistency, using a
distributed lock manager to control access to data, implementing data partitioning for workload
distribution and data accessibility, implementing a two-phase commit protocol for transaction
agreement, and using replication to ensure data availability across nodes [8].
Consequently, they may lead to radically different approaches to the situation. In the realm of
databases, for example, the separation of the concepts of atomicity, isolation, and durability
presents numerous opportunities for optimization. Still, it also adds another layer of complexity
Distributed Systems Group 3
2 Conceptual Foundations
to the process of determining which algorithms are appropriate for various types of
circumstances. Bridging this gap in understanding each other and the ramifications of
correctness and failure models continues to be challenging.
Conflict resolution issues in highly scalable systems: Over the past several years, many
products and enterprises have incorporated conflict-free replicated data types, often known as
CRDTs, to meet high-availability requirements in the face of concurrent data changes. Recent
advancements in collaborative editing systems have the potential to allow hundreds of people to
collaborate on a shared document or data item while having no impact on the system's
performance. Several talks covered programming approaches, static analyses, and related tools
for using CRDTs safely when eventual consistency is insufficient to protect application
invariants. When eventual consistency is inadequate to ensure application invariants, CRDTs are
utilized.
Research questions:
Research Question 1: What are consistency models and algorithms in distributed systems and
databases? What are the different types available in each?
Research Question 2: How do consistency models and algorithms impact the performance of
distributed systems and databases? What are the challenges and limitations associated with
various consistency models and algorithms?
Research Question 3: What are the recommended solutions in the literature to overcome the
constraints and limitations of consistency models and algorithms in distributed systems and
databases? How do these solutions affect scalability and availability?
4 Related Work
Numerous studies have explored consistency models, algorithms, and approaches to ensure data
consistency in distributed databases and systems. These works contribute to understanding the
trade-offs between consistency, availability, and performance in distributed environments. The
following overview highlights some notable research efforts in this area:
One notable study by Diogo M. et al. [11] presents a comprehensive comparison of various
consistency models, including solid consistency, weak consistency, eventual consistency, and
causal consistency. The authors analyse each model's benefits, limitations, and practical
implications, shedding light on their impact on system performance and data integrity. They
evaluate the consistency guarantees these models provide, considering factors such as
synchronization overhead, latency, and fault tolerance. The study also explores real-world
applications of different consistency models and discusses the challenges of maintaining
consistency in distributed databases and systems.
Siswandi Agung Hidayat et al. [12] focus on the evaluation and performance analysis of
different consensus algorithms, such as Paxos and Raft. The study investigates how these
algorithms affect distributed systems' data consistency, availability, and fault tolerance. The
authors provide insights into the strengths and weaknesses of each algorithm, considering
factors like network latency and failure scenarios. They conduct experiments using a simulated
distributed system to evaluate the performance of the consensus algorithms under various
workload conditions and system configurations. The findings highlight the impact of consensus
algorithms on data consistency and the trade-offs between performance and fault tolerance.
Similarly, Ouri Wolfson [13] conducts a comparative study of transaction management
protocols in distributed databases, including two-phase commit and three-phase commit. The
research examines the implications of these protocols on data consistency and concurrency
control, discussing the challenges and optimizations associated with each approach. The authors
analyse the performance characteristics of each protocol, such as latency and scalability, and
evaluate their effectiveness in ensuring transactional consistency across distributed systems. The
study also considers scenarios with high transactional workloads and network latency to assess
the protocols' performance under realistic conditions.
Furthermore, a survey by Sashi Tarun et al. [14] presents an overview of various techniques and
strategies for data replication in distributed databases. The authors discuss replication models,
such as primary-backup and multi-primary replication, and explore their impact on data
consistency and availability. The survey covers different replication strategies, including
synchronous and asynchronous replication, and examines their trade-offs regarding consistency
guarantees, performance, and fault tolerance. Additionally, the authors discuss conflict
resolution mechanisms employed in replicated databases to handle conflicts arising from
concurrent updates.
It has known as the Civil Air Patrol. The proof of the theorem shows that a system cannot ensure
Distributed Systems Group 9
4 Related Work
(firm) consistency, availability, and partition tolerance all at the same time. The term "causal
consistency" refers to one of the weak consistency models that may be employed in distributed
systems to improve partitioning tolerance and retain availability. Using this paradigm, data may
be read from any system partition. They provided a technique for automatically determining if
the executions of distributed or concurrent systems are compatible with causal theories in this
body of work [15]. They reduced the difficulty of evaluating whether an action is causally
consistent by translating it into questions that must be answered from a data log. As a
consequence, the task will be simpler to manage.
The reduction is based on a detailed definition of the executions that violate causal consistency,
namely the presence of cycles in correctly defined linkages between the operations that occur in
these executions. These executions contradict causal consistency regarding the presence of
cycles in these adequately stated relations. These executions contradict the laws of causality
since they do not go rationally from one action to the next. They demonstrated the benefits of
the proposed method by adding it to a remote database testing tool and executing several tests on
genuine case studies. This has allowed us to demonstrate the method's efficacy [16].
This article presents "Coo," a framework for ensuring database consistency with one another.
The following is a list of Coo's specific technical accomplishments. Coo's initial
recommendation is to employ a partial order pair (POP) graph, which gives more expressiveness
for scheduling difficulties. This is achieved by accounting stateful data, such as Commit and
Abort, and by being more descriptive in transaction conflicts. When Coo speaks about 100%
consistency, they mean a POP graph without a cycle. Coo's second attribute is the capacity to
generate inconsistent test cases based on POP cycles. These test cases may be used to investigate
database consistency in a manner that is exact (they can detect a wide variety of abnormalities),
accessible (they are based on SQL), and inexpensive (they only need to do the check once,
which takes a few minutes). They used eleven separate databases to test COO at each of the
degrees of isolation it allows, some scattered while others were centrally centralized. According
to the study's results, databases do not entirely comply with the ANSI SQL standard (for
example, Oracle claimed to be serializable but only appeared in particular inconsistent
scenarios). Furthermore, each database has its own implementation methodologies and
concurrency control behaviours (for example, PostgreSQL, MySQL, and SQL Server all
handled Repeatable Read operations in quite different ways). Using a COO helps bridge the
knowledge gap between course levels, enabling the discovery of more complicated and all-
encompassing inconsistent activities [17].
Optimization can potentially enhance data currency and consistency significantly; improvements
of 30% to 100% are not uncommon. However, there are a few additional possibilities that apply.
The approaches for configuration optimization covered in this article are all simple to grasp and
use. Furthermore, the present technique allows us to offer a mathematical rationale for several
intuitively recognized truths regarding the consistency and timeliness of the data. The study has
enabled the establishment of a mathematical explanation for several facts, making this viable
[18].
Using simulation, this examines three distinct scheduling methods for distributed transactional
systems in the real world. This investigation aims to find a solution that fits both the necessity
for serializability and the need for timeliness. Consequently, scheduling algorithms must provide
concurrency management and on-time job completion. The first algorithm does this via
consensus, but the other approaches rely on tokens to offer globally consistent orderings. The
second step of the technique consists of scheduling activities, beginning with the non-
preemptive deadline and ending with the earliest start time. The outcomes of our simulations for
Distributed Systems Group 9
4 Related Work
each method and the effects of the many different system components are given. They looked at
the various distinctions between these tactics in the last portion of our study [19].
Distributed databases are kept in many locations, frequently spanning multiple geographical
areas. It is an exciting subject for academic research since it introduces a whole new set of
difficulties. One of the problems will be keeping the database in a consistent state at all times.
Because numerous users are simultaneously accessing the database, there is reason to be
concerned about its consistency and integrity. This issue must be addressed. Various viable
solutions, notably the on-lock and timestamp-based systems, are reviewed in this article.
Furthermore, they analysed these tactics while considering additional considerations [20].
5 Preliminary Outline
Introduction
a. Background and motivation
b. Problem statement
c. Objectives of the study
Understanding Consistency
a. Overview of distributed databases and systems
b. Different types of consistency models (e.g., strong, weak, eventual, causal)
c. How consistency models impact data reliability and accuracy
Existing Research
a. Overview of previous studies on consistency models and algorithms
b. Comparison of different approaches and their pros and cons
c. Evaluation of performance and scalability trade-offs
Challenges and Limitations
Recommendations and Solutions
Evaluation and Performance Analysis
Conclusion
a. Summary of key findings and contributions
b. Implications of the research
c. Future research directions
References
[1] Hongwei, Duan & Ligetu, Bi. (2021). Research on Distributed Storage Technology of
Database Big Data Based on Cloud Computing. Journal of Physics: Conference Series. 1982.
012195. 10.1088/1742-6596/1982/1/012195.
[2] N, Shahana. (2022). Impact and Implications of Big Data Analytics in Cloud Computing
Platforms. International Journal for Research in Applied Science and Engineering Technology.
10. 4661-4666. 10.22214/ijraset.2022.43407.
[3] Waseem, Q., Sofiah Wan Din, W. I., Alshamrani, S. S., Alharbi, A., & Nazir, A. (2021,
March 12). Quantitative Analysis and Performance Evaluation of Target-Oriented Replication
Strategies in Cloud Computing. MDPI. https://doi.org/10.3390/electronics10060672
[4] Gill, S., Varshney, H., Chola, A., & Chhabra, M. (2022, May 6). Data Replication in
Distributed Systems: The Best Guide 101 - Learn | Hevo. Learn | Hevo.
https://hevodata.com/learn/data-replication-in-distributed-system/
[5] McLaren, K & Burnett, RA & Goodlad, John & Howatson, S & Lang, S & Lee, F &
Lessells, A & Ogston, Simon & Robertson, AJ & Simpson, J & Smith, G & Tavadia, H &
Group, F. (2003). Consistency of histopathological reporting of laryngeal dysplasia.
Histopathology. 37. 460 - 463. 10.1046/j.1365-2559.2000.00998.x.
[6] Steen, M. V., & Tanenbaum, A. S. (2016, August 16). A brief introduction to distributed
systems - Computing. SpringerLink. https://doi.org/10.1007/s00607-016-0508-7
[7] Using artificial intelligence and data fusion for environmental monitoring: A review and
future perspectives. (2022, June 25). Using Artificial Intelligence and Data Fusion for
Environmental Monitoring: A Review and Future Perspectives - ScienceDirect.
https://doi.org/10.1016/j.inffus.2022.06.003
[8] Zhu, C., Li, J., Zhong, Z., Yue, C., & Zhang, M. (2023, April 24). A Survey on the
Integration of Blockchains and Databases - Data Science and Engineering. SpringerLink.
https://doi.org/10.1007/s41019-023-00212-z
[9] Sharfuddin, Mohammed & Ragunathan, Thirumalaisamy. (2022). Improving Performance of
Cloud Storage Systems Using Support-Based Replication Algorithm. ECTI Transactions on
Computer and Information Technology (ECTI-CIT). 17. 14-26. 10.37936/recti-
cit.2023171.247333.
[10] Pham, Van-Nam & Hossain, Md. Delowar & Lee, Ga-Won & Huh, Eui-nam. (2023).
Efficient Data Delivery Scheme for Large-Scale Microservices in Distributed Cloud
Environment. Applied Sciences. 13. 886. 10.3390/app13020886.
[11] Diogo, M., Cabral, B., & Bernardino, J. (2019, February 14). Consistency Models of
NoSQL Databases. MDPI. https://doi.org/10.3390/fi11020043
[12] Performance Comparison and Analysis of Paxos, Raft, and PBFT Using NS3. (n.d.).
Performance Comparison and Analysis of Paxos, Raft, and PBFT Using NS3 | IEEE Conference
Publication | IEEE Xplore. https://ieeexplore.ieee.org/document/9975938
[13] Wolfson, O. (2005, January 1). A comparative analysis of two-phase-commit protocols. A
Comparative Analysis of Two-phase-commit Protocols | SpringerLink.
https://doi.org/10.1007/3-540-53507-1_84
[14]A Review on Fragmentation, Allocation and Replication in Distributed Database Systems.
(n.d.). A Review on Fragmentation, Allocation and Replication in Distributed Database Systems
| IEEE Conference Publication | IEEE Xplore.
https://ieeexplore.ieee.org/abstract/document/9004233
[15] Computing Systems, IEEE International Workshop. 257. 10.1109/FTDCS.1997.644735.
[16] Zennou, Rachid & Biswas, Ranadeep & Bouajjani, Ahmed & Enea, Constantin & Erradi,
Mohamed. (2021). Checking causal consistency of distributed databases. Computing. 104. 1-21.
10.1007/s00607-021-00911-3.
[17] Li, Haixiang & Chen, Yuxing & Li, Xiaoyan. (2022). Coo: Consistency Check for
Transactional Databases. 10.48550/arXiv.2206.14602.
[18] Leung, Clement & Wolfenden, K. (1985). Analysis and Optimisation of Data Currency and
Distributed Systems Group 12
References