0% found this document useful (0 votes)
10 views14 pages

ASS#5 - RAID, FAULT TOLERANCE, RELIABILITY AND HPC (Autosaved) .DCB

The document discusses RAID (Redundant Array of Independent Disks) technology, highlighting its advantages such as data redundancy, fault tolerance, and improved performance. It also covers fault-tolerant systems, intelligent storage systems, and the differences between computer clusters and grid computing. Additionally, it provides definitions and calculations related to reliability, availability, and failure probabilities in various configurations.

Uploaded by

zizohossam06
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views14 pages

ASS#5 - RAID, FAULT TOLERANCE, RELIABILITY AND HPC (Autosaved) .DCB

The document discusses RAID (Redundant Array of Independent Disks) technology, highlighting its advantages such as data redundancy, fault tolerance, and improved performance. It also covers fault-tolerant systems, intelligent storage systems, and the differences between computer clusters and grid computing. Additionally, it provides definitions and calculations related to reliability, availability, and failure probabilities in various configurations.

Uploaded by

zizohossam06
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Name ‫عمار دمحم محي عباس‬

Dr. ‫دمحم وفائي‬


Subject distributed
College Modern academy for engineering and
technology
ASS#5 _ RAID, FAULT TOLERANT SYSTEMS, AND CLUSTER COMPUTING

1. What Are the Advantages of RAID?


A:
Data Redundancy: One of the primary advantages of RAID is data redundancy. Different RAID
levels, such as RAID 1 (mirroring) and RAID 5 (parity), provide mechanisms to duplicate or
distribute redundant copies of data across multiple drives. This redundancy helps protect against
data loss due to drive failures.
Fault Tolerance: RAID configurations are designed to provide fault tolerance, ensuring that the
failure of a single disk drive does not result in data loss. RAID levels like RAID 1, RAID 5, and
RAID 6 can withstand the failure of one or more drives.
Improved Performance: Some RAID levels, such as RAID 0 (striping), are designed to improve
performance by striping data across multiple drives. This allows for parallel read and write
operations, enhancing overall data transfer rates. RAID 10, a combination of mirroring and
striping, provides both redundancy and improved performance.
Increased Storage Capacity: RAID configurations, especially those involving striping (e.g.,
RAID 0), can increase the overall storage capacity by combining the capacities of multiple
drives. This allows users to create larger logical volumes than what a single drive can provide.
Data Integrity: RAID configurations that use parity, such as RAID 5 and RAID 6, include error-
checking information to ensure data integrity. This helps detect and correct errors that may occur
during data transfer or storage.
Hot Swapping: Many RAID implementations support hot swapping, allowing administrators to
replace a failed drive with a new one without shutting down the system. This feature contributes
to increased system availability and reduced downtime.
Cost-Effective Solutions: Depending on the RAID level chosen, RAID can provide a cost-
effective solution for achieving specific storage goals. For instance, RAID 1 provides data
redundancy at the expense of storage capacity, while RAID 5 balances redundancy and capacity
more efficiently.
Ease of Recovery: In the event of a drive failure, RAID configurations with redundancy allow
for easier data recovery. Data can be reconstructed from the remaining drives, minimizing the
risk of permanent data loss.
RAID Controllers: Dedicated RAID controllers provide hardware-based RAID management,
offloading processing tasks from the host system's CPU. This can lead to improved system
performance, especially in high-demand environments.
2. What Is RAID Array?
A: RAID is a storage technology that combines multiple physical disk drives into a single logical
unit.
RAID provides various advantages, primarily focused on improving data reliability,
performance, and availability.
.
3. Compare between Disk mirroring and Disk stripping
A:
4. Draw a diagram of RAID 01 (stripping and mirroring) and explain its applications?
A:

5. Draw a diagram of RAID 5 and explain its applications?


A:

RAID 5 requires at least three disks, but it is often recommended to use at least five disks for
performance reasons.
RAID 5 arrays are generally considered to be a poor choice for use on write-intensive systems
because of the performance impact associated with writing parity information.
When a disk does fail, it can take a long time to rebuild a RAID 5 array.
Performance is usually degraded during the rebuild time, and the array is vulnerable to an
additional disk failure until the rebuild is complete.
6. Explain the components of Intelligent storage system and mention its advantage?
A:
An intelligent storage system consists of four key components: Front-end, Cache, Back-end, and
Physical Disks.
An I/O request received from the compute system at the front-end port is processed through
cache and the back end to enable storage and retrieval of data from the physical disk

A read request can be serviced directly from cache if the requested data is found in cache.
Business applications require high levels of performance, availability, security, and scalability.
A disk drive is a core element of storage that governs the performance of any storage system.
Some of the older disk array technologies could not overcome performance constraints due to the
limitations of a disk drive and its mechanical components.
RAID technology made an important contribution to enhance storage performance and
reliability, but disk drives, even with a RAID implementation could not meet performance
requirements of today’s applications.
With advancements in technology, Intelligent Storage System, has evolved.
These intelligent storage systems are feature-rich RAID arrays that provide highly optimized I/O
processing capabilities.
These storage systems are configured with a large amount of memory (called cache) and multiple
I/O paths and use sophisticated algorithms to meet the requirements of performance-sensitive
applications.
These arrays have an operating environment that intelligently and optimally handles the
management, allocation, and utilization of storage resources.

7. Explain the advantages of fault-tolerant systems?


A:
Availability: Availability is defined as the property where the system is readily available for its
use at any time.
Reliability: Reliability is defined as the property where the system can work continuously
without any failure. Generally defined as the ability of a product or service to perform as
expected over time. Formally defined as the probability that a product, piece of equipment, or
system performs its intended function for a stated period of time under specified operating
conditions
Dependability is a measure of both the availability and reliability of a service.
Safety: Safety is defined as the property where the system can remain safe from unauthorized
access even if any failure occurs.
Maintainability: Maintainability is defined as the property states that how easily and fastly the
failed node or system can be repaired. The probability that a system or product can be retained
in, or one that has failed can be restored to, operating condition in a specified amount of time
Increased reliability: By reducing the likelihood and potential impact of system failures, fault
tolerance boosts the reliability of your assets.
Reduced downtime: Automated fault detection and recovery systems ensure that backup
resources can be used to reduce unexpected downtime and minimize its direct and indirect costs.
More secure data: Fault-tolerant systems can eliminate the risk of critical data loss or corruption
by storing crucial information in backup locations and responding in the event of data breaches
or hardware failures.
Enhanced performance: Ensuring workloads are distributed for maximum efficiency, fault-
tolerant systems can reduce bottlenecks to improve overall system performance.
Fault-tolerant systems contribute to a more resilient organization and play an important role in
business continuity as a whole.

8. Explain Fault tolerance techniques


A:
1. Fault Detection: Fault Detection is 1st phase where DCS is monitored continuously. The
outcomes are compared with expected output. During monitoring if any faults are identified they
are being notified. These faults can occur due to various reasons e.g. hardware failure, network
failure, and software issues. The main aim of 1st phase is to detect these faults as soon as they
occur so that the work being assigned will not be delayed.
2. Fault Diagnosis: Fault diagnosis is the process where the fault identified in the 1st phase will
be diagnosed properly in order to get the root cause and possible nature of the faults. Fault
diagnosis can be done manually by the administrator or by using automated Techniques in order
to solve the fault and perform the given task.
3. Evidence Generation: Evidence generation is defined as the process where the report of the
fault is prepared based on the diagnosis done in 2nd phase. This report involves the details of the
causes of the fault, the nature of faults, the solutions that can be used for fixing, and other
alternatives and preventions that need to be considered.
4. Assessment: Assessment is the process where the damages caused by the faults are analyzed.
It can be determined with the help of messages that are being passed from the component that
has encountered the fault. Based on the assessment further decisions are made.
5. Recovery: Recovery is the process where the aim is to make the system fault free. It is the step
to make the system fault free and restore it to state forward recovery and backup recovery. Some
of the common recovery techniques such as reconfiguration and resynchronization can be used.

9. Compare between hardware and software redundancy?


A:
10. Explain the Fault-tolerant steps?
A:

11. Explain the Fault-tolerant tools?


12. 13.Explain the Fault-tolerant tools
A:
14.Explain Fault tolerant of Stateless system Hot/Standby
A:

13. Explain Fault tolerant of Stateful system Hot/Standby


A:

14. Define: Failure, Error, and Fault?


A:
Fault: is an approach to building systems able to withstand and mitigate adverse events and
operating conditions in order to dependably continue delivering the level of service expected by
the users of the system.
Error: An error occurs when a fault causes an incorrect state or operation in the system
Failure: is the inability of the RAID system to perform its intended function due to unresolved
faults and errors.

15. Define: Reliability (T), availability, Mean time between failure (MTBF), and Mean
Time to Repair (MTTR)?
A:
Reliability: Reliability is defined as the property where the system can work continuously
without any failure. Generally defined as the ability of a product or service to perform as
expected over time. Formally defined as the probability that a product, piece of equipment, or
system performs its intended function for a stated period of time under specified operating
conditions.
Availability: Availability is defined as the property where the system is readily available for its
use at any time.
MTBF: Mean Time Between Failures (MTBF), as the name suggests, is the average time
between failure of hardware modules. It is the average time a manufacturer estimates before a
failure occurs in a hardware module.
MTTR : Mean Time To Repair (MTTR), is the time taken to repair a failed hardware module.
In an operational system, repair generally means replacing the hardware module.

16. Consider a static webpage during an observation window of 24 hours, the service sustains 3
periods of downtime. The first outage takes 15 minutes to repair, the second lasts 30 min, the
third 1hrs. calculate the MTTR, MTBF, and the availability of the site.
A:
Downtime 1 (DT1)= 15 min, Downtime 2 (DT2)= 30 min, Downtime 3 (DT3)=60 min
Observation window=24 hrs
Downtime DT= DT1+DT2+DT3= 15+30+60= 105min
MTTR= DT/No of failures= 105/3= 35 min
MTBF= Observation period-DT/No of failures=24x30-105/3= 1235/3= 685 min
A= MTBF/(MTBF+MTTR)= 685/(685+35)=0.95
A=95%
MTTR=35min
MTBF=685min
A=95%

17. Calculate the resultant reliability of two components connected in series if the probability of
failures are F1 = 0.1 and F2 = 0.2
A:
R=R1*R2=(1-F1)*(1-F2)=0.9*0.8=0.72
The probability of failure has increased to 1 – 0.72 = 0.28, it is more than the failure probability F2.

18. Calculated the reliability of a series system with three elements with R1 = 0.9, R2 = 0.8, and R3 =
0.5 if a) they are connected in series b) if they are connected in parallel
1
2
1 2 3 3
a) in series:
R=R1*R2*R3=0.9*0.8*0.5=0.36

b) in parallel:
R=(1-R1)*(1-R2)*(1-R3)=0.1*0.2*0.5=0.01

19. A system consists of three parallel components with probabilities of failure F1 = 0.08, F2 = 0.20,
and F3 = 0.20. Calculate the resultant probability of failure (F) and of failure-free operation (R).
Assume that the components are independent.
A:
In parallel systems:
F = F1 × F2 × F3 = 0.08 × 0.20 × 0.20 = 0.0032.
R = 1 – F = 1 – 0.0032= 0.9968.

20. Calculate the resultant probability of failure (F) and failure-free operation (R) for a combined
series-parallel system. Assume that the components are independent. The failure probabilities of
individual elements are: F1 = 0.08, F2 = 0.30, F3 = 0.20, and F4 = 0.10.
2 3
1

A:
First, the reliability of elements 2 and 3 in a series is calculated: R2–3 = R2 × R3 = (1 – F2) ×
(1 – F3) = (1 – 0.3) × (1 – 0.2) = 0.7 × 0.8 = 0.56.
The probability of failure is complementary to reliability, so that F2–3 = 1 – R2–3 = 1 – 0.56
= 0.44.
Then, the reliability of this F2–3 group arranged in parallel with element 4 is obtained as
F4,2–3 = F4 × F2–3 = 0.10 × 0.56 = 0.056.
The resultant reliability of the whole system is obtained as the reliability of component 1 in a
series with the subsystem 4,2-3. Here, the reliabilities must be multiplied.
The resultant reliability thus is R=R1 X R2–3 = (1- F1 ) X (1- F4,2–3 ) = (1-0.08) X (1-
0.056) = .92X.944=.86848
The resultant probability of failure is F = 1 – R = 1 – 0.86848 = 0.13152 ≈ 0.13.
21. Define cluster computing?
A:
computer cluster: is a group of linked computers, working together closely thus in many respects
forming a single computer.

22. Explain how does a computer cluster work?


A:
It works cooperatively together as a single integrated computing resource
Clusters vary in size but share a common framework.
A cluster typically has one or two head/root nodes and more computing nodes.
The head/root/master node is where user log in, compile code, assign tasks, coordinate jobs, and
monitor traffic across all nodes.
The computing nodes handle performance computing.
They execute tasks, follow instructions, and function collectively as a powerful single system.
Tasks automatically move from the head system to the computing nodes, and excellent tools can
help with workload scheduling.
Fast, reliable, and low-latency networks are crucial for supporting parallel operations in a cluster.
Each cluster node has one or more CPUs, typically with multiple cores.
“For instance, if each node with two processors containing 16 cores, the total core number is 32.
This means that one of computing nodes can perform 32 tasks simultaneously, even though a
single core can do the job itself.
Cluster with 3 nodes can perform 96 tasks simultaneously.
An HPC system typically consists of 16–64 computers, each with one–four processors. Each
processor has two–four cores, resulting in a total of 64–256 cores.
The ability to perform more tasks using a cluster is the power of computer clusters.
23. What is the difference between a computer cluster and grid computers?
A:
The difference between a computer cluster and grid computers is that grid is a network of
independent computers distributed across various locations, working together to achieve a
common goal.
Grid computing, has numerous parallel computations that happen independently, so processors
don’t need to communicate.
In a computer cluster, each node performs the same job. In a grid operating system, each node
handles a different task.
Grid computing is heterogeneous, with each node performing different tasks, while cluster
computing is homogeneous, with nodes performing the same tasks.
The aim of Grid computing is to enable coordinated resource sharing and problem solving in
dynamic, multi-institutional virtual organizations.
In grid computing large-scale science and engineering is done through the interaction of people,
various computing resources, information systems, and instruments etc. all are geographically
and dispersed.
The overall inspiration for grids is to facilitate the routine interactions of these resources in order
to support largescale science and engineering.
Grid computing is a form of distributed computing wherein a super and virtual computer is
composed of a cluster of networks, loosely coupled computers.
Grid Computing enables the sharing, selection, and conjugation of geographically distributed
autonomous resources dynamically at runtime depending on various factors like their
availability, capability, performance, cost, and users’ quality-of-service requirements.

24. What are the benefits of a computer cluster?


A:
1. High Performance : The systems offer better and enhanced performance than that of
mainframe computer networks.
2. Easy to manage :Cluster Computing is manageable and easy to implement.
3. Scalable :Resources can be added to the clusters accordingly.
4. Expandability :Computer clusters can be expanded easily by adding additional computers to
the network. Cluster computing is capable of combining several additional resources or the
networks to the existing computer system.
5. Availability :The other nodes will be active when one node gets failed and will function as a
proxy for the failed node. This makes sure for enhanced availability.
6. Flexibility :It can be upgraded to the superior specification or additional nodes can be added.

25. What are the types of cluster computing?


A:

26. Why is Cluster Computing important?


A:
1. High Performance : The systems offer better and enhanced performance than that of
mainframe computer networks.
2. Easy to manage :Cluster Computing is manageable and easy to implement.
3. Scalable :Resources can be added to the clusters accordingly.
4. Expandability :Computer clusters can be expanded easily by adding additional computers to
the network. Cluster computing is capable of combining several additional resources or the
networks to the existing computer system.
5. Availability :The other nodes will be active when one node gets failed and will function as a
proxy for the failed node. This makes sure for enhanced availability.
6. Flexibility :It can be upgraded to the superior specification or additional nodes can be added.
27. Mention Components of a Cluster Computer and explain the functions of the head/root
and computing nodes?
A:
High Performance Computers like Servers, PCs, Workstations etc.
Micro- kernel based operating systems.
High speed networks or switches like Gigabit Ethernets.
NICs (Network Interface Cards)
Fast Communication Protocols and Services
Cluster Middleware which is hardware, OS kernels, applications and subsystems.
Parallel Programming Environment Tools (compilers, parallel virtual machines etc.).
Sequential and Parallel applications
The cluster middleware is very much capable for offering an elusive and a unified system image.
The head/root/master node is where user log in, compile code, assign tasks, coordinate jobs, and
monitor traffic across all nodes
Head/root/master node performs three responsibilities:
Define the resource requirements for the given tasks.
Set the proper environment for work.
Specify those tasks and carry them out as shell commands.
Cluster Nodes= computing and head/root/master nodes

28. Explain High performance (HP) clusters


A:
29. Explain Load-balancing clusters
A:

30. Explain High Availability (HA) Clusters


A:

You might also like