Chapter 4 Cloud Computing Tech (Another Copy)
Chapter 4 Cloud Computing Tech (Another Copy)
Virtualization
1. Hypervisor: The hypervisor is the core software that enables virtualization. It is responsible for
managing and allocating the physical resources of the host computer to the virtual machines.
2. Host Machine: The host machine is the physical computer that runs the hypervisor and hosts the
virtual machines.
3. Virtual Machines (VMs): Virtual machines are the software representations of physical
computers. They act as independent entities with their own operating systems, applications, and
resources.
4. Guest Operating Systems: Each virtual machine runs its own guest operating system, which can
be different from the host operating system.
5. Resource Allocation: The hypervisor dynamically allocates hardware resources, such as CPU,
memory, storage, and network, to the virtual machines based on their requirements.
6. Isolation: Virtualization provides isolation between VMs, allowing them to run independently
without interfering with each other. This enables better security and stability.
7. Snapshots: Virtualization allows the creation of snapshots, which capture the state of a virtual
machine at a specific point in time. Snapshots can be used for backup, rollback, or testing purposes.
8. Migration and High Availability: Virtual machines can be easily migrated between different
physical hosts without interruption, providing flexibility and high availability. This is known as live
migration.
9. Consolidation: Virtualization enables the consolidation of multiple physical servers into a single
host, which reduces hardware costs, power consumption, and data center footprint.
10. Scalability: Virtualization provides scalability by allowing the addition or removal of virtual
machines based on demand, without the need for additional physical hardware.
2. Types of Virtualization -
There are several types of virtualization, including:
1. Server virtualization: This involves partitioning a physical server into multiple virtual servers,
allowing each virtual server to run its own operating system and applications. This increases
resource utilization and allows for better management and flexibility.
2. Desktop virtualization: This involves running a desktop operating system and applications on a
virtual machine, hosted on a centralized server. Users can access their virtual desktops remotely,
allowing for flexible work environments and centralized management.
3. Network virtualization: This involves abstracting network resources, such as switches, routers,
and firewalls, into virtual entities. This allows for more efficient use of network resources and
enables the creation of virtual networks that can be easily managed and configured.
4. Storage virtualization: This involves abstracting physical storage devices into virtual storage
pools that can be allocated to different systems as needed. This allows for improved storage
utilization, better performance, and simplified storage management.
5. Application virtualization: This involves encapsulating applications into virtual containers,
allowing them to run independently of the underlying operating system. This enables applications to
be deployed and managed more easily, and helps to prevent conflicts between different applications.
7. Storage virtualization: This involves abstracting physical storage devices into virtual storage
pools that can be allocated to different systems as needed. This allows for improved storage
utilization, better performance, and simplified storage management.
These are just a few examples of virtualization technologies, and there are many more specific
implementations and variations depending on the specific needs and requirements of the
organization or individual using virtualization.
1. Full virtualization: In this level of virtualization, the entire hardware platform is virtualized,
allowing multiple operating systems and applications to run on a single physical server without any
modifications. Each virtual machine (VM) has its own virtualized hardware resources, including
CPU, memory, storage, and network interfaces. Examples of full virtualization technologies include
VMware ESXi and Microsoft Hyper-V.
2. Para-virtualization: This level of virtualization requires modifications to the guest operating
systems to be aware of the virtualization layer. The guest operating systems, known as para-virtual
machines (PVMs), communicate directly with the hypervisor or virtual machine monitor (VMM) to
optimize performance and resource utilization. Para-virtualization can provide better performance
compared to full virtualization but requires more effort to modify the guest operating systems.
Examples of para-virtualization technologies include Xen and Oracle VM Server for x86.
Each level of virtualization offers different trade-offs in terms of performance, flexibility, and
management overhead. The choice of implementation level depends on the specific requirements
and goals of the virtualization deployment.
3. **Desktop Virtualization:**
- **Virtual Desktop Infrastructure (VDI):** Users interact with virtual desktops hosted on
servers. This can be useful for centralized management and security. VMware Horizon and Citrix
Virtual Apps and Desktops are examples.
4. **Storage Virtualization:**
- **Storage Area Network (SAN) Virtualization:** Combines physical storage resources into a
single storage pool. This allows for efficient storage allocation and management. Examples include
EMC VMAX and IBM SAN Volume Controller.
- **Network-Attached Storage (NAS) Virtualization:** Abstracts multiple physical network
storage devices into a single logical storage unit.
5. **Network Virtualization:**
- **Software-Defined Networking (SDN):** Separates the control plane from the data plane,
allowing for programmable network management. OpenFlow is a protocol often used in SDN.
- **Network Function Virtualization (NFV):** Virtualizes network functions traditionally
performed by dedicated hardware appliances. Examples include virtual routers and firewalls.
6. **Application Virtualization:**
- **Application Streaming:** Allows applications to be delivered on-demand to end-user
devices. Microsoft App-V is an example of an application streaming solution.
- **Container Orchestration:** Manages the deployment, scaling, and operation of
containerized applications. Kubernetes is a widely used container orchestration platform.
CPU virtualization is typically achieved through software called a hypervisor or virtual machine
monitor (VMM), which creates and manages the virtual environment for each VM. The hypervisor
abstracts the underlying physical CPU and allows multiple VMs to share the CPU resources,
including processing power, memory, and I/O devices.
1. Full virtualization: In this approach, the hypervisor presents a virtual CPU to each VM that
behaves as if it were a real physical CPU. The guest operating systems running on the VMs are
unaware that they are running in a virtualized environment. The hypervisor intercepts and translates
privileged instructions from the VMs into equivalent operations on the physical CPU.
2. Para-virtualization: Here, the guest operating systems are modified to be aware of the
virtualized environment. The guest OSes communicate directly with the hypervisor to optimize
performance and avoid the overhead of instruction translation. This requires modification of the
guest OS kernel to work with the hypervisor.
1. Consolidation: Multiple VMs can run on a single physical CPU, maximizing resource utilization
and reducing the need for additional hardware.
2. Isolation: Each VM runs in its own isolated environment, ensuring that one VM cannot interfere
with the operation of others. This enhances security and stability.
3. Flexibility: VMs can be easily created, copied, and migrated between physical hosts, allowing
for dynamic allocation of resources and efficient workload management.
4. High availability: In case of hardware failures, VMs can be automatically moved to another
physical host to maintain service availability.
5. Efficient resource allocation: CPU resources can be dynamically allocated and adjusted based
on workload demands, ensuring optimal performance.
Overall, virtualization of CPU brings flexibility, efficiency, and scalability to the utilization of
hardware resources, enabling organizations to maximize the benefits of their computing
infrastructure.
The operating system is responsible for managing this virtualized memory. It maps the virtual
addresses used by a process to the physical addresses in the actual memory. This mapping is stored
in a data structure called the page table.
Virtual memory allows processes to have the illusion of having unlimited memory, even though the
physical memory is limited. When a process accesses a virtual address that is not currently mapped
to any physical memory, the operating system retrieves the required data from disk and brings it
into physical memory. This process is known as page fault.
Virtual memory also provides protection and isolation between processes. Each process has its own
virtual address space, so a process cannot access or modify the memory of another process.
8. I/O Devices and OS - Virtualization of I/O devices and the operating system (OS)
refers to the process of creating virtual representations of hardware devices and the OS itself,
allowing multiple virtual machines (VMs) to share and access resources.
Virtualizing I/O devices involves the abstraction of physical devices, such as network cards, storage
controllers, or graphics cards, into virtual devices that are presented to VMs. This allows multiple
VMs to share and access the same physical device, eliminating the need for dedicated hardware for
each VM. Virtual I/O devices are created and managed by a hypervisor or virtual machine monitor
(VMM).
There are different approaches to virtualizing I/O devices. One common method is called device
emulation, where the hypervisor emulates the behavior of a physical device and presents it to the
VM. This allows the VM to interact with the virtual device as if it were a physical device. However,
device emulation can be less efficient than other methods.
Another approach is device passthrough or direct I/O, where the hypervisor allows a VM to directly
access a physical device without any intermediate emulation. This provides better performance but
limits the VM's mobility and requires specific hardware support.
Virtualizing the OS involves partitioning the physical hardware resources into virtual instances,
each running a separate instance of the OS. This allows multiple OS instances to run simultaneously
on the same physical machine. Each virtual machine has its own dedicated OS, including kernel,
device drivers, and user space. The hypervisor manages the allocation of resources and handles the
interactions between VMs and physical hardware.
Virtualizing the OS provides benefits such as improved resource utilization, isolation, and
flexibility. It allows running multiple operating systems on a single physical machine, consolidating
hardware resources and reducing costs. Virtualized OS instances can be easily created, migrated,
and managed, providing flexibility in scaling and maintaining systems.
In summary, virtualization of I/O devices and the operating system enables efficient sharing and
utilization of hardware resources, allowing multiple VMs to run concurrently on a single physical
machine. This technology has revolutionized the IT industry, providing a foundation for cloud
computing, data centers, and virtual desktop infrastructure, among other applications.
Here are some key benefits of using virtualization for data-center automation:
1. Server Consolidation: Virtualization enables the consolidation of multiple physical servers into a
single physical server with multiple virtual servers. This reduces the number of physical servers
required, saving space, power, cooling, and overall cost.
2. Increased Hardware Utilization: By running multiple virtual servers on a single physical server,
hardware resources can be utilized more efficiently. This leads to higher resource utilization rates
and avoids underutilization or overprovisioning of servers.
4. Easy Scalability: Virtualization simplifies the process of scaling up or down as business needs
change. New virtual servers can be quickly provisioned or decommissioned, allowing for greater
flexibility and agility in adapting to changing demands.
7. Testing and Development Environment: Virtualization facilitates the creation of isolated testing
and development environments. Multiple virtual servers can be set up to run different operating
systems or software configurations, enabling easier software testing, development, and
troubleshooting.
In the reduce stage, the intermediate results are combined and aggregated by applying a specified
reduce function. The reduce function takes in the intermediate key-value pairs and produces the
final output, which is typically a condensed version of the input data.
One of the key advantages of MapReduce is its ability to scale horizontally by distributing the
processing tasks across multiple nodes. This allows for efficient parallel processing and enables the
handling of large datasets that would be impossible to process on a single machine. Additionally,
MapReduce provides fault tolerance by automatically handling node failures and rerouting tasks to
other nodes.
MapReduce has become a popular approach for big data processing and is widely used in various
industries, including web search, social media analytics, and machine learning. The framework has
also influenced the development of other distributed data processing systems, such as Apache
Hadoop and Apache Spark.
11. GFS - GFS virtualization refers to the use of Google File System (GFS) for virtual machine
(VM) storage in virtualization environments. GFS is a distributed file system developed by Google
that provides scalable and reliable storage for large amounts of data.
In virtualization, GFS can be used as the underlying storage system for virtual machine images,
snapshots, and other virtual machine files. By leveraging GFS, virtualization platforms can benefit
from its scalability, fault-tolerance, and high-performance capabilities.
Using GFS for virtualization can offer several advantages, such as:
1. Scalability: GFS is designed to handle large amounts of data, making it suitable for storing VM
images and other virtualization files.
2. Reliability: GFS has built-in mechanisms for data replication and fault tolerance, ensuring that
virtual machine data is protected against hardware failures and data corruption.
3. High Performance: GFS is optimized for sequential read and write operations, which can improve
the overall performance of virtual machines running on the virtualization platform.
Overall, GFS virtualization enables a more efficient and reliable virtualization environment by
leveraging the capabilities of Google File System.
12. HDFS – When referring to HDFS virtualization, it typically involves deploying Hadoop
Distributed File System (HDFS) within a virtualized environment, leveraging virtualization
technologies such as VMware, KVM (Kernel-based Virtual Machine), or Microsoft Hyper-V.
Using Hadoop Distributed File System (HDFS) within virtualization environments presents both
benefits and considerations. Here are some points to consider:
1. Resource Efficiency: Virtualization allows for better resource utilization by consolidating
multiple HDFS instances onto a single physical server. This can lead to cost savings by
reducing the number of physical machines needed to host HDFS clusters.
2. Scalability: Virtualization platforms offer scalability features such as dynamic resource
allocation and live migration, which can be beneficial for scaling HDFS clusters based on
changing workload demands.
3. Isolation: Virtualization provides isolation between different HDFS instances running on the
same physical hardware, reducing the risk of interference and conflicts between them.
4. Flexibility: Virtualization enables easier experimentation and testing of different Hadoop
configurations and setups without the need for additional physical hardware.
5. High Availability: Virtualization platforms often include features such as high availability
(HA) and fault tolerance, which can improve the reliability and resilience of HDFS
deployments by providing mechanisms for automatic failover and recovery.
6. Performance Overhead: Running HDFS within a virtualized environment can introduce
performance overhead due to factors such as virtualization layer processing and resource
contention. It's essential to carefully tune and optimize the virtualization environment to
minimize this overhead.
7. Storage Performance: Virtualized storage solutions may introduce latency or bottlenecks
compared to direct-attached storage (DAS) or network-attached storage (NAS)
configurations. Storage virtualization technologies such as VMware vSAN or storage area
networks (SANs) can help mitigate these issues.
8. Networking Considerations: Virtualized HDFS deployments require efficient networking
to ensure optimal data transfer rates and low latency between nodes. Network virtualization
technologies like VMware NSX or software-defined networking (SDN) can assist in
optimizing network performance.
9. Security: Virtualization introduces additional layers of complexity to the security landscape,
requiring attention to factors such as hypervisor security, network segmentation, and access
controls to safeguard HDFS data and infrastructure.
10.Management Overhead: Managing virtualized HDFS environments involves additional
tasks such as VM provisioning, monitoring, and maintenance. Implementing automation and
orchestration tools can streamline these management tasks and improve operational
efficiency.
11.Licensing Considerations: Depending on the virtualization platform used, there may be
licensing costs associated with deploying HDFS in a virtualized environment. It's important
to consider these costs alongside the potential benefits of virtualization.
12.Compatibility and Support: Ensure that the virtualization platform chosen is compatible
with the Hadoop ecosystem components and is supported by the vendors providing Hadoop
distributions and virtualization software.
In summary, while virtualization can offer numerous benefits for deploying and managing HDFS
clusters, it's essential to carefully evaluate factors such as performance, security, and management
overhead to ensure a successful deployment.
Hadoop is an open-source framework designed for distributed storage and processing of large
data sets using a cluster of commodity hardware. Hadoop consists of two main components:
the Hadoop Distributed File System (HDFS) for storage and the MapReduce programming
model for processing.
1. Hadoop Distributed File System (HDFS):
• Overview: HDFS is a distributed file system that provides high-throughput access to
application data. It is designed to store and manage large amounts of data across
multiple nodes in a Hadoop cluster.
• Key Features:
• Distributed Storage: Data is distributed across multiple nodes in the cluster
to ensure fault tolerance and scalability.
• Block-based Storage: Large files are divided into fixed-size blocks (typically
128 MB or 256 MB) and distributed across the cluster.
• Replication: Each block is replicated to multiple nodes (default is three) to
provide fault tolerance. If a node goes down, data can be retrieved from its
replicas.
• Master-Slave Architecture: The HDFS cluster consists of a single
NameNode (master) that manages the metadata and multiple DataNodes
(slaves) that store the actual data.
2. Hadoop Framework:
• MapReduce: Hadoop uses the MapReduce programming model for processing large
datasets in parallel across a distributed cluster. It involves two main phases - Map
and Reduce.
• Map Phase: Input data is divided into smaller chunks, and a map function is
applied to each chunk, producing a set of intermediate key-value pairs.
• Shuffle and Sort Phase: Intermediate results are shuffled and sorted based
on keys to group related data together.
• Reduce Phase: The reduce function is applied to each group of intermediate
data, producing the final output.
• YARN (Yet Another Resource Negotiator): YARN is the resource management
layer in Hadoop that manages and schedules resources in the cluster. It allows
multiple applications to share resources efficiently.
3. Ecosystem Components:
• Hadoop has a rich ecosystem of additional components and tools for various tasks,
including data storage, processing, and analysis. Some examples include:
• Hive: A data warehousing and SQL-like query language for Hadoop.
• Pig: A high-level platform for creating MapReduce programs used for data
analysis.
• HBase: A NoSQL database that runs on top of HDFS and provides real-time
read/write access to large datasets.
• Spark: A fast and general-purpose cluster computing framework that can be
used as an alternative to MapReduce.
Hadoop and its ecosystem are widely used in the industry for big data processing and analytics due
to their scalability, fault tolerance, and cost-effectiveness on commodity hardware. However, the
technology landscape is evolving, and other distributed computing frameworks like Apache Spark
are gaining popularity for certain use cases.