0% found this document useful (0 votes)
421 views

General Parallel File System 13

The document discusses architectural and design issues in the General Parallel File System (GPFS). GPFS is a file system designed for deep computing environments that uses a general architecture to provide high performance, scalability, high availability, and concurrency control. Key aspects of GPFS include using large file blocks, distributed locking for parallel access, logging and replication for high availability, and flexibility to handle failures of nodes and disks.

Uploaded by

Ankur Rastogi
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
421 views

General Parallel File System 13

The document discusses architectural and design issues in the General Parallel File System (GPFS). GPFS is a file system designed for deep computing environments that uses a general architecture to provide high performance, scalability, high availability, and concurrency control. Key aspects of GPFS include using large file blocks, distributed locking for parallel access, logging and replication for high availability, and flexibility to handle failures of nodes and disks.

Uploaded by

Ankur Rastogi
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 23

Architectural and Design Issues in the General Parallel File System

May 12, 2002

IBM Research Lab in Haifa

Benny Mandler - [email protected]

Agenda

What is GPFS? H ? a file system for deep computing R GPFS uses L General architecture How does GPFS meet its challenges - architectural issues
performance ? scalability ? high availability ? concurrency control
?

Scalable Parallel Computing


RS/6000 SP Scalable Parallel Computer 1 highH -512 nodes connected by high-speed switch R1-16 CPUs per node (Power2 or PowerPC) >1 TB disk per node L500 MB/s full duplex per switch port

Scalable parallel computing enables I/O-intensive applications: I/ODeep computing - simulation, seismic analysis, data mining Server consolidation - aggregating file, web servers onto a centrally-managed machine centrallyStreaming video and audio for multimedia presentation Scalable object store for large digital libraries, web servers, databases, ...

What is GPFS?

GPFS addresses SP I/O requirements


High Performance - multiple GB/s to/from a single file
concurrent reads and writes, parallel data access - within a file and across files H Support fully parallel access both to file data and metadata client caching enabled by distributed locking R wide striping, large data blocks, prefetch L

Scalability
scales up to 512 nodes (N-Way SMP). Storage nodes, file system nodes, disks, (Nadapters...

High Availability
faultfault-tolerance via logging, replication, RAID support survives node and disk failures Uniform access via shared disks - Single image file system High capacity multiple TB per file system, 100s of GB per file. Standards compliant (X/Open 4.0 "POSIX") with minor exceptions
What is GPFS?

GPFS vs. local and distributed file systems on the SP2

H R L
Native AIX File System (JFS) No file sharing - application can only access files on its own node Applications must do their own data partitioning DCE Distributed File System (follow-up of AFS) Application nodes (DCE clients) share files on server node Switch is used as a fast LAN Coarse-grained (file or segment level) parallelism Server node is performance and capacity bottleneck

GPFS Parallel File System


GPFS file systems are striped across multiple disks on multiple storage nodes Independent GPFS instances run on each application node GPFS instances use storage nodes as "block servers" - all instances can access all disks

Tokyo Video on Demand Trial


Video on Demand for new "borough" of Tokyo Applications: movies, news, karaoke, education ... H Video distribution via hybrid fiber/coax R "live" since June '96 Trial Currently 500 subscribers L

6 Mbit/sec MPEG video streams 100 simultaneous viewers (75 MB/sec) 200 hours of video on line (700 GB) 1212-node SP-2 (7 distribution, 5 storage) SP-

Engineering Design
Major aircraft manufacturer

Using GPFS to store CATIA designs and structural modeling data GPFS allows all nodes to share designs and models

H Using CATIA for large designs, Elfini for structural modeling and analysis R SP used for modeling/analysis L

GPFS uses

Shared Disks - Virtual Shared Disk architecture

File systems consist of one or more shared disks H ? Individual disk can contain data, metadata, or both R ? Disks are designated to failure group L ? Data and metadata are striped to balance load and
maximize parallelism

Recoverable Virtual Shared Disk for accessing disk storage


?

Disks are physically attached to SP nodes ? VSD allows clients to access disks over the SP switch ? VSD client looks like disk device driver on client node ? VSD server executes I/O requests on storage node. ? VSD supports JBOD or RAID volumes, fencing, multipathing (where physical hardware permits)

GPFS only assumes a conventional block I/O interface


General architecture

GPFS Architecture Overview

Implications of Shared Disk Model H ? All data and metadata on globally accessible disks (VSD) R ? All access to permanent data through disk I/O interface L ? Distributed protocols, e.g., distributed locking, coordinate disk access from
multiple nodes ? Fine-grained locking allows parallel access by multiple clients ? Logging and Shadowing restore consistency after node failures

Implications of Large Scale


Support up to 4096 disks of up to 1 TB each (4 Petabytes) The largest system in production is 75 TB ? Failure detection and recovery protocols to handle node failures ? Replication and/or RAID protect against disk / storage node failure ? On-line dynamic reconfiguration (add, delete, replace disks and nodes; rebalance file system)
?

General architecture

GPFS Architecture - Node Roles


Three types of nodes: file system, storage, and manager
Each node can perform any of these functions H ? File system nodes
?

R run user programs, read/write data to/from storage nodes Limplement virtual file system interface
cooperate with manager nodes to perform metadata operations

Manager nodes (one per file system)


global lock manager recovery manager global allocation manager quota manager file metadata manager admin services fail over

Storage nodes
implement block I/O interface shared access from file system and manager nodes interact with manager nodes for recovery (e.g. fencing) file data and metadata striped across multiple disks on multiple storage nodes
General architecture

GPFS Software Structure

H R L

General architecture

Disk Data Structures: Files

Large block size allows efficient use of disk bandwidth H Fragments reduce space overhead for small files R No designated "mirror", no fixed placement function: L

Flexible replication (e.g., replicate only metadata, or only important files) Dynamic reconfiguration: data can migrate block-by-block Multi level indirect blocks ? Each disk address: list of pointers to replicas ? Each pointer: disk id + sector no.

General architecture

Large File Block Size

Conventional file systems store data in small blocks to pack data H more densely R GPFS uses large blocks (256KB default) to optimize disk transfer L speed

/ ( Th r ughpu t M B sec ) o

4201

698

867

)setybK( eziS refsnarT O/I

046

215

483

652

821

7 6 5 4 3 2 1 0

Performance

Parallelism and consistency

Distributed locking - acquire appropriate lock for every operation H used for updates to user data R Centralized management - conflicting operations forwarded to a L designated node - used for file metadata Distributed locking + centralized hints - used for space allocation Central coordinator - used for configuration changes

I/O slowdown effects


Additional I/O activity rather than token server overload

Parallel File Access From Multiple Nodes


GPFS allows parallel applications on multiple nodes to access nonoverlapping ranges of a single file with no conflict H Global locking serializes access to overlapping ranges of a file R Global locking based on "tokens" which convey access rights to an L object (e.g. a file) or subset of an object (e.g. a byte range) Tokens can be held across file system operations, enabling coherent data caching in clients Cached data discarded or written to disk when token is revoked Performance optimizations: required/desired ranges, metanode, data shipping, special token modes for file size operations

Performance

Deep Prefetch for High Throughput

GPFS stripes successive blocks across successive disks H Disk I/O for sequential reads and writes is done in parallel R GPFS measures application "think time" ,disk throughput, and cache L state to automatically determine optimal parallelism Prefetch algorithms now recognize strided and reverse sequential access Accepts hints Write-behind policy

Application reads at 15 MB/sec Each disk reads at 5 MB/sec

Three I/Os executed in parallel


Performance

GPFS Throughput Scaling for Non-cached Files NonHardware: Power2 wide nodes, SSA disks H Experiment: sequential R read/write from large L number of GPFS nodes to varying number of storage nodes Result: throughput increases nearly linearly with number of storage nodes Bottlenecks:
microchannel limits node throughput to 50MB/s ? system throughput limited by available storage nodes
?

Scalability

Disk Data Structures: Allocation map

Segmented Block Allocation H MAP: R L

Each segment contains bits representing blocks on all disks Each segment is a separately lockable unit Minimizes contention for allocation map when writing files on multiple nodes Allocation manager service provides hints which segments to try

Similar: inode allocation map


Scalability

High Availability - Logging and Recovery

Problem: detect/fix file system inconsistencies after a failure of one H or more nodes R ? All updates that may leave inconsistencies if uncompleted are logged L ? Write-ahead logging policy: log record is forced to disk before dirty metadata is
written ? Redo log: replaying all log records at recovery time restores file system consistency

Logged updates:
?

I/O to replicated data ? directory operations (create, delete, move, ...) ? allocation map changes ordered writes ? shadowing

Other techniques:
?

High Availability

Node Failure Recovery

Application node failure: H ? force-on-steal policy ensures that all changes visible to other nodes have been R written to disk and will not be lost L ? all potential inconsistencies are protected by a token and are logged
file system manager runs log recovery on behalf of the failed node after successful log recovery tokens held by the failed node are released ? actions taken: restore metadata being updated by the failed node to a consistent state, release resources held by the failed node
?

File system manager failure:


?

new node is appointed to take over ? new file system manager restores volatile state by querying other nodes ? New file system manager may have to undo or finish a partially completed configuration change (e.g., add/delete disk) Dual-attached disk: use alternate path (VSD) ? Single attached disk: treat as disk failure
High Availability

Storage node failure:


?

Handling Disk Failures

When a disk failure is detected H ? The node that detects the failure informs the file system manager R ? File system manager updates the configuration data to mark the failed disk as L "down" (quorum algorithm) While a disk is down
?

Read one / write all available copies ? "Missing update" bit set in the inode of modified files File system manager searches inode file for missing update bits ? All data & metadata of files with missing updates are copied back to the recovering disk (one file at a time, normal locking protocol) ? Until missing update recovery is complete, data on the recovering disk is treated as write-only Failed disk is deleted from configuration or replaced by a new one ? New replicas are created on the replacement or on other disks

When/if disk recovers


?

Unrecoverable disk failure


?

Cache Management

H R L

Stats

Total Cache

Seq / random General Pool: Clock list, merge, re-map optimal, total Seq / random Block Size pool: Clock list optimal, total Seq / random Block Size pool: Clock list optimal, total Seq / random Block Size pool: Clock list optimal, total

Balance dynamically according to usage patterns Avoid fragmentation - internal and external Unified steal Periodical re-balancing

Epilogue

Used on six of the ten most powerful supercomputers in the world, H including the largest (ASCI white) R Installed at several hundred customer sites, on clusters ranging from L a few nodes with less than a TB of disk, up to 512 nodes with 140 TB of disk in 2 file systems IP rich - ~20 filed patents State of the art TeraSort
world record of 17 minutes ? using 488 node SP. 432 file system and 56 storage nodes (604e 332 MHz) ? total 6 TB disk space
?

References
?

GPFS home page: http://www.haifa.il.ibm.com/projects/storage/gpfs.html ? FAST 2002: http://www.usenix.org/events/fast/schmuck.html ? TeraSort - http://www.almaden.ibm.com/cs/gpfs-spsort.html ? Tiger Shark: http://www.research.ibm.com/journal/rd/422/haskin.html

You might also like