General Parallel File System 13
General Parallel File System 13
Agenda
What is GPFS? H ? a file system for deep computing R GPFS uses L General architecture How does GPFS meet its challenges - architectural issues
performance ? scalability ? high availability ? concurrency control
?
Scalable parallel computing enables I/O-intensive applications: I/ODeep computing - simulation, seismic analysis, data mining Server consolidation - aggregating file, web servers onto a centrally-managed machine centrallyStreaming video and audio for multimedia presentation Scalable object store for large digital libraries, web servers, databases, ...
What is GPFS?
Scalability
scales up to 512 nodes (N-Way SMP). Storage nodes, file system nodes, disks, (Nadapters...
High Availability
faultfault-tolerance via logging, replication, RAID support survives node and disk failures Uniform access via shared disks - Single image file system High capacity multiple TB per file system, 100s of GB per file. Standards compliant (X/Open 4.0 "POSIX") with minor exceptions
What is GPFS?
H R L
Native AIX File System (JFS) No file sharing - application can only access files on its own node Applications must do their own data partitioning DCE Distributed File System (follow-up of AFS) Application nodes (DCE clients) share files on server node Switch is used as a fast LAN Coarse-grained (file or segment level) parallelism Server node is performance and capacity bottleneck
6 Mbit/sec MPEG video streams 100 simultaneous viewers (75 MB/sec) 200 hours of video on line (700 GB) 1212-node SP-2 (7 distribution, 5 storage) SP-
Engineering Design
Major aircraft manufacturer
Using GPFS to store CATIA designs and structural modeling data GPFS allows all nodes to share designs and models
H Using CATIA for large designs, Elfini for structural modeling and analysis R SP used for modeling/analysis L
GPFS uses
File systems consist of one or more shared disks H ? Individual disk can contain data, metadata, or both R ? Disks are designated to failure group L ? Data and metadata are striped to balance load and
maximize parallelism
Disks are physically attached to SP nodes ? VSD allows clients to access disks over the SP switch ? VSD client looks like disk device driver on client node ? VSD server executes I/O requests on storage node. ? VSD supports JBOD or RAID volumes, fencing, multipathing (where physical hardware permits)
Implications of Shared Disk Model H ? All data and metadata on globally accessible disks (VSD) R ? All access to permanent data through disk I/O interface L ? Distributed protocols, e.g., distributed locking, coordinate disk access from
multiple nodes ? Fine-grained locking allows parallel access by multiple clients ? Logging and Shadowing restore consistency after node failures
General architecture
R run user programs, read/write data to/from storage nodes Limplement virtual file system interface
cooperate with manager nodes to perform metadata operations
Storage nodes
implement block I/O interface shared access from file system and manager nodes interact with manager nodes for recovery (e.g. fencing) file data and metadata striped across multiple disks on multiple storage nodes
General architecture
H R L
General architecture
Large block size allows efficient use of disk bandwidth H Fragments reduce space overhead for small files R No designated "mirror", no fixed placement function: L
Flexible replication (e.g., replicate only metadata, or only important files) Dynamic reconfiguration: data can migrate block-by-block Multi level indirect blocks ? Each disk address: list of pointers to replicas ? Each pointer: disk id + sector no.
General architecture
Conventional file systems store data in small blocks to pack data H more densely R GPFS uses large blocks (256KB default) to optimize disk transfer L speed
/ ( Th r ughpu t M B sec ) o
4201
698
867
046
215
483
652
821
7 6 5 4 3 2 1 0
Performance
Distributed locking - acquire appropriate lock for every operation H used for updates to user data R Centralized management - conflicting operations forwarded to a L designated node - used for file metadata Distributed locking + centralized hints - used for space allocation Central coordinator - used for configuration changes
Performance
GPFS stripes successive blocks across successive disks H Disk I/O for sequential reads and writes is done in parallel R GPFS measures application "think time" ,disk throughput, and cache L state to automatically determine optimal parallelism Prefetch algorithms now recognize strided and reverse sequential access Accepts hints Write-behind policy
GPFS Throughput Scaling for Non-cached Files NonHardware: Power2 wide nodes, SSA disks H Experiment: sequential R read/write from large L number of GPFS nodes to varying number of storage nodes Result: throughput increases nearly linearly with number of storage nodes Bottlenecks:
microchannel limits node throughput to 50MB/s ? system throughput limited by available storage nodes
?
Scalability
Each segment contains bits representing blocks on all disks Each segment is a separately lockable unit Minimizes contention for allocation map when writing files on multiple nodes Allocation manager service provides hints which segments to try
Problem: detect/fix file system inconsistencies after a failure of one H or more nodes R ? All updates that may leave inconsistencies if uncompleted are logged L ? Write-ahead logging policy: log record is forced to disk before dirty metadata is
written ? Redo log: replaying all log records at recovery time restores file system consistency
Logged updates:
?
I/O to replicated data ? directory operations (create, delete, move, ...) ? allocation map changes ordered writes ? shadowing
Other techniques:
?
High Availability
Application node failure: H ? force-on-steal policy ensures that all changes visible to other nodes have been R written to disk and will not be lost L ? all potential inconsistencies are protected by a token and are logged
file system manager runs log recovery on behalf of the failed node after successful log recovery tokens held by the failed node are released ? actions taken: restore metadata being updated by the failed node to a consistent state, release resources held by the failed node
?
new node is appointed to take over ? new file system manager restores volatile state by querying other nodes ? New file system manager may have to undo or finish a partially completed configuration change (e.g., add/delete disk) Dual-attached disk: use alternate path (VSD) ? Single attached disk: treat as disk failure
High Availability
When a disk failure is detected H ? The node that detects the failure informs the file system manager R ? File system manager updates the configuration data to mark the failed disk as L "down" (quorum algorithm) While a disk is down
?
Read one / write all available copies ? "Missing update" bit set in the inode of modified files File system manager searches inode file for missing update bits ? All data & metadata of files with missing updates are copied back to the recovering disk (one file at a time, normal locking protocol) ? Until missing update recovery is complete, data on the recovering disk is treated as write-only Failed disk is deleted from configuration or replaced by a new one ? New replicas are created on the replacement or on other disks
Cache Management
H R L
Stats
Total Cache
Seq / random General Pool: Clock list, merge, re-map optimal, total Seq / random Block Size pool: Clock list optimal, total Seq / random Block Size pool: Clock list optimal, total Seq / random Block Size pool: Clock list optimal, total
Balance dynamically according to usage patterns Avoid fragmentation - internal and external Unified steal Periodical re-balancing
Epilogue
Used on six of the ten most powerful supercomputers in the world, H including the largest (ASCI white) R Installed at several hundred customer sites, on clusters ranging from L a few nodes with less than a TB of disk, up to 512 nodes with 140 TB of disk in 2 file systems IP rich - ~20 filed patents State of the art TeraSort
world record of 17 minutes ? using 488 node SP. 432 file system and 56 storage nodes (604e 332 MHz) ? total 6 TB disk space
?
References
?
GPFS home page: http://www.haifa.il.ibm.com/projects/storage/gpfs.html ? FAST 2002: http://www.usenix.org/events/fast/schmuck.html ? TeraSort - http://www.almaden.ibm.com/cs/gpfs-spsort.html ? Tiger Shark: http://www.research.ibm.com/journal/rd/422/haskin.html