redp5743
redp5743
Kedar Karmarkar
Chinmaya Mishra
Qais Noorshams
Gero Schmidt
Anna Greim
Dietmar Fischer
Data and AI
Hybrid Cloud
Redpaper
IBM Redbooks
November 2024
REDP-5743-00
Note: Before using this information and the product it supports, read the information in “Notices” on page v.
Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .v
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
How you can become a published author, too! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Stay connected to IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Chapter 1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Evolution of Data Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Introduction to IBM watsonx.data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Typical use cases for IBM watsonx.data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 IBM Storage Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 IBM Storage Scale S3 Data Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 Storage Abstraction and acceleration with IBM Storage Scale AFM . . . . . . . . . . . . . . . 9
Chapter 5. Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.1 Monitoring watsonx.data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2 Monitoring IBM Storage Scale S3 service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Online resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Help from IBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
iv Accelerating AI and Analytics with IBM watsonx.data and IBM Storage Scale
Notices
This information was developed for products and services offered in the US. This material might be available
from IBM in other languages. However, you may be required to own a copy of the product or product version in
that language in order to access it.
IBM may not offer the products, services, or features discussed in this document in other countries. Consult
your local IBM representative for information on the products and services currently available in your area. Any
reference to an IBM product, program, or service is not intended to state or imply that only that IBM product,
program, or service may be used. Any functionally equivalent product, program, or service that does not
infringe any IBM intellectual property right may be used instead. However, it is the user’s responsibility to
evaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in this document. The
furnishing of this document does not grant you any license to these patents. You can send license inquiries, in
writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive, MD-NC119, Armonk, NY 10504-1785, US
This information could include technical inaccuracies or typographical errors. Changes are periodically made
to the information herein; these changes will be incorporated in new editions of the publication. IBM may make
improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time
without notice.
Any references in this information to non-IBM websites are provided for convenience only and do not in any
manner serve as an endorsement of those websites. The materials at those websites are not part of the
materials for this IBM product and use of those websites is at your own risk.
IBM may use or distribute any of the information you provide in any way it believes appropriate without
incurring any obligation to you.
The performance data and client examples cited are presented for illustrative purposes only. Actual
performance results may vary depending on specific configurations and operating conditions.
Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products and cannot confirm the
accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the
capabilities of non-IBM products should be addressed to the suppliers of those products.
Statements regarding IBM’s future direction or intent are subject to change or withdrawal without notice, and
represent goals and objectives only.
This information contains examples of data and reports used in daily business operations. To illustrate them
as completely as possible, the examples include the names of individuals, companies, brands, and products.
All of these names are fictitious and any similarity to actual people or business enterprises is entirely
coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrate programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs in
any form without payment to IBM, for the purposes of developing, using, marketing or distributing application
programs conforming to the application programming interface for the operating platform for which the sample
programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore,
cannot guarantee or imply reliability, serviceability, or function of these programs. The sample programs are
provided “AS IS”, without warranty of any kind. IBM shall not be liable for any damages arising out of your use
of the sample programs.
The following terms are trademarks or registered trademarks of International Business Machines Corporation,
and might also be trademarks or registered trademarks in other countries.
Db2® IBM Cloud Pak® Redbooks®
IBM® IBM Elastic Storage® Redbooks (logo) ®
IBM Cloud® IBM Spectrum® Resilient®
OpenShift, Red Hat, are trademarks or registered trademarks of Red Hat, Inc. or its subsidiaries in the United
States and other countries.
Other company, product, or service names may be trademarks or service marks of others.
vi Accelerating AI and Analytics with IBM watsonx.data and IBM Storage Scale
Preface
This IBM® Redpaper describes an IBM Data & artificial intelligence (AI) solution for using
IBM watsonx.data together with IBM Storage Scale. The paper showcases how
IBM watsonx.data applications can benefit from the enterprise storage features and functions
offered by IBM Storage Scale.
IBM Storage Scale is software-defined, high performance, scalable file and object storage
that enables organizations to build a Global Data Platform for AI, high-performance
computing (HPC), advanced analytics, and other demanding workloads.
IBM watsonx.data and IBM Storage Scale can be a powerful combination for building a
scalable and cost-effective data lakehouse solution. This IBM Redbooks® publication delves
into how IBM Storage Scale's robust storage capabilities and IBM watsonx.data's advanced
analytics features come together to build a powerful data and AI platform. This platform
empowers you to unlock valuable insights from your data and make data-driven decisions.
This further helps organizations to expand from AI pilot projects to full-scale production
systems by providing the right tools, platforms and software-defined storage on which to run it
all.
Authors
This paper was produced by a team of specialists from around the world working with the IBM
Redbooks, Tucson Center.
Kedar Karmarkar is a Development Architect with the IBM Storage Scale development team
and has contributed to Data Caching, Scale containerization, AI solutions and Storage Scale
Development adoption teams. Kedar has over 25 years of infrastructure software, storage
development experience in management, and architect roles. Prior to IBM, Kedar has led
development of network-attached storage (NAS), Block level virtualization, replication,
systems, and storage management products. Kedar has a Bachelor of Engineering
(Computer Science) degree from University of Pune, India.
Qais Noorshams is an IBM Software Engineer in the Big Data & Analytics team within the
IBM Storage Scale organization. Since he joined IBM Germany in 2015, he has held various
technical and leadership positions in international software development projects. He is a
certified Expert Developer, IBM Recognized Speaker, and IBM Recognized Teacher. His track
record includes authoring more than 15 granted patents, more than 15 peer-reviewed
publications, and various IBM-published newsletter and blog articles. He holds a PhD degree
in Computer Science (Dr.-Ing.) from Karlsruhe Institute of Technology (Germany).
Gero Schmidt is a Software Engineer at IBM Germany R&D GmbH in the Big Data &
Analytics team of the IBM Storage Scale development organization. He joined IBM Germany
in 2001 as presales technical support engineer for enterprise storage solutions. He has
co-authored multiple IBM Redbooks® and has been a frequent speaker at IBM international
conferences. In 2015 he joined the storage research group at the IBM Almaden Research
Center in California, USA, where he worked on IBM Storage Scale, compression of genomic
data in next generation sequencing pipelines and the development of a cloud-native backup
solution for containerized applications in Kubernetes/Red Hat OpenShift. He holds a degree
in Physics (Dipl.-Phys.) from the Braunschweig University of Technology in Germany.
Anna Greim is the scrum master of the IBM Storage Scale Big Data and Analytics team.
Anna has been with IBM for more than 12 years working as a software developer and enjoys
working with a talented team and as part of that team developing great products and
solutions.
Dietmar Fischer is the Manager of the IBM Storage Scale's Big Data and Analytics team.
Dietmar has been with IBM for more than 25 years and has held several positions within the
IBM Storage development organization including software test, development, project
management, and management. Dietmar has a strong technical computer science
background and enjoys developing great solutions with a team of very talented experts.
Gopikrishnan Varadarajulu, Rohan Pednekar, Kevin Shen, Ted Hoover, Khanh Ngo,
Gregory Kishi, Rene Orozco Martinez, Boda Devi Manikanta, Hariharan Ashokan,
T K Narayanan, Sujith PS, Shafeek M, Renu Rajagopal, Prasad Kulkarni, Madhu Thorat,
Pravin Ranjan, Rajan Mishra
Larry Coyne
IBM Redbooks, Tucson Center
viii Accelerating AI and Analytics with IBM watsonx.data and IBM Storage Scale
Residencies run from two to six weeks in length, and you can participate either in person or
as a remote resident working from your home base.
Find out more about the residency program, browse the residency index, and apply online at:
ibm.com/redbooks/residencies.html
Comments welcome
Your comments are important to us!
We want our papers to be as helpful as possible. Send us your comments about this paper or
other IBM Redbooks publications in one of the following ways:
Use the online Contact us review Redbooks form found at:
ibm.com/redbooks
Send your comments in an email to:
[email protected]
Mail your comments to:
IBM Corporation, IBM Redbooks
Dept. HYTD Mail Station P099
2455 South Road
Poughkeepsie, NY 12601-5400
Preface ix
x Accelerating AI and Analytics with IBM watsonx.data and IBM Storage Scale
1
Chapter 1. Overview
This IBM Redpaper describes the IBM solution for using IBM Storage Scale as enterprise
storage with IBM watsonx.data. The paper showcases how IBM watsonx.data applications
can benefit from the enterprise storage features and functions offered by
IBM Storage Scale.
This chapter provides an overview of IBM watsonx.data and IBM Storage Scale along with
key features and use cases for these products. If you are already familiar with
IBM watsonx.data and IBM Storage Scale, you may skip this chapter.
However, building a data warehouse (DW) comes up with a high up-front cost, and scaling a
DW is an expensive affair both in terms of compute and storage. Moreover, DWs only work
with structured data.
Moving data warehouses to the cloud does not solve the problem - it comes with vendor
lock-in, sometimes with even higher costs, and with limited machine learning/AI use cases.
These limitations lead to the concept of data lakes, offering higher scalability and flexibility.
Based on a scale-out architecture created on commodity servers, data lakes can store and
process massive volumes of data in its original form - structured or unstructured. Adopters of
data lakes looked to Hive, Impala and Spark together with Hadoop Distributed File System
(HDFS) storage to simplify data engineering, real-time analytics, predictive analytics and
machine learning tasks. Data lakes are also typically less expensive than data warehouses.
Ever since, the AWS S3 API has been established as a standard to process unstructured
data as objects. More and more enterprises are integrating S3 as the data access protocol of
choice in their data workflows. However, the processing layers in data lakes (for example,
Hive) are not well equipped to handle S3 based storages, even as S3 based cloud object
stores became ubiquitous.
These limitations give rises to an emerging architecture called data lakehouse, that combines
the flexibility of a data lake with the performance of a data warehouse. Lakehouse solutions
provide a high-performance query engine over low-cost object storage in conjunction with a
data governance layer. Data lakehouses are based around open-standard object storage and
enable multiple analytics/AI workloads to operate on the same data simultaneously without
requiring the data to be duplicated or transformed.
A key benefit of data lakehouses is that they address the needs both of traditional data
warehouse analysts who curate and publish data for business intelligence and reporting
purposes as well as those of data scientists and engineers who run more complex data
analysis and processing workloads.
2 Accelerating AI and Analytics with IBM watsonx.data and IBM Storage Scale
Enterprise AI does not work in isolation, but is typically part of a larger data pipeline. Starting
from ingesting and acquiring data from various sources, data is curated, de-duplicated and
cleansed in various ETL stages followed by further processing, all within a lakehouse, before
being fed to AI-based systems for training or inference. The quality and accuracy of the data
is paramount for effectiveness of AI, therefore the importance of having a modern and
integrated lakehouse can't be overstated, for the adoption of enterprise AI.
Lakehouses are a step in that direction. However, fundamental challenges still remain:
1. First generation Lakehouses are still limited by their ability to address cost and complexity
challenges. They are usually single query engines set up to support limited workloads, for
example Business Intelligence (BI) or Machine Learning (ML).
2. Moreover, first generation Lakehouses typically deploy over cloud only with no support for
multi-/hybrid cloud deployments.
3. Minimal governance and metadata capabilities to deploy across the entire ecosystem
remains an issue. The challenge is to bring analytics to data where it is generated and
resides. Customers are looking for an easier migration path to a modern lakehouse with
no migration or delayed migration of data and metadata.
4. And all this needs to be achieved while maintaining robust data governance and security
policies in places, even as the usage and users of data become more varied than ever
before.
To mitigate these issues, IBM designed the watsonx.data platform, positioning it as a modern
Lakehouse platform to help organizations manage efficient use of their data. It is part of the
broader IBM watsonx platform, an enterprise-ready AI and data platform designed to
accelerate the adoption of enterprise AI. The watsonx family is comprised of three platforms:
1. watsonx.ai - for generative AI and machine learning
2. watsonx.data - a next-generation data lakehouse built on open architecture and open data
formats
3. watsonx.governance - to enable AI workflows that are built with responsibility,
transparency, and explainability.
Users can store their enterprise data within watson.data or and make that data accessible
directly for AI and BI. They can also attach existing enterprise data sources spread across
cloud and on-premise environments to watsonx.data, which helps to reduce data duplication
and cost of storing data in multiple places.
Chapter 1. Overview 3
With the foundation of IBM Cloud® Pak for Data's AI and data platform, watsonx.data
integrates seamlessly with existing data and data fabric services within the platform. This
integration accelerates and simplifies the process of scaling AI workloads across enterprises.
The following components provide the foundation of IBM watsonx.data architecture (see
Figure 1-1 on page 5):
Query engines The IBM watsonx.data platform natively includes Presto and Spark as the
query engines. Presto is a distributed query engine designed to handle modern
data formats that are highly elastic and scalable.
IBM watsonx.data query engines are fully modular and can be dynamically scaled
to meet workload demands and concurrency. The engines can be attached to
internal or external data stores in a plug-and-play configuration to access enterprise
data in an open table format.
Milvus VectorDB Milvus is a vector database that stores, indexes, and manages embedding
vectors used for similarity search and retrieval augmented generation (RAG). It is
developed to empower embedding similarity search primarily for AI inferencing
applications. Milvus is included in IBM watsonx.data as a service.
Metadata and Governance service The metastore included with watsonx.data is based
upon the open-source Apache Hive Metastore (HMS). The metadata service
enables the query engines to know the location, format, and read capabilities of the
data. The metastore essentially manages table schemas as well as where to find
them in object storage.
Data catalogs operate within the purview of the metadata and governance layer.
Data catalogs assist query engines with finding the correct data and deliver
semantic information for policies and rules specific to a particular data store. A data
catalog is created specifically to a data store when it is registered with watsonx.data
and is managed by the HMS metadata service. The supported catalog types
include Apache Iceberg, Hive, Apache Hudi or Delta Lake at the time of this writing.
IBM watsonx.data integrates with IBM Knowledge Catalog (IKC) and Apache
Ranger for policy-based governance and administration of data and metadata. The
policy engine enables users to define and enforce rules for data protection.
One key aspect supporting the open architecture of IBM watsonx.data lakehouse is its
support for open table formats such as Apache Iceberg. As a vendor agnostic open table
format, Apache Iceberg allows different engines to access the same data at the same time,
thereby enabling data sharing across multiple repositories (for example, data warehouses
and data lakes). This allows using new technology with old data through metadata integration,
and allows users to migrate data and workload at their own pace. Open formats and
standards to ensure interoperability with future technology stacks.
As an example of the data sharing aspect, an IBM Db2® Warehouse has the option to
read/write to/from a cloud bucket using open formats such as parquet and iceberg. The
bucket metadata (table schemas and others) can be exposed to watsonx.data using the
'Metadata Sync' feature provided by Apache Iceberg. This allows for seamless integration and
sharing of data between IBM Db2 Warehouse and IBM watsonx.data without the need for
deduplication or additional ETL operations, while allowing to offload some of the workloads to
IBM watsonx.data.
IBM watsonx.data is delivered as containerized software as part of the IBM Cloud Pak® for
Data (PC4D) software bundle. IBM watsonx.data can be deployed on-premise, across
multiple clouds, and is also as a managed service on AWS. An entry-level developer version
is also available if you want to try it out.
4 Accelerating AI and Analytics with IBM watsonx.data and IBM Storage Scale
Figure 1-1 IBM watsonx.data architecture
Chapter 1. Overview 5
Data Lake modernization
Accelerate modernization of your Data Lake with Apache Iceberg and object store.
Replace or augment legacy Hadoop data lakes with an open data lakehouse and
access better performance, security, and governance, without migration or ETL.
Decoupling of compute and storage for independent scalability and lower costs.
Real-time analytics and BI
Combine data from existing sources with new data to unlock new, faster insights
without the cost and complexity of duplicating and moving data and metadata
across different environments.
Streamline data engineering
Reduce data pipelines, simplify data transformation, and enrich data for
consumption using Spark, SQL, Python, or an AI-infused conversational interface.
Prepare Data for AI
Collect, curate and prepare data efficiently for use by AI with Spark and Milvus
vector database. Build, train, tune, deploy, and monitor AI/ML models with trusted
and governed data in IBM watsonx.data and ensure compliance with lineage and
reproducibility of data used for AI. Integrated vectorized embedding capabilities in
Milvus enable Retrieval Augmented Generation (RAG) use cases at scale across
large sets of trusted, governed data.
Generative AI-powered data insights
Leverage generative AI infused in watsonx.data to find, augment, and visualize data
and unlock new data insights through a conversational interface - no SQL required.
Figure 1-2 shows the positioning of watsonx.data within the IBM watsonx ecosystem.
6 Accelerating AI and Analytics with IBM watsonx.data and IBM Storage Scale
1.4 IBM Storage Scale
IBM Storage Scale (formerly known as IBM Spectrum Scale or IBM General Parallel
File-System (GPFS)) is an industry-leading IBM storage software for file and object storage. It
can be deployed as a software-defined storage solution that effectively meets the demands of
AI, big data, analytics, and HPC workloads. It has market-leading performance and scalability,
and a wealth of sophisticated data management capabilities.
IBM Storage Scale System (formerly known as the IBM Elastic Storage® Server or ESS) is a
fully integrated and tested IBM Storage Scale building block (Appliance) that provides
enterprise grade performance, reliability, availability, and serviceability. It is an optimum way
to deploy IBM Storage Scale storage for most IBM Storage Scale use cases. Alternatively, as
a true software defined solution (SDS), customers can choose to deploy IBM Storage Scale
over commodity servers (such as x86 based storage rich servers), whether on customer's
on-premise environment or on a public cloud.
The ever-growing volume of data, multiple varieties, and formats of data and data silos being
dispersed across on-premises or private/public clouds adds to the overall complexity and cost
of building a modern Lakehouse solution built for the AI age. Enterprises who have invested in
traditional data warehouses and data lakes are looking to simplify and modernize their
applications. They are looking to integrate and unify the dispersed data sources for better
data visibility, reducing duplication and controlling costs. These data sources could be cloud
object storage, HDFS, or even databases.
With IBM Storage Scale, customers can build a highly scalable Global Data Platform for their
Lakehouse environments, offering higher performance, cost advantage and superior data
management capabilities. IBM Storage Scale becomes the storage layer for the Lakehouse.
Data may reside within Scale or be virtualized into Scale from any cloud, from any edge or
from any legacy data silos, whether object, file or HDFS format. Data may be orchestrated to
IBM Storage Scale to minimize the time to results. The Global Data Platform, powered by
IBM Storage Scale offers the following differentiated data services. See Figure 1-3 on page 8.
Data Access Services
With a rich set of data protocols, IBM Storage Scale Data Access Services provide unified
and shared file and object access to any unstructured data stored anywhere across an
organization. The data access services are “multi-lingual”, meaning some applications can
create and access data with a certain protocol, and others may require access to the
same data with a different protocol at the same time.
Storage Abstraction and Acceleration Services
IBM's global data platform provides high-performance data access from where the data
resides. By leveraging the IBM Storage Scale's Active File Management (AFM), it can
abstract and virtualize remote data sources dispersed across the enterprise to be
managed under a common storage namespace and accelerate them for high-performance
data access.
Data Management Services
IBM Storage Scale provides comprehensive Information life cycle management services
including a highly flexible policy engine that allows customers to define rules for optimizing
the storage of their unstructured data. These services transparently move data to the
appropriate tier of storage, optimizing both cost and performance based on an
organization's retention, archiving and data governance policies.
Chapter 1. Overview 7
Data Resiliency Services
Data Resiliency Services provides comprehensive tools and capabilities to identify and
detect threats to protect an organization's data, and provide essential response and
recovery capabilities when security breaches occur. These data resilience services align
with all aspects of the NIST security framework, from practicing cyber hygiene before an
event, all the way through detection, response, and recovery.
Figure 1-3 IBM Storage Scale, a Global Data Platform for Unstructured Data
For more information about IBM Storage Scale, see IBM Storage Scale.
For more information about IBM Storage Scale System (appliance), see IBM Storage Scale
System.
The Cluster Export Services (CES) infrastructure in IBM Storage Scale manages the
following aspects for the Data Access services, including the S3 service:
Manage high availability (HA): The participating nodes are designated as CES nodes or
protocol nodes. A set of floating IP addresses, called CES address pool (CES IP Pool), is
defined and distributed among the CES nodes. As nodes enter and leave the
IBM Storage Scale cluster, the addresses in the pool can be redistributed among the CES
nodes to provide high availability. Higher-level application nodes access the S3 service
over one or more of these floating IPs, which are assigned to active protocol nodes and
moved to an inactive node in case of a failover.
8 Accelerating AI and Analytics with IBM watsonx.data and IBM Storage Scale
Monitoring the health of these protocols and raising events or alerts during failures
Managing the floating IPs (CES IPs) that are used for accessing these protocols, by
including failover and failback of these addresses which might be triggered by any protocol
node failures
The new High Performance S3 service is still based on Red Hat Nooba and does not require
a Red Hat OpenShift environment to be deployed which further simplifies the S3 service
deployment and architecture. This simplified architecture allows for both containerized and
non-containerized S3 applications to access the IBM Storage Scale S3 service.
The High Performance S3 service is optimized for multi-protocol data access to enable
workflows which access the same instance of data using S3 and other access protocols. S3
objects are mapped to files and buckets are mapped to directories within IBM Storage Scale
and vice versa.
IBM Storage Scale AFM virtualizes remote S3 buckets at fileset level. Cache relationships are
created at a fileset level. Multiple such cache relationships can exist per file system,
corresponding to remote buckets on various public clouds. Once the cache relationships are
created, the remote S3 buckets appear as local buckets under the IBM Storage Scale file
system, under a common storage namespace. This eliminates the need for data copy and
greatly eases the management of those dispersed storages.
AFM uses user-defined intelligent policies to accelerate data access including automatic
eviction of data.
Chapter 1. Overview 9
10 Accelerating AI and Analytics with IBM watsonx.data and IBM Storage Scale
2
Customers looking for integrated compute and storage infrastructure solution for
IBM watsonx.data in an appliance form factor, may consider an IBM Fusion HCI-based
solution.
Depending on the customer's use case, IBM Storage Scale can be leveraged in this solution
in either of the following two ways, or a combination of both:
1. As a high-performance enterprise storage and as the primary object storage layer for the
Lakehouse solution. The data buckets reside locally on the IBM Storage Scale file system
itself.
2. As a persistent cache and storage acceleration layer for accessing remote object stores
globally dispersed across various clouds, data centers and locations.
12 Accelerating AI and Analytics with IBM watsonx.data and IBM Storage Scale
For dispersed buckets, AFM abstracts and accelerates them in a way that these external
buckets appear as local buckets residing on the IBM Storage Scale file system itself.
High-performance object access is delivered with intelligent caching service provided by
AFM.
The S3 service then exposes the buckets (local or accelerated) to IBM watsonx.data for
attachment to a query engine such as Presto or Spark.
This solution paves the way for complete separation of the compute and storage, which
comes with the benefit of having to manage, operate, scale and grow the compute
(Red Hat OpenShift/IBM watsonx.data) and storage (IBM Storage Scale) layers completely
independent of each other. The storage and the S3 service stays outside OpenShift and is
accessed by the S3 protocol from the compute layer in a plug-and-play configuration.
When processing engines within watsonx.data access data, the request reaches the
IBM Storage Scale S3 service over the S3 protocol. The S3 service interacts with the
IBM Storage Scale client on the same node, which then hands off the I/O to IBM Storage
Scale server nodes using the NSD protocol. If the bucket happens to be remote and has not
been cached already onto the IBM Storage Scale file system, AFM gateways are engaged to
access the remote object bucket. All this happens transparently without the IBM watsonx.data
applications having to know details of the remote object bucket itself.
14 Accelerating AI and Analytics with IBM watsonx.data and IBM Storage Scale
The following top benefits are realized by using IBM Storage Scale with IBM watsonx.data
Lakehouse:
Storage abstraction and virtualization, eliminate silos
Leveraging AFM, IBM Storage Scale can virtualize and abstract dispersed storages
(islands) all over the enterprise and make them available under a common namespace. A
single global namespace delivers a consistent, seamless experience for new or existing
storage, making it easier to manage them from a single window of control. It reduces
unnecessary data copies and improves efficiency, security and governance. Data may be
virtualized and orchestrated into Scale from any cloud, from any edge or from any legacy
data silos, whether object, file or HDFS format, thereby minimizing the time to result.
Accelerated storage where performance matters
IBM Storage Scale AFM performs as a tier 1 data caching service, performing automatic,
transparent caching of back-end storage systems. It provides a high-performance
persistent storage cache, together with low-capacity requirements. This has the effect of
accelerating data queries and improve economics by fronting lower performance storage.
With watsonx.data, a 5-15x improvement in query performance can be seen.
Collapse layers and simplify data integration with multi-protocol data access
IBM Storage Scale has the most comprehensive support for data access protocols. It
supports data access by using S3, NFS, SMB, POSIX, HDFS and GPUDirect. This feature
eliminates the need to maintain separate copies of the same data for traditional
applications, analytics and AI, and enable globally dispersed teams to collaborate on data
regardless of protocol, location or format
While S3 is a must for lakehouses, multi-protocols provide the flexibility to ingest or access
data from various legacy data sources. For example,
– Data can be ingested into a bucket using NFS and same data is instantly available for
processing by watsonx.data engines via S3.
– Data may be curated and cleansed via Spark in IBM watsonx.data for AI model training
or inference purposes. The curated data may be made available to AI workflows
through POSIX and GPUDirect for highest performance access.
This facilitates in-place analytics and simplifies the complexity of enterprise-wide data
workflows starting from data cleansing all the way to AI.
A Lakehouse optimized for AI
IBM Storage scale with its rich set of data access protocols, provides a unified data
platform for analytics and AI, reduces costs and simplifies data workflows. A
high-performance storage platform, it minimizes the cost of training AI models by
delivering a faster time to solution, as GPU resources are expensive. GPU Direct Storage
(GDS) offers high bandwidth, low latency performance to train Generative AI models
faster. Provides a Landing Zone for high-speed data ingest to AI training jobs.
Note: The Granite series of Generative AI models shipped with IBM watsonx.ai were
trained with large datasets residing on IBM Storage Scale.
Lower costs
– The IBM Storage Scale System provides much higher storage density than the
competition. This translates into cost savings in terms of Power, cooling and rack space
needed in the data center. For customers requiring higher storage capacity and growth
outlook, this can lower the Total Cost of Ownership (TCO) significantly over the years.
Key advantages of IBM Storage Scale-based data cache are the following:
It operates as a shared data cache, available to all the engines in IBM watsonx.data,
whereas Alluxio is only available to Presto.
The shared data cache is available to all the worker nodes in IBM watsonx.data at any
given time.
It provides a persistent data cache even for newly provisioned engines and survives
engine restarts.
Shared data cache is available to multiple protocols and not just to S3.
16 Accelerating AI and Analytics with IBM watsonx.data and IBM Storage Scale
Note: Many public cloud providers charge their customers data egress costs for moving
data out of the cloud. Therefore, having the data cached locally using AFM provides
costs savings on the data egress costs. Storage acceleration also reduces the
contention for bandwidth.
Figure 2-2 AFM aggregating dispersed storages under a common storage namespace
Here are some key use cases of IBM Storage Scale AFM for IBM watsonx.data:
1. IBM watsonx.data running on-premise and the applications (Presto/Spark) require
high-performance access to data stored in S3 buckets in a public cloud or in a different
data center location. AFM transparently executes data caching services of the data from
its home location and accelerates the storage performance.
2. IBM watsonx.data is deployed on a public cloud. However, you prefer to have their
enterprise data on premise for security or regulatory reasons. In these scenarios, AFM
can transparently execute data caching services of the data from on-prem location and
accelerates storage performance.
3. IBM watsonx.data is running on-premise accessing S3 data sources residing on-premise
as well. However, the storage performance from these data stores is not adequate to meet
your query SLAs. In these scenarios, AFM can be used to accelerate
IBM watsonx.data queries by fronting lower performance storage.
18 Accelerating AI and Analytics with IBM watsonx.data and IBM Storage Scale
3
Table 3-1 IBM watsonx.data with IBM Storage Scale architecture components
Product name Version
watsonx.data 2.0.2
Architecture x86_64
3.2 Planning
This section describes planning for IBM Storage Scale.
Plan for IBM Storage Scale for capacity, performance, and storage abstraction (AFM) and for
advanced features such as Storage Tiering.
The storage capacity planning for the IBM Storage Scale cluster depends upon the
customer's use case, whether Scale is being used as the primary storage for the S3 buckets
or only for storage acceleration, or both.
For capacity planning as primary storage, take into account your current and the projected
YoY growth storage requirements, together with the combined storage bandwidth offered by
the system to plan for optimum performance. Visit Configuring and tuning your system for
GPFS and Parameters for performance tuning and optimization to tune your cluster for
optimum performance.
For storage acceleration, each bucket being accelerated maps to one AFM gateway node. In
an I/O heavy production system with multiple accelerated buckets, it may be worthwhile to
configure two or more AFM gateway nodes, for optimum performance and high availability
(HA).
It is also recommended that AFM filesets backing remote S3 buckets are configured over a
fast storage tier, such as a Storage pool configured over NVME disks.
20 Accelerating AI and Analytics with IBM watsonx.data and IBM Storage Scale
configured as a Protocol cluster, unless the original IBM Storage Scale System itself can be
upgraded to 5.2.1.
As is the case with other CES-enabled protocols, the S3 service is configured on designated
protocol nodes. There are two possible architectures to deploy the S3 service:
The S3 Protocol nodes can be part of the Scale (server) cluster itself. This may be the
preferred approach for software-defined Scale environments including deployments on
public clouds.
Otherwise, the S3 Protocol nodes can be part of a separate Scale (client) cluster and
accessing the Scale file system using remote mount access as described in
IBM Storage Scale Documentation Mounting a remote GPFS file system.
In this configuration, the Scale server cluster (such as the IBM Storage Scale System or
ESS) grants permission to the Scale client cluster for its owning file systems. The Scale
client cluster remotely mounts the file systems and operates as a Protocol cluster to the
application layer above. For very small or testing environments, the value of this additional
administration effort might not become easily apparent. For production environments,
however, this approach has some distinct advantages.
– The Scale client cluster environment can be individually scaled, managed and
upgraded, for example, to take advantage of new S3 service versions and
improvements delivered with new IBM Storage Scale releases, with incurring minimal
or no changes to existing IBM Storage Scale System infrastructure.
– Higher level of storage isolation and multi-tenancy at storage level can be
accomplished by constraining the IBM Storage Scale client clusters to access only the
designated IBM Storage Scale filesystems or filesets, if the security policies demand
so. For example, in an organization with multiple lines of business or departments, a
dedicated Scale client cluster can be assigned to each of such department while
keeping a common storage backend.
– The Scale client cluster and the Scale server can run different versions of Scale
software, if needed.
– See Figure 4-1 on page 24 that shows the three tiered deployment for IBM
watsonx.data with IBM Storage Scale architecture
Network Planning
The performance of the IBM watsonx.data IBM Storage Scale solution depends on the
network provisioned for communication between watsonx.data and IBM Storage Scale. In
case of a multi-tiered architecture defined in the prior section, it's important to plan for
separate networks and network interfaces for:
Network between the OpenShift cluster and the S3 Protocol node.
Network between S3 Protocol cluster and IBM Storage Scale server (NSD) cluster,
including the network between the Storage Scale server nodes itself.
For production workloads, the following hardware configuration is recommended for the
Red Hat OpenShift based worker nodes running IBM Watsxon.data:
Raw cores = 64
System memory (GB) = 1920
Local storage = 300 GB
If AFM-based storage acceleration is needed, determine how large the persistent storage
cache must be. This depends on various factors including:
The total number of filesets (for example, buckets) being cached
Total size of the data on remote S3 storage, and size of the buckets that need to be
cached at some point in time, based on access patterns by watsonx.data applications.
Amount of data that needs to be read locally by watsonx.data applications during a short
span of time to avoid multiple round-trip reads/writes to the remote S3 storage
Type of caching used: whether it is read-only cache or read-write cache
Table 3-2 provides guidance on how to size the storage capacity to be configured for the
cache. This is defined as a percentage of the actual storage size of the dispersed storage
(remote/local) that is being accelerated.
Table 3-2 Sizing guidance for persistent storage cache for AFM-based storage acceleration
Size of remote S3 data source Cache size (as % of the remote storage size)
10 TB or smaller 30%
10 TB to 1 PB 20%
1 PB or larger 10%
22 Accelerating AI and Analytics with IBM watsonx.data and IBM Storage Scale
4
Figure 4-1 Three tiered deployment for IBM watsonx.data with IBM Storage Scale architecture
24 Accelerating AI and Analytics with IBM watsonx.data and IBM Storage Scale
Example 4-1 listing shows the configurations of the Scale client cluster used as example for
this paper.
Example 4-1 Configurations of the Scale client cluster used in this Redpaper
[root@fscc-sr650-46 ~]# mmlscluster
The CES IPs are configured and assigned to the nodes of the Scale client cluster, in
Example 4-2, we have two CES IPs setup.
Example 4-3 shows the Scale client cluster has mounted two file systems, one for data
access, the other acting as CES shared root.
Example 4-3 Listing the remote mounted filesystems in the Scale client cluster
[root@fscc-sr650-46 ~]# mmremotefs show all
Local Name Remote Name Cluster name Mount Point Mount Options Automount Drive Priority
essData essData ess3k5.bda.scale.com /gpfs/essData rw no - 0
essCesRoot essCesRoot ess3k5.bda.scale.com /gpfs/essCesRoot rw no - 0
The Scale server cluster owns the file systems and is described is described in Example 4-4.
Example 4-4 Configurations of the Scale server cluster used in this Redpaper
[root@fscc-sr650-36 ~]# mmlscluster
The Scale client cluster, containing the S3 service, is used to configure S3 access for
watsonx.data. This process is explained more elaborately over the next sections. First, an S3
account is needed, which can be created after the corresponding user and group have been
defined within the operating system. For this account, a S3 bucket is created afterwards. The
combination of CES IP and port, account credentials (access key and secret key), and bucket
name are required to define the connection from watsonx.data
Note: Reverse DNS lookup needs to be available for all CES IPs. The CES IPs must be
unique and cannot be cluster node IPs.
Configure the CES shared root file system, which is used for configuration and
administration of the CES protocols using:
# ./spectrumscale config protocols -f essCesRoot-m /gpfs/essCesRoot
Note: It is recommended that the CES shared root is a separate file system. The CES
shared root needs to be at least 4 GB.
26 Accelerating AI and Analytics with IBM watsonx.data and IBM Storage Scale
Review the configuration and perform a precheck of the deployment using
# ./spectrumscale node list
# ./spectrumscale deploy -precheck
Finally, deploy the changes using:
# ./spectrumscale deploy
Verify that the S3 services are up and running on the designated S3 protocol nodes:
# mmces service list -a
node46s.bda.scale.ibm.com: S3 is running
node47s.bda.scale.ibm.com: S3 is running
node48s.bda.scale.ibm.com: S3 is running
To view the S3 protocol configuration, run:
# mms3 config list
Example 4-5 shows the default configuration of the S3 service
Filesets can have their own defined quotas for data and inodes. The owning fileset becomes
an attribute of each file for enforcing IBM Storage Scale based policies (such as automated
tiering and placement, encryption, compression) as needed. Each fileset mounts at a regular
directory path (called JunctionPath) within the Scale file system. A regular S3 bucket may be
defined over the mount path.
Example 4-6 shows how to create an independent fileset and create a S3 bucket on top of it.
Then proceed to create a S3 bucket over the fileset's mount path (directory), described in the
following section. In the above example, the mount path is /gpfs/essData/watsonx/ which
would be the default bucket path (--newBucketsPath) for our S3 account.
Alternatively, a bucket could be configured over a pre-existing directory. For example, a bucket
could be configured over the mount point directory of an IBM Storage Scale fileset, so that the
fileset becomes the backing storage of the S3 bucket.
The steps to create a S3 bucket are shown in the following command listings.
Create a S3 account first, associating the account to a system user where <uid> and <gid>
are the Posix UID and GID associated with the S3 account. These parameters are not
needed to be passed if an account name is passed. The account name should be a valid
system username.
<Path> is a filesystem absolute path, which will act as a base path for S3 buckets created
using S3 API. This path can be overridden for buckets created with the mms3 bucket create
command.
28 Accelerating AI and Analytics with IBM watsonx.data and IBM Storage Scale
To view the details associated with this S3 account including its AWS access credentials, run:
# mms3 account list <S3 account-name>
Then, create one or more S3 buckets, corresponding to this S3 account. There are two ways
a bucket can be created.
1. Using the mms3 command:
# mms3 bucket create <S3 bucket-name> --accountName <S3 account-name>
--filesystemPath <Path>
Where <S3 account-name> is the name of account which should be used for the bucket.
<Path> is the filesystem absolute path including the directory for the bucket, which is to be
used for bucket creation. This could be different than the default bucket path
(--newBucketsPath) configured for the S3 account.
The command will create a new directory with system path <Path> which corresponds to
the S3 bucket.
2. Using the S3 API, for example the “aws” S3 client as shown in Example 4-7.
Name New Buckets Path Uid Gid Access Key Secret Key
----- ---------------- --- --- ---------- -------------
watsonx /gpfs/ess3k54/watsonx/ 2002 100 <Our AWS_ACCESS_KEY_ID>< Our
AWS_SECRET_ACCESS_KEY>
Create an S3 bucket named under the default Bucket path:
# mms3 bucket create named b-watsonx --accountName watsonx --filesystemPath
/gpfs/essData/watsonx/b-watsonx
Create an S3 bucket named “b-watsonx2” not under the default Bucket path:
# mms3 create b-watsonx2 --accountName watsonx --filesystemPath
/gpfs/essData/b-watsonx2
Follow these instructions for configuring storage acceleration over remote buckets:
To start with, designate one or more nodes as AFM nodes. To designate a node as an
AFM node, first ensure that the node has the AFM rpm (gpfs.afm.cos.*) installed, and the
node has necessary connectivity to the remote cloud object S3 endpoint. Then run:
# mmchnode --gateway -N <AFM node hostname>
Get the AWS access key ID and secret key for your remote bucket instance. For example,
if using IBM COS, navigate to cloud.ibm.com → Instances → Storage → Service
Credentials Tab → expand on down arrow. Get the details from cos_hmac_keys.
Log in to an AFM gateway node.
Create the access keys in AFM corresponding to the remote object bucket.
# mmafmcoskeys bucket[:{[Region@]Server|ExportMap}] set {<access key> <secret
key> | --keyfile filePath}
30 Accelerating AI and Analytics with IBM watsonx.data and IBM Storage Scale
Create an AFM relationship for the remote S3 bucket as shown in Example 4-8.
Note the --dir parameter passed to the command. This is done to ensure that the fileset
is created under the S3 “New Buckets Path” (from the command “mms3 account list”).
To see the newly created fileset, run:
# mmlsfileset <Device>
To see the relationship of the fileset with the remote bucket, run:
# mmafmctl <Device> getstate
Where <Device> is the name of the Storage Scale filesystem
Create a S3 bucket over the fileset's mount path. Change the ownership of the directory to
that of the account corresponding to the S3 bucket.
# chown <s3 account user>:<s3 account group> <fileset mount path>
Create the S3 bucket pointing to that directory:
# mms3 bucket create <bucket-name>--accountName <S3 account name>
--filesystemPath <fileset mount path>/<bucket-name>
Example:
In this example, there is a remote S3 bucket named “chm-cos-s3-bucket” residing on
IBM Cloud Object Storage (IBMCOS). The following steps illustrate the steps for creating a
virtual/accelerated S3 bucket named "b-watsonx" corresponding to the IBMCOS bucket.
Note: The output shows the AFM gateway node(s). The Cache State should be “Active” to
indicate that the storage acceleration is working properly.
Now create a S3 bucket over the directory /gpfs/essData/watsonx/ibmcos-bucket:
# chown watsonx:users ibmcos-bucket/
# mms3 bucket create b-watsonx-cos --accountName watsonx --filesystemPath
/gpfs/essData/watsonx/ibmcos-bucket
Starting to create bucket with name b-watsonx-cos
Note: The directory '/gpfs/essData/watsonx/ibmcos-bucket' for bucket already
exists. Skipping update of ownership and the setting of permissions of the
directory for the user with uid:gid=2002:100
Bucket b-watsonx-cos created successfully
32 Accelerating AI and Analytics with IBM watsonx.data and IBM Storage Scale
4.4 Define IBM Storage Scale S3 buckets to IBM watsonx.data
Install IBM watsonx.data using the documented procedure at Installing watsonx.data.
Use the following procedure to register an IBM Storage Scale S3 bucket to your watsonx.data
instance as externally managed storage and associate a catalog for the bucket. This catalog
serves as the query interface for watsonx.data for the data stored within the bucket. See
Figure 4-2.
Figure 4-2 watsonX.data panel for adding IBM Storage Scale component
Create an engine instance such as Presto and associate the catalog to that engine. This will
make the S3 bucket discoverable through the catalog. Then continue to create schemas and
tables under the storage catalog, as shown in the following command:
create schema <catalog name>.<schema name> with (location = 's3a://<bucket
name>/<directory for schema> ');
For example, if the name of the bucket is b-watsonx and the catalog name is c_scale, create a
schema named “schema1” by running:
create schema c_scale.schema1 with (location = 's3a://b-watsonx/schema1');
This creates a directory called schema1 under the bucket's filesystem path in
IBM Storage Scale. Data for all tables created under this schema would reside underneath
this directory.
34 Accelerating AI and Analytics with IBM watsonx.data and IBM Storage Scale
Figure 4-3 Sample view of the watsonx.data Infrastructure manager GUI
Chapter 5. Monitoring
Since IBM watsonx.data runs within a Red Hat OpenShift cluster, you can use all the
standard monitoring features available within Red Hat OpenShift to monitor watsonx.data
projects or namespaces. Also, you can use the monitoring capabilities natively available
within watsonx.data and also with IBM Storage Scale.
Within the Red Hat OpenShift cluster, the URL route can be queried - while passing the
namespace of the watsonx.data installation “-n ${PROJECT_CPD_INST_OPERANDS}" - as
follows:
# oc get route -n ${PROJECT_CPD_INST_OPERANDS} | grep presto
ibm-lh-lakehouse-presto-01-presto-svc
ibm-lh-lakehouse-presto-01-presto-svc-cpd-instance-test.apps.ocp4x.scale.ibm.com
ibm-lh-lakehouse-presto-01-presto-svc 8443 reencrypt
None
The monitoring endpoint is specific for a S3 server, for example, in an environment of multiple
S3 protocol nodes each node exposes such an endpoint. This allows for fine-grained
monitoring and analysis, for example, when multiple S3 nodes are actively in use, the
monitoring can show if the load is evenly balanced across the nodes.
38 Accelerating AI and Analytics with IBM watsonx.data and IBM Storage Scale
The S3 monitoring endpoint is available at http://<host>:7004/metrics/nsfs_stats, so the
data can be queried directly, for example:
# curl http://10.10.1.121:7004/metrics/nsfs_stats
{"nsfs_counters":{"noobaa_nsfs_io_read_count":0,"noobaa_nsfs_io_write_count":1,
"noobaa_nsfs_io_read_bytes":0,"noobaa_nsfs_io_write_bytes":4},"op_stats_counter
s":{"noobaa_nsfs_op_upload_object_count":1,"noobaa_nsfs_op_upload_object_error_
count":0}}
As the result is JSON, the output can be further parsed with a JSON parser to be
post-processed, for example for scripting and automating purposes. Obtaining the write count
in the above output is as simple as querying the respective field, for example:
# curl -s http://10.10.1.121:7004/metrics/nsfs_stats | jq -r
'.nsfs_counters.noobaa_nsfs_io_write_count'
1
These probes and metrics can be exploited with monitoring tools like Grafana to get an
overview of the system and can be further extended into integrated monitoring frameworks to
build a more complex analysis pipeline.
Chapter 5. Monitoring 39
40 Accelerating AI and Analytics with IBM watsonx.data and IBM Storage Scale
6
Follow the instructions outlined in Simplified setup: Using SKLM with a self-signed certificate
to enable encryption for the IBM Storage Scale cluster.
Follow Part 1 of the document: “Installing and configuring SKLM” to set up crypto servers.
They serve as key managers for IBM Storage Scale nodes.
Then follow Part 2: Configuring the Scale cluster for encryption.
42 Accelerating AI and Analytics with IBM watsonx.data and IBM Storage Scale
Once encryption has been enabled for the fileset, define a S3 bucket on the fileset's mount
point directory using the regular procedure.
A sample of the Subject Alternative Name (SAN) file as stated in the above documentation is
shown in Example 6-1, containing a CES IP and the corresponding DNS name.
[req_distinguished_name]
CN = localhost
[req_ext]
subjectAltName = DNS:localhost,DNS: cesip1.bda.scale.ibm.com,IP: 10.10.1.121
Note: Remove the newline/line-break characters from the actual certificate content.
The oc patch command restarts the compute engines. Wait until the restart is complete and
continue to register IBM Storage Scale buckets using a secure (HTTPS) endpoint.
Once the table is defined, it is possible to view the existing data with SQL queries in Presto.
Users may ingest more datafiles to the same directory via NFS/Posix or may even update the
existing datafiles (for example, update or append records) outside of Presto, and the same
would be reflected in any subsequent SQL queries run from Presto. This feature can be
leveraged to realize complex workflows within the enterprise data pipeline.
44 Accelerating AI and Analytics with IBM watsonx.data and IBM Storage Scale
6.3.2 Example B. Data sharing at a S3 bucket level
In another example of data sharing within a bucket, a bucket can be defined that belongs to
an ingest job and the same bucket can be made available for another S3 account, for example
as read only to process the data in the bucket. This can be achieved using typical S3 bucket
policies and the s3api put-bucket-policy command, for example as the owner of the bucket
activate a policy using:
aws --endpoint https://<CES-IP>:<port> s3api put-bucket-policy --bucket <bucket
name> --policy file://<path-to-file>
Allowing access to a bucket can be configured using the prior command with the following
example policy.
$ cat policyReadWrite.json
{
"Version":"2012-10-17",
"Statement":[{
"Sid":"policyReadWrite",
"Effect":"Allow",
"Principal": { "AWS": "userReadWrite" },
"Action":["s3:*"],
"Resource":"*"}]
}
Allow read only access with the following example policy: $ cat policyReadOnly.json
{
"Version":"2012-10-17",
"Statement":[{
"Sid":"policyReadOnly",
"Effect":"Allow",
"Principal": { "AWS": "userReadOnly" },
"Action":["s3:GetObject", "s3:ListBucket"],
"Resource":"*"}]
}
Next, taking the steps together, the following steps show the complete flow. As example,
users userMain and userReadOnly are created with their respective S3 accounts. For
simplicity, define the following command aliases as
# access/secret as provided while creating the S3 accounts
$ alias s3uMain='AWS_ACCESS_KEY_ID=access... AWS_SECRET_ACCESS_KEY=secret...
aws --endpoint https://10.10.1.121:6443 s3'
$ alias s3uReadOnly='AWS_ACCESS_KEY_ID=access...
AWS_SECRET_ACCESS_KEY=secret... aws --endpoint https://10.10.1.121:6443 s3'
The commands can be run virtually from any system and as any console user, the access key
and secret key combination define the S3 account user for the commands that are executed.
On the system, there is a bucket for the main user as shown in Example 6-2 on page 46.
Name
------
b-userMain
This example of data sharing at IBM Storage Scale S3 buckets level illustrates a simple
scenario of data sharing at the level of S3 buckets.
46 Accelerating AI and Analytics with IBM watsonx.data and IBM Storage Scale
Related publications
The publications listed in this section are considered particularly suitable for a more detailed
discussion of the topics covered in this paper.
IBM Redbooks
The following IBM Redbooks publications provide additional information about the topic in this
document. Note that some publications referenced in this list might be available in softcopy
only. For the current online list of IBM Storage Scale Redbooks select here.
IBM Storage Scale System Introduction Guide, REDP-5729
IBM Hybrid Solution for Scalable Data Solutions using IBM Spectrum Scale, REDP-5549
IBM Spectrum Scale and IBM Elastic Storage System Network Guide, REDP-5484
Accelerating IBM watsonx.data with IBM Fusion HCI, REDP-5720
You can search for, view, download or order these documents and other Redbooks,
Redpapers, Web Docs, draft and additional materials, at the following website:
ibm.com/redbooks
Online resources
These websites are also relevant as further information sources:
IBM watsonx.data
https://www.ibm.com/products/watsonx-data
Product Documentation for IBM watsonx.data
https://www.ibm.com/docs/en/watsonx/watsonxdata
IBM Storage Scale
https://www.ibm.com/products/storage-scale
Product Documentation for IBM Storage Scale
https://www.ibm.com/docs/en/storage-scale
Product Documentation for IBM Storage Scale System
https://www.ibm.com/docs/en/storage-scale-system
How to sync externally managed Iceberg tables with the catalog integration in
watsonx.data (blog)
48 Accelerating AI and Analytics with IBM watsonx.data and IBM Storage Scale
Back cover
REDP-5743-00
ISBN 0738461881
Printed in U.S.A.
®
ibm.com/redbooks