In the fast-moving world of artificial intelligence and machine learning (AI/ML), everything seems to revolve around data. Entire careers are built around data. Data engineers, data analysts and data scientists are just a few roles that gather, synthesize, process, analyze and use data to help AI/ML solve real-world problems. As the volume of data continues to grow exponentially, the management of data becomes critical and a major challenge.
Challenges in data management
Data management brings many challenges in the AI/ML space, such as:
- Governance and compliance: Data governance often must meet the requirements of state or federal compliance policies for traceability, privacy and audit
- Data lineage: Tracking all variations or modifications to data is critical in replicating complex AI/ML workflows and outputs
- Data management: Managing the lifecycle of data, how users interface with it and its storage requirements as well as using the best tooling helps maintain costs and productivity
- Knowledge and expertise: There are hundreds of AI/ML tools available to data science engineers, and many new tools are added to the AI/ML industry every day. Introducing tools that have a familiar workflow—and that are easy to understand and consume—allows engineers to focus on their business goals, rather than on learning new tools
The combination of Red Hat OpenShift AI as an AI/ML platform and data version control from lakeFS help alleviate these challenges.
Data versioning with lakeFS
The most prevalent way developers manage source code today is through Git and the use of tools such as GitHub and GitLabs. There are many other tools available for use, but most have a similar workflow. Git, however, is not intended for objects, such as large data files, tarballs, container images or AI/ML models. For those file types, object storage is commonly used, often through the use of an object storage solution that offers an Amazon S3 interface. OpenShift AI has built-in support for interfacing with S3-accessible object storage.
Imagine being able to manage AI/ML data, models, pipeline artifacts and other large object files in a Git-like manner, either through a web console or an API. lakeFS serves as a S3 Gateway to many different object storage solutions, including Amazon S3, Azure Blob Storage, Google Cloud Storage, Red Hat OpenShift Data Foundation, MinIO and many more. Even better, it can be easily added between the OpenShift AI cluster and an existing object storage solution with very few changes to the environment. lakeFS can be run locally in the OpenShift cluster or OpenShift AI can connect to lakeFS in another on-premise environment, public cloud or private cloud.
With lakeFS, data engineers can now create new repositories for their AI/ML data and models, create branches, make changes, merge changes and track the entire lineage of data. It offers a single, familiar interface regardless of where the data or models are stored.
Try out lakeFS with OpenShift AI
Red Hat has worked closely with the lakeFS team at Treeverse to validate the integration of Red Hat OpenShift AI with lakeFS. We replicated the fraud detection demo found within the OpenShift AI documentation and adapted it to insert lakeFS in between OpenShift AI and a local instance of MinIO. With this validation complete, the lakeFS team has announced support for running lakeFS on an OpenShift cluster and its integration with OpenShift AI. Be sure to check out the Accelerating AI Innovation with lakeFS and OpenShift AI blog on the lakeFS site.
Follow the instructions on how to get your OpenShift AI environment up with lakeFS and MinIO and perform the fraud detection demo with the changes outlined in the instructions. You’ll get to test pulling data from lakeFS, storing data via lakeFS, saving trained models to lakeFS, pulling models to serve from lakeFS and even exploring how OpenShift AI pipelines use lakeFS for artifact storage.
Happy data versioning!
关于作者
Sean has been (back) at Red Hat since 2020 working with strategic Red Hat ecosystem partners to co-create integrated product solutions and get them to market.
产品
工具
试用购买与出售
沟通
关于红帽
我们是世界领先的企业开源解决方案供应商,提供包括 Linux、云、容器和 Kubernetes。我们致力于提供经过安全强化的解决方案,从核心数据中心到网络边缘,让企业能够更轻松地跨平台和环境运营。