Infrastructure

OpenConnect: LinkedIn’s next-generation AI pipeline ecosystem

LinkedIn is powered by an enormous volume of structured and unstructured data - from member profiles and job postings to Feed interactions and real-time event streams. But data alone doesn’t create a personalized, valuable, or trusted experience. Under the surface, it takes layers of sophisticated engineering and infrastructure to process that data securely and apply AI to transform it into smart, helpful features for over 1.2 billion members worldwide in a trusted manner.

To make this possible, we operate thousands of AI pipelines every day - processing petabytes of data to support a wide range of AI applications. These pipelines must be robust, scalable, and intuitive for developing AI models, while also flexible enough to support rapid experimentation. Given their critical role, we are always looking for ways to make these pipelines faster, more efficient, highly reliable and easier to work with.

That’s why we’ve spent the past two years building OpenConnect - a next-generation AI pipeline ecosystem that reduces launch times from over 14 minutes to under 30 seconds and cuts failure detection time by 80%. With improved scalability and resilience, OpenConnect supports more than 100k executions monthly across multiple data centers and CPU/GPU SKUs, powering 100% of LinkedIn’s AI workloads - including recommendation systems and generative AI use cases.

In this blog, we’ll explore the challenges we faced with our legacy pipeline system and how OpenConnect helped us overcome them. We’ll also share some interesting design decisions and architecture details that make OpenConnect a foundational part of LinkedIn’s AI infrastructure.

Figure 1: OpenConnect vs ProML: Key performance metrics
Figure 1: OpenConnect vs ProML: Key performance metrics

Challenges with LinkedIn’s previous AI pipeline platform

Before OpenConnect, LinkedIn’s AI workloads relied on our legacy ProML ecosystem. Despite its past success, ProML struggled to meet evolving AI and pipeline demands around scalability and flexibility. Notably, we began to see challenges with:

  • Reusability and extensibility: Verticals were running independent pipelines as their feature data sets and modeling pipelines were unique to their needs. However, in modern applications of AI there is a lot of reuse (e.g. shared embeddings, shared datasets, multi-task learning) which means AI pipelines should support cross-functional, policy-aligned reusability from the outset.
  • Slow iteration and maintenance: Tangled setups with huge dependencies caused users to spend more than 10 minutes just to tweak a parameter or code for a new ML experiment.
  • Rerun experience: The lack of automated or smart retries forced users to manually rerun tasks, repeating unnecessary steps and slowing down experimentation while wasting compute resources.
  • Scalability and robustness: The system faced inefficiencies and was prone to building time and run time errors when handling LinkedIn’s petabyte-scale data and diverse computational demands across multi-cluster environments, compromising reliability.

Overall architecture of OpenConnect

To address the challenges and apply our learnings from ProML, we revamped our AI pipeline ecosystem under OpenConnect. The architecture, illustrated in the diagram below that breaks down each layer and its role in the ecosystem, is designed to support LinkedIn’s diverse AI use cases while ensuring scalability, flexibility, efficiency and trust, including security and compliance.

Figure 2: LinkedIn AI training platform overview
Figure 2: LinkedIn AI training platform overview

A key aspect of this architecture is the authorship and execution environment, where users define pythonic workflows using the OpenConnect library. This enables seamless integration into the pipeline ecosystem. Below is an example of a simple workflow for MNIST training:

Figure 3: OpenConnect DSL
Figure 3: OpenConnect DSL

How we addressed these challenges

Reusability and extensibility

Within LinkedIn, it is typical for AI platform engineers and AI engineers work together to develop and maintain a comprehensive set of standard and custom components, centralized in our Reusable Component Hub. While just a sampling, these components can include:

  • Data collection and labeling: Importer, Data Analyzer, Transformer, Extender
  • Model training and tuning: TensorFlow Trainer, PyTorch Trainer, Ray Tuner
  • Model evaluation: Model Analyzer, Modifier, Publisher
  • Offline inference: TensorFlow Inference, PyTorch Inference, Spark Inference, Ray Inference

Teams can publish their custom components and those are discoverable via other teams. This infrastructure empowers teams across the organization to adopt and integrate these components into their workflows with minimal effort, thereby enhancing flexibility, accelerating development, and promoting broad collaboration.

Figure 4: OpenConnect Reusable Component Hub diagram
Figure 4: OpenConnect Reusable Component Hub diagram

Slow iteration and maintenance

Component dependency management

Our previous framework required a full rebuild for all code and configuration changes, resulting in users spending 10–15 minutes compiling workflows. This was due to large codebases, complex dependencies, and challenges uploading to remote clusters. This slowed down the whole iteration processes and directly dictated the number of experiments an individual was able to launch.

When developing OpenConnect, we tackled this problem by applying two core principles:

  • Decoupling component dependencies: Isolating dependencies ensures that workflows aren’t bogged down by unnecessary or conflicting dependencies.
  • Caching dependencies with Docker or manifests: Pre-built Docker images or manifests reduce redundant builds, enabling faster iteration and deployment.

The diagram below illustrates how these principles are implemented in OpenConnect, tailored to the needs of two key user groups: Component Consumers (who utilize workflows and models) and Component Producers (who develop and maintain components). It maps out the flow from user inputs and orchestration to dependency management and component development, providing a clear bridge between our strategy and its practical application for these users.

Figure 5: OpenConnect dependencies management
Figure 5: OpenConnect dependencies management

Component Consumer (AI Vertical MLEs)

AI engineers can launch the workflows directly from their development IDEs. For example, they can trigger the workflow using the CLI command: 

mldev run my_training_workflow

The workflow is then submitted to Flyte, which dispatches jobs to our on-prem compute cluster. These jobs include Spark/Java/Hadoop tasks or TF/PyTorch/Ray tasks, each relying on dependencies predefined in the component’s manifest.

Component Producer (AI Infra and AI Vertical Team)

Component producers develop component code and publish it via GitHub Actions. These components are registered in Flyte, with each component’s dependencies encapsulated in a self-contained manifest, such as a Docker image or dependency file.

With the above design, our users can now iterate rapidly by launching workflows in seconds, as dependencies are pre-cached in Jar manifests or Docker images, eliminating the need for lengthy rebuilds and uploads for each experiment.

To further enhance the debugging capabilities, we also developed VSCode integrations to enhance the debugging experience with FlyteInteractive, enabling users to perform code inspection, profiling, and benchmarking to address challenges like AUC drops in distributed training and bottlenecks in transformer models.

Rerun experience

Efficient reruns are critical for accelerating experimentation and debugging in machine learning workflows. OpenConnect enhances the rerun experience through two key mechanisms: output caching and partial rerun.

Output caching

Output caching stores intermediate results to eliminate redundant computations for internal ML engineers who belong to the same trusted group and have the necessary access permissions. As shown in the diagram, when a workflow is rerun, cached outputs are reused, significantly reducing execution time. For example:

  • Initially, the Data Importer step and Data Row Manipulator step are cached with durations of 2 hours and 3 hours, respectively, while Model Trainer step fails after 2 hours.
  • On rerun, the system leverages cached results for Data Importer step and Data Row Manipulator step, re-executing the failed Model Trainer step and downstream Model Analyzer step.

This approach minimizes computational overhead, ensuring faster iterations for users.

Figure 6: Execution of a workflow with output caching
Figure 6: Execution of a workflow with output caching

Partial rerun

Partial rerun enables targeted re-execution of specific workflow segments, streamlining debugging. As illustrated:

  • The workflow skips unaffected steps (Data Importer and Data Row Manipulator) and re-executes only the necessary segments starting from Model Trainer step, where the issue was identified.
  • Steps marked as "skipped" (Data Importer and Data Row Manipulator) are bypassed, allowing the workflow to focus on the problematic section.

By isolating and re-running only the relevant parts, partial rerun reduces debugging time and resource usage, empowering users to iterate more efficiently.

Figure 7: Execution of a workflow with partial rerun
Figure 7: Execution of a workflow with partial rerun

This feature is only available in LinkedIn’s internal Flyte and has not been open sourced.

Scalability and robustness 

Multi-region, multi-cluster global scheduling

To ease fleet management, improve cost efficiency, and reduce resource fragmentation, OpenConnect operates a multi-cluster, multi-region Kubernetes setup where each cluster can host different types and amounts of GPU resources. This poses challenges for traditional workflow orchestration systems, which require workflows to be registered and managed separately with multiple control planes. To address this, we leverage Flyte’s natively decoupled control plane (FlyteAdmin) and data plane (FlytePropeller) architecture. This allows our users to register their workflows once against a single control plane and execute them across any clusters. On top of the current setup, we have built a global scheduler which enables smart routing between different clusters and namespaces based on heuristics and policies such as data locality and resource usage. Global scheduler, as a critical component for the scalability and robustness of our system, requires deliberate design considerations.

Figure 8: Flyte multi-region routing setup
Figure 8: Flyte multi-region routing setup

Disruption readiness

When managing large computing clusters, it is quite common that nodes need to be put into maintenance mode in the cases of OS updates and hardware checkups. Ensuring resilience in the face of such disruptions is critical. We have implemented a disruption readiness system to actively checkpointing the training parameters, epoch state, and data pointers. When a node is scheduled for maintenance, we notify the associated training job, triggering an all-reduce operation to coordinate checkpointing across all workers. After a configurable grace period, the job is safely shut down. Disrupted jobs exit with a special error code, prompting Flyte to retry the job on a new set of nodes. This mechanism has reduced training job failures due to infrastructure disruptions by 90%, significantly reducing operational overhead associated with node maintenance at scale.

Figure 9: Job disruption workflow
Figure 9: Job disruption workflow

Kubernetes training platform integration

LinkedIn is migrating its entire computing infrastructure to Kubernetes. OpenConnect is designed to be Kubernetes-native and integrates seamlessly into LinkedIn’s current Kubernetes training platform, supporting diverse training frameworks such as Horovod, TFJob, and PyTorchJob. As a training platform, Kubernetes training platform is responsible for:

  • Scheduling optimization: We leverage the Volcano scheduler with gang scheduling and bin-packing strategies to minimize resource fragmentation and improve cluster utilization. This has led to a 36% reduction in training job failures caused by capacity constraints. On top of Volcano, we’ve implemented a queue-based prioritization system to ensure higher-priority jobs are scheduled first.
  • Job lifecycle management: This includes provisioning required resources such as certificates and HDFS/NFS access, and performing garbage collection of Kubernetes resources upon job completion.
  • Observability: We aggregate logs and metrics across various job types to provide unified visibility through OpenConnect. Users can view not only container logs, but also GPU utilization metrics and relevant Kubernetes events in a centralized dashboard.
Figure 9: High level workflow of LinkedIn’s K8s training platform
Figure 9: High level workflow of LinkedIn’s K8s training platform

Custom extensions

Flyte Data Service: For Graph Neural Networks, we extend OpenConnect by building Flyte Data Service (which extends Flyte using the Agent framework). It enables on-demand, simultaneous execution of graph services alongside training and inference at scale. It streamlines the ML user experience and cleanly fits into the rest of the ecosystem.

Incremental training: We introduced the Flyte Operator within LinkedIn’s internal Airflow system to support incremental training workflows that automatically trigger when data is ready. This work enabled LinkedIn’s first production incremental learning model, the move to incremental learning delivered substantial business impact to different verticals.

Future considerations

  1. AI-assisted workflow authoring with MCP: Potential to leverage GitHub Copilot-style tools and internally fine-tuned models to streamline workflow creation, integrating with internal services via the Model Context Protocol (MCP) to simplify the user experience.
  2. Implement persistent agent pods: Flyte launches a new pod for each task and the number of tasks can get pretty large for a complex pipeline. Our data shows that several of these tasks are lightweight and share the same environments. We plan to introduce reusable Task Pods to minimize the overhead of repetitive pod creation and deletion to improve efficiency and reduce resource consumption.
  3. Reinforcement learning (RL): Develop a robust RL workflow system with intelligent scheduling, automated checkpointing, and optimized training and rollout processes to support the complex demands of RL-based applications, ensuring scalability and fault tolerance.

OpenConnect's impact and looking ahead

OpenConnect represents a transformative leap forward for LinkedIn’s AI infrastructure, addressing the scalability, efficiency, and usability challenges of our legacy ProML ecosystem. By reducing launch times by 20x, cutting failure detection time by 80%, and supporting over 100k monthly executions across diverse workloads, OpenConnect has become the backbone of our AI-driven platform, powering personalized and trustworthy experiences for over 1.2 billion members. The adoption of decoupled dependencies, caching mechanisms, and robust multi-cluster orchestration has not only accelerated development cycles but also fostered greater collaboration across teams through reusable components and custom extensions.

Looking ahead, with plans to integrate AI-assisted authoring, persistent agent pods, and reinforcement learning workflows, OpenConnect is poised to continue evolving as a cutting-edge solution. We are excited to see its impact grow as we open-source select features and invite the broader community to contribute. The success of OpenConnect is a testament to the dedication of the AI Platform team and our commitment to innovation, and we look forward to sharing more milestones in the future.

Acknowledgements

This multi-year journey began with user feedback on existing infrastructure, evolving into the OpenConnect platform and migrating all workloads. Key contributors include Ankit Goyal, Biao He, Yubo Wang, Wenye Zhang, and Chen Xie, alongside the OpenConnect Team: Lingyu Lyu, Clint Zhang, Yue Shang, Santosh Jha, Harsh Jain, Umer Ahmad, Shuying Liang, Chaitra Hegde, Nithila Ilangovan, Manish Khanna, Maneesh Varshney, Byron Hsu (alumni), Melody Lui (alumni), and Keqiu Hu (alumni).

The project received support from sibling teams, including Chen Zhu, Daniel Tang, Tao Huang, Yujie Ai, Jenny Zhang, Lijuan Zhang, Jonathan Hung, Haowen Ning, Yang Pei, Xiaohan Huang, Frank Gu, Shihao Wang, Weiyu Yen, Zhuo Zhi, Youmin Han, Mark Zhao, Martin Au-Yeung, Richard Li, Vaibhav Jindal,Ethan Lyu, Valentine Lin, Tommy Li, Yanning Chen, Yitong Zhou, Gaurav Misra, Haoyue Tang, Shivam Sahni (alumni) from AI Platform Team and Shardool S, Binyao Jiang, Sandeep Dhillon, Vinayak Agarwal, Yizhou Luo, Sally Ou, Leo Sun, Thomas Huang, Qilin Xu, Ye Zhou from the Data Processing Platform team.

We also thank the management team, including Animesh Singh, Kapil Surlaker, Kamal D., Lenisha Gandhi, Raghu Hiremagalur, Erran Berger, Deepak Agarwal, Ya Xu (alumni) for providing significant and sustained support.

Special thanks to the Union AI team including Ketan Umare, Yee Tong, Kevin Su, David Espejo, Haytham Abuelfutuh, John Votta for dedicated guidance and support for Flyte during our journey.

Last but not least, many thanks to the reviewers of this blog post: Maneesh Varshney, and the LinkedIn Editorial team: Benito Leyva for your reviews and suggestions.