Best Open Source Linux Data Integration Tools 2026

Data Integration Tools for Linux

Data Integration Linux Clear Filters

Browse free open source Data Integration tools and projects for Linux below. Use the toggles on the left to filter open source Data Integration tools by OS, license, language, programming language, and project status.

Build AI Apps with Gemini 3 on Vertex AI
Access Google’s most capable multimodal models. Train, test, and deploy AI with 200+ foundation models on one platform.

Vertex AI gives developers access to Gemini 3—Google’s most advanced reasoning and coding model—plus 200+ foundation models including Claude, Llama, and Gemma. Build generative AI apps with Vertex AI Studio, customize with fine-tuning, and deploy to production with enterprise-grade MLOps. New customers get $300 in free credits.

Try Vertex AI Free
Easily Host LLMs and Web Apps on Cloud Run
Run everything from popular models with on-demand NVIDIA L4 GPUs to web apps without infrastructure management.

Run frontend and backend services, batch jobs, host LLMs, and queue processing workloads without the need to manage infrastructure. Cloud Run gives you on-demand GPU access for hosting LLMs and running real-time AI—with 5-second cold starts and automatic scale-to-zero so you only pay for actual usage. New customers get $300 in free credit to start.

Try Cloud Run Free
1

Pentaho

Pentaho offers comprehensive data integration and analytics platform.

Pentaho couples data integration with business analytics in a modern platform to easily access, visualize and explore data that impacts business results. Use it as a full suite or as individual components that are accessible on-premise, in the cloud, or on-the-go (mobile). Pentaho enables IT and developers to access and integrate data from any source and deliver it to your applications all from within an intuitive and easy to use graphical tool. The Pentaho Enterprise Edition Free Trial can be obtained from https://pentaho.com/download/

69 Reviews

Downloads: 693 This Week

Last Update: 2025-02-06
See Project
2

Pentaho Data Integration

Pentaho Data Integration ( ETL ) a.k.a Kettle

Pentaho Data Integration uses the Maven framework. Project distribution archive is produced under the assemblies module. Core implementation, database dialog, user interface, PDI engine, PDI engine extensions, PDI core plugins, and integration tests. Maven, version 3+, and Java JDK 1.8 are requisites. Use of the Pentaho checkstyle format (via mvn checkstyle:check and reviewing the report) and developing working Unit Tests helps to ensure that pull requests for bugs and improvements are processed quickly. In addition to the unit tests, there are integration tests that test cross-module operation.

Downloads: 56 This Week

Last Update: 2021-11-08
See Project
3

Airbyte

Data integration platform for ELT pipelines from APIs, databases

We believe that only an open-source solution to data movement can cover the long tail of data sources while empowering data engineers to customize existing connectors. Our ultimate vision is to help you move data from any source to any destination. Airbyte already provides the largest catalog of 300+ connectors for APIs, databases, data warehouses, and data lakes. Moving critical data with Airbyte is as easy and reliable as flipping on a switch. Our teams process more than 300 billion rows each month for ambitious businesses of all sizes. Enable your data engineering teams to focus on projects that are more valuable to your business. Building and maintaining custom connectors have become 5x easier with Airbyte. With an average response rate of 10 minutes or less and a Customer Satisfaction score of 96/100, our team is ready to support your data integration journey all over the world.

Downloads: 13 This Week

Last Update: 2025-10-15
See Project
4

nango

A single API for all your integrations.

Nango is a single API to interact with all other external APIs. It should be the only API you need to integrate to your app. Nango is an open-source solution for integrating third-party APIs with applications, simplifying API authentication, data syncing, and management.

Downloads: 6 This Week

Last Update: 2 days ago
See Project
Our Free Plans just got better! | Auth0
With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.

Try free now
5

Apache Hudi

Upserts, Deletes And Incremental Processing on Big Data

Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes and Incrementals. Hudi manages the storage of large analytical datasets on DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible storage). Apache Hudi is a transactional data lake platform that brings database and data warehouse capabilities to the data lake. Hudi reimagines slow old-school batch data processing with a powerful new incremental processing framework for low latency minute-level analytics. Hudi provides efficient upserts, by mapping a given hoodie key (record key + partition path) consistently to a file id, via an indexing mechanism. This mapping between record key and file group/file id, never changes once the first version of a record has been written to a file. In short, the mapped file group contains all versions of a group of records.

Downloads: 4 This Week

Last Update: 2025-12-18
See Project
6

Hetionet

Hetionet: an integrative network of disease

Hetionet is a hetnet — network with multiple node and edge (relationship) types — which encodes biology. The hetnet was designed for Project Rephetio, which aims to systematically identify why drugs work and predict new therapies for drugs. The JSON and Neo4j formats contain node and edge properties, which are absent in the TSV and matrix formats, including licensing information. Therefore the recommended formats are JSON and Neo4j. Our hetio package in Python reads the JSON format, but it is otherwise a simple yet new format. The Neo4j graph database has an established and thriving ecosystem. However, if you would like to access Hetionet without Neo4j, then we suggest the JSON format. The matrix format refers to HetMat archives, which store edge adjacency matrices on disk. Additional usage information is available at the corresponding download locations.

Downloads: 3 This Week

Last Update: 2023-06-12
See Project
7

Open Source Data Quality and Profiling

World's first open source data quality & data preparation project

This project is dedicated to open source data quality and data preparation solutions. Data Quality includes profiling, filtering, governance, similarity check, data enrichment alteration, real time alerting, basket analysis, bubble chart Warehouse validation, single customer view etc. defined by Strategy. This tool is developing high performance integrated data management platform which will seamlessly do Data Integration, Data Profiling, Data Quality, Data Preparation, Dummy Data Creation, Meta Data Discovery, Anomaly Discovery, Data Cleansing, Reporting and Analytic. It also had Hadoop ( Big data ) support to move files to/from Hadoop Grid, Create, Load and Profile Hive Tables. This project is also known as "Aggregate Profiler" Resful API for this project is getting built as (Beta Version) https://sourceforge.net/projects/restful-api-for-osdq/ apache spark based data quality is getting built at https://sourceforge.net/projects/apache-spark-osdq/

8 Reviews

Downloads: 6 This Week

Last Update: 2021-01-20
See Project
8

ChunJun

A data integration framework

ChunJun is a distributed integration framework, and currently is based on Apache Flink. It was initially known as FlinkX and renamed ChunJun on February 22, 2022. It can realize data synchronization and calculation between various heterogeneous data sources. ChunJun has been deployed and running stably in thousands of companies so far. Based on the real-time computing engine--Flink, and supports JSON template and SQL script configuration tasks. The SQL script is compatible with Flink SQL syntax. Supports a variety of heterogeneous data sources, and supports synchronization and calculation of more than 20 data sources such as MySQL, Oracle, SQLServer, Hive, Kudu, etc. Easy to expand, highly flexible, newly expanded data source plugins can integrate with existing data source plugins instantly, plugin developers do not need to care about the code logic of other plugins.

Downloads: 1 This Week

Last Update: 2022-11-18
See Project
9

Jitsu

Jitsu is an open-source Segment alternative

Jitsu is a fully-scriptable data ingestion engine for modern data teams. Set-up a real-time data pipeline in minutes, not days. Installing Jitsu is a matter of selecting your framework and adding few lines of code to your app. Jitsu is built to be framework agnostic, so regardless of your stack, we have a solution that'll work for your team. Connect data warehouse (Snowflake, Clickhouse, BigQuery, S3, Redshift ot Postgres) and query your data instantly. Jitsu can either stream data in real-time or send it in micro-batches (up to once a minute). Apply any transformation with Jitsu. Just write JavaScript code right in the UI to do anything with incoming data. And yes, the code editor supports code completion, debugging and many more. It feels like a full-featured IDE!

Downloads: 1 This Week

Last Update: 2025-08-14
See Project
AI-generated apps that pass security review
Stop waiting on engineering. Build production-ready internal tools with AI—on your company data, in your cloud.

Retool lets you generate dashboards, admin panels, and workflows directly on your data. Type something like “Build me a revenue dashboard on my Stripe data” and get a working app with security, permissions, and compliance built in from day one. Whether on our cloud or self-hosted, create the internal software your team needs without compromising enterprise standards or control.

Try Retool free
10

KubeRay

A toolkit to run Ray applications on Kubernetes

KubeRay is a powerful, open-source Kubernetes operator that simplifies the deployment and management of Ray applications on Kubernetes. It offers several key components. KubeRay core: This is the official, fully-maintained component of KubeRay that provides three custom resource definitions, RayCluster, RayJob, and RayService. These resources are designed to help you run a wide range of workloads with ease.

Downloads: 1 This Week

Last Update: 2025-11-21
See Project
11

Nest Manager

NST Manager (SmartThings)

Nest Manager is a community SmartThings solution that integrates Nest devices—thermostats, Protects, and cameras—into the SmartThings ecosystem via a comprehensive SmartApp and device handlers. It offers a unified dashboard, rich device tiles, and automation hooks so users can monitor and control temperature, modes, and alerts alongside other smart home devices. The project emphasizes usability with guided setup flows, status summaries, and in-app diagnostics to help troubleshoot connectivity or permission issues. It exposes detailed attributes and commands, enabling powerful rules and scenes that coordinate Nest with sensors, presence, and schedules in SmartThings. Historical and environmental data can be surfaced to support energy-aware automations and notifications. For advanced users, it provides granular preferences to tune polling, event verbosity, and safety behaviors, turning SmartThings into a capable hub for Nest-centric homes.

Downloads: 1 This Week

Last Update: 2025-09-03
See Project
12

PHPCI

PHPCI is a free and open source continuous integration tool

PHPCI is a continuous integration (CI) server designed specifically for PHP applications. It automates tasks such as testing, code quality checks, and deployment, helping developers maintain code consistency and detect issues early. PHPCI supports various plugins and tools, including PHPUnit, PHPMD, and Codeception, making it highly customizable for different project needs.

Downloads: 1 This Week

Last Update: 2025-05-19
See Project
13

CloverDX

Design, automate, operate and publish data pipelines at scale

Please, visit www.cloverdx.com for latest product versions. Data integration platform; can be used to transform/map/manipulate data in batch and near-realtime modes. Suppors various input/output formats (CSV,FIXLEN,Excel,XML,JSON,Parquet, Avro,EDI/X12,HL7,COBOL,LOTUS, etc.). Connects to RDBMS/JMS/Kafka/SOAP/Rest/LDAP/S3/HTTP/FTP/ZIP/TAR. CloverDX offers 100+ specialized components which can be further extended by creation of "macros" - subgraphs - and libraries, shareable with 3rd parties. Simple data manipulation jobs can be created visually. More complex business logic can be implemented using Clover's domain-specific-language CTL, in Java or languages like Python or JavaScript. Through its DataServices functionality, it allows to quickly turn data pipelines into REST API endpoints. The platform allows to easily scale your data job across multiple cores or nodes/machines. Supports Docker/Kubernetes deployments and offers AWS/Azure images in their respective marketplace

4 Reviews

Downloads: 7 This Week

Last Update: 2023-05-04
See Project
14

XAware Data Integration Project

Create XML and JSON data services from any data source

Create services to integrate applications & move data of any type. Build data views across DBMS, SOAP, HTTP/REST, Salesforce, SAP, Microsoft, SharePoint, Text, LDAP, FTP sources to read, write & transfer data. Eclipse designer & run-time engine.

Downloads: 7 This Week

Last Update: 2016-04-06
See Project
15

COMA Community Edition

Schema Matching Solution for Data Integration

COMA CE is the community edition of the well-established COMA project developed at the University of Leipzig. It comprises the parsers, matcher library, matching framework and a sample GUI for tests and evaluations. COMA was initiated at the database chair of the University of Leipzig in 2002 and got much positive feedback ever since. It excels due to numerous matching strategies, which can be combined to large matching workflows, and which enable reliable match results between different kind of schemas.

Downloads: 4 This Week

Last Update: 2016-03-18
See Project
16

Metl ETL Data Integration

Simple message-based, web-based ETL integration

Metl is a simple, web-based ETL tool that allows for data integrations including database, files, messaging, and web services. Supports RDBMS, SOAP, HTTP, FTP, SFTP, XML, FIXLEN, CSV, JSON, ZIP, and more. Metl implements scheduled integration tasks without the need for custom coding or heavy infrastructure. It can be deployed in the cloud or in an internal data center, and it was built to allow developers to extend it with custom components.

Downloads: 4 This Week

Last Update: 2022-01-21
See Project
17

KETL

KETL(tm) is a production ready ETL platform. The engine is built upon an open, multi-threaded, XML-based architecture. KETL's is designed to assist in the development and deployment of data integration efforts which require ETL and scheduling

Downloads: 3 This Week

Last Update: 2015-08-22
See Project
18

PDI Data Vault framework

Data Vault loading automation using Pentaho Data Integration.

A metadata driven 'tool' to automate loading a designed Data Vault. It consists of a set of Pentaho Data Integration and database objects. Thel Virtual Machine (VMware) is a 64 bit Ubuntu Server 14.04, with MySQL (Percona Server) and PostgreSQL 9.4 as the database flavours and PDI version 5.2 CE. NB: Directory version_2.4 contains the most recent Virtual Machine. The readme.txt contains info about that VM.

Downloads: 2 This Week

Last Update: 2015-09-23
See Project
19

Daffodil Replicator

Daffodil Replicator is a powerful Open Source Java tool for data integration, data migration and data protection in real time. It allows bi-directional data replication and synchronization between homogeneous / heterogeneous databases including Oracle, M

1 Review

Downloads: 1 This Week

Last Update: 2019-06-12
See Project
20

ADempiere Compiere Kettle or PDI

Templates for integrating the data structures of Compiere, Openbravo or ADempiere for all kind of Pentaho Data Integration processes. Later on we plan to migrate these to Talend too.

Downloads: 1 This Week

Last Update: 2015-08-01
See Project
21

BioDataServer

The BioDataServer is a database integration system. It implements a mediator-wrapper architecture and offers a SQL interface. The data integration is based on user defined intergrated schema and adapter that wrap any kind of data source.

Downloads: 1 This Week

Last Update: 2013-03-22
See Project
22

N-Browse: a generic network browser

N-Browse is a client-server package for interactive visualization of network data with heterogeneous types of links, intended for ease of use and designed using a generic database schema for data integration and visualization.

Downloads: 1 This Week

Last Update: 2014-04-15
See Project
23

ODI-EE Blog Code Samples

ODI \ OWB ETL \ ELT Datawarehousing Data Integration

Downloads: 1 This Week

Last Update: 2013-04-17
See Project
24

OPENSUITE

OPENSUITE - an integration platform to enable process data integration between independently developed business applications.OPENSUITE integration platform takes advantage of the SOA best integration practices to supply the middleware layer functionality

Downloads: 1 This Week

Last Update: 2013-04-22
See Project
25

Pytente

Uma Ferramenta Computacional para Análise e Recuperação de Patentes

O Pytente é uma solução avançada para automatizar o processo de coleta, armazenamento e tratamento de dados bibliográficos de patentes. A ferramenta foi projetada para simplificar a coleta de grandes volumes de dados em repositórios de acesso aberto. O Pytente garante o armazenamento estruturado das informações, além da validação e eliminação de registros duplicados. Dentre as diversas funcionalidades disponibilizadas pela ferramenta, destacam-se a extração personalizada de subconjuntos de dados e a possibilidade de realizar buscas semânticas no conjunto de dados armazenados, sem a necessidade de elaborar expressões lógicas de busca.

Downloads: 1 This Week

Last Update: 2025-11-03
See Project