A comprehensive data lineage tracking system that captures lineage from Python jobs using OpenLineage, stores it in Marquez, and provides a live lineage graph visualization.

This project demonstrates a complete data lineage solution with:
- Marquez: OpenLineage backend for storing and querying lineage metadata
- OpenLineage SDK: Python job lineage tracking
- PostgreSQL: Metadata storage backend
- Docker Compose: Easy local development setup
- Python Jobs: Use OpenLineage SDK to emit lineage events describing data transformations
- Marquez: Stores and indexes all lineage metadata in PostgreSQL
- Web UI: Provides interactive visualization of lineage graphs
- API: Allows programmatic querying of lineage data
- LineageEmitter: Python class that handles emission of lineage events
- Marquez Configuration: Custom configuration for database connectivity
- Docker Services: Orchestrated services for easy deployment
The project includes 8 comprehensive example jobs that demonstrate different data lineage scenarios:
-
Customer Data Processing (
emit_lineage.py)- Simple ETL job processing customer data
- Input: raw_customers → Output: processed_customers
-
Order Processing (
order_processing.py)- Complex job with multiple inputs and outputs
- Inputs: raw_orders, customer_master, product_catalog
- Outputs: enriched_orders, order_summary
-
Financial Data Processing (
financial_processing.py)- Financial transactions with currency conversion
- Inputs: raw_transactions, account_master, exchange_rates
- Outputs: processed_transactions, daily_account_summary, fraud_indicators
-
Data Quality Monitoring (
data_quality_monitoring.py)- Monitors data quality across multiple datasets
- Inputs: raw_customers, raw_orders, raw_transactions
- Outputs: data_quality_report, data_lineage_summary, quality_alerts
-
Real-time Analytics Pipeline (
real_time_analytics.py)- Streaming data processing for user analytics
- Inputs: user_events_stream, user_profiles, product_catalog
- Outputs: real_time_user_analytics, trending_content, personalization_models
-
Machine Learning Pipeline (
ml_pipeline.py)- ML model training and prediction pipeline
- Inputs: training_data, feature_store, model_config
- Outputs: trained_model, model_predictions, model_metrics, feature_importance
-
Data Lake Ingestion (
data_lake_ingestion.py)- Multi-source data ingestion into data lake
- Inputs: external_api_data, log_files, sensor_data, social_media_feeds
- Outputs: raw_data_lake, structured_data_lake, data_lake_metadata, data_lineage_tracking
-
Compliance & Governance (
compliance_governance.py)- Compliance monitoring and governance tracking
- Inputs: sensitive_data_inventory, data_access_logs, regulatory_requirements, data_lineage_metadata
- Outputs: compliance_report, data_privacy_assessment, governance_dashboard, audit_trail
- Docker Desktop installed and running
- Python 3.13+ (for local development)
-
Clone and setup:
git clone <your-repo-url> cd data-lineage-audit cp env.example .env
-
Start all services:
docker compose up -d
-
Wait for services to be ready (about 30 seconds):
docker compose ps
# Start all services
docker compose up -d
# Check service status
docker compose ps
# View logs if needed
docker compose logs marquez
docker compose logs marquez-web# Navigate to Python jobs directory
cd lineage/python_jobs
# Activate virtual environment
source venv/bin/activate
# Run individual jobs
python emit_lineage.py # Customer data processing
python order_processing.py # Order processing with multiple inputs
python financial_processing.py # Financial data with currency conversion
python data_quality_monitoring.py # Data quality monitoring
python real_time_analytics.py # Real-time analytics pipeline
python ml_pipeline.py # Machine learning pipeline
python data_lake_ingestion.py # Data lake ingestion
python compliance_governance.py # Compliance & governance
# Or run all jobs at once
cd ../..
python run_all_jobs.py- Marquez UI: http://localhost:3000
- Marquez API: http://localhost:5002
- Marquez Admin: http://localhost:5003
# Check all containers are healthy
docker compose ps
# Test Marquez API
curl http://localhost:5002/api/v1/namespaces
# Test Marquez UI
curl http://localhost:3000# Run the lineage job
cd lineage/python_jobs
source venv/bin/activate
python emit_lineage.py
# Verify data was stored
curl http://localhost:5002/api/v1/jobs
curl http://localhost:5002/api/v1/namespaces/data-lineage-audit/datasets# Run all tests
cd lineage/python_jobs
source venv/bin/activate
pytest
# Run specific test files
pytest lineage/tests/test_emit_lineage.py -v- Web Interface: Visit http://localhost:3000 to see interactive lineage graphs
- API Queries: Use curl or your preferred HTTP client to query the Marquez API
- Job Details: View specific job runs and their lineage relationships
- Dataset Schemas: Examine dataset schemas and field-level lineage
data-lineage-audit/
├─ README.md # This file
├─ docker-compose.yml # Service orchestration
├─ marquez-config.yml # Marquez configuration
├─ env.example # Environment variables template
├─ lineage/
│ ├─ python_jobs/ # Python lineage jobs
│ │ ├─ emit_lineage.py # Main lineage emission script
│ │ ├─ requirements.txt # Python dependencies
│ │ └─ venv/ # Virtual environment
│ └─ tests/ # Test suite
└─ docs/ # Documentation
| Service | Port | Description |
|---|---|---|
| Marquez UI | 3000 | Lineage visualization interface |
| Marquez API | 5002 | OpenLineage backend API |
| Marquez Admin | 5003 | Marquez admin interface |
| PostgreSQL | 5432 | Metadata storage |
# List all jobs
curl http://localhost:5002/api/v1/jobs
# List datasets in namespace
curl http://localhost:5002/api/v1/namespaces/data-lineage-audit/datasets
# Get specific job details
curl http://localhost:5002/api/v1/namespaces/data-lineage-audit/jobs/customer_data_processing
# Get job runs
curl http://localhost:5002/api/v1/namespaces/data-lineage-audit/jobs/customer_data_processing/runsServices won't start:
# Check Docker is running
docker --version
# Check service status
docker compose ps
# View logs
docker compose logs marquezLineage jobs fail:
# Verify Marquez is accessible
curl http://localhost:5002/api/v1/namespaces
# Check Python environment
cd lineage/python_jobs
source venv/bin/activate
python -c "import openlineage.client; print('OpenLineage client available')"Web UI not accessible:
# Check if Marquez UI container is running
docker compose ps marquez-web
# Check UI logs
docker compose logs marquez-web- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
MIT License - see LICENSE file for details.