Skip to content

esteoh/dku_mcptool

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Dataiku Factory - MCP Tool Suite

This is a fork of hhobin/dataiku_factory in which an attemp to adapt it to the Kiro AI IDE.


A comprehensive Model Context Protocol (MCP) tool suite for Dataiku DSS integration. This project provides AI IDE with direct access to Dataiku DSS for managing recipes, datasets, and scenarios.

πŸš€ Quick Start

Prerequisites

  • Python 3.11+
  • Dataiku DSS instance with API access
  • Valid DSS API key

Installation

# Clone and setup
git clone <repository-url>
cd dataiku_factory

# Run installation script
./install.sh

# Or install manually:
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -e .

Configuration

  1. Copy environment template:
cp .env.sample .env
  1. Configure your DSS connection in .env:
DSS_HOST=https://your-dss-instance.com:10000
DSS_API_KEY=your-api-key-here
DSS_INSECURE_TLS=true  # Only if using self-signed certificates
  1. Test your connection:
python scripts/mcp_server.py --help

Kiro IDE Integration

Register the MCP server:

    "dataiku-factory": {
      "command": ".venv\\Scripts\\python.exe",
      "args": [
        "scripts\\mcp_server.py"
      ],
      "env": {
        "DSS_HOST": "https://you_dss_host",
        "DSS_API_KEY": "you_dss_key",
        "DSS_INSECURE_TLS": "false"
      },
      "disabled": false,
      "autoApprove": [
        "get_project_flow",
        "search_project_objects",
        "get_dataset_sample",
        "get_dataset_schema",
        "get_scenario_logs",
        "get_recent_runs",
        "get_recipe_code"
      ]
    }

πŸ“š MCP Tool Catalog

Core Recipe Management Tools

Tool Description Key Parameters
create_recipe Create new recipe project_key, recipe_type, recipe_name, inputs, outputs, code
update_recipe Update existing recipe project_key, recipe_name, **kwargs
delete_recipe Delete recipe project_key, recipe_name
run_recipe Execute recipe project_key, recipe_name, build_mode

Core Dataset Management Tools

Tool Description Key Parameters
create_dataset Create new dataset project_key, dataset_name, dataset_type, params
update_dataset Update dataset settings project_key, dataset_name, **kwargs
delete_dataset Delete dataset project_key, dataset_name, drop_data
build_dataset Build dataset project_key, dataset_name, mode, partition
inspect_dataset_schema Get dataset schema project_key, dataset_name
check_dataset_metrics Get dataset metrics project_key, dataset_name

Core Scenario Management Tools

Tool Description Key Parameters
create_scenario Create new scenario project_key, scenario_name, scenario_type, definition
update_scenario Update scenario settings project_key, scenario_id, **kwargs
delete_scenario Delete scenario project_key, scenario_id
add_scenario_trigger Add trigger to scenario project_key, scenario_id, trigger_type, **params
remove_scenario_trigger Remove scenario trigger project_key, scenario_id, trigger_idx
run_scenario Execute scenario project_key, scenario_id

πŸ”§ Advanced Scenario Management Tools

Tool Description Key Parameters
get_scenario_logs Get detailed run logs and error messages project_key, scenario_id, run_id
get_scenario_steps Get step configuration including Python code project_key, scenario_id
clone_scenario Clone scenario with modifications project_key, source_scenario_id, new_scenario_name, modifications

πŸ’» Code and Recipe Development Tools

Tool Description Key Parameters
get_recipe_code Extract actual Python/SQL code from recipes project_key, recipe_name
validate_recipe_syntax Validate Python/SQL syntax before running project_key, recipe_name, code
test_recipe_dry_run Test recipe logic without execution project_key, recipe_name, sample_rows

πŸ—ΊοΈ Project Exploration Tools

Tool Description Key Parameters
get_project_flow Get complete data flow/pipeline structure project_key
search_project_objects Search datasets, recipes, scenarios by pattern project_key, search_term, object_types
get_dataset_sample Get sample data from datasets project_key, dataset_name, rows, columns

βš™οΈ Environment and Configuration Tools

Tool Description Key Parameters
get_code_environments List available Python/R environments project_key
get_project_variables Get project-level variables and secrets project_key
get_connections List available data connections project_key

πŸ” Debugging and Monitoring Tools

Tool Description Key Parameters
get_recent_runs Get recent run history across scenarios/recipes project_key, limit, status_filter
get_job_details Get detailed job execution information project_key, job_id
cancel_running_jobs Cancel running jobs/scenarios project_key, job_ids

πŸš€ Productivity Tools

Tool Description Key Parameters
duplicate_project_structure Copy project structure to new project source_project_key, target_project_key, include_data
export_project_config Export project configuration as JSON/YAML project_key, format
batch_update_objects Update multiple objects with similar changes project_key, object_type, pattern, updates

Total: 34 Tools (16 core + 18 advanced)

πŸ”§ Usage Examples

Core Operations

Creating a Python Recipe

# Via Claude Code chat:
"""
Create a python recipe called "data_cleaner" that takes "raw_data" as input 
and outputs "clean_data" in project "ANALYTICS_PROJECT"
"""

# This translates to:
create_recipe(
    project_key="ANALYTICS_PROJECT",
    recipe_type="python",
    recipe_name="data_cleaner",
    inputs=["raw_data"],
    outputs=[{"name": "clean_data", "new": True, "connection": "filesystem_managed"}],
    code="""
import pandas as pd
df = dataiku.Dataset("raw_data").get_dataframe()
# Add your cleaning logic here
df_clean = df.dropna()
dataiku.Dataset("clean_data").write_with_schema(df_clean)
"""
)

Building a Dataset

# Via Claude Code chat:
"""
Build the dataset "user_analytics" in project "BI" with recursive build mode
"""

# This translates to:
build_dataset(
    project_key="BI",
    dataset_name="user_analytics",
    mode="RECURSIVE_BUILD"
)

Adding a Daily Scenario Trigger

# Via Claude Code chat:
"""
Add a daily trigger to scenario "daily_etl" that runs at 6:00 AM UTC
"""

# This translates to:
add_scenario_trigger(
    project_key="DATA_PIPELINE",
    scenario_id="daily_etl",
    trigger_type="daily",
    hour=6,
    minute=0,
    timezone="UTC"
)

Advanced Operations

Getting Scenario Logs for Failed Runs

# Via Claude Code chat:
"""
Show me the logs for the latest failed run of scenario "data_processing"
"""

# This translates to:
get_scenario_logs(
    project_key="ANALYTICS_PROJECT",
    scenario_id="data_processing"
)

Extracting and Validating Recipe Code

# Via Claude Code chat:
"""
Extract the code from recipe "customer_segmentation" and validate its syntax
"""

# This translates to:
get_recipe_code(
    project_key="ML_PROJECT",
    recipe_name="customer_segmentation"
)

validate_recipe_syntax(
    project_key="ML_PROJECT",
    recipe_name="customer_segmentation"
)

Exploring Project Structure

# Via Claude Code chat:
"""
Show me the complete data flow for project "SALES_ANALYTICS" and find all datasets containing "customer"
"""

# This translates to:
get_project_flow(
    project_key="SALES_ANALYTICS"
)

search_project_objects(
    project_key="SALES_ANALYTICS",
    search_term="customer",
    object_types=["datasets", "recipes", "scenarios"]
)

Getting Sample Data

# Via Claude Code chat:
"""
Get a sample of 500 rows from dataset "transactions" showing only customer_id and amount columns
"""

# This translates to:
get_dataset_sample(
    project_key="FINANCE_PROJECT",
    dataset_name="transactions",
    rows=500,
    columns=["customer_id", "amount"]
)

Monitoring and Debugging

# Via Claude Code chat:
"""
Show me the recent failed runs in project "DATA_PIPELINE" and get details for any failed jobs
"""

# This translates to:
get_recent_runs(
    project_key="DATA_PIPELINE",
    limit=20,
    status_filter="FAILED"
)

get_job_details(
    project_key="DATA_PIPELINE",
    job_id="job_12345"
)

Productivity Operations

# Via Claude Code chat:
"""
Export the configuration of project "TEMPLATE_PROJECT" as YAML and duplicate its structure to "NEW_PROJECT"
"""

# This translates to:
export_project_config(
    project_key="TEMPLATE_PROJECT",
    format="yaml"
)

duplicate_project_structure(
    source_project_key="TEMPLATE_PROJECT",
    target_project_key="NEW_PROJECT",
    include_data=False
)

πŸ—οΈ Architecture

dataiku_factory/
β”œβ”€β”€ dataiku_mcp/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ client.py          # DSS client wrapper
β”‚   β”œβ”€β”€ server.py          # MCP server implementation
β”‚   └── tools/
β”‚       β”œβ”€β”€ recipes.py              # Recipe management tools
β”‚       β”œβ”€β”€ datasets.py             # Dataset management tools
β”‚       β”œβ”€β”€ scenarios.py            # Scenario management tools
β”‚       β”œβ”€β”€ advanced_scenarios.py   # Advanced scenario tools
β”‚       β”œβ”€β”€ code_development.py     # Code development tools
β”‚       β”œβ”€β”€ project_exploration.py  # Project exploration tools
β”‚       β”œβ”€β”€ environment_config.py   # Environment configuration
β”‚       β”œβ”€β”€ monitoring_debug.py     # Monitoring & debugging
β”‚       └── productivity.py         # Productivity tools
β”œβ”€β”€ scripts/
β”‚   └── mcp_server.py      # MCP server entrypoint
β”œβ”€β”€ install.sh             # Installation script
β”œβ”€β”€ README.md
β”œβ”€β”€ pyproject.toml
└── .env.sample

πŸ”’ Security

  • API Key Protection: Store API keys in environment variables, never in code
  • SSL Configuration: Support for self-signed certificates with DSS_INSECURE_TLS=true
  • Permission Validation: All operations respect DSS user permissions
  • Error Handling: Sensitive information is not exposed in error messages

πŸ“ˆ Monitoring

The MCP server provides logging for monitoring:

# Run with verbose logging
python scripts/mcp_server.py --verbose

# Check logs for debugging
tail -f dataiku_mcp.log

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/amazing-feature
  3. Commit changes: git commit -m 'Add amazing feature'
  4. Push to branch: git push origin feature/amazing-feature
  5. Open a Pull Request

Development Setup

# Install development dependencies
pip install -e .[dev]

# Run code formatting
black dataiku_mcp/ scripts/
ruff check dataiku_mcp/ scripts/

πŸ“ API Reference

Supported Recipe Types

  • Code recipes: python, r, sql, pyspark, scala, shell
  • Visual recipes: grouping, join, sync, split, distinct, sort, topn

Supported Dataset Types

  • Managed datasets: managed (default filesystem storage)
  • Filesystem datasets: filesystem (custom paths)
  • SQL datasets: sql (database tables)
  • Cloud datasets: s3, gcs, azure
  • Upload datasets: uploaded (CSV uploads)

Supported Scenario Types

  • Step-based scenarios: step_based (visual workflow)
  • Custom Python scenarios: custom_python (Python code)

Trigger Types

  • Periodic: periodic (every X minutes)
  • Hourly: hourly (specific minutes past hour)
  • Daily: daily (specific time daily)
  • Monthly: monthly (specific day/time monthly)
  • Dataset: dataset (on dataset changes)

πŸ› Troubleshooting

Common Issues

  1. Connection refused: Check DSS_HOST and ensure DSS is running
  2. SSL certificate errors: Set DSS_INSECURE_TLS=true for self-signed certificates
  3. API key invalid: Verify API key in DSS admin panel
  4. Permission denied: Ensure API key has required project permissions

Debug Mode

Enable debug logging:

python scripts/mcp_server.py --verbose

Connection Testing

Test the MCP server connection:

python scripts/mcp_server.py --verbose

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments


Ready to enhance your Dataiku workflows with AI assistance! πŸš€

About

Dataiku MCP tools

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.6%
  • Shell 0.4%