Markio is an API-first service that converts documents and web content into Markdown/structured text, with:
- Sync parsing endpoints (
/v1/parse_*,/v1/parse_file,/v1/parse_url) - Async task queue with retry/cancel/pause/resume (
/v1/tasks/*) - Optional Redis-backed queue/state/cache
- Vue 3 console at
/console - Local SDK + CLI for direct integration
This repository is currently in alpha (0.1.0) and focuses on practical parsing workflows over heavy platform features.
- Unified parse contract with
parsed_content,parser,source_type,request_id, andduration_ms - Broad format coverage: Office files, PDF, HTML, EPUB, image OCR, URL, FASTA, GenBank
- Queue observability: task stats, queue health, dashboard, per-task processing latency
- Operational safety: upload size limits, strict output directory guard, consistent JSON error model, request ID tracing, rate limiting
- Flexible deployment: local Python, Docker Compose, optional Redis backend
- Developer ergonomics: typed FastAPI routes, SDK/CLI, and comprehensive pytest suite
flowchart LR
A["Clients (API / CLI / SDK / Console)"] --> B["FastAPI App"]
B --> C["Sync Parse Routers"]
B --> D["Async Task Router"]
C --> E["Parser Registry + Guards"]
E --> F["Docling / MinerU Parsers"]
D --> G["Task Manager (Memory or Redis)"]
G --> F
G --> H["Redis Cache / Task Store (optional)"]
B --> I["Middlewares (trace, rate-limit, gzip, cors)"]
| Type | Extensions / Source | Dedicated Endpoint | Supported by /v1/parse_file |
|---|---|---|---|
.pdf |
/v1/parse_pdf_file |
✅ | |
| Word | .doc, .docx |
/v1/parse_doc_file, /v1/parse_docx_file |
✅ |
| PowerPoint | .ppt, .pptx |
/v1/parse_ppt_file, /v1/parse_pptx_file |
✅ |
| Excel | .xlsx |
/v1/parse_xlsx_file |
✅ |
| HTML File | .html, .htm |
/v1/parse_html_file |
✅ |
| EPUB | .epub |
/v1/parse_epub_file |
✅ |
| Image OCR | .png, .jpg, .jpeg |
/v1/parse_image_file |
✅ |
| URL | http(s)://... |
/v1/parse_url |
❌ |
| FASTA | .fasta, .fa, .fna, .faa, .ffn, .fsa, .fas, .txt |
/v1/parse_fasta_file |
❌ |
| GenBank | .gb, .gbk, .genbank, .gbff, .txt |
/v1/parse_genbank_file |
❌ |
- Python
3.11+ uv(recommended)- Node.js
18+(for frontend development) - Optional: Docker + Docker Compose
- Optional: Redis (
TASK_QUEUE_BACKEND=redis+REDIS_ENABLED=true) - Optional: LibreOffice (
.docand.pptconversion support)
git clone https://github.com/Tendo33/markio.git
cd markio
uv sync
uv pip install -e .
cp .env.example .env
python markio/main.pyOpen:
- API docs: http://localhost:8000/docs
- Console: http://localhost:8000/console
- Health: http://localhost:8000/healthz
docker compose up -dcd frontend
npm install
npm run devcurl -X POST "http://localhost:8000/v1/parse_file" \
-F "file=@./sample.docx"curl -X POST "http://localhost:8000/v1/parse_url?url=https://example.com"# submit
curl -X POST "http://localhost:8000/v1/tasks/submit" \
-F "file=@./sample.pdf" \
-F "parse_method=auto" \
-F "lang=ch" \
-F "priority=5"
# list
curl "http://localhost:8000/v1/tasks?page=1&page_size=20"
# dashboard
curl "http://localhost:8000/v1/tasks/dashboard"
task_idis expected to be a 32-char lowercase hex string.
Base prefix: /v1
POST /parse_file(extension-based dispatch)POST /parse_pdf_filePOST /parse_doc_filePOST /parse_docx_filePOST /parse_ppt_filePOST /parse_pptx_filePOST /parse_xlsx_filePOST /parse_html_filePOST /parse_epub_filePOST /parse_image_filePOST /parse_urlPOST /parse_fasta_filePOST /parse_genbank_file
POST /tasks/submitGET /tasksGET /tasks/statsGET /tasks/queueGET /tasks/dashboardGET /tasks/{task_id}POST /tasks/queue/pausePOST /tasks/queue/resumePOST /tasks/{task_id}/cancelPOST /tasks/{task_id}/retry
GET /healthzGET /readyzGET /(redirect to/docs)GET /console(frontend static app / fallback page)
After editable installation, CLI entrypoint is available as markio.
markio pdf ./sample.pdf --method auto
markio docx ./sample.docx --save
markio image ./sample.pngPython SDK example:
import asyncio
from markio.sdk.markio_sdk import MarkioSDK
async def main():
sdk = MarkioSDK(output_dir="outputs")
result = await sdk.parse_pdf("sample.pdf", parse_method="auto")
print(result["content"][:500])
asyncio.run(main())More:
- CLI guide: docs/cli_usage.md
- SDK guide: docs/sdk_usage.md
Core settings come from environment variables (.env, see .env.example).
| Variable | Default | Notes |
|---|---|---|
PDF_PARSE_ENGINE |
pipeline |
pipeline, vlm-vllm-engine, vlm-vllm-client |
MINERU_DEVICE_MODE |
cuda |
cuda, cpu, mps |
REDIS_ENABLED |
false |
Enables Redis cache and Redis task backend |
TASK_QUEUE_BACKEND |
memory |
memory or redis |
TASK_WORKER_COUNT |
2 |
Background workers |
TASK_MAX_UPLOAD_SIZE_BYTES |
52428800 |
Upload cap (413 on overflow) |
TASK_MAX_AUTO_RETRIES |
0 |
Auto-retry limit |
TASK_PROCESSING_TIMEOUT_SECONDS |
0 |
Requeue timeout for processing tasks |
RATE_LIMIT_ENABLED |
true |
Lightweight per-IP + route limiter |
ENABLE_MCP |
false |
Mount MCP endpoints/tools |
Redis details: docs/REDIS_INTEGRATION.md
markio/
├── markio/ # FastAPI app, routers, parsers, services, SDK/CLI
├── frontend/ # Vue 3 + Vite console
├── tests/ # pytest suites and fixtures
├── docs/ # usage docs and design plans
├── scripts/ # helper scripts
├── data/ logs/ outputs/
├── compose.yaml
└── .env.example
# default suite (excludes live tests by marker)
uv run pytest
# tests requiring external running service
uv run pytest -m live- CLI: docs/cli_usage.md
- SDK: docs/sdk_usage.md
- Console frontend: docs/console_frontend.md
- Biological parsing: docs/biological_data_parsing.md
- Redis integration: docs/REDIS_INTEGRATION.md
- Project license: MIT
- Frontend third-party notice: frontend/THIRD_PARTY_NOTICES.md
