FastAPI + Vue app that extracts structured research metadata from PDF papers: DOI/title, data/code availability statements, and cleaned/validated data/code links. Use it via a simple REST API or the bundled web UI.
- Accurate DOI/title (no guessing), availability statements, and link repair/deduplication
- Single-file synchronous mode (no database required)
- Batch jobs and CSV export when MongoDB is available
- Works with local Ollama or any OpenAI‑compatible endpoint
- Python 3.10+
- Node.js 18+
- Optional: MongoDB 6+ (for background jobs/batch)
- Optional: Ollama with
nomic-embed-textfor local embeddings (ollama pull nomic-embed-text)
- Backend (choose one)
- Conda:
./setup_conda.sh && conda activate ecoopen-llm - venv:
python -m venv .venv && source .venv/bin/activate && pip install -r requirements.txt
- Conda:
- Configure (optional):
cp .env.example .envand setAGENT_BASE_URL/AGENT_MODEL(andJWT_SECRETfor production). For embeddings choose:EMBEDDINGS_BACKEND=endpointwithAGENT_EMBED_MODEL=<id>to use the same OpenAI-compatible server for embeddings, orEMBEDDINGS_BACKEND=ollamawithOLLAMA_HOSTandOLLAMA_EMBED_MODEL(e.g.,nomic-embed-text).
- API:
./run_api.sh→ http://localhost:8000 - Frontend:
cd frontend && npm install && npm run dev→ http://localhost:5173- To point the UI at a custom API URL, create
frontend/.envwithVITE_API_BASE=http://localhost:8000
- One‑shot deploy with systemd + nginx:
./deploy.sh(builds frontend, installs backend deps, reloads services)- Flags:
SKIP_FRONTEND=1skip building the frontendINSTALL_BACKEND=0don’t (re)install Python depsDISABLE_CONFLICTING_SITES=0keep other nginx sitesPUBLIC_BASE=https://ecoopen.sciom.netoverride public health url
- Verify:
- Local:
curl -s -o /dev/null -w '%{http_code}\n' http://127.0.0.1:3290/health→200 - Public:
curl -s -o /dev/null -w '%{http_code}\n' "$PUBLIC_BASE/api/health"→200
- Local:
All analyze endpoints require auth. Register or login to obtain a bearer token.
Password requirements: at least 8 characters and must include letters and numbers. Registration requires password_confirm to match password.
# Register (requires password_confirm)
curl -sS -X POST -H 'Content-Type: application/json' \
-d '{"email":"[email protected]","password":"Abcdef12","password_confirm":"Abcdef12"}' \
http://localhost:8000/auth/register
# Login → TOKEN
TOKEN=$(curl -sS -X POST -H 'Content-Type: application/json' \
-d '{"email":"[email protected]","password":"Abcdef12"}' \
http://localhost:8000/auth/login | jq -r .access_token)
# Current user and admin flag
curl -sS -H "Authorization: Bearer $TOKEN" http://localhost:8000/auth/me
# → { "id": "...", "email": "[email protected]", "is_admin": false }Admin emails are configured via ADMIN_EMAILS in environment or .env (comma‑separated addresses).
Frontend tip: the UI auto‑logs in after a successful registration and syncs admin/user id state from /auth/me.
Analyze a single PDF synchronously (no MongoDB needed):
curl -sS -X POST \
-H "Authorization: Bearer $TOKEN" \
-F "file=@/path/to/paper.pdf;type=application/pdf" \
"http://localhost:8000/analyze?mode=sync"Response includes title, title_source (origin of extracted title: heuristic|llm|enriched), doi, availability statements, and normalized data_links/code_links.
For batch processing and CSV export, run MongoDB and use the UI or POST /analyze/batch.
ENABLE_TITLE_ENRICHMENT(default: true): improves title detection by merging first-page lines; no network.ENABLE_LINK_VERIFICATION(default: true): normalizes and deduplicates links; no network.ENABLE_DOI_VERIFICATION(default: false): verifies extracted DOI against Crossref and adjusts confidence using title similarity. If no DOI is present or verification fails, a title-based Crossref search can supply a DOI that matches the extracted title. Knobs:DOI_HTTP_TIMEOUT_SECONDS,DOI_CACHE_TTL.ENABLE_TITLE_LLM_PREFERRED(default: false): prefer LLM-based title extraction over heuristic; when DOI verification is enabled, the title is checked against Crossref and can drive DOI reconciliation.- Optional knobs:
ENRICHMENT_HTTP_TIMEOUT_SECONDS,ENRICHMENT_MAX_CONCURRENCY,ENRICHMENT_CONTACT_EMAIL(reserved for future network enrichers). - To include detailed extraction diagnostics in API responses, set
EXPOSE_AVAILABILITY_DEBUG=true.
Run the test suite to verify functionality:
# Unit tests (no external services required)
pytest tests/ -v -m "not integration"
# Full workflow test (requires Ollama and LLM endpoint)
pytest tests/test_workflow_full.py -v -sThe full workflow test assesses extraction accuracy and performance on example PDFs. See tests/README.md for details.
- Missing embeddings (Ollama):
ollama pull nomic-embed-text - No database? Use
mode=syncfor single-file analysis - LLM endpoint down: service falls back to deterministic extraction when possible
Contributions are welcome via issues and PRs. This is an open‑source project; see the repository’s license file for details.