Skip to content

sciom/EcoOpen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EcoOpen — PDF Scientific Data Extractor

FastAPI + Vue app that extracts structured research metadata from PDF papers: DOI/title, data/code availability statements, and cleaned/validated data/code links. Use it via a simple REST API or the bundled web UI.

Highlights

  • Accurate DOI/title (no guessing), availability statements, and link repair/deduplication
  • Single-file synchronous mode (no database required)
  • Batch jobs and CSV export when MongoDB is available
  • Works with local Ollama or any OpenAI‑compatible endpoint

Requirements

  • Python 3.10+
  • Node.js 18+
  • Optional: MongoDB 6+ (for background jobs/batch)
  • Optional: Ollama with nomic-embed-text for local embeddings (ollama pull nomic-embed-text)

Install

  • Backend (choose one)
    • Conda: ./setup_conda.sh && conda activate ecoopen-llm
    • venv: python -m venv .venv && source .venv/bin/activate && pip install -r requirements.txt
  • Configure (optional): cp .env.example .env and set AGENT_BASE_URL/AGENT_MODEL (and JWT_SECRET for production). For embeddings choose:
    • EMBEDDINGS_BACKEND=endpoint with AGENT_EMBED_MODEL=<id> to use the same OpenAI-compatible server for embeddings, or
    • EMBEDDINGS_BACKEND=ollama with OLLAMA_HOST and OLLAMA_EMBED_MODEL (e.g., nomic-embed-text).

Run

Deploy

  • One‑shot deploy with systemd + nginx:
    • ./deploy.sh (builds frontend, installs backend deps, reloads services)
    • Flags:
      • SKIP_FRONTEND=1 skip building the frontend
      • INSTALL_BACKEND=0 don’t (re)install Python deps
      • DISABLE_CONFLICTING_SITES=0 keep other nginx sites
      • PUBLIC_BASE=https://ecoopen.sciom.net override public health url
  • Verify:
    • Local: curl -s -o /dev/null -w '%{http_code}\n' http://127.0.0.1:3290/health200
    • Public: curl -s -o /dev/null -w '%{http_code}\n' "$PUBLIC_BASE/api/health"200

Authenticate

All analyze endpoints require auth. Register or login to obtain a bearer token.

Password requirements: at least 8 characters and must include letters and numbers. Registration requires password_confirm to match password.

# Register (requires password_confirm)
curl -sS -X POST -H 'Content-Type: application/json' \
  -d '{"email":"[email protected]","password":"Abcdef12","password_confirm":"Abcdef12"}' \
  http://localhost:8000/auth/register

# Login → TOKEN
TOKEN=$(curl -sS -X POST -H 'Content-Type: application/json' \
  -d '{"email":"[email protected]","password":"Abcdef12"}' \
  http://localhost:8000/auth/login | jq -r .access_token)

# Current user and admin flag
curl -sS -H "Authorization: Bearer $TOKEN" http://localhost:8000/auth/me
# → { "id": "...", "email": "[email protected]", "is_admin": false }

Admin emails are configured via ADMIN_EMAILS in environment or .env (comma‑separated addresses).

Frontend tip: the UI auto‑logs in after a successful registration and syncs admin/user id state from /auth/me.

Quick analyze

Analyze a single PDF synchronously (no MongoDB needed):

curl -sS -X POST \
  -H "Authorization: Bearer $TOKEN" \
  -F "file=@/path/to/paper.pdf;type=application/pdf" \
  "http://localhost:8000/analyze?mode=sync"

Response includes title, title_source (origin of extracted title: heuristic|llm|enriched), doi, availability statements, and normalized data_links/code_links.

For batch processing and CSV export, run MongoDB and use the UI or POST /analyze/batch.

Enrichment & Debug

  • ENABLE_TITLE_ENRICHMENT (default: true): improves title detection by merging first-page lines; no network.
  • ENABLE_LINK_VERIFICATION (default: true): normalizes and deduplicates links; no network.
  • ENABLE_DOI_VERIFICATION (default: false): verifies extracted DOI against Crossref and adjusts confidence using title similarity. If no DOI is present or verification fails, a title-based Crossref search can supply a DOI that matches the extracted title. Knobs: DOI_HTTP_TIMEOUT_SECONDS, DOI_CACHE_TTL.
  • ENABLE_TITLE_LLM_PREFERRED (default: false): prefer LLM-based title extraction over heuristic; when DOI verification is enabled, the title is checked against Crossref and can drive DOI reconciliation.
  • Optional knobs: ENRICHMENT_HTTP_TIMEOUT_SECONDS, ENRICHMENT_MAX_CONCURRENCY, ENRICHMENT_CONTACT_EMAIL (reserved for future network enrichers).
  • To include detailed extraction diagnostics in API responses, set EXPOSE_AVAILABILITY_DEBUG=true.

Testing

Run the test suite to verify functionality:

# Unit tests (no external services required)
pytest tests/ -v -m "not integration"

# Full workflow test (requires Ollama and LLM endpoint)
pytest tests/test_workflow_full.py -v -s

The full workflow test assesses extraction accuracy and performance on example PDFs. See tests/README.md for details.

Troubleshooting

  • Missing embeddings (Ollama): ollama pull nomic-embed-text
  • No database? Use mode=sync for single-file analysis
  • LLM endpoint down: service falls back to deterministic extraction when possible

Contributing & License

Contributions are welcome via issues and PRs. This is an open‑source project; see the repository’s license file for details.

About

EcoOpen Data Toolkit - set of tools and workflow for automatic literature aggregation.

Resources

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •