Skip to content

[Bug]: document_metadata missing for most chunks in large dataset #11533

@gileslloyd

Description

@gileslloyd

Self Checks

  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report (Language Policy).
  • Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
  • Please do not modify this template :) and fill in all the required fields.

RAGFlow workspace code commit ID

cfdcceb

RAGFlow image version

v0.22.1

Other environment information

Actual behavior

When using the ragflow_retrieval MCP tool with a dataset containing more than 30 documents, chunks from documents 31+ do not contain document_metadata. This is because the metadata is fetched from the document metadata cache. That cache is built using the list documents endpoint with the default pagination parameters of page=1, page_size=30:

docs_res = self._get(f"/datasets/{dataset_id}/documents")

page = int(q.get("page", 1))

Ultimately that means we'll only ever be able to see the metadata for the first 30 docs. Even if the logic is altered to fetch all the pages when building the metadata cache there is also a hard limit of 128 documents.

_MAX_DOCUMENT_CACHE = 128

if len(self._dataset_metadata_cache) > self._MAX_DATASET_CACHE:

Expected behavior

When using the ragflow_retrieval tool, every chunk should include the relevant document_metadata

Steps to reproduce

1. Create a dataset,
2. Upload more than 30 documents,
3. Use the langflow_retrieval MCP tool, using a query which will return chunks from one of the later documents in the set,
4. Check the response to verify that those chunks do not include a document_metadata property

Additional information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    🐞 bugSomething isn't working, pull request that fix bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions