数据集切片中文本分段标识符有效性的问题。 #8609
Replies: 2 comments 1 reply
-
|
Hi @BillGaici20230401! I'm Dosu and I’m helping the ragflow team. RAGFlow 的分块顺序确实是:先用分隔符(如 如果你希望每个 多字符分隔符(如 总之,RAGFlow 的分块机制是先分隔符、后合并,分隔符不会保留在块中。分块大小和分隔符需要结合实际文本结构灵活调整。如果还有特殊需求,可以进一步探索 parser_config 的自定义能力或更细粒度的分块模板。 To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
-
|
@BillGaici20230401 Hi, pls check our language policy. We’ll temporarily close this thread for now, but feel free to reopen or submit a new one in English — we’d love to help! Thank you for your understanding! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
请教一个使用text文件创建知识库切片的问题,设置文本分段标识符为:
##,分块大小:512,上传的文本文件内容中,每隔几行都会有一个##符号,最后解析结果发现,虽然每个块是按照##分割的,但没有严格按照##分隔块,很多块是两个##被分到一个块中,这个块中虽然包含两个预计分开的内容,但这两个预计分块的之间分隔符##被删除了。看分隔符介绍中,说的要先按照分隔符分块,再按照大小(512)分。感觉它是先按照大小(512)分,再按照分隔符分了?是这样的吗?谢谢了!
本想缩小分块大小,但我的预计分块中,有个别块比较大。大部分块不超过256。这种情况如何分块才好呢?
Beta Was this translation helpful? Give feedback.
All reactions