Skip to content

fix: handle dict output in SmolInstructDataset.load for SIDER subset#2443

Open
octo-patch wants to merge 1 commit intoopen-compass:mainfrom
octo-patch:fix/issue-2440-smolinstruct-sider-dict-output
Open

fix: handle dict output in SmolInstructDataset.load for SIDER subset#2443
octo-patch wants to merge 1 commit intoopen-compass:mainfrom
octo-patch:fix/issue-2440-smolinstruct-sider-dict-output

Conversation

@octo-patch
Copy link
Copy Markdown

Fixes #2440

Problem

The property_prediction-sider subset of SMolInstruct stores the output field as a multi-label dict (one entry per side-effect category) rather than a plain string like every other subset. When HuggingFace Arrow tries to build a Dataset from rows that mix string and struct-typed outputs, it raises:

TypeError: Couldn't cast array of type struct<Hepatobiliary disorders: string, ...> to string

Additionally, random was used by the existing mini_set sampling code but was never imported, which would cause a NameError whenever mini_set=True.

Solution

In opencompass/datasets/smolinstruct.py:

  1. Before appending each row to raw_data, check whether the output field is a dict; if so, serialize it to a JSON string. This ensures Dataset.from_list() always receives a uniform string-typed column regardless of the subset.
  2. Add the missing import json and import random statements.

Testing

The fix can be verified by running the pp-acc evaluation config that includes the SIDER subset:

opencompass config.py --reuse

where config.py imports mini_pp_acc_datasets_0shot_instruct (which includes PP-SIDER-0shot-instruct-mini). Previously this raised a TypeError; with this fix the dataset loads cleanly.

…ixes open-compass#2440)

The property_prediction-sider subset stores the `output` field as a
multi-label dict (one key per side-effect category) rather than a plain
string like every other SMolInstruct subset.  When this is mixed with
string-output rows in a single HuggingFace Dataset, Arrow raises:

  TypeError: Couldn't cast array of type struct<...> to string

Two related fixes in opencompass/datasets/smolinstruct.py:
1. Serialize any dict `output` to a JSON string before appending to
   raw_data, so Dataset.from_list() always sees a uniform string column.
2. Add the missing `import json` and `import random` statements (`random`
   was already used by the mini_set sampling code but never imported).

Co-authored-by: Octopus <liyuan851277048@icloud.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] SmolInstruct -- Unexpected data format when loading subset property_prediction-sider.jsonl

2 participants