fix: handle dict output in SmolInstructDataset.load for SIDER subset#2443
Open
octo-patch wants to merge 1 commit intoopen-compass:mainfrom
Open
fix: handle dict output in SmolInstructDataset.load for SIDER subset#2443octo-patch wants to merge 1 commit intoopen-compass:mainfrom
octo-patch wants to merge 1 commit intoopen-compass:mainfrom
Conversation
…ixes open-compass#2440) The property_prediction-sider subset stores the `output` field as a multi-label dict (one key per side-effect category) rather than a plain string like every other SMolInstruct subset. When this is mixed with string-output rows in a single HuggingFace Dataset, Arrow raises: TypeError: Couldn't cast array of type struct<...> to string Two related fixes in opencompass/datasets/smolinstruct.py: 1. Serialize any dict `output` to a JSON string before appending to raw_data, so Dataset.from_list() always sees a uniform string column. 2. Add the missing `import json` and `import random` statements (`random` was already used by the mini_set sampling code but never imported). Co-authored-by: Octopus <liyuan851277048@icloud.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #2440
Problem
The
property_prediction-sidersubset of SMolInstruct stores theoutputfield as a multi-label dict (one entry per side-effect category) rather than a plain string like every other subset. When HuggingFace Arrow tries to build a Dataset from rows that mix string and struct-typed outputs, it raises:Additionally,
randomwas used by the existingmini_setsampling code but was never imported, which would cause aNameErrorwhenevermini_set=True.Solution
In
opencompass/datasets/smolinstruct.py:raw_data, check whether theoutputfield is a dict; if so, serialize it to a JSON string. This ensuresDataset.from_list()always receives a uniformstring-typed column regardless of the subset.import jsonandimport randomstatements.Testing
The fix can be verified by running the pp-acc evaluation config that includes the SIDER subset:
where
config.pyimportsmini_pp_acc_datasets_0shot_instruct(which includesPP-SIDER-0shot-instruct-mini). Previously this raised aTypeError; with this fix the dataset loads cleanly.