Skip to content

fix: handle all test_range types in SizePartitioner.get_cost#2438

Open
octo-patch wants to merge 1 commit intoopen-compass:mainfrom
octo-patch:fix/issue-2430-size-partitioner-test-range
Open

fix: handle all test_range types in SizePartitioner.get_cost#2438
octo-patch wants to merge 1 commit intoopen-compass:mainfrom
octo-patch:fix/issue-2430-size-partitioner-test-range

Conversation

@octo-patch
Copy link
Copy Markdown

Fixes #2430

Problem

SizePartitioner.get_cost() computed actual_size using an eval trick:

actual_size = eval('len(range(self.dataset_size[dataset_abbr]))'
                   f'{test_range})')

This constructs expressions like len(range(N)[:100]) and relies on test_range being a string slice. However, test_range can also be None, an int, or a float (all documented in DatasetReader), which causes a SyntaxError:

  • Nonelen(range(N))None) — syntax error
  • int 100len(range(N))100) — syntax error
  • float 0.5len(range(N))0.5) — syntax error

Solution

Add _get_actual_size(test_range, total_size) that mirrors the branching logic already present in load_partial_dataset() (in icl_dataset_reader.py) without loading the dataset into memory:

test_range type Behaviour
None / empty string full dataset size
int / float outside (0, total_size) full dataset size
float in (0, 1) int(test_range * total_size)
int in (0, total_size) int(test_range)
str (e.g. "[:100]") len(index_list{test_range})

Both eval call-sites in get_cost() are replaced with calls to this helper.

Testing

Manually verified with representative inputs:

p._get_actual_size(None, 1000)      # 1000
p._get_actual_size('', 1000)        # 1000
p._get_actual_size(100, 1000)       # 100
p._get_actual_size(0.5, 1000)       # 500
p._get_actual_size('[:100]', 1000)  # 100
p._get_actual_size('[100:200]', 1000) # 100

…pen-compass#2430)

The previous implementation used an eval trick that only worked for string
test_range values (e.g. "[:100]"). Passing an int, float, or None caused a
SyntaxError because the expression became invalid Python.

Introduce _get_actual_size() that mirrors the slicing logic already used in
load_partial_dataset(), handling None/empty-string, int, float (<1 as ratio,
>=1 as count), and string slice notation correctly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] SizePartitioner is not handling test_range properly

2 participants