Skip to content

More easily support creating reproducible archives with shutil #120036

@ncoghlan

Description

@ncoghlan

Proposal:

It would be handy if the shutil module provided a convenient way to opt in to the build artifact reproducibility features described in https://reproducible-builds.org/docs/archives/

Such an addition would likely make more sense as a new shutil.make_reproducible_archive function, rather than trying to shoehorn the new functionality into the existing shutil.make_archive API.

The specific problem that prompted this feature idea was encountering this traceback trying to set owner=0 and group=0 in shutil.make_archive:

Traceback (most recent call last):
[snip application details]
  File "/home/acoghlan/...[snip]...", line 97, in create_archive
    archive_with_extension = shutil.make_archive(
                             ^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/shutil.py", line 1188, in make_archive
    filename = func(base_name, base_dir, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/shutil.py", line 992, in _make_tarball
    uid = _get_uid(owner)
          ^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/shutil.py", line 941, in _get_uid
    result = getpwnam(name)
             ^^^^^^^^^^^^^^
TypeError: getpwnam() argument must be str, not int

tarfile itself does support setting numeric owner and group IDs (via addfile and the filter option on add),but the high level shutil wrapper assumes the owner and group will always be given via names that can be looked up on the current system, it doesn't allow them to be specified numerically.

While supporting numeric uids and gids in the high level API would be mildly helpful, it isn't necessarily the most useful way to address the limitation since the only value anyone would ever likely pass numerically is 0 (which can be worked around on many systems by passing "root" as a symbolic name), and their actual goal would be to indicate that the archive is intended to be a reproducible build artifact, so they actively don't want to include environmental details that are specific to that particular invocation.

As things are now, it isn't a massive burden to copy-and-paste the _make_tarball code from shutil.py and adapt it for build artifact creation purposes, but I also think there genuinely are two very different use cases for archive creation (backups where you want to reproduce the original environment as faithfully as possible, and build artifacts that you want to make as portable and build system independent as possible), so there's potentially merit in offering a separate high level API for the case that isn't as well served by the existing high level API.

Has this already been discussed elsewhere?

No response given

Links to previous discussion of this feature:

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    stdlibStandard Library Python modules in the Lib/ directorytype-featureA feature request or enhancement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions