Niels Möller nisse@lysator.liu.se writes:
- The tarball embeds some username info, some
portability/reproducability TAR_OPTIONS for inspiration:
export TAR_OPTIONS = --owner=0 --group=0 --numeric-owner --sort=name --mode=go+u,go-w --format=ustar
Trimming the tar meta data a little seems like an easy improvement. When I looked at this in a different project, I ended up with the invocation here: https://git.glasklar.is/system-transparency/core/system-transparency/-/blob/...
tar --exclude .git --exclude .gitmodules --sort=name --format=posix \ --pax-option=exthdr.name=%d/PaxHeaders/%f \ --pax-option=delete=atime,delete=ctime \ --clamp-mtime --mtime="./$(basename "${DIST_DIR}")/${LATEST_COMPONENT}" \ --numeric-owner --owner=0 --group=0 \ --mode=go+u,go-w \ -cf - "$(basename "${DIST_DIR}")" ) | gzip --no-name --best > "${DIST_DIR}.tar.gz"
based mostly on the section on reproducibility in the GNU tar manual. Is there some reason to prefer format ustar over posix?
I tried --format=posix for some time (also inspired by the tar manual), but realized it did not provide anything important, that ustar did not provide. I found that the ./PaxHeadeers/ virtual sub-directory with --format=posix archives annoying:
https://lists.gnu.org/archive/html/bug-gnulib/2025-01/msg00233.html
Using gzip with --rsyncable can dramatically improve some situations (e.g., transfer or storage of multiple nettle-*.tar.gz tarballs with block-based deduping), but it costs 2-3% of size.
If I remember correctly, fiddling with file timestamps was also rather important to get a reproducible tar file. But it may be a good first step to fix the non-timestamp metadata?
It is indeed trickier than everyone would want this to be... small incremental improvements like avoiding uid/gid help though.
/Simon