March 13, 2024March 13, 2024 | by Arround The Web | No comments

GNU Guix: Adventures on the quest for long-term reproducible deployment

Rebuilding software five years later, how hard can it be? It can’t be
that hard, especially when you pride yourself on having a tool that
can travel in
time
and that does a good job at ensuring reproducible
builds, right?

In hindsight, we can tell you: it’s more challenging than it
seems. Users attempting to travel 5 years back with guix time-machine
are (or were) unavoidably going to hit bumps on the road—a real
problem because that’s one of the use cases Guix aims to support well,
in particular in a reproducible
research context.

In this post, we look at some of the challenges we face while traveling
back, how we are overcoming them, and open issues.

The vision

First of all, one clarification: Guix aims to support time travel, but
we’re talking of a time scale measured in years, not in decades. We
know all too well that this is already very ambitious—it’s something
that probably nobody except Nix and Guix are even
trying. More importantly, software deployment at the scale of decades
calls for very different, more radical techniques; it’s the work of
archivists.

Concretely, Guix 1.0.0 was released in
2019 and
our goal is to allow users to travel as far back as 1.0.0 and redeploy
software from there, as in this example:

$ guix time-machine -q --commit=v1.0.0 -- \
     environment --ad-hoc python2 -- python
> guile: warning: failed to install locale
Python 2.7.15 (default, Jan  1 1970, 00:00:01) 
[GCC 5.5.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>

(The command above uses guix environment, the predecessor of guix shell,
which didn’t exist back then.)
It’s only 5 years ago but it’s pretty much remote history on the scale
of software evolution—in this case, that history comprises major
changes in Guix
itself and
in Guile.
How well does such a command work? Well, it depends.

The project has two build farms; bordeaux.guix.gnu.org has been
keeping substitutes (pre-built binaries) of everything it built since
roughly 2021, while ci.guix.gnu.org keeps substitutes for roughly two
years, but there is currently no guarantee on the duration
substitutes may be retained.
Time traveling to a period where substitutes are available is
fine: you end up downloading lots of binaries, but that’s OK, you rather
quickly have your software environment at hand.

Bumps on the build road

Things get more complicated when targeting a period in time for which
substitutes are no longer available, as was the case for v1.0.0 above.
(And really, we should assume that substitutes won’t remain available
forever: fellow NixOS hackers recently had to seriously consider
trimming their 20-year-long history of
substitutes
because the costs are not sustainable.)

Apart from the long build times, the first problem that arises in the
absence of substitutes is source code unavailability. I’ll spare you
the details for this post—that problem alone would deserve a book.
Suffice to say that we’re lucky that we started working on integrating
Guix with Software
Heritage
years ago, and that there has been great progress over the last couple
of years to get closer to full package source code
archival (more precisely: 94% of
the source code of packages available in Guix in January 2024 is
archived, versus 72% of the packages available in May 2019).

So what happens when you run the time-machine command above? It
brings you to May 2019, a time for which none of the official build
farms had substitutes until a few days ago. Ideally, thanks to
isolated build
environments,
you’d build things for hours or days, and in the end all those binaries
will be here just as they were 5 years ago. In practice though, there
are several problems that isolation as currently implemented does not
address.

Screenshot of movie “Safety Last!” with Harold Lloyd hanging from a clock on a building’s façade.

Among those, the most frequent problem is time traps: software build
processes that fail after a certain date (these are also referred to as
“time bombs” but we’ve had enough of these and would rather call for a
ceasefire). This plagues a handful of packages out of almost 30,000 but
unfortunately we’re talking about packages deep in the dependency graph.
Here are some examples:

OpenSSL unit tests fail
after a certain date because some of the X.509 certificates they use
have expired.
GnuTLS had similar issues;
newer versions rely on
datefudge to
fake the date while running the tests and thus avoid that problem
altogether.
Python 2.7, found in Guix 1.0.0, also had that
problem with its TLS-related
tests.
OpenJDK would fail to build at some
point with this interesting
message: Error: time is more than 10 years from present: 1388527200000 (the build system would consider that its data about
currencies is likely outdated after 10 years).
Libgit2, a dependency of Guix, had (has?) a time-dependent
tests.
MariaDB tests started failing in
2019.

Someone traveling to v1.0.0 will hit several of these, preventing
guix time-machine from completing. A serious bummer, especially to
those who’ve come to Guix from the perspective of making their research
workflow
reproducible.

Time traps are the main road block, but there’s more! In rare cases,
there’s software influenced by kernel details not controlled by the
build daemon:

Tests of the hwloc hardware locality library would fail when
running on a Btrfs file system.

In a handful of cases, but important ones, builds might fail when
performed on certain CPUs. We’re aware of at least two cases:

Python 3.9 to 3.11 would set a signal handler stack too small for
use on Intel Sapphire Rapids Xeon
CPUs (it’s more
complicated than this but the end result is: it will no longer build
on modern hardware).
Firefox would reportedly crash on Raptor Lake CPUs running an buggy
version of their
firmware.

Neither time traps nor those obscure hardware-related issues can be
avoided with the isolation mechanism currently used by the build daemon.
This harms time traveling when substitutes are unavailable. Giving up
is not in the ethos of this project though.

Where to go from here?

There are really two open questions here:

How can we tell which packages needs to be “fixed”, and how:
building at a specific date, on a specific CPU?
How can keep those aspects of the build environment (time, CPU
variant) under control?

Let’s start with #2. Before looking for a solution, it’s worth
remembering where we come from. The build daemon runs build processes
with a separate root file
system, under
dedicated user IDs, and in separate Linux
namespaces,
thereby minimizing interference with the rest of the system and ensuring
a well-defined build
environment.
This technique was
implemented
by Eelco Dolstra for Nix in 2007 (with namespace support added
in
2012),
at a time where the word container had to do with boats and before
“Docker” became the name of a software tool. In short, the approach
consists in controlling the build environment in every detail (it’s at
odds with the strategy that consists in achieving reproducible builds
in spite of high build environment
variability).
That these are mere processes with a bunch of bind mounts makes this
approach inexpensive and appealing.

Realizing we’d also want to control the build environment’s date,
we naturally turn to Linux namespaces to address that—Dolstra, Löh, and
Pierron already suggested something along these lines in the conclusion
of their 2010 Journal of Functional Programming
paper. Turns out
there is now a time
namespace.
Unfortunately it’s limited to CLOCK_MONOTONIC and CLOCK_BOOTTIME
clocks; the manual page states:

Note that time namespaces do not virtualize the CLOCK_REALTIME
clock. Virtualization of this clock was avoided for reasons of
complexity and overhead within the kernel.

I hear you say: What about
datefudge and
libfaketime?
These rely on the LD_PRELOAD environment variable to trick the dynamic
linker into pre-loading a library that provides symbols such as
gettimeofday and clock_gettime. This is a fine approach in some
cases, but it’s too fragile and too intrusive when targeting arbitrary
build processes.

That leaves us with essentially one viable option: virtual machines
(VMs). The full-system QEMU lets you specify the initial real-time
clock of the VM with the -rtc flag, which is exactly what we need
(“user-land” QEMU such as qemu-x86_64 does not support it). And of
course, it lets you specify the CPU model to emulate.

News from the past

Now, the question is: where does the VM fit? The author considered
writing a package
transformation
that would change a package such that it’s built in a well-defined VM.
However, that wouldn’t really help: this option didn’t exist in past
revisions, and it would lead to a different build anyway from the
perspective of the daemon—a different
derivation.

The best strategy appeared to be
offloading:
the build daemon can offload builds to different machines over SSH, we
just need to let it send builds to a suitably-configured VM. To do
that, we can reuse some of the machinery initially developed for
childhurds
that takes care of setting up offloading to the VM: creating substitute
signing keys and SSH keys, exchanging secret key material between the
host and the guest, and so on.

The end result is a service for Guix System
users
that can be configured in a few lines:

(use-modules (gnu services virtualization))

(operating-system
  ;; …
  (services (append (list (service virtual-build-machine-service-type))
                    %base-services)))

The default setting above provides a 4-core VM whose initial date is
January 2020, emulating a Skylake CPU from that time—the right setup for
someone willing to reproduce old binaries. You can check the
configuration like this:

$ sudo herd configuration build-vm
CPU: Skylake-Client
number of CPU cores: 4
memory size: 2048 MiB
initial date: Wed Jan 01 00:00:00Z 2020

To enable offloading to that VM, one has to explicitly start it, like
so:

$ sudo herd start build-vm

From there on, every native build is offloaded to the VM. The key part
is that with almost no configuration, you get everything set up to build
packages “in the past”. It’s a Guix System only solution; if you run
Guix on another distro, you can set up a similar build VM but you’ll
have to go through the cumbersome process that is all taken care of
automatically here.

Of course it’s possible to choose different configuration parameters:

(service virtual-build-machine-service-type
         (virtual-build-machine
          (date (make-date 0 0 00 00 01 10 2017 0)) ;further back in time
          (cpu "Westmere")
          (cpu-count 16)
          (memory-size (* 8 1024))
          (auto-start? #t)))

With a build VM with its date set to January 2020, we have been able to
rebuild Guix and its dependencies along with a bunch of packages such as
emacs-minimal from v1.0.0, overcoming all the time traps and other
challenges described earlier. As a side effect, substitutes
are now available from ci.guix.gnu.org so you can even try this at
home without having to rebuild the world:

$ guix time-machine -q --commit=v1.0.0 -- build emacs-minimal --dry-run
guile: warning: failed to install locale
substitute: updating substitutes from 'https://ci.guix.gnu.org'... 100.0%
38.5 MB would be downloaded:
   /gnu/store/53dnj0gmy5qxa4cbqpzq0fl2gcg55jpk-emacs-minimal-26.2

For the fun of it, we went as far as v0.16.0, released in December
2018:

guix time-machine -q --commit=v0.16.0 -- \
  environment --ad-hoc vim -- vim --version

This is the furthest we can go since
channels
and the underlying mechanisms that make time travel possible did not
exist before that date.

There’s one “interesting” case we stumbled upon in that process: in
OpenSSL 1.1.1g (released April 2020 and packaged in December
2020),
some of the test certificates are not valid before April 2020, so the
build VM needs to have its clock set to May 2020 or thereabouts.
Booting the build VM with a different date can be done without
reconfiguring the system:

$ sudo herd stop build-vm
$ sudo herd start build-vm -- -rtc base=2020-05-01T00:00:00

The -rtc … flags are passed straight to QEMU, which is handy when
exploring workarounds…

The time-travel continuous integration
jobset has been set up to
check that we can, at any time, travel back to one of the past releases.
This at least ensures that Guix itself and its dependencies have
substitutes available at ci.guix.gnu.org.

Reproducible research workflows reproduced

Incidentally, this effort rebuilding 5-year-old packages has allowed us
to fix embarrassing problems. Software that accompanies research papers
that followed our reproducibility
guidelines
could no longer be deployed, at least not without this clock twiddling
effort:

code
of [Re] Storage Tradeoffs in a Collaborative Backup Service for
Mobile Devices, submitted
as part of the ReScience Ten Years Reproducibility
Challenge in June 2020,
and which is precisely about showcasing reproducible deployment with
Guix;
code
of the 2022 Nature Scientific Data article entitled Toward
practical transparent verifiable and long-term reproducible research
using Guix, which
relied on an April 2020 revision of Guix to deploy (Simon Tournier
who co-authored the paper reported
earlier
on a failed attempt showing just how challenging it was).

It’s good news that we can now re-deploy these 5-year-old software
environments with minimum hassle; it’s bad news that holding this
promise took extra effort.

The ability to reproduce the environment of software that accompanies
research work should not be considered a mundanity or an exercise that’s
“overkill”.
The ability to rerun, inspect, and modify software are the natural
extension of the scientific method. Without a companion reproducible
software environment, research papers are merely the advertisement of
scholarship, to paraphrase Jon Claerbout.

The future

The astute reader surely noticed that we didn’t answer question #1
above:

How can we tell which packages needs to be “fixed”, and how: building
at a specific date, on a specific CPU?

It’s a fact that Guix so far lacks information about the date, kernel,
or CPU model that should be used to build a given package.
Derivations
purposefully lack that information on the grounds that it cannot be
enforced in user land and is rarely necessary—which is true, but
“rarely” is not the same as “never”, as we saw. Should we create a
catalog of date, CPU, and/or kernel annotations for packages found in
past revisions? Should we define, for the long-term, an
all-encompassing derivation format? If we did and effectively required
virtual build machines, what would that mean from a
bootstrapping
standpoint?

Here’s another option: build packages in VMs running in the year 2100,
say, and on a baseline CPU. We don’t need to require all users to set
up a virtual build machine—that would be impractical. It may be enough
to set up the project build farms so they build everything that way.
This would allow us to catch time traps and year 2038
bugs before they bite.

Before we can do that, the virtual-build-machine service needs to be
optimized. Right now, offloading to build VMs is as heavyweight as
offloading to a separate physical build machine: data is transferred
back and forth over SSH over TCP/IP. The first step will be to run SSH
over a paravirtualized transport instead such as AF_VSOCK
sockets.
Another avenue would be to make /gnu/store in the guest VM an overlay
over the host store so that inputs do not need to be transferred and
copied.

Until then, happy software (re)deployment!

Acknowledgments

Thanks to Simon Tournier for insightful comments on a previous version
of this post.

Source: Planet GNU