March 4, 2024March 4, 2024 | by Arround The Web | No comments

GNU Guix: Identifying software

What does it take to “identify software”? How can we tell what software
is running on a machine to determine, for example, what security
vulnerabilities might affect it?

In October 2023, the US Cybersecurity and Infrastructure Security Agency
(CISA) published a white paper entitled Software Identification
Ecosystem Option
Analysis
that looks at existing options to address these questions. The
publication was followed by a request for
comments; our
comment
as Guix developers didn’t make it on time to be published, but we’d like
to share it here.

Software identification for cybersecurity purposes is a crucial topic,
as the white paper explains in its introduction:

Effective vulnerability management requires software to be trackable
in a way that allows correlation with other information such as known
vulnerabilities […]. This correlation is only possible when different
cybersecurity professionals know they are talking about the same
software.

The Common Platform Enumeration
(CPE)
standard has been designed to fill that role; it is used to identify
software as part of the well-known Common Vulnerabilities and Exposures
(CVE)
process. But CPE is showing its limits as an extrinsic identification
mechanism: the human-readable identifiers chosen by CPE fail to capture
the complexity of what “software” is.

We think functional software deployment as implemented by Nix and Guix,
coupled with the source code identification work carried out by Software
Heritage, provides a unique perspective on these matters.

On Software Identification

The Software Identification Ecosystem Option Analysis white paper
released by CISA in October 2023 studies options towards the definition
of a software identification ecosystem that can be used across the
complete, global software space for all key cybersecurity use cases.

Our experience lies in the design and development of
GNU Guix, a package manager, software deployment
tool, and GNU/Linux distribution, which emphasizes three key elements:
reproducibility, provenance tracking, and auditability. We explain
in the following sections our approach and how it relates to the goal
stated in the aforementioned white paper.

Guix produces binary artifacts of varying complexity from source code:
package binaries, application bundles (container images to be consumed
by Docker and related tools), system installations, system bundles
(container and virtual machine images).

All these artifacts qualify as “software” and so does source code. Some
of this “software” comes from well-identified upstream packages,
sometimes with modifications added downstream by packagers (patches);
binary artifacts themselves are the byproduct of a build process where
the package manager uses other binary artifacts it previously built
(compilers, libraries, etc.) along with more source code (the package
definition) to build them. How can one identify “software” in that
sense?

Software is dual: it exists in source form and in binary,
machine-executable form. The latter is the outcome of a complex
computational process taking source code and intermediary binaries as
input.

Our thesis can be summarized as follows:

We consider that the requirements for source code identifiers differ
from the requirements to identify binary artifacts.

Our view, embodied in GNU Guix, is that:

Source code can be identified in an unambiguous and
distributed fashion through inherent identifiers such as
cryptographic hashes.

Binary artifacts, instead, need to be the byproduct of a
comprehensive and verifiable build process itself available as
source code.

In the next sections, to clarify the context of this statement, we show
how Guix identifies source code, how it defines the source-to-binary
path and ensures its verifiability, and how it provides provenance
tracking.

Source Code Identification

Guix includes package
definitions
for almost 30,000 packages. Each package definition identifies its
origin—its
“main” source code as well as patches. The origin is
content-addressed: it includes a SHA256 cryptographic hash of the
code (an inherent identifier), along with a primary URL to download
it.

Since source is content-addressed, the URL can be thought of as a hint.
Indeed, we connected Guix to the Software
Heritage source code archive: when
source code vanishes from its original URL, Guix falls back to
downloading it from the archive. This is made possible thanks to the use
of inherent (or intrinsic) identifiers both by Guix and Software
Heritage.

More information can be found in this 2019 blog
post
and in the documents of the Software Hash Identifiers
(SWHID) working group.

Reproducible Builds

Guix provides a verifiable path from source code to binaries by
ensuring reproducible builds. To
achieve that, Guix builds upon the pioneering research work of Eelco
Dolstra that led to the design of the Nix package
manager, with which it shares the same conceptual
foundation.

Namely, Guix relies on hermetic builds: builds are performed in
isolated environments that contain nothing but explicitly-declared
dependencies—where a “dependency” can be the output of another build
process or source code, including build scripts and patches.

An implication is that builds can be verified independently. For
instance, for a given version of Guix, guix build gcc
should produce the exact same binary, bit-for-bit. To facilitate
independent verification, guix challenge gcc compares the
binary artifacts of the GNU Compiler Collection (GCC) as built and
published by different parties. Users can also compare to a local build
with guix build gcc --check.

As with Nix, build processes are identified by derivations, which are
low-level, content-addressed build instructions; derivations may refer
to other derivations and to source code. For instance,
/gnu/store/c9fqrmabz5nrm2arqqg4ha8jzmv0kc2f-gcc-11.3.0.drv
uniquely identifies the derivation to build a specific variant of
version 11.3.0 of the GNU Compiler Collection (GCC). Changing the
package definition—patches being applied, build flags, set of
dependencies—, or similarly changing one of the packages it depends
on, leads to a different derivation (more information can be found in
Eelco Dolstra's PhD
thesis).

Derivations form a graph that captures the entirety of the build
processes leading to a binary artifact. In contrast, mere package
name/version pairs such as gcc 11.3.0 fail to capture the
breadth and depth elements that lead to a binary artifact. This is a
shortcoming of systems such as the Common Platform Enumeration (CPE)
standard: it fails to express whether a vulnerability that applies to
gcc 11.3.0 applies to it regardless of how it was built,
patched, and configured, or whether certain conditions are required.

Full-Source Bootstrap

Reproducible builds alone cannot ensure the source-to-binary
correspondence: the compiler could contain a backdoor, as demonstrated
by Ken Thompson in Reflections on Trusting Trust. To address that,
Guix goes further by implementing so-called full-source bootstrap:
for the first time, literally every package in the distribution is built
from source code, starting from a very small binary
seed.
This gives an unprecedented level of transparency, allowing code to be
audited at all levels, and improving robustness against the
“trusting-trust attack” described by Ken Thompson.

The European Union recognized the importance of this work through an
NLnet Privacy & Trust Enhancing Technologies (NGI0 PET)
grant allocated in
2021 to Jan Nieuwenhuizen to further work on full-source bootstrap in
GNU Guix, GNU Mes, and related projects, followed by another
grant in 2022 to expand
support to the Arm and RISC-V CPU architectures.

Provenance Tracking

We define provenance tracking as the ability to map a binary artifact
back to its complete corresponding source. Provenance tracking is
necessary to allow the recipient of a binary artifact to access the
corresponding source code and to verify the source/binary correspondence
if they wish to do so.

The
guix pack
command can be used to build, for instance, containers images. Running
guix pack -f docker python --save-provenance produces a
self-describing Docker image containing the binaries of Python and its
run-time dependencies. The image is self-describing because
--save-provenance flag leads to the inclusion of a
manifest that describes which revision of Guix was used to produce
this binary. A third party can retrieve this revision of Guix and from
there view the entire build dependency graph of Python, view its source
code and any patches that were applied, and recursively for its
dependencies.

To summarize, capturing the revision of Guix that was used is all it
takes to reproduce a specific binary artifact. This is illustrated by
the time-machine
command.
The example below deploys, at any time on any machine, the specific
build artifact of the python package as it was defined in this Guix
commit:

guix time-machine -q --commit=d3c3922a8f5d50855165941e19a204d32469006f \
  -- install python

In other words, because Guix itself defines how artifacts are built,
the revision of the Guix source coupled with the package name
unambiguously identify the package’s binary artifact. As
scientists, we build on this property to achieve reproducible research
workflows, as explained in this 2022 article in Nature Scientific
Data; as engineers, we
value this property to analyze the systems we are running and determine
which known vulnerabilities and bugs apply.

Again, a software bill of materials (SBOM) written as a mere list of
package name/version pairs would fail to capture as much information.
The Artifact Dependency Graph (ADG) of
OmniBOR, while less ambiguous, falls short in
two ways: it is too fine-grained for typical cybersecurity applications
(at the level of individual source files), and it only captures the
alleged source/binary correspondence of individual files but not the
process to go from source to binary.

Conclusions

Inherent identifiers lend themselves well to unambiguous source code
identification, as demonstrated by Software Heritage, Guix, and Nix.

However, we believe binary artifacts should instead be treated as the
result of a computational process; it is that process that needs to be
fully captured to support independent verification of the
source/binary correspondence. For cybersecurity purposes, recipients
of a binary artifact must be able to be map it back to its source code
(provenance tracking), with the additional guarantee that they must be
able to reproduce the entire build process to verify the source/binary
correspondence (reproducible builds and full-source bootstrap). As
long as binary artifacts result from a reproducible build process,
itself described as source code, identifying binary artifacts boils
down to identifying the source code of their build process.

These ideas are developed in the 2022 scientific paper Building a
Secure Software Supply Chain with
GNU Guix

Source: Planet GNU

GNU Guix: Identifying software

On Software Identification

Source Code Identification

Reproducible Builds

Full-Source Bootstrap

Provenance Tracking

Conclusions

Related posts:

Leave a Reply Cancel reply