Andy Wingo: a simple hdr histogram

December 10, 2023 | by Arround The Web | No comments

Andy Wingo: a simple hdr histogram

Good evening! This evening, a note on high-dynamic-range (HDR)
histograms.

problem

How should one record garbage collector pause times?

A few options present themselves: you could just record the total pause
time. Or, also record the total number of collections, which allows you
to compute the average. Maximum is easy enough, too. But then you
might also want the median or the p90 or the p99, and these percentile
values are more gnarly: you either need to record all the pause times,
which can itself become a memory leak, or you need to approximate via a
histogram.

Let’s assume that you decide on the histogram approach. How should you
compute the bins? It would be nice to have microsecond accuracy on the
small end, but if you bin by microsecond you could end up having
millions of bins, which is not what you want. On the other end you
might have multi-second GC pauses, and you definitely want to be able to
record those values.

Consider, though, that it’s not so important to have microsecond
precision for a 2-second pause. This points in a direction of wanting
bins that are relatively close to each other, but whose absolute
separation can vary depending on whether we are measuring microseconds
or milliseconds. You want approximately uniform precision over a high
dynamic range.

logarithmic binning

The basic observation is that you should be able to make a histogram
that gives you, say, 3 significant figures on measured values. Such a
histogram would count anything between 1230 and 1240 in the same bin,
and similarly for 12300 and 12400. The gap between bins increases as
the number of digits grows.

Of course computers prefer base-2 numbers over base-10, so let’s do
that. Say we are measuring nanoseconds, and the maximum number of
seconds we expect is 100 or so. There are about 2³⁰
nanoseconds in a second, and 100 is a little less than 2⁷, so
that gives us a range of 37 bits. Let’s say we want a precision of 4
significant base-2 digits, or 4 bits; then we will have one set of
2⁴ bins for 10-bit values, another for 11-bit values, and
so-on, for a total of 37 × 2⁴ bins, or 592 bins. If we use a
32-bit integer count per bin, such a histogram would be 2.5kB or so,
which I think is acceptable.

Say you go to compute the bin for a value. Firstly, note that there are
some values that do not have 4 significant bits: if you record a
measurement of 1 nanosecond, presumably that is just 1 significant
figure. These are like the
denormals in
floating-point numbers. Let’s just say that recording a value val in
[0, 2⁴-1] goes to bin val.

If val is 2⁴ or more, then we compute the major and
minor components. The major component is the number of bits needed to
represent val, minus the 4 precision bits. We can define it like this
in C, assuming that val is a 64-bit value:

#define max_value_bits 37
#define precision 4
uint64_t major = 64ULL - __builtin_clzl(val) - precision;

The 64 - __builtin_clzl(val) gives us the ceiling of the base-2
logarithm of the value. And actually, to take into account the denormal
case, we do this instead:

uint64_t major = val < (1ULL << precision)
  ? 0ULL
  : 64ULL - __builtin_clzl(val) - precision;

Then to get the minor component, we right-shift val by major bits,
unless it is a denormal, in which case the minor component is the value
itself:

uint64_t minor = val < (1 << precision)
  ? val
  : (val >> (major - 1ULL)) & ((1ULL << precision) - 1ULL);

Then the histogram bucket for a given value can be computed directly:

uint64_t idx = (major << precision) | minor;

Of course, we would prefer to bound our storage, hence the consideration
about 37 total bits in 100 seconds of nanoseconds. So let’s do that,
and record any out-of-bounds value in the special last bucket,
indicating that we need to expand our histogram next time:

if (idx >= (max_value_bits << precision))
  idx = max_value_bits << precision;

The histogram itself can then be defined simply as having enough buckets
for all major and minor components in range, plus one for overflow:

struct histogram {
  uint32_t buckets[(max_value_bits << precision) + 1];
};

Getting back the lower bound for a bucket is similarly simple, again
with a case for denormals:

uint64_t major = idx >> precision;
uint64_t minor = idx & ((1ULL << precision) - 1ULL);
uint64_t min_val = major
  ? ((1ULL << precision) | minor) << (major - 1ULL)
  : minor;

y u no library

How many lines of code does something need to be before you will include
it as a library instead of re-implementing? If I am honest with myself,
there is no such limit, as far as code goes at least: only a limit of
time. I am happy to re-implement anything; time is my only enemy. But
strategically speaking, time too is the fulcrum: if I can save time by
re-implementing over integrating a library, I will certainly hack.

The canonical library in this domain is
HdrHistogram. But even
the C port is thousands of lines of code! A histogram should not take
that much code! Hence this blog post today. I think what we have above is sufficient. HdrHistogram’s documentation speaks
in terms of base-10 digits of precision, but that is easily translated
to base-2 digits: if you want 0.1% precision, then in base-2 you’ll need
10 bits; no problem.

I should of course mention that HdrHistogram includes an API that compensates for coordinated
omission, but I think such an API is straigtforward to build on
top of the basics.

My code, for what it is worth, and which may indeed be buggy, is over
here.
But don’t re-use it: write your own. It could be much nicer in C++ or
Rust, or any other language.

Finally, I would note that somehow this feels very basic; surely there
is prior art? I feel like in 2003, Google would have had a better
answer than today; alack. Pointers
appreciated to other references, and if you find them, do tell me more about your search strategy, because mine is inadequate. Until then, gram you later!

Source: Planet GNU