TESID design rationale

Here’s some analysis of the reasoning behind TESID and specific design choices.

TESID is made up of three parts:

  1. Encryption,
  2. String encoding, and
  3. A convention for reducing occupation of the ID space and attaching a discriminant for types or other such details.

(I’ve numbered these items in their encoding order, but I’m going to discuss them in this altered order. Things would get boring if list item numbers always had to be in order. [Yet another of Markdown’s terrible decisions: write a list items numbered 2, 3 and 1, and it’ll ignore your clearly‐declared intent and use 1, 2 and 3. Better hope your Markdown engine allows you to write <li value=2> and such.])

See the algorithms page for a basic overview of it all. This file explains why things are the way they are.

2. 5peck: Speck, but with five-based word sizes and just one key

Encryption is done with some new instances of the Speck cipher family, which I have called 5peck (because it amused me). 5peck is pronounced “fpeck”, not to be confused with “ſpeck”. (This also amuses me.) In places where an identifier is required that cannot start with a numeral, I use fpeck.

There are two particular properties of 5peck ciphers:

5peck ciphers don’t get names of their own (unlike Speck32/64, Speck128/128, &c.). Wouldn’t want them getting self-important or anything.

In the reference implementation, all 5peck ciphers use a 64-bit machine word, with appropriate rotation and wrapping addition and subtraction via masking, but all that is implementation detail.

For key scheduling: a 128-bit key is expanded with n=64, m=2, T=30, α=9, β=2. Encryption takes the n least significant bits of each round key.

5peck is defined for word sizes n={10, 15, 20, 25, 30, 35, 40, 45, 50}. In all cases, T=30, α=9, β=2, and m is not applicable.

The notable omission from the sequence is n=5, which would produce 1024 two-character TESIDs. I omitted it because it’s not that useful, I’d need to support a different value of at least α, and I was concerned about its integrity, given how extremely it deviates from the pattern I see in the Speck paper. (To reach 128-bit security, I think it’d need at least a few more rounds, T=30 yielding only 150 bits in its expanded key.) So instead I start at n=10 and four-character TESIDs.

Justification for the single expanded key and T=30

Probably the most significant deviation from Speck is the 128-bit key used all round: does it or can it lead to 128-bit security, even at n=10? And how many rounds do we need?

These are the numbers of rounds for the defined Speck instances:

I’m going to generalise and call this T = w + m, where w is very approximately proportional to n + c for some constant c. n=24 is a mild outlier from my predictions: I’d have expected w=20, not w=19. And assuming that w is linearly correlated to n is hubris.

In the addition rather than multiplication of m, it is clearly seen that security is not particularly strongly proportional to the number of round key bits; Speck128/128 has 16 bits of round key per bit of security, but Speck128/256 has only 8.5 bits of round key per bit of security, and Speck32/64 has 5.5 bits of round key per bit of security. You clearly can’t take this too far, as there’s an obvious hard lower bound of 1, and imperfections in diffusion will mean that the real lower bound is higher. (Skipping to the end state: for n=10 I end up with 2.34 bits of round key per claimed bit of security, which is not a huge amount of margin, but still sufficient, I think. n=5 would have been 1.17, which I suspect would be insufficient.)

(I’m treating the number of bits of security as equivalent to the number of bits of key, because that is the intent of the cipher.)

Well then, the number of rounds. Or actually, the key expansion technique, that’s important too.

Originally, I expanded the key for each value of n, by consuming one word of the key until it was all gone (plus however many extra zero bits were needed at the end); this led to m=ceil(128 / n). I also tentatively chose T=30, leading to values of w from 17 (for n=10) to 27 (for n=50). This seemed fine for a while, but I wasn’t enthusiastic about having to store 2160 bytes of expanded keys (as I was using u64 for all n), and n=10’s m=13 was rather a lot. Then I thought of just expanding it once, with n=64, m=2, and during encryption taking the n bits of each round key. The purpose of key expansion is to diffuse the key bits into the round keys, so this shouldn’t reduce security in any way. And it gets the whole key into the mix sooner, so it should lead to increased security in the round keys for small values of n. This effectively reduces the very high values of m for small values of n, so that my w is higher for them. This suggests that I could probably reduce the number of rounds for smaller n. A nice simple solution here is T = 20 + n/5, which stays higher than Speck, but by much less. The most important one to consider here is n=10: this takes it down to T=22, meaning 220 bits of round key. This is heading too low for my liking. With this approach to key scheduling, I suspect you may need more rounds to fit this much security in.

(This is why I eliminated n=5: it’s hard to get such high security out of such a block size with this type of cipher; even n=10 may be pushing it.)

And really, performance is sufficiently excellent that it’s not worth fussing about reducing rounds by 25%. My Rust benchmark of encrypting all n=10 values (2²⁰ of them, around a million) shows it taking under half a nanosecond for each one (under 0.5ms for all) on my laptop, and this will hold for all block sizes, since all are using u64 machine words with masking. So I’ll just leave it a flat 30 rounds, which I’m fairly confidently is sufficient for 128-bit security at n=10, and significant overkill for n=15 to n=40 or so. I’ve run the numbers in a few different ways and am as confident in the results as I could hope to be as a crypto layman.

Justification for α = 9, β = 2

Speck’s decision for their rotation values is explained in https://eprint.iacr.org/2017/560.pdf#page=8: they started with (7, 2) and it worked well, but switched to (8, 3), which was still good enough, for better x86 performance due to SIMD byte-shuffle, except for Speck32 where (8, 3) was too low-quality on a 16-bit word and they didn’t want to increase the number of rounds to compensate. They acknowledge that there are other values like (9, 2) that are better in a vacuum, but that the attacks they handle better aren’t security-limiting attacks, and that performance is worse on some platforms.

In 5peck, word sizes are multiples of 5, so the SIMD byte-shuffle argument fails, as do the other performance arguments. Intuition suggested to me that α − β = 5 might suffer slightly; some basic cryptanalysis concurred and suggested (9, 2) works well. (I use that for the n=64, m=2 key expansion too.) In related news, it looks to me from some basic CryptoSMT analysis like multiples of five for n fare better than multiples of eight. Not sure if I’m barking up the wrong tree of if that’s a real effect. If you happen to be a cryptographer, I’d quite like to know more, for curiosity’s sake.

Other remarks

I do not know if my key reuse across ciphers weakens anything. In principle, I don’t believe it should, especially given that a particular value will only ever be encrypted with one value of n.

Remember that even if I can figure out how to nudge cryptographers’ tools along, I am not a professional cryptographer and really don’t know what I’m doing. But I do think that the cryptography covered here is sound and robust, to approximately 128 bits of security, which is frankly pretty extreme overkill for ID scrambling. Even if I have gravely miscalculated on n=10, I don’t think there’s any way that it could reduce it below even 80 bits, which would still be significant overkill for the applicable threat model.

I’ve used quite a bit of logic and reasoning here, which I imagine is a bad sign in cryptographic circles, where it’s better to use concrete measurements. I invite analysis by experts whose fancy is tickled.

If I’ve scared you in all of this, I would like to reassure you: I’ve been handling the cryptography carefully just out of principle, and in the real world it doesn’t actually matter, because in the intended use case of this library, literally no currently‐known or hypothesised attack will ever be viable. Your IDs are safe.

3. Base-32 encoding

I have chosen this alphabet:

23456789abcdefghijkmnpqrstuvwxyz

That is, ASCII digits and lowercase letters, minus the following characters:

For some purposes it might be nice to remove more:

But with bit-level block sizes (and without messing around with more esoteric format-preserving encryption, which I did try), a power-of-two base yields the best results for performance, code size and compactness of representation, and I didn’t want to go down to 16. So base-32 it is, meaning five bits per character.

An interesting consequence of the technical choices is that TESIDs are an even number of characters. This means that you could even provide an optional check digit, if it took your fancy: even-length string means no check digit, odd-length means check digit. If you want a check digit, I suggest the Damm algorithm with an order-32 table. At this time, I offer no assistance in this matter.

When thinking of ID lengths: every five bits is one character, or more practically (because of the block sizes used) every ten bits two characters. (Speck can only support even block sizes, so I couldn’t support odd-length strings without compromising my design a bit, or finding another cipher as good that could support odd block sizes, which I failed to do, though I did toy with Hasty Pudding for a while.)

One final obvious question: why base-32, why not a different base?

If designing with a different alphabet, different block sizes would make sense. If encoding to base-b and padding to length p, the largest block size that can be used is ceil(log₂(bp − bp − 1)) (this formula includes rezeroing the block range, which is useful for some values of b and p).
For string lengths p={1, 2, 3, …}, showing block sizes up to 64 bits
(though remember TESID supports 100, and 128 would be nice if very convenient, which it only is for base-16 of these):
base-16 yields block sizes 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64;
base-20 yields block sizes 5, 9, 13, 18, 22, 26, 31, 35, 39, 44, 48, 52, 57, 61, 65;
base-25 yields block sizes 5, 10, 14, 19, 24, 28, 33, 38, 42, 47, 52, 56, 61, 65;
base-32 yields block sizes 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65;
base-58 yields block sizes 6, 12, 18, 24, 30, 36, 41, 47, 53, 59, 65;
base-64 yields block sizes 6, 12, 18, 24, 30, 36, 42, 48, 54, 60, 66.
But with Speck, block sizes must be multiples of two, so any odd-numbered block sizes must be reduced by one. (That is, a five-character string in base-32 could only go up to 2²⁴, not 2²⁵.)

To my mind, the interesting variants of TESID here are:

Supporting odd-length strings does disclose more epochs, but it doesn’t matter, especially not when you can tweak all that with⸺

1. Sparsity and discrimination

This is a simple extra step that can be added before encryption, in order to make valid IDs harder to guess, and to avoid overlap between IDs of different types.

idid × sparsity + discriminant

There are two purposes for sparsity:

  1. Reducing the occupation of the ID space. If you don’t do this, then by the time you have a million or so IDs assigned (all the ones up to 2²⁰), every four-character string made up of base-32 characters will correspond to a valid ID. This can be undesirable, for example if you would like to use IDs as authentication, or if you just plain don’t want valid IDs to be guessable.

    (“Using IDs as authentication”: that is, having the ID is sufficient to access and/or mutate an entity. This means things like magic login links, and unlisted videos on a site like YouTube.)

    Given a 64-bit ID and 64-bit discriminant, the largest value that can safely be used for sparsity is 2³⁶ − 1. I suggest using powers of two (assigning a certain number of bits), but you don’t have to. If you allot 36 bits of sparsity (the most that can safely be done with 64-bit IDs, as 36 + 64 = 100 which is the largest block size supported by the 5peck ciphers, though in practice nothing will fill even 50 bits of that 64-bit ID, so with care you can go higher than 36 if you want), then only one in every 68 billion possible IDs will be valid, and it’s very probably reasonable to use IDs as authentication. With 36 bits of sparseness, you’d get 16 8-character IDs, then 16384 10-character IDs, then 16 million 12-character IDs, &c.

  2. Avoiding ID reuse across types or tenants, by adding a discriminant. Without this, a given ID number will produce the same TESID, which could lead to type confusion, allow discovery of valid IDs of different types (or on different tenants, or whatever), or disclose details about types’ sequences. With this applied correctly and consistently, TESIDs will effectively represent the encryption of the (type, id) tuple, instead of just the ID number. Because of this, a decoder can also determine which type an ID pertains to.

If you are going to use TESID for more than one type, I recommend adding bits for a type discriminant.

As for how much sparsity to add, decide that yourself. Every bit you add increases the amortised ID length by 0.2 characters. (Or, every ten bits you add increases the ID length by exactly 2 characters.)

Here’s a reference table for some possible values, choosing powers of two (though you’re not limited to it):

Sparsity(Bits)Extra chars№ 4-char IDs№ 6-char IDs№ 8-char IDs№ 10-char IDs
1 0 0 ~1 million ~1 billion~1 trillion~1 quadrillion
32 5 1 32768~33 million~34 billion ~35 trillion
256 8 1.6 4096 ~4 million ~4 billion ~4 trillion
1024 10 2 1024 ~1 million ~1 billion ~1 trillion
65536 16 3.2 16 16368~16 million ~17 billion
4294967296 32 6.4 1 0 255 261888
6871947673535.99…7.2 1 0 16 16368

(See also the interactive table of ranges.)

It’s worth mentioning why very large sparsity still nets a single four-character TESID here: ID 0. Remember the step in the algorithm, idid × sparsity + discriminant: zero times a large number is still zero. In order to stop having one four-character ID, you can either skip ID 0, or apply a large discriminant to counteract it (e.g. for sparsity of 2³², apply discriminant of at least 2³⁰ to take you into the 8-character range, with 2³² the simplest value if you’re not needing to perform split decodes).

Values above 8 are not particularly likely to be useful for type discrimination, but could be for other forms of discrimination (e.g. shared tenantry), or for pure sparseness (especially for authenticatory IDs).

The discriminant should generally be smaller than sparsity; otherwise the ID reuse problem will occur, and type-unaware decoding will no longer be able to retrieve the discriminant in full. But this is not checked, partly for reasons of efficiency, and partly because an overly-large discriminant can be used legitimately, as a simple offset to lengthen IDs (though sparsity and skipping ID 0 is probably generally a better tool).

I’ve introduced the concept of typed TESID coders to the libraries. TESID offers no assistance in maintaining discriminant values for your types; that you must maintain yourself.

It’s possible to achieve these effects in the database, by adjusting the start (discriminant) and step (sparseness) of the sequence. TESID provides this functionality anyway for three reasons:

  1. Most databases don’t provide the tools to readily adjust sequence stride;
  2. Your database is probably limited to 63 bits, but TESID performs these operations in 100-bit space, allowing sparseness and discrimination even on implausibly large ID values; and
  3. You may be applying TESID to an existing sequence.

I identify three main parameter value groups:

(When I say “small sparsity”, I suggest values like 32, 256 or 1024. You don’t have to line up with bits, but I mildly prefer to.)


Sequence ideas: