More about TESID

A miscellaneous bag of information about TESID, the supporting research, and other related discussions.

Background

People use two main approaches for storing and presenting IDs:

More generally, these two approaches are sequential versus random.

The trouble with sequential is that it requires that allocation be done in one place. This is not always possible or desirable. (And if it’s not, you won’t be able to use TESID directly, though it may still be food for thought.)

The trouble with random is that it has to be quite long, or the chance of collisions becomes unacceptably high. (I’m talking mostly about UUIDv4 here; it is possible to combine other elements like time and machine ID to twiddle the scope of collisions and possibly even reduce the number of bits required, but it’ll still be fundamentally more than sequential.) Depending on what you’re doing, you can definitely reduce the number of bits with care; and NanoID is a well‐thought‐out scheme that makes this ergonomic while giving human‐friendly compact IDs. But still, they’ll necessarily be comparatively large.

My goal in TESID was to use sequential integers under the hood (because the projects I designed it for don’t need distributed ID allocation), but transform these nice small numbers into comparatively small, random‐looking strings.

This is not a novel idea: people have been doing things like this for a long time. But it’s not talked about much, and no one seems to have canonicalised a good pattern or library for it: almost every library I found had had very significant security problems, and the remainder were too limited for my liking.

Goals

I declare four goals:

  1. Computed from a numeric ID. No random generation with probabilistic collision handling, I want to store numbers, preferably but not necessarily sequential.

  2. Secure. I want there to be absolutely no feasible way of determining a sequence from string IDs, without possessing the key. This leads in the direction of using a reputable block cipher, though other options may be satisfactory. (Randomly‐generated strings would satisfy this goal, though they’d fail goal 1.)

    As examples of what I will not accept:

    • Hashids, although one of the best‐advertised schemes, is stunningly insecure, so that for a large fraction of apps it’s almost trivially reversible;

    • Optimus uses a fairly weak LCG for encryption, which would be bad enough for me to prefer to disqualify it even though attacks would generally be unviable, but then uses it improperly (discussed later), so that you will often be able to partially decode IDs.

  3. Human‐friendly IDs. This is fuzzy, but includes elements like:

    • IDs should be words, for text selection purposes, so that you can double‐click and get the ID, the whole ID, and nothing but the ID. Presuming ASCII, this limits you to letters, numbers and underscores (63 characters).

    • Case‐sensitivity is not ideal, because it takes more effort to type, and is much more error‐prone for speaking aloud.

    • Characters that can be easily confused should be avoided. This includes well‐known things like 1/I/l and O/0 for their visual similarity, but also includes things like o/0 for their audial similarity, because both are commonly pronounced “oh”.

      Note also that the preference is to avoid all members of a confusable group, not to leave one behind, because even though you could accept all options and convert them automatically, this still tends to lead to confusion among humans, since they will probably expect the difference to matter.

    • You want to be able to avoid undesirable patterns (swear words and such).

    A completely different approach that could also work is using hyphenated sequences of words, possibly in patterns like adjective-noun. Things like jubilant-umbrella, unglued-tumbling-bolshevik, nappers-torpidly-nonsecular-battleground. [The second and third were generated with the aid of random.choice(open('/usr/share/dict/words').readlines()).] Trickier to handle well, but not unreasonable in some cases.

  4. Short strings. Strings should not be only of one fixed length, because this consistently means that either your IDs are longer than they need to be, or you’ll run out of IDs at some point and have to do something about it.

    There are two points of interplay here with security:

    • If you use optimally‐short IDs, then after some time, valid IDs will be trivially guessable, which is undesirable for some applications.

    • By increasing the length of ID strings over time, you are disclosing an epoch: a short ID must be older than a longer ID. I find this tiny information leak almost always acceptable.

    In both cases, sparsity (discussed later) allows you to manage the matter.

In my survey of public discussions, libraries and techniques to achieve this, I only found short strings (goal 4) being produced by random generation (typically failing goal 1, and also requiring the addition of collision handling) or by non‐cryptographic scrambling (failing goal 2).

I made TESID because I was dissatisfied with this situation. TESID satisfies all four goals fully, with the caveat that avoiding undesirable patterns properly is not just plug‐and‐play, requiring extra effort. This is discussed later.

I did also find a couple of references to randomness on smaller (e.g. 64-bit) integers with manual rather than probabilistic collision detection, which is an approach that can satisfy all goals, minus the preference of sequentiality in goal 1, at the cost of requiring collision checking and with the benefit of not retaining a cryptographic key. This is an approach that I think has promise, but needs to be run as a database extension rather than in application space for best results; I have in mind to experiment with a kind of variant of TESID that works that way.

Achieving short strings with a sequence and real encryption

Suppose you use a block cipher to encrypt IDs: it’s fixed‐width encryption. If you want to support 32‐bit IDs, you’ll want to use a cipher with a 32‐bit block size, and then every number in the range [0, 2³²) will encrypt to a different number in the same range.

To support short strings with a block cipher for encryption, you need to use multiple block ciphers, with different block sizes. In the old days you might have used Skip32 and Skipjack, or now I’d suggest Speck32/64 and Speck64/128, for 32‐bit and 64‐bit IDs.

This leads to one problem: the numerical overlap problem. Your 32‐bit cipher supports the range [0, 2³²), and your 64‐bit cipher [0, 2⁶⁴)—​but you only want it for [2³², 2⁶⁴). There is some number that will be encrypted to zero with the 32‐bit cipher, and there is probably (about four billion to one) some number that will be encrypted with the 64‐bit cipher to zero.

Therefore, you need the string encoding to show which cipher was employed. The simplest way of doing this is to pad to a certain length. In a base‐32 alphabet, a 32‐bit value needs up to 6.4 characters, so just pad them all to 7 characters; using my base‐32 alphabet, the value that encrypted to zero would thus be stringified to 2222222. In the same alphabet, a 64‐bit value needs up to 12.8 characters, so just pad them all to 13 characters (0 → 2222222222222). Now you can determine which range to use while decoding.

(For correctness, you also need to check while decoding that a 64‐bit value doesn’t decode to a number below 2³²: such a value should have been 7 characters long, not 13. Otherwise, if a malicious user found an overly long encoding of an ID, you’d have an object being referred to by two different IDs.)

There does exist a neat solution to this problem that doesn’t require storing any kind of tag: use ciphers with disjoint ranges. For example: instead of using two ciphers with ranges [0, 2³²) and [0, 2⁶⁴), choose a cipher where the second can use the range [2³², 2⁶⁴). If you’re interested in this, look into format‐preserving encryption, but the quick summary is that the only practical and well‐vetted options are patent‐encumbered until 2029.


Well then, TESID: I first implemented it with 32‐ and 64‐bit block sizes, but the IDs weren’t as short as I liked, and I wasn’t quite happy with the keying situation (I was using disjoint 64‐bit and 128‐bit keys, as drawing the 64‐bit from the start of the 128‐bit would weaken it to 65 bits for brute force purposes), so I went a little off the beaten track, and devised new instances of the Speck cipher with different parameters. So now I have 20‐, 30‐, 40‐, 50‐, 60‐, 70‐, 80‐, 90‐ and 100‐bit block sizes, producing 4‐, 6‐, 8‐, 10‐, 12‐, 14‐, 16‐, 18‐ and 20‐character strings.

(If you’re wondering why TESID string lengths are all even, it’s because Speck requires that block sizes be an even number of bits, and it’s most convenient to only support even string lengths. The design rationale page explains a bit more.)

The Table of Ranges
RangeLength5peck n~№ values
[0, 2²⁰)4101 million
[2²⁰, 2³⁰)6151 billion
[2³⁰, 2⁴⁰)8201 trillion
[2⁴⁰, 2⁾⁰)10251 quadrillion
[2⁵⁰, 2⁶⁰)1230⋮
[2⁜⁰, 2⁡⁰)1435
[2⁡⁰, 2⁸⁰)1640
[2⁸⁰, 2⁚⁰)1845
[2⁚⁰, 2š⁰⁰)2050

Solving guessability and TESID reuse across types

As outlined so far, there are two potential problems [There’s also the problem of undesirable patterns, but I’ll address that later.] with TESID for some applications:

  1. Once you have enough IDs, they become trivially guessable. Once you’ve assigned a little over a million IDs [The 2²⁰ IDs from 0 to 1048575.], all sequences of four characters from the base‐32 alphabet refer to valid IDs. (They’re in a scrambled order, but they’re all valid.)

    This can be undesirable, because it allows errors to occur undetected, and allows enumeration (out of order, but still enumeration).

  2. TESIDs are computed only from the numeric ID, so if you have entities of two different types with the same numeric ID, they’ll get the same TESID. (I call this TESID reuse across types.)

    This can be undesirable, because it allows errors to occur undetected, and allows finding valid IDs by looking at the IDs of another type.

(These are not concerns for all applications. Most of the time, valid IDs being guessable isn’t an issue, and enumerability is even desirable in some cases—​though if you want enumerability, you should almost certainly use the numeric ID directly, and not TESID. And if you don’t care about guessability and don’t have much chance for type confusion, then you may also not care about TESID reuse across types.)

I went through a number of different possibilities before settling on a delightfully simple design, so delightful and simple that I wove it into the libraries:

id → id × sparsity + discriminant

It’s that simple, and solves both of the problems, and more.

As an example, suppose sparsity = 1000 (and no discriminant, i.e. discriminant = 0). This means that your ID sequence {0, 1, 2, …} becomes {0, 1000, 2000, …}. Then the number is encrypted, scrambling these small values in the range [0, 1048575]. Due to this sparsity factor, the smallest block size will only be used for IDs 0–1048, rather than 0–1048575. Thus, only 1049 out of the possible 1048576 four‐character values will be valid TESIDs (0.1%), uniformly randomly distributed because of the encryption. If someone’s trying to guess an ID, there’s only one chance in a thousand for every string that they try that it will be valid. [But do remember that each guess is 1⁄1000; it takes 693 guesses to exceed a 50% probability of finding at least one valid: log(½) ÷ log(1 − 1⁄1000) ≈ 692.8]

And then discrimination: suppose we used that sequence for one type, but now we have a second type and we want it to have completely separate TESIDs with no overlap. For this new type, we use discriminant = 1. Now, its ID sequence {0, 1, 2, …} becomes {1, 1001, 2001, …}. Again, we’ll end up with only 0.1% of potential TESIDs being valid for this type, a different 0.1% from the first type.

In this example, we can also perform what I call a split decode: we decrypt the number and divide integrally by sparsity to get the ID, and understand the remainder to be the discriminant. Thus, the value that decrypts to 3000 is understood to be {ID 3, discriminant 0}; and 3001 {ID 3, discriminant 1}. This can be useful for error reporting: “you gave me an ID of type A, but I need an ID of type B”. (Mind you, it can also be worthwhile adding a type marker to your IDs in some cases, e.g. prefix it with a capital letter (e.g. T) or a lowercase word and underscore (e.g. typename_). It’s not possible to determine the discriminant from a TESID without decrypting it.)

This is all stuff that can in theory be done at the database layer, but databases and related tooling tend to have poor support for stride in sequences, and not great support for a sequence offset either. Plus there’s also the likely 2⁶³ − 1 limit there. Given how compact and self‐contained a notion it is, and how much more convenient, I reckoned it warranted deep inclusion in TESID.

I’ve written more about this topic in the design rationale page. If you’re interested in doing anything beyond small type discrimination, I recommend reading that document, as I’ve gone through some of the possible applications and considerations.

Interactive Table of Ranges for seeing the effects of sparsity and discrimination

This interactive table requires JavaScript, which it looks like you’re not running. C’est la vie; I don’t either. 🙂

The Table of Ranges above considered the range of the ID to encrypt. This form of the table shows ranges for the original ID, before sparsification and discrimination, to give you an idea of how long your IDs will be under various sparsity and discriminant values.

The sparsity‐and‐discriminant‐aware Table of Ranges

Length5peck n№ valuesLowest IDHighest ID
410
615
820
1025
1230
1435
1640
1845
2050

On the security of the cryptography

I’m using Speck, which is believed cryptographically‐secure, but with different parameters that I have chosen myself, and an altered key expansion technique, to make life easier when encrypting with diverse block sizes. I have taken care so that I believe the result is cryptographically secure to approximately 128 bits of security, which is very significant overkill for the purpose; but I am not a professional cryptographer, so you should be sceptical of my claims.

I have written more in the design rationale page, including some reasonably detailed [For a layman.] analysis of the Speck parameters and how I chose them. I invite cryptographers to analyse my work.

On other kinds of weaknesses:

My personal, moderately‐informed opinion is that, in the absence of algorithmic weaknesses in Speck, there will be no attack more effective than brute‐force, which is completely infeasible.

If I’ve scared you so that you’re reconsidering using TESID: this is an unconventional application of cryptography, in a context where I think literally every kind of cryptographic attack I know of is not possible. I’m only being meticulous [For a layman.] out of principle. For where TESID is to be used, I believe the encryption to be completely invincible.

What I want you to understand is that this is Real Encryption of the type that governments stopped you from exporting back in the ’90s, not the totally fake stuff that Hashids uses, or the accidentally extremely weak stuff of Optimus.

The encryption key is permanent

There is one last security concern that I must mention here, the one practical weak spot in the entire scheme.

The most important weakness of all real encryption schemes is the key: if someone gets the key, it’s all over. Therefore, the goal in most places where encryption is used is is to change keys regularly, to minimise what is compromised by compromise of the key. In transport usage, perfect forward secrecy is preferred so that each session is encrypted with a different key.

TESID has only one key, and it’s permanent. You can’t rotate keys without breaking all existing IDs, which is unlikely to be acceptable.

This is a risk in using TESIDs as authentication [That is, granting access to an entity to anyone who knows its TESID, but not publicly sharing that TESID.]: if someone has the key, they can trivially encode, decode and enumerate TESIDs.

As an example of this risk, take YouTube. Credible sources say that their video IDs used to use base64 encoding of 3DES encryption of 64‐bit integer keys, and even stored the key in the source code in multiple places. This was a curiously risky state of affairs, given that they do use IDs as authentication in unlisted videos, so with the encryption key (easily obtained if you could access the code—it wasn’t even a production environment secret) you could easily enumerate video IDs and rapidly find unlisted videos, which could potentially contain secrets. It took them until 2016 (around ten years) before they switched to randomly‐generated IDs, which are secure against enumeration (presuming the random number generator is secure). That’s a long time for them to consider the risk not all that important. From the fact that they locked down old unlisted videos in an orderly fashion six months after the algorithm change with a month’s notice, I conclude that they monitored the situation and concluded that no one was abusing the key—the access patterns would be very easy to spot.

You must determine how you rate this risk. I don’t think this is a great concern for most deployments, but weigh it against your desire for IDs‐as‐authentication, especially for long‐lived IDs.

The concept of error detection

This is something not implemented in TESID, but which you could add fairly easily.

If you valued the ability to statically detect errors without full decoding, you could add a check character using the Damm algorithm using an order‐32 table and the same base‐32 alphabet. If you really wanted to, you could even support an optional check character, since TESIDs are all even in length, and your check character would make the length odd.

It’s just a thought, really, one that I was favouring earlier on, but once I added the concept of sparsity I don’t think it’s so useful, as you can easily make it so that the vast majority of single‐character errors won’t produce valid TESIDs. However, there is something nice about guaranteeing that all single‐character errors and adjacent transposition errors will be detected. Anyway, just thought I’m mention it.

Comparisons

Hashids

Hashids is fairly well‐known, with implementations in many languages. In its earlier days, it made some claims about security, but they have thankfully dropped those now, because Hashids’ security is awful, bad almost beyond my wildest imagination. Treat it as easily‐reversible obfuscation only.

Hashids generates non‐sequential strings from integers, or even sequences of integers. (At first I couldn’t think of any use case for this feature, but once in early testing the issue of TESID reuse across types was pointed out to me, and once I implemented sparsity and discriminant, I suspect that might be a situation where using a sequence of integers might make sense. But overall, I’m going to ignore the feature here as it doesn’t really add anything.)

Hashids doesn’t encrypt IDs at all, but instead shuffles the alphabet. When encoding an number, it rotates the alphabet by that number, recording that as the first character of the output, and then converts the number to a string using this rotated alphabet.

This is… look, this is just so bad it’s basically negligence. [As for the claims it used to make about security, I can’t honestly see any way that they weren’t bald‐faced lies. It’s either that or staggering and fairly implausible incompetence.] A significant fraction of ten‐year‐olds, and not a few five‐year‐olds, would spot the problem: if you can get a system using Hashids to emit sequential IDs (which is very common), it literally tells you its alphabet. If the alphabet starts 5N6y2rl…, 0 maps to a string that starts with 5, 1 to a string starting with N, 2 → 6…, 3 → y…, &c. In its default configuration, all you need is 44 sequential IDs, and you’ve got the key to completely reverse Hashids. And that’s hardly the only way perfectly normal usage patterns will lead to it disclosing its scrambled alphabet.

Before I understood what Hashids was actually doing, I was inclined to give it credit for avoiding English curse words, but ignoring its ability to encode a sequence of numbers into a hashid, their technique is no different from just removing the chosen 14 characters from the alphabet.

Concerning the strings themselves, ignoring any aspects of weakness: Hashids in its default configuration produces strings of length 2 + floor(log₄₄ n), for each number in the sequence of numbers (typically one). This is shorter than TESID at times and for extremely large values, but most of the time TESID is one character shorter or the same length. As for a type discriminant, done as a separate number that’ll cost you two characters in Hashids, more than in TESID with sparsity below 1024. In Hashids you can set a minimum length, which is similar to what sparsity/discriminant can do in TESID. Add to all this that TESID doesn’t mix alphabetic cases, and TESIDs are clearly more user‐friendly.

With all said and done, I strongly recommend against using Hashids in any situation. It’s just bad, and gives an extremely false impression of security. If you care to avoid curse words and the likes, see my suggestions below; even the simple suggestions will do about as good a job as Hashids, though in a slightly different way.

Optimus

Optimus (first implemented for PHP, and there’s a Go port too) is a library for quickly encrypting IDs. It uses a technique that’s about as fast as possible, but unfortunately has chosen poor parameters that render it cryptographically very weak (though Hashids is still worse).

Optimus uses two keys, a prime‐key (a large prime number in the encryption domain) and a xor‐key (a random value in the encryption domain). The inverse of the prime‐key must also be provided for decryption. It also takes a parameter bits which controls the encryption domain, which defaults to 31 (because of PHP’s use of signed integers), meaning the cipher operates on values in the range [0, 2³¹).

Optimus does not stringify the encrypted IDs: you’re still dealing with integers. This means that you’ll normally get a ten‐digit ID.

This is the algorithm for encrypting id:

encrypted‐id = ((id × prime‐key) mod 2bits) xor xor‐key

Skipping the XOR part, which adds no cryptographic complexity, this is the algorithm of a Linear Congruential Generator of the form m a power of 2, c = 0, stepping the generator forwards one place to encrypt, and backwards one place to decrypt.

LCGs in general have some known weaknesses (read the rest of that Wikipedia article), but that specific class of LCGs is particularly bad for its weaknesses. The most catastrophic one here is that they work terribly with even seeds, meaning that half the IDs you encrypt are barely encrypted at all. I believe [But I’m not a professional cryptographer and also haven’t tried what I suggest here.] that with fewer than twenty sequential or randomly‐but‐evenly‐spaced IDs, you may be able to calculate at least the last few bits of xor‐key, and several bits of prime‐key’s entropy. And it won’t take all that much more to chip away at the rest of the keys.

To demonstrate the badness of this class of LCGs, try bits = 5 and prime‐key = 29, which corresponds to lcg(a=29, c=0, m=32): the cycles of this LCG are [[0], [1, 29, 9, 5, 17, 13, 25, 12], [2, 26, 18, 10], [3, 23, 27, 15, 19, 7, 11, 31], [4, 20], [6, 14, 22, 30], [8], [12, 28], [16], [24]]. 0, 8, 16 and 24 encrypt to themselves. 4 and 20 and 12 and 28 to each other. &c., and you can clearly see how much worse the even numbers are than odd. The only redeeming feature of the parameters chosen is the use of a prime a, which is more likely to produce results as good as you can get given c = 0 and m a power of 2 (though try lcg(a=31, c=0, m=32) if you dare); but compare it even to what that page describes about an LCG with well‐chosen Hull–Dobell Theorem parameters, with its single cycle and better distribution, and… yeah, this one’s very, very bad.

I myself started TESID with the LCG approach in mind, but the description of the difficulty of finding good parameters put me off it—​it’s clearly possible to do a fairly good job with LCGs (with known but generally acceptable weaknesses, for this application), but frightfully easy to make a mess of it. This is why I decided to go with block ciphers—​they’re somewhat more computationally expensive, but much more reliable. I used RC5 first (which officially supports 32‐bit and 64‐bit block sizes, unlike such as Skipjack, where Skip32 is a variant added after the fact, like my custom Speck word sizes that I now use), then switched to Speck at some point because its algorithm looked prettier, and I no longer had to choose key length and how many rounds to do. And it turned out Speck was faster than RC5 with its recommended parameters, too. But seriously, my reasons for switching from RC5 to Speck were very flimsy and came down more to vibes. I think I was vindicated once I switched to custom word sizes, though.

UUID

I’m focusing on techniques for encoding sequential identifiers, which UUID isn’t, but I include this comparison for completeness and convenient reference, since I intend TESID to be a good replacement to UUIDs in certain places they’re commonly used.

PrĂŠcis:

In terms of my goals, UUID categorically fails goals 1 (computed from a numeric ID) and 4 (short strings), and doesn’t fare well at goal 3 (human-friendly IDs).

In more detail:

UUIDs are 128‐bit values. In the best‐case scenario, a proper UUID data type, they’ll take 128 bits of space (16 bytes) in your database. But it’s common to end up storing them in canonical textual representation, where they’ll probably take at least 320 bits (40 bytes). For presentation to users, canonical textual representation is most common, which is 36 characters long (32 hexadecimal characters, four hyphens). Other representations are possible; the shortest reasonable one is base‐64, which is still quite unwieldy at 22 characters.

The strength of UUIDs is that they’re suitable for distributed allocation. If you use UUIDv4, you just generate 122 bits [Six bits are reserved for indicating version 4 and variant 1. It’s not unheard of to just generate a 128‐bit value and produce a nominally‐invalid UUID.] of randomness, and use that, so you can do it on different machines with no trouble. There is a slight risk of collision, but it’s generally negligible.

There are also other versions of UUIDs that take other pieces of information, such as timestamp or a node MAC address. This conveniently stores that information that you could potentially have wanted to store anyway. If exposed publicly, this can allow identification of nodes or times (which may be desirable or undesirable).

But all up, UUID is heavily based on randomness.

Also worthy of mention here is ULID, which uses a 48‐bit timestamp (milliseconds) and 80 bits of randomness (choosing a new random value every millisecond, and just incrementing the random value within the millisecond—​so it’s not perfect for distributed use, though the risk of collisions or overflow is probably still negligible), and a (26‐character) base‐32 string encoding (Crockford’s alphabet, and uppercase at that, which is unwise). A deliberate feature of ULID is that it is lexicographically sortable. The draft UUIDv7 is similar, with 48 bits of timestamp, 6 bits of version and variant, and 74 bits of randomness (but without the intra‐millisecond incrementing behaviour).

On the database side, TESIDs are integers. They will probably take 64 bits (8 bytes) to store—​much less than UUIDs. In encoded textual form, they’re also much shorter than the canonical textual representation of UUIDs—​typically 4–8 characters, and as long as 20 characters in the most extreme case with extraordinarily high IDs and sparsity.

TESID is not immediately suitable for distributed allocation, and something like UUIDv4 is not possible; but something more reminiscent of UUIDv1 or the drafts UUIDv6–8 is readily possible, by using techniques similar to sparsity and discriminant, either on the ID in the database, or as part of TESID encoding. Care will be required about number sizes, if doing it in the ID column in a database: if your database only supports signed 64‐bit integers, a 48‐bit timestamp would leave you with only 15 bits for the ID—32,768 [My phone number ends in 32768. It was the largest power of two available.] possible IDs. It may be wiser to use a compound (integer, integer) key and join them in application code, though you may need to be careful about TESID’s 100‐bit limit.

It’s also worth pointing out a meaningful difference in convention here: things like UUIDs tend to put the “data” at the start, and the randomness at the end, because this tends to produce better results in databases, allowing better locality and sharding behaviour. But in TESID you do it the other way round, with the “data” at the least significant bits, so that the numeric values are as small as possible, yielding strings that are as short as possible. (Also, the locality and sharding reasons don’t generally apply, as the idea is that sparsity and discrimination are applied when needed, not stored that way.) If you joined a 64‐bit ID and a 48‐bit timestamp with the timestamp in the most significant place, UUID‐style, then you get a 105‐bit value for the next decade and a half (21 characters in base‐32, though TESID actually doesn’t go that high), then 106‐bit for the next seventy years, 107‐bit for the next 140 years, and so on. If you put the timestamp in the least significant place, then the value is only ceil(48 + log₂ id) bits: in TESID terms, you’ll get four 10 character IDs, then 35 quadrillion or so 12‐character IDs, even more squillions of 14‐character IDs, and so on.

TESIDs deliberately minimise information leakage: thanks to the use of encryption, the only leak is the length of the string, which discloses an epoch.

All up, TESID is grounded in deterministic techniques. There are no collisions. Strings are as short as possible.

Generation is theoretically not significantly different; if anything, it’s a little easier, as you’re more likely to be able to lean on the database to handle things. But in practice, if you want to avoid bad patterns in the thorough rather than probabilistic way, TESIDs can be harder to work with for ID allocation, because you may have to fight against what your tooling wants to do, and in a stateful manner (you need to know where in the ID sequence you’re up to), whereas with UUIDs you probably already had to perform ID allocation yourself, but at least the ID allocation function was effectively stateless.

As for performance: neither UUID nor TESID are at all likely to be a bottleneck. With UUIDs, you’re working with larger values in both database and application, almost certainly doing a string allocation in application, and possibly having a variable‐width column in the database (which tends to slightly harm performance). For TESID, assuming the Rust implementation which is the most efficient, you’re working with smaller values and need no string allocation; and a synthetic benchmark on my AMD Ryzen 5800HS laptop (2021) shows TESID encoding and decoding each taking well under 100ns. At a guess, it’s probably much of a muchness; I think microbenchmarks would probably generally favour TESID over UUID, especially if you’re using a database without a real UUID type, but I doubt that realistic benchmarks would distinguish.

Others

Read Methods for obfuscating sequential IDs by Jiayu Yi. It’s a great collection of links and descriptions of things in this space. It doesn’t cover the variable‐width stuff that I’ve made a key feature of TESID, but it’s an excellent survey, and if you’ve been reading this page of mine, you probably want to read that one too. The library Presents provided by that author at the end is cryptographically sound, using real 64‐bit block ciphers and stringifying with a base‐62 alphabet with no leading padding, for strings of up to 11 characters. (That is, 95% [(2⁶⁴ − 62¹⁰) ÷ 2⁶⁴] of IDs will be 11 characters long.)

Considerations when choosing TESID

I reckon that TESID is generally better than the alternatives that people tend to use, but it’s without issues either, and there are a few things you should be aware of before choosing to use TESID.

Terminology and spelling

TESID stands for Textualised Encrypted Sequential Identifiers. I pronounce it as the two‐syllable /tɛsəd/.

I have called the process of converting a number to a string encoding, not encrypting; and its inverse decoding, not decrypting. Where a type exists for containing the expanded key with methods to perform the encoding and decoding, this type is called a coder.

In libraries, a TESID coder type is likely to be named TesidCoder or TESIDCoder.

There may also be variants like TypedTesidCoder or TypedTESIDCoder, for more convenient handling of sparsity and discriminant parameters.

Here and in the docs, I mostly call the result of encoding an ID a TESID (a textualised encrypted sequential identifier), but in real deployments you’ll probably normally just call it an ID.

The problem of undesirable patterns (swear words and such)

Because it uses encryption (which is equivalent to randomness) and a reasonably large alphabet, TESID is susceptible to one problem that may be deemed significant: you can end up with undesirable patterns in the generated strings. For example, if you have allocated all the IDs from 0 to 2²⁰, then all four‐letter words containing neither l nor o will be valid TESIDs. As an benign example, with the key 000102030405060708090a0b0c0d0e0f, the ID 921366 encodes to “rust”, and 336828232012 to “sumthing”.

This is not a novel problem: anything that generates random values will be susceptible. However, TESID is a little more heavily affected than things that generate long random strings, because here the undesirable pattern is much more likely to be the entire ID, rather than just four characters somewhere in the middle of a much longer string. (Though you’re more likely to get such undesirable patterns appearing somewhere in the longer strings.)

TESID deliberately does not try to mitigate this, because I don’t like the cost of naive mitigation (generally involving a much shorter alphabet, such as starting with removing the remaining vowels).

If this concerns you, here are four suggestions: two easy and probabilistic, one more difficult but thorough, and one forking and changing the library:

  1. The easiest solution, recommended for general use: use sparsity, to reduce the probability of an undesired pattern occurring. With sparsity 256, for example, the probability of each four‐letter word occurring is only 1⁄256, around 0.4%. (Though if you use type discrimination, it increases again proportionally.) Maybe that’s enough for you.

  2. Skip four‐letter IDs and start at six, where the patterns are less likely to occur and much less likely to be the whole string, by beginning your ID sequence at 2²⁰ ÷ sparsity (if not sparsening, that means starting at 1048576; but if not sparsening, realise that by the time you get to 2³⁰ (a billion‐odd IDs), about 99.9% of six-character IDs will be in use).

  3. The best solution, recommended if you’re willing to put the effort in: before using an ID, encode it and scan for undesirable patterns, and skip to the next if you decide you don’t like it. [This pattern is commonly seen when generating random IDs, since it’s easy there as you were already generating them in application code. nanoid-good is one example.] This is admittedly likely to be more bothersome to arrange, as you might have been offloading the sequence to your database, but it’s really the only way to do a good job of avoiding such patterns.

    (I say this despite the alternative of a smaller, vowel‐free alphabet, as using such a smaller alphabet introduces problems of its own in human‐friendliness (there’s just something off about string IDs that never include vowels, for starters), and also this approach allows more flexible filtering for truly superior human‐friendliness, such as disallowing same‐character repetition which is error‐prone in dictation.)

    You’ll need to either do the checking in the database (not a bad idea, frankly—​you could probably move all the TESID stuff there quite beneficially), or with an extra round trip to the database on any inserts. The in‐database technique will vary by database, and ORMs and the likes may not be fond of you. It’s certainly harder than just “here, use this library in your code, and keep using your database’s standard auto‐incrementing primary key”, so I’m deferring any sort of official recommendation for now, though I will be pursuing it as an optional extra.

    Just one hint: for SQLite, it may be best to use INTEGER PRIMARY KEY AUTOINCREMENT for both its automatic sequence maintenance and its ROWID optimisation, and mess with sqlite_sequence before each INSERT. But this does run the risk of forgetting to pump the sequence check.

    I intend to write a library for filtering potentially bad patterns at some point. I have in mind preferring broad rules like “no more than three letters in a row” and “the same character must not occur three times in a row”, rather than specific bad‐word matching (though if the rules were as blunt as these suggestions, you might still want a filter for certain three‐character sequences). Such aggressive rules as these disqualify a large fraction of IDs as the string gets long, over 90%, so more nuance on greater lengths might be warranted, perhaps considering vowel and consonant patterns, though it’s not like you end up running out of IDs (1% of a zillion is still more like a zillion than like zero)—​but ID allocation would slow down, heading towards a microsecond. (Run the Rust rejection_rates example for a demonstration.)

  4. Fork TESID with a different, smaller alphabet. (I’d suggest base‐16, as its half‐byte alignment makes it the easiest to switch to, and it has nice properties at the boundaries too; but I wouldn’t use the hexadecimal alphabet, as it feels programmery and also mildly encourages numeric treatment which would break the occasional ID, instead optimising for avoiding undesirable patterns and minimising confusables of any kind; spending just a minute whittling it down once, I ended up with 346789cdghjkprtz. But as I hinted above, I find such an alphabet as this mildly disconcerting, though I can’t clearly explain why. It’s just too small in the wrong way, and the lack of vowels is somehow unpleasant.)

Further reading

Two more pages: