Background
People use two main approaches for storing and presenting IDs:
- 
Sequential integers. Compact, sorted, performant. But it exposes the
sequence, which allows enumeration and comparison and estimation of the size
of a system, each of which could be useful or undesirable, depending on the
case.
 
- 
UUIDs. 128 bits long, generally random. Can be handy for distributed ID
allocation. Long and unwieldy: 32 characters in hexadecimal, 22 in Base64
(or any base 57â68). But quite popular due to good tooling support.
 
More generally, these two approaches are sequential versus random.
The trouble with sequential is that it requires that allocation be done in one
place. This is not always possible or desirable. (And if itâs not, you wonât be
able to use TESID directly, though it may still be food for thought.)
The trouble with random is that it has to be quite long, or the chance of
collisions becomes unacceptably high. (Iâm talking mostly about UUIDv4 here; it
is possible to combine other elements like time and machine ID to twiddle the
scope of collisions and possibly even reduce the number of bits required, but
itâll still be fundamentally more than sequential.) Depending on what youâre
doing, you can definitely reduce the number of bits with care; and NanoID is
a wellâthoughtâout scheme that makes this ergonomic while giving humanâfriendly
compact IDs. But still, theyâll necessarily be comparatively large.
My goal in TESID was to use sequential integers under the hood (because the
projects I designed it for donât need distributed ID allocation), but transform
these nice small numbers into comparatively small, randomâlooking strings.
This is not a novel idea: people have been doing things like this for a long time.
But itâs not talked about much, and no one seems to have canonicalised a good
pattern or library for it: almost every library I found had had very significant
security problems, and the remainder were too limited for my liking.
Goals
I declare four goals:
- 
Computed from a numeric ID.
No random generation with probabilistic collision handling,
I want to store numbers, preferably but not necessarily sequential.
 
- 
Secure.
I want there to be absolutely no feasible way of determining a sequence from string IDs,
without possessing the key.
This leads in the direction of using a reputable block cipher,
though other options may be satisfactory.
(Randomlyâgenerated strings would satisfy this goal, though theyâd fail goal 1.)
As examples of what I will not accept:
- 
Hashids, although one of the bestâadvertised schemes, is stunningly insecure,
so that for a large fraction of apps itâs almost trivially reversible;
 
- 
Optimus uses a fairly weak LCG for encryption,
which would be bad enough for me to prefer to disqualify it
even though attacks would generally be unviable,
but then uses it improperly (discussed later),
so that you will often be able to partially decode IDs.
 
 
- 
Humanâfriendly IDs. This is fuzzy, but includes elements like:
- 
IDs should be words, for text selection purposes,
so that you can doubleâclick and get the ID, the whole ID, and nothing but the ID.
Presuming ASCII, this limits you to letters, numbers and underscores (63 characters).
 
- 
Caseâsensitivity is not ideal, because it takes more effort to type,
and is much more errorâprone for speaking aloud.
 
- 
Characters that can be easily confused should be avoided.
This includes wellâknown things like 1/I/l and O/0 for their visual similarity,
but also includes things like o/0 for their audial similarity,
because both are commonly pronounced âohâ.
Note also that the preference is to avoid all members of a confusable group,
not to leave one behind,
because even though you could accept all options and convert them automatically,
this still tends to lead to confusion among humans,
since they will probably expect the difference to matter.
 
- 
You want to be able to avoid undesirable patterns (swear words and such).
 
A completely different approach that could also work is using hyphenated sequences of words,
possibly in patterns like adjective-noun.
Things like jubilant-umbrella, unglued-tumbling-bolshevik, nappers-torpidly-nonsecular-battleground. [The second and third were generated with the aid of random.choice(open('/usr/share/dict/words').readlines()).]
Trickier to handle well, but not unreasonable in some cases.
 
- 
Short strings.
Strings should not be only of one fixed length,
because this consistently means that either your IDs are longer than they need to be,
or youâll run out of IDs at some point and have to do something about it.
There are two points of interplay here with security:
- 
If you use optimallyâshort IDs, then after some time,
valid IDs will be trivially guessable,
which is undesirable for some applications.
 
- 
By increasing the length of ID strings over time,
you are disclosing an epoch:
a short ID must be older than a longer ID. 
I find this tiny information leak almost always acceptable.
 
In both cases, sparsity (discussed later) allows you to manage the matter.
 
In my survey of public discussions, libraries and techniques to achieve this,
I only found short strings (goal 4) being produced by random generation
(typically failing goal 1, and also requiring the addition of collision handling)
or by nonâcryptographic scrambling (failing goal 2).
I made TESID because I was dissatisfied with this situation.
TESID satisfies all four goals fully,
with the caveat that avoiding undesirable patterns properly is not just plugâandâplay,
requiring extra effort. This is discussed later.
I did also find a couple of references to randomness on smaller (e.g. 64-bit)
integers with manual rather than probabilistic collision detection,
which is an approach that can satisfy all goals, minus the preference of sequentiality in goal 1,
at the cost of requiring collision checking and with the benefit of not retaining a cryptographic key.
This is an approach that I think has promise,
but needs to be run as a database extension rather than in application space for best results;
I have in mind to experiment with a kind of variant of TESID that works that way.
Achieving short strings with a sequence and real encryption
Suppose you use a block cipher to encrypt IDs:
itâs fixedâwidth encryption.
If you want to support 32âbit IDs, youâll want to use a cipher with a 32âbit block size,
and then every number in the range [0, 2³²) will encrypt to a different number in the same range.
To support short strings with a block cipher for encryption,
you need to use multiple block ciphers, with different block sizes.
In the old days you might have used Skip32 and Skipjack,
or now Iâd suggest Speck32/64 and Speck64/128,
for 32âbit and 64âbit IDs.
This leads to one problem: the numerical overlap problem.
Your 32âbit cipher supports the range [0, 2³²),
and your 64âbit cipher [0, 2âśâ´)ââbut you only want it for [2³², 2âśâ´).
There is some number that will be encrypted to zero with the 32âbit cipher,
and there is probably (about four billion to one)
some number that will be encrypted with the 64âbit cipher to zero.
Therefore, you need the string encoding to show which cipher was employed.
The simplest way of doing this is to pad to a certain length.
In a baseâ32 alphabet, a 32âbit value needs up to 6.4 characters,
so just pad them all to 7 characters; using my baseâ32 alphabet,
the value that encrypted to zero would thus be stringified to 2222222.
In the same alphabet, a 64âbit value needs up to 12.8 characters,
so just pad them all to 13 characters (0 â 2222222222222).
Now you can determine which range to use while decoding.
(For correctness,
you also need to check while decoding that a 64âbit value doesnât decode to a number below 2³²:
such a value should have been 7 characters long, not 13.
Otherwise, if a malicious user found an overly long encoding of an ID,
youâd have an object being referred to by two different IDs.)
There does exist a neat solution to this problem that doesnât require storing any kind of tag:
use ciphers with disjoint ranges.
For example: instead of using two ciphers with ranges [0, 2³²) and [0, 2âśâ´),
choose a cipher where the second can use the range [2³², 2âśâ´).
If youâre interested in this, look into formatâpreserving encryption,
but the quick summary is that the only practical and wellâvetted options are patentâencumbered until 2029.
Well then, TESID: I first implemented it with 32â and 64âbit block sizes,
but the IDs werenât as short as I liked,
and I wasnât quite happy with the keying situation
(I was using disjoint 64âbit and 128âbit keys,
as drawing the 64âbit from the start of the 128âbit would weaken it to 65 bits for brute force purposes),
so I went a little off the beaten track,
and devised new instances of the Speck cipher with different parameters.
So now I have 20â, 30â, 40â, 50â, 60â, 70â, 80â, 90â and 100âbit block sizes,
producing 4â, 6â, 8â, 10â, 12â, 14â, 16â, 18â and 20âcharacter strings.
(If youâre wondering why TESID string lengths are all even,
itâs because Speck requires that block sizes be an even number of bits,
and itâs most convenient to only support even string lengths.
The design rationale page explains a bit more.)
The Table of Ranges
| Range | Length | 5peck n | ~â values
 | 
| [0, 2²â°) | 4 | 10 | 1 million
 | 
| [2²â°, 2Âłâ°) | 6 | 15 | 1 billion
 | 
| [2Âłâ°, 2â´â°) | 8 | 20 | 1 trillion
 | 
| [2â´â°, 2âľâ°) | 10 | 25 | 1 quadrillion
 | 
| [2âľâ°, 2âśâ°) | 12 | 30 | âŽ
 | 
| [2âśâ°, 2âˇâ°) | 14 | 35
 | 
| [2âˇâ°, 2â¸â°) | 16 | 40
 | 
| [2â¸â°, 2âšâ°) | 18 | 45
 | 
| [2âšâ°, 2šâ°â°) | 20 | 50
 | 
Solving guessability and TESID reuse across types
As outlined so far, there are two potential problems [Thereâs also the problem of undesirable patterns,
but Iâll address that later.] with TESID for some applications:
- 
Once you have enough IDs, they become trivially guessable.
Once youâve assigned a little over a million IDs [The 2²ⰠIDs from 0 to 1048575.],
all sequences of four characters from the baseâ32 alphabet refer to valid IDs.
(Theyâre in a scrambled order, but theyâre all valid.)
This can be undesirable,
because it allows errors to occur undetected,
and allows enumeration (out of order, but still enumeration).
 
- 
TESIDs are computed only from the numeric ID,
so if you have entities of two different types with the same numeric ID,
theyâll get the same TESID. (I call this TESID reuse across types.)
This can be undesirable,
because it allows errors to occur undetected,
and allows finding valid IDs by looking at the IDs of another type.
 
(These are not concerns for all applications.
Most of the time, valid IDs being guessable isnât an issue,
and enumerability is even desirable in some casesââthough if you want enumerability,
you should almost certainly use the numeric ID directly, and not TESID.
And if you donât care about guessability and donât have much chance for type confusion,
then you may also not care about TESID reuse across types.)
I went through a number of different possibilities before settling on a delightfully simple design,
so delightful and simple that I wove it into the libraries:
id â id Ă sparsity + discriminant
- 
Before encryption, multiply the ID by a sparsity factor, and add a discriminant.
 
- 
After decryption, subtract the discriminant, and divide the ID by the sparsity factor,
failing if the subtraction goes negative or the division leaves a remainder.
 
Itâs that simple, and solves both of the problems, and more.
As an example, suppose sparsityâŻ=âŻ1000 (and no discriminant, i.e. discriminantâŻ=âŻ0).
This means that your ID sequence {0, 1, 2, âŚ} becomes {0, 1000, 2000, âŚ}.
Then the number is encrypted, scrambling these small values in the range [0, 1048575].
Due to this sparsity factor, the smallest block size will only be used for IDs 0â1048, rather than 0â1048575.
Thus, only 1049 out of the possible 1048576 fourâcharacter values will be valid TESIDs (0.1%),
uniformly randomly distributed because of the encryption.
If someoneâs trying to guess an ID,
thereâs only one chance in a thousand for every string that they try that it will be valid. [But do remember that each guess is 1â1000;
it takes 693 guesses to exceed a 50% probability of finding at least one valid:
log(½)âŻĂˇâŻlog(1âŻââŻ1â1000)âŻââŻ692.8]
And then discrimination: suppose we used that sequence for one type,
but now we have a second type and we want it to have completely separate TESIDs with no overlap.
For this new type, we use discriminantâŻ=âŻ1.
Now, its ID sequence {0, 1, 2, âŚ} becomes {1, 1001, 2001, âŚ}.
Again, weâll end up with only 0.1% of potential TESIDs being valid for this type,
a different 0.1% from the first type.
In this example, we can also perform what I call a split decode:
we decrypt the number and divide integrally by sparsity to get the ID,
and understand the remainder to be the discriminant.
Thus, the value that decrypts to 3000 is understood to be {ID 3, discriminant 0};
and 3001 {ID 3, discriminant 1}.
This can be useful for error reporting: âyou gave me an ID of type A, but I need an ID of type Bâ.
(Mind you, it can also be worthwhile adding a type marker to your IDs in some cases,
e.g. prefix it with a capital letter (e.g. T) or a lowercase word and underscore (e.g. typename_).
Itâs not possible to determine the discriminant from a TESID without decrypting it.)
This is all stuff that can in theory be done at the database layer,
but databases and related tooling tend to have poor support for stride in sequences,
and not great support for a sequence offset either.
Plus thereâs also the likely 2âśÂłâŻââŻ1 limit there.
Given how compact and selfâcontained a notion it is,
and how much more convenient,
I reckoned it warranted deep inclusion in TESID.
Iâve written more about this topic in the design rationale page.
If youâre interested in doing anything beyond small type discrimination,
I recommend reading that document,
as Iâve gone through some of the possible applications and considerations.
Interactive Table of Ranges for seeing the effects of sparsity and discrimination
The Table of Ranges above considered the range of the ID to encrypt.
This form of the table shows ranges for the original ID, before sparsification and discrimination,
to give you an idea of how long your IDs will be under various sparsity and discriminant values.
 
On the security of the cryptography
Iâm using Speck, which is believed cryptographicallyâsecure,
but with different parameters that I have chosen myself,
and an altered key expansion technique,
to make life easier when encrypting with diverse block sizes.
I have taken care so that I believe the result is cryptographically secure to approximately 128 bits of security,
which is very significant overkill for the purpose;
but I am not a professional cryptographer,
so you should be sceptical of my claims.
I have written more in the design rationale page,
including some reasonably detailed [For a layman.] analysis of the Speck parameters and how I chose them.
I invite cryptographers to analyse my work.
On other kinds of weaknesses:
- 
Some of the block sizes are quite small, which would render them quite vulnerable to birthday attacks if used in modes like CBC, [See, for example, the Sweet32 attacks, where something
like a hundred billion repetitions of the same partiallyâknown,
partiallyâunknown data allows breaking an otherwiseâsecure cipher;
or RC4 NOMORE, which exploits a weakness in the cipher to need only about a billion repetitions.] but those attacks are inapplicable here, since weâre encoding
single blocks, effectively in ECB mode.
 
- 
Although TESIDs appear random,
it is often possible to infer properties of the underlying IDs relative to one another.
Certain operations may assign adjacent IDs, for example;
and if you pair IDs with creation times you obtain an order for their TESIDs.
This doesnât normally take you to the point of knownâcleartext,
but if there was a viable knownâcleartext attack,
it could possibly help a little.
(In practice, those sorts of attacks require very large numbers of encrypted values,
and TESIDâs ECBâlike mode is largely just not susceptible.)
 
My personal, moderatelyâinformed opinion is that, in the absence of algorithmic
weaknesses in Speck, there will be no attack more effective than bruteâforce,
which is completely infeasible.
If Iâve scared you so that youâre reconsidering using TESID:
this is an unconventional application of cryptography,
in a context where I think literally every kind of cryptographic attack I know of is not possible.
Iâm only being meticulous [For a layman.] out of principle.
For where TESID is to be used, I believe the encryption to be completely invincible.
What I want you to understand is that this is Real Encryption
of the type that governments stopped you from exporting back in the â90s,
not the totally fake stuff that Hashids uses,
or the accidentally extremely weak stuff of Optimus.
The encryption key is permanent
There is one last security concern that I must mention here,
the one practical weak spot in the entire scheme.
The most important weakness of all real encryption schemes is the key:
if someone gets the key, itâs all over.
Therefore, the goal in most places where encryption is used is is to change keys regularly,
to minimise what is compromised by compromise of the key.
In transport usage, perfect forward secrecy is preferred so that each session is encrypted with a different key.
TESID has only one key, and itâs permanent.
You canât rotate keys without breaking all existing IDs,
which is unlikely to be acceptable.
This is a risk in using TESIDs as authentication [That is, granting access to an entity to anyone who knows its TESID, but not publicly sharing that TESID.]:
if someone has the key, they can trivially encode, decode and enumerate TESIDs.
As an example of this risk, take YouTube.
Credible sources say that their video IDs used to use base64 encoding of 3DES encryption of 64âbit integer keys,
and even stored the key in the source code in multiple places.
This was a curiously risky state of affairs,
given that they do use IDs as authentication in unlisted videos,
so with the encryption key (easily obtained if you could access the codeâit wasnât even a production environment secret)
you could easily enumerate video IDs and rapidly find unlisted videos,
which could potentially contain secrets.
It took them until 2016 (around ten years) before they switched to randomlyâgenerated IDs,
which are secure against enumeration (presuming the random number generator is secure).
Thatâs a long time for them to consider the risk not all that important.
From the fact that they locked down old unlisted videos in an orderly fashion six months after the algorithm change with a monthâs notice,
I conclude that they monitored the situation and concluded that no one was abusing the keyâthe access patterns would be very easy to spot.
You must determine how you rate this risk.
I donât think this is a great concern for most deployments,
but weigh it against your desire for IDsâasâauthentication, especially for longâlived IDs.
The concept of error detection
This is something not implemented in TESID,
but which you could add fairly easily.
If you valued the ability to statically detect errors without full decoding,
you could add a check character using the Damm algorithm
using an orderâ32 table and the same baseâ32 alphabet.
If you really wanted to, you could even support an optional check character,
since TESIDs are all even in length,
and your check character would make the length odd.
Itâs just a thought, really, one that I was favouring earlier on,
but once I added the concept of sparsity I donât think itâs so useful,
as you can easily make it so that the vast majority of singleâcharacter errors wonât produce valid TESIDs.
However, there is something nice about guaranteeing that all
singleâcharacter errors and adjacent transposition errors will be detected.
Anyway, just thought Iâm mention it.
Comparisons
Hashids
Hashids is fairly wellâknown, with implementations in
many languages. In its earlier days, it made some claims about security, but
they have thankfully dropped those now, because Hashidsâ security is awful,
bad almost beyond my wildest imagination. Treat it as easilyâreversible
obfuscation only.
Hashids generates nonâsequential strings from integers, or even sequences of integers.
(At first I couldnât think of any use case for this feature,
but once in early testing the issue of TESID reuse across types was pointed out to me,
and once I implemented sparsity and discriminant,
I suspect that might be a situation where using a sequence of integers might make sense.
But overall, Iâm going to ignore the feature here as it doesnât really add anything.)
Hashids doesnât encrypt IDs at all, but instead shuffles the alphabet.
When encoding an number, it rotates the alphabet by that number,
recording that as the first character of the output,
and then converts the number to a string using this rotated alphabet.
This is⌠look, this is just so bad itâs basically negligence. [As for the claims it used to make about security, I canât honestly see any way that they werenât baldâfaced lies.
Itâs either that or staggering and fairly implausible incompetence.] A significant fraction of tenâyearâolds, and not a few fiveâyearâolds,
would spot the problem:
if you can get a system using Hashids to emit sequential IDs (which is very common),
it literally tells you its alphabet.
If the alphabet starts 5N6y2rlâŚ,
0 maps to a string that starts with 5,
1 to a string starting with N,
2 â 6âŚ,
3 â yâŚ, &c.
In its default configuration, all you need is 44 sequential IDs, and youâve got
the key to completely reverse Hashids.
And thatâs hardly the only way perfectly normal usage patterns will lead to it disclosing its scrambled alphabet.
Before I understood what Hashids was actually doing,
I was inclined to give it credit for avoiding English curse words,
but ignoring its ability to encode a sequence of numbers into a hashid,
their technique is no different from just removing the chosen 14 characters from the alphabet.
Concerning the strings themselves, ignoring any aspects of weakness:
Hashids in its default configuration produces strings of length 2âŻ+âŻfloor(logâââŻn),
for each number in the sequence of numbers (typically one).
This is shorter than TESID at times and for extremely large values,
but most of the time TESID is one character shorter or the same length.
As for a type discriminant, done as a separate number thatâll cost you two characters in Hashids,
more than in TESID with sparsity below 1024.
In Hashids you can set a minimum length, which is similar to what sparsity/discriminant can do in TESID.
Add to all this that TESID doesnât mix alphabetic cases, and TESIDs are clearly more userâfriendly.
With all said and done, I strongly recommend against using Hashids in any
situation. Itâs just bad, and gives an extremely false impression of security.
If you care to avoid curse words and the likes,
see my suggestions below;
even the simple suggestions will do about as good a job as Hashids,
though in a slightly different way.
Optimus
Optimus
(first implemented for PHP, and thereâs a Go port too)
is a library for quickly encrypting IDs.
It uses a technique thatâs about as fast as possible,
but unfortunately has chosen poor parameters that render it cryptographically very weak
(though Hashids is still worse).
Optimus uses two keys, a primeâkey (a large prime number in the encryption domain)
and a xorâkey (a random value in the encryption domain).
The inverse of the primeâkey must also be provided for decryption.
It also takes a parameter bits which controls the encryption domain,
which defaults to 31 (because of PHPâs use of signed integers),
meaning the cipher operates on values in the range [0, 2³š).
Optimus does not stringify the encrypted IDs: youâre still dealing with integers.
This means that youâll normally get a tenâdigit ID.
This is the algorithm for encrypting id:
encryptedâid = ((id Ă primeâkey) mod 2bits) xor xorâkey
Skipping the XOR part, which adds no cryptographic complexity,
this is the algorithm of a Linear Congruential Generator of the form m a power of 2, c = 0,
stepping the generator forwards one place to encrypt, and backwards one place to decrypt.
LCGs in general have some known weaknesses (read the rest of that Wikipedia article),
but that specific class of LCGs is particularly bad for its weaknesses.
The most catastrophic one here is that they work terribly with even seeds,
meaning that half the IDs you encrypt are barely encrypted at all. I believe [But Iâm not a professional cryptographer and also havenât tried what I suggest here.] that with fewer than twenty sequential or randomlyâbutâevenlyâspaced IDs,
you may be able to calculate at least the last few bits of xorâkey,
and several bits of primeâkeyâs entropy.
And it wonât take all that much more to chip away at the rest of the keys.
To demonstrate the badness of this class of LCGs,
try bits = 5 and primeâkey = 29,
which corresponds to lcg(a=29, c=0, m=32):
the cycles of this LCG are [[0], [1, 29, 9, 5, 17, 13, 25, 12], [2, 26, 18, 10], [3, 23, 27, 15, 19, 7, 11, 31], [4, 20], [6, 14, 22, 30], [8], [12, 28], [16], [24]].
0, 8, 16 and 24 encrypt to themselves.
4 and 20 and 12 and 28 to each other.
&c., and you can clearly see how much worse the even numbers are than odd.
The only redeeming feature of the parameters chosen is the use of a prime a,
which is more likely to produce results as good as you can get given c = 0 and m a power of 2
(though try lcg(a=31, c=0, m=32) if you dare);
but compare it even to what that page describes about an LCG with wellâchosen HullâDobell Theorem parameters,
with its single cycle and better distribution, and⌠yeah, this oneâs very, very bad.
I myself started TESID with the LCG approach in mind,
but the description of the difficulty of finding good parameters put me off itââitâs clearly possible to do a fairly good job with LCGs
(with known but generally acceptable weaknesses, for this application),
but frightfully easy to make a mess of it.
This is why I decided to go with block ciphersââtheyâre somewhat more computationally expensive,
but much more reliable. I used RC5 first (which officially supports 32âbit and 64âbit block sizes,
unlike such as Skipjack, where Skip32 is a variant added after the fact,
like my custom Speck word sizes that I now use),
then switched to Speck at some point because its algorithm looked prettier,
and I no longer had to choose key length and how many rounds to do.
And it turned out Speck was faster than RC5 with its recommended parameters, too.
But seriously, my reasons for switching from RC5 to Speck were very flimsy and came down more to vibes.
I think I was vindicated once I switched to custom word sizes, though.
UUID
Iâm focusing on techniques for encoding sequential identifiers, which UUID isnât,
but I include this comparison for completeness and convenient reference,
since I intend TESID to be a good replacement to UUIDs in certain places theyâre commonly used.
PrĂŠcis:
- UUIDs are good for distributed ID allocation, TESIDs for centralised.
 
- UUIDs are long and unwieldy, TESIDs are short (unless you deliberately make them long) and human-friendly.
 
- If using versions of UUID other than 4, UUIDs leak information that TESIDs donât.
 
- Various of the stuff other versions of UUIDs do can actually be done with TESID too.
 
In terms of my goals, UUID categorically fails goals 1 (computed from a numeric ID) and 4 (short strings),
and doesnât fare well at goal 3 (human-friendly IDs).
In more detail:
UUIDs are 128âbit values.
In the bestâcase scenario, a proper UUID data type,
theyâll take 128 bits of space (16 bytes) in your database.
But itâs common to end up storing them in canonical textual representation,
where theyâll probably take at least 320 bits (40 bytes).
For presentation to users, canonical textual representation is most common,
which is 36 characters long (32 hexadecimal characters, four hyphens).
Other representations are possible; the shortest reasonable one is baseâ64,
which is still quite unwieldy at 22 characters.
The strength of UUIDs is that theyâre suitable for distributed allocation.
If you use UUIDv4, you just generate 122 bits [Six bits are reserved for indicating version 4 and variant 1.
Itâs not unheard of to just generate a 128âbit value and produce a nominallyâinvalid UUID.] of randomness,
and use that, so you can do it on different machines with no trouble.
There is a slight risk of collision, but itâs generally negligible.
There are also other versions of UUIDs that take other pieces of information,
such as timestamp or a node MAC address.
This conveniently stores that information that you could potentially have wanted to store anyway.
If exposed publicly, this can allow identification of nodes or times
(which may be desirable or undesirable).
But all up, UUID is heavily based on randomness.
Also worthy of mention here is ULID,
which uses a 48âbit timestamp (milliseconds) and 80 bits of randomness
(choosing a new random value every millisecond,
and just incrementing the random value within the millisecondââso itâs not perfect for distributed use,
though the risk of collisions or overflow is probably still negligible),
and a (26âcharacter) baseâ32 string encoding (Crockfordâs alphabet, and uppercase at that, which is unwise).
A deliberate feature of ULID is that it is lexicographically sortable.
The draft UUIDv7 is similar,
with 48 bits of timestamp, 6 bits of version and variant,
and 74 bits of randomness (but without the intraâmillisecond incrementing behaviour).
On the database side, TESIDs are integers.
They will probably take 64 bits (8 bytes) to storeââmuch less than UUIDs.
In encoded textual form, theyâre also much shorter than the canonical textual representation of UUIDsââtypically 4â8 characters,
and as long as 20 characters in the most extreme case with extraordinarily high IDs and sparsity.
TESID is not immediately suitable for distributed allocation,
and something like UUIDv4 is not possible;
but something more reminiscent of UUIDv1 or the drafts UUIDv6â8 is readily possible,
by using techniques similar to sparsity and discriminant,
either on the ID in the database, or as part of TESID encoding.
Care will be required about number sizes, if doing it in the ID column in a database:
if your database only supports signed 64âbit integers,
a 48âbit timestamp would leave you with only 15 bits for the IDâ32,768 [My phone number ends in 32768. It was the largest power of two available.] possible IDs.
It may be wiser to use a compound (integer, integer) key and join them in application code,
though you may need to be careful about TESIDâs 100âbit limit.
Itâs also worth pointing out a meaningful difference in convention here:
things like UUIDs tend to put the âdataâ at the start,
and the randomness at the end,
because this tends to produce better results in databases,
allowing better locality and sharding behaviour.
But in TESID you do it the other way round,
with the âdataâ at the least significant bits,
so that the numeric values are as small as possible,
yielding strings that are as short as possible.
(Also, the locality and sharding reasons donât generally apply,
as the idea is that sparsity and discrimination are applied when needed, not stored that way.)
If you joined a 64âbit ID and a 48âbit timestamp with the timestamp in the most significant place, UUIDâstyle,
then you get a 105âbit value for the next decade and a half (21 characters in baseâ32, though TESID actually doesnât go that high),
then 106âbit for the next seventy years, 107âbit for the next 140 years, and so on.
If you put the timestamp in the least significant place, then the value is only ceil(48âŻ+âŻlogââŻid) bits:
in TESID terms, youâll get four 10 character IDs,
then 35 quadrillion or so 12âcharacter IDs, even more squillions of 14âcharacter IDs, and so on.
TESIDs deliberately minimise information leakage: thanks to the use of encryption,
the only leak is the length of the string, which discloses an epoch.
All up, TESID is grounded in deterministic techniques.
There are no collisions. Strings are as short as possible.
Generation is theoretically not significantly different;
if anything, itâs a little easier, as youâre more likely to be able to lean on the database to handle things.
But in practice, if you want to avoid bad patterns in the thorough rather than probabilistic way,
TESIDs can be harder to work with for ID allocation,
because you may have to fight against what your tooling wants to do,
and in a stateful manner (you need to know where in the ID sequence youâre up to),
whereas with UUIDs you probably already had to perform ID allocation yourself,
but at least the ID allocation function was effectively stateless.
As for performance: neither UUID nor TESID are at all likely to be a bottleneck.
With UUIDs, youâre working with larger values in both database and application,
almost certainly doing a string allocation in application,
and possibly having a variableâwidth column in the database (which tends to slightly harm performance).
For TESID, assuming the Rust implementation which is the most efficient,
youâre working with smaller values and need no string allocation;
and a synthetic benchmark on my AMD Ryzen 5800HS laptop (2021) shows TESID encoding and decoding each taking well under 100ns.
At a guess, itâs probably much of a muchness;
I think microbenchmarks would probably generally favour TESID over UUID,
especially if youâre using a database without a real UUID type,
but I doubt that realistic benchmarks would distinguish.
Others
Read Methods for obfuscating sequential IDs by Jiayu Yi.
Itâs a great collection of links and descriptions of things in this space.
It doesnât cover the variableâwidth stuff that Iâve made a key feature of TESID,
but itâs an excellent survey, and if youâve been reading this page of mine,
you probably want to read that one too.
The library Presents provided by that author at the end is cryptographically sound,
using real 64âbit block ciphers and stringifying with a baseâ62 alphabet with no leading padding,
for strings of up to 11 characters. (That is, 95% [(2âśâ´âŻââŻ62šâ°)âŻĂˇâŻ2âśâ´] of IDs will be 11 characters long.)
Considerations when choosing TESID
I reckon that TESID is generally better than the alternatives that people tend to use,
but itâs without issues either, and there are a few things you should be aware of before choosing to use TESID.
- 
In naive usage, TESID is particularly susceptible to generating IDs containing undesirable patterns (swear words and such).
This is a problem that I have deliberately left partially unsolved out of the box;
there are ready solutions that I find reasonable,
but the most thorough do require extra effort.
You will want to think about this before using TESID.
 
- 
You will have to store the encryption key somewhere.
(By contrast, fullârandomness approaches have nothing to store.)
The simplest technique for this is hardâcoding it in the source code,
which may be suitable for singleâdeployment instances
(that is, where youâre not distributing the code to others,
in which case youâd probably want each instance to have its own key).
It feels icky to store such a secret in the code,
but YouTube got away with doing this until 2017,
and I must admit that it is sometimes a pragmatic approach.
Some other options are environment variables, config files,
somewhere in the database,
and a fullâon secrets management tool.
All of these may be reasonable.
It probably depends most on what youâre using already,
since you probably have other secret values already.
 
- 
You cannot change the encryption key:
itâs permanent, and if it leaks, IDs become trivially enumerable.
If you use IDs as authentication, this is a risk to weigh.
 
Terminology and spelling
TESID stands for Textualised Encrypted Sequential Identifiers.
I pronounce it as the twoâsyllable /tÉsÉd/.
I have called the process of converting a number to a string encoding, not
encrypting; and its inverse decoding, not decrypting. Where a type exists for
containing the expanded key with methods to perform the encoding and decoding,
this type is called a coder.
In libraries, a TESID coder type is likely to be named TesidCoder or TESIDCoder.
There may also be variants like TypedTesidCoder or TypedTESIDCoder,
for more convenient handling of sparsity and discriminant parameters.
Here and in the docs, I mostly call the result of encoding an ID a TESID
(a textualised encrypted sequential identifier),
but in real deployments youâll probably normally just call it an ID.
The problem of undesirable patterns (swear words and such)
Because it uses encryption (which is equivalent to randomness) and a reasonably large alphabet,
TESID is susceptible to one problem that may be deemed significant:
you can end up with undesirable patterns in the generated strings.
For example, if you have allocated all the IDs from 0 to 2²â°,
then all fourâletter words containing neither l nor o will be valid TESIDs.
As an benign example, with the key 000102030405060708090a0b0c0d0e0f,
the ID 921366 encodes to ârustâ,
and 336828232012 to âsumthingâ.
This is not a novel problem:
anything that generates random values will be susceptible.
However, TESID is a little more heavily affected than things that generate long random strings,
because here the undesirable pattern is much more likely to be the entire ID,
rather than just four characters somewhere in the middle of a much longer string.
(Though youâre more likely to get such undesirable patterns appearing somewhere in the longer strings.)
TESID deliberately does not try to mitigate this,
because I donât like the cost of naive mitigation
(generally involving a much shorter alphabet,
such as starting with removing the remaining vowels).
If this concerns you, here are four suggestions:
two easy and probabilistic,
one more difficult but thorough,
and one forking and changing the library:
- 
The easiest solution, recommended for general use:
use sparsity, to reduce the probability of an undesired pattern occurring.
With sparsity 256, for example,
the probability of each fourâletter word occurring is only 1â256, around 0.4%.
(Though if you use type discrimination, it increases again proportionally.)
Maybe thatâs enough for you.
 
- 
Skip fourâletter IDs and start at six,
where the patterns are less likely to occur and much less likely to be the whole string,
by beginning your ID sequence at 2²â°âŻĂˇâŻsparsity
(if not sparsening, that means starting at 1048576;
but if not sparsening, realise that by the time you get to 2Âłâ° (a billionâodd IDs),
about 99.9% of six-character IDs will be in use).
 
- 
The best solution, recommended if youâre willing to put the effort in:
before using an ID, encode it and scan for undesirable patterns,
and skip to the next if you decide you donât like it. [This pattern is commonly seen when generating random IDs,
since itâs easy there as you were already generating them in application code.
nanoid-good is one example.] This is admittedly likely to be more bothersome to arrange,
as you might have been offloading the sequence to your database,
but itâs really the only way to do a good job of avoiding such patterns.
(I say this despite the alternative of a smaller, vowelâfree alphabet,
as using such a smaller alphabet introduces problems of its own in humanâfriendliness
(thereâs just something off about string IDs that never include vowels, for starters),
and also this approach allows more flexible filtering for truly superior humanâfriendliness,
such as disallowing sameâcharacter repetition which is errorâprone in dictation.)
Youâll need to either do the checking in the database
(not a bad idea, franklyââyou could probably move all the TESID stuff there quite beneficially),
or with an extra round trip to the database on any inserts.
The inâdatabase technique will vary by database,
and ORMs and the likes may not be fond of you.
Itâs certainly harder than just âhere, use this library in your code,
and keep using your databaseâs standard autoâincrementing primary keyâ,
so Iâm deferring any sort of official recommendation for now,
though I will be pursuing it as an optional extra.
Just one hint: for SQLite, it may be best to use INTEGER PRIMARY KEY AUTOINCREMENT
for both its automatic sequence maintenance and its ROWID optimisation,
and mess with sqlite_sequence before each INSERT.
But this does run the risk of forgetting to pump the sequence check.
I intend to write a library for filtering potentially bad patterns at some point.
I have in mind preferring broad rules like âno more than three letters in a rowâ
and âthe same character must not occur three times in a rowâ,
rather than specific badâword matching
(though if the rules were as blunt as these suggestions,
you might still want a filter for certain threeâcharacter sequences).
Such aggressive rules as these disqualify a large fraction of IDs as the string gets long,
over 90%, so more nuance on greater lengths might be warranted,
perhaps considering vowel and consonant patterns,
though itâs not like you end up running out of IDs
(1% of a zillion is still more like a zillion than like zero)ââbut ID allocation would slow down, heading towards a microsecond.
(Run the Rust rejection_rates example for a demonstration.)
 
- 
Fork TESID with a different, smaller alphabet.
(Iâd suggest baseâ16, as its halfâbyte alignment makes it the easiest to switch to,
and it has nice properties at the boundaries too;
but I wouldnât use the hexadecimal alphabet,
as it feels programmery and also mildly encourages numeric treatment which would break the occasional ID,
instead optimising for avoiding undesirable patterns and minimising confusables of any kind;
spending just a minute whittling it down once,
I ended up with 346789cdghjkprtz.
But as I hinted above, I find such an alphabet as this mildly disconcerting,
though I canât clearly explain why.
Itâs just too small in the wrong way, and the lack of vowels is somehow unpleasant.)
 
Further reading
Two more pages:
- Algorithms: prose descriptions of the TESID encoding and decoding algorithms, including the pieces that make them up.
 
- Design rationale: some analysis of the reasoning behind TESID and specific design choices.