What kind of Primary Key should you be using in your application?

Why Obfuscate?

Using sequential ids can leak information about your applications.

  • Estimated counts - If all of your entities are auto incrementing a competitor could estimate how many users or how much content you have. This isn’t a huge deal and could be negated by setting an arbitrary initial auto increment value.
  • Parameter Injection - If all of your urls are formatted /user/{userId} anyone can easily just change the id in the url. Ideally you have authorization in place to prevent sensitive data from being accessed. This is just an extra precaution.
  • Automated Scraping - Using parameter injection it would be easy to write a script from 0 to some max id and scrape all data from your website without obfuscated ids. Obfuscated ids won’t prevent this if all ids are public somehow but will make them work a little harder.

k-sortable ids

k-sortable ids are ids that can be roughly sorted so they order themselves by the time they were created. This could be a potential problem, since attacked may then be able to guess the next ID. See https://www.intruder.io/research/in-guid-we-trust

Note: All monotonically increasing (auto-increment, k-sortable), and timestamp-based ids share the security issues with Cuid. V4 UUIDs and GUIDs are also insecure because it’s possible to predict future values of many random algorithms, and many of them are biased, leading to increased probability of collision. Likewise, UUID V6-V8 are also insecure because they leak information which could be used to exploit systems or violate user privacy. Here are some example exploits:

ID Types

Auto Incrementing Sequential Integer

Easy and cheap

May give away information if it’s presented in the frontend application.

See German Tank Problem.

UUID/GUID - Universally Unique IDentifier

UUID

ex: [9EE44A7C-0485-4AF4-8F72-2CFE80043A4A]

A UUID is 128 bits long, and can guarantee uniqueness across space and time.

This specification defines a Uniform Resource Name namespace for UUIDs (Universally Unique IDentifier), also known as GUIDs (Globally Unique IDentifier).

What’s the difference between GUIDs and UUIDs?

GUID is Microsoft’s implementation of the UUID standard.

Cons

UUID can be suboptimal for many uses-cases because:

  • doesn’t cluster well, produces terrible locality, and no insight as to when it was generated
  • It isn’t the most character efficient way of encoding 128 bits of randomness
  • UUID v1/v2 is impractical in many environments, as it requires access to a unique, stable MAC address
  • UUID v3/v5 requires a unique seed and produces randomly distributed IDs, which can cause fragmentation in many data structures
  • UUID v4 provides no other information than randomness which can cause fragmentation in many data structures
  • Sections are hard to parse for a human and are UUID version dependent. If we can’t directly interpret each section, the dashes on the string representation add no value
  • If we only care about an unique variant/version, we wouldn’t need to encode the rest
  • Having to support more variants and versions, the parsing and generation is slower than it could be
  • Version 4, being fully random, produces fragmentation in many data structures, and aren’t sortable in a meaningful way
  • Version 4 are slow to generate, requiring 122 bits of good quality entropy
  • It’s just not the most efficient way to encode 128 bits into a string. For example, base64 without padding would require 22 characters, not 36

CUID/CUID2

CUID2

Secure, collision-resistant ids optimized for horizontal scaling and performance. Next generation UUIDs.

Cuid2 is not good for: Sequential ids (see the note on K-sortable ids, below) High performance tight loops, such as render loops (if you don’t need cross-host unique ids or security, consider a simple counter for this use-case, or try Ulid or NanoId).

import { createId } from '@paralleldrive/cuid2';

const ids = [
  createId(), // 'tz4a98xxat96iws9zmbrgj3a'
  createId(), // 'pfh0haxfpzowht3oi213cqos'
  createId(), // 'nc6bzmkmd014706rfda898to'
];

https://github.com/paralleldrive/cuid2

CUID (Deprecated)

ULID - Universally Unique Lexicographically Sortable Identifier`

[01ARZ3NDEKTSV4RRFFQ69G5FAV]

Features

  • 128-bit compatibility with UUID
  • 1.21e+24 unique ULIDs per millisecond
  • Lexicographically sortable!
  • Canonically encoded as a 26 character string, as opposed to the 36 character UUID
  • Uses Crockford’s base32 for better efficiency and readability (5 bits per character)
  • Case insensitive
  • No special characters (URL safe)
  • Monotonic sort order (correctly detects and handles the same millisecond)

Components

  • timestamp
  • randomness

Resources

Hashids

Hashids is a small open-source library that generates short, unique, non-sequential ids from numbers.

It converts numbers like 347 into strings like “yr8”, or array of numbers like [27, 986] into “3kTMd”.

You can also decode those ids back. This is useful in bundling several parameters into one or simply using them as short UIDs.

Nanoid

Optimus

Mongo ObjectID

[507f1f77bcf86cd799439011]

Format:

  • a 4-byte value representing the seconds since the Unix epoch,
  • a 5-byte random value, and
  • a 3-byte counter, starting with a random value.

Resources

xid

###[9m4e2mr0ui3e8a215n4g]

Xid is using Mongo Object ID algorithm to generate globally unique ids with a different serialization (base64) to make it shorter when transported as a string

  • 4-byte value representing the seconds since the Unix epoch,
  • 3-byte machine identifier,
  • 2-byte process id, and
  • 3-byte counter, starting with a random value.

Resources

KSUID - K-Sortable Unique IDentifier

[0ujsszwN8NRY24YaXiTIE2VWDTS]

KSUID is for K-Sortable Unique IDentifier. It’s a way to generate globally unique IDs similar to RFC 4122 UUIDs, but contain a time component so they can be "roughly" sorted by time of creation. The remainder of the KSUID is randomly generated bytes.

  • Sortable by Timestamp
  • No Coordination Required
  • Lexographically Sortable, Portable Representations

KSUIDs are 20-bytes: a 32-bit unsigned integer UTC timestamp and a 128-bit randomly generated payload. The timestamp uses big-endian encoding, to allow lexicographic sorting. The timestamp epoch is adjusted to March 5th, 2014, providing over 100 years of useful life starting at UNIX epoch + 14e8. The payload uses a cryptographically-strong pseudorandom number generator.

The string representation is fixed at 27-characters encoded using a base62 encoding that also sorts lexicographically.

Resources

Twitter Snowflake

To generate the roughly-sorted 64 bit ids in an uncoordinated manner, we settled on a composition of: timestamp, worker number and sequence number.

Sequence numbers are per-thread and worker numbers are chosen at startup via zookeeper (though that’s overridable via a config file).

Firebase PushIDs

[-JhLeOlGIEjaIOFHR0xd]

Push IDs are string identifiers that are generated client-side. They are a combination of a timestamp and some random bits. The timestamp ensures they are ordered chronologically, and the random bits ensure that each ID is unique, even if thousands of people are creating push IDs at the same time.

A push ID contains 120 bits of information. The first 48 bits are a timestamp, which both reduces the chance of collision and allows consecutively created push IDs to sort chronologically. The timestamp is followed by 72 bits of randomness, which ensures that even two people creating push IDs at the exact same millisecond are extremely unlikely to generate identical IDs. One caveat to the randomness is that in order to preserve chronological ordering if a client creates multiple push IDs in the same millisecond, we just ‘increment’ the random bits by one.

To turn our 120 bits of information (timestamp + randomness) into an ID that can be used as a Firebase key, we basically base64 encode it into ASCII characters, but we use a modified base64 alphabet that ensures the IDs will still sort correctly when ordered lexicographically (since Firebase keys are ordered lexicographically).

There’s one caveat to the chronological ordering of push IDs. Since the timestamp is generated client-side, it’s at the mercy of the client’s local clock, which could be incorrect. We make an attempt to compensate for these ‘skewed’ clocks to some degree. When a Firebase client establishes a connection, the Firebase server sends an accurate timestamp to the client so it can use that to correct its incorrect clock when generating push IDs.

Resources

Instagram

Instagram:

We’ve delegated ID creation to each table inside each shard, by using PL/PGSQL, Postgres’ internal programming language, and Postgres’ existing auto-increment functionality.

Each of our IDs consists of:

  • 41 bits for time in milliseconds (gives us 41 years of IDs with a custom epoch)
  • 13 bits that represent the logical shard ID
  • 10 bits that represent an auto-incrementing sequence, modulus 1024. This means we can generate 1024 >IDs, per shard, per millisecond

Resources


Related Notes