Deterministic unique ID generation

This document describes a way of creating universally unique ID’s in a deterministic manner.

Problem description

Consider loading the same type of data from a number of external sources to store in a database. Each source is individually providing unique ID’s, but with overlapping number series, for example getting customer ID’s from two different CRM systems. We would like to be able to differentiate customer 001 in the first system from customer 001 in the second. Typically a random UUID (Universally Unique ID, sometimes called globally unique ID or GUID) is generated as a new ID, thereby making sure each entry is uniquely identified. This allows each external data record to be uniquely identified, but introduce a new dependency to the generated ID. If we would lose the data, it would not be possible to recreate the data, as the reconstituted ID’s would be new random values.

The UUID standard

The UUID format is standardized (rfc4122), 128 bits long, and comes in different versions (versions in this context as in “flavors”, rather than “generations”). Version 4 is randomly generated, but versions 3 and 5 are called “namespace-name” UUID’s. These versions can be used to generate a UUID deterministically from a name.

There are a couple of standardized namespaces, of which one is called the “DNS namespace” – the namespace being a UUID itself, for DNS namespace the UUID is 6ba7b810-9dad-11d1-80b4-00c04fd430c8.

The namespace-name UUID’s are created by applying the following operation, then flip a few bits and encode into a string:

hash(Namespace_DNS + "domain.example.com")

Versions 3 and 5 are almost identical – the only difference is that version 3 uses MD5 and version 5 uses SHA-1 as hashing algorithms. Since SHA-1 produce 160 bit long hash values, the output is truncated after 128 bits. The security vulnerabilities of these algorithms for use in some cryptographic applications is not applicable to UUID generation.

The solution

Create a pseudo-DNS record using the following format:

local_id.country_code.source.entity.example.com

  • Local ID (local_id) is derived from the source data
  • Country code (country_code) is the ISO 3166-1 alpha-2 two-letter country code for the data source, e.g. “se”, “fi” or “de”
  • Source is the name of the source data system, e.g. “crm” or “erp”.
  • Entity is a descriptive name of the entity, e.g. “customer” or “contract”.
  • The remainder (example.com) is the domain of your organization.

The local ID is a hexadecimal string representation of the SHA-1 hash of any combination of data fields that when concatenated as a UTF-8 encoded string will uniquely identify the record (often just a single numerical field that is used as the primary key) in the source data. The reason for hashing is that we would like to be able to create a compliant DNS string, so we can only use characters a-z, 0-9 and hyphen (“-“) as well as keeping the total length to a maximum of 253 characters.

Next, generate a version 5 UUID using the DNS namespace together with the pseudo-DNS record.

Example

Assume the following set of properties:

  • Local id = 12345
  • Country code = se
  • Source = mycrm
  • Entity = customer
  • Domain = example.com

The UUID is computed by:

  1. Create the id: sha1("12345".toUTF8()).toHexString().toLowerCase(), producing the value 8cb2237d0679ca88db6464eac60da96345513964
  2. Create the pseudo-DNS record: 8cb2237d0679ca88db6464eac60da96345513964.se.mycrm.customer.example.com
  3. Create the version 5 Namespace_DNS UUID: NamespaceUuid("6ba7b810-9dad-11d1-80b4-00c04fd430c8", "8cb2237d0679ca88db6464eac60da96345513964.se.mycrm.customer.example.com"), resulting in the UUID f1e2ec1a-d04f-5ad0-af06-d8a3eb545607.

The output of the last step is the deterministically generated ID, that will be both unique and possible to deterministically regenerate from the source data if needed.

A little more on local ID’s

The local ID may not always be explicitly available. In these cases, it is often desirable to find a combination of fixed attributes that will uniquely identify the record (sometimes described as a “natural key”).

For example, let’s say we want to extract user data from a multi-tenant system where the login name is unique within a tenant. A natural key could be produced by concatenating the login name with the tenant name. To support any type of character and avoid ambiguity, always use the same encoding (I recommend UTF-8) for the strings. For the login name “foo” in tenant “bar”, a local, unique ID can be generated by sha1(("foo" + "bar").toUTF8()).toHexString().toLowerCase(), or described stepwise:

  1. Concatenate strings “foo” and “bar”, encode as UTF-8
  2. Compute the SHA-1 hash of the encoded string
  3. Convert the hash to hexadecimal string representation in lower case.

Notes

Here are some final notes that should be considered.

  • UUIDs are specified to be case-insensitive, so always store them in upper or lower case and convert to the same case when comparing.
  • ISO 3166-1 alpha-2 is not completely identical to country codes used as top domains, for example “gb” is the ISO 3166-1 alpha-2 code for Great Britain (as opposed to “uk”).
  • Always construct the pseudo-dns name using lower case letters only.
  • Always use UTF-8 encoding of the source ID as input to the SHA-1 hash function.
  • Always create valid DNS strings (a-z, 0-9 and hyphens, max 253 characters long).
  • Always use a hexadecimal string encoding of the hashed local ID.
  • Always concatenate multiple parts of a natural key in the same order.
  • There is an online UUID generator able to generate version 5 UUID’s at: https://www.webtools.services/uuid-generator

Code examples

The following code examples are created in Node.JS and does not require any additional packages to be installed. Just save to a local file, e.g. uuid-test.js and run it with: node uuid-test.js.

const crypto = require('crypto');
const local_key = 'AX3614-9112';
const pseudo_domain = '.se.mycrm.customer.example.com'
const key_hash = crypto.createHash('sha1').update(local_key).digest('hex');
const pseudo_fqdn = key_hash + pseudo_domain;
const ns = Buffer.from('6ba7b8109dad11d180b400c04fd430c8', 'hex'); // DNS ns UUID
var raw_bytes = crypto.createHash('sha1').update(ns).update(pseudo_fqdn).digest();
raw_bytes[6] &= 0x0f;   // four msb in octet 6
raw_bytes[6] |= 5 << 4; // is set to version
raw_bytes[8] &= 0x3f;   // two msb in octet 8
raw_bytes[8] |= 0x80;   // is set to variant
const hex = raw_bytes.slice(0, 16).toString('hex'); // truncate to 16 octets
const uuid = hex.substring(0, 8) + '-' + hex.substring(8, 12) + '-' +
             hex.substring(12, 16) + '-' + hex.substring(16, 20) + '-' +
             hex.substring(20, 32)
console.log(uuid);

The output when run should be:

36e1e5b7-0012-53f9-b31c-6a7aa14e078c

The second code example is a little more generalized and has a lot of comments in the code, but it basically does the same:

const crypto = require('crypto');

//  RFC 4122, Appendix C - Some Name Space IDs
const nameSpace_DNS = '6ba7b810-9dad-11d1-80b4-00c04fd430c8';
const nameSpace_URL = '6ba7b811-9dad-11d1-80b4-00c04fd430c8';
const nameSpace_OID = '6ba7b812-9dad-11d1-80b4-00c04fd430c8';
const nameSpace_X500 = '6ba7b814-9dad-11d1-80b4-00c04fd430c8';

/**
 * Create a hex encoded SHA-1 cheksum from plaintext
 *
 * Arguments
 *   plaintext: The string that will be hashed after being UTF-8 encoded
 *
 * Returns
 *   The SHA-1 checksum as a hex encoded string  
 */
function makeShaHex(plaintext) {
  shasum = crypto.createHash('sha1');
  shasum.update(plaintext, 'utf8');
  return shasum.digest('hex');
}

/**
 * Convert raw bytes to the standard UUID string format
 *
 * Arguments
 *   buffer: A Buffer object containing the raw bytes of the UUID
 *
 * Returns
 *   The standard string format representation of the UUID  
 */
function uuidFormat(buffer) {
  hex_bytes = buffer.toString('hex');
  formatted = hex_bytes.substring(0, 8) + '-' + hex_bytes.substring(8, 12) + '-' +
    hex_bytes.substring(12, 16) + '-' + hex_bytes.substring(16, 20) + '-' +
    hex_bytes.substring(20, 32)
  return formatted;
}

/**
 * Convert the standard UUID string format to raw bytes
 *
 * Arguments
 *   uuid: The standard string format representation of the UUID
 *
 * Returns
 *   A Buffer object containing the raw bytes of the UUID
 */
function uuidToBuffer(uuid) {
  return Buffer.from(uuid.replace(/-/g, ''), 'hex');
}

/**
 * Create a version 5 (NameSpace-Name) UUID as a raw bytes Buffer
 *
 * Arguments
 *   nameSpace: the desired name space as a standard UUID formatted string
 *   name: the string representation of a value in the name space
 *
 * Returns
 *   A Buffer with the bytes of the UUID  
 */
function nameSpaceNameShaUuid(nameSpace, name) {
  ns = uuidToBuffer(nameSpace);
  shasum = crypto.createHash('sha1');
  shasum.update(ns);
  shasum.update(name);
  raw_bytes = shasum.digest();

  // Set the four most significant bits (bits 12 through 15) of the
  // time_hi_and_version field to the appropriate 4-bit version number
  // Here, we are using SHA-1 which is 5.
  raw_bytes[6] &= 0x0f;
  raw_bytes[6] |= 5 << 4;

  // Set the two most significant bits (bits 6 and 7) of
  // the clock_seq_hi_and_reserved to zero and one, respectively.
  raw_bytes[8] &= 0x3f;
  raw_bytes[8] |= 0x80;

  // Keep the first 16 octets
  raw_bytes = raw_bytes.slice(0, 16);

  return raw_bytes;
}

var local_key = 'AX3614-9112'
console.log('Local key:          ' + local_key);
var local_key_hash = makeShaHex(local_key);
console.log('Local key SHA-1:    ' + local_key_hash)
var domain = '.se.mycrm.customer.example.com'
var pseudo_dns =  local_key_hash + domain;
console.log('Pseudo DNS:         ' + pseudo_dns);
var uuid = nameSpaceNameShaUuid(nameSpace_DNS, pseudo_dns);
console.log('Corresponding UUID: ' + uuidFormat(uuid));

The output when run:

Local key: AX3614-9112 
Local key SHA-1: 41268e494e18318c0f74a5aba979ebf53552edff 
Pseudo DNS: 41268e494e18318c0f74a5aba979ebf53552edff.se.mycrm.customer.example.com 
Corresponding UUID: 36e1e5b7-0012-53f9-b31c-6a7aa14e078c

Leave a comment