Skip to content

ADR-0005: Flexible entity IDs, with 8-character hex as the recommended default

Status

Accepted

Context

Every entity in a GLX archive (persons, events, places, sources, etc.) needs a unique identifier. Several patterns were considered:

  • UUID (36 chars, 550e8400-e29b-41d4-a716-...) — globally unique, never collides, but long and visually noisy in a YAML file.
  • Auto-incrementing integers — compact, but require a central counter, which clashes with ADR-0004: two researchers working on parallel branches would both pick person-42.
  • Short random hex (e.g., 8 characters = 2³² ≈ 4.3 billion possible values) — compact, collision-rare at archive scale, no central counter needed.
  • Descriptive strings (john-smith-1850, leeds-parish) — readable, but require authors to know the entity's "real name", which may be private or unknown.

A second consideration is privacy in shared diffs. When a researcher shares a commit or issue snippet that references only entity IDs, descriptive IDs leak real names. Random hex IDs do not.

A third is filesystem layout. In multi-file archives, filenames are derived from entity IDs (strings.ToLower(entityID) + ".glx"). IDs therefore need to be filename-safe on Windows, macOS, and Linux — which rules out colons, slashes, and most punctuation.

Decision

The specification (see Archive Organization — ID Format Standards) constrains entity IDs to 1–64 characters, alphanumeric (a–z, A–Z, 0–9) and hyphens only. The spec is the source of truth for the full ID grammar and uniqueness scope.

Within those constraints, two patterns are explicitly documented and recommended:

  • 8-character random hex (2³² ≈ 4.3 billion possible values) — the default pattern generated by tooling. Preferred when privacy matters (IDs leak no names in diffs) or when the entity has no good natural name.
  • Descriptive strings — preferred when readability of diffs and file listings matters more than privacy. The specification explicitly states both formats produce equally stable filenames; in either case, the only way to collide is to reuse the same ID (compared case-insensitively, so Person-A and person-a are treated as the same ID).

Consequences

Positive

  • Authors pick the pattern that fits their workflow. Privacy-sensitive research uses hex; collaborative research on public figures (historical biography, genealogical fiction like the Westeros example) uses descriptive IDs.
  • 8-character hex is short enough to read and compare at a glance, unlike a full UUID.
  • No central counter means parallel branches can generate IDs without coordination. The birthday bound for a 2³² space is √(2³²) ≈ 65,000 IDs before collision probability becomes noticeable — comfortably above the size of any individual-researcher archive. Archives that exceed that can use longer IDs (the 64-character limit leaves ample headroom).
  • Filename derivation is deterministic from the ID, so committing an entity twice does not churn the git history with random renames.

Negative

  • Because the format is flexible, different archives use different conventions. Tooling that spans archives must not assume any specific pattern.
  • Descriptive IDs can become stale (person-smith-young meaning changes if the entity is updated), but the ID does not rename itself. Authors who care about that should use hex.
  • Lowercase-only filenames mean Person-A and person-a collide; the serializer catches this at write time, but contributors need to be aware of it.

Licensed under Apache License 2.0