hexagon-nodesData Quality & Identity Resolution

This article explains how identity resolution works within the Untitled platform, what impacts data quality (DQ), and how to think about trade-offs when activating records.

Internet-scale identity data is inherently probabilistic at the margins. While our system is built to maintain high confidence levels, understanding how and why mismatches occur is critical to evaluating performance properly.

The goal is not theoretical perfection. The goal is maximizing reliable match rates while maintaining strong confidence thresholds that support positive unit economics in marketing activation.

What Is Identity Resolution?

Identity resolution is the process of connecting anonymous website visitors to a real individual or business profile.

Standalone websites typically identify only 1–2% of traffic using first-party cookies or user-submitted data. Untitled expands that capability by leveraging:

  • A distributed Identity Graph

  • Deterministic validation signals

  • Ongoing data feedback loops

This allows for identification of a materially larger percentage of anonymous traffic, provided sufficient signal density exists.


Expected Accuracy Levels

Across production deployments, identity matching accuracy typically falls in the 90–95% range.

Important context:

  • Internet-scale datasets are never 100% perfect.

  • A small percentage of mismatches is expected in any large identity system.

  • Accuracy improves with traffic scale and engagement frequency.

  • Small filtered exports can appear less accurate than overall system averages.

Accuracy is influenced by signal density, not just matching logic.


What Impacts Data Quality?

Several variables influence match confidence:

1. Traffic Volume

Websites with:

  • Fewer than ~1,000 monthly unique visitors

  • Newly launched properties

  • QA or Dev environments

May exhibit more variability in match confidence.

Performance becomes materially more stable above ~5,000 monthly uniques.

2. Session Frequency

Accuracy improves logarithmically with engagement frequency.

More repeat sessions and longer engagement windows create stronger triangulation between:

  • IP address

  • Device

  • Browser

  • Hashed email identifier

More signals = higher confidence.

3. Filtering & Export Size

Applying heavy demographic or firmographic filters and then exporting extremely small subsets can create the perception of lower accuracy. Small samples amplify anomalies.


Deterministic Matching Today

The current system relies primarily on deterministic matching and rule engines.

This approach:

  • Performs strongly in high-traffic environments

  • Avoids excessive modeling assumptions

  • Prioritizes observed validation over speculative inference

However, in low-signal environments, deterministic systems can produce what we call extraneous linkages — technically valid but lower-confidence matches formed from partial signals.

We are actively rolling out probabilistic signal filtering upgrades to reduce false positives and improve confidence scoring when data is ambiguous.


The Feedback Loop Effect

The highest level of accuracy occurs when:

  1. Records are activated in marketing channels

  2. Users convert or transact

  3. First-party data is fed back into the platform

As activation increases, local accuracy improves.

This creates a reinforcing cycle:

More activation → More signal → Better resolution → Higher confidence


The Strategic Perspective

When evaluating data quality, the relevant question is not: “Is this 100% perfect?”

The relevant questions are:

  • Is the majority of the dataset reliable?

  • Is the cost of activation justified by the confidence level?

  • Are we applying appropriate filters for higher-cost channels?

  • Are we leveraging intent signals to increase precision?

Identity resolution at scale is about disciplined confidence management and unit economics. The following articles in this section dive deeper into this topic.

Last updated

Was this helpful?