# Why Data Quality May Vary

If you are seeing occasional mismatches in a QA environment or low-traffic site, this is expected and does not typically reflect production performance. Across live deployments with **sufficient traffic**, matching accuracy generally falls in the **90–95% range**. Identity resolution is signal-driven. When signal density is low, variability increases. When signal density is high, accuracy stabilizes.

Below explains why.

***

## The Core Principle: Signal Density Drives Confidence

Identity resolution relies on triangulating multiple deterministic signals, including:

* IP address
* Device and browser characteristics
* Cookie identifiers
* Hashed Email Address (HEM)
* Historical validation signals

The more frequently a visitor is observed, the more confidently those signals align.

**Accuracy improves:**

* Proportionally with site scale
* Logarithmically with session frequency

More traffic and repeat engagement create stronger validation reinforcement.

***

## Why Low-Traffic Sites Show More Variability

When a site has:

* Fewer than \~1,000 monthly unique visitors
* Limited repeat sessions
* Inconsistent engagement

The system has fewer opportunities to validate identity associations. In these environments, deterministic matching may occasionally produce what we refer to as:

### Extraneous Linkages

<figure><img src="/files/MsK8A45y47dAn1br8DOf" alt=""><figcaption></figcaption></figure>

Extraneous linkages occur when:

* Partial signals are technically valid
* Multiple possible matches exist
* There are insufficient signals to confidently discriminate

A match is produced, but confidence is lower than in high-volume environments.

Performance becomes materially more stable above \~5,000 monthly uniques.

***

## Why QA & Dev Environments Amplify Issues

QA and Dev environments often:

* Have extremely low traffic
* Generate sporadic tag fires
* Lack repeat behavioral patterns
* Sit behind restricted access

Because identity resolution improves with repeated observation, these environments can exaggerate anomalies. QA is effectively a worst-case scenario for identity resolution performance. Issues observed in QA frequently do not persist in production.

***

## Why Small Exports Can Appear Worse

Another common scenario involves:

* Applying multiple demographic or firmographic filters
* Exporting a small subset
* Manually validating a handful of records

Small samples can amplify perceived inaccuracies.

Example: A system performing at 93% accuracy still allows for 7% anomalies.\
In a small export of 20 records, 1–2 mismatches may appear disproportionate and alarming.

Manual validation using Google or LinkedIn searches is not always definitive ground truth, especially for:

* Private companies
* Mid-market businesses
* Individuals with limited public presence

Internet-scale datasets are not identical to public search visibility.

***

## Deterministic Matching in Low-Signal Environments

The current system relies primarily on deterministic and rule-based matching.

**Strength:**

* High confidence in observed validation
* Reduced speculative modeling

**Limitation:**

* Deterministic logic requires sufficient signal density

In low-volume settings, ambiguity increases. We are rolling out probabilistic signal filtering upgrades designed to:

* Reduce false positives
* Improve confidence scoring
* Return only records that meet stronger certainty thresholds

This may slightly reduce attribute fill in low-traffic environments but will improve precision.

***

## When Should You Investigate Further?

**Data quality concerns merit deeper review if:**

* Anomalies persist on high-traffic production environments
* Issues are observed consistently across large datasets
* Activation performance materially underperforms expectations

**Isolated anomalies are expected in:**

* QA environments
* Newly launched sites
* Extremely small exports (1-2 dozen records)

***

## Practical Safeguards

To increase confidence and reduce perceived variability:

1. Evaluate production traffic rather than QA
2. Allow sufficient traffic volume before benchmarking accuracy
3. Use Buyer Intent filters (Hot/Warm), **detailed in the next article**
4. Apply additional validation for higher-cost activation channels
5. Leverage activation feedback loops to improve local accuracy

Identity resolution strengthens over time with traffic and engagement.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.getuntitled.ai/product-overview/untitled-id-tag/data-validation-and-identity-resolution/why-data-quality-may-vary.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
