Diagnose data discrepancies with Experimentation Events Export
Query Experimentation Events Export (E3) data with SQL or PySpark to diagnose 10%+ visitor or conversion discrepancies between Optimizely and your analytics tool.
Use this guide to inspect the raw Experimentation Events Export (E3) data when your Optimizely Experimentation counts diverge from another analytics tool. Reach for it when the visitor or conversion gap is above 10%. Also reach for it when your Monthly Active User (MAU) usage is higher than your unique-visitor estimate.
Prerequisites
Before you start, complete the Troubleshoot data discrepancies between Optimizely and your analytics tool checklist in the support documentation. The checks in this guide assume two things. You have ruled out the basic causes the checklist covers, and you have access to your Experimentation Events Export data.
When to investigate at the E3 level
Investigate E3 data in the following cases:
- Your visitor or conversion counts differ from another analytics tool by more than 10%.
- Your MAU usage is higher than your unique-visitor estimate from another analytics tool.
- A single experiment, domain, or audience is generating disproportionate traffic.
NoteTo analyze MAUs, combine the
decisionsandeventstables and dedupe onvisitor_id. Otherwise, you undercount any visitor who only triggered events.
Compare like with like
Optimizely Experimentation is user- and event-based, so session-based tools report more visitors than a user-based segmentation in Optimizely. The difference can look like a discrepancy on its own.
Before you query, confirm the following:
- The date range, URLs, and audience match between the two sources.
- The other tool reports on users, not sessions.
- In Google Analytics 4 (GA4), apply a user-scoped segment.
Verify timing
Check whether the two tools run on the same trigger conditions. Mismatched timing is a common source of "missing" visitors in one tool.
- The other tool runs at the same time as Optimizely. A cookie wall or consent banner may block one and not the other.
- For client-side events, a redirect can cancel the network request mid-flight.
Inspect visitor IDs
A misconfigured visitor ID is the most common cause of MAU overages. Confirm the following:
- The visitor ID cookie has the correct expiry and domain.
- Your bring-your-own-ID (BYOID) configuration loads before Optimizely Web Experimentation fires.
- The same person resolves to the same ID across sessions.
To find anomalies, list a sample of visitor_id values from the decisions table.
SELECT visitor_id
FROM decisions
WHERE is_holdback = false
LIMIT 100000;If the other analytics tool has its own visitor ID, attach it to Optimizely as an attribute. Compare the two ID spaces to see which source generates more IDs per visitor. Share the joined list back so the other team can check the raw data on their side.
Inspect referers
The referer column displays the page each event fired on. Use it to confirm that Optimizely Experimentation runs only on URLs that the other analytics tool's report covers.
SELECT
referer,
COUNT(DISTINCT visitor_id) AS unique_visitor_count
FROM decisions
WHERE is_holdback = false
GROUP BY referer
ORDER BY unique_visitor_count DESC;Look for the following:
- Domains that the other report does not cover.
- Internal Quality Assurance (QA) or staging environments to filter out.
Inspect experiments
Look for an experiment generating disproportionate traffic. A single misconfigured component can skew totals across the whole account.
SELECT
experiment_id,
COUNT(DISTINCT visitor_id) AS unique_visitor_count
FROM decisions
WHERE is_holdback = false
GROUP BY experiment_id
ORDER BY unique_visitor_count DESC;Look for the following:
- A single experiment with disproportionate traffic, which may point to a misconfigured component or activation.
- Excessive decision events from one experiment.
- An experiment that changes the visitor ID mid-session.
Inspect user agents
Use user_agent to find bot or pre-render traffic that the other tool may already filter out.
SELECT
user_agent,
COUNT(DISTINCT visitor_id) AS unique_visitor_count
FROM decisions
WHERE is_holdback = false
GROUP BY user_agent
ORDER BY unique_visitor_count DESC;If you find odd user agents, do the following:
- Confirm bot filtering is turned on, and that the
$opt_user_agentattribute is sent for server-side SDKs. See How to filter out bots. - Add custom bot filtering in a wrapper around the
decideandtrackmethods, or in a custom event dispatcher. - Configure the
holdEventsscript in project JavaScript to block specific user agents (Optimizely Web Experimentation only).
Your other tool's bot filtering may differ from Optimizely's. The other tool's team needs to replicate Optimizely's filtering on that side.
Inspect IP addresses
Use the user_ip column to find traffic from a single source that is inflating one tool's counts. This catches bots, internal testing rigs, and visitor-ID misconfigurations that map many IDs to one machine.
SELECT
user_ip,
COUNT(DISTINCT visitor_id) AS unique_visitor_count
FROM decisions
WHERE is_holdback = false
GROUP BY user_ip
ORDER BY unique_visitor_count DESC;Look for the following:
- IPs tied to more than one visitor ID, which may indicate a visitor-ID configuration error.
- IPs tied to many visitor IDs, which usually indicates bot traffic, internal testing, or an automation rig.
Filter problem IPs at the Content Delivery Network (CDN) level, or skip them in your SDK wrapper. For server-side SDKs, send the visitor's IP as an attribute. The SDK's default user_ip is your server's IP, not your visitor's.
Inspect timestamps
Use the timestamp column to find event batches that landed outside the date range your other tool covers. Delayed batches are a common cause of day-over-day discrepancies.
SELECT timestamp
FROM decisions
WHERE is_holdback = false
LIMIT 100000;Look for batches whose timestamps fall outside the date range the other tool covers. A delayed batch can land in Optimizely on a different day than in the other tool.
Inspect attributes
For server-side SDKs, attribute analysis only works for attributes your team already sends. To analyze attributes column by column, explode the attributes array in SQL, Python, or PySpark. See PySpark example.
Inspect events
If your conversion counts disagree, work from the broadest comparison inward.
- Confirm that overall event totals match across tools, ignoring experiments and segmentation. If they do not, the discrepancy is in your event implementation.
- Check page activation and how segmentation is configured in the other tool. If overall totals match but the results page is off, the issue is in one of these.
- Confirm that all attributes are included on each event sent through the Events API. Missing attributes can drop an event from your segmented results.
- Confirm that no redirect cancels the event request mid-flight.
To compare event totals across tools, group your events table by event name and count distinct visitors.
SELECT
event_name,
COUNT(DISTINCT visitor_id) AS unique_visitor_count
FROM events
GROUP BY event_name
ORDER BY unique_visitor_count DESC;Snowflake example queries
The following queries assume the e3_decisions and e3_conversions tables and parameter values for account_id, experiment_id, event_id, start_date, and end_date. Adjust table names and parameters for Athena. Some queries do not translate directly to Athena or PySpark.
Total conversions
SELECT d.variation_id, COUNT(c.visitor_id)
FROM e3_decisions d, e3_conversions c
WHERE d.visitor_id = c.visitor_id
AND c.date >= d.date
AND d.account_id = {account_id}
AND c.account_id = {account_id}
AND d.date BETWEEN '{start_date}' AND '{end_date}'
AND c.date BETWEEN '{start_date}' AND '{end_date}'
AND d.experiment_id = '{experiment_id}'
AND c.entity_id = '{event_id}'
GROUP BY 1
ORDER BY 1;Unique conversions
SELECT d.variation_id, COUNT(DISTINCT c.visitor_id)
FROM e3_decisions d, e3_conversions c
WHERE d.visitor_id = c.visitor_id
AND c.timestamp >= d.timestamp
AND d.account_id = {account_id}
AND c.account_id = {account_id}
AND d.date BETWEEN '{start_date}' AND '{end_date}'
AND c.date BETWEEN '{start_date}' AND '{end_date}'
AND d.experiment_id = '{experiment_id}'
AND c.entity_id = '{event_id}'
GROUP BY 1
ORDER BY 1;Sum of revenue
SELECT d.variation_id, SUM(c.revenue)
FROM e3_decisions d, e3_conversions c
WHERE d.visitor_id = c.visitor_id
AND c.timestamp >= d.timestamp
AND d.account_id = {account_id}
AND c.account_id = {account_id}
AND d.date BETWEEN '{start_date}' AND '{end_date}'
AND c.date BETWEEN '{start_date}' AND '{end_date}'
AND d.experiment_id = '{experiment_id}'
AND c.entity_id = '{event_id}'
GROUP BY 1
ORDER BY 1;Total conversions filtered by an attribute (decisions only)
SELECT d.variation_id, COUNT(c.visitor_id)
FROM e3_decisions d, e3_conversions c, LATERAL FLATTEN(d.attributes) da
WHERE d.visitor_id = c.visitor_id
AND c.timestamp >= d.timestamp
AND d.account_id = {account_id}
AND c.account_id = {account_id}
AND d.date BETWEEN '{start_date}' AND '{end_date}'
AND c.date BETWEEN '{start_date}' AND '{end_date}'
AND d.experiment_id = '{experiment_id}'
AND c.entity_id = '{event_id}'
-- Depending on the attribute, you may not need both type and name
AND da.value:type = '{attribute_type}'
AND da.value:name = '{attribute_name}'
AND da.value:value = '{attribute_value}'
GROUP BY 1
ORDER BY 1;Total conversions filtered by an attribute (decisions or conversions)
SELECT d.variation_id, COUNT(c.visitor_id)
FROM e3_decisions d, e3_conversions c,
LATERAL FLATTEN(d.attributes) da,
LATERAL FLATTEN(c.attributes) ca
WHERE d.visitor_id = c.visitor_id
AND c.timestamp >= d.timestamp
AND d.account_id = {account_id}
AND c.account_id = {account_id}
AND d.date BETWEEN '{start_date}' AND '{end_date}'
AND c.date BETWEEN '{start_date}' AND '{end_date}'
AND d.experiment_id = '{experiment_id}'
AND c.entity_id = '{event_id}'
AND (
(da.value:type = '{attribute_type}' AND da.value:name = '{attribute_name}' AND da.value:value = '{attribute_value}')
OR
(ca.value:type = '{attribute_type}' AND ca.value:name = '{attribute_name}' AND ca.value:value = '{attribute_value}')
)
GROUP BY 1
ORDER BY 1;Total conversions filtered by IP
SELECT d.variation_id, COUNT(c.visitor_id)
FROM e3_decisions d, e3_conversions c
WHERE d.visitor_id = c.visitor_id
AND c.timestamp >= d.timestamp
AND d.account_id = {account_id}
AND c.account_id = {account_id}
AND d.date BETWEEN '{start_date}' AND '{end_date}'
AND c.date BETWEEN '{start_date}' AND '{end_date}'
AND d.experiment_id = '{experiment_id}'
AND c.entity_id = '{event_id}'
AND (d.user_ip = '{ip_address}' OR c.user_ip = '{ip_address}')
GROUP BY 1
ORDER BY 1;Daily MAUs for a specific domain
SELECT date, COUNT(DISTINCT visitor_id) AS distinct_count
FROM (
SELECT visitor_id, date
FROM e3_decisions
WHERE account_id = {account_id}
AND date BETWEEN '{start_date}' AND '{end_date}'
AND referer LIKE '%YOUR_DOMAIN%'
UNION
SELECT visitor_id, date
FROM e3_conversions
WHERE account_id = {account_id}
AND date BETWEEN '{start_date}' AND '{end_date}'
AND referer LIKE '%YOUR_DOMAIN%'
) AS combined_tables
GROUP BY date
ORDER BY date;PySpark example
he following script reads parquet data and joins the event and decision datasets. It explodes nested attributes into columns and exports the result to an Excel file.
import os
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, explode, first
spark = SparkSession.builder.appName("example").getOrCreate()
base_data_dir = os.environ.get("filepath")
def read_data(path, view_name):
"""Read parquet data from `path` and register a Spark temp view."""
spark.read.parquet(path).createOrReplaceTempView(view_name)
read_data(os.path.join(base_data_dir, "type=decisions"), "decisions")
read_data(os.path.join(base_data_dir, "type=events"), "events")
df = spark.sql("""
SELECT c.uuid, c.timestamp, c.attributes, c.process_timestamp,
c.visitor_id, c.entity_id, c.user_ip, c.user_agent,
c.referer, c.event_name, c.revenue, c.value, c.client_engine
FROM decisions d, events c
WHERE d.visitor_id = c.visitor_id
AND c.timestamp >= d.timestamp
AND d.is_holdback = false
""")
df_exploded = df.withColumn("attributes", explode("attributes"))
df_pivoted = (
df_exploded
.select(
col("uuid"),
col("timestamp"),
col("process_timestamp"),
col("visitor_id"),
col("user_ip"),
col("user_agent"),
col("referer"),
col("client_engine"),
col("attributes.name").alias("attributes_name"),
col("attributes.value").alias("attributes_value"),
)
.groupBy(
"uuid", "timestamp", "process_timestamp", "visitor_id",
"user_ip", "user_agent", "referer", "client_engine",
)
.pivot("attributes_name")
.agg(first("attributes_value"))
)
df_pandas = df_pivoted.toPandas()
df_pandas.to_excel("export.xlsx", index=False)Other known causes
Work through these less-common causes after you have ruled out the previous checks. Each one can produce a 10%+ discrepancy on its own.
- The visitor ID in the other analytics tool is not device-based.
- The other tool uses an aliasing mechanism, like GA blended mode.
- The other tool applies IP filtering or bot filtering that Optimizely is not also configured to apply.
- Optimizely's bot list filters out HTTP libraries used by some server-side SDKs and the Events API. If Optimizely filters out your traffic, either turn off bot filtering or send a common browser user agent.
Related resources
- Get started with the Experimentation Events Export
- Troubleshoot data discrepancies between Optimizely and your analytics tool
- Discrepancies in third-party data
- Monitor monthly active users (MAUs)
Updated about 2 hours ago
