The availability of features may depend on your plan type. Contact your Customer Success Manager if you have any questions.
Dev GuideAPI Reference
Dev GuideAPI ReferenceUser GuideGitHubDev CommunityOptimizely AcademySubmit a ticketLog In
Dev Guide

Diagnose data discrepancies with Experimentation Events Export

Query Experimentation Events Export (E3) data with SQL or PySpark to diagnose 10%+ visitor or conversion discrepancies between Optimizely and your analytics tool.

Use this guide to inspect the raw Experimentation Events Export (E3) data when your Optimizely Experimentation counts diverge from another analytics tool. Reach for it when the visitor or conversion gap is above 10%. Also reach for it when your Monthly Active User (MAU) usage is higher than your unique-visitor estimate.

Prerequisites

Before you start, complete the Troubleshoot data discrepancies between Optimizely and your analytics tool checklist in the support documentation. The checks in this guide assume two things. You have ruled out the basic causes the checklist covers, and you have access to your Experimentation Events Export data.

When to investigate at the E3 level

Investigate E3 data in the following cases:

  • Your visitor or conversion counts differ from another analytics tool by more than 10%.
  • Your MAU usage is higher than your unique-visitor estimate from another analytics tool.
  • A single experiment, domain, or audience is generating disproportionate traffic.
📘

Note

To analyze MAUs, combine the decisions and events tables and dedupe on visitor_id. Otherwise, you undercount any visitor who only triggered events.

Compare like with like

Optimizely Experimentation is user- and event-based, so session-based tools report more visitors than a user-based segmentation in Optimizely. The difference can look like a discrepancy on its own.

Before you query, confirm the following:

  • The date range, URLs, and audience match between the two sources.
  • The other tool reports on users, not sessions.
  • In Google Analytics 4 (GA4), apply a user-scoped segment.

Verify timing

Check whether the two tools run on the same trigger conditions. Mismatched timing is a common source of "missing" visitors in one tool.

  • The other tool runs at the same time as Optimizely. A cookie wall or consent banner may block one and not the other.
  • For client-side events, a redirect can cancel the network request mid-flight.

Inspect visitor IDs

A misconfigured visitor ID is the most common cause of MAU overages. Confirm the following:

  • The visitor ID cookie has the correct expiry and domain.
  • Your bring-your-own-ID (BYOID) configuration loads before Optimizely Web Experimentation fires.
  • The same person resolves to the same ID across sessions.

To find anomalies, list a sample of visitor_id values from the decisions table.

SELECT visitor_id
FROM decisions
WHERE is_holdback = false
LIMIT 100000;

If the other analytics tool has its own visitor ID, attach it to Optimizely as an attribute. Compare the two ID spaces to see which source generates more IDs per visitor. Share the joined list back so the other team can check the raw data on their side.

Inspect referers

The referer column displays the page each event fired on. Use it to confirm that Optimizely Experimentation runs only on URLs that the other analytics tool's report covers.

SELECT
  referer,
  COUNT(DISTINCT visitor_id) AS unique_visitor_count
FROM decisions
WHERE is_holdback = false
GROUP BY referer
ORDER BY unique_visitor_count DESC;

Look for the following:

  • Domains that the other report does not cover.
  • Internal Quality Assurance (QA) or staging environments to filter out.

Inspect experiments

Look for an experiment generating disproportionate traffic. A single misconfigured component can skew totals across the whole account.

SELECT
  experiment_id,
  COUNT(DISTINCT visitor_id) AS unique_visitor_count
FROM decisions
WHERE is_holdback = false
GROUP BY experiment_id
ORDER BY unique_visitor_count DESC;

Look for the following:

  • A single experiment with disproportionate traffic, which may point to a misconfigured component or activation.
  • Excessive decision events from one experiment.
  • An experiment that changes the visitor ID mid-session.

Inspect user agents

Use user_agent to find bot or pre-render traffic that the other tool may already filter out.

SELECT
  user_agent,
  COUNT(DISTINCT visitor_id) AS unique_visitor_count
FROM decisions
WHERE is_holdback = false
GROUP BY user_agent
ORDER BY unique_visitor_count DESC;

If you find odd user agents, do the following:

  • Confirm bot filtering is turned on, and that the $opt_user_agent attribute is sent for server-side SDKs. See How to filter out bots.
  • Add custom bot filtering in a wrapper around the decide and track methods, or in a custom event dispatcher.
  • Configure the holdEvents script in project JavaScript to block specific user agents (Optimizely Web Experimentation only).

Your other tool's bot filtering may differ from Optimizely's. The other tool's team needs to replicate Optimizely's filtering on that side.

Inspect IP addresses

Use the user_ip column to find traffic from a single source that is inflating one tool's counts. This catches bots, internal testing rigs, and visitor-ID misconfigurations that map many IDs to one machine.

SELECT
  user_ip,
  COUNT(DISTINCT visitor_id) AS unique_visitor_count
FROM decisions
WHERE is_holdback = false
GROUP BY user_ip
ORDER BY unique_visitor_count DESC;

Look for the following:

  • IPs tied to more than one visitor ID, which may indicate a visitor-ID configuration error.
  • IPs tied to many visitor IDs, which usually indicates bot traffic, internal testing, or an automation rig.

Filter problem IPs at the Content Delivery Network (CDN) level, or skip them in your SDK wrapper. For server-side SDKs, send the visitor's IP as an attribute. The SDK's default user_ip is your server's IP, not your visitor's.

Inspect timestamps

Use the timestamp column to find event batches that landed outside the date range your other tool covers. Delayed batches are a common cause of day-over-day discrepancies.

SELECT timestamp
FROM decisions
WHERE is_holdback = false
LIMIT 100000;

Look for batches whose timestamps fall outside the date range the other tool covers. A delayed batch can land in Optimizely on a different day than in the other tool.

Inspect attributes

For server-side SDKs, attribute analysis only works for attributes your team already sends. To analyze attributes column by column, explode the attributes array in SQL, Python, or PySpark. See PySpark example.

Inspect events

If your conversion counts disagree, work from the broadest comparison inward.

  1. Confirm that overall event totals match across tools, ignoring experiments and segmentation. If they do not, the discrepancy is in your event implementation.
  2. Check page activation and how segmentation is configured in the other tool. If overall totals match but the results page is off, the issue is in one of these.
  3. Confirm that all attributes are included on each event sent through the Events API. Missing attributes can drop an event from your segmented results.
  4. Confirm that no redirect cancels the event request mid-flight.

To compare event totals across tools, group your events table by event name and count distinct visitors.

SELECT
  event_name,
  COUNT(DISTINCT visitor_id) AS unique_visitor_count
FROM events
GROUP BY event_name
ORDER BY unique_visitor_count DESC;

Snowflake example queries

The following queries assume the e3_decisions and e3_conversions tables and parameter values for account_id, experiment_id, event_id, start_date, and end_date. Adjust table names and parameters for Athena. Some queries do not translate directly to Athena or PySpark.

Total conversions

SELECT d.variation_id, COUNT(c.visitor_id)
FROM e3_decisions d, e3_conversions c
WHERE d.visitor_id = c.visitor_id
  AND c.date >= d.date
  AND d.account_id = {account_id}
  AND c.account_id = {account_id}
  AND d.date BETWEEN '{start_date}' AND '{end_date}'
  AND c.date BETWEEN '{start_date}' AND '{end_date}'
  AND d.experiment_id = '{experiment_id}'
  AND c.entity_id = '{event_id}'
GROUP BY 1
ORDER BY 1;

Unique conversions

SELECT d.variation_id, COUNT(DISTINCT c.visitor_id)
FROM e3_decisions d, e3_conversions c
WHERE d.visitor_id = c.visitor_id
  AND c.timestamp >= d.timestamp
  AND d.account_id = {account_id}
  AND c.account_id = {account_id}
  AND d.date BETWEEN '{start_date}' AND '{end_date}'
  AND c.date BETWEEN '{start_date}' AND '{end_date}'
  AND d.experiment_id = '{experiment_id}'
  AND c.entity_id = '{event_id}'
GROUP BY 1
ORDER BY 1;

Sum of revenue

SELECT d.variation_id, SUM(c.revenue)
FROM e3_decisions d, e3_conversions c
WHERE d.visitor_id = c.visitor_id
  AND c.timestamp >= d.timestamp
  AND d.account_id = {account_id}
  AND c.account_id = {account_id}
  AND d.date BETWEEN '{start_date}' AND '{end_date}'
  AND c.date BETWEEN '{start_date}' AND '{end_date}'
  AND d.experiment_id = '{experiment_id}'
  AND c.entity_id = '{event_id}'
GROUP BY 1
ORDER BY 1;

Total conversions filtered by an attribute (decisions only)

SELECT d.variation_id, COUNT(c.visitor_id)
FROM e3_decisions d, e3_conversions c, LATERAL FLATTEN(d.attributes) da
WHERE d.visitor_id = c.visitor_id
  AND c.timestamp >= d.timestamp
  AND d.account_id = {account_id}
  AND c.account_id = {account_id}
  AND d.date BETWEEN '{start_date}' AND '{end_date}'
  AND c.date BETWEEN '{start_date}' AND '{end_date}'
  AND d.experiment_id = '{experiment_id}'
  AND c.entity_id = '{event_id}'
  -- Depending on the attribute, you may not need both type and name
  AND da.value:type = '{attribute_type}'
  AND da.value:name = '{attribute_name}'
  AND da.value:value = '{attribute_value}'
GROUP BY 1
ORDER BY 1;

Total conversions filtered by an attribute (decisions or conversions)

SELECT d.variation_id, COUNT(c.visitor_id)
FROM e3_decisions d, e3_conversions c,
     LATERAL FLATTEN(d.attributes) da,
     LATERAL FLATTEN(c.attributes) ca
WHERE d.visitor_id = c.visitor_id
  AND c.timestamp >= d.timestamp
  AND d.account_id = {account_id}
  AND c.account_id = {account_id}
  AND d.date BETWEEN '{start_date}' AND '{end_date}'
  AND c.date BETWEEN '{start_date}' AND '{end_date}'
  AND d.experiment_id = '{experiment_id}'
  AND c.entity_id = '{event_id}'
  AND (
        (da.value:type = '{attribute_type}' AND da.value:name = '{attribute_name}' AND da.value:value = '{attribute_value}')
        OR
        (ca.value:type = '{attribute_type}' AND ca.value:name = '{attribute_name}' AND ca.value:value = '{attribute_value}')
      )
GROUP BY 1
ORDER BY 1;

Total conversions filtered by IP

SELECT d.variation_id, COUNT(c.visitor_id)
FROM e3_decisions d, e3_conversions c
WHERE d.visitor_id = c.visitor_id
  AND c.timestamp >= d.timestamp
  AND d.account_id = {account_id}
  AND c.account_id = {account_id}
  AND d.date BETWEEN '{start_date}' AND '{end_date}'
  AND c.date BETWEEN '{start_date}' AND '{end_date}'
  AND d.experiment_id = '{experiment_id}'
  AND c.entity_id = '{event_id}'
  AND (d.user_ip = '{ip_address}' OR c.user_ip = '{ip_address}')
GROUP BY 1
ORDER BY 1;

Daily MAUs for a specific domain

SELECT date, COUNT(DISTINCT visitor_id) AS distinct_count
FROM (
  SELECT visitor_id, date
  FROM e3_decisions
  WHERE account_id = {account_id}
    AND date BETWEEN '{start_date}' AND '{end_date}'
    AND referer LIKE '%YOUR_DOMAIN%'
  UNION
  SELECT visitor_id, date
  FROM e3_conversions
  WHERE account_id = {account_id}
    AND date BETWEEN '{start_date}' AND '{end_date}'
    AND referer LIKE '%YOUR_DOMAIN%'
) AS combined_tables
GROUP BY date
ORDER BY date;

PySpark example

he following script reads parquet data and joins the event and decision datasets. It explodes nested attributes into columns and exports the result to an Excel file.

import os
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, explode, first

spark = SparkSession.builder.appName("example").getOrCreate()
base_data_dir = os.environ.get("filepath")

def read_data(path, view_name):
    """Read parquet data from `path` and register a Spark temp view."""
    spark.read.parquet(path).createOrReplaceTempView(view_name)

read_data(os.path.join(base_data_dir, "type=decisions"), "decisions")
read_data(os.path.join(base_data_dir, "type=events"), "events")

df = spark.sql("""
    SELECT c.uuid, c.timestamp, c.attributes, c.process_timestamp,
           c.visitor_id, c.entity_id, c.user_ip, c.user_agent,
           c.referer, c.event_name, c.revenue, c.value, c.client_engine
    FROM decisions d, events c
    WHERE d.visitor_id = c.visitor_id
      AND c.timestamp >= d.timestamp
      AND d.is_holdback = false
""")

df_exploded = df.withColumn("attributes", explode("attributes"))

df_pivoted = (
    df_exploded
    .select(
        col("uuid"),
        col("timestamp"),
        col("process_timestamp"),
        col("visitor_id"),
        col("user_ip"),
        col("user_agent"),
        col("referer"),
        col("client_engine"),
        col("attributes.name").alias("attributes_name"),
        col("attributes.value").alias("attributes_value"),
    )
    .groupBy(
        "uuid", "timestamp", "process_timestamp", "visitor_id",
        "user_ip", "user_agent", "referer", "client_engine",
    )
    .pivot("attributes_name")
    .agg(first("attributes_value"))
)

df_pandas = df_pivoted.toPandas()
df_pandas.to_excel("export.xlsx", index=False)

Other known causes

Work through these less-common causes after you have ruled out the previous checks. Each one can produce a 10%+ discrepancy on its own.

  • The visitor ID in the other analytics tool is not device-based.
  • The other tool uses an aliasing mechanism, like GA blended mode.
  • The other tool applies IP filtering or bot filtering that Optimizely is not also configured to apply.
  • Optimizely's bot list filters out HTTP libraries used by some server-side SDKs and the Events API. If Optimizely filters out your traffic, either turn off bot filtering or send a common browser user agent.

Related resources