
If you’re exporting GA4 data to BigQuery, you already know the power of raw event-level data. But raw data is only as good as your implementation. A misconfigured tag, a broken consent flow, or a silent SPA tracking issue can silently corrupt months of analytics — and you won’t notice until someone asks „why don’t the numbers add up?”
This guide provides a systematic, SQL-driven audit framework for GA4 data in BigQuery. Every query uses a rolling 90-day window and is designed to be copied, adapted to your project, and run immediately. The audit covers six critical areas: export health, user identity, consent management, SPA tracking integrity, cross-platform consistency, and traffic source attribution.
Note: Throughout this article, replace
your_project.your_dataset.events_*with your actual BigQuery GA4 export table path. All queries use a dynamic 90-day date range.
Before diving into specific implementation areas, verify that your BigQuery export itself is healthy. Missing tables, unexpected event volumes, or broken intraday exports can invalidate every downstream analysis.
This query checks whether you have an export table for every day in the last 90 days and flags days with suspiciously low event counts.
SELECT
PARSE_DATE('%Y%m%d', event_date) AS date,
COUNT(*) AS total_events,
COUNT(DISTINCT user_pseudo_id) AS unique_users,
COUNT(DISTINCT
CONCAT(user_pseudo_id, '-',
CAST((SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS STRING))
) AS unique_sessions
FROM `your_project.your_dataset.events_*`
WHERE _TABLE_SUFFIX BETWEEN
FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY))
AND FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
GROUP BY date
ORDER BY date ASC;
What to look for: Gaps in dates (missing exports), days with near-zero events, or sudden spikes/drops that indicate tracking outages or tag misfires.
Verify that all expected events are being collected and that no rogue or misspelled event names have crept in.[3]
SELECT
event_name,
COUNT(*) AS event_count,
COUNT(DISTINCT user_pseudo_id) AS users_triggering,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) AS pct_of_total
FROM `your_project.your_dataset.events_*`
WHERE _TABLE_SUFFIX BETWEEN
FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY))
AND FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
GROUP BY event_name
ORDER BY event_count DESC;
What to look for: Unexpected event names (typos, test events left in production), missing critical events like purchase or generate_lead, and the ratio between page_view and session_start events.
GA4 BigQuery exports — especially those involving Firebase — are known to contain 1-2% duplicate events caused by network retransmission on the client side.
WITH duplicates AS (
SELECT
event_name,
event_timestamp,
user_pseudo_id,
COUNT(*) AS occurrence_count
FROM `your_project.your_dataset.events_*`
WHERE _TABLE_SUFFIX BETWEEN
FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY))
AND FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
GROUP BY event_name, event_timestamp, user_pseudo_id
HAVING COUNT(*) > 1
)
SELECT
occurrence_count AS duplicate_level,
COUNT(*) AS number_of_cases,
SUM(occurrence_count - 1) AS excess_rows
FROM duplicates
GROUP BY duplicate_level
ORDER BY duplicate_level;
What to look for: A small percentage of duplicates (1-2%) is normal for Firebase/app data. If web-only properties show significant duplication, investigate GTM tag firing rules or Measurement Protocol misuse.
The user_id field in GA4 BigQuery is your authenticated identifier — the bridge between anonymous device-level tracking and real business users. If it’s misconfigured, cross-device stitching, LTV analysis, and CRM joins all break down.
This measures what percentage of your events carry a user_id, which directly indicates how well your authentication-based identity strategy is working.
SELECT
COUNT(*) AS total_events,
COUNTIF(user_id IS NOT NULL AND user_id != '') AS events_with_user_id,
COUNTIF(user_id IS NULL OR user_id = '') AS events_without_user_id,
ROUND(COUNTIF(user_id IS NOT NULL AND user_id != '') * 100.0 / COUNT(*), 2) AS user_id_coverage_pct
FROM `your_project.your_dataset.events_*`
WHERE _TABLE_SUFFIX BETWEEN
FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY))
AND FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY));
What to look for: If your site has authentication, coverage should be meaningful for post-login events. A site-wide rate below 5% on a platform with login functionality suggests the user_id is not being passed to the dataLayer or GA4 configuration tag correctly.
Not all events should have a user_id. Pre-login events like first_visit naturally won’t. This query helps you pinpoint which events are missing it unexpectedly.
SELECT
event_name,
COUNT(*) AS total_events,
COUNTIF(user_id IS NOT NULL AND user_id != '') AS with_user_id,
ROUND(COUNTIF(user_id IS NOT NULL AND user_id != '') * 100.0 / COUNT(*), 2) AS coverage_pct
FROM `your_project.your_dataset.events_*`
WHERE _TABLE_SUFFIX BETWEEN
FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY))
AND FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
GROUP BY event_name
ORDER BY total_events DESC;
What to look for: Post-authentication events (purchase, add_to_cart, login) should have near-100% user_id coverage. If login events themselves lack user_id, the dataLayer push is firing too late or not at all.[7]
A single user_id mapping to many user_pseudo_id values is normal (cross-device). But a single user_pseudo_id mapping to many user_id values is a red flag — it means multiple real users are sharing a device identifier, or the user_id is being set incorrectly (e.g., set to a static test value).[6]
-- Check: how many user_pseudo_ids per user_id (expected: multiple = cross-device)
SELECT
user_id,
COUNT(DISTINCT user_pseudo_id) AS pseudo_id_count
FROM `your_project.your_dataset.events_*`
WHERE _TABLE_SUFFIX BETWEEN
FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY))
AND FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
AND user_id IS NOT NULL
AND user_id != ''
GROUP BY user_id
ORDER BY pseudo_id_count DESC
LIMIT 50;
-- Check: how many user_ids per user_pseudo_id (expected: 1, red flag if > 1)
SELECT
user_pseudo_id,
COUNT(DISTINCT user_id) AS user_id_count
FROM `your_project.your_dataset.events_*`
WHERE _TABLE_SUFFIX BETWEEN
FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY))
AND FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
AND user_id IS NOT NULL
AND user_id != ''
GROUP BY user_pseudo_id
HAVING COUNT(DISTINCT user_id) > 1
ORDER BY user_id_count DESC
LIMIT 50;
What to look for: The second query should return very few rows. If one user_pseudo_id has 10+ different user_id values, you likely have a shared/kiosk device or a bug where user_id isn’t being cleared on logout.[6]
Ensure that user_id values look like real identifiers and not test data, PII, or garbled strings.
SELECT
user_id,
COUNT(*) AS event_count
FROM `your_project.your_dataset.events_*`
WHERE _TABLE_SUFFIX BETWEEN
FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY))
AND FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
AND user_id IS NOT NULL
AND user_id != ''
GROUP BY user_id
ORDER BY event_count DESC
LIMIT 100;
What to look for: Values like USER_ID, undefined, null, test, email addresses (PII leak!), or a single user_id appearing on an impossibly high number of events. Any PII in the user_id field is a GDPR/privacy violation that needs immediate remediation.
Consent Mode v2 is now mandatory for EU-based properties that want to use Google Ads features. Your Consent Management Platform (CMP) must correctly signal analytics_storage, ads_storage, ad_user_data, and ad_personalization to GA4. In BigQuery, the privacy_info record captures these consent signals at the event level.
This is the foundational query: determine whether consent mode is not implemented, basic, or advanced.[10]
SELECT
privacy_info.analytics_storage,
privacy_info.ads_storage,
COUNT(*) AS events_total,
COUNT(DISTINCT user_pseudo_id) AS users,
COUNT(DISTINCT
CONCAT(user_pseudo_id, '-',
CAST((SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS STRING))
) AS sessions
FROM `your_project.your_dataset.events_*`
WHERE _TABLE_SUFFIX BETWEEN
FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY))
AND FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
GROUP BY
privacy_info.analytics_storage,
privacy_info.ads_storage
ORDER BY events_total DESC;
Interpreting the results:
| Scenario | analytics_storage | ads_storage | Meaning |
|---|---|---|---|
| No consent mode | All null | All null | CMP not integrated with GA4[8] |
| Basic consent mode | Only Yes | Yes or No | Data only collected when user consents[11] |
| Advanced consent mode | Yes, No, or null | Yes, No, or null | Cookieless pings sent even when denied[10] |
Track whether your consent rates are stable or if a CMP update introduced a regression.
SELECT
PARSE_DATE('%Y%m%d', event_date) AS date,
COUNTIF(privacy_info.analytics_storage = 'Yes') AS consented_events,
COUNTIF(privacy_info.analytics_storage = 'No') AS denied_events,
COUNTIF(privacy_info.analytics_storage IS NULL) AS unknown_events,
COUNT(*) AS total_events,
ROUND(COUNTIF(privacy_info.analytics_storage = 'Yes') * 100.0 / COUNT(*), 2) AS consent_rate_pct
FROM `your_project.your_dataset.events_*`
WHERE _TABLE_SUFFIX BETWEEN
FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY))
AND FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
GROUP BY date
ORDER BY date ASC;
What to look for: A sudden drop in consent rate often signals a CMP misconfiguration rather than a change in user behavior. Hybrid CMP remnants from a migration (e.g., legacy OneTrust dataLayer events still firing alongside a new setup) are a common culprit.
In advanced consent mode, denied events are sent as „cookieless pings” without user_pseudo_id or ga_session_id. This query measures their impact on your data.
SELECT
privacy_info.analytics_storage,
COUNT(*) AS event_count,
COUNTIF(user_pseudo_id IS NULL OR user_pseudo_id = '') AS events_without_pseudo_id,
COUNTIF(
(SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') IS NULL
) AS events_without_session_id,
ROUND(COUNTIF(user_pseudo_id IS NULL OR user_pseudo_id = '') * 100.0 / COUNT(*), 2) AS pct_no_identity
FROM `your_project.your_dataset.events_*`
WHERE _TABLE_SUFFIX BETWEEN
FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY))
AND FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
GROUP BY privacy_info.analytics_storage;
What to look for: Events with analytics_storage = 'No' should show near-100% missing user_pseudo_id. If events with analytics_storage = 'Yes' are also missing identity fields, there’s a deeper problem with your tag configuration.
Verify that ads_storage and analytics_storage signals behave logically together.
SELECT
privacy_info.analytics_storage AS analytics_consent,
privacy_info.ads_storage AS ads_consent,
privacy_info.uses_transient_token,
COUNT(*) AS events,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) AS pct_of_total
FROM `your_project.your_dataset.events_*`
WHERE _TABLE_SUFFIX BETWEEN
FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY))
AND FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
GROUP BY
privacy_info.analytics_storage,
privacy_info.ads_storage,
privacy_info.uses_transient_token
ORDER BY events DESC;
What to look for: Common valid combinations are Yes/Yes, Yes/No, No/No. A combination like No/Yes (denying analytics but allowing ads) is technically possible but very unusual and may indicate a CMP signal ordering issue.
Single Page Applications (SPAs) are notorious for breaking GA4’s automatic tracking. GA4 relies on browser History API changes to detect page navigations in SPAs, but this mechanism frequently fails due to race conditions, asynchronous content loading, and incorrect History API usage.
Every session should have at least one session_start and at least one page_view. A significant mismatch indicates tracking problems.
WITH session_events AS (
SELECT
CONCAT(user_pseudo_id, '-',
CAST((SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS STRING)
) AS session_key,
COUNTIF(event_name = 'session_start') AS session_starts,
COUNTIF(event_name = 'page_view') AS page_views,
COUNTIF(event_name = 'screen_view') AS screen_views
FROM `your_project.your_dataset.events_*`
WHERE _TABLE_SUFFIX BETWEEN
FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY))
AND FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
AND user_pseudo_id IS NOT NULL
GROUP BY session_key
)
SELECT
CASE
WHEN session_starts = 0 AND page_views = 0 THEN 'No session_start, No page_view'
WHEN session_starts = 0 AND page_views > 0 THEN 'No session_start, Has page_view'
WHEN session_starts > 0 AND page_views = 0 THEN 'Has session_start, No page_view'
WHEN session_starts > 1 THEN 'Multiple session_starts'
WHEN session_starts = 1 AND page_views >= 1 THEN 'Healthy (1 start, 1+ views)'
END AS session_health,
COUNT(*) AS session_count,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) AS pct
FROM session_events
GROUP BY session_health
ORDER BY session_count DESC;
What to look for: Sessions without session_start can occur when events are sent via the Measurement Protocol without including ga_session_id. Sessions without page_view may indicate SPA tracking failures where the History API change doesn’t fire.
SPAs frequently produce „(not set)” landing pages when the initial page_view event fails to fire or fires before the page content has loaded.
SELECT
(SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_location') AS landing_page,
COUNT(*) AS page_view_count
FROM `your_project.your_dataset.events_*`
WHERE _TABLE_SUFFIX BETWEEN
FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY))
AND FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
AND event_name = 'page_view'
AND (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'entrances') = 1
GROUP BY landing_page
ORDER BY page_view_count DESC
LIMIT 50;
What to look for: If (not set) or null appears as a top landing page, your SPA isn’t firing page_view correctly on initial load. This is one of the most common GA4 SPA issues.
SPAs can fire duplicate page_view events when both Enhanced Measurement’s history change listener and a custom GTM trigger both respond to the same navigation.
WITH session_pages AS (
SELECT
CONCAT(user_pseudo_id, '-',
CAST((SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS STRING)
) AS session_key,
(SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_location') AS page_location,
event_timestamp,
COUNT(*) AS fire_count
FROM `your_project.your_dataset.events_*`
WHERE _TABLE_SUFFIX BETWEEN
FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY))
AND FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
AND event_name = 'page_view'
AND user_pseudo_id IS NOT NULL
GROUP BY session_key, page_location, event_timestamp
HAVING COUNT(*) > 1
)
SELECT
fire_count AS duplicate_level,
COUNT(*) AS occurrences
FROM session_pages
GROUP BY fire_count
ORDER BY fire_count;
What to look for: Any results here mean a page_view fired multiple times for the same page at the same timestamp within a session. This typically happens on SPAs when Enhanced Measurement’s „Page changes based on browser history events” is enabled alongside a custom page_view trigger in GTM.
Compare page depth between web and app. On SPAs, artificially low page views per session often indicate missed navigations.
WITH session_views AS (
SELECT
platform,
CONCAT(user_pseudo_id, '-',
CAST((SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS STRING)
) AS session_key,
COUNTIF(event_name IN ('page_view', 'screen_view')) AS views
FROM `your_project.your_dataset.events_*`
WHERE _TABLE_SUFFIX BETWEEN
FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY))
AND FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
AND user_pseudo_id IS NOT NULL
GROUP BY platform, session_key
)
SELECT
platform,
COUNT(*) AS total_sessions,
ROUND(AVG(views), 2) AS avg_views_per_session,
ROUND(APPROX_QUANTILES(views, 100)[OFFSET(50)], 2) AS median_views_per_session
FROM session_views
GROUP BY platform
ORDER BY total_sessions DESC;
What to look for: If the WEB platform shows an average of ~1.0 page views per session on a multi-page or SPA site, the history change tracking is almost certainly broken.
When a GA4 property collects data from both a website and mobile apps, the data flows into the same BigQuery tables but uses different event names, parameters, and identity mechanisms. Verifying consistency across platforms is critical.
A basic but essential check: are all your data streams (WEB, IOS, ANDROID) actively sending data?
SELECT
platform,
stream_id,
COUNT(*) AS total_events,
COUNT(DISTINCT user_pseudo_id) AS unique_users,
COUNT(DISTINCT event_name) AS unique_event_names,
MIN(PARSE_DATE('%Y%m%d', event_date)) AS first_event_date,
MAX(PARSE_DATE('%Y%m%d', event_date)) AS last_event_date
FROM `your_project.your_dataset.events_*`
WHERE _TABLE_SUFFIX BETWEEN
FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY))
AND FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
GROUP BY platform, stream_id
ORDER BY total_events DESC;
What to look for: Missing platforms (e.g., IOS shows no data), large discrepancies between iOS and Android volumes that don’t match your user base distribution, or gaps in the date range for a specific platform.
Events that should be tracked on all platforms (like login, purchase, sign_up) sometimes only fire on one.
SELECT
event_name,
COUNTIF(platform = 'WEB') AS web_events,
COUNTIF(platform = 'IOS') AS ios_events,
COUNTIF(platform = 'ANDROID') AS android_events,
COUNT(*) AS total
FROM `your_project.your_dataset.events_*`
WHERE _TABLE_SUFFIX BETWEEN
FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY))
AND FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
GROUP BY event_name
ORDER BY total DESC;
What to look for: Business-critical events with zero count on one platform. Remember that page_view is web-only and screen_view is app-only by design — that’s not a bug. But custom events like add_to_cart or begin_checkout should appear across all platforms if the functionality exists.
Check whether the user_id implementation is consistent across all platforms.
SELECT
platform,
COUNT(*) AS total_events,
COUNTIF(user_id IS NOT NULL AND user_id != '') AS events_with_user_id,
ROUND(COUNTIF(user_id IS NOT NULL AND user_id != '') * 100.0 / COUNT(*), 2) AS user_id_coverage_pct
FROM `your_project.your_dataset.events_*`
WHERE _TABLE_SUFFIX BETWEEN
FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY))
AND FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
GROUP BY platform
ORDER BY total_events DESC;
What to look for: If your Android app shows 40% user_id coverage but iOS shows 5%, the iOS implementation is likely broken. Both apps should have similar authentication patterns and therefore similar coverage rates.
Compare session engagement and duration across platforms to spot data collection anomalies rather than genuine user behavior differences.
WITH session_data AS (
SELECT
platform,
CONCAT(user_pseudo_id, '-',
CAST((SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS STRING)
) AS session_key,
MAX(COALESCE(
(SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'session_engaged'),
CAST((SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'session_engaged') AS STRING)
)) AS is_engaged,
SUM(COALESCE(
(SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'engagement_time_msec'), 0
)) AS total_engagement_ms
FROM `your_project.your_dataset.events_*`
WHERE _TABLE_SUFFIX BETWEEN
FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY))
AND FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
AND user_pseudo_id IS NOT NULL
GROUP BY platform, session_key
)
SELECT
platform,
COUNT(*) AS total_sessions,
COUNTIF(is_engaged = '1') AS engaged_sessions,
ROUND(COUNTIF(is_engaged = '1') * 100.0 / COUNT(*), 2) AS engagement_rate_pct,
ROUND(AVG(total_engagement_ms) / 1000, 2) AS avg_engagement_sec
FROM session_data
GROUP BY platform
ORDER BY total_sessions DESC;
What to look for: Wildly different engagement rates between iOS and Android (e.g., 80% vs 20%) usually indicate an implementation issue rather than real user behavior. Also note that web and app platforms may differ in how session_engaged is tracked (string vs integer).
An abnormally high proportion of direct traffic is one of the most common indicators of tracking problems. GA4 uses a „last non-direct click” attribution model, meaning it looks back up to 90 days to find a non-direct source before attributing a session to direct.
The core question: is your direct traffic proportion suspiciously high?
WITH session_sources AS (
SELECT
CONCAT(user_pseudo_id, '-',
CAST((SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS STRING)
) AS session_key,
COALESCE(
(SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'medium'),
'(none)'
) AS session_medium,
COALESCE(
(SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'source'),
'(direct)'
) AS session_source
FROM `your_project.your_dataset.events_*`
WHERE _TABLE_SUFFIX BETWEEN
FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY))
AND FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
AND event_name = 'session_start'
AND user_pseudo_id IS NOT NULL
),
classified AS (
SELECT
session_key,
session_source,
session_medium,
CASE
WHEN session_medium IN ('(none)', '(not set)') OR session_medium IS NULL THEN 'Direct'
ELSE 'Non-Direct'
END AS traffic_type
FROM session_sources
)
SELECT
traffic_type,
COUNT(DISTINCT session_key) AS sessions,
ROUND(COUNT(DISTINCT session_key) * 100.0 /
SUM(COUNT(DISTINCT session_key)) OVER(), 2) AS pct_of_total
FROM classified
GROUP BY traffic_type
ORDER BY sessions DESC;
What to look for: Direct traffic above 40-50% of total sessions often signals broken UTM parameters, missing referrer data, HTTP-to-HTTPS migration issues, or app traffic being misattributed. For most websites, 20-30% direct is a healthy baseline.
For properties using the newer BigQuery export schema (post-2024), the session_traffic_source_last_click record provides the most accurate session-level attribution.
SELECT
COALESCE(
session_traffic_source_last_click.cross_channel_campaign.source,
traffic_source.source,
'(direct)'
) AS source,
COALESCE(
session_traffic_source_last_click.cross_channel_campaign.medium,
traffic_source.medium,
'(none)'
) AS medium,
COALESCE(
session_traffic_source_last_click.cross_channel_campaign.campaign_name,
traffic_source.name,
'(not set)'
) AS campaign,
COUNT(DISTINCT
CONCAT(user_pseudo_id, '-',
CAST((SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS STRING))
) AS sessions,
COUNT(DISTINCT user_pseudo_id) AS users
FROM `your_project.your_dataset.events_*`
WHERE _TABLE_SUFFIX BETWEEN
FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY))
AND FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
GROUP BY source, medium, campaign
ORDER BY sessions DESC
LIMIT 50;
What to look for: (not set) values in source/medium, which indicate attribution data loss. Also look for the known GA4 BigQuery export bug where paid search (gclid) events are misattributed as organic or direct.
Build a custom default channel grouping to match GA4’s UI reporting and compare.
WITH session_attribution AS (
SELECT
CONCAT(user_pseudo_id, '-',
CAST((SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS STRING)
) AS session_key,
COALESCE(
session_traffic_source_last_click.cross_channel_campaign.source,
'(direct)'
) AS source,
COALESCE(
session_traffic_source_last_click.cross_channel_campaign.medium,
'(none)'
) AS medium
FROM `your_project.your_dataset.events_*`
WHERE _TABLE_SUFFIX BETWEEN
FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY))
AND FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
AND user_pseudo_id IS NOT NULL
),
with_channel AS (
SELECT
session_key,
source,
medium,
CASE
WHEN source = '(direct)' AND medium IN ('(none)', '(not set)') THEN 'Direct'
WHEN medium = 'organic' THEN 'Organic Search'
WHEN medium IN ('cpc', 'ppc', 'paidsearch') THEN 'Paid Search'
WHEN medium IN ('display', 'banner', 'cpm') THEN 'Display'
WHEN medium = 'email' THEN 'Email'
WHEN medium = 'referral' THEN 'Referral'
WHEN medium IN ('social', 'social-network', 'social-media', 'sm', 'social network') THEN 'Organic Social'
WHEN medium IN ('paid_social', 'paidsocial') THEN 'Paid Social'
WHEN medium = 'affiliate' THEN 'Affiliates'
WHEN medium IN ('video', 'youtube') THEN 'Video'
ELSE 'Other'
END AS channel_group
FROM session_attribution
)
SELECT
channel_group,
COUNT(DISTINCT session_key) AS sessions,
ROUND(COUNT(DISTINCT session_key) * 100.0 /
SUM(COUNT(DISTINCT session_key)) OVER(), 2) AS pct_of_total
FROM with_channel
GROUP BY channel_group
ORDER BY sessions DESC;
What to look for: Compare these percentages to GA4’s UI Traffic Acquisition report. Large discrepancies (>10%) between BigQuery and the GA4 UI often stem from differences in how consent mode and reporting identity affect each system.
The collected_traffic_source field provides un-sessionized, raw attribution data. This is the most granular way to diagnose attribution issues.
SELECT
collected_traffic_source.manual_source AS utm_source,
collected_traffic_source.manual_medium AS utm_medium,
collected_traffic_source.manual_campaign_name AS utm_campaign,
CASE WHEN collected_traffic_source.gclid IS NOT NULL THEN 'Yes' ELSE 'No' END AS has_gclid,
COUNT(*) AS events,
COUNT(DISTINCT user_pseudo_id) AS users
FROM `your_project.your_dataset.events_*`
WHERE _TABLE_SUFFIX BETWEEN
FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY))
AND FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
AND (collected_traffic_source.manual_source IS NOT NULL
OR collected_traffic_source.gclid IS NOT NULL)
GROUP BY utm_source, utm_medium, utm_campaign, has_gclid
ORDER BY events DESC
LIMIT 50;
What to look for: Events with a gclid but null manual_source indicate the known GA4 BigQuery export bug where Google Ads clicks lose their attribution. Sessions with gclid should be attributed to google / cpc.
Beyond the six core areas above, these supplementary queries catch issues that are easy to overlook but can significantly impact data quality.
GA4 event parameters can hold string, integer, float, or double values. If the same parameter key switches types across events, downstream queries will silently return nulls.
SELECT
ep.key AS param_key,
COUNTIF(ep.value.string_value IS NOT NULL) AS string_count,
COUNTIF(ep.value.int_value IS NOT NULL) AS int_count,
COUNTIF(ep.value.float_value IS NOT NULL) AS float_count,
COUNTIF(ep.value.double_value IS NOT NULL) AS double_count
FROM `your_project.your_dataset.events_*`,
UNNEST(event_params) AS ep
WHERE _TABLE_SUFFIX BETWEEN
FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY))
AND FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
GROUP BY param_key
HAVING
(COUNTIF(ep.value.string_value IS NOT NULL) > 0) +
(COUNTIF(ep.value.int_value IS NOT NULL) > 0) +
(COUNTIF(ep.value.float_value IS NOT NULL) > 0) +
(COUNTIF(ep.value.double_value IS NOT NULL) > 0) > 1
ORDER BY param_key;
What to look for: Parameters that appear in multiple data type columns. For example, a value parameter that’s sometimes a string and sometimes an integer means different tags or platforms are sending it inconsistently.
If the property tracks purchases, verify that every purchase event has a valid transaction_id.
SELECT
COUNT(*) AS total_purchase_events,
COUNTIF(
(SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'transaction_id')
IS NULL OR
(SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'transaction_id') = ''
) AS missing_transaction_id,
COUNT(DISTINCT
(SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'transaction_id')
) AS unique_transaction_ids
FROM `your_project.your_dataset.events_*`
WHERE _TABLE_SUFFIX BETWEEN
FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY))
AND FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
AND event_name = 'purchase';
What to look for: Missing transaction_id values make deduplication impossible and inflate revenue reporting. If total_purchase_events is much larger than unique_transaction_ids, you have duplicate purchase tracking.
Verify that intraday tables are being created and that data latency is within acceptable bounds.
SELECT
MAX(TIMESTAMP_MICROS(event_timestamp)) AS latest_event,
TIMESTAMP_DIFF(CURRENT_TIMESTAMP(),
MAX(TIMESTAMP_MICROS(event_timestamp)), HOUR) AS hours_since_latest
FROM `your_project.your_dataset.events_intraday_*`
WHERE _TABLE_SUFFIX = FORMAT_DATE('%Y%m%d', CURRENT_DATE());
What to look for: If hours_since_latest exceeds 4-6 hours during business hours, your streaming export may be broken or the intraday table is not being created.
This audit framework is designed to be executed quarterly, or after any significant implementation change such as a CMP migration, GTM container update, or new data stream addition.
The queries in this guide are starting points. Every GA4 implementation is unique — adapt the thresholds, add custom event checks, and extend the channel grouping logic to match your specific marketing stack. The goal isn’t just to find problems; it’s to build confidence that the data driving your business decisions is accurate.
Mail: marcin [@] milowski.eu lub skontaktuj się przez Facebook messenger