How We Monitor 60,000 Data Points a Day

What 60,000 Data Points Actually Means

When we say we monitor 60,000+ data points daily, that number breaks down into specific, countable inputs:

Pipeline	Data Points/Day	What It Collects
gads-sync	~53,000	Campaign metrics, keyword performance, search terms, ad group stats, budget pacing, Quality Scores, auction insights across 24 Google Ads accounts
webopt-data-sync	~2,000	GA4 sessions, bounce rates, page performance, Google Search Console impressions, clicks, positions, PageSpeed scores
semrush-data-sync	~2,500	Keyword positions, visibility scores, competitor movements, backlink profiles, site audit findings, domain authority metrics
Budget monitoring	~3,744	Spend snapshots every 30 minutes across all managed accounts (24 accounts x 48 checks x ~3.25 metrics per check)
Playwright health checks	168	Site status across 21 websites, 8 categories each: fatal errors, broken CSS, broken images, JS crashes, failed requests, cache config, nav accessibility, visual regressions

That total isn't a marketing number. It's the sum of database rows written per day across the four pipelines. Each row represents a specific metric at a specific time for a specific client. We can query the exact count on any given day.

The Four Pipelines

Each pipeline runs independently. If one fails, the others keep collecting. This isn't a single fragile process that takes everything down when something breaks.

Pipeline 1: gads-sync (Every 6 Hours)

This is the largest pipeline by volume. It connects to each of our 24 managed Google Ads accounts and pulls campaign-level metrics, keyword performance data, search term reports, ad group statistics, Quality Score components, and auction insights. The sync runs every six hours, producing four snapshots per day per account. That means we see spend, conversion, and performance data within a quarter of a business day.

The dataset count is large because Google Ads exposes data at multiple levels of granularity. A single account with 5 campaigns, 20 ad groups, and 200 keywords generates campaign-level rows, ad-group-level rows, keyword-level rows, and search-term-level rows. Multiply that across 24 accounts and four sync cycles per day, and 53,000 data points is a conservative count.

What it catches: a campaign that suddenly stops converting, a keyword whose cost-per-click spikes beyond historical norms, a search term report showing irrelevant queries eating budget, a competitor entering an auction and driving up impression share competition. These patterns surface in the data hours after they happen, not days. A search term that costs $50 before anyone notices it would have cost $200 or more by the time a weekly check caught it.

What a break looks like: if the Google Ads API returns errors or an account's OAuth token expires, the pipeline logs the failure and retries on the next cycle. No data is lost from other accounts; the failure is isolated. We've designed each pipeline to treat accounts as independent units. An authentication problem with one account doesn't block or delay data collection for the other 23.

Pipeline 2: webopt-data-sync (Daily)

This pipeline pulls organic performance data from three sources per client: GA4 traffic metrics (sessions, engaged sessions, bounce rates, page-level performance), Google Search Console data (impressions, clicks, average position, and click-through rates broken down by query and page), and PageSpeed scores (Core Web Vitals: Largest Contentful Paint, Cumulative Layout Shift, Interaction to Next Paint, plus overall Lighthouse scores).

It runs once daily because these metrics don't change hour-to-hour the way ad spend does. Organic search positions shift gradually, and GA4 processes data with a delay that makes intraday polling unnecessary.

What it catches: a sudden drop in organic traffic that signals an indexation problem or algorithm update impact, a page that lost its featured snippet to a competitor, a PageSpeed regression caused by a new third-party script or unoptimized hero image, a Search Console coverage error that's preventing new pages from being indexed.

The daily cadence also creates a clean time series. When a client asks "when did our traffic start dropping?" we can point to the exact day, not an approximate week. That precision matters for root cause analysis. A traffic drop that coincides with a plugin update on the same day tells a different story than a gradual decline over two weeks.

What a break looks like: GA4 or Search Console API authentication failures are caught and logged. The pipeline reports which clients succeeded and which failed, so we can address access issues without losing data from functioning accounts. A common cause is Google Workspace permission changes on the client side. When someone on the client's team changes their GA4 property permissions, our service account loses access. The pipeline flags it, and we resolve the access issue, typically within a business day.

Pipeline 3: semrush-data-sync (Daily/Weekly/Monthly)

SEMrush data operates on three cadences because different metrics change at different rates:

Daily: Position tracking visibility scores and keyword rankings for 1,000+ tracked keywords
Weekly: Backlink profile changes, referring domain counts, site audit health scores
Monthly: Domain analytics, authority score trends, competitive landscape shifts

What it catches: a competitor launching a content campaign (visible in their position changes before their traffic impact shows up), a sudden spike in toxic backlinks, a site audit regression from a CMS update.

What a break looks like: SEMrush API rate limits occasionally throttle the sync. The pipeline backs off and retries. Individual dataset failures don't block other datasets.

Pipeline 4: Budget Monitoring (Every 30 Minutes)

This is the most time-sensitive pipeline. Every 30 minutes, it checks spend levels across all managed Google Ads accounts against their monthly budgets.

What it catches: an account pacing to overspend its monthly budget. Alerts fire at 110% of the expected daily run rate, giving us time to adjust bids or pause campaigns before the overspend becomes significant.

What a break looks like: if the monitoring script fails, the team is notified of the monitoring failure itself. The system monitors itself. We treat a monitoring gap the same as a detected problem, because a gap in monitoring is itself a risk that needs immediate resolution.

One Database, Zero Conflicting Numbers

Every pipeline writes to the same PostgreSQL database. We call it Thor. It holds 80+ tables, each with a defined schema, and every table traces back to a specific data source.

This solves the most common reporting problem in marketing: conflicting numbers. When GA4 shows one traffic number, the SEMrush dashboard shows another, and last week's report used a third, nobody knows what's real. With Thor, there's one number for each metric, and it came from one source at one time.

We enforce three authorities, each responsible for different questions:

Authority	What It Answers	Examples
Thor	How is performance trending? What are the numbers?	Traffic, conversions, keyword positions, ad spend, PageSpeed scores
CRM	Who are our clients? What services do they have?	Client roster, active services, contact information, billing
Strategy docs	What are we trying to achieve?	Keyword targets, competitive positioning, quarterly goals

These three never overlap. Thor doesn't store client contact information. The CRM doesn't store keyword rankings. Strategy docs don't contain raw performance data. When you need an answer, there's exactly one place to look.

Read more about why this separation matters in Why We Built a Single Source of Truth for Client Data.

The Traceability Chain

Every number in a client report follows a five-step chain back to its origin:

Brief - The strategy document that set the objective and defined what metrics to track
Data layer - The tag, event, or API endpoint that generates the raw data
Structured table - The Thor database table where the pipeline writes the processed data
Raw snapshot - The original API response, stored as write-once JSONB, preserved indefinitely
Report run - The specific report generation that pulled the number and presented it

If a client asks "where did this number come from?" we can walk backward from the report to the raw API response. If the number looks wrong, we can compare the structured table against the raw snapshot to see if a processing error introduced a discrepancy. If the raw snapshot itself looks wrong, we can re-pull from the source API and compare.

This isn't theoretical. We've used the traceability chain to debug discrepancies between Google Ads reported conversions and our internal tracking. The raw snapshots showed a conversion action had been reconfigured mid-month, splitting what was one metric into two. Without the snapshot archive, that would have looked like a conversion drop. With it, we could see exactly what changed and when.

Another example: a client's GA4 property showed a traffic spike that didn't correspond to any campaign change. Walking the traceability chain backward, we found the spike coincided with a referral spam domain hitting the site. The raw Search Console snapshot from that day showed no corresponding increase in organic impressions, which confirmed the traffic was not from search. Without the chain, the team might have spent hours investigating a "growth" signal that was actually noise.

Why This Matters for Your Business

The infrastructure described above isn't something clients interact with directly. But it shapes every interaction they do have with us.

Problems caught in hours, not weeks. A conversion tracking break that goes undetected for 30 days means 30 days of ad spend optimized against the wrong signal. Our pipelines catch that within 6 hours. A site error that makes your business look broken on mobile gets caught at the next morning's health check, not when a customer mentions it. Read about specific failure scenarios in What Breaks at 2 AM.

Numbers never conflict. When your monthly report says organic traffic grew 12%, that number came from one place and we can prove it. There's no "well, GA4 shows this but Search Console shows that" ambiguity. One source, one number, one truth.

Historical data is preserved. When we want to compare this March to last March, the data is there. When we want to see how a specific keyword's position has trended over 12 months, the daily snapshots are there. When a new team member picks up an account, they can see the full history, not just last month.

Reports are reproducible. If you run the same report query against the same date range twice, you get the same answer. Data doesn't get overwritten or aged out. The raw snapshots are write-once and permanent.

This infrastructure is what separates reactive marketing management (waiting for something to go wrong, then scrambling) from proactive marketing management (catching problems early, having the data to diagnose them, and having the history to put them in context). The monthly fee doesn't just pay for someone to look at dashboards. It pays for the system that makes those dashboards trustworthy.

What About AI Visibility?

The monitoring landscape is changing. Traditional SEO tracked positions in Google's organic results. Now businesses appear in AI-generated answers across multiple platforms: Google AI Overviews, ChatGPT, Perplexity, Claude, Gemini, Copilot, You.com, Phind, and Meta AI.

We're building this monitoring into our pipeline infrastructure because AI citation is becoming a traffic and credibility signal that matters. When an AI platform cites your business in a response, that's visibility you can't get from traditional ranking. When it stops citing you, that's a loss you need to understand. The same pipeline architecture that monitors 60,000 data points from traditional sources extends naturally to AI visibility tracking.

The principle remains the same: measure it consistently, store it centrally, and make it traceable. Whether the data comes from a Google Ads API or an AI platform's citation pattern, the infrastructure handles it the same way.

Explore our SEO services to see how this infrastructure supports ongoing optimization work.