The Problem With Scattered Data
Here's the reality at most agencies. The Google Ads manager has their data in the Google Ads interface and maybe an exported spreadsheet. The SEO team has their data in SEMrush dashboards and Google Search Console. The web team has their data in GA4. The account manager has a slide deck from two weeks ago with numbers that may or may not match any of the above.
A client asks: "Is our organic traffic up or down?" The SEO team checks GA4 and says up 8%. The account manager checks the last report and says up 12%. The discrepancy exists because they're looking at different date ranges, different filters, or different comparison periods. Neither number is wrong, exactly. But the client has now lost confidence in both.
This problem compounds with scale. At 5 clients, you can keep track of which spreadsheet has the latest numbers. At 33 clients across Google Ads, SEO, and web optimization, scattered data becomes a structural risk. Someone will report a wrong number. Someone will miss a trend because they were looking at the wrong dashboard. Someone will make a strategy recommendation based on stale data.
The problem isn't people being careless. It's that scattered data systems make correctness impossible to guarantee. When the same metric can be queried in three different tools with three different default date ranges and three different attribution models, discrepancies aren't bugs. They're a feature of the architecture. The only way to eliminate them is to change the architecture.
We built Thor to make this problem impossible.
Three Authorities, Zero Overlap
The first design decision was defining clear boundaries. Not everything goes in the database. Three systems each own specific types of information, and their responsibilities don't overlap.
| Authority | What It Answers | What It Does NOT Answer |
|---|---|---|
| Thor (PostgreSQL database) | How is performance trending? What are the numbers? What happened on this date? | Who is this client? What services do they pay for? What's the strategic direction? |
| CRM | Who are our clients? What services are active? When was the last communication? | What are the traffic numbers? How is the campaign performing? |
| Strategy documents | What are we trying to achieve? What keywords are we targeting? What's the competitive positioning? | What are the actual performance metrics? Who's the billing contact? |
This separation is enforced, not suggested. When someone needs performance data, they query Thor. When someone needs client context, they check the CRM. When someone needs strategic direction, they read the strategy document. There's exactly one place to look for any given question.
The benefit isn't just organizational tidiness. It eliminates an entire category of errors: the error that comes from updating data in one place but not another. If traffic data only lives in Thor, you can't have a situation where the database says one thing and a spreadsheet says another. There is no spreadsheet.
Four Pipelines, One Destination
Thor receives data from four independent pipelines, each with its own schedule, its own failure handling, and its own data domain.
gads-sync: Google Ads Data (Every 6 Hours)
This pipeline connects to all 24 managed Google Ads accounts and pulls campaign metrics, keyword performance data, search term reports, ad group statistics, Quality Score components, and auction insights. The sync runs every six hours, producing roughly 53,000 data points per day.
The pipeline handles each account independently. If one account's API call fails (expired token, temporary API error), the other 23 accounts still sync successfully. Failed accounts are logged and retried on the next cycle.
webopt-data-sync: Organic Performance (Daily)
This pipeline pulls GA4 traffic metrics (sessions, bounce rates, page-level performance), Google Search Console data (impressions, clicks, average position by query and page), and PageSpeed scores (Core Web Vitals, Lighthouse scores). It runs once daily because these metrics don't change at the granularity that would justify more frequent pulls.
Each data source within the pipeline operates independently. A GA4 authentication failure doesn't prevent Search Console data from syncing. This independence is a deliberate design choice: partial data is better than no data, and a failure in one source shouldn't cascade into a gap across all sources.
semrush-data-sync: SEO Intelligence (Mixed Schedule)
SEMrush data syncs on three cadences matched to how quickly each metric type changes:
- Daily: Position tracking for 1,000+ keywords, visibility scores, competitor rank movements
- Weekly: Backlink profile changes, referring domain counts and quality, site audit health scores and issue counts
- Monthly: Domain analytics, authority score trends, organic traffic estimates, competitive landscape analysis
Budget monitoring (Every 30 Minutes)
The highest-frequency pipeline. Every 30 minutes, it checks current spend levels across all managed Google Ads accounts against their monthly budgets. This produces roughly 3,744 data points per day (24 accounts, 48 checks, multiple metrics per check).
All four pipelines write to Thor. Each pipeline has its own set of tables, its own schema, and its own data lifecycle. But they all share the same database instance, which means cross-pipeline queries are straightforward. Want to compare organic traffic trends against paid spend changes? It's a SQL join, not a spreadsheet merge. Want to see if a keyword position improvement in SEMrush data correlates with a Google Search Console traffic increase for the same page? That's a query against two tables with a shared client identifier and date column.
This cross-pipeline analysis is where the real strategic value emerges. Individual data sources tell you what happened within their domain. Combined data tells you why.
The Raw Snapshot Archive
This is the piece that makes Thor more than a reporting database.
Every time a pipeline pulls data from an API, it stores two things: the processed, structured data that goes into the normal reporting tables, and the raw API response as a write-once JSONB record in the raw_snapshots table.
The raw snapshot is never modified after it's written. It preserves the exact response the API returned at that moment. This creates an audit trail that goes all the way back to the source.
Why this matters:
Debugging discrepancies. When a number in a report doesn't match what a client sees in their own Google Ads dashboard, we can pull the raw snapshot from the date in question and compare it against the structured data. If the structured data differs from the raw snapshot, we have a processing bug to fix. If the raw snapshot matches our report, the discrepancy is in the dashboard's date range or filter settings. Either way, we can resolve it with evidence, not guesswork.
Detecting retroactive changes. Google Ads can retroactively adjust conversion numbers as attribution data comes in. If we pull campaign data on the 15th and again on the 30th, the numbers for the same date range may differ. With raw snapshots from both pulls, we can see exactly what changed and understand why the month-end report doesn't match the mid-month check. This is expected behavior, but without the snapshot archive, it looks like someone changed the numbers.
Historical context for new team members. When someone new takes over an account, they can see not just the current state but the full history. Not just the processed metrics, but the raw data that produced them. This matters when trying to understand why a strategy change was made six months ago. The data that informed the decision is preserved, not just the result.
What This Means for Clients
The infrastructure described above is invisible to clients in their daily interactions with us. They see reports, strategy documents, and recommendations. But the quality of all those outputs depends on the data underneath them.
Numbers never conflict. When a monthly report says cost per acquisition decreased 15%, that number came from one query against one database. There's no possibility of a different team member running a different query and getting a different answer. The number is the number.
Problems surface faster. Because four pipelines are writing data continuously, a sudden change in any metric gets noticed quickly. A conversion drop, a traffic spike, a budget anomaly. These appear in the data within hours, not in the monthly report review weeks later. Read about specific failure scenarios in What Breaks at 2 AM.
Historical data is always available. We don't age out data. We don't overwrite last month's numbers with this month's. When a client asks how this quarter compares to the same quarter last year, the data is there. When a strategy review needs to reference the baseline from the start of the engagement, the baseline is there.
Reports are reproducible. Run the same query with the same parameters twice and you get the same answer. This sounds obvious, but it's not the norm in environments where reports are built from dashboard screenshots and spreadsheet exports. Those artifacts change every time the underlying data refreshes. Our reports are queries, and queries against immutable data produce deterministic results.
Why Competitors Can't Easily Replicate This
Building a centralized data infrastructure isn't a feature you bolt on. It's a foundational decision that shapes how everything else works.
Custom sync engines. Each pipeline is a purpose-built synchronization engine. gads-sync understands the Google Ads API's pagination, rate limits, and data freshness characteristics. semrush-data-sync handles SEMrush's three different cadences and API-specific quirks. These aren't off-the-shelf connectors. They were built for this specific purpose, tested against real data at real scale, and refined over months of production use.
Unified schema. The 80+ tables in Thor follow a consistent schema design. Client identifiers match across tables. Date fields use the same conventions. Metric definitions are standardized. This consistency is what makes cross-pipeline queries possible, and it's the result of deliberate design work, not a natural outcome of connecting APIs to a database.
Months of historical backfill. When we add a new data source or a new metric, we backfill historical data where possible. This means the database contains not just current data but a usable history from the start of each engagement. Backfilling takes time and careful validation. It's not something you do once; it's an ongoing investment as data sources evolve and API schemas change.
Engineering investment. The monitoring infrastructure, the pipeline error handling, the raw snapshot archive, the anomaly detection, the budget alerting: each of these represents significant engineering work. An agency that decides today to build this capability is months away from having it operational, and years away from having the historical depth that makes it valuable.
This isn't an argument that technology is a substitute for strategy or execution. It's an observation that reliable data infrastructure makes strategy and execution measurably better. Every recommendation is grounded in data you can verify. Every report is reproducible. Every problem is diagnosable. Those properties come from the architecture, and the architecture takes time to build.
The Decision to Build vs. Buy
We considered third-party data aggregation tools before building Thor. Platforms like Supermetrics, Funnel.io, and Agency Analytics offer data connectors that pull from multiple sources. We chose to build our own for specific reasons:
Schema control. Third-party tools normalize data into their own schemas, which means you lose source-specific nuance. Google Ads search term reports and SEMrush keyword position data have different semantics even when they both contain "keyword" and "position" columns. Controlling the schema lets us preserve those distinctions.
Raw snapshot preservation. No off-the-shelf tool we evaluated offered write-once raw response archival. They process and normalize the data, which is useful for reporting but eliminates the audit trail back to the original API response.
Pipeline independence. Most aggregation tools run all data pulls as a single scheduled job. If the job fails, everything fails. Our architecture isolates each pipeline and each client within each pipeline, so failures are contained and specific.
Custom alerting logic. Budget pacing alerts at 110% of expected daily spend, conversion volume anomaly detection, cross-source discrepancy flagging: these are business-specific rules that don't map cleanly to generic alerting frameworks.
The tradeoff is real: we maintain the sync engines, handle API changes, and manage the database infrastructure ourselves. That's ongoing engineering work. But the result is a data foundation that's exactly what we need, not an approximation of it.
Learn how all this data comes together in our daily monitoring in How We Monitor 60,000 Data Points a Day. Learn more about our team and explore our full service offerings.