Skip to main content

Best practices to optimize the query costs in BigQuery

I wanted to write this post to help those starting to worry about their increasing BigQuery cost. You've probably been using Firebase or Google Analytics because of their powerful and free services. Later, you realized the UI isn't advanced enough to answer all your questions about your data. And as your company grew, you needed to combine data from different resources in a single report/dashboard for a complete view of user behaviors across different channels and devices. In one common use case, you wanted to combine product data, marketing and CRM data to understand whether the conversion rates from email to in-app purchase events are higher than conversion rates from ads to in-app purchase events. Answering this or similar questions requires another approach to data storage and consumption. Like many other data-driven and successful companies, you decided to collect all your company data into your data warehouse and do analysis on top of your data warehouse. You ended up using GA4's or Firebase's BigQuery link to load all the data into your BigQuery, which is Google's Cloud Data warehouse solution.

OLAP in the modern data stack

Burak Kabakci

Burak Kabakci

Core Team

OLAP is almost 30 years old and has often been used in different contexts over the last 10 years. It's a huge market, and there are many enterprise software in the space. OLAP is simply pre-aggregated data from your raw data, and it's not for everyone. If you're working for a startup that doesn't have terabytes of data, you probably don't need OLAP because you can efficiently run ad-hoc queries on your raw data (also known as fact tables). However, when you have petabytes of data and have tens of people who rely on it to make decisions, you don't want to make them wait 10 minutes for a simple query. Let's breakdown the OLAP into two categories:

Common Metric Types

Burak Kabakci

Burak Kabakci

Core Team

We've been working on building the first open-source metrics store for the last months. We worked closely with many data teams from different industries and markets to help them define their metrics as code. While working closely with various data teams, we've identified 3 types of metrics definitions that are widely accepted by data teams. Let me explain what these 3 common metric types are and how they are defined in BI tools today:

Introducing Metriql: Open-source metrics store

Burak Kabakci

Burak Kabakci

Core Team

Introduction

Today, most data-led/driven companies successfully establish the data warehouse as the single source of truth for their data and collect all their data into their data warehouse through ETL jobs. However, ETL is only one part of the whole process. Companies want to collect and own their data because they want to consume it in various ways to grow their business, whether by building analytical dashboards for stakeholders or building an ML application. To consume the data via BI or data tools, companies need to model their data and define business metrics and KPIs so stakeholders can agree on the definition of the data. Establishing the single source of truth for metadata is very difficult because tools and teams consume the data in different methods. This topic has also been addressed by many data folks recently: Ben has written The missing piece of the modern data stack, and Basecase came up with the Headless BI concept, Airbnb introduced Minerva.