Say hello to Graph Compose 🎉
How do you build data-driven products? We're responsible for implementing analytics libraries in our applications and websites, but how do we get these data back into our products?
And what would we build if we could?
I've been living in two worlds. I never noticed the subtle context switch between the world of analytics / data engineering and programming / software engineering.
All too often I'd change hats from building data pipelines and answering business questions with SQL queries, over to writing full stack applications where I'd simply implement analytics libraries and move onto other concerns like UI and feature development.
Then it occurred to me. There's a fundamental disconnect between developers implementing analytics and who ends up using these data. Sure, it's great for Business Analysts and Product Managers to have dashboards, and Data Scientists to build machine learning models.
But what about regular developers / programmers / software engineers?
What if we could use these analytics data in the products that generate the data?
Or perhaps a more exciting question
What new products could we build from the data we're already collecting?
What is Graph Compose?
Graph Compose helps developers build data-driven products. It's a GraphQL API wrapping the analytics data you already have in your data warehouse.
Let's look at a basic example. Say you want to get a list of all events sent to your analytics from October onwards:
# this query is simply sent as the body of a HTTP POST request
query RecentEvents {
events(where: { date: { after: "2019-10-01" } }) {
results {
event
}
}
}
And you'll get back:
{
"data": {
"events": {
"results": [
{
"event": "Clicked Card Promo"
},
{
"event": "Homepage Newsletter Signup"
}
]
}
}
}
Which is much nicer than the underlying SQL query:
WITH view AS (
SELECT * EXCEPT (ROW_NUMBER)
FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY loaded_at DESC)
ROW_NUMBER
FROM `some-dataset.some_table.tracks`
WHERE _PARTITIONDATE > '2019-10-01')
WHERE ROW_NUMBER = 1
),
base_query AS (
SELECT event_text
FROM view
WHERE TIMESTAMP_TRUNC (timestamp, DAY, 'UTC') > '2019-10-01'
GROUP BY event_text
)
SELECT event_text
FROM base_query
LIMIT 100
Any programmer with a bit of SQL knowledge might wonder why we couldn't just run the final base_query
and be done with it. This starts to open up the complexities in building an API around analytics data.
For an API to be general-purpose, that is to say it can be flexible to answer many types of questions, it needs to translate a beautifully simple GraphQL query, into underlying database queries that are reasonably modular and scalable. This turns out to be quite tricky when the datasets are in the Gigbytes, Terabytes, or even Petabytes. We need to consider concepts like table partitioning, windowing functions, and combining aggregations into single queries.
Back to our original question: why don't programmers use analytics data to build products?
The simplest answer: it's bloody hard.
If you know the exact question you need to answer (and it's unlikely you will ask any variation of the question) you might be in luck to write a REST API around an SQL query, add some safety and monitoring and be done with it. But just like writing applications without an ORM over a database, you quickly figure out that you're spending a lot of time just transforming data structures and managing database I/O.
You might like to think about Graph Compose as an ORM for your analytics.
Why GraphQL?
If you haven't worked with a GraphQL API before, it might not yet be clear what this brings to the table. There are hundreds of articles written about GraphQL around the web and plenty of conference talks on Youtube. If you're learning for the first time, I suggest starting finding someone from your own programming community on Youtube because they'll make the abstract concepts more concrete.
I'll try again: Why GraphQL for analytics?
I've been wondering this for a few years now. I actually expected that a product like Graph Compose would be built a couple years back, but it never came. So I set out to build it myself. I'll be writing in more detail about this in other articles so let's keep it to bullet points:
- The way data engineers think about building analytics pipelines and datasets don't map to the world of objects we use when writing applications.
- It's a typed API, with comments. You can play around, incrementally build your query as you refine the question you're asking, and know exactly what shape and type of data to expect.
- It's just a HTTPS request. Any server or browser can send the request without specific libraries for your programming language.
- The same query interface can be used across analytics systems and data warehouses. No matter how you capture or store your data, we make the query interface intuative to application development.
Building Graph Compose isn't necessary hard, but it's incredibly complex. We will be documenting our journey in this blog as our product evolves.
Let's get specific
In launching Graph Compose, we are integrating with Segment. If you're not yet familiar with Segment, it's essentially one of the most important tools in a high-functioning analytics stack. In fact, it might be one of the only tools you use. Put simply, you write something like this:
analytics.track('Homepage Newsletter Signup')
...and the event is sent (with any extra metadata) to hundreds of different analytics tools like Google Analytics and Mixpanel, along with other tools like Stripe and Mailchimp, but most importantly it is sent to your own data warehouse.
That's right, you own your raw data. It's yours to query in any dashboarding tool, machine learning tools, or in our case - building products with Graph Compose.
I love BigQuery as a data warehouse, but Redshift, Snowflake, and even plain old Postgres will work too. They all have their quirks, but leave that to us to figure out.
You send data to Segment, they load it into a data warehouse, Graph Compose is your API to get it back into your product.
Segment is only the start for us. We have plans to add new data sources in the coming year.
Use cases I'm excited about
My career has spanned several industries. Accounting, Finance, Energy Engineering, Retail, and Software. Within Software I've worked on projects in almost every sector. But when it comes to building Graph Compose, I can put my hand on heart and say that I'm truly uncertain on who will find this truly valuable.
Each of these industry sectors have ripe opportunities to build data-driven products from Terabytes or Petabytes of their ever-growing analytics data. I'm excited to see what developers can build and the problems they can solve with Graph Compose.
We can do more than making marketing and advertising work better. How about building products that improve sustainable energy supplies, efficiently feed hungry people, quickly identify at-risk patients, and reduce income inequality? We hope to support the people behind these ambitious products.
Once upon a time it was hard to collect, transform and store data. This is now easy. In fact, it's so easy that we have gone bonkers in the last 10 years, tracking everything we can imagine without any real plan on how to use it. Whilst I'm excited about machine learning and AI, I believe it's a cop out to 'leave it to the machines to figure out'. We can build products that solve real problems today using the skills we have.
Our capability to solve hard problems is limited by our ability to ask the right questions from our data. We want to help with that too.
This is just the beginning.