How Substreams Unlock High-Speed, Composable Indexing

StreamingFast
10 min readOct 13, 2023

Substreams and Firehose enable a Shared Intelligence Layer. Fancy words, but what does it mean?

Blockchains hold a lot of valuable, yet opaque data. By shifting the transformation of data lower down in the stack, and doing so in a shared and open fashion, Substreams unlocks rich and actionable information for all to use. At speeds never before seen. There’s no longer a need to compromise your tech stack to fit with your chosen data provider.

Substreams can sink data anywhere you need, from wherever you want.

Alexandre Bourget gave this talk at Messari Mainnet 2023, giving a high-level overview of how Firehose and Substreams are changing the future of how developers consume blockchain data.

Transcript

I’m Alex, the CTO at StreamingFast. I’m also a pianist, and a data scientist and software architect. I’m the father of nine beautiful children. I love designing and crafting software, which I’ve done since I was 12. And I met Satoshi in 2013, and since then I’ve gathered all of that blockchain knowledge. And today I’m the CTO of StreamingFast, a Montreal-based company who is one of the core developer teams of The Graph.

And our goal and mission is to make the world of open data, with that shared intelligence layer, in a low-cost and high-performance package — the technology to do these things. What I mean by a shared intelligence layer, is that crowd-sourced knowledge from people and groups, turned into code. And when data is more open, creativity sparks from all the ends of the world. And often from places we wouldn’t expect.

So we joined The Graph after a stunt we made. I don’t know if you guys know subgraphs. Raise your hand if you know subgraphs. Oh okay, cool, you guys are good! We took a very popular, yet very slow-to-sync subgraph. It took two months to do, and we did it in 10 hours. How? Using the tech I’m going to present today. And since then, The Graph has decided to make these fundamental pieces of technology into the underpinning of its high-performance and multi-chain strategy.

So these are Firehose and Substreams. Firehose is a means to get data out of blockchains and extract it. And Substreams is a massively-parallelized transformation engine over Firehose data.

Let’s start with that one here, Firehose. At SF we’ve been thinking hard of all these indexing problems from first principles. We needed a robust extraction layer. To get the data, we needed to properly index and make it useful. And for that, we designed the Firehose. We were up against these issues. Like issues of cost. We didn’t want to run bulky nodes that take a lot of RAM and space. And you know, dedicated SSDs and high performance stuff and servers. That was complex and costly. The goal was to get to the data inside. We didn’t want to deal with these things. It was too complex.

Also reliability. We wanted something more reliable, with less moving parts. There’s risk of inconsistency when you have many of such nodes, and you need to query them but they are not all synced to the same place. So we also wanted consistency across the ecosystem. Some people have their nodes with JSON-RPC and you need to query them like crazy. Some people tout billions of queries, but it makes it very cumbersome. Some others have websocket streams, but they’re not always reliable across reorganizations, and things like that. And also, some built indexing frameworks, reinventing the wheel for their own particular stack.

So this introduced big costs, low reliability, and lack of consistent performance. So we wanted to do something simple. One of the core insights of Firehose is the use of flat files. Flat files paired with a streaming-first engine, so we would get the best of both history and real time. So flat files are the cheapest. It’s a little ridiculous to be talking about flat files, but I feel it’s the place. They’re cheaper than processes and programs running. There’s nothing simpler and cheaper. So we went to the ground-level here, in terms of resources.

A small anecdote. Those folks at The Graph, today using the Firehose to feed their indexing system, saw a reduction of cost of 90%, because they could stop querying huge bulky nodes, and have data pushed to them instead. Much faster, and lower latency, and much cheaper.

So one thing that is common from all blockchains is that they produce data that people want to use — you know, it’s a database. Half of the equation of a database is reading the data. With a good and open data format for every chain, something unified. We chose Google Protobuf definitions for the widest compatibility. We could conceive of the best data models for each chain. And that helps with simplicity, unifying the stack.

And with regards to latency, the Firehose equalizes all chains. And we chose fast instead of sometimes unreliable. So all those chains get transformed when they are Firehose-enabled into push-first. They are pushing the data with the lowest latency. The moment the transaction occurs, you get it.

And regarding reliability, we addressed that by lowering the reliance on nodes. And also making things simpler. If things are simpler, they are more reliable. You know the optimization of storage — S3 and whatever, Google Buckets — is one of the most optimized things on Earth. You have so many tiers, you can have your costs go low by putting data in Iceland and such. And one last very important insight, the whole stack was made blockchain-aware with a cursor that allows for pristine reorganization navigation. So all these cases are really taken care of here.

And all that lays the ground for massively parallelized operations, which we’ll see now with Substreams. So Substreams is a massively-parallelizable, general-purpose transformation engine over Firehose data, therefore any blockchain data underneath. And it works in batch mode for processing history, and back testing in case you’re running things like trading operations. And also real time with low latency, so there’s no compromise. That level of the stack doesn’t have a compromise, we made it the most solid that we could.

Once we have a reliable and fast source of data with Firehose, here’s what we wanted to address. We’ve seen so many re-inventions of the wheel. The teams of each blockchain protocol roll out their own indexing technology. How many wrote a thing to write into PostgreSQL directly in their node. They all do that. But it’s a lot of duplication, with inconsistent behavior. And also these are not lego blocks, these are different lego brands. Each not fitting with one another, the holes are not the same size. So that produces inconsistent experiences for folks trying to make sense of the open data that blockchains produce. I don’t like that.

So another place that we didn’t want to reinvent the wheel is understanding of contracts. When you want to understand contacts, you start digging and go to that obscure land of data and you try to decode it. And multiple people do that, and that’s the layer I mean when I say “the shared intelligence layer”, understanding what’s happening on chain. The problem is twofold; either teams start digging into the smart contracts themselves, interpreting the data again and again, and often times with quality issues as they’re not perfectly knowledgeable of the protocols and all of the math involved, or they use providers that do that. Providers that give you intelligence about Uniswap V3 or whatever. But they capture that intelligence into their proprietary stack, and you can consume in the way that they provide it to you. So subgraphs is one of the tech that captures that crowd-sourced knowledge outside of proprietary stacks. You can use it locally, and you can share it around. That’s a nice abstraction. But it also constrains you to a certain model of consumption.

With Substreams, we want to solve that by bringing that intelligence layer lower down the stack, so you can decide where you’re going to dump that data and not have to write it again. So it’s a write-abstract, and we’re going to get to that.

Another problem is also performance. Performance is crazy in this space, as history is constantly growing. Of course, history grows. And if you don’t design for addressing that earlier on, you are getting technologies that are linear. I remember a provider that was down for three weeks. Three weeks being down in any company kills you. And they were reprocessing their chain linearly. That was such an absurd situation for me. I said “never on Earth would we want to tolerate that. I want all our systems to sync full networks in 30 minutes. Let’s design for that.” That’s what Substreams is, it allows you to parallelize operations, crunch data, and reprocess.

Firehose also, by the way. Everything I’ve said was designed for extremely high speed and iteration. Because one of the other issues is slow dev cycles. If you have a slow dev cycle, your product is dead. Remember that “Build-Measure-Learn” loop in a startup? You need to learn as fast as possible. If you can’t iterate, what do you learn? You try something, and in two weeks you might see the result. That’s crazy. So we wanted to solve for that too.

Everywhere in technology, people start reinventing the wheel when they find just the right abstraction that gives them the power to do what they want, without being in the way by having something they don’t want. We worked hard to find the simplest and the best abstraction for maximum performance and composability — flat files and streaming-first rich data models. A part of that abstraction was also the place in the stack, to aggregate the knowledge of all of these contracts. If you put them in a downstreams SQL store, you lose streaming ability. If you stream it to Slack, if your engine was made to send data to Slack, you lose other components. So if we can put that shared knowledge earlier on in the stack, then we can reuse it in multiple fashions.

Also, Firehose and Subsstreams are in the open. Runnable by you locally, but whichever provider you want, and The Graph Network with its pool of Indexers that will run your loads for you. They’ll offer you that service. But you are still free. To achieve truly open data, that shared intelligence layer must not be locked into proprietary frameworks like Dune and friends, and other things like that.

Maybe the most appealing feature is performance. You can process large histories in minutes. It splits the work in parallel. And then, it also works in parallel. So the time you invest to craft these things, you get all these benefits — there’s no compromise. You get the lowest latency on the market. You can imagine tent poles that you interlock. When the blockchain produces something, it produces down to Substreams, and Firehose. It’s all a push fashion. And the nodes race together to get you the data as fast as possible. There’s nothing more that can be done in terms of latency there.

It also has great introspection capabilities. Often you have people with quality issues with a provider. With Substreams you can go directly into that place in the middle of the chain, analyze all the data bits, before needing to reprocess and load your Postgres database yet another time again.

So it is the first general-purpose, programmable indexing stack for blockchain data. It works across protocols and blockchains with unparalleled performance. A quick look at how that works in reality, this is that shared knowledge.

You have someone, the author, in red. He understands Uniswap and builds those blocks. And then you have the Sushi guys (another team) building intelligence about Sushi, and you merge them together. And the junction of all these things is clean and clear data models. Well distinct data, not request-response, not complex node operations. And so you can have another guy, a trader, who is going to take that collective knowledge and fuse it for his own purpose. Feeding that into his trading operation, fusing with OpenSea trades and what not.

So where does that data go? This is all streaming real-time, with extremely fast historical reprocessing. So it puts us in a new problem. The last years we’ve seen in the world of data science a tremendous amount of development. There’s a new database that crops up each week, for all I know. So many tools, so many use cases. And that perhaps is not so much a problem, as an opportunity. Because the number of new developers — we want new developers in this space — and as they come, they come with those experiences and knowledge and expectations. They want to use those powerful engines. And we want to make it as simple as possible to use them without sacrificing the composability and that shared intelligence layer. That’s why it fits at the right place to leverage that maximally. And the solution to that is to sink it everywhere. Substreams is the abstraction for knowledge, and then develop the tools. I’ve seen so many protocols also write natively. That protocol will have a plugin for Kafka, that protocol will have a plugin for MongoDB, and whatnot. But if you integrate as a Layer 1, the FIrehose, you get all these crazy things once and for all, with the community of people developing the shared intelligence. That’s a good value proposition for your folks. The first one was subgraphs, and Substreams today can power subgraphs, and you can deploy them to the network. So you will have folks who will host that for you, and you can benefit from the network effects of The Graph. It’s an end-to-end solution. We want to bring all of these things to The Graph Network. And you can deploy processes, so that other folks can run them for you. But we keep that refined intelligence layer.

So we’re writing a boatload of sinks. One example I have. One guy, who is an accountant, started writing a Substreams module to feed that into Google Sheets. He was tired of doing copy+paste from Etherscan, so now he has a bit of code that can understand the chain’s transactions and sink it over there. So talk about reusability. So we’re really targeting any custom programming language, any data systems that supports gRPC, which is a wide array.

Now that you know all of that, if you want to take part in that revolution of open data and shared intelligence, ask your preferred chain if they are Firehose-enabled. If not, ask them to work with us to add that thin layer so that they benefit from the rest, and all of their community. And if you manage a dapp or a smart contract, encode your knowledge into Substreams modules. Make that intelligence available to others, for people consuming in any way once and for all. Keep an eye on The Graph, because we’re going to bring all of these data services in the world to The Graph Network. Thank you all.

--

--

StreamingFast

StreamingFast is a protocol infrastructure company that provides a massively scalable architecture for streaming blockchain data.