Substreams: Massively Faster Indexing Performance for Subgraphs
Note: This post was created by The Graph and originally posted on The Graph Blog
The Graph ecosystem has grown substantially over the past year, with five Core Developer teams now working full-time to enhance The Graph’s indexing and query capabilities for the world. StreamingFast, the first additional team to join as a Core Dev after Edge & Node, brings both an incredible pool of talent and powerful technology to further the protocol. One of the most exciting innovations is coming to fruition soon: substreams.
StreamingFast (formerly dfuse) was founded in 2018, providing high performance, cross-chain centralized indexing services. Interactions with the Edge & Node team convinced StreamingFast that decentralization is the most effective and scalable way to build for the future. Subsequently, StreamingFast accepted a grant from The Graph Foundation and, in June 2021, joined as a Graph Core Developer team to work full-time on The Graph ecosystem. This decentralized version of M&A was the first of its kind (but not the last).
In joining as a Graph Core Developer team, StreamingFast brought Firehose, a high-performance method of ingesting data from blockchains and started integrating it into to The Graph. At that time, an extremely complex subgraph could take weeks to sync, creating friction for developers building on The Graph. StreamingFast created a prototype called Sparkle, which helped decrease sync time on that subgraph from weeks to around six hours. Now, StreamingFast has evolved Sparkle’s capabilities and created substreams that can scale across all subgraphs on all chains.
How Substreams Work
RPC-based Subgraphs have a linear indexing model for processing blockchain data (i.e. they process events one at a time, in order). They do so via polling API calls to Ethereum clients. Firehose technology replaces those polling API calls with a stream of data utilizing a push model and sending data to the indexing node faster. This helps increase the speed of syncing and indexing.
Substreams take things even further by enabling massively parallelized streaming data. Substreams can be combined and aggregated in powerful new ways to feed data into subgraphs or end-user applications in a fraction of the time. With substream parallelization, some subgraphs could sync more than 100x faster.
With substreams, the data pipeline can be broken down into four stages:
- Extract (via Firehose)
- Transform (via Substreams and Subgraphs)
- Load (to the postgres database)
- Query (serving queries to users)
The first transformation via substreams allows lighter weight parallelized computation and composability that many subgraphs can benefit from.
To illustrate: in the instance of large DEXes-which need to find pairs for any given trade-a substream model enables individual small modules to work simultaneously on pairs, reserve extractors, prices, volume aggregation, and other key metrics. If a developer bases their work on existing substreams, they can take the DEX prices and create a module to average all DEX prices across an ecosystem.
Substream modules don’t go through postgresQL. Existing modules can be leveraged, which developers can adapt, allowing end users to take advantage of composability without paying a performance penalty for indexing.
After the Extraction and Transformation stages, substreams can be composed in an infinite number of ways, enabling another module to populate into a subgraph, all before Load operations.
As opposed to linear historical data processing, substream data can be processed in parallel and cached. This allows for the fastest possible insertion into the postgres database, going from days or weeks to mere hours.
This all serves as a benefit to developers. Developers need to build subgraphs and should be able to iterate on those subgraphs as fast as possible, maximizing developer productivity. Developers will be able to iterate upon existing modules, reuse the most efficient processes (such as in the DEX example), using incremental iterations to improve without needing to rebuild a new subgraph. They will be able to observe data and add to their database as required. The speed and data composability of subgraphs and substreams, pulling data through Firehose, will make The Graph the fastest and most efficient way to get data from blockchains.
This is the power of open-source data composability via The Graph: a hivemind of developers building composable data across a global ecosystem. Centralized services cannot compete.
Current Stage in the Process
An initial implementation of substreams has been built and is being tested. The core devs are working with a small group of developers to improve the software. Keep an eye out for announcements on availability for developers.
Thanks to all the core dev teams that have worked on this (special shoutout to StreamingFast!). We can’t wait for developers to experience the radically faster indexing performance enabled by substreams.
About The Graph
The Graph is the indexing and query layer of web3. Developers build and publish open APIs, called subgraphs, that applications can query using GraphQL. The Graph currently supports indexing data from 32 different networks including Ethereum, NEAR, Arbitrum, Optimism, Polygon, Avalanche, Celo, Fantom, Moonbeam, IPFS, and PoA with more networks coming soon. To date, over 38,000+ subgraphs have been deployed on the hosted service and now subgraphs can be deployed directly on the network. Over 28,000 developers have built subgraphs for applications such as Uniswap, Synthetix, KnownOrigin, Art Blocks, Gnosis, Balancer, Livepeer, DAOstack, Audius, Decentraland, and many others.
The Graph Network’s self service experience for developers launched in July 2021; since then over 232 subgraphs have migrated to the Network, with over 161+ Indexers serving subgraph queries, 8,600+ delegators, and 2,300+ curators to date. More than 3 million GRT has been signaled to date with an average of 15K GRT per subgraph.
If you are a developer building an application or web3 application, you can use subgraphs for indexing and querying data from blockchains. The Graph allows applications to efficiently and performantly present data in a UI and allows other developers to use your subgraph too! You can deploy a subgraph to the network using the newly launched Subgraph Studio or query existing subgraphs that are in the Graph Explorer. The Graph would love to welcome you to be Indexers, Curators and/or Delegators on The Graph’s mainnet. Join The Graph community by introducing yourself in The Graph Discord for technical discussions, join The Graph’s Telegram chat, and follow The Graph on Twitter, LinkedIn, Instagram, Facebook, Reddit, and Medium! The Graph’s developers and members of the community are always eager to chat with you, and The Graph ecosystem has a growing community of developers who support each other.
The Graph Foundation oversees The Graph Network. The Graph Foundation is overseen by the Technical Council. Edge & Node, StreamingFast, Figment, Semiotic and The Guild are five of the many organizations within The Graph ecosystem.
StreamingFast is a web3 builder and investor. As a core developer on The Graph, it excels at building massively scalable open-source software for processing and indexing blockchain data. Founded by a team of serial tech entrepreneurs, the company has deep expertise in large scale data science. Its core innovation, the Firehose, is a files-based and streaming-first approach to processing blockchain data that enables high performance indexing on high throughput chains.
Originally published at https://thegraph.com on June 2, 2022.