Less Terrible Ethereum Indexing
đź’¬

Less Terrible Ethereum Indexing

Date
September 9, 2024
Author
Kaylee George

by Kaylee and DC

Wait, why is this hard?

On the surface, a blockchain is the ideal system-of-record for serving data. You have a single canonical history. It’s mostly append-only. It has precise semantics and everything is immutable. Perfect for caching and computing derived data. Easy!

In practice, Ethereum is a surprisingly challenging system of record. Say you’re building a “hello world” wallet. You give it an account like vitalik.eth and all it does is:

  • Shows the current balance, in ETH
  • Shows recent ETH transfers
image

Simple! But building this so that it’s fast and correct is frankly harder than it should be.

First things first, you’ll need to track the chain. You can run a node yourself (you’ll need an EL like Geth or Reth, a CL like Prysm, ~1TB disk space each, at least a day to sync, and don’t forget monitoring and alerting if you want it to be reliable) or you can use an hosted node API like Alchemy, Quicknode, Chainstack, or LlamaRPC. We’ve run into at least one serious reliability issue, cost surprise, or outright correctness bug with each of these. You can work around this by using two different nodes with fallback. In any case, the node will tell you any account’s current ETH balance. But it won’t give you recent transfers… that would be too easy!

Instead, you’ll need your own DB and a node that supports trace_block . That’s because ETH transfers do not emit event logs: you need to trace the call tree of every transaction in order to find them. Tracing tends to be slow and expensive. In our experience, if you want full trace history without running your own archive node, only Chainstack will let you do it in less than a week and for less than $1000.

Finally, you’ll need an indexer to write chain data into your DB. You can roll your own (don’t forget to handle reorgs!) or use an existing one such as The Graph, Ponder, or Shovel. Any of these will be writing a lot of data… imagine a Postgres instance with ~1TB of trace rows. Index carefully.

Now, after building all that, you can ship the hello world UI!

image

…but this is just L1. Ethereum runs on rollups now. If you want to show an account’s true balance, multiply the above by all major chains.

None of this is “deep tech” or conceptually hard, but it’s all operationally heavier and slower than it should be.

After evaluating many indexers, we think we’ve found a way to make indexing lighter & faster. In this post, we present:

  1. Background. The evolution of our indexer setup.
  2. Lessons learned. Our current indexer setup & workarounds.
  3. Proposal for a better way to index. This would’ve let us go from 0 → 1 much faster.

1. Background

First, a quick history of what we tried:

Daimo Indexer v0

  • Base only, USDC only. Easy mode: single contract, yes events, no tracing.
  • RPCs + in-memory. Indexer loaded direct from RPC. No DB.

We remember this version fondly because it was really simple to operate and had no major issues. Eventually, though, the amount of history we were loading on startup got too large. Time to add a DB…

Daimo Indexer v1

  • Base only, all ERC-20 transfers. Still easy-ish mode.
  • Shovel + Postgres DB. Shovel is a great open-source indexer that uses RPCs under the hood. We love the simplicity: you just define mapping from (event log) to (DB table) and it… shovels from RPC to DB.

Initially, this worked well for us. We love the clean simplicity of Shovel.

But filtering transfers based on a dynamic list of addresses didn’t work for us (the underlying JSON RPC access patterns were inefficient). So we removed the filter. RPC issue solved, but now, instead of writing just Daimo-related transfers into our database, we had Shovel writing all ERC-20 transfers into our DB. Cue DB issues…

Daimo Indexer v2

  • All major EVM chains, all ETH & ERC20 transfers. Hard mode: tracing required.
  • Shovel + “JumboDB” + API DB.

This is when our problems became much worse. First, we had some terrible DB reliability issues.

We solved our DB issues by splitting raw chain data out into a dedicated “JumboDB”, with a separate process extracting the small subset of events relevant to Daimo.

Our indexing setup before.
Our indexing setup before.

Eventually, this failed too: JumboDB became too jumbo. For example, handling reorgs requires deleting just a few blocks at the tip, but with our billion-row Postgres DB even this became problematic: queries would time out and Shovel would halt completely. The fact that Polygon regularly has 10+ block reorgs didn’t help (more on that below).

A polygon re-org delete query that caused the indexer to halt. The table was
A polygon re-org delete query that caused the indexer to halt. The table was correctly indexed such that reorgs would delete rows only via primary key range query. Despite this, pg would just freeze intermittently past some size.

After poking around at some other indexers, we decided that none of them fully met our needs. Many are closed-sourced, didn’t support all chains, or didn’t support arbitrary filtering.

So we built our own.

đź’ˇ

Big thanks to Ryan at Index Supply. He was very helpful in debugging these issues. The problems we ran into were largely RPC or DB related. We recommend Shovel (and an exciting new service Index Supply is cooking) for many dapps.

Daimo Indexer v3 (present-day)

  • All major EVM chains, all ETH & ERC20 transfers. Still hard mode.
  • RPCs + API DB. Our own indexer that fetches and filters directly from RPC in one go.

We wanted to make our setup as lightweight and streamlined as possible. Back to the basics:

  1. Use RPCs to directly fetch all log events and traces that we care about (including all transfers, userops, and custom Daimo events) for a given block or range of blocks
  2. Filter through everything we fetched on-demand using some custom logic
  3. Only store what’s relevant to our DB

With this setup, we still process tons of logs and traces but only write a tiny fraction of that data. Much cleaner. The main tricky part is reorg handling.

Our indexing setup after. Much cleaner.
Our indexing setup after. Much cleaner.

In hindsight, maybe this setup was obvious: fewer moving parts, much smaller db, no third-party dependencies. But in the moment, it kind of wasn’t. We were hesitant to build our own since indexing is step 0 for any crypto app, so we should be able to use an existing tried-and-true solution. Turns out we could do better.

It’s also nice that we have full control over our indexer now. If something’s not working, we have full autonomy to go fix it ourselves. Our custom filtering logic is easier to manage.

In particular, we do something we anticipate many rollup-native apps will have to do: we index a home chain that defines which accounts we care about + all transfers on many chains for those accounts. (In the future, the “home chain” might be a keystore rollup.) We index transfers to and from Daimo accounts on all other major chains, so those indexers can’t run ahead of the home chain. No existing indexer we’ve seen supports this, but it’s straightforward to do in a purpose-built filtering indexer.

2. Lessons learned

‣
Chain issues. L1 + OPstack pretty clean, rest have unique challenges.
‣
RPC issues. A review of RPC providers (plus one that might be a Geth bug).

Goal-state indexing

Indexers tend to be complex and heavy (e.g. The Graph, Polygon’s Chain Indexer Framework). Newer and simpler include Ponder and Shovel. There are still some problems we’ve identified with the current state of indexing, but luckily, these problems have solutions. For example:

  • Problem: writing entire firehose from chain to DB.
  • Solution: process the firehose, but filter before writing. Filtering requires arbitrary logic + potentially cross-chain logic—cannot be done via just some fixed filter parameters in our case.

  • Problem: re-executing blocks adds latency + cost.
  • Solution: execute block once → output logs and events immediately. Reth+exex is a very promising pattern.

Above all, existing indexers work around the considerable limitations of the Ethereum JSON RPC API. For dapp indexers, it’s mostly acceptable. Take an app like Polymarket: here, you care about a few Ethereum event signatures from your own contracts, on a single chain. JSON RPC-based filtering, using eth_filterLogs etc, works OK for this use case.

For payments, wallets, and other use cases where you need real-time asset transfers, current indexers (and the JSON RPC API) work poorly.

3. Better indexing / a modest proposal

Our ideal, goal-state data provider looks something like this:

image
  1. Data provider serves a simple, efficient API for block data.
  2. For example, this could use grpc + compressed protobufs. Our indexer has some way to subscribe to each chain ID we’re tracking.

  3. For each new block, we promptly receive (block header, events, traces)
  4. Ideally, this should be within 100ms of when each block lands.

    Our indexer can process this data with arbitrary logic. For example, we might send push notifications straight from the indexer.

    Then, we filter and write to DB. Doing the filtering ourselves (rather than trying to tell a remote node provider how to filter) is far simpler and more flexible. It costs slightly more bandwidth, but this is a tiny cost especially if using an efficient serialization.

    To handle reorgs, simply delete data at ≥ the block number of the newly received block.

  5. Finally, provide historic data in exactly the same format for fast backfill.
  6. This is static data. There is no reason we should have to query old data via a slow JSON API a few blocks at a time, and no reason why the node provider backend should have to painstakingly re-execute old blocks!

    Instead, what we want is something like an S3 bucket or even a torrent file, with a bit of scaffolding to rapidly run our indexer over historic block data.

The Reth exex system is very close to this and accomplishes 1 and 2 better than anything else we’ve seen. There are already some promising indexer experiments using Reth this way. The main missing pieces for us:

  • We’d love an external API so we don’t directly have to run nodes ourselves.
  • Reth doesn’t support a few chains we need yet, including Arbitrum.
  • Finally, we still need #3 for fast backfill.
  • Backfilling eg. Base from scratch on a server with fast internet should take <1 hour, not many hours for (syncing a fresh Reth node) or several days for (calling trace_block since genesis via Chainstack JSON RPC).

Conclusion

“Simplify, then add lightness.” — Colin Chapman

Ethereum UX is directly tied to and largely dependent on indexer quality. For the eth ecosystem to feel snappy and reliable, we need rock-solid indexer infrastructure.

Many of the underlying challenges date to the very start of Ethereum a decade ago. For example, the fact that ETH transfers require tracing, or some of the design decisions that make the JSON API inefficient.

But we think they’re on the cusp of being solved now, which is exciting.

Big thanks to Ansgar, Georgios, Alexey & Ryan for reading drafts of this post.

If these kinds of problems are interesting to you, we’re hiring. Email us at founders@daimo.com