On decentralised database management systems

A little background, I’ve mostly done fullstack web app development in my career, but I took a detour in 2014-ish to work at a startup called FoundationDB to do feature development on distributed database systems with ACID properties. (The technology still exists; the company was acquired by Apple in 2015 and the project was later re-open sourced, and can be found here.)

Recently, I’ve been working myself up to creating decentralized apps on Polkadot for fun side projects, and started thinking about web app development generally vis a vis blockchain, what the constraints are. Usually when I start working on a side project, I start with the data model and then do a little guesswork about what kind of DBMS I will want for it. Often the answer to that has fallen somewhere between a relational database and a KV-store, or some combination of those ideas. I’ve seen companies use elasticsearch as a primary database. For substrate, it’s a mixed bag depending on whether you want to go with a smart contract-based app or build your own collator. I imagine if you built your own collator/parachain then you could model the data it manages any way you like, but there’s no sort of traditional relational database built on top of this technology yet (right?).

Recently, I’ve seen TPS measurements over 100k for Kusama during the spammening with only a fraction of the “cores” in use. But finality still takes on the order of seconds, so effectively writes take as long as finality takes (?). If I remember it right, FoundationDB was able to do millions of TPS with a mixed load of reads and writes on “commodity hardware” with ACID properties using a distributed (but not decentralized) network of nodes with a variety of roles that changed over time depending on network conditions (which ultimately meant that outages could occur randomly and the system would still be able to handle ACID transactions until the bitter end). I remember being impressed by that, but this is different from operating in an environment where some of the nodes could be bad actors, and I recognize that this is the innovation of blockchain and what makes those systems decentralized rather than merely distributed. It’s “easy” to get millions of TPS when trust isn’t a concern. (edit: an overview of the FDB technology can be found here for anyone interested)

That said, one of the things that FoundationDB was trying to achieve was to separate the storage from the data model. In other words, they had independent server processes that acted as layers (SQL, Document, Tuple, etc.) that would interact with the distributed KV-store over TCP that served as a foundation (hence, the name). The Document layer, for example, was a server process that acted as a drop-in replacement for MongoDB, where you could interact with it over Mongo’s wire protocol using the existing CLI or other client libraries, but ultimately those reads and writes were made to the KV-store with ACID transactions, something Mongo couldn’t do at the time (still doesn’t?).

So I’ve been thinking about this “layers” idea and how it could relate to polkadot/substrate/JAM. What I’ve seen for data management in the blockchain world is mostly IPFS and their ilk, which is more or less like a replacement for s3, but not a database. And I keep coming back to the same question – is blockchain a database? It’s data stored in a linked-list like structure, the writes are basically ACID, but what’s missing I guess is all the indexing that makes querying faster. If I imagine a situation where we simply treat the data on-chain as the single source of truth, then we just create indexes to make that data more accessible/queriable, then is that all that’s required for making a traditional full-stack app experience? Would it be possible to create a layered architecture like FoundationDB over top of blockchain, where the blockchain is the KV-store? And then developers could just use existing tools (psql, mongo, etc.) to create apps on top of the blockchain?

If any of this sounds uninformed, it’s because it is. I’m still a beginner when it comes to the technical side of bockchain and Polkadot/Substrate. I keep meaning to dive deeper, but I have had lots of distractions. Also, my role at FoundationDB didn’t require me to know that much about distributed systems. I was mainly involved with translating the mongo wire protocol and document layer correctness testing while I was there.

1 Like

It is absolutely doable but it is costly. the other issue is potentially privacy. Usually in a database, most of the data is private.

An example where that would be useful: to get data from head you can use a lightclient. For history, you need to go to an archive if you want information per block but if you want information per adress then you have to go to block-explorer. You need to trust them to have done the job properly and not miss a block for example. Having a rollup / parachain where you could get inverted indexes would be nice.

Since you cannot delete from the blockchain or recompact data, you would need to think carefully on how to design indexes if you do not want the size to get large (and then again costly).

If you think more about it and write a proposal I would be happy to review it.

Thank you for your reply and offer to review a proposal. Can you point me to examples of successful proposals, documentation about the proposal process, etc.? I’m still very new to the community.

In reply to some of the issues you bought up, I was thinking that the indexes wouldn’t need to live on the blockchain, but could be independent services, perhaps even centralized services, that would spend considerable upfront cost indexing the chain looking for data under a certain namespace (address maybe, but could also be arbitrary data that woud act as keys in a potentially long-lived but non-permanent KV-store built on the side). In that way, you could also control privacy to a certain extent just by encrypting the data – my understanding is that encryption is only working for some timespan after it’s used, and would be eventually be compromised by future technological advances.

But upon reflection, I realize that spoofed indexes (coming from centralized database mirrors) presented as truth could actually be designed to coerce an address-holder to make inadvertent but real changes to onchain state to their detriment.

Perhaps this thing “built on the side” would really just be a kind of rollup / parachain, and building the indexes wouldn’t be necessary if you could just learn them from other nodes, and also the data would be considerably less if, for example, this weren’t just a single parachain offering a mongo dev experience, let’s say, but rather a library for startups to use to stand up their own parachain that would act as a mongo emulator (or toggle a cluster of modeling layer types for use on what is essentially a private chain).

Cautionary Note:
The ideas canvased here reflect a point of view, and end state, that has authoritatively been described as “fundamentally mistaken”

Here the exception probably proves the rule that your instinct is correct. Except when your parachain adopts a consensus protocol that leverages the data model, such as Sui’s Address owned objects. Here you would be implementing your own validation/consensus protocol as an insurance policy for the day the relay chain fails, which current token designs means it is guaranteed to happen, and while it is nigh impossible to predict when, it is possible to predict the triggering event: token designs that don’t have the current token vulnerabilities.

So the constraint on your choices would be: use a data model that can survive the relay chain failure.

Blockchains are definitely in their infancy - no chain has been able to persuade a regulator to sign-off on their various claims (certain misrepresentations – i.e. not just puffery – aside)… but the outlines of an end state are visible and they are trivial, as you have worked out: these are just distributed data stores, or even a network of VM’s with a network shared drive. Both work as a practical mental model.

With that end state visible, you can find that some supporting tech can be reasonably well advanced. Some that come to mind

Relax. Everyone here is struggling, and way out of their comfort zone. Especially the PhD’s - if they weren’t we would already have the SEC publicly blessing DOT.

“The more I learn the more I realize how little I understand.” Source…

That low-level DB tech above is probably mature enough for you to add value?

The only tip I would offer is don’t assume the end state is one blockchain to rule them all
… or even a winner take most end state

Better, in my current view, to work from the premise that there will be a multitude of interchangeable blockchains (relay chains in Substrate parlance, DOTs in Polkadot parlance).