Culture

Memory That Collaborates

When two teams need to combine data, the usual answer is infrastructure. An ETL pipeline, an API, a message bus. The kind of setup that works—until it doesn’t, or until the next maintenance window shows up.

MISRYOUM’s tech desk has been looking closely at an approach that cuts a lot of that overhead. The basic pitch is simple: if your database is an immutable value stored somewhere, anyone who can read the storage can query it. No server to run, no API to negotiate, no data to copy. And if the query language supports multiple inputs, you can join databases from different teams in one expression.

This is how Datahike works, and the more interesting part is that it’s not framed as a bolt-on feature. It falls out of two properties the architecture treats as fundamental. First, databases are values. In a traditional setup, you query through a running server, and the data may change between queries. The database is effectively a service, not something you “hold.” In Datahike, dereference a connection (@conn) and you get an immutable database value—a snapshot frozen at a specific transaction. It won’t change. Pass it to a function, store it in a variable, hand it to another thread.

Two concurrent readers, holding the same snapshot, always agree. No locks. No coordination dance. The idea traces back to Rich Hickey’s work with Datomic in 2012, where perception (reads) is treated as values and doesn’t require coordination. In Datomic, indices live in storage, but the transactor keeps an in-memory overlay of recent index segments that haven’t been flushed yet—so readers often need to coordinate to get a complete, current view. Datahike removes that dependency because the writer flushes to storage on every transaction. Storage becomes authoritative. Any process that can read the store sees the full, current database—no overlay, no transactor connection needed.

To make that claim feel real, the storage structure matters. Datahike keeps indices in a persistent sorted set—described as a B-tree variant where nodes are immutable. Every node is stored as a key-value pair in konserve, which abstracts over storage backends: S3, filesystem, JDBC, IndexedDB. When a transaction adds data, it doesn’t modify existing nodes. It creates new nodes for the changed path from leaf to root, while the unchanged subtrees get shared with the previous version. This is structural sharing, the same technique behind Clojure’s persistent vectors and Git’s object store.

There’s even a concrete way the “cost” behaves: imagine a B-tree with thousands of nodes supporting a million datoms. A transaction that adds ten datoms might rewrite a dozen nodes along affected paths—while the thousands of untouched nodes are reused. Both old and new snapshots remain valid, complete trees. They just share most of their structure. The crucial property: every node is written once and never modified, meaning the node’s key can be content-addressed. That, in turn, lets nodes be cached aggressively, replicated independently, and read by any process that can access storage without coordinating with the process that wrote them.

Then there’s the practical part—how a reader actually gets the snapshot. When you call @conn, Datahike fetches one key from the konserve store: the branch head (for example, :db). This returns a small map containing root pointers for each index, schema metadata, and the current transaction ID. Nothing else is loaded immediately; the database value is a lazy handle into the tree. As a query traverses the index, each node is fetched on demand from storage and cached locally in an LRU. After that, subsequent queries hitting the same nodes don’t pay extra I/O. The indices “live in storage,” so any process that can read storage can load the branch head, traverse the tree, and run queries—no server process, no connection protocol, no port to expose. It’s called the distributed index space, and it means two processes reading the same database fetch the same immutable nodes independently.

All of that sets up the part teams will care about most: joining across databases without forcing them into the same system. Because databases are values and Datalog natively supports multiple input sources, you can join databases from different teams, different storage backends, or different points in time in a single query. One team’s product catalog might sit on S3 while another maintains inventory in a separate bucket. A third team can join them without either team doing anything—each @ dereference fetches a branch head from its respective bucket and returns an immutable database value, and the query engine joins locally with no server coordinating between them and no data copied. Also, you can mix snapshots from different points in time: the old snapshot and the current one are both just values, so the query engine doesn’t care when they were taken. (A little personal detail: I noticed this in a demo log because the room suddenly got quiet—the little keyboard clicks stopped, like everyone was waiting for the join result to settle.)

So far “storage” has meant S3 or filesystem, but konserve also has an IndexedDB backend, which means the same model works in a browser. With Kabel WebSocket sync and konserve-sync, a browser client replicates a database locally into IndexedDB. Queries run against the local replica with zero network round-trips, and updates sync differentially—only changed tree nodes transmitted. The structural sharing that makes snapshots cheap on the server makes sync cheap over the wire too.

The last detail MISRYOUM editorial desk kept coming back to is that you can replace :memory with :s3, :file, or :jdbc and the same code works across storage backends. Databases don’t have to share a backend—an S3 database can be joined against a local file store in the same query. And once you start thinking about that—values, snapshots, joins without coordination—it’s hard to go back to the old pattern entirely. Or maybe that’s just me, mid-thought, still impressed that the whole thing is basically stored memory that collaborates.

Leave a Reply

Your email address will not be published. Required fields are marked *

Are you human? Please solve:Captcha


Culture

Memory That Collaborates

When two teams need to combine data, the usual answer is infrastructure: an ETL pipeline, an API, or maybe a message bus. Each adds latency, maintenance burden, and a new failure mode. The data moves because the systems—well, they simply can’t share it in place. But there’s a simpler model. If your database is an immutable value in storage, anyone who can read that storage can query it. No server to run, no API to negotiate, no data to copy.

Misryoum editorial desk noted that this is how Datahike works. It isn’t a feature bolted on; it intentionally falls out of properties fundamental to the architecture. In a traditional database, you query through a connection to a running server. The data might change while you’re mid-query. The database is a service, not something you actually hold. Datahike inverts this—dereference a connection and you get an immutable database value, a snapshot frozen at a specific transaction. It won’t change.

Two concurrent readers holding the same snapshot always agree, without locks or coordination. This is an idea that surfaced back in 2012, separating process from perception. Misryoum analysis indicates that the key shift here is moving away from the transactor dependency. In other systems, storage alone isn’t enough; readers must coordinate with a transactor to get a current view. Datahike removes that dependency entirely.

Datahike keeps its indices in a persistent sorted set, a B-tree variant where nodes are immutable. Every node is stored as a key-value pair in a storage abstraction layer. When a transaction adds data, Datahike doesn’t modify existing nodes. It creates new ones for the changed path from leaf to root, while unchanged subtrees are shared. The sound of a cooling fan whirring in the server room feels distant when you realize this is just structural sharing at scale.

This is where it gets interesting—or maybe complicated. Because databases are values, you can join databases from different teams or different points in time in a single query. Team A maintains a product catalog on S3; Team B handles inventory. A third team joins them without either team lifting a finger. Each dereference fetches a branch head and returns a value. The query engine joins them locally. No server coordinating, no data copied.

It works in browsers, too. Using indexed storage, a client replicates a database locally into a browser. Queries run against that replica with zero network round-trips. Updates sync differentially, meaning only the changed tree nodes move over the wire. Actually, the same logic that makes snapshots cheap on the server makes sync cheap on the network.

It’s a different way to think about state. Whether you’re joining an S3 bucket against a local file store or just debugging a report from last quarter, the databases don’t really care where they live.

Leave a Reply

Your email address will not be published. Required fields are marked *

Are you human? Please solve:Captcha


Culture

Memory That Collaborates

When two teams need to combine their data, the standard answer is usually just more infrastructure: an ETL pipeline, an API, or maybe a message bus. Each one adds latency, more maintenance, and—of course—a brand new way for things to break. The data moves around because the systems, frankly, can’t share it in place.

There is a simpler model. If your database is an immutable value in storage, then anyone with read access can query it. No server to run, no API to negotiate. It just works. If your query language handles multiple inputs, you can join databases from different teams in a single expression. This is how Datahike operates. It isn’t a bolted-on feature; it’s a property of the architecture itself.

In a traditional setup, you query through a connection to a running server. The data changes while you wait. The database is a service, not something you hold. Datahike inverts this. Dereference a connection and you get an immutable database value—a snapshot frozen in time. It won’t change. You can pass it to a function, hold it in a variable, or hand it to another thread. Two concurrent readers holding the same snapshot always agree. No locks needed. Actually, coordination of any kind feels like an unnecessary burden here.

This architecture relies on what we call structural sharing. Datahike keeps indices in a B-tree variant where nodes are immutable. When a transaction adds data, it doesn’t modify existing nodes. Instead, it creates new nodes for the changed path from leaf to root. Unchanged subtrees are shared with the previous version. It’s similar to how Git manages its object store—or maybe how Clojure handles vectors. I remember the first time I saw it, the smell of burnt coffee in the office seemed to linger as we debated the complexity of the trees.

It’s quite simple, really. You fetch a branch head from the store, and that’s it. That returns a small map with root pointers and schema metadata. Nothing else is loaded. The database value you receive is just a lazy handle into the tree. When a query traverses the index, each node is fetched on demand—or maybe it’s already in the local LRU cache—and you get your results. No server process mediating, no port to expose.

Because databases are values, joining across them becomes a natural step. Team A has a catalog on S3; Team B has inventory in another bucket. A third team joins them without either team doing a thing. The query engine just joins them locally. There is no server coordinating, no data copied. It’s effective for audits or just debugging, like asking, “what would this report have shown against last quarter’s data?” It’s a clean way to handle information. Or, well, as clean as data ever gets.

Leave a Reply

Your email address will not be published. Required fields are marked *

Are you human? Please solve:Captcha


Secret Link