Modern servers rely on many CPU cores and complex memory hierarchies, making scalability primarily a concurrency problem rather than a raw compute problem.
In-memory data stores like Redis and Memcached sit on the hot path of modern systems, where architectural decisions directly affect latency and throughput.
Dragonfly is not “Redis but faster.”
It is a fundamentally different in-memory data store built around a concurrency model designed for modern multi-core machines. Understanding Dragonfly requires looking at its architecture before its performance.
This post starts a series on Dragonfly internals by focusing on the architectural principles that shape the system, beginning with the problem it is designed to solve.
Scaling today is no longer about making a single operation faster; it is about using many CPU cores without forcing them to constantly coordinate.
Clock speeds have plateaued, core counts keep rising, and memory access is increasingly non-uniform. In this environment, shared state becomes the dominant cost. Cache Coherence Traffic, Lock Contention, and Cross-Core Communication quickly outweigh the cost of the actual work being done.
Most traditional designs respond by either protecting shared data with locks or avoiding concurrency entirely through a single execution thread. Both approaches simplify correctness, but they place a hard limit on scalability. At high concurrency, systems spend more time coordinating than doing useful work. Dragonfly takes a more opinionated stance: contention is not something to be managed at runtime, but something to be designed away. By partitioning data and execution to minimize sharing, Dragonfly allows concurrency to scale with core count rather than fight against it—an approach that naturally leads to its thread-per-core and shared-nothing architecture.
Redis is single-threaded only for command execution.
The process is not single-core,
and its memory is not touched by only one core.
Single Threaded != Single Core
Systems that rely heavily on shared mutable state—such as multi-threaded caches with shared hash tables—incur all three costs as concurrency increases.
At the center of Dragonfly’s architecture is a deliberate choice: one worker thread per CPU core.
Each worker thread is pinned to a specific core and runs continuously. Instead of treating CPU cores as interchangeable execution resources, Dragonfly treats each core as a long-lived execution context with a clearly defined role.
The number of worker threads matches the number of CPU cores
Each worker thread runs on a dedicated core
Requests are dispatched to a specific worker thread, not pulled from a shared pool
Once a request reaches a worker thread, it is executed serially on that core. There is no concurrent execution within a worker thread and no migration of work during execution.
So essentially, a single core in Dragonfly behaves like a Redis instance which follows a single threaded model, executes commands serially and owns the complete dataset.
Shared Nothing Architecture is a distributed design architecture where each node operates independently and does not interfere with another node. This means the data owned by each node is not shared and has an exclusive access to that memory.
Building on its thread-per-core execution model, Dragonfly adopts a shared-nothing architecture. In Dragonfly, each node refers to a single core.
Each thread owns its data and operates on it independently. There are no shared hash tables, no shared memory structures in the hot path, and no coordination between cores during normal request execution.
Since a core owns a well-defined portion of the keyspace, only that core is allowed to read or modify the data it owns. This ownership model completely avoids synchronization problem where multiple threads need concurrent access to the shared resource.
To conclude, instead of relying on locks or atomic operations to maintain correctness, Dragonfly relies on isolation and threads rarely need to coordinate.
The request is received by an I/O thread, which parses the command and hashes the key.
Based on the hash, the request is asynchronously routed to the shard that owns the key.
The shard’s worker thread processes the request serially as part of its normal execution loop.
The thread looks up the key in its shard-owned data structures.
If the key exists and is valid, a value is produced; otherwise, a nil response is generated.
The result is passed back to an I/O thread, which formats and sends the response to the client.
The request completes.
From the shard’s perspective, the entire operation is local and sequential. No other core participates in the execution of this request, and no shared mutable state is involved in the hot path.
The inspiration for Dragonfly can be seen from 2 main sources :
Redis being single threaded on a multi-core machine
SycllaDB using a shared-nothing-architecture.
Roman Gershman did Experiments with 3 variants :
Redis single threaded I/O
Redis multi threaded I/O (command execution is still on single thread)
midi-redis which uses multi-threaded command execution.
In pipeline mode (when multiple commands are sent in the same request), the single threaded Redis variant showed positive results, while multi-threaded Redis (multiple I/O Threads) suffered on max throughput achieved because the server spent time in thread coordination.
On the other hand, midi-redis variant proved to scale 10x times and spent less time doing IO thread coordination.
This analysis highlighted an important economic and architectural insight: vertical scaling matters. A single large machine is often more efficient than coordinating many smaller ones (read the original post for proof and analogy), provided the software can fully exploit available cores. This naturally leads to designs that scale within a node, not just across nodes.
At the same time, shard-per-core, shared-nothing architectures—successfully used in systems like ScyllaDB—demonstrated a practical way to scale on multi-core machines. Dragonfly applies these lessons directly: rather than extending Redis incrementally, it adopts explicit ownership and isolation as foundational design principles.
- 1 Thread per core. - Each thread runs its own event loop. - Background threads are multi-threaded ?
- Single thread for command execution with a single event loop.
- 1 listener thread and N Worker threads (each running their own event loop), request is allocated round robin to each worker thread. - Background threads have different pools - LRU, Statistics, Memory Allocators etc.
Memory Sharing
Each Core has its own memory which is not shared with other Cores.
Global memory which is available for the single thread.
Global Hash Table is shared by all threads including background threads.
Locking Mechanism
Does not use locks in its command execution event loop. Its needs locks when doing cross shard communication (for ex - multi key commands), background persistence, shared metadata and statistics etc.
Does not use locks in its command execution event loop. It uses locks for I/O Threads, background persistence, shared metadata and statistics etc.
- Each item’s hash maps to a bucket and lock is allocated per bucket. - All the locks are stored in a Lock Hash Table. The size of this lock table is directly proportional to the number of worker threads. - Keys belonging to different buckets can be locked in parallel.
Lock Contention
No lock contention for command execution since no locks. For non-command tasks, lock contention might be there but is very limited since those tasks are infrequent.
No lock contention for command execution since no locks. For non-command tasks, lock contention might be there but is very limited since those tasks are infrequent.
- Yes, but minimized with lock buckets. - Parallel requests to keys hashing to same lock bucket with content for locks.
We will have a separate post to deep dive on the concurrency models of each cache.
At this point, the core architectural shape of Dragonfly should be clear. Thread-per-core execution, shared-nothing ownership, and explicit request routing work together to minimize contention by design. Concurrency is not managed through synchronization, but through isolation and ownership.
This perspective also reframes how to think about locks. Rather than asking whether a system uses locks, the more meaningful question becomes where locks are necessary and why. In Dragonfly, most command execution does not require locks because concurrent mutation is structurally impossible within a shard. Where coordination is unavoidable—such as cross-shard operations or global metadata—locking is contained and explicit.
This is in contrast with systems like Redis, where a single-threaded event loop avoids locks by avoiding concurrency altogether. Both approaches achieve correctness and predictability, but they arrive there through fundamentally different tradeoffs.