NSDb is a time series database, thus it is optimized for handling time series data, i.e. a series of multi dimensional data points indexed by time.
A time series record, called Bit
in the NSDb nomenclature, is a structure composed of:
- a
timestamp
, a long field that represents the record time instant - a
value
, a numerical value ( INT, BIGINT or DECIMAL), that represents the Bit main characteristic, the number that identifies the main bit feature at a given timestamp. dimensions
, set of key-value pairs. Keys are strings and can be considered as the dimension name; dimension's value can be a STRING, an INT, a BIGINT or a DECIMAL and tells us that the value referred to a key. Dimensions are secondary Bit characteristic and can be used to filter or aggregate results.
If we want to model the number of accesses to emergency rooms in a country (or in a state or a region) we can use a time series record (a Bit
) with the following features:
- access time as the timestamp
- the number of patients entered in the ER for triage as the
Value
- some additional information useful for filtering and aggregating as
dimensions
- the hospital name
- the patients' priority
- the hospital coordinates (geohash)
A series of time series records (bits) is called metric
.
NSDb provides two levels of containers for metrics: database
and namespace
.
A bit belongs to a namespace, which belongs to a database.
To declare the database and the namespace is not required in a sql statement, but it is necessary for the outer context (e.g. rest apis, java apis or cli)
The following properties can be configured for a metric. This configuration must be performed before inserting bits and cannot be updated.
- shard interval: the time interval that determines the shard length.
- retention: the time interval in which data are kept into the indices.
Given the time series nature of NSDb, it would be improper to talk about a metric's schema because each bit can have different dimensions. Anyway, it is important to point out that bits of the same metric must be homogeneous to each other. The following rules must be followed by a series of Bits:
- value type must be the same (e.g. DECIMAL or INT)
- dimensions can be present or not, but if present their type must be the same among bits; e.g. dimension
d
can be present inbit1
and not inbit2
but if a bit contains the same dimension with a different type (STRING vs INT for example) it will be rejected.
Given the above preconditions, it is possible to infer and retrieve a schema for a metric, which will be composed of:
- a long timestamp
- a numeric value
- the union of each bit's dimensions
NSDb supports a custom SQL like query language. For more information, the full doc could be examined.
Essentially, dimensions and timestamp can be used to compose where conditions or group by clauses and value for aggregation clauses.
Reusing the ER example explained above, in order to select accesses with WHITE priority and sort results by the time we can run this query:
select * from accesses where priority = WHITE order by timestamp desc
while for retrieving the sum of accesses grouped by priority this query can be used
select sum(value) from accesses group by priority
Perhaps the most interesting feature of NSDb is a special case of data streaming based on a publish-subscribe pattern.
A user needs to subscribe by performing the following steps:
- open a WebSocket
- send a specific json message providing a database, a namespace and a valid query
NSDb will return the query results, the Historical data
first, then, every time a record that fulfills the query provided is inserted, it will be sent to the subscriber's websocket.
If we consider NSDb as a Source and the subscriber as a Sink, this publish-subscribe mechanism can be interpreted as a simple streaming pipeline.
NSDb strongly relies on akka, the implementation of the Actor Model on the JVM.
This model gives the right abstraction to build high traffic, concurrent and distributed systems. Using it makes easy to handle concurrency, and also provides an abstraction for creating and maintaining a cluster of actors (e.g. it's not necessary to implement gossip and anti-entropy protocols).
NSDb takes full advantage of Apache lucene™, thanks to which it is possible to index bits and run high performance searches. It also provides many primitives useful to optimize indexes and related queries to fit time series requirements.
In addition to Lucene, the native file system is used for writing a commit log, used to keep track of all the writes and deletions happened in the system; it is essential in case of disaster recovery. In fact, all the log entries are written before being performed, so that it is possible to replay them in case of failure.
In order to maximize write throughput and also to guarantee a real time events delivery, It has been decided to adopt a custom consistency
strategy.
After the write operation has been logged in the commit log, it is accumulated into an internal buffer which is periodically flushed into the index, and in parallel it is checked against all the subscribed query and published to all the suitable subscribers.
This means, that from the subscriber point of view, the effects of an insert operation will be observable in real time, while for ordinary users, NSDb behaves like a eventually consistent
system, i.e. records are stored only after a configurable interval.
One of the most relevant properties of a time series database is that data continuously comes, will be always inserted and never updated, and given the nature of the data itself, that is the fact that it's naturally ordered by time, leads us to adopt a sharding strategy called Natural Time Sharding.
Users can configure a sharding interval
sharding {
interval = 15s //15m 15d 1y are valid values
}
according to which, shards will be created.