Cassandra Tuning

Some of the things we use to store 100M metrics on our cluster.

First, you'll need Cassandra 3.11, and if you have a large number of metrics, you may want to add this patch: https://github.com/criteo-forks/cassandra/commit/1e54aae96dff438cb80cc4ce258ac1dc706236db

It's a good idea to create two separate clusters: one for metadata and one for data. This is because the metadata one will use indexes (SASI or Lucene) and will become slower as you add nodes.

We use the following settings:

max_heap_size: 24G (https://issues.apache.org/jira/browse/CASSANDRA-8150)
G1 GC with biased_locking disabled, max_gc_pause_millis set to 500, max_parallel_gc_threads set to the number of threads of the machine, max_conc_gc_threads set to 5
-Dcassandra.allow_unsafe_aggressive_sstable_expiration=true
-Dio.netty.eventLoopThreads=12 to reduce the CPU used by event loop by batching calls to epoll()
-Dcassandra.netty_flush_delay_nanoseconds=0 (same)

In Cassandra.yml:

      # Do more writes.
      concurrent_writes: 128
      native_transport_max_threads: 128
      # We have a single 'logical' disk, but more physical disks.
      memtable_flush_writers: 8,
      # Make sure we don't write to disk too much. This drastically
      # reduce the number of compactions as we don't write tens of SSTables per minute.
      memtable_cleanup_threshold: 0.2
      # We do not require compression for our workload.
      internode_compression: none'
      # Restore default.
      inter_dc_tcp_nodelay: false,
      # Increase coalescing window to reduce packet number.
      otc_coalescing_window_us: 10_000,
      # Try to reduce GC pressure.
      # http://www.datastax.com/dev/blog/off-heap-memtables-in-cassandra-2-1
      memtable_allocation_type: 'offheap_objects'
      # Try to make good use of the cache.
      file_cache_size_in_mb: 4000
      disk_optimization_strategy: spinning
      # TODO: try this
      # 'buffer_pool_use_heap_if_exhausted' =>  false,
      # Make hints faster
      max_hints_delivery_threads: 4
      # The speed is divided by the number of nodes internally, try to bump it
      # a little bit. We currently know that 4_000 works but 40_000 is too much.
      # This should really made more dynamic.
      # It makes the decommission easier (because hints are stored during
      # decom). Don't put it too high, it would break the service.
      hinted_handoff_throttle_in_kb:  4000

For the biggraphite keyspace, we set durable_writes = false and instead set memtable_flush_period_in_ms = 900000 on the tables. This put us at risk of some data loss if multiple nodes disappear at the same time but greatly enhance write performances.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cassandra Tuning

Clone this wiki locally