Databases

Cassandra Gotchas: Dealing with Tombstones and Wide Partitions

Avoid common Cassandra performance killers like deletion tombstones, huge partitions, and secondary index misuse.

Sachin Sarawgi·April 20, 2026·3 min read
#cassandra#databases#performance#distributed-systems

Cassandra Gotchas: Managing Distributed Scale

Cassandra is built for extreme availability, but its "append-only" storage model (LSM-trees) introduces specific behaviors that can catch developers off guard. Here are the most common Cassandra pitfalls.

1. The Tombstone Trap

In Cassandra, deleting data doesn't actually remove it from disk immediately. Instead, it writes a Tombstone marker.

  • The Pitfall: Frequent deletes or updates to the same row. When you read, Cassandra must scan through all these tombstones to find the "live" data. If you have thousands of tombstones, your read latency will explode, and you might see "TombstoneOverwhelmingException."
  • The Solution: Avoid frequent deletes. If you must delete, keep the volume low or tune your GC Grace Seconds and compaction strategy. Use TTLs instead of manual deletes whenever possible.

2. Huge Partitions

Cassandra distributes data across the cluster based on the Partition Key.

  • The Pitfall: Storing too much data under a single partition key (e.g., all events for a single customer over 10 years). Partitions larger than 100MB-200MB lead to memory pressure during compaction and long GC pauses.
  • The Solution: Use Bucketing. Instead of partition_key = customer_id, use partition_key = (customer_id, month) to ensure partitions stay small and manageable.

3. The Secondary Index Scam

Cassandra allows you to create secondary indexes on non-partition columns.

  • The Pitfall: Using secondary indexes on high-cardinality data. Unlike relational databases, a secondary index in Cassandra requires the coordinator to contact every node in the cluster to satisfy the query, destroying performance.
  • The Solution: Don't use secondary indexes for high-cardinality data. Instead, create a Materialized View or manually maintain a Mapping Table to support your secondary query patterns.

4. "SELECT *" and Large Rows

  • The Pitfall: Selecting all columns when you only need one or two. In Cassandra, data for a row might be spread across multiple SSTables. Fetching everything requires more I/O.
  • The Solution: Always specify the columns you need. Be aware of your row size; if a single row has hundreds of large columns, it can bottleneck your throughput.

5. Over-reliance on ALLOW FILTERING

  • The Pitfall: Using ALLOW FILTERING to bypass query restrictions. This forces Cassandra to scan data across nodes and filter it at the coordinator level, which is highly inefficient.
  • The Solution: Design your schema for your queries. If you need to filter by a column, it should probably be part of your Clustering Key.

Summary

Cassandra performance is all about partition health and query-first schema design. By avoiding tombstones and keeping your partitions small, you can maintain the sub-millisecond latencies that Cassandra is famous for.

📚

Recommended Resources

Designing Data-Intensive ApplicationsBest Seller

The definitive guide to building scalable, reliable distributed systems by Martin Kleppmann.

View on Amazon
Kafka: The Definitive GuideEditor's Pick

Real-time data and stream processing by Confluent engineers.

View on Amazon
Apache Kafka Series on Udemy

Hands-on Kafka course covering producers, consumers, Kafka Streams, and Connect.

View Course

Practical engineering notes

Get the next backend guide in your inbox

One useful note when a new deep dive is published: system design tradeoffs, Java production lessons, Kafka debugging, database patterns, and AI infrastructure.

No spam. Just practical notes you can use at work.

Sachin Sarawgi

Written by

Sachin Sarawgi

Engineering Manager and backend engineer with 10+ years building distributed systems across fintech, enterprise SaaS, and startups. CodeSprintPro is where I write practical guides on system design, Java, Kafka, databases, AI infrastructure, and production reliability.

Found this useful? Share it: