Cassandra Gotchas: Managing Distributed Scale
Cassandra is built for extreme availability, but its "append-only" storage model (LSM-trees) introduces specific behaviors that can catch developers off guard. Here are the most common Cassandra pitfalls.
1. The Tombstone Trap
In Cassandra, deleting data doesn't actually remove it from disk immediately. Instead, it writes a Tombstone marker.
- The Pitfall: Frequent deletes or updates to the same row. When you read, Cassandra must scan through all these tombstones to find the "live" data. If you have thousands of tombstones, your read latency will explode, and you might see "TombstoneOverwhelmingException."
- The Solution: Avoid frequent deletes. If you must delete, keep the volume low or tune your GC Grace Seconds and compaction strategy. Use TTLs instead of manual deletes whenever possible.
2. Huge Partitions
Cassandra distributes data across the cluster based on the Partition Key.
- The Pitfall: Storing too much data under a single partition key (e.g., all events for a single customer over 10 years). Partitions larger than 100MB-200MB lead to memory pressure during compaction and long GC pauses.
- The Solution: Use Bucketing. Instead of
partition_key = customer_id, usepartition_key = (customer_id, month)to ensure partitions stay small and manageable.
3. The Secondary Index Scam
Cassandra allows you to create secondary indexes on non-partition columns.
- The Pitfall: Using secondary indexes on high-cardinality data. Unlike relational databases, a secondary index in Cassandra requires the coordinator to contact every node in the cluster to satisfy the query, destroying performance.
- The Solution: Don't use secondary indexes for high-cardinality data. Instead, create a Materialized View or manually maintain a Mapping Table to support your secondary query patterns.
4. "SELECT *" and Large Rows
- The Pitfall: Selecting all columns when you only need one or two. In Cassandra, data for a row might be spread across multiple SSTables. Fetching everything requires more I/O.
- The Solution: Always specify the columns you need. Be aware of your row size; if a single row has hundreds of large columns, it can bottleneck your throughput.
5. Over-reliance on ALLOW FILTERING
- The Pitfall: Using
ALLOW FILTERINGto bypass query restrictions. This forces Cassandra to scan data across nodes and filter it at the coordinator level, which is highly inefficient. - The Solution: Design your schema for your queries. If you need to filter by a column, it should probably be part of your Clustering Key.
Summary
Cassandra performance is all about partition health and query-first schema design. By avoiding tombstones and keeping your partitions small, you can maintain the sub-millisecond latencies that Cassandra is famous for.
