When Kafka's disk gets full, the service can get stuck, leading us to drop all incoming events. To mitigate the issue, we can edit Kafka's log retention policies to free up some space. There are two configs we can set (both set minimum values and data can't be deleted beforehand):
- time - kafka docs
- bytes - kafka docs
Note that the retention check loop by default is ran every 5min retention check interval, we can change it to be more frequent, but probably don't need to.
We want to minimize the probability of kafka disk getting full and losing events and maximize disk usage to have as long retention as possible to have data to recover from in case something broke about ingestion. Therefore we suggest setting time to be relatively low (2h or 24h) and bytes to be about 90% of the volume size.