We just made a change to our caching service to increase the speed on the site. We use a caching database called Redis to handle our caching needs (for example if you go to your inventory we make a saved version of that in our cache DB, so that when you come back it doesn't have to do a full lookup again).
By default redis has a policy to not remove items from the cache until they are expired which is a great default for either a very small site, or a huge site that can afford to continue adding redis nodes. There is some complicated math here, but each node you add adds ~20% more capacity (because they have knowledge of each other, and the cache in each spread out) and isn't a great solution for us.
Today we set the expiration to allkeys-lru, which means when it hits the limit it will start removing old cache keys that haven't expired yet (example: we cache your wardrobe for 3 days) to make space.
Why did this just become an issue
When we mad our final server move a few months ago, we made choices preparing for the future. Our redis instance just hit its maximum (many, many GB) last night and was attempting to continue adding to the cache. It took us months to hit our limit, and by removing older keys we've added about 50% more overhead.
More importantly we've been investing in our monitoring and reporting, so now have monitoring set up for this particular failure mode and and I will get alerts via email and slack if we start to approach this boundary again.
TLDR Our cache filled up and didn't know how to get itself out of the situation, and it took manual intervention. We have changed the expiration policy, and have monitoring for this case going forward.
💖 ✨ 🤗