[Early October, 2024]
A casual day of my life. Dropping kids to their schools, walking, jogging, taking shower having breakfast and going to office. Sudden break in the routine.
Cring cring (phone is ringing)
Me: Hello!
My Team Member: Bhaiaaaa. Application is getting crashed.
Me: Which application?
My Team Member:(Customer name) app.
Me: Call the app team.
My Team Member: Called already. It's Database Server... CPU is 100%
Me: (Missed a heartbeat) Ummh what happened?
My Team Member: Better you login to the server...
...
My Team Member: Hello! Can you hear me?
Me: (Does there any change in app or DB end? Does the subscriber base increase recently? Just back to the reality after poking). Well, I am checking!
After hours of investigations, I pinned point the culprit query which was sucking up the CPU.
[Mid August, 2024]
Phone call between me and my customer.
My Customer: Bhai, we are integrating your API with our new platforms.
Me: Ok. No issue. By the way, what would be the TPS?
My Customer: Don't worry bhai, your platform can cater it.
Me: Ok then proceed.
[Early October, 2024]
Phone call between me and my customer.
Me: What did you do last night after 2:00 AM??? Why there is hundreds of API calling per second???
My Customer: What!? Nothing! Let me check.
Me: CONTROL the API calling immediately. We are going to implement the threshold at our end.
My Customer: Ok, I am doing it for temporarily. Other team perhaps did the activity last night. This is a CXO level project. You need to serve full traffic. I know you can deliver this load. bla bla (Typical local customer voice!)
Me: Implement threshold at the API calling immediately. Allow some time, I will get back with solutions.
My Customer: Okay bhai I am doing it.
------------
I called an emergency meeting with the Developers, DevOps and DBA team. I asked the team to implement API calling threshold at web server. The Database back to the normal state. Now, I asked my team to check common query performance check list.
- Index - Ok
- Index Fragmentation - Within the threshold value
- Execution plan - Ok
- Query Store - Does the culprit query took any new execution plan? No
- Resource bottleneck - CPU
- Opportunity to optimize at App level - Yes
My full team convinced that rather than increasing computing power, if data is kept in memory cache and served then we can immediately resolve the issue. Team selected Redis as caching server. We implemented the new solutions in 2-3 days. Serving API data from cache rather than database. Team tested and informed customer to resume full traffic.
Voila! Redis resolved my pain. There was no additional load on the DB. System was stable same as before. Everyone became happy.
My life returned to its casual state again!