Incident with our background job processor

Incident Report for LeadSimple

Postmortem

We use caching broadly throughout our application. Caching provides a way for us to temporarily store data that would otherwise need to be retrieved over and over again. As an example, we store whether an account is active or not in this caching system, avoiding the need to check with Stripe - our payment provider - if an account is active or not. Among other things, using this temporary storage helps to create a quick and responsive user experience within the application.

‌Caching requires a server for temporary data storage, and we are currently using the same server for both background job processing and UI/UX speed-ups. This is the root of why the web application was experiencing errors.

‌

Our caching server went down because of a script we ran to fix a problem with our background jobs, but since the shared server was entirely unavailable, it caused outages in the web app as well because it depended on the same server.

‌

‌Once we were able to unblock our job processor infrastructure, after 12 minutes, the app went back online.

‌

Our Engineering team will be working on separating the caching that’s used for background jobs from the caching that’s used for UI/UX.

Posted Jan 29, 2024 - 12:00 PST

Resolved

During an investigation of a non-related bug, we caused our background job processor to block, effectively preventing us from running any background jobs. This also affected our main website because we rely heavily on caching, which uses the same underlying infrastructure as our background job processor.

Posted Jan 29, 2024 - 10:16 PST

This incident affected: App, Outbound SMS, Outbound Phone Calls, Notifications, Lead Imports, Enterprise Dashboard, REST API, Integrations, and Zapier.