Incident summary
Date: Tuesday, [Date]
Duration: 4 hours 17 minutes (03:14 – 07:31)
Impact: Authentication service unavailable. All users unable to log in. Approximately 2,400 failed login attempts during the outage window. Three enterprise customers contacted support before internal detection.
Root cause: TLS certificate on the authentication API endpoint expired at 03:14. The certificate was not in the monitored inventory. Renewal alert was sent to a distribution list that had been disabled six months prior during a team restructure.
Severity: P1
Timeline
18 months prior: Authentication service migrated from monolith to separate microservice. New subdomain created (auth-api.yourdomain.com). Certificate obtained manually through CA portal. Added to a monitoring spreadsheet maintained by the engineer who led the migration.
12 months prior: Engineer who led the migration transfers to a different team. Spreadsheet is not handed over. Certificate monitoring responsibility is effectively unassigned.
45 days prior: CA sends renewal reminder to the email address specified at certificate issuance — a team distribution list that was decommissioned during a reorganisation. The email bounces. No one receives it.
14 days prior: CA sends second reminder. Same result.
03:14: Certificate expires. Authentication service begins returning TLS errors to all clients.
03:14 – 05:40: No internal alerts fire. Synthetic monitoring covers the main website and three primary API endpoints. The auth-api subdomain is not in the synthetic monitoring configuration.
05:40: First support ticket created by an enterprise customer reporting "login not working." Ticket assigned to support queue with standard priority.
05:58: Second enterprise customer emails their account manager directly. Account manager escalates internally.
06:03: On-call engineer paged. Begins investigation.
06:11: Root cause identified — TLS certificate expiry confirmed using openssl s_client. Certificate shows not-after timestamp in the past.
06:11 – 07:20: Emergency certificate renewal process. CA portal login credentials not in password manager — stored locally on the laptop of the engineer who originally set up the service. Emergency contact to retrieve credentials. Manual domain validation. Certificate issuance. Certificate installation and service restart.
07:31: Authentication service confirmed operational. All-clear communicated to affected customers.
Total duration: 4 hours 17 minutes.
Contributing factors
1. Certificate not in managed inventory. The auth-api certificate was in a personal spreadsheet, not the team's certificate management system. When the responsible engineer moved teams, the certificate became effectively untracked.
2. Renewal alerts routed to a defunct email address. The certificate was issued using a team email address that was decommissioned during a reorganisation. CA renewal reminders had nowhere to go.
3. Monitoring gap on the affected subdomain. The synthetic monitoring configuration was created before the authentication microservice was split out. The new subdomain was never added to the monitoring scope.
4. Credential management failure. The CA portal credentials were stored locally rather than in the shared credential store. This extended the incident by over an hour while credentials were retrieved.
5. No certificate inventory ownership process. There was no process requiring that certificate ownership be transferred when engineers moved teams or left the organisation. Ownership became implicit and then absent.
What was not a contributing factor
It is worth noting what did not cause this incident:
- The certificate itself was valid and properly configured when it was first issued.
- The renewal process, once initiated, worked correctly.
- The team's response once the incident was detected was competent and reasonably fast.
The failure was entirely in the processes around certificate lifecycle management — not in the certificate technology itself. This is the pattern in almost all certificate expiry incidents: the certificate does what it is supposed to do, and the failure is in the human and process systems meant to manage it.
Action items
Immediate (within 48 hours):
- Audit all certificates across all subdomains using CT log enumeration. Produce a complete list of every certificate associated with the organisation's domains.
- Add all discovered certificates to the central certificate management system with assigned owners.
- Add auth-api and all other missing subdomains to the synthetic monitoring configuration.
- Move CA portal credentials to the shared credential store.
Short-term (within 2 weeks):
- Implement certificate monitoring with alerts at 90, 60, 30, and 14 days before expiry. Route alerts to functional team addresses, not individual or transient distribution lists.
- Define and document the process for certificate ownership transfer when engineers change teams.
- Review all certificates for renewal reminder contact addresses. Update any that point to individual addresses or defunct distribution lists.
Medium-term (within 6 weeks):
- Evaluate ACME automation for all certificates where it is technically feasible. Target: eliminate manual renewal for all production certificates.
- Implement continuous CT monitoring so that new certificates issued for the organisation's domains are detected automatically and added to the managed inventory.
- Run a tabletop exercise simulating certificate expiry on a different critical service to validate that the response process works before it is needed in a real incident.
Using this template for your own incidents
A good postmortem is blameless — it focuses on systemic failures, not individual mistakes. The engineer who did not hand over the spreadsheet was following no particular process, because no process existed. The answer is not to blame the individual but to build the process that makes the individual's departure irrelevant to certificate continuity.
The questions worth asking in any certificate expiry postmortem:
- How did this certificate end up outside the managed inventory?
- Who was responsible for monitoring this certificate — and did they know they were?
- What alerts were configured? Where did they go? Did anyone receive them?
- How long between expiry and detection? Why was detection not faster?
- What extended the resolution time? Were credentials, access, or procedures unavailable?
- What process change makes this specific failure path impossible in the future?
CertControl addresses the systemic gaps that make certificate expiry incidents recurring rather than one-time: continuous inventory from CT logs, monitoring with functional alert routing, ACME automation where possible, and reporting that makes the certificate estate visible to the people responsible for it.