TLS Certificate Expiry Postmortem Template for IT Teams

This is a composite postmortem based on real patterns from certificate expiry incidents. The organisation, names, and specific details are fictional — the failure modes, timeline, and lessons are not. Use this as a template for documenting and learning from your own incidents.

Incident summary

Date: Tuesday, [Date]
Duration: 4 hours 17 minutes (03:14 – 07:31)
Impact: Authentication service unavailable. All users unable to log in. Approximately 2,400 failed login attempts during the outage window. Three enterprise customers contacted support before internal detection.
Root cause: TLS certificate on the authentication API endpoint expired at 03:14. The certificate was not in the monitored inventory. Renewal alert was sent to a distribution list that had been disabled six months prior during a team restructure.
Severity: P1

Timeline

18 months prior: Authentication service migrated from monolith to separate microservice. New subdomain created (auth-api.yourdomain.com). Certificate obtained manually through CA portal. Added to a monitoring spreadsheet maintained by the engineer who led the migration.

12 months prior: Engineer who led the migration transfers to a different team. Spreadsheet is not handed over. Certificate monitoring responsibility is effectively unassigned.

45 days prior: CA sends renewal reminder to the email address specified at certificate issuance — a team distribution list that was decommissioned during a reorganisation. The email bounces. No one receives it.

14 days prior: CA sends second reminder. Same result.

03:14: Certificate expires. Authentication service begins returning TLS errors to all clients.

03:14 – 05:40: No internal alerts fire. Synthetic monitoring covers the main website and three primary API endpoints. The auth-api subdomain is not in the synthetic monitoring configuration.

05:40: First support ticket created by an enterprise customer reporting "login not working." Ticket assigned to support queue with standard priority.

05:58: Second enterprise customer emails their account manager directly. Account manager escalates internally.

06:03: On-call engineer paged. Begins investigation.

06:11: Root cause identified — TLS certificate expiry confirmed using openssl s_client. Certificate shows not-after timestamp in the past.

06:11 – 07:20: Emergency certificate renewal process. CA portal login credentials not in password manager — stored locally on the laptop of the engineer who originally set up the service. Emergency contact to retrieve credentials. Manual domain validation. Certificate issuance. Certificate installation and service restart.

07:31: Authentication service confirmed operational. All-clear communicated to affected customers.

Total duration: 4 hours 17 minutes.

Contributing factors

1. Certificate not in managed inventory. The auth-api certificate was in a personal spreadsheet, not the team's certificate management system. When the responsible engineer moved teams, the certificate became effectively untracked.

2. Renewal alerts routed to a defunct email address. The certificate was issued using a team email address that was decommissioned during a reorganisation. CA renewal reminders had nowhere to go.

3. Monitoring gap on the affected subdomain. The synthetic monitoring configuration was created before the authentication microservice was split out. The new subdomain was never added to the monitoring scope.

4. Credential management failure. The CA portal credentials were stored locally rather than in the shared credential store. This extended the incident by over an hour while credentials were retrieved.

5. No certificate inventory ownership process. There was no process requiring that certificate ownership be transferred when engineers moved teams or left the organisation. Ownership became implicit and then absent.

What was not a contributing factor

It is worth noting what did not cause this incident:

The certificate itself was valid and properly configured when it was first issued.
The renewal process, once initiated, worked correctly.
The team's response once the incident was detected was competent and reasonably fast.

The failure was entirely in the processes around certificate lifecycle management — not in the certificate technology itself. This is the pattern in almost all certificate expiry incidents: the certificate does what it is supposed to do, and the failure is in the human and process systems meant to manage it.

Action items

Immediate (within 48 hours):

Audit all certificates across all subdomains using CT log enumeration. Produce a complete TLS certificate inventory covering every certificate associated with the organisation's domains.
Add all discovered certificates to the central certificate management system with assigned owners.
Add auth-api and all other missing subdomains to the synthetic monitoring configuration.
Move CA portal credentials to the shared credential store.

Short-term (within 2 weeks):

Implement certificate expiry alerts at 90, 60, 30, and 14 days before expiry. Route alerts to functional team addresses, not individual or transient distribution lists.
Define and document the process for certificate ownership transfer when engineers change teams.
Review all certificates for renewal reminder contact addresses. Update any that point to individual addresses or defunct distribution lists.

Medium-term (within 6 weeks):

Evaluate ACME automation for all certificates where it is technically feasible. Target: eliminate manual renewal for all production certificates.
Implement continuous CT monitoring so that new certificates issued for the organisation's domains are detected automatically and added to the managed inventory.
Run a tabletop exercise simulating certificate expiry on a different critical service to validate that the response process works before it is needed in a real incident.

Using this template for your own incidents

A good postmortem is blameless — it focuses on systemic failures, not individual mistakes. The engineer who did not hand over the spreadsheet was following no particular process, because no process existed. The answer is not to blame the individual but to build the process that makes the individual's departure irrelevant to certificate continuity. The broader playbook for getting there is covered in the guide on how to avoid expired certificates.

The questions worth asking in any certificate expiry postmortem:

How did this certificate end up outside the managed inventory?
Who was responsible for monitoring this certificate — and did they know they were?
What alerts were configured? Where did they go? Did anyone receive them?
How long between expiry and detection? Why was detection not faster?
What extended the resolution time? Were credentials, access, or procedures unavailable?
What process change makes this specific failure path impossible in the future?

CertControl addresses the systemic gaps that make certificate expiry incidents recurring rather than one-time: continuous inventory from CT logs, monitoring with functional alert routing, ACME automation where possible, and reporting that makes the certificate estate visible to the people responsible for it.

How CertControl closes each gap from this postmortem

The five contributing factors in this postmortem are not specific to this organisation — they are the standard failure pattern. Here is how CertControl closes each one:

Certificate not in managed inventory — closed via CT log discovery. CertControl queries Certificate Transparency logs and automatically discovers every certificate ever issued to your domains — including subdomains created during migrations, certificates issued by other teams, and systems no one internally knew about. The auth-api certificate would have been in the inventory from day one, not in an engineer's spreadsheet.
Renewal alerts routed to a stale email address — closed via named ownership in the system. Every certificate in CertControl has a named owner assigned in the platform. Alerts route directly to that person — not to a CA distribution list that can be retired in a reorg. Ownership is in the system; it does not disappear when the engineer changes teams.
Monitoring gap on the subdomain — closed via automatic discovery of new issuances. CertControl monitors CT logs continuously and automatically detects new certificates on your domains. New subdomains and services are added to the monitoring inventory without anyone having to remember to add them to a configuration.
Credentials unavailable during the incident — reduced via ACME automation. For certificates that support it, CertControl's ACME integration eliminates the renewal need — and with it the need for CA portal access during an incident. Certificates renew automatically before expiry; the scenario of retrieving credentials from an absent colleague's laptop does not arise.
No ownership transfer process — closed because ownership is in the system, not with individuals. When an engineer changes teams, the certificate ownership is still in CertControl. A manager can see which certificates are assigned to the departing engineer and reassign them — rather than discovering the gaps 18 months later at 03:14.

Frequently asked questions

What was the root cause of this certificate outage?

A TLS certificate on an authentication API endpoint expired because it was never in the managed inventory. Renewal reminders went to a distribution list that had been decommissioned during a team reorganisation, so nobody received them, and the subdomain was missing from synthetic monitoring.

Why did it take over four hours to resolve?

Detection itself was delayed because no internal alerts covered the affected subdomain — the first signal came from a customer support ticket two and a half hours in. Resolution was then extended by more than an hour because the CA portal credentials were stored locally on an absent engineer's laptop rather than in the shared credential store.

What makes a certificate expiry postmortem useful?

A blameless focus on systemic failures rather than individual mistakes. The key questions are how the certificate left the managed inventory, who owned monitoring, where the alerts went, why detection was slow, what extended resolution, and what process change makes that exact failure path impossible in future.

How do you prevent this kind of outage from recurring?

Build continuous inventory from CT logs so no certificate stays untracked, assign named ownership in the system rather than to transient distribution lists, route alerts to functional addresses, store CA credentials centrally, and use ACME automation wherever feasible so manual renewal is not on the critical path.

What was explicitly not a contributing factor?

The certificate technology itself. The certificate was valid and correctly configured when issued, the renewal process worked once initiated, and the team's response after detection was competent. The failure was entirely in the human and process systems meant to manage the certificate's lifecycle.

See the platform Book a demo

The Certificate That Took Down a Login Flow: A Postmortem Template