← Week 3: Integration + failure modes + fit analysis

Day 15: Failure Mode Analysis — What Breaks and When

Phase 5 · September 9, 2026

← Week 3: Integration + failure modes + fit analysis

Agenda (2–3 hours)

  • Study (60 min): ACM PCA failure modes; CRL expiry; HSM unavailability
  • Analyze (60 min): Map failure modes to your provisioning service
  • Write (45 min): acm-pca-design.md §5.1–5.2
← Week 3: Integration + failure modes + fit analysis

Failure Mode Framework

For each component, ask three questions:

  1. What fails? (the specific failure event)
  2. What is the impact? (which operations break; for how long)
  3. What is the recovery procedure? (how to restore service)

Components to analyze:

  • ACM PCA CA (managed by AWS)
  • CRL distribution (S3 bucket, HTTPS endpoint)
  • OCSP responder (AWS-managed endpoint)
  • Your provisioning service (Lambda + DynamoDB)
  • The trust anchor (root CA cert)
← Week 3: Integration + failure modes + fit analysis

Failure Mode 1: ACM PCA Service Unavailable

What fails: AWS ACM PCA service outage (rare but possible).

Impact:

  • IssueCertificate API calls fail → provisioning service cannot issue new certs
  • Existing certs are unaffected (they were already issued; CA unavailability doesn't invalidate them)
  • CRL updates may be delayed → relying parties see stale CRL
  • OCSP responses may be delayed or unavailable

Recovery:

  • Wait for AWS to restore service (SLA: 99.9% monthly)
  • If sustained: use previously issued certs; queue new issuance requests
  • Mitigation: your provisioning service should cache recently issued certs in DynamoDB so retry is idempotent (idempotency token pattern from Day 11)
← Week 3: Integration + failure modes + fit analysis

Failure Mode 2: CRL Expiry

What fails: The CRL's nextUpdate timestamp passes without a new CRL being published.

Impact:

  • Relying parties with strict CRL checking will reject all certs issued by this CA
  • "Hard-fail" clients refuse connections; "soft-fail" clients continue (security risk)
  • Operational: your services may start rejecting each other's certs

When it happens:

  • ACM PCA S3 bucket access is blocked (bucket policy change, ACL, deletion)
  • Cascading failure from another change
  • CRL expiry window: default 7 days (you configured this in Day 12)

Recovery:

  • Fix S3 bucket access; ACM PCA will republish CRL automatically
  • Mitigation: monitor s3:GetObject on CRL bucket; alarm if CRL URL returns 403/404
← Week 3: Integration + failure modes + fit analysis

Failure Mode 3: Root CA Certificate Expiry

What fails: The root CA cert's notAfter passes.

Impact:

  • All certs in the hierarchy become untrusted (chain validation fails at the root)
  • TLS connections fail; device provisioning fails; everything fails
  • This is the most catastrophic failure mode

Prevention:

  • 15-year root CA validity → monitor with a 2-year warning alarm
  • Alert on expiry: aws acm-pca describe-certificate-authorityNotAfter field
  • Plan a root CA rotation long before expiry (1–2 year lead time)

Recovery:

  • Establish a new root CA
  • Cross-sign: new root signs old root (trust both temporarily)
  • Migrate trust bundles everywhere
  • This is a multi-month operation — do not let it become urgent
← Week 3: Integration + failure modes + fit analysis

Failure Mode 4: Issuance CA Key Compromise

What fails: The private key of an issuance CA is compromised.

Impact:

  • An attacker can issue arbitrary certificates trusted by your hierarchy
  • Relying parties cannot distinguish forged certs from real ones
  • All certs issued by this CA must be considered suspect

Recovery procedure:

  1. Revoke the issuance CA cert (via the subordinate CA or root)
  2. Generate an audit report: list all certs issued by the compromised CA
  3. Revoke all issued certs (or let them expire if TTL is short)
  4. Create a new issuance CA under the same subordinate
  5. Re-issue certs for all devices/services through the new CA
  6. Update trust bundles everywhere to remove the old CA

Why three tiers help: the blast radius is the issuance CA, not the root.
The subordinate CA can sign a new issuance CA without touching the root.

← Week 3: Integration + failure modes + fit analysis

Failure Mode 5: Provisioning Service Unavailable

What fails: Your Lambda / service tier is down.

Impact:

  • New cert requests fail → devices cannot provision
  • Existing certs are unaffected (already issued)
  • Devices that need to rotate their certs before expiry will fail to do so

Mitigation:

  • Short-lived certs (90 days for devices): give 30-day rotation window
    → provisioning service needs to be available for at most a 60-day window before cert expiry
  • Long-lived certs (1 year): rotation failure is less urgent but still matters
  • Alert on DynamoDB: certs expiring in < 30 days with no rotation attempt
← Week 3: Integration + failure modes + fit analysis

Failure Mode 6: Lambda + ACM PCA Latency

What fails: IssueCertificate + GetCertificate takes 2–5 seconds.
Lambda has a 15-minute maximum timeout, but client timeouts may be shorter.

Impact:

  • Device provisioning request times out if the client has a 3-second timeout
  • Retry amplifies: multiple provisioning requests for the same device

Mitigation:

  • Use idempotency tokens (Day 11) — retries return the same cert
  • Decouple issuance from the synchronous request:
    POST /provision → accept request → write to SQS → return 202 Accepted
    SQS consumer → call ACM PCA → write cert to DynamoDB
    GET /provision/{id} → poll DynamoDB for result
    
    This async pattern makes ACM PCA latency invisible to the device.
← Week 3: Integration + failure modes + fit analysis

Challenge Assignment

Complete acm-pca-design.md §5.1 and §5.2:

## §5. Failure Mode Analysis

### 5.1 Failure Mode Table
| Component | Failure | Impact | Recovery | Mitigation |
|-----------|---------|--------|----------|-----------|
| ACM PCA service | Outage | No new issuance | Wait for AWS | Idempotent retry queue |
| CRL endpoint | S3 inaccessible | Cert rejection | Fix bucket ACL | S3 access alarm |
| Root CA cert | Expiry | Total PKI failure | Root rotation | 2-year warning alarm |
| Issuance CA key | Compromise | Forged certs | Revoke + re-issue | Audit report |
| Provisioning service | Lambda outage | No provisioning | Restore service | 30-day rotation window |

### 5.2 Monitoring and Alerting Design
<CloudWatch alarms: CRL URL, CA expiry, cert expiry distribution, issuance errors>
← Week 3: Integration + failure modes + fit analysis

Resources

  • ACM PCA SLA: aws.amazon.com/privateca/sla
  • ACM PCA monitoring: docs.aws.amazon.com/privateca/latest/userguide/PcaCloudWatch.html
  • CRL monitoring: aws acm-pca get-certificate-authority-crl (check nextUpdate)
  • Phase 4, Day 15 (SPIFFE + Lambda failure modes) — many patterns carry over