← Week 2: Implementation

Day 12: Retry Logic and DLQ Handling

Phase 7 · Oct 4, 2026

← Week 2: Implementation

Agenda (2–3 hours)

  • Implement (90 min): Write the retry policy — exponential backoff, jitter, max attempts; implement DLQ processor Lambda
  • Test (60 min): Inject task failures; verify tasks retry exactly 5 times before reaching DLQ; verify DLQ processor marks tasks DEAD
  • Review (30 min): Verify the retry count is sourced from the SQS ApproximateReceiveCount attribute, not stored in DynamoDB
← Week 2: Implementation

Retry Policy

pub fn should_retry(msg: &Message) -> bool {
    let receive_count: u32 = msg
        .attributes
        .as_ref()
        .and_then(|a| a.get(&MessageSystemAttributeName::ApproximateReceiveCount))
        .and_then(|v| v.parse().ok())
        .unwrap_or(1);

    receive_count <= MAX_RETRIES  // MAX_RETRIES = 5
}

pub fn backoff_duration(attempt: u32) -> Duration {
    let base = Duration::from_secs(2u64.pow(attempt));
    let jitter = Duration::from_millis(rand::random::<u64>() % 1000);
    base + jitter  // 2s, 4s, 8s, 16s, 32s (+ jitter)
}

SQS redrive policy routes to DLQ after maxReceiveCount = 5 at the queue level.

← Week 2: Implementation

DLQ Processor Lambda

async fn handler(event: LambdaEvent<SqsEvent>) -> Result<(), Error> {
    for record in event.payload.records {
        let task_id = extract_task_id(&record)?;
        let error   = extract_last_error(&record);

        // Mark task as DEAD in DynamoDB
        db.update_task_status(
            &task_id,
            TaskStatus::Dead,
            TaskStatus::Failed,  // expected — may have been PROCESSING
        ).await.ok(); // best effort; task may already be DEAD

        db.append_event(&task_id, TaskEvent::Dead { error }, next_seq).await?;

        counter!("tasks_dead_total").increment(1);
        tracing::error!(task_id=%task_id, error=%error, "task sent to DLQ");
    }
    Ok(())
}
← Week 2: Implementation

SQS Redrive Policy

{
  "deadLetterTargetArn": "arn:aws:sqs:us-east-1:123:task-queue-dlq",
  "maxReceiveCount": 5
}

After 5 receive attempts, SQS automatically moves the message to the DLQ.
The DLQ retains messages for 14 days (max) — gives operators time to investigate.

Re-drive from DLQ back to the main queue:

aws sqs start-message-move-task \
  --source-arn arn:aws:sqs:us-east-1:123:task-queue-dlq \
  --destination-arn arn:aws:sqs:us-east-1:123:task-queue
← Week 2: Implementation

Key Takeaways

  • ApproximateReceiveCount attribute is the canonical retry counter — no need to store it in DynamoDB
  • SQS redrive policy handles DLQ routing automatically at the queue level
  • DLQ processor should be idempotent — the same message may arrive twice if the processor crashes
  • Jitter on backoff distributes retries across time to prevent thundering herd during outages

Tomorrow: instrumentation — adding full OTel traces, metrics, and structured logs to the implementation.