How to Debug and Fix AWS Lambda Cold Starts

2026-03-25·7 min read·problem-solving

Your Lambda function responds in 50ms most of the time. But every few minutes, a request takes 2-3 seconds. Users notice. Your monitoring dashboard shows latency spikes that look like a heartbeat monitor. The culprit: cold starts.

A cold start happens when AWS needs to spin up a fresh execution environment for your Lambda function -- downloading your code, starting the runtime, and running your initialization logic. This article shows you how to measure cold starts, identify the bottleneck, and fix them.

Step 1: Confirm You Have a Cold Start Problem

Not every slow Lambda invocation is a cold start. Before optimizing, confirm the issue using CloudWatch Logs.

Every Lambda invocation logs a REPORT line. Cold starts include an extra field called Init Duration:

REPORT RequestId: abc-123
Duration: 45.12 ms
Billed Duration: 46 ms
Memory Size: 256 MB
Max Memory Used: 89 MB
Init Duration: 1823.45 ms    <-- This only appears on cold starts

If you see Init Duration, that invocation was a cold start. The value tells you exactly how long initialization took.

Query Cold Starts with CloudWatch Logs Insights

Go to CloudWatch > Logs Insights, select your Lambda function's log group, and run:

filter @type = "REPORT"
| fields @requestId, @duration, @initDuration, @memorySize, @maxMemoryUsed
| filter ispresent(@initDuration)
| stats count() as coldStarts,
        avg(@initDuration) as avgColdStart,
        max(@initDuration) as maxColdStart,
        pct(@initDuration, 95) as p95ColdStart
  by bin(1h)
| sort by bin(1h) desc

This gives you cold start count and average/p95 init duration per hour. If your avgColdStart is under 200ms on a lightweight runtime like Node.js or Python, you probably don't need to optimize. If it's over 500ms, keep reading.

Enable Lambda Insights for Deeper Visibility

For ongoing monitoring, enable CloudWatch Lambda Insights on your function:

aws lambda update-function-configuration \
  --function-name my-function \
  --layers arn:aws:lambda:us-east-1:580247275435:layer:LambdaInsightsExtension:49

Lambda Insights tracks cold start frequency, init duration, and memory usage as first-class metrics you can alarm on.

Step 2: Identify the Bottleneck

Cold start time breaks down into three phases:

Environment setup (~100-200ms) -- AWS downloads your code and starts the runtime. You can't control this directly.
Runtime initialization -- The language runtime starts up. Varies by language: Python ~200ms, Node.js ~150ms, Java ~1-3s, .NET ~400ms.
Your init code -- Code that runs outside the handler: imports, database connections, SDK client creation. This is where most of the time goes.

To figure out where your time is spent, add timing logs to your initialization:

import time
import json

start = time.time()

import boto3  # Heavy import
print(f"boto3 import: {(time.time() - start) * 1000:.0f}ms")

start = time.time()
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('my-table')
print(f"DynamoDB client init: {(time.time() - start) * 1000:.0f}ms")


def handler(event, context):
    # Your handler code here
    item = table.get_item(Key={'id': event['id']})
    return {
        'statusCode': 200,
        'body': json.dumps(item['Item'], default=str)
    }

Check the CloudWatch logs for a cold start invocation. You'll see exactly which import or initialization step is eating your time.

Step 3: Reduce Package Size

The single most impactful optimization you can make without changing architecture. AWS has to download your deployment package on every cold start. Smaller package = faster download.

Before: An 8MB deployment package with every dependency bundled.

After: A 2MB package with only what you need.

Practical steps:

# Check what's taking space
du -sh node_modules/* | sort -rh | head -20

# For Node.js: use esbuild to tree-shake and bundle
npx esbuild src/handler.ts --bundle --platform=node --target=node20 \
  --outfile=dist/handler.js --minify --external:@aws-sdk/*

# For Python: exclude unnecessary files
pip install -r requirements.txt -t ./package \
  --no-cache-dir \
  --only-binary=:all:

# Remove test files, docs, and type stubs
cd package
find . -type d -name "tests" -exec rm -rf {} +
find . -type d -name "__pycache__" -exec rm -rf {} +
find . -name "*.pyc" -delete
find . -name "*.pyi" -delete

Key tip: AWS SDK v3 for JavaScript and boto3 for Python are included in the Lambda runtime. Don't bundle them unless you need a specific version.

Step 4: Switch to ARM64 (Graviton)

This is a one-line configuration change that reduces cold start latency by 30-50% across all runtimes:

aws lambda update-function-configuration \
  --function-name my-function \
  --architectures arm64

Or in your infrastructure-as-code (e.g., AWS SAM):

Resources:
  MyFunction:
    Type: AWS::Serverless::Function
    Properties:
      Handler: handler.handler
      Runtime: python3.12
      Architectures:
        - arm64
      MemorySize: 512

ARM64 Lambda functions also cost 20% less per ms of compute. There's no reason not to switch unless you depend on x86-only native binaries.

Step 5: Use SnapStart (Python, Java, .NET)

SnapStart takes a snapshot of your function's initialized execution environment. On subsequent cold starts, Lambda restores from the snapshot instead of re-running initialization. The result: cold starts drop from seconds to ~200ms.

SnapStart now supports Python (added November 2024), Java, and .NET 8 with Native AOT.

Enable it for a Python function:

aws lambda update-function-configuration \
  --function-name my-function \
  --snap-start ApplyOn=PublishedVersions

# You must publish a version to activate SnapStart
aws lambda publish-version --function-name my-function

Important constraints:

SnapStart works on published versions only, not $LATEST
It does not work with provisioned concurrency (choose one or the other)
Ephemeral storage must be 512 MB or less
Network connections established during init may be stale after restore -- reinitialize them

For Python, use the @snapstart.restore decorator to run code after snapshot restoration:

from aws_lambda_powertools import Logger
import snapstart
import boto3

logger = Logger()
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('my-table')

@snapstart.restore
def reinitialize():
    """Runs after SnapStart restores from snapshot."""
    # Re-establish connections that may have gone stale
    global table
    dynamodb = boto3.resource('dynamodb')
    table = dynamodb.Table('my-table')


def handler(event, context):
    item = table.get_item(Key={'id': event['id']})
    return {
        'statusCode': 200,
        'body': json.dumps(item['Item'], default=str)
    }

Step 6: Provisioned Concurrency (When You Need Zero Cold Starts)

If your use case demands absolutely zero cold starts -- payment processing, real-time APIs, latency-sensitive endpoints -- use provisioned concurrency. It keeps a pool of pre-initialized environments always warm.

aws lambda put-provisioned-concurrency-config \
  --function-name my-function \
  --qualifier my-alias \
  --provisioned-concurrent-executions 10

This keeps 10 warm instances ready at all times. Cost: you pay for these instances whether they're handling requests or not.

Use Application Auto Scaling to adjust provisioned concurrency based on traffic patterns:

# Register the scalable target
aws application-autoscaling register-scalable-target \
  --service-namespace lambda \
  --resource-id function:my-function:my-alias \
  --scalable-dimension lambda:function:ProvisionedConcurrency \
  --min-capacity 5 \
  --max-capacity 50

# Create a target tracking policy
aws application-autoscaling put-scaling-policy \
  --service-namespace lambda \
  --resource-id function:my-function:my-alias \
  --scalable-dimension lambda:function:ProvisionedConcurrency \
  --policy-name lambda-scaling-policy \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 0.7,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "LambdaProvisionedConcurrencyUtilization"
    }
  }'

Monitor spillover (requests that hit a cold start despite provisioned concurrency) with:

-- CloudWatch Logs Insights
filter @type = "REPORT"
| stats sum(strcontains(@message, "Init Duration")) as coldStarts,
        count(*) as totalInvocations
  by bin(5m)

If coldStarts is consistently non-zero, increase your provisioned concurrency minimum.

Decision Guide

Situation	Solution	Cold Start Reduction	Cost Impact
Quick win, any runtime	ARM64 + smaller package	30-50%	Saves 20%
Python/Java/.NET workloads	SnapStart	70-90%	Minimal
Zero tolerance for cold starts	Provisioned concurrency	100%	$$$ (always-on)
Predictable traffic patterns	Provisioned + auto scaling	100%	$$ (scales with traffic)

Key Takeaways

Start with the free optimizations: switch to ARM64, trim your deployment package, and move heavy initialization outside the handler. If cold starts are still a problem, enable SnapStart for Python/Java/.NET functions -- it's the best cost-to-performance ratio. Reserve provisioned concurrency for the critical paths where even a single cold start is unacceptable. Always measure with CloudWatch Logs Insights before and after changes so you know what actually worked.

#aws#lambda#serverless#performance#cloud