How to Debug and Fix AWS Lambda Cold Starts
Your Lambda function responds in 50ms most of the time. But every few minutes, a request takes 2-3 seconds. Users notice. Your monitoring dashboard shows latency spikes that look like a heartbeat monitor. The culprit: cold starts.
A cold start happens when AWS needs to spin up a fresh execution environment for your Lambda function -- downloading your code, starting the runtime, and running your initialization logic. This article shows you how to measure cold starts, identify the bottleneck, and fix them.
Step 1: Confirm You Have a Cold Start Problem
Not every slow Lambda invocation is a cold start. Before optimizing, confirm the issue using CloudWatch Logs.
Every Lambda invocation logs a REPORT line. Cold starts include an extra field called Init Duration:
REPORT RequestId: abc-123
Duration: 45.12 ms
Billed Duration: 46 ms
Memory Size: 256 MB
Max Memory Used: 89 MB
Init Duration: 1823.45 ms <-- This only appears on cold starts
If you see Init Duration, that invocation was a cold start. The value tells you exactly how long initialization took.
Query Cold Starts with CloudWatch Logs Insights
Go to CloudWatch > Logs Insights, select your Lambda function's log group, and run:
filter @type = "REPORT"
| fields @requestId, @duration, @initDuration, @memorySize, @maxMemoryUsed
| filter ispresent(@initDuration)
| stats count() as coldStarts,
avg(@initDuration) as avgColdStart,
max(@initDuration) as maxColdStart,
pct(@initDuration, 95) as p95ColdStart
by bin(1h)
| sort by bin(1h) desc
This gives you cold start count and average/p95 init duration per hour. If your avgColdStart is under 200ms on a lightweight runtime like Node.js or Python, you probably don't need to optimize. If it's over 500ms, keep reading.
Enable Lambda Insights for Deeper Visibility
For ongoing monitoring, enable CloudWatch Lambda Insights on your function:
aws lambda update-function-configuration \
--function-name my-function \
--layers arn:aws:lambda:us-east-1:580247275435:layer:LambdaInsightsExtension:49
Lambda Insights tracks cold start frequency, init duration, and memory usage as first-class metrics you can alarm on.
Step 2: Identify the Bottleneck
Cold start time breaks down into three phases:
- Environment setup (~100-200ms) -- AWS downloads your code and starts the runtime. You can't control this directly.
- Runtime initialization -- The language runtime starts up. Varies by language: Python ~200ms, Node.js ~150ms, Java ~1-3s, .NET ~400ms.
- Your init code -- Code that runs outside the handler: imports, database connections, SDK client creation. This is where most of the time goes.
To figure out where your time is spent, add timing logs to your initialization:
import time
import json
start = time.time()
import boto3 # Heavy import
print(f"boto3 import: {(time.time() - start) * 1000:.0f}ms")
start = time.time()
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('my-table')
print(f"DynamoDB client init: {(time.time() - start) * 1000:.0f}ms")
def handler(event, context):
# Your handler code here
item = table.get_item(Key={'id': event['id']})
return {
'statusCode': 200,
'body': json.dumps(item['Item'], default=str)
}
Check the CloudWatch logs for a cold start invocation. You'll see exactly which import or initialization step is eating your time.
Step 3: Reduce Package Size
The single most impactful optimization you can make without changing architecture. AWS has to download your deployment package on every cold start. Smaller package = faster download.
Before: An 8MB deployment package with every dependency bundled.
After: A 2MB package with only what you need.
Practical steps:
# Check what's taking space
du -sh node_modules/* | sort -rh | head -20
# For Node.js: use esbuild to tree-shake and bundle
npx esbuild src/handler.ts --bundle --platform=node --target=node20 \
--outfile=dist/handler.js --minify --external:@aws-sdk/*
# For Python: exclude unnecessary files
pip install -r requirements.txt -t ./package \
--no-cache-dir \
--only-binary=:all:
# Remove test files, docs, and type stubs
cd package
find . -type d -name "tests" -exec rm -rf {} +
find . -type d -name "__pycache__" -exec rm -rf {} +
find . -name "*.pyc" -delete
find . -name "*.pyi" -delete
Key tip: AWS SDK v3 for JavaScript and boto3 for Python are included in the Lambda runtime. Don't bundle them unless you need a specific version.
Step 4: Switch to ARM64 (Graviton)
This is a one-line configuration change that reduces cold start latency by 30-50% across all runtimes:
aws lambda update-function-configuration \
--function-name my-function \
--architectures arm64
Or in your infrastructure-as-code (e.g., AWS SAM):
Resources:
MyFunction:
Type: AWS::Serverless::Function
Properties:
Handler: handler.handler
Runtime: python3.12
Architectures:
- arm64
MemorySize: 512
ARM64 Lambda functions also cost 20% less per ms of compute. There's no reason not to switch unless you depend on x86-only native binaries.
Step 5: Use SnapStart (Python, Java, .NET)
SnapStart takes a snapshot of your function's initialized execution environment. On subsequent cold starts, Lambda restores from the snapshot instead of re-running initialization. The result: cold starts drop from seconds to ~200ms.
SnapStart now supports Python (added November 2024), Java, and .NET 8 with Native AOT.
Enable it for a Python function:
aws lambda update-function-configuration \
--function-name my-function \
--snap-start ApplyOn=PublishedVersions
# You must publish a version to activate SnapStart
aws lambda publish-version --function-name my-function
Important constraints:
- SnapStart works on published versions only, not
$LATEST - It does not work with provisioned concurrency (choose one or the other)
- Ephemeral storage must be 512 MB or less
- Network connections established during init may be stale after restore -- reinitialize them
For Python, use the @snapstart.restore decorator to run code after snapshot restoration:
from aws_lambda_powertools import Logger
import snapstart
import boto3
logger = Logger()
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('my-table')
@snapstart.restore
def reinitialize():
"""Runs after SnapStart restores from snapshot."""
# Re-establish connections that may have gone stale
global table
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('my-table')
def handler(event, context):
item = table.get_item(Key={'id': event['id']})
return {
'statusCode': 200,
'body': json.dumps(item['Item'], default=str)
}
Step 6: Provisioned Concurrency (When You Need Zero Cold Starts)
If your use case demands absolutely zero cold starts -- payment processing, real-time APIs, latency-sensitive endpoints -- use provisioned concurrency. It keeps a pool of pre-initialized environments always warm.
aws lambda put-provisioned-concurrency-config \
--function-name my-function \
--qualifier my-alias \
--provisioned-concurrent-executions 10
This keeps 10 warm instances ready at all times. Cost: you pay for these instances whether they're handling requests or not.
Use Application Auto Scaling to adjust provisioned concurrency based on traffic patterns:
# Register the scalable target
aws application-autoscaling register-scalable-target \
--service-namespace lambda \
--resource-id function:my-function:my-alias \
--scalable-dimension lambda:function:ProvisionedConcurrency \
--min-capacity 5 \
--max-capacity 50
# Create a target tracking policy
aws application-autoscaling put-scaling-policy \
--service-namespace lambda \
--resource-id function:my-function:my-alias \
--scalable-dimension lambda:function:ProvisionedConcurrency \
--policy-name lambda-scaling-policy \
--policy-type TargetTrackingScaling \
--target-tracking-scaling-policy-configuration '{
"TargetValue": 0.7,
"PredefinedMetricSpecification": {
"PredefinedMetricType": "LambdaProvisionedConcurrencyUtilization"
}
}'
Monitor spillover (requests that hit a cold start despite provisioned concurrency) with:
-- CloudWatch Logs Insights
filter @type = "REPORT"
| stats sum(strcontains(@message, "Init Duration")) as coldStarts,
count(*) as totalInvocations
by bin(5m)
If coldStarts is consistently non-zero, increase your provisioned concurrency minimum.
Decision Guide
| Situation | Solution | Cold Start Reduction | Cost Impact |
|---|---|---|---|
| Quick win, any runtime | ARM64 + smaller package | 30-50% | Saves 20% |
| Python/Java/.NET workloads | SnapStart | 70-90% | Minimal |
| Zero tolerance for cold starts | Provisioned concurrency | 100% | $$$ (always-on) |
| Predictable traffic patterns | Provisioned + auto scaling | 100% | $$ (scales with traffic) |
Key Takeaways
Start with the free optimizations: switch to ARM64, trim your deployment package, and move heavy initialization outside the handler. If cold starts are still a problem, enable SnapStart for Python/Java/.NET functions -- it's the best cost-to-performance ratio. Reserve provisioned concurrency for the critical paths where even a single cold start is unacceptable. Always measure with CloudWatch Logs Insights before and after changes so you know what actually worked.