When (Not If) Containers Misbehave

In my previous article, Fargate + Lambda are better together, I introduced the concept of a hybrid architecture where I can route traffic between ECS Fargate and Lambda based on real-time conditions.

The most common argument I know about is:

Why not just use containers for everything?

In this article, I try to explain why the "containers are cheaper" is an oversimplification.

The Hidden Cost Equation

Before I go into details, let's address the costs (hopefully I did the calculation correctly):

Assuming typical production capacity:

Daily capacity: ~2M requests/day per task
Monthly capacity: ~60M requests/month per task

Monthly Cost Comparison:

Monthly Requests	Tasks Needed	Lambda Cost	Fargate Cost	Winner
1M	1	$2,17	$56,92	Lambda 96% cheaper
10M	1	$21,67	$56,92	Lambda 62% cheaper
25M	1	$54,17	$56,92	Lambda 5% cheaper
50M	1	$108,33	$56,92	Fargate 47% cheaper
65M	2	$140,83	$113,84	Fargate 19% cheaper
100M	2	$216,67	$113,84	Fargate 47% cheaper
130M	3	$281,67	$170,76	Fargate 39% cheaper

Daily Cost Comparison:

Daily Requests	Tasks Needed	Lambda Cost	Fargate Cost	Winner
1M	1	$2,17	$1,90	Fargate 12% cheaper
10M	5	$21,67	$9,50	Fargate 56% cheaper
25M	12	$54,17	$22,80	Fargate 58% cheaper
50M	24	$108,33	$45,60	Fargate 58% cheaper
65M	31	$140,83	$58,90	Fargate 58% cheaper
100M	47	$216,67	$89,30	Fargate 59% cheaper
130M	61	$281,67	$115,90	Fargate 59% cheaper

The tables show that containers become cheaper around 50M requests/month or over 1M requests daily.

If I consider the human cost, based on European senior developer rates (Stack Overflow 2024 Survey - €75K average = €36/hour):

Engineering Effort	Time Investment	Cost
Load testing and optimization	2 weeks (80 hours)	€2.880
Traffic controller development	2 weeks (80 hours)	€2.880
Production fixes and fine-tuning	1 week (40 hours)	€1.440

Total Fargate human cost: €7.200

With Lambda, I skip most of this complexity:

No failure scenario testing: AWS handles all for me
No traffic controller: No need to invent it
No production fine-tuning: It just works out of the box

With 14 operational scenarios, I need to simulate and prepare for each failure mode, while Lambda eliminates all of them through abstraction.

Even when Fargate appears cheaper on infrastructure costs (around 50M+ requests/month or 1M+ requests/day). The €7.200 engineering investment only becomes cost-effective after many months.

Payback Calculation Formula:

Monthly Savings = (Lambda Daily Cost - Fargate Daily Cost) × 30 days
Payback Period (months) = €7,200 ÷ Monthly Savings (USD)

Examples:
• 10M requests/day: ($21.67 - $9.50) × 30 = $365.10/month
  Payback = €7,200 ÷ $365.10 = 19.7 months

• 25M requests/day: ($54.17 - $22.80) × 30 = $941.10/month  
  Payback = €7,200 ÷ $941.10 = 7.6 months

Human Cost Payback:

Traffic Volume	Daily Savings	Monthly Savings	Payback Period
1M requests/month	N/A	-$54.75	Never (Lambda cheaper)
10M requests/day	$12.17	$365.10	19.7 months
25M requests/day	$31.37	$941.10	7.6 months
50M requests/day	$62.73	$1,881.90	3.8 months

Based on the table, the €7.200 human cost works are only cost-effective from 7-8 months at extremely high traffic (25M+ requests/day) to multiple years at moderate traffic levels, assuming:

No major architectural changes
No scaling issues requiring rework
No additional failure modes discovered

What Makes Fargate "Complex"

Let's see what it takes to run containers in production:

  ECSCluster:
    Type: AWS::ECS::Cluster
    Properties:
      ClusterName: !Sub "${AWS::StackName}-cluster"
      CapacityProviders:
        - FARGATE
        - FARGATE_SPOT
      DefaultCapacityProviderStrategy:
        - CapacityProvider: FARGATE
          Weight: 2  # 40% on-demand for stability
        - CapacityProvider: FARGATE_SPOT
          Weight: 3  # 60% spot for cost savings
      ClusterSettings:
        - Name: containerInsights
          Value: enhanced

Without the Amazon ECS Container Insights metrics enabled, I would not be able to monitor metrics that could cause downtime, like:

CPU
Memory
RunningTaskCount

Most teams start with what I call the 'monolithic service trap' - one big ECS service that seems simple until it isn't.

  MyService:
    Type: AWS::ECS::Service
    Properties:
      ServiceName: !Sub "${AWS::StackName}-my-service"
      Cluster: !Ref ECSCluster
      TaskDefinition: !Ref MyTaskDefinition
      LaunchType: FARGATE
      PlatformVersion: LATEST
      DeploymentConfiguration:
        Strategy: ROLLING
        MaximumPercent: 200          # Allow 100% extra capacity during deployment  
        MinimumHealthyPercent: 100   # Keep all current tasks running until new ones are healthy
        DeploymentCircuitBreaker:
          Enable: true
          Rollback: true
      DeploymentController:
        Type: ECS
      NetworkConfiguration:
        AwsvpcConfiguration:
          AssignPublicIp: DISABLED
          SecurityGroups:
            - !Ref ECSSecurityGroupId
          Subnets: !FindInMap [!Ref StageName, !Ref "AWS::Region", PrivateSubnets]
      LoadBalancers:
        - ContainerName: my-container
          ContainerPort: 3000
          TargetGroupArn:
            Fn::ImportValue:
              !Sub "${ALBStackName}-My-ECS-TargetGroup-Arn"
      HealthCheckGracePeriodSeconds: 60

  MyTaskDefinition:
    Type: AWS::ECS::TaskDefinition
    Properties:
      Family: !Sub "${AWS::StackName}-my-task-definition"
      NetworkMode: awsvpc
      RequiresCompatibilities:
        - FARGATE
      Cpu: 2048
      Memory: 4096
      ExecutionRoleArn: !GetAtt ECSExecutionRole.Arn
      TaskRoleArn: !GetAtt ECSTaskRole.Arn
      ContainerDefinitions:
        - Name: my-container
          Image: !Ref MyImageUri
          Essential: true
          PortMappings:
            - ContainerPort: 3000
              Protocol: tcp
          Environment:
            - Name: AWS_REGION
              Value: !Ref "AWS::Region"
            - Name: LOG_LEVEL
              Value: !FindInMap [LogLevel, !Ref StageName, level]
            - Name: NODE_OPTIONS
              Value: "--max-old-space-size=3072"  # Use 75% of 4GB memory for heap
      ....

The problems with this approach are:

Single Point of Failure: One bad deployment kills everything
All-or-Nothing Updates: Can't update just something without affecting something else
Blast Radius: one endpoint path can bring all down

I like to organise my architecture based on specific parameters, and they can vary based on the application. I am working on an international streaming application, so I could potentially split my services into:

country (maybe it is better to have ECS per country)
type of users
type of subscriptions
type of platform

The above are examples, but they have some obvious advantages:

Fault Isolation
Independent Deployments
Smaller Blast Radius
Resource Isolation
Independent Capacity
Observability
Scalability

A single monolithic service creates availability and scalability bottlenecks, where the failure of one component can impact the entire system. On the opposite extreme, there is excessive fragmentation. When services are correctly isolated, a bug or outage in one service cannot directly impact other services. This isolation is particularly valuable when services have different reliability requirements or face varying load patterns, as failures remain localised instead of bringing all down.

A lesson I have learned is that service decomposition becomes economically justified when different components have significantly different resource traffic patterns. Each component scales independently according to its actual needs. Without this separation, the entire system must be provisioned for peak load across all components, often resulting in extra overprovisioning and wasted resources.

I want to highlight some important configurations for AWS::ECS::Service and AWS::ECS::TaskDefinition

Deployment Strategy

While AWS ECS supports blue/green deployments through CodeDeploy integration, this feature does not work when ALB target groups dynamically route traffic between ECS and Lambda functions. Blue/green deployments require dedicated target groups for each deployment stage, which conflicts with my setup. The rolling deployment strategy provides zero-downtime deployments while maintaining compatibility with my hybrid traffic management system.

Resource Allocation: CPU and Memory Sizing

Cpu: 2048    # 2 vCPU - Increased from 1024 due to load testing findings
Memory: 4096 # 4GB - Increased from 2048 due to memory pressure and GC issues

The resource allocation reflects the results from the load testing. Initially, I had (1 vCPU, 2GB RAM), but it resulted in cascading performance issues during traffic spikes:

CPU Saturation
Memory Pressure

Fargate Cold Start

As Vlad Ionescu pointed out in his post, Fargate takes its time to scale up. From what I can see, it takes up to 5 minutes. Always, thanks to the load testing, I found out that newly launched tasks were receiving full traffic upon ALB target registration, before completing their initialisation routines, because they were considered 'healthy', which contributed to 502/503 errors.

To mitigate this, I added an extra configuration:

  MyServiceSTargetGroup:
    Type: AWS::ElasticLoadBalancingV2::TargetGroup
    Properties:
      TargetGroupAttributes:
        - Key: slow_start.duration_seconds
          Value: "60"

In the end, by increasing resource allocation to 2 vCPU/4GB and implementing a 60-second ALB slow start period, I achieved:

Reduced Error Rates: The combination eliminated the 502/503 error spikes during scaling events
Better Cost Efficiency: While individual tasks cost more, we need fewer of them and experience fewer scaling thrashes

The 14 potential pitfalls

Lambda is a full delegation to AWS. It is magic.

ECS Fargate has its complexity, but this is the price of control.

Here are all the cases I found while testing

Failure Category	What Can Go Wrong	Real-World Impact
:fire: Emergency Failures	Zero running tasks available	:rotating_light: Complete service outage - immediate Lambda failover required
:warning: Capacity Issues	Down to minimum viable tasks	:chart_with_downwards_trend: Gradual failure prevention - start shifting to Lambda
:skull: Critical Capacity	Only one task left running	:bomb: Last chance before outage - emergency traffic routing
:rocket: Traffic Overload	Requests exceed Lambda capacity	:arrow_up: Lambda can't handle load - need ECS scaling
:boom: Traffic Spike	Sudden burst traffic overwhelms system	:zap: Viral content or DDoS - hybrid ECS+Lambda mode
:ocean: Sustained Load	High traffic persists for extended period	:moneybag: Cost optimization - shift to ECS-heavy hybrid mode
:sleeping: Low Traffic	Traffic drops below cost-effective threshold	:money_with_wings: ECS waste - switch back to Lambda
:computer: CPU Exhaustion	CPU utilization reaches warning threshold	:hot_face: Resource competition - enable Lambda overflow
:red_circle: CPU Emergency	CPU utilization hits critical levels	:fire_engine: Cascade failure risk - emergency Lambda routing
:floppy_disk: Memory Pressure	Memory usage approaches warning limits	:exploding_head: Memory leaks detected - Lambda overflow
:bangbang: Memory Emergency	Memory usage hits critical threshold	:sos: OOM kills imminent - immediate failover
:arrow_down: Scale-In Problems	Auto-scaling fails to reduce capacity efficiently	:dollar: Idle tasks burning money - gradual scale-down
:skull: Spot Interruptions	AWS reclaims spot instances	:cloud_with_lightning: Multiple tasks lost suddenly
:dart: Path-Specific Issues	Individual service failures	:construction: Blast radius contained to one service

Notice how many scenarios demand "Lambda failover" or "Lambda overflow" as solutions. This is because Lambda has a unique strength: it can instantiate thousands of instances in milliseconds.

This remarkable ability comes at a cost, as I pay more because AWS manages everything for me. But when Fargate faces any of these 14 challenges, Lambda's instant scalability becomes priceless.

Zero capacity planning becomes an asset when you need emergency capacity instantly
Automatic scaling becomes crucial when Fargate scaling can't keep up with traffic spikes
No infrastructure management means no infrastructure bottlenecks during a crisis
Isolation per request ensures consistent performance regardless of load

Performance trade-offs

Fargate Performance Degrades Under Load: Fargate latency increases as more resources are consumed within the same task. A task handling 1 request performs very differently from one handling thousands simultaneously
Lambda Maintains Consistent Performance: Lambda stays consistent because it's 1 request per execution environment
Different Performance Profiles: Fargate excels at sustained moderate load (p95: 30ms vs 70ms), while Lambda excels at consistent performance regardless of system load
Cost vs Performance Trade-off: You pay Lambda's higher price for guaranteed isolation and instant scalability
Failure Recovery: Lambda's instant provisioning makes it ideal for emergency scenarios when Fargate fails

Fargate performs better in stable conditions but degrades under stress, and its total cost of ownership is so high that its complexity pays off only in the long term. Lambda performs consistently but at a higher cost per request, but this higher pricing compensates for operational simplicity and faster delivery cycles.

It's Not About Cost, It's About Risk

After running this hybrid architecture in production, here's what I've learned:

Lambda is not more expensive
Containers are not cheaper
Hybrid architectures give you the best of both worlds

What's Next: Traffic Controller

I'll share the Traffic Controller implementation that makes this hybrid magic possible. The Traffic Controller is a component that dynamically routes traffic between Lambda and ECS based on real-time traffic patterns. A component that monitors CloudWatch alarms, understands task-aware load distribution, and makes routing decisions.

When (Not If) Containers Misbehave

The Hidden Cost Equation

What Makes Fargate "Complex"

Deployment Strategy

Resource Allocation: CPU and Memory Sizing

Fargate Cold Start

The 14 potential pitfalls

Performance trade-offs

It's Not About Cost, It's About Risk

What's Next: Traffic Controller

Comments

More from this blog

Metrics, Logs, Traces, and Audit (Part 3)

Three Responsibilities of a Global Application (Part 2)

Trust Is the Architecture (Part 1)

Building long-running usage reports on AWS

Command Palette

The Hidden Cost Equation

What Makes Fargate "Complex"

Deployment Strategy

Resource Allocation: CPU and Memory Sizing

Fargate Cold Start

The 14 potential pitfalls

Performance trade-offs

It's Not About Cost, It's About Risk

What's Next: Traffic Controller

Comments

More from this blog