Skip to main content

Command Palette

Search for a command to run...

When (Not If) Containers Misbehave

Published
10 min read

In my previous article, Fargate + Lambda are better together, I introduced the concept of a hybrid architecture where I can route traffic between ECS Fargate and Lambda based on real-time conditions.

The most common argument I know about is:

  • Why not just use containers for everything?

In this article, I try to explain why the "containers are cheaper" is an oversimplification.

The Hidden Cost Equation

Before I go into details, let's address the costs (hopefully I did the calculation correctly):

Assuming typical production capacity:

  • Daily capacity: ~2M requests/day per task

  • Monthly capacity: ~60M requests/month per task

Monthly Cost Comparison:

Monthly RequestsTasks NeededLambda CostFargate CostWinner
1M1$2,17$56,92Lambda 96% cheaper
10M1$21,67$56,92Lambda 62% cheaper
25M1$54,17$56,92Lambda 5% cheaper
50M1$108,33$56,92Fargate 47% cheaper
65M2$140,83$113,84Fargate 19% cheaper
100M2$216,67$113,84Fargate 47% cheaper
130M3$281,67$170,76Fargate 39% cheaper

Daily Cost Comparison:

Daily RequestsTasks NeededLambda CostFargate CostWinner
1M1$2,17$1,90Fargate 12% cheaper
10M5$21,67$9,50Fargate 56% cheaper
25M12$54,17$22,80Fargate 58% cheaper
50M24$108,33$45,60Fargate 58% cheaper
65M31$140,83$58,90Fargate 58% cheaper
100M47$216,67$89,30Fargate 59% cheaper
130M61$281,67$115,90Fargate 59% cheaper

The tables show that containers become cheaper around 50M requests/month or over 1M requests daily.

If I consider the human cost, based on European senior developer rates (Stack Overflow 2024 Survey - €75K average = €36/hour):

Engineering EffortTime InvestmentCost
Load testing and optimization2 weeks (80 hours)€2.880
Traffic controller development2 weeks (80 hours)€2.880
Production fixes and fine-tuning1 week (40 hours)€1.440

Total Fargate human cost: €7.200

With Lambda, I skip most of this complexity:

  • No failure scenario testing: AWS handles all for me

  • No traffic controller: No need to invent it

  • No production fine-tuning: It just works out of the box

With 14 operational scenarios, I need to simulate and prepare for each failure mode, while Lambda eliminates all of them through abstraction.

Even when Fargate appears cheaper on infrastructure costs (around 50M+ requests/month or 1M+ requests/day). The €7.200 engineering investment only becomes cost-effective after many months.

Payback Calculation Formula:

Monthly Savings = (Lambda Daily Cost - Fargate Daily Cost) × 30 days
Payback Period (months) = €7,200 ÷ Monthly Savings (USD)

Examples:
• 10M requests/day: ($21.67 - $9.50) × 30 = $365.10/month
  Payback = €7,200 ÷ $365.10 = 19.7 months

• 25M requests/day: ($54.17 - $22.80) × 30 = $941.10/month  
  Payback = €7,200 ÷ $941.10 = 7.6 months

Human Cost Payback:

Traffic VolumeDaily SavingsMonthly SavingsPayback Period
1M requests/monthN/A-$54.75Never (Lambda cheaper)
10M requests/day$12.17$365.1019.7 months
25M requests/day$31.37$941.107.6 months
50M requests/day$62.73$1,881.903.8 months

Based on the table, the €7.200 human cost works are only cost-effective from 7-8 months at extremely high traffic (25M+ requests/day) to multiple years at moderate traffic levels, assuming:

  • No major architectural changes

  • No scaling issues requiring rework

  • No additional failure modes discovered

What Makes Fargate "Complex"

Let's see what it takes to run containers in production:

  ECSCluster:
    Type: AWS::ECS::Cluster
    Properties:
      ClusterName: !Sub "${AWS::StackName}-cluster"
      CapacityProviders:
        - FARGATE
        - FARGATE_SPOT
      DefaultCapacityProviderStrategy:
        - CapacityProvider: FARGATE
          Weight: 2  # 40% on-demand for stability
        - CapacityProvider: FARGATE_SPOT
          Weight: 3  # 60% spot for cost savings
      ClusterSettings:
        - Name: containerInsights
          Value: enhanced

Without the Amazon ECS Container Insights metrics enabled, I would not be able to monitor metrics that could cause downtime, like:

  • CPU

  • Memory

  • RunningTaskCount

Most teams start with what I call the 'monolithic service trap' - one big ECS service that seems simple until it isn't.

  MyService:
    Type: AWS::ECS::Service
    Properties:
      ServiceName: !Sub "${AWS::StackName}-my-service"
      Cluster: !Ref ECSCluster
      TaskDefinition: !Ref MyTaskDefinition
      LaunchType: FARGATE
      PlatformVersion: LATEST
      DeploymentConfiguration:
        Strategy: ROLLING
        MaximumPercent: 200          # Allow 100% extra capacity during deployment  
        MinimumHealthyPercent: 100   # Keep all current tasks running until new ones are healthy
        DeploymentCircuitBreaker:
          Enable: true
          Rollback: true
      DeploymentController:
        Type: ECS
      NetworkConfiguration:
        AwsvpcConfiguration:
          AssignPublicIp: DISABLED
          SecurityGroups:
            - !Ref ECSSecurityGroupId
          Subnets: !FindInMap [!Ref StageName, !Ref "AWS::Region", PrivateSubnets]
      LoadBalancers:
        - ContainerName: my-container
          ContainerPort: 3000
          TargetGroupArn:
            Fn::ImportValue:
              !Sub "${ALBStackName}-My-ECS-TargetGroup-Arn"
      HealthCheckGracePeriodSeconds: 60

  MyTaskDefinition:
    Type: AWS::ECS::TaskDefinition
    Properties:
      Family: !Sub "${AWS::StackName}-my-task-definition"
      NetworkMode: awsvpc
      RequiresCompatibilities:
        - FARGATE
      Cpu: 2048
      Memory: 4096
      ExecutionRoleArn: !GetAtt ECSExecutionRole.Arn
      TaskRoleArn: !GetAtt ECSTaskRole.Arn
      ContainerDefinitions:
        - Name: my-container
          Image: !Ref MyImageUri
          Essential: true
          PortMappings:
            - ContainerPort: 3000
              Protocol: tcp
          Environment:
            - Name: AWS_REGION
              Value: !Ref "AWS::Region"
            - Name: LOG_LEVEL
              Value: !FindInMap [LogLevel, !Ref StageName, level]
            - Name: NODE_OPTIONS
              Value: "--max-old-space-size=3072"  # Use 75% of 4GB memory for heap
      ....

The problems with this approach are:

  • Single Point of Failure: One bad deployment kills everything

  • All-or-Nothing Updates: Can't update just something without affecting something else

  • Blast Radius: one endpoint path can bring all down

I like to organise my architecture based on specific parameters, and they can vary based on the application. I am working on an international streaming application, so I could potentially split my services into:

  • country (maybe it is better to have ECS per country)

  • type of users

  • type of subscriptions

  • type of platform

The above are examples, but they have some obvious advantages:

  • Fault Isolation

  • Independent Deployments

  • Smaller Blast Radius

  • Resource Isolation

  • Independent Capacity

  • Observability

  • Scalability

A single monolithic service creates availability and scalability bottlenecks, where the failure of one component can impact the entire system. On the opposite extreme, there is excessive fragmentation. When services are correctly isolated, a bug or outage in one service cannot directly impact other services. This isolation is particularly valuable when services have different reliability requirements or face varying load patterns, as failures remain localised instead of bringing all down.

A lesson I have learned is that service decomposition becomes economically justified when different components have significantly different resource traffic patterns. Each component scales independently according to its actual needs. Without this separation, the entire system must be provisioned for peak load across all components, often resulting in extra overprovisioning and wasted resources.

I want to highlight some important configurations for AWS::ECS::Service and AWS::ECS::TaskDefinition

Deployment Strategy

While AWS ECS supports blue/green deployments through CodeDeploy integration, this feature does not work when ALB target groups dynamically route traffic between ECS and Lambda functions. Blue/green deployments require dedicated target groups for each deployment stage, which conflicts with my setup. The rolling deployment strategy provides zero-downtime deployments while maintaining compatibility with my hybrid traffic management system.

Resource Allocation: CPU and Memory Sizing

Cpu: 2048    # 2 vCPU - Increased from 1024 due to load testing findings
Memory: 4096 # 4GB - Increased from 2048 due to memory pressure and GC issues

The resource allocation reflects the results from the load testing. Initially, I had (1 vCPU, 2GB RAM), but it resulted in cascading performance issues during traffic spikes:

  • CPU Saturation

  • Memory Pressure

Fargate Cold Start

As Vlad Ionescu pointed out in his post, Fargate takes its time to scale up. From what I can see, it takes up to 5 minutes. Always, thanks to the load testing, I found out that newly launched tasks were receiving full traffic upon ALB target registration, before completing their initialisation routines, because they were considered 'healthy', which contributed to 502/503 errors.

To mitigate this, I added an extra configuration:

  MyServiceSTargetGroup:
    Type: AWS::ElasticLoadBalancingV2::TargetGroup
    Properties:
      TargetGroupAttributes:
        - Key: slow_start.duration_seconds
          Value: "60"

In the end, by increasing resource allocation to 2 vCPU/4GB and implementing a 60-second ALB slow start period, I achieved:

  • Reduced Error Rates: The combination eliminated the 502/503 error spikes during scaling events

  • Better Cost Efficiency: While individual tasks cost more, we need fewer of them and experience fewer scaling thrashes

The 14 potential pitfalls

Lambda is a full delegation to AWS. It is magic.

ECS Fargate has its complexity, but this is the price of control.

Here are all the cases I found while testing

Failure CategoryWhat Can Go WrongReal-World Impact
:fire: Emergency FailuresZero running tasks available:rotating_light: Complete service outage - immediate Lambda failover required
:warning: Capacity IssuesDown to minimum viable tasks:chart_with_downwards_trend: Gradual failure prevention - start shifting to Lambda
:skull: Critical CapacityOnly one task left running:bomb: Last chance before outage - emergency traffic routing
:rocket: Traffic OverloadRequests exceed Lambda capacity:arrow_up: Lambda can't handle load - need ECS scaling
:boom: Traffic SpikeSudden burst traffic overwhelms system:zap: Viral content or DDoS - hybrid ECS+Lambda mode
:ocean: Sustained LoadHigh traffic persists for extended period:moneybag: Cost optimization - shift to ECS-heavy hybrid mode
:sleeping: Low TrafficTraffic drops below cost-effective threshold:money_with_wings: ECS waste - switch back to Lambda
:computer: CPU ExhaustionCPU utilization reaches warning threshold:hot_face: Resource competition - enable Lambda overflow
:red_circle: CPU EmergencyCPU utilization hits critical levels:fire_engine: Cascade failure risk - emergency Lambda routing
:floppy_disk: Memory PressureMemory usage approaches warning limits:exploding_head: Memory leaks detected - Lambda overflow
:bangbang: Memory EmergencyMemory usage hits critical threshold:sos: OOM kills imminent - immediate failover
:arrow_down: Scale-In ProblemsAuto-scaling fails to reduce capacity efficiently:dollar: Idle tasks burning money - gradual scale-down
:skull: Spot InterruptionsAWS reclaims spot instances:cloud_with_lightning: Multiple tasks lost suddenly
:dart: Path-Specific IssuesIndividual service failures:construction: Blast radius contained to one service

Notice how many scenarios demand "Lambda failover" or "Lambda overflow" as solutions. This is because Lambda has a unique strength: it can instantiate thousands of instances in milliseconds.

This remarkable ability comes at a cost, as I pay more because AWS manages everything for me. But when Fargate faces any of these 14 challenges, Lambda's instant scalability becomes priceless.

  • Zero capacity planning becomes an asset when you need emergency capacity instantly

  • Automatic scaling becomes crucial when Fargate scaling can't keep up with traffic spikes

  • No infrastructure management means no infrastructure bottlenecks during a crisis

  • Isolation per request ensures consistent performance regardless of load

Performance trade-offs

  1. Fargate Performance Degrades Under Load: Fargate latency increases as more resources are consumed within the same task. A task handling 1 request performs very differently from one handling thousands simultaneously

  2. Lambda Maintains Consistent Performance: Lambda stays consistent because it's 1 request per execution environment

  3. Different Performance Profiles: Fargate excels at sustained moderate load (p95: 30ms vs 70ms), while Lambda excels at consistent performance regardless of system load

  4. Cost vs Performance Trade-off: You pay Lambda's higher price for guaranteed isolation and instant scalability

  5. Failure Recovery: Lambda's instant provisioning makes it ideal for emergency scenarios when Fargate fails

Fargate performs better in stable conditions but degrades under stress, and its total cost of ownership is so high that its complexity pays off only in the long term. Lambda performs consistently but at a higher cost per request, but this higher pricing compensates for operational simplicity and faster delivery cycles.

It's Not About Cost, It's About Risk

After running this hybrid architecture in production, here's what I've learned:

  • Lambda is not more expensive

  • Containers are not cheaper

  • Hybrid architectures give you the best of both worlds

What's Next: Traffic Controller

I'll share the Traffic Controller implementation that makes this hybrid magic possible. The Traffic Controller is a component that dynamically routes traffic between Lambda and ECS based on real-time traffic patterns. A component that monitors CloudWatch alarms, understands task-aware load distribution, and makes routing decisions.