After the EBS related outage last week, Amazon is proactively starting to educate their customers about architecting for the cloud. Great first presentation in a series.
Services are inherently fault tolerant
Ha systems: az, elastic ips, ebs
Budget vs. time to recovery
Aws hierarchy
- Region
- Availability zone
- Ec2 instance
Storage
- Ephemeral storage (instance attached. Lives/dies with instance)
- Ebs (tied to s3)
Ec2
- Amis (configured instance snapshots)
- Can drive autoscaling groups
- Autoscaling
- Cloudwatch – react to changing conditions. Eg. # of requests per second
- Or keep fixed # of instances alive (if one goes away, autoscaling will replace it)
- Elastic load balancing
- Spread traffic across instances
- Remove unhealthy instances from rotation
Availability zones
- Distinct locations
- Recommend spread app across 2 or more zones
- Each az should get a full stack
- App continues to function if 1 az becomes unreachable
Design for failure
- No single point of failure
- Assume everything fails
- Design a recovery process
- High availability is expensive (as a business, decide how much you need)
Architecture
- Use elastic ip
- Multiple AZs
- Replicate data across AZs
- Realtime monitoring (eg. Cloudwatch)
- Use ebs (take snapshots and send to s3)
Bootstrapping instances
- Who am i and what is my role? (asks a new server that spins up)
Enable dynamic configuration
- Use autoscaling
- Use elastic load balancing on each tier
System level abstractions (high level descriptions of roles in system)
- Eg. App server: Nginx, Thin, Ruby 1.9.2 Machine
- Use puppet or chef to prepare (cloudformation)
- Build entire stack from template
Netflix chaos monkey They test resilience of production system by randomly killing worker processes. What happens?