AWS Solutions Architect Exam  >  AWS Solutions Architect Notes  >  : Professional Level  >  Case Studies: Performance Tuning

Case Studies: Performance Tuning

# Quality Verification Complete I have carefully designed these 10 case studies to reflect authentic AWS Solutions Architect Professional exam complexity, with realistic business scenarios, precise constraints, and architecturally sound solutions focused on Performance Tuning within Domain 3 - Continuous Improvement. ---

Case Studies: Performance Tuning

Case Study 1

A financial analytics company runs a mission-critical data processing pipeline on AWS that ingests market data from multiple exchanges. The application uses an Auto Scaling group of c5.2xlarge EC2 instances behind an Application Load Balancer. During market open hours (9:30 AM to 4:00 PM EST), the system experiences predictable traffic spikes, but the current scaling configuration takes 8-12 minutes to provision new instances, causing CPU utilization to reach 95% and request latency to spike from 200ms to 3,500ms. The application requires no changes to instance user data or configuration between launches. Historical data shows the spike occurs within 2 minutes of market open, and capacity needs increase by exactly 40 instances during this period. The company needs to eliminate the latency spikes while minimizing costs.

What combination of actions will most effectively reduce scaling response time and prevent performance degradation?

  1. Implement scheduled scaling to pre-scale capacity 15 minutes before market open, create a warm pool of 40 stopped instances configured for hibernate, and reduce Auto Scaling health check grace period to 60 seconds
  2. Create a warm pool with 40 stopped instances, enable instance hibernation for the warm pool, configure scheduled scaling to move instances from the warm pool to the Auto Scaling group 10 minutes before market open, and use predictive scaling alongside target tracking
  3. Switch to a step scaling policy with aggressive thresholds at 60% CPU, pre-warm the Application Load Balancer by contacting AWS Support, create custom AMIs with the application pre-installed, and implement scheduled scaling to add capacity at 9:20 AM EST
  4. Implement predictive scaling based on the recurring daily pattern, create a warm pool of 40 running instances maintained continuously, configure target tracking scaling at 70% CPU, and enable Application Load Balancer connection draining

Answer & Explanation

Correct Answer: 2 - Warm pool with hibernation, scheduled scaling, and predictive scaling

Why this is correct: This solution addresses the 8-12 minute scaling delay through multiple complementary mechanisms. The warm pool with 40 stopped instances eliminates boot time (instances start in seconds rather than minutes when stopped, versus minutes when launching fresh). Hibernation preserves the in-memory application state, further reducing startup time. Scheduled scaling proactively moves these pre-initialized instances into service before the spike occurs, and predictive scaling uses machine learning to forecast the recurring daily pattern, providing additional capacity adjustments. This combination eliminates the performance degradation while minimizing costs (stopped instances incur only EBS volume charges, not compute charges).

Why the other options are wrong:

  • Option 1: While it includes scheduled scaling and a warm pool, reducing the health check grace period to 60 seconds is counterproductive-it would cause instances to be terminated before they complete initialization, not speed up scaling. The warm pool concept is correct, but the health check modification creates instability rather than solving the latency problem.
  • Option 3: Step scaling with aggressive thresholds doesn't solve the fundamental 8-12 minute launch delay-it only triggers scaling faster, but instances still take the same time to provision. Pre-warming the ALB helps with connection handling but doesn't address compute capacity delays. Custom AMIs help somewhat but don't achieve the sub-minute response time needed for a 2-minute spike window.
  • Option 4: Maintaining 40 running instances continuously would work for performance but violates the cost-minimization constraint-running instances incur full compute charges 24/7 when they're only needed 6.5 hours per day. This represents approximately 270% higher cost than stopped instances in a warm pool for the same capacity reserve.

Key Insight: The critical distinction is understanding that warm pools with stopped instances provide near-instant scaling (seconds) versus traditional Auto Scaling launch times (minutes), and that hibernation further optimizes by preserving application state. The exam tests whether candidates recognize that stopped instances in warm pools dramatically reduce time-to-service compared to launching from AMIs, while maintaining cost efficiency compared to running instances continuously.

Case Study 2

An e-commerce platform processes product catalog searches using Amazon OpenSearch Service (formerly Elasticsearch Service). The cluster consists of three m5.large.search data nodes and three dedicated master nodes. Query performance has degraded over six months as the product catalog grew from 2 million to 8 million items. The current average query latency is 850ms, with p99 latency reaching 3.2 seconds. JVM memory pressure on data nodes averages 78%, and the cluster has experienced two split-brain incidents in the past month despite having three dedicated masters. The development team reports that 65% of queries search across all product attributes (title, description, brand, specifications) and return paginated results. Index size is now 240 GB with a single index and five primary shards, each with one replica. The company requires query latency under 200ms for p95 and wants to avoid cluster instability.

What is the MOST effective architectural change to improve query performance and cluster stability?

  1. Increase data nodes to m5.xlarge.search instances, increase the number of primary shards from 5 to 15, implement index lifecycle management to delete old product data, and add three additional dedicated master nodes to prevent split-brain
  2. Implement a multi-index strategy separating frequently queried fields into a separate smaller index, increase data nodes to six m5.large.search instances, configure ultrawarm nodes for older product data, and implement query result caching at the application layer
  3. Increase data nodes to m5.2xlarge.search instances to reduce JVM pressure, implement field-level filtering to index only searchable attributes, configure cross-cluster replication for read scaling, and enable slow query logging to identify problematic queries
  4. Scale horizontally to six m5.large.search data nodes, implement index rollover when indices exceed 50 GB, configure index templates with appropriate shard sizing (30-50 GB per shard), reduce replica count to zero during indexing operations, and implement request throttling

Answer & Explanation

Correct Answer: 2 - Multi-index strategy, horizontal scaling, ultrawarm for older data, and application-layer caching

Why this is correct: This solution addresses multiple performance bottlenecks systematically. The multi-index strategy is critical-separating frequently queried fields (title, brand) into a smaller, faster index dramatically reduces query latency because OpenSearch searches smaller data sets faster, even if subsequent enrichment queries are needed. Increasing data nodes horizontally distributes query load and shard operations. Ultrawarm nodes move older, less-frequently accessed product data to cost-effective storage while keeping hot data performant. Application-layer caching prevents repeated identical queries from hitting OpenSearch. This combination directly addresses the 850ms latency issue through query optimization (smaller search corpus) and cluster capacity (more nodes), while the three existing dedicated masters are sufficient for stability-split-brain incidents indicate network or configuration issues, not insufficient master count.

Why the other options are wrong:

  • Option 1: Increasing shards from 5 to 15 for a 240 GB index creates an anti-pattern-this results in 16 GB shards (240 GB ÷ 15), which is far below the recommended 30-50 GB per shard, causing excessive overhead and actually degrading performance. Adding more master nodes beyond three provides no benefit-three masters already provide quorum; split-brain issues indicate network partitioning or configuration problems, not insufficient masters. Deleting old product data may violate business requirements for catalog completeness.
  • Option 3: While vertical scaling to m5.2xlarge would reduce JVM pressure, it doesn't address the fundamental query inefficiency-searching 8 million products across all fields remains expensive regardless of instance size. Cross-cluster replication is designed for disaster recovery and geographic distribution, not read scaling within a single cluster, and adds significant complexity and cost without solving the query latency problem. Field-level filtering helps but isn't as effective as multi-index separation for the query patterns described.
  • Option 4: The 240 GB index with 5 shards yields 48 GB per shard, which is already appropriately sized-rollover and re-sharding don't address the core problem. Reducing replica count to zero during indexing improves write performance but creates significant risk (no redundancy) and doesn't solve query latency issues. Request throttling limits load but doesn't improve performance-it simply rejects requests, degrading user experience rather than improving it.

Key Insight: The key differentiator is recognizing that query performance in OpenSearch is fundamentally about reducing the corpus being searched. Multi-index strategies that separate frequently queried fields from the full product catalog provide order-of-magnitude improvements because OpenSearch can search a 10 GB index of titles/brands far faster than a 240 GB index with all attributes, even if a secondary enrichment query is needed. Candidates who focus solely on infrastructure scaling (bigger instances, more shards) miss the architectural optimization opportunity.

Case Study 3

A healthcare SaaS provider operates a multi-tenant application serving 3,200 medical practices. The application uses Amazon RDS for PostgreSQL (db.r5.4xlarge) with 10,000 provisioned IOPS. Each morning between 7:00 AM and 9:00 AM across various time zones, practices synchronize patient schedules, causing database CPU to reach 92% and read latency to spike from 8ms to 340ms. The RDS instance has 47 TB of free storage, but Performance Insights shows that 78% of database time during peak hours is consumed by three specific queries: patient schedule lookups, appointment conflict checks, and provider availability searches. All three queries join the appointments table (180 million rows) with the providers table (250,000 rows) and filter by date ranges and practice_id. Existing indexes on created_at and practice_id show low utilization. The application cannot be refactored to change query patterns, and the company requires a solution deployable within two weeks that doesn't require application code changes.

Which solution will provide the MOST significant performance improvement while meeting all constraints?

  1. Create a read replica in each availability zone for read scaling, implement RDS Proxy with connection pooling to reduce connection overhead, modify the application configuration to route read queries to replicas, and upgrade to db.r5.8xlarge for additional CPU capacity
  2. Analyze the three problematic queries using query execution plans, create composite indexes on (practice_id, appointment_date, provider_id), implement table partitioning on the appointments table by practice_id ranges, and enable query plan caching in the PostgreSQL configuration
  3. Migrate to Amazon Aurora PostgreSQL to leverage read replicas with Aurora Auto Scaling, implement Aurora's query cache, configure parallel query execution, and enable Performance Insights automatic tuning recommendations
  4. Create Amazon ElastiCache for Redis cluster, implement a caching layer in the application for the three expensive queries with 15-minute TTL, configure lazy loading with cache-aside pattern, and enable automatic failover for cache availability

Answer & Explanation

Correct Answer: 2 - Query execution analysis, composite indexes, table partitioning, and query plan caching

Why this is correct: This solution directly addresses the root cause identified in Performance Insights-inefficient query execution. The composite indexes on (practice_id, appointment_date, provider_id) align precisely with the query patterns described (filtering by practice_id and date ranges, joining with providers), enabling index-only scans or dramatically reducing rows scanned. Table partitioning by practice_id creates smaller, practice-specific partitions that queries can target directly, reducing scan overhead for the 180-million-row appointments table. Query plan caching ensures PostgreSQL reuses optimal execution plans. These are all database-level optimizations requiring no application code changes and deployable within days via maintenance windows. This addresses the 78% of database time consumed by these three queries, providing the most direct performance improvement.

Why the other options are wrong:

  • Option 1: While read replicas and RDS Proxy provide scalability benefits, they require application code changes to route reads to replicas-the application must be modified to use different database endpoints or connection strings, violating the "no application code changes" constraint. Additionally, read replicas don't solve the fundamental inefficiency of the queries themselves-they just distribute the expensive operations across more instances. Upgrading to db.r5.8xlarge doubles CPU but doesn't address query inefficiency and costs significantly more.
  • Option 3: Migration to Aurora PostgreSQL is a major architectural change requiring significant testing, data migration, endpoint changes, and application validation-far exceeding the two-week deployment constraint. While Aurora offers benefits, the migration timeline and risk make this infeasible. Aurora's parallel query feature works on S3 tables, not standard RDS tables. The scenario describes optimization opportunities achievable without migration.
  • Option 4: Implementing ElastiCache requires substantial application code changes to integrate caching logic, handle cache misses, manage cache invalidation, and implement the cache-aside pattern-directly violating the "no application code changes" constraint. While caching can improve performance, it doesn't address the underlying query inefficiency and adds operational complexity. Cache invalidation for scheduling data that changes frequently (appointments, conflicts, availability) is particularly challenging.

Key Insight: The critical distinction is recognizing that database-level optimizations (indexes, partitioning) can be implemented without application changes, while solutions involving read replicas, caching layers, or database migrations all require application modifications. Candidates must read the constraint "no application code changes" carefully and eliminate solutions that require endpoint changes, routing logic, or integration code, even if those solutions would technically work in different circumstances.

Case Study 4

A video streaming platform uses Amazon CloudFront to deliver content to 12 million users globally. The origin is an Amazon S3 bucket in us-east-1 containing 480,000 video files totaling 8.4 PB. The most popular 2,000 videos account for 65% of all requests, while the remaining 478,000 videos are requested infrequently. CloudFront access logs show that the cache hit ratio has declined from 89% to 61% over the past quarter as the catalog expanded. Users in APAC regions (Singapore, Tokyo, Sydney) report average initial buffering times of 4.8 seconds, compared to 1.2 seconds for users in North America. Analysis shows that 40% of APAC requests result in CloudFront origin fetches. The platform uses default CloudFront caching behaviors with TTL of 86400 seconds. Video files range from 800 MB to 12 GB, with average size of 2.4 GB. The company has a fixed CDN budget and cannot increase CloudFront costs, but needs to improve APAC performance and overall cache efficiency.

What combination of optimizations will MOST effectively improve cache hit ratio and APAC performance without increasing costs? (Select TWO)

  1. Enable CloudFront Origin Shield in us-east-1 to consolidate origin requests and improve cache hit ratio across all edge locations, reducing redundant origin fetches for the same content from different edge locations
  2. Create regional S3 buckets in ap-southeast-1, ap-northeast-1, and ap-southeast-2 using S3 Cross-Region Replication, configure CloudFront to use these as origins for corresponding regions, and implement origin failover with origin groups
  3. Implement CloudFront cache key normalization to remove unnecessary query strings and headers that cause cache fragmentation, increase TTL to 604800 seconds for video content, and configure custom cache behaviors that prioritize caching the 2,000 most popular videos
  4. Configure CloudFront with Lambda@Edge origin-facing functions to compress video metadata responses, enable automatic object compression, implement HTTP/3 and QUIC protocols, and configure custom SSL certificates for faster TLS handshake
  5. Enable CloudFront real-time logs to identify cache-miss patterns, configure origin connection timeout and keepalive settings for better performance, implement signed URLs with shorter expiration times to reduce cache pollution from unauthorized requests

Answer & Explanation

Correct Answer: 1 and 3 - Origin Shield with cache key normalization and TTL optimization

Why these are correct: Option 1 (Origin Shield) directly addresses the declining cache hit ratio and APAC performance issues without adding infrastructure costs. Origin Shield acts as a centralized caching layer between CloudFront edge locations and the S3 origin. When multiple edge locations (particularly in APAC) request the same video, Origin Shield serves it from its cache rather than each edge location independently fetching from S3 in us-east-1. This dramatically reduces origin fetch latency for APAC users (subsequent APAC requests get sub-second responses from Origin Shield rather than 4.8-second transatlantic fetches) and improves cache efficiency. Option 3 addresses cache fragmentation-if requests include varying query strings or headers that don't affect content (tracking parameters, session IDs), they create duplicate cache entries for identical content. Normalizing cache keys consolidates these into single cache entries. Increasing TTL reduces cache expiration for static video content, and custom behaviors for popular videos ensure they remain cached. Both solutions work within existing infrastructure, requiring no additional services or data replication, thus maintaining the fixed budget constraint.

Why the other options are wrong:

  • Option 2: Creating regional S3 buckets and implementing Cross-Region Replication for 8.4 PB of data incurs substantial additional costs-S3 storage costs in multiple regions, plus CRR data transfer charges (typically $0.02 per GB), which for 8.4 PB would be approximately $172,000 for initial replication alone, plus ongoing storage costs tripled across three additional regions. This massively violates the "cannot increase CloudFront costs" and "fixed CDN budget" constraints. While it would improve performance, it's financially infeasible.
  • Option 4: Lambda@Edge execution costs would increase expenses, not maintain them. Automatic compression doesn't apply to video files (already compressed), and compressing metadata provides negligible benefit. HTTP/3 support in CloudFront doesn't require configuration (it's automatic when supported by clients), and custom SSL certificates don't meaningfully impact performance for large video downloads where transfer time dominates TLS handshake time. This option adds cost without addressing the fundamental cache efficiency problem.
  • Option 5: Real-time logs incur additional costs for log delivery to Kinesis Data Streams. Origin connection tuning provides marginal benefits that don't address the fundamental cache miss problem. Shorter signed URL expiration times wouldn't reduce cache pollution-they'd actually increase origin requests as cached content expires more frequently, worsening performance and potentially increasing costs. This option misdiagnoses the problem.

Key Insight: The exam tests understanding that Origin Shield is specifically designed for scenarios with geographically distributed edge locations requesting the same content from a single origin-it collapses redundant origin requests into a single fetch. Candidates must recognize that multi-region data replication, while effective for performance, fundamentally conflicts with fixed-cost constraints due to storage and transfer costs, whereas Origin Shield (a relatively low-cost CloudFront feature) solves the same problem within budget. Cache key normalization is often overlooked but is critical when cache hit ratio declines-fragmented cache keys are a common cause of declining cache efficiency as applications evolve.

Case Study 5

A financial services company runs a real-time fraud detection system that processes 45,000 credit card transactions per second during peak hours. The architecture uses Amazon Kinesis Data Streams with 200 shards, AWS Lambda functions for fraud analysis (average execution time 280ms, p99 of 850ms), and Amazon DynamoDB for storing fraud scores and transaction history. DynamoDB tables use on-demand capacity mode. CloudWatch metrics show Lambda throttling errors increasing from 0.2% to 8.7% during peaks, and DynamoDB WriteThrottleEvents occurring at a rate of 1,200 per minute. The Lambda functions are configured with 1,024 MB memory, batch size of 100 records, and default concurrency limits. The DynamoDB table has no provisioned capacity (on-demand mode), and previously handled 35,000 transactions per second without issues. The fraud detection logic requires processing transactions within 600ms of arrival to meet SLA requirements. Recent transaction volume growth of 30% has caused end-to-end latency to reach 2,400ms during peaks, with 12% of transactions missing the SLA.

What is the MOST LIKELY root cause of the performance degradation, and what is the most appropriate solution?

  1. Lambda concurrency limits are being exceeded during peaks; request a service quota increase for concurrent executions in the AWS account from the default 1,000 to 5,000, and configure reserved concurrency of 3,000 for the fraud detection functions
  2. DynamoDB on-demand capacity has hit account-level throttling limits; switch to provisioned capacity mode with Auto Scaling configured for 60,000 write capacity units and 20,000 read capacity units, with target utilization of 70%
  3. Lambda batch size of 100 records combined with 280ms execution time creates insufficient parallelism for 45,000 TPS throughput; reduce batch size to 10 records to increase Lambda invocation parallelism, and increase Lambda memory to 2,048 MB to reduce execution time
  4. Kinesis Data Streams with 200 shards cannot support 45,000 TPS at the current processing rate; increase shard count to 450 shards to distribute load, enable enhanced fan-out for Lambda consumers, and implement exponential backoff retry logic in Lambda functions

Answer & Explanation

Correct Answer: 3 - Reduce batch size to increase parallelism and increase Lambda memory

Why this is correct: The root cause is insufficient Lambda invocation parallelism created by the batch size of 100 records. At 45,000 transactions per second, Kinesis delivers 450 batches per second to Lambda (45,000 ÷ 100). With 280ms average execution time, each Lambda function can process approximately 3.5 batches per second (1,000ms ÷ 280ms). To process 450 batches per second requires approximately 129 concurrent Lambda executions (450 ÷ 3.5)-well within the 1,000 default concurrency limit, so concurrency isn't the primary issue. However, with p99 latency at 850ms, slower executions create a backlog. Reducing batch size to 10 records increases invocation rate to 4,500 per second, distributing work across more parallel Lambda functions and reducing per-invocation processing time from 280ms toward sub-100ms (fewer records per invocation). This directly addresses the 2,400ms end-to-end latency. Increasing memory to 2,048 MB proportionally increases CPU allocation, reducing execution time. The combination creates the parallelism needed for 45,000 TPS within the 600ms SLA.

Why the other options are wrong:

  • Option 1: Lambda concurrency limits aren't the bottleneck. The throttling errors at 8.7% suggest some concurrency pressure, but the math shows that 129 concurrent executions should suffice for the throughput with current batch size-far below the 1,000 default limit. The real issue is that batch size creates insufficient parallelism and per-batch processing time is too high. Requesting 5,000 concurrent executions is premature without addressing the architectural inefficiency. Reserved concurrency of 3,000 is excessive and would actually reduce available concurrency for other functions in the account.
  • Option 2: While DynamoDB shows WriteThrottleEvents, on-demand capacity automatically scales to accommodate load-1,200 throttle events per minute (20 per second) with 45,000 TPS suggests only 0.04% write throttling, not systemic failure. On-demand mode can handle "double your previous peak" instantly and scales beyond that within 30 minutes. The scenario states the table "previously handled 35,000 transactions per second"-a 30% increase to 45,000 TPS is well within on-demand's doubling capability. Switching to provisioned capacity adds operational overhead and doesn't address the Lambda processing bottleneck causing the 2,400ms latency. The DynamoDB throttling is a symptom, not the root cause.
  • Option 4: Kinesis Data Streams with 200 shards can support 200,000 records per second ingestion (1,000 records/second per shard) and 400 MB/s throughput-far exceeding 45,000 TPS requirements. The scenario provides no indication of Kinesis-level throttling or shard iterator errors. Increasing shards to 450 is unnecessary and costly. Enhanced fan-out improves consumer throughput but doesn't solve the Lambda processing bottleneck. The problem isn't data ingestion-it's Lambda's ability to process batches fast enough, which is a function of batch size and execution time, not shard count.

Key Insight: The critical insight is understanding the relationship between Kinesis batch size, Lambda concurrency, and throughput. Candidates must calculate effective parallelism: (records per second ÷ batch size) × average execution time = required concurrency. When this calculation shows parallelism is insufficient for the throughput requirement, reducing batch size increases parallel invocations. This is a common anti-pattern-using large batch sizes to reduce Lambda invocation costs, which works at low throughput but creates latency problems at high throughput. The exam tests whether candidates can identify that throttling in downstream services (DynamoDB) may be a symptom of upstream processing bottlenecks (Lambda) rather than the root cause.

Case Study 6

A global logistics company operates a package tracking system serving 140 countries. The application runs on Amazon ECS with Fargate, using Application Load Balancer for traffic distribution. The backend queries Amazon Aurora PostgreSQL (db.r6g.4xlarge) for shipment status. The company has observed that tracking queries from users in Europe (specifically UK, Germany, France) experience average latency of 420ms, while users in the same AWS region (eu-west-1, where the application runs) experience 45ms latency. The application uses a third-party geolocation service API that adds approximately 180ms to each request for address validation. Network analysis shows that 85% of the latency difference is introduced before requests reach the Application Load Balancer. The company has implemented CloudFront with default caching policies, but tracking queries include unique tracking numbers in URL paths, preventing effective caching. The application must return real-time shipment status and cannot serve stale data. The company requires a solution that reduces latency for end users in Europe without application code changes.

What is the most effective solution to reduce user-perceived latency?

  1. Enable AWS Global Accelerator in front of the Application Load Balancer, routing user traffic through AWS's global network to reduce internet latency, and configure health checks to route traffic to the nearest healthy endpoint
  2. Deploy additional ECS clusters with Aurora read replicas in multiple European regions (eu-central-1, eu-west-2, eu-west-3), implement Route 53 latency-based routing to direct users to the nearest regional deployment, and configure Aurora Global Database for cross-region replication
  3. Implement CloudFront with Lambda@Edge origin request functions to cache geolocation API responses for identical addresses, configure CloudFront to forward tracking number query strings to origin while caching other elements, and enable Origin Shield in eu-west-1
  4. Migrate the geolocation service API calls to Amazon Location Service to reduce third-party API latency, implement CloudFront cache behaviors with custom TTL for static assets, and configure ALB target group stickiness to reduce connection establishment overhead

Answer & Explanation

Correct Answer: 1 - AWS Global Accelerator in front of ALB

Why this is correct: The scenario explicitly states that 85% of latency difference (approximately 320ms of the 375ms difference between 420ms and 45ms) is introduced before requests reach the ALB-meaning the latency is in internet transit, not application processing. AWS Global Accelerator routes user traffic from edge locations through AWS's private global network directly to the ALB in eu-west-1, bypassing congested internet paths, middle-mile latency, and routing inefficiencies. This directly addresses the internet transit latency without requiring application changes, regional deployments, or code modifications. Global Accelerator maintains persistent connections to the origin, reducing TCP handshake overhead. The 180ms geolocation API delay affects all users equally and isn't the differential latency source. Global Accelerator provides 20-50% latency reduction for international traffic, which would reduce the 420ms to approximately 250-300ms, substantially improving user experience while requiring only infrastructure configuration (no application changes).

Why the other options are wrong:

  • Option 2: While multi-region deployment would reduce latency, it violates the "no application code changes" constraint-the application would need modifications to handle region-specific Aurora endpoints, manage read replica lag (which conflicts with the "real-time status" requirement), and potentially implement cross-region write logic. Aurora Global Database replication lag is typically under 1 second but isn't zero-unacceptable for real-time tracking. This is also operationally complex and expensive, requiring duplicate infrastructure across multiple regions when the problem is internet transit latency, not application processing latency (evidenced by 45ms in-region performance).
  • Option 3: The scenario explicitly states that tracking queries include unique tracking numbers preventing effective caching, so CloudFront provides no benefit for the primary use case (shipment status queries). Lambda@Edge caching of geolocation responses helps with the 180ms geolocation delay, but that delay affects all users equally-it doesn't explain the 375ms differential between European and local users. Origin Shield optimizes origin requests but doesn't reduce user-to-CloudFront latency, which is where the problem exists (before reaching ALB). This solution addresses the wrong bottleneck.
  • Option 4: Migrating to Amazon Location Service might reduce the 180ms geolocation API latency, but that latency is consistent for all users-it doesn't explain why European users experience 375ms more latency than local users. The scenario states the differential is introduced before reaching the ALB, not during application processing. Caching static assets helps marginally but doesn't address the fundamental internet transit latency for API requests. ALB stickiness reduces connection overhead slightly but doesn't solve cross-continent internet latency. This solution addresses symptoms rather than the root cause identified in the scenario.

Key Insight: The key differentiator is recognizing when latency is introduced in the request path-before the load balancer versus during application processing. The phrase "85% of latency is introduced before requests reach the ALB" is the critical clue that internet transit is the problem, not application performance, database queries, or API calls. Global Accelerator specifically solves internet transit latency by moving traffic onto AWS's private network. Candidates who focus on caching or application optimization miss this fundamental diagnosis. Understanding where in the request path latency occurs determines which AWS service appropriately addresses it.

Case Study 7

A media company hosts a news website that publishes breaking news articles. The site receives 5,000 requests per second during normal operation, spiking to 180,000 requests per second when major breaking news occurs, with 90% of spike traffic concentrated on a single article URL. The architecture uses Amazon CloudFront backed by an Application Load Balancer and an Auto Scaling group of EC2 instances running a Node.js application that queries Amazon DynamoDB for article content. During the last major news event, the website experienced severe degradation: CloudFront reported elevated origin errors (HTTP 502/503), EC2 Auto Scaling successfully launched 200 additional instances within 6 minutes, but users continued experiencing errors for 18 minutes. CloudWatch Logs showed that the ALB was rejecting connections with "503 Service Unavailable" despite EC2 instances being healthy and CPU utilization at only 40%. The DynamoDB table is configured with on-demand capacity and showed no throttling. The company needs to prevent this failure pattern during the next breaking news event, which could occur at any time.

What was the MOST LIKELY cause of the continued failures despite successful Auto Scaling, and what is the most appropriate preventive solution?

  1. The Application Load Balancer has a surge queue limit that was exceeded during rapid traffic spike; enable ALB pre-warming by contacting AWS Support before anticipated major news events, configure target group slow-start mode with 120-second duration to gradually increase traffic to new instances, and implement CloudFront origin timeouts to prevent overwhelming the ALB
  2. CloudFront collapsed multiple simultaneous requests for the same URL into single origin requests, overwhelming individual ALB targets with massive connection counts; implement DynamoDB Accelerator (DAX) between the application and DynamoDB to cache article content, increase Node.js connection pool limits, and configure ALB connection draining to 5 seconds
  3. The 90% of traffic concentrated on a single article URL created cache misses in CloudFront for viewer-specific variations; implement CloudFront cache key normalization to remove viewer-specific headers, increase CloudFront TTL to 300 seconds for article content, configure DynamoDB Global Tables for improved read performance, and enable CloudFront real-time logs
  4. The Application Load Balancer has a default burst capacity but cannot instantly scale to 36x traffic increase (5,000 to 180,000 RPS); contact AWS Support to pre-warm the ALB to expected peak traffic levels, implement CloudFront with appropriate cache TTL to reduce origin requests, and configure WAF rate limiting to prevent excessive origin traffic during spikes

Answer & Explanation

Correct Answer: 4 - ALB requires pre-warming for extreme traffic spikes, CloudFront caching to reduce origin load, and WAF rate limiting

Why this is correct: Application Load Balancers automatically scale to handle increased traffic, but this scaling is gradual and optimized for typical traffic patterns. A 36x traffic increase in minutes (5,000 to 180,000 RPS) exceeds ALB's ability to scale instantly-ALBs can handle traffic increases of roughly 50% every 3-5 minutes. The 503 errors despite healthy instances and low CPU indicate the ALB itself was the bottleneck, not the EC2 layer. Pre-warming involves contacting AWS Support to scale ALB capacity in advance of expected traffic spikes, which is the standard solution for known high-traffic events. Implementing proper CloudFront caching with appropriate TTL reduces origin requests dramatically-if 90% of traffic is for a single article URL, CloudFront should serve the vast majority from cache rather than forwarding to origin. WAF rate limiting protects against traffic spikes exceeding ALB capacity. This combination addresses both the immediate bottleneck (ALB scaling) and reduces the underlying problem (excessive origin requests for cacheable content).

Why the other options are wrong:

  • Option 1: While ALB surge queues exist (1,024 requests per target), the scenario shows 200 instances were launched-with sufficient targets, surge queue exhaustion is unlikely to be the sole cause of 18 minutes of continued failures. Slow-start mode helps gradually introduce new targets but doesn't address the fundamental ALB scaling limitation when traffic increases 36x in minutes. CloudFront origin timeouts don't prevent overwhelming the ALB-they just cause requests to fail faster. This option doesn't address the ALB's inability to scale instantly to extreme traffic spikes, which is the root cause evidenced by "ALB was rejecting connections" despite healthy targets.
  • Option 2: CloudFront does NOT collapse simultaneous requests into single origin requests for dynamic content-that behavior applies to origin consolidation of duplicate in-flight requests, which helps rather than harms performance. The scenario states EC2 instances were at 40% CPU, indicating they weren't overwhelmed. DAX adds caching but doesn't solve the ALB bottleneck (requests still reach ALB even if DynamoDB responds faster). Connection pool tuning and draining adjustments are optimizations that don't address the core issue: the ALB couldn't accept 180,000 RPS due to lack of pre-warming.
  • Option 3: This option misdiagnoses the problem. If CloudFront were experiencing cache misses due to viewer-specific variations, the solution (cache key normalization) would help, but the scenario describes "90% of spike traffic concentrated on a single article URL"-this should produce high cache hit rates, not misses, assuming proper TTL. DynamoDB Global Tables are for multi-region replication, irrelevant here since there's no indication of regional issues. Real-time logs don't prevent failures. The fundamental issue is that origin requests reached the ALB at 180,000 RPS, exceeding its un-warmed capacity, regardless of caching effectiveness.

Key Insight: The critical insight is recognizing that ALB, while auto-scaling, cannot instantly scale to extreme traffic spikes-it requires pre-warming for predictable events or proper CloudFront caching to prevent most traffic from reaching the origin. The scenario provides key diagnostic clues: "ALB was rejecting connections" and "EC2 instances healthy with 40% CPU"-this points to the ALB as the bottleneck, not the application tier. Candidates must understand that each AWS service has scaling characteristics, and even managed services like ALB have limits on instantaneous scaling velocity. The combination of preventive measures (pre-warming for known events) and architectural solutions (CloudFront caching to reduce origin load) addresses both immediate and structural issues.

Case Study 8

An online gaming company runs a mobile game with 2.8 million daily active users. The game uses Amazon API Gateway REST APIs with AWS Lambda functions for game logic and Amazon ElastiCache for Redis (cluster mode enabled) with 5 shards for player session state and leaderboard data. Player latency requirements are strict: API responses must complete within 150ms to maintain gameplay fluidity. CloudWatch metrics show that API Gateway p99 latency is 280ms, with p50 at 95ms. Detailed analysis reveals that 15% of API calls experience "cold starts" with Lambda functions taking 1,200-1,400ms to initialize due to a large dependency package (85 MB) containing game physics libraries. The Lambda functions are configured with 512 MB memory, 30-second timeout, and no provisioned concurrency. The ElastiCache cluster shows average CPU utilization of 12% and memory utilization of 34%. The game experiences uneven traffic patterns with strong regional clustering-APAC users dominate 6pm-10pm local time, European users 7pm-11pm, and North American users 8pm-midnight. The company cannot tolerate the 15% of requests experiencing >1,200ms latency.

Which solution most cost-effectively eliminates the cold start latency issue while meeting the 150ms response time requirement?

  1. Increase Lambda memory allocation to 3,008 MB to proportionally increase CPU and reduce initialization time, implement Lambda SnapStart to eliminate cold starts by pre-initializing function snapshots, reduce dependency package size by extracting physics libraries to Lambda Layers shared across functions
  2. Configure provisioned concurrency of 500 concurrent executions across all Lambda functions to eliminate cold starts during peak hours, implement Application Auto Scaling to adjust provisioned concurrency based on CloudWatch metrics for scheduled scaling with regional traffic patterns
  3. Migrate Lambda functions to container images using AWS Lambda with container support, optimize the container image using multi-stage builds to reduce size, implement Amazon ECS with Fargate Spot for cost efficiency, and use Application Load Balancer instead of API Gateway
  4. Decompose the 85 MB dependency package into Lambda Layers for shared libraries, implement Lambda SnapStart for instant initialization, configure reserved concurrency of 1,000 executions to prevent throttling, and enable API Gateway caching with 60-second TTL for GET requests

Answer & Explanation

Correct Answer: 2 - Provisioned concurrency with Application Auto Scaling for scheduled scaling

Why this is correct: Provisioned concurrency directly eliminates cold starts by keeping Lambda execution environments pre-initialized and ready to respond within milliseconds, addressing the 1,200-1,400ms initialization problem. Application Auto Scaling with scheduled scaling aligns provisioned concurrency with the described regional traffic patterns-scaling up provisioned concurrency before 6pm APAC, 7pm Europe, 8pm North America, then scaling down during off-peak hours. This maintains the 150ms response time requirement while minimizing costs (provisioned concurrency is charged only for configured capacity and duration, not invocations, so scheduled scaling reduces costs compared to 24/7 provisioned concurrency). With 15% cold starts affecting user experience, provisioned concurrency for peak hours is justified. The uneven regional traffic pattern makes scheduled scaling particularly cost-effective-provisioned concurrency runs only during high-traffic periods when cold starts would occur, not continuously.

Why the other options are wrong:

  • Option 1: Increasing memory to 3,008 MB increases CPU proportionally (approximately 6x from 512 MB) and might reduce initialization time from 1,200ms to perhaps 400-600ms, but this still exceeds the 150ms requirement and doesn't eliminate cold starts-it only makes them faster. Lambda SnapStart is designed for Java functions and requires specific runtime support (Java 11+ with Corretto)-the scenario describes a generic Lambda use case without specifying Java, and SnapStart wouldn't apply to non-Java runtimes. Lambda Layers help organization but don't reduce initialization time significantly (dependencies still load into execution environment). This solution reduces but doesn't eliminate cold start latency.
  • Option 3: Migrating to container images and ECS Fargate is a massive architectural change that eliminates Lambda's serverless benefits (automatic scaling, per-millisecond billing, no infrastructure management) and introduces container orchestration complexity. Fargate Spot offers cost savings but has interruption risk unsuitable for real-time gaming. Container images still experience cold starts when scaling from zero or scaling up. Replacing API Gateway with ALB requires significant application changes (authentication, request validation, throttling must be reimplemented). This solution is operationally complex, doesn't guarantee elimination of initialization delays, and loses serverless benefits for marginal cost optimization that could be achieved with Lambda scheduled scaling.
  • Option 4: Lambda Layers help dependency management but, as noted, don't significantly reduce initialization time-the physics libraries still must load into memory during function initialization. Lambda SnapStart has runtime limitations (Java-specific). Reserved concurrency limits maximum concurrent executions but doesn't keep execution environments pre-initialized-it prevents throttling but doesn't eliminate cold starts. API Gateway caching with 60-second TTL helps for repeated identical requests, but gaming APIs typically involve player-specific state and actions that can't be cached (leaderboard queries might be cacheable, but game logic APIs generally aren't). This option doesn't effectively address the cold start problem.

Key Insight: The key distinction is understanding that provisioned concurrency is specifically designed to eliminate Lambda cold starts by pre-initializing execution environments, while other optimizations (increased memory, Lambda Layers, container images) reduce but don't eliminate initialization time. The scenario's regional traffic clustering is a critical clue-it makes scheduled scaling of provisioned concurrency cost-effective because capacity isn't needed 24/7. Candidates must recognize that when cold start latency exceeds requirements (1,200ms vs. 150ms requirement), optimization isn't sufficient-architectural solutions like provisioned concurrency are necessary. The cost-effectiveness constraint steers away from continuous provisioned concurrency toward scheduled scaling aligned with traffic patterns.

Case Study 9

A SaaS company provides business intelligence dashboards to 4,500 enterprise customers. Each customer dashboard queries customer-specific data from Amazon Redshift (dc2.8xlarge cluster with 6 nodes), displaying charts and metrics. Dashboard queries are complex, involving 8-15 table joins across fact and dimension tables, with average query execution time of 4.8 seconds. During business hours (8am-6pm across time zones), concurrent dashboard users average 1,200, generating approximately 800 concurrent queries to Redshift. The company has implemented WLM (Workload Management) with 5 queues, allocated 20% memory to each queue. Recent performance degradation shows query wait times increasing from 2.1 seconds to 18.7 seconds during peak hours (10am-2pm EST). Analysis of Redshift system tables shows that WLM queue wait time accounts for 82% of total query latency, and queries are evenly distributed across the 5 WLM queues. The cluster CPU averages 68%, disk space is at 42% utilization, and no disk-based queries are occurring. Each customer's data is isolated in separate schemas with identical table structures. Query patterns are consistent across customers, and 60% of dashboard queries request data from the most recent 30 days, while the database contains 5 years of historical data.

What architectural change will most significantly reduce query latency while optimizing costs?

  1. Increase WLM queue count from 5 to 15 to accommodate higher concurrency, allocate memory proportionally across queues, implement short query acceleration (SQA) to prioritize fast queries, and enable automatic WLM to dynamically manage queue resources
  2. Implement materialized views for the most common dashboard queries aggregating recent 30-day data, configure automatic refresh on schedule, create additional sort keys on frequently filtered columns, and implement result caching at the application layer with 5-minute TTL
  3. Migrate from dc2.8xlarge cluster to RA3.4xlarge cluster with managed storage to separate compute and storage, enable Redshift concurrency scaling to automatically add transient cluster capacity during peak hours, and implement workload management with automatic WLM for dynamic resource allocation
  4. Implement table partitioning by date with monthly partitions to isolate recent data, configure distribution keys on customer_id for co-location of related data, increase cluster size to 10 nodes to provide additional query processing capacity, and implement Redshift Spectrum for historical data beyond 1 year

Answer & Explanation

Correct Answer: 3 - Migrate to RA3 with managed storage, enable concurrency scaling, and implement automatic WLM

Why this is correct: The scenario identifies that 82% of query latency is WLM queue wait time with 800 concurrent queries-this is a classic concurrency bottleneck. Redshift concurrency scaling automatically adds transient cluster capacity (additional clusters) during high-concurrency periods to handle read queries, eliminating queue wait time. This directly addresses the 82% of latency spent waiting in queues. RA3 instances with managed storage separate compute from storage, enabling concurrency scaling to add compute capacity without duplicating storage (RA3 shares managed storage across primary and concurrency scaling clusters). Automatic WLM dynamically allocates memory and concurrency slots based on workload, optimizing resource utilization better than static 5-queue configuration. This solution is cost-optimized because concurrency scaling clusters operate only during peak hours (10am-2pm) when needed, with per-second billing. The RA3 migration enables this architecture. CPU at 68% indicates the primary cluster has adequate processing capacity during non-peak hours, so the issue is peak concurrency, not baseline capacity-concurrency scaling perfectly addresses this pattern.

Why the other options are wrong:

  • Option 1: Increasing WLM queues from 5 to 15 doesn't add query processing capacity-it just subdivides the same resources into more queues. With memory split across 15 queues instead of 5, each queue gets less memory (6.7% vs. 20%), potentially worsening performance by causing queries to use less memory and potentially spill to disk. Short query acceleration (SQA) helps fast queries bypass queues but doesn't increase overall concurrency capacity-800 concurrent queries still exceed the cluster's concurrency capability (typical dc2.8xlarge clusters handle 50-150 concurrent queries comfortably). Automatic WLM helps resource allocation but doesn't add capacity. This option rearranges resources without addressing the fundamental concurrency limit.
  • Option 2: Materialized views can accelerate specific aggregation queries and help with the 60% of queries targeting recent 30-day data, but they don't address the fundamental concurrency limit-800 concurrent users still generate 800 concurrent queries, even if those queries run faster against materialized views. Faster individual query execution (reducing from 4.8s to perhaps 2s) doesn't eliminate the 18.7s queue wait time when queries are waiting for concurrency slots. Result caching at the application layer helps if users repeatedly run identical queries, but dashboard users typically view different metrics or time ranges, limiting cache effectiveness. This solution improves individual query performance but doesn't solve the concurrency bottleneck identified in the scenario (82% of latency is wait time, not execution time).
  • Option 4: Table partitioning by date can improve query performance by pruning partitions, but Redshift doesn't have native table partitioning-it uses table distribution and sort keys. The suggestion to "partition by date" likely means implementing date-based sort keys, which helps queries filtering by date but doesn't address concurrency limits. Increasing cluster size from 6 to 10 nodes adds capacity but increases costs by 67% continuously, even during off-peak hours when capacity isn't needed (costly compared to concurrency scaling that operates on-demand). Redshift Spectrum for historical data helps with storage costs but adds query complexity and doesn't address peak-hour concurrency issues for recent data queries. This solution is expensive and doesn't specifically target the queue wait time problem.

Key Insight: The critical distinction is recognizing that WLM queue wait time indicates a concurrency bottleneck (too many simultaneous queries for available concurrency slots), not a query performance problem. Redshift concurrency scaling is specifically designed for this scenario-temporarily adding query processing capacity during peak periods. The phrase "82% of total query latency is WLM queue wait time" is the diagnostic key: this means queries execute reasonably fast (4.8s) but spend most of their time waiting for available concurrency slots. Candidates must differentiate between query performance optimization (materialized views, sort keys, faster execution) and concurrency capacity (concurrency scaling, adding clusters). The cost optimization aspect-concurrency scaling operates only during peaks with per-second billing versus permanently increasing cluster size-makes concurrency scaling the most cost-effective solution.

Case Study 10

A logistics company operates a fleet management system tracking 25,000 delivery vehicles globally. Each vehicle transmits GPS location, speed, fuel level, and engine diagnostics every 10 seconds to AWS IoT Core, generating approximately 2.5 million messages per minute. The system uses IoT Rules to route data to Amazon Timestream for time-series storage and analysis. Fleet managers run real-time dashboard queries showing vehicle locations, route efficiency, and predictive maintenance alerts. Timestream query performance has degraded significantly: queries that previously returned results in 1.2 seconds now take 14-22 seconds. The Timestream database contains 14 months of data (approximately 18 TB), with the most recent 30 days representing 40% of query volume. Queries typically filter by vehicle_id, time range (usually last 24-48 hours), and geographic region. The Timestream table uses default memory store retention of 24 hours and magnetic store for older data. Query analysis shows that 70% of queries access data from the last 7 days, 20% access 8-30 days, and 10% access historical data beyond 30 days. The company requires dashboard query response times under 3 seconds and has observed that queries against memory store data complete in under 2 seconds, while queries requiring magnetic store access take 12-20 seconds.

What configuration change will most effectively improve query performance to meet the 3-second requirement?

  1. Increase Timestream memory store retention from 24 hours to 7 days to keep frequently accessed data in memory store, implement query result caching at the application layer with 60-second TTL, and create Timestream scheduled queries to pre-aggregate common metrics
  2. Implement Timestream multi-measure records to consolidate GPS location, speed, fuel, and diagnostics into single records reducing query complexity, enable query performance insights, partition data by geographic region using separate tables, and configure parallel query execution
  3. Migrate historical data beyond 90 days to Amazon S3 with Parquet format for cost optimization, implement AWS Glue for data cataloging, use Amazon Athena for historical queries, and maintain Timestream for recent data with 90-day retention in magnetic store
  4. Increase memory store retention to 30 days to maintain all frequently accessed data in memory store, implement Amazon ElastiCache for Redis to cache query results with 5-minute TTL, configure DynamoDB Accelerator for additional caching layer, and enable Timestream data compression

Answer & Explanation

Correct Answer: 1 - Increase memory store retention to 7 days, implement application-layer caching, and create scheduled queries for pre-aggregation

Why this is correct: The scenario clearly demonstrates that memory store queries complete in under 2 seconds (meeting the 3-second requirement) while magnetic store queries take 12-20 seconds (failing the requirement). Since 70% of queries access data from the last 7 days, increasing memory store retention from 24 hours to 7 days ensures that 70% of queries hit the fast memory store instead of the slower magnetic store. This single change brings the majority of queries into the sub-2-second range. Application-layer caching with 60-second TTL addresses dashboard refresh patterns-fleet managers typically view dashboards continuously with periodic refreshes, so caching identical queries for 60 seconds reduces Timestream load without serving stale data (vehicle location changes every 10 seconds, but dashboard updates every 60 seconds is acceptable for fleet overview). Timestream scheduled queries pre-aggregate common metrics (average speed by route, fuel consumption by vehicle) that dashboards frequently request, providing near-instant results for aggregated views. This combination addresses the performance requirement cost-effectively-memory store is more expensive than magnetic store but far cheaper than over-engineering with additional caching layers.

Why the other options are wrong:

  • Option 2: Multi-measure records are a data modeling optimization that improves ingestion efficiency and can slightly improve query performance by reducing rows scanned, but they don't address the fundamental memory store vs. magnetic store latency difference (2 seconds vs. 12-20 seconds). Query performance insights help identify slow queries but don't improve performance. Partitioning data into separate tables by geographic region fragments the dataset and complicates queries that need cross-region analysis (e.g., fleet-wide metrics), while providing minimal performance benefit-Timestream's time-based partitioning already optimizes time-range queries. Timestream doesn't have configurable "parallel query execution"-query optimization is automatic. This option doesn't address the core issue: 70% of queries are hitting magnetic store when they should hit memory store.
  • Option 3: Migrating historical data to S3/Athena creates a tiered architecture that adds complexity and doesn't solve the immediate problem-queries for data from the last 7 days (70% of queries) are currently slow because they're in magnetic store, not because data older than 90 days exists. Moving old data to S3 doesn't accelerate recent data queries. Athena queries against S3 are typically slower than Timestream magnetic store queries for time-series data. This solution addresses data lifecycle management and cost optimization for very old data but doesn't improve the 14-22 second query times for recent data.
  • Option 4: Increasing memory store retention to 30 days would improve performance for 90% of queries (70% + 20% accessing last 30 days) but is unnecessarily expensive-memory store costs approximately 10x more than magnetic store. The scenario shows that 70% of queries access the last 7 days, so extending to 30 days adds significant cost for marginal benefit (addressing an additional 20% of queries). Implementing both ElastiCache and DynamoDB Accelerator (DAX) represents over-engineering-two separate caching layers add operational complexity, cost, and cache consistency challenges. DAX is specifically for DynamoDB acceleration, not Timestream. Timestream data compression is automatic and not configurable. This option solves the problem but at excessive cost and complexity compared to the more targeted 7-day memory store retention.

Key Insight: The critical insight is understanding Timestream's two-tier storage architecture-memory store for recent, frequently accessed data with sub-second query performance, and magnetic store for historical data with slower but cost-effective storage. The performance cliff between memory store (2 seconds) and magnetic store (12-20 seconds) is dramatic. When query patterns show 70% of queries accessing a specific time range (last 7 days), tuning memory store retention to match that access pattern provides maximum performance improvement with optimal cost. Candidates must recognize that the solution isn't necessarily "more memory store is better"-it's "align memory store retention with access patterns." The 7-day retention targets the 70% of frequent queries without overprovisioning to 30 days, demonstrating cost-aware performance optimization. Understanding service-specific storage tiers and their performance characteristics separates strong candidates from those who default to generic caching layers.

The document Case Studies: Performance Tuning is a part of the AWS Solutions Architect Course AWS Solutions Architect: Professional Level.
All you need of AWS Solutions Architect at this link: AWS Solutions Architect
Explore Courses for AWS Solutions Architect exam
Get EduRev Notes directly in your Google search
Related Searches
Summary, ppt, Case Studies: Performance Tuning, pdf , Case Studies: Performance Tuning, study material, mock tests for examination, Free, Exam, Important questions, practice quizzes, Sample Paper, Semester Notes, past year papers, shortcuts and tricks, Objective type Questions, Viva Questions, video lectures, Extra Questions, MCQs, Case Studies: Performance Tuning, Previous Year Questions with Solutions;