A financial analytics company runs a mission-critical data processing pipeline on AWS that ingests market data from multiple exchanges. The application uses an Auto Scaling group of c5.2xlarge EC2 instances behind an Application Load Balancer. During market open hours (9:30 AM to 4:00 PM EST), the system experiences predictable traffic spikes, but the current scaling configuration takes 8-12 minutes to provision new instances, causing CPU utilization to reach 95% and request latency to spike from 200ms to 3,500ms. The application requires no changes to instance user data or configuration between launches. Historical data shows the spike occurs within 2 minutes of market open, and capacity needs increase by exactly 40 instances during this period. The company needs to eliminate the latency spikes while minimizing costs.
What combination of actions will most effectively reduce scaling response time and prevent performance degradation?
Correct Answer: 2 - Warm pool with hibernation, scheduled scaling, and predictive scaling
Why this is correct: This solution addresses the 8-12 minute scaling delay through multiple complementary mechanisms. The warm pool with 40 stopped instances eliminates boot time (instances start in seconds rather than minutes when stopped, versus minutes when launching fresh). Hibernation preserves the in-memory application state, further reducing startup time. Scheduled scaling proactively moves these pre-initialized instances into service before the spike occurs, and predictive scaling uses machine learning to forecast the recurring daily pattern, providing additional capacity adjustments. This combination eliminates the performance degradation while minimizing costs (stopped instances incur only EBS volume charges, not compute charges).
Why the other options are wrong:
Key Insight: The critical distinction is understanding that warm pools with stopped instances provide near-instant scaling (seconds) versus traditional Auto Scaling launch times (minutes), and that hibernation further optimizes by preserving application state. The exam tests whether candidates recognize that stopped instances in warm pools dramatically reduce time-to-service compared to launching from AMIs, while maintaining cost efficiency compared to running instances continuously.
An e-commerce platform processes product catalog searches using Amazon OpenSearch Service (formerly Elasticsearch Service). The cluster consists of three m5.large.search data nodes and three dedicated master nodes. Query performance has degraded over six months as the product catalog grew from 2 million to 8 million items. The current average query latency is 850ms, with p99 latency reaching 3.2 seconds. JVM memory pressure on data nodes averages 78%, and the cluster has experienced two split-brain incidents in the past month despite having three dedicated masters. The development team reports that 65% of queries search across all product attributes (title, description, brand, specifications) and return paginated results. Index size is now 240 GB with a single index and five primary shards, each with one replica. The company requires query latency under 200ms for p95 and wants to avoid cluster instability.
What is the MOST effective architectural change to improve query performance and cluster stability?
Correct Answer: 2 - Multi-index strategy, horizontal scaling, ultrawarm for older data, and application-layer caching
Why this is correct: This solution addresses multiple performance bottlenecks systematically. The multi-index strategy is critical-separating frequently queried fields (title, brand) into a smaller, faster index dramatically reduces query latency because OpenSearch searches smaller data sets faster, even if subsequent enrichment queries are needed. Increasing data nodes horizontally distributes query load and shard operations. Ultrawarm nodes move older, less-frequently accessed product data to cost-effective storage while keeping hot data performant. Application-layer caching prevents repeated identical queries from hitting OpenSearch. This combination directly addresses the 850ms latency issue through query optimization (smaller search corpus) and cluster capacity (more nodes), while the three existing dedicated masters are sufficient for stability-split-brain incidents indicate network or configuration issues, not insufficient master count.
Why the other options are wrong:
Key Insight: The key differentiator is recognizing that query performance in OpenSearch is fundamentally about reducing the corpus being searched. Multi-index strategies that separate frequently queried fields from the full product catalog provide order-of-magnitude improvements because OpenSearch can search a 10 GB index of titles/brands far faster than a 240 GB index with all attributes, even if a secondary enrichment query is needed. Candidates who focus solely on infrastructure scaling (bigger instances, more shards) miss the architectural optimization opportunity.
A healthcare SaaS provider operates a multi-tenant application serving 3,200 medical practices. The application uses Amazon RDS for PostgreSQL (db.r5.4xlarge) with 10,000 provisioned IOPS. Each morning between 7:00 AM and 9:00 AM across various time zones, practices synchronize patient schedules, causing database CPU to reach 92% and read latency to spike from 8ms to 340ms. The RDS instance has 47 TB of free storage, but Performance Insights shows that 78% of database time during peak hours is consumed by three specific queries: patient schedule lookups, appointment conflict checks, and provider availability searches. All three queries join the appointments table (180 million rows) with the providers table (250,000 rows) and filter by date ranges and practice_id. Existing indexes on created_at and practice_id show low utilization. The application cannot be refactored to change query patterns, and the company requires a solution deployable within two weeks that doesn't require application code changes.
Which solution will provide the MOST significant performance improvement while meeting all constraints?
Correct Answer: 2 - Query execution analysis, composite indexes, table partitioning, and query plan caching
Why this is correct: This solution directly addresses the root cause identified in Performance Insights-inefficient query execution. The composite indexes on (practice_id, appointment_date, provider_id) align precisely with the query patterns described (filtering by practice_id and date ranges, joining with providers), enabling index-only scans or dramatically reducing rows scanned. Table partitioning by practice_id creates smaller, practice-specific partitions that queries can target directly, reducing scan overhead for the 180-million-row appointments table. Query plan caching ensures PostgreSQL reuses optimal execution plans. These are all database-level optimizations requiring no application code changes and deployable within days via maintenance windows. This addresses the 78% of database time consumed by these three queries, providing the most direct performance improvement.
Why the other options are wrong:
Key Insight: The critical distinction is recognizing that database-level optimizations (indexes, partitioning) can be implemented without application changes, while solutions involving read replicas, caching layers, or database migrations all require application modifications. Candidates must read the constraint "no application code changes" carefully and eliminate solutions that require endpoint changes, routing logic, or integration code, even if those solutions would technically work in different circumstances.
A video streaming platform uses Amazon CloudFront to deliver content to 12 million users globally. The origin is an Amazon S3 bucket in us-east-1 containing 480,000 video files totaling 8.4 PB. The most popular 2,000 videos account for 65% of all requests, while the remaining 478,000 videos are requested infrequently. CloudFront access logs show that the cache hit ratio has declined from 89% to 61% over the past quarter as the catalog expanded. Users in APAC regions (Singapore, Tokyo, Sydney) report average initial buffering times of 4.8 seconds, compared to 1.2 seconds for users in North America. Analysis shows that 40% of APAC requests result in CloudFront origin fetches. The platform uses default CloudFront caching behaviors with TTL of 86400 seconds. Video files range from 800 MB to 12 GB, with average size of 2.4 GB. The company has a fixed CDN budget and cannot increase CloudFront costs, but needs to improve APAC performance and overall cache efficiency.
What combination of optimizations will MOST effectively improve cache hit ratio and APAC performance without increasing costs? (Select TWO)
Correct Answer: 1 and 3 - Origin Shield with cache key normalization and TTL optimization
Why these are correct: Option 1 (Origin Shield) directly addresses the declining cache hit ratio and APAC performance issues without adding infrastructure costs. Origin Shield acts as a centralized caching layer between CloudFront edge locations and the S3 origin. When multiple edge locations (particularly in APAC) request the same video, Origin Shield serves it from its cache rather than each edge location independently fetching from S3 in us-east-1. This dramatically reduces origin fetch latency for APAC users (subsequent APAC requests get sub-second responses from Origin Shield rather than 4.8-second transatlantic fetches) and improves cache efficiency. Option 3 addresses cache fragmentation-if requests include varying query strings or headers that don't affect content (tracking parameters, session IDs), they create duplicate cache entries for identical content. Normalizing cache keys consolidates these into single cache entries. Increasing TTL reduces cache expiration for static video content, and custom behaviors for popular videos ensure they remain cached. Both solutions work within existing infrastructure, requiring no additional services or data replication, thus maintaining the fixed budget constraint.
Why the other options are wrong:
Key Insight: The exam tests understanding that Origin Shield is specifically designed for scenarios with geographically distributed edge locations requesting the same content from a single origin-it collapses redundant origin requests into a single fetch. Candidates must recognize that multi-region data replication, while effective for performance, fundamentally conflicts with fixed-cost constraints due to storage and transfer costs, whereas Origin Shield (a relatively low-cost CloudFront feature) solves the same problem within budget. Cache key normalization is often overlooked but is critical when cache hit ratio declines-fragmented cache keys are a common cause of declining cache efficiency as applications evolve.
A financial services company runs a real-time fraud detection system that processes 45,000 credit card transactions per second during peak hours. The architecture uses Amazon Kinesis Data Streams with 200 shards, AWS Lambda functions for fraud analysis (average execution time 280ms, p99 of 850ms), and Amazon DynamoDB for storing fraud scores and transaction history. DynamoDB tables use on-demand capacity mode. CloudWatch metrics show Lambda throttling errors increasing from 0.2% to 8.7% during peaks, and DynamoDB WriteThrottleEvents occurring at a rate of 1,200 per minute. The Lambda functions are configured with 1,024 MB memory, batch size of 100 records, and default concurrency limits. The DynamoDB table has no provisioned capacity (on-demand mode), and previously handled 35,000 transactions per second without issues. The fraud detection logic requires processing transactions within 600ms of arrival to meet SLA requirements. Recent transaction volume growth of 30% has caused end-to-end latency to reach 2,400ms during peaks, with 12% of transactions missing the SLA.
What is the MOST LIKELY root cause of the performance degradation, and what is the most appropriate solution?
Correct Answer: 3 - Reduce batch size to increase parallelism and increase Lambda memory
Why this is correct: The root cause is insufficient Lambda invocation parallelism created by the batch size of 100 records. At 45,000 transactions per second, Kinesis delivers 450 batches per second to Lambda (45,000 ÷ 100). With 280ms average execution time, each Lambda function can process approximately 3.5 batches per second (1,000ms ÷ 280ms). To process 450 batches per second requires approximately 129 concurrent Lambda executions (450 ÷ 3.5)-well within the 1,000 default concurrency limit, so concurrency isn't the primary issue. However, with p99 latency at 850ms, slower executions create a backlog. Reducing batch size to 10 records increases invocation rate to 4,500 per second, distributing work across more parallel Lambda functions and reducing per-invocation processing time from 280ms toward sub-100ms (fewer records per invocation). This directly addresses the 2,400ms end-to-end latency. Increasing memory to 2,048 MB proportionally increases CPU allocation, reducing execution time. The combination creates the parallelism needed for 45,000 TPS within the 600ms SLA.
Why the other options are wrong:
Key Insight: The critical insight is understanding the relationship between Kinesis batch size, Lambda concurrency, and throughput. Candidates must calculate effective parallelism: (records per second ÷ batch size) × average execution time = required concurrency. When this calculation shows parallelism is insufficient for the throughput requirement, reducing batch size increases parallel invocations. This is a common anti-pattern-using large batch sizes to reduce Lambda invocation costs, which works at low throughput but creates latency problems at high throughput. The exam tests whether candidates can identify that throttling in downstream services (DynamoDB) may be a symptom of upstream processing bottlenecks (Lambda) rather than the root cause.
A global logistics company operates a package tracking system serving 140 countries. The application runs on Amazon ECS with Fargate, using Application Load Balancer for traffic distribution. The backend queries Amazon Aurora PostgreSQL (db.r6g.4xlarge) for shipment status. The company has observed that tracking queries from users in Europe (specifically UK, Germany, France) experience average latency of 420ms, while users in the same AWS region (eu-west-1, where the application runs) experience 45ms latency. The application uses a third-party geolocation service API that adds approximately 180ms to each request for address validation. Network analysis shows that 85% of the latency difference is introduced before requests reach the Application Load Balancer. The company has implemented CloudFront with default caching policies, but tracking queries include unique tracking numbers in URL paths, preventing effective caching. The application must return real-time shipment status and cannot serve stale data. The company requires a solution that reduces latency for end users in Europe without application code changes.
What is the most effective solution to reduce user-perceived latency?
Correct Answer: 1 - AWS Global Accelerator in front of ALB
Why this is correct: The scenario explicitly states that 85% of latency difference (approximately 320ms of the 375ms difference between 420ms and 45ms) is introduced before requests reach the ALB-meaning the latency is in internet transit, not application processing. AWS Global Accelerator routes user traffic from edge locations through AWS's private global network directly to the ALB in eu-west-1, bypassing congested internet paths, middle-mile latency, and routing inefficiencies. This directly addresses the internet transit latency without requiring application changes, regional deployments, or code modifications. Global Accelerator maintains persistent connections to the origin, reducing TCP handshake overhead. The 180ms geolocation API delay affects all users equally and isn't the differential latency source. Global Accelerator provides 20-50% latency reduction for international traffic, which would reduce the 420ms to approximately 250-300ms, substantially improving user experience while requiring only infrastructure configuration (no application changes).
Why the other options are wrong:
Key Insight: The key differentiator is recognizing when latency is introduced in the request path-before the load balancer versus during application processing. The phrase "85% of latency is introduced before requests reach the ALB" is the critical clue that internet transit is the problem, not application performance, database queries, or API calls. Global Accelerator specifically solves internet transit latency by moving traffic onto AWS's private network. Candidates who focus on caching or application optimization miss this fundamental diagnosis. Understanding where in the request path latency occurs determines which AWS service appropriately addresses it.
A media company hosts a news website that publishes breaking news articles. The site receives 5,000 requests per second during normal operation, spiking to 180,000 requests per second when major breaking news occurs, with 90% of spike traffic concentrated on a single article URL. The architecture uses Amazon CloudFront backed by an Application Load Balancer and an Auto Scaling group of EC2 instances running a Node.js application that queries Amazon DynamoDB for article content. During the last major news event, the website experienced severe degradation: CloudFront reported elevated origin errors (HTTP 502/503), EC2 Auto Scaling successfully launched 200 additional instances within 6 minutes, but users continued experiencing errors for 18 minutes. CloudWatch Logs showed that the ALB was rejecting connections with "503 Service Unavailable" despite EC2 instances being healthy and CPU utilization at only 40%. The DynamoDB table is configured with on-demand capacity and showed no throttling. The company needs to prevent this failure pattern during the next breaking news event, which could occur at any time.
What was the MOST LIKELY cause of the continued failures despite successful Auto Scaling, and what is the most appropriate preventive solution?
Correct Answer: 4 - ALB requires pre-warming for extreme traffic spikes, CloudFront caching to reduce origin load, and WAF rate limiting
Why this is correct: Application Load Balancers automatically scale to handle increased traffic, but this scaling is gradual and optimized for typical traffic patterns. A 36x traffic increase in minutes (5,000 to 180,000 RPS) exceeds ALB's ability to scale instantly-ALBs can handle traffic increases of roughly 50% every 3-5 minutes. The 503 errors despite healthy instances and low CPU indicate the ALB itself was the bottleneck, not the EC2 layer. Pre-warming involves contacting AWS Support to scale ALB capacity in advance of expected traffic spikes, which is the standard solution for known high-traffic events. Implementing proper CloudFront caching with appropriate TTL reduces origin requests dramatically-if 90% of traffic is for a single article URL, CloudFront should serve the vast majority from cache rather than forwarding to origin. WAF rate limiting protects against traffic spikes exceeding ALB capacity. This combination addresses both the immediate bottleneck (ALB scaling) and reduces the underlying problem (excessive origin requests for cacheable content).
Why the other options are wrong:
Key Insight: The critical insight is recognizing that ALB, while auto-scaling, cannot instantly scale to extreme traffic spikes-it requires pre-warming for predictable events or proper CloudFront caching to prevent most traffic from reaching the origin. The scenario provides key diagnostic clues: "ALB was rejecting connections" and "EC2 instances healthy with 40% CPU"-this points to the ALB as the bottleneck, not the application tier. Candidates must understand that each AWS service has scaling characteristics, and even managed services like ALB have limits on instantaneous scaling velocity. The combination of preventive measures (pre-warming for known events) and architectural solutions (CloudFront caching to reduce origin load) addresses both immediate and structural issues.
An online gaming company runs a mobile game with 2.8 million daily active users. The game uses Amazon API Gateway REST APIs with AWS Lambda functions for game logic and Amazon ElastiCache for Redis (cluster mode enabled) with 5 shards for player session state and leaderboard data. Player latency requirements are strict: API responses must complete within 150ms to maintain gameplay fluidity. CloudWatch metrics show that API Gateway p99 latency is 280ms, with p50 at 95ms. Detailed analysis reveals that 15% of API calls experience "cold starts" with Lambda functions taking 1,200-1,400ms to initialize due to a large dependency package (85 MB) containing game physics libraries. The Lambda functions are configured with 512 MB memory, 30-second timeout, and no provisioned concurrency. The ElastiCache cluster shows average CPU utilization of 12% and memory utilization of 34%. The game experiences uneven traffic patterns with strong regional clustering-APAC users dominate 6pm-10pm local time, European users 7pm-11pm, and North American users 8pm-midnight. The company cannot tolerate the 15% of requests experiencing >1,200ms latency.
Which solution most cost-effectively eliminates the cold start latency issue while meeting the 150ms response time requirement?
Correct Answer: 2 - Provisioned concurrency with Application Auto Scaling for scheduled scaling
Why this is correct: Provisioned concurrency directly eliminates cold starts by keeping Lambda execution environments pre-initialized and ready to respond within milliseconds, addressing the 1,200-1,400ms initialization problem. Application Auto Scaling with scheduled scaling aligns provisioned concurrency with the described regional traffic patterns-scaling up provisioned concurrency before 6pm APAC, 7pm Europe, 8pm North America, then scaling down during off-peak hours. This maintains the 150ms response time requirement while minimizing costs (provisioned concurrency is charged only for configured capacity and duration, not invocations, so scheduled scaling reduces costs compared to 24/7 provisioned concurrency). With 15% cold starts affecting user experience, provisioned concurrency for peak hours is justified. The uneven regional traffic pattern makes scheduled scaling particularly cost-effective-provisioned concurrency runs only during high-traffic periods when cold starts would occur, not continuously.
Why the other options are wrong:
Key Insight: The key distinction is understanding that provisioned concurrency is specifically designed to eliminate Lambda cold starts by pre-initializing execution environments, while other optimizations (increased memory, Lambda Layers, container images) reduce but don't eliminate initialization time. The scenario's regional traffic clustering is a critical clue-it makes scheduled scaling of provisioned concurrency cost-effective because capacity isn't needed 24/7. Candidates must recognize that when cold start latency exceeds requirements (1,200ms vs. 150ms requirement), optimization isn't sufficient-architectural solutions like provisioned concurrency are necessary. The cost-effectiveness constraint steers away from continuous provisioned concurrency toward scheduled scaling aligned with traffic patterns.
A SaaS company provides business intelligence dashboards to 4,500 enterprise customers. Each customer dashboard queries customer-specific data from Amazon Redshift (dc2.8xlarge cluster with 6 nodes), displaying charts and metrics. Dashboard queries are complex, involving 8-15 table joins across fact and dimension tables, with average query execution time of 4.8 seconds. During business hours (8am-6pm across time zones), concurrent dashboard users average 1,200, generating approximately 800 concurrent queries to Redshift. The company has implemented WLM (Workload Management) with 5 queues, allocated 20% memory to each queue. Recent performance degradation shows query wait times increasing from 2.1 seconds to 18.7 seconds during peak hours (10am-2pm EST). Analysis of Redshift system tables shows that WLM queue wait time accounts for 82% of total query latency, and queries are evenly distributed across the 5 WLM queues. The cluster CPU averages 68%, disk space is at 42% utilization, and no disk-based queries are occurring. Each customer's data is isolated in separate schemas with identical table structures. Query patterns are consistent across customers, and 60% of dashboard queries request data from the most recent 30 days, while the database contains 5 years of historical data.
What architectural change will most significantly reduce query latency while optimizing costs?
Correct Answer: 3 - Migrate to RA3 with managed storage, enable concurrency scaling, and implement automatic WLM
Why this is correct: The scenario identifies that 82% of query latency is WLM queue wait time with 800 concurrent queries-this is a classic concurrency bottleneck. Redshift concurrency scaling automatically adds transient cluster capacity (additional clusters) during high-concurrency periods to handle read queries, eliminating queue wait time. This directly addresses the 82% of latency spent waiting in queues. RA3 instances with managed storage separate compute from storage, enabling concurrency scaling to add compute capacity without duplicating storage (RA3 shares managed storage across primary and concurrency scaling clusters). Automatic WLM dynamically allocates memory and concurrency slots based on workload, optimizing resource utilization better than static 5-queue configuration. This solution is cost-optimized because concurrency scaling clusters operate only during peak hours (10am-2pm) when needed, with per-second billing. The RA3 migration enables this architecture. CPU at 68% indicates the primary cluster has adequate processing capacity during non-peak hours, so the issue is peak concurrency, not baseline capacity-concurrency scaling perfectly addresses this pattern.
Why the other options are wrong:
Key Insight: The critical distinction is recognizing that WLM queue wait time indicates a concurrency bottleneck (too many simultaneous queries for available concurrency slots), not a query performance problem. Redshift concurrency scaling is specifically designed for this scenario-temporarily adding query processing capacity during peak periods. The phrase "82% of total query latency is WLM queue wait time" is the diagnostic key: this means queries execute reasonably fast (4.8s) but spend most of their time waiting for available concurrency slots. Candidates must differentiate between query performance optimization (materialized views, sort keys, faster execution) and concurrency capacity (concurrency scaling, adding clusters). The cost optimization aspect-concurrency scaling operates only during peaks with per-second billing versus permanently increasing cluster size-makes concurrency scaling the most cost-effective solution.
A logistics company operates a fleet management system tracking 25,000 delivery vehicles globally. Each vehicle transmits GPS location, speed, fuel level, and engine diagnostics every 10 seconds to AWS IoT Core, generating approximately 2.5 million messages per minute. The system uses IoT Rules to route data to Amazon Timestream for time-series storage and analysis. Fleet managers run real-time dashboard queries showing vehicle locations, route efficiency, and predictive maintenance alerts. Timestream query performance has degraded significantly: queries that previously returned results in 1.2 seconds now take 14-22 seconds. The Timestream database contains 14 months of data (approximately 18 TB), with the most recent 30 days representing 40% of query volume. Queries typically filter by vehicle_id, time range (usually last 24-48 hours), and geographic region. The Timestream table uses default memory store retention of 24 hours and magnetic store for older data. Query analysis shows that 70% of queries access data from the last 7 days, 20% access 8-30 days, and 10% access historical data beyond 30 days. The company requires dashboard query response times under 3 seconds and has observed that queries against memory store data complete in under 2 seconds, while queries requiring magnetic store access take 12-20 seconds.
What configuration change will most effectively improve query performance to meet the 3-second requirement?
Correct Answer: 1 - Increase memory store retention to 7 days, implement application-layer caching, and create scheduled queries for pre-aggregation
Why this is correct: The scenario clearly demonstrates that memory store queries complete in under 2 seconds (meeting the 3-second requirement) while magnetic store queries take 12-20 seconds (failing the requirement). Since 70% of queries access data from the last 7 days, increasing memory store retention from 24 hours to 7 days ensures that 70% of queries hit the fast memory store instead of the slower magnetic store. This single change brings the majority of queries into the sub-2-second range. Application-layer caching with 60-second TTL addresses dashboard refresh patterns-fleet managers typically view dashboards continuously with periodic refreshes, so caching identical queries for 60 seconds reduces Timestream load without serving stale data (vehicle location changes every 10 seconds, but dashboard updates every 60 seconds is acceptable for fleet overview). Timestream scheduled queries pre-aggregate common metrics (average speed by route, fuel consumption by vehicle) that dashboards frequently request, providing near-instant results for aggregated views. This combination addresses the performance requirement cost-effectively-memory store is more expensive than magnetic store but far cheaper than over-engineering with additional caching layers.
Why the other options are wrong:
Key Insight: The critical insight is understanding Timestream's two-tier storage architecture-memory store for recent, frequently accessed data with sub-second query performance, and magnetic store for historical data with slower but cost-effective storage. The performance cliff between memory store (2 seconds) and magnetic store (12-20 seconds) is dramatic. When query patterns show 70% of queries accessing a specific time range (last 7 days), tuning memory store retention to match that access pattern provides maximum performance improvement with optimal cost. Candidates must recognize that the solution isn't necessarily "more memory store is better"-it's "align memory store retention with access patterns." The 7-day retention targets the 70% of frequent queries without overprovisioning to 30 days, demonstrating cost-aware performance optimization. Understanding service-specific storage tiers and their performance characteristics separates strong candidates from those who default to generic caching layers.