How RallyReel Stayed Online While AWS Took Down the Internet
A thermal event in AWS us-east-1 took down Coinbase, Suno, Canva, and over 150 other services. RallyReel didn't go down. Here's a deeper look at the architecture that kept us running.
On May 7, 2026, a thermal event at an AWS data center in Northern Virginia took down over 150 cloud services. Coinbase halted trading. Supabase went dark. AI music app Suno went down as well. Canva, Reddit, Perplexity, HubSpot — the list kept growing. Services were degraded or completely down for hours.
RallyReel was not one of them.
The Architecture That Kept Us Running
We generate AI videos using GPU compute — one of the most resource-intensive workloads you can run today. Most AI platforms are either wrappers or are fully dependent on a single serverless GPU provider, which means when their provider has a bad day, the app becomes unavailable.
That's not how we roll.
A few months ago, after an unexpected capacity crunch at our main compute provider, we responded by rapidly designing and deploying a provider-agnostic compute layer. Instead of routing every generation task to one provider, we built a tiered fallback system across three independent compute backends:
- Primary GPU cloud — our first stop for most workloads, with dedicated capacity for different video types
- Secondary GPU cloud — a separate provider with no shared infrastructure with our primary
- On-premises server — hardware we own and control; zero dependency on any cloud provider
Every video generation task contains state which tracks its current compute provider. If any task fails or times out, our system automatically submits it to the next provider in the chain — no human intervention required. The task resets and keeps going.
In some cases, this architecture results in excessive cost on our side due to the “noisy neighbour” problem — where GPU nodes are technically available but suffer from significantly reduced CPU, memory, or network throughput. This can cause memory-heavy or CPU-heavy workloads to slow down or time out unexpectedly.
We're no strangers to raising these issues with our cloud providers and escalating directly to L2 and L3 support teams when needed. We have a deep understanding of GPU scheduling behavior and know that degraded nodes can be difficult — but not impossible — to trace. We're actively working with our vendors on improving transparency and surfacing deeper infrastructure-level insights into why certain nodes underperform despite being priced identically to healthy ones.
The on-premises server at the end of the compute chain is what makes us different. When cloud providers run into datacenter failures — like they did here — our own hardware keeps running without interruption, still generating Reels and Commercials, just at a more measured pace. Reels may take a few minutes longer to generate, and sudden bursts of demand may cause throughput to slow down significantly. Eventual consistency is virtually guaranteed for all queued jobs on our platform.
What Actually Happened Yesterday
During the outage window, GPU providers with AWS dependencies in a particular Availability Zone started reporting degraded capacity. Our system noticed tasks failing to be scheduled and automatically began routing to the next provider in the chain. For some tasks that meant falling all the way through to our on-premises server.
Users generating Storyboard Commercials on May 7 and 8 experienced normal wait times. Reels, and especially Multi-Shot Commercials, however, saw noticeably longer wait times, as some workloads were rerouted all the way to on-premises infrastructure. We did not observe any job failures due to this capacity constraint, which is a significant improvement over the transient capacity crunch we experienced earlier this year.
Why It Matters for Dealers
Auto dealers operate on tight timelines. A salesperson uploading photos before a weekend push can't afford platform delays — the content needs to be ready quickly so listings stay fresh and inventory keeps moving.
That's the standard we hold ourselves to. RallyReel is built to be infrastructure you don't have to think about: you upload a photo, you get a video, and you move on with your day. The system needs to perform consistently, regardless of what's happening upstream in any given cloud region. This becomes even more important as we onboard more enterprise users and scale the volume of Reels and Commercials we generate.
The Bigger Picture
Building for resilience has been a core architectural discipline at RallyReel since day one — every new feature, video model, and infrastructure change is evaluated against a simple question: what happens when something breaks? Because in the real world — and especially with the rise of vibe-coding and increased outages even among the most reputable cloud providers — something always does.
RallyReel's mission is to be the world's leading one-click car commercial platform. That only works if we're available every time a dealer, marketing agency, or independent seller needs us — not just when everything upstream happens to be stable.
Yesterday was one of those tests, and we passed.