# Is Fly.io Good for Production? Incident Logs Say No

> Fly.io has logged 609+ incidents since 2022, averaging 13 per month. Here is what their own infra-log says about Fly.io for production workloads in 2026.
- **Author**: harsh-kanani
- **Published**: 2026-06-13
- **Modified**: 2026-06-13
- **Category**: Alternatives
- **URL**: https://kuberns.com/blogs/is-fly-io-good-for-production/

---

Fly.io is a capable platform, but if you are asking whether Fly.io is reliable for production with real users depending on it, the answer from their own data is complicated. IsDown has tracked 609 incidents on Fly.io since June 2022, averaging 13 per month. In May 2026 alone, there were incidents on 20 out of 31 days.

Fly.io publishes every internal incident in a public infra-log, which is genuinely transparent, but it also means the reliability record is fully visible. Teams evaluating Fly.io for [production workloads that cannot afford downtime](https://kuberns.com/blogs/zero-downtime-deployment/) deserve the full picture before committing.

This post pulls directly from Fly.io's own infra-log, their public status history, and IsDown's four-year monitoring data. By the end, you will know exactly which workloads Fly.io handles well, where the platform has a documented reliability problem, and what developers are choosing instead in 2026.

**TL;DR: Is Fly.io good for production?**

- 609 documented incidents since June 2022, averaging 13 per month
- May 2026: incidents on 20 of 31 days
- Recurring issues trace back to Consul and Corrosion, Fly.io's internal state systems
- No contractual SLA on any standard plan
- Fly Postgres is self-managed: backups, failover, and upgrades are your responsibility
- Scale-to-zero cold starts affect latency-sensitive production APIs
- Kuberns deploys the same apps from GitHub in one click, with no Dockerfile, no fly.toml, and no infra to manage

## What Fly.io's Own Incident Logs Actually Show

![Fly.io incident history and outage data from infra-log](https://kuberns-blogs.s3.ap-south-1.amazonaws.com/fly-io-incident-history.png)

Fly.io maintains a public infra-log at fly.io/infra-log that documents internal incidents, including ones that never make it to their public status page. This level of transparency is rare in the industry. It is also a detailed record of what breaks, how often, and why.

According to IsDown, which has monitored Fly.io continuously since June 2022, the platform has recorded 609 outages and incidents over four years. The average resolution time when Fly.io goes down is 296 minutes based on historical data.

| Period | Days with Issues | Notable Incidents |
|--------|-----------------|-------------------|
| June 2026 (first 13 days) | 7 of 13 | 6PN mapping failures, ORD network instability, SIN 500 errors, IAD Managed Postgres failovers |
| May 2026 | 20 of 31 | West coast proxies overloaded, deploy billing bug, GRU edge crash, ARN capacity shortage |
| April 2026 | 13 of 30 | Various degraded performance events |
| March 2026 | 10 of 17 | Multitenant Consul outage affecting Unmanaged Postgres and LiteFS |

Here are the key incidents in detail from recent months:

**June 4, 2026: Stale 6PN mappings broke private networking.** Fly.io's private network (6PN) uses WireGuard to connect all Machines. A conflict between their legacy and new stable-address systems caused stale routing entries that pointed to hosts that no longer existed. The issue affected Consul clusters first, then spread to customer Machines as a spike of rebalancing migrations happened. Fly.io's own postmortem described it as "an unexpected failure mode" that took considerable time to triage.

**May 30, 2026: Deploys blocked by a billing validation error.** A mis-ordered deployment of a new Corrosion schema caused all organization updates to fail to propagate. Apps that had just added payment methods or credits received a "billing information required" error and could not deploy. The fix required reverting the GraphQL API change and manually backfilling missing data in Corrosion.

**May 28, 2026: West coast edge proxies overloaded.** Fly.io's fly-proxy load balancer and Corrosion state system interacted under high load in a way that triggered Airtime, their built-in defense mechanism, but still caused elevated error rates for users on west coast infrastructure.

**March 2026: Multitenant Consul cluster degraded.** A failed Consul node caused issues with LiteFS primary node selection and Unmanaged Postgres for versions 14.x and older. While the impact was described as limited to legacy products, it is a direct example of how Consul instability flows into database reliability.

> Curious how this compares to Railway's incident record? The [Railway production reliability breakdown](https://kuberns.com/blogs/is-railway-good-for-production/) covers five major Railway incidents from the same period, including an 8-hour full platform outage.

## The Recurring Problem: Consul and Corrosion

![Fly.io Consul and Corrosion architecture failure pattern](https://kuberns-blogs.s3.ap-south-1.amazonaws.com/fly-io-consul-corrosion-issues.png)

Almost every significant Fly.io incident in 2025 and 2026 traces back to the same two systems: Consul and Corrosion.

Consul is Fly.io's distributed key-value store, used to manage configuration, coordinate Postgres primary selection, handle LiteFS dynamic leases, and store machine state. Corrosion is their newer distributed SQLite-backed replacement, designed to propagate state changes across their global infrastructure. The two systems now coexist, which is part of the problem.

Fly.io's own infra-log describes the current situation plainly: legacy and new systems overlap in ways that create unexpected interactions. The June 4 incident happened specifically because old-style 6PN DNAT rules, stored in Corrosion, were still being applied to Machines that had already migrated to the new stable address system. The conflict was invisible until a spike of migrations triggered it simultaneously across many hosts.

This is not a one-off bug. It is the expected consequence of running two versions of a critical infrastructure system in parallel while migrating between them. Fly.io is transparent about this: their postmortems consistently end with a list of follow-up work items to drain the legacy system and make the new one more resilient. That work is ongoing.

For production teams, the practical implication is that Fly.io's networking, Postgres, and LiteFS reliability are all tied to the health of systems that Fly.io themselves describe as still maturing. When Consul or Corrosion has an issue, it does not stay isolated.

> Before committing to Fly.io, it is worth understanding [what Fly.io actually is and how it works](https://kuberns.com/blogs/what-is-flyio/) - including the infrastructure decisions that create these dependencies.

## Fly.io Production Limitations You Should Know

![Fly.io production limitations and constraints overview](https://kuberns-blogs.s3.ap-south-1.amazonaws.com/fly-io-production-limitations.png)

Beyond the incident history, several platform-level constraints are worth understanding before running production workloads on Fly.io.

**No contractual SLA on any plan:** Fly.io does not publish an SLA on their pricing page. There is no uptime guarantee, no compensation policy, and no contractual recovery window on any standard plan. If your app goes down, you have no formal recourse.

**Every app requires a Dockerfile and fly.toml:** Fly.io does not auto-detect your stack. Before your first deploy, you need a Dockerfile and a fly.toml configuration file. For multi-service apps, each service has its own config to maintain. Teams without Docker experience face a real barrier to entry, and the cognitive overhead scales with the number of services.

**Scale-to-zero cold starts in production:** Fly.io supports scale-to-zero, which stops idle Machines and restarts them when traffic arrives. The cold start delay is typically several seconds. For production APIs where response time matters, this is a direct user-facing performance issue. Disabling scale-to-zero keeps Machines running but adds to your monthly bill.

**Fly Postgres is self-managed:** Fly Postgres runs as a Machine with a persistent volume. You are responsible for backups, failover configuration, version upgrades, and operational maintenance. Fly provides tooling through fly postgres commands, but the responsibility is yours. Consul incidents, as shown in March 2026, directly affect Unmanaged Postgres clusters.

**Region-specific incidents with no automatic failover:** Fly.io incidents in 2026 have hit ORD, SIN, IAD, GRU, LAX, and ARN as isolated events. Fly.io does not automatically route your traffic to a healthy region when the region your app runs in has an issue. You get the incident your region gets, and you wait for Fly.io to resolve it.

**Fly.io pricing complexity:** Every resource is metered separately: Machines, volumes, bandwidth, and Postgres all bill independently. [Fly.io's pricing structure](https://kuberns.com/blogs/flyio-pricing/) can produce unexpected bills as your app grows, especially once you add databases, replicas, and persistent volumes.

> Teams evaluating their options often compare [Fly.io against Render and Kuberns side by side](https://kuberns.com/blogs/flyio-vs-render-vs-kuberns-ai/) before deciding. The differences in managed services and pricing predictability matter a lot at production scale.

## Is Fly.io Good for Production? The Honest Verdict

Fly.io is a serious platform built by engineers who care about infrastructure. The public infra-log alone puts them ahead of most competitors on transparency. But transparency is not the same as reliability, and the production picture is mixed.

| Use Case | Fly.io | Notes |
|----------|--------|-------|
| Side projects and personal apps | Suitable | Low cost, global regions, good DX for experienced devs |
| Startups with paying users | Risky | No SLA, recurring Consul/Corrosion incidents, requires ops expertise |
| Teams needing a contractual SLA | Not suitable | No SLA published on any standard plan |
| Postgres-heavy production apps | Caution required | Self-managed, affected by Consul incidents, manual backup and failover |
| Latency-sensitive APIs | Caution required | Scale-to-zero cold starts hurt response times |
| Vibe-coded apps needing fast deployment | Not ideal | Requires Dockerfile and fly.toml - no zero-config path |

[![Deploy your app on Kuberns without Dockerfiles or infra config](https://kuberns-blogs.s3.ap-south-1.amazonaws.com/deploy-on-kuberns-bannner6.png)](https://dashboard.kuberns.com)

## Where Developers Are Deploying Their Projects in 2026

![Developers choosing deployment platforms in 2026](https://kuberns-blogs.s3.ap-south-1.amazonaws.com/where-developers-deploy-their-projects.png)

Developers moving away from Fly.io are choosing Kuberns. Not because it is the newest name in the space, but because it solves the exact problem Fly.io creates: too much configuration, too much infrastructure ownership, and too many incidents traced back to systems you never asked to manage.

The shift is especially visible among developers building with AI coding tools. Cursor, Bolt, Windsurf, Lovable, and Replit users are shipping production-ready apps faster than ever, but they are not infrastructure engineers. They do not want to write a Dockerfile, debug a fly.toml, or wait on a Consul cluster recovery. They want to push code and have it live. Kuberns is where [those vibe-coded apps go after the build is done](https://kuberns.com/blogs/after-vibe-coding-deploy-your-app/).

Kuberns connects to your GitHub repository, detects your stack automatically, and deploys your app with SSL, autoscaling, and process management handled out of the box. No server to provision. No config file to maintain. No incident postmortem to read on a Monday morning.

> If you built your app in Cursor and need a production home, [deploying your Cursor project to Kuberns](https://kuberns.com/blogs/deploy-cursor-website-on-kuberns/) takes minutes with no config required.

### Why Kuberns Is the Preferred Choice Among Developers

![Kuberns agentic AI deployment platform dashboard](https://kuberns-blogs.s3.ap-south-1.amazonaws.com/kuberns-home-page-new.png)

Kuberns is an Agentic AI deployment platform built specifically for this moment - where the code is already written and deployment should not be another project.

Here is what Kuberns offers that no other platform in this category matches:

**Agentic AI stack detection:** Kuberns reads your repository and configures your deployment automatically. It identifies your framework, runtime, dependencies, and build commands without you writing a single config file. No Dockerfile, no fly.toml, no buildpack selection.

**Zero OS maintenance:** There is no server to patch, no Consul cluster to monitor, no Corrosion bugs to wait on. Kuberns handles the entire infrastructure layer. You push code; the platform handles the rest.

**Automatic SSL on every deploy:** SSL is provisioned and renewed automatically for every app and every custom domain. There is no Certbot to install, no certificate renewal cron job, and no configuration required.

**No PM2 and no process manager:** Kuberns manages your application process natively. Your Node.js, Python, Go, or any other runtime stays running without you configuring a process manager, writing systemd units, or debugging why your app died after an SSH session closed.

**Autoscaling without cold starts:** Kuberns scales your app up and down based on traffic automatically. Unlike Fly.io's scale-to-zero which introduces cold start delays, Kuberns handles scaling in a way that does not penalize your users with slow first responses.

**Unified dashboard for everything:** Logs, environment variables, custom domains, deploy history, and scaling controls all live in one dashboard. No CLI required for day-to-day operations.

**GitHub-native CI/CD:** Connect your repository, set your environment variables, and click Deploy. Every push to your main branch deploys automatically. The entire flow from code to live URL takes minutes.

**Built on AWS:** Kuberns runs on AWS infrastructure, not a shared control plane with a Consul dependency. The reliability foundation is the same infrastructure that powers the most critical production workloads on the internet.

Developers who built their apps in Bolt, Windsurf, or Lovable and need a production deployment that does not require a DevOps background are choosing Kuberns because the platform matches the way they actually work. The [shift from vibe coding to AI-powered deployment](https://kuberns.com/blogs/from-bolt-vibe-coding-to-ai-powered-deployment/) is exactly what Kuberns was built for.

> Kuberns also supports teams deploying [best-fit deployment platforms for small dev teams](https://kuberns.com/blogs/best-deployment-platform-small-dev-teams/) who need reliability without infrastructure overhead.

## Conclusion

Fly.io is not a bad platform. It is built by people who care deeply about infrastructure, and their public infra-log is more honest than anything most competitors publish. But "is Fly.io reliable for production" has a specific answer when you look at the data: 609 incidents since June 2022, 13 per month on average, incidents on 20 of 31 days in May 2026, and a recurring architectural challenge in Consul and Corrosion that their own engineers are still working through.

For side projects and internal tools, Fly.io is a reasonable choice if you have the DevOps skills to manage it. For production applications with real users, paying customers, and any uptime requirement, the incident frequency and the absence of a contractual SLA make it a risky default.

The developers who need Fly.io's level of control, and have the infrastructure expertise to use it well, will continue to choose it. But most developers building production apps in 2026 are not in that category. They built their app with AI tools and need a deployment platform that matches that pace.

For those teams, Fly.io's reliability record and configuration overhead are real blockers. Kuberns handles [what Fly.io's production workload limitations](https://kuberns.com/blogs/what-is-vibe-deployment/) make difficult: zero-config deployment, automatic SSL, process management, and AWS-backed reliability, without any of the infrastructure work.

If you are ready to deploy without the overhead, [connect your GitHub repo on Kuberns](https://dashboard.kuberns.com) and go live in minutes.

[![Deploy on Kuberns with one click - no Dockerfile, no fly.toml, no config](https://kuberns-blogs.s3.ap-south-1.amazonaws.com/CTA_banner.png)](https://dashboard.kuberns.com)

## Frequently Asked Questions

### Is Fly.io good for production workloads?

Fly.io can handle production workloads but comes with real reliability risks. IsDown has tracked 609 incidents since June 2022, averaging 13 per month. Fly.io's own infra-log documents recurring issues with Consul and Corrosion, the internal systems that power app routing, Postgres, and private networking. Teams with strict uptime requirements should evaluate the incident history before committing.

### How many incidents has Fly.io had?

According to IsDown, which has monitored Fly.io continuously since June 2022, Fly.io has had 609 documented incidents over four years, averaging 13 per month. In May 2026, there were incidents on 20 out of 31 days. In June 2026, 7 of the first 13 days had reported issues.

### Does Fly.io have an SLA?

Fly.io does not publish a contractual SLA for standard plans. There is no public uptime guarantee or compensation policy on their pricing page. Teams that need a guaranteed recovery window or uptime commitment will not find one on Fly.io's standard offering.

### What is the Consul and Corrosion problem on Fly.io?

Consul is Fly.io's distributed configuration and state management system. Corrosion is their newer replacement. Almost every major Fly.io incident in 2025 and 2026 traces back to one of these systems: Consul cluster degradation, Corrosion schema mismatches, and stale 6PN mappings that depend on Corrosion for routing. It is a recurring architectural issue they are actively working to resolve.

### Is Fly.io Postgres reliable for production?

Fly Postgres is not a fully managed database. It runs as a Fly Machine with a persistent volume, and you are responsible for backups, failover, and version upgrades. Consul incidents directly affect unmanaged Postgres clusters. For production databases with real users, this requires operational expertise that many teams do not have.

### Does Fly.io scale-to-zero hurt production apps?

Yes. Fly.io's scale-to-zero feature stops Machines when idle and restarts them on the next request, introducing cold start delays of several seconds. For production APIs where response time matters, this directly affects user experience and is not suitable for latency-sensitive workloads.

### Are all Fly.io regions equally reliable?

No. Fly.io incidents in 2026 show region-specific failures across ORD (Chicago), SIN (Singapore), IAD (Virginia), GRU (Sao Paulo), LAX (Los Angeles), and ARN (Stockholm). Fly.io does not provide automatic cross-region failover by default. An incident in the region where your app runs will affect your users until Fly.io resolves it.

### Is Fly.io better than Railway for production?

Fly.io and Railway both have significant reliability issues for production. Railway had a single 8-hour full platform outage in May 2026. Fly.io has more frequent but typically shorter incidents spread across regions. Neither offers a contractual SLA on standard plans. Fly.io gives more infrastructure control; Railway offers a simpler interface. Neither is ideal for teams that cannot tolerate unplanned downtime.

### What is the best Fly.io alternative for production?

Kuberns is a strong Fly.io alternative for production. It is built on AWS, uses Agentic AI to configure deployments automatically, and handles SSL, process management, and autoscaling without any manual configuration. There is no Dockerfile required, no fly.toml to maintain, and no Consul dependency to worry about.

### Is Kuberns better than Fly.io for production?

For teams that need reliability without DevOps overhead, Kuberns is a better fit than Fly.io. Kuberns deploys from GitHub in one click, provisions SSL automatically, handles autoscaling, and runs on AWS infrastructure. You get production-grade deployment without managing Dockerfiles, fly.toml configs, Consul clusters, or server-level maintenance.

---
- [More Alternatives articles](https://kuberns.com/blogs/category/alternatives/1/)
- [All articles](https://kuberns.com/blogs/)