This page contains press release content distributed by XPR Media. Members of the editorial and news staff of the USA TODAY Network were not involved in the creation of this content.

Clockwork.io Introduces A New Class of Fault Tolerance to End Failure-Driven GPU Waste in AI Training

New TorchPass solution addresses a multi-million dollar challenge with AI infrastructure; uses Live GPU Migration to keep large-scale AI training running through hardware failures instead of forcing costly restarts

PALO ALTO, CA / ACCESS Newswire / March 11, 2026 / Clockwork.io, the leader in Software-Driven AI Fabrics– a programmable, vendor-neutral software layer that optimizes large-scale GPU clusters for real-time observability, fault tolerance, and deterministic performance-today announced the general availability of TorchPass Workload Fault Tolerance. This new class of software-driven fault-tolerance eliminates one of the most costly failure modes in large-scale AI training: catastrophic job restarts caused by infrastructure faults.

Delivered as a core capability of the Clockwork.io FleetIQ platform, TorchPass applies the principles of Software-Driven AI Fabrics to distributed training, using Live GPU Migration to allow workloads to continue running through GPU failures, network disruptions, driver bugs, and even full node crashes-without checkpoint restarts or lost progress.

“Companies are investing billions in next-gen chips, yet the costs of running distributed AI jobs remains grossly inflated because the ecosystem has accepted failure as a constant,” said Suresh Vasudevan, CEO of Clockwork.io. “We built TorchPass to fundamentally reject that premise. Instead of treating failure as inevitable and restarting after the fact, TorchPass makes infrastructure faults invisible to the workload-training continues through failures transparently, in software. For a typical 2,048-GPU deployment, that translates into over $6 million a year in recovered compute. This is what our Software-Driven AI Fabric approach was designed to deliver: fault-tolerant AI infrastructure.”

Dylan Patel, Founder and CEO of SemiAnalysis agreed that large-scale training jobs are limited by interruptions.

“As Blackwell clusters roll out with an NVL72 domain, and we look to the future with Rubin Ultra’s NVL576 domain, the idea that a single GPU error or network link flap can take down an entire run is totally unacceptable,” said Patel. “TorchPass solves a huge challenge with cluster reliability: it provides transparent failover and live workload migration that keeps MFU high, which in turn drives better GPU economics.”

Why AI Training Fails at Scale

Distributed AI training remains one of the most failure-prone workloads in modern infrastructure. As cluster sizes grow, fragility increases sharply. Research from Meta FAIR shows that mean time to failure drops to 7.9 hours in a 1,024-GPU cluster and to just 1.8 hours at 16,384 GPUs. This means that for most large, AI-focused enterprises or AI clouds, failure-driven restarts are completely inevitable – making this a major barrier to scaling AI’s impact.

Each failure forces training jobs to roll back to the most recent checkpoint, discarding minutes or hours of completed work and wasting additional time on manual intervention, reprovisioning resources and restarting training. These restarts silently cap GPU utilization, making reliability one of the largest hidden costs in AI infrastructure.

TorchPass addresses this problem by proactively addressing costly AI workload failures, solving them before the job stops or needs to restart. Vital for enterprises running large AI workloads and AI clouds alike, TorchPass dramatically improves the reliability of workloads and cluster utilization. For AI clouds, who can now address impacted GPUs while preserving the training run as planned, this translates into better customer SLAs and overall AI cloud economics, improving their ability to protect margin and deliver new models sooner.

“Managing compute output across large-scale GPU clusters is vital to ensuring we’re delivering reliable capacity to our customers. By using TorchPass we have the support of a company that focuses on resilience like it is a core business function: it replaces any specific failing GPU and keeps the rest of the job moving, rather than making one small problem impact our large-scale operations,” said David Power, CTO of Nscale. “In our evaluation, Live GPU Migration preserved both run continuity and throughput under real fault conditions, which is exactly what you need to deliver predictable time-to-train and a better customer experience at scale.”

How Live GPU Migration Works: Reliability Without Restart

TorchPass performs transparent, in-flight migration of impacted training ranks to spare resources when failures occur. TorchPass typically completes recovery in approximately three minutes while the training process continues uninterrupted.

It supports resilience across three failure scenarios:

  • Unplanned migration, handling sudden events such as kernel crashes, power failures, or GPU faults by reconstructing state from healthy replicas

  • Pre-emptive migration, triggered by early warning signals such as rising temperatures or ECC memory errors, enabling controlled migration before a hard failure

  • Planned migration, enabling maintenance, patching, and workload rebalancing without interrupting training

This approach reduces wasted training progress by 95%, cutting lost time from approximately three hours per day to under ten minutes in a 1,024-GPU cluster.

Jordan Nanos, Member of Technical Staff and lead author of ClusterMAX-SemiAnalysis’ independent benchmark for large-scale AI training-stress tested Clockwork.io TorchPass and found it delivered leading performance and efficiency for large-scale distributed training, enabling users to reduce checkpointing overhead in training. He shared the following results:

“In our testing, Clockwork.io TorchPass delivered the fastest and most efficient fault-tolerant performance for a gpt-oss-120B training run. We used TorchTitan on a Kubernetes cluster with 64x H200 GPUs. During our testing we measured job completion time (JCT) and Model FLOPs Utilization (MFU) against a standard approach (checkpoint-restart) and the leading open-source fault-tolerant training framework (TorchFT). We simulated multiple hardware failures on the cluster in order to stress test the fault-tolerant training frameworks.

When compared to checkpoint-restart, TorchPass was significantly faster to recover from failures. This reduced overall JCT and maintained high MFU. And when compared to TorchFT, TorchPass had a significantly higher MFU. This reduced overall JCT while also maintaining an equal time to recover from failures.

Using TorchPass also has a downstream effect where it provides users with an opportunity to reduce or even remove checkpointing from their training code. This means larger effective batch sizes, lower risk of out of memory errors (OOMs), and less time spent thinking about storage. For a research organization, this can ultimately mean a faster time to reach their training objective,” concluded Nanos.

Measurable Business Impact from Software-Driven Fault-Tolerance

For customers operating large AI clusters, the impact is immediate and measurable. In a typical 2,048-GPU H200 deployment, TorchPass Workload Fault Tolerance delivers over $6 million in annual savings by preventing wasted compute.

These savings come from eliminating hundreds of thousands of GPU-hours that would otherwise be lost to failure-driven restarts, cascading retries, and idle recovery time. By keeping training jobs running through infrastructure faults instead of restarting them, TorchPass converts lost GPU time into productive training, significantly improving the return on GPU investments that today often operate at just 30-50% of theoretical performance.

Enabling the Next Generation of AI Infrastructure

By making reliability a software-defined capability rather than a hardware constraint, TorchPass provides the operational confidence required to deploy next-generation, tightly coupled systems such as NVIDIA GB200 and GB300 NVL72 and future rack-scale systems, where dense architectures amplify the cost of even small failures.

TorchPass builds on Clockwork.io’s prior release of Network Fault Tolerance, which applies the same Software-Driven AI Fabric principles to network resilience by transparently rerouting traffic around link failures.

Together, these capabilities form Clockwork.io’s Software-Driven AI Fabric, a vendor-neutral software layer spanning network, compute, and storage. As modern AI workloads run on tightly coupled clusters where hundreds or thousands of processors must operate in coordinated lockstep, infrastructure behaves as a single system, where reliability and performance directly determine overall efficiency. By managing this complexity in software, Clockwork.io enables operators to run heterogeneous AI infrastructure as a unified platform-maintaining high utilization, predictable performance, and resilience while preserving the flexibility to evolve hardware and improve the economics of large-scale AI deployments.

To learn more about the launch of TorchPass, visit the Clockwork.io team in-person at NVIDIA GTC from March 16-19, Booth #205, or visit https://clockwork.io.

About Clockwork.io
Clockwork.io pioneers Software-Driven AI Fabrics™, delivering a programmable software layer that makes large-scale AI clusters observable, deterministic, and resilient by design to drive continuous workload progress and peak cluster utilization. Its FleetIQ platform enables enterprises to train, deploy, and serve the world’s most demanding AI workloads faster, more reliably, and at lower cost. Companies including Uber, Wells Fargo, DCAI, Nebius, Nscale, and White Fiber trust Clockwork.io to power their AI infrastructure. Learn more at www.clockwork.io.

Media Contact
Dana Trismen
clockwork@unshakablemarketinggroup.com
650-269-7478

SOURCE: Clockwork

View the original press release on ACCESS Newswire

Information contained on this page is provided by an independent third-party content provider. XPRMedia and this Site make no warranties or representations in connection therewith. If you are affiliated with this page and would like it removed please contact pressreleases@xpr.media

goop Selects AI Marketing Platform Adora to Deliver Personalization at Scale

goop Selects AI Marketing Platform Adora to Deliver Personalization at Scale

Lifestyle leader expands customer personalization across verticals, achieving 117% higher ROAS and 5x faster production

March 11, 2026

Retirement Plan Advisor & Attorney Publishes Book ‘401(k) Exposed’ to Help Employers Understand What They’re Responsible

Retirement Plan Advisor & Attorney Publishes Book ‘401(k) Exposed’ to Help Employers Understand What They’re Responsible

Retirement Plan Advisor & Attorney Publishes New Book “401(k) Exposed” to Help Employers Understand What They're

March 11, 2026

Delray Beach Fred Astaire Dance Studio’s ‘Blockbuster Ballroom’ Dazzles Audiences at The Wick Theatre

Delray Beach Fred Astaire Dance Studio’s ‘Blockbuster Ballroom’ Dazzles Audiences at The Wick Theatre

Hollywood-inspired ballroom spectacular choreographed by National Champions Kristian Sesse and Mae Lozada Bringing this

March 11, 2026

North Texas Lawn Solutions Published a Texas Lawn Fertilizer Guide

North Texas Lawn Solutions Published a Texas Lawn Fertilizer Guide

LITTLE ELM, TX – March 11, 2026 – PRESSADVANTAGE – North Texas Lawn Solutions published this week a detailed guide on

March 11, 2026

Bark Busters Tree Service Expands Use of Crane-Assisted Techniques for Complex Tree Removals

Bark Busters Tree Service Expands Use of Crane-Assisted Techniques for Complex Tree Removals

HARRISON, TN – March 11, 2026 – PRESSADVANTAGE – Bark Busters Tree Service has expanded the operational use of

March 11, 2026

Waterline Remodeling Launches Premium General Contractor Services in Southwest Florida

Waterline Remodeling Launches Premium General Contractor Services in Southwest Florida

NAPLES, FL – March 11, 2026 – PRESSADVANTAGE – Waterline Remodeling has officially launched its premium remodeling

March 11, 2026

NBM Protects Burlington Businesses with Layered Cybersecurity Solutions

NBM Protects Burlington Businesses with Layered Cybersecurity Solutions

BURLINGTON, MA – March 11, 2026 – PRESSADVANTAGE – Modern businesses face growing cybersecurity challenges as threats

March 11, 2026

All In Solutions Detox CA LLC Expands Holistic Wellness Programming to Support Mind-Body Healing During Detox

All In Solutions Detox CA LLC Expands Holistic Wellness Programming to Support Mind-Body Healing During Detox

RESEDA, CA – March 11, 2026 – PRESSADVANTAGE – All In Solutions Detox Reseda has expanded its treatment offerings to

March 11, 2026

Tampa Bay Functional Neurology Expands Access to Intensive Concussion Recovery Program on Florida’s Gulf Coast

Tampa Bay Functional Neurology Expands Access to Intensive Concussion Recovery Program on Florida’s Gulf Coast

Brandon-based functional neurology clinic introduces a structured fly-in rehabilitation program for post-concussion

March 11, 2026

UMUSIC HOSPITALITY & LIFESTYLE AND IMI GROUP UNVEIL UMUSIC BEACH CLUBS & LIFESTYLE

UMUSIC HOSPITALITY & LIFESTYLE AND IMI GROUP UNVEIL UMUSIC BEACH CLUBS & LIFESTYLE

A New Era of Music-Led Beach Club Culture Begins with Global Expansion of O Beach and the Development of a Five-Star

March 11, 2026

Surety Business Launches Affordable Plug-and-Play Fleet Tracking for SMBs

Surety Business Launches Affordable Plug-and-Play Fleet Tracking for SMBs

New service powered by Alarm.com Connected Fleet delivers GPS tracking, fuel reports, and vehicle diagnostics at

March 11, 2026

The University of Arizona Global Campus Expands Workforce Development Partnerships to Canada

The University of Arizona Global Campus Expands Workforce Development Partnerships to Canada

With this expansion, we’re deepening our collaboration with employers that view education as a strategic investment.”—

March 11, 2026

Abstract Delivers InstaMAT 2026: Where Artistic Precision Meets the Most Powerful 3D Material Platform

Abstract Delivers InstaMAT 2026: Where Artistic Precision Meets the Most Powerful 3D Material Platform

InstaMAT 2026 gives artists more control over surface detail, with tools that stay stable and reusable through every

March 11, 2026

InventionHome® Inventor Creates Reactive Training Device to Improve Reflexes and Defensive Skills in Martial Arts

InventionHome® Inventor Creates Reactive Training Device to Improve Reflexes and Defensive Skills in Martial Arts

PITTSBURGH, PA, UNITED STATES, March 11, 2026 /EINPresswire.com/ — Anthony C. of Newark, OH is the creator of The

March 11, 2026

Sportstalk Florida exclusive: Nottingham Forest is set for Europa League Round of 16 against Midtjylland

Sportstalk Florida exclusive: Nottingham Forest is set for Europa League Round of 16 against Midtjylland

For a club with a proud history in Europe, nights like this carry special meaning. RICHBORO , PA, UNITED STATES, March

March 11, 2026

HIP Video Promo Presents: Divergent premiere brand new lyric video ‘Give Her Love’ on VENTS Magazine

HIP Video Promo Presents: Divergent premiere brand new lyric video ‘Give Her Love’ on VENTS Magazine

Divergent Proves Rock Music Still Has Something to Say on Electrifying New Single "Give Her Love" JOHNSTON, IA, UNITED

March 11, 2026

Clean Pro Gutter Cleaning Launches New Digital Platform Serving 840+ Cities Across 43 States

Clean Pro Gutter Cleaning Launches New Digital Platform Serving 840+ Cities Across 43 States

Company in business since 2001 deploys city-specific service pages across 200+ metropolitan areas, delivering

March 11, 2026

Printify’s 5-Step Guide to Start Selling t-shirts on Etsy Smarter

Printify’s 5-Step Guide to Start Selling t-shirts on Etsy Smarter

New sellers can launch without inventory, test niches faster, and price for profit with print-on-demand New sellers

March 11, 2026

Duperon Corporation Announces Expansion of Ownership with President Mark Turpin Named Co-Owner

Duperon Corporation Announces Expansion of Ownership with President Mark Turpin Named Co-Owner

His leadership, integrity, and deep commitment to our employees, customers, and the water industry made this a natural

March 11, 2026

ARU Celebrates 10 Years of Specialty Property Insurance Excellence Driven by Innovation, Expertise, and Growth

ARU Celebrates 10 Years of Specialty Property Insurance Excellence Driven by Innovation, Expertise, and Growth

ARU celebrates a decade of disciplined growth, expanding from poultry coverage to a national specialty property

March 11, 2026

Canary Labs & SORBA.ai Announce Strategic Partnership to Deliver Native Historian-to-AI Stack to Industrial Enterprises

Canary Labs & SORBA.ai Announce Strategic Partnership to Deliver Native Historian-to-AI Stack to Industrial Enterprises

Native integration between Canary Historian & SORBA.ai transforms trusted time-series data into predictive

March 11, 2026

From Pakistan to Cleveland: 2.1 Million Mothers Reached: Inside JFF’s Global Push to Improve Maternal and Child Health

From Pakistan to Cleveland: 2.1 Million Mothers Reached: Inside JFF’s Global Push to Improve Maternal and Child Health

JFF demonstrates how evidence-based nutrition interventions improve outcomes for mothers & babies — both in

March 11, 2026

Influential Women: Emily Cole, Senior Lead Principal Consultant On The Mandiant Defense Team At Google Cloud Security

Influential Women: Emily Cole, Senior Lead Principal Consultant On The Mandiant Defense Team At Google Cloud Security

CINCINNATI, OH, UNITED STATES, March 11, 2026 /EINPresswire.com/ — Cincinnati-Based Cybersecurity Leader Combines

March 11, 2026

Programmers.io Invests in Premier Education to Strengthen the IBM i Ecosystem

Programmers.io Invests in Premier Education to Strengthen the IBM i Ecosystem

Programmers.io expands its free IBM i Internship Program to train the next generation of developers with RPG, COBOL, AI

March 11, 2026

Former NSA and DHS Privacy Leader Brendan Henry Joins The Beckage Firm as Senior Counsel

Former NSA and DHS Privacy Leader Brendan Henry Joins The Beckage Firm as Senior Counsel

Former NSA and DHS privacy leader Brendan Henry joins The Beckage Firm as Senior Counsel, strengthening cybersecurity,

March 11, 2026

BEYCOME BRINGS ITS DIY REAL ESTATE PLATFORM AND $99 FLAT-FEE LISTINGS TO VIRGINIA — CLOSING A HOME EVERY 40 MINUTES

BEYCOME BRINGS ITS DIY REAL ESTATE PLATFORM AND $99 FLAT-FEE LISTINGS TO VIRGINIA — CLOSING A HOME EVERY 40 MINUTES

The fast-growing platform continues its national expansion, helping homeowners sell their homes while keeping more of

March 11, 2026

Zakipoint Health Inc. Appoints Frederick Karutz as Strategic Advisor to Accelerate Health Plan Market Expansion

Zakipoint Health Inc. Appoints Frederick Karutz as Strategic Advisor to Accelerate Health Plan Market Expansion

zakipoint Health appoints Frederick Karutz as Strategic Advisor to expand into the health plan market and drive

March 11, 2026

SwedencareUSA to Showcase Full Line of Pet Dental Health Products at Global Pet Expo

SwedencareUSA to Showcase Full Line of Pet Dental Health Products at Global Pet Expo

Global Pet Expo provides an excellent opportunity for us to connect with our valued industry partners and meet with

March 11, 2026

A Provocative Sci-Fi Novel Eleven Elements Draws Critical Praise for Its Bold Vision of a World Without Wars

A Provocative Sci-Fi Novel Eleven Elements Draws Critical Praise for Its Bold Vision of a World Without Wars

Readers’ Favorite calls it “a brilliant start to an epic sci-fi saga,” while BlueInk Review praises its exciting and

March 11, 2026

NATPOWER International AG and R&R-Beth GmbH Sign Business Development Agreement

NATPOWER International AG and R&R-Beth GmbH Sign Business Development Agreement

WIDNAU, SWITZERLAND, March 11, 2026 /EINPresswire.com/ — NATPOWER International AG announced this morning that it has

March 11, 2026

BNS, Inc. Announces Global Expansion to Help Telecom and Data Center Operators Modernize Faster

BNS, Inc. Announces Global Expansion to Help Telecom and Data Center Operators Modernize Faster

BNS Expansion Brings Proven IT Asset Disposition (ITAD), Consulting, and Software Development Capabilities to Europe

March 11, 2026

Elevating Minds Psychiatry Launches Free Online ADHD Screening Quizzes for Individuals in Hawai‘i and California

Elevating Minds Psychiatry Launches Free Online ADHD Screening Quizzes for Individuals in Hawai‘i and California

New online screening tools support awareness and informed next steps in mental health care Many people notice patterns

March 11, 2026

Digital Mapping Impacts Land Surveying Methods in the United States

Digital Mapping Impacts Land Surveying Methods in the United States

Looking for land surveyors in Massachusetts? Get reliable land surveying services and a plot plan for my property.

March 11, 2026

Step-Up Omaha Extends Application Deadline to March 16

Step-Up Omaha Extends Application Deadline to March 16

Step-Up Omaha helps Omaha’s youth and young adults develop critical skills while giving businesses access to driven

March 11, 2026

BOSS Solutions Earns Multiple 2026 Gartner Digital Markets Recognitions for Customer Service Excellence

BOSS Solutions Earns Multiple 2026 Gartner Digital Markets Recognitions for Customer Service Excellence

BOSS Solutions recognized by Gartner Digital Markets for 2026 customer service excellence across BOSSDesk and BOSS811.

March 11, 2026

Deer Solution Expands to Coastal North Carolina with Launch of Deer Solution of Wilmington

Deer Solution Expands to Coastal North Carolina with Launch of Deer Solution of Wilmington

As NC homeowners, we understand how much time, care, and money goes into outdoor spaces. We are proud to introduce an

March 11, 2026

Metal America Launches Free Concrete Calculator for Building Dealers and Contractors

Metal America Launches Free Concrete Calculator for Building Dealers and Contractors

New platform lets metal building dealers quote concrete slabs instantly and connect customers with vetted local

March 11, 2026

PuraMadera Accelerates Global Expansion, Bringing Amazon Wood and Mineral Rich Soil to Latin America and the Caribbean

PuraMadera Accelerates Global Expansion, Bringing Amazon Wood and Mineral Rich Soil to Latin America and the Caribbean

PuraMadera expands into 10 Latin American and Caribbean markets, delivering traceable Amazon wood and mineral rich soil

March 11, 2026

Dedicated Computing Launches Sabre™ S10300 Workstation Powered by Intel® Xeon® 600 Processors

Dedicated Computing Launches Sabre™ S10300 Workstation Powered by Intel® Xeon® 600 Processors

Delivering High-Performance, Scalable Solutions for AI, Edge, and Industrial Applications WAUKESHA, WI, UNITED STATES,

March 11, 2026

Hidden signal shifts in GPS and BeiDou revealed and stabilized

Hidden signal shifts in GPS and BeiDou revealed and stabilized

GA, UNITED STATES, March 11, 2026 /EINPresswire.com/ — Satellite navigation systems underpin modern society,

March 11, 2026