Removing friction from Amazon SageMaker AI development

All Things Distributed Now Go Build! Articles @werner

Removing friction from Amazon SageMaker AI development

August 06, 2025 • 1469 words

Incremental progress from Behavior Gap — Image source: https://behaviorgap.com/the-magic-of-incremental-change/

When we launched Amazon SageMaker AI in 2017, we had a clear mission: put machine learning in the hands of any developer, irrespective of their skill level. We wanted infrastructure engineers who were “total noobs in machine learning” to be able to achieve meaningful results in a week. To remove the roadblocks that made ML accessible only to a select few with deep expertise.

Eight years later, that mission has evolved. Today’s ML builders aren’t just training simple models—they’re building generative AI applications that require massive compute, complex infrastructure, and sophisticated tooling. The problems have gotten harder, but our mission remains the same: eliminate the undifferentiated heavy lifting so builders can focus on what matters most. In the last year, I’ve met with customers who are doing incredible work with generative AI—training massive models, fine-tuning for specific use cases, building applications that would have seemed like science fiction just a few years ago. But in these conversations, I hear about the same frustrations. The workarounds. The impossible choices. The time lost to what should be solved problems. A few weeks ago, we released a few capabilities that address these friction points: securely enabling remote connections to SageMaker AI, comprehensive observability for large-scale model development, deploying models on your existing HyperPod compute, and training resilience for Kubernetes workloads. Let me walk you through them.

The workaround tax

Here’s a problem I didn’t expect to still be dealing with in 2025—developers having to choose between their preferred development environment and access to powerful compute.

I spoke with a customer who described what they called the “SSH workaround tax”—the time and complexity cost of trying to connect their local development tools to SageMaker AI compute. They’d built this elaborate system of SSH tunnels and port forwarding that worked, sort of, until it didn’t. When we moved from classic to the latest version of SageMaker Studio, their workaround broke entirely. They had to make a choice: abandon their carefully customized VS Code setups with all their extensions and workflows or lose access to the compute they needed for their ML workloads.

Builders shouldn’t have to choose between their development tools and cloud compute. It’s like being forced to choose between having electricity and having running water in your house—both are essential, and the choice itself is the problem.

The technical challenge was interesting. SageMaker Studio spaces are isolated managed environments with their own security model and lifecycle. How do you securely tunnel IDE connections through AWS infrastructure without exposing credentials or requiring customers to become networking experts? The solution needed to work for different types of users—some who wanted one-click access directly from SageMaker Studio, others who preferred to start their day in their local IDE and manage all their spaces from there. We needed to improve on the work that was done for SageMaker SSH Helper.

So, we built a new StartSession API that creates secure connections specifically for SageMaker AI spaces, establishing SSH-over-SSM tunnels through AWS Systems Manager that maintain all of SageMaker AI’s security boundaries while providing seamless access. For VS Code users coming from Studio, the authentication context carries over automatically. For those who want their local IDE as the primary entry point, administrators can provide local credentials that work through the AWS Toolkit VS Code plug-in. And most importantly, the system handles network interruptions gracefully and automatically reconnects, because we know builders hate losing their work when connections drop.

This addressed the number one feature request for SageMaker AI, but as we dug deeper into what was slowing down ML teams, we discovered that the same pattern was playing out at an even larger scale in the infrastructure that supports model training itself.

The observability paradox

The second problem is what I call the “observability paradox”. The very system designed to prevent problems becomes the source of problems itself.

When you’re running training, fine-tuning, or inference jobs across hundreds or thousands of GPUs, failures are inevitable. Hardware overheats. Network connections drop. Memory gets corrupted. The question isn’t whether problems will occur—it’s whether you’ll detect them before they cascade into catastrophic failures that waste days of expensive compute time.

To monitor these massive clusters, teams deploy observability systems that collect metrics from every GPU, every network interface, every storage device. But the monitoring system itself becomes a performance bottleneck. Self-managed collectors hit CPU limitations and can’t keep up with the scale. Monitoring agents fill up disk space, causing the very training failures they’re meant to prevent.

I’ve seen teams running foundation model training on hundreds of instances experience cascading failures that could have been prevented. A few overheating GPUs start thermal throttling, down the entire distributed training job. Network interfaces begin dropping packets under increased load. What should be a minor hardware issue becomes a multi-day investigation across fragmented monitoring systems, while expensive compute sits idle.

When something does go wrong, data scientists become detectives, piecing together clues across fragmented tools—CloudWatch for containers, custom dashboards for GPUs, network monitors for interconnects. Each tool shows a piece of the puzzle, but correlating them manually takes days.

This was one of those situations where we saw customers doing work that had nothing to do with the actual business problems they were trying to solve. So we asked ourselves: how do you build observability infrastructure that scales with massive AI workloads without becoming the bottleneck it’s meant to prevent?

The solution we built rethinks observability architecture from the ground up. Instead of single-threaded collectors struggling to process metrics from thousands of GPUs, we implemented auto-scaling collectors that grow and shrink with the workload. The system automatically correlates high-cardinality metrics generated within HyperPod using algorithms designed for massive scale time series data. It detects not just binary failures, but what we call grey failures—partial, intermittent problems that are hard to detect but slowly degrade performance. Think GPUs that automatically slow down due to overheating, or network interfaces dropping packets under load. And you get all of this out-of-the-box, in a single dashboard based on our lessons learned training GPU clusters at scale—with no configuration required.

Teams that used to spend days detecting, investigating, and remediating task performance issues now identify root causes in minutes. Instead of reactive troubleshooting after failures, they get proactive alerts when performance starts to degrade.

The compound effect

What strikes me about these problems is how they compound in ways that aren’t immediately obvious. The SSH workaround tax doesn’t just cost time—it discourages the kind of rapid experimentation that leads to breakthroughs. When setting up your development environment takes hours instead of minutes, you’re less likely to try that new approach or test that different architecture.

The observability paradox creates a similar psychological barrier. When infrastructure problems take days to diagnose, teams become conservative. They stick with smaller, safer experiments rather than pushing the boundaries of what’s possible. They over-provision resources to avoid failures instead of optimizing for efficiency. The infrastructure friction becomes innovation friction.

But these aren’t the only friction points we’ve been working to eliminate. In my experience building distributed systems at scale, one of the most persistent challenges has been the artificial boundaries we create between different phases of the machine learning lifecycle—organizations maintaining separate infrastructure for training models and serving them in production, a pattern that made sense when these workloads had fundamentally different characteristics, but one that has become increasingly inefficient as both have converged on similar compute requirements. With SageMaker HyperPod’s new model deployment capabilities, we’re eliminating this boundary entirely, allowing you to train your foundation models on a cluster and immediately deploy them on the same infrastructure, maximizing resource utilization while reducing the operational complexity that comes from managing multiple environments.

For teams using Kubernetes, we’ve added a HyperPod training operator that brings significant improvements to fault recovery. When failures occur, it restarts only the affected resources rather than the entire job. The operator also monitors for common training issues such as stalled batches and non-numeric loss values. Teams can define custom recovery policies through straightforward YAML configurations. These capabilities dramatically reduce both resource waste and operational overhead.

These updates—securely enabling remote connections, autoscaling observability collectors, seamlessly deploying models from training environments, and improving fault recovery—work together to address the friction points that prevent builders from focusing on what matters most: building better AI applications. When you remove these friction points, you don’t just make existing workflows faster; you enable entirely new ways of working.

This continues the evolution of our original SageMaker AI vision. Each step forward gets us closer to the goal of putting machine learning in the hands of any developer, with as little undifferentiated heavy lifting as possible.

Now, go build!