Breaking The AI Infra Monopoly With Rust- Tracel AI

This is my interview with Nathaniel Simard of Tracel AI. This interview is part of our recent efforts to make sense of the software engineering landscape in the age of AI. In this interview, Nathaniel and I spent a lot of time talking about the major, structural issues in the GPU programming market and how Tracel plans to address them with their Rust-based stack. It is hard to understate how big the problems that Tracel is tackling are, and it was exciting to talk to Nathaniel so early in the journey. To see jobs available at this and other cool rust companies, check out our extensive rust job board.

Want to advertise here? Reach out! filtra@filtra.io

Listen On Apple PodcastsWatch On YouTubeListen On SpotifyListen With RSS

Drew: I think it would be interesting to start by discussing what kernel languages are and the fragmentation that exists within that ecosystem. Could you speak to the specific problems this causes for AI model developers?

Nathaniel: GPU programming is difficult; you can't simply use languages like Java or TypeScript. You have to use specialized languages designed for the GPU. While some, like CUDA, are similar to C++, you cannot compile them with standard tools like GCC. Instead, you have to navigate a fragmented ecosystem of shader and kernel languages including Metal, Vulkan, HIP, WebGPU, and WGSL.

Nathaniel: To leverage a GPU, you typically write your main application in a standard language like Python or C++ for the CPU side, and then embed these specific GPU kernels directly into that application.

Drew: Right, perfect. CUDA is probably what most people are familiar with because it is talked about as one of Nvidia's primary advantages. But, the entire ecosystem is incredibly fragmented. Can you explain why that's a problem for developers to have all these different languages?

Nathaniel: For CPUs, we are accustomed to using general-purpose languages like Rust, Python, or JavaScript. Deploying them is straightforward; you can run the same application on Intel or AMD CPUs, and even hardware from ten years ago works fine. Most of the portability is handled by just-in-time runtimes, like the Java runtime, or compilers like LLVM. Essentially, you write your application once and deploy it to any hardware.

Nathaniel: For GPUs, it is much more difficult. While you can use a portable language like WebGPU, you won't be able to leverage the specific hardware instructions required for maximum performance. To get those benefits, you must use hardware-specific languages: CUDA for NVIDIA, Metal for Apple Silicon, or Vulkan for Android. And, Vulkan itself encompasses multiple GPU languages. So, writing multi-platform GPU code is incredibly challenging. This fragmentation is the biggest hurdle for developers today.

Drew: So, if a developer wants to run the same model on different GPUs, they effectively have to rewrite those kernels for each platform, right?

Nathaniel: Yes, especially across different vendors. If you want to deploy a model to an NVIDIA GPU and then to an AMD GPU, the code is not portable. While kernels are generally portable across different models within the NVIDIA platform, you won't achieve optimal performance because the algorithms must be tailor-made for the specific hardware. You have to understand the underlying hardware properties to truly optimize your kernel.

Drew: This is fascinating. One thing that just occurred to me: there are chips that aren't GPUs, like Google's TPUs and other AI-specific hardware. Are those fundamentally different, or is the process the same? Do you write kernels for those chips as well?

Nathaniel: Many of those specific processors aren't necessarily programmable in the traditional sense. NPUs, for example, are quite difficult to program because they lack the shader languages found in GPUs.

Nathaniel: There are exceptions; Google TPUs use Pallas, a Python DSL, to program the hardware. Typically, however, you aren't writing assembly; you are configuring a proprietary compiler. You provide a high-level algorithm—like a matrix multiplication—and the compiler generates the code for you.

Nathaniel: When I say these are "non-programmable" chips, I mean you can't program them directly. While the concepts are similar, a TPU has its own specific workflow, and you cannot use a single platform to write code for all these different hardware types.

Drew: That sounds incredibly messy. That’s why I was so excited to learn about what your team at Tracel is building. Specifically, you have a project called CubeCL that seems to address these exact pain points. Can you explain what CubeCL is and how it solves these problems?

Nathaniel: CubeCL is essentially a Rust extension that allows you to leverage Rust to write code for GPUs or any AI accelerator. While we don't support every accelerator yet, our goal is to target TPUs alongside GPUs. CubeCL is strictly optimized for numerical computing performance using a just-in-time (JIT) compiler.

Nathaniel: We avoid ahead-of-time compilation because precompiling every kernel variation for every piece of hardware would result in gigabytes of data. Instead, the JIT compiler generates code based on specific hardware properties detected at runtime. It can account for the number of streaming multiprocessors on a GPU, the warp or "plane" size, or the number of available CPU cores and threads.

Nathaniel: The primary benefit of CubeCL is that you can inspect the current hardware and adapt your algorithm during the JIT compilation process. You write your code once, yet you maintain access to low-level hardware properties. This allows you to fine-tune your kernels for the specific hardware in use, regardless of the underlying shader language.

Drew: This is really interesting. Let me make sure I understand correctly: with CubeCL, you write everything in Rust and compile your application, but the specific GPU logic isn't actually compiled until runtime. Is that right?

Nathaniel: Yeah, so suppose you're deploying a model and need to execute a matrix multiplication, which is a very common algorithm. At runtime, when you first execute a matrix multiplication, the algorithm fetches the hardware properties to determine the optimal tile decomposition. It checks for tensor core availability, their specific sizes, and the supported precision to find the best possible configuration for your current setup.

Drew : That's interesting. Would you describe what you've created as a sort of compiler?

Nathaniel: Yes, it is both a compiler and a runtime.

Nathaniel: You no longer have to manage your own execution queue or memory allocation strategy; we handle those because they aren't portable. We built easy-to-use abstractions for both the runtime and compilation so you can simply write your algorithm and execute it. We manage all the portability and performance issues for you.

Drew : That’s very interesting. How does CubeCL guarantee it outputs the most optimal kernels for each piece of hardware? Is that even possible?

Nathaniel: You can’t guarantee that; I don't think anyone can. However, you can write custom kernels with CubeCL. In a typical use case, if you are optimizing an algorithm for specific CUDA GPUs in your data center, you might hard-code certain "magic numbers." This includes determining the number and size of tiles, selecting precisions, or deciding whether to implement double buffering.

Nathaniel: You make those optimization decisions for your specific GPU, and you can write that same algorithm with CubeCL. It will output the same code you would have written in CUDA. The difference is that CubeCL allows you to parameterize the algorithm—adjusting block sizes, tiling dimensions, and double-buffering—to perform auto-tuning. By running micro-benchmarks at runtime, the system identifies the superior configuration for your current hardware. That is how you write portable, optimal kernels.

Drew : That’s interesting. I’m not familiar with auto-tuning. Is that a technique people are already using?

Nathaniel: Yes, it is quite a popular technique. You take your algorithm, parameterize it with certain hyperparameters, and then perform a search to find the best configuration for your specific problem size, tensor dimensions, and hardware. You score these configurations by running micro-benchmarks, ranking the different algorithmic variations, and choosing the optimal one.

Drew : Okay, so that’s really important.

Nathaniel: Yes, and this is also why just-in-time compilation is so useful; you cannot be certain which configuration is the fastest until you actually execute it on the hardware.

Drew : That is fascinating; I hadn't considered that. This is all very impressive. So, CubeCL is one core piece of what you're building, but there's another component that addresses a major problem in the AI space: the disparity between inference code used in the training context versus the deployed context. Often, because the hardware differs, the code has to change. Could you explain that problem in more detail?

Nathaniel: Suppose you are building an image classification model. You will likely train it in the cloud using NVIDIA GPUs, but you might want to deploy it to a robot, a camera for object detection, a car, or even a laptop. In that scenario, you have to write training code for NVIDIA, and then reimplement the same model for every different hardware target, which is a painful process.

Nathaniel: Current solutions like ONNX, a serialization format, allow you to take your Python training code and load it into runtimes optimized for other systems, such as a CPU. However, the translation is imperfect. Pre-processing and post-processing steps often don't work the same way, forcing you to rewrite them for every platform. Our goal with Burn is to let you write your training code, model logic, and processing steps once. You can then use that exact same code for both training and deployment because our optimizations work for both scenarios.

Drew : That seems like a major issue. Beyond the extra work of rewriting code, are there also situations where the model behaves differently in the training context versus the deployment context?

Nathaniel: Yes, translations are often imperfect. Numerical stability varies across different frameworks and runtimes, so it is much safer to deploy the exact same code you wrote for training during inference. This ensures the model behaves as expected, since you already validated it during experimentation.

Nathaniel: Looking forward, we want to enable continual learning, where models continuously learn from private data while deployed on a device like a laptop. Current tools aren't built for this; you can't easily deploy PyTorch everywhere because Python doesn't run well in production on a mobile phone, a camera, or a robot. By building in Rust, we eliminate the gap between "prototype" and "production-ready" code. Everything you write with Burn is production-ready from the start.

Drew: That is incredible. I hadn't considered the continual learning aspect of this, but I know it’s a potentially massive shift for the future of AI. It’s fascinating that Burn could support that. It seems like using Rust is what actually enables you to solve these problems in this specific way—is that right? How does the language itself play into that?

Nathaniel: It would be extremely difficult to achieve this with Python. Because Rust doesn't have a garbage collector, we can leverage its ownership rules to perform optimizations that are impossible otherwise. At runtime, we know the exact lifetime of every variable. If we know a tensor is being used for the last time, we can reuse that memory in place or fuse tensors together.

Nathaniel: Many tensors in your model won't even be materialized in memory because our just-in-time compiler optimizes them away. This is hard in Python because variables are only cleaned up during a garbage collector pass. Even in C++, variables are typically only cleaned up when they go out of scope. Rust is finer-grained; we know the moment a variable is used for the last time. We leverage that to support dynamic graphs and shapes while maintaining optimizations that are usually only available in static graphs or "compile mode" in the Python world.

Drew: It’s exciting to see Rust’s features enabling fundamental problem-solving like this.

Drew: So, playing devil's advocate for a moment: the vast majority of AI researchers use Python. Part of that is the established ecosystem, but it’s also because Python is easy to write and prototype. Rust, conversely, has a reputation for being difficult. Is it an uphill battle to convince model developers and researchers to adopt Rust? How do you respond to that?

Nathaniel: I don't think Rust is actually that hard to learn. In one of my experiences building a recommendation engine, I spent most of my time in Python trying to optimize slow code, fighting runtime exceptions, and managing shady multi-processing. While writing Python is easy, writing reliable, production-ready Python is quite difficult. You often spend more time debugging Python than you would writing Rust, where the compiler eliminates much of that struggle.

Nathaniel: Furthermore, LLMs make it easier than ever to get up to speed. If you encounter a compiler error, you can ask a chatbot to explain the fundamental issue and suggest a fix. This significantly reduces the friction of transitioning from Python to Rust.

Nathaniel: It is true that the AI ecosystem is centered around Python. If your goal is to simply import a model, modify a few lines, and test an idea, Python is a valuable tool for leveraging that existing work. However, the Rust ecosystem is growing, with more models and algorithms being implemented every day. If you are building something truly new where you can't rely on existing libraries, Rust is the better choice because you will spend far less time debugging.

Drew: That makes sense; you answered that very well. The idea of using LLMs to get up to speed in a new language is fascinating. I’m still processing the full implications, but I think that shift is going to fundamentally change a lot of programming.

Nathaniel: LLMs are also excellent at translation. My native language is French, and it is now incredibly easy to translate English job postings or other documents into French using AI. The same applies to code. You can provide Python code, translate it to Rust, and there is a high probability it will work.

Nathaniel: The boundaries between programming languages and the costs of migration are decreasing. At RustConf last year, Microsoft announced plans to rewrite their C++ codebase in Rust. They are targeting an incredible volume—I believe around one million lines of code per developer per month—because the translation is AI-assisted. That is a significant factor to consider.

Drew: That’s amazing. A year or two ago, I saw a request for proposals from the US Department of Defense for a system to translate their legacy C++ to Rust. At the time, I didn't think LLMs were quite ready for that, but now I feel like they could actually do it.

Nathaniel: They are probably not there yet in terms of doing it fully on their own, but with a human expert in the loop to review the code and guide the model, it speeds up the translation process significantly.

Drew: That is a fascinating consideration for your business, especially regarding the challenge of getting people to rewrite their workloads in Rust. It may not actually be a major hurdle, and it seems like it will become less of a problem over time.

Nathaniel: Translating complex logic and critical safety code is a significant challenge where AI currently struggles. For low-level programming, AI models aren't quite there yet; they frequently make errors regarding memory management and performance. That is why I don't expect to see the Unreal Engine fully rewritten in Rust anytime soon. Their C++ codebase is massive and the logic is incredibly complex.

Nathaniel: It is a similar story with PyTorch, which is primarily C++ and Python; I doubt they will be able to rewrite those systems in another language. However, smaller Python models consisting of a few thousand lines are quite easy to translate. This is because high-level code is far easier to migrate than fundamental low-level code.

Drew: Right. So the code being rewritten into Burn primarily consists of models—things like setting up matrices. Do you have a sense of whether LLMs are actually proficient at writing that type of code? I imagine there is significantly less of that in their training sets compared to something like web applications.

Nathaniel: I’m not entirely sure, but I suspect LLMs are quite proficient in that area because GitHub is full of scripts, experimental code, and model definitions. There are plenty of examples available, ranging from small model architectures to production-level projects, so I don't think LLMs will struggle much with that type of code.

Drew: I find what you're doing exciting because the industry has barely scratched the surface of on-device AI and edge computing. Running smaller models on devices is going to be a major trend. The combination of building Burn on top of Rust seems to offer a lot here, especially since you don't have to rewrite inference logic for every target device. Is this an area your team is particularly excited about?

Nathaniel: Absolutely. On-device AI is the future because it is likely the most energy-efficient way to deploy these models. While hosting a single massive chatbot in the cloud is better than not having one at all, eventually, the choice of AI will depend on the complexity of the task. Because cloud processing consumes so much electricity, I expect we will see a shift toward efficiency; a simple request might be processed directly on your phone, while only the most complex tasks are routed to the cloud. Deploying more AI at the edge is simply the most efficient path forward.

Drew: Exactly. It feels like a lot of our devices already have the necessary chips, but they aren't being fully utilized. Before we move on too much, I want to ask another question about the advantages of Rust relative to this on-device problem.

Drew: Does Rust itself simplify cross-platform deployment because it is so cross-platform itself?

Nathaniel: On the CPU side, Rust is remarkably portable; you can compile it to WebAssembly for browsers or even deploy it to microcontrollers without an operating system. This level of portability is difficult to achieve with Python. While you can target these platforms with C or C++ since they share the same LLVM compiler backend, Rust makes the process significantly easier because of Cargo. Managing cross-compilation in C++ often requires complex CMake or Make scripts, whereas Cargo handles the build system and package management seamlessly. With CubeCL, we are extending this portability to accelerated code—such as GPUs and AI accelerators—using a just-in-time compiler, though that specific capability isn't native to standard Rust.

Drew: That makes sense. What about the numerical computing aspect? Does Rust offer any specific advantages for that?

Nathaniel: I wouldn't say Rust’s syntax offers massive inherent benefits for numerical computing, but its macro system and metaprogramming capabilities are extremely useful for writing high-performance kernels. We rely heavily on procedural macros to ensure our algorithms run as fast as possible. While C++ uses templates and const expression, Rust’s procedural macros allow us to effectively simulate the compile-time metaprogramming approach found in Zig. So, while these features aren't exclusive to numerical computing, they were essential for building the infrastructure that makes high-performance numerical computing possible in Burn.

Drew: That’s fascinating. We’ve been diving deep into the technical side, so I want to zoom out for a moment. Both Burn and CubeCL are open-source projects, yet Tracel is a commercial entity. This naturally raises questions about your business model. What does the path to revenue look like for the company?

Nathaniel: We are developing a cloud platform called Burn Central to help users train and deploy their models. While it hasn't been fully announced, it isn't exactly a secret. The platform allows you to deploy models for training, monitor experiments, and track analytics for models deployed on-device.

Nathaniel: Our strategy is to provide a service that is entirely complementary to our open-source work. We aren't creating any closed-source extensions to the core of Burn or CubeCL; those projects will remain open source as they should be. Instead, we’re building a cloud layer on top of those projects to help businesses leverage the codebase we’ve built. It’s a straightforward model similar to many other successful open-source companies.

Drew: It sounds like the classic open-core model. That’s exciting. What do you expect will be the primary draw for users to join the platform?

Nathaniel: It’s all about time savings. Many of our users aren't infrastructure engineers, and rebuilding AWS deployment scripts or CI/CD pipelines from scratch is a massive undertaking. By using our services instead of building everything themselves, they’ll be much better off.

Nathaniel: There are several key drivers: reliability, speed of execution, and cost-effectiveness. Ultimately, using our platform should be cheaper and more dependable than trying to build and maintain that infrastructure in-house.

Drew: Are you planning to build features that specifically support the transition from training models to deploying them onto devices, similar to what we discussed earlier?

Nathaniel: We developed our own model packaging system so that your model remains independent of its deployment target. Typically, deploying cloud software requires you to build a custom Docker image and manage the operating system and driver dependencies yourself. We abstract those layers entirely. You simply deploy the core model, and we automate the packaging for every target device.

Nathaniel: This automation ensures the model runs at optimal speed without requiring you to become an expert in DevOps or MLOps. You can stay focused on the machine learning, while we handle the operational complexity.

Drew: That’s very cool. I think that could be a very compelling product for a lot of companies.

Nathaniel: I would use it. I wish a tool like this had existed when I was training and deploying machine learning models before I started working on Tracell.

Drew: I have another question about the platform: who is your target customer? I imagine it isn't the major labs like OpenAI, xAI, or Anthropic. Are you targeting the tier below them, or do you have a different audience in mind?

Nathaniel: I can't predict the future, but our current focus is supporting our open-source users. If you are already using Burn for prototypes and are considering moving to production, our platform is likely the best path forward. We are prioritizing our open-source community first.

Nathaniel: In the future, we may explore specific industry verticals and expand in those directions. Of course, if members of our community happen to be from OpenAI or Anthropic, they are more than welcome to use the platform as well.

Drew: Since you mentioned the open-source community, what does adoption look like there? Have you seen people building noteworthy projects with Burn?

Nathaniel: We have a lot of open-source users building interesting projects. While many are developing private models, which is perfectly fine, there are several noteworthy public initiatives. For instance, some researchers at DeepMind have used Burn for research projects, and others are using it for privacy-centric products.

Nathaniel: Most of these users have specific deployment requirements; they want to run their own models across a variety of different devices. This is where Rust’s on-device capabilities become a key differentiator for Burn. They can train a model once and then deploy it seamlessly to any hardware target.

Drew: I'm excited for you guys. On that note, have you had any major company milestones lately, such as product breakthroughs, revenue achievements, or fundraising?

Nathaniel: We recently released a closed alpha for Burn Central. If you'd like access, you can email us for a key. We've been iterating with early users to squash bugs and refine features since its launch last month. We also shipped the Burn 0.20 release earlier this year.

Nathaniel: Our biggest milestone, however, was adding CPU support to CubeCL. Initially, CubeCL was a WebGPU-only runtime, but we’ve since expanded to CUDA, ROCm, Metal, and Vulkan. By adding CPU support, we proved we can achieve state-of-the-art performance using the same kernel across different backends.

Nathaniel: This allows you to write an algorithm once for almost any type of accelerator. I’m not aware of any other numerical computing language that handles this as effectively as CubeCL. While we’re still iterating to perfect every kernel, the proof of concept is solid.

Drew: Tell me a little bit about the team. How many people do you have working with you now?

Nathaniel: There are eight of us.

Drew: Is the team mostly technical? I imagine it is.

Nathaniel: Yes, it remains a very technical team.

Drew: How did you all come together? How did the company start?

Nathaniel: It started as a pet project of mine while I was working at a startup. I was facing all the problems I’m trying to solve now, so I knew them well.

Nathaniel: I co-founded Tracell with a close friend I met on the first day of my undergraduate degree. From there, I brought in friends from my master's program, and the team grew with more people along the way.

Drew: Are you guys hiring right now?

Nathaniel: Yeah, yeah, we are hiring.

Drew: Obviously our audience is lots of people looking to build with Rust. What specifically are you looking for in the engineers you hire?

Nathaniel: We're looking for technical people who are comfortable reasoning from first principles. Since we're building low-level code, you need to understand how CPUs and GPUs function to design effective algorithms. We aren't necessarily looking for the most experienced person in the room, but rather someone who excels at problem-solving from the ground up.

Nathaniel: Currently, we are focused on hiring within Canada for financial and administrative reasons. If you're based in Canada and this sounds like a fit, you can easily submit your resume through our website.

Drew: That's a great opportunity for any Canadian engineer who fits that description. Is the team based in-person in Quebec City?

Nathaniel: We have an office in Quebec City, but we also have team members working remotely from Montreal and other locations.

Drew: Both options are on the table then. To give us a sense of what it's like to work there, could you describe an interesting problem your team has tackled recently?

Nathaniel: We're currently optimizing FlashAttention for LLMs, writing the entire algorithm in CubeCL. Our goal is to run at optimal speeds across any GPU or CPU using the same core algorithm. We’re also tackling Rust’s long compilation times, specifically for Burn, to reduce the friction between experiments.

Nathaniel: Additionally, I’m working on device communication and optimizing the lazy execution of tensor operations. This involves transitioning from a multi-threaded, read-only space to a mutable state where we can perform auto-optimizations—essentially asynchronous channel optimization. We're also diving into various memory management optimizations.

Drew: It sounds like there are plenty of interesting problems to sink your teeth into. Regarding the compilation optimization piece—since I know that’s a common pain point—have you learned any tips or tricks that you think others should know?

Nathaniel: The amount of code you generate is actually the biggest factor. If you're generating 15 megabytes of LLVM IR, it's going to take longer than if you generate less. Using generics extensively can drastically increase compilation time, similar to templates in C++, so you have to be careful with them.

Nathaniel: While Rust’s zero-cost abstractions with generics are great, it's important to know when not to use them. In areas where performance isn't critical, opting out of generics can save significant compilation time.

Nathaniel: Another key aspect is how you structure your project, specifically how you split your crates and manage dependencies. I recently learned that avoiding "leaking" dependencies into your public types can drastically improve incremental build times, as the Rust compiler can better isolate changes. Those are the two main things to look out for.

Drew: That’s helpful. Is there anything unique you’d point out about the company culture?

Nathaniel: We have a unique dynamic; we’re a team that laughs a lot and genuinely enjoys working together. Beyond the culture, we aren't afraid to tackle incredibly difficult problems. I think it’s vital to maintain a "beginner’s mind" throughout your career. When you encounter a challenge, you should have the drive to learn what's necessary and solve it quickly. That lack of fear when learning new things defines who we are as a team.

Drew: Humor is essential. When you're tackling difficult technical problems, being able to laugh makes a significant difference in the team's ability to maintain momentum.

Nathaniel: We don't let the difficulty get us down. Some people might get depressed because the work is so hard or things aren't working, but we take the opposite approach. When things get tough, we laugh about it.

Drew: Awesome. Visualizing you all joking around makes me wonder—do you primarily speak English or French in the office?

Nathaniel: Most of the time we speak French, though we use English as well. In Quebec City, especially among computer science people, it’s really a blend of both—we speak "Franglish."

Drew: Oh right, you're using all the English terminology for packages and technical concepts anyway. Since you're based in Quebec City, I’m curious about the local tech ecosystem. I don't hear about a lot of startups coming out of that area, even though I know there is a significant amount of technical talent there.

Nathaniel: Quebec City actually has a significant technical population. With a population nearing one million, it isn't a small town. While it doesn't match the scale of Montreal or Toronto, it has a substantial talent pool. We have a strong presence in traditional engineering fields like optical, mechanical, and electrical engineering. Many of these professionals eventually transition into software, where they tend to excel at low-level programming. There is definitely a growing startup scene here as well.

Nathaniel: The ecosystem is actually quite strong.

Drew: Are your investors primarily based in the Quebec City area as well, or are they from all over the world?

Nathaniel: Our investors are from all over the world, though more specifically, they are primarily based in the US and Canada.

Drew: Okay, so primarily North America.

Drew: Well, those were all the questions I had for you today. I really enjoyed learning more about what you're building. Thank you Nathaniel.

Nathaniel: Thank you.

get rust jobs on filtra

Know someone we should interview? Let us know: filtra@filtra.io

sign up to get an email when our next interview drops