2024-04 edition of AI & the web

Nathaniel Simard on Burn, WebGPU & WASM developments, Llama 3 and Phi 3 releases

tl;dr

Nathaniel Simard on Burn & Tracel AI | WebGPU gains further browser support | WASM gets better Tooling | Llama 3 & Phi 3 released | MuiscGen comes to the browser

Table of Contents

AI conversations

Nathaniel Simard on Burn

Nathaniel is founder & CEO of the startup Tracel AI and inventor of the burn ML framework. On Friday we interviewed him on burn, his company and the state of AI in the web.

Jan: Nathaniel, we're excited to have you as our guest today! To kick off our conversation, could you please introduce yourself and share a bit about your background and what got you into ML?

Nathaniel: Sure, I'm the creator of Burn, a deep learning framework written in Rust, and the founder of Tracel AI. I started coding in the first year of university, where I was studying mechanical engineering, but I quickly switched to software engineering after, since I instantly fell in love with programming. Then I explored different facets of the field from backend to frontend development, and I decided to start my career as a consultant focused on software quality. After some time, I wanted to go deeper into AI, since I was always interested in the process of learning, so I enrolled for a master's degree at MILA.

J: According to Github you started to work on burn in summer 2022. Since then has already earned over 7000 stars on GH. What was your initial motivation to develop a new ML framework?

N: I always had a side project going on, for fun mostly and to learn new things. I wanted to explore asynchronous and sparse neural network architectures, where each sub-network can learn and interact with other sub-networks asynchronously and independently. I wasn't able to actually create something useful because I needed fine control over the gradients and the concurrency primitives, which is not easily done with Python and PyTorch. At the same time, I was working on machine translation models at my current job, and it was quite painful to put models into production. I decided to switch my side project to a new deep learning framework, with more flexibility regarding gradients and concurrency primitives as well as being more reliable and easier to deploy on any system.

J: Amazing to see that this was born out of a side project! I feel the struggle with concurrency in Python and this is something where Rust really shines. What are some of the other key features that makes burn special?

N: I think there are two things that really set Burn apart. First, almost all neural network structures are generic over the backend. The goal is that you can ship your model with almost no dependency, and anybody can run it on their hardware with the most appropriate backend, even embedded devices without an operating system. Second, Burn really tries to push the boundaries of what is possible in terms of performance and flexibility. It offers a fully eager API, but also operation fusion and other optimizations that are normally only found in static graph frameworks. The objective is that you don't have to choose between portability, flexibility, and performance; you can have it all!

J: Let's dig a bit deeper into that. From a developer's point of view, how would burn simplify my model code in comparison to let's say pytorch?

N: It really depends on what you are building. For training, it's trivial to set up a multi-GPU setup on one node to implement data parallelism; no need to spawn multiple processes to execute the same program multiple times using low-level tensor synchronization. You can simply send a model to multiple threads with their associated device and collect the gradients from each thread before updating the reference model. It simplifies metrics logging, optimization, everything really. On top of that, it works with any hardware. For general use, you don't have to worry about tensor layout or if a tensor is contiguous; we take care of that part so that you can focus on the modeling part. The tensor API is fully functional and stateless, meaning that each operation returns a new tensor, even operations that do semantically modify a tensor, like slice_assign; the result will be returned. It may seem like a waste to not reuse tensors, but in fact, we do reuse the tensor's data automatically, so users don't have to think about that. For instance, there is no in-place operation available in Burn; every place we can do an in-place operation or reuse a tensor's buffer, we do. Long story short, we provide clean APIs that automatically handle a lot of the complexity of building models while performing all sorts of optimization.

J: One of the recent exciting developments in burn was the development of a new backend for JIT kernel fusion. You have written a very insightful explanation about it on your blog (which I highly recommend our readers to check out). Can you tell us in a nutshell how it works and what performance gains you could achieve with it?

N: Sure, the way fusion works in Burn is via a backend decorator that captures all tensor operations into a high-level representation without any data involved. Then the goal is to find potential optimizations on that representation, which for now are mostly fusing element-wise operations together. The performance gains depend on the model you are building and how many custom element-wise functions you are using. For the transformer provided in Burn, I think it's around a 20% speedup, but this will improve over time. The biggest benefit is that you don't have to do many optimizations yourself and write custom GPU kernels; the compiler does that for you.

J: burn also runs in the browser (see e.g. MNIST demo https://burn.dev/demo/). What is the current state of browser support in burn and what are your plans for the web?

N: I think Burn has really great support for the Web, we even support WebGPU! We currently don't have specific plans it should just work like any other WebAssembly application written in Rust. We might add more examples in the future based on community feedback.

J: What new features can we expect in burn throughout this year?

N: First, we are working on adding more JIT runtimes, so you can expect even better hardware support than we currently have. But the biggest features are going to be quantization and distributed training/inference. The focus will then be on improving and optimizing every part of Burn.

J: Distributed training/inference sounds exciting. Can you already share some more information on that or is it still in the planning?

N: We are going to implement it as a remote backend, generic over a protocol. Therefore, you'll be able to use it with any backend and even test your distributed algorithms locally.

J: Let's talk about another topic. You founded Tracel AI last summer together with your colleague Louis. Where are you standing now and what are your plans for Tracel AI?

N: I have big plans for Trace AI. The goal is to improve the current state of the AI infrastructure landscape, starting with Burn. We are going to support the community built around Burn with complementary products in the future. Hopefully, we can make a positive impact in the field and help spread AI further.

J: A glance into the crystal ball: What big trends do you think we will see in the next five years in AI in general and in Web AI specifically?

N: I think models will need to become way more efficient; the framework is only one part of it, but they should be able to adjust their compute and memory requirements based on the task difficulty. Models could learn to increase their operational efficiency. This could create a funny dynamic where the more you train a network, the cheapest it is to deploy, but also fine-tune. Applied for the web, you could fine-tune a model with a way bigger cost to memory so that the model itself prunes some neurons to run efficiently on users' hardware. In terms of application, I think we will see a lot of on-device AI models accelerated by NPU with privacy in mind. Probably the most important application will be in robotics, where AI will have a physical presence and impact in our world.

J: Nathaniel, thank you so much for joining us today and sharing your perspectives. We really appreciate your time and wish you all the best for your future!

N: Thanks a lot for having me!

Latest developments

WebGPU support

The WebGL + WebGPU was held in March 2024 on the state of WebGPU with some interesting demos featuring WebGPU (though more focussed on graphics rather than AI):

The most interesting piece of information to take away from this meetup was the current WebGPU implementation status. WebGPU is now available in Safari Technology Preview for testing and Firefox plans to ship to release by the end of this year. Exciting times ahead for GPU accelerated AI in the browser!

WASM(I/O)

Developments in WASM are especially interesting for AI in the browser, since almost all Web ML frameworks (onnx, tensorflow.js, ratchet, web-mlc, burn) either use a WASM backend when no GPU is available or are entirely written in C++ or Rust and then compiled down to WASM.

In March the WASM I/O conference was held in Barcelona with many captivating talks. One to mention in particular was the talk by Google WASM @ Google on WASM support in Chrome, the tensorflow.js WASM backend and other insights and demos:

Some of the highlights:

  • Google is working on tools that perform optimisations like tree-shaking or code splitting. This is great because WASM is notorious for having large binary sizes and therefore slow initial loading times.

  • Garbage Collection support in WASM, aka WASM-GC, is going to ship in November, which might enable GC-based programming languages to build for the browser.

  • Better support for SIMD in WASM, which e.g. powers the real-time background blur in Google Meet

Llama 3

Obviously I couldn't miss probably the most anticipated model release this year: Llama 3. Read the full story here: Introducing Meta Llama 3: The most capable openly available LLM to date.

Dwarkesh Patel had a very insightful interview with Marc Zuckerberg on Llama 3. Remarkably, Marc also mentioned the plans to train a smaller Llama 3 model in the realm of 2B parameters, which would be much more usable in a browser environment.

Phi 3

Fresh from the oven: Microsoft just released Phi-3, a tiny yet powerful 3.8B parameter model which we will probably soon see in web frameworks like transformers.js or ratchet.

wllama

A new WASM build of Llama.cpp appeared that seems to be performing really well (see minisearch for a showcase):

Ratchet

Our last episode's interviewee Christopher is making headway with Ratchet which now runs the Phi-2 at around 24 tok/sec (on an M3 Max). While there is still some way to go to make it usable on lower end devices a very interesting fact is that this runs with only 15% performance loss in Chrome in comparison to running it natively. Chrome seems to be doing a great job on supporting WebGPU.

Showcases

Victor built MiniSearch which leverages the aforementioned wllama and uses a RAG pipeline to retrieve live search results. Given that this runs fully in the browser and doesn't use WebGPU it runs surprisingly fast. Check it out here

MusicGen Web

Joshua added MusicGen support to transformers.js and built a Huggingface Space to experiment with it:

Enzo took that to a next level by creating a whole Jukebox running right in the browser:

Upcoming conferences

CHANGELOG.md

On our own behalf

Are you working on a cool project with AI in the browser that you’d like to share? I’d be especially interested in art projects using AI in the browser. Send us an email to [email protected].