Taking on Anthropic's Public Performance Engineering Interview Challenge
I stumbled across an optimization challenge Anthropic released from their performance engineering teams interview. When I first opened, my reaction seemed to have been like a lot of other people’s reaction, something along the lines of, “I have no idea what is going on here or what these words are”.
Usually I might have just looked around a little more then closed it and forgot about it forever. The README had some interesting extra information for the challenge, the goal was to beat the AI, which meant to me, LLMs knew how to solve this problem. So now this seemingly impossible task for me actually seemed like it was doable if I could prompt some AI chat bot to help me out. By the time I got tired of working on this I got to 1474 cycles, a 100x improvement over the baseline and just barely passed the teams bar for hiring (1487 cycles).
This ended up being an interesting experiment in using AI in a teacher-like way too. My workflow ended up being a lot of questions and answers with a lot of back and forth. Some of the things I did to get to 1474 cycles, the LLMs I tried didn’t come up with.
Understanding The Problem
Getting to a point where I even understood the problem correctly took me a while, and I had few false starts where I thought I had a grasp on the scope of the work that needed to be done but after hacking on it, realized I actually didn’t quite understand it.
The problem starts off with some instructions that run sequentially, do some processing of numbers, loads things from memory, then at the end checks if the computed values are correct. The goal of the challenge is to make it as fast as possible where “fast” means use the fewest cycles possible. This goal confused the LLMs a bit, the AI would suggest optimizations that might make sense on a real CPU but didn’t apply to the virtual test and reducing cycles how they were being counted.
The logic of the program that needed to be optimized takes a set of inputs, has each input traverse a binary tree, applying some math operations to it, then eventually stores the final values.
Another thing that threw me off was the type of processor that was being emulated. I haven’t run into this type of architecture before. Very Long Instruction Word (VLIW) seems to be more prevalent in other areas of computing than I work with,
Actually Optimizing It
I went through a series of optimizations, slowly understanding and learning more about what can even be done with this problem, and ending up with a final solution that completely rewrote everything I did on earlier iterations.
The first most obvious optimization to me was to use SIMD instead of processing things one by one. I wasn’t completely new to how this could work, I have some awareness of how SIMD works and why it would help. This brought the cycle count down to about 25,000. It felt like a lot but still had a long way to go.
After that I started looking at reducing instructions in general. I did one optimization that turned the index selection into a couple of instructions. This ended up barely moving cycle count, and this made me realize I was going to have to make pretty big changes to how this ran to get to the goal.
This was the point where I started to try to take advantage of the instruction batching the VLIW architecture allows for. I started with some hand batched instructions. After that I did some simple instruction batching but the function I implemented barely helped at this point, and wouldn’t be too helpful until later on. I redid instruction batching maybe three times. I eventually ended up with an instruction scheduler that tracked memory access and dependencies. The VM allowed 12 single arithmetic instructions and 6 vectorized arithmetic instructions to run at the same time. The instruction scheduler would also turn vectorized instructions into multiple single instructions to get more computation to run in the same batch if it determined that was more efficient.
The next big optimization was pipelining, where I had each batch go through stages of instructions. I iterated on pipelining a bit, this was where most of the big optimizations were for me. Pipelining was the concept I needed to really pack instruction together and get the most out of the instruction scheduler.
The way it worked was, one batch of inputs would have stage one’s instructions applied, then the next batch would start and have stage one instructions applied. Instruction batching from before allows the second stage instructions to run at the same time as the first stage instructions are applied to the second batch of inputs. Pipelining across batches and rounds allowed me to line up stages to take advantage of the open slots in each cycle’s instruction limit. My first attempt at pipelining to my last iteration took me from 20,000 cycles down to 2700. The pipelining I guided Claude Code to write got the cycles down into the 1500s.
The binary tree ended up being the most difficult thing to deal with. Since node values and locations in the tree were calculated while the code was being executed there wasn’t a way to preload arbitrary nodes and the indexes wouldn’t be contiguous so SIMD couldn’t be used. I was stuck bottlenecked by single load instructions that were used to get node values for the binary tree out of memory for use in computations. 16 rounds for 32 batches of 8, where each batch needed 8 loads and 2 loads could be run per cycle meant the low bound was 2048 cycles.
The best approach I found was to find ways to avoid doing memory access in the first place. There were 3 tricks to avoid needing load instructions for levels 0, 1 and 2 of the tree and since the tree wraps around back to the beginning levels 0,1 and 2 would be run twice. Avoiding load instructions ties this back to pipelining. These rounds without load instructions could be run at the same time using instruction batching.
Working With AI on This Problem
The AI chat bots were pretty helpful in the beginning. They found a lot of small optimizations, and helped with the overall architecture. Iterating was pretty simple, the workflow was mostly me asking how it can be improved, I would implement it and go back and forth getting feedback and asking for improvements. As the cycle count got lower the AI started making more mistakes and getting confused or possibly not explaining things well and making me confused. This actually ended up helping me though. I would have to question what it said, understand the code and problem a little better and correct its misunderstandings. I felt like I was learning a lot and actually felt confident about the decisions I was making designing my little kernel. While this was helpful, it was also a bit frustrating because I was starting to get stuck, even though the README suggested the AI was able to figure it out.
At around 2700 cycles I was getting nowhere with my “paste into the chat Q&A” workflow and switched to having Claude Code give it a try. I took what I guess is a pretty naive approach to using it, basically saying this is where I’m at, here is the problem, go and try to figure it out and I would check on it every so often. Claude code got stuck here too and couldn’t improve on its own. I guess the set up based on the README results had a better prompt or feedback loop.
I had to make a couple insights on my own and drive the AI to do the exact thing I wanted and couldn’t count on it to just discover the solution and it didn’t seem to know it already like I thought I might have based on how I read the README results. When I eventually figured out some solutions to the problems, I prompted the AI to implement pretty big rewrites. This was a better pipelining loop and the full instruction scheduler with dependency tracking along with unpacking VALU instructions. That got me to 1514 cycles. There was one thing the AI got wrong on one line (optimal size of batches) which I tweaked and got my final cycle count down to 1474. I’m not sure why it got that wrong.
It was a pretty interesting experience working through this with AI. I barely used Google searches. Usually when I finish a tricky problem I have a couple of browser windows filled with tabs across both of my monitors, this time there was just a couple of chat logs. The AI ended up being a decent teacher, as long as I was critical of what it said. It did well explaining solutions when they were typical patterns early on in the optimization process. It also quickly wrote the final code for me when I was getting a little tired of the problem. I overall learned a lot struggling with this and felt pretty good about the solution I arrived at in the end.
