Breaking Bend: Benchmarking the HVM

Well, this did not end up going how I had hoped.

A couple of months ago a couple of YouTubers that I watch from time to time were talking about a new programming language called Bend, that made a very bold claim: write your code in this language, and it will be parallelised automatically. You don't need to mess around with threads, or cuda, or anything else to take advantage of all your cores. Just write your code in this language and with the magic of interaction combinators you'll get the power of all of your cores, be it all 24 on your CPU, or the thousands on your GPU. I was intrigued, and I wanted to see if it would live up to the hype. So I decided to write some code to test it out and compare it to another concurrency model.

Since this was a language I hadn't used before I wanted to make this a fair fight and put it up against another language I hadn't used before. I picked Go, because I wanted to see if learning a new concurrency model (Goroutines) was going to be easier than learning Bend.

Next I needed to pick what I was going to code up. I wanted to pick something that would need a lot of mathematical operations and would lend itself to be parallelised, but would put a fair bit of strain on the hardware. I flicked through a linear algebra textbook and picked out calculating the determinant of a matrix using the cofactors method. This has O(n!) runtime, so for a small amount of data I could put quite a bit of load on the interpreters and hardware.

The last thing of note is that because Bend is still a new language, it's libraries are not as mature as Go's, so I wanted to remove any file I/O. To this end, I wrote some code that would allow me to generate the Go and Bend files with the data built in, at with whatever size of matrix I wanted randomly generated.

Developer Experience.

The following is as of bend-lang 0.2.36 and HVM 2.0.21

Higher Order Co claims that Bend is "Python like". Now, I don't call myself a Python developer. I have coded in a bunch of languages, but at the moment it's mainly Java (please send all hatemail to my X/Twitter account please). But I've worked with Python enough to say with confidence that this is only like Python in that it'll look like Python in a slide deck that you're showing to a VC whose sum total of programming is some Pascal once back in the 90s on their Dad's computer. When you try to actually use this, it's nothing like Python.

For a start, its type system is nothing like Python. One of Python's headline features is duck typing. But in Bend you cannot access the elements of an object at will. You have to either have to put the object whose elements you want to access inside a "match" block, or you need to "open" it first. This often results in having to create a function just so you can have another match block to access an element (because you can't nest a match block inside a match block). Why not just use open to gain access to the elements? Because open only works if the object only has one constructor, and lists have two constructors.

Which brings me nicely to my second problem: the lack of arrays. Everything in Bend is either a primitive or an object that's made up of other objects. So, it can do linked lists and trees, but not arrays. There is some syntactic sugar on top of Lists to make them look like arrays, but they are linked lists under the hood. It doesn't even really have strings, those too are linked lists of characters with some syntactic sugar.

The number types available in Bend are also pretty odd. They're all 24bit, which often results in integer overflow errors. This meant in the end that I had to reduce the max values of the elements in the matrices I was generating so as to reduce the probability of hitting this error. I'm going to speculate that they are all 24bits because that's the native integer size on most graphics cards.

As for Go, I initially wrote two implementations: one parallelised with Goroutines, and one not. This felt a lot more like a standard system programming language, but when I came to run it I found that the parallel version was actually slower than the single threaded version. There is a section on this at the end, but the tl;dr is that, although lightweight Goroutines are not cost free, and if you end up creating a lot of them they will actually slow your program down.

Performance

For reference, I tested this out on an older gaming rig of mine. The CPU is AMD FX-4130 Quad-Core running at 3.8Ghz and it has 24GB of RAM, and my GPU is too old to run the GPU version of the code (yes, I need new hardware). I got the runtimes by simply running each version of the code with the "time" command in bash.

Below is a graph of time in milliseconds that it took for the code to run matrices of various sizes. Note I've had to make the scale logarithmic so you can see the Go times.

For the smaller matrices the bend code performs better than the Go code. This is probably because at these small sizes in both cases the CPU is probably doing more work interpreting the code than running it. Bend is a smaller language than Go, so its interpreter will run faster.

But when we get beyond a 7x7 that we begin to see a difference. By the time we are at an 11x11 matrix the times for Bend are truly awful. iBend comes in at just over 2hrs for the single threaded version and 20 minutes for the multi core version. Meanwhile go_single comes in at 22 seconds, for the single core version, and 46s for go_multi1 and 4s for go_multi2. Higher Order Co acknowledges they are bad, but I wasn't expecting this bad. Moving to multiple CPU cores does increase the performance proportionally with the number of cores, but even the single threaded version of the Go code beats it on my (admittedly small) machine. Maybe once you throw the thousands of cores a GPU has at Bend things might be different. It would be interesting to see the difference between Bend and raw CUDA code run on the smae GPU.

So, why is this? What makes bend so damn slow? I suspect it comes down to the way Bend stores data.The fact that the bend code is working with linked lists makes constructing the sub-matrices for calculating the cofactors and indexing into the arrays take so long. Instead of indexing into an array taking O(1) time, it takes O(i) (where i is the index) time. Add up all of those additional operations over the course of processing the matrices, and the runtime blows up.

So, what's up with go_multi1 being slower than go_single?

In short, there were too many goroutines. With the final matrix which is 11x11, 39916800 of them were being created. But anything over 4 parallel threads of execution will not really speed the program up much (unless some threads have blocking operations, which is not the case here).

To try and reduce the number of Goroutines I uploaded the code for go_multi1 to ChartGPT and asked it to optimise the code to reduce the number of Goroutines. It gave many standard answers, like batching up the work, but none of the actual code it produced was any faster, in fact it was all slightly slower. I thought it was likely that the answer was to only parallelise the first cofactor search, and then after that just run in single threaded fashion. I thought this would work well because it would be a huge reduction in the number of Goroutines (i.e. the number of Goroutines would be the size of the matrix), and each one would have the same amount of work to do. I asked ChatGPT to produce code based on the original go_multi1 and it was able to do that, and I checked at it in (after modifying it for templating) as go_multi2.

As an aside, this goes to show how simultaneously limited and impressive ChatGPT is. There was no way I could get it to come to the optimisation I came to on its own. However, once I told it how to optimise the algorithm it did so perfectly. So, although it's knowledgeable, it's incapable of coming up with solutions on its own that it hasn't seen before.

Conclusion

First, the praise. Bend does deliver on its promise: it will take your code and run it in parallel with zero modifications. The numbers comparing the single threaded and multi threaded runs of the bend code show that clearly.

However, the value proposition with any high level language is that it will trade off a certain amount of performance for gains in developer productivity. Bend is not there. Despite its superficial similarity to Python, it's nothing like that language. This means regular imperative & object oriented programmers are going to find the learning curve too great. You get less productivity from your developers and less performance compared to using something closer to the bare metal. Worse on both fronts. No business is going to go for that.

Furthermore, it seems plausible that we are entering a world where the limiting factor on developing and running the software of the future won't be the availability of developer skill, but the availability of hardware and the energy to run it. That means efficiency of code will begin to trump developer efficiency (aside: this may impact unionisation of tech workers in a big way, ask me on X/twitter if you want me thoughts on that). In a world where businesses are trying to squeeze every bit of value out of every watt of power sent to a GPU, the inefficiencies of a language like Bend and the HVM it runs on will not be tolerated.

I wish Victor Taelin and Higher Order Co all the best, but they have a mountain to climb.

Speedfox's Blog

Breaking Bend: Benchmarking the HVM

Developer Experience.

Performance

So, what's up with go_multi1 being slower than go_single?

Conclusion

Links