
Prompted in part by Apple’s paper about the limits of large language models (“The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity”), I spent some time playing with Tower of Hanoi. It’s a problem I solved some 50 years ago when I was in college, and I haven’t felt the desire or need to revisit it since. Now, of course, “We Can Haz AI,” and all that means. Of course, I didn’t want to write the code myself. I confess, I don’t like recursive solutions. But there was Qwen3-30B, a “reasoning model” with 30-billion parameters that I can run on my laptop. I had no doubt that Qwen could generate a good Tower program, but I thought it would be fun to see what happened.
First, I asked Qwen if it was familiar with the Tower of Hanoi problem. Of course it was. After it explained the game, I asked it to write a Python program to solve it, with the number of disks taken from the command line. Fine—the result looks a lot like the program I remember writing in college (except that was way, way before Python—I think I used a dialect of PL/1). I ran it, and it worked perfectly.
The output was a bit awkward (just a list of moves), so I asked it to animate it on the terminal. The terminal animation wasn’t really satisfactory, so after a couple of tries, I asked it to try a graphical animation. I didn’t give it any more information than that. It generated another program, using Python’s tkinter library. And again, this worked perfectly. It generated a nice visualization—except that when I watched the animation, I realized that it had solved the problem upside down! Large disks were on top of smaller disks, not vice versa. I want to be clear—the solution was absolutely correct; in addition to inverting the towers, it inverted the rule about moving disks, so that it was never putting a smaller disk on top of a larger one. If you stacked the disks in a pyramid (the “normal” way) and made the same moves, you’d get the correct result. Symmetry FTW.
So I told Qwen that the solution was upside down and asked it to fix it. It thought for a long time and eventually told me that I must be looking at the visualization the wrong way. Perhaps it thought I should stand on my head? Proving, if nothing else, that LLMs can be assholes too. Just like 10x programmers. Maybe that’s an argument for AGI?
Seriously, there’s a point here. It is certainly important to research the limits of artificial intelligence. It’s definitely interesting that reasoning LLMs tended to abandon problems that required too much reasoning and were most successful at problems that only required a moderate reasoning budget. Interesting, but is that surprising? Very hard problems are very hard problems for a reason: They are very hard. And most humans behave the same way: We give up (or look up the answer) when faced with a problem too hard for us to solve.
But we must also think about what we mean by “reasoning.” I had no doubt that Qwen could solve Tower of Hanoi. After all, solutions must be in hundreds of GitHub repos, Stack Overflow questions, and online tutorials. Do I, as a user, care the least little bit if Qwen looks up the solution in an external source? No, I don’t, as long as the output is correct. Do I think this means that Qwen is not “reasoning”? Ignoring all the anthropomorphism that we’re stuck with, no. If a reasonable and reasoning human is asked to solve a difficult problem, what do we do? We try to look up a process for solving the problem. We verify that the process is correct. And we use that process in our solution. If computers are relevant, we’ll use them, rather than solving on pencil and paper. Why should we expect anything different from LLMs? If someone told me that I had to solve Tower of Hanoi with 15 disks (32,767 moves), I’m sure I’d get lost somewhere between the beginning and end, even though I know the algorithm. But I wouldn’t even think of listing the moves by hand; I’d write a program (like the one Qwen generated) and have it dump out the moves. Laziness is a virtue—that’s something Larry Wall (creator of Perl) taught us. That’s reasoning—it’s as much about looking for the easy solution as it is doing the hard work.
A blog post I read recently reported something similar. Someone asked openAI’s o3 to solve a classic chess problem by Paul Morphy (probably the greatest chess player of the 19th century). The AI realized that its attempts to solve the problem were incorrect, so it looked up the answer online, used that as its reply, and gave a good explanation of why the answer was correct. This is a perfectly reasonable way to solve the problem. The LLM experiences no joy, no validation, in solving a difficult chess problem; it doesn’t feel a sense of accomplishment. It’s just supplying an answer. While it’s not the kind of reasoning that AI researchers want to see, looking up the answer online and explaining why the answer is correct is great demonstration of human-like reasoning. Maybe this isn’t “reasoning” from a researcher’s perspective, but it’s certainly problem-solving. It represents a chain of thought in which the model decides that it can’t solve the problem on its own, so it looks up the answer online. And when I’m using AI, problem-solving is what I’m after.
I want to make it clear that I’m not a convert to the cult of AGI. I don’t consider myself a skeptic either; I’m a nonbeliever, and that’s different. We can’t talk about general intelligence meaningfully if we can’t define what “intelligence” means. The hegemony of the technorati has us chasing after problem-solving metrics, as if “intelligence” could be represented by a number. It’s all Asimov until you need to run benchmarks—then it’s reduced to numbers. If we know anything about intelligence, we know it’s not represented by a vector of benchmark results testing the ability to solve hard problems.
But if AI isn’t the embodiment of some kind of undefinable intelligence, it’s still the greatest engineering project of the 21st century. The ability to synthesize human language correctly is a major achievement, as is the ability to emulate human reasoning—and “emulation” is a fair description of what it’s doing. AI’s detractors ignore—bizarrely, in my opinion—its tremendous utility, as if citing examples where AI generates incorrect or grossly inappropriate output means that it’s useless. That isn’t the case—but it does require thinking carefully about AI’s limitations. Programming with AI assistance will certainly require more attention to debugging, testing, and software design—all themes that we’ve been watching carefully over the past few years, and that we’re talking about in our AI Codecon conferences. Applications like detecting fraud in welfare applications may have to be scrapped or put on hold, as the city of Amsterdam found out, until we can build AI systems that are free from bias. Building bias-free systems is likely to be much harder than solving difficult problems in mathematics. It’s a problem that might not be solvable—we humans certainly haven’t solved it. Either worrying about or breathlessly anticipating AGI achieves little, except for diverting attention away from both useful applications of AI and real harms caused by AI.
Artificial Intelligence, Research
Radar