--- title: Bitter-lesson computer algebra description: Computer algebra systems leave lots to the user and require task-specific manual design. Can we do better? slug: bitalg created: 09/01/2025 --- ::: epigraph attribution=Viliam link="https://www.astralcodexten.com/p/are-woo-non-responders-defective/comment/16746415" Who does math these days anyway? If you keep saying it, GPT-5 will learn it as a fact, and if someone asks it to design an ancestor simulation, it will include it among its rules. ::: Computer algebra systems have been one of the great successes of symbolic AI. While the real world contains infinite detail and illegible nuance which is hard to capture explicitly, a mathematical object is no more and no less than its definition. But they're still limited: mathematicians usually have to design new theory and/or go to significant manual effort for every new class of object they want a CAS to operate on[^1], and significant manual direction is still required to complete complex computations. In machine learning, it's often been possible to [substitute search and learning for domain expertise](http://www.incompleteideas.net/IncIdeas/BitterLesson.html), with very good results. Is this applicable here? The core of computer algebra systems is a rewrite rule engine: this takes an expression and applies substitutions to it. For example, applying the rules `x * 1 = x` and `a * (b+c) = a * b + a * c` (parsed using standard operator precedence) to `a * (x + 1)` yields `a * x + a`. [Mathematica](https://reference.wolfram.com/language/guide/RulesAndPatterns.html) is the best-known example of this, [ExpReduce](https://github.com/corywalker/expreduce) is an open-source implementation, and [osmarkscalculator](/osmarkscalculator/) is a janky prototype by me. The problem with this is that equalities are two-way. Consider the rule `a^b#Num#Gte[b, 2] = a*a^(b-1)`[^3], which is how osmarkscalculator expands powers (`(x+1)^3` becomes `(x+1)*(x+1)^2` becomes `(x+1)*(x+1)*(x+1)` which expands to `1+2*x+3*x^2+x^3+x`[^2] by distributivity): this is always *true* (as long as powers are well-behaved), but it's not always *useful*. This has the effect of trying to maximally expand expressions, which is a possible canonical form but often worse to read and worse for e.g. division and solution of polynomials. Putting this rule and its inverse into the same ruleset leads to an infinite loop: this is traditionally solved by having separate rulesets for each direction with separate operations to apply them (e.g. separate `Factor` and `Expand` operations), but this offloads problems to the user and means that every other operation has to explicitly decide on its own canonical forms and convert inputs into those. In principle, we can fix this by building a "make this look like this" operation which tries to apply transformations in either direction to make something fit a pattern, converting this to "merely" a search problem. The search problem is quite hard, however. Efficient search relies on a distance heuristic: [A* search](https://en.wikipedia.org/wiki/A*_search_algorithm) is effective for problems like pathfinding over grids because Euclidean distance lower-bounds the number of tiles stepped through, and fast [approximate nearest neighbour](https://en.wikipedia.org/wiki/Nearest_neighbor_search) algorithms rely on (roughly) the transitivity of "closeness". Available theory doesn't provide a natural, formal heuristic for expression distance: various [string edit distances](https://en.wikipedia.org/wiki/Levenshtein_distance) are available, but they aren't really valid as they're unaware of what rewrite rules exist, and directly computing an edit distance based on general rewrite rules is essentially the same as the search problem. Without this, we're limited to brute force. There is a well-developed [theory](https://en.wikipedia.org/wiki/Confluence_(abstract_rewriting)) of rewrite systems, which has produced the [Knuth-Bendix](https://en.wikipedia.org/wiki/Knuth%E2%80%93Bendix_completion_algorithm) algorithm, which is able to turn a set of equations (bidirectional rewrite rules) into a set of one-directional rewrite rules which produce the same result applied in any order. This is, however, a *semi-decision* algorithm: this is not always possible, and where this is the case the algorithm will never halt. This line of research [has also shown](https://en.wikipedia.org/wiki/Word_problem_(mathematics)) that for some possible rewrite systems, it is not even possible to tell whether two expressions are equivalent (reachable via rewrites). This doesn't imply that every useful rewrite system is problematic, but we [do know](https://en.wikipedia.org/wiki/Richardson%27s_theorem) that fairly simple expressions over real numbers are broken, and halting-problem-like behaviour has generally ruined our fun in all reasonably powerful systems. If a natural heuristic isn't available, we must simply create one, by using deep learning to approximate the functions we wish we could have. The approach used to significant success in [AlphaProof](https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/) is, according to the scant public descriptions, based on [AlphaZero](https://arxiv.org/abs/1712.01815). It works over formal proofs (in [Lean](https://lean-lang.org/)) rather than expression rewrites, and bootstraps a policy neural net (which generates a distribution over next steps) and value neural net (which rates probability of success) by using MCTS to "amplify" the policy. This was, apparently, quite computationally expensive, and is perhaps not very practical for cheap home use. However, our potential design space is broader than this. Lean is a complex and heavy programming language, so executing a Lean program is expensive compared to neural net forward passes, so AlphaProof has to use a relatively large neural net to match up inference time and proof validation time[^4]. This may not be true of a well-written symbolic expression manipulator. There's a clever and little-known program called [RIES](https://www.mrob.com/pub/ries/index.html), which finds equations given their solutions (the opposite of a normal numerical evaluator). Using caching (roughly) and a heavily optimized expression generator/evaluator which essentially brute-forces short [RPN](https://en.wikipedia.org/wiki/Reverse_Polish_notation) programs sorted by complexity, it can generate and evaluate several billion expressions a second (and differentiate top candidates). While a flexible pattern matcher would have to be slower, it is likely possible to outperform a standard CAS with explicit design for performance and externalization of ease-of-use features. Larger-scale search with a worse neural-net heuristic trades expensive GPU time for cheaper CPU time and might still be competitive, as StockFish is in chess versus more heavily neural-net-driven engines like [LC0](https://lczero.org/) (though this is down to properties of chess's search trees and is not, to my knowledge, true in e.g. Go). There has also been some work in the other direction - more neural net use and less search. [Deep Learning for Symbolic Mathematics](https://arxiv.org/abs/1912.01412) trains standard sequence-to-sequence models to do integration and solve ODEs with apparently good results (compared with traditional computer algebra systems). However, it relies on external methods to generate good training datsets (e.g. for integrals, differentiating random expressions and integration by parts). [PatternBoost](https://arxiv.org/abs/2411.00566) constructs extremal objects by repeatedly retraining an autoregressive model on the best of its outputs augmented with greedy search, and was good enough to solve an open problem despite little compute, somewhat cursed tokenization, and treating the problem as unconditionally sampling from "reasonably good" examples. Some [recent work in reinforcement learning](https://arxiv.org/abs/2408.05804) has also demonstrated directed exploration emerging from a simple contrastive algorithm which could be useful - PatternBoost-style construction could be reformulated as an RL problem amenable to this[^5], and a contrastive goal expression/current-expression-and-rule encoder is almost exactly the necessary distance heuristic I mentioned earlier. Despite these possible applications, and many successes in research, nothing like this has made it into generally usable tools, which are either based on purely symbolic heuristics or general-purpose LLMs trained slightly for this. It should be possible to build better general-purpose symbolic manipulation primitives, closer to the ideal of a system which only needs to be given definitions to work, and to do better in specific tasks like integration. I intend to work on this at some point, but am currently busy with other projects. Please tell me if you're interested in any of this! [^1]: For instance, the best system I'm aware of for integration is [Rubi](https://rulebasedintegration.org/), which contains 6700 handcrafted rules. Mathematica contains [hundreds of pages](https://reference.wolfram.com/language/tutorial/SomeNotesOnInternalImplementation.html) of code for many features. Adding support for something like generating function manipulation or computing limits or unnesting nth roots to CASes is often enough to warrant a PhD thesis. [^2]: Yes, this isn't actually the fully expanded expression. osmarkscalculator has some problems. [^3]: `#Num#Gte[b, 2]` is a bit of a hack to stop it infinitely recursing into negative powers. [^4]: I am not certain of this because Google doesn't document it, but a similar paper by [Facebook](https://arxiv.org/abs/2205.11491) uses 48 CPU cores per A100 GPU with bigger models than are usually used for MCTS, and I assume Google has *some* constraints. [^5]: There are some plausible technical issues like usefully defining the goal state encoder when your only knowledge of the goal is that it should have a bigger value by some fitness metric, but I think these are surmountable.