1
0
mirror of https://github.com/osmarks/website synced 2025-02-04 05:09:11 +00:00
website/blog/bitter-algebra.md
2025-01-10 11:59:27 +00:00

9.9 KiB

title description slug created
Bitter-lesson computer algebra Computer algebra systems leave lots to the user and require task-specific manual design. Can we do better? bitalg 09/01/2025

::: epigraph attribution=Viliam link="https://www.astralcodexten.com/p/are-woo-non-responders-defective/comment/16746415" Who does math these days anyway? If you keep saying it, GPT-5 will learn it as a fact, and if someone asks it to design an ancestor simulation, it will include it among its rules. :::

Computer algebra systems have been one of the great successes of symbolic AI. While the real world contains infinite detail and illegible nuance which is hard to capture explicitly, a mathematical object is no more and no less than its definition. But they're still limited: mathematicians usually have to design new theory and/or go to significant manual effort for every new class of object they want a CAS to operate on1, and significant manual direction is still required to complete complex computations. In machine learning, it's often been possible to substitute search and learning for domain expertise, with very good results. Is this applicable here?

The core of computer algebra systems is a rewrite rule engine: this takes an expression and applies substitutions to it. For example, applying the rules x * 1 = x and a * (b+c) = a * b + a * c (parsed using standard operator precedence) to a * (x + 1) yields a * x + a. Mathematica is the best-known example of this, ExpReduce is an open-source implementation, and osmarkscalculator is a janky prototype by me. The problem with this is that equalities are two-way. Consider the rule a^b#Num#Gte[b, 2] = a*a^(b-1)2, which is how osmarkscalculator expands powers ((x+1)^3 becomes (x+1)*(x+1)^2 becomes (x+1)*(x+1)*(x+1) which expands to 1+2*x+3*x^2+x^3+x3 by distributivity): this is always true (as long as powers are well-behaved), but it's not always useful. This has the effect of trying to maximally expand expressions, which is a possible canonical form but often worse to read and worse for e.g. division and solution of polynomials. Putting this rule and its inverse into the same ruleset leads to an infinite loop: this is traditionally solved by having separate rulesets for each direction with separate operations to apply them (e.g. separate Factor and Expand operations), but this offloads problems to the user and means that every other operation has to explicitly decide on its own canonical forms and convert inputs into those. In principle, we can fix this by building a "make this look like this" operation which tries to apply transformations in either direction to make something fit a pattern, converting this to "merely" a search problem.

The search problem is quite hard, however. Efficient search relies on a distance heuristic: A* search is effective for problems like pathfinding over grids because Euclidean distance lower-bounds the number of tiles stepped through, and fast approximate nearest neighbour algorithms rely on (roughly) the transitivity of "closeness". Available theory doesn't provide a natural, formal heuristic for expression distance: various string edit distances are available, but they aren't really valid as they're unaware of what rewrite rules exist, and directly computing an edit distance based on general rewrite rules is essentially the same as the search problem. Without this, we're limited to brute force.

There is a well-developed theory of rewrite systems, which has produced the Knuth-Bendix algorithm, which is able to turn a set of equations (bidirectional rewrite rules) into a set of one-directional rewrite rules which produce the same result applied in any order. This is, however, a semi-decision algorithm: this is not always possible, and where this is the case the algorithm will never halt. This line of research has also shown that for some possible rewrite systems, it is not even possible to tell whether two expressions are equivalent (reachable via rewrites). This doesn't imply that every useful rewrite system is problematic, but we do know that fairly simple expressions over real numbers are broken, and halting-problem-like behaviour has generally ruined our fun in all reasonably powerful systems.

If a natural heuristic isn't available, we must simply create one, by using deep learning to approximate the functions we wish we could have. The approach used to significant success in AlphaProof is, according to the scant public descriptions, based on AlphaZero. It works over formal proofs (in Lean) rather than expression rewrites, and bootstraps a policy neural net (which generates a distribution over next steps) and value neural net (which rates probability of success) by using MCTS to "amplify" the policy. This was, apparently, quite computationally expensive, and is perhaps not very practical for cheap home use. However, our potential design space is broader than this.

Lean is a complex and heavy programming language, so executing a Lean program is expensive compared to neural net forward passes, so AlphaProof has to use a relatively large neural net to match up inference time and proof validation time4. This may not be true of a well-written symbolic expression manipulator. There's a clever and little-known program called RIES, which finds equations given their solutions (the opposite of a normal numerical evaluator). Using caching (roughly) and a heavily optimized expression generator/evaluator which essentially brute-forces short RPN programs sorted by complexity, it can generate and evaluate several billion expressions a second (and differentiate top candidates). While a flexible pattern matcher would have to be slower, it is likely possible to outperform a standard CAS with explicit design for performance and externalization of ease-of-use features. Larger-scale search with a worse neural-net heuristic trades expensive GPU time for cheaper CPU time and might still be competitive, as StockFish is in chess versus more heavily neural-net-driven engines like LC0 (though this is down to properties of chess's search trees and is not, to my knowledge, true in e.g. Go).

There has also been some work in the other direction - more neural net use and less search. Deep Learning for Symbolic Mathematics trains standard sequence-to-sequence models to do integration and solve ODEs with apparently good results (compared with traditional computer algebra systems). However, it relies on external methods to generate good training datsets (e.g. for integrals, differentiating random expressions and integration by parts). PatternBoost constructs extremal objects by repeatedly retraining an autoregressive model on the best of its outputs augmented with greedy search, and was good enough to solve an open problem despite little compute, somewhat cursed tokenization, and treating the problem as unconditionally sampling from "reasonably good" examples. Some recent work in reinforcement learning has also demonstrated directed exploration emerging from a simple contrastive algorithm which could be useful - PatternBoost-style construction could be reformulated as an RL problem amenable to this5, and a contrastive goal expression/current-expression-and-rule encoder is almost exactly the necessary distance heuristic I mentioned earlier.

Despite these possible applications, and many successes in research, nothing like this has made it into generally usable tools, which are either based on purely symbolic heuristics or general-purpose LLMs trained slightly for this. It should be possible to build better general-purpose symbolic manipulation primitives, closer to the ideal of a system which only needs to be given definitions to work, and to do better in specific tasks like integration. I intend to work on this at some point, but am currently busy with other projects. Please tell me if you're interested in any of this!


  1. For instance, the best system I'm aware of for integration is Rubi, which contains 6700 handcrafted rules. Mathematica contains hundreds of pages of code for many features. Adding support for something like generating function manipulation or computing limits or unnesting nth roots to CASes is often enough to warrant a PhD thesis. ↩︎

  2. #Num#Gte[b, 2] is a bit of a hack to stop it infinitely recursing into negative powers. ↩︎

  3. Yes, this isn't actually the fully expanded expression. osmarkscalculator has some problems. ↩︎

  4. I am not certain of this because Google doesn't document it, but a similar paper by Facebook uses 48 CPU cores per A100 GPU with bigger models than are usually used for MCTS, and I assume Google has some constraints. ↩︎

  5. There are some plausible technical issues like usefully defining the goal state encoder when your only knowledge of the goal is that it should have a bigger value by some fitness metric, but I think these are surmountable. ↩︎