website/tag-systems.md at 925f10ff48e0f9f40d3cab6ec13f60caa7c1bdb3

osmarks/website

Fork 0

mirror of https://github.com/osmarks/website synced 2025-09-06 12:27:56 +00:00

Files

osmarks 925f10ff48 new post and tweaks

2025-07-20 22:54:48 +01:00

8.7 KiB

Raw Blame History

title, description, slug, created, tags

title

description

slug

created

tags

Replacing tag systems

Also note-taking applications again, embeddings, and blog organization.

tag

20/07/2025

opinion

Tag systems are now one of the most popular ways we organize objects (on computers, mostly), after flat named files and hierarchies, presumably because they're flexible, easy to understand and easy to implement. Even this blog has tags, although they aren't widely exposed. However, there are maintenance challenges: without foreknowledge of what you'll write later, it's easy to add overspecific tags which only ever contain one entry, or incautiously add a tag which contains (almost) every entry. In my own instance of Minoteaur, which I use for personal journals and notes on various topics, one tag (#journal) contains 90% of pages (2194), with other tags having occurrences in the tens at most and usually fewer.

This might not be bad, even if it feels inelegant¹, except for the fact that I almost never search by any of them. The natural and obvious commonalities between pages are often things like what field their content belongs to or what topics they talk about. But I've never had a good reason to read, for instance, everything I have ever written down about abstract algebra, and I don't think this would be a desirable way to arrange a blog either. To fix this, we need to disentangle the several functions of tags: surfacing related pages or posts, describing them to a reader who has found them another way, and augmenting searches. These have different requirements:

Finding related content using tags benefits from many small tags (as long as they contain more than one entry), and/or bigger but compositional tags. Having a small set of (near-)orthogonal big tags is probably difficult, though (Zipf's law). Small tags are hard to keep track of, and are useless if relevant entries are tagged wrong.
Description using tags is a strange usecase, since most places which can fit tags can also fit a free text description. Tags can, however, allow for a more standardized, easily-skimmable vocabulary than free text. This blog currently uses them as more of a category system: pages are accent-colored in a few places based on the first tag with an associated color², and pages can be associated with each other by links or being part of the same series.
Using tags in search requires that the user can predict what tags their desired result has before seeing that result³. This requires that the tags be predictable, regular and specific, and tailored to the kind of search queries which happen. Fanfic tagging schemes circa 2010 are a good example of this working well. Minoteaur has a structured data system (separate from "tags" per se, since I didn't work out a clean way to unify them) where pages have a list of key/value pairs associated with them and searches can filter on these. It works well for simple things, but I ran into the usual tagging issues of inconsistency in labelling things, not being sure how to split and organize things and difficulty refactoring.

A major trend in note-taking applications, at least when I was designing Minoteaur around 2019, was trying to replace the first function of tags (related content) with links. In principle, backlinks, and the system of "unlinked references" (the software shows non-link uses of a page's title elsewhere), give you related content without manual tagging; in practice, pages do not naturally link to everything relevant, so you have to manually write links like tags, not care very much about finding related things or look through transitive connections. There's a further problem: linking requires notes to be organized into separate pages with unique names. Without the hierarchical organization lost in shiny modern applications, namespace collisions are hard to fix; more importantly, it's hard to divide some forms of notes into sensibly, uniquely named "pages" in the first place⁴⁵. Tagwiki attempts to fix this from the other direction by dropping page names entirely and only having tags, but I think this creates more problems: canonical names are helpful for addressing, being almost entirely dependent on tags makes their shortcomings worse, and bucketing by time is special-cased in due to this insufficiency. A general structured data system would fix the special-casing, but not the rest.

There is a school of thought which would suggest dealing with this by having separate software for journals, taking notes on reading, remembering how software works, etc. I like questionable solutions to the general versions of problems, and natural language processing is more advanced than when competing notes software was designed, and thus I have discovered/invented a new concept, preemptively named "Minoteaur 9". It would be built on a collapsible tree of bullet points like Dynalist, with metadata assignable to any bullet point. Rather than requiring tags or links, the software would allow highlighting phrases then jumping to relevant bullet points based on a combination of distance in the tree, text-embedding-based similarity, keyword matching and a classifier to determine how much something is an "authoritative definition" as opposed to a mere co-occurring reference. Search (and perhaps a sidebar) could then be used to highlight these co-occurring references or match metadata. To mitigate metadata bitrot, nodes' metadata would fit centrally defined schemas rather than being added ad-hoc.

I'm not likely to implement this until/unless the cost of software goes down a lot, since my existing setup is mostly workable⁶, but I think this is an unexplored point in the space of notes.

For blogs, and external-facing writing generally, readers expect more linearity and structure, so novel free structures are probably not wise, and it's not as useful that the system might help find novel connections (as you can do this in your main notes and write them up). There is still a much cleaner way for blogs to do related pages than tags: text embeddings on post contents, which allow unsupervised (by you) clustering and similarity, and without the risk of having to manually overhaul your tag ontology. For some reason⁷, almost nobody does this. Most people's photograph organization also seems to be built around albums (categories, i.e. primitive tags or primitive hierarchies) and basic automated tagging, rather than the obviously superior technologies of Meme Search Engine⁸. Perhaps in time people will migrate.

::: captioned src=/assets/images/internal_meme_search.png The slightly outdated copy of Meme Search Engine I run on my photos and screenshots directory. :::

It also doesn't maximize information per tag - you would want exactly 50% of pages to have each tag for that - but this isn't a good goal. ↩︎
I think coloring for categorical information is underappreciated in modern designs, though it can be hard to design color palettes. I allocated tag colors manually here, but I don't know a nice way to color tags in bigger systems so that tags are mostly unambiguous. A co-occurrence approach like word2vec to give similar tags similar colors may be good, though it might be better to allocate them so that colors are reused only on tags which almost never appear together. ↩︎
This is a general issue with search. A paper from 2022 describes a funny workaround: use a high-powered LLM to imagine what the result looks like and then search for items similar to that. Slightly relatedly, this works by generating n-grams which should be in the result and searching for those. ↩︎
In my case this is maths notes: it's ugly to put an entire topic into one long page, but wildly impractical and annoying to split every proof, corollary and note into a separate page. ↩︎
To some extent this is the case for blog posts: I have many minor things to say, but only publish them when they can fit into a particularly witty microblog snippet or form a coherent long post from them. ↩︎
Also, the ranking has many degrees of freedom and would be hard to tune. ↩︎
I think it's because the engineering is at least nontrivial, and blog software is mostly settled by now, with new development often coming from "small web"/open source people who have been polarized against AI. I don't implement it because this website has only 40 blog posts. ↩︎
I think big tech photo software does offer basic semantic search now, but with primitive small models. ↩︎

8.7 KiB Raw Blame History

8.7 KiB

Raw Blame History