1
0
mirror of https://github.com/osmarks/website synced 2025-09-02 18:57:55 +00:00

new post and tweaks

This commit is contained in:
osmarks
2025-07-20 22:54:48 +01:00
parent 9a651933e4
commit 925f10ff48
10 changed files with 193 additions and 19 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 381 KiB

View File

@@ -35,7 +35,7 @@ Relatedly, power and interconnectivity difficulties with PCIe cards have led to
Consumers have mostly been stuck with [Gigabit Ethernet](https://en.wikipedia.org/wiki/Gigabit_Ethernet) for decades[^3]. 10 Gigabit Ethernet was available soon afterward but lacks consumer adoption: I think this is because of a combination of expensive and power-hungry initial products, more stringent cabling requirements, and lack of demand (because internet connections were slow and LANs became less important). A decade later, the more modest 2.5GbE and 5GbE standards were released, designed for cheaper implementations and cables[^4]. These have eventually been used in desktop motherboards and higher-end consumer networking equipment[^5].
Servers, being in environments where fast core networks and backhaul are more easily available and having more use for high throughput, moved more rapidly to 10, 40, 100(/25)[^6], 200, 400 and 800Gbps, with 1.6Tbps Ethernet now being standardized. The highest speeds are mostly for AI and specialized network applications, since most code is bad and is limited by some combination of CPU time and external API latency. Optimized code which does not do much work can handle [millions of HTTP requests a second](https://www.techempower.com/benchmarks/#section=data-r23&test=json) on [28 outdated cores](https://github.com/TechEmpower/FrameworkBenchmarks/wiki/Project-Information-Environment), and with kernel bypass and hardware-offloaded cryptography [DPDK](https://www.dpdk.org/) can push 100 million: most software is not like that, and has to do more work per byte sent/received.
Servers, being in environments where fast core networks and backhaul are more easily available and having more use for high throughput, moved rapidly to 10, 40, 100(/25)[^6], 200, 400 and 800Gbps, with 1.6Tbps Ethernet now being standardized. The highest speeds are mostly for AI and specialized network applications, since most code is bad and is limited by some combination of CPU time and external API latency. Optimized code which does not do much work can handle [millions of HTTP requests a second](https://www.techempower.com/benchmarks/#section=data-r23&test=json) on [28 outdated cores](https://github.com/TechEmpower/FrameworkBenchmarks/wiki/Project-Information-Environment), and with kernel bypass and hardware-offloaded cryptography [DPDK](https://www.dpdk.org/) can push 100 million: most software is not like that, and has to do more work per byte sent/received.
Energy per bit transferred is scaling down slower than data rates are scaling up, so high-performance switches are having to move to [co-packaged optics](https://www.servethehome.com/hands-on-with-the-intel-co-packaged-optics-and-silicon-photonics-switch/) and similar technologies.

View File

@@ -176,7 +176,7 @@ Well, I mean, yes idealism, yes the dignity of pure research, yes the pursuit of
### Schild's Ladder
[Schild's Ladder](https://www.goodreads.com/book/show/156780.Schild_s_Ladder) may just be Greg Egan showing off cool physics ideas without much in the way of plot beyond this, but they *are* very cool. Egan also manages to pull off an actually-futuristic future society and world. I also enjoyed Greg Egan's many short story collections, [Luminous](https://www.goodreads.com/book/show/156782.Luminous)/[Oceanic](https://www.goodreads.com/book/show/6741362-oceanic)/[Instantiation](https://www.goodreads.com/book/show/50641444-instantiation) - I think he does better at these.
[Schild's Ladder](https://www.goodreads.com/book/show/156780.Schild_s_Ladder) may just be Greg Egan showing off cool physics ideas without much in the way of plot beyond this, but they *are* very cool. Egan also manages to pull off an actually-futuristic future society and world[^6]. I also enjoyed Greg Egan's many short story collections, [Luminous](https://www.goodreads.com/book/show/156782.Luminous)/[Oceanic](https://www.goodreads.com/book/show/6741362-oceanic)/[Instantiation](https://www.goodreads.com/book/show/50641444-instantiation) - I think he does better at these.
### Sufficiently Advanced Magic
@@ -382,3 +382,5 @@ You can suggest other possibly-good stuff in the comments and I may add it to an
[^4]: Using the word "dimension" this way is terrible but we are stuck with it. I might have to retroactively eliminate whoever came up with it.
[^5]: I believe the reviewers saying it's novel and exciting have just never read anything vaguely transhumanist ever.
[^6]: He does have a tendency to write every character, including nominally-alien aliens, as unrealistically reasonable, cosmopolitan, smart and well-educated.

42
blog/tag-systems.md Normal file
View File

@@ -0,0 +1,42 @@
---
title: Replacing tag systems
description: Also note-taking applications again, embeddings, and blog organization.
slug: tag
created: 20/07/2025
tags: ["opinion"]
---
Tag systems are now one of the most popular ways we organize objects (on computers, mostly), after flat named files and hierarchies, presumably because they're flexible, easy to understand and easy to implement. Even this blog has tags, although they aren't widely exposed. However, there are maintenance challenges: without foreknowledge of what you'll write later, it's easy to add overspecific tags which only ever contain one entry, or incautiously add a tag which contains (almost) every entry. In my own instance of [Minoteaur](/minoteaur/), which I use for personal journals and notes on various topics, one tag (`#journal`) contains 90% of pages (2194), with other tags having occurrences in the tens at most and usually fewer.
This might not be bad, even if it feels inelegant[^1], except for the fact that I almost never search by any of them. The natural and obvious commonalities between pages are often things like what field their content belongs to or what topics they talk about. But I've never had a good reason to read, for instance, everything I have ever written down about abstract algebra, and I don't think this would be a desirable way to arrange a blog either. To fix this, we need to disentangle the several functions of tags: surfacing related pages or posts, describing them to a reader who has found them another way, and augmenting searches. These have different requirements:
* Finding related content using tags benefits from many small tags (as long as they contain more than one entry), and/or bigger but compositional tags. Having a small set of (near-)orthogonal big tags is probably difficult, though (Zipf's law). Small tags are hard to keep track of, and are useless if relevant entries are tagged wrong.
* Description using tags is a strange usecase, since most places which can fit tags can also fit a free text description. Tags can, however, allow for a more standardized, easily-skimmable vocabulary than free text. This blog currently uses them as more of a category system: pages are accent-colored in a few places based on the first tag with an associated color[^3], and pages can be associated with each other by links or being part of the same series.
* Using tags in search requires that the user can predict what tags their desired result has before seeing that result[^2]. This requires that the tags be predictable, regular and specific, and tailored to the kind of search queries which happen. [Fanfic tagging schemes circa 2010](https://idlewords.com/talks/fan_is_a_tool_using_animal.htm) are a good example of this working well. Minoteaur has a structured data system (separate from "tags" per se, since I didn't work out a clean way to unify them) where pages have a list of key/value pairs associated with them and searches can filter on these. It works well for simple things, but I ran into the usual tagging issues of inconsistency in labelling things, not being sure how to split and organize things and [difficulty refactoring](https://gwern.net/design#future-tag-features).
A major trend in note-taking applications, at least when I was designing Minoteaur around 2019, was trying to replace the first function of tags (related content) with links. In principle, backlinks, and the system of "unlinked references" (the software shows non-link uses of a page's title elsewhere), give you related content without manual tagging; in practice, pages do not naturally link to everything relevant, so you have to manually write links like tags, not care very much about finding related things or look through transitive connections. There's a further problem: linking requires notes to be organized into separate pages with unique names. Without the hierarchical organization lost in shiny modern applications, namespace collisions are hard to fix; more importantly, it's hard to divide some forms of notes into sensibly, uniquely named "pages" in the first place[^4][^5]. [Tagwiki](https://github.com/dpc/tagwiki) attempts to fix this from the other direction by dropping page names entirely and only having tags, but I think this creates more problems: canonical names are helpful for addressing, being almost entirely dependent on tags makes their shortcomings worse, and bucketing by time is special-cased in due to this insufficiency. A general structured data system would fix the special-casing, but not the rest.
There is a school of thought which would suggest dealing with this by having separate software for journals, taking notes on reading, remembering how software works, etc. I like questionable solutions to the general versions of problems, and natural language processing is more advanced than when competing notes software was designed, and thus I have discovered/invented a new concept, preemptively named "Minoteaur 9". It would be built on a collapsible tree of bullet points like [Dynalist](https://dynalist.io/), with metadata assignable to any bullet point. Rather than requiring tags or links, the software would allow highlighting phrases then jumping to relevant bullet points based on a combination of distance in the tree, text-embedding-based similarity, keyword matching and a classifier to determine how much something is an "authoritative definition" as opposed to a mere co-occurring reference. Search (and perhaps a sidebar) could then be used to highlight these co-occurring references or match metadata. To mitigate metadata bitrot, nodes' metadata would fit centrally defined schemas rather than being added ad-hoc.
I'm not likely to implement this until/unless the cost of software goes down a lot, since my existing setup is mostly workable[^7], but I think this is an unexplored point in the space of notes.
For blogs, and external-facing writing generally, readers expect more linearity and structure, so novel free structures are probably not wise, and it's not as useful that the system might help find novel connections (as you can do this in your main notes and write them up). There is still a much cleaner way for blogs to do related pages than tags: [text embeddings](https://cameronharwick.com/writing/related-posts-in-wordpress-with-vector-embedding/) on post contents, which allow unsupervised (by you) clustering and similarity, and without the risk of having to manually overhaul your tag ontology. For some reason[^8], almost nobody does this. Most people's photograph organization also seems to be built around albums (categories, i.e. primitive tags or primitive hierarchies) and basic automated tagging, rather than the obviously superior technologies of [Meme Search Engine](https://github.com/osmarks/meme-search-engine)[^6]. Perhaps in time people will migrate.
::: captioned src=/assets/images/internal_meme_search.png
The slightly outdated copy of Meme Search Engine I run on my photos and screenshots directory.
:::
[^1]: It also doesn't maximize information per tag - you would want exactly 50% of pages to have each tag for that - but this isn't a good goal.
[^2]: This is a general issue with search. A [paper from 2022](https://arxiv.org/abs/2212.10496) describes a funny workaround: use a high-powered LLM to imagine what the result looks like and then search for items similar to that. Slightly relatedly, [this](https://arxiv.org/abs/2204.10628) works by generating n-grams which should be in the result and searching for those.
[^3]: I think coloring for categorical information is underappreciated in modern designs, though it can be hard to design color palettes. I allocated tag colors manually here, but I don't know a nice way to color tags in bigger systems so that tags are mostly unambiguous. A co-occurrence approach like [word2vec](https://en.wikipedia.org/wiki/Word2vec) to give similar tags similar colors may be good, though it might be better to allocate them so that colors are reused only on tags which almost never appear together.
[^4]: In my case this is maths notes: it's ugly to put an entire topic into one long page, but wildly impractical and annoying to split every proof, corollary and note into a separate page.
[^5]: To some extent this is the case for blog posts: I have many minor things to say, but only publish them when they can fit into a particularly witty microblog snippet or form a coherent long post from them.
[^6]: I think big tech photo software does offer basic semantic search now, but with primitive small models.
[^7]: Also, the ranking has many degrees of freedom and would be hard to tune.
[^8]: I think it's because the engineering is at least nontrivial, and blog software is mostly settled by now, with new development often coming from "small web"/open source people who have been polarized against AI. I don't implement it because this website has only 40 blog posts.

View File

@@ -4330,5 +4330,125 @@
"date": null,
"website": "GitHub",
"auto": true
},
"https://en.wikipedia.org/wiki/List_of_tallest_buildings_and_structures_in_London": {
"excerpt": "City of London skyline in 2024",
"title": "List of tallest buildings and structures in London",
"author": "Contributors to Wikimedia projects",
"date": "2003-02-07T14:52:09Z",
"website": "Wikimedia Foundation, Inc.",
"auto": true
},
"https://www.london.gov.uk/who-we-are/what-london-assembly-does/questions-mayor/find-an-answer/major-road-network": {
"excerpt": "Questions to the Mayor: Please provide details of the roads in London that form part of the Major Road Network (MRN). Please include the name of the road, the length of the road within London, which highway authority is responsible for the road and the percentage of the roads in London that form part of the MRN?",
"title": "Major Road Network",
"author": null,
"date": null,
"website": "London City Hall",
"auto": true
},
"https://www.london.gov.uk/who-we-are/what-london-assembly-does/questions-mayor/find-an-answer/average-distance-travelled-person-mode-london": {
"excerpt": "Questions to the Mayor: Using data showing travel in London by mode of transport, could you provide an estimate of the average distance in kilometres travelled per person by mode in each year since 2016?",
"title": "Average distance travelled by person per mode in London",
"author": null,
"date": null,
"website": "London City Hall",
"auto": true
},
"https://www.businessinsider.com/the-8-fastest-elevators-in-the-world-2013-1?op=1": {
"excerpt": "Check out the eight fastest elevators worldwide, showcasing engineering marvels that redefine vertical transportation.",
"title": "Asian Skyscrapers Dominate A New List Of The World's Fastest Elevators",
"author": "Megan Willett-Wei",
"date": "2013-01-23T17:07:46Z",
"website": "Business Insider",
"auto": true
},
"https://www.wired.com/story/thyssenkrupp-multi-maglev-elevator/": {
"excerpt": "ThyssenKrupp's Multi elevator that can travel horizontally, diagonally as well as vertically",
"title": "The Wonkavator is real: ThyssenKrupp unveils its maglev elevator that 'runs like the Tube'",
"author": "Bonnie Christian",
"date": "2017-06-22T06:00:01.000-04:00",
"website": "WIRED",
"auto": true
},
"https://vitalik.eth.limo/general/2023/04/14/traveltime.html": {
"excerpt": "Dark Mode Toggle",
"title": "Travel time ~= 750 * distance ^ 0.6",
"author": null,
"date": null,
"website": null,
"auto": true
},
"https://idlewords.com/talks/fan_is_a_tool_using_animal.htm": {
"excerpt": "In 1967, Gene Roddenberry launched a TV show that had a massive cultural impact. While it wasnt a hit during its original run, it kindled the imagination in a way few other television programs had. The story of an attractive, pan-ethnic crew roaming the galaxy, solving moral dilemmas in tight uniforms, had a powerful appeal.",
"title": "Fan Is A Tool-Using Animal—dConstruct Conference Talk",
"author": null,
"date": null,
"website": null,
"auto": true
},
"https://arxiv.org/abs/2212.10496": {
"excerpt": "While dense retrieval has been shown effective and efficient across tasks and languages, it remains difficult to create effective fully zero-shot dense retrieval systems when no relevance label is available. In this paper, we recognize the difficulty of zero-shot learning and encoding relevance. Instead, we propose to pivot through Hypothetical Document Embeddings~(HyDE). Given a query, HyDE first zero-shot instructs an instruction-following language model (e.g. InstructGPT) to generate a hypothetical document. The document captures relevance patterns but is unreal and may contain false details. Then, an unsupervised contrastively learned encoder~(e.g. Contriever) encodes the document into an embedding vector. This vector identifies a neighborhood in the corpus embedding space, where similar real documents are retrieved based on vector similarity. This second step ground the generated document to the actual corpus, with the encoder's dense bottleneck filtering out the incorrect details. Our experiments show that HyDE significantly outperforms the state-of-the-art unsupervised dense retriever Contriever and shows strong performance comparable to fine-tuned retrievers, across various tasks (e.g. web search, QA, fact verification) and languages~(e.g. sw, ko, ja).",
"title": "Precise Zero-Shot Dense Retrieval without Relevance Labels",
"author": "[Submitted on 20 Dec 2022]",
"date": null,
"website": "arXiv.org",
"auto": true
},
"https://arxiv.org/abs/2204.10628": {
"excerpt": "Knowledge-intensive language tasks require NLP systems to both provide the correct answer and retrieve supporting evidence for it in a given corpus. Autoregressive language models are emerging as the de-facto standard for generating answers, with newer and more powerful systems emerging at an astonishing pace. In this paper we argue that all this (and future) progress can be directly applied to the retrieval problem with minimal intervention to the models' architecture. Previous work has explored ways to partition the search space into hierarchical structures and retrieve documents by autoregressively generating their unique identifier. In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers. This setup allows us to use an autoregressive model to generate and score distinctive ngrams, that are then mapped to full passages through an efficient data structure. Empirically, we show this not only outperforms prior autoregressive approaches but also leads to an average improvement of at least 10 points over more established retrieval solutions for passage-level retrieval on the KILT benchmark, establishing new state-of-the-art downstream performance on some datasets, while using a considerably lighter memory footprint than competing systems. Code and pre-trained models at https://github.com/facebookresearch/SEAL.",
"title": "Autoregressive Search Engines: Generating Substrings as Document Identifiers",
"author": "[Submitted on 22 Apr 2022]",
"date": null,
"website": "arXiv.org",
"auto": true
},
"https://github.com/amoffat/supertag": {
"excerpt": "A tag-based filesystem. Contribute to amoffat/supertag development by creating an account on GitHub.",
"title": "GitHub - amoffat/supertag: A tag-based filesystem",
"author": "amoffat",
"date": null,
"website": "GitHub",
"auto": true
},
"https://gwern.net/design": {
"excerpt": "Meta page describing Gwern.net, the self-documenting websites implementation and experiments for better semantic zoom of hypertext; technical decisions using Markdown and static hosting.",
"title": "Design Of This Website",
"author": "Gwern",
"date": null,
"website": null,
"auto": true
},
"https://en.wikipedia.org/wiki/Word2vec": {
"excerpt": "Word2vec is a technique in natural language processing (NLP) for obtaining vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec algorithm estimates these representations by modeling text in a large corpus. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. Word2vec was developed by Tomáš Mikolov, Kai Chen, Greg Corrado, Ilya Sutskever and Jeff Dean at Google, and published in 2013.[1][2]",
"title": "Word2vec",
"author": "Contributors to Wikimedia projects",
"date": "2015-08-14T22:22:48Z",
"website": "Wikimedia Foundation, Inc.",
"auto": true
},
"https://github.com/dpc/tagwiki": {
"excerpt": "A wiki in which you link to pages by specifing hashtags they contain. - dpc/tagwiki",
"title": "GitHub - dpc/tagwiki: A wiki in which you link to pages by specifing hashtags they contain.",
"author": "dpc",
"date": null,
"website": "GitHub",
"auto": true
},
"https://dynalist.io/": {
"excerpt": "Dynalist lets you organize your ideas and tasks in simple lists. It's powerful, yet easy to use. Try the live demo now, no need to sign up.",
"title": "Home - Dynalist",
"author": null,
"date": null,
"website": null,
"auto": true
},
"https://cameronharwick.com/writing/related-posts-in-wordpress-with-vector-embedding/": {
"excerpt": "Since my 10 year old related posts plugin cant even be downloaded anymore because of a security vulnerability, I figure its time to bring…",
"title": "Related Posts in WordPress with Vector Embedding",
"author": null,
"date": null,
"website": null,
"auto": true
}
}

14
package-lock.json generated
View File

@@ -16,7 +16,7 @@
"@vscode/markdown-it-katex": "^1.1.0",
"axios": "^1.5.0",
"base64-js": "^1.5.1",
"better-sqlite3": "^11.5.0",
"better-sqlite3": "^11.10.0",
"binary-fuse-filter": "^1.0.0",
"chalk": "^4.1.0",
"css-select": "^5.1.0",
@@ -968,9 +968,9 @@
}
},
"node_modules/better-sqlite3": {
"version": "11.5.0",
"resolved": "https://registry.npmjs.org/better-sqlite3/-/better-sqlite3-11.5.0.tgz",
"integrity": "sha512-e/6eggfOutzoK0JWiU36jsisdWoHOfN9iWiW/SieKvb7SAa6aGNmBM/UKyp+/wWSXpLlWNN8tCPwoDNPhzUvuQ==",
"version": "11.10.0",
"resolved": "https://registry.npmjs.org/better-sqlite3/-/better-sqlite3-11.10.0.tgz",
"integrity": "sha512-EwhOpyXiOEL/lKzHz9AW1msWFNzGc/z+LzeB3/jnFJpxu+th2yqvzsSWas1v9jgs9+xiXJcD5A8CJxAG2TaghQ==",
"hasInstallScript": true,
"license": "MIT",
"dependencies": {
@@ -4595,9 +4595,9 @@
"integrity": "sha512-Wjss+Bc674ZABPr+SCKWTqA4V1pyYFhzDTjNBJy4jdmgOv0oGIGXeKBRJyINwP5tIy+iIZD9SfgZpztduzQ5QA=="
},
"better-sqlite3": {
"version": "11.5.0",
"resolved": "https://registry.npmjs.org/better-sqlite3/-/better-sqlite3-11.5.0.tgz",
"integrity": "sha512-e/6eggfOutzoK0JWiU36jsisdWoHOfN9iWiW/SieKvb7SAa6aGNmBM/UKyp+/wWSXpLlWNN8tCPwoDNPhzUvuQ==",
"version": "11.10.0",
"resolved": "https://registry.npmjs.org/better-sqlite3/-/better-sqlite3-11.10.0.tgz",
"integrity": "sha512-EwhOpyXiOEL/lKzHz9AW1msWFNzGc/z+LzeB3/jnFJpxu+th2yqvzsSWas1v9jgs9+xiXJcD5A8CJxAG2TaghQ==",
"requires": {
"bindings": "^1.5.0",
"prebuild-install": "^7.1.1"

View File

@@ -11,7 +11,7 @@
"@vscode/markdown-it-katex": "^1.1.0",
"axios": "^1.5.0",
"base64-js": "^1.5.1",
"better-sqlite3": "^11.5.0",
"better-sqlite3": "^11.10.0",
"binary-fuse-filter": "^1.0.0",
"chalk": "^4.1.0",
"css-select": "^5.1.0",

View File

@@ -673,5 +673,14 @@
},
"https://www.nvidia.com/content/dam/en-zz/Solutions/networking/ethernet-adapters/connectX-6-dx-datasheet.pdf": {
title: "NVIDIA ConnectX-6 DX Ethernet SmartNIC"
},
"https://content.knightfrank.com/research/2207/documents/en/nlaknight-frank-london-tall-buildings-survey-2021-7962.pdf": {
title: "London Tall Buildings Survey",
author: "Knight Frank"
},
"https://www.pnas.org/doi/10.1073/pnas.0610172104": {
title: "Growth, innovation, scaling, and the pace of life in cities",
date: "2007-04-24",
author: ["Luís M. A. Bettencourt", "José Lobo", "Dirk Helbing", "Geoffrey B. West"]
}
}

View File

@@ -127,7 +127,7 @@ summary h1, summary h2
main
max-width: calc(100% - 2 * $content-margin)
width: $content-width
text-align: justify
text-align: left
margin-left: $content-margin
margin-right: $content-margin
&.fullwidth
@@ -204,8 +204,8 @@ button, select, input, textarea, .textarea
background: lightgray
border: 1px solid black
padding: 1em
margin-bottom: 0.5em
margin-top: 0.5em
margin-bottom: 16px
margin-top: 16px
box-sizing: border-box
> a
display: block
@@ -215,8 +215,8 @@ button, select, input, textarea, .textarea
margin: 0
blockquote
padding-left: 0.8rem
border-left: 0.4rem solid black
padding-left: 0.6rem
border-left: 6px solid black
margin-left: 0.2rem
.wider, pre
@@ -246,8 +246,6 @@ blockquote
margin: 0
.footnotes-sep
display: none
.footnotes-list
text-align: justify
@media (max-width: calc(4 * $content-margin + $content-width + $sidenotes-width))
// minwidth 1-pane layout
.sidenotes
@@ -342,7 +340,7 @@ $hl-border: 3px
color: white
blockquote
border-left: 0.4rem solid white
border-left: 6px solid white
button, select, input, textarea, .textarea
background: #333
@@ -514,3 +512,6 @@ textarea.squiggle
.smallcaps
font-variant: small-caps
font-size: 1.05em
.header
border-top: solid 6px var(--stripe, transparent)

View File

@@ -52,7 +52,7 @@ html(lang="en")
block nav-items
.sidenotes-container
main(class=!haveSidenotes ? "fullwidth" : "")
.header
.header(style=accentColor ? `--stripe: ${accentColor}` : "")
h1.page-title= title
block under-title
if !internal