epigraphs and new post

This commit is contained in:
osmarks 2024-04-22 19:19:53 +01:00
parent 20488d93c8
commit f5be4dded8
19 changed files with 185 additions and 22 deletions

BIN
assets/images/cdf.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 51 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 98 KiB

BIN
assets/images/meme_eval.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 35 KiB

BIN
assets/images/meme_roc.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 29 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.7 MiB

View File

@ -5,6 +5,10 @@ updated: 12/09/2023
description: Powerful search tools as externalized cognition, and how mine work.
slug: maghammer
---
::: epigraph attribution="Deus Ex"
The need to be observed and understood was once satisfied by God. Now we can implement the same functionality with data-mining algorithms.
:::
I have had this setup in various bits and pieces for a while, but some people expressed interest in its capabilities and apparently haven't built similar things and/or weren't aware of technologies in this space, so I thought I would run through what I mean by "personal data warehouse" and "externalized cognition" and why they're important, how my implementation works, and other similar work.
## What?

71
blog/meme-thresher.md Normal file
View File

@ -0,0 +1,71 @@
---
title: "MemeThresher: efficient semiautomated meme acquisition with AI"
description: Absurd technical solutions for problems which did not particularly need solving are one of life's greatest joys.
slug: memethresher
created: 22/04/2024
---
::: epigraph attribution=AI_WAIFU
I think what you need to do is spend a day in the data labeling mines.
:::
One common complaint about modern AI is that it takes away valuable aspects of the human experience and replaces them with impersonal engineering. I like impersonal engineering, and since 2021 (prior, in fact, to Meme Search Engine) have been working on automatically determining whether memes were good, with a view to filtering the stream of memes from Reddit for good ones automatically. Unfortunately, at the time I lacked the knowledge, computing power and good enough pretrained models to pull it off. Now I have at least some of those things, and it *works*. Somewhat.
Reddit is still the primary source for memes, or at least the kind of somewhat technical memes I like. Despite the [API changes](https://en.wikipedia.org/wiki/2023_Reddit_API_controversy) last year, Reddit still has an excellently usable API compared to its competitors[^1] for my purposes - some [very simple code](https://github.com/osmarks/meme-search-engine/blob/master/meme-rater/crawler.py) is able to crawl a [multireddit](https://support.reddithelp.com/hc/en-us/articles/360043043412-What-is-a-custom-feed-and-how-do-I-make-one) and download metadata and images from a custom set of subreddits. I gathered 12000 images from subreddits I had known contained some decent memes.
## Modelling
The core of MemeThresher is the meme rating model, which determines how good a meme is. Some of the people I talked to said this was "subjective", which is true, but this isn't really an issue - whether a meme aligns with my preferences is an objectively answerable question, if quite a hard one due to the complexity of those preferences and the difficulty of conveying them compactly. I consulted `#off-topic`[^2] members on how to do this, and got the suggestion of using a [Bradley-Terry model](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model). A Bradley-Terry model fits scalar scores to pairwise comparisons (of win probability, but I just set them to 0.95 or 0.05 depending on which one was preferred) - you've probably interacted with this in the form of the "elo" system in chess. I used [SigLIP](https://twitter.com/giffmana/status/1707327094982672666) embeddings as the input to the model, as it's already deployed for Meme Search Engine and is scarily capable. I ensemble several models trained on the same data (in a different order and with different initialization) together so that I can compute variance on the same pairs to find what the model is most uncertain about[^3].
## The data labelling mines
Even with a powerful pretrained model providing prior knowledge about the world and about memes, making this work at all needs a decent quantity of training data. I wrote a [custom script](https://github.com/osmarks/meme-search-engine/blob/master/meme-rater/rater_server.py) to allow rapidly comparing pairs of memes. This uses the existing Meme Search Engine infrastructure to rapidly load and embed images and has a simple keyboard interface for fast rating. It was slightly annoying, but I got about a thousand labelled meme pairs to start training. Perhaps less data would have worked, but it did not take that long (I think an hour or two) and there were enough issues as things stood.
## Staring at loss curves for several hours
It has been said that neural nets really want to learn - even if you make mistakes in the architecture or training code, a lot of the time they will train anyway, if worse than they should. However, they don't necessarily learn what you *want*. With just a thousand data points to train on, and a nontrivial [model](https://github.com/osmarks/meme-search-engine/blob/master/meme-rater/model.py) (an ensemble of <span class="hoverdefn" title="multi-layer perceptrons">MLPs</span>), it would immediately overfit after a single epoch. I tried various regularization schemes - increasing weight decay a lot, dropout, and even cutting the model down as far as linear regression - but this either resulted in the model underfitting or it overfitting with a slightly stranger loss curve.
Fortunately, I had sort of planned for this. The existing pairs of memes I rated were randomly selected from the dataset and as a result often quite low-signal, telling the model things it "already knew" from other examples. None of this is particularly novel[^4] and active learning - inferring what data would be most useful to a model so you can gather and train on it - is widely studied. I used the least bad checkpoint to [find the highest-variance pairs](https://github.com/osmarks/meme-search-engine/blob/master/meme-rater/active_learning.py) out of a large random set, then manually rated those - they were given a separate label so I could place those at the end of training (this ultimately seemed to make the model worse) and have a separate validation set (helpful). I also looked into [finding the highest-gradient pairs](https://github.com/osmarks/meme-search-engine/blob/master/meme-rater/al2.py), but it turns out this isn't what I want: they are in some sense the most "surprising", and the most surprising things were apparently things which were already clear to the model being overturned by a new data point, not marginal cases it could nevertheless learn from.
The new data seemingly made it slightly less prone to overfitting, but, "excitingly", it would achieve the lowest validation loss for the early set and new set at different steps. I was not able to resolve this, so I just eyeballed the curve to find a checkpoint which a reasonable balance between these, ran another round of active learning and manual labelling, and trained on that. It was very hard for the model to learn that at all, which makes sense - many of the pairs were *very* marginal (I didn't include a tie option and didn't want to add one at a late stage, but it would perhaps have been appropriate). At some point I decided that I did not want to try and wrangle the model into working any longer and picked a good-enough checkpoint to use.
::: captioned src=/assets/images/losscurve_final.png
Ultimately, I picked step 3000 from this run.
:::
I ran a final evaluation on it: I rated my whole meme dataset, got 25 memes at various percentiles of score, and manually determined how many were good enough for meme library inclusion. This result is actually not particularly desirable: ideally *all* the suitable memes should be right of some line and the ones on the left of that all bad. On the advice of my friend, I also plotted a ROC curve, as well as another one I don't know a standard name for (the cumulative probability of a meme being good by score), using a new dataset where I manually considered 150 random memes individually.
::: captioned src=/assets/images/meme_eval.png
It is sort of annoying to see the culmination of three days of work as a boring `matplotlib` graph rather than something cooler.
:::
::: captioned src=/assets/images/meme_roc.png
AUROC 0.801.
:::
::: captioned src=/assets/images/cdf.png
I actually plotted this by accident while misunderstanding how ROC worked very badly.
:::
## Deployment
The ultimate purpose of the software and model is of course to find good memes with less effort on my part, not to produce dubiously good ROC curves. Handily, this was mostly doable with my existing code - Meme Search Engine to embed images in bulk, a slight modification of the dataset crawler to get new images, a [new frontend](https://github.com/osmarks/meme-search-engine/blob/master/meme-rater/library_processing_server.py) to allow me to select and label memes for library inclusion, and a [short script](https://github.com/osmarks/meme-search-engine/blob/master/meme-rater/meme_pipeline.py) to orchestrate it and run the scoring model.
## Improving the model
Since this is a blog post and not an actual research paper, I can say that I have no idea which parts of this were necessary or suboptimal and declare the model good enough without spending time running ablations, nor do I have to release my [folder](https://datasets.osmarks.net/projects-formerly-codenamed-radius-tyrian/) of poorly organized run logs and models[^5]. However, I do have a few ideas for interested readers which might improve its capabilities:
* Use a bigger model than mine and regularize it better.
* Do proper hyperparameter sweeps rather than a brief grid search I did midway through to look at learning rates and weight decay values.
* Use a better pretrained image encoder (I don't know if there *are* any, at least for this usecase) or use the pre-pooling output rather than just embeddings.
* Shorter-iteration-cycle active learning: I did this very coarsely, doing only two rounds with 256 images at once, but I think it should be practical to retrain the model and get a new set of high-signal pairs much more frequently, since the model is very small (~50M parameters). This might allow faster learning with fewer samples.
* Rather than training a model directly to predict scores and then winrates, train a single model to predict winrate between pairs given both embeddings, then either use that directly or use it to generate synthetic data for a scalar-score model.
Thanks to AI_WAIFU, RossM, Claude-3 Opus and LLaMA-3-70B for their help with dataset gathering methodology, architectural suggestions, plotting code/naming things and the web frontends respectively!
[^1]: This is more because theirs are awful than because its is especially good.
[^2]: You know which one.
[^3]: In retrospect, it would probably also have been fine to take pairs where the predicted winrate was about 0.5.
[^4]: Someone has, in fact, built a [similar tool](https://github.com/RossM/image-rater/tree/dev) which they used to find the best possible catgirl, but I think they do logistic regression (insufficiently powerful for my requirements) and understanding and rearranging their code to my requirements would have been slower than writing it myself.
[^5]: Actually, most research wouldn't do this because research norms are terrible.

View File

@ -4,6 +4,10 @@ description: The history of the feared note-taking application.
created: 06/06/2023
updated: 28/08/2023
---
::: epigraph attribution="Xe Iaso" link=https://xeiaso.net/blog/GraphicalEmoji/
This code is free as in mattress. If you decide to use it, it's your problem.
:::
If you've talked to me, you've probably heard of Minoteaur.
It was conceptualized in 2019, when I determined that it was a good idea to take notes on things in a structured way, looked at all existing software, and was (as usual, since all software is terrible; I will probably write about this at some point) vaguely unsatisifed by it.
I don't actually remember the exact details, since I don't have notes on this which I can find, but this was around the time when Roam Research leaked into the tech noösphere, and I was interested in and generally agreed with the ideas of graph-structured note-taking applications, with easy and fast flat organization.

View File

@ -5,6 +5,9 @@ created: 25/02/2024
updated: 14/04/2024
slug: mlrig
---
::: epigraph attribution=@jckarter link=https://twitter.com/jckarter/status/1441441401439358988
Programmers love talking about the “bare metal”, when in fact the logic board is composed primarily of plastics and silicon oxides.
:::
## Summary

View File

@ -5,7 +5,6 @@ slug: nemc
updated: 09/02/2020
created: 14/08/2018
---
Imagine that some politician said "Cars are an important part of our modern economy. Many jobs involve cars. Therefore, all students must learn how to build cars; we will be adding it to the curriculum."
This does (to me, anyway) seem pretty ridiculous on its own.
*Now* imagine that the students are being taught to make small car models out of card, whilst being told that this is actually how it's always done (this sort of falls apart, since with the car thing it's easy for the average person to tell; most people can't really distinguish these things with programming).
@ -30,7 +29,7 @@ I have an alternative list of things to teach which I think might actually be re
This doesn't really sound as fancy as teaching "the new literacy", but it seems like a better place to start for helping people be able to interact with modern computer systems.
## Update (09/02/2020 CE)
## Update (2020-02-09)
Having shown someone this post, they've suggested to me that Scratch is more about teaching some level of computational thinking-type skills - learning how to express intentions in a structured way and being precise/specific - than actually teaching *programming*, regardless of how it's marketed.
This does seem pretty sensible, actually.

View File

@ -5,6 +5,10 @@ created: 08/07/2021
updated: 19/08/2021
slug: osbill
---
::: epigraph attribution="Malcolm Turnbull"
The laws of Australia prevail in Australia, I can assure you of that. The laws of mathematics are very commendable, but the only law that applies in Australia is the law of Australia.
:::
I recently found out that the UK government aims to introduce the "[Online Safety Bill](https://www.gov.uk/government/publications/draft-online-safety-bill)" and read about it somewhat (no, I have not actually read much of the (draft) bill itself; it is 145 pages with 146 of impact assessments and 123 of explanatory notes, and so out of reach of all but very dedicated lawyers) and, as someone on the internet, it seems very harmful. This has already been detailed quite extensively and probably better than I [can](https://techcrunch.com/2021/05/12/uk-publishes-draft-online-safety-bill/) [manage](https://www.openrightsgroup.org/blog/access-denied-service-blocking-in-the-online-safety-bill/) [elsewhere](https://matrix.org/blog/2021/05/19/how-the-u-ks-online-safety-bill-threatens-matrix), so I'll just summarize my issues relatively quickly.
Firstly, it appears to impose an unreasonable amount of requirements on essentially every internet service (technically mine, too, due to the comments box!): risk-assessments, probably age verification (age is somewhat sensitive information which it would not be good for all websites to have to collect), fees for companies of some size (I think this is just set by OFCOM), and, more generally, removing "harmful content", on pain of being blocked/sanctioned/fined. Not *illegal* content, just "content that is harmful to children/adults" (as defined on page 50 or so). The bill is claimed to deal with the excesses of Facebook and other large companies, and they certainly have problems, but this affects much more than that (and doesn't seem to address their main problems (misaligned incentives with users causing optimization for engagement over all else, privacy violations, monopolistic behaviour) much).

View File

@ -5,6 +5,10 @@ created: 24/09/2023
updated: 15/01/2024
slug: opinion
---
::: epigraph attribution=pyrartha
If you aren't slandering those who disagree even slightly with you at every opportunity and talking about how you want to beat them up, do you even hold any political beliefs?
:::
This may sound strange coming from someone whose website contains things which are clearly [political opinions](/osbill/); I am being [hypocritical](https://www.overcomingbias.com/p/homo-hipocritushtml)/didn't notice/have updated my views since that/am writing hyperbolically or ironically to make a point/do not require myself to have self-consistent beliefs (select your favourite option). Regardless, I think that holding, forming and in various ways acting on political opinions is somewhere between unnecessary and significantly net harmful. I apologize in advance for not using concrete examples for anything in this post, but those would be political opinions.
## Importance, Tractability, Neglectedness

View File

@ -4,6 +4,10 @@ created: 02/07/2023
description: Why programming education isn't very good, and my thoughts on AI code generation.
slug: progedu
---
::: epigraph attribution="Randall Munroe" link=https://xkcd.com/2030/
Don't trust voting software and don't listen to anyone who tells you it's safe. I don't quite know how to put this, but our entire field is bad at what we do, and if you rely on us, everyone will die.
:::
It seems to be fairly well-known (or at least widely believed amongst the people I regularly talk to about this) that most people are not very good at writing code, even those who really "should" be because of having (theoretically) been taught to (see e.g. <https://web.archive.org/web/20150624150215/http://blog.codinghorror.com/why-cant-programmers-program/>). Why is this? In this article, I will describe my wild guesses.
General criticisms of formal education have [already been done](https://en.wikipedia.org/wiki/The_Case_Against_Education), probably better than I can manage to do. I was originally going to write about how the incentives of the system are not particularly concerned with testing people in accurate ways, but rather easy and standardizable ways, and the easiest and most standardizable ways are to ask about irrelevant surface details rather than testing skill. But this isn't actually true: automated testing of code to solve problems is scaleable enough that things like [Project Euler](https://projecteuler.net/) and [Leetcode](https://leetcode.com/) can test vast amounts of people without human intervention, and it should generally be *less* effort to do this than to manually process written exams. It does seem to be the case that programming education tends to preferentially test bad proxies for actual skill, but the causality probably doesn't flow from testing methods.

View File

@ -4,6 +4,10 @@ description: RSAPI and the rest of my infrastructure.
created: 27/03/2024
slug: srsapi
---
::: epigraph attribution=@creatine_cycle link=https://twitter.com/creatine_cycle/status/1661455402033369088
Transhumanism is attractive until you have seen how software is built.
:::
The original [Site tech stack](/stack/) article (updated since release somewhat as hardware has improved and software been replaced) covers the basic workings of the public-facing website. However, I run *other* things, some of which are interesting to talk about! I have a number of services for personal use running on the same infrastructure, and several non-web-facing but public services. Here's the latest edition of the handy diagram I made in Graphviz:
<details>

View File

@ -4,6 +4,10 @@ description: Learn about how osmarks.net works internally! Spoiler warning if yo
created: 24/02/2022
updated: 11/05/2023
---
::: epigraph attribution="Rupert Goodwins"
If you can't stand the heat, get out of the server room.
:::
As you may know, osmarks.net is a website, served from computers which are believed to exist. But have you ever wondered exactly how it's all set up? If not, you may turn elsewhere and live in ignorance. Otherwise, continue reading.
Many similar personal sites are hosted on free static site services or various cloud platforms, but mine actually runs on a physical server. This was originally done because of my general distrust of SaaS/cloud platforms, to learn about Linux administration, and desire to run some non-web things, but now it's necessary to run the full range of weird components which are now important to the website. ~~The hardware has remained the same since early 2019, before I actually had a public site, apart from the addition of more disk capacity and a spare GPU for occasional machine learning workloads - I am using an old HP ML110 G7 tower server. Despite limited RAM and CPU power compared to contemporary rackmount models, it was cheap, has continued to work amazingly reliably, and is much more power-efficient than those would have been. It mostly only runs at about 5% CPU load and 2GB of RAM in use anyway, so it's not been an issue.~~ Due to the increasing compute demands of internal workloads, among other things, it has now been replaced with a custom build using a consumer Ryzen CPU. This has massively increased performance thanks to the CPU's much better IPC, clocks and core count, the 16x increase in RAM, and actually having an SSD[^2].

View File

@ -87,22 +87,48 @@ const renderContainer = (tokens, idx) => {
opening = false
}
const m = tokens[idx].info.trim().split(" ");
const blockType = m[0]
const blockType = m.shift()
const options = {}
for (const arg of m.slice(1)) {
const [k, v] = arg.split("=", 2)
options[k] = v ?? true
let inQuotes, k, v, arg = false
while (arg = m.shift()) {
let wasInQuotes = inQuotes
if (arg[arg.length - 1] == '"') {
arg = arg.slice(0, -1)
inQuotes = false
}
if (wasInQuotes) {
options[k] += " " + arg
} else {
[k, v] = arg.split("=", 2)
if (v && v[0] == '"') {
inQuotes = true
v = v.slice(1)
}
options[k] = v ?? true
}
}
if (opening) {
if (blockType === "captioned") {
const link = `<a href="${md.utils.escapeHtml(options.src)}">`
return `<div class="${options.wide ? "caption wider" : "caption"}">${options.link ? link : ""}<img src="${md.utils.escapeHtml(options.src)}">${options.link ? "</a>" : ""}`
} else if (blockType === "epigraph") {
return `<div class="epigraph"><div>`
}
} else {
if (blockType === "captioned") {
return `</div>`
} else if (blockType === "epigraph") {
let ret = `</div></div>`
if (options.attribution) {
let inner = md.utils.escapeHtml(options.attribution)
if (options.link) {
inner = `<a href="${md.utils.escapeHtml(options.link)}">${inner}</a>`
}
ret = `<div class="attribution">${md.utils.escapeHtml("— ") + inner}</div>` + ret
}
return ret
}
}
throw new Error(`unrecognized blockType ${blockType}`)
@ -226,7 +252,7 @@ const processBlog = async () => {
}, processContent: renderMarkdown })
})
console.log(chalk.yellow(`${Object.keys(blog).length} blog entries`))
globalData.blog = addGuids(R.filter(x => !x.draft, R.sortBy(x => x.updated ? -x.updated.valueOf() : 0, R.values(blog))))
globalData.blog = addGuids(R.filter(x => !x.draft && !x.internal, R.sortBy(x => x.updated ? -x.updated.valueOf() : 0, R.values(blog))))
}
const processErrorPages = () => {
@ -296,7 +322,7 @@ const runOpenring = async () => {
const cached = readCache("openring", 60*60*1000)
if (cached) { globalData.openring = cached; return }
// wildly unsafe but only runs on input from me anyway
const arg = `./openring -n6 ${globalData.feeds.map(x => '-s "' + x + '"').join(" ")} < openring.html`
const arg = `./openring -n6 ${globalData.feeds.map(x => '-s "' + x + '"').join(" ")} < ./src/openring.html`
console.log(chalk.keyword("orange")("Openring:") + " " + arg)
const out = await util.promisify(childProcess.exec)(arg)
console.log(chalk.keyword("orange")("Openring:") + "\n" + out.stderr.trim())
@ -386,7 +412,7 @@ const doImages = async () => {
}
const avif = await writeFormat("avif", ".avif", "avifenc", ["-s", "0", "-q", "20"], " 2x")
const avifc = await writeFormat("avif-compact", ".c.avif", path.join(srcDir, "avif_compact.sh"), [])
const jpeg = await writeFormat("jpeg-scaled", ".jpg", "_fallback", "convert", ["-resize", "25%", "-format", "jpeg"])
const jpeg = await writeFormat("jpeg-scaled", ".jpg", "convert", ["-resize", "25%", "-format", "jpeg"])
globalData.images[stripped] = [
["image/avif", `${avifc}, ${avif} 2x`],
["_fallback", jpeg]

View File

@ -297,7 +297,7 @@ $hl-border: 3px
blockquote
border-left: 0.4rem solid white
.sidenotes img
.sidenotes img, .footnotes img
width: 100%
max-width: 15em
display: block
@ -316,3 +316,37 @@ table
padding: 0.4em
th
white-space: nowrap
.epigraph
position: relative
padding-left: 3rem
padding-right: 3rem
padding-top: 1rem
padding-bottom: 1rem
> div
font-style: italic
.attribution
text-indent: 2rem
text-align: right
font-style: normal
> div::after
content: "\201D"
right: 0rem
bottom: 0rem
> div::before
content: "\201C"
left: 0rem
top: 0rem
> div::before, > div::after
display: block
position: absolute
font-size: 2.5em
// TODO
#comments-wrapper textarea
width: calc(100% - 0.5em) !important

View File

@ -48,17 +48,18 @@ html(lang="en")
.header
h1.page-title= title
block under-title
h3.deemph
if updated
span= `Updated ${renderDate(updated)}`
if created || wordCount
span= " / "
if created
span= `Created ${renderDate(created)}`
if wordCount
span= " / "
if wordCount
span= `${metricPrefix(wordCount, "")} words`
if !internal
h3.deemph
if updated
span= `Updated ${renderDate(updated)}`
if created || wordCount
span= " / "
if created
span= `Created ${renderDate(created)}`
if wordCount
span= " / "
if wordCount
span= `${metricPrefix(wordCount, "")} words`
if description
em.description!= description
block content
@ -67,5 +68,6 @@ html(lang="en")
.sidenotes
if comments !== "off"
main(class=!haveSidenotes ? "fullwidth isso" : "isso")
main.isso
h2 Comments
section(id="comments-wrapper")