1
0
mirror of https://github.com/osmarks/website synced 2025-09-10 22:36:01 +00:00

emphasis blocks

This commit is contained in:
osmarks
2025-01-24 15:17:41 +00:00
parent d44443289d
commit 9d9a78a950
4 changed files with 21 additions and 6 deletions

View File

@@ -6,14 +6,16 @@ created: 24/01/2025
series_index: 3
slug: memescale
---
::: emphasis
Try the new search system [here](https://nooscope.osmarks.net/). I don't intend to replace the existing [Meme Search Engine](https://mse.osmarks.net/), as its more curated dataset is more useful to me for most applications.
:::
::: epigraph attribution="Brian Eno"
Be the first person to not do something that no one else has ever thought of not doing before.
:::
Computers are very fast. It is easy to forget this when they routinely behave so slowly, and now that many engineers are working on heavily abstracted cloud systems, but even my slightly outdated laptop is in principle capable of executing 15 billion instructions per core in each second it wastes stuttering and doing nothing in particular. People will sometimes talk about how their system has to serve "millions of requests a day", but a day is about 10<sup>5</sup> seconds, and the problem of serving tens of queries a second on much worse hardware than we have now was solved decades ago. The situation is even sillier for GPUs - every consumer GPU is roughly as fast as entire 1990s supercomputers[^1] and they mostly get used to shade triangles for games. In the spirit of [Production Twitter on One Machine](https://thume.ca/2023/01/02/one-machine-twitter/), [Command-line Tools can be 235x Faster than your Hadoop Cluster](https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html) and projects like [Marginalia](https://search.marginalia.nu/), I have assembled what I believe to be a competitively sized image dataset and search system on my one ["server"](/stack/)[^2] by carefully avoiding work.
Try the new search system [here](https://nooscope.osmarks.net/). I don't intend to replace the existing [Meme Search Engine](https://mse.osmarks.net/), as its more curated dataset is more useful to me for most applications.
## Scraping
The concept for this project was developed in May, when I was pondering how to get more memes and a more general collection without the existing semimanual curation systems, particularly in order to provide open-domain image search. [MemeThresher](/memethresher/)'s crawler pulls from a small set of subreddits, and it seemed plausible that I could just switch it to `r/all`[^3] to get a decent sample of recent data. However, after their IPO and/or some manager realizing unreasonably late that people might be willing to pay for unstructured text data now, Reddit [does not want you](https://support.reddithelp.com/hc/en-us/articles/26410290525844-Public-Content-Policy) to scrape much, and this consistently cut off after a few thousand items. Conveniently, however, in the [words](https://www.reddit.com/r/reddit4researchers/comments/1co0mqa/our_plans_for_researchers_on_reddit/) of Reddit's CTO: