mirror of
https://github.com/osmarks/website
synced 2025-04-10 12:26:39 +00:00
copyedit
This commit is contained in:
parent
f583c2fd9a
commit
3ce96ec7d4
@ -12,7 +12,7 @@ Keep loading, nothing here is hallowed <br />
|
||||
Come see the world as it oughta be
|
||||
:::
|
||||
|
||||
[Heavenbanning](https://x.com/nearcyan/status/1532076277947330561) - [first proposed](https://news.ycombinator.com/item?id=25522518) as far back as late 2020, but popularized in 2022[^1] - is an alternative to shadowbanning (hiding users' social media comments from other people without telling them) in which users see only encouraging, sycophantic LLM-generated replies to their posts (which remain hidden from other humans). This, and similar technologies like the [Automated Persuasion Network](https://osmarks.net/stuff/apn.pdf), raise concerning questions about ethics, the role of stated preferences versus revealed preferences and the relevance of humans in the future. But for predicting adoption, the relevant question is not whether it's ethical - it's whether it's profitable.
|
||||
[Heavenbanning](https://x.com/nearcyan/status/1532076277947330561) - [first proposed](https://news.ycombinator.com/item?id=25522518) as far back as late 2020, but popularized in 2022[^1] - is an alternative to shadowbanning (hiding users' social media comments from other people without telling them) in which users see only encouraging, sycophantic LLM-generated replies to their posts (which remain hidden from other humans). This, and similar technologies like the [Automated Persuasion Network](https://osmarks.net/stuff/apn.pdf), raise concerning questions about ethics, the role of stated preferences versus revealed preferences and the relevance of humans in the future - questions other people have spilled many bits writing about already, and which I don't think will be the main reasons for any (lack of) adoption. To know when and how it will be used, what we need to know is when and how it will be profitable.
|
||||
|
||||
The purpose of heavenbanning, to the platform implementing it, is to mollify users who would otherwise be banned from the platform and leave, keeping their engagement up without disrupting other users' experience. An obvious question is whether shadowbanning - or slightly softer but functionally similar forms like downweighting in algorithmic recommenders - is common enough for this to matter. Most of the available academic research is qualitative surveys of how users feel about (thinking they are) being shadowbanned, and the quantitative data is still primarily self-reports, since social media platforms are opaque about their moderation. According to [a PDF](https://files.osf.io/v1/resources/xcz2t/providers/osfstorage/628662af52d1723f1080bc21?action=download&direct&version=1) describing a survey of 1000 users, the perceived shadowbanning rate is about 10%. [This paper](https://arxiv.org/abs/2012.05101) claims a shadowban rate of ~2% on Twitter (and a variety of slightly different shadowban mechanisms), based on scraping.
|
||||
|
||||
|
@ -26,7 +26,7 @@ The [unsanctioned datasets distributed via BitTorrent](https://academictorrents.
|
||||
|
||||
This may be unintuitive, since "all the images" was, based on my early estimates, about 250 million. Assuming a (slightly pessimistic) 1MB per image, I certainly don't have 250TB of storage. Usable thumbnails would occupy perhaps 50kB each with the best available compression, which would have been very costly to apply, but 12TB is still more than I have free. The trick is that it wasn't necessary to store any of that[^4]: to do search, only the embedding vectors, occupying about 2kB each, are needed (as well as some metadata for practicality). Prior work like [img2dataset](https://github.com/rom1504/img2dataset) retained resized images for later embedding: I avoided this by implementing the entire system as a monolithic minimal-buffering pipeline going straight from URLs to image buffers to embeddings to a very large compressed blob on disk, with backpressure to clamp download speed to the rate necessary to feed the GPU.
|
||||
|
||||
I spent a day or two [implementing](https://github.com/osmarks/meme-search-engine/blob/master/src/reddit_dump.rs) this, with a mode to randomly sample a small fraction of the images for initial testing. This revealed some bottlenecks - notably, the inference server was slower than it theoretically could be and substantially CPU-hungry - which I was able to partly fix by [hackily rewriting](https://github.com/osmarks/meme-search-engine/blob/master/aitemplate/model.py) the model using [AITemplate](https://github.com/facebookincubator/AITemplate). I had anticipated running close to network bandwidth limits, but with my GPU fully loaded and the inference server improved I only hit 200Mbps down at first; a surprising and more binding limit was the CPU-based image preprocessing code, which I "fixed" by compromising image quality very slightly. I also had to increase a lot of resource limits (file descriptors and local DNS caching) to handle the unreasonable amount of parallel downloads. This more or less worked, but more detailed calculations showed that I'd need a month of runtme and significant additional storage for a full run, and the electricity/SSD costs were nontrivial so the project was shelved.
|
||||
I spent a day or two [implementing](https://github.com/osmarks/meme-search-engine/blob/master/src/reddit_dump.rs) this, with a mode to randomly sample a small fraction of the images for initial testing. This revealed some bottlenecks - notably, the inference server was slower than it theoretically could be and substantially CPU-hungry - which I was able to partly fix by [hackily rewriting](https://github.com/osmarks/meme-search-engine/blob/master/aitemplate/model.py) the model using [AITemplate](https://github.com/facebookincubator/AITemplate). I had anticipated running close to network bandwidth limits, but with my GPU fully loaded and the inference server improved I only hit 200Mbps down at first; a surprising and more binding limit was the CPU-based image preprocessing code, which I "fixed" by compromising image quality very slightly. I also had to increase a lot of resource limits (file descriptors and local DNS caching) to handle the unreasonable amount of parallel downloads. This more or less worked, but more detailed calculations showed that I'd need a month of runtime and significant additional storage for a full run, and the electricity/SSD costs were nontrivial so the project was shelved.
|
||||
|
||||
Recently, some reprioritization and requiring a lot of additional storage anyway resulted in me resurrecting the project from the archives. I had to make a few final tweaks to integrate it with the metrics system, reduce network traffic by making it ignore probably-non-image URLs earlier, log some data I was missing and (slightly) handle links to things like Imgur galleries. After an early issue with miswritten concurrency code leading to records being read in the wrong order such that it would not correctly recover from a restart, it ran very smoothly for a few days. There were, however, several unexplained discontinuities in the metrics, as well as some gradual changes over time which resulted in me using far too much CPU time. I had to actually think about optimization.
|
||||
|
||||
|
@ -54,7 +54,7 @@ block content
|
||||
each entry in openring
|
||||
!= entry
|
||||
|
||||
//iframe(src="https://george.gh0.pw/embed.cgi?gollark", style="border:none;width:100%;height:50px", title="Acquiesce to GEORGE.")
|
||||
iframe(src="https://george.gh0.pw/embed.cgi?gollark", style="border:none;width:100%;height:50px", title="Acquiesce to GEORGE.")
|
||||
|
||||
block under-title
|
||||
h2= name
|
||||
|
Loading…
x
Reference in New Issue
Block a user