From 7fc08caf214dad177b0b80d30ccc420387538d4d Mon Sep 17 00:00:00 2001
From: osmarks <osmarks@mycorrhiza>
Date: Sat, 8 Mar 2025 22:42:39 +0000
Subject: [PATCH] =?UTF-8?q?Edit=20=E2=80=98osmarks.net=5Fweb=5Fsearch=5Fpl?=
 =?UTF-8?q?an=5F(secret)=E2=80=99?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 osmarks.net_web_search_plan_(secret).myco | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/osmarks.net_web_search_plan_(secret).myco b/osmarks.net_web_search_plan_(secret).myco
index 685f3f5..3fd6010 100644
--- a/osmarks.net_web_search_plan_(secret).myco
+++ b/osmarks.net_web_search_plan_(secret).myco
@@ -11,7 +11,9 @@ The job of a search engine is to retrieve useful information for users. This is
 = Information sources
 
 * Anna's Archive is ~0.5PB. This contains a substantial fraction of books and papers. These are plausibly higher-quality than the general internet.
-* We do need general internet data for breadth of knowledge etc. This runs to PB (Common Crawl etc). Apparently billions of pages per month.
+* {We do need general internet data for breadth of knowledge etc. This runs to PB (Common Crawl etc). Apparently billions of pages per month.
+* Common Crawl doesn't even get PDFs because they're complicated to process! We need those.
+}
 * {There is lots of alpha in weird corners of Twitter and also Discord. It would be useful to scrape these, though people would complain.
 * Also IRC, but that's logged worse.
 }