From 0726eb9ea755f1390d4a63f678e9579ca8b16f62 Mon Sep 17 00:00:00 2001 From: osmarks Date: Fri, 7 Mar 2025 14:45:49 +0000 Subject: [PATCH] =?UTF-8?q?Edit=20=E2=80=98osmarks.net=5Fweb=5Fsearch=5Fpl?= =?UTF-8?q?an=5F(secret)=E2=80=99?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- osmarks.net_web_search_plan_(secret).myco | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/osmarks.net_web_search_plan_(secret).myco b/osmarks.net_web_search_plan_(secret).myco index 9d77870..6e0489a 100644 --- a/osmarks.net_web_search_plan_(secret).myco +++ b/osmarks.net_web_search_plan_(secret).myco @@ -12,7 +12,9 @@ The job of a search engine is to retrieve useful information for users. This is * Anna's Archive is ~0.5PB. This contains a substantial fraction of books and papers. These are plausibly higher-quality than the general internet. * We do need general internet data for breadth of knowledge etc. This runs to PB (Common Crawl etc). Apparently billions of pages per month. -* There is lots of alpha in weird corners of Twitter and also Discord. It would be useful to scrape these, though people would complain. +* {There is lots of alpha in weird corners of Twitter and also Discord. It would be useful to scrape these, though people would complain. +* Also IRC, but that's logged worse. +} * Images, PDFs, etc contain useful knowledge which hasn't been integrated properly into most things. We need* these. = Indexing