From 7c0349ecc0c1496eb23f5a6475b469b9d02ec554 Mon Sep 17 00:00:00 2001 From: osmarks Date: Tue, 18 Mar 2025 12:46:46 +0000 Subject: [PATCH] =?UTF-8?q?Edit=20=E2=80=98osmarks.net=5Fweb=5Fsearch=5Fpl?= =?UTF-8?q?an=5F(secret)=E2=80=99?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- osmarks.net_web_search_plan_(secret).myco | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/osmarks.net_web_search_plan_(secret).myco b/osmarks.net_web_search_plan_(secret).myco index 6abd190..d825df4 100644 --- a/osmarks.net_web_search_plan_(secret).myco +++ b/osmarks.net_web_search_plan_(secret).myco @@ -11,7 +11,9 @@ The job of a search engine is to retrieve useful information for users. This is = Information sources * Anna's Archive is ~0.5PB. This contains a substantial fraction of books and papers. These are plausibly higher-quality than the general internet. -* We do need general internet data for breadth of knowledge etc. This runs to PB (Common Crawl etc). Apparently billions of pages per month. +* {We do need general internet data for breadth of knowledge etc. This runs to PB (Common Crawl etc). Apparently billions of pages per month. +* It is possible that scraping can't be done by new entrants. Much of the web is useless so this is "fine", but Reddit still has some knowledge to it, as do obscure blogs. +} * {There is lots of alpha in weird corners of Twitter and also Discord. It would be useful to scrape these, though people would complain. * Also IRC, but that's logged worse. }