diff --git a/osmarks.net_web_search_plan_(secret).myco b/osmarks.net_web_search_plan_(secret).myco index 685f3f5..3fd6010 100644 --- a/osmarks.net_web_search_plan_(secret).myco +++ b/osmarks.net_web_search_plan_(secret).myco @@ -11,7 +11,9 @@ The job of a search engine is to retrieve useful information for users. This is = Information sources * Anna's Archive is ~0.5PB. This contains a substantial fraction of books and papers. These are plausibly higher-quality than the general internet. -* We do need general internet data for breadth of knowledge etc. This runs to PB (Common Crawl etc). Apparently billions of pages per month. +* {We do need general internet data for breadth of knowledge etc. This runs to PB (Common Crawl etc). Apparently billions of pages per month. +* Common Crawl doesn't even get PDFs because they're complicated to process! We need those. +} * {There is lots of alpha in weird corners of Twitter and also Discord. It would be useful to scrape these, though people would complain. * Also IRC, but that's logged worse. }