Edit ‘osmarks.net_web_search_plan_(secret)’

2025-03-08 22:42:39 +00:00
parent 7106a5edc9
commit 7fc08caf21
1 changed files with 3 additions and 1 deletions
@@ -11,7 +11,9 @@ The job of a search engine is to retrieve useful information for users. This is
 = Information sources

 * Anna's Archive is ~0.5PB. This contains a substantial fraction of books and papers. These are plausibly higher-quality than the general internet.
-* We do need general internet data for breadth of knowledge etc. This runs to PB (Common Crawl etc). Apparently billions of pages per month.
+* {We do need general internet data for breadth of knowledge etc. This runs to PB (Common Crawl etc). Apparently billions of pages per month.
+* Common Crawl doesn't even get PDFs because they're complicated to process! We need those.
+}
 * {There is lots of alpha in weird corners of Twitter and also Discord. It would be useful to scrape these, though people would complain.
 * Also IRC, but that's logged worse.
 }