Edit ‘osmarks.net_web_search_plan’

This commit is contained in:
osmarks
2025-04-30 22:52:41 +00:00
committed by wikimind
parent cabc6ab1b7
commit 03cecf841d

View File

@@ -10,7 +10,7 @@ The job of a search engine is to retrieve useful information for users. This is
= Information sources
* Anna's Archive is ~0.5PB. This contains a substantial fraction of books and papers. These are plausibly higher-quality than the general internet.
* Anna's Archive is ~0.5PB (unclear). This contains a substantial fraction of books and papers. These are plausibly higher-quality than the general internet.
* {We do need general internet data for breadth of knowledge etc. This runs to PB (Common Crawl etc). Apparently billions of pages per month.
* It is possible that scraping can't be done by new entrants. Much of the web is useless so this is "fine", but Reddit still has some knowledge to it, as do obscure blogs.
}