Edit ‘osmarks.net_web_search_plan’
This commit is contained in:
@@ -10,7 +10,7 @@ The job of a search engine is to retrieve useful information for users. This is
|
||||
|
||||
= Information sources
|
||||
|
||||
* Anna's Archive is ~0.5PB. This contains a substantial fraction of books and papers. These are plausibly higher-quality than the general internet.
|
||||
* Anna's Archive is ~0.5PB (unclear). This contains a substantial fraction of books and papers. These are plausibly higher-quality than the general internet.
|
||||
* {We do need general internet data for breadth of knowledge etc. This runs to PB (Common Crawl etc). Apparently billions of pages per month.
|
||||
* It is possible that scraping can't be done by new entrants. Much of the web is useless so this is "fine", but Reddit still has some knowledge to it, as do obscure blogs.
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user