Edit ‘osmarks.net_web_search_plan_(secret)’

This commit is contained in:
osmarks 2025-03-08 22:44:03 +00:00 committed by wikimind
parent 7fc08caf21
commit d843150ac7

View File

@ -11,13 +11,13 @@ The job of a search engine is to retrieve useful information for users. This is
= Information sources
* Anna's Archive is ~0.5PB. This contains a substantial fraction of books and papers. These are plausibly higher-quality than the general internet.
* {We do need general internet data for breadth of knowledge etc. This runs to PB (Common Crawl etc). Apparently billions of pages per month.
* Common Crawl doesn't even get PDFs because they're complicated to process! We need those.
}
* We do need general internet data for breadth of knowledge etc. This runs to PB (Common Crawl etc). Apparently billions of pages per month.
* {There is lots of alpha in weird corners of Twitter and also Discord. It would be useful to scrape these, though people would complain.
* Also IRC, but that's logged worse.
}
* Images, PDFs, etc contain useful knowledge which hasn't been integrated properly into most things. We need* these.
* {Images, PDFs, etc contain useful knowledge which hasn't been integrated properly into most things. We need* these.
* Common Crawl doesn't even get PDFs because they're complicated to process!
}
= Indexing