Edit ‘osmarks.net_web_search_plan_(secret)’

2025-03-10 08:57:43 +00:00
parent 9c9c94e4ac
commit dd4164fb8c
1 changed files with 1 additions and 1 deletions
--- a/osmarks.net_web_search_plan_(secret).myco
+++ b/osmarks.net_web_search_plan_(secret).myco
@@ -19,6 +19,7 @@ The job of a search engine is to retrieve useful information for users. This is
 * Common Crawl doesn't even get PDFs because they're complicated to process!
 * Obscure papers, product user manuals, shiny reports from organizations.
 }
+* So much tacit knowledge is in videos. Oh no. Maybe we can get away with an autotranscriber and frame extraction.

 = Indexing

@@ -31,7 +32,6 @@ The job of a search engine is to retrieve useful information for users. This is
 * "Links" aren't actually trivial. Would need to do substantial work to e.g. find reference targets in poorly digitized papers.
 * We need OCR to understand PDFs and images properly (even with a native multimodal encoder OCR is probably necessary for traning). For some reason there are no good open-source solutions. This could maybe be fixed with a synthetic data approach (generate corrupted documents, train on those).
 * "Documents" can be quite long and we want to be able to find things in e.g. a book with ~paragraph granularity whilst still understanding the context of the book. Consider [[https://arxiv.org/abs/2004.12832]], hierarchical systems? It would be somewhat cursed, but could index entire book as one vector then postprocess-select paragraphs.
-* So much tacit knowledge is in videos. Oh no. Maybe we can get away with an autotranscriber and frame extraction.

 = Filtering