diff --git a/osmarks.net_web_search_plan.myco b/osmarks.net_web_search_plan.myco index 02f0f1c..6ca5221 100644 --- a/osmarks.net_web_search_plan.myco +++ b/osmarks.net_web_search_plan.myco @@ -36,6 +36,7 @@ The job of a search engine is to retrieve useful information for users. This is * We need OCR to understand PDFs and images properly (even with a native multimodal encoder OCR is probably necessary for traning). For some reason there are no good open-source solutions. This could maybe be fixed with a synthetic data approach (generate corrupted documents, train on those). * {"Documents" can be quite long and we want to be able to find things in e.g. a book with ~paragraph granularity whilst still understanding the context of the book. Consider [[https://arxiv.org/abs/2004.12832]], hierarchical systems? It would be somewhat cursed, but could index entire book as one vector then postprocess-select paragraphs. * (Modern)ColBERT late interaction w/ pooling. +* https://gwern.net/tree-embedding } = Filtering