From cb0904a74d3f848bd924ae265ad3e8948f13587d Mon Sep 17 00:00:00 2001 From: osmarks Date: Mon, 5 May 2025 17:09:34 +0000 Subject: [PATCH] =?UTF-8?q?Edit=20=E2=80=98osmarks.net=5Fweb=5Fsearch=5Fpl?= =?UTF-8?q?an=E2=80=99?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- osmarks.net_web_search_plan.myco | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/osmarks.net_web_search_plan.myco b/osmarks.net_web_search_plan.myco index de9ce96..b9b1383 100644 --- a/osmarks.net_web_search_plan.myco +++ b/osmarks.net_web_search_plan.myco @@ -33,7 +33,9 @@ The job of a search engine is to retrieve useful information for users. This is } * "Links" aren't actually trivial. Would need to do substantial work to e.g. find reference targets in poorly digitized papers. * We need OCR to understand PDFs and images properly (even with a native multimodal encoder OCR is probably necessary for traning). For some reason there are no good open-source solutions. This could maybe be fixed with a synthetic data approach (generate corrupted documents, train on those). -* "Documents" can be quite long and we want to be able to find things in e.g. a book with ~paragraph granularity whilst still understanding the context of the book. Consider [[https://arxiv.org/abs/2004.12832]], hierarchical systems? It would be somewhat cursed, but could index entire book as one vector then postprocess-select paragraphs. +* {"Documents" can be quite long and we want to be able to find things in e.g. a book with ~paragraph granularity whilst still understanding the context of the book. Consider [[https://arxiv.org/abs/2004.12832]], hierarchical systems? It would be somewhat cursed, but could index entire book as one vector then postprocess-select paragraphs. +* (Modern)ColBERT late interaction w/ pooling. +} = Filtering