Edit ‘osmarks.net_web_search_plan’

This commit is contained in:
osmarks
2025-05-05 17:09:34 +00:00
committed by wikimind
parent 06edf9ab3a
commit cb0904a74d

View File

@@ -33,7 +33,9 @@ The job of a search engine is to retrieve useful information for users. This is
}
* "Links" aren't actually trivial. Would need to do substantial work to e.g. find reference targets in poorly digitized papers.
* We need OCR to understand PDFs and images properly (even with a native multimodal encoder OCR is probably necessary for traning). For some reason there are no good open-source solutions. This could maybe be fixed with a synthetic data approach (generate corrupted documents, train on those).
* "Documents" can be quite long and we want to be able to find things in e.g. a book with ~paragraph granularity whilst still understanding the context of the book. Consider [[https://arxiv.org/abs/2004.12832]], hierarchical systems? It would be somewhat cursed, but could index entire book as one vector then postprocess-select paragraphs.
* {"Documents" can be quite long and we want to be able to find things in e.g. a book with ~paragraph granularity whilst still understanding the context of the book. Consider [[https://arxiv.org/abs/2004.12832]], hierarchical systems? It would be somewhat cursed, but could index entire book as one vector then postprocess-select paragraphs.
* (Modern)ColBERT late interaction w/ pooling.
}
= Filtering