--- title: "Maghammer: My personal data warehouse" created: 28/08/2023 updated: 12/09/2023 description: Powerful search tools as externalized cognition, and how mine work. slug: maghammer --- I have had this setup in various bits and pieces for a while, but some people expressed interest in its capabilities and apparently haven't built similar things and/or weren't aware of technologies in this space, so I thought I would run through what I mean by "personal data warehouse" and "externalized cognition" and why they're important, how my implementation works, and other similar work. ## What? Firstly, "personal data warehouse". There are a lot of names and a lot of implementations, but the general idea is a system that I can use to centrally store and query personally relevant data from various sources. Mine is mostly focused on text search but is configured so that it can (and does, though not as much) work with other things. Proprietary OSes and cloud platforms are now trying to offer this sort of thing, but not very hard. My implementation runs locally on my [server](/stack/), importing data from various sources and building full-text indices. Here are some other notable ones: * [Stephen Wolfram's personal analytics](https://writings.stephenwolfram.com/2012/03/the-personal-analytics-of-my-life/) - he doesn't describe much of the implementation but does have an impressively wide range of data. * [Dogsheep](https://dogsheep.github.io/) - inspired by Wolfram (the name is a pun), and the basis for a lot of Maghammer. * [Recoll](https://www.lesbonscomptes.com/recoll/pages/index-recoll.html) - a powerful file indexer I also used in the past. * [Rewind](https://www.rewind.ai/) - a shinier more modern commercial tool (specifically for MacOS...) based on the somewhat weird approach of constantly recording audio and screen content. * [Monocle](https://thesephist.com/posts/monocle/) - built, apparently, to learn a new programming language, but it seems like it works well enough. You'll note that not all of these projects make any attempt to work on non-text data, which is a reasonable choice, since these concerns are somewhat separable. I personally care about handling my quantitative data too, especially since some of it comes from the same sources, and designed accordingly. ## Why? Why do I want this? Because human memory is very, very bad. My (declarative) memory is much better than average, but falls very far short of recording everything I read and hear, or even just the source of it[^1]. According to [Landauer, 1986](https://onlinelibrary.wiley.com/doi/pdf/10.1207/s15516709cog1004_4)'s estimates, the amount of retrievable information accumulated by a person over a lifetime is less than a gigabyte, or <0.05% of my server's disk space[^5]. There's also distortion in remembered material which is hard to correct for. Information is simplified in ways that lose detail, reframed or just changed as your other beliefs change, merged with other memories, or edited for social reasons. Throughout human history, even before writing, the solution to this has been externalization of cognitive processing: other tiers in the memory hierarchy with more capacity and worse performance. While it would obviously be [advantageous](/rote/) to be able to remember everything directly, just as it would be great to have arbitrarily large amounts of fast SRAM to feed our CPUs, tradeoffs are forced by reality. Oral tradition and culture were the first implementations, shifting information from one unreliable human mind to several so that there was at least some redundancy. Writing made for greater robustness, but the slowness of writing and copying (and for a long time expense of hardware) was limiting. Printing allowed mass dissemination of media but didn't make recording much easier for the individual. Now, the ridiculous and mostly underexploited power of contemporary computers makes it possible to literally record (and search) everything you ever read at trivial cost, as well as making lookups fast enough to integrate them more tightly into workflows. Roam Research popularized the idea of notes as a "second brain"[^2], but it's usually the case that the things you want to know are not ones you thought to explicitly write down and organize. More concretely, I frequently read interesting papers or blog posts or articles which I later remember in some other context - perhaps they came up in a conversation and I wanted to send someone a link, or a new project needs a technology I recall there being good content on. Without good archiving, I would have to remember exactly where I saw it (implausible) or use a standard, public search engine and hope it will actually pull the document I need. Maghammer (mostly) stores these and allows me to find them in a few seconds (fast enough for interactive online conversations, and not that much slower than Firefox's omnibox history search) as long as I can remember enough keywords. It's also nice to be able to conveniently find old shell commands for strange things I had to do in the past, or look up sections in books (though my current implementation isn't ideal for this). ## How? I've gone through a lot of implementations, but they all are based on the general principle of avoiding excessive effort by reusing existing tools where practical and focusing on the most important functionality over minor details. Initially, I just archived browser history with [a custom script](https://github.com/osmarks/random-stuff/blob/master/histretention.py) and stored [SingleFile](https://addons.mozilla.org/en-US/firefox/addon/single-file/) HTML pages and documents, with the expectation I would set up search other than `grep` later. I did in fact eventually (November 2021) set up Recoll (indexing) and [Recoll WE](https://www.lesbonscomptes.com/recoll/faqsandhowtos/IndexWebHistory) (to store all pages rather than just selected ones, or I suppose all of the ones without only client-side logic), and they continued to work decently for some time. As usually happens with software, I got dissatisfied with it for various somewhat arbitrary reasons and prototyped rewrites. These were not really complete enough to go anywhere (some of them reimplemented an entire search engine for no particular reason, one worked okay but would have been irritating to design a UI for, one works for the limited scope of indexing Calibre but doesn't do anything else) so I continued to use Recoll until March 2023, when I found [Datasette](https://datasette.io/) and [the author's work on personal search engines](https://datasette.substack.com/p/dogsheep-personal-analytics-with) and realized that this was probably the most viable path to a more personalized system. My setup is of course different from theirs, so I wrote some different importer scripts to organize data nicely in SQLite and build full text search indices, and an increasingly complicated custom plugin to do a few minor UI tweaks (rendering timestamp columns, fixing foreign keys on single-row view pages, doing links) and reimplement something like [datasette-search-all](https://github.com/simonw/datasette-search-all/) (which provides a global search bar and nicer search UI). Currently, I have custom scripts to import this data, which are run nightly as a batch job: * Anki cards from [anki-sync-server](https://github.com/ankicommunity/anki-sync-server/)'s database - just the text content, because the schema is weird enough that I didn't want to try and work out how anything else was stored. * Unorganized text/HTML/PDF files in my archives folder. * Books (EPUB) stored in Calibre - overall metadata and chapter full text. * Media files in my archive folder (all videos I've watched recently) - format, various metadata fields, and full extracted subtitles with full text search. * I've now added [WhisperX](https://github.com/m-bain/whisperX/) autotranscription on all files with bad/nonexistent subtitles. While it struggles with music more than Whisper itself, its use of batched inference and voice activity detection meant that I got ~100x realtime speed on average processing all my files (after a patch to fix the awfully slow alignment algorithm). * [Miniflux](/rssgood/) RSS feed entries. * [Minoteaur](/minoteaur/) notes, files and structured data. I don't have links indexed since SQLite isn't much of a graph database[^6], and my importer reads directly off the Minoteaur database and writing a Markdown parser would have been annoying. * RCLWE web history (including the `circache` holding indexed pages in my former Recoll install). There are also some other datasets handled differently, because the tools I use for those happened to already use SQLite somewhere and had reasonably usable formats. Specifically, [Gadgetbridge](https://www.gadgetbridge.org/) data from my smartwatch is copied off my phone and accessible in Datasette, [Atuin](https://github.com/ellie/atuin)'s local shell history database is symlinked in, Firefox history comes from [my script](https://github.com/osmarks/random-stuff/blob/master/histretention.py) on my laptop rather than the nightly serverside batch job, and I also connected my Calibre library database, though I don't actually use that. 13GB of storage is used in total. This is some of what the UI looks like - it is much like a standard Datasette install with a few extra UI elements and some style tweaks I made: