`ark`: A Personal Archive System, Part 3: The Store — Where the Archive Actually Lives

09 Jun 2026 » 10 min read » Filed under: Technology & Gadgets

That last post. What can I say. Four thousand plus words? Did I really do that to you? It won’t happen again. If I could do it over, I would1. When I set out to (very tentatively) write this series of posts on ark, I intended it as being a way of showcasing something fun that I built that actually turned out to be the thing that I always wanted to build. I forgot to bring the fun to that last post and it reads (to me) like one big “look what I can do!” flail. And this piece! This piece started out in the same direction, only worse. It was mired in technical detail. At one point, in the first draft, I wrote,

This is naturally a more technical piece than most that I write, given the nature of what I am describing. I’ll do my best to smooth out those rough edges, but know that I am aware this isn’t for my usual audience.

Lazy, lazy, lazy. And not the right intention. Fortunately, I caught onto what I was doing when only a thousand words or so had been set down. Pauciloquy2 is called for and there is still (barely) enough to recover. And so here we are, about to talk about ark‘s store, or, where’d all those files go? Where to begin…?

The Dreaded Org Chart

I’ve been unsatisfied with the hierarchical structure of file systems since I ran that very first catalog command on an Apple ][ e. I’ve lived with them ever since, an accumulating succession of decades that have done little to quench the burning dissatisfaction with the way files are stored on computers. It sometimes seems like a large part of my avocation in technology has been a desperate search for a way out of the rigid hierarchy3. “Tear down the wall!”

The search took me to Evernote with its notebooks and tags, and then to Obsidian with its org chart and tags. But Obsidian introduced, to me at least, the notion of a graph: that is links between files that form an edge between nodes. It is a powerful idea, more powerful than I realized. And so when I approached this hobby project and was considering the design, I had two strong ideas in mind:

  1. Get me off this org chart!
  2. How might I take advantage of graphs?

Simple Requirements

In my limited imagination, there are two poles on the file storage spectrum4: a simple listing of files and a graph of files where every file points to every other file. ark is designed to be as close to the simple listing of files as possible. My requirements were, therefore, simple:

  1. There should only ever be one of each item. Preventing duplicates makes things easier to find. I can’t tell you how many times I have found three different copies of a Word document or photo on my computer.
  2. Items in the archive must be described separately from the files themselves. File systems provide the bare minimum capacity for describing a file. An archive is more than a file system so I need a way of describing those files to make finding them as easy as possible.
  3. The archive only stores finished products. Working documents, working files don’t get into the archive until they are finished.

With these requirements in hand, I set about meeting each one. To ensure that each item in the archive is unique, it gets a unique file name based on its digital DNA. The unique set of bytes that make up a file can be “hashed” into a number that is unique for that set of bytes. ark uses sha256 for its hashing mechanism. That number becomes not only the name of the file, but its identifier in ark‘s database. What it means in practice is that if I bring an exact copy of a file into ark that already exists, it doesn’t get added a second time, it is simply ignored in favor of the copy that is already in the system.

To describe the files in the system, ark uses a SQLite database. This allows ark to have full-text search and semantic search capabilities without running a database server. The SQLite database is just another file on my computer. True, SQLite is not designed to be a multi-user database, but ark is not designed to be a multi-user application, so we’re all good here. All of the meta-data needed to describe a file is stored in the database. That meta-data breaks down into five categories:

  1. Classification (doc type, series, sub-series, format, sensitivity, priority, etc.)
  2. Provenance (date authored, record origin, original source path, physical location, etc.)
  3. Identification (doc ID, sha256, title, store path)
  4. Quality (OCR status, OCR quality)
  5. Content (full-text search content, LLM summary, embeddings)

Finally, ark uses a “copy-in” strategy for files. That means that the source file is left untouched, wherever it comes from, and a copy of the file is brought into the archive. Because of this, and the other requirements I listed above, I can store all of the files in a flat structure within ark‘s store. After all, I never need to know the actual file name. I just need to be able to describe what I am looking for and the database takes care of the rest.

Connections

In your standard OS, files sit there on a file system completely unconnected. But in a personal archive, people are first-class citizens. So in addition to the database storing information about files, it stores information about the people in those files. And since it is the connections that make an archive like this come alive, ark supports three kinds of connections:

  1. Document-to-document: one email is linked to another as a “reply-to”. A receipt is linked to an invoice. Documents can link to one another.
  2. Document-to-person: Who wrote it, received it, who’s mentioned, who’s the subject. The same person may show up in lots of roles across documents. This is the single most-queried table in the archive. Anything person-aware reads from here.
  3. Person-to-person: Friends, family, colleagues, including date ranges. A handyman who works on your house may retire, and someone else takes over. The connections capture it all.

Here’s where things get fun! In an org chart, the answer to “show me everything about my son from the last two years” depends on remembering which folder you filed each thing in. In a graph, it doesn’t. I asked Claude Desktop5 that exact question and got back this:

Theater & Performance — The dominant thread. He’d been active in community theater for a while: directing plays, performing in musicals (the creepy Wolf in Into the Woods), and preparing for the role of Edwin Drood at his high school. He’s a baritone/tenor in the choral group and was being recruited for an a cappella group. By May 2026, a choral concert I attended. Then the big news: he and a friend auditioned for Guys and Dolls — a six-hour audition — and he was offered the lead role of Nathan Detroit. Around 200 lines.

Prom — May 19, 2026. He went with a friend. Dinner first, then photos at the Jefferson Memorial via Uber Black, then the prom itself. I left the car at the venue so he could drive home. He got home around 10pm, changed, headed to a post-prom party, was home just after 1am. I noted he “seemed very grown up about the whole thing.”

Memorial Day 2026 — At our friends’ house. He “christened the pool as he does every year.”

College Prep — From a family Zoom in late May: he’s working on the Common Application and asking for recommendation letters. Summer plans still up in the air — SAT prep, camp counselor work, or a possible Europe trip.

Lawn — An October 2025 iMessage asks whether he can cut the lawn that weekend.

Rutgers campus tour — Mentioned in passing in the Zoom notes: I drove by my childhood New Jersey house “after a Rutgers campus tour with him.”

Cotillion — Per a January 2026 letter, he was working as an instructor — teaching students to dance and how to conduct themselves in society.

Driving — Also from that January letter, the dry observation that his having a license “does wonders for the nervous system.”

One question; every connected thing — pulled together by edges the archive had been quietly maintaining while I wasn’t looking. Letters, iMessages, Zoom call notes, photos, calendar entries, diaries, social media — formats I don’t normally think of as connected, returning one coherent picture of my kid over a two-year span6. The org chart could have stored these. Only the graph could have answered them. (More on how that LLM call works under the hood — bundles, edges, redaction — next post.)

Two Ways to Organize

In looking at how archivists tend to organize archives, a 4-tiered, um, hierarchy (sorry!) emerged as a trend:

  1. Series (biographical, correspondence, writings, research, professional, financial, legal, medical, etc.).
  2. Sub-series (fixed categories that fall underneath each of the series).
  3. File. A collection of items in a series/sub-series in a physical archive.
  4. Item. The thing itself.

In ark we have hard mappings to three of the four: series, sub-series, and the item itself.

Series and sub-series are categories that form a controlled vocabulary. But I find it useful to have user-curated groupings as well. While ark can use tags, I created something called a “collection” which is a curated grouping named after the reason that the items are grouped together. For example #2026-tax-documents, or #2019-house-purchase, or #vacation-in-the-golden-age-notes. Documents can, of course, have a series and sub-series, be tagged, and be members of one or more collections. The nice thing about collections is that they can be used as input for other ark commands. For instance:

ark bundle '#vacation-in-the-golden-age-notes' | ark task summarize

which will create a bundle of all of the items in the #vacation-in-the-golden-age-notes collection and then use an LLM to summarize the entire bundle.

Bottom line: a collection is a list of items in the archive with context.

Some Things Aren’t Documents

Most things in ark are documents: emails, PDFs, photos, diary entries, Office documents, text files. But several things in the archive aren’t documents; instead, they have their own dedicated database tables. These include reading events (to manage my reading list), health data from Apple Health and FitBit, location data pulled off photos and extracted from other sources like diary entries, and more.

I’ll write about each of these later on in this series. For the store, it is useful to know that these live alongside the document model and follow the same rules: they are addressable with a unique sha256 identifier, auditable, and integrated with ark‘s core command set (although they sometimes have commands of their own).

Two Things That Touch Everything

Two things in ark‘s data model don’t sit in any one layer; they sit over all of them:

  1. Sensitivity: This is set on every item that comes into the archive. The archive treats sensitivity as a query filter, not a display hint. Items that are sensitive are automatically routed through different code paths. For example, if I use Claude Desktop to ask a question and the result includes sensitive data, the code path prevents “restricted” data from leaving the local machine so that it never gets to Claude Desktop. If the data is marked “sensitive”, any sensitive information is stripped and replaced with “[REDACTED]” before being sent off the local machine. This is true everywhere data might be exfiltrated off the local machine.
  2. Annotations: I have a lot to say about things (pauciloquy goes only so far). I engineered the annotations layer to sit atop everything in the archive. This way, I can add notes and comments to a document, a book record, a watch event, an Apple Health record, a person — anything in the archive can be annotated. Those annotations are searchable, and they are surfaced most commonly when looking at a document in the archive. This allows me to add context without touching the original item.

Conclusion

This design keeps ark entirely self-contained on my local machine. The sensitivity layer ensures that documents that shouldn’t leave the machine don’t. SQLite handles full-text and semantic search. And so far, this scales well. As of this writing, my store is 125 GB not counting the SQLite database which adds another 9 GB:

=== ark store stats ===
Strategy: copy
Store path: /Users/[username]/.local/share/ark/store
Store size: 125.0 GB (390,749 files on disk)
In store (DB): 365,774 document(s)
Index-only: 330,680 document(s) (no managed copy)
Total docs: 696,454
Type Total In store Index-only
──────────────────── ─────── ───────── ───────────
email 281,740 272,068 9,672
browser_visit 106,616 0 106,616
image 85,415 77,182 8,233
imessage 41,267 0 41,267
tweet 27,599 0 27,599
watch_event 26,149 0 26,149
calendar_event 21,564 0 21,564
facebook-post 19,234 0 19,234
cli_command 15,189 0 15,189
note 12,914 7 12,907
health_day 9,411 0 9,411
pdf 8,400 8,400 0
blog_comment 8,040 0 8,040
blog_post 7,477 0 7,477
purchase 5,813 0 5,813
music_play 5,194 0 5,194
attachment 4,302 4,302 0
git_commit 2,261 0 2,261
office 2,078 2,078 0
reading_finished 1,547 0 1,547
book 1,257 0 1,257
text 1,114 1,114 0
diary_entry 804 612 192
reminder 373 0 373
action_item 209 0 209
playlist 124 0 124
video 96 0 96
review 83 0 83
weather_snapshot 52 0 52
blog_page 38 0 38
subscription 27 0 27
message 22 0 22
reading_started 19 0 19
code_file 11 11 0
outbox_draft 8 0 8
conversation 3 0 3
day_summary 3 0 3
timeline_event 1 0 1

You’ll notice the “Index-only” column has some big numbers. Some doc types — browser history, iMessages, tweets — live as index entries pointing at source databases or cloud accounts. The original bytes aren’t worth duplicating, so ark keeps the metadata and content for search but doesn’t manage a separate copy.

That’s nearly 700,000 items in ark. Most of these items were ingested automatically into ark from a variety of sources. I’ll talk about “ingestion at scale” next time.


Notes:
  1. I suppose I could do it over, but a long-standing tenet of this blog is that I make mistakes in public, and learn from them. It does no one any good to erase those mistakes and pretend they don’t exist. ↩︎
  2. Today’s A.Word.A.Day word. This is the first mailing list I ever subscribed to. According to `ark` my first message from the list came on February 2, 1997. I think I’ve subscribed longer, but I may have been deleting that email. (I did delete email early on, not thinking about future value.) I still enjoy reading it each day it arrives in my inbox. I wonder how many people out there have subscribed to, and actively read, a mailing list for just about 30 years now? ↩︎
  3. That word is just too hard to type. I’m going to stick with “org chart” going forward if you don’t mind. ↩︎
  4. I’m speaking here of a file as an atomic unit of measure. I know that files are made up of bytes, but for the purposes of a workable archive, the file is the base unit. ↩︎
  5. Recall from the previous post how `ark` commands are exposed to LLMs via an MCP server. This is an example of that in action. Here, Claude Desktop is taking my natural language question and turning it into a series of `ark` queries (this is the agentic model) to get back the best possible answer. ↩︎
  6. Keep in mind I’ve only been using `ark` for a few months so many of these connections were “discovered” by the ingestion process (more on this next time). Over time, this should become a richer source of linkages. ↩︎

Tagged as


Get new posts by email

5 responses to “`ark`: A Personal Archive System, Part 3: The Store — Where the Archive Actually Lives”

  1. I actually liked the last post, but I’m generally in favour of longer posts. I have a “the bigger the better” mindset regarding stuff like this haha. It was great to get a feeling on how to use ark.

    For this post I expected a more technical version, with the database schema layed out. That would have been interesting in my opinion.
    What’s the reason for some emails being stored outside ark and others not? I can see why your browser history commit history are better stored outside. But why are some item types only partially imported?

    I had to google the word “pauciloquy” and one of the results was “A.Word.A.Day”. Funny that this is where you got it from. Didn’t know about that site or rather that mailing list. I’m using a widget on my phone that shows a daily word, but it gives little context unless I open the app. And even then I have to do another tap to get to the full dictionary entry. I think I’ll keep it (as it still exposes me to new words and makes me pretend to learn something). But maybe I’ll check out that mailing list too. Thanks for the recommendation!
    On that note: I used to save the sign-up confirmation mail of every service I signed up for. I still have a folder of almost 300 mails going back to 2012. Unfortunately I recently went through the emails I already archived offline and deleted a bunch of messages I condemned useless, including some signing up confirmations. Inspired by your post I’ll keep everything from now on (almost everything).

    1. Jan, I hear you! I started down that road of highly technical, and then realized most of my regular readers would not be interested. I tried to blend my usual style with a slightly technical bent. What I may do when I finish the batch is write a more technical appendix for folks like you.

      I tried to explain (poorly, it seems) the distinction between documents and “events” in ark. Documents live in the store and have a corresponding record in the database. Things that are events like browser history, CLI history, health “events”, reading “events”, etc. don’t have actual documents. So those just live in the database without a corresponding document in the store. The closest analog to an academic archive might be logbooks stored in the archive.

      The reason there are about 9,000 emails in the index but not the store has to do with the evolution of ark. Early on, ark was only storing the full text of messages in the database and not pulling in a corresponding .eml file. These are some residual messages from that ingestion. I hadn’t noticed that there were still emails in the index only. Fortunately, I built ark with the tools for handling situations like this. There is a command:

      ark backfill email-store

      which will fix this.

      It was just a coincidence that “pauciloquy” was yesterday’s word-of-the-day, and given that last post, I couldn’t resist the irony. I also thought it was cool that ark could tell me when I first started subscribing to that list.

      On the next post, about bulk ingestion, you’ll see how the volume of stuff in ark combined with the ever-growing web of links between documents and people, really make the archive come alive in interesting ways. So keeping stuff for all these years has proven useful finally!

  2. Mark Plutowski

    Traditional archivist: “How do I ensure this survives for 100 years?”

    Jamie: “How can I make my archive useful to me and my kids tomorrow and next year, as well as 100 years from now?”

    I’ve always thought that people who journal or archive without benefiting from it in the present day are missing out.

    1. Mark, I’m not sure these are mutually exclusive. I’d argue that future durability of an archive is a result of its present utility. I want to preserve my archive for the future because I find it so useful today. My models for this were folks like Isaac Asimov, who maintained a diary for decades and which served him in good stead (especially when writing his 2-volume autobiography in the late 1970s); John and John Quincy Adams, whose diaries and papers made for important references in their time, and are just as valuable now as research tools; and ditto Ralph Waldo Emerson with his notebooks.

      I’d be dishonest if I didn’t admit that ego plays a role in future posterity, but now that I’ve accumulated the critical mass in one place with the capabilities I’ve built, I use my archive every day! It’s incredibly useful, to say nothing of fun!

  3. Mark Plutowski

    several things in the archive aren’t documents; instead, they have their own dedicated database tables. These include reading events (to manage my reading list), health data from Apple Health and FitBit.

    I ran into this exact design fork myself with high-volume telemetry sources like Apple Health, Strava, AutoSleep, medical, brokerage, bank, purchases, etc:

    Do I normalize everything now, or preserve flexibility and interpret later?

    I saw three options: (A) a universal telemetry table, (B) dedicated tables per datatype, or (C) leave fitness/health/financial data as files. I went with C.

    The main reason was that ingesting into first-class tables becomes an ETL and schema design rabbit hole: either design a universal schema flexible enough to accommodate current and future data types, or maintain a growing collection of specialized tables. Once the data lives in the database, I’d also need to build dedicated tool functions so the AI assistant could query it.

    Instead, I treat telemetry data more like documents. I leave CSVs from apps like Strava, Amazon, & E*Trade as-is, and partition the large Apple Health XML export into month-based JSONL files. The retrieval step of my RAG loads the relevant files and lets the AI assistant do the analysis directly.

    By deferring interpretation to the AI layer I lose the ability to do deterministic SQL queries, but the tradeoff felt worth it: much simpler ingestion, no schema maintenance burden, and no need to build specialized query infrastructure for every new data source.

    That said, I still wonder what I may have lost by going this route. For journals, emails, images, and unstructured documents (all of which do live in first-class tables), I’ve built specialized analyses like word clouds and term trends, and filtered searches that take advantage of the data structure (e.g., email sender, image location). With health/fitness/financial data, I opted for simplicity and delegated more of the analysis to the LLM.

    I’m still curious whether that was the right tradeoff. I wonder whether I sacrificed capabilities I haven’t yet discovered I’ll need.

Comments

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Discover more from Jamie Todd Rubin

Subscribe now to keep reading and get access to the full archive.

Continue reading