`ark`: A Personal Archive System, Part 3: The Store — Where the Archive Actually Lives

That last post. What can I say. Four thousand plus words? Did I really do that to you? It won’t happen again. If I could do it over, I would¹. When I set out to (very tentatively) write this series of posts on ark, I intended it as being a way of showcasing something fun that I built that actually turned out to be the thing that I always wanted to build. I forgot to bring the fun to that last post and it reads (to me) like one big “look what I can do!” flail. And this piece! This piece started out in the same direction, only worse. It was mired in technical detail. At one point, in the first draft, I wrote,

This is naturally a more technical piece than most that I write, given the nature of what I am describing. I’ll do my best to smooth out those rough edges, but know that I am aware this isn’t for my usual audience.

Lazy, lazy, lazy. And not the right intention. Fortunately, I caught onto what I was doing when only a thousand words or so had been set down. Pauciloquy² is called for and there is still (barely) enough to recover. And so here we are, about to talk about ark‘s store, or, where’d all those files go? Where to begin…?

The Dreaded Org Chart

I’ve been unsatisfied with the hierarchical structure of file systems since I ran that very first catalog command on an Apple ][ e. I’ve lived with them ever since, an accumulating succession of decades that have done little to quench the burning dissatisfaction with the way files are stored on computers. It sometimes seems like a large part of my avocation in technology has been a desperate search for a way out of the rigid hierarchy³. “Tear down the wall!”

The search took me to Evernote with its notebooks and tags, and then to Obsidian with its org chart and tags. But Obsidian introduced, to me at least, the notion of a graph: that is links between files that form an edge between nodes. It is a powerful idea, more powerful than I realized. And so when I approached this hobby project and was considering the design, I had two strong ideas in mind:

Get me off this org chart!
How might I take advantage of graphs?

Simple Requirements

In my limited imagination, there are two poles on the file storage spectrum⁴: a simple listing of files and a graph of files where every file points to every other file. ark is designed to be as close to the simple listing of files as possible. My requirements were, therefore, simple:

There should only ever be one of each item. Preventing duplicates makes things easier to find. I can’t tell you how many times I have found three different copies of a Word document or photo on my computer.
Items in the archive must be described separately from the files themselves. File systems provide the bare minimum capacity for describing a file. An archive is more than a file system so I need a way of describing those files to make finding them as easy as possible.
The archive only stores finished products. Working documents, working files don’t get into the archive until they are finished.

With these requirements in hand, I set about meeting each one. To ensure that each item in the archive is unique, it gets a unique file name based on its digital DNA. The unique set of bytes that make up a file can be “hashed” into a number that is unique for that set of bytes. ark uses sha256 for its hashing mechanism. That number becomes not only the name of the file, but its identifier in ark‘s database. What it means in practice is that if I bring an exact copy of a file into ark that already exists, it doesn’t get added a second time, it is simply ignored in favor of the copy that is already in the system.

To describe the files in the system, ark uses a SQLite database. This allows ark to have full-text search and semantic search capabilities without running a database server. The SQLite database is just another file on my computer. True, SQLite is not designed to be a multi-user database, but ark is not designed to be a multi-user application, so we’re all good here. All of the meta-data needed to describe a file is stored in the database. That meta-data breaks down into five categories:

Classification (doc type, series, sub-series, format, sensitivity, priority, etc.)
Provenance (date authored, record origin, original source path, physical location, etc.)
Identification (doc ID, sha256, title, store path)
Quality (OCR status, OCR quality)
Content (full-text search content, LLM summary, embeddings)

Finally, ark uses a “copy-in” strategy for files. That means that the source file is left untouched, wherever it comes from, and a copy of the file is brought into the archive. Because of this, and the other requirements I listed above, I can store all of the files in a flat structure within ark‘s store. After all, I never need to know the actual file name. I just need to be able to describe what I am looking for and the database takes care of the rest.

Connections

In your standard OS, files sit there on a file system completely unconnected. But in a personal archive, people are first-class citizens. So in addition to the database storing information about files, it stores information about the people in those files. And since it is the connections that make an archive like this come alive, ark supports three kinds of connections:

Document-to-document: one email is linked to another as a “reply-to”. A receipt is linked to an invoice. Documents can link to one another.
Document-to-person: Who wrote it, received it, who’s mentioned, who’s the subject. The same person may show up in lots of roles across documents. This is the single most-queried table in the archive. Anything person-aware reads from here.
Person-to-person: Friends, family, colleagues, including date ranges. A handyman who works on your house may retire, and someone else takes over. The connections capture it all.

Here’s where things get fun! In an org chart, the answer to “show me everything about my son from the last two years” depends on remembering which folder you filed each thing in. In a graph, it doesn’t. I asked Claude Desktop⁵ that exact question and got back this:

Theater & Performance — The dominant thread. He’d been active in community theater for a while: directing plays, performing in musicals (the creepy Wolf in Into the Woods), and preparing for the role of Edwin Drood at his high school. He’s a baritone/tenor in the choral group and was being recruited for an a cappella group. By May 2026, a choral concert I attended. Then the big news: he and a friend auditioned for Guys and Dolls — a six-hour audition — and he was offered the lead role of Nathan Detroit. Around 200 lines.

Prom — May 19, 2026. He went with a friend. Dinner first, then photos at the Jefferson Memorial via Uber Black, then the prom itself. I left the car at the venue so he could drive home. He got home around 10pm, changed, headed to a post-prom party, was home just after 1am. I noted he “seemed very grown up about the whole thing.”

Memorial Day 2026 — At our friends’ house. He “christened the pool as he does every year.”

College Prep — From a family Zoom in late May: he’s working on the Common Application and asking for recommendation letters. Summer plans still up in the air — SAT prep, camp counselor work, or a possible Europe trip.

Lawn — An October 2025 iMessage asks whether he can cut the lawn that weekend.

Rutgers campus tour — Mentioned in passing in the Zoom notes: I drove by my childhood New Jersey house “after a Rutgers campus tour with him.”

Cotillion — Per a January 2026 letter, he was working as an instructor — teaching students to dance and how to conduct themselves in society.

Driving — Also from that January letter, the dry observation that his having a license “does wonders for the nervous system.”

One question; every connected thing — pulled together by edges the archive had been quietly maintaining while I wasn’t looking. Letters, iMessages, Zoom call notes, photos, calendar entries, diaries, social media — formats I don’t normally think of as connected, returning one coherent picture of my kid over a two-year span⁶. The org chart could have stored these. Only the graph could have answered them. (More on how that LLM call works under the hood — bundles, edges, redaction — next post.)

Two Ways to Organize

In looking at how archivists tend to organize archives, a 4-tiered, um, hierarchy (sorry!) emerged as a trend:

Series (biographical, correspondence, writings, research, professional, financial, legal, medical, etc.).
Sub-series (fixed categories that fall underneath each of the series).
File. A collection of items in a series/sub-series in a physical archive.
Item. The thing itself.

In ark we have hard mappings to three of the four: series, sub-series, and the item itself.

Series and sub-series are categories that form a controlled vocabulary. But I find it useful to have user-curated groupings as well. While ark can use tags, I created something called a “collection” which is a curated grouping named after the reason that the items are grouped together. For example #2026-tax-documents, or #2019-house-purchase, or #vacation-in-the-golden-age-notes. Documents can, of course, have a series and sub-series, be tagged, and be members of one or more collections. The nice thing about collections is that they can be used as input for other ark commands. For instance:

ark bundle '#vacation-in-the-golden-age-notes' | ark task summarize

which will create a bundle of all of the items in the #vacation-in-the-golden-age-notes collection and then use an LLM to summarize the entire bundle.

Bottom line: a collection is a list of items in the archive with context.

Some Things Aren’t Documents

Most things in ark are documents: emails, PDFs, photos, diary entries, Office documents, text files. But several things in the archive aren’t documents; instead, they have their own dedicated database tables. These include reading events (to manage my reading list), health data from Apple Health and FitBit, location data pulled off photos and extracted from other sources like diary entries, and more.

I’ll write about each of these later on in this series. For the store, it is useful to know that these live alongside the document model and follow the same rules: they are addressable with a unique sha256 identifier, auditable, and integrated with ark‘s core command set (although they sometimes have commands of their own).

Two Things That Touch Everything

Two things in ark‘s data model don’t sit in any one layer; they sit over all of them:

Sensitivity: This is set on every item that comes into the archive. The archive treats sensitivity as a query filter, not a display hint. Items that are sensitive are automatically routed through different code paths. For example, if I use Claude Desktop to ask a question and the result includes sensitive data, the code path prevents “restricted” data from leaving the local machine so that it never gets to Claude Desktop. If the data is marked “sensitive”, any sensitive information is stripped and replaced with “[REDACTED]” before being sent off the local machine. This is true everywhere data might be exfiltrated off the local machine.
Annotations: I have a lot to say about things (pauciloquy goes only so far). I engineered the annotations layer to sit atop everything in the archive. This way, I can add notes and comments to a document, a book record, a watch event, an Apple Health record, a person — anything in the archive can be annotated. Those annotations are searchable, and they are surfaced most commonly when looking at a document in the archive. This allows me to add context without touching the original item.

Conclusion

This design keeps ark entirely self-contained on my local machine. The sensitivity layer ensures that documents that shouldn’t leave the machine don’t. SQLite handles full-text and semantic search. And so far, this scales well. As of this writing, my store is 125 GB not counting the SQLite database which adds another 9 GB:

			
=== ark store stats ===
  Strategy:       copy
  Store path:     /Users/[username]/.local/share/ark/store
  Store size:     125.0 GB  (390,749 files on disk)
  In store (DB):  365,774 document(s)
  Index-only:     330,680 document(s)  (no managed copy)
  Total docs:     696,454
  Type                    Total   In store   Index-only
  ────────────────────  ───────  ─────────  ───────────
  email                 281,740    272,068        9,672
  browser_visit         106,616          0      106,616
  image                  85,415     77,182        8,233
  imessage               41,267          0       41,267
  tweet                  27,599          0       27,599
  watch_event            26,149          0       26,149
  calendar_event         21,564          0       21,564
  facebook-post          19,234          0       19,234
  cli_command            15,189          0       15,189
  note                   12,914          7       12,907
  health_day              9,411          0        9,411
  pdf                     8,400      8,400            0
  blog_comment            8,040          0        8,040
  blog_post               7,477          0        7,477
  purchase                5,813          0        5,813
  music_play              5,194          0        5,194
  attachment              4,302      4,302            0
  git_commit              2,261          0        2,261
  office                  2,078      2,078            0
  reading_finished        1,547          0        1,547
  book                    1,257          0        1,257
  text                    1,114      1,114            0
  diary_entry               804        612          192
  reminder                  373          0          373
  action_item               209          0          209
  playlist                  124          0          124
  video                      96          0           96
  review                     83          0           83
  weather_snapshot           52          0           52
  blog_page                  38          0           38
  subscription               27          0           27
  message                    22          0           22
  reading_started            19          0           19
  code_file                  11         11            0
  outbox_draft                8          0            8
  conversation                3          0            3
  day_summary                 3          0            3
  timeline_event              1          0            1

		

You’ll notice the “Index-only” column has some big numbers. Some doc types — browser history, iMessages, tweets — live as index entries pointing at source databases or cloud accounts. The original bytes aren’t worth duplicating, so ark keeps the metadata and content for search but doesn’t manage a separate copy.

That’s nearly 700,000 items in ark. Most of these items were ingested automatically into ark from a variety of sources. I’ll talk about “ingestion at scale” next time.

Notes:

I suppose I could do it over, but a long-standing tenet of this blog is that I make mistakes in public, and learn from them. It does no one any good to erase those mistakes and pretend they don’t exist. ↩︎
Today’s A.Word.A.Day word. This is the first mailing list I ever subscribed to. According to `ark` my first message from the list came on February 2, 1997. I think I’ve subscribed longer, but I may have been deleting that email. (I did delete email early on, not thinking about future value.) I still enjoy reading it each day it arrives in my inbox. I wonder how many people out there have subscribed to, and actively read, a mailing list for just about 30 years now? ↩︎
That word is just too hard to type. I’m going to stick with “org chart” going forward if you don’t mind. ↩︎
I’m speaking here of a file as an atomic unit of measure. I know that files are made up of bytes, but for the purposes of a workable archive, the file is the base unit. ↩︎
Recall from the previous post how `ark` commands are exposed to LLMs via an MCP server. This is an example of that in action. Here, Claude Desktop is taking my natural language question and turning it into a series of `ark` queries (this is the agentic model) to get back the best possible answer. ↩︎
Keep in mind I’ve only been using `ark` for a few months so many of these connections were “discovered” by the ingestion process (more on this next time). Over time, this should become a richer source of linkages. ↩︎

Tagged asessay

8 responses to “`ark`: A Personal Archive System, Part 3: The Store — Where the Archive Actually Lives”

Jan

June 10, 2026

I actually liked the last post, but I’m generally in favour of longer posts. I have a “the bigger the better” mindset regarding stuff like this haha. It was great to get a feeling on how to use ark.

For this post I expected a more technical version, with the database schema layed out. That would have been interesting in my opinion.
What’s the reason for some emails being stored outside ark and others not? I can see why your browser history commit history are better stored outside. But why are some item types only partially imported?

I had to google the word “pauciloquy” and one of the results was “A.Word.A.Day”. Funny that this is where you got it from. Didn’t know about that site or rather that mailing list. I’m using a widget on my phone that shows a daily word, but it gives little context unless I open the app. And even then I have to do another tap to get to the full dictionary entry. I think I’ll keep it (as it still exposes me to new words and makes me pretend to learn something). But maybe I’ll check out that mailing list too. Thanks for the recommendation!
On that note: I used to save the sign-up confirmation mail of every service I signed up for. I still have a folder of almost 300 mails going back to 2012. Unfortunately I recently went through the emails I already archived offline and deleted a bunch of messages I condemned useless, including some signing up confirmations. Inspired by your post I’ll keep everything from now on (almost everything).

Loading…

1. Jamie Todd Rubin
  
  June 10, 2026
  
  Jan, I hear you! I started down that road of highly technical, and then realized most of my regular readers would not be interested. I tried to blend my usual style with a slightly technical bent. What I may do when I finish the batch is write a more technical appendix for folks like you.
  
  I tried to explain (poorly, it seems) the distinction between documents and “events” in ark. Documents live in the store and have a corresponding record in the database. Things that are events like browser history, CLI history, health “events”, reading “events”, etc. don’t have actual documents. So those just live in the database without a corresponding document in the store. The closest analog to an academic archive might be logbooks stored in the archive.
  
  The reason there are about 9,000 emails in the index but not the store has to do with the evolution of ark. Early on, ark was only storing the full text of messages in the database and not pulling in a corresponding .eml file. These are some residual messages from that ingestion. I hadn’t noticed that there were still emails in the index only. Fortunately, I built ark with the tools for handling situations like this. There is a command:
  
  ark backfill email-store
  
  which will fix this.
  
  It was just a coincidence that “pauciloquy” was yesterday’s word-of-the-day, and given that last post, I couldn’t resist the irony. I also thought it was cool that ark could tell me when I first started subscribing to that list.
  
  On the next post, about bulk ingestion, you’ll see how the volume of stuff in ark combined with the ever-growing web of links between documents and people, really make the archive come alive in interesting ways. So keeping stuff for all these years has proven useful finally!
  
  Loading…
  
  1. Jan
    
    June 23, 2026
    
    Thanks for the explanation! I got the difference between files and events, my main confusion stemmed from the fact that only a part of the emails are stored in the archive and others are only referenced. I understand the reason now.
    I guess the same is true for images and diary entries? Some of them were not imported as a file in the beginning? Would you be able to backfill those types with a command too?
    
    Also I hope you are doing well! Will there be another post this week? Or are you going to skip another week?
    
    Loading…
    
    1. Jamie Todd Rubin
      
      June 23, 2026
      
      Jan, there is a suite of ark backfill functions for this very purpose.
      
      usage: ark backfill <subcommand> [args] Subcommands: exif Backfill EXIF metadata (GPS, camera, timestamp) on images email-store Store original .eml files for emails ingested from mbox annotation-links Create mentioned-in links for [[doc:...]] refs in annotations email-bodies Re-extract body text for HTML-only emails (fixes meta/link bug) ingested-sha256 Populate the ingested_sha256 column for existing store records reading-format Copy missing format from paired reading_started to reading_finished events format Detect and update inferred file format for stored documents transcribe Vision-transcribe PDFs whose text layer scores poorly (#135)
      
      In this particular email use case, you would use the ark backfill email-store to pull the missing .eml files in to the archive store, and not just the database.
      
      Sorry for the delay on the posts! I’ve been busy with work and family. I’ve got the next post about 40% written, I was hoping to get it out today but with my current schedule it will likely be tomorrow or Thursday. But it is coming!
      
      Loading…
      
Mark Plutowski

June 10, 2026

Traditional archivist: “How do I ensure this survives for 100 years?”

Jamie: “How can I make my archive useful to me and my kids tomorrow and next year, as well as 100 years from now?”

I’ve always thought that people who journal or archive without benefiting from it in the present day are missing out.

Loading…

1. Jamie Todd Rubin
  
  June 11, 2026
  
  Mark, I’m not sure these are mutually exclusive. I’d argue that future durability of an archive is a result of its present utility. I want to preserve my archive for the future because I find it so useful today. My models for this were folks like Isaac Asimov, who maintained a diary for decades and which served him in good stead (especially when writing his 2-volume autobiography in the late 1970s); John and John Quincy Adams, whose diaries and papers made for important references in their time, and are just as valuable now as research tools; and ditto Ralph Waldo Emerson with his notebooks.
  
  I’d be dishonest if I didn’t admit that ego plays a role in future posterity, but now that I’ve accumulated the critical mass in one place with the capabilities I’ve built, I use my archive every day! It’s incredibly useful, to say nothing of fun!
  
  Loading…
  
Mark Plutowski

June 13, 2026

several things in the archive aren’t documents; instead, they have their own dedicated database tables. These include reading events (to manage my reading list), health data from Apple Health and FitBit.

I ran into this exact design fork myself with high-volume telemetry sources like Apple Health, Strava, AutoSleep, medical, brokerage, bank, purchases, etc:

Do I normalize everything now, or preserve flexibility and interpret later?

I saw three options: (A) a universal telemetry table, (B) dedicated tables per datatype, or (C) leave fitness/health/financial data as files. I went with C.

The main reason was that ingesting into first-class tables becomes an ETL and schema design rabbit hole: either design a universal schema flexible enough to accommodate current and future data types, or maintain a growing collection of specialized tables. Once the data lives in the database, I’d also need to build dedicated tool functions so the AI assistant could query it.

Instead, I treat telemetry data more like documents. I leave CSVs from apps like Strava, Amazon, & E*Trade as-is, and partition the large Apple Health XML export into month-based JSONL files. The retrieval step of my RAG loads the relevant files and lets the AI assistant do the analysis directly.

By deferring interpretation to the AI layer I lose the ability to do deterministic SQL queries, but the tradeoff felt worth it: much simpler ingestion, no schema maintenance burden, and no need to build specialized query infrastructure for every new data source.

That said, I still wonder what I may have lost by going this route. For journals, emails, images, and unstructured documents (all of which do live in first-class tables), I’ve built specialized analyses like word clouds and term trends, and filtered searches that take advantage of the data structure (e.g., email sender, image location). With health/fitness/financial data, I opted for simplicity and delegated more of the analysis to the LLM.

I’m still curious whether that was the right tradeoff. I wonder whether I sacrificed capabilities I haven’t yet discovered I’ll need.

Loading…

1. Jamie Todd Rubin
  
  June 20, 2026
  
  Mark, your A/B/C frame is exactly the fork I hit, and I landed on a fourth option you don’t list — call it “everything is a document, but with structured doc_type values.”
  
  In ark, the documents table is the central store. Health daily aggregates, reading start/finish events, weather snapshots, Apple Music plays, browser history, git commits, even iMessage exchanges all live as rows in that table, each with a distinct doc_type and rich JSON in the metadata column. My Apple Health daily aggregate becomes one document per day with doc_type='health_day' and metadata = {steps, sleep_hours, resting_hr, weight, ...}. A book “started reading” is a reading_started document linked to the book document via a reading-of relationship. Same row shape, different semantics by type.
  
  The win is that I get most of your B without paying for it. Everything in documents inherits FTS5 over fts_content, vector embeddings via sqlite-vec, sensitivity-tier filtering, and the same document_relationships graph used by every other doc. My MCP ark_search tool sees health, email, and PDFs identically — no specialized “query health” function needed. ark_search "sleep last week" Just Works because each health_day doc emits a one-line fts_content summary like "2026-06-19 — 7h12m sleep, 8423 steps, 62 bpm resting HR, 174 lbs" that’s both human-readable in the bundle and FTS-indexable.
  
  When I want deterministic SQL, I lean on SQLite’s JSON1: SELECT date_authored, json_extract(metadata, '$.sleep_hours') FROM documents WHERE doc_type='health_day' AND date_authored >= '2026-06-01'. So I get the relational queries you mention without baking column-per-metric into a schema. Adding a new telemetry source is a new doc_type + an ingest adapter — no migrations, no growing forest of tables. The fitbit collector and the Apple Health collector both write doc_type='health_day' with the same metadata keys; downstream code doesn’t care which source.
  
  What I do give up vs. your option B with proper relational tables: type-safe joins on numeric fields, query-planner optimizations on metric values, and somewhat-less-ergonomic aggregations (you’re always reaching through json_extract). For ad-hoc “average resting HR by month” it’s fine; for “window function over my last 1000 steps records” the JSON path gets gnarly. I haven’t built those analyses yet, so I haven’t actually paid the cost — but I expect I’d push the hot ones into a SQL view or a materialized aggregate before going full column-per-metric.
  
  What I do give up vs. your option C: simplicity of ingest. Each source needs an adapter that commits to an fts_content shape useful but not overwhelming for the search index. For high-volume telemetry that’s real upfront work. I’d argue you bought the same complexity later, just upstream of your RAG step.
  
  Net read: I think your “I wonder if I sacrificed capabilities” question is the right one to be asking. I’d flip it though — you’ve kept maximum flexibility because the source-of-truth files stay around. You can move any single source to a hybrid like mine, or to a full relational schema, at any time. Being able to defer that decision per-source is a stronger position than being committed to one approach across the whole archive.
  
  Loading…

CommentsCancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Jamie Todd Rubin

`ark`: A Personal Archive System, Part 3: The Store — Where the Archive Actually Lives

The Dreaded Org Chart

Simple Requirements

Connections

Two Ways to Organize

Some Things Aren’t Documents

Two Things That Touch Everything

Conclusion

Like this:

Related posts

8 responses to “`ark`: A Personal Archive System, Part 3: The Store — Where the Archive Actually Lives”

CommentsCancel reply

`ark`: A Personal Archive System, Part 3: The Store — Where the Archive Actually Lives

The Dreaded Org Chart

Simple Requirements

Connections

Two Ways to Organize

Some Things Aren’t Documents

Two Things That Touch Everything

Conclusion

Like this:

Related posts

8 responses to “`ark`: A Personal Archive System, Part 3: The Store — Where the Archive Actually Lives”

CommentsCancel reply

Discover more from Jamie Todd Rubin