Skip to content

sources.toml Reference

The sources.toml file defines which data sources the indexer processes. It’s located at data/config/sources.toml.

[defaults]
schema_version = "1.0"
data_dir = "data/fixtures" # or "data/sources" for production
[[sources]]
name = "pinboard"
enabled = true
db_path = "pinboard/chronicle/chronicle.db"
category = "reading"
[[sources.queries]]
table = "events"
entry_type = "bookmark"
action_type = "BookmarkAction"
object_type = "WebPage"
sql = """
SELECT
source_id as external_id,
json_extract(object, '$.name') as title,
json_extract(object, '$.description') as content,
end_time as occurred_at,
json_extract(object, '$.url') as url
FROM events WHERE type = 'BookmarkAction'
"""
FieldRequiredDescription
schema_versionNoConfig format version (currently “1.0”)
data_dirNoBase directory for source databases. Defaults to data/sources. Use data/fixtures for testing.

Each [[sources]] block defines a data source.

FieldRequiredDescription
nameYesUnique identifier for the source
enabledNoWhether to process this source. Defaults to true
db_pathYesPath to SQLite database, relative to data_dir
categoryNoCategory for filtering. See categories below
default_visibilityNoVisibility for items without explicit visibility. Options: public, unlisted, private, secret
CategoryUse for
readingBookmarks, highlights, RSS
musicListening history, scrobbles
socialPosts, messages, replies
commsEmail, chat, DMs
productivityTasks, time tracking
browseBrowser history, searches
notesNotes, documents
photosImages, screenshots
locationCheck-ins, GPS logs
codeCommits, issues, PRs
aiAI conversations
curationCollections, boards
videoWatch history
calendarEvents, meetings

Each source can have one or more [[sources.queries]] blocks.

FieldRequiredDescription
tableNoTable name for validation (optional)
entry_typeYesType of entry (e.g., “bookmark”, “listen”, “note”)
action_typeNoSchema.org action type (e.g., “BookmarkAction”)
object_typeNoSchema.org object type (e.g., “WebPage”)
sqlYesSQL query to extract events

Your SQL query should return these columns:

ColumnRequiredDescription
external_idYesUnique ID within the source. Also accepts source_id or id
titleYesDisplay title
contentNoFull text for search indexing
occurred_atYesWhen the event happened. Also accepts timestamp, created_at, date, etc.
urlNoLink to original item
visibilityNoItem visibility level

The indexer automatically parses many timestamp formats:

  • Unix timestamps: 1704067200 (seconds) or 1704067200000 (milliseconds)
  • ISO 8601: 2024-01-01T00:00:00Z
  • Date strings: 2024-01-01
  • Various formats: Jan 1, 2024, 01/01/2024, etc.

If parsing fails, the event uses the current time with a warning.

For high-volume sources (like time tracking), you can enable aggregation:

[[sources]]
name = "timing"
enabled = true
db_path = "timing/chronicle/chronicle.db"
[sources.aggregation]
strategy = "daily"
key_fields = ["title"]

Available strategies:

  • daily — Group events by date + key fields (e.g., 2.2M records → ~50K daily aggregates)
  • hourly — Group events by hour + key fields
  • session — Group events with gaps less than time_bucket seconds (default 30 minutes)
FieldDescription
strategyAggregation period: daily, hourly, session, or none
key_fieldsFields to group by
time_bucketFor session strategy: gap threshold in seconds (default 1800 = 30 min)

The default configuration includes 23 sources:

  • pinboard — Bookmarks from Pinboard
  • readwise — Highlights from Readwise
  • spotify — Listening history from Spotify
  • lastfm — Scrobbles from Last.fm
  • apple-podcasts — Podcast episodes
  • imessage — iMessage conversations
  • linkedin — LinkedIn messages
  • things — Tasks from Things app
  • timing — App usage from Timing
  • safari — Safari browsing history
  • chrome — Chrome history (from Google Takeout)
  • google-search — Search queries
  • apple-notes — Notes from Apple Notes
  • notion — Pages from Notion
  • apple-photos — Photos with metadata
  • foursquare — Check-ins
  • github — GitHub activity
  • claude — Claude conversation exports
  • arena — Are.na blocks and channels
  • twitter-** — Twitter/X archive (supports multiple accounts)
  • youtube — YouTube watch history
  • gcal — Google Calendar events

Here’s a complete example for adding a custom notes database:

[[sources]]
name = "my-notes"
enabled = true
db_path = "my-notes/notes.db"
category = "notes"
default_visibility = "private"
[[sources.queries]]
table = "notes"
entry_type = "note"
action_type = "CreateAction"
object_type = "NoteDigitalDocument"
sql = """
SELECT
id as external_id,
title,
body as content,
strftime('%s', created_at) as occurred_at,
NULL as url
FROM notes
WHERE deleted_at IS NULL
ORDER BY created_at DESC
"""

Key points:

  • Use strftime('%s', ...) to convert datetime columns to Unix timestamps
  • Filter out deleted items in the WHERE clause
  • Return NULL for optional columns you don’t have
  • Set default_visibility to control indexing

Check your configuration before indexing:

Terminal window
cd packages/otso-indexer
cargo run --release -- validate

This checks:

  • TOML syntax
  • Required fields
  • Table existence
  • SQL syntax (via EXPLAIN)
Terminal window
# Build all enabled sources
cargo run --release -- build
# Build a specific source
cargo run --release -- build --source pinboard
# Rebuild search index only (skip event store)
cargo run --release -- build --meili-only
# Show statistics
cargo run --release -- stats
# Validate configuration
cargo run --release -- validate
# Full rebuild from event store
cargo run --release -- rebuild