ArchiveBox/archivebox/Architecture.md

# ArchiveBox UI

## Page: Getting Started

### What do you want to capture?

- Save some URLs now -> [Add page]
    - Paste some URLs to archive now
    - Upload a file containing URLs (bookmarks.html export, RSS.xml feed, markdown file, word doc, PDF, etc.)
    - Pull in URLs to archive from a remote location (e.g. RSS feed URL, remote TXT file, JSON file, etc.)

- Import URLs from a browser -> [Import page]
    - Desktop: Get the ArchiveBox Chrome/Firefox extension
    - Mobile: Get the ArchiveBox iOS App / Android App
    - Upload a bookmarks.html export file
    - Upload a browser_history.sqlite3 export file

- Import URLs from a 3rd party bookmarking service -> [Sync page]
    - Pocket
    - Pinboard
    - Instapaper
    - Wallabag
    - Zapier, N8N, IFTTT, etc.
    - Upload a bookmarks.html export, bookmarks.json, RSS, etc. file

- Archive URLs on a schedule -> [Schedule page]

- Archive an entire website -> [Crawl page]
    - What starting URL/domain?
    - How deep?
    - Follow links to external domains?
    - Follow links to parent URLs?
    - Maximum number of pages to save?
    - Maximum number of requests/minute?

- Crawl for URLs with a search engine and save automatically
    - 
- Some URLs on a schedule
- Save an entire website (e.g. `https://example.com`)
- Save results matching a search query (e.g. "site:example.com")
- Save a social media feed (e.g. `https://x.com/user/1234567890`)

--------------------------------------------------------------------------------

### Crawls App

- Archive an entire website -> [Crawl page]
    - What are the seed URLs?
    - How many hops to follow?
    - Follow links to external domains?
    - Follow links to parent URLs?
    - Maximum number of pages to save?
    - Maximum number of requests/minute?


--------------------------------------------------------------------------------

### Scheduler App


- Archive URLs on a schedule -> [Schedule page]
    - What URL(s)?
    - How often?
    - Do you want to discard old snapshots after x amount of time?
    - Any filter rules?
    - Want to be notified when changes are detected -> redirect[Alerts app/create new alert(crawl=self)]


* Choose Schedule check for new URLs: Schedule.objects.get(pk=xyz)
    - 1 minute
    - 5 minutes
    - 1 hour
    - 1 day

    * Choose Destination Crawl to archive URLs using : Crawl.objects.get(pk=xyz)
        - Tags
        - Persona
        - Created By ID
        - Config
        - Filters
            - URL patterns to include
            - URL patterns to exclude
            - ONLY_NEW= Ignore URLs if already saved once / save URL each time it appears / only save is last save > x time ago


--------------------------------------------------------------------------------

### Sources App (For managing sources that ArchiveBox pulls URLs in from)

- Add a new source to pull URLs in from (WIZARD)
    - Choose URI:
        - [x] Web UI
        - [x] CLI
        - Local filesystem path (directory to monitor for new files containing URLs)
        - Remote URL (RSS/JSON/XML feed)
        - Chrome browser profile sync (login using gmail to pull bookmarks/history)
        - Pocket, Pinboard, Instapaper, Wallabag, etc.
        - Zapier, N8N, IFTTT, etc.
        - Local server filesystem path (directory to monitor for new files containing URLs)
        - Google drive (directory to monitor for new files containing URLs)
        - Remote server FTP/SFTP/SCP path (directory to monitor for new files containing URLs)
        - AWS/S3/B2/GCP bucket (directory to monitor for new files containing URLs)
        - XBrowserSync (login to pull bookmarks)
    - Choose extractor
        - auto
        - RSS
        - Pocket
        - etc.
    - Specify extra Config, e.g.
        - credentials
        - extractor tuning options (e.g. verify_ssl, cookies, etc.)

- Provide credentials for the source
    - API Key
    - Username / Password
    - OAuth

--------------------------------------------------------------------------------

### Alerts App

- Create a new alert, choose condition
    - Get notified when a site goes down (<x% success ratio for Snapshots)
    - Get notified when a site changes visually more than x% (screenshot diff)
    - Get notified when a site's text content changes more than x% (text diff)
    - Get notified when a keyword appears
    - Get notified when a keyword dissapears
    - When an AI prompt returns some result

- Choose alert threshold:
    - any condition is met
    - all conditions are met
    - condition is met for x% of URLs
    - condition is met for x% of time

- Choose how to notify: (List[AlertDestination])
    - maximum alert frequency
    - destination type: email / Slack / Webhook / Google Sheet / logfile
    - destination info:
        - email address(es)
        - Slack channel
        - Webhook URL

- Choose scope:
    - Choose ArchiveResult scope (extractors): (a query that returns ArchiveResult.objects QuerySet)
        - All extractors
        - Only screenshots
        - Only readability / mercury text
        - Only video
        - Only html
        - Only headers

    - Choose Snapshot scope (URL): (a query that returns Snapshot.objects QuerySet)
        - All domains
        - Specific domain
        - All domains in a tag
        - All domains in a tag category
        - All URLs matching a certain regex pattern

    - Choose crawl scope: (a query that returns Crawl.objects QuerySet)
        - All crawls
        - Specific crawls
        - crawls by a certain user
        - crawls using a certain persona


class AlertDestination(models.Model):
    destination_type: [email, slack, webhook, google_sheet, local logfile, b2/s3/gcp bucket, etc.]
    maximum_frequency
    filter_rules
    credentials
    alert_template: JINJA2 json/text template that gets populated with alert contents
add architecture mockup 2024-10-15 08:03:17 +00:00			`# ArchiveBox UI`

			`## Page: Getting Started`

			`### What do you want to capture?`

			`- Save some URLs now -> [Add page]`
			`- Paste some URLs to archive now`
			`- Upload a file containing URLs (bookmarks.html export, RSS.xml feed, markdown file, word doc, PDF, etc.)`
			`- Pull in URLs to archive from a remote location (e.g. RSS feed URL, remote TXT file, JSON file, etc.)`

			`- Import URLs from a browser -> [Import page]`
			`- Desktop: Get the ArchiveBox Chrome/Firefox extension`
			`- Mobile: Get the ArchiveBox iOS App / Android App`
			`- Upload a bookmarks.html export file`
			`- Upload a browser_history.sqlite3 export file`

			`- Import URLs from a 3rd party bookmarking service -> [Sync page]`
			`- Pocket`
			`- Pinboard`
			`- Instapaper`
			`- Wallabag`
			`- Zapier, N8N, IFTTT, etc.`
			`- Upload a bookmarks.html export, bookmarks.json, RSS, etc. file`

			`- Archive URLs on a schedule -> [Schedule page]`

			`- Archive an entire website -> [Crawl page]`
			`- What starting URL/domain?`
			`- How deep?`
			`- Follow links to external domains?`
			`- Follow links to parent URLs?`
			`- Maximum number of pages to save?`
			`- Maximum number of requests/minute?`

			`- Crawl for URLs with a search engine and save automatically`
			`-`
			`- Some URLs on a schedule`
			- Save an entire website (e.g. `https://example.com`)
			`- Save results matching a search query (e.g. "site:example.com")`
			- Save a social media feed (e.g. `https://x.com/user/1234567890`)

			`--------------------------------------------------------------------------------`

			`### Crawls App`

			`- Archive an entire website -> [Crawl page]`
			`- What are the seed URLs?`
			`- How many hops to follow?`
			`- Follow links to external domains?`
			`- Follow links to parent URLs?`
			`- Maximum number of pages to save?`
			`- Maximum number of requests/minute?`


			`--------------------------------------------------------------------------------`

			`### Scheduler App`


			`- Archive URLs on a schedule -> [Schedule page]`
			`- What URL(s)?`
			`- How often?`
			`- Do you want to discard old snapshots after x amount of time?`
			`- Any filter rules?`
			`- Want to be notified when changes are detected -> redirect[Alerts app/create new alert(crawl=self)]`


			`* Choose Schedule check for new URLs: Schedule.objects.get(pk=xyz)`
			`- 1 minute`
			`- 5 minutes`
			`- 1 hour`
			`- 1 day`

			`* Choose Destination Crawl to archive URLs using : Crawl.objects.get(pk=xyz)`
			`- Tags`
			`- Persona`
			`- Created By ID`
			`- Config`
			`- Filters`
			`- URL patterns to include`
			`- URL patterns to exclude`
			`- ONLY_NEW= Ignore URLs if already saved once / save URL each time it appears / only save is last save > x time ago`


			`--------------------------------------------------------------------------------`

			`### Sources App (For managing sources that ArchiveBox pulls URLs in from)`

			`- Add a new source to pull URLs in from (WIZARD)`
			`- Choose URI:`
			`- [x] Web UI`
			`- [x] CLI`
			`- Local filesystem path (directory to monitor for new files containing URLs)`
			`- Remote URL (RSS/JSON/XML feed)`
			`- Chrome browser profile sync (login using gmail to pull bookmarks/history)`
			`- Pocket, Pinboard, Instapaper, Wallabag, etc.`
			`- Zapier, N8N, IFTTT, etc.`
			`- Local server filesystem path (directory to monitor for new files containing URLs)`
			`- Google drive (directory to monitor for new files containing URLs)`
			`- Remote server FTP/SFTP/SCP path (directory to monitor for new files containing URLs)`
			`- AWS/S3/B2/GCP bucket (directory to monitor for new files containing URLs)`
			`- XBrowserSync (login to pull bookmarks)`
			`- Choose extractor`
			`- auto`
			`- RSS`
			`- Pocket`
			`- etc.`
			`- Specify extra Config, e.g.`
			`- credentials`
			`- extractor tuning options (e.g. verify_ssl, cookies, etc.)`

			`- Provide credentials for the source`
			`- API Key`
			`- Username / Password`
			`- OAuth`

			`--------------------------------------------------------------------------------`

			`### Alerts App`

			`- Create a new alert, choose condition`
			`- Get notified when a site goes down (<x% success ratio for Snapshots)`
			`- Get notified when a site changes visually more than x% (screenshot diff)`
			`- Get notified when a site's text content changes more than x% (text diff)`
			`- Get notified when a keyword appears`
			`- Get notified when a keyword dissapears`
			`- When an AI prompt returns some result`

			`- Choose alert threshold:`
			`- any condition is met`
			`- all conditions are met`
			`- condition is met for x% of URLs`
			`- condition is met for x% of time`

			`- Choose how to notify: (List[AlertDestination])`
			`- maximum alert frequency`
			`- destination type: email / Slack / Webhook / Google Sheet / logfile`
			`- destination info:`
			`- email address(es)`
			`- Slack channel`
			`- Webhook URL`

			`- Choose scope:`
			`- Choose ArchiveResult scope (extractors): (a query that returns ArchiveResult.objects QuerySet)`
			`- All extractors`
			`- Only screenshots`
			`- Only readability / mercury text`
			`- Only video`
			`- Only html`
			`- Only headers`

			`- Choose Snapshot scope (URL): (a query that returns Snapshot.objects QuerySet)`
			`- All domains`
			`- Specific domain`
			`- All domains in a tag`
			`- All domains in a tag category`
			`- All URLs matching a certain regex pattern`

			`- Choose crawl scope: (a query that returns Crawl.objects QuerySet)`
			`- All crawls`
			`- Specific crawls`
			`- crawls by a certain user`
			`- crawls using a certain persona`


			`class AlertDestination(models.Model):`
			`destination_type: [email, slack, webhook, google_sheet, local logfile, b2/s3/gcp bucket, etc.]`
			`maximum_frequency`
			`filter_rules`
			`credentials`
			`alert_template: JINJA2 json/text template that gets populated with alert contents`