mirror of
https://github.com/ArchiveBox/ArchiveBox
synced 2024-11-25 13:40:20 +00:00
172 lines
5.7 KiB
Markdown
172 lines
5.7 KiB
Markdown
|
# ArchiveBox UI
|
||
|
|
||
|
## Page: Getting Started
|
||
|
|
||
|
### What do you want to capture?
|
||
|
|
||
|
- Save some URLs now -> [Add page]
|
||
|
- Paste some URLs to archive now
|
||
|
- Upload a file containing URLs (bookmarks.html export, RSS.xml feed, markdown file, word doc, PDF, etc.)
|
||
|
- Pull in URLs to archive from a remote location (e.g. RSS feed URL, remote TXT file, JSON file, etc.)
|
||
|
|
||
|
- Import URLs from a browser -> [Import page]
|
||
|
- Desktop: Get the ArchiveBox Chrome/Firefox extension
|
||
|
- Mobile: Get the ArchiveBox iOS App / Android App
|
||
|
- Upload a bookmarks.html export file
|
||
|
- Upload a browser_history.sqlite3 export file
|
||
|
|
||
|
- Import URLs from a 3rd party bookmarking service -> [Sync page]
|
||
|
- Pocket
|
||
|
- Pinboard
|
||
|
- Instapaper
|
||
|
- Wallabag
|
||
|
- Zapier, N8N, IFTTT, etc.
|
||
|
- Upload a bookmarks.html export, bookmarks.json, RSS, etc. file
|
||
|
|
||
|
- Archive URLs on a schedule -> [Schedule page]
|
||
|
|
||
|
- Archive an entire website -> [Crawl page]
|
||
|
- What starting URL/domain?
|
||
|
- How deep?
|
||
|
- Follow links to external domains?
|
||
|
- Follow links to parent URLs?
|
||
|
- Maximum number of pages to save?
|
||
|
- Maximum number of requests/minute?
|
||
|
|
||
|
- Crawl for URLs with a search engine and save automatically
|
||
|
-
|
||
|
- Some URLs on a schedule
|
||
|
- Save an entire website (e.g. `https://example.com`)
|
||
|
- Save results matching a search query (e.g. "site:example.com")
|
||
|
- Save a social media feed (e.g. `https://x.com/user/1234567890`)
|
||
|
|
||
|
--------------------------------------------------------------------------------
|
||
|
|
||
|
### Crawls App
|
||
|
|
||
|
- Archive an entire website -> [Crawl page]
|
||
|
- What are the seed URLs?
|
||
|
- How many hops to follow?
|
||
|
- Follow links to external domains?
|
||
|
- Follow links to parent URLs?
|
||
|
- Maximum number of pages to save?
|
||
|
- Maximum number of requests/minute?
|
||
|
|
||
|
|
||
|
--------------------------------------------------------------------------------
|
||
|
|
||
|
### Scheduler App
|
||
|
|
||
|
|
||
|
- Archive URLs on a schedule -> [Schedule page]
|
||
|
- What URL(s)?
|
||
|
- How often?
|
||
|
- Do you want to discard old snapshots after x amount of time?
|
||
|
- Any filter rules?
|
||
|
- Want to be notified when changes are detected -> redirect[Alerts app/create new alert(crawl=self)]
|
||
|
|
||
|
|
||
|
* Choose Schedule check for new URLs: Schedule.objects.get(pk=xyz)
|
||
|
- 1 minute
|
||
|
- 5 minutes
|
||
|
- 1 hour
|
||
|
- 1 day
|
||
|
|
||
|
* Choose Destination Crawl to archive URLs using : Crawl.objects.get(pk=xyz)
|
||
|
- Tags
|
||
|
- Persona
|
||
|
- Created By ID
|
||
|
- Config
|
||
|
- Filters
|
||
|
- URL patterns to include
|
||
|
- URL patterns to exclude
|
||
|
- ONLY_NEW= Ignore URLs if already saved once / save URL each time it appears / only save is last save > x time ago
|
||
|
|
||
|
|
||
|
--------------------------------------------------------------------------------
|
||
|
|
||
|
### Sources App (For managing sources that ArchiveBox pulls URLs in from)
|
||
|
|
||
|
- Add a new source to pull URLs in from (WIZARD)
|
||
|
- Choose URI:
|
||
|
- [x] Web UI
|
||
|
- [x] CLI
|
||
|
- Local filesystem path (directory to monitor for new files containing URLs)
|
||
|
- Remote URL (RSS/JSON/XML feed)
|
||
|
- Chrome browser profile sync (login using gmail to pull bookmarks/history)
|
||
|
- Pocket, Pinboard, Instapaper, Wallabag, etc.
|
||
|
- Zapier, N8N, IFTTT, etc.
|
||
|
- Local server filesystem path (directory to monitor for new files containing URLs)
|
||
|
- Google drive (directory to monitor for new files containing URLs)
|
||
|
- Remote server FTP/SFTP/SCP path (directory to monitor for new files containing URLs)
|
||
|
- AWS/S3/B2/GCP bucket (directory to monitor for new files containing URLs)
|
||
|
- XBrowserSync (login to pull bookmarks)
|
||
|
- Choose extractor
|
||
|
- auto
|
||
|
- RSS
|
||
|
- Pocket
|
||
|
- etc.
|
||
|
- Specify extra Config, e.g.
|
||
|
- credentials
|
||
|
- extractor tuning options (e.g. verify_ssl, cookies, etc.)
|
||
|
|
||
|
- Provide credentials for the source
|
||
|
- API Key
|
||
|
- Username / Password
|
||
|
- OAuth
|
||
|
|
||
|
--------------------------------------------------------------------------------
|
||
|
|
||
|
### Alerts App
|
||
|
|
||
|
- Create a new alert, choose condition
|
||
|
- Get notified when a site goes down (<x% success ratio for Snapshots)
|
||
|
- Get notified when a site changes visually more than x% (screenshot diff)
|
||
|
- Get notified when a site's text content changes more than x% (text diff)
|
||
|
- Get notified when a keyword appears
|
||
|
- Get notified when a keyword dissapears
|
||
|
- When an AI prompt returns some result
|
||
|
|
||
|
- Choose alert threshold:
|
||
|
- any condition is met
|
||
|
- all conditions are met
|
||
|
- condition is met for x% of URLs
|
||
|
- condition is met for x% of time
|
||
|
|
||
|
- Choose how to notify: (List[AlertDestination])
|
||
|
- maximum alert frequency
|
||
|
- destination type: email / Slack / Webhook / Google Sheet / logfile
|
||
|
- destination info:
|
||
|
- email address(es)
|
||
|
- Slack channel
|
||
|
- Webhook URL
|
||
|
|
||
|
- Choose scope:
|
||
|
- Choose ArchiveResult scope (extractors): (a query that returns ArchiveResult.objects QuerySet)
|
||
|
- All extractors
|
||
|
- Only screenshots
|
||
|
- Only readability / mercury text
|
||
|
- Only video
|
||
|
- Only html
|
||
|
- Only headers
|
||
|
|
||
|
- Choose Snapshot scope (URL): (a query that returns Snapshot.objects QuerySet)
|
||
|
- All domains
|
||
|
- Specific domain
|
||
|
- All domains in a tag
|
||
|
- All domains in a tag category
|
||
|
- All URLs matching a certain regex pattern
|
||
|
|
||
|
- Choose crawl scope: (a query that returns Crawl.objects QuerySet)
|
||
|
- All crawls
|
||
|
- Specific crawls
|
||
|
- crawls by a certain user
|
||
|
- crawls using a certain persona
|
||
|
|
||
|
|
||
|
class AlertDestination(models.Model):
|
||
|
destination_type: [email, slack, webhook, google_sheet, local logfile, b2/s3/gcp bucket, etc.]
|
||
|
maximum_frequency
|
||
|
filter_rules
|
||
|
credentials
|
||
|
alert_template: JINJA2 json/text template that gets populated with alert contents
|