From 092e0b6dfa362dd95d3f3143dd0a5e47af4d170f Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Thu, 25 Jan 2024 08:08:36 -0800 Subject: [PATCH 1/2] Update README.md --- README.md | 366 +++++++++++++++++++++++++++++------------------------- 1 file changed, 199 insertions(+), 167 deletions(-) diff --git a/README.md b/README.md index 625ca8d5..85d42cee 100644 --- a/README.md +++ b/README.md @@ -70,31 +70,50 @@ The goal is to sleep soundly knowing the part of the internet you care about wil
-**๐Ÿ“ฆ  Get ArchiveBox with `docker` / `apt` / `brew` / `pip3` / `nix` / etc. ([see Quickstart below](#quickstart)).** +**๐Ÿ“ฆ  Install ArchiveBox using your preferred method: `docker` / `apt` / `brew` / `pip3` / `nix` / etc. ([see Quickstart below](#quickstart)).** -```bash -# Get ArchiveBox with Docker or Docker Compose (recommended) +
Quick reference   โคต๏ธ +
+
# Get ArchiveBox with Docker Compose (recommended)
+curl -O 'https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/dev/docker-compose.yml'
+docker compose up
+
+
# Or use it as a single Docker container docker run -v $PWD/data:/data -p 8000:8000 -it archivebox/archivebox - -# Or install with your preferred package manager (see Quickstart below for apt, brew, and more) +
+
# Or install with your preferred package manager (see Quickstart below for apt, brew, and more) pip install archivebox +
+
# Or use the optional auto setup script to install it +curl -sSL 'https://get.archivebox.io' | sh
+
+
+ +**๐Ÿ”ข Create a new directory to hold your data** +```bash +mkdir ~/archivebox; cd ~/archivebox +archivebox init --setup # or: setup config via docker-compose.yml -# Or use the optional auto setup script to install it -curl -sSL 'https://get.archivebox.io' | sh ``` -**๐Ÿ”ข Example usage: adding links to archive.** +**๐Ÿ”ข Next steps: start the ArchiveBox server to use the Web UI** +```bash +archivebox server 0.0.0.0:8000 # or: docker compose up +``` +Then open http://localhost:8000 to see it โžก๏ธ + +**๐Ÿ”ข Or use the CLI to archive links & manage your collection** ```bash archivebox add 'https://example.com' # add URLs one at a time archivebox add < ~/Downloads/bookmarks.json # or pipe in URLs in any text-based format archivebox schedule --every=day --depth=1 https://example.com/rss.xml # or auto-import URLs regularly on a schedule +# or: docker compose run archivebox add ... etc. ... + +archivebox list 'https://example.com' # use the CLI commands (--help for more) +ls ./archive/*/index.json # or browse your Snapshots via the filesystem ``` -**๐Ÿ”ข Example usage: viewing the archived content.** -```bash -archivebox server 0.0.0.0:8000 # use the interactive web UI -archivebox list 'https://example.com' # use the CLI commands (--help for more) -ls ./archive/*/index.json # or browse directly via the filesystem -``` +


@@ -214,6 +233,38 @@ See "Against curl | sh as a #### ๐Ÿ›   Package Manager Setup + + +
+Pip pip (macOS/Linux/BSD) +
+
    + +
  1. Install Python >= v3.10 and Node >= v18 on your system (if not already installed).
  2. +
  3. Install the ArchiveBox package using pip3. +
    pip3 install archivebox
    +
    +
  4. +
  5. Create a new empty directory and initialize your collection (can be anywhere). +
    mkdir ~/archivebox && cd ~/archivebox
    +archivebox init --setup
    +# install any missing extras like wget/git/ripgrep/etc. manually as needed
    +
    +
  6. +
  7. Optional: Start the server then login to the Web UI http://127.0.0.1:8000 โ‡ข Admin. +
    archivebox server 0.0.0.0:8000
    +# completely optional, CLI can always be used without running a server
    +# archivebox [subcommand] [--args]
    +
    +
  8. +
+ +See below for more usage examples using the CLI, Web UI, or filesystem/SQL/Python to manage your archive.
+See the pip-archivebox repo for more details about this distribution. +

+
+ +
aptitude apt (Ubuntu/Debian)
@@ -276,35 +327,6 @@ See the homebr

-
-Pip pip (macOS/Linux/BSD) -
-
    - -
  1. Install Python >= v3.9 and Node >= v18 on your system (if not already installed).
  2. -
  3. Install the ArchiveBox package using pip3. -
    pip3 install archivebox
    -
    -
  4. -
  5. Create a new empty directory and initialize your collection (can be anywhere). -
    mkdir ~/archivebox && cd ~/archivebox
    -archivebox init --setup
    -# install any missing extras like wget/git/ripgrep/etc. manually as needed
    -
    -
  6. -
  7. Optional: Start the server then login to the Web UI http://127.0.0.1:8000 โ‡ข Admin. -
    archivebox server 0.0.0.0:8000
    -# completely optional, CLI can always be used without running a server
    -# archivebox [subcommand] [--args]
    -
    -
  8. -
- -See below for more usage examples using the CLI, Web UI, or filesystem/SQL/Python to manage your archive.
-See the pip-archivebox repo for more details about this distribution. -

-
-
Arch pacman / FreeBSD pkg / Nix nix (Arch/FreeBSD/NixOS/more)
@@ -343,7 +365,7 @@ See below for usage examples using the CLI, W
โœจ Alpha (contributors wanted!): for more info, see the: Electron ArchiveBox repo. -
+
@@ -424,117 +446,119 @@ mkdir -p ~/archivebox/data # create a new data dir anywhere cd ~/archivebox/data # IMPORTANT: cd into the directory # archivebox [subcommand] [--args] +archivebox help +# or +docker compose run archivebox help ``` +#### ArchiveBox Subcommands + +- `archivebox` `help`/`version` to see the list of available subcommands and currently installed version info +- `archivebox` `setup`/`init`/`config`/`status`/`manage` to administer your collection +- `archivebox` `add`/`schedule`/`remove`/`update`/`list`/`shell`/`oneshot` to manage Snapshots in the archive +- `archivebox` `schedule` to pull in fresh URLs regularly from [bookmarks/history/Pocket/Pinboard/RSS/etc.](#input-formats) + +
+
+curl sh automatic setup script CLI Usage Examples (non-Docker) +
+

+archivebox init --setup      # safe to run init multiple times (also how you update versions)
+archivebox version           # get archivebox version info + check dependencies
+archivebox help              # get list of archivebox subcommands that can be run
+archivebox add --depth=1 'https://news.ycombinator.com'
+
+
+ +
+ +
+Docker Docker Compose CLI Usage Examples +
+

+# make sure you have `docker-compose.yml` from the Quickstart instructions first
+docker compose run archivebox init --setup
+docker compose run archivebox version
+docker compose run archivebox help
+docker compose run archivebox add --depth=1 'https://news.ycombinator.com'
+# to start webserver: docker compose up
+
+
+ +
+ +
+Docker Docker CLI Usage Examples +
+

+docker run -v $PWD:/data -it archivebox/archivebox init --setup
+docker run -v $PWD:/data -it archivebox/archivebox version
+docker run -v $PWD:/data -it archivebox/archivebox help
+docker run -v $PWD:/data -it archivebox/archivebox add --depth=1 'https://news.ycombinator.com'
+# to start webserver: docker run -v $PWD:/data -it -p 8000:8000 archivebox/archivebox
+
+
+ +
+ +
+๐Ÿ—„  SQL/Python/Filesystem Usage +

+sqlite3 ./index.sqlite3    # run SQL queries on your index
+archivebox shell           # explore the Python API in a REPL
+ls ./archive/*/index.html  # or inspect snapshots on the filesystem
+
+
+ + +
+ +
+๐Ÿ–ฅ  Web UI Usage +

+# Start the server on bare metal (pip/apt/brew/etc):
+archivebox manage createsuperuser              # create a new admin user via CLI
+archivebox server 0.0.0.0:8000                 # start the server
+
+# Or with Docker Compose: +nano docker-compose.yml # setup initial ADMIN_USERNAME & ADMIN_PASSWORD +docker compose up # start the server +
+# Or with a Docker container: +docker run -v $PWD:/data -it archivebox/archivebox archivebox manage createsuperuser +docker run -v $PWD:/data -it -p 8000:8000 archivebox/archivebox +
+ +
Optional: Change permissions to allow non-logged-in users
+ +

+# OPTIONAL
+archivebox config --set PUBLIC_ADD_VIEW=True   # allow guests to submit URLs 
+archivebox config --set PUBLIC_SNAPSHOTS=True  # allow guests to see snapshot content
+archivebox config --set PUBLIC_INDEX=True      # allow guests to see list of all snapshots
+
+# restart the server to apply any config changes
+
+
+ +
+
+ > [!TIP] > Whether in Docker or not, ArchiveBox commands all work the same way, and can be used in tandem to access the same data directory. > For example, you can run the Web UI in Docker Compose, and run one-off commands on host with `pip`-installed ArchiveBox or in Docker interchangeably.
-Expand to show examples...
+Expand to show comparison...

-docker compose up -d                                      # start the Web UI server in the background
-docker compose run archivebox add 'https://example.com'   # add a test URL to snapshot w/ Docker Compose
-
-archivebox list 'https://example.com'                     # fetch it with pip-installed archivebox on the host
-docker compose run archivebox list 'https://example.com'                       # or w/ Docker Compose
-docker run -it -v $PWD:/data archivebox/archivebox list 'https://example.com'  # or w/ Docker, all equivalent
+archivebox add --depth=1 'https://example.com'                     # add a URL with pip-installed archivebox on the host
+docker compose run archivebox add --depth=1 'https://example.com'                       # or w/ Docker Compose
+docker run -it -v $PWD:/data archivebox/archivebox add --depth=1 'https://example.com'  # or w/ Docker, all equivalent
 
-
-##### Bare Metal Usage (`pip`/`apt`/`brew`/etc.) - -
-
-Click to expand... -
- -

-archivebox init --setup      # safe to run init multiple times (also how you update versions)
-archivebox version           # get archivebox version info and more
-archivebox add --depth=1 'https://news.ycombinator.com'
-
- -
-
- -##### Docker Compose Usage - -
-
-Click to expand... -
- -

-# make sure you have `docker-compose.yml` from the Quickstart instructions first
-docker compose run archivebox init --setup
-docker compose run archivebox version
-docker compose run archivebox add --depth=1 'https://news.ycombinator.com'
-
- -
-
- -##### Docker Usage - -
-
-Click to expand... -
- -

-docker run -v $PWD:/data -it archivebox/archivebox init --setup
-docker run -v $PWD:/data -it archivebox/archivebox version
-
- -
-
- -#### Next Steps - -- `archivebox help/version` to see the list of available subcommands and currently installed version info -- `archivebox setup/init/config/status/manage` to administer your collection -- `archivebox add/schedule/remove/update/list/shell/oneshot` to manage Snapshots in the archive -- `archivebox schedule` to pull in fresh URLs regularly from [bookmarks/history/Pocket/Pinboard/RSS/etc.](#input-formats) - - -#### ๐Ÿ–ฅ  Web UI Usage - -##### Start the Web Server -```bash -# Bare metal (pip/apt/brew/etc): -archivebox server 0.0.0.0:8000 # open http://127.0.0.1:8000 to view it - -# Docker Compose: -docker compose up - -# Docker: -docker run -v $PWD:/data -it -p 8000:8000 archivebox/archivebox -``` - -##### Allow Public Access or Create an Admin User -```bash -archivebox manage createsuperuser # create a new admin username & pass -# OR # OR -archivebox config --set PUBLIC_ADD_VIEW=True # allow guests to submit URLs -archivebox config --set PUBLIC_SNAPSHOTS=True # allow guests to see snapshot content -archivebox config --set PUBLIC_INDEX=True # allow guests to see list of all snapshots - -# restart the server to apply any config changes -``` - -*Docker hint:* Set the [`ADMIN_USERNAME` & `ADMIN_PASSWORD`)](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#admin_username--admin_password) env variables to auto-create an admin user on first-run. - -#### ๐Ÿ—„  SQL/Python/Filesystem Usage - -```bash -sqlite3 ./index.sqlite3 # run SQL queries on your index -archivebox shell # explore the Python API in a REPL -ls ./archive/*/index.html # or inspect snapshots on the filesystem -```
@@ -555,25 +579,28 @@ ls ./archive/*/index.html # or inspect snapshots on the filesystem ---
-lego +lego

# Overview -## Input Formats + -ArchiveBox supports many input formats for URLs, including Pocket & Pinboard exports, Browser bookmarks, Browser history, plain text, HTML, markdown, and more! +## Input Formats: How to pass URLs into ArchiveBox for saving -*Click these links for instructions on how to prepare your links from these sources:* +- The official ArchiveBox Browser Extension (provides realtime archiving from Chrome/Chromium/Firefox browsers) + +- Manual imports of URLs from RSS, JSON, CSV, TXT, SQL, HTML, Markdown, or [any other text-based format...](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#Import-a-list-of-URLs-from-a-text-file) + +- [MITM Proxy](https://mitmproxy.org/) archiving with [`archivebox-proxy`](https://github.com/ArchiveBox/archivebox-proxy) ([realtime archiving](https://github.com/ArchiveBox/ArchiveBox/issues/577) of all traffic from any device going through the proxy) + +- Exported [browser history](https://github.com/ArchiveBox/ArchiveBox/wiki/Quickstart#2-get-your-list-of-urls-to-archive) or [browser bookmarks](https://github.com/ArchiveBox/ArchiveBox/wiki/Quickstart#2-get-your-list-of-urls-to-archive) (see instructions for: [Chrome](https://support.google.com/chrome/answer/96816?hl=en), [Firefox](https://support.mozilla.org/en-US/kb/export-firefox-bookmarks-to-backup-or-transfer), [Safari](https://github.com/ArchiveBox/ArchiveBox/assets/511499/24ad068e-0fa6-41f4-a7ff-4c26fc91f71a), [IE](https://support.microsoft.com/en-us/help/211089/how-to-import-and-export-the-internet-explorer-favorites-folder-to-a-32-bit-version-of-windows), [Opera](https://help.opera.com/en/latest/features/#bookmarks:~:text=Click%20the%20import/-,export%20button,-on%20the%20bottom), [and more...](https://github.com/ArchiveBox/ArchiveBox/wiki/Quickstart#2-get-your-list-of-urls-to-archive)) + +- Links from [Pocket](https://getpocket.com/export), [Pinboard](https://pinboard.in/export/), [Instapaper](https://www.instapaper.com/user), [Shaarli](https://shaarli.readthedocs.io/en/master/Usage/#importexport), [Delicious](https://www.groovypost.com/howto/howto/export-delicious-bookmarks-xml/), [Reddit Saved](https://github.com/csu/export-saved-reddit), [Wallabag](https://doc.wallabag.org/en/user/import/wallabagv2.html), [Unmark.it](http://help.unmark.it/import-export), [OneTab](https://www.addictivetips.com/web/onetab-save-close-all-chrome-tabs-to-restore-export-or-import/), [Firefox Sync](https://github.com/ArchiveBox/ArchiveBox/issues/648), [and more...](https://github.com/ArchiveBox/ArchiveBox/wiki/Quickstart#2-get-your-list-of-urls-to-archive) -- TXT, RSS, XML, JSON, CSV, SQL, HTML, Markdown, or [any other text-based format...](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#Import-a-list-of-URLs-from-a-text-file) -- [Browser history](https://github.com/ArchiveBox/ArchiveBox/wiki/Quickstart#2-get-your-list-of-urls-to-archive) or [browser bookmarks](https://github.com/ArchiveBox/ArchiveBox/wiki/Quickstart#2-get-your-list-of-urls-to-archive) (see instructions for: [Chrome](https://support.google.com/chrome/answer/96816?hl=en), [Firefox](https://support.mozilla.org/en-US/kb/export-firefox-bookmarks-to-backup-or-transfer), [Safari](https://github.com/ArchiveBox/ArchiveBox/assets/511499/24ad068e-0fa6-41f4-a7ff-4c26fc91f71a), [IE](https://support.microsoft.com/en-us/help/211089/how-to-import-and-export-the-internet-explorer-favorites-folder-to-a-32-bit-version-of-windows), [Opera](https://help.opera.com/en/latest/features/#bookmarks:~:text=Click%20the%20import/-,export%20button,-on%20the%20bottom), [and more...](https://github.com/ArchiveBox/ArchiveBox/wiki/Quickstart#2-get-your-list-of-urls-to-archive)) -- Browser extension [`archivebox-exporter`](https://github.com/ArchiveBox/archivebox-extension) (realtime archiving from Chrome/Chromium/Firefox) -- [Pocket](https://getpocket.com/export), [Pinboard](https://pinboard.in/export/), [Instapaper](https://www.instapaper.com/user), [Shaarli](https://shaarli.readthedocs.io/en/master/Usage/#importexport), [Delicious](https://www.groovypost.com/howto/howto/export-delicious-bookmarks-xml/), [Reddit Saved](https://github.com/csu/export-saved-reddit), [Wallabag](https://doc.wallabag.org/en/user/import/wallabagv2.html), [Unmark.it](http://help.unmark.it/import-export), [OneTab](https://www.addictivetips.com/web/onetab-save-close-all-chrome-tabs-to-restore-export-or-import/), [Firefox Sync](https://github.com/ArchiveBox/ArchiveBox/issues/648), [and more...](https://github.com/ArchiveBox/ArchiveBox/wiki/Quickstart#2-get-your-list-of-urls-to-archive) -- Proxy archiving with [`archivebox-proxy`](https://github.com/ArchiveBox/archivebox-proxy) ([realtime archiving](https://github.com/ArchiveBox/ArchiveBox/issues/577) of all traffic from any browser or device) @@ -599,13 +626,17 @@ It also includes a built-in scheduled import feature with `archivebox schedule`
-## Output Formats + + + +## Output Formats: What ArchiveBox saves for each URL + Inside each Snapshot folder, ArchiveBox saves these different types of extractor outputs as plain files: -`./archive/TIMESTAMP/*` +`./archive/{Snapshot.id}/` - **Index:** `index.html` & `index.json` HTML and JSON index files containing metadata and details - **Title**, **Favicon**, **Headers** Response headers, site favicon, and parsed site title @@ -644,29 +675,27 @@ env CHROME_BINARY=chromium archivebox ... # run with a one-off config These methods also work the same way when run inside Docker, see the Docker Configuration wiki page for details. -**The config loading logic with all the options defined is here: [`archivebox/config.py`](https://github.com/ArchiveBox/ArchiveBox/blob/dev/archivebox/config.py).** +The configuration is documented here: **[Configuration Wiki](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration)**, and loaded here: [`archivebox/config.py`](https://github.com/ArchiveBox/ArchiveBox/blob/dev/archivebox/config.py). -Most options are also documented on the **[Configuration Wiki page](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration)**. - -#### Most Common Options to Tweak - -```bash +
+Most Common Options to Tweak +

 # e.g. archivebox config --set TIMEOUT=120
-
+
TIMEOUT=120 # default: 60 add more seconds on slower networks CHECK_SSL_VALIDITY=True # default: False True = allow saving URLs w/ bad SSL SAVE_ARCHIVE_DOT_ORG=False # default: True False = disable Archive.org saving MAX_MEDIA_SIZE=1500m # default: 750m raise/lower youtubedl output size - +
PUBLIC_INDEX=True # default: True whether anon users can view index PUBLIC_SNAPSHOTS=True # default: True whether anon users can view pages PUBLIC_ADD_VIEW=False # default: False whether anon users can add new URLs - +
CHROME_USER_AGENT="Mozilla/5.0 ..." # change these to get around bot blocking WGET_USER_AGENT="Mozilla/5.0 ..." CURL_USER_AGENT="Mozilla/5.0 ..." -``` - +
+

## Dependencies @@ -772,7 +801,7 @@ Each snapshot subfolder ./archive/TIMESTAMP/ includes a static
@@ -781,14 +810,17 @@ You can export the main index to browse it statically as plain HTML files in a f > *NOTE: These exports are not paginated, exporting many URLs or the entire archive at once may be slow. Use the filtering CLI flags on the `archivebox list` command to export specific Snapshots or ranges.* -```bash +```bash| +# do a one-off single URL archive wihout needing a data dir initialized +archivebox oneshot 'https://example.com' + # archivebox list --help archivebox list --html --with-headers > index.html # export to static html table archivebox list --json --with-headers > index.json # export to json blob archivebox list --csv=timestamp,url,title > index.csv # export to csv spreadsheet # (if using Docker Compose, add the -T flag when piping) -# docker compose run -T archivebox list --html --filter-type=search snozzberries > index.json +# docker compose run -T archivebox list --html 'https://example.com' > index.json ``` The paths in the static exports are relative, make sure to keep them next to your `./archive` folder when backing them up or viewing them. From 51f2382407f72fc2c2327026ecd0d6b8e5f22188 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Thu, 25 Jan 2024 22:30:04 -0800 Subject: [PATCH 2/2] Update README.md --- README.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 85d42cee..d357c550 100644 --- a/README.md +++ b/README.md @@ -532,7 +532,6 @@ docker run -v $PWD:/data -it -p 8000:8000 archivebox/archivebox
Optional: Change permissions to allow non-logged-in users

-# OPTIONAL
 archivebox config --set PUBLIC_ADD_VIEW=True   # allow guests to submit URLs 
 archivebox config --set PUBLIC_SNAPSHOTS=True  # allow guests to see snapshot content
 archivebox config --set PUBLIC_INDEX=True      # allow guests to see list of all snapshots
@@ -677,10 +676,12 @@ env CHROME_BINARY=chromium archivebox ...       # run with a one-off config
 
 The configuration is documented here: **[Configuration Wiki](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration)**, and loaded here: [`archivebox/config.py`](https://github.com/ArchiveBox/ArchiveBox/blob/dev/archivebox/config.py).
 
+
 
-Most Common Options to Tweak +Expand to see the most common options to tweak...

 # e.g. archivebox config --set TIMEOUT=120
+# or   docker compose run archivebox config --set TIMEOUT=120
 
TIMEOUT=120 # default: 60 add more seconds on slower networks CHECK_SSL_VALIDITY=True # default: False True = allow saving URLs w/ bad SSL