From bd19b794e563beb72d2afabebadf87ebb92c2fc2 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Tue, 30 Jan 2024 01:01:16 -0800 Subject: [PATCH] copy readme from dev --- README.md | 594 +++++++++++++++++++++++++++--------------------------- 1 file changed, 300 insertions(+), 294 deletions(-) diff --git a/README.md b/README.md index 61c143e9..5ded344a 100644 --- a/README.md +++ b/README.md @@ -23,39 +23,28 @@ curl -sSL 'https://get.archivebox.io' | sh # (or see pip/brew/Docker instruct Without active preservation effort, everything on the internet eventually dissapears or degrades. Archive.org does a great job as a free central archive, but they require all archives to be public, and they can't save every type of content. -*ArchiveBox is an open source tool that helps you archive web content on your own (or privately within an organization): save copies of browser bookmarks, preserve evidence for legal cases, backup photos from FB / Insta / Flickr, download your media from YT / Soundcloud / etc., snapshot research papers & academic citations, and more...* +*ArchiveBox is an open source tool that helps organizations and individuals archive web content and retain control over their data: save copies of browser bookmarks, preserve evidence for legal cases, backup photos from FB / Insta / Flickr, download your media from YT / Soundcloud / etc., snapshot research papers & academic citations, and more...* -> โžก๏ธ *Use ArchiveBox as a [command-line package](#quickstart) and/or [self-hosted web app](#quickstart) on Linux, macOS, or in [Docker](#quickstart).* +> โžก๏ธ *Use ArchiveBox on [Linux](#quickstart)/[macOS](#quickstart)/[Windows](#quickstart)/[Docker](#quickstart) as a [CLI tool](#usage), [self-hosted Web App](https://github.com/ArchiveBox/ArchiveBox/wiki/Publishing-Your-Archive), [`pip` library](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#python-shell-usage), or [one-off command](#static-archive-exporting).*
-๐Ÿ“ฅ **You can feed ArchiveBox URLs one at a time, or schedule regular imports** from browser bookmarks or history, feeds like RSS, bookmark services like Pocket/Pinboard, and more. See input formats for a full list. +๐Ÿ“ฅ **You can feed ArchiveBox URLs one at a time, or schedule regular imports** from your bookmarks or history, social media feeds or RSS, link-saving services like Pocket/Pinboard, our [Browser Extension](https://chromewebstore.google.com/detail/archivebox-exporter/habonpimjphpdnmcfkaockjnffodikoj), and more. See Input Formats for a full list. snapshot detail page -๐Ÿ’พ **It saves snapshots of the URLs you feed it in several redundant formats.** +**It saves snapshots of the URLs you feed it in several redundant formats.** It also detects any content featured *inside* each webpage & extracts it out into a folder: -- `HTML/Generic websites -> HTML, PDF, PNG, WARC, Singlefile` -- `YouTube/SoundCloud/etc. -> MP3/MP4 + subtitles, description, thumbnail` -- `News articles -> article body TXT + title, author, featured images` -- `Github/Gitlab/etc. links -> git cloned source code` -- *[and more...](#output-formats)* +- ๐ŸŒ **HTML**/**Any websites** โžก๏ธ `original HTML+CSS+JS`, `singlefile HTML`, `screenshot PNG`, `PDF`, `WARC`, ... +- ๐ŸŽฅ **Social Media**/**News** โžก๏ธ `post content TXT`, `comments`, `title`, `author`, `images` +- ๐ŸŽฌ **YouTube**/**SoundCloud**/etc. โžก๏ธ `MP3/MP4`s, `subtitles`, `metadata`, `thumbnail`, ... +- ๐Ÿ’พ **Github**/**Gitlab**/etc. links โžก๏ธ `clone of GIT source code`, `README`, `images`, ... +- โœจ *and more, see [Output Formats](#output-formats) below...* -It uses normal filesystem folders to organize archives (no complicated proprietary formats), and offers a CLI + web UI. +It uses [standard tools](#dependencies) like Chrome, `wget`, & `yt-dlp`, and stores data in ordinary [files & folders](#archive-layout) (no complex proprietary formats). --- -๐Ÿ›๏ธ ArchiveBox is used by many *[professionals](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102) and [hobbyists](https://zulip.archivebox.io/#narrow/stream/158-development)* who save content off the web, for example: - -- **Individuals:** - `backing up browser bookmarks/history`, `saving FB/Insta/etc. content`, `shopping lists` -- **Journalists:** - `crawling and collecting research`, `preserving quoted material`, `fact-checking and review` -- **Lawyers:** - `evidence collection`, `hashing & integrity verifying`, `search, tagging, & review` -- **Researchers:** - `collecting AI training sets`, `feeding analysis / web crawling pipelines` - The goal is to sleep soundly knowing the part of the internet you care about will be automatically preserved in durable, easily accessible formats [for decades](#background--motivation) after it goes down.
@@ -70,32 +59,45 @@ The goal is to sleep soundly knowing the part of the internet you care about wil
-**๐Ÿ“ฆ  Get ArchiveBox with `docker` / `apt` / `brew` / `pip3` / `nix` / etc. ([see Quickstart below](#quickstart)).** +**๐Ÿ“ฆ  Install ArchiveBox using your preferred method: `docker` / `pip` / `apt` / `brew` / etc. ([see full Quickstart below](#quickstart)).** -```bash -# Get ArchiveBox with Docker Compose (recommended) or Docker -curl -sSL 'https://docker-compose.archivebox.io' > docker-compose.yml -docker pull archivebox/archivebox -# Or install with your preferred package manager (see Quickstart below for apt, brew, and more) +
Expand for quick copy-pastable install commands...   โคต๏ธ +
+
mkdir ~/archivebox; cd ~/archivebox    # create a dir somewhere for your archivebox data
+
+# Option A: Get ArchiveBox with Docker Compose (recommended): +curl -sSL 'https://docker-compose.archivebox.io' > docker-compose.yml # edit options in this file as-needed +docker compose run archivebox init --setup +# docker compose run archivebox add 'https://example.com' +# docker compose run archivebox help +# docker compose up +
+
+# Option B: Or use it as a plain Docker container: +docker run -it -v $PWD:/data archivebox/archivebox init --setup +# docker run -it -v $PWD:/data archivebox/archivebox add 'https://example.com' +# docker run -it -v $PWD:/data archivebox/archivebox help +# docker run -it -v $PWD:/data -p 8000:8000 archivebox/archivebox +
+
+# Option C: Or install it with your preferred pkg manager (see Quickstart below for apt, brew, and more) pip install archivebox - -# Or use the optional auto setup script to install it +archivebox init --setup +# archviebox add 'https://example.com' +# archivebox help +# archivebox server 0.0.0.0:8000 +
+
+# Option D: Or use the optional auto setup script to install it curl -sSL 'https://get.archivebox.io' | sh -``` +
+
+Open http://localhost:8000 to see your server's Web UI โžก๏ธ +
+
-**๐Ÿ”ข Example usage: adding links to archive.** -```bash -archivebox add 'https://example.com' # add URLs one at a time -archivebox add < ~/Downloads/bookmarks.json # or pipe in URLs in any text-based format -archivebox schedule --every=day --depth=1 https://example.com/rss.xml # or auto-import URLs regularly on a schedule -``` -**๐Ÿ”ข Example usage: viewing the archived content.** -```bash -archivebox server 0.0.0.0:8000 # use the interactive web UI -archivebox list 'https://example.com' # use the CLI commands (--help for more) -ls ./archive/*/index.json # or browse directly via the filesystem -```


@@ -123,12 +125,23 @@ ls ./archive/*/index.json # or browse directly via the filesyste ## ๐Ÿค Professional Integration -*[Contact us](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102) if your institution/org wants to use ArchiveBox professionally.* +ArchiveBox is free for everyone to self-host, but we also provide support, security review, and custom integrations to help NGOs, governments, and other organizations [run ArchiveBox professionally](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102): -- setup & support, team permissioning, hashing, audit logging, backups, custom archiving etc. -- for **individuals**, **NGOs**, **academia**, **governments**, **journalism**, **law**, and more... +- ๐Ÿ—ž๏ธ **Journalists:** + `crawling and collecting research`, `preserving quoted material`, `fact-checking and review` +- โš–๏ธ **Lawyers:** + `collecting & preserving evidence`, `hashing / integrity checking / chain-of-custody`, `tagging & review` +- ๐Ÿ”ฌ **Researchers:** + `analyzing social media trends`, `collecting LLM training data`, `crawling to feed other pipelines` +- ๐Ÿ‘ฉ๐Ÿฝ **Individuals:** + `saving legacy social media / memoirs`, `preserving portfolios / resume`, `backing up news articles` -*We are a 501(c)(3) nonprofit and all our work goes towards supporting open-source development.* +> ***[Contact our team](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102)** if your institution/org wants to use ArchiveBox professionally.* +> +> - setup & support, team permissioning, hashing, audit logging, backups, custom archiving etc. +> - for **individuals**, **NGOs**, **academia**, **governments**, **journalism**, **law**, and more... + +*We are a ๐Ÿ›๏ธ 501(c)(3) nonprofit and all our work goes towards supporting open-source development.*
@@ -137,6 +150,8 @@ ls ./archive/*/index.json # or browse directly via the filesyste grassgrass
+ + # Quickstart **๐Ÿ–ฅ  Supported OSs:** Linux/BSD, macOS, Windows (Docker)   **๐Ÿ‘พ  CPUs:** `amd64` (`x86_64`), `arm64` (`arm8`), `arm7` (raspi>=3)
@@ -146,7 +161,7 @@ ls ./archive/*/index.json # or browse directly via the filesyste #### โœณ๏ธ  Easy Setup -
+
Docker docker-compose (macOS/Linux/Windows)   ๐Ÿ‘ˆ  recommended   (click to expand)
๐Ÿ‘ Docker Compose is recommended for the easiest install/update UX + best security + all the extras out-of-the-box. @@ -155,9 +170,10 @@ ls ./archive/*/index.json # or browse directly via the filesyste
  • Install Docker on your system (if not already installed).
  • Download the docker-compose.yml file into a new empty directory (can be anywhere).
    mkdir ~/archivebox && cd ~/archivebox
    -curl -O 'https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/dev/docker-compose.yml'
    +# Read and edit docker-compose.yml options as-needed after downloading
    +curl -sSL 'https://docker-compose.archivebox.io' > docker-compose.yml
     
  • -
  • Run the initial setup and create an admin user. +
  • Run the initial setup to create an admin user (or set ADMIN_USER/PASS in docker-compose.yml)
    docker compose run archivebox init --setup
     
  • Next steps: Start the server then login to the Web UI http://127.0.0.1:8000 โ‡ข Admin. @@ -187,6 +203,7 @@ docker run -v $PWD:/data -it archivebox/archivebox init --setup
    docker run -v $PWD:/data -p 8000:8000 archivebox/archivebox
     # completely optional, CLI can always be used without running a server
     # docker run -v $PWD:/data -it [subcommand] [--args]
    +docker run -v $PWD:/data -it archivebox/archivebox help
     
  • @@ -216,8 +233,41 @@ See "Against curl | sh as a #### ๐Ÿ›   Package Manager Setup + +
    -aptitude apt (Ubuntu/Debian) +Pip pip (macOS/Linux/BSD) +
    +
      + +
    1. Install Python >= v3.10 and Node >= v18 on your system (if not already installed).
    2. +
    3. Install the ArchiveBox package using pip3 (or pipx). +
      pip3 install archivebox
      +
      +
    4. +
    5. Create a new empty directory and initialize your collection (can be anywhere). +
      mkdir ~/archivebox && cd ~/archivebox
      +archivebox init --setup
      +# install any missing extras like wget/git/ripgrep/etc. manually as needed
      +
      +
    6. +
    7. Optional: Start the server then login to the Web UI http://127.0.0.1:8000 โ‡ข Admin. +
      archivebox server 0.0.0.0:8000
      +# completely optional, CLI can always be used without running a server
      +# archivebox [subcommand] [--args]
      +archivebox help
      +
      +
    8. +
    + +See below for more usage examples using the CLI, Web UI, or filesystem/SQL/Python to manage your archive.
    +See the pip-archivebox repo for more details about this distribution. +

    +
    + + +
    +aptitude apt (Ubuntu/Debian/etc.)
    1. Add the ArchiveBox repository to your sources.
      @@ -241,6 +291,7 @@ archivebox init --setup # if any problems, install with pip instead
      archivebox server 0.0.0.0:8000
       # completely optional, CLI can always be used without running a server
       # archivebox [subcommand] [--args]
      +archivebox help
       
    @@ -251,7 +302,7 @@ See the debian-a
    -homebrew brew (macOS) +homebrew brew (macOS only)
    1. Install Homebrew on your system (if not already installed).
    2. @@ -269,6 +320,7 @@ archivebox init --setup # if any problems, install with pip instead
      archivebox server 0.0.0.0:8000
       # completely optional, CLI can always be used without running a server
       # archivebox [subcommand] [--args]
      +archivebox help
       
    @@ -278,35 +330,6 @@ See the homebr

    -
    -Pip pip (macOS/Linux/BSD) -
    -
      - -
    1. Install Python >= v3.9 and Node >= v18 on your system (if not already installed).
    2. -
    3. Install the ArchiveBox package using pip3. -
      pip3 install archivebox
      -
      -
    4. -
    5. Create a new empty directory and initialize your collection (can be anywhere). -
      mkdir ~/archivebox && cd ~/archivebox
      -archivebox init --setup
      -# install any missing extras like wget/git/ripgrep/etc. manually as needed
      -
      -
    6. -
    7. Optional: Start the server then login to the Web UI http://127.0.0.1:8000 โ‡ข Admin. -
      archivebox server 0.0.0.0:8000
      -# completely optional, CLI can always be used without running a server
      -# archivebox [subcommand] [--args]
      -
      -
    8. -
    - -See below for more usage examples using the CLI, Web UI, or filesystem/SQL/Python to manage your archive.
    -See the pip-archivebox repo for more details about this distribution. -

    -
    -
    Arch pacman / FreeBSD pkg / Nix nix (Arch/FreeBSD/NixOS/more)
    @@ -345,7 +368,7 @@ See below for usage examples using the CLI, W
    โœจ Alpha (contributors wanted!): for more info, see the: Electron ArchiveBox repo. -
    +
    @@ -419,124 +442,133 @@ For more discussion on managed and paid hosting options see here: -docker compose up -d # start the Web UI server in the background -docker compose run archivebox add 'https://example.com' # add a test URL to snapshot w/ Docker Compose - -archivebox list 'https://example.com' # fetch it with pip-installed archivebox on the host -docker compose run archivebox list 'https://example.com' # or w/ Docker Compose -docker run -it -v $PWD:/data archivebox/archivebox list 'https://example.com' # or w/ Docker, all equivalent - - -
    +curl sh automatic setup script CLI Usage Examples (non-Docker)
    - -##### Bare Metal Usage (`pip`/`apt`/`brew`/etc.) - -
    -
    -Click to expand... -
    -
    
     archivebox init --setup      # safe to run init multiple times (also how you update versions)
    -archivebox version           # get archivebox version info and more
    +archivebox version           # get archivebox version info + check dependencies
    +archivebox help              # get list of archivebox subcommands that can be run
     archivebox add --depth=1 'https://news.ycombinator.com'
     
    -
    -
    - -##### Docker Compose Usage
    +
    -Click to expand... +Docker Docker Compose CLI Usage Examples
    -
    
     # make sure you have `docker-compose.yml` from the Quickstart instructions first
     docker compose run archivebox init --setup
     docker compose run archivebox version
    +docker compose run archivebox help
     docker compose run archivebox add --depth=1 'https://news.ycombinator.com'
    +# to start webserver: docker compose up
     
    -
    -
    - -##### Docker Usage
    +
    -Click to expand... +Docker Docker CLI Usage Examples
    -
    
     docker run -v $PWD:/data -it archivebox/archivebox init --setup
     docker run -v $PWD:/data -it archivebox/archivebox version
    +docker run -v $PWD:/data -it archivebox/archivebox help
    +docker run -v $PWD:/data -it archivebox/archivebox add --depth=1 'https://news.ycombinator.com'
    +# to start webserver: docker run -v $PWD:/data -it -p 8000:8000 archivebox/archivebox
    +
    +
    + +
    + +
    +๐Ÿ—„  SQL/Python/Filesystem Usage +
    
    +archivebox shell           # explore the Python library API in a REPL
    +sqlite3 ./index.sqlite3    # run SQL queries directly on your index
    +ls ./archive/*/index.html  # or inspect snapshot data directly on the filesystem
    +
    +
    + + +
    + +
    +๐Ÿ–ฅ  Web UI Usage +
    
    +# Start the server on bare metal (pip/apt/brew/etc):
    +archivebox manage createsuperuser              # create a new admin user via CLI
    +archivebox server 0.0.0.0:8000                 # start the server
    +
    +# Or with Docker Compose: +nano docker-compose.yml # setup initial ADMIN_USERNAME & ADMIN_PASSWORD +docker compose up # start the server +
    +# Or with a Docker container: +docker run -v $PWD:/data -it archivebox/archivebox archivebox manage createsuperuser +docker run -v $PWD:/data -it -p 8000:8000 archivebox/archivebox +
    + +Open
    http://localhost:8000 to see your server's Web UI โžก๏ธ +
    +Optional: Change permissions to allow non-logged-in users + +
    
    +archivebox config --set PUBLIC_ADD_VIEW=True   # allow guests to submit URLs 
    +archivebox config --set PUBLIC_SNAPSHOTS=True  # allow guests to see snapshot content
    +archivebox config --set PUBLIC_INDEX=True      # allow guests to see list of all snapshots
    +# or
    +docker compose run archivebox config --set ...
    +
    +# restart the server to apply any config changes
    +
    +
    + +
    +
    + +> [!TIP] +> Whether in Docker or not, ArchiveBox commands work the same way, and can be used to access the same data on-disk. +> For example, you could run the Web UI in Docker Compose, and run one-off commands with `pip`-installed ArchiveBox. + +
    +Expand to show comparison...
    + +
    
    +archivebox add --depth=1 'https://example.com'                     # add a URL with pip-installed archivebox on the host
    +docker compose run archivebox add --depth=1 'https://example.com'                       # or w/ Docker Compose
    +docker run -it -v $PWD:/data archivebox/archivebox add --depth=1 'https://example.com'  # or w/ Docker, all equivalent
     
    -
    -#### Next Steps - -- `archivebox help/version` to see the list of available subcommands and currently installed version info -- `archivebox setup/init/config/status/manage` to administer your collection -- `archivebox add/schedule/remove/update/list/shell/oneshot` to manage Snapshots in the archive -- `archivebox schedule` to pull in fresh URLs regularly from [bookmarks/history/Pocket/Pinboard/RSS/etc.](#input-formats) - - -#### ๐Ÿ–ฅ  Web UI Usage - -##### Start the Web Server -```bash -# Bare metal (pip/apt/brew/etc): -archivebox server 0.0.0.0:8000 # open http://127.0.0.1:8000 to view it - -# Docker Compose: -docker compose up - -# Docker: -docker run -v $PWD:/data -it -p 8000:8000 archivebox/archivebox -``` - -##### Allow Public Access or Create an Admin User -```bash -archivebox manage createsuperuser # create a new admin username & pass -# OR # OR -archivebox config --set PUBLIC_ADD_VIEW=True # allow guests to submit URLs -archivebox config --set PUBLIC_SNAPSHOTS=True # allow guests to see snapshot content -archivebox config --set PUBLIC_INDEX=True # allow guests to see list of all snapshots - -# restart the server to apply any config changes -``` - -*Docker hint:* Set the [`ADMIN_USERNAME` & `ADMIN_PASSWORD`)](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#admin_username--admin_password) env variables to auto-create an admin user on first-run. - -#### ๐Ÿ—„  SQL/Python/Filesystem Usage - -```bash -sqlite3 ./index.sqlite3 # run SQL queries on your index -archivebox shell # explore the Python API in a REPL -ls ./archive/*/index.html # or inspect snapshots on the filesystem -```
    @@ -557,25 +589,28 @@ ls ./archive/*/index.html # or inspect snapshots on the filesystem ---
    -lego +lego

    # Overview -## Input Formats + -ArchiveBox supports many input formats for URLs, including Pocket & Pinboard exports, Browser bookmarks, Browser history, plain text, HTML, markdown, and more! +## Input Formats: How to pass URLs into ArchiveBox for saving -*Click these links for instructions on how to prepare your links from these sources:* +- The official ArchiveBox Browser Extension (provides realtime archiving from Chrome/Chromium/Firefox browsers) + +- Manual imports of URLs from RSS, JSON, CSV, TXT, SQL, HTML, Markdown, or [any other text-based format...](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#Import-a-list-of-URLs-from-a-text-file) + +- [MITM Proxy](https://mitmproxy.org/) archiving with [`archivebox-proxy`](https://github.com/ArchiveBox/archivebox-proxy) ([realtime archiving](https://github.com/ArchiveBox/ArchiveBox/issues/577) of all traffic from any device going through the proxy) + +- Exported [browser history](https://github.com/ArchiveBox/ArchiveBox/wiki/Quickstart#2-get-your-list-of-urls-to-archive) or [browser bookmarks](https://github.com/ArchiveBox/ArchiveBox/wiki/Quickstart#2-get-your-list-of-urls-to-archive) (see instructions for: [Chrome](https://support.google.com/chrome/answer/96816?hl=en), [Firefox](https://support.mozilla.org/en-US/kb/export-firefox-bookmarks-to-backup-or-transfer), [Safari](https://github.com/ArchiveBox/ArchiveBox/assets/511499/24ad068e-0fa6-41f4-a7ff-4c26fc91f71a), [IE](https://support.microsoft.com/en-us/help/211089/how-to-import-and-export-the-internet-explorer-favorites-folder-to-a-32-bit-version-of-windows), [Opera](https://help.opera.com/en/latest/features/#bookmarks:~:text=Click%20the%20import/-,export%20button,-on%20the%20bottom), [and more...](https://github.com/ArchiveBox/ArchiveBox/wiki/Quickstart#2-get-your-list-of-urls-to-archive)) + +- Links from [Pocket](https://getpocket.com/export), [Pinboard](https://pinboard.in/export/), [Instapaper](https://www.instapaper.com/user), [Shaarli](https://shaarli.readthedocs.io/en/master/Usage/#importexport), [Delicious](https://www.groovypost.com/howto/howto/export-delicious-bookmarks-xml/), [Reddit Saved](https://github.com/csu/export-saved-reddit), [Wallabag](https://doc.wallabag.org/en/user/import/wallabagv2.html), [Unmark.it](http://help.unmark.it/import-export), [OneTab](https://www.addictivetips.com/web/onetab-save-close-all-chrome-tabs-to-restore-export-or-import/), [Firefox Sync](https://github.com/ArchiveBox/ArchiveBox/issues/648), [and more...](https://github.com/ArchiveBox/ArchiveBox/wiki/Quickstart#2-get-your-list-of-urls-to-archive) -- TXT, RSS, XML, JSON, CSV, SQL, HTML, Markdown, or [any other text-based format...](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#Import-a-list-of-URLs-from-a-text-file) -- [Browser history](https://github.com/ArchiveBox/ArchiveBox/wiki/Quickstart#2-get-your-list-of-urls-to-archive) or [browser bookmarks](https://github.com/ArchiveBox/ArchiveBox/wiki/Quickstart#2-get-your-list-of-urls-to-archive) (see instructions for: [Chrome](https://support.google.com/chrome/answer/96816?hl=en), [Firefox](https://support.mozilla.org/en-US/kb/export-firefox-bookmarks-to-backup-or-transfer), [Safari](https://github.com/ArchiveBox/ArchiveBox/assets/511499/24ad068e-0fa6-41f4-a7ff-4c26fc91f71a), [IE](https://support.microsoft.com/en-us/help/211089/how-to-import-and-export-the-internet-explorer-favorites-folder-to-a-32-bit-version-of-windows), [Opera](https://help.opera.com/en/latest/features/#bookmarks:~:text=Click%20the%20import/-,export%20button,-on%20the%20bottom), [and more...](https://github.com/ArchiveBox/ArchiveBox/wiki/Quickstart#2-get-your-list-of-urls-to-archive)) -- Browser extension [`archivebox-exporter`](https://github.com/ArchiveBox/archivebox-extension) (realtime archiving from Chrome/Chromium/Firefox) -- [Pocket](https://getpocket.com/export), [Pinboard](https://pinboard.in/export/), [Instapaper](https://www.instapaper.com/user), [Shaarli](https://shaarli.readthedocs.io/en/master/Usage/#importexport), [Delicious](https://www.groovypost.com/howto/howto/export-delicious-bookmarks-xml/), [Reddit Saved](https://github.com/csu/export-saved-reddit), [Wallabag](https://doc.wallabag.org/en/user/import/wallabagv2.html), [Unmark.it](http://help.unmark.it/import-export), [OneTab](https://www.addictivetips.com/web/onetab-save-close-all-chrome-tabs-to-restore-export-or-import/), [Firefox Sync](https://github.com/ArchiveBox/ArchiveBox/issues/648), [and more...](https://github.com/ArchiveBox/ArchiveBox/wiki/Quickstart#2-get-your-list-of-urls-to-archive) -- Proxy archiving with [`archivebox-proxy`](https://github.com/ArchiveBox/archivebox-proxy) ([realtime archiving](https://github.com/ArchiveBox/ArchiveBox/issues/577) of all traffic from any browser or device) @@ -601,30 +636,41 @@ It also includes a built-in scheduled import feature with `archivebox schedule`
    -## Output Formats -Inside each Snapshot folder, ArchiveBox saves these different types of extractor outputs as plain files: + + +## Output Formats: What ArchiveBox saves for each URL -`./archive/TIMESTAMP/*` -- **Index:** `index.html` & `index.json` HTML and JSON index files containing metadata and details -- **Title**, **Favicon**, **Headers** Response headers, site favicon, and parsed site title -- **SingleFile:** `singlefile.html` HTML snapshot rendered with headless Chrome using SingleFile -- **Wget Clone:** `example.com/page-name.html` wget clone of the site with `warc/TIMESTAMP.gz` -- Chrome Headless - - **PDF:** `output.pdf` Printed PDF of site using headless chrome - - **Screenshot:** `screenshot.png` 1440x900 screenshot of site using headless chrome - - **DOM Dump:** `output.html` DOM Dump of the HTML after rendering using headless chrome -- **Article Text:** `article.html/json` Article text extraction using Readability & Mercury -- **Archive.org Permalink:** `archive.org.txt` A link to the saved site on archive.org -- **Audio & Video:** `media/` all audio/video files + playlists, including subtitles & metadata with youtube-dl (or yt-dlp) -- **Source Code:** `git/` clone of any repository found on GitHub, Bitbucket, or GitLab links -- _More coming soon! See the [Roadmap](https://github.com/ArchiveBox/ArchiveBox/wiki/Roadmap)..._ +For each web page added, ArchiveBox creates a Snapshot folder and preserves its content as ordinary files inside the folder (e.g. HTML, PDF, PNG, JSON, etc.). -It does everything out-of-the-box by default, but you can disable or tweak [individual archive methods](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration) via environment variables / config. +It uses all available methods out-of-the-box, but you can disable extractors and fine-tune the [configuration](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration) as-needed. +
    +
    +Expand to see the full list of ways ArchiveBox saves each page... + + +./archive/{Snapshot.id}/
    +
      +
    • Index: index.html & index.json HTML and JSON index files containing metadata and details
    • +
    • Title, Favicon, Headers Response headers, site favicon, and parsed site title
    • +
    • SingleFile: singlefile.html HTML snapshot rendered with headless Chrome using SingleFile
    • +
    • Wget Clone: example.com/page-name.html wget clone of the site with warc/TIMESTAMP.gz
    • +
    • Chrome Headless
        +
      • PDF: output.pdf Printed PDF of site using headless chrome
      • +
      • Screenshot: screenshot.png 1440x900 screenshot of site using headless chrome
      • +
      • DOM Dump: output.html DOM Dump of the HTML after rendering using headless chrome
      • +
    • +
    • Article Text: article.html/json Article text extraction using Readability & Mercury
    • +
    • Archive.org Permalink: archive.org.txt A link to the saved site on archive.org
    • +
    • Audio & Video: media/ all audio/video files + playlists, including subtitles & metadata with youtube-dl (or yt-dlp)
    • +
    • Source Code: git/ clone of any repository found on GitHub, Bitbucket, or GitLab links
    • +
    • More coming soon! See the Roadmap...
    • +
    +

    ## Configuration @@ -632,52 +678,56 @@ It does everything out-of-the-box by default, but you can disable or tweak [indi ArchiveBox can be configured via environment variables, by using the `archivebox config` CLI, or by editing `./ArchiveBox.conf` directly. - -```bash -archivebox config # view the entire config +
    +
    +Expand to see examples... +
    archivebox config                               # view the entire config
     archivebox config --get CHROME_BINARY           # view a specific value
    -
    +
    archivebox config --set CHROME_BINARY=chromium # persist a config using CLI # OR echo CHROME_BINARY=chromium >> ArchiveBox.conf # persist a config using file # OR env CHROME_BINARY=chromium archivebox ... # run with a one-off config -``` +
    +These methods also work the same way when run inside Docker, see the Docker Configuration wiki page for details. +

    -These methods also work the same way when run inside Docker, see the Docker Configuration wiki page for details. +The configuration is documented here: **[Configuration Wiki](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration)**, and loaded here: [`archivebox/config.py`](https://github.com/ArchiveBox/ArchiveBox/blob/dev/archivebox/config.py). -**The config loading logic with all the options defined is here: [`archivebox/config.py`](https://github.com/ArchiveBox/ArchiveBox/blob/dev/archivebox/config.py).** - -Most options are also documented on the **[Configuration Wiki page](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration)**. - -#### Most Common Options to Tweak - -```bash + +
    +Expand to see the most common options to tweak... +
    
     # e.g. archivebox config --set TIMEOUT=120
    -
    +# or   docker compose run archivebox config --set TIMEOUT=120
    +
    TIMEOUT=120 # default: 60 add more seconds on slower networks CHECK_SSL_VALIDITY=True # default: False True = allow saving URLs w/ bad SSL SAVE_ARCHIVE_DOT_ORG=False # default: True False = disable Archive.org saving MAX_MEDIA_SIZE=1500m # default: 750m raise/lower youtubedl output size - +
    PUBLIC_INDEX=True # default: True whether anon users can view index PUBLIC_SNAPSHOTS=True # default: True whether anon users can view pages PUBLIC_ADD_VIEW=False # default: False whether anon users can add new URLs - +
    CHROME_USER_AGENT="Mozilla/5.0 ..." # change these to get around bot blocking WGET_USER_AGENT="Mozilla/5.0 ..." CURL_USER_AGENT="Mozilla/5.0 ..." -``` - +
    +

    ## Dependencies -To achieve high-fidelity archives in as many situations as possible, ArchiveBox depends on a variety of 3rd-party tools that specialize in extracting different types of content. +To achieve high-fidelity archives in as many situations as possible, ArchiveBox depends on a variety of 3rd-party libraries and tools that specialize in extracting different types of content. + +> Under-the-hood, ArchiveBox uses [Django](https://www.djangoproject.com/start/overview/) to power its [Web UI](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#ui-usage) and [SQlite](https://www.sqlite.org/locrsf.html) + the filesystem to provide [fast & durable metadata storage](https://www.sqlite.org/locrsf.html) w/ [determinisitc upgrades](https://stackoverflow.com/a/39976321/2156113). ArchiveBox bundles industry-standard tools like [Google Chrome](https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install), [`wget`, `yt-dlp`, `readability`, etc.](#dependencies) internally, and its operation can be [tuned, secured, and extended](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration) as-needed for many different applications. +
    -Expand to learn more about ArchiveBox's dependencies...
    +Expand to learn more about ArchiveBox's internals & dependencies...
    > *TIP: For better security, easier updating, and to avoid polluting your host system with extra dependencies,**it is strongly recommended to use the [โญ๏ธ official Docker image](https://github.com/ArchiveBox/ArchiveBox/wiki/Docker)** with everything pre-installed for the best experience.* @@ -724,14 +774,13 @@ Installing directly on **Windows without Docker or WSL/WSL2/Cygwin is not offici ## Archive Layout -All of ArchiveBox's state (including the SQLite DB, archived assets, config, logs, etc.) is stored in a single folder called the "ArchiveBox Data Folder". -Data folders can be created anywhere (`~/archivebox` or `$PWD/data` as seen in our examples), and you can create more than one for different collections. +All of ArchiveBox's state (SQLite DB, archived assets, config, logs, etc.) is stored in a single folder called the "ArchiveBox Data Folder".
    Expand to learn more about the layout of Archivebox's data on-disk...
    - +Data folders can be created anywhere (`~/archivebox` or `$PWD/data` as seen in our examples), and you can create as many data folders as you want to hold different collections. All archivebox CLI commands are designed to be run from inside an ArchiveBox data folder, starting with archivebox init to initialize a new collection inside an empty directory.
    mkdir ~/archivebox && cd ~/archivebox   # just an example, can be anywhere
    @@ -774,7 +823,7 @@ Each snapshot subfolder ./archive/TIMESTAMP/ includes a static 
     
    @@ -783,14 +832,17 @@ You can export the main index to browse it statically as plain HTML files in a f > *NOTE: These exports are not paginated, exporting many URLs or the entire archive at once may be slow. Use the filtering CLI flags on the `archivebox list` command to export specific Snapshots or ranges.* -```bash +```bash| +# do a one-off single URL archive wihout needing a data dir initialized +archivebox oneshot 'https://example.com' + # archivebox list --help archivebox list --html --with-headers > index.html # export to static html table archivebox list --json --with-headers > index.json # export to json blob archivebox list --csv=timestamp,url,title > index.csv # export to csv spreadsheet # (if using Docker Compose, add the -T flag when piping) -# docker compose run -T archivebox list --html --filter-type=search snozzberries > index.json +# docker compose run -T archivebox list --html 'https://example.com' > index.json ``` The paths in the static exports are relative, make sure to keep them next to your `./archive` folder when backing them up or viewing them. @@ -806,8 +858,6 @@ The paths in the static exports are relative, make sure to keep them next to you
    ---- -
    security graphic
    @@ -823,7 +873,7 @@ If you're importing pages with private content or URLs containing secret tokens
    -Click to expand... +Expand to learn about privacy, permissions, and user accounts... ```bash @@ -838,6 +888,7 @@ archivebox config --set SAVE_ARCHIVE_DOT_ORG=False # disable saving all URLs in archivebox config --set PUBLIC_INDEX=False archivebox config --set PUBLIC_SNAPSHOTS=False archivebox config --set PUBLIC_ADD_VIEW=False +archivebox manage createsuperuser # if extra paranoid or anti-Google: archivebox config --set SAVE_FAVICON=False # disable favicon fetching (it calls a Google API passing the URL's domain part only) @@ -867,7 +918,7 @@ Be aware that malicious archived JS can access the contents of other pages in yo
    -Click to expand... +Expand to see risks and mitigations... ```bash @@ -903,7 +954,7 @@ For various reasons, many large sites (Reddit, Twitter, Cloudflare, etc.) active
    -Click to expand... +Click to learn how to set up user agents, cookies, and site logins...
    @@ -926,7 +977,7 @@ ArchiveBox appends a hash with the current date `https://example.com#2020-10-24`
    -Click to expand... +Click to learn how the `Re-Snapshot` feature works...
    @@ -954,12 +1005,11 @@ Improved support for saving multiple snapshots of a single URL without this hash ### Storage Requirements -Because ArchiveBox is designed to ingest a large volume of URLs with multiple copies of each URL stored by different 3rd-party tools, it can be quite disk-space intensive. -There also also some special requirements when using filesystems like NFS/SMB/FUSE. +Because ArchiveBox is designed to ingest a large volume of URLs with multiple copies of each URL stored by different 3rd-party tools, it can be quite disk-space intensive. There are also some special requirements when using filesystems like NFS/SMB/FUSE.
    -Click to expand... +Click to learn more about ArchiveBox's filesystem and hosting requirements...
    @@ -1030,10 +1080,6 @@ If using Docker or NFS/SMB/FUSE for the `data/archive/` folder, you may need to

    - ---- - -
    paisley graphic @@ -1047,7 +1093,7 @@ ArchiveBox aims to enable more of the internet to be saved from deterioration by
    -Click to read more... +Click to read more about why archiving is important and how to do it ethically...
    @@ -1082,7 +1128,7 @@ A variety of open and closed-source archiving projects exist, but few provide a
    -Click to read more...
    +Click to read about how we differ from other centralized archiving services and open source tools...
    ArchiveBox tries to be a robust, set-and-forget archiving solution suitable for archiving RSS feeds, bookmarks, or your entire browsing history (beware, it may be too big to store), including private/authenticated content that you wouldn't otherwise share with a centralized service. @@ -1111,33 +1157,21 @@ ArchiveBox is neither the highest fidelity nor the simplest tool available for s
    -
    -
    -dependencies graphic -
    + ## Internet Archiving Ecosystem - -Our Community Wiki page serves as an index of the broader web archiving community. - -
      -
    • See where archivists hang out online
    • -
    • Explore other open-source tools for your web archiving needs
    • -
    • Learn which organizations are the big players in the web archiving space
    • -
    -
    -Explore our index of web archiving software, blogs, and communities around the world... +Our Community Wiki strives to be a comprehensive index of the broader web archiving community...
    - [Community Wiki](https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community) - - [The Master Lists](https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community#the-master-lists) - _Community-maintained indexes of archiving tools and institutions._ - [Web Archiving Software](https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community#web-archiving-projects) - _Open source tools and projects in the internet archiving space._ + _List of ArchiveBox alternatives and open source projects in the internet archiving space._ + - [Awesome-Web-Archiving Lists](https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community#the-master-lists) + _Community-maintained indexes of archiving tools and institutions like `iipc/awesome-web-archiving`._ - [Reading List](https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community#reading-list) _Articles, posts, and blogs relevant to ArchiveBox and web archiving in general._ - [Communities](https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community#communities) @@ -1154,11 +1188,8 @@ Our Community Wiki page serves as an index of the broader web archiving communit > โœจ **[Hire the team that built Archivebox](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102) to work on your project.** ([@ArchiveBoxApp](https://twitter.com/ArchiveBoxApp)) -(We also offer general software consulting across many industries) -
    ----
    documentation graphic @@ -1333,28 +1364,19 @@ archivebox init --setup
    -#### Run the linters +#### Run the linters / tests
    Click to expand... ```bash ./bin/lint.sh -``` -(uses `flake8` and `mypy`) - -
    - -#### Run the integration tests - -
    Click to expand... - -```bash ./bin/test.sh ``` -(uses `pytest -s`) +(uses `flake8`, `mypy`, and `pytest -s`)
    + #### Make migrations or enter a django shell
    Click to expand... @@ -1449,47 +1471,31 @@ Extractors take the URL of a page to archive, write their output to the filesyst ## Further Reading -- Home: [ArchiveBox.io](https://archivebox.io) -- Demo: [Demo.ArchiveBox.io](https://demo.archivebox.io) -- Docs: [Docs.ArchiveBox.io](https://docs.archivebox.io) -- Releases: [Github.com/ArchiveBox/ArchiveBox/releases](https://github.com/ArchiveBox/ArchiveBox/releases) -- Wiki: [Github.com/ArchiveBox/ArchiveBox/wiki](https://github.com/ArchiveBox/ArchiveBox/wiki) -- Issues: [Github.com/ArchiveBox/ArchiveBox/issues](https://github.com/ArchiveBox/ArchiveBox/issues) -- Discussions: [Github.com/ArchiveBox/ArchiveBox/discussions](https://github.com/ArchiveBox/ArchiveBox/discussions) -- Community Chat: [Zulip Chat (preferred)](https://zulip.archivebox.io) or [Matrix Chat (old)](https://app.element.io/#/room/#archivebox:matrix.org) + + +- [ArchiveBox.io Homepage](https://archivebox.io) / [Source Code (Github)](https://github.com/ArchiveBox/ArchiveBox) / [Demo Server](https://demo.archivebox.io) +- [Documentation Wiki](https://github.com/ArchiveBox/ArchiveBox/wiki) / [API Reference Docs](https://docs.archivebox.io) / [Changelog](https://github.com/ArchiveBox/ArchiveBox/releases) +- [Bug Tracker](https://github.com/ArchiveBox/ArchiveBox/issues) / [Discussions](https://github.com/ArchiveBox/ArchiveBox/discussions) / [Community Chat Forum (Zulip)](https://zulip.archivebox.io) - Social Media: [Twitter](https://twitter.com/ArchiveBoxApp), [LinkedIn](https://www.linkedin.com/company/archivebox/), [YouTube](https://www.youtube.com/@ArchiveBoxApp), [Alternative.to](https://alternativeto.net/software/archivebox/about/), [Reddit](https://www.reddit.com/r/ArchiveBox/) -- Donations: [Github.com/ArchiveBox/ArchiveBox/wiki/Donations](https://github.com/ArchiveBox/ArchiveBox/wiki/Donations) --- +
    +๐Ÿ›๏ธ Contact us for professional support ๐Ÿ’ฌ


    - -
    - -This project is maintained mostly in my spare time with the help from generous contributors. - - -

    - -**๐Ÿ›๏ธ [Contact us for professional support](https://docs.sweeting.me/s/archivebox-consulting-services) ๐Ÿ’ฌ** - -
    -     - - -
    -ArchiveBox operates as a US 501(c)(3) nonprofit, donations are tax-deductible.
    (fiscally sponsored by HCB EIN: 81-2908499)

    - -(็ฝ‘็ซ™ๅญ˜ๆกฃ / ็ˆฌ่™ซ) - - - - -
    -
    -โœจ Have spare CPU/disk/bandwidth and want to help the world?
    Check out our Good Karma Kit...
    +   +   +
    +ArchiveBox operates as a US 501(c)(3) nonprofit (sponsored by HCB), donations are tax-deductible. +

    +  +  +
    +ArchiveBox was started by Nick Sweeting in 2017, and has grown steadily with help from our amazing contributors. +
    +โœจ Have spare CPU/disk/bandwidth after all your ็ฝ‘็ซ™ๅญ˜ๆกฃ็ˆฌ and want to help the world?
    Check out our Good Karma Kit...