From ba851b17a69e59cf909359cdfde0d99808e0bab6 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Tue, 30 Jan 2024 02:20:38 -0800 Subject: [PATCH] more README html-ifying --- README.md | 113 ++++++++++++++++++++++++++++-------------------------- 1 file changed, 59 insertions(+), 54 deletions(-) diff --git a/README.md b/README.md index 6d2f6c62..d3c0b16f 100644 --- a/README.md +++ b/README.md @@ -36,7 +36,7 @@ Without active preservation effort, everything on the internet eventually dissap **It saves snapshots of the URLs you feed it in several redundant formats.** It also detects any content featured *inside* pages & extracts it out into a folder: -- 🌐 **HTML**/**Any websites** ➡️ `original HTML+CSS+JS`, `singlefile HTML`, `screenshot PNG`, `PDF`, `WARC`, `article text MD`, `headers JSON`, `title`, `favicon`, ... +- 🌐 **HTML**/**Any websites** ➡️ `original HTML+CSS+JS`, `singlefile HTML`, `screenshot PNG`, `PDF`, `WARC`, `title`, `article text`, `favicon`, `headers`, ... - 🎥 **Social Media**/**News** ➡️ `post content TXT`, `comments`, `title`, `author`, `images` - 🎬 **YouTube**/**SoundCloud**/etc. ➡️ `MP3/MP4`s, `subtitles`, `metadata`, `thumbnail`, ... - 💾 **Github**/**Gitlab**/etc. links ➡️ `clone of GIT source code`, `README`, `images`, ... @@ -134,7 +134,7 @@ ArchiveBox is free for everyone to self-host, but we also provide support, secur - ⚖️ **Lawyers:** `collecting & preserving evidence`, `detecting changes`, `tagging & review` - 🔬 **Researchers:** - `analyzing social media trends`, `getting LLM training sets`, `crawling pipelines` + `analyzing social media trends`, `getting LLM training data`, `crawling pipelines` - 👩🏽 **Individuals:** `saving bookmarks`, `preserving portfolio content`, `legacy / memoirs archival` @@ -471,8 +471,8 @@ docker compose run archivebox help curl sh automatic setup script CLI Usage Examples (non-Docker)

-# make sure you have pip-installed ArchiveBox and it's available in your $PATH first
-
+# make sure you have pip-installed ArchiveBox and it's available in your $PATH first  
+
# archivebox [subcommand] [--args] archivebox init --setup # safe to run init multiple times (also how you update versions) archivebox version # get archivebox version info + check dependencies @@ -488,7 +488,7 @@ archivebox add --depth=1 'https://news.ycombinator.com'

 # make sure you have `docker-compose.yml` from the Quickstart instructions first
-
+
# docker compose run archivebox [subcommand [--args] docker compose run archivebox init --setup docker compose run archivebox version @@ -505,7 +505,7 @@ docker compose run archivebox add --depth=1 'https://news.ycombinator.com'

 # make sure you create and cd into in a new empty directory first  
-
+
# docker run -it -v $PWD:/data archivebox/archivebox [subcommand [--args] docker run -v $PWD:/data -it archivebox/archivebox init --setup docker run -v $PWD:/data -it archivebox/archivebox version @@ -610,19 +610,19 @@ docker run -it -v $PWD:/data archivebox/archivebox add --depth=1 'https://exampl ## Input Formats: How to pass URLs into ArchiveBox for saving -- The official ArchiveBox Browser Extension - Provides realtime archiving of browsing history or selected pages from Chrome/Chromium/Firefox browsers +- From the official ArchiveBox Browser Extension + Provides realtime archiving of browsing history or selected pages from Chrome/Chromium/Firefox browsers. -- Manual imports of URLs from RSS, JSON, CSV, TXT, SQL, HTML, Markdown, etc. files - ArchiveBox supports injesting URLs in [any text-based format...](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#Import-a-list-of-URLs-from-a-text-file) +- From manual imports of URLs from RSS, JSON, CSV, TXT, SQL, HTML, Markdown, etc. files + ArchiveBox supports injesting URLs in [any text-based format](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#Import-a-list-of-URLs-from-a-text-file). -- Manually exported [browser history](https://github.com/ArchiveBox/ArchiveBox/wiki/Quickstart#2-get-your-list-of-urls-to-archive) or [browser bookmarks](https://github.com/ArchiveBox/ArchiveBox/wiki/Quickstart#2-get-your-list-of-urls-to-archive) from any browser +- From manually exported [browser history](https://github.com/ArchiveBox/ArchiveBox/wiki/Quickstart#2-get-your-list-of-urls-to-archive) or [browser bookmarks](https://github.com/ArchiveBox/ArchiveBox/wiki/Quickstart#2-get-your-list-of-urls-to-archive) (in Netscape format) See instructions for: Chrome, Firefox, Safari, IE, Opera, and more... -- [MITM Proxy](https://mitmproxy.org/) archiving with [`archivebox-proxy`](https://github.com/ArchiveBox/archivebox-proxy) +- From URLs visited through a [MITM Proxy](https://mitmproxy.org/) with [`archivebox-proxy`](https://github.com/ArchiveBox/archivebox-proxy) Provides [realtime archiving](https://github.com/ArchiveBox/ArchiveBox/issues/577) of all traffic from any device going through the proxy. -- Links from bookmarking services or social media (e.g. Twitter bookmarks, Reddit saved posts, etc.) +- From bookmarking services or social media (e.g. Twitter bookmarks, Reddit saved posts, etc.) See instructions for: Pocket, Pinboard, Instapaper, Shaarli, Delicious, Reddit Saved, Wallabag, Unmark.it, OneTab, Firefox Sync, and more... @@ -743,44 +743,47 @@ ArchiveBox bundles industry-standard tools like [Google Chrome](https://github.c
Expand to learn more about ArchiveBox's internals & dependencies...
-> *TIP: For better security, easier updating, and to avoid polluting your host system with extra dependencies,**it is strongly recommended to use the [⭐️ official Docker image](https://github.com/ArchiveBox/ArchiveBox/wiki/Docker)** with everything pre-installed for the best experience.* +
+

TIP: For better security, easier updating, and to avoid polluting your host system with extra dependencies,it is strongly recommended to use the ⭐️ official Docker image with everything pre-installed for the best experience.

+
These optional dependencies used for archiving sites include: -archivebox --version CLI output screenshot showing dependencies installed +archivebox --version CLI output screenshot showing dependencies installed +
    +
  • chromium / chrome (for screenshots, PDF, DOM HTML, and headless JS scripts)
  • +
  • node & npm (for readability, mercury, and singlefile)
  • +
  • wget (for plain HTML, static files, and WARC saving)
  • +
  • curl (for fetching headers, favicon, and posting to Archive.org)
  • +
  • yt-dlp or youtube-dl (for audio, video, and subtitles)
  • +
  • git (for cloning git repos)
  • +
  • singlefile (for saving into a self-contained html file)
  • +
  • postlight/parser (for discussion threads, forums, and articles)
  • +
  • readability (for articles and long text content)
  • +
  • and more as we grow...
  • +
-- `chromium` / `chrome` (for screenshots, PDF, DOM HTML, and headless JS scripts) -- `node` & `npm` (for readability, mercury, and singlefile) -- `wget` (for plain HTML, static files, and WARC saving) -- `curl` (for fetching headers, favicon, and posting to Archive.org) -- `yt-dlp` or `youtube-dl` (for audio, video, and subtitles) -- `git` (for cloning git repos) -- `singlefile` (for saving into a self-contained html file) -- `postlight/parser` (for discussion threads, forums, and articles) -- `readability` (for articles and long text content) -- and more as we grow... - -You don't need to install every dependency to use ArchiveBox. ArchiveBox will automatically disable extractors that rely on dependencies that aren't installed, based on what is configured and available in your `$PATH`. - +You don't need to install every dependency to use ArchiveBox. ArchiveBox will automatically disable extractors that rely on dependencies that aren't installed, based on what is configured and available in your $PATH. + If not using Docker, make sure to keep the dependencies up-to-date yourself and check that ArchiveBox isn't reporting any incompatibility with the versions you install. -```bash -# install python3 and archivebox with your system package manager +
#install python3 and archivebox with your system package manager
 # apt/brew/pip/etc install ... (see Quickstart instructions above)
-
+
archivebox setup # auto install all the extractors and extras archivebox --version # see info and check validity of installed dependencies -``` +
+ +Installing directly on Windows without Docker or WSL/WSL2/Cygwin is not officially supported (I cannot respond to Windows support tickets), but some advanced users have reported getting it working. -Installing directly on **Windows without Docker or WSL/WSL2/Cygwin is not officially supported** (I cannot respond to Windows support tickets), but some advanced users have reported getting it working. - -#### Learn More - -- https://github.com/ArchiveBox/ArchiveBox/wiki/Install#dependencies -- https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install -- https://github.com/ArchiveBox/ArchiveBox/wiki/Upgrading-or-Merging-Archives -- https://github.com/ArchiveBox/ArchiveBox/wiki/Troubleshooting#installing +

Learn More

+

@@ -948,8 +951,8 @@ https://127.0.0.1:8000/archive/*
-

NOTE: Only the wget & dom extractor methods execute archived JS when viewing snapshots, all other archive methods produce static output that does not execute JS on viewing. -If you are worried about these issues ^ you should disable these extractors using archivebox config --set SAVE_WGET=False SAVE_DOM=False.

+

NOTE: Only the wget & dom extractor methods execute archived JS when viewing snapshots, all other archive methods produce static output that does not execute JS on viewing.
+If you are worried about these issues ^ you should disable these extractors using:
archivebox config --set SAVE_WGET=False SAVE_DOM=False.

Learn More

@@ -1007,13 +1010,14 @@ archivebox add 'https://example.com#2020-10-25' The Re-Snapshot Button button in the Admin UI is a shortcut for this hash-date multi-snapshotting workaround. -Improved support for saving multiple snapshots of a single URL without this hash-date workaround will be [added eventually](https://github.com/ArchiveBox/ArchiveBox/issues/179) (along with the ability to view diffs of the changes between runs). +Improved support for saving multiple snapshots of a single URL without this hash-date workaround will be added eventually (along with the ability to view diffs of the changes between runs). -#### Learn More - -- https://github.com/ArchiveBox/ArchiveBox/issues/179 -- https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#explanation-of-buttons-in-the-web-ui---admin-snapshots-list +

Learn More

+
@@ -1036,14 +1040,15 @@ Disk usage can be reduced by using a compressed/deduplicated filesystem like ZFS If using Docker or NFS/SMB/FUSE for the `data/archive/` folder, you may need to set [`PUID` & `PGID`](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#puid--pgid) and [disable `root_squash`](https://github.com/ArchiveBox/ArchiveBox/issues/1304) on your fileshare server. -#### Learn More - -- https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#Disk-Layout -- https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overview#output-folder -- https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#large-archives -- https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#puid--pgid -- https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overview#do-not-run-as-root +

Learn More

+