Update README.md

This commit is contained in:
Nick Sweeting 2024-01-05 17:20:09 -08:00 committed by GitHub
parent a232b45b61
commit e43babb7ac
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23

View file

@ -620,9 +620,9 @@ Installing directly on **Windows without Docker or WSL/WSL2/Cygwin is not offici
- https://github.com/ArchiveBox/ArchiveBox/wiki/Troubleshooting#installing - https://github.com/ArchiveBox/ArchiveBox/wiki/Troubleshooting#installing
</details> </details>
<br/> <br/>
## Archive Layout ## Archive Layout
All of ArchiveBox's state (including the SQLite DB, archived assets, config, logs, etc.) is stored in a single folder called the "ArchiveBox Data Folder". All of ArchiveBox's state (including the SQLite DB, archived assets, config, logs, etc.) is stored in a single folder called the "ArchiveBox Data Folder".
@ -633,6 +633,7 @@ Data folders can be created anywhere (`~/archivebox` or `$PWD/data` as seen in o
<summary><i>Expand to learn more about the layout of Archivebox's data on-disk...</i></summary> <summary><i>Expand to learn more about the layout of Archivebox's data on-disk...</i></summary>
<br/> <br/>
All `archivebox` CLI commands are designed to be run from inside an ArchiveBox data folder, starting with `archivebox init` to initialize a new collection inside an empty directory. All `archivebox` CLI commands are designed to be run from inside an ArchiveBox data folder, starting with `archivebox init` to initialize a new collection inside an empty directory.
```bash ```bash
@ -671,10 +672,11 @@ Each snapshot subfolder `./archive/<timestamp>/` includes a static `index.json`
- https://github.com/ArchiveBox/ArchiveBox/wiki/Publishing-Your-Archive - https://github.com/ArchiveBox/ArchiveBox/wiki/Publishing-Your-Archive
- https://github.com/ArchiveBox/ArchiveBox/wiki/Upgrading-or-Merging-Archives - https://github.com/ArchiveBox/ArchiveBox/wiki/Upgrading-or-Merging-Archives
</details>
</details>
<br/> <br/>
## Static Archive Exporting ## Static Archive Exporting
You can export the main index to browse it statically as plain HTML files in a folder (without needing to run a server). You can export the main index to browse it statically as plain HTML files in a folder (without needing to run a server).
@ -684,6 +686,7 @@ You can export the main index to browse it statically as plain HTML files in a f
<summary><i>Expand to learn how to export your ArchiveBox collection...</i></summary> <summary><i>Expand to learn how to export your ArchiveBox collection...</i></summary>
<br/> <br/>
> *NOTE: These exports are not paginated, exporting many URLs or the entire archive at once may be slow.* > *NOTE: These exports are not paginated, exporting many URLs or the entire archive at once may be slow.*
> *Use the filtering CLI flags on the `archivebox list` command to export specific Snapshots or ranges.* > *Use the filtering CLI flags on the `archivebox list` command to export specific Snapshots or ranges.*
@ -707,15 +710,16 @@ The paths in the static exports are relative, make sure to keep them next to you
</details> </details>
<br/> <br/>
--- ---
<div align="center"> <div align="center">
<img src="https://docs.monadical.com/uploads/upload_b6900afc422ae699bfefa2dcda3306f3.png" width="100%" alt="security graphic"/> <img src="https://docs.monadical.com/uploads/upload_b6900afc422ae699bfefa2dcda3306f3.png" width="100%" alt="security graphic"/>
</div> </div>
## Caveats ## Caveats
### Archiving Private Content ### Archiving Private Content
@ -758,6 +762,7 @@ archivebox config --set CHROME_BINARY=chromium # ensure it's using Chromium
- https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#chrome_user_data_dir - https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#chrome_user_data_dir
- https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#cookies_file - https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#cookies_file
</details> </details>
<br/> <br/>
@ -766,6 +771,7 @@ archivebox config --set CHROME_BINARY=chromium # ensure it's using Chromium
Be aware that malicious archived JS can access the contents of other pages in your archive when viewed. Because the Web UI serves all viewed snapshots from a single domain, they share a request context and **typical CSRF/CORS/XSS/CSP protections do not work to prevent cross-site request attacks**. See the [Security Overview](https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overview#stealth-mode) page and [Issue #239](https://github.com/ArchiveBox/ArchiveBox/issues/239) for more details. Be aware that malicious archived JS can access the contents of other pages in your archive when viewed. Because the Web UI serves all viewed snapshots from a single domain, they share a request context and **typical CSRF/CORS/XSS/CSP protections do not work to prevent cross-site request attacks**. See the [Security Overview](https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overview#stealth-mode) page and [Issue #239](https://github.com/ArchiveBox/ArchiveBox/issues/239) for more details.
<br/> <br/>
<details> <details>
<summary><i>Click to expand...</i></summary> <summary><i>Click to expand...</i></summary>
@ -797,6 +803,7 @@ The admin UI is also served from the same origin as replayed JS, so malicious pa
</details> </details>
<br/> <br/>
### Working Around Sites that Block Archiving ### Working Around Sites that Block Archiving
For various reasons, many large sites (Reddit, Twitter, Cloudflare, etc.) actively block archiving or bots in general. There are a number of approaches to work around this. For various reasons, many large sites (Reddit, Twitter, Cloudflare, etc.) actively block archiving or bots in general. There are a number of approaches to work around this.
@ -806,6 +813,7 @@ For various reasons, many large sites (Reddit, Twitter, Cloudflare, etc.) active
<summary><i>Click to expand...</i></summary> <summary><i>Click to expand...</i></summary>
<br/> <br/>
- Set [`CHROME_USER_AGENT`, `WGET_USER_AGENT`, `CURL_USER_AGENT`](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#curl_user_agent) to impersonate a real browser (instead of an ArchiveBox bot) - Set [`CHROME_USER_AGENT`, `WGET_USER_AGENT`, `CURL_USER_AGENT`](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#curl_user_agent) to impersonate a real browser (instead of an ArchiveBox bot)
- Set up a logged-in browser session for archiving using [`CHROME_DATA_DIR` & `COOKIES_FILE`](https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install#setting-up-a-chromium-user-profile) - Set up a logged-in browser session for archiving using [`CHROME_DATA_DIR` & `COOKIES_FILE`](https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install#setting-up-a-chromium-user-profile)
- Rewrite your URLs before archiving to swap in an alternative frontend thats more bot-friendly e.g. - Rewrite your URLs before archiving to swap in an alternative frontend thats more bot-friendly e.g.
@ -822,11 +830,13 @@ In the future we plan on adding support for running JS scripts during archiving
ArchiveBox appends a hash with the current date `https://example.com#2020-10-24` to differentiate when a single URL is archived multiple times. ArchiveBox appends a hash with the current date `https://example.com#2020-10-24` to differentiate when a single URL is archived multiple times.
<br/> <br/>
<details> <details>
<summary><i>Click to expand...</i></summary> <summary><i>Click to expand...</i></summary>
<br/> <br/>
Because ArchiveBox uniquely identifies snapshots by URL, it must use a workaround to take multiple snapshots of the same URL (otherwise they would show up as a single Snapshot entry). It makes the URLs of repeated snapshots unique by adding a hash with the archive date at the end: Because ArchiveBox uniquely identifies snapshots by URL, it must use a workaround to take multiple snapshots of the same URL (otherwise they would show up as a single Snapshot entry). It makes the URLs of repeated snapshots unique by adding a hash with the archive date at the end:
```bash ```bash
@ -848,6 +858,7 @@ Improved support for saving multiple snapshots of a single URL without this hash
</details> </details>
<br/> <br/>
### Storage Requirements ### Storage Requirements
Because ArchiveBox is designed to ingest a large volume of URLs with multiple copies of each URL stored by different 3rd-party tools, it can be quite disk-space intensive. Because ArchiveBox is designed to ingest a large volume of URLs with multiple copies of each URL stored by different 3rd-party tools, it can be quite disk-space intensive.
@ -858,6 +869,7 @@ There also also some special requirements when using filesystems like NFS/SMB/FU
<summary><i>Click to expand...</i></summary> <summary><i>Click to expand...</i></summary>
<br/> <br/>
**ArchiveBox can use anywhere from ~1gb per 1000 articles, to ~50gb per 1000 articles**, mostly dependent on whether you're saving audio & video using `SAVE_MEDIA=True` and whether you lower `MEDIA_MAX_SIZE=750mb`. **ArchiveBox can use anywhere from ~1gb per 1000 articles, to ~50gb per 1000 articles**, mostly dependent on whether you're saving audio & video using `SAVE_MEDIA=True` and whether you lower `MEDIA_MAX_SIZE=750mb`.
Disk usage can be reduced by using a compressed/deduplicated filesystem like ZFS/BTRFS, or by turning off extractors methods you don't need. You can also deduplicate content with a tool like [fdupes](https://github.com/adrianlopezroche/fdupes) or [rdfind](https://github.com/pauldreik/rdfind). **Don't store large collections on older filesystems like EXT3/FAT** as they may not be able to handle more than 50k directory entries in the `archive/` folder. **Try to keep the `index.sqlite3` file on local drive (not a network mount)** or SSD for maximum performance, however the `archive/` folder can be on a network mount or slower HDD. Disk usage can be reduced by using a compressed/deduplicated filesystem like ZFS/BTRFS, or by turning off extractors methods you don't need. You can also deduplicate content with a tool like [fdupes](https://github.com/adrianlopezroche/fdupes) or [rdfind](https://github.com/pauldreik/rdfind). **Don't store large collections on older filesystems like EXT3/FAT** as they may not be able to handle more than 50k directory entries in the `archive/` folder. **Try to keep the `index.sqlite3` file on local drive (not a network mount)** or SSD for maximum performance, however the `archive/` folder can be on a network mount or slower HDD.
@ -878,10 +890,13 @@ If using Docker or NFS/SMB/FUSE for the `data/archive/` folder, you may need to
</details> </details>
<br/> <br/>
--- ---
<br/> <br/>
## Screenshots ## Screenshots
<div align="center" width="80%"> <div align="center" width="80%">
@ -922,23 +937,27 @@ If using Docker or NFS/SMB/FUSE for the `data/archive/` folder, you may need to
</div> </div>
<br/> <br/>
--- ---
<br/>
<br/>
<div align="center"> <div align="center">
<img src="https://github.com/ArchiveBox/ArchiveBox/assets/511499/ca85432e-a2df-40c6-968f-51a1ef99b24e" width="100%" alt="paisley graphic"> <img src="https://github.com/ArchiveBox/ArchiveBox/assets/511499/ca85432e-a2df-40c6-968f-51a1ef99b24e" width="100%" alt="paisley graphic">
</div> </div>
# Background & Motivation # Background & Motivation
ArchiveBox aims to enable more of the internet to be saved from deterioration by empowering people to self-host their own archives. The intent is for all the web content you care about to be viewable with common software in 50 - 100 years without needing to run ArchiveBox or other specialized software to replay it. ArchiveBox aims to enable more of the internet to be saved from deterioration by empowering people to self-host their own archives. The intent is for all the web content you care about to be viewable with common software in 50 - 100 years without needing to run ArchiveBox or other specialized software to replay it.
<br/> <br/>
<details> <details>
<summary><i>Click to read more...</i></summary> <summary><i>Click to read more...</i></summary>
<br/> <br/>
Vast treasure troves of knowledge are lost every day on the internet to link rot. As a society, we have an imperative to preserve some important parts of that treasure, just like we preserve our books, paintings, and music in physical libraries long after the originals go out of print or fade into obscurity. Vast treasure troves of knowledge are lost every day on the internet to link rot. As a society, we have an imperative to preserve some important parts of that treasure, just like we preserve our books, paintings, and music in physical libraries long after the originals go out of print or fade into obscurity.
Whether it's to resist censorship by saving articles before they get taken down or edited, or just to save a collection of early 2010's flash games you love to play, having the tools to archive internet content enables to you save the stuff you care most about before it disappears. Whether it's to resist censorship by saving articles before they get taken down or edited, or just to save a collection of early 2010's flash games you love to play, having the tools to archive internet content enables to you save the stuff you care most about before it disappears.
@ -948,14 +967,17 @@ Whether it's to resist censorship by saving articles before they get taken down
<sup><i>Image from <a href="https://perma.cc/">Perma.cc</a>...</i><br/></sup> <sup><i>Image from <a href="https://perma.cc/">Perma.cc</a>...</i><br/></sup>
</div> </div>
The balance between the permanence and ephemeral nature of content on the internet is part of what makes it beautiful. I don't think everything should be preserved in an automated fashion--making all content permanent and never removable, but I do think people should be able to decide for themselves and effectively archive specific content that they care about. The balance between the permanence and ephemeral nature of content on the internet is part of what makes it beautiful. I don't think everything should be preserved in an automated fashion--making all content permanent and never removable, but I do think people should be able to decide for themselves and effectively archive specific content that they care about.
Because modern websites are complicated and often rely on dynamic content, Because modern websites are complicated and often rely on dynamic content,
ArchiveBox archives the sites in **several different formats** beyond what public archiving services like Archive.org/Archive.is save. Using multiple methods and the market-dominant browser to execute JS ensures we can save even the most complex, finicky websites in at least a few high-quality, long-term data formats. ArchiveBox archives the sites in **several different formats** beyond what public archiving services like Archive.org/Archive.is save. Using multiple methods and the market-dominant browser to execute JS ensures we can save even the most complex, finicky websites in at least a few high-quality, long-term data formats.
</details> </details>
<br/> <br/>
## Comparison to Other Projects ## Comparison to Other Projects
<img src="https://github.com/ArchiveBox/ArchiveBox/assets/511499/4cac62a9-e8fb-425b-85a3-ca644aa6dd42" width="5%" align="right" alt="comparison"/> <img src="https://github.com/ArchiveBox/ArchiveBox/assets/511499/4cac62a9-e8fb-425b-85a3-ca644aa6dd42" width="5%" align="right" alt="comparison"/>