ArchiveBox/README.md

94 lines
3.8 KiB
Markdown
Raw Normal View History

2017-05-05 09:10:50 +00:00
# Pocket Stream Archive
2017-05-05 09:15:19 +00:00
(Your own personal Way-Back Machine)
2017-05-05 09:42:08 +00:00
Save an archived copy of all websites you star using Pocket, indexed in an html file. Powered by the new [headless](https://developers.google.com/web/updates/2017/04/headless-chrome) Google Chrome and good 'ol `wget`.
2017-05-05 09:10:50 +00:00
![](screenshot.png)
## Quickstart
2017-05-05 09:56:11 +00:00
**Runtime:** I've found it takes about an hour to download 1000 articles, and they'll take up roughly 1GB.
Those numbers are from running it on my i5 4-core machine with 50mbps down. YMMV.
2017-05-05 09:10:50 +00:00
**Dependencies:** Google Chrome headless, wget
```bash
brew install Caskroom/versions/google-chrome-canary
brew install wget
# OR on linux
wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | sudo apt-key add -
sudo sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list'
apt update; apt install google-chrome-beta
```
2017-05-05 09:56:11 +00:00
**Archiving:**
2017-05-05 09:10:50 +00:00
1. Download your pocket export file `ril_export.html` from https://getpocket.com/export
2. Download this repo `git clone https://github.com/pirate/pocket-archive-stream`
3. `cd pocket-archive-stream/`
2017-05-05 09:15:19 +00:00
4. `./archive.py ~/Downloads/ril_export.html`
2017-05-05 09:10:50 +00:00
It produces a folder `pocket/` containing an `index.html`, and archived copies of all the sites,
organized by timestamp. For each sites it saves:
2017-05-05 09:15:19 +00:00
2017-05-05 09:21:35 +00:00
- wget of site, e.g. `en.wikipedia.org/wiki/Example.html` with .html appended if not present
- `sreenshot.png` 1440x900 screenshot of site using headless chrome
- `output.pdf` Printed PDF of site using headless chrome
2017-05-05 09:10:50 +00:00
2017-05-05 09:54:31 +00:00
You can tweak parameters like screenshot size, file paths, timeouts, etc. in `archive.py`.
You can also tweak the outputted html index in `index_template.html`. It just uses python
format strings (not a proper templating engine like jinja2), which is why the CSS is double-bracketed `{{...}}`.
2017-05-05 09:52:07 +00:00
2017-05-05 09:19:25 +00:00
## Publishing Your Archive
The pocket archive is suitable for serving on your personal server, you can upload the pocket
2017-05-05 09:10:50 +00:00
archive to `/var/www/pocket` and allow people to access your saved copies of sites.
2017-05-05 09:19:25 +00:00
Just stick this in your nginx config to properly serve the wget-archived sites:
```nginx
location /pocket/ {
alias /var/www/pocket/;
try_files $uri $uri/ $uri.html =404;
}
```
Make sure you're not running any content as CGI or PHP, you only want to serve static files!
Urls look like: `https://sweeting.me/pocket/archive/1493350273/en.wikipedia.org/wiki/Dining_philosophers_problem`
2017-05-05 09:10:50 +00:00
## Info
2017-05-05 09:30:07 +00:00
This is basically an open-source version of [Pocket Premium](https://getpocket.com/premium) (which you should consider paying for!).
2017-05-05 09:15:19 +00:00
I got tired of sites I saved going offline or changing their URLS, so I started
archiving a copy of them locally now, similar to The Way-Back Machine provided
by [archive.org](https://archive.org).
2017-05-05 09:10:50 +00:00
Now I can rest soundly knowing important articles and resources I like wont dissapear off the internet.
2017-05-05 09:30:07 +00:00
My published archive as an example: [sweeting.me/pocket](https://home.sweeting.me/pocket).
2017-05-05 09:15:19 +00:00
## Security WARNING
2017-05-05 09:10:50 +00:00
Hosting other people's site content has security implications for your domain, make sure you understand
the dangers of hosting other people's CSS & JS files on your domain. It's best to put this on a domain
of its own to slightly mitigate CSRF attacks.
It might also be prudent to blacklist your archive in your `robots.txt` so that search engines dont index
the content on your domain.
2017-05-05 09:30:07 +00:00
## TODO
- body text extraction using [fathom](https://hacks.mozilla.org/2017/04/fathom-a-framework-for-understanding-web-pages/)
- auto-tagging based on important extracted words
- audio & video archiving with `youtube-dl`
- full-text indexing with elasticsearch
- video closed-caption downloading for full-text indexing video content
- automatic text summaries of article with summarization library
- feature image extraction
- http support (from my https-only domain)