mirror of
https://github.com/ArchiveBox/ArchiveBox
synced 2024-11-14 00:17:08 +00:00
Merge pull request #1026 from turian/feature/kludge-984-UTF8-bug
This commit is contained in:
commit
8a96563169
10 changed files with 43 additions and 18 deletions
3
.gitignore
vendored
3
.gitignore
vendored
|
@ -24,3 +24,6 @@ data1/
|
||||||
data2/
|
data2/
|
||||||
data3/
|
data3/
|
||||||
output/
|
output/
|
||||||
|
|
||||||
|
# vim
|
||||||
|
*.sw?
|
||||||
|
|
|
@ -1,5 +1,5 @@
|
||||||
# This is the Dockerfile for ArchiveBox, it bundles the following dependencies:
|
# This is the Dockerfile for ArchiveBox, it bundles the following dependencies:
|
||||||
# python3, ArchiveBox, curl, wget, git, chromium, youtube-dl, single-file
|
# python3, ArchiveBox, curl, wget, git, chromium, youtube-dl, yt-dlp, single-file
|
||||||
# Usage:
|
# Usage:
|
||||||
# git submodule update --init --recursive
|
# git submodule update --init --recursive
|
||||||
# git pull --recurse-submodules
|
# git pull --recurse-submodules
|
||||||
|
|
|
@ -87,7 +87,7 @@ ls ./archive/*/index.json # or browse directly via the filesyste
|
||||||
- [**Free & open source**](https://github.com/ArchiveBox/ArchiveBox/blob/master/LICENSE), doesn't require signing up online, stores all data locally
|
- [**Free & open source**](https://github.com/ArchiveBox/ArchiveBox/blob/master/LICENSE), doesn't require signing up online, stores all data locally
|
||||||
- [**Powerful, intuitive command line interface**](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#CLI-Usage) with [modular optional dependencies](#dependencies)
|
- [**Powerful, intuitive command line interface**](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#CLI-Usage) with [modular optional dependencies](#dependencies)
|
||||||
- [**Comprehensive documentation**](https://github.com/ArchiveBox/ArchiveBox/wiki), [active development](https://github.com/ArchiveBox/ArchiveBox/wiki/Roadmap), and [rich community](https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community)
|
- [**Comprehensive documentation**](https://github.com/ArchiveBox/ArchiveBox/wiki), [active development](https://github.com/ArchiveBox/ArchiveBox/wiki/Roadmap), and [rich community](https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community)
|
||||||
- [**Extracts a wide variety of content out-of-the-box**](https://github.com/ArchiveBox/ArchiveBox/issues/51): [media (youtube-dl), articles (readability), code (git), etc.](#output-formats)
|
- [**Extracts a wide variety of content out-of-the-box**](https://github.com/ArchiveBox/ArchiveBox/issues/51): [media (youtube-dl or yt-dlp), articles (readability), code (git), etc.](#output-formats)
|
||||||
- [**Supports scheduled/realtime importing**](https://github.com/ArchiveBox/ArchiveBox/wiki/Scheduled-Archiving) from [many types of sources](#input-formats)
|
- [**Supports scheduled/realtime importing**](https://github.com/ArchiveBox/ArchiveBox/wiki/Scheduled-Archiving) from [many types of sources](#input-formats)
|
||||||
- [**Uses standard, durable, long-term formats**](#saves-lots-of-useful-stuff-for-each-imported-link) like HTML, JSON, PDF, PNG, and WARC
|
- [**Uses standard, durable, long-term formats**](#saves-lots-of-useful-stuff-for-each-imported-link) like HTML, JSON, PDF, PNG, and WARC
|
||||||
- [**Usable as a oneshot CLI**](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#CLI-Usage), [**self-hosted web UI**](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#UI-Usage), [Python API](https://docs.archivebox.io/en/latest/modules.html) (BETA), [REST API](https://github.com/ArchiveBox/ArchiveBox/issues/496) (ALPHA), or [desktop app](https://github.com/ArchiveBox/electron-archivebox) (ALPHA)
|
- [**Usable as a oneshot CLI**](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#CLI-Usage), [**self-hosted web UI**](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#UI-Usage), [Python API](https://docs.archivebox.io/en/latest/modules.html) (BETA), [REST API](https://github.com/ArchiveBox/ArchiveBox/issues/496) (ALPHA), or [desktop app](https://github.com/ArchiveBox/electron-archivebox) (ALPHA)
|
||||||
|
@ -469,7 +469,7 @@ Inside each Snapshot folder, ArchiveBox save these different types of extractor
|
||||||
- **DOM Dump:** `output.html` DOM Dump of the HTML after rendering using headless chrome
|
- **DOM Dump:** `output.html` DOM Dump of the HTML after rendering using headless chrome
|
||||||
- **Article Text:** `article.html/json` Article text extraction using Readability & Mercury
|
- **Article Text:** `article.html/json` Article text extraction using Readability & Mercury
|
||||||
- **Archive.org Permalink:** `archive.org.txt` A link to the saved site on archive.org
|
- **Archive.org Permalink:** `archive.org.txt` A link to the saved site on archive.org
|
||||||
- **Audio & Video:** `media/` all audio/video files + playlists, including subtitles & metadata with youtube-dl
|
- **Audio & Video:** `media/` all audio/video files + playlists, including subtitles & metadata with youtube-dl (or yt-dlp)
|
||||||
- **Source Code:** `git/` clone of any repository found on GitHub, Bitbucket, or GitLab links
|
- **Source Code:** `git/` clone of any repository found on GitHub, Bitbucket, or GitLab links
|
||||||
- _More coming soon! See the [Roadmap](https://github.com/ArchiveBox/ArchiveBox/wiki/Roadmap)..._
|
- _More coming soon! See the [Roadmap](https://github.com/ArchiveBox/ArchiveBox/wiki/Roadmap)..._
|
||||||
|
|
||||||
|
@ -529,7 +529,7 @@ To achieve high fidelity archives in as many situations as possible, ArchiveBox
|
||||||
- `node` & `npm` (for readability, mercury, and singlefile)
|
- `node` & `npm` (for readability, mercury, and singlefile)
|
||||||
- `wget` (for plain HTML, static files, and WARC saving)
|
- `wget` (for plain HTML, static files, and WARC saving)
|
||||||
- `curl` (for fetching headers, favicon, and posting to Archive.org)
|
- `curl` (for fetching headers, favicon, and posting to Archive.org)
|
||||||
- `youtube-dl` (for audio, video, and subtitles)
|
- `youtube-dl` or `yt-dlp` (for audio, video, and subtitles)
|
||||||
- `git` (for cloning git repos)
|
- `git` (for cloning git repos)
|
||||||
- and more as we grow...
|
- and more as we grow...
|
||||||
|
|
||||||
|
|
|
@ -144,12 +144,19 @@ CONFIG_SCHEMA: Dict[str, ConfigDefaultDict] = {
|
||||||
'--no-call-home',
|
'--no-call-home',
|
||||||
'--write-sub',
|
'--write-sub',
|
||||||
'--all-subs',
|
'--all-subs',
|
||||||
'--write-auto-sub',
|
# There are too many of these and youtube
|
||||||
|
# throttles you with HTTP error 429
|
||||||
|
#'--write-auto-subs',
|
||||||
'--convert-subs=srt',
|
'--convert-subs=srt',
|
||||||
'--yes-playlist',
|
'--yes-playlist',
|
||||||
'--continue',
|
'--continue',
|
||||||
'--ignore-errors',
|
# This flag doesn't exist in youtube-dl
|
||||||
|
# only in yt-dlp
|
||||||
'--no-abort-on-error',
|
'--no-abort-on-error',
|
||||||
|
# --ignore-errors must come AFTER
|
||||||
|
# --no-abort-on-error
|
||||||
|
# https://github.com/yt-dlp/yt-dlp/issues/4914
|
||||||
|
'--ignore-errors',
|
||||||
'--geo-bypass',
|
'--geo-bypass',
|
||||||
'--add-metadata',
|
'--add-metadata',
|
||||||
'--max-filesize={}'.format(c['MEDIA_MAX_SIZE']),
|
'--max-filesize={}'.format(c['MEDIA_MAX_SIZE']),
|
||||||
|
@ -203,7 +210,8 @@ CONFIG_SCHEMA: Dict[str, ConfigDefaultDict] = {
|
||||||
'SINGLEFILE_BINARY': {'type': str, 'default': lambda c: bin_path('single-file')},
|
'SINGLEFILE_BINARY': {'type': str, 'default': lambda c: bin_path('single-file')},
|
||||||
'READABILITY_BINARY': {'type': str, 'default': lambda c: bin_path('readability-extractor')},
|
'READABILITY_BINARY': {'type': str, 'default': lambda c: bin_path('readability-extractor')},
|
||||||
'MERCURY_BINARY': {'type': str, 'default': lambda c: bin_path('mercury-parser')},
|
'MERCURY_BINARY': {'type': str, 'default': lambda c: bin_path('mercury-parser')},
|
||||||
'YOUTUBEDL_BINARY': {'type': str, 'default': 'youtube-dl'},
|
#'YOUTUBEDL_BINARY': {'type': str, 'default': 'youtube-dl'},
|
||||||
|
'YOUTUBEDL_BINARY': {'type': str, 'default': 'yt-dlp'},
|
||||||
'NODE_BINARY': {'type': str, 'default': 'node'},
|
'NODE_BINARY': {'type': str, 'default': 'node'},
|
||||||
'RIPGREP_BINARY': {'type': str, 'default': 'rg'},
|
'RIPGREP_BINARY': {'type': str, 'default': 'rg'},
|
||||||
'CHROME_BINARY': {'type': str, 'default': None},
|
'CHROME_BINARY': {'type': str, 'default': None},
|
||||||
|
|
|
@ -1,6 +1,7 @@
|
||||||
__package__ = 'archivebox.extractors'
|
__package__ = 'archivebox.extractors'
|
||||||
|
|
||||||
import os
|
import os
|
||||||
|
import sys
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
|
||||||
from typing import Optional, List, Iterable, Union
|
from typing import Optional, List, Iterable, Union
|
||||||
|
@ -137,14 +138,16 @@ def archive_link(link: Link, overwrite: bool=False, methods: Optional[Iterable[s
|
||||||
link.url,
|
link.url,
|
||||||
)) from e
|
)) from e
|
||||||
"""
|
"""
|
||||||
# Instead, use the kludgy workaround from
|
# Instead, use the kludgy workaround from
|
||||||
# https://github.com/ArchiveBox/ArchiveBox/issues/984#issuecomment-1150541627
|
# https://github.com/ArchiveBox/ArchiveBox/issues/984#issuecomment-1150541627
|
||||||
with open(ERROR_LOG, "a", encoding='utf-8') as f:
|
with open(ERROR_LOG, "a", encoding='utf-8') as f:
|
||||||
command = ' '.join(sys.argv)
|
command = ' '.join(sys.argv)
|
||||||
ts = datetime.now(timezone.utc).strftime('%Y-%m-%d__%H:%M:%S')
|
ts = datetime.now(timezone.utc).strftime('%Y-%m-%d__%H:%M:%S')
|
||||||
f.write(("\n" + 'Exception in archive_methods.save_{}(Link(url={}))'.format(
|
f.write(("\n" + 'Exception in archive_methods.save_{}(Link(url={})) command={}; ts={}'.format(
|
||||||
method_name,
|
method_name,
|
||||||
link.url,
|
link.url,
|
||||||
|
command,
|
||||||
|
ts
|
||||||
) + "\n"))
|
) + "\n"))
|
||||||
#f.write(f"\n> {command}; ts={ts} version={config['VERSION']} docker={config['IN_DOCKER']} is_tty={config['IS_TTY']}\n")
|
#f.write(f"\n> {command}; ts={ts} version={config['VERSION']} docker={config['IN_DOCKER']} is_tty={config['IS_TTY']}\n")
|
||||||
|
|
||||||
|
|
|
@ -33,7 +33,7 @@ def should_save_media(link: Link, out_dir: Optional[Path]=None, overwrite: Optio
|
||||||
|
|
||||||
@enforce_types
|
@enforce_types
|
||||||
def save_media(link: Link, out_dir: Optional[Path]=None, timeout: int=MEDIA_TIMEOUT) -> ArchiveResult:
|
def save_media(link: Link, out_dir: Optional[Path]=None, timeout: int=MEDIA_TIMEOUT) -> ArchiveResult:
|
||||||
"""Download playlists or individual video, audio, and subtitles using youtube-dl"""
|
"""Download playlists or individual video, audio, and subtitles using youtube-dl or yt-dlp"""
|
||||||
|
|
||||||
out_dir = out_dir or Path(link.link_dir)
|
out_dir = out_dir or Path(link.link_dir)
|
||||||
output: ArchiveOutput = 'media'
|
output: ArchiveOutput = 'media'
|
||||||
|
@ -61,7 +61,7 @@ def save_media(link: Link, out_dir: Optional[Path]=None, timeout: int=MEDIA_TIME
|
||||||
pass
|
pass
|
||||||
else:
|
else:
|
||||||
hints = (
|
hints = (
|
||||||
'Got youtube-dl response code: {}.'.format(result.returncode),
|
'Got youtube-dl (or yt-dlp) response code: {}.'.format(result.returncode),
|
||||||
*result.stderr.decode().split('\n'),
|
*result.stderr.decode().split('\n'),
|
||||||
)
|
)
|
||||||
raise ArchiveError('Failed to save media', hints)
|
raise ArchiveError('Failed to save media', hints)
|
||||||
|
@ -72,8 +72,18 @@ def save_media(link: Link, out_dir: Optional[Path]=None, timeout: int=MEDIA_TIME
|
||||||
timer.end()
|
timer.end()
|
||||||
|
|
||||||
# add video description and subtitles to full-text index
|
# add video description and subtitles to full-text index
|
||||||
|
# Let's try a few different
|
||||||
index_texts = [
|
index_texts = [
|
||||||
text_file.read_text(encoding='utf-8').strip()
|
# errors:
|
||||||
|
# * 'strict' to raise a ValueError exception if there is an
|
||||||
|
# encoding error. The default value of None has the same effect.
|
||||||
|
# * 'ignore' ignores errors. Note that ignoring encoding errors
|
||||||
|
# can lead to data loss.
|
||||||
|
# * 'xmlcharrefreplace' is only supported when writing to a
|
||||||
|
# file. Characters not supported by the encoding are replaced with
|
||||||
|
# the appropriate XML character reference &#nnn;.
|
||||||
|
# There are a few more options described in https://docs.python.org/3/library/functions.html#open
|
||||||
|
text_file.read_text(encoding='utf-8', errors='xmlcharrefreplace').strip()
|
||||||
for text_file in (
|
for text_file in (
|
||||||
*output_path.glob('*.description'),
|
*output_path.glob('*.description'),
|
||||||
*output_path.glob('*.srt'),
|
*output_path.glob('*.srt'),
|
||||||
|
|
|
@ -91,9 +91,9 @@ echo " This is a helper script which installs the ArchiveBox dependencies on
|
||||||
echo " You may be prompted for a sudo password in order to install the following:"
|
echo " You may be prompted for a sudo password in order to install the following:"
|
||||||
echo ""
|
echo ""
|
||||||
echo " - archivebox"
|
echo " - archivebox"
|
||||||
echo " - python3, pip, nodejs, npm (languages used by ArchiveBox, and its extractor modules)"
|
echo " - python3, pip, nodejs, npm (languages used by ArchiveBox, and its extractor modules)"
|
||||||
echo " - curl, wget, git, youtube-dl (used for extracting title, favicon, git, media, and more)"
|
echo " - curl, wget, git, youtube-dl, yt-dlp (used for extracting title, favicon, git, media, and more)"
|
||||||
echo " - chromium (skips this if any Chrome/Chromium version is already installed)"
|
echo " - chromium (skips this if any Chrome/Chromium version is already installed)"
|
||||||
echo ""
|
echo ""
|
||||||
echo " If you'd rather install these manually as-needed, you can find detailed documentation here:"
|
echo " If you'd rather install these manually as-needed, you can find detailed documentation here:"
|
||||||
echo " https://github.com/ArchiveBox/ArchiveBox/wiki/Install"
|
echo " https://github.com/ArchiveBox/ArchiveBox/wiki/Install"
|
||||||
|
@ -115,7 +115,7 @@ if which apt-get > /dev/null; then
|
||||||
fi
|
fi
|
||||||
echo
|
echo
|
||||||
echo "[+] Installing ArchiveBox system dependencies using apt..."
|
echo "[+] Installing ArchiveBox system dependencies using apt..."
|
||||||
sudo apt-get install -y git python3 python3-pip python3-distutils wget curl youtube-dl ffmpeg git nodejs npm ripgrep
|
sudo apt-get install -y git python3 python3-pip python3-distutils wget curl youtube-dl yt-dlp ffmpeg git nodejs npm ripgrep
|
||||||
sudo apt-get install -y libgtk2.0-0 libgtk-3-0 libnotify-dev libgconf-2-4 libnss3 libxss1 libasound2 libxtst6 xauth xvfb libgbm-dev || sudo apt-get install -y chromium || sudo apt-get install -y chromium-browser || true
|
sudo apt-get install -y libgtk2.0-0 libgtk-3-0 libnotify-dev libgconf-2-4 libnss3 libxss1 libasound2 libxtst6 xauth xvfb libgbm-dev || sudo apt-get install -y chromium || sudo apt-get install -y chromium-browser || true
|
||||||
sudo apt-get install -y archivebox
|
sudo apt-get install -y archivebox
|
||||||
sudo apt-get --only-upgrade install -y archivebox
|
sudo apt-get --only-upgrade install -y archivebox
|
||||||
|
|
|
@ -55,7 +55,7 @@
|
||||||
# CURL_BINARY = curl
|
# CURL_BINARY = curl
|
||||||
# GIT_BINARY = git
|
# GIT_BINARY = git
|
||||||
# WGET_BINARY = wget
|
# WGET_BINARY = wget
|
||||||
# YOUTUBEDL_BINARY = youtube-dl
|
# YOUTUBEDL_BINARY = yt-dlp
|
||||||
# CHROME_BINARY = chromium
|
# CHROME_BINARY = chromium
|
||||||
|
|
||||||
# CHROME_USER_DATA_DIR="~/.config/google-chrome/Default"
|
# CHROME_USER_DATA_DIR="~/.config/google-chrome/Default"
|
||||||
|
|
1
setup.py
1
setup.py
|
@ -42,6 +42,7 @@ INSTALL_REQUIRES = [
|
||||||
"django-extensions>=3.0.3",
|
"django-extensions>=3.0.3",
|
||||||
"dateparser>=1.0.0",
|
"dateparser>=1.0.0",
|
||||||
"youtube-dl>=2021.04.17",
|
"youtube-dl>=2021.04.17",
|
||||||
|
"yt-dlp>=2021.4.11",
|
||||||
"python-crontab>=2.5.1",
|
"python-crontab>=2.5.1",
|
||||||
"croniter>=0.3.34",
|
"croniter>=0.3.34",
|
||||||
"w3lib>=1.22.0",
|
"w3lib>=1.22.0",
|
||||||
|
|
|
@ -5,7 +5,7 @@ Package3: archivebox
|
||||||
Suite: focal
|
Suite: focal
|
||||||
Suite3: focal
|
Suite3: focal
|
||||||
Build-Depends: debhelper, dh-python, python3-all, python3-pip, python3-setuptools, python3-wheel, python3-stdeb
|
Build-Depends: debhelper, dh-python, python3-all, python3-pip, python3-setuptools, python3-wheel, python3-stdeb
|
||||||
Depends3: nodejs, wget, curl, git, ffmpeg, youtube-dl, python3-all, python3-pip, python3-setuptools, python3-croniter, python3-crontab, python3-dateparser, python3-django, python3-django-extensions, python3-django-jsonfield, python3-mypy-extensions, python3-requests, python3-w3lib, ripgrep
|
Depends3: nodejs, wget, curl, git, ffmpeg, youtube-dl, yt-dlp, python3-all, python3-pip, python3-setuptools, python3-croniter, python3-crontab, python3-dateparser, python3-django, python3-django-extensions, python3-django-jsonfield, python3-mypy-extensions, python3-requests, python3-w3lib, ripgrep
|
||||||
X-Python3-Version: >= 3.7
|
X-Python3-Version: >= 3.7
|
||||||
XS-Python-Version: >= 3.7
|
XS-Python-Version: >= 3.7
|
||||||
Setup-Env-Vars: DEB_BUILD_OPTIONS=nocheck
|
Setup-Env-Vars: DEB_BUILD_OPTIONS=nocheck
|
||||||
|
|
Loading…
Reference in a new issue