Added yt-dlp everywhere

This commit is contained in:
Joseph Turian 2022-09-12 20:34:02 +00:00
parent e41f313fa3
commit f5f7aff3b4
10 changed files with 18 additions and 11 deletions

3
.gitignore vendored
View file

@ -24,3 +24,6 @@ data1/
data2/ data2/
data3/ data3/
output/ output/
# vim
*.sw?

View file

@ -1,5 +1,5 @@
# This is the Dockerfile for ArchiveBox, it bundles the following dependencies: # This is the Dockerfile for ArchiveBox, it bundles the following dependencies:
# python3, ArchiveBox, curl, wget, git, chromium, youtube-dl, single-file # python3, ArchiveBox, curl, wget, git, chromium, youtube-dl, yt-dlp, single-file
# Usage: # Usage:
# docker build . -t archivebox --no-cache # docker build . -t archivebox --no-cache
# docker run -v "$PWD/data":/data archivebox init # docker run -v "$PWD/data":/data archivebox init

View file

@ -87,7 +87,7 @@ ls ./archive/*/index.json # or browse directly via the filesyste
- [**Free & open source**](https://github.com/ArchiveBox/ArchiveBox/blob/master/LICENSE), doesn't require signing up online, stores all data locally - [**Free & open source**](https://github.com/ArchiveBox/ArchiveBox/blob/master/LICENSE), doesn't require signing up online, stores all data locally
- [**Powerful, intuitive command line interface**](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#CLI-Usage) with [modular optional dependencies](#dependencies) - [**Powerful, intuitive command line interface**](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#CLI-Usage) with [modular optional dependencies](#dependencies)
- [**Comprehensive documentation**](https://github.com/ArchiveBox/ArchiveBox/wiki), [active development](https://github.com/ArchiveBox/ArchiveBox/wiki/Roadmap), and [rich community](https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community) - [**Comprehensive documentation**](https://github.com/ArchiveBox/ArchiveBox/wiki), [active development](https://github.com/ArchiveBox/ArchiveBox/wiki/Roadmap), and [rich community](https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community)
- [**Extracts a wide variety of content out-of-the-box**](https://github.com/ArchiveBox/ArchiveBox/issues/51): [media (youtube-dl), articles (readability), code (git), etc.](#output-formats) - [**Extracts a wide variety of content out-of-the-box**](https://github.com/ArchiveBox/ArchiveBox/issues/51): [media (youtube-dl or yt-dlp), articles (readability), code (git), etc.](#output-formats)
- [**Supports scheduled/realtime importing**](https://github.com/ArchiveBox/ArchiveBox/wiki/Scheduled-Archiving) from [many types of sources](#input-formats) - [**Supports scheduled/realtime importing**](https://github.com/ArchiveBox/ArchiveBox/wiki/Scheduled-Archiving) from [many types of sources](#input-formats)
- [**Uses standard, durable, long-term formats**](#saves-lots-of-useful-stuff-for-each-imported-link) like HTML, JSON, PDF, PNG, and WARC - [**Uses standard, durable, long-term formats**](#saves-lots-of-useful-stuff-for-each-imported-link) like HTML, JSON, PDF, PNG, and WARC
- [**Usable as a oneshot CLI**](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#CLI-Usage), [**self-hosted web UI**](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#UI-Usage), [Python API](https://docs.archivebox.io/en/latest/modules.html) (BETA), [REST API](https://github.com/ArchiveBox/ArchiveBox/issues/496) (ALPHA), or [desktop app](https://github.com/ArchiveBox/electron-archivebox) (ALPHA) - [**Usable as a oneshot CLI**](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#CLI-Usage), [**self-hosted web UI**](https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#UI-Usage), [Python API](https://docs.archivebox.io/en/latest/modules.html) (BETA), [REST API](https://github.com/ArchiveBox/ArchiveBox/issues/496) (ALPHA), or [desktop app](https://github.com/ArchiveBox/electron-archivebox) (ALPHA)
@ -469,7 +469,7 @@ Inside each Snapshot folder, ArchiveBox save these different types of extractor
- **DOM Dump:** `output.html` DOM Dump of the HTML after rendering using headless chrome - **DOM Dump:** `output.html` DOM Dump of the HTML after rendering using headless chrome
- **Article Text:** `article.html/json` Article text extraction using Readability & Mercury - **Article Text:** `article.html/json` Article text extraction using Readability & Mercury
- **Archive.org Permalink:** `archive.org.txt` A link to the saved site on archive.org - **Archive.org Permalink:** `archive.org.txt` A link to the saved site on archive.org
- **Audio & Video:** `media/` all audio/video files + playlists, including subtitles & metadata with youtube-dl - **Audio & Video:** `media/` all audio/video files + playlists, including subtitles & metadata with youtube-dl (or yt-dlp)
- **Source Code:** `git/` clone of any repository found on GitHub, Bitbucket, or GitLab links - **Source Code:** `git/` clone of any repository found on GitHub, Bitbucket, or GitLab links
- _More coming soon! See the [Roadmap](https://github.com/ArchiveBox/ArchiveBox/wiki/Roadmap)..._ - _More coming soon! See the [Roadmap](https://github.com/ArchiveBox/ArchiveBox/wiki/Roadmap)..._
@ -529,7 +529,7 @@ To achieve high fidelity archives in as many situations as possible, ArchiveBox
- `node` & `npm` (for readability, mercury, and singlefile) - `node` & `npm` (for readability, mercury, and singlefile)
- `wget` (for plain HTML, static files, and WARC saving) - `wget` (for plain HTML, static files, and WARC saving)
- `curl` (for fetching headers, favicon, and posting to Archive.org) - `curl` (for fetching headers, favicon, and posting to Archive.org)
- `youtube-dl` (for audio, video, and subtitles) - `youtube-dl` or `yt-dlp` (for audio, video, and subtitles)
- `git` (for cloning git repos) - `git` (for cloning git repos)
- and more as we grow... - and more as we grow...

View file

@ -203,7 +203,8 @@ CONFIG_SCHEMA: Dict[str, ConfigDefaultDict] = {
'SINGLEFILE_BINARY': {'type': str, 'default': lambda c: bin_path('single-file')}, 'SINGLEFILE_BINARY': {'type': str, 'default': lambda c: bin_path('single-file')},
'READABILITY_BINARY': {'type': str, 'default': lambda c: bin_path('readability-extractor')}, 'READABILITY_BINARY': {'type': str, 'default': lambda c: bin_path('readability-extractor')},
'MERCURY_BINARY': {'type': str, 'default': lambda c: bin_path('mercury-parser')}, 'MERCURY_BINARY': {'type': str, 'default': lambda c: bin_path('mercury-parser')},
'YOUTUBEDL_BINARY': {'type': str, 'default': 'youtube-dl'}, #'YOUTUBEDL_BINARY': {'type': str, 'default': 'youtube-dl'},
'YOUTUBEDL_BINARY': {'type': str, 'default': 'yt-dlp'},
'NODE_BINARY': {'type': str, 'default': 'node'}, 'NODE_BINARY': {'type': str, 'default': 'node'},
'RIPGREP_BINARY': {'type': str, 'default': 'rg'}, 'RIPGREP_BINARY': {'type': str, 'default': 'rg'},
'CHROME_BINARY': {'type': str, 'default': None}, 'CHROME_BINARY': {'type': str, 'default': None},

View file

@ -1,6 +1,7 @@
__package__ = 'archivebox.extractors' __package__ = 'archivebox.extractors'
import os import os
import sys
from pathlib import Path from pathlib import Path
from typing import Optional, List, Iterable, Union from typing import Optional, List, Iterable, Union

View file

@ -72,6 +72,7 @@ def save_media(link: Link, out_dir: Optional[Path]=None, timeout: int=MEDIA_TIME
timer.end() timer.end()
# add video description and subtitles to full-text index # add video description and subtitles to full-text index
# Let's try a few different
index_texts = [ index_texts = [
text_file.read_text(encoding='utf-8').strip() text_file.read_text(encoding='utf-8').strip()
for text_file in ( for text_file in (

View file

@ -92,7 +92,7 @@ echo " You may be prompted for a sudo password in order to install the follow
echo "" echo ""
echo " - archivebox" echo " - archivebox"
echo " - python3, pip, nodejs, npm (languages used by ArchiveBox, and its extractor modules)" echo " - python3, pip, nodejs, npm (languages used by ArchiveBox, and its extractor modules)"
echo " - curl, wget, git, youtube-dl (used for extracting title, favicon, git, media, and more)" echo " - curl, wget, git, youtube-dl, yt-dlp (used for extracting title, favicon, git, media, and more)"
echo " - chromium (skips this if any Chrome/Chromium version is already installed)" echo " - chromium (skips this if any Chrome/Chromium version is already installed)"
echo "" echo ""
echo " If you'd rather install these manually as-needed, you can find detailed documentation here:" echo " If you'd rather install these manually as-needed, you can find detailed documentation here:"
@ -115,7 +115,7 @@ if which apt-get > /dev/null; then
fi fi
echo echo
echo "[+] Installing ArchiveBox system dependencies using apt..." echo "[+] Installing ArchiveBox system dependencies using apt..."
sudo apt-get install -y git python3 python3-pip python3-distutils wget curl youtube-dl ffmpeg git nodejs npm ripgrep sudo apt-get install -y git python3 python3-pip python3-distutils wget curl youtube-dl yt-dlp ffmpeg git nodejs npm ripgrep
sudo apt-get install -y libgtk2.0-0 libgtk-3-0 libnotify-dev libgconf-2-4 libnss3 libxss1 libasound2 libxtst6 xauth xvfb libgbm-dev || sudo apt-get install -y chromium || sudo apt-get install -y chromium-browser || true sudo apt-get install -y libgtk2.0-0 libgtk-3-0 libnotify-dev libgconf-2-4 libnss3 libxss1 libasound2 libxtst6 xauth xvfb libgbm-dev || sudo apt-get install -y chromium || sudo apt-get install -y chromium-browser || true
sudo apt-get install -y archivebox sudo apt-get install -y archivebox
sudo apt-get --only-upgrade install -y archivebox sudo apt-get --only-upgrade install -y archivebox

View file

@ -55,7 +55,7 @@
# CURL_BINARY = curl # CURL_BINARY = curl
# GIT_BINARY = git # GIT_BINARY = git
# WGET_BINARY = wget # WGET_BINARY = wget
# YOUTUBEDL_BINARY = youtube-dl # YOUTUBEDL_BINARY = yt-dlp
# CHROME_BINARY = chromium # CHROME_BINARY = chromium
# CHROME_USER_DATA_DIR="~/.config/google-chrome/Default" # CHROME_USER_DATA_DIR="~/.config/google-chrome/Default"

View file

@ -42,6 +42,7 @@ INSTALL_REQUIRES = [
"django-extensions>=3.0.3", "django-extensions>=3.0.3",
"dateparser>=1.0.0", "dateparser>=1.0.0",
"youtube-dl>=2021.04.17", "youtube-dl>=2021.04.17",
"yt-dlp>=2021.4.11",
"python-crontab>=2.5.1", "python-crontab>=2.5.1",
"croniter>=0.3.34", "croniter>=0.3.34",
"w3lib>=1.22.0", "w3lib>=1.22.0",

View file

@ -5,7 +5,7 @@ Package3: archivebox
Suite: focal Suite: focal
Suite3: focal Suite3: focal
Build-Depends: debhelper, dh-python, python3-all, python3-pip, python3-setuptools, python3-wheel, python3-stdeb Build-Depends: debhelper, dh-python, python3-all, python3-pip, python3-setuptools, python3-wheel, python3-stdeb
Depends3: nodejs, wget, curl, git, ffmpeg, youtube-dl, python3-all, python3-pip, python3-setuptools, python3-croniter, python3-crontab, python3-dateparser, python3-django, python3-django-extensions, python3-django-jsonfield, python3-mypy-extensions, python3-requests, python3-w3lib, ripgrep Depends3: nodejs, wget, curl, git, ffmpeg, youtube-dl, yt-dlp, python3-all, python3-pip, python3-setuptools, python3-croniter, python3-crontab, python3-dateparser, python3-django, python3-django-extensions, python3-django-jsonfield, python3-mypy-extensions, python3-requests, python3-w3lib, ripgrep
X-Python3-Version: >= 3.7 X-Python3-Version: >= 3.7
XS-Python-Version: >= 3.7 XS-Python-Version: >= 3.7
Setup-Env-Vars: DEB_BUILD_OPTIONS=nocheck Setup-Env-Vars: DEB_BUILD_OPTIONS=nocheck