Nick Sweeting
0285aa52a0
config and attr access improvements
2024-08-20 18:31:21 -07:00
Nick Sweeting
774ce3fda7
fix singlefile extractor exception when result is none
2024-05-17 20:12:18 -07:00
Nick Sweeting
0420662174
switch everywhere to use Snapshot.pk and ArchiveResult.pk instead of id
2024-05-13 05:12:12 -07:00
Nick Sweeting
457c42bf84
load EXTRACTORS dynamically using importlib.import_module
2024-05-11 22:28:59 -07:00
Nick Sweeting
4c5a3fba8b
more fixes for wget_output_path
2024-05-07 05:38:29 -07:00
Nick Sweeting
9b21ce490e
add workaround logic to catch paths that are too long or contain unprintable characters
2024-05-07 05:03:23 -07:00
Nick Sweeting
f770bba3cf
fix OSError 36 caused by checking for path that is too long to exist
2024-05-07 04:12:07 -07:00
Nick Sweeting
b4c3aa5097
Merge branch 'main' into dev
2024-03-26 15:01:36 -07:00
Ben Muthalaly
f4deb97f59
Add ARGS
and EXTRA_ARGS
for Mercury extractor
2024-03-05 21:15:38 -06:00
Ben Muthalaly
d8cf09c21e
Remove unnecessary variable length args for dedupe
2024-03-05 21:13:45 -06:00
Naomi Phillips
a729480b75
Add COOKIES_FILE support for singlefile extractor
2024-03-03 02:32:46 -05:00
Ben Muthalaly
d74ddd42ae
Flip dedupe precedence order
2024-03-01 14:50:32 -06:00
Ben Muthalaly
ab8f395e0a
Add YOUTUBEDL_EXTRA_ARGS
2024-02-23 15:40:31 -06:00
Ben Muthalaly
4e69d2c9e1
Add EXTRA_*_ARGS
for wget, curl, and singlefile
2024-02-22 23:04:11 -06:00
Nick Sweeting
8b9bc3dec8
minor fixes
2024-02-22 04:50:22 -08:00
Nick Sweeting
6a4e568d1b
new archivebox update speed improvements
2024-02-22 04:50:22 -08:00
Nick Sweeting
0a25495520
add fallback to check wget output dir with port stripped
2024-01-19 03:47:38 -08:00
Nick Sweeting
c1fd2cfa42
tag URLs immediately once added instead of waiting until archival completes
2024-01-03 20:31:46 -08:00
Nick Sweeting
db2984e47b
prefer dom dump to singlefile for generating readability output
2024-01-03 20:11:06 -08:00
Nick Sweeting
78d942ac22
show more detail in readabiliity error messages
2024-01-03 20:09:31 -08:00
Nick Sweeting
5b07a1126c
add comment about why DOM is preferred over singlefile for readability parsing
2024-01-03 19:09:24 -08:00
Nick Sweeting
2c54e55697
prefer dom dump to singlefile for generating readability output
2024-01-02 19:50:56 -08:00
Nick Sweeting
f0033f75d0
config.py lint fixes
2023-11-14 02:07:35 -08:00
Nick Sweeting
a680724367
Merge branch 'dev' into search_index_extract_html_text
2023-10-27 23:09:28 -07:00
Ross Williams
310b4d1242
Add htmltotext extractor
...
Saves HTML text nodes and selected element attributes in
`htmltotext.txt` for each Snapshot. Primarily intended to be used
for search indexing.
2023-10-23 21:42:32 -04:00
Nick Sweeting
63ad43f46c
Merge branch 'dev' into method_allow_deny
2023-10-20 04:25:44 -07:00
Nick Sweeting
82d8662c74
add more readability error output
2023-10-20 04:14:28 -07:00
Ben Muthalaly
77917e9b55
Fix HTML title parsing bugs.
...
This slightly modifies the HTML_TITLE_REGEX to fix two parsing errors.
The first occurred when title tags were empty (e.g. "<title></title>")
which was parsed as "</title". The second occurred when titles were a
single character (e.g. "<title>A</title>") which was not matched by the
regex, and so would fall back to link.base_url.
Now when tags are empty, it falls back to link.base_url, and single
character titles are parsed correctly.
The way the regex works now is still a bit wonky for some edge cases.
I couldn't find any cases of incorrect behavior, but it still might be
worth reworking more completely for robustness.
2023-10-09 02:00:01 -05:00
spresse1
603ce7ec10
After a timeout, chrome will leave behind a SingletonLock, which prevents future instances of chrome from starting. When an extractor fails due to a timeout, remove this file.
2023-08-28 17:27:03 +02:00
Ross Williams
2076474252
Drop use of TypeAlias to maintain Python 3.9 compat
...
TypeAlias annotation was introduced in Python 3.10, and is not strictly
necessary. Drop use of it to maintain Python 3.9 compatibility.
2023-08-02 10:56:48 -04:00
Ross Williams
b44f7e68b1
Add URL-specific method allow/deny lists
...
Allows enabling only allow-listed extractors or disabling specific
deny-listed extractors for a regular expression matched against an added
site's URL.
2023-08-02 09:36:40 -04:00
Sascha Ißbrücker
7bf4f40da0
just use out_dir
2023-05-29 10:03:49 +02:00
Sascha Ißbrücker
40c122515a
fix: make oneshot command return successful exist code
2023-05-29 10:01:27 +02:00
Micah R Ledbetter
1e50ca243e
Add FAVICON_PROVIDER option for custom favicon service
2023-05-05 20:42:36 -05:00
ふぁ
d77c770c47
add CHROME_TIMEOUT args
...
Signed-off-by: ふぁ <yuki@yuki0311.com>
2023-03-14 20:29:41 +09:00
Nick Sweeting
9599845b56
ensure DOM HTML dump is non-zero length file when retrying
2023-03-13 10:49:26 +00:00
Nick Sweeting
0cbeeb4346
Merge pull request #1021 from renaisun/dev
2023-01-09 18:17:39 -08:00
Joseph Turian
07de4a79a1
Merge branch 'dev' into feature/kludge-984-UTF8-bug
2022-12-20 11:39:01 +01:00
Joseph Turian
081a12b079
Add ts
2022-09-12 21:32:47 +00:00
Joseph Turian
daef48e59b
flake8
2022-09-12 21:31:33 +00:00
Joseph Turian
983f485cc0
flake8
2022-09-12 21:29:43 +00:00
Joseph Turian
b864c38d9e
Don't be strict on unicode errors
2022-09-12 20:40:45 +00:00
Joseph Turian
dba423a568
A few more youtube-dl tweaks
2022-09-12 20:36:23 +00:00
Joseph Turian
f5f7aff3b4
Added yt-dlp everywhere
2022-09-12 20:34:02 +00:00
renaisun
0ea955b3ed
add a missing comma
2022-09-12 09:08:28 +08:00
notevenaperson
40659b5e9d
singlefile.py: Code to ensure options are deduplicated
2022-09-12 09:08:28 +08:00
Joseph Turian
2b58cce43f
Attempted to warn on #984 and #1014
2022-09-11 12:19:16 +02:00
renaisun
8899fe0b92
Add SINGLEFILE_ARGS to control single-file arguments
2022-06-09 14:35:48 +08:00
Nick Sweeting
950b5cbbb6
Merge pull request #924 from prnake/dev
...
improve title extractor
2022-05-09 18:38:12 -07:00
Nick Sweeting
57df65f28f
use yt-dlp for media archiving instead of youtube-dl
2022-04-21 07:11:35 -07:00