Nick Sweeting
f0033f75d0
config.py lint fixes
2023-11-14 02:07:35 -08:00
Nick Sweeting
a680724367
Merge branch 'dev' into search_index_extract_html_text
2023-10-27 23:09:28 -07:00
Ross Williams
310b4d1242
Add htmltotext extractor
...
Saves HTML text nodes and selected element attributes in
`htmltotext.txt` for each Snapshot. Primarily intended to be used
for search indexing.
2023-10-23 21:42:32 -04:00
Nick Sweeting
63ad43f46c
Merge branch 'dev' into method_allow_deny
2023-10-20 04:25:44 -07:00
Nick Sweeting
82d8662c74
add more readability error output
2023-10-20 04:14:28 -07:00
Ben Muthalaly
77917e9b55
Fix HTML title parsing bugs.
...
This slightly modifies the HTML_TITLE_REGEX to fix two parsing errors.
The first occurred when title tags were empty (e.g. "<title></title>")
which was parsed as "</title". The second occurred when titles were a
single character (e.g. "<title>A</title>") which was not matched by the
regex, and so would fall back to link.base_url.
Now when tags are empty, it falls back to link.base_url, and single
character titles are parsed correctly.
The way the regex works now is still a bit wonky for some edge cases.
I couldn't find any cases of incorrect behavior, but it still might be
worth reworking more completely for robustness.
2023-10-09 02:00:01 -05:00
spresse1
603ce7ec10
After a timeout, chrome will leave behind a SingletonLock, which prevents future instances of chrome from starting. When an extractor fails due to a timeout, remove this file.
2023-08-28 17:27:03 +02:00
Ross Williams
2076474252
Drop use of TypeAlias to maintain Python 3.9 compat
...
TypeAlias annotation was introduced in Python 3.10, and is not strictly
necessary. Drop use of it to maintain Python 3.9 compatibility.
2023-08-02 10:56:48 -04:00
Ross Williams
b44f7e68b1
Add URL-specific method allow/deny lists
...
Allows enabling only allow-listed extractors or disabling specific
deny-listed extractors for a regular expression matched against an added
site's URL.
2023-08-02 09:36:40 -04:00
Sascha Ißbrücker
7bf4f40da0
just use out_dir
2023-05-29 10:03:49 +02:00
Sascha Ißbrücker
40c122515a
fix: make oneshot command return successful exist code
2023-05-29 10:01:27 +02:00
Micah R Ledbetter
1e50ca243e
Add FAVICON_PROVIDER option for custom favicon service
2023-05-05 20:42:36 -05:00
ふぁ
d77c770c47
add CHROME_TIMEOUT args
...
Signed-off-by: ふぁ <yuki@yuki0311.com>
2023-03-14 20:29:41 +09:00
Nick Sweeting
9599845b56
ensure DOM HTML dump is non-zero length file when retrying
2023-03-13 10:49:26 +00:00
Nick Sweeting
0cbeeb4346
Merge pull request #1021 from renaisun/dev
2023-01-09 18:17:39 -08:00
Joseph Turian
07de4a79a1
Merge branch 'dev' into feature/kludge-984-UTF8-bug
2022-12-20 11:39:01 +01:00
Joseph Turian
081a12b079
Add ts
2022-09-12 21:32:47 +00:00
Joseph Turian
daef48e59b
flake8
2022-09-12 21:31:33 +00:00
Joseph Turian
983f485cc0
flake8
2022-09-12 21:29:43 +00:00
Joseph Turian
b864c38d9e
Don't be strict on unicode errors
2022-09-12 20:40:45 +00:00
Joseph Turian
dba423a568
A few more youtube-dl tweaks
2022-09-12 20:36:23 +00:00
Joseph Turian
f5f7aff3b4
Added yt-dlp everywhere
2022-09-12 20:34:02 +00:00
renaisun
0ea955b3ed
add a missing comma
2022-09-12 09:08:28 +08:00
notevenaperson
40659b5e9d
singlefile.py: Code to ensure options are deduplicated
2022-09-12 09:08:28 +08:00
Joseph Turian
2b58cce43f
Attempted to warn on #984 and #1014
2022-09-11 12:19:16 +02:00
renaisun
8899fe0b92
Add SINGLEFILE_ARGS to control single-file arguments
2022-06-09 14:35:48 +08:00
Nick Sweeting
950b5cbbb6
Merge pull request #924 from prnake/dev
...
improve title extractor
2022-05-09 18:38:12 -07:00
Nick Sweeting
57df65f28f
use yt-dlp for media archiving instead of youtube-dl
2022-04-21 07:11:35 -07:00
prnake
011bd104cb
remove unused import
2022-02-09 10:48:51 +08:00
papersnake
de8e22efb7
improve title extractor
2022-02-08 23:17:52 +08:00
Nick Sweeting
4715ace7dd
ignore BaseException lgtm errors
2021-05-31 20:59:05 -04:00
Nick Sweeting
eb4d3bca9d
Update readability.py
2021-05-13 00:13:32 -04:00
Nick Sweeting
62078a77f8
show run duration after each archived link in cli output
2021-04-10 07:52:01 -04:00
Nick Sweeting
193df5c8d3
add video subtitles and description to full-text index
2021-04-10 07:22:20 -04:00
Nick Sweeting
a9986f1f05
add timezone support, tons of CSS and layout improvements, more detailed snapshot admin form info, ability to sort by recently updated, better grid view styling, better table layouts, better dark mode support
2021-04-10 04:21:36 -04:00
Nick Sweeting
bd6d9c165b
enforce utf8 on literally all file operations because windows sucks
2021-03-27 01:16:29 -04:00
Nick Sweeting
084cf7ff51
add more explanation about snapshot.save timestamp bump
2021-02-17 13:34:46 -05:00
Nick Sweeting
acb932ba12
improve readability and mercury error handling and fix output path to be relative
2021-02-16 15:53:11 -05:00
Nick Sweeting
c95698e608
bump Snapshot.updated time after each extractor, change extractor order
2021-02-16 15:52:18 -05:00
Nick Sweeting
d0f8a5e710
change mercury atomic_write output order
2021-02-16 06:19:16 -05:00
Nick Sweeting
7d0f5653c3
fix lgtm alerts
2021-02-01 02:27:24 -05:00
Nick Sweeting
04c951cdd5
fix alerts
2021-02-01 02:22:02 -05:00
Nick Sweeting
846c966c4d
use globbing to find wget output path
2021-01-30 22:02:39 -05:00
Nick Sweeting
e6fa16e13a
only chmod wget output if it exists
2021-01-30 22:02:11 -05:00
Nick Sweeting
385daf9af8
save the url as title for staticfiles or non html files
2021-01-30 22:01:49 -05:00
Nick Sweeting
b9b1c3d9e8
fix singlefile output path not relative
2021-01-30 20:44:49 -05:00
Nick Sweeting
d6de04a83a
fix lgtm errors
2021-01-30 06:07:35 -05:00
Nick Sweeting
c2aaa41c76
fix missing str path
2021-01-30 01:25:08 -05:00
Nick Sweeting
15e58bd366
fix using os.path calls on pathlib paths
2021-01-27 11:27:40 -05:00
Nick Sweeting
9764a8ed9b
check for non html files from wget
2021-01-25 18:15:16 -05:00