Nick Sweeting
de2ab43f7f
switch .is_dir and .exists for os.access to avoid PermissionError on startup
2024-10-08 03:02:34 -07:00
Nick Sweeting
cf1ea8f80f
improve config loading of TMP_DIR, LIB_DIR, move to separate files
2024-10-07 23:45:11 -07:00
Nick Sweeting
94123ca68c
fix archive_dot_org repsonse parsing bytes vs str bug
2024-10-01 00:18:38 -07:00
Nick Sweeting
18474f452b
move config moved out of legacy files and better version output
2024-09-30 23:52:00 -07:00
Nick Sweeting
d21bc86075
finish migrating almost all config to new system
2024-09-30 23:21:34 -07:00
Nick Sweeting
69522da4bb
move wget and mercury into plugins
2024-09-30 21:43:45 -07:00
Nick Sweeting
363a499289
move util.py into misc folder
2024-09-30 17:25:15 -07:00
Nick Sweeting
dfca4b13b2
move system.py into misc folder
2024-09-30 17:13:55 -07:00
Nick Sweeting
3e5b6ddeae
move config into dedicated global app
2024-09-30 15:59:05 -07:00
Nick Sweeting
bb65b2dbec
move almost all config into new archivebox.CONSTANTS
2024-09-25 05:10:09 -07:00
Nick Sweeting
a5ffd4e9d3
move pdf, screenshot, dom, singlefile, and ytdlp extractor config to new plugin system
2024-09-25 00:42:26 -07:00
Nick Sweeting
ee5bec6a10
flip link_archive exception throw order so real exception is easier to read at the bottom
2024-09-25 00:39:49 -07:00
Nick Sweeting
c9c163efed
begin migrating search backends to new plugin system
2024-09-24 02:13:01 -07:00
Nick Sweeting
52386d9c16
run all blocking commands in background threads and show nice UI messages as confirmation
2024-09-06 02:54:22 -07:00
Nick Sweeting
cbf2a8fdc3
rename datetime fields to _at, massively improve ABID generation safety and determinism
2024-09-04 23:42:36 -07:00
Nick Sweeting
d0fefc0279
add chunk_size=500 to more iterator calls
2024-08-27 19:28:00 -07:00
Nick Sweeting
24fe958ff3
massively improve Snapshot admin list view query performance
2024-08-26 20:16:43 -07:00
Nick Sweeting
9b1659c72f
make created_by_id autoapply to any ArchiveResults created under Snapshot
2024-08-20 19:43:07 -07:00
Nick Sweeting
0285aa52a0
config and attr access improvements
2024-08-20 18:31:21 -07:00
Nick Sweeting
774ce3fda7
fix singlefile extractor exception when result is none
2024-05-17 20:12:18 -07:00
Nick Sweeting
0420662174
switch everywhere to use Snapshot.pk and ArchiveResult.pk instead of id
2024-05-13 05:12:12 -07:00
Nick Sweeting
457c42bf84
load EXTRACTORS dynamically using importlib.import_module
2024-05-11 22:28:59 -07:00
Nick Sweeting
4c5a3fba8b
more fixes for wget_output_path
2024-05-07 05:38:29 -07:00
Nick Sweeting
9b21ce490e
add workaround logic to catch paths that are too long or contain unprintable characters
2024-05-07 05:03:23 -07:00
Nick Sweeting
f770bba3cf
fix OSError 36 caused by checking for path that is too long to exist
2024-05-07 04:12:07 -07:00
Nick Sweeting
b4c3aa5097
Merge branch 'main' into dev
2024-03-26 15:01:36 -07:00
Ben Muthalaly
f4deb97f59
Add ARGS
and EXTRA_ARGS
for Mercury extractor
2024-03-05 21:15:38 -06:00
Ben Muthalaly
d8cf09c21e
Remove unnecessary variable length args for dedupe
2024-03-05 21:13:45 -06:00
Naomi Phillips
a729480b75
Add COOKIES_FILE support for singlefile extractor
2024-03-03 02:32:46 -05:00
Ben Muthalaly
d74ddd42ae
Flip dedupe precedence order
2024-03-01 14:50:32 -06:00
Ben Muthalaly
ab8f395e0a
Add YOUTUBEDL_EXTRA_ARGS
2024-02-23 15:40:31 -06:00
Ben Muthalaly
4e69d2c9e1
Add EXTRA_*_ARGS
for wget, curl, and singlefile
2024-02-22 23:04:11 -06:00
Nick Sweeting
8b9bc3dec8
minor fixes
2024-02-22 04:50:22 -08:00
Nick Sweeting
6a4e568d1b
new archivebox update speed improvements
2024-02-22 04:50:22 -08:00
Nick Sweeting
0a25495520
add fallback to check wget output dir with port stripped
2024-01-19 03:47:38 -08:00
Nick Sweeting
c1fd2cfa42
tag URLs immediately once added instead of waiting until archival completes
2024-01-03 20:31:46 -08:00
Nick Sweeting
db2984e47b
prefer dom dump to singlefile for generating readability output
2024-01-03 20:11:06 -08:00
Nick Sweeting
78d942ac22
show more detail in readabiliity error messages
2024-01-03 20:09:31 -08:00
Nick Sweeting
5b07a1126c
add comment about why DOM is preferred over singlefile for readability parsing
2024-01-03 19:09:24 -08:00
Nick Sweeting
2c54e55697
prefer dom dump to singlefile for generating readability output
2024-01-02 19:50:56 -08:00
Nick Sweeting
f0033f75d0
config.py lint fixes
2023-11-14 02:07:35 -08:00
Nick Sweeting
a680724367
Merge branch 'dev' into search_index_extract_html_text
2023-10-27 23:09:28 -07:00
Ross Williams
310b4d1242
Add htmltotext extractor
...
Saves HTML text nodes and selected element attributes in
`htmltotext.txt` for each Snapshot. Primarily intended to be used
for search indexing.
2023-10-23 21:42:32 -04:00
Nick Sweeting
63ad43f46c
Merge branch 'dev' into method_allow_deny
2023-10-20 04:25:44 -07:00
Nick Sweeting
82d8662c74
add more readability error output
2023-10-20 04:14:28 -07:00
Ben Muthalaly
77917e9b55
Fix HTML title parsing bugs.
...
This slightly modifies the HTML_TITLE_REGEX to fix two parsing errors.
The first occurred when title tags were empty (e.g. "<title></title>")
which was parsed as "</title". The second occurred when titles were a
single character (e.g. "<title>A</title>") which was not matched by the
regex, and so would fall back to link.base_url.
Now when tags are empty, it falls back to link.base_url, and single
character titles are parsed correctly.
The way the regex works now is still a bit wonky for some edge cases.
I couldn't find any cases of incorrect behavior, but it still might be
worth reworking more completely for robustness.
2023-10-09 02:00:01 -05:00
spresse1
603ce7ec10
After a timeout, chrome will leave behind a SingletonLock, which prevents future instances of chrome from starting. When an extractor fails due to a timeout, remove this file.
2023-08-28 17:27:03 +02:00
Ross Williams
2076474252
Drop use of TypeAlias to maintain Python 3.9 compat
...
TypeAlias annotation was introduced in Python 3.10, and is not strictly
necessary. Drop use of it to maintain Python 3.9 compatibility.
2023-08-02 10:56:48 -04:00
Ross Williams
b44f7e68b1
Add URL-specific method allow/deny lists
...
Allows enabling only allow-listed extractors or disabling specific
deny-listed extractors for a regular expression matched against an added
site's URL.
2023-08-02 09:36:40 -04:00
Sascha Ißbrücker
7bf4f40da0
just use out_dir
2023-05-29 10:03:49 +02:00