Commit graph

196 commits

Author SHA1 Message Date
Nick Sweeting
de2ab43f7f
switch .is_dir and .exists for os.access to avoid PermissionError on startup 2024-10-08 03:02:34 -07:00
Nick Sweeting
cf1ea8f80f
improve config loading of TMP_DIR, LIB_DIR, move to separate files 2024-10-07 23:45:11 -07:00
Nick Sweeting
94123ca68c
fix archive_dot_org repsonse parsing bytes vs str bug 2024-10-01 00:18:38 -07:00
Nick Sweeting
18474f452b
move config moved out of legacy files and better version output 2024-09-30 23:52:00 -07:00
Nick Sweeting
d21bc86075
finish migrating almost all config to new system 2024-09-30 23:21:34 -07:00
Nick Sweeting
69522da4bb
move wget and mercury into plugins 2024-09-30 21:43:45 -07:00
Nick Sweeting
363a499289
move util.py into misc folder 2024-09-30 17:25:15 -07:00
Nick Sweeting
dfca4b13b2
move system.py into misc folder 2024-09-30 17:13:55 -07:00
Nick Sweeting
3e5b6ddeae
move config into dedicated global app 2024-09-30 15:59:05 -07:00
Nick Sweeting
bb65b2dbec
move almost all config into new archivebox.CONSTANTS 2024-09-25 05:10:09 -07:00
Nick Sweeting
a5ffd4e9d3
move pdf, screenshot, dom, singlefile, and ytdlp extractor config to new plugin system 2024-09-25 00:42:26 -07:00
Nick Sweeting
ee5bec6a10
flip link_archive exception throw order so real exception is easier to read at the bottom 2024-09-25 00:39:49 -07:00
Nick Sweeting
c9c163efed
begin migrating search backends to new plugin system 2024-09-24 02:13:01 -07:00
Nick Sweeting
52386d9c16
run all blocking commands in background threads and show nice UI messages as confirmation 2024-09-06 02:54:22 -07:00
Nick Sweeting
cbf2a8fdc3
rename datetime fields to _at, massively improve ABID generation safety and determinism 2024-09-04 23:42:36 -07:00
Nick Sweeting
d0fefc0279
add chunk_size=500 to more iterator calls 2024-08-27 19:28:00 -07:00
Nick Sweeting
24fe958ff3
massively improve Snapshot admin list view query performance 2024-08-26 20:16:43 -07:00
Nick Sweeting
9b1659c72f
make created_by_id autoapply to any ArchiveResults created under Snapshot 2024-08-20 19:43:07 -07:00
Nick Sweeting
0285aa52a0
config and attr access improvements 2024-08-20 18:31:21 -07:00
Nick Sweeting
774ce3fda7
fix singlefile extractor exception when result is none 2024-05-17 20:12:18 -07:00
Nick Sweeting
0420662174
switch everywhere to use Snapshot.pk and ArchiveResult.pk instead of id 2024-05-13 05:12:12 -07:00
Nick Sweeting
457c42bf84
load EXTRACTORS dynamically using importlib.import_module 2024-05-11 22:28:59 -07:00
Nick Sweeting
4c5a3fba8b
more fixes for wget_output_path 2024-05-07 05:38:29 -07:00
Nick Sweeting
9b21ce490e
add workaround logic to catch paths that are too long or contain unprintable characters 2024-05-07 05:03:23 -07:00
Nick Sweeting
f770bba3cf
fix OSError 36 caused by checking for path that is too long to exist 2024-05-07 04:12:07 -07:00
Nick Sweeting
b4c3aa5097 Merge branch 'main' into dev 2024-03-26 15:01:36 -07:00
Ben Muthalaly
f4deb97f59 Add ARGS and EXTRA_ARGS for Mercury extractor 2024-03-05 21:15:38 -06:00
Ben Muthalaly
d8cf09c21e Remove unnecessary variable length args for dedupe 2024-03-05 21:13:45 -06:00
Naomi Phillips
a729480b75
Add COOKIES_FILE support for singlefile extractor 2024-03-03 02:32:46 -05:00
Ben Muthalaly
d74ddd42ae Flip dedupe precedence order 2024-03-01 14:50:32 -06:00
Ben Muthalaly
ab8f395e0a Add YOUTUBEDL_EXTRA_ARGS 2024-02-23 15:40:31 -06:00
Ben Muthalaly
4e69d2c9e1 Add EXTRA_*_ARGS for wget, curl, and singlefile 2024-02-22 23:04:11 -06:00
Nick Sweeting
8b9bc3dec8 minor fixes 2024-02-22 04:50:22 -08:00
Nick Sweeting
6a4e568d1b new archivebox update speed improvements 2024-02-22 04:50:22 -08:00
Nick Sweeting
0a25495520 add fallback to check wget output dir with port stripped 2024-01-19 03:47:38 -08:00
Nick Sweeting
c1fd2cfa42 tag URLs immediately once added instead of waiting until archival completes 2024-01-03 20:31:46 -08:00
Nick Sweeting
db2984e47b prefer dom dump to singlefile for generating readability output 2024-01-03 20:11:06 -08:00
Nick Sweeting
78d942ac22 show more detail in readabiliity error messages 2024-01-03 20:09:31 -08:00
Nick Sweeting
5b07a1126c add comment about why DOM is preferred over singlefile for readability parsing 2024-01-03 19:09:24 -08:00
Nick Sweeting
2c54e55697 prefer dom dump to singlefile for generating readability output 2024-01-02 19:50:56 -08:00
Nick Sweeting
f0033f75d0 config.py lint fixes 2023-11-14 02:07:35 -08:00
Nick Sweeting
a680724367
Merge branch 'dev' into search_index_extract_html_text 2023-10-27 23:09:28 -07:00
Ross Williams
310b4d1242 Add htmltotext extractor
Saves HTML text nodes and selected element attributes in
`htmltotext.txt` for each Snapshot. Primarily intended to be used
for search indexing.
2023-10-23 21:42:32 -04:00
Nick Sweeting
63ad43f46c
Merge branch 'dev' into method_allow_deny 2023-10-20 04:25:44 -07:00
Nick Sweeting
82d8662c74 add more readability error output 2023-10-20 04:14:28 -07:00
Ben Muthalaly
77917e9b55 Fix HTML title parsing bugs.
This slightly modifies the HTML_TITLE_REGEX to fix two parsing errors.
The first occurred when title tags were empty (e.g. "<title></title>")
which was parsed as "</title". The second occurred when titles were a
single character (e.g. "<title>A</title>") which was not matched by the
regex, and so would fall back to link.base_url.

Now when tags are empty, it falls back to link.base_url, and single
character titles are parsed correctly.

The way the regex works now is still a bit wonky for some edge cases.
I couldn't find any cases of incorrect behavior, but it still might be
worth reworking more completely for robustness.
2023-10-09 02:00:01 -05:00
spresse1
603ce7ec10 After a timeout, chrome will leave behind a SingletonLock, which prevents future instances of chrome from starting. When an extractor fails due to a timeout, remove this file. 2023-08-28 17:27:03 +02:00
Ross Williams
2076474252 Drop use of TypeAlias to maintain Python 3.9 compat
TypeAlias annotation was introduced in Python 3.10, and is not strictly
necessary. Drop use of it to maintain Python 3.9 compatibility.
2023-08-02 10:56:48 -04:00
Ross Williams
b44f7e68b1 Add URL-specific method allow/deny lists
Allows enabling only allow-listed extractors or disabling specific
deny-listed extractors for a regular expression matched against an added
site's URL.
2023-08-02 09:36:40 -04:00
Sascha Ißbrücker
7bf4f40da0 just use out_dir 2023-05-29 10:03:49 +02:00