ArchiveBox

mirror of https://github.com/ArchiveBox/ArchiveBox synced 2024-11-10 14:44:18 +00:00

Author	SHA1	Message	Date
Nick Sweeting	774ce3fda7	fix singlefile extractor exception when result is none	2024-05-17 20:12:18 -07:00
Nick Sweeting	0420662174	switch everywhere to use Snapshot.pk and ArchiveResult.pk instead of id	2024-05-13 05:12:12 -07:00
Nick Sweeting	457c42bf84	load EXTRACTORS dynamically using importlib.import_module	2024-05-11 22:28:59 -07:00
Nick Sweeting	4c5a3fba8b	more fixes for wget_output_path	2024-05-07 05:38:29 -07:00
Nick Sweeting	9b21ce490e	add workaround logic to catch paths that are too long or contain unprintable characters	2024-05-07 05:03:23 -07:00
Nick Sweeting	f770bba3cf	fix OSError 36 caused by checking for path that is too long to exist	2024-05-07 04:12:07 -07:00
Nick Sweeting	b4c3aa5097	Merge branch 'main' into dev	2024-03-26 15:01:36 -07:00
Ben Muthalaly	f4deb97f59	Add `ARGS` and `EXTRA_ARGS` for Mercury extractor	2024-03-05 21:15:38 -06:00
Ben Muthalaly	d8cf09c21e	Remove unnecessary variable length args for dedupe	2024-03-05 21:13:45 -06:00
Naomi Phillips	a729480b75	Add COOKIES_FILE support for singlefile extractor	2024-03-03 02:32:46 -05:00
Ben Muthalaly	d74ddd42ae	Flip dedupe precedence order	2024-03-01 14:50:32 -06:00
Ben Muthalaly	ab8f395e0a	Add `YOUTUBEDL_EXTRA_ARGS`	2024-02-23 15:40:31 -06:00
Ben Muthalaly	4e69d2c9e1	Add `EXTRA_*_ARGS` for wget, curl, and singlefile	2024-02-22 23:04:11 -06:00
Nick Sweeting	8b9bc3dec8	minor fixes	2024-02-22 04:50:22 -08:00
Nick Sweeting	6a4e568d1b	new archivebox update speed improvements	2024-02-22 04:50:22 -08:00
Nick Sweeting	0a25495520	add fallback to check wget output dir with port stripped	2024-01-19 03:47:38 -08:00
Nick Sweeting	c1fd2cfa42	tag URLs immediately once added instead of waiting until archival completes	2024-01-03 20:31:46 -08:00
Nick Sweeting	db2984e47b	prefer dom dump to singlefile for generating readability output	2024-01-03 20:11:06 -08:00
Nick Sweeting	78d942ac22	show more detail in readabiliity error messages	2024-01-03 20:09:31 -08:00
Nick Sweeting	5b07a1126c	add comment about why DOM is preferred over singlefile for readability parsing	2024-01-03 19:09:24 -08:00
Nick Sweeting	2c54e55697	prefer dom dump to singlefile for generating readability output	2024-01-02 19:50:56 -08:00
Nick Sweeting	f0033f75d0	config.py lint fixes	2023-11-14 02:07:35 -08:00
Nick Sweeting	a680724367	Merge branch 'dev' into search_index_extract_html_text	2023-10-27 23:09:28 -07:00
Ross Williams	310b4d1242	Add htmltotext extractor Saves HTML text nodes and selected element attributes in `htmltotext.txt` for each Snapshot. Primarily intended to be used for search indexing.	2023-10-23 21:42:32 -04:00
Nick Sweeting	63ad43f46c	Merge branch 'dev' into method_allow_deny	2023-10-20 04:25:44 -07:00
Nick Sweeting	82d8662c74	add more readability error output	2023-10-20 04:14:28 -07:00
Ben Muthalaly	77917e9b55	Fix HTML title parsing bugs. This slightly modifies the HTML_TITLE_REGEX to fix two parsing errors. The first occurred when title tags were empty (e.g. "<title></title>") which was parsed as "</title". The second occurred when titles were a single character (e.g. "<title>A</title>") which was not matched by the regex, and so would fall back to link.base_url. Now when tags are empty, it falls back to link.base_url, and single character titles are parsed correctly. The way the regex works now is still a bit wonky for some edge cases. I couldn't find any cases of incorrect behavior, but it still might be worth reworking more completely for robustness.	2023-10-09 02:00:01 -05:00
spresse1	603ce7ec10	After a timeout, chrome will leave behind a SingletonLock, which prevents future instances of chrome from starting. When an extractor fails due to a timeout, remove this file.	2023-08-28 17:27:03 +02:00
Ross Williams	2076474252	Drop use of TypeAlias to maintain Python 3.9 compat TypeAlias annotation was introduced in Python 3.10, and is not strictly necessary. Drop use of it to maintain Python 3.9 compatibility.	2023-08-02 10:56:48 -04:00
Ross Williams	b44f7e68b1	Add URL-specific method allow/deny lists Allows enabling only allow-listed extractors or disabling specific deny-listed extractors for a regular expression matched against an added site's URL.	2023-08-02 09:36:40 -04:00
Sascha Ißbrücker	7bf4f40da0	just use out_dir	2023-05-29 10:03:49 +02:00
Sascha Ißbrücker	40c122515a	fix: make oneshot command return successful exist code	2023-05-29 10:01:27 +02:00
Micah R Ledbetter	1e50ca243e	Add FAVICON_PROVIDER option for custom favicon service	2023-05-05 20:42:36 -05:00
ふぁ	d77c770c47	add CHROME_TIMEOUT args Signed-off-by: ふぁ <yuki@yuki0311.com>	2023-03-14 20:29:41 +09:00
Nick Sweeting	9599845b56	ensure DOM HTML dump is non-zero length file when retrying	2023-03-13 10:49:26 +00:00
Nick Sweeting	0cbeeb4346	Merge pull request #1021 from renaisun/dev	2023-01-09 18:17:39 -08:00
Joseph Turian	07de4a79a1	Merge branch 'dev' into feature/kludge-984-UTF8-bug	2022-12-20 11:39:01 +01:00
Joseph Turian	081a12b079	Add ts	2022-09-12 21:32:47 +00:00
Joseph Turian	daef48e59b	flake8	2022-09-12 21:31:33 +00:00
Joseph Turian	983f485cc0	flake8	2022-09-12 21:29:43 +00:00
Joseph Turian	b864c38d9e	Don't be strict on unicode errors	2022-09-12 20:40:45 +00:00
Joseph Turian	dba423a568	A few more youtube-dl tweaks	2022-09-12 20:36:23 +00:00
Joseph Turian	f5f7aff3b4	Added yt-dlp everywhere	2022-09-12 20:34:02 +00:00
renaisun	0ea955b3ed	add a missing comma	2022-09-12 09:08:28 +08:00
notevenaperson	40659b5e9d	singlefile.py: Code to ensure options are deduplicated	2022-09-12 09:08:28 +08:00
Joseph Turian	2b58cce43f	Attempted to warn on #984 and #1014	2022-09-11 12:19:16 +02:00
renaisun	8899fe0b92	Add SINGLEFILE_ARGS to control single-file arguments	2022-06-09 14:35:48 +08:00
Nick Sweeting	950b5cbbb6	Merge pull request #924 from prnake/dev improve title extractor	2022-05-09 18:38:12 -07:00
Nick Sweeting	57df65f28f	use yt-dlp for media archiving instead of youtube-dl	2022-04-21 07:11:35 -07:00
prnake	011bd104cb	remove unused import	2022-02-09 10:48:51 +08:00

1 2 3 4

177 commits