ArchiveBox

mirror of https://github.com/ArchiveBox/ArchiveBox synced 2024-11-10 06:34:16 +00:00

Author	SHA1	Message	Date
Nick Sweeting	a680724367	Merge branch 'dev' into search_index_extract_html_text	2023-10-27 23:09:28 -07:00
Ross Williams	310b4d1242	Add htmltotext extractor Saves HTML text nodes and selected element attributes in `htmltotext.txt` for each Snapshot. Primarily intended to be used for search indexing.	2023-10-23 21:42:32 -04:00
Ross Williams	b44f7e68b1	Add URL-specific method allow/deny lists Allows enabling only allow-listed extractors or disabling specific deny-listed extractors for a regular expression matched against an added site's URL.	2023-08-02 09:36:40 -04:00
Nick Sweeting	bd6d9c165b	enforce utf8 on literally all file operations because windows sucks	2021-03-27 01:16:29 -04:00
Cristian	62ed11a5ca	fix: Improve headers handling	2020-09-24 12:55:51 -05:00
Angel Rey	ee6caca3ca	Added more asserts	2020-09-23 11:07:00 -05:00
Angel Rey	1cce786d6d	Added test headers extractor	2020-09-23 11:07:00 -05:00
ttimasdf	e3329be291	tests: add test for mercury-parser	2020-09-22 18:44:12 -05:00
Cristian	cc0fa747ce	feat: Add options to ease management of node related extractors	2020-08-18 10:34:28 -05:00
Cristian	2a68af1b94	tests: Add readability tests	2020-08-11 11:15:15 -05:00
Cristian	5429096c30	tests: Add mechanism to avoid using extractors that we are not testing	2020-08-04 08:42:30 -05:00
Nick Sweeting	5b6eb5e4ad	make filenames consistent with program name	2020-08-03 13:23:05 -05:00
Cristian	37df00a08b	tests: Add basic singlefile test	2020-08-03 13:22:36 -05:00
Cristian	e6c571beb2	fix: Remove title from extractors for oneshot	2020-07-31 10:24:58 -05:00
Cristian	23e6803f02	fix: Add change to calculate wget folder when there is a port present	2020-07-17 16:55:56 -05:00

15 commits