Nick Sweeting
a680724367
Merge branch 'dev' into search_index_extract_html_text
2023-10-27 23:09:28 -07:00
Ross Williams
310b4d1242
Add htmltotext extractor
...
Saves HTML text nodes and selected element attributes in
`htmltotext.txt` for each Snapshot. Primarily intended to be used
for search indexing.
2023-10-23 21:42:32 -04:00
Ross Williams
b44f7e68b1
Add URL-specific method allow/deny lists
...
Allows enabling only allow-listed extractors or disabling specific
deny-listed extractors for a regular expression matched against an added
site's URL.
2023-08-02 09:36:40 -04:00
Nick Sweeting
bd6d9c165b
enforce utf8 on literally all file operations because windows sucks
2021-03-27 01:16:29 -04:00
Cristian
62ed11a5ca
fix: Improve headers handling
2020-09-24 12:55:51 -05:00
Angel Rey
ee6caca3ca
Added more asserts
2020-09-23 11:07:00 -05:00
Angel Rey
1cce786d6d
Added test headers extractor
2020-09-23 11:07:00 -05:00
ttimasdf
e3329be291
tests: add test for mercury-parser
2020-09-22 18:44:12 -05:00
Cristian
cc0fa747ce
feat: Add options to ease management of node related extractors
2020-08-18 10:34:28 -05:00
Cristian
2a68af1b94
tests: Add readability tests
2020-08-11 11:15:15 -05:00
Cristian
5429096c30
tests: Add mechanism to avoid using extractors that we are not testing
2020-08-04 08:42:30 -05:00
Nick Sweeting
5b6eb5e4ad
make filenames consistent with program name
2020-08-03 13:23:05 -05:00
Cristian
37df00a08b
tests: Add basic singlefile test
2020-08-03 13:22:36 -05:00
Cristian
e6c571beb2
fix: Remove title from extractors for oneshot
2020-07-31 10:24:58 -05:00
Cristian
23e6803f02
fix: Add change to calculate wget folder when there is a port present
2020-07-17 16:55:56 -05:00