ArchiveBox/archivebox/extractors
Ross Williams 310b4d1242 Add htmltotext extractor
Saves HTML text nodes and selected element attributes in
`htmltotext.txt` for each Snapshot. Primarily intended to be used
for search indexing.
2023-10-23 21:42:32 -04:00
..
__init__.py Add htmltotext extractor 2023-10-23 21:42:32 -04:00
archive_org.py enforce utf8 on literally all file operations because windows sucks 2021-03-27 01:16:29 -04:00
dom.py After a timeout, chrome will leave behind a SingletonLock, which prevents future instances of chrome from starting. When an extractor fails due to a timeout, remove this file. 2023-08-28 17:27:03 +02:00
favicon.py Add FAVICON_PROVIDER option for custom favicon service 2023-05-05 20:42:36 -05:00
git.py Refactor should_save_extractor methods to accept overwrite parameter 2021-01-21 15:56:32 -06:00
headers.py Refactor should_save_extractor methods to accept overwrite parameter 2021-01-21 15:56:32 -06:00
htmltotext.py Add htmltotext extractor 2023-10-23 21:42:32 -04:00
media.py Don't be strict on unicode errors 2022-09-12 20:40:45 +00:00
mercury.py improve readability and mercury error handling and fix output path to be relative 2021-02-16 15:53:11 -05:00
pdf.py After a timeout, chrome will leave behind a SingletonLock, which prevents future instances of chrome from starting. When an extractor fails due to a timeout, remove this file. 2023-08-28 17:27:03 +02:00
readability.py remove unused import 2022-02-09 10:48:51 +08:00
screenshot.py After a timeout, chrome will leave behind a SingletonLock, which prevents future instances of chrome from starting. When an extractor fails due to a timeout, remove this file. 2023-08-28 17:27:03 +02:00
singlefile.py add CHROME_TIMEOUT args 2023-03-14 20:29:41 +09:00
title.py Fix HTML title parsing bugs. 2023-10-09 02:00:01 -05:00
wget.py add timezone support, tons of CSS and layout improvements, more detailed snapshot admin form info, ability to sort by recently updated, better grid view styling, better table layouts, better dark mode support 2021-04-10 04:21:36 -04:00