Nick Sweeting
0420662174
switch everywhere to use Snapshot.pk and ArchiveResult.pk instead of id
2024-05-13 05:12:12 -07:00
Nick Sweeting
457c42bf84
load EXTRACTORS dynamically using importlib.import_module
2024-05-11 22:28:59 -07:00
Nick Sweeting
8b9bc3dec8
minor fixes
2024-02-22 04:50:22 -08:00
Nick Sweeting
6a4e568d1b
new archivebox update speed improvements
2024-02-22 04:50:22 -08:00
Nick Sweeting
f0033f75d0
config.py lint fixes
2023-11-14 02:07:35 -08:00
Nick Sweeting
a680724367
Merge branch 'dev' into search_index_extract_html_text
2023-10-27 23:09:28 -07:00
Ross Williams
310b4d1242
Add htmltotext extractor
...
Saves HTML text nodes and selected element attributes in
`htmltotext.txt` for each Snapshot. Primarily intended to be used
for search indexing.
2023-10-23 21:42:32 -04:00
Ross Williams
2076474252
Drop use of TypeAlias to maintain Python 3.9 compat
...
TypeAlias annotation was introduced in Python 3.10, and is not strictly
necessary. Drop use of it to maintain Python 3.9 compatibility.
2023-08-02 10:56:48 -04:00
Ross Williams
b44f7e68b1
Add URL-specific method allow/deny lists
...
Allows enabling only allow-listed extractors or disabling specific
deny-listed extractors for a regular expression matched against an added
site's URL.
2023-08-02 09:36:40 -04:00
Sascha Ißbrücker
7bf4f40da0
just use out_dir
2023-05-29 10:03:49 +02:00
Sascha Ißbrücker
40c122515a
fix: make oneshot command return successful exist code
2023-05-29 10:01:27 +02:00
Joseph Turian
07de4a79a1
Merge branch 'dev' into feature/kludge-984-UTF8-bug
2022-12-20 11:39:01 +01:00
Joseph Turian
081a12b079
Add ts
2022-09-12 21:32:47 +00:00
Joseph Turian
daef48e59b
flake8
2022-09-12 21:31:33 +00:00
Joseph Turian
983f485cc0
flake8
2022-09-12 21:29:43 +00:00
Joseph Turian
f5f7aff3b4
Added yt-dlp everywhere
2022-09-12 20:34:02 +00:00
Joseph Turian
2b58cce43f
Attempted to warn on #984 and #1014
2022-09-11 12:19:16 +02:00
papersnake
de8e22efb7
improve title extractor
2022-02-08 23:17:52 +08:00
Nick Sweeting
4715ace7dd
ignore BaseException lgtm errors
2021-05-31 20:59:05 -04:00
Nick Sweeting
62078a77f8
show run duration after each archived link in cli output
2021-04-10 07:52:01 -04:00
Nick Sweeting
a9986f1f05
add timezone support, tons of CSS and layout improvements, more detailed snapshot admin form info, ability to sort by recently updated, better grid view styling, better table layouts, better dark mode support
2021-04-10 04:21:36 -04:00
Nick Sweeting
084cf7ff51
add more explanation about snapshot.save timestamp bump
2021-02-17 13:34:46 -05:00
Nick Sweeting
c95698e608
bump Snapshot.updated time after each extractor, change extractor order
2021-02-16 15:52:18 -05:00
Dan Arnfield
5420903102
Refactor should_save_extractor
methods to accept overwrite
parameter
2021-01-21 15:56:32 -06:00
Cristian
275ad22db7
refactor: Remove skip_index
from archive related functions
2020-12-08 18:42:25 -05:00
Cristian
f6c73f9aeb
fix: Issue with oneshot command
2020-12-08 18:42:25 -05:00
JDC
7903db6dfb
Add ArchiveResult Manager and sorted indexable filter
2020-12-06 01:13:39 +02:00
JDC
b1f70b2197
Initial implementation
2020-12-06 01:12:45 +02:00
Cristian
33182fd53c
fix: Add missing assignation
2020-11-04 15:07:45 -05:00
Cristian
d064a3eeff
fix: Handle case when update tries to re-add a link that is not in the sql index
2020-11-04 15:02:54 -05:00
Cristian
f292cface2
fix: Add condition for oneshot when archiving links
2020-11-04 14:40:44 -05:00
Cristian
4484491fb7
feat: Create ArchiveResult after finishing an extractor process
2020-11-04 11:22:55 -05:00
Angel Rey
ce71747538
replaced os.path in init extractors
2020-10-02 15:46:39 -05:00
Cristian
7d3767b882
fix: oneshot command not running extractors
2020-09-24 12:56:16 -05:00
Angel Rey
852e3c9cff
Added headers extractor
2020-09-23 11:07:00 -05:00
ttimasdf
357b677363
fix: add mercury-parser to extractors list
2020-09-22 18:44:12 -05:00
Cristian
b18bbf8874
test: Fix tests post-rebase
2020-09-17 09:09:52 -05:00
Cristian
50f3f16203
lint: Remove unused import
2020-09-15 08:05:46 -05:00
Cristian
0a83392cbf
fix: Replace any
typing with Union[Iterable[Link], QuerySet] in archive_links
2020-09-15 08:05:46 -05:00
Cristian
018bd91745
refactor: Remove get_iter lambda from archive_links
2020-09-15 08:05:46 -05:00
Cristian
01fb44fd40
refactor: Change archive_links check to focus on queryset, so it allows other iterables and not just lists
2020-09-15 08:05:46 -05:00
Cristian
fe9604a772
feat: Add tests for remove command
2020-09-15 08:05:46 -05:00
Cristian
be520d137a
feat: Refactor add method to use querysets
2020-09-15 08:05:46 -05:00
Cristian
874403e667
feat: Remove patch_main_index
2020-09-15 08:05:46 -05:00
Cristian
31343c1367
feat: Update extractors and add command to use sql index as source of truth
2020-09-15 08:05:46 -05:00
Nick Sweeting
e87f1d57a3
fix linters
2020-08-18 09:22:12 -04:00
Nick Sweeting
c9b3bab84d
fix pull title not working
2020-08-18 08:49:26 -04:00
Nick Sweeting
b0c0a676f8
re-enable readability and singlefile by default now that its less noisy
2020-08-18 08:29:46 -04:00
Nick Sweeting
d7d53cfb12
dont show skipped extractors to reduce visual noise
2020-08-18 08:13:35 -04:00
Nick Sweeting
b681a477ae
add overwrite flag to add command to force re-archiving
2020-08-18 04:37:54 -04:00