jim winstead
ccabda4c7d
Handle list of tags in JSON, and be more clever about comma vs. space
2024-02-28 17:38:49 -08:00
jim winstead
178e676e0f
Fix JSON parser by not always mangling the input
...
Rather than by assuming the JSON file we are parsing has junk at the beginning
(which maybe only used to happen?), try parsing it as-is first, and then fall
back to trying again after skipping the first line
Fixes #1347
2024-02-27 14:48:19 -08:00
Nick Sweeting
a680724367
Merge branch 'dev' into search_index_extract_html_text
2023-10-27 23:09:28 -07:00
Ross Williams
310b4d1242
Add htmltotext extractor
...
Saves HTML text nodes and selected element attributes in
`htmltotext.txt` for each Snapshot. Primarily intended to be used
for search indexing.
2023-10-23 21:42:32 -04:00
Ross Williams
b44f7e68b1
Add URL-specific method allow/deny lists
...
Allows enabling only allow-listed extractors or disabling specific
deny-listed extractors for a regular expression matched against an added
site's URL.
2023-08-02 09:36:40 -04:00
Sascha Ißbrücker
40c122515a
fix: make oneshot command return successful exist code
2023-05-29 10:01:27 +02:00
Nick Sweeting
9f1470cf03
fix output permissions tests
2021-05-31 20:57:46 -04:00
Nick Sweeting
eef9adbfcb
fix select invalid test
2021-04-03 15:50:48 -04:00
Nick Sweeting
354b4627ed
fix tests
2021-03-30 23:39:15 -04:00
Nick Sweeting
bd6d9c165b
enforce utf8 on literally all file operations because windows sucks
2021-03-27 01:16:29 -04:00
Nick Sweeting
33df9c1ebe
fix after and before in remove tests
2021-02-18 06:21:44 -05:00
Nick Sweeting
4f5bb3776c
fix sql err
2021-02-18 05:51:53 -05:00
Nick Sweeting
46a4197514
fix tests
2021-02-18 04:26:56 -05:00
Cristian
e82161a768
refactor: Remove setup_django from search
2020-12-11 16:43:48 -05:00
Nick Sweeting
e03d17c208
test extract flag on oneshot
2020-12-11 16:49:18 +02:00
Cristian
f6c73f9aeb
fix: Issue with oneshot command
2020-12-08 18:42:25 -05:00
Nick Sweeting
1b22f8eeef
Merge pull request #515 from cdvv7788/POC-setup-django-on-init
2020-11-27 23:56:37 -05:00
Nick Sweeting
efe3027797
Merge branch 'master' into archive-result
2020-11-27 23:18:11 -05:00
Nick Sweeting
0e2ccbc10d
update urls to new repo path
2020-11-23 02:06:46 -05:00
Nick Sweeting
fdd4effc92
Merge pull request #535 from cdvv7788/extractors-flag
2020-11-13 14:53:17 -05:00
JDC
b1dbfcb73f
Add test remove tag filter
2020-11-13 14:17:12 -05:00
Cristian
44eede96e5
feat: Add extract flag to add command
2020-11-13 09:24:34 -05:00
Cristian
33182fd53c
fix: Add missing assignation
2020-11-04 15:07:45 -05:00
Cristian
d064a3eeff
fix: Handle case when update tries to re-add a link that is not in the sql index
2020-11-04 15:02:54 -05:00
Cristian
e7e33ea7a5
tests: Add tests for several different ways to extract the title
2020-10-30 08:04:26 -05:00
Cristian
f6ce1de882
fix: archivebox version was being called as root
2020-10-27 09:15:14 -05:00
Cristian
a6bee5f111
feat: Move setup_django to an inner module
2020-10-26 08:02:04 -05:00
Cristian
e1d0b8bce7
feat: Initialize django at the beginning
2020-10-26 07:45:21 -05:00
Cristian
ae1484b8bf
feat: Remove index.json and index.html generation from the regular process
2020-10-23 06:45:56 -05:00
Cristian Vargas
a850b4a9d9
Merge branch 'master' into tags
2020-10-20 08:23:25 -05:00
Cristian
62c78e1d10
refactor: Remove django-taggit and replace it with a local tags setup
2020-10-12 13:47:03 -05:00
Angel Rey
73418836f8
Replaced os.path in server.py
2020-10-02 15:46:39 -05:00
Angel Rey
62c9028212
Improved tags
2020-09-24 15:34:23 -05:00
Cristian
0158efb1d0
test: Improve oneshot test
2020-09-24 12:56:16 -05:00
Cristian
62ed11a5ca
fix: Improve headers handling
2020-09-24 12:55:51 -05:00
Angel Rey
ee6caca3ca
Added more asserts
2020-09-23 11:07:00 -05:00
Angel Rey
1cce786d6d
Added test headers extractor
2020-09-23 11:07:00 -05:00
Cristian
46b9e3d536
fix: Fix mercury extractor test
2020-09-23 10:34:05 -05:00
ttimasdf
e3329be291
tests: add test for mercury-parser
2020-09-22 18:44:12 -05:00
Cristian
fa622d3e14
refactor: Replace --index with --with-headers in the list command to make it more explicit. Change it so it affects the csv output too.
2020-09-15 08:05:46 -05:00
Cristian
2aa8d69b72
fix: Save history in main index (to mimic previous behaviour)
2020-09-15 08:05:46 -05:00
Cristian
7e9d195d13
feat: Update list
command to sort using sqlite
2020-09-15 08:05:46 -05:00
Cristian
f55153eab3
feat: Update update
command to work with querysets
2020-09-15 08:05:46 -05:00
Cristian
dafa1dd63c
tests: Add tests for before and after flags in remove command
2020-09-15 08:05:46 -05:00
Cristian
fe9604a772
feat: Add tests for remove command
2020-09-15 08:05:46 -05:00
Cristian
be0dff8126
feat: Add tests to refactored init command
2020-09-15 08:05:46 -05:00
Cristian
a77d6dc235
feat: list command fails when --index is used without --json or --html
2020-09-15 08:05:46 -05:00
Cristian
885ff50449
feat: Add html export to list command
2020-09-15 08:05:46 -05:00
Cristian
aab8f96520
feat: Add flag to list command to support index like output
2020-09-15 08:05:46 -05:00
Cristian
cc0fa747ce
feat: Add options to ease management of node related extractors
2020-08-18 10:34:28 -05:00