Nick Sweeting
a5ffd4e9d3
move pdf, screenshot, dom, singlefile, and ytdlp extractor config to new plugin system
2024-09-25 00:42:26 -07:00
Nick Sweeting
cbf2a8fdc3
rename datetime fields to _at, massively improve ABID generation safety and determinism
2024-09-04 23:42:36 -07:00
Nick Sweeting
6c4f3fc83a
fix chrome headless=new arg
2024-08-26 20:15:36 -07:00
Nick Sweeting
6ffa710bb3
fix headers Elapsed timedelta is not a JSON-serializable
2024-08-26 20:15:22 -07:00
Nick Sweeting
2c2d034d6d
move to new vendoring fallback logic
2024-08-23 02:01:02 -07:00
Nick Sweeting
38ca5c3228
add extra info to headers.json
2024-08-22 17:57:40 -07:00
Nick Sweeting
6a6ae7468e
fix lint errors
2024-04-25 21:36:11 -07:00
Nick Sweeting
75153252dc
big overhaul of REST API, split into auth, core, and cli methods
2024-04-25 03:56:22 -07:00
Nick Sweeting
6cb357e76c
fix fix_url_from_markdown assertion to be valid url
2024-04-24 19:41:11 -07:00
Nick Sweeting
128419f991
expand comment about markdown url trailing paren trimming
2024-04-24 17:50:18 -07:00
Nick Sweeting
beb3932d80
replace uses of URL_REGEX with find_all_urls to handle markdown better
2024-04-24 17:45:45 -07:00
Nick Sweeting
98c5e69203
bump lockfiles
2024-04-24 14:38:21 -07:00
Nick Sweeting
17f40f3ada
Merge branch 'dev' into fix-URL_REGEX
2024-04-23 19:53:58 -07:00
Nick Sweeting
c6f8a33a63
Update util.py
2024-04-23 19:53:18 -07:00
longzai
e4dc2701ef
fix URL_REGEX 2
2024-04-11 15:51:55 +08:00
longzai
4ae765ec27
fix the URL_REGEX used in generic_html parsers
...
Signed-off-by: longzai <437172242@qq.com>
2024-04-08 04:53:05 +08:00
Nick Sweeting
c5bb99dce1
explicitly use Default profile inside user data dir
2024-03-18 14:40:40 -07:00
Nick Sweeting
ca2c484a8e
Add _EXTRA_ARGS
for various extractors ( #1360 )
2024-03-14 01:55:09 -07:00
Ben Muthalaly
d8cf09c21e
Remove unnecessary variable length args for dedupe
2024-03-05 21:13:45 -06:00
Ben Muthalaly
4686da91e6
Fix cookies being set incorrectly
2024-03-05 01:48:35 -06:00
Ben Muthalaly
d74ddd42ae
Flip dedupe precedence order
2024-03-01 14:50:32 -06:00
Ben Muthalaly
68326a60ee
Add cookies file to http request in download_url
2024-02-27 15:30:31 -06:00
Ben Muthalaly
4d9c5a7b4b
Add CHROME_EXTRA_ARGS
...
Also fix `YOUTUBEDL_EXTRA_ARGS`.
2024-02-23 18:40:03 -06:00
Ben Muthalaly
4e69d2c9e1
Add EXTRA_*_ARGS
for wget, curl, and singlefile
2024-02-22 23:04:11 -06:00
Nick Sweeting
6a4e568d1b
new archivebox update speed improvements
2024-02-22 04:50:22 -08:00
Nick Sweeting
8c07b7e127
disable automatic chrome selfupdating
2024-01-11 19:51:27 -08:00
Nick Sweeting
6184f659dc
improve window size chrome cli handling
2024-01-11 19:02:46 -08:00
spresse1
603ce7ec10
After a timeout, chrome will leave behind a SingletonLock, which prevents future instances of chrome from starting. When an extractor fails due to a timeout, remove this file.
2023-08-28 17:27:03 +02:00
Ross Williams
c039ef05b3
Fix hyphen placement in util.URL_REGEX
...
Incorrect hyphen placement in `URL_REGEX` was allowing it to match more
characters than intended. In a regex character class, a literal hyphen
can only appear as the first character in the class, or it will be
interpreted as the delimiter of a range of characters.
The issue fixed here caused the range of characters from `[$-_]`
be treated as valid URL characters, instead of the intended set of three
characters `[-_$]`. The incorrect range interpretation inadvertantly
included most ASCII punctuation, most importantly the angle brackets,
square brackets, and single quote that the expression uses
to mark the end of a match.
This causes the expression to match a URL that has a "hostname" portion
beginning with one of the intended "stop parsing" characters. For
example:
```
https://<b>www</b>.example.com/ # MATCHES but should not
https://[for example] # MATCHES but should not
scheme='https://' # MATCHES, including final quote, but should not
```
Some test cases have been added to the `URL_REGEX` assert in
archivebox.parsers to cover this possibility.
2023-08-08 15:24:16 -04:00
Ross Williams
d0e65eba7f
More reliably detect Google Chrome version number
...
Previous method was splitting on the first whitespace, and missing the
version number when it appeared as `"Google Chrome 115.0.234.2342"`
instead of, i.e. `"Chromium 115.0.234.8283"`.
This commit changes the version detection to regex search for
whitespace, then one or more digits followed by a period, then at least
one more digit. Only the first sequence of digits is captured. Unless
Chrome radically changes their version numbering, this should capture
the first group of digits after the reported browser name, which would
be the major version.
2023-07-31 15:34:58 -04:00
ふぁ
44a5a5ed7e
add explicitly specify --headless=new
...
Signed-off-by: ふぁ <yuki@yuki0311.com>
2023-03-17 19:30:14 +09:00
ふぁ
d77c770c47
add CHROME_TIMEOUT args
...
Signed-off-by: ふぁ <yuki@yuki0311.com>
2023-03-14 20:29:41 +09:00
Nick Sweeting
606fa397a4
disable passing timeout arg to chrome because v111 is crashing when passed
2023-03-13 10:50:18 +00:00
Nick Sweeting
1f1c70a8b1
remove --single-process from chrome args and add some rendering optimization args
2023-03-13 10:49:57 +00:00
Nick Sweeting
49faec8f6d
add no-zygote and single-process args to try and prevent orphan chrome processes after exit
2021-05-13 05:04:23 -04:00
Nick Sweeting
9f05cf8283
virtual-time-budget doesnt work with some chrome stuff
2021-04-10 08:04:59 -04:00
Nick Sweeting
0c321a06d0
hide scrollbars in screenshots
2021-04-10 05:45:19 -04:00
Nick Sweeting
a9986f1f05
add timezone support, tons of CSS and layout improvements, more detailed snapshot admin form info, ability to sort by recently updated, better grid view styling, better table layouts, better dark mode support
2021-04-10 04:21:36 -04:00
Nick Sweeting
5a9f27204a
dont use chrome when its not available on windows systems
2021-04-05 23:33:08 -04:00
Nick Sweeting
3e26ae4a66
support finding multiple urls as substrings in text
2021-03-27 04:30:40 -04:00
Nick Sweeting
c089501073
add response status code to headers.json
2021-01-30 20:44:49 -05:00
Nick Sweeting
a0a79cead8
move utils and vendored libs into subfolders
2020-12-06 02:01:18 +02:00
Nick Sweeting
104553489f
remove redundant utils file
2020-11-28 02:12:27 -05:00
Nick Sweeting
83693a5c03
add packaging setup with stdeb for debian and apt
...
vendor the base32_crockford lib
add build script for debain packages
2020-11-23 16:57:05 -05:00
Nick Sweeting
c47398851b
nicer timeout hints
2020-10-31 07:57:11 -04:00
Cristian
62ed11a5ca
fix: Improve headers handling
2020-09-24 12:55:51 -05:00
Angel Rey
f0915a56aa
Replaced get method
2020-09-24 12:55:51 -05:00
Angel Rey
a8a8fd14ac
Fixed indent headers.json
2020-09-23 11:07:00 -05:00
Angel Rey
852e3c9cff
Added headers extractor
2020-09-23 11:07:00 -05:00
Cristian
b18bbf8874
test: Fix tests post-rebase
2020-09-17 09:09:52 -05:00