Ben Muthalaly
4e69d2c9e1
Add EXTRA_*_ARGS
for wget, curl, and singlefile
2024-02-22 23:04:11 -06:00
Nick Sweeting
6a4e568d1b
new archivebox update speed improvements
2024-02-22 04:50:22 -08:00
Nick Sweeting
8c07b7e127
disable automatic chrome selfupdating
2024-01-11 19:51:27 -08:00
Nick Sweeting
6184f659dc
improve window size chrome cli handling
2024-01-11 19:02:46 -08:00
spresse1
603ce7ec10
After a timeout, chrome will leave behind a SingletonLock, which prevents future instances of chrome from starting. When an extractor fails due to a timeout, remove this file.
2023-08-28 17:27:03 +02:00
Ross Williams
c039ef05b3
Fix hyphen placement in util.URL_REGEX
...
Incorrect hyphen placement in `URL_REGEX` was allowing it to match more
characters than intended. In a regex character class, a literal hyphen
can only appear as the first character in the class, or it will be
interpreted as the delimiter of a range of characters.
The issue fixed here caused the range of characters from `[$-_]`
be treated as valid URL characters, instead of the intended set of three
characters `[-_$]`. The incorrect range interpretation inadvertantly
included most ASCII punctuation, most importantly the angle brackets,
square brackets, and single quote that the expression uses
to mark the end of a match.
This causes the expression to match a URL that has a "hostname" portion
beginning with one of the intended "stop parsing" characters. For
example:
```
https://<b>www</b>.example.com/ # MATCHES but should not
https://[for example] # MATCHES but should not
scheme='https://' # MATCHES, including final quote, but should not
```
Some test cases have been added to the `URL_REGEX` assert in
archivebox.parsers to cover this possibility.
2023-08-08 15:24:16 -04:00
Ross Williams
d0e65eba7f
More reliably detect Google Chrome version number
...
Previous method was splitting on the first whitespace, and missing the
version number when it appeared as `"Google Chrome 115.0.234.2342"`
instead of, i.e. `"Chromium 115.0.234.8283"`.
This commit changes the version detection to regex search for
whitespace, then one or more digits followed by a period, then at least
one more digit. Only the first sequence of digits is captured. Unless
Chrome radically changes their version numbering, this should capture
the first group of digits after the reported browser name, which would
be the major version.
2023-07-31 15:34:58 -04:00
ふぁ
44a5a5ed7e
add explicitly specify --headless=new
...
Signed-off-by: ふぁ <yuki@yuki0311.com>
2023-03-17 19:30:14 +09:00
ふぁ
d77c770c47
add CHROME_TIMEOUT args
...
Signed-off-by: ふぁ <yuki@yuki0311.com>
2023-03-14 20:29:41 +09:00
Nick Sweeting
606fa397a4
disable passing timeout arg to chrome because v111 is crashing when passed
2023-03-13 10:50:18 +00:00
Nick Sweeting
1f1c70a8b1
remove --single-process from chrome args and add some rendering optimization args
2023-03-13 10:49:57 +00:00
Nick Sweeting
49faec8f6d
add no-zygote and single-process args to try and prevent orphan chrome processes after exit
2021-05-13 05:04:23 -04:00
Nick Sweeting
9f05cf8283
virtual-time-budget doesnt work with some chrome stuff
2021-04-10 08:04:59 -04:00
Nick Sweeting
0c321a06d0
hide scrollbars in screenshots
2021-04-10 05:45:19 -04:00
Nick Sweeting
a9986f1f05
add timezone support, tons of CSS and layout improvements, more detailed snapshot admin form info, ability to sort by recently updated, better grid view styling, better table layouts, better dark mode support
2021-04-10 04:21:36 -04:00
Nick Sweeting
5a9f27204a
dont use chrome when its not available on windows systems
2021-04-05 23:33:08 -04:00
Nick Sweeting
3e26ae4a66
support finding multiple urls as substrings in text
2021-03-27 04:30:40 -04:00
Nick Sweeting
c089501073
add response status code to headers.json
2021-01-30 20:44:49 -05:00
Nick Sweeting
a0a79cead8
move utils and vendored libs into subfolders
2020-12-06 02:01:18 +02:00
Nick Sweeting
104553489f
remove redundant utils file
2020-11-28 02:12:27 -05:00
Nick Sweeting
83693a5c03
add packaging setup with stdeb for debian and apt
...
vendor the base32_crockford lib
add build script for debain packages
2020-11-23 16:57:05 -05:00
Nick Sweeting
c47398851b
nicer timeout hints
2020-10-31 07:57:11 -04:00
Cristian
62ed11a5ca
fix: Improve headers handling
2020-09-24 12:55:51 -05:00
Angel Rey
f0915a56aa
Replaced get method
2020-09-24 12:55:51 -05:00
Angel Rey
a8a8fd14ac
Fixed indent headers.json
2020-09-23 11:07:00 -05:00
Angel Rey
852e3c9cff
Added headers extractor
2020-09-23 11:07:00 -05:00
Cristian
b18bbf8874
test: Fix tests post-rebase
2020-09-17 09:09:52 -05:00
apkallum
008769d296
add support for Paths in json encoder
2020-09-17 09:09:52 -05:00
Nick Sweeting
3658153cf8
fix url parsing through quotes
2020-08-18 08:04:57 -04:00
Cristian
d0d2991c69
fix: Change import that was not working
2020-07-31 12:15:00 -05:00
Cristian
6006b4f93b
refactor: Organize code to remove flake8 issues
2020-07-24 12:25:25 -05:00
Cristian
949f78aa65
fix: Use w3lib to improve the encoding extraction
2020-07-22 10:24:08 -05:00
Nick Sweeting
8cb530230c
fix docker SHM limited to 64mb chrome crash
2020-07-21 23:39:21 -04:00
apkallum
b7785c4138
use dateparser for parsing, let it handle error
2020-07-16 19:38:38 -04:00
Nick Sweeting
dfb83b4f27
add AttributeDict
2020-07-13 11:24:49 -04:00
Cristian
528fc8f1f6
fix: Improve encoding detection for rss+xml content types
2020-07-02 12:11:23 -05:00
Nick Sweeting
3ec97e5528
fix git conflict commited by accident
2020-07-02 03:22:37 -04:00
Nick Sweeting
8840ad72bb
remove circular import possibilities
2020-07-02 03:13:35 -04:00
Cristian
c971e00c9c
feat: Add stdout from process to the template
2020-07-01 12:23:59 -05:00
Nick Sweeting
c415420f33
improve sort columns and UI placeholders
2020-06-30 06:41:48 -04:00
Nick Sweeting
9f440c2cf8
use requests.get to fetch and decode instead of urllib
2020-06-30 05:55:54 -04:00
Nick Sweeting
cb67b09f9d
Merge branch 'master' into django
2020-06-25 21:30:29 -04:00
michael.bub
c79ce2b1f5
guess encoding via chardet if available
2020-02-15 13:58:07 +01:00
Mashiat Sarker Shakkhar
0bb216ce02
util.py: Use dateparser to parse date strings.
2019-09-10 23:51:09 -04:00
Nick Sweeting
500534f4be
fix missing comma in staticfile extensions list
2019-05-02 15:17:16 -04:00
Nick Sweeting
95007d9137
split up utils into separate files
2019-04-30 23:13:04 -04:00
Nick Sweeting
1b8abc0961
move everything out of legacy folder
2019-04-27 17:26:24 -04:00
Drewry Pope
332a32f4f9
Resolve 3 typos in util.py
2019-04-20 02:59:44 -05:00
Nick Sweeting
27708152d2
wip initial django setup
2019-04-02 16:36:41 -04:00
Nick Sweeting
f1075f2c7d
fix links index
2019-03-30 23:43:53 -04:00