mirror of
https://github.com/ArchiveBox/ArchiveBox
synced 2025-02-17 05:48:24 +00:00
Incorrect hyphen placement in `URL_REGEX` was allowing it to match more characters than intended. In a regex character class, a literal hyphen can only appear as the first character in the class, or it will be interpreted as the delimiter of a range of characters. The issue fixed here caused the range of characters from `[$-_]` be treated as valid URL characters, instead of the intended set of three characters `[-_$]`. The incorrect range interpretation inadvertantly included most ASCII punctuation, most importantly the angle brackets, square brackets, and single quote that the expression uses to mark the end of a match. This causes the expression to match a URL that has a "hostname" portion beginning with one of the intended "stop parsing" characters. For example: ``` https://<b>www</b>.example.com/ # MATCHES but should not https://[for example] # MATCHES but should not scheme='https://' # MATCHES, including final quote, but should not ``` Some test cases have been added to the `URL_REGEX` assert in archivebox.parsers to cover this possibility. |
||
---|---|---|
.. | ||
__init__.py | ||
generic_html.py | ||
generic_json.py | ||
generic_rss.py | ||
generic_txt.py | ||
medium_rss.py | ||
netscape_html.py | ||
pinboard_rss.py | ||
pocket_api.py | ||
pocket_html.py | ||
shaarli_rss.py | ||
url_list.py | ||
wallabag_atom.py |