Fix hyphen placement in util.URL_REGEX

Incorrect hyphen placement in `URL_REGEX` was allowing it to match more characters than intended. In a regex character class, a literal hyphen can only appear as the first character in the class, or it will be interpreted as the delimiter of a range of characters. The issue fixed here caused the range of characters from `[$-_]` be treated as valid URL characters, instead of the intended set of three characters `[-_$]`. The incorrect range interpretation inadvertantly included most ASCII punctuation, most importantly the angle brackets, square brackets, and single quote that the expression uses to mark the end of a match. This causes the expression to match a URL that has a "hostname" portion beginning with one of the intended "stop parsing" characters. For example: ``` https://<b>www</b>.example.com/ # MATCHES but should not https://[for example] # MATCHES but should not scheme='https://' # MATCHES, including final quote, but should not ``` Some test cases have been added to the `URL_REGEX` assert in archivebox.parsers to cover this possibility.
2024-11-22 20:23:12 +00:00 · 2023-08-08 15:09:11 -04:00 · 2023-08-08 15:09:11 -04:00 · c039ef05b3
commit c039ef05b3
parent b773041952
2 changed files with 5 additions and 1 deletions
--- a/archivebox/parsers/init.py
+++ b/archivebox/parsers/init.py
@ -233,6 +233,10 @@ _test_url_strs = {
    'https://example.com/?what=1#how-about-this=1&2%20baf': 1,
    'https://example.com?what=1#how-about-this=1&2%20baf': 1,
    '<test>http://example7.com</test>': 1,
+    'https://<test>': 0,
+    'https://[test]': 0,
+    'http://"test"': 0,
+    'http://\'test\'': 0,
    '[https://example8.com/what/is/this.php?what=1]': 1,
    '[and http://example9.com?what=1&other=3#and-thing=2]': 1,
    '<what>https://example10.com#and-thing=2 "</about>': 1,
--- a/archivebox/util.py
+++ b/archivebox/util.py
@ -59,7 +59,7 @@ URL_REGEX = re.compile(
    r'(?=('
    r'http[s]?://'                    # start matching from allowed schemes
    r'(?:[a-zA-Z]|[0-9]'              # followed by allowed alphanum characters
-    r'|[$-_@.&+]|[!*\(\),]'           #    or allowed symbols
+    r'|[-_$@.&+!*\(\),]'           #    or allowed symbols (keep hyphen first to match literal hyphen)
    r'|(?:%[0-9a-fA-F][0-9a-fA-F]))'  #    or allowed unicode bytes
    r'[^\]\[\(\)<>"\'\s]+'          # stop parsing at these symbols
    r'))',