Nick Sweeting
db2984e47b
prefer dom dump to singlefile for generating readability output
2024-01-03 20:11:06 -08:00
Ben Muthalaly
77917e9b55
Fix HTML title parsing bugs.
...
This slightly modifies the HTML_TITLE_REGEX to fix two parsing errors.
The first occurred when title tags were empty (e.g. "<title></title>")
which was parsed as "</title". The second occurred when titles were a
single character (e.g. "<title>A</title>") which was not matched by the
regex, and so would fall back to link.base_url.
Now when tags are empty, it falls back to link.base_url, and single
character titles are parsed correctly.
The way the regex works now is still a bit wonky for some edge cases.
I couldn't find any cases of incorrect behavior, but it still might be
worth reworking more completely for robustness.
2023-10-09 02:00:01 -05:00
papersnake
de8e22efb7
improve title extractor
2022-02-08 23:17:52 +08:00
Nick Sweeting
04c951cdd5
fix alerts
2021-02-01 02:22:02 -05:00
Nick Sweeting
385daf9af8
save the url as title for staticfiles or non html files
2021-01-30 22:01:49 -05:00
Dan Arnfield
5420903102
Refactor should_save_extractor
methods to accept overwrite
parameter
2021-01-21 15:56:32 -06:00
Cristian
81d766aba1
refactor: Remove setup_django from title.py
2020-12-11 16:03:50 -05:00
Cristian
e7e33ea7a5
tests: Add tests for several different ways to extract the title
2020-10-30 08:04:26 -05:00
Nick Sweeting
f727ece7b3
add regex fallback back to title parser
2020-10-30 04:57:31 -04:00
Nick Sweeting
79bef1384e
Merge pull request #493 from ttimasdf/feat-ogtitle
...
Feature: add og:title metadata as alternative title
2020-10-30 04:51:14 -04:00
Cristian
c12fe0e3d7
feat: Use CURL_ARGS on title extractor
2020-10-22 08:46:16 -05:00
ttimasdf
eda3836dee
feat: add og:title metadata as alternative title
2020-09-27 12:54:52 +08:00
Cristian
b18bbf8874
test: Fix tests post-rebase
2020-09-17 09:09:52 -05:00
Nick Sweeting
032c2458de
add missing setup_django import
2020-07-28 05:58:13 -04:00
Nick Sweeting
55a237a435
also set snapshot title inside of fetch_title directly
2020-07-28 05:56:34 -04:00
Nick Sweeting
273059f054
accept gzipped responses when using curl
2020-07-28 05:55:54 -04:00
Cristian
a5550b2105
fix: Rename logging folder to avoid naming conflicts (and circular import issues)
2020-07-22 11:02:13 -05:00
Cristian
f4d1b5121e
refactor: Move logging.py to main module to avoid circular import issues
2020-07-17 18:00:04 -05:00
Nick Sweeting
5c2bbe7efe
bufixes
2020-06-25 22:14:40 -04:00
Nick Sweeting
95007d9137
split up utils into separate files
2019-04-30 23:13:04 -04:00
Nick Sweeting
1b8abc0961
move everything out of legacy folder
2019-04-27 17:26:24 -04:00