Commit graph

10 commits

Author SHA1 Message Date
Sascha Ißbrücker
4220ea0b4c
Fix website loader content encoding detection (#482) 2023-05-30 22:04:54 +02:00
Sascha Ißbrücker
30da1880a5
Cache website metadata to avoid duplicate scraping (#401)
* Cache website metadata to avoid duplicate scraping

* fix test setup
2023-01-20 22:28:44 +01:00
Sascha Ißbrücker
43d52642a6 Fix website loader test 2023-01-14 12:26:04 +01:00
Sascha Ißbrücker
4f9170c48d Improve website loader logging 2023-01-14 11:24:09 +01:00
Luca
c2d8cde86b
Trim website metadata title and description (#383)
* feat: trim fetched metadata placeholders

* feat: implement trimming serverside

* Add website loader tests

* Address review comments

Co-authored-by: Sascha Ißbrücker <sascha.issbruecker@gmail.com>
2023-01-12 21:06:36 +01:00
Sascha Ißbrücker
2fd7704816
Limit document size for website scraper (#354)
Limits the size of scraped HTML documents to prevent out of memory errors. The scraper will stop reading from the response when it encounters the closing head tag, or if the read content's size exceeds a max limit.

Fixes #345
2022-10-07 21:18:18 +02:00
Dustin Blackman
b53bd9f112
Bump waybackpy to 3.0.6 (#281)
* fix wayback

* fix tests

* Reuse user agent from website loader

Co-authored-by: Sascha Ißbrücker <sascha.issbruecker@gmail.com>
2022-07-03 06:26:16 +02:00
Sascha Ißbrücker
e08bf9fd03
Fake request headers to reduce bot detection (#263)
Co-authored-by: Sascha Ißbrücker <sascha.issbruecker@gmail.com>
2022-05-21 13:25:32 +02:00
Taku Izumi
937858cf58
Fix website scraper decoding content incorrectly (#126)
* Avoid stall on web scraping

This patch fixes stall on web scraping.
I encountered a stall (scraping never ends) when adding
a bookmark of some site.
To avoid this case, adding a timeout parameter at requests.get()
function is a solution.

Signed-off-by: Taku Izumi <admin@orz-style.com>

* Avoid character corruption of scraping some Japanese sites

This patch fixes character corruption of scraping some Japanese
sites. To avoid character corruption, I use r.content instead
of r.text in load_page function.

The reason of character corruption is encoding problem, I think.
r.text handles data as unicode encoded text, so if scraping
web site's charset is not unicode encoded, character corruption
occurs. r.content handles data as str[], we can avoid encoding
problem.

Signed-off-by: Taku Izumi <admin@orz-style.com>

* use charset_normalizer to determine response encoding

Co-authored-by: Taku Izumi <admin@orz-style.com>
Co-authored-by: Sascha Ißbrücker <sascha.issbruecker@googlemail.com>
2021-08-25 10:16:23 +02:00
Sascha Ißbrücker
e07da529f1 Preview website title + description in bookmark form
Fix unnecessary selects when rendering bookmarks
2019-07-02 01:28:02 +02:00