Fix website scraper decoding content incorrectly (#126)

* Avoid stall on web scraping

This patch fixes stall on web scraping.
I encountered a stall (scraping never ends) when adding
a bookmark of some site.
To avoid this case, adding a timeout parameter at requests.get()
function is a solution.

Signed-off-by: Taku Izumi <admin@orz-style.com>

* Avoid character corruption of scraping some Japanese sites

This patch fixes character corruption of scraping some Japanese
sites. To avoid character corruption, I use r.content instead
of r.text in load_page function.

The reason of character corruption is encoding problem, I think.
r.text handles data as unicode encoded text, so if scraping
web site's charset is not unicode encoded, character corruption
occurs. r.content handles data as str[], we can avoid encoding
problem.

Signed-off-by: Taku Izumi <admin@orz-style.com>

* use charset_normalizer to determine response encoding

Co-authored-by: Taku Izumi <admin@orz-style.com>
Co-authored-by: Sascha Ißbrücker <sascha.issbruecker@googlemail.com>
This commit is contained in:
Taku Izumi 2021-08-25 17:16:23 +09:00 committed by GitHub
parent 8047ba6c63
commit 937858cf58
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23

View file

@ -2,6 +2,7 @@ from dataclasses import dataclass
import requests
from bs4 import BeautifulSoup
from charset_normalizer import from_bytes
@dataclass
@ -33,5 +34,11 @@ def load_website_metadata(url: str):
def load_page(url: str):
r = requests.get(url)
return r.text
r = requests.get(url, timeout=10)
# Use charset_normalizer to determine encoding that best matches the response content
# Several sites seem to specify the response encoding incorrectly, so we ignore it and use custom logic instead
# This is different from Response.text which does respect the encoding specified in the response first,
# before trying to determine one
results = from_bytes(r.content)
return str(results.best())