Fix website scraper decoding content incorrectly (#126)

* Avoid stall on web scraping This patch fixes stall on web scraping. I encountered a stall (scraping never ends) when adding a bookmark of some site. To avoid this case, adding a timeout parameter at requests.get() function is a solution. Signed-off-by: Taku Izumi <admin@orz-style.com> * Avoid character corruption of scraping some Japanese sites This patch fixes character corruption of scraping some Japanese sites. To avoid character corruption, I use r.content instead of r.text in load_page function. The reason of character corruption is encoding problem, I think. r.text handles data as unicode encoded text, so if scraping web site's charset is not unicode encoded, character corruption occurs. r.content handles data as str[], we can avoid encoding problem. Signed-off-by: Taku Izumi <admin@orz-style.com> * use charset_normalizer to determine response encoding Co-authored-by: Taku Izumi <admin@orz-style.com> Co-authored-by: Sascha Ißbrücker <sascha.issbruecker@googlemail.com>
2024-09-20 05:51:56 +00:00 · 2021-08-25 17:16:23 +09:00 · 2021-08-25 17:16:23 +09:00 · 937858cf58
commit 937858cf58
parent 8047ba6c63
1 changed files with 9 additions and 2 deletions
--- a/bookmarks/services/website_loader.py
+++ b/bookmarks/services/website_loader.py
@ -2,6 +2,7 @@ from dataclasses import dataclass

 import requests
 from bs4 import BeautifulSoup
+from charset_normalizer import from_bytes


@dataclass
@ -33,5 +34,11 @@ def load_website_metadata(url: str):


 def load_page(url: str):
-    r = requests.get(url)
-    return r.text
+    r = requests.get(url, timeout=10)
+
+    # Use charset_normalizer to determine encoding that best matches the response content
+    # Several sites seem to specify the response encoding incorrectly, so we ignore it and use custom logic instead
+    # This is different from Response.text which does respect the encoding specified in the response first,
+    # before trying to determine one
+    results = from_bytes(r.content)
+    return str(results.best())