linkding/bookmarks/services/website_loader.py

import logging
from dataclasses import dataclass
from functools import lru_cache
from urllib.parse import urljoin

import requests
from bs4 import BeautifulSoup
from charset_normalizer import from_bytes
from django.utils import timezone

logger = logging.getLogger(__name__)


@dataclass
class WebsiteMetadata:
    url: str
    title: str
    description: str
    preview_image: str | None

    def to_dict(self):
        return {
            "url": self.url,
            "title": self.title,
            "description": self.description,
            "preview_image": self.preview_image,
        }


# Caching metadata avoids scraping again when saving bookmarks, in case the
# metadata was already scraped to show preview values in the bookmark form
@lru_cache(maxsize=10)
def load_website_metadata(url: str):
    title = None
    description = None
    preview_image = None
    try:
        start = timezone.now()
        page_text = load_page(url)
        end = timezone.now()
        logger.debug(f"Load duration: {end - start}")

        start = timezone.now()
        soup = BeautifulSoup(page_text, "html.parser")

        title = soup.title.string.strip() if soup.title is not None else None
        description_tag = soup.find("meta", attrs={"name": "description"})
        description = (
            description_tag["content"].strip()
            if description_tag and description_tag["content"]
            else None
        )

        if not description:
            description_tag = soup.find("meta", attrs={"property": "og:description"})
            description = (
                description_tag["content"].strip()
                if description_tag and description_tag["content"]
                else None
            )

        image_tag = soup.find("meta", attrs={"property": "og:image"})
        preview_image = image_tag["content"].strip() if image_tag else None
        if (
            preview_image
            and not preview_image.startswith("http://")
            and not preview_image.startswith("https://")
        ):
            preview_image = urljoin(url, preview_image)

        end = timezone.now()
        logger.debug(f"Parsing duration: {end - start}")
    finally:
        return WebsiteMetadata(
            url=url, title=title, description=description, preview_image=preview_image
        )


CHUNK_SIZE = 50 * 1024
MAX_CONTENT_LIMIT = 5000 * 1024


def load_page(url: str):
    headers = fake_request_headers()
    size = 0
    content = None
    iteration = 0
    # Use with to ensure request gets closed even if it's only read partially
    with requests.get(url, timeout=10, headers=headers, stream=True) as r:
        for chunk in r.iter_content(chunk_size=CHUNK_SIZE):
            size += len(chunk)
            iteration = iteration + 1
            if content is None:
                content = chunk
            else:
                content = content + chunk

            logger.debug(f"Loaded chunk (iteration={iteration}, total={size / 1024})")

            # Stop reading if we have parsed end of head tag
            end_of_head = "</head>".encode("utf-8")
            if end_of_head in content:
                logger.debug(f"Found closing head tag after {size} bytes")
                content = content.split(end_of_head)[0] + end_of_head
                break
            # Stop reading if we exceed limit
            if size > MAX_CONTENT_LIMIT:
                logger.debug(f"Cancel reading document after {size} bytes")
                break
        if hasattr(r, "_content_consumed"):
            logger.debug(f"Request consumed: {r._content_consumed}")

    # Use charset_normalizer to determine encoding that best matches the response content
    # Several sites seem to specify the response encoding incorrectly, so we ignore it and use custom logic instead
    # This is different from Response.text which does respect the encoding specified in the response first,
    # before trying to determine one
    results = from_bytes(content or "")
    return str(results.best())


DEFAULT_USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.0.0 Safari/537.36"


def fake_request_headers():
    return {
        "Accept": "text/html,application/xhtml+xml,application/xml",
        "Accept-Encoding": "gzip, deflate",
        "Dnt": "1",
        "Upgrade-Insecure-Requests": "1",
        "User-Agent": DEFAULT_USER_AGENT,
    }
Limit document size for website scraper (#354) Limits the size of scraped HTML documents to prevent out of memory errors. The scraper will stop reading from the response when it encounters the closing head tag, or if the read content's size exceeds a max limit. Fixes #345 2022-10-07 19:18:18 +00:00			`import logging`
Preview website title + description in bookmark form Fix unnecessary selects when rendering bookmarks 2019-07-01 23:28:02 +00:00			`from dataclasses import dataclass`
Cache website metadata to avoid duplicate scraping (#401) * Cache website metadata to avoid duplicate scraping * fix test setup 2023-01-20 21:28:44 +00:00			`from functools import lru_cache`
Add support for bookmark thumbnails (#721) * Preview Image * fix tests * add test * download preview image * relative path * gst * details view * fix tests * Improve preview image styles * Remove preview image URL from model * Revert form changes * update tests * make it work in uwsgi --------- Co-authored-by: Sascha Ißbrücker <sascha.issbruecker@gmail.com> 2024-05-07 16:58:52 +00:00			`from urllib.parse import urljoin`
Preview website title + description in bookmark form Fix unnecessary selects when rendering bookmarks 2019-07-01 23:28:02 +00:00
			`import requests`
			`from bs4 import BeautifulSoup`
Fix website scraper decoding content incorrectly (#126) * Avoid stall on web scraping This patch fixes stall on web scraping. I encountered a stall (scraping never ends) when adding a bookmark of some site. To avoid this case, adding a timeout parameter at requests.get() function is a solution. Signed-off-by: Taku Izumi <admin@orz-style.com> * Avoid character corruption of scraping some Japanese sites This patch fixes character corruption of scraping some Japanese sites. To avoid character corruption, I use r.content instead of r.text in load_page function. The reason of character corruption is encoding problem, I think. r.text handles data as unicode encoded text, so if scraping web site's charset is not unicode encoded, character corruption occurs. r.content handles data as str[], we can avoid encoding problem. Signed-off-by: Taku Izumi <admin@orz-style.com> * use charset_normalizer to determine response encoding Co-authored-by: Taku Izumi <admin@orz-style.com> Co-authored-by: Sascha Ißbrücker <sascha.issbruecker@googlemail.com> 2021-08-25 08:16:23 +00:00			`from charset_normalizer import from_bytes`
Improve website loader logging 2023-01-14 10:24:09 +00:00			`from django.utils import timezone`
Preview website title + description in bookmark form Fix unnecessary selects when rendering bookmarks 2019-07-01 23:28:02 +00:00
Limit document size for website scraper (#354) Limits the size of scraped HTML documents to prevent out of memory errors. The scraper will stop reading from the response when it encounters the closing head tag, or if the read content's size exceeds a max limit. Fixes #345 2022-10-07 19:18:18 +00:00			`logger = logging.getLogger(__name__)`

Preview website title + description in bookmark form Fix unnecessary selects when rendering bookmarks 2019-07-01 23:28:02 +00:00
			`@dataclass`
			`class WebsiteMetadata:`
			`url: str`
			`title: str`
			`description: str`
Add support for bookmark thumbnails (#721) * Preview Image * fix tests * add test * download preview image * relative path * gst * details view * fix tests * Improve preview image styles * Remove preview image URL from model * Revert form changes * update tests * make it work in uwsgi --------- Co-authored-by: Sascha Ißbrücker <sascha.issbruecker@gmail.com> 2024-05-07 16:58:52 +00:00			`preview_image: str \| None`
Preview website title + description in bookmark form Fix unnecessary selects when rendering bookmarks 2019-07-01 23:28:02 +00:00
			`def to_dict(self):`
			`return {`
Add black code formatter 2024-01-27 10:29:16 +00:00			`"url": self.url,`
			`"title": self.title,`
			`"description": self.description,`
Add support for bookmark thumbnails (#721) * Preview Image * fix tests * add test * download preview image * relative path * gst * details view * fix tests * Improve preview image styles * Remove preview image URL from model * Revert form changes * update tests * make it work in uwsgi --------- Co-authored-by: Sascha Ißbrücker <sascha.issbruecker@gmail.com> 2024-05-07 16:58:52 +00:00			`"preview_image": self.preview_image,`
Preview website title + description in bookmark form Fix unnecessary selects when rendering bookmarks 2019-07-01 23:28:02 +00:00			`}`


Cache website metadata to avoid duplicate scraping (#401) * Cache website metadata to avoid duplicate scraping * fix test setup 2023-01-20 21:28:44 +00:00			`# Caching metadata avoids scraping again when saving bookmarks, in case the`
			`# metadata was already scraped to show preview values in the bookmark form`
			`@lru_cache(maxsize=10)`
Preview website title + description in bookmark form Fix unnecessary selects when rendering bookmarks 2019-07-01 23:28:02 +00:00			`def load_website_metadata(url: str):`
			`title = None`
			`description = None`
Add support for bookmark thumbnails (#721) * Preview Image * fix tests * add test * download preview image * relative path * gst * details view * fix tests * Improve preview image styles * Remove preview image URL from model * Revert form changes * update tests * make it work in uwsgi --------- Co-authored-by: Sascha Ißbrücker <sascha.issbruecker@gmail.com> 2024-05-07 16:58:52 +00:00			`preview_image = None`
Preview website title + description in bookmark form Fix unnecessary selects when rendering bookmarks 2019-07-01 23:28:02 +00:00			`try:`
Improve website loader logging 2023-01-14 10:24:09 +00:00			`start = timezone.now()`
Preview website title + description in bookmark form Fix unnecessary selects when rendering bookmarks 2019-07-01 23:28:02 +00:00			`page_text = load_page(url)`
Improve website loader logging 2023-01-14 10:24:09 +00:00			`end = timezone.now()`
Add black code formatter 2024-01-27 10:29:16 +00:00			`logger.debug(f"Load duration: {end - start}")`
Improve website loader logging 2023-01-14 10:24:09 +00:00
			`start = timezone.now()`
Add black code formatter 2024-01-27 10:29:16 +00:00			`soup = BeautifulSoup(page_text, "html.parser")`
Preview website title + description in bookmark form Fix unnecessary selects when rendering bookmarks 2019-07-01 23:28:02 +00:00
Trim website metadata title and description (#383) * feat: trim fetched metadata placeholders * feat: implement trimming serverside * Add website loader tests * Address review comments Co-authored-by: Sascha Ißbrücker <sascha.issbruecker@gmail.com> 2023-01-12 20:06:36 +00:00			`title = soup.title.string.strip() if soup.title is not None else None`
Add black code formatter 2024-01-27 10:29:16 +00:00			`description_tag = soup.find("meta", attrs={"name": "description"})`
			`description = (`
			`description_tag["content"].strip()`
			`if description_tag and description_tag["content"]`
			`else None`
			`)`
Support Open Graph description (#602) * Support pytest for running tests * Support extracting description from meta og:description property * Revert changes to TOC * Add test --------- Co-authored-by: Sascha Ißbrücker <sascha.issbruecker@gmail.com> 2024-01-27 09:28:46 +00:00
			`if not description:`
Add black code formatter 2024-01-27 10:29:16 +00:00			`description_tag = soup.find("meta", attrs={"property": "og:description"})`
			`description = (`
			`description_tag["content"].strip()`
			`if description_tag and description_tag["content"]`
			`else None`
			`)`
Support Open Graph description (#602) * Support pytest for running tests * Support extracting description from meta og:description property * Revert changes to TOC * Add test --------- Co-authored-by: Sascha Ißbrücker <sascha.issbruecker@gmail.com> 2024-01-27 09:28:46 +00:00
Add support for bookmark thumbnails (#721) * Preview Image * fix tests * add test * download preview image * relative path * gst * details view * fix tests * Improve preview image styles * Remove preview image URL from model * Revert form changes * update tests * make it work in uwsgi --------- Co-authored-by: Sascha Ißbrücker <sascha.issbruecker@gmail.com> 2024-05-07 16:58:52 +00:00			`image_tag = soup.find("meta", attrs={"property": "og:image"})`
			`preview_image = image_tag["content"].strip() if image_tag else None`
			`if (`
			`preview_image`
			`and not preview_image.startswith("http://")`
			`and not preview_image.startswith("https://")`
			`):`
			`preview_image = urljoin(url, preview_image)`

Improve website loader logging 2023-01-14 10:24:09 +00:00			`end = timezone.now()`
Add black code formatter 2024-01-27 10:29:16 +00:00			`logger.debug(f"Parsing duration: {end - start}")`
Preview website title + description in bookmark form Fix unnecessary selects when rendering bookmarks 2019-07-01 23:28:02 +00:00			`finally:`
Add support for bookmark thumbnails (#721) * Preview Image * fix tests * add test * download preview image * relative path * gst * details view * fix tests * Improve preview image styles * Remove preview image URL from model * Revert form changes * update tests * make it work in uwsgi --------- Co-authored-by: Sascha Ißbrücker <sascha.issbruecker@gmail.com> 2024-05-07 16:58:52 +00:00			`return WebsiteMetadata(`
			`url=url, title=title, description=description, preview_image=preview_image`
			`)`
Preview website title + description in bookmark form Fix unnecessary selects when rendering bookmarks 2019-07-01 23:28:02 +00:00

Limit document size for website scraper (#354) Limits the size of scraped HTML documents to prevent out of memory errors. The scraper will stop reading from the response when it encounters the closing head tag, or if the read content's size exceeds a max limit. Fixes #345 2022-10-07 19:18:18 +00:00			`CHUNK_SIZE = 50 * 1024`
			`MAX_CONTENT_LIMIT = 5000 * 1024`


Preview website title + description in bookmark form Fix unnecessary selects when rendering bookmarks 2019-07-01 23:28:02 +00:00			`def load_page(url: str):`
Fake request headers to reduce bot detection (#263) Co-authored-by: Sascha Ißbrücker <sascha.issbruecker@gmail.com> 2022-05-21 11:25:32 +00:00			`headers = fake_request_headers()`
Limit document size for website scraper (#354) Limits the size of scraped HTML documents to prevent out of memory errors. The scraper will stop reading from the response when it encounters the closing head tag, or if the read content's size exceeds a max limit. Fixes #345 2022-10-07 19:18:18 +00:00			`size = 0`
			`content = None`
Improve website loader logging 2023-01-14 10:24:09 +00:00			`iteration = 0`
Limit document size for website scraper (#354) Limits the size of scraped HTML documents to prevent out of memory errors. The scraper will stop reading from the response when it encounters the closing head tag, or if the read content's size exceeds a max limit. Fixes #345 2022-10-07 19:18:18 +00:00			`# Use with to ensure request gets closed even if it's only read partially`
			`with requests.get(url, timeout=10, headers=headers, stream=True) as r:`
			`for chunk in r.iter_content(chunk_size=CHUNK_SIZE):`
			`size += len(chunk)`
Improve website loader logging 2023-01-14 10:24:09 +00:00			`iteration = iteration + 1`
Limit document size for website scraper (#354) Limits the size of scraped HTML documents to prevent out of memory errors. The scraper will stop reading from the response when it encounters the closing head tag, or if the read content's size exceeds a max limit. Fixes #345 2022-10-07 19:18:18 +00:00			`if content is None:`
			`content = chunk`
			`else:`
			`content = content + chunk`

Add black code formatter 2024-01-27 10:29:16 +00:00			`logger.debug(f"Loaded chunk (iteration={iteration}, total={size / 1024})")`
Improve website loader logging 2023-01-14 10:24:09 +00:00
Limit document size for website scraper (#354) Limits the size of scraped HTML documents to prevent out of memory errors. The scraper will stop reading from the response when it encounters the closing head tag, or if the read content's size exceeds a max limit. Fixes #345 2022-10-07 19:18:18 +00:00			`# Stop reading if we have parsed end of head tag`
Add black code formatter 2024-01-27 10:29:16 +00:00			`end_of_head = "</head>".encode("utf-8")`
Fix website loader content encoding detection (#482) 2023-05-30 20:04:54 +00:00			`if end_of_head in content:`
Add black code formatter 2024-01-27 10:29:16 +00:00			`logger.debug(f"Found closing head tag after {size} bytes")`
Fix website loader content encoding detection (#482) 2023-05-30 20:04:54 +00:00			`content = content.split(end_of_head)[0] + end_of_head`
Limit document size for website scraper (#354) Limits the size of scraped HTML documents to prevent out of memory errors. The scraper will stop reading from the response when it encounters the closing head tag, or if the read content's size exceeds a max limit. Fixes #345 2022-10-07 19:18:18 +00:00			`break`
			`# Stop reading if we exceed limit`
			`if size > MAX_CONTENT_LIMIT:`
Add black code formatter 2024-01-27 10:29:16 +00:00			`logger.debug(f"Cancel reading document after {size} bytes")`
Limit document size for website scraper (#354) Limits the size of scraped HTML documents to prevent out of memory errors. The scraper will stop reading from the response when it encounters the closing head tag, or if the read content's size exceeds a max limit. Fixes #345 2022-10-07 19:18:18 +00:00			`break`
Add black code formatter 2024-01-27 10:29:16 +00:00			`if hasattr(r, "_content_consumed"):`
			`logger.debug(f"Request consumed: {r._content_consumed}")`
Fix website scraper decoding content incorrectly (#126) * Avoid stall on web scraping This patch fixes stall on web scraping. I encountered a stall (scraping never ends) when adding a bookmark of some site. To avoid this case, adding a timeout parameter at requests.get() function is a solution. Signed-off-by: Taku Izumi <admin@orz-style.com> * Avoid character corruption of scraping some Japanese sites This patch fixes character corruption of scraping some Japanese sites. To avoid character corruption, I use r.content instead of r.text in load_page function. The reason of character corruption is encoding problem, I think. r.text handles data as unicode encoded text, so if scraping web site's charset is not unicode encoded, character corruption occurs. r.content handles data as str[], we can avoid encoding problem. Signed-off-by: Taku Izumi <admin@orz-style.com> * use charset_normalizer to determine response encoding Co-authored-by: Taku Izumi <admin@orz-style.com> Co-authored-by: Sascha Ißbrücker <sascha.issbruecker@googlemail.com> 2021-08-25 08:16:23 +00:00
			`# Use charset_normalizer to determine encoding that best matches the response content`
			`# Several sites seem to specify the response encoding incorrectly, so we ignore it and use custom logic instead`
			`# This is different from Response.text which does respect the encoding specified in the response first,`
			`# before trying to determine one`
Add black code formatter 2024-01-27 10:29:16 +00:00			`results = from_bytes(content or "")`
Fix website scraper decoding content incorrectly (#126) * Avoid stall on web scraping This patch fixes stall on web scraping. I encountered a stall (scraping never ends) when adding a bookmark of some site. To avoid this case, adding a timeout parameter at requests.get() function is a solution. Signed-off-by: Taku Izumi <admin@orz-style.com> * Avoid character corruption of scraping some Japanese sites This patch fixes character corruption of scraping some Japanese sites. To avoid character corruption, I use r.content instead of r.text in load_page function. The reason of character corruption is encoding problem, I think. r.text handles data as unicode encoded text, so if scraping web site's charset is not unicode encoded, character corruption occurs. r.content handles data as str[], we can avoid encoding problem. Signed-off-by: Taku Izumi <admin@orz-style.com> * use charset_normalizer to determine response encoding Co-authored-by: Taku Izumi <admin@orz-style.com> Co-authored-by: Sascha Ißbrücker <sascha.issbruecker@googlemail.com> 2021-08-25 08:16:23 +00:00			`return str(results.best())`
Fake request headers to reduce bot detection (#263) Co-authored-by: Sascha Ißbrücker <sascha.issbruecker@gmail.com> 2022-05-21 11:25:32 +00:00

Add black code formatter 2024-01-27 10:29:16 +00:00			`DEFAULT_USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.0.0 Safari/537.36"`
Bump waybackpy to 3.0.6 (#281) * fix wayback * fix tests * Reuse user agent from website loader Co-authored-by: Sascha Ißbrücker <sascha.issbruecker@gmail.com> 2022-07-03 04:26:16 +00:00

Fake request headers to reduce bot detection (#263) Co-authored-by: Sascha Ißbrücker <sascha.issbruecker@gmail.com> 2022-05-21 11:25:32 +00:00			`def fake_request_headers():`
			`return {`
			`"Accept": "text/html,application/xhtml+xml,application/xml",`
			`"Accept-Encoding": "gzip, deflate",`
			`"Dnt": "1",`
			`"Upgrade-Insecure-Requests": "1",`
Bump waybackpy to 3.0.6 (#281) * fix wayback * fix tests * Reuse user agent from website loader Co-authored-by: Sascha Ißbrücker <sascha.issbruecker@gmail.com> 2022-07-03 04:26:16 +00:00			`"User-Agent": DEFAULT_USER_AGENT,`
Fake request headers to reduce bot detection (#263) Co-authored-by: Sascha Ißbrücker <sascha.issbruecker@gmail.com> 2022-05-21 11:25:32 +00:00			`}`