ArchiveBox

mirror of https://github.com/ArchiveBox/ArchiveBox synced 2024-11-29 23:50:22 +00:00

Author	SHA1	Message	Date
Ross Williams	b6a20c962a	Extract text from singlefile.html when indexing singlefile.html contains a lot of large strings in the form of `data:` URLs, which can be unnecessarily stored in full-text indices. Also, large chunks of JavaScript shouldn't be indexed, either, as they pollute search results for searches about JS functions, etc. This commit takes a blanket approach of parsing singlefile.html as it is read and only outputting text and selected textual attributes (like `alt`) for indexing.	2023-10-12 13:06:35 -04:00
Ben Muthalaly	77917e9b55	Fix HTML title parsing bugs. This slightly modifies the HTML_TITLE_REGEX to fix two parsing errors. The first occurred when title tags were empty (e.g. "<title></title>") which was parsed as "</title". The second occurred when titles were a single character (e.g. "<title>A</title>") which was not matched by the regex, and so would fall back to link.base_url. Now when tags are empty, it falls back to link.base_url, and single character titles are parsed correctly. The way the regex works now is still a bit wonky for some edge cases. I couldn't find any cases of incorrect behavior, but it still might be worth reworking more completely for robustness.	2023-10-09 02:00:01 -05:00
Nick Sweeting	5c1a14e4f2	ignore errors while getting system user name	2023-09-14 03:39:44 -07:00
Nick Sweeting	ffe2968e4f	improve some comments	2023-09-14 02:41:27 -07:00
Nick Sweeting	f809efce4d	Merge pull request #996 from barthalion/dev	2023-09-03 21:40:49 -07:00
Nick Sweeting	aaca74f6a8	only start parsing json after the first open brace	2023-09-03 21:40:12 -07:00
Nick Sweeting	cd9f228b2f	Merge pull request #1214 from DanielBatteryStapler/DanielBatteryStapler-patch-1	2023-09-03 21:25:12 -07:00
Nick Sweeting	16d278fbdb	Merge pull request #1168 from mAAdhaTTah/add-readwise-reader	2023-09-03 21:24:49 -07:00
Nick Sweeting	110a22ee32	Merge branch 'dev' into DanielBatteryStapler-patch-1	2023-08-31 15:20:46 -07:00
Nick Sweeting	73a5f74d38	update default YOUTUBEDL_ARGS to fix subs and filesize	2023-08-31 15:17:45 -07:00
Nick Sweeting	86366d5640	Update logging_util.py to fix generator subscripting error	2023-08-31 15:12:43 -07:00
spresse1	603ce7ec10	After a timeout, chrome will leave behind a SingletonLock, which prevents future instances of chrome from starting. When an extractor fails due to a timeout, remove this file.	2023-08-28 17:27:03 +02:00
root	23f086aa40	add LDAP support	2023-08-17 19:51:02 -05:00
DanielBatteryStapler	94dacc49c7	Fix archive_org icon "exists"	2023-08-15 23:49:54 -04:00
Ross Williams	c039ef05b3	Fix hyphen placement in util.URL_REGEX Incorrect hyphen placement in `URL_REGEX` was allowing it to match more characters than intended. In a regex character class, a literal hyphen can only appear as the first character in the class, or it will be interpreted as the delimiter of a range of characters. The issue fixed here caused the range of characters from `[$-_]` be treated as valid URL characters, instead of the intended set of three characters `[-_$]`. The incorrect range interpretation inadvertantly included most ASCII punctuation, most importantly the angle brackets, square brackets, and single quote that the expression uses to mark the end of a match. This causes the expression to match a URL that has a "hostname" portion beginning with one of the intended "stop parsing" characters. For example: ``` https://<b>www</b>.example.com/ # MATCHES but should not https://[for example] # MATCHES but should not scheme='https://' # MATCHES, including final quote, but should not ``` Some test cases have been added to the `URL_REGEX` assert in archivebox.parsers to cover this possibility.	2023-08-08 15:24:16 -04:00
Nick Sweeting	b773041952	Merge pull request #1199 from overhacked/chrome_version_detection_fix	2023-08-01 10:14:18 -07:00
Ross Williams	d0e65eba7f	More reliably detect Google Chrome version number Previous method was splitting on the first whitespace, and missing the version number when it appeared as `"Google Chrome 115.0.234.2342"` instead of, i.e. `"Chromium 115.0.234.8283"`. This commit changes the version detection to regex search for whitespace, then one or more digits followed by a period, then at least one more digit. Only the first sequence of digits is captured. Unless Chrome radically changes their version numbering, this should capture the first group of digits after the reported browser name, which would be the major version.	2023-07-31 15:34:58 -04:00
Ross Williams	9d9872d325	bin_version means to modify, not replace environ the `bin_version` function means to modify the environment, not replace it entirely. Fixes bugs that occur when it wipes out the PATH environment variable, such as when running in a virtual environment.	2023-07-31 11:36:34 -04:00
mAAdhaTTah	181501fd36	Add Readwise Reader API parser Implemented similar to the Pocket API.	2023-07-02 11:20:58 -04:00
Sascha Ißbrücker	7bf4f40da0	just use out_dir	2023-05-29 10:03:49 +02:00
Sascha Ißbrücker	40c122515a	fix: make oneshot command return successful exist code	2023-05-29 10:01:27 +02:00
Micah R Ledbetter	1e50ca243e	Add FAVICON_PROVIDER option for custom favicon service	2023-05-05 20:42:36 -05:00
David Calano	f48e48e6da	Fix for Issue #1008 - Added missing decode() when setting pkg_path variable	2023-03-29 01:48:12 -04:00
Tom Ryder	53af810ff8	Add missing closing quote to style attribute	2023-03-27 10:54:04 +13:00
ふぁ	44a5a5ed7e	add explicitly specify --headless=new Signed-off-by: ふぁ <yuki@yuki0311.com>	2023-03-17 19:30:14 +09:00
Nick Sweeting	9f42a3bf29	fix whitespace	2023-03-15 16:01:02 -07:00
ふぁ	d77c770c47	add CHROME_TIMEOUT args Signed-off-by: ふぁ <yuki@yuki0311.com>	2023-03-14 20:29:41 +09:00
Nick Sweeting	606fa397a4	disable passing timeout arg to chrome because v111 is crashing when passed	2023-03-13 10:50:18 +00:00
Nick Sweeting	1f1c70a8b1	remove --single-process from chrome args and add some rendering optimization args	2023-03-13 10:49:57 +00:00
Nick Sweeting	9599845b56	ensure DOM HTML dump is non-zero length file when retrying	2023-03-13 10:49:26 +00:00
Nick Sweeting	dca69933eb	Update archivebox/config.py Co-authored-by: dugite-code <dugite-code@users.noreply.github.com>	2023-01-09 18:22:01 -08:00
Nick Sweeting	2538b170c7	Merge branch 'dev' into feat/reverse-proxy-auth	2023-01-09 18:20:45 -08:00
Nick Sweeting	0cbeeb4346	Merge pull request #1021 from renaisun/dev	2023-01-09 18:17:39 -08:00
Joseph Turian	07de4a79a1	Merge branch 'dev' into feature/kludge-984-UTF8-bug	2022-12-20 11:39:01 +01:00
Nick Sweeting	e114b1f6dc	Merge pull request #1027 from turian/feature/migrations-0021_auto_20220914_0934.py	2022-11-27 19:28:55 -08:00
SnZ	2db830c6a8	Method typo? Fixes '[Errno 2] No such file or directory' error during add	2022-11-20 01:51:16 +01:00
Joseph Turian	a26a91d09f	Merge branch 'feature/migrations-0021_auto_20220914_0934.py' into feature/kludge-984-UTF8-bug	2022-09-14 09:44:55 +00:00
Joseph Turian	22d8e57637	Add missing migration 0021	2022-09-14 09:36:17 +00:00
Joseph Turian	30947aeb07	yt-dlp flag cleanup	2022-09-14 06:29:57 +02:00
Joseph Turian	f729bbe122	yt-dlp fixes	2022-09-14 06:27:58 +02:00
Joseph Turian	081a12b079	Add ts	2022-09-12 21:32:47 +00:00
Joseph Turian	daef48e59b	flake8	2022-09-12 21:31:33 +00:00
Joseph Turian	983f485cc0	flake8	2022-09-12 21:29:43 +00:00
Joseph Turian	b864c38d9e	Don't be strict on unicode errors	2022-09-12 20:40:45 +00:00
Joseph Turian	dba423a568	A few more youtube-dl tweaks	2022-09-12 20:36:23 +00:00
Joseph Turian	f5f7aff3b4	Added yt-dlp everywhere	2022-09-12 20:34:02 +00:00
renaisun	0ea955b3ed	add a missing comma	2022-09-12 09:08:28 +08:00
notevenaperson	40659b5e9d	singlefile.py: Code to ensure options are deduplicated	2022-09-12 09:08:28 +08:00
Joseph Turian	2b58cce43f	Attempted to warn on #984 and #1014	2022-09-11 12:19:16 +02:00
Bartłomiej Piotrowski	eb97fd427b	Skip first line of the "JSON" file ArchiveBox moves the file to parse to the sources directory and adds the original filename at the top, making the file invalid.	2022-07-05 10:56:40 +02:00

1 2 3 4 5 ...

1260 commits