ArchiveBox

mirror of https://github.com/ArchiveBox/ArchiveBox synced 2024-11-23 12:43:10 +00:00

Author	SHA1	Message	Date
Ross Williams	1e604a1352	sqlite search: clean up errors and type-checking Clean up error handling, and report a better error message on search and flush if FTS5 tables haven't yet been created. Add some mypy comments to clean up type-checking errors.	2023-10-16 14:31:52 -04:00
Ross Williams	adb9f0ecc9	sqlite search: Rename `connection` to `get_connection` `connection` could cause confusion with `django.db.connection` and `get_connection` is a better callable name.	2023-10-16 13:23:26 -04:00
Ross Williams	e0f8eeeaa7	Improve search.backends.sqlite retry logic Retry with table creation should fail if it is attempted for a second time.	2023-10-16 13:23:26 -04:00
Ross Williams	8fe5faf4d0	Introduce SQLite FTS5-powered search backend Use SQLite's FTS5 extension to power full-text search without any additional dependencies. FTS5 was introduced in SQLite 3.9.0, [released][1] in 2015 so should be available on most SQLite installations at this point in time. [1]: https://www.sqlite.org/changes.html#version_3_9_0	2023-10-16 13:23:26 -04:00
Ross Williams	c53ec45a29	WIP: add sqlite search backend boilerplate	2023-10-16 13:23:26 -04:00
Nick Sweeting	d7b883b049	fix broken link	2023-10-12 00:22:47 -07:00
Nick Sweeting	dcef217e5e	Merge pull request #1242 from benmuth/fix-titles-with-empty-tag	2023-10-09 21:39:26 -07:00
Ben Muthalaly	77917e9b55	Fix HTML title parsing bugs. This slightly modifies the HTML_TITLE_REGEX to fix two parsing errors. The first occurred when title tags were empty (e.g. "<title></title>") which was parsed as "</title". The second occurred when titles were a single character (e.g. "<title>A</title>") which was not matched by the regex, and so would fall back to link.base_url. Now when tags are empty, it falls back to link.base_url, and single character titles are parsed correctly. The way the regex works now is still a bit wonky for some edge cases. I couldn't find any cases of incorrect behavior, but it still might be worth reworking more completely for robustness.	2023-10-09 02:00:01 -05:00
Nick Sweeting	4950cee3b6	Merge pull request #1229 from jamesob/dev	2023-09-18 12:19:00 -07:00
James O'Beirne	b28b3b7e67	README: update outdated links Most frustratingly, the outdated docker-compose link prompts users to download an older version of the docker-compose.yml file, which packages a broken YouTube retrieval method.	2023-09-18 10:37:35 -04:00
Nick Sweeting	5c1a14e4f2	ignore errors while getting system user name	2023-09-14 03:39:44 -07:00
Nick Sweeting	ffe2968e4f	improve some comments	2023-09-14 02:41:27 -07:00
Nick Sweeting	f809efce4d	Merge pull request #996 from barthalion/dev	2023-09-03 21:40:49 -07:00
Nick Sweeting	aaca74f6a8	only start parsing json after the first open brace	2023-09-03 21:40:12 -07:00
Nick Sweeting	cd9f228b2f	Merge pull request #1214 from DanielBatteryStapler/DanielBatteryStapler-patch-1	2023-09-03 21:25:12 -07:00
Nick Sweeting	16d278fbdb	Merge pull request #1168 from mAAdhaTTah/add-readwise-reader	2023-09-03 21:24:49 -07:00
Nick Sweeting	110a22ee32	Merge branch 'dev' into DanielBatteryStapler-patch-1	2023-08-31 15:20:46 -07:00
Nick Sweeting	73a5f74d38	update default YOUTUBEDL_ARGS to fix subs and filesize	2023-08-31 15:17:45 -07:00
Nick Sweeting	86366d5640	Update logging_util.py to fix generator subscripting error	2023-08-31 15:12:43 -07:00
Nick Sweeting	a837f870af	Merge pull request #1221 from spresse1/update-singlefile	2023-08-29 17:04:04 -07:00
spresse1	c8597a7fa1	Update singlefile to the latest version and switch it to single-file-cli. Unfortunately, this requires a rewrite of NPM dependency files.	2023-08-29 20:28:48 +02:00
Nick Sweeting	62fb56354b	Merge pull request #1219 from spresse1/chrome-cleanup	2023-08-28 20:02:22 -07:00
spresse1	603ce7ec10	After a timeout, chrome will leave behind a SingletonLock, which prevents future instances of chrome from starting. When an extractor fails due to a timeout, remove this file.	2023-08-28 17:27:03 +02:00
Nick Sweeting	0b6064b7dd	Update docker_entrypoint.sh to use /bin/bash	2023-08-22 16:35:43 -07:00
root	23f086aa40	add LDAP support	2023-08-17 19:51:02 -05:00
Nick Sweeting	00ecf57b0f	Merge pull request #1211 from DanielBatteryStapler/DanielBatteryStapler-patch-1	2023-08-15 21:34:53 -07:00
DanielBatteryStapler	94dacc49c7	Fix archive_org icon "exists"	2023-08-15 23:49:54 -04:00
Nick Sweeting	68e936e7c2	Merge pull request #1186 from ArchiveBox/dependabot/npm_and_yarn/word-wrap-1.2.4	2023-08-13 16:43:09 -07:00
Nick Sweeting	a7d7644dca	Merge pull request #1205 from overhacked/fix_url_regex_hyphen	2023-08-09 14:20:07 -07:00
Ross Williams	c039ef05b3	Fix hyphen placement in util.URL_REGEX Incorrect hyphen placement in `URL_REGEX` was allowing it to match more characters than intended. In a regex character class, a literal hyphen can only appear as the first character in the class, or it will be interpreted as the delimiter of a range of characters. The issue fixed here caused the range of characters from `[$-_]` be treated as valid URL characters, instead of the intended set of three characters `[-_$]`. The incorrect range interpretation inadvertantly included most ASCII punctuation, most importantly the angle brackets, square brackets, and single quote that the expression uses to mark the end of a match. This causes the expression to match a URL that has a "hostname" portion beginning with one of the intended "stop parsing" characters. For example: ``` https://<b>www</b>.example.com/ # MATCHES but should not https://[for example] # MATCHES but should not scheme='https://' # MATCHES, including final quote, but should not ``` Some test cases have been added to the `URL_REGEX` assert in archivebox.parsers to cover this possibility.	2023-08-08 15:24:16 -04:00
Nick Sweeting	b773041952	Merge pull request #1199 from overhacked/chrome_version_detection_fix	2023-08-01 10:14:18 -07:00
Nick Sweeting	5b7ecfc872	Merge pull request #1197 from overhacked/bin_version_env_fix	2023-08-01 10:12:59 -07:00
Ross Williams	d0e65eba7f	More reliably detect Google Chrome version number Previous method was splitting on the first whitespace, and missing the version number when it appeared as `"Google Chrome 115.0.234.2342"` instead of, i.e. `"Chromium 115.0.234.8283"`. This commit changes the version detection to regex search for whitespace, then one or more digits followed by a period, then at least one more digit. Only the first sequence of digits is captured. Unless Chrome radically changes their version numbering, this should capture the first group of digits after the reported browser name, which would be the major version.	2023-07-31 15:34:58 -04:00
Ross Williams	9d9872d325	bin_version means to modify, not replace environ the `bin_version` function means to modify the environment, not replace it entirely. Fixes bugs that occur when it wipes out the PATH environment variable, such as when running in a virtual environment.	2023-07-31 11:36:34 -04:00
Nick Sweeting	3e5e9c7a41	Merge pull request #1194 from wogong/patch-1	2023-07-29 09:29:34 -07:00
Zhen	3e9e221232	Fix Instapaper export link in README.md Original link to Instapaper export `https://www.instapaper.com/user/export` is broken: `405: Method Not Allowed`.	2023-07-28 07:58:58 +08:00
dependabot[bot]	0bf739b736	Bump word-wrap from 1.2.3 to 1.2.4 Bumps [word-wrap](https://github.com/jonschlinkert/word-wrap) from 1.2.3 to 1.2.4. - [Release notes](https://github.com/jonschlinkert/word-wrap/releases) - [Commits](https://github.com/jonschlinkert/word-wrap/compare/1.2.3...1.2.4) --- updated-dependencies: - dependency-name: word-wrap dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com>	2023-07-19 02:03:00 +00:00
Nick Sweeting	40ddd33602	Update README.md	2023-07-07 09:37:42 -07:00
mAAdhaTTah	181501fd36	Add Readwise Reader API parser Implemented similar to the Pocket API.	2023-07-02 11:20:58 -04:00
Nick Sweeting	0d26538a4b	Update README.md example commands to use new docker compose format	2023-06-13 17:46:32 -07:00
Nick Sweeting	37d238cd31	Update README.md	2023-06-13 17:43:40 -07:00
Nick Sweeting	2580f76a2e	Update README.md	2023-06-13 17:41:08 -07:00
Nick Sweeting	571131d5f3	Update README.md to simplify intro instructions	2023-06-13 17:35:00 -07:00
Nick Sweeting	733dbfa1f3	Update scheduler to persist single shared contab via volume instead of requiring separate container for each job	2023-06-13 17:13:55 -07:00
Nick Sweeting	58d784cdd8	limit nginx config to only serve archive directory instead of main data folder root	2023-06-13 16:43:37 -07:00
Nick Sweeting	0e0b06bef1	Merge pull request #1159 from ArchiveBox/pirate-patch-1	2023-06-13 05:49:58 -07:00
Nick Sweeting	406e2b681d	Update docker-compose.yml scheduled task image and container name	2023-06-13 05:49:22 -07:00
Nick Sweeting	347b8d977d	Merge pull request #1154 from sissbruecker/fix/oneshot_exit_code	2023-05-30 17:34:25 -07:00
Sascha Ißbrücker	7bf4f40da0	just use out_dir	2023-05-29 10:03:49 +02:00
Sascha Ißbrücker	40c122515a	fix: make oneshot command return successful exist code	2023-05-29 10:01:27 +02:00

1 2 3 4 5 ...

3088 commits