The previous implementation of targeted file scanning pulled patches out of commit data, which didn't work for binary files (because GitHub doesn't return patches for them). This PR changes the system to always just download the requested file and scan it, which means we get binary file support.
* Fixed the checks for local exported data
* Fixed the check for local export files
* Fixed the check for local export files
* Fixed the check for local export files
* Merge branch 'main' into th-899-postman-panic-issue
* minor changes in the tests
* test update
* test
The GitHub source generates chunks for targeted scans differently than it does for "normal" scans. One difference was the presence of leading + and - characters, which can interfere with detection in some cases.
There is a scenario in which results filtration is known to cause problems, and this PR disables it in that scenario. (It should cause problems more generally, but lacking any concrete cases of that, I want to tread lightly.)
We have identified some cases in which the results "cleaning" logic (the logic that eliminates superfluous results) should not run. In order to allow this, we need to expose the cleaning logic to the engine. This PR does so by doing these things:
- Create a CustomResultsCleaner interface that can be implemented by detectors that want to use custom cleaning logic
- Implement this interface for the aws and awssessionkey detectors (and remove their previous invocation of their custom cleaning logic)
- Modify the engine to invoke this logic (conditionally)
This PR also removes the "custom" cleaning logic for the opsgenie, razorpay, and twilio detectors, because it was added erroneously.
This is an alternative implementation of #3233.
Many analyzer tests require GCP credentials, which the community does
not have access to. It's best to ignore these tests, which would
otherwise immediately fail for unrelated community contributions.
If a detector ignores the configured timeout it is probably because of I/O blocking, which degrades the efficiency of the detector worker pool when it happens a lot. In the worst case, a detector that fully hangs will zombify its worker, causing really bad performance problems. When this happens, we don't really have a good way to notice other than seeing scan throughput drop suspiciously. This PR adds explicit logging when detection takes longer than it should so we have a better chance of catching this.
(This problem theoretically can spring up anywhere, in any worker, but the detector fleet is vast, uses network I/O, and is implemented by a much larger group of people, so this sort of problem is much more likely to slip into detector implementations than anywhere else in the codebase. We could generalize this mechanism, but I don't want to make that investment before seeing if this smaller change captures the information we need.)
The GitHub source currently applies its authentication configuration as the first step of enumeration. This is incompatible with both targeted scans and scan job reports, and also means that authentication logic has to be duplicated into the validation flow. This PR moves it into Init so that it's available to targeted scans and, eventually, unit-specific scans. This also allows us to remove the copy of the old logic that was in Validate.
As part of the work I've also cleaned up the integration test suite. (Several of them were apparently disabled back when they ran on every push, but now that we're not doing that, we can re-enable them.)