* Add UnitHook and NoopHook implementations
The UnitHook tracks metrics per unit of a job, and emits them on a
channel once finished. It should work even if the Source does not
support source units.
* Refactor channel to use an LRU cache instead
An LRU cache has a more favorable failure mode than the channel. With
the channel, if the consumer stopped consuming metrics, scanning would
block. With the LRU cache, metrics will be dropped when space runs out
and a log message emitted.
* Fix bug in chunker that surfaces with a flaky passed in io.Reader
The chunker was previously expecting the passed in io.Reader to always
successfully read a full buffer of data, however it's valid for a Reader
to return less data than requested. When this happens, the chunker would
peek the same data that it then reads in the next iteration of the loop,
causing the same data to be scanned twice.
Co-authored-by: ahrav <ahravdutta02@gmail.com>
* Fix EOF error check
* Use io.ReadFull in Chunker
---------
Co-authored-by: ahrav <ahravdutta02@gmail.com>
This PR updates the S3 source to use explicitly configured credentials if they're available and follow the normal AWS credentials waterfall if they're not. This is irrespective of whether role assumption is configured. This changes the previous behavior, which was to use waterfall credentials only if role assumption was configured and explicitly configured credentials only when it was not.
* added PR and Issue body scanning; adjusted CLI args to fit
* removed print statement from debugging
* removed exclude-commits; adjusted CLI flags
* minor changes to match main branch
* fixing logic
* updating README for --issues and --prs
* Add ability to dynamically scale concurrently running sources
Refactor SourceManager to use a counting semaphore to allow for
dymanically changing limits. This complicated `Wait() error` which needs
to return the first error encountered. We previously got that for free
using `errgroup.Group`, however now we need to handle that ourselves.
`Wait()` needs to return an error for use in the engine to set the
correct exit code.
* Group third party imports together
The previous implementation used int64 for both, which can be mixed up
easily. Using distinct types adds a layer of type safety checked by the
compiler.
This PR implements validation of Gitlab source configuration.
I was hoping to be able to unify more of the implementation of Validate and Chunks, but there was more divergence than I expected. Specifically, Chunks handles a fair number of Gitlab errors that aren't configuration errors (e.g. "Gitlab returned a repo with an unparseable URL"). Accommodating these in the Validate code path felt wrong, and I wasn't able to create a common code path that could accommodate both Validate and Chunks without looking awful.
* Refactor SourceManager to remove Enrollment
Initializing the Source will be the responsibility of the caller. The
SourceManager exposes a GetIDs method for getting a source and job ID.
* Update tests
* Update engine usage
* Update apiClient interface to have one GetIDs method
* Update SourceManager usage in engine
This PR unifies some code paths within the S3 source. This is being done to better support a future implementation of S3 source validation; less code that runs means less code to validate. The logical change is to move the handling of "role-less" operation down the call tree, which allows for a single code path for more of the S3 code.
This PR also fixes a bug that would occur in the (rare) case that the source couldn't create a regional S3 client. Before, an error would be logged, but it would be followed by a panic. Now the bucket in question is skipped.
The source manager initialization function was defined as `sourceID`
followed by `jobID`, while the source initialization function is the
reverse. This is confusing and easy to mix up since the parameters are
the same type.
This commit adds a test to make sure the source manager initializes in
the correct order, but it doesn't prevent the library user to make the
same mistake. We may want to consider using different types.
* add exportable validate function for github
* update validator
* use the context
* gate to prevent panic
* wrap error with context
* wrap error with context for basic auth and unauth
* add role assumption for s3 source
* refactor role assumption to repeatable string
user can pass array of roles to assume
* refactor s3 chunks to handle passed roleARNs
* add role-session name
use timestamp to make dynamic
* add docstring for rolearn strings()
* make sure role ars are passed into source
* refactor role assumption functionality
break s3 bucket scanning into sep. function
* add log check on assume role
* fix role iteration
- Make sure s3 struct is populated with roles
- add separate new client instantiation for role-based access
- iterates through each role
* add comment
* protobuf revert for merge
* re-run make proto
* lint cleanup
* cleanup TODOs
* drop redundant switch case in assumerole client
* use less verbose 'ctx' designator
* breakout functionality from Chunks
- separate functions for:
- enumerating buckets to scan
- scanning objects within the buckets
* remake protobuf defs
* allow scan to continue on single bucket err
* add readme docs
* minor fixups
With the introduction of the SourceManager, the chunks channel became
private and read-only. This provides a method to write chunks into the
channel as we transition away from needing to do that.
* added functionality to scan docker images with digests instead of tags
* cleaned import statement
* added unit test for baseAndTag parsing + remote digest scan
* Add common chunker.
* add comment.
* use better config name.
* Add common chunk reader to s3.
* Add common chunk reader to git, gcs, circleci.
* revert gcs.
* revert gcs.
* fix chunker.
* revert gcs.
* update cancellablewrite.
* revert impl.
* update to remove totalsize.
* Fix my goof.
* Use unified struct in chunkreader.
* return err instead of logging and returning.
* rename error to err.
* only send single ChunkResult even if there is an error and chunkBytes.
* fix logic.
* Add SourceManager to Engine struct
* Update Engine methods to use the SourceManager
* Fix GCS test
The original was testing that `Init()` errors weren't surfaced in
`Finish()`, but the `SourceManager` changed that behavior.
* JobProgress race fixes
* Add contextual values
* Remove unused code
* Add debug logs
* Rename WithConcurrency to WithConcurrentSources
* Always forward chunks to the output chunks channel
* feat: initial support for bare repositories
* feat: use concatenation instead of formatting and os.Getenv instead of os.Environ
Signed-off-by: Savely Krasovsky <savely@krasovs.ky>
* fix: go-git update with pre-receive hooks fix
Signed-off-by: Savely Krasovsky <savely@krasovs.ky>
* fix: remove info about pre-receive hook from README.md for now
Signed-off-by: Savely Krasovsky <savely@krasovs.ky>
* fix: don't scan staged while using --bare option, fixes to make it work with the latest master
Signed-off-by: Savely Krasovsky <savely@krasovs.ky>
* fix: small refactor according to #1518
Signed-off-by: Savely Krasovsky <savely@krasovs.ky>
---------
Signed-off-by: Savely Krasovsky <savely@krasovs.ky>
* Refactor git source to allow ScanOptions and use source in engine
Refactor the Chunks method of the git Source to call out to two helper
methods: scanRepos and scanDirs which scans s.conn.Repositories and
s.conn.Directories respectively. The only notable change in behavior is
that a credential is no longer necessary if there are no
s.conn.Repositories to scan.
* Preserve ScanGit functionality of not cleaning up temporary files
* Support fatal errors in job reports
* WIP: JobReporter and JobInspector
* WIP: JobReportHook and JobReportRef
* Add ChunkError type and asyncRun helper method
* Rename JobReport to JobProgress
* Return a closed channel from Done when the JobProgress is nil
* Comment catchFirstFatal function
#1454 modified one of the Github enumeration code paths in a way that broke an integration test by causing one client's transport to be used for the construction of a different client, causing authentication failures. This saves the original transport for use, fixing the test.
* Miscellaneous SourceManager updates
* Own the chunks channel instead of accepting it as an input
* Add Chunks and Wait methods
* Fix bug in Enroll so it actually returns the handle
* Add context.Context parameter to the SourceInitFunc type
* Add SourceManager tests for Run and Wait methods
* Rename man variables to mgr
* Implement SourceManager basics
* Rename identifiers and add a default headlessAPI implementation
* Rewrite to use SourceInitFunc
* Update variable name to accurately reflect its value
* issue comment scanning
* save progress
* test
* test for pr comment and issue comment
* add pagination support
* linter stuff
* make linter happy
* remove debug log
* readd logging
* github issue resolved
* var const block and handle rate limit
* remove magic number
* make gitURLParse a public function to use more generally
* fix test bug
* make comment scanning OPT-IN
* Add CancellableWrite helper function
* Create SourceUnitEnumerator interface and EnumerationResult struct
* Implement SourceUnitEnumerator for the filesystem Source
* Omit explicit zero values
* Exit with non-zero exit code on chunk source error
* Exit with a non-zero exit code whenever we hit an error getting
chunks. Previously the error would be logged but trufflehog would exit
with a 0 (success) status code.
* fix gcs test
---------
Co-authored-by: Dustin Decker <dustin@trufflesec.com>
Co-authored-by: ahrav <ahravdutta02@gmail.com>
* Implement CommonSourceUnitUnmarshaller
* Add SourceUnitUnmarshaller to all sources using
All sources, with the exception of git, will use the CommonSourceUnit as
they only contain a single type of unit to scan.
* Fix method comments to adhere to Go's style guide
* Add Validator interface and example
* Close sockets and improve error messages
* Remove duplicate error
* Use var declaration so err slice can be nil
* Fix worktree scan by setting EnableDotGitCommonDir
* Change `PlainOpenOptions` to set `EnableDotGitCommonDir` to true.
In every current usage of this function, it is on an already-cloned
repository, so it should always be valid to have this set. By doing
so, it should fix some issues with worktrees.
* Remove unused go.mod replace directives
* Remove replace directives for libraries that are not in use.
* Add in-memory caching lib, used by the GCS source.
* Use cache for tracking progress for the GCS source.
* fix merge issue.
* fix merge issue.
* fix test.
* Fix static check.
* Add test for NewWithData.
* Use cache for tracking progress for the GCS source.
* fix merge issue.
* fix merge issue.
* fix test.
* update comment.
* update comments.
* Use cache for tracking progress for the GCS source.
* fix merge issue.
* fix merge issue.
* fix test.
* remove unused dep.
* address comments.
* Add exists method.
* Use cache for tracking progress for the GCS source.
* fix merge issue.
* fix merge issue.
* fix test.
* rebase.
* fix test.
* Use cache for tracking progress for the GCS source.
* fix merge issue.
* fix merge issue.
* fix test.
* rebase.
* rebase.
* split encode resume by comma.
* Use a persistable cache.
* fix merge.
* fix merge.
* Add progress as part of the cache given it will be the persistence layer.
* Add test for making sure the cache doesn't persist when the increment value is not met.
* fix tests.
* Resolve#1167 by adding support for the AWS_SESSION_TOKEN environment variable and adding a --session-token cli arg
* fix error message
---------
Co-authored-by: Dustin Decker <dustin@trufflesec.com>
This will concatenate all errors together into a single string. When
possible, it would be better to log the actual errors slice to take
advantage of structured logging.
* Rename directories to paths
* Generate protos
* Add file scanning support to filesystem source
* Add directories back to filesystem proto
* Generate protos
* Combine paths and directories from in source
* Add filesystem filter
* Address comments
* Adding missing flags to Readme
* Use retryableHttpClient by default for GitHub
* Adding repoUrl for scanning time log
* Use WithField instead of WithFields
* Updating README with lasted --help output
* Allow using a glob for include list.
* Update command flag.
* Make comment more clear.
* update comment.
* Allow scanning repo and org at the same time.