The source manager attaches some context keys, but in certain circumstances, they're already present, resulting in duplicate keys. This PR changes the attachment to be conditional. It also adds some new log messages to track source startup progress.
* add tempfile creation
- break PID retrieval into sep. function
* add tmpfile cleanup func
* add file cleanup to main cleanup func
* refactor file logic to only return name string
* add temp buffer naming to gcs
* add temp buffer naming to s3
* add temp buffer naming to filesystem
* add temp buffer naming to git
* consolidate cleanup functions
- have single function handle both files and dirs
- remove interface(not needed with a single func implementation)
- change calls to `New(...)` to reflect config implementation
- simplify automation in main.go
- update disk-buffer-reader dependency
* integrate changes from pr #2133
* merge main
* checkout from main to revert conflict issues
* re-add buffer logic to git
* interface no longer needed
* move string format to global const
---------
Co-authored-by: Ahrav Dutta <ahrav.dutta@trufflesec.com>
ChunkReporter is more flexible and will allow code reuse for unit
chunking. ChanReporter was added as a way to maintain the original
channel functionality, so this PR should not alter existing behavior.
* [chore] Fix SourceManager flaky test
Sorting by EndTime is not deterministic, however sorting by StartTime
should be. StartTime is set in a goroutine that's limited by
WithConcurrentUnits, so it should happen in order that the units are
received.
* Sort by unit ID
* Add TravisCI source
* update test to use sourcestest
* Remove jobPage loop
ListByBuild does not support pagination, so this was infinitely
repeating. https://developer.travis-ci.com/resource/jobs#find
* Continue chunking on error
* review updates
* update readme
---------
Co-authored-by: Miccah Castorina <m.castorina93@gmail.com>
* adds func to get scannerPIDs
* add cleanup and call to get pids
* move pid handling to git module
* remove PID logic from main
* refactor testing code to handle different exec name
* cleanup linting errors
* add better logging, fix dir if clause
* some PR fixups
* mod fixup
* add interfaces for helper funcs
* refactor cleanup into main, getPID into git
* lint and test fixups, remove fail on n<2 pids
* simplify pid sorting
* use filepath.Join
* use Args[0] for exec name, fix logger
* formatting fixup
* move functionality into cleantemp pkg
* go mod fixup
* remove redundant testing comment
* fix go.sum issues
* add 15m ticker loop for cleanup
* enclose ticker in function for goroutine defer
fix cleantemp interface
* make time more readable
* add check for non-local Trufflehog PIDs
* allow deletion even if no non-local pids found
* bundle intial cleanup into runCleanup func
* add explicit regex check for tempdir format
* Add UnitHook and NoopHook implementations
The UnitHook tracks metrics per unit of a job, and emits them on a
channel once finished. It should work even if the Source does not
support source units.
* Refactor channel to use an LRU cache instead
An LRU cache has a more favorable failure mode than the channel. With
the channel, if the consumer stopped consuming metrics, scanning would
block. With the LRU cache, metrics will be dropped when space runs out
and a log message emitted.
* Fix bug in chunker that surfaces with a flaky passed in io.Reader
The chunker was previously expecting the passed in io.Reader to always
successfully read a full buffer of data, however it's valid for a Reader
to return less data than requested. When this happens, the chunker would
peek the same data that it then reads in the next iteration of the loop,
causing the same data to be scanned twice.
Co-authored-by: ahrav <ahravdutta02@gmail.com>
* Fix EOF error check
* Use io.ReadFull in Chunker
---------
Co-authored-by: ahrav <ahravdutta02@gmail.com>
This PR updates the S3 source to use explicitly configured credentials if they're available and follow the normal AWS credentials waterfall if they're not. This is irrespective of whether role assumption is configured. This changes the previous behavior, which was to use waterfall credentials only if role assumption was configured and explicitly configured credentials only when it was not.
* added PR and Issue body scanning; adjusted CLI args to fit
* removed print statement from debugging
* removed exclude-commits; adjusted CLI flags
* minor changes to match main branch
* fixing logic
* updating README for --issues and --prs
* Add ability to dynamically scale concurrently running sources
Refactor SourceManager to use a counting semaphore to allow for
dymanically changing limits. This complicated `Wait() error` which needs
to return the first error encountered. We previously got that for free
using `errgroup.Group`, however now we need to handle that ourselves.
`Wait()` needs to return an error for use in the engine to set the
correct exit code.
* Group third party imports together
The previous implementation used int64 for both, which can be mixed up
easily. Using distinct types adds a layer of type safety checked by the
compiler.
This PR implements validation of Gitlab source configuration.
I was hoping to be able to unify more of the implementation of Validate and Chunks, but there was more divergence than I expected. Specifically, Chunks handles a fair number of Gitlab errors that aren't configuration errors (e.g. "Gitlab returned a repo with an unparseable URL"). Accommodating these in the Validate code path felt wrong, and I wasn't able to create a common code path that could accommodate both Validate and Chunks without looking awful.
* Refactor SourceManager to remove Enrollment
Initializing the Source will be the responsibility of the caller. The
SourceManager exposes a GetIDs method for getting a source and job ID.
* Update tests
* Update engine usage
* Update apiClient interface to have one GetIDs method
* Update SourceManager usage in engine
This PR unifies some code paths within the S3 source. This is being done to better support a future implementation of S3 source validation; less code that runs means less code to validate. The logical change is to move the handling of "role-less" operation down the call tree, which allows for a single code path for more of the S3 code.
This PR also fixes a bug that would occur in the (rare) case that the source couldn't create a regional S3 client. Before, an error would be logged, but it would be followed by a panic. Now the bucket in question is skipped.
The source manager initialization function was defined as `sourceID`
followed by `jobID`, while the source initialization function is the
reverse. This is confusing and easy to mix up since the parameters are
the same type.
This commit adds a test to make sure the source manager initializes in
the correct order, but it doesn't prevent the library user to make the
same mistake. We may want to consider using different types.
* add exportable validate function for github
* update validator
* use the context
* gate to prevent panic
* wrap error with context
* wrap error with context for basic auth and unauth
* add role assumption for s3 source
* refactor role assumption to repeatable string
user can pass array of roles to assume
* refactor s3 chunks to handle passed roleARNs
* add role-session name
use timestamp to make dynamic
* add docstring for rolearn strings()
* make sure role ars are passed into source
* refactor role assumption functionality
break s3 bucket scanning into sep. function
* add log check on assume role
* fix role iteration
- Make sure s3 struct is populated with roles
- add separate new client instantiation for role-based access
- iterates through each role
* add comment
* protobuf revert for merge
* re-run make proto
* lint cleanup
* cleanup TODOs
* drop redundant switch case in assumerole client
* use less verbose 'ctx' designator
* breakout functionality from Chunks
- separate functions for:
- enumerating buckets to scan
- scanning objects within the buckets
* remake protobuf defs
* allow scan to continue on single bucket err
* add readme docs
* minor fixups
With the introduction of the SourceManager, the chunks channel became
private and read-only. This provides a method to write chunks into the
channel as we transition away from needing to do that.
* added functionality to scan docker images with digests instead of tags
* cleaned import statement
* added unit test for baseAndTag parsing + remote digest scan
* Add common chunker.
* add comment.
* use better config name.
* Add common chunk reader to s3.
* Add common chunk reader to git, gcs, circleci.
* revert gcs.
* revert gcs.
* fix chunker.
* revert gcs.
* update cancellablewrite.
* revert impl.
* update to remove totalsize.
* Fix my goof.
* Use unified struct in chunkreader.
* return err instead of logging and returning.
* rename error to err.
* only send single ChunkResult even if there is an error and chunkBytes.
* fix logic.
* Add SourceManager to Engine struct
* Update Engine methods to use the SourceManager
* Fix GCS test
The original was testing that `Init()` errors weren't surfaced in
`Finish()`, but the `SourceManager` changed that behavior.
* JobProgress race fixes
* Add contextual values
* Remove unused code
* Add debug logs
* Rename WithConcurrency to WithConcurrentSources
* Always forward chunks to the output chunks channel
* feat: initial support for bare repositories
* feat: use concatenation instead of formatting and os.Getenv instead of os.Environ
Signed-off-by: Savely Krasovsky <savely@krasovs.ky>
* fix: go-git update with pre-receive hooks fix
Signed-off-by: Savely Krasovsky <savely@krasovs.ky>
* fix: remove info about pre-receive hook from README.md for now
Signed-off-by: Savely Krasovsky <savely@krasovs.ky>
* fix: don't scan staged while using --bare option, fixes to make it work with the latest master
Signed-off-by: Savely Krasovsky <savely@krasovs.ky>
* fix: small refactor according to #1518
Signed-off-by: Savely Krasovsky <savely@krasovs.ky>
---------
Signed-off-by: Savely Krasovsky <savely@krasovs.ky>
* Refactor git source to allow ScanOptions and use source in engine
Refactor the Chunks method of the git Source to call out to two helper
methods: scanRepos and scanDirs which scans s.conn.Repositories and
s.conn.Directories respectively. The only notable change in behavior is
that a credential is no longer necessary if there are no
s.conn.Repositories to scan.
* Preserve ScanGit functionality of not cleaning up temporary files
* Support fatal errors in job reports
* WIP: JobReporter and JobInspector
* WIP: JobReportHook and JobReportRef
* Add ChunkError type and asyncRun helper method
* Rename JobReport to JobProgress
* Return a closed channel from Done when the JobProgress is nil
* Comment catchFirstFatal function
#1454 modified one of the Github enumeration code paths in a way that broke an integration test by causing one client's transport to be used for the construction of a different client, causing authentication failures. This saves the original transport for use, fixing the test.
* Miscellaneous SourceManager updates
* Own the chunks channel instead of accepting it as an input
* Add Chunks and Wait methods
* Fix bug in Enroll so it actually returns the handle
* Add context.Context parameter to the SourceInitFunc type
* Add SourceManager tests for Run and Wait methods
* Rename man variables to mgr
* Implement SourceManager basics
* Rename identifiers and add a default headlessAPI implementation
* Rewrite to use SourceInitFunc
* Update variable name to accurately reflect its value
* issue comment scanning
* save progress
* test
* test for pr comment and issue comment
* add pagination support
* linter stuff
* make linter happy
* remove debug log
* readd logging
* github issue resolved
* var const block and handle rate limit
* remove magic number
* make gitURLParse a public function to use more generally
* fix test bug
* make comment scanning OPT-IN
* Add CancellableWrite helper function
* Create SourceUnitEnumerator interface and EnumerationResult struct
* Implement SourceUnitEnumerator for the filesystem Source
* Omit explicit zero values
* Exit with non-zero exit code on chunk source error
* Exit with a non-zero exit code whenever we hit an error getting
chunks. Previously the error would be logged but trufflehog would exit
with a 0 (success) status code.
* fix gcs test
---------
Co-authored-by: Dustin Decker <dustin@trufflesec.com>
Co-authored-by: ahrav <ahravdutta02@gmail.com>
* Implement CommonSourceUnitUnmarshaller
* Add SourceUnitUnmarshaller to all sources using
All sources, with the exception of git, will use the CommonSourceUnit as
they only contain a single type of unit to scan.
* Fix method comments to adhere to Go's style guide
* Add Validator interface and example
* Close sockets and improve error messages
* Remove duplicate error
* Use var declaration so err slice can be nil
* Fix worktree scan by setting EnableDotGitCommonDir
* Change `PlainOpenOptions` to set `EnableDotGitCommonDir` to true.
In every current usage of this function, it is on an already-cloned
repository, so it should always be valid to have this set. By doing
so, it should fix some issues with worktrees.
* Remove unused go.mod replace directives
* Remove replace directives for libraries that are not in use.