Commit graph

161 commits

Author SHA1 Message Date
Dustin Decker
8999eab89d
Add central feature flags (#3264)
* Add central feature flags

* use atomic

* tidy
2024-09-03 15:54:41 -07:00
ahrav
55fe05d0b4
fix dep versions (#3106) 2024-07-26 17:44:23 -07:00
ahrav
f865482025
[feat] - Streamlined File Handling with BufferedReaderSeeker (#3041)
* Streaming file handling.

* cleanup

* update tests

* lint

* defer close on input io.ReadCloser's

* fix seek bug

* fix hanging

* clarify errors

* update

* address comments

* revert

* update

* address

* add check to prevent seek without buffering

* revet

* revert

* update comment to make buffer usage more clear
2024-07-17 13:52:18 -07:00
Richard Gomez
3c20b000e1
fix(git): set GIT_DIR based on ScanOptions.Bare (#3004) 2024-06-24 07:37:45 -07:00
Zachary Rice
d5b9157d2b
clone more refs (#2988) 2024-06-20 09:40:03 -05:00
Richard Gomez
40fa304a3a
feat(git): improve scan logging (#2923) 2024-06-06 05:12:59 -04:00
Richard Gomez
4d2c8c6e11
refactor(github): improve wiki err handling (#2917) 2024-06-05 08:06:01 -04:00
Miccah
c86b423c61
[chore] Always log git repositories being scanned (#2909) 2024-06-03 18:02:34 -07:00
ahrav
896e6e7c66
upgrade github dep (#2858) 2024-05-16 14:35:08 -07:00
ahrav
ead9dd5748
[refactor] - Create separate handler for non-archive data (#2825)
* Remove specialized handler and archive struct and restructure handlers pkg.

* Refactor RPM archive handlers to use a library instead of shelling out

* make rpm handling context aware

* update test

* Refactor AR/deb archive handler to use an existing library instead of shelling out

* Update tests

* Handle non-archive data within the DefaultHandler

* make structs and methods private

* Remove non-archive data handling within sources

* add max size check

* add filename and size to context kvp

* move skip file check and is binary check before opening file

* fix test

* preserve existing funcitonality of not handling non-archive files in HandleFile

* Handle non-archive data within the DefaultHandler

* rebase

* Remove non-archive data handling within sources

* Adjust check for rpm/deb archive type

* add additional deb mime type

* add gzip

* move diskbuffered rereader setup into handler pkg

* remove DiskBuffereReader creation logic within sources

* update comment

* move rewind closer

* reduce log verbosity

* add metrics for file handling

* add metrics for errors

* make defaultBufferSize a const

* add metrics for file handling

* add metrics for errors

* fix tests

* add metrics for max archive depth and skipped files

* update error

* skip symlinks and dirs

* update err

* Address incompatible reader to openArchive

* remove nil check

* fix err assignment

* Allow git cat-file blob to complete before trying to handle the file

* wrap compReader with DiskbufferReader

* Allow git cat-file blob to complete before trying to handle the file

* updates

* use buffer writer

* update

* refactor

* update context pkg

* revert stuff

* update test

* fix test

* remove

* use correct reader

* add metrics for file handling

* add metrics for errors

* fix tests

* rebase

* add metrics for errors

* add metrics for max archive depth and skipped files

* update error

* skip symlinks and dirs

* update err

* fix err assignment

* rebase

* remove

* Update write method in contentWriter interface

* Add bufferReadSeekCloser

* update name

* update comment

* fix lint

* Remove specialized handler and archive struct and restructure handlers pkg.

* Refactor RPM archive handlers to use a library instead of shelling out

* make rpm handling context aware

* update test

* Refactor AR/deb archive handler to use an existing library instead of shelling out

* Update tests

* add max size check

* add filename and size to context kvp

* move skip file check and is binary check before opening file

* fix test

* preserve existing funcitonality of not handling non-archive files in HandleFile

* Handle non-archive data within the DefaultHandler

* rebase

* Remove non-archive data handling within sources

* Handle non-archive data within the DefaultHandler

* add gzip

* move diskbuffered rereader setup into handler pkg

* remove DiskBuffereReader creation logic within sources

* update comment

* move rewind closer

* reduce log verbosity

* make defaultBufferSize a const

* add metrics for file handling

* add metrics for errors

* fix tests

* add metrics for max archive depth and skipped files

* update error

* skip symlinks and dirs

* update err

* Address incompatible reader to openArchive

* remove nil check

* fix err assignment

* wrap compReader with DiskbufferReader

* Allow git cat-file blob to complete before trying to handle the file

* updates

* use buffer writer

* update

* refactor

* update context pkg

* revert stuff

* update test

* remove

* rebase

* go mod tidy

* lint check

* update metric to ms

* update metric

* update comments

* dont use ptr

* update

* fix

* Remove specialized handler and archive struct and restructure handlers pkg.

* Refactor RPM archive handlers to use a library instead of shelling out

* make rpm handling context aware

* update test

* Refactor AR/deb archive handler to use an existing library instead of shelling out

* Update tests

* add max size check

* add filename and size to context kvp

* move skip file check and is binary check before opening file

* fix test

* preserve existing funcitonality of not handling non-archive files in HandleFile

* Adjust check for rpm/deb archive type

* add additional deb mime type

* update comment

* go mod tidy

* update go mod

* Add a buffered file reader

* update comments

* use Buffered File Readder

* return buffer

* update

* fix

* return

* go mod tidy

* merge

* use a shared pool

* use sync.Once

* reorganzie

* remove unused code

* fix double init

* fix stuff

* nil check

* reduce allocations

* updates

* update metrics

* updates

* reset buffer instead of putting it back

* skip binaries

* skip

* concurrently process diffs

* close chan

* concurrently enumerate orgs

* increase workers

* ignore pbix and vsdx files

* add metrics for gitparse's Diffchan

* fix metric

* update metrics

* update

* fix checks

* fix

* inc

* update

* reduce

* Create workers to handle binary files

* modify workers

* updates

* add check

* delete code

* use custom reader

* rename struct

* add nonarchive handler

* fix break

* add comments

* add tests

* refactor

* remove log

* do not scan rpm links

* simplify

* rename var

* rename

* fix benchmark

* add buffer

* buffer

* buffer

* handle panic

* merge main

* merge main

* add recover

* revert stuff

* revert

* revert to using reader

* fixes

* remove

* update

* fixes

* linter

* fix test

* fix comment

* update field name

* fix
2024-05-15 13:40:16 -07:00
ahrav
570cec7565
[refactor] - Refactor Archive Handling Logic (#2703)
* Remove specialized handler and archive struct and restructure handlers pkg.

* Refactor RPM archive handlers to use a library instead of shelling out

* make rpm handling context aware

* update test

* Refactor AR/deb archive handler to use an existing library instead of shelling out

* Update tests

* add max size check

* add filename and size to context kvp

* move skip file check and is binary check before opening file

* fix test

* preserve existing funcitonality of not handling non-archive files in HandleFile

* Adjust check for rpm/deb archive type

* add additional deb mime type

* update comment

* Remove specialized handler and archive struct and restructure handlers pkg.

* Refactor RPM archive handlers to use a library instead of shelling out

* make rpm handling context aware

* update test

* Refactor AR/deb archive handler to use an existing library instead of shelling out

* Update tests

* add max size check

* add filename and size to context kvp

* move skip file check and is binary check before opening file

* fix test

* preserve existing funcitonality of not handling non-archive files in HandleFile

* Adjust check for rpm/deb archive type

* add additional deb mime type

* update comment

* go mod tidy

* update go mod

* go mod tidy

* add comment

* update max depth check to >

* go mod tidy

* rename

* [refactor] - Refactor Archive Handling Logic - Part 4: Non-Archive Data Handling and Cleanup (#2704)

* Handle non-archive data within the DefaultHandler

* make structs and methods private

* Remove non-archive data handling within sources

* Handle non-archive data within the DefaultHandler

* rebase

* Remove non-archive data handling within sources

* add gzip

* move diskbuffered rereader setup into handler pkg

* remove DiskBuffereReader creation logic within sources

* move rewind closer

* reduce log verbosity

* make defaultBufferSize a const

* use correct reader

* address comments

* update test

* [feat] - Add Prometheus Metrics for File Handlers (#2705)

* add metrics for file handling

* add metrics for errors

* add metrics for file handling

* add metrics for errors

* fix tests

* add metrics for max archive depth and skipped files

* update error

* skip symlinks and dirs

* update err

* fix err assignment

* add metrics for file handling

* add metrics for errors

* fix tests

* rebase

* add metrics for errors

* add metrics for max archive depth and skipped files

* update error

* skip symlinks and dirs

* update err

* fix err assignment

* rebase

* remove

* update metric to ms

* update comments

* address comments

* reduce indentations

* add metrics for archive depth

* [bug] - Enhanced Archive Handling to Address Interface Constraints (#2710)

* add metrics for file handling

* add metrics for errors

* add metrics for file handling

* add metrics for errors

* fix tests

* add metrics for max archive depth and skipped files

* update error

* skip symlinks and dirs

* update err

* Address incompatible reader to openArchive

* remove nil check

* fix err assignment

* wrap compReader with DiskbufferReader

* add metrics for file handling

* add metrics for errors

* fix tests

* rebase

* add metrics for errors

* add metrics for max archive depth and skipped files

* update error

* skip symlinks and dirs

* update err

* fix err assignment

* rebase

* remove

* update metric to ms

* update comments

* address comments

* reduce indentations

* replace diskbuffereader with bufferedfilereader

* updtes

* add metric back

* [bug] -  Fix bug and simplify git cat-file command execution and output handling (#2719)

* add metrics for file handling

* add metrics for errors

* add metrics for file handling

* add metrics for errors

* fix tests

* add metrics for max archive depth and skipped files

* update error

* skip symlinks and dirs

* update err

* Address incompatible reader to openArchive

* remove nil check

* fix err assignment

* Allow git cat-file blob to complete before trying to handle the file

* wrap compReader with DiskbufferReader

* Allow git cat-file blob to complete before trying to handle the file

* updates

* revert stuff

* update test

* remove

* add metrics for file handling

* add metrics for errors

* fix tests

* rebase

* add metrics for errors

* add metrics for max archive depth and skipped files

* update error

* skip symlinks and dirs

* update err

* fix err assignment

* rebase

* remove

* update metric to ms

* update comments

* address comments

* reduce indentations

* inline
2024-05-10 11:36:06 -07:00
Richard Gomez
13bd783d2d
test(git): change length of chunks (#2767)
This fixes one missed test in #2754 (comment).

The number of chunks doubled because each commit now has metadata + data.
2024-04-30 08:34:12 -04:00
Richard Gomez
11e5febeee
feat(git): scan commit metadata (#2754)
This is a follow-up to #2713 that fixes the strange test error.

As suspected, the failure was caused by additional diffs not being included in the test's expected data.
2024-04-29 16:58:45 -04:00
Cody Rose
11452e8a57
Revert "feat(git): scan commit metadata (#2713)" (#2747)
This reverts commit 81a9c813a1.
2024-04-25 10:56:48 -04:00
Richard Gomez
81a9c813a1
feat(git): scan commit metadata (#2713)
This fixes #2683. It scans the commit author, committer (which is typically GitHub <noreply@github.com> for GitHub, but can be different), and message.

It also scans Git notes.
2024-04-25 10:13:09 -04:00
ahrav
a8132839f8
[chore] - update go-github dep manually (#2664)
* update go-github dep

* remove commented out line
2024-04-03 19:19:14 -07:00
Cody Rose
b7f08db1ef
Redact secret in git command output (#2539)
When we fail to clone a git repository we log the command output to help with diagnosis. However, this output can include credentials in certain cases (such as certain errors associated with redirects). We don't want to log credentials when this happens.
2024-03-06 11:51:35 -05:00
Miccah
c60443891b
Add Display method to SourceUnit and Kind member to the CommonSourceUnit (#2450)
* Add Display method to SourceUnit and Kind member to the CommonSourceUnit

* Make SourceUnitID return the ID and a kind

These two values together uniquely represent a unit.
2024-02-20 11:24:13 -08:00
ahrav
5290023c2d
use read full (#2474) 2024-02-20 07:21:16 -08:00
ahrav
e8006f1bee
2396 since commit stopped working (#2402)
* Ensure we handle commits with no diffs correctly.

* cleanup

* add nil check

* address comments

* move comment

* revert

* add comment
2024-02-13 07:21:22 -08:00
Richard Gomez
b3ff12d1e9
Fix handling of GitHub ratelimit information (#2041)
This is a follow-up to #1912, which used the headers from the response to determine rate-limiting information, instead of using the values from RateLimitError.Rate. Although that logic seemed solid, I discovered that it did not work in some circumstances. This lead to the "unexpected" path more often than intended, and periodic instances where requests would be made before the ratelimit was refreshed.
2024-02-07 09:11:12 -05:00
ahrav
7b492a690a
[feat] - use diff chan (#2387)
* use diff chan

* address comments

* add comment

* address comments

* use old ordering

* add correct author line

* Add required *Commit arg to newDiff

* address comments
2024-02-06 10:06:10 -08:00
Miccah
01c9ac7b59
Fix binary file hanging bug in git sources (#2388)
Waiting for the sub-command will block until all of `stdout` has been
read. In some cases, we return early due to failed chunking without
reading all of the data, and thus, get stuck waiting for the command to
finish. Closing the pipe will ensure `Wait` does not block on that I/O.
2024-02-05 15:28:49 -08:00
ahrav
135cc3eb69
[fixup] - correctly use the buffered file writer (#2373)
* correctly use the buffered file writer

* use value from source

* reorder fields

* use only the DetectorKey as a map field

* address comments and use factory function

* fix optional params

* remove commented out code
2024-02-05 10:43:55 -08:00
Richard Gomez
8e90c4e669
Scan GitHub wikis #2233 2024-01-31 10:52:24 -05:00
ahrav
9867ce8eb8
Allow for configuring the buffered file writer (#2319)
* Write large diffs to tmp files

* address comments

* Move bufferedfilewriter to own pkg

* update test

* swallow write err

* use buffer pool

* use size vs len

* use interface

* fix test

* update comments

* fix test

* Allow for configuring the buffered file writer

* remove unused

* add missing method

* remove

* remove unused

* move parser and commit struct closer to where they are used

* linter change

* fix snifftest

* address comments

* add more kvp pairs to error

* fix test

* update

* add back missing metadata fields

* address comments

* remove bufferedfile writer

* fix

* address comments

* use unint8

* update interface

* adjust interface

* fix tests

* make linter happy

* fix finalize

* address comments

* update test

* address comments

* lint

* remove guard

* fix test

* fix

* add TODO

* fix tests
2024-01-30 12:51:58 -08:00
ahrav
7c59ff95d5
[feat] - tmp file diffs (#2306)
* Write large diffs to tmp files

* address comments

* Move bufferedfilewriter to own pkg

* update test

* swallow write err

* use buffer pool

* use size vs len

* use interface

* fix test

* update comments

* fix test

* remove unused

* remove

* remove unused

* move parser and commit struct closer to where they are used

* linter change

* add more kvp pairs to error

* fix test

* update

* address comments

* remove bufferedfile writer

* address comments

* adjust interface

* fix finalize

* address comments

* lint

* remove guard

* fix

* add TODO
2024-01-30 12:30:51 -08:00
ahrav
a1dc660f41
[fixup ] - Allow ssh cloning with AWS Code Commit (#2307) 2024-01-16 11:55:17 -08:00
Richard Gomez
241e153dfb
fix(gitparse): handle fromFileLine edge case (#2206) 2024-01-04 14:53:08 -08:00
Bill Rich
78d8dd3abf
Add handlerOpts back (#2258) 2023-12-22 12:11:59 -08:00
Bill Rich
ceff786db4
Skip all binaries (#2256)
* Skip all binaries

* Remove noop

* Drop handlerOpts
2023-12-22 12:01:07 -08:00
Dustin Decker
7d93adc1d0
Add skip archive support (#2257) 2023-12-22 11:55:23 -08:00
ahrav
39f0310f1f
[fixup] - Refactor to Pass Reader for Binary Diffs and Archived Data; Optimize /tmp Directory Cleanup (#2253) 2023-12-22 07:41:54 -08:00
ahrav
5c6ce693c1
[feat] - Make skipping binaries configurable (#2226)
* Make skipping binaries configurable

* remove ioutil

* fix

* address comments

* address comments

* use multi-reader

* remove print

* use const

* fix test

* fix my stupidness
2023-12-15 11:46:27 -08:00
Mike Vanbuskirk
53f060a08e
Add disk buffer tempfile cleanup (#2130)
* add tempfile creation

- break PID retrieval into sep. function

* add tmpfile cleanup func

* add file cleanup to main cleanup func

* refactor file logic to only return name string

* add temp buffer naming to gcs

* add temp buffer naming to s3

* add temp buffer naming to filesystem

* add temp buffer naming to git

* consolidate cleanup functions

- have single function handle both files and dirs
- remove interface(not needed with a single func implementation)
- change calls to `New(...)` to reflect config implementation
- simplify automation in main.go
- update disk-buffer-reader dependency

* integrate changes from pr #2133

* merge main

* checkout from main to revert conflict issues

* re-add buffer logic to git

* interface no longer needed

* move string format to global const

---------

Co-authored-by: Ahrav Dutta <ahrav.dutta@trufflesec.com>
2023-12-11 18:31:50 -05:00
Richard Gomez
d1a2d9e832
chore: propagate log context to handlers (#2191) 2023-12-10 10:30:11 -08:00
ahrav
4b31b39d6b
[chore] - Refactor common code into a separate function (#2179)
* Refactor common code into a separate function

* rename vars

* make sure to set the scanOptions fields

* address comments
2023-12-08 08:44:35 -08:00
ahrav
b75991850a
[chore] - Compile regex once (#2176)
* move regex compilation out of the fxn

* missed a spot

* merge main
2023-12-07 07:26:27 -08:00
ahrav
e6bc7f4451
remove unnecessary Git cmd check (#2175) 2023-12-06 13:38:34 -08:00
ahrav
cb81f7d11a
[feat] - Remove go-git dependency (#2174)
* remove use of go-git for binary files

* fix it

* use limit reader

* fix comment

* fix test

* address comments

* address comments

* address comments
2023-12-06 13:38:01 -08:00
ahrav
13da76d357
skip files we can't scan (#2170) 2023-12-04 13:37:11 -08:00
ahrav
279f915799
[chore] - fix error comparisons (#2142)
* fix error comparisons

* fix imports
2023-12-01 08:32:41 -08:00
ahrav
d334b3075e
move all Git setup into Init method (#2105)
* add proto fields for git

* add uri to proto

* move all git setup into Init method

* fix logic for when to use repoPath
2023-11-16 13:59:53 -08:00
Miccah
9d6bc8c504
Refactor git source to support scanning units (#2083) 2023-11-01 09:52:58 -07:00
Miccah
52600a897a
[chore] Replace chunks channel with ChunkReporter in git based sources (#2082)
ChunkReporter is more flexible and will allow code reuse for unit
chunking. ChanReporter was added as a way to maintain the original
channel functionality, so this PR should not alter existing behavior.
2023-11-01 09:22:44 -07:00
Mike Vanbuskirk
4636dc08f6
Add temp directory management (#1878)
* adds func to get scannerPIDs

* add cleanup and call to get pids

* move pid handling to git module

* remove PID logic from main

* refactor testing code to handle different exec name

* cleanup linting errors

* add better logging, fix dir if clause

* some PR fixups

* mod fixup

* add interfaces for helper funcs

* refactor cleanup into main, getPID into git

* lint and test fixups, remove fail on n<2 pids

* simplify pid sorting

* use filepath.Join

* use Args[0] for exec name, fix logger

* formatting fixup

* move functionality into cleantemp pkg

* go mod fixup

* remove redundant testing comment

* fix go.sum issues

* add 15m ticker loop for cleanup

* enclose ticker in function for goroutine defer

fix cleantemp interface

* make time more readable

* add check for non-local Trufflehog PIDs

* allow deletion even if no non-local pids found

* bundle intial cleanup into runCleanup func

* add explicit regex check for tempdir format
2023-10-26 12:28:56 -04:00
Bill Rich
c5efa870ff
Use latest dbr (#1955) 2023-10-24 07:52:49 -07:00
ahrav
c4bc8fc7fa
[bug] - correctly check err (#1824)
* correctly check err.

* address comments.

* update.

* add comment.

* update comment.
2023-09-27 15:52:07 -07:00
Miccah
dbcb888063
Update Source interface to use SourceID and JobID types (#1774)
The previous implementation used int64 for both, which can be mixed up
easily. Using distinct types adds a layer of type safety checked by the
compiler.
2023-09-14 11:28:24 -07:00
Cody Rose
1155ee2736
Implement Gitlab source validation (#1765)
This PR implements validation of Gitlab source configuration.

I was hoping to be able to unify more of the implementation of Validate and Chunks, but there was more divergence than I expected. Specifically, Chunks handles a fair number of Gitlab errors that aren't configuration errors (e.g. "Gitlab returned a repo with an unparseable URL"). Accommodating these in the Validate code path felt wrong, and I wasn't able to create a common code path that could accommodate both Validate and Chunks without looking awful.
2023-09-13 11:51:12 -04:00