Commit graph

308 commits

Author SHA1 Message Date
Miccah
78219a27b3
Call Finish in SourceManager after the semaphore is released (#2121) 2023-11-24 13:22:08 -08:00
Richard Gomez
024aa056b9
chore(github): add a newline between titles and bodies (#2124) 2023-11-23 16:14:28 -08:00
Richard Gomez
1f502fd42c
feat(github): scan issue & pr titles (#1899) 2023-11-22 19:15:27 -08:00
Dustin Decker
75e869faff
Fix forks and repos counter, add metric for orgs enumerated (#2118) 2023-11-21 08:52:33 -08:00
Miccah
39a603d2dc
[chore] Add JSON tags to job metrics (#2114) 2023-11-16 17:08:33 -08:00
ahrav
d334b3075e
move all Git setup into Init method (#2105)
* add proto fields for git

* add uri to proto

* move all git setup into Init method

* fix logic for when to use repoPath
2023-11-16 13:59:53 -08:00
Miccah
9d6bc8c504
Refactor git source to support scanning units (#2083) 2023-11-01 09:52:58 -07:00
Miccah
52600a897a
[chore] Replace chunks channel with ChunkReporter in git based sources (#2082)
ChunkReporter is more flexible and will allow code reuse for unit
chunking. ChanReporter was added as a way to maintain the original
channel functionality, so this PR should not alter existing behavior.
2023-11-01 09:22:44 -07:00
ahrav
95e0090bc2
[chore] - correctly handle input shorter than 512 bytes (#2077)
* correctly handle input shorter than 512 bytes

* add tests

* reorder tests

* add another test case

* update test

* address comment
2023-10-31 16:42:42 -07:00
Miccah
57203a56cd
[chore] Fix SourceManager flaky test (#2059)
* [chore] Fix SourceManager flaky test

Sorting by EndTime is not deterministic, however sorting by StartTime
should be. StartTime is set in a goroutine that's limited by
WithConcurrentUnits, so it should happen in order that the units are
received.

* Sort by unit ID
2023-10-30 19:16:55 -07:00
Dustin Decker
05fae156e1
Add TravisCI source (#1877)
* Add TravisCI source

* update test to use sourcestest

* Remove jobPage loop

ListByBuild does not support pagination, so this was infinitely
repeating. https://developer.travis-ci.com/resource/jobs#find

* Continue chunking on error

* review updates

* update readme

---------

Co-authored-by: Miccah Castorina <m.castorina93@gmail.com>
2023-10-30 07:28:25 -07:00
Mike Vanbuskirk
4636dc08f6
Add temp directory management (#1878)
* adds func to get scannerPIDs

* add cleanup and call to get pids

* move pid handling to git module

* remove PID logic from main

* refactor testing code to handle different exec name

* cleanup linting errors

* add better logging, fix dir if clause

* some PR fixups

* mod fixup

* add interfaces for helper funcs

* refactor cleanup into main, getPID into git

* lint and test fixups, remove fail on n<2 pids

* simplify pid sorting

* use filepath.Join

* use Args[0] for exec name, fix logger

* formatting fixup

* move functionality into cleantemp pkg

* go mod fixup

* remove redundant testing comment

* fix go.sum issues

* add 15m ticker loop for cleanup

* enclose ticker in function for goroutine defer

fix cleantemp interface

* make time more readable

* add check for non-local Trufflehog PIDs

* allow deletion even if no non-local pids found

* bundle intial cleanup into runCleanup func

* add explicit regex check for tempdir format
2023-10-26 12:28:56 -04:00
Bill Rich
c5efa870ff
Use latest dbr (#1955) 2023-10-24 07:52:49 -07:00
Miccah
0b16142d4f
Add UnitHook and NoopHook implementations (#1930)
* Add UnitHook and NoopHook implementations

The UnitHook tracks metrics per unit of a job, and emits them on a
channel once finished. It should work even if the Source does not
support source units.

* Refactor channel to use an LRU cache instead

An LRU cache has a more favorable failure mode than the channel. With
the channel, if the consumer stopped consuming metrics, scanning would
block. With the LRU cache, metrics will be dropped when space runs out
and a log message emitted.
2023-10-23 14:27:01 -07:00
Miccah
b8724e87e6
Use the configured include repositories in the GitHub filter (#1926) 2023-10-20 19:03:28 -07:00
Richard Gomez
3acc65b2fb
chore(github): reduce comment log verbosity (#1922) 2023-10-20 16:16:38 -07:00
Cody Rose
7ac7fa8728
Move Github comments check to fix a test #1927 2023-10-19 19:23:55 -04:00
Richard Gomez
4b821e9732
Handle secondary GitHub ratelimits (#1912)
* fix(github): reduce visibility-related api calls

* fix(github): handle secondary ratelimits
2023-10-19 14:54:45 -04:00
Miccah
758344711a
Export ChunkError fields and add ErrorsFor convenience method (#1920) 2023-10-19 08:46:49 -07:00
Richard Gomez
6ea3a7da4a
fix(github): normalize repo cache (#1897) 2023-10-17 15:07:47 -07:00
Miccah
03dc7cb68d
[chore] Add SourceUnitEnumChunker filesystem tests (#1873)
* [chore] Add SourceUnitEnumChunker filesystem tests

* Ensure reported units are exactly what is expected
2023-10-16 10:42:18 -07:00
Miccah
f09bce3f75
[chore] Fix flaky TestJobProgressElapsedTime (#1872) 2023-10-06 17:05:05 -07:00
ahrav
3d2490ca80
use Repositories field from conn. (#1860) 2023-10-04 13:56:02 -07:00
Miccah
0d451aa806
Fix bug in chunker that surfaces with a flaky passed in io.Reader (#1838)
* Fix bug in chunker that surfaces with a flaky passed in io.Reader

The chunker was previously expecting the passed in io.Reader to always
successfully read a full buffer of data, however it's valid for a Reader
to return less data than requested. When this happens, the chunker would
peek the same data that it then reads in the next iteration of the loop,
causing the same data to be scanned twice.

Co-authored-by: ahrav <ahravdutta02@gmail.com>

* Fix EOF error check

* Use io.ReadFull in Chunker

---------

Co-authored-by: ahrav <ahravdutta02@gmail.com>
2023-10-02 09:38:23 -07:00
ahrav
c4bc8fc7fa
[bug] - correctly check err (#1824)
* correctly check err.

* address comments.

* update.

* add comment.

* update comment.
2023-09-27 15:52:07 -07:00
Cody Rose
e9efed85c2
Use S3 credentials waterfall (#1823)
This PR updates the S3 source to use explicitly configured credentials if they're available and follow the normal AWS credentials waterfall if they're not. This is irrespective of whether role assumption is configured. This changes the previous behavior, which was to use waterfall credentials only if role assumption was configured and explicitly configured credentials only when it was not.
2023-09-27 16:57:47 -04:00
joeleonjr
699547b7d3
consolidated pr and issue descr/comment flags (#1827) 2023-09-27 15:54:02 -04:00
ahrav
bf47fd69bb
Github partial scan (#1804)
* Add ability for targetted partial scans of Github.

* update comment.

* add more tests.

* add additiional test.

* address comments.
2023-09-26 12:38:33 -07:00
joeleonjr
1e42dae734
added PR and Issue body scanning (#1816)
* added PR and Issue body scanning; adjusted CLI args to fit

* removed print statement from debugging

* removed exclude-commits; adjusted CLI flags

* minor changes to match main branch

* fixing logic

* updating README for --issues and --prs
2023-09-26 12:25:48 -04:00
āh̳̕mͭͭͨͩ̐e̘ͬ́͋ͬ̊̓͂d
62b2195502
Adding new function SetProgressOngoing to be used when the source does not yet know how many items it is scanning and does not want to display a percentage complete. (#1802)
Co-Authored-By: @mcastorina
2023-09-21 13:26:10 -04:00
Miccah
efa404942a
Add ability to dynamically scale concurrently running sources (#1790)
* Add ability to dynamically scale concurrently running sources

Refactor SourceManager to use a counting semaphore to allow for
dymanically changing limits. This complicated `Wait() error` which needs
to return the first error encountered. We previously got that for free
using `errgroup.Group`, however now we need to handle that ourselves.
`Wait()` needs to return an error for use in the engine to set the
correct exit code.

* Group third party imports together
2023-09-20 16:49:56 -07:00
ahrav
22876f8381
replace interface{} with any. (#1771) 2023-09-15 04:35:15 -07:00
Miccah
dbcb888063
Update Source interface to use SourceID and JobID types (#1774)
The previous implementation used int64 for both, which can be mixed up
easily. Using distinct types adds a layer of type safety checked by the
compiler.
2023-09-14 11:28:24 -07:00
Cody Rose
1155ee2736
Implement Gitlab source validation (#1765)
This PR implements validation of Gitlab source configuration.

I was hoping to be able to unify more of the implementation of Validate and Chunks, but there was more divergence than I expected. Specifically, Chunks handles a fair number of Gitlab errors that aren't configuration errors (e.g. "Gitlab returned a repo with an unparseable URL"). Accommodating these in the Validate code path felt wrong, and I wasn't able to create a common code path that could accommodate both Validate and Chunks without looking awful.
2023-09-13 11:51:12 -04:00
Miccah
72b6a9ec6b
Add a SourceType constant to all source packages (#1768) 2023-09-12 17:23:25 -07:00
Miccah
be4d0bcb41
Refactor SourceManager to remove Enrollment (#1740)
* Refactor SourceManager to remove Enrollment

Initializing the Source will be the responsibility of the caller. The
SourceManager exposes a GetIDs method for getting a source and job ID.

* Update tests

* Update engine usage

* Update apiClient interface to have one GetIDs method

* Update SourceManager usage in engine
2023-09-12 16:58:38 -07:00
Mike Vanbuskirk
de540652cb
verbosity updates to s3 source (#1750) 2023-09-11 14:53:43 -05:00
ahrav
2a9f34962d
Add optional param to Chunks (#1747)
* Add interface for targeted chunking.

* use optional args.

* update Chunks method signature.

* update tests.

* fix test.

* update QueryCriteria type.
2023-09-07 09:03:37 -07:00
ahrav
abb131e502
[chore] - update Docker source (#1708)
* Add concurrency and common chunker.

* lint.

* address comments.
2023-09-05 07:40:38 -07:00
Cody Rose
afe708519b
Validate S3 source (#1715)
This PR adds S3 source validation. This is accomplished by factoring out common "bucket visiting" logic to be used by both scanning and validation.
2023-09-05 10:18:58 -04:00
Cody Rose
a2c0abbfd6
Unify S3 client creation logic (#1657)
This PR unifies some code paths within the S3 source. This is being done to better support a future implementation of S3 source validation; less code that runs means less code to validate. The logical change is to move the handling of "role-less" operation down the call tree, which allows for a single code path for more of the S3 code.

This PR also fixes a bug that would occur in the (rare) case that the source couldn't create a regional S3 client. Before, an error would be logged, but it would be followed by a panic. Now the bucket in question is skipped.
2023-08-30 17:49:37 -04:00
Miccah
522b2fab29
Add a cancel cause to job cancellation (#1728) 2023-08-30 12:00:44 -07:00
Miccah
7ba880f47a
Add AvailableCapacity method to SourceManager (#1665) 2023-08-29 12:36:44 -07:00
ahrav
2b1b1b5ad0
Add jobID to chunk. (#1721) 2023-08-29 12:02:30 -07:00
ahrav
c51e8f8af5
buffer channel. (#1718) 2023-08-28 18:08:31 -07:00
ahrav
0932ea224b
[chore] - Prevent nil deref panic (#1709) 2023-08-26 20:39:50 -07:00
Miccah
5eb776cd61
Support cancelling a run from a JobProgressRef (#1663) 2023-08-25 10:43:33 -07:00
Cody Rose
33eed42e17
Test S3 role assumption (#1655)
This PR adds a test of the S3 role assumption functionality. It currently only tests role assumption within a single account.
2023-08-25 11:30:08 -04:00
Miccah
61977412df
Add SourceName to JobProgressRef (#1664) 2023-08-25 07:48:25 -07:00
ahrav
4f4a79f62b
Support azure git links (#1662)
* Support azure git links.

* update comment.

* update test names.
2023-08-24 14:36:52 -07:00