Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clamav sidecar proof of concept #81

Closed
wants to merge 5 commits into from

Conversation

rahearn
Copy link

@rahearn rahearn commented Dec 4, 2020

Description of change

This installs clamav in our app container, and ensures that virus file definitions are kept up to date. It does not include running the clamd daemon in the background, so scans only happen by manually calling clamscan with a file or directory.

Pros to this approach:

  • not much for us to maintain going forward, does not use a custom buildpack
  • will satisfy controls around automated updating of virus definition files, no matter how often we deploy or restart the app
  • could potentially be expanded to cover scanning app files, if we get pushback that we need to do that as well

Downsides:

  • need to figure out how we're going to call clamscan on uploaded files
  • uses a ton of memory compared to the app by itself. 1.5G of RAM was the bare minimum to not crash out of scanning the src directory. 1G was the minimum to just get the freshclam process to complete

Other notes:
This requires version 7 of the cf-cli tool. That's easily installed on macs (brew install cf-cli@7), but I didn't look into other operating systems.

How to test

This app is deployed as tta-smarthub-ryan to the tta-transient-ryan space. Run

  • cf target -s tta-transient-ryan
  • cf ssh tta-smarthub-ryan
  • /tmp/lifecycle/shell
  • clamscan -d $CLAMAV_DATA_DIR src will run a scan of the src directory
vcap@15bdae76-905e-49d1-782a-142e:~$ clamscan -d $CLAMAV_DATA_DIR src/

----------- SCAN SUMMARY -----------
Known viruses: 8946438
Engine version: 0.102.4
Scanned directories: 1
Scanned files: 0
Infected files: 0
Data scanned: 0.00 MB
Data read: 0.00 MB (ratio 0.00:1)
Time: 26.048 sec (0 m 26 s)

Issue(s)

Checklist
Not ready to merge, so deleting the checklist.

@rahearn rahearn marked this pull request as draft December 4, 2020 21:00
@rahearn
Copy link
Author

rahearn commented Dec 7, 2020

Got some feedback from other 18f engineers on this as well. They suggested the possibility of setting up another cloud.gov app that exposes a clamav API.

https://blog.theodo.com/2017/11/implement-antivirus-api-10-min/ describes the general approach, and there are reports of similar setups being successfully implemented in cloud.gov

Pros to this approach:

  • memory overhead limited to a single clamav app instance, rather than having to absorb it on every instance
  • calling out to scan a file is a REST call rather then shelling out to the system, thereby avoiding a potential vulnerability point of access

Cons to this approach:

  • Will not allow for scanning the app container (probably not something we need to do anyway)

@kryswisnaskas
Copy link
Collaborator

kryswisnaskas commented Dec 8, 2020

Nice. And it looks like clamav is one of the top free solutions for virus scanning.

I haven't tested it yet, but had a couple of questions:

----------- SCAN SUMMARY -----------
Known viruses: 8946438
Engine version: 0.102.4
Scanned directories: 1
Scanned files: 0
Infected files: 0
Data scanned: 0.00 MB
Data read: 0.00 MB (ratio 0.00:1)
Time: 26.048 sec (0 m 26 s)

How should we read this scan summary? Did it actually scan any data? (0 MB, 0 files).

Do we have any idea how big the uploaded files will be?
I am a bit concerned about the time it took (that could be due to not understanding what it actually scanned).

I took a look at the architecture behind setting up another cloud.gov app exposing the clamav API. It definitely is an interesting approach. I think the advantage in that scenario would be that we could call the service via a REST api from our backend. While there are ways to call a shell command from nodejs, using a REST api seems like a better interface.
We would need to figure out how long it would take to scan one / several files though, both by calling a shell command vs. posting to the REST api.

The downside in addition to the "cons" listed above, could be introducing more complex setup, potentially bringing in more points of failure.

@kryswisnaskas
Copy link
Collaborator

kryswisnaskas commented Dec 8, 2020

memory overhead limited to a single clamav app instance, rather than having to absorb it on every instance

This would be an important advantage

@rahearn
Copy link
Author

rahearn commented Dec 8, 2020

Did it actually scan any data? (0 MB, 0 files)

Oops, no, needed a -r flag.

vcap@a1165ea4-15df-418f-4ff4-31f9:~$ clamscan -d $CLAMAV_DATA_DIR -r build

----------- SCAN SUMMARY -----------
Known viruses: 8947852
Engine version: 0.102.4
Scanned directories: 15
Scanned files: 151
Infected files: 0
Data scanned: 10.37 MB
Data read: 4.63 MB (ratio 2.24:1)
Time: 22.072 sec (0 m 22 s)

I am a bit concerned about the time it took

Given the difference between this scan and the original one, looks like the vast majority of the time is startup overhead. That would be mitigated by running clamd in the background and clamdscan instead of clamscan.

I'm definitely leaning towards the separate app/REST api for our use rather than running clamd as a sidecar.

@rahearn
Copy link
Author

rahearn commented Dec 10, 2020

Decision made: we will run a separate ClamAV app w/ REST API. Work tracked here: HHS#203

@rahearn rahearn closed this Dec 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants