Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clair Postgres database storage space issue #2170

Open
Mkori1 opened this issue Nov 12, 2024 · 7 comments
Open

Clair Postgres database storage space issue #2170

Mkori1 opened this issue Nov 12, 2024 · 7 comments

Comments

@Mkori1
Copy link

Mkori1 commented Nov 12, 2024

Description of Problem / Feature Request

Postgres Storage :1800 GiB
Clair Postgres database storage space gets full in 1 week span and then suddenly storage space is freed around 1TB.
This happens every week storage gets full and suddenly freed up at once.
Screenshot 2024-11-12 at 12 49 46 PM

Garbage collection is running daily.

Need to understand why storage space is getting full and GC is not able to clean up the space.

Environment:
clair:4.7.2
Combo Mode

config.yaml
introspection_addr: :8089
http_listen_addr: :8080
log_level: debug
indexer:
connstring: XXX
scanlock_retry: 10
layer_scan_concurrency: 5
migrations: true
matcher:
connstring: XXX
migrations: true
update_retention: 2
period:
matchers:
names:

  • alpine
  • aws
  • debian
  • oracle
  • photon
  • python
  • rhel
  • suse
  • ubuntu
    updaters:
    sets:
  • alpine
  • aws
  • debian
  • oracle
  • photon
  • pyupio
  • rhel
  • suse
  • ubuntu
    notifier:
    connstring: XXX
    delivery_interval: 1m
    poll_interval: 5m
    migrations: true
    auth:
    psk:
    key: XXX
    iss: ["quay"]

tracing and metrics

trace:
name: "jaeger"
probability: 1
jaeger:
agent:
endpoint: "localhost:6831"
service_name: "clair"
metrics:
name: "prometheus"

@frostmar
Copy link
Contributor

Storage for Postgres "deleted" rows is only available for reuse after a Postgres VACUUM has run.
Maybe check your database VACUUM / AUTOVACUUM settings

@Mkori1
Copy link
Author

Mkori1 commented Nov 13, 2024

What does this mean in log.
2024-11-11 11:19:26 UTC::@:[26410]:LOG: automatic vacuum of table "ClairBackend.public.vuln": index scans: 1
pages: 0 removed, 2801169 remain, 13106 skipped due to pins, 545696 skipped frozen
tuples: 96092 removed, 12632497 remain, 831982 are dead but not yet removable, oldest xmin: 934978812
buffer usage: 6392796 hits, 3018733 misses, 1438233 dirtied
avg read rate: 41.457 MB/s, avg write rate: 19.751 MB/s
system usage: CPU: user: 79.75 s, system: 24.19 s, elapsed: 568.88 s
WAL usage: 2086644 records, 1408460 full page images, 1823645535 bytes

@Mkori1
Copy link
Author

Mkori1 commented Nov 13, 2024

Does automatic vacuum of table "ClairBackend.public.vuln": index scans: 1 relates to starting GC in Clair.

Also does below statement means DELETION is not happening
2024-11-13 15:49:17 UTC:10.116.14.176(14922):root@ClairBackend:[24095]:ERROR: canceling statement due to user request
2024-11-13 15:49:17 UTC:10.116.14.176(14922):root@ClairBackend:[24095]:STATEMENT: DELETE FROM layer WHERE NOT EXISTS (SELECT FROM manifest_layer WHERE manifest_layer.layer_id = layer.id);

@frostmar
Copy link
Contributor

The concept of VACUUM / AUTOVACUUM in Postgres is important here (and nothing to do with Clair) - take a look at docs maybe starting at https://www.postgresql.org/docs/current/routine-vacuuming.html

@Mkori1
Copy link
Author

Mkori1 commented Nov 13, 2024

Do we have any setting related to timeout as I see errors in log when method is DELETE
{"level":"debug","request_id":"da40e9d64c5747d9","component":"httptransport/New","code":500,"error":"unable to delete manifest: timeout: context canceled","time":"2024-11-13T17:49:34Z","message":"http error response"}
{"level":"info","component":"httptransport/New","request_id":"da40e9d64c5747d9","remote_addr":"100.123.110.151:55158","method":"DELETE","request_uri":"/indexer/api/v1/index_report/sha256:542c27dfe5f967b5bde01c11918871d309d30642a47848b9e13829647814b930","status":500,"duration":600097.463138,"time":"2024-11-13T17:49:34Z","message":"handled HTTP request"}

@crozzy
Copy link
Collaborator

crozzy commented Nov 13, 2024

So, excessive latency when deleting manifests was identified as an issue and dealt with here, I believe the earliest version it is available in is v4.7.3.

This change includes an index creation that you can choose to pre-create concurrently if you have a larger instance using the clairctl admin sub-command https://github.com/quay/clair/blob/main/cmd/clairctl/admin.go#L42.

@crozzy
Copy link
Collaborator

crozzy commented Nov 13, 2024

I'm not sure if this is related to the DB storage issue (i.e. the VACUUM isn't running for some reason due to these DELETEs being in-flight for a long time) but I echo what @frostmar said, checking how your DB is configured to (AUTO)VACUUM is likely going to be very useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants