Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provider: persistent storage reporting should accurately reflect available Ceph space #146

Open
andy108369 opened this issue Nov 16, 2023 · 3 comments
Assignees
Labels
P1 repo/provider Akash provider-services repo issues

Comments

@andy108369
Copy link
Contributor

andy108369 commented Nov 16, 2023

The current reporting of persistent storage available space by the provider, based on Ceph's MAX AVAIL, is not accurate.

This is due to Ceph's MAX AVAIL being a dynamic value that represents MAX - USED, and it decreases as storage is used. Consequently, this results in the provider sometimes reporting less available space than actually exists.

A key point of confusion arises with Kubernetes' PV (Persistent Volume) system. In Kubernetes, when a PV or PVC (Persistent Volume Claim) is created, it doesn't immediately reserve physical space in Ceph. Therefore, Ceph's MAX AVAIL doesn't change upon the creation of these volumes, leading to a discrepancy. It's only when data is actually written to these volumes that Ceph's MAX AVAIL decreases accordingly.

To provide a more accurate view of the available space, the provider should modify its display metrics.
Instead of relying on Ceph's MAX AVAIL, it should calculate the actual available space as [Total MAX space of Ceph] - [Reserved space in K8s (PV/PVC)].
Here, Total MAX space of Ceph should be considered as the entire storage capacity of the Ceph cluster without deducting the Ceph's USED amount (as what Ceph's MAX AVAIL does now) or the space reserved by Kubernetes PV/PVC.
This approach will give a more realistic representation of the available storage, accounting for the Kubernetes-reserved space.

NOTE: Ceph's USED is the STORED x No_Replicas in Ceph, which means the available persistent storage can easily go negative as soon as more than half of space gets written to the persistent storage (with two replicas), or a quarter of that (with three replicas). See the example case from Hurricane provider is below (two replicas).


Tested Provider / Inventory Operator Versions

  • 0.4.6
  • 0.4.7
  • 0.4.8-rc0

Scenario Illustration

  1. Initial Provider View (before deployment)

provider has 1 OSD (disk of 32Gi)

$ provider_info.sh provider.provider-02.sandbox-01.aksh.pw
type       cpu    gpu  ram                 ephemeral           persistent
used       0      0    0                   0                   0
pending    0      0    0                   0                   0
available  14.65  2    30.632808685302734  174.28484315704554  30.370879160240293
node       14.65  2    30.632808685302734  174.28484315704554  N/A
  1. After Creating a Deployment with 10Gi Persistent Storage
  • Provider still shows MAX AVAIL from Ceph;
  • Briefly, available persistent storage dropped by 10Gi but quickly reverted (during bid/accepting bid/sending-manifest; so I presume some inner akash-provider mechanics)
  1. Provider View Post-Deployment
$ provider_info.sh provider.provider-02.sandbox-01.aksh.pw
type       cpu    gpu  ram                 ephemeral           persistent
used       1      1    2                   5                   10
pending    0      0    0                   0                   0
available  13.65  1    28.632808685302734  169.28484315704554  30.0556707251817
node       13.65  1    28.632808685302734  169.28484315704554  N/A

ceph df also reports MAX AVAIL => 30Gi.

  1. Writing 9Gi of Data to PV
dd if=/dev/urandom bs=1M count=9216 of=1
  1. Views After Writing 9Gi of Data
  • Ceph View: Shows MAX AVAIL as MAX - USED.
$ kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph df
--- RAW STORAGE ---
CLASS    SIZE   AVAIL     USED  RAW USED  %RAW USED
hdd    32 GiB  23 GiB  9.1 GiB   9.1 GiB      28.40
TOTAL  32 GiB  23 GiB  9.1 GiB   9.1 GiB      28.40
 
--- POOLS ---
POOL               ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
akash-nodes         1   32     19 B        1    4 KiB      0     21 GiB
akash-deployments   2   32  9.0 GiB    2.33k  9.0 GiB  29.71     21 GiB
.mgr                3    1  449 KiB        2  452 KiB      0     21 GiB
  • Provider View: Reflects MAX AVAIL aligned with Ceph's calculation MAX - USED, i.e. (21 Gi - 9 Gi = 12 Gi).
$ provider_info.sh provider.provider-02.sandbox-01.aksh.pw
type       cpu    gpu  ram                 ephemeral           persistent
used       1      1    2                   5                   10
pending    0      0    0                   0                   0
available  13.65  1    28.632808685302734  169.28484315704554  12.302638040855527
node       13.65  1    28.632808685302734  169.28484315704554  N/A
@andy108369
Copy link
Contributor Author

andy108369 commented Nov 16, 2023

Hurricane provider

This is the also reason Hurricane provider reports -567 Gi of persistent storage available:

Akash-Provider currently reports MAX AVAIL - USED => 393-960 => negative -567 Gi of available persistent storage.

Clarification: USED is STORED X No_Replicas here, i.e. 480 x 2 = 960 Gi (server has 2x 931.5G disks with two OSD's each 465.8G)

$ kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph df
--- RAW STORAGE ---
CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
hdd    1.8 TiB  898 GiB  965 GiB   965 GiB      51.79
TOTAL  1.8 TiB  898 GiB  965 GiB   965 GiB      51.79
 
--- POOLS ---
POOL               ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
.mgr                1    1  449 KiB        2  904 KiB      0    393 GiB
akash-deployments   2  256  480 GiB  123.31k  960 GiB  54.98    393 GiB

$ provider_info.sh provider.hurricane.akash.pub
type       cpu     gpu  ram                ephemeral           persistent
used       58.6    0    169.5              746.5               550
pending    0       0    0                  0                   0
available  34.295  1    4.681840896606445  1062.2646561246365  -567.2483718525618
node       34.295  1    4.681840896606445  1062.2646561246365  N/A

Ceph config - 2 replicas:

$ kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph osd pool get akash-deployments all
size: 2
min_size: 2
...

PVC

$ kubectl get pvc -A |grep -E '[a-z0-9]{45}' | awk '{print $5}' | sed 's/i//g' | numfmt --from=iec | tr '\n' '+' | sed 's/+$/\n/g' | bc -l | numfmt --to-unit=$((1024**3))
550

The provider should have calculated its available persistent storage using (AVAIL / REPLICAS) - CLAIMED_BY_PVC) = (898/2)-550 = -101 - meaning it has already been over-provisioned.
It is still running well though, because the PVC's weren't 100% filled yet:

Filesystem                         Size  Used Avail Use% Mounted on
/dev/rbd0                          492G  193G  299G  40% /root/.osmosisd
/dev/rbd1                           49G   44K   49G   1% /root

FWIW, the Used space as reported by df reported doesn't mean as much because even after the files get removed off of these rbd devices, they still occupy the data on the disk (data remanence) as only inode gets removed (or its visibility flag).

$ kubectl -n rook-ceph exec -ti $(kubectl -n rook-ceph get pods -l app=rook-ceph-tools -o name) -- bash

[root@rook-ceph-tools-846b5c845b-qrf7c /]# ceph osd pool ls
.mgr
akash-deployments

[root@rook-ceph-tools-846b5c845b-qrf7c /]# rbd pool stats akash-deployments
Total Images: 2
Total Snapshots: 0
Provisioned Size: 550 GiB

[root@rook-ceph-tools-846b5c845b-qrf7c /]# rbd -p akash-deployments ls
csi-vol-5cb9281e-ed6e-4a9a-85d4-196a91aa1863
csi-vol-dfc5aaaf-577f-4705-9d20-e361a5b6d657

[root@rook-ceph-tools-846b5c845b-qrf7c /]# rbd -p akash-deployments disk-usage csi-vol-dfc5aaaf-577f-4705-9d20-e361a5b6d657
warning: fast-diff map is not enabled for csi-vol-dfc5aaaf-577f-4705-9d20-e361a5b6d657. operation may be slow.
NAME                                          PROVISIONED  USED  
csi-vol-dfc5aaaf-577f-4705-9d20-e361a5b6d657       50 GiB  17 GiB

[root@rook-ceph-tools-846b5c845b-qrf7c /]# rbd -p akash-deployments disk-usage csi-vol-5cb9281e-ed6e-4a9a-85d4-196a91aa1863
warning: fast-diff map is not enabled for csi-vol-5cb9281e-ed6e-4a9a-85d4-196a91aa1863. operation may be slow.
NAME                                          PROVISIONED  USED   
csi-vol-5cb9281e-ed6e-4a9a-85d4-196a91aa1863      500 GiB  469 GiB

469+17 = 486 GiB is actually used disk space by these two PVC's. (by 6Gi more since last time I've issued ceph df above as the data is being written by the apps)

useful ceph commands

$ kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pods -l app=rook-ceph-tools -o name) -- sh -c 'ceph osd pool ls | while read POOL; do echo "=== pool: $POOL ==="; rbd -p "$POOL" ls | while read VOL; do rbd -p "$POOL" disk-usage "$VOL"; done; done'

=== pool: .mgr ===
=== pool: akash-deployments ===
warning: fast-diff map is not enabled for csi-vol-5cb9281e-ed6e-4a9a-85d4-196a91aa1863. operation may be slow.
NAME                                          PROVISIONED  USED   
csi-vol-5cb9281e-ed6e-4a9a-85d4-196a91aa1863      500 GiB  469 GiB
...
$ kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pods -l app=rook-ceph-tools -o name) -- sh -c 'ceph osd pool ls | while read POOL; do echo "=== pool: $POOL ==="; rbd -p "$POOL" ls | while read VOL; do ceph osd map "$POOL" "$VOL"; done; done'=== pool: .mgr ===
=== pool: akash-deployments ===
osdmap e2192 pool 'akash-deployments' (2) object 'csi-vol-5cb9281e-ed6e-4a9a-85d4-196a91aa1863' -> pg 2.d2caa6be (2.be) -> up ([3,1], p3) acting ([3,1], p3)
...

Europlots provider

Akash Provider calculates the available persistent storage as MAX AVAIL - USED => (9.3-1.1)*1024 = 8396.8 GiB which matches up with the available persistent storage provider reports.

However, in fact the provider should have (AVAIL / REPLICAS) - CLAIMED_BY_PVC) => ((20*1024/2)-2420) => 7820 GiB (or 7.64TiB) of available space.

# kubectl -n akash-services get pods -o custom-columns='NAME:.metadata.name,IMAGE:.spec.containers[*].image'
NAME                                        IMAGE
akash-hostname-operator-54854db4c5-c6wvl    ghcr.io/akash-network/provider:0.4.6
akash-inventory-operator-5ff867f6d9-cvx28   ghcr.io/akash-network/provider:0.4.6
akash-ip-operator-79cc857f7b-fj8hd          ghcr.io/akash-network/provider:0.4.6
akash-node-1-0                              ghcr.io/akash-network/node:0.26.2
akash-provider-0                            ghcr.io/akash-network/provider:0.4.7
# kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph df
--- RAW STORAGE ---
CLASS    SIZE   AVAIL     USED  RAW USED  %RAW USED
ssd    21 TiB  20 TiB  1.2 TiB   1.2 TiB       5.55
TOTAL  21 TiB  20 TiB  1.2 TiB   1.2 TiB       5.55

--- POOLS ---
POOL               ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
.mgr                1    1  6.6 MiB        3   20 MiB      0    6.2 TiB
akash-nodes         2   32     19 B        1    8 KiB      0    9.3 TiB
akash-deployments   3  512  588 GiB  152.59k  1.1 TiB   5.79    9.3 TiB
$ provider_info.sh provider.europlots.com
type       cpu      gpu  ram                 ephemeral           persistent
used       98.15    1    285.83948681596667  1550.8649163246155  2414.4313225746155
pending    0        0    0                   0                   0
available  283.585  1    650.6061556553468   10326.47616339475   8387.475747092627
node       136.015  1    307.38518168684095  4418.74621364288    N/A
node       122.015  0    309.32234382629395  5785.588328568265   N/A
node       25.555   0    33.898630142211914  122.141621183604    N/A
  • his Ceph is configured with 2 replicas for the objects
# kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph osd pool get akash-deployments all
size: 2
min_size: 2
...
  • the actual reserved space by the PVC is 2420 GiB (or 2.4 TiB):
$ cat message\ \(7\).txt | grep -E '[a-z0-9]{45}'
01pil6i48e91fr0k3jlhdakid6sg2q2g2dm2muk343gc8   node-certs-node-0                                      Bound    pvc-cee47c63-a161-48b1-a819-c77df64ab195   100Mi      RWO            beta3          82d
01pil6i48e91fr0k3jlhdakid6sg2q2g2dm2muk343gc8   postgres-data-postgres-0                               Bound    pvc-26535f51-7205-4c78-b24f-6af3ef30aed9   5Gi        RWO            beta3          82d
18mh9uqn9n92nn165jveibikldaoksook0ioug62e0ejo   db-wordpress-db-db-0                                   Bound    pvc-6fc61cc9-828c-4312-a880-03a7f5755bcd   1Gi        RWO            beta3          46d
18mh9uqn9n92nn165jveibikldaoksook0ioug62e0ejo   wordpress-wordpress-data-wordpress-0                   Bound    pvc-95c7733d-4f3d-40f6-93bb-5ae723860603   1Gi        RWO            beta3          46d
...
...

$ cat message\ \(7\).txt | grep -E '[a-z0-9]{45}' | awk '{print $5}' | sed 's/i//g' | numfmt --from=iec | tr '\n' '+' | sed 's/+$/\n/g' | bc -l | numfmt --to-unit=$((1024**3))
2420

Europlots is using 6 x 3.84TiB disks (ceph: 2 replicas) which gives him (6*3.84)/2 = 11.52 TiB of available persistent storage space.

Ceph reports 20 TiB as AVAIL, which corresponds to: 6*3.84 = 23.04 TiB - extra padding (the exact disk size / ceph FS metadata)

Taking the no. of replicas into account: ((20*1024)/2) = 10240 GiB is the amount of disk space his cluster has to offer for the deployments.

AVAIL - PVC => 10240-2420 = 7820 GiB is the actually available space that the Ceph cluster can offer.
However, it reports more based on the previously mentioned formula: MAX AVAIL - USED => (9.3-1.1)*1024 = 8396.8 GiB. Because of that the provider can over-allocate persistent storage space to its deployments. As shown in the example with Hurricane provider.

UPDATE:

Here is the actual used space by the PVC on Europlots:

NAME                                          PROVISIONED  USED
csi-vol-1f2185ec-846d-11ee-8000-be1f38662678     1000 GiB  15 GiB
csi-vol-23ba52fa-3a22-11ee-8d82-6aa17b00d99b       16 GiB  348 MiB
csi-vol-23bae704-3a22-11ee-8d82-6aa17b00d99b       16 GiB  500 MiB
csi-vol-2b2c2567-4c63-11ed-8eaa-ce3b8929bf79       20 GiB  19 GiB
csi-vol-2f98a86f-52d1-11ee-8000-be1f38662678        1 GiB  176 MiB
csi-vol-2f9bd2a8-52d1-11ee-8000-be1f38662678        2 GiB  460 MiB
csi-vol-34de1928-4c05-11ee-8000-be1f38662678        3 GiB  36 MiB
csi-vol-72ac18c1-435f-11ee-8d82-6aa17b00d99b      100 MiB  28 MiB
csi-vol-72b1851c-435f-11ee-8d82-6aa17b00d99b        5 GiB  52 MiB
csi-vol-78845050-72f6-11ee-8000-be1f38662678      500 GiB  26 GiB
csi-vol-8f07e2f8-7822-11ee-8000-be1f38662678      250 GiB  137 GiB
csi-vol-b042d2fe-6a0f-11ee-8000-be1f38662678      512 MiB  20 MiB
csi-vol-d5cb3fc3-5fd5-11ee-8000-be1f38662678        1 GiB  296 MiB
csi-vol-d5cdee4c-5fd5-11ee-8000-be1f38662678        1 GiB  968 MiB
csi-vol-e754da41-6ce2-11ee-8000-be1f38662678      954 MiB  64 MiB
csi-vol-f911f06a-b62b-11ed-82f4-c22bed72523b       10 GiB  1.3 GiB
csi-vol-fb244441-0dff-11ee-8425-5af3b7a33171        1 GiB  344 MiB
csi-vol-fb286227-0dff-11ee-8425-5af3b7a33171        2 GiB  1.5 GiB
csi-vol-fe21fa44-4d7e-11ee-8000-be1f38662678      600 GiB  393 GiB

PROVISIONED: 1000+16+16+20+1+2+3+(100/1024)+5+500+250+(512/1024)+1+1+(954/1024)+10+1+2+600 = 2429 GiB
USED: 15+(348/1024)+(500/1024)+19+(176/1024)+(460/1024)+(36/1024)+(28/1024)+(52/1024)+26+137+(20/1024)+(296/1024)+(968/1024)+(64/1024)+1.3+(344/1024)+1.5+393 = 596 GiB

That explains the discrepancy between the actually available disk space 7820 GiB (AVAIL - PVC) and what provider reports 8396.8 GiB (MAX AVAIL - USED).

[provider_reported_avail - used_by_PVC] => 8396.8 - 596 = 7800.8 GiB total available space. (slightly less by 20 GiB as it the space was consumed within ~an hour-two as I was updating this comment)

@andy108369
Copy link
Contributor Author

andy108369 commented Jul 30, 2024

Yet another observation from H100 Oblivus

This is mainly just for the record so we have more raw data to work with.

H100 Oblivus reports a negative of -1232.58 GiB of beta3 storage is available.

arno@x1:~$ provider_info2.sh provider.h100.mon.obl.akash.pub
PROVIDER INFO
BALANCE: 5094.264848
"hostname"                         "address"
"provider.h100.mon.obl.akash.pub"  "akash1g7az2pus6atgeufgttlcnl0wzlzwd0lrsy6d7s"

Total/Available/Used (t/a/u) per node:
"name"   "cpu(t/a/u)"           "gpu(t/a/u)"  "mem(t/a/u GiB)"          "ephemeral(t/a/u GiB)"
"node1"  "252/3.88/248.12"      "8/4/4"       "1417.21/483.86/933.35"   "5756.74/4282.74/1474"
"node2"  "252/230.95/21.05"     "8/0/8"       "1417.21/1272.29/144.92"  "5756.74/5656.74/100"
"node3"  "252/118.645/133.355"  "8/0/8"       "1417.21/1267.09/150.13"  "5756.74/5655.74/101"
"node4"  "252/121.675/130.325"  "8/0/8"       "1417.21/1273.44/143.77"  "5756.74/5656.74/100"
"node5"  "252/119.675/132.325"  "8/0/8"       "1417.21/1271.3/145.92"   "5756.74/5656.74/100"
"node6"  "252/184.975/67.025"   "8/0/8"       "1417.21/1273.27/143.94"  "5756.74/5656.74/100"
"node7"  "252/233.675/18.325"   "8/0/8"       "1417.21/1273.44/143.77"  "5756.74/5656.74/100"
"node8"  "252/230.225/21.775"   "8/0/8"       "1417.21/1279.54/137.68"  "5756.74/5656.74/100"

ACTIVE TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
741           60     1793.46     2175              0             0             7636.32

PERSISTENT STORAGE:
"storage class"  "available space(GiB)"
"beta3"          -1232.58

PENDING TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
  • looking at the ceph

Ceph is configured for 3 replicas, host is failure domain and 1 OSD per host.

arno@x1:~$ kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph df
--- RAW STORAGE ---
CLASS    SIZE   AVAIL    USED  RAW USED  %RAW USED
hdd    32 TiB  21 TiB  11 TiB    11 TiB      33.60
TOTAL  32 TiB  21 TiB  11 TiB    11 TiB      33.60
 
--- POOLS ---
POOL               ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
akash-deployments   1  256  3.6 TiB  941.52k   11 TiB  36.36    6.3 TiB
.mgr                2    1  449 KiB        2  1.3 MiB      0    6.3 TiB

arno@x1:~$ kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph osd df
ID  CLASS  WEIGHT   REWEIGHT  SIZE    RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS
 0    hdd  4.00000   1.00000   4 TiB  1.5 TiB  1.5 TiB   18 KiB  4.3 GiB  2.5 TiB  36.37  1.08  105      up
 2    hdd  4.00000   1.00000   4 TiB  1.4 TiB  1.4 TiB   23 KiB  4.4 GiB  2.6 TiB  35.32  1.05  101      up
 1    hdd  4.00000   1.00000   4 TiB  1.3 TiB  1.3 TiB   16 KiB  4.2 GiB  2.7 TiB  31.86  0.95   92      up
 3    hdd  4.00000   1.00000   4 TiB  1.3 TiB  1.3 TiB   39 KiB  4.1 GiB  2.7 TiB  31.91  0.95   91      up
 4    hdd  4.00000   1.00000   4 TiB  1.3 TiB  1.3 TiB   53 KiB  4.4 GiB  2.7 TiB  32.56  0.97   93      up
 5    hdd  4.00000   1.00000   4 TiB  1.3 TiB  1.3 TiB   18 KiB  3.4 GiB  2.7 TiB  32.50  0.97   94      up
 6    hdd  4.00000   1.00000   4 TiB  1.3 TiB  1.3 TiB   34 KiB  4.2 GiB  2.7 TiB  32.63  0.97   93      up
 7    hdd  4.00000   1.00000   4 TiB  1.4 TiB  1.4 TiB   27 KiB  4.4 GiB  2.6 TiB  35.67  1.06  102      up
                       TOTAL  32 TiB   11 TiB   11 TiB  232 KiB   33 GiB   21 TiB  33.60                   
MIN/MAX VAR: 0.95/1.08  STDDEV: 1.73

arno@x1:~$ kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph osd tree
ID   CLASS  WEIGHT    TYPE NAME       STATUS  REWEIGHT  PRI-AFF
 -1         32.00000  root default                             
 -7          4.00000      host node1                           
  0    hdd   4.00000          osd.0       up   1.00000  1.00000
 -9          4.00000      host node2                           
  2    hdd   4.00000          osd.2       up   1.00000  1.00000
 -5          4.00000      host node3                           
  1    hdd   4.00000          osd.1       up   1.00000  1.00000
 -3          4.00000      host node4                           
  3    hdd   4.00000          osd.3       up   1.00000  1.00000
-11          4.00000      host node5                           
  4    hdd   4.00000          osd.4       up   1.00000  1.00000
-15          4.00000      host node6                           
  5    hdd   4.00000          osd.5       up   1.00000  1.00000
-13          4.00000      host node7                           
  6    hdd   4.00000          osd.6       up   1.00000  1.00000
-17          4.00000      host node8                           
  7    hdd   4.00000          osd.7       up   1.00000  1.00000
  • provisioned PVs
arno@x1:~$ kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                                                        STORAGECLASS   REASON   AGE
pvc-0cbb2a54-5b67-4c3c-b264-a9892692daaf   1000Gi     RWO            Delete           Bound    ai28v42g2f44s5so2fk5ri52it17gk2kjarupg6f55j8i/vllm-data-vllm-0               beta3                   14h
pvc-1d28bcbf-015f-4391-b37b-5936bc9ba7dc   1000Gi     RWO            Delete           Bound    6n1bqbbssllrrcscemcbgs2843tijmjmth5pqqteu9is6/rayhead-data-rayhead-0         beta3                   6h16m
pvc-27061a34-c00e-4afd-9739-6a942c6ee9cc   1000Gi     RWO            Delete           Bound    83a7jr508haaatk0lhe44menv6ubl4rj8j1ohdvkbvq5g/vllm-data-vllm-0               beta3                   6d7h
pvc-38ab8ea9-ade1-4fec-82da-1ca50e36c042   1000Gi     RWO            Delete           Bound    v8j8pcqft9jml55aiva3f502tjlb22p3r8e0b2b33p316/vllm-data-vllm-0               beta3                   14h
pvc-501cec9c-0dde-48f3-a622-e8b365bfd375   5Gi        RWO            Delete           Bound    hindsb9l1cvqif74f47n1avob20vptbmcd7rn4tvjpn54/mongo-data-mongo-0             beta3                   75d
pvc-566eeee6-5aa7-4891-9231-ece077a336c2   1000Gi     RWO            Delete           Bound    6n1bqbbssllrrcscemcbgs2843tijmjmth5pqqteu9is6/rayworker-data-rayworker-0     beta3                   6h16m
pvc-8b5e9eb9-eaa9-40d9-a07e-5289c293fa5d   932Gi      RWO            Delete           Bound    a7mhlcdufef0pf0l3gdlbusmbu7vnevd6j33k6ha12c08/app-workspace-app-0            beta3                   4h46m
pvc-8c369565-94ce-45ef-874c-63c98fc98239   1000Gi     RWO            Delete           Bound    6n1bqbbssllrrcscemcbgs2843tijmjmth5pqqteu9is6/rayworker2-data-rayworker2-0   beta3                   6h16m
pvc-c363d884-0a8d-454c-93ea-372b74796fd9   700Gi      RWO            Delete           Bound    6cvfuea7bav11k5kt64fpp82hlqvkbcusbvi5e8hjj8po/vllm-data-vllm-0               beta3                   26h
  • they are only provisioned in the akash-deployments pool as expected
$ kubectl -n rook-ceph exec -ti $(kubectl -n rook-ceph get pods -l app=rook-ceph-tools -o name) -- bash
bash-4.4$ ceph osd pool ls | while read POOL; do echo "=== pool: $POOL ==="; rbd -p "$POOL" ls | while read VOL; do ceph osd map "$POOL" "$VOL"; done; done
=== pool: akash-deployments ===
osdmap e1924 pool 'akash-deployments' (1) object 'csi-vol-0917364e-cab9-4f69-8760-e044fe3bd0c2' -> pg 1.b76e9bfe (1.fe) -> up ([7,3,5], p7) acting ([7,3,5], p7)
osdmap e1924 pool 'akash-deployments' (1) object 'csi-vol-2d1ae187-9c70-446d-b194-0d9ba475f767' -> pg 1.7fee6553 (1.53) -> up ([0,2,4], p0) acting ([0,2,4], p0)
osdmap e1924 pool 'akash-deployments' (1) object 'csi-vol-322cb85a-05a8-49a6-9e00-594ebe57f6a5' -> pg 1.5826478b (1.8b) -> up ([5,2,4], p5) acting ([5,2,4], p5)
osdmap e1924 pool 'akash-deployments' (1) object 'csi-vol-478e7d21-b4a8-4877-b7bd-fda496a79070' -> pg 1.38a66ea4 (1.a4) -> up ([5,0,4], p5) acting ([5,0,4], p5)
osdmap e1924 pool 'akash-deployments' (1) object 'csi-vol-54731da9-53ea-40fc-a0dc-66d993f7c23d' -> pg 1.b753587e (1.7e) -> up ([5,2,3], p5) acting ([5,2,3], p5)
osdmap e1924 pool 'akash-deployments' (1) object 'csi-vol-6cc55a35-706d-42bf-8942-90dc0a9a0fa3' -> pg 1.c80396e8 (1.e8) -> up ([6,2,0], p6) acting ([6,2,0], p6)
osdmap e1924 pool 'akash-deployments' (1) object 'csi-vol-a47f9fc1-3e5e-48c0-91c3-8897a0074fe1' -> pg 1.34bb8a5b (1.5b) -> up ([6,4,7], p6) acting ([6,4,7], p6)
osdmap e1924 pool 'akash-deployments' (1) object 'csi-vol-bac73230-3cdc-42a8-80c8-f889590d48c0' -> pg 1.5526e1c (1.1c) -> up ([7,4,3], p7) acting ([7,4,3], p7)
osdmap e1924 pool 'akash-deployments' (1) object 'csi-vol-d221961b-f0f9-4488-8e19-90915f3dd18f' -> pg 1.6065c863 (1.63) -> up ([4,6,7], p4) acting ([4,6,7], p4)
=== pool: .mgr ===
bash-4.4$ 
  • Provisioned vs Actual disk usage
arno@x1:~$ kubectl -n rook-ceph exec -ti $(kubectl -n rook-ceph get pods -l app=rook-ceph-tools -o name) -- bash
bash-4.4$ ceph osd pool ls
akash-deployments
.mgr
bash-4.4$ rbd pool stats akash-deployments
Total Images: 9
Total Snapshots: 0
Provisioned Size: 7.5 TiB
bash-4.4$ rbd -p akash-deployments ls | xargs -I@ rbd -p akash-deployments disk-usage '@'
NAME                                          PROVISIONED  USED   
csi-vol-0917364e-cab9-4f69-8760-e044fe3bd0c2     1000 GiB  464 GiB
csi-vol-2d1ae187-9c70-446d-b194-0d9ba475f767     1000 GiB  221 GiB
csi-vol-322cb85a-05a8-49a6-9e00-594ebe57f6a5     1000 GiB  767 GiB
csi-vol-478e7d21-b4a8-4877-b7bd-fda496a79070     1000 GiB  221 GiB
csi-vol-54731da9-53ea-40fc-a0dc-66d993f7c23d     1000 GiB  767 GiB
csi-vol-6cc55a35-706d-42bf-8942-90dc0a9a0fa3      700 GiB  464 GiB
csi-vol-a47f9fc1-3e5e-48c0-91c3-8897a0074fe1        5 GiB  4.4 GiB
csi-vol-bac73230-3cdc-42a8-80c8-f889590d48c0      932 GiB  2.2 GiB
csi-vol-d221961b-f0f9-4488-8e19-90915f3dd18f     1000 GiB  767 GiB

Total Provisioned: (6*1000)+932+700+5 = 7637 GiB or 7.46 TiB
Total USED: 464+221+767+221+767+464+4.4+2.2+767 = 3677.6 GiB or 3.6 TiB which matches to STORED in ceph df output for akash-deployments pools.

Yet provider reports a negative of -1232.58 GiB of beta3 storage is Available, and 7636.32 Active.

  • provider status endpoint results (8443/status)
$ curl -k https://provider.h100.mon.obl.akash.pub:8443/status
{"cluster":{"leases":9,"inventory":{"active":[{"cpu":16000,"gpu":8,"memory":137438953472,"storage_ephemeral":107374182400,"storage":{"beta3":751619276800}},{"cpu":384000,"gpu":24,"memory":412316860416,"storage_ephemeral":322122547200,"storage":{"beta3":3221225472000,"ram":32212254720}},{"cpu":64000,"gpu":8,"memory":137438953472,"storage_ephemeral":107374182400,"storage":{"beta3":1073741824000,"ram":10737418240}},{"cpu":150000,"gpu":1,"memory":386547056640,"storage_ephemeral":483183820800,"storage":{"beta3":1000000000000,"ram":10737418240}},{"cpu":16000,"gpu":8,"memory":137438953472,"storage_ephemeral":107374182400,"storage":{"beta3":1073741824000,"ram":10737418240}},{"cpu":2000,"gpu":0,"memory":6442450944,"storage_ephemeral":1073741824,"storage":{"beta3":5368709120}},{"cpu":31000,"gpu":1,"memory":190215195136,"storage_ephemeral":549755813888,"storage":{"ram":8589934592}},{"cpu":16000,"gpu":8,"memory":137438953472,"storage_ephemeral":107374182400,"storage":{"beta3":1073741824000,"ram":10737418240}},{"cpu":31000,"gpu":1,"memory":190215195136,"storage_ephemeral":274877906944,"storage":{"ram":8589934592}}],"available":{"nodes":[{"name":"node1","allocatable":{"cpu":252000,"gpu":8,"memory":1521721561088,"storage_ephemeral":6181247667821},"available":{"cpu":34880,"gpu":5,"memory":718349052928,"storage_ephemeral":4873430126189}},{"name":"node2","allocatable":{"cpu":252000,"gpu":8,"memory":1521721569280,"storage_ephemeral":6181247667821},"available":{"cpu":230950,"gpu":0,"memory":1366109708288,"storage_ephemeral":6073873485421}},{"name":"node3","allocatable":{"cpu":252000,"gpu":8,"memory":1521721634816,"storage_ephemeral":6181247667821},"available":{"cpu":118645,"gpu":0,"memory":1360523429888,"storage_ephemeral":6072799743597}},{"name":"node4","allocatable":{"cpu":252000,"gpu":8,"memory":1521721589760,"storage_ephemeral":6181247667821},"available":{"cpu":121675,"gpu":0,"memory":1367350700032,"storage_ephemeral":6073873485421}},{"name":"node5","allocatable":{"cpu":252000,"gpu":8,"memory":1521721577472,"storage_ephemeral":6181247667821},"available":{"cpu":119675,"gpu":0,"memory":1365045917696,"storage_ephemeral":6073873485421}},{"name":"node6","allocatable":{"cpu":252000,"gpu":8,"memory":1521721597952,"storage_ephemeral":6181247667821},"available":{"cpu":184975,"gpu":0,"memory":1367164061696,"storage_ephemeral":6073873485421}},{"name":"node7","allocatable":{"cpu":252000,"gpu":8,"memory":1521721602048,"storage_ephemeral":6181247667821},"available":{"cpu":233675,"gpu":0,"memory":1367350712320,"storage_ephemeral":6073873485421}},{"name":"node8","allocatable":{"cpu":252000,"gpu":8,"memory":1521721602048,"storage_ephemeral":6181247667821},"available":{"cpu":230225,"gpu":0,"memory":1373891729408,"storage_ephemeral":6073873485421}}],"storage":[{"class":"beta3","size":-1323522588672}]}}},"bidengine":{"orders":0},"manifest":{"deployments":1},"cluster_public_hostname":"provider.h100.mon.obl.akash.pub","address":"akash1g7az2pus6atgeufgttlcnl0wzlzwd0lrsy6d7s"}

@SGC41
Copy link

SGC41 commented Aug 7, 2024

Noticed this a while back and have been tracking it...
my persistent storage reported, keeps slowly dropping, tho after adding more capacity recently, i saw nearly 100GiB drop pr day in reported free persistent capacity, another curious thing i noticed, when checking my provider api, while i was upgrading a server that holds one of the replicated ceph OSD's.

then the capacity shown went off the charts.


ACTIVE TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
801.5         6      366.44      608               0             0             1900

PERSISTENT STORAGE:
"storage class"  "available space(GiB)"
"beta3"          1497.12

another day

ACTIVE TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
807.7         7      388.79      1609.17           0             0             1900

PERSISTENT STORAGE:
"storage class"  "available space(GiB)"
"beta3"          1299.05

this was when one of the ceph OSD nodes was down.


ACTIVE TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
807.7         7      388.79      1609.17           0             0             1900

PERSISTENT STORAGE:
"storage class"  "available space(GiB)"
"beta3"          4497.81

afterwards when the node was back online, it went back to the former.
or sort of... it does seem to report a bit more free capacity after.

ACTIVE TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
801.7         6      381.34      1539.32           0             0             1900

PERSISTENT STORAGE:
"storage class"  "available space(GiB)"
"beta3"          1483.29

not sure what it should be reporting exactly... but there are currently 4200 GiB allocated for each ceph node, minus mons and whatever else ceph stuff.

would be interesting to know, if the reported minus capacity from H100 Oblivius, affects leases from console.
does this stop persistent storage leases from being deployed on a provider reporting minus persistent.

let me know, if there is anything i can help with for troubleshooting this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 repo/provider Akash provider-services repo issues
Projects
None yet
Development

No branches or pull requests

3 participants