Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training with data generated by Isaac Sim Replicator #381

Open
doruksonmez opened this issue Aug 19, 2024 · 19 comments
Open

Training with data generated by Isaac Sim Replicator #381

doruksonmez opened this issue Aug 19, 2024 · 19 comments

Comments

@doruksonmez
Copy link

doruksonmez commented Aug 19, 2024

Hi, first of all, sorry for the long post. I'm currently experimenting with training DOPE to understand the project fully and its theoretical background. First, I generated synthetic data using Blenderproc since NVISII is no longer supported by newer Python versions. However, I've missed the point where I should be running the below command 5 times as mentioned here: #361 (comment), data generation took so much time and I ended up only having ~1600 images after a long time.

./run_blenderproc_datagen.py --nb_runs 5 --nb_frames 1000 --path_single_obj /isaac_dope_retrain/Popcorn/google_16k/textured.obj --nb_objects 6 --distractors_folder ../nvisii_data_gen/google_scanned_models --nb_distractors 10 --backgrounds_folder ../dome_hdri_haven/

Then I followed NVIDIA's official documentation on generating synthetic dataset using Isaac Sim Replicator and generated ~7200 images for the 036_wood_block class which is from the YCB dataset. Then I separated 202 images for testing and run debug tool on this split like the following:

python3 debug.py --data ../isaac_data/test/

Here is the debug result:
000114
000151
000105

My first question is, do these debug results look correct for the object? (looks like there is no connection to the 8th point)

Then I started DOPE training with the train/train.py script with a batch size of 16 and for 200 epochs:
python3 -m torch.distributed.launch --nproc_per_node=1 train.py --batchsize 16 -e 200 --data ../isaac_data/isaac_data/ --object _36_wood_block

Here is my training log, TensorBoard, and belief maps so far:

Loading Model...
ready to train!
Train Epoch: 1 [0/7000 (0%)] 	Loss: 0.032431039959192 	Local Rank: 0
Train Epoch: 1 [1600/7000 (23%)] 	Loss: 0.005280021112412 	Local Rank: 0
Train Epoch: 1 [3200/7000 (46%)] 	Loss: 0.003795965109020 	Local Rank: 0
Train Epoch: 1 [4800/7000 (68%)] 	Loss: 0.005852747708559 	Local Rank: 0
Train Epoch: 1 [6400/7000 (91%)] 	Loss: 0.005139055196196 	Local Rank: 0
NaN or Inf found in input tensor.
Train Epoch: 2 [0/7000 (0%)] 	Loss: 0.005350864026695 	Local Rank: 0
Train Epoch: 2 [1600/7000 (23%)] 	Loss: 0.004227774683386 	Local Rank: 0
Train Epoch: 2 [3200/7000 (46%)] 	Loss: 0.005079013295472 	Local Rank: 0
Train Epoch: 2 [4800/7000 (68%)] 	Loss: 0.005090656690300 	Local Rank: 0
Train Epoch: 2 [6400/7000 (91%)] 	Loss: 0.004829996265471 	Local Rank: 0
NaN or Inf found in input tensor.
Train Epoch: 3 [0/7000 (0%)] 	Loss: 0.004584960173815 	Local Rank: 0
Train Epoch: 3 [1600/7000 (23%)] 	Loss: 0.005996774882078 	Local Rank: 0
Train Epoch: 3 [3200/7000 (46%)] 	Loss: 0.005501705221832 	Local Rank: 0
Train Epoch: 3 [4800/7000 (68%)] 	Loss: 0.004319729283452 	Local Rank: 0
Train Epoch: 3 [6400/7000 (91%)] 	Loss: 0.003838054370135 	Local Rank: 0
NaN or Inf found in input tensor.
Train Epoch: 4 [0/7000 (0%)] 	Loss: 0.004935909062624 	Local Rank: 0
Train Epoch: 4 [1600/7000 (23%)] 	Loss: 0.005632530432194 	Local Rank: 0
Train Epoch: 4 [3200/7000 (46%)] 	Loss: 0.003991267178208 	Local Rank: 0
Train Epoch: 4 [4800/7000 (68%)] 	Loss: 0.005610936786979 	Local Rank: 0
Train Epoch: 4 [6400/7000 (91%)] 	Loss: 0.005505917128175 	Local Rank: 0
NaN or Inf found in input tensor.
Train Epoch: 5 [0/7000 (0%)] 	Loss: 0.004688881803304 	Local Rank: 0
Train Epoch: 5 [1600/7000 (23%)] 	Loss: 0.004800515249372 	Local Rank: 0
Train Epoch: 5 [3200/7000 (46%)] 	Loss: 0.004549864679575 	Local Rank: 0
Train Epoch: 5 [4800/7000 (68%)] 	Loss: 0.004274855367839 	Local Rank: 0
Train Epoch: 5 [6400/7000 (91%)] 	Loss: 0.004714360460639 	Local Rank: 0
.
.
.
NaN or Inf found in input tensor.
Train Epoch: 35 [0/7000 (0%)] 	Loss: 0.003139317035675 	Local Rank: 0
Train Epoch: 35 [1600/7000 (23%)] 	Loss: 0.003259100019932 	Local Rank: 0
Train Epoch: 35 [3200/7000 (46%)] 	Loss: 0.003899197326973 	Local Rank: 0
Train Epoch: 35 [4800/7000 (68%)] 	Loss: 0.003065461292863 	Local Rank: 0
Train Epoch: 35 [6400/7000 (91%)] 	Loss: 0.002900267252699 	Local Rank: 0
NaN or Inf found in input tensor.
Train Epoch: 36 [0/7000 (0%)] 	Loss: 0.002934988122433 	Local Rank: 0
Train Epoch: 36 [1600/7000 (23%)] 	Loss: 0.002790168859065 	Local Rank: 0
Train Epoch: 36 [3200/7000 (46%)] 	Loss: 0.002845479408279 	Local Rank: 0
Train Epoch: 36 [4800/7000 (68%)] 	Loss: 0.003665541065857 	Local Rank: 0
Train Epoch: 36 [6400/7000 (91%)] 	Loss: 0.003706640098244 	Local Rank: 0
NaN or Inf found in input tensor.
Train Epoch: 37 [0/7000 (0%)] 	Loss: 0.003936767578125 	Local Rank: 0
Train Epoch: 37 [1600/7000 (23%)] 	Loss: 0.003334288951010 	Local Rank: 0
Train Epoch: 37 [3200/7000 (46%)] 	Loss: 0.003531043883413 	Local Rank: 0
Train Epoch: 37 [4800/7000 (68%)] 	Loss: 0.003663829993457 	Local Rank: 0
Train Epoch: 37 [6400/7000 (91%)] 	Loss: 0.003290888853371 	Local Rank: 0
NaN or Inf found in input tensor.
Train Epoch: 38 [0/7000 (0%)] 	Loss: 0.004055183846503 	Local Rank: 0
Train Epoch: 38 [1600/7000 (23%)] 	Loss: 0.003013886744156 	Local Rank: 0
Train Epoch: 38 [3200/7000 (46%)] 	Loss: 0.003433866892010 	Local Rank: 0
Train Epoch: 38 [4800/7000 (68%)] 	Loss: 0.003090128302574 	Local Rank: 0
Train Epoch: 38 [6400/7000 (91%)] 	Loss: 0.003046812955290 	Local Rank: 0
NaN or Inf found in input tensor.
Train Epoch: 39 [0/7000 (0%)] 	Loss: 0.003284086473286 	Local Rank: 0
Train Epoch: 39 [1600/7000 (23%)] 	Loss: 0.003326341975480 	Local Rank: 0
Train Epoch: 39 [3200/7000 (46%)] 	Loss: 0.003423219081014 	Local Rank: 0
Train Epoch: 39 [4800/7000 (68%)] 	Loss: 0.002622122410685 	Local Rank: 0
Train Epoch: 39 [6400/7000 (91%)] 	Loss: 0.003288988722488 	Local Rank: 0
NaN or Inf found in input tensor.
Train Epoch: 40 [0/7000 (0%)] 	Loss: 0.003445768961683 	Local Rank: 0
Train Epoch: 40 [1600/7000 (23%)] 	Loss: 0.002639355603606 	Local Rank: 0
Train Epoch: 40 [3200/7000 (46%)] 	Loss: 0.002910927170888 	Local Rank: 0
Train Epoch: 40 [4800/7000 (68%)] 	Loss: 0.003899353090674 	Local Rank: 0
Train Epoch: 40 [6400/7000 (91%)] 	Loss: 0.003307382576168 	Local Rank: 0

Screenshot from 2024-08-20 00-15-17
Screenshot from 2024-08-20 00-15-27
Screenshot from 2024-08-20 00-16-37

My second question is, do these loss values/tensorboard/belief maps indicate a good training process? (I'm asking because training really takes time on RTX 4070 Ti and if something is wrong, I would like to fix it and restart the training)

My third question is; I have noticed some differences between NVISII/Blenderproc data and Isaac Sim Replicator data, so these changes would affect my training or inference process? Are there any other steps that should apply to Isaac data before or after the training to get the expected results?

Sample data for cracker:

{
  "camera_data": {},
  "objects": [
    {
      "class": "cracker",
      "visibility": 1.0,
      "location": [
        6.001536846160889,
        -4.993739604949951,
        86.97349548339844
      ],
      "quaternion_xyzw": [
        0.05684935301542282,
        0.32983407378196716,
        -0.32128748297691345,
        0.8858622908592224
      ],
      "projected_cuboid": [
        [
          294.026123046875,
          258.42864990234375
        ],
        [
          407.443115234375,
          274.3418884277344
        ],
        [
          363.9100646972656,
          112.13758850097656
        ],
        [
          240.95428466796875,
          112.85871887207031
        ],
        [
          264.0873718261719,
          293.51092529296875
        ],
        [
          379.4239501953125,
          314.3736877441406
        ],
        [
          329.1734924316406,
          149.2497100830078
        ],
        [
          204.38331604003906,
          145.52359008789062
        ],
        [
          309.0063171386719,
          211.89468383789062
        ]
      ]
    }
  ]
}

Wood block data from Isaac Sim Replicator:

{
    "camera_data": {
        "intrinsics": {
            "fx": 768.0,
            "fy": 768.0,
            "cx": 256.0,
            "cy": 256.0
        },
        "camera_view_matrix": [
            [
                1.0,
                0.0,
                0.0,
                0.0
            ],
            [
                0.0,
                -1.0,
                -0.0,
                0.0
            ],
            [
                0.0,
                0.0,
                -1.0,
                0.0
            ],
            [
                -0.0,
                0.0,
                0.0,
                1.0
            ]
        ],
        "camera_projection_matrix": [
            [
                3.0,
                0.0,
                0.0,
                0.0
            ],
            [
                0.0,
                3.0,
                0.0,
                0.0
            ],
            [
                0.0,
                0.0,
                0.0,
                -1.0
            ],
            [
                0.0,
                0.0,
                0.01,
                0.0
            ]
        ]
    },
    "keypoint_order": [
        "Center",
        "LDB",
        "LDF",
        "LUB",
        "LUF",
        "RDB",
        "RDF",
        "RUB",
        "RUF"
    ],
    "objects": [
        {
            "prim_path": "/World/_36_wood_block_0/_36_wood_block",
            "visibility": 0.973,
            "location": [
                -0.02757798135280609,
                0.12349912524223328,
                1.191649079322815
            ],
            "quaternion_xyzw": [
                0.2647941019297759,
                0.9311522685504322,
                -0.13284147100297986,
                0.21258570022076217
            ],
            "projected_cuboid": [
                [
                    238,
                    336
                ],
                [
                    308,
                    370
                ],
                [
                    286,
                    383
                ],
                [
                    233,
                    258
                ],
                [
                    217,
                    277
                ],
                [
                    262,
                    402
                ],
                [
                    243,
                    413
                ],
                [
                    186,
                    284
                ],
                [
                    172,
                    303
                ]
            ],
            "truncation_ratio": 0.0,
            "class": "_36_wood_block"
        }
    ]
}

Thanks in advance.

@TontonTremblay
Copy link
Collaborator

@nv-jeff maybe jeff could help you, I never used isaac replicator sorry.

@doruksonmez
Copy link
Author

@TontonTremblay Thanks for your answer, I would be really appreciated if @nv-jeff could also chime in. Meantime, would you have a comment on the ongoing training (loss/maps etc.)?

@TontonTremblay
Copy link
Collaborator

I think you would want 0.001 around loss, not sure would need to check other issue threads. Also maybe your object has symmetries in there. Check the section on generating data with symmetries.

@TontonTremblay
Copy link
Collaborator

ahh you are doing the ycb wood box, yeah there are symmetries on that object. You will have to define them.

@doruksonmez
Copy link
Author

I got you but this object is not generated by NVISII, so where am I supposed to define model_info.json? I just have .pngs and .jsons in my dataset. I don't see any option for it in the train.py either. Is this supposed be in Isaac Sim's data generation config?
This is my dope_config.yml file within the Isaac Sim Replicator directory:

---
# Default rendering parameters
CONFIG:
  renderer: RayTracedLighting
  headless: false
  width: 512
  height: 512

# prim_type is determined by the usd file. 
# To determine, open the usd file in Isaac Sim and see the prim path. If you load it in /World, the path will be /World/<prim_type>
OBJECTS_TO_GENERATE:
#- { part_name: 003_cracker_box, num: 1, prim_type: _03_cracker_box }
#- { part_name: 035_power_drill, num: 1, prim_type: _35_power_drill }
- { part_name: 036_wood_block, num: 1, prim_type: _36_wood_block }


# Maximum force component to apply to objects to keep them in motion
FORCE_RANGE: 30

# Camera Intrinsics
WIDTH: 512
HEIGHT: 512
F_X: 768
F_Y: 768
pixel_size: 0.04 # in mm 

# Number of sphere lights added to the scene
NUM_LIGHTS: 6

# Minimum and maximum distances of objects away from the camera (along the optical axis)
# MIN_DISTANCE: 1.0
MIN_DISTANCE: 0.4
# MAX_DISTANCE: 2.0
MAX_DISTANCE: 1.4

# Rotation of camera rig with respect to world frame, expressed as XYZ euler angles
CAMERA_RIG_ROTATION:
- 0
- 0
- 0

# Rotation of camera with respect to camera rig, expressed as XYZ euler angles. Please note that in this example, we
# define poses with respect to the camera rig instead of the camera prim. By using the rig's frame as a surrogate for
# the camera's frame, we effectively change the coordinate system of the camera. When
# CAMERA_RIG_ROTATION = np.array([0, 0, 0]) and CAMERA_ROTATION = np.array([0, 0, 0]), this corresponds to the default
# Isaac-Sim camera coordinate system of -z out the face of the camera, +x to the right, and +y up. When
# CAMERA_RIG_ROTATION = np.array([0, 0, 0]) and CAMERA_ROTATION = np.array([180, 0, 0]), this corresponds to
# the YCB Video Dataset camera coordinate system of +z out the face of the camera, +x to the right, and +y down.
CAMERA_ROTATION:
- 180
- 0
- 0

# Minimum and maximum XYZ euler angles for the part being trained on to be rotated, with respect to the camera rig
MIN_ROTATION_RANGE:
- -180
- -90
- -180

# Minimum and maximum XYZ euler angles for the part being trained on to be rotated, with respect to the camera rig
MAX_ROTATION_RANGE:
- 180
- 90
- 180

# How close the center of the part being trained on is allowed to be to the edge of the screen
FRACTION_TO_SCREEN_EDGE: 0.9

# MESH and DOME datasets
SHAPE_SCALE:
- 0.05
- 0.05
- 0.05
SHAPE_MASS: 1
OBJECT_SCALE:
- 1
- 1
- 1
OBJECT_MASS: 1

TRAIN_PART_SCALE: # Scale for the training objects 
- 1
- 1
- 1

# Asset paths 
DISTRACTOR_ASSET_PATH: /Isaac/Props/YCB/Axis_Aligned/
TRAIN_ASSET_PATH: /Isaac/Props/YCB/Axis_Aligned/
DOME_TEXTURE_PATH: /NVIDIA/Assets/Skies/ 

# MESH dataset
NUM_MESH_SHAPES: 400
NUM_MESH_OBJECTS: 150
MESH_FRACTION_GLASS: 0.15
MESH_FILENAMES:
- 002_master_chef_can
- 004_sugar_box
- 005_tomato_soup_can
- 006_mustard_bottle
- 007_tuna_fish_can
- 008_pudding_box
- 009_gelatin_box
- 010_potted_meat_can
- 011_banana
- 019_pitcher_base
- 021_bleach_cleanser
- 024_bowl
- 025_mug
- 035_power_drill
- 036_wood_block
- 037_scissors
- 040_large_marker
- 051_large_clamp
- 052_extra_large_clamp
- 061_foam_brick

# DOME dataset
NUM_DOME_SHAPES: 30
NUM_DOME_OBJECTS: 20
DOME_FRACTION_GLASS: 0.2
DOME_TEXTURES:
- Clear/evening_road_01_4k
- Clear/kloppenheim_02_4k
- Clear/mealie_road_4k
- Clear/noon_grass_4k
- Clear/qwantani_4k
- Clear/signal_hill_sunrise_4k
- Clear/sunflowers_4k
- Clear/syferfontein_18d_clear_4k
- Clear/venice_sunset_4k
- Clear/white_cliff_top_4k
- Cloudy/abandoned_parking_4k
- Cloudy/champagne_castle_1_4k
- Cloudy/evening_road_01_4k
- Cloudy/kloofendal_48d_partly_cloudy_4k
- Cloudy/lakeside_4k
- Cloudy/sunflowers_4k
- Cloudy/table_mountain_1_4k
- Evening/evening_road_01_4k
- Indoor/adams_place_bridge_4k
- Indoor/autoshop_01_4k
- Indoor/bathroom_4k
- Indoor/carpentry_shop_01_4k
- Indoor/en_suite_4k
- Indoor/entrance_hall_4k
- Indoor/hospital_room_4k
- Indoor/hotel_room_4k
- Indoor/lebombo_4k
- Indoor/old_bus_depot_4k
- Indoor/small_empty_house_4k
- Indoor/studio_small_04_4k
- Indoor/surgery_4k
- Indoor/vulture_hide_4k
- Indoor/wooden_lounge_4k
- Night/kloppenheim_02_4k
- Night/moonlit_golf_4k
- Storm/approaching_storm_4k

@doruksonmez
Copy link
Author

@TontonTremblay @nv-jeff, my training just completed even though the loss value looks okay, I can not get any inference results with inference/inference.py. What would be the reason for it?

Training output:
loss_terminal

TensorBoard:
Screenshot from 2024-08-20 15-12-43

Inference:
Screenshot from 2024-08-20 15-11-52

@TontonTremblay
Copy link
Collaborator

can you share belief maps i think it is show_beliefs something like that

@doruksonmez
Copy link
Author

Only thing I could find related to show_beliefs was --showbelief flag in train2/inference.py. I tried to run it like:
python3 inference.py --data ../isaac_data/test/ --showbelief

but it throws:

Traceback (most recent call last):
  File "/isaac_dope_retrain/Deep_Object_Pose/train2/inference.py", line 22, in <module>
    from cuboid import Cuboid3d
ModuleNotFoundError: No module named 'cuboid'

So I have changed

import sys 
sys.path.append("inference")

to

import sys 
sys.path.append("../common")

but this time it throws:
Screenshot from 2024-08-20 20-41-02

I also set the following fields in train2/config_inference/config_pose.yaml:

weights: {
    '_36_wood_block':"/isaac_dope_retrain/Deep_Object_Pose/train/output_wood_block/weights/net_epoch_99.pth"
}

architectures: {
    '_36_wood_block':"dope"
}

dimensions: {
    "_36_wood_block": [9.6024150848388672,19.130100250244141,5.824894905090332]
}

class_ids: {
    '_36_wood_block': 37
}

draw_colors: {
    "_36_wood_block": [217,12, 232]  # magenta
}

How can I get belief map images? Sorry, I could not find any explanation for this in the repo. Thank you for your support.

@doruksonmez
Copy link
Author

I see this part in the train/train.py but I can't find any images related to it. It's like it doesn't save those somehow:
Screenshot from 2024-08-20 20-57-20

@doruksonmez
Copy link
Author

I have a view in TensorBoard for the epoch 176. Would it be useful for you @TontonTremblay ?
Screenshot from 2024-08-20 21-09-47

@TontonTremblay
Copy link
Collaborator

yeah it looks like you have symmetries. see how for some points it does not know which one is which? you need to add them to your annotation. https://github.com/NVlabs/Deep_Object_Pose/blob/master/data_generation/readme.md#handling-objects-with-symmetries read this.

@doruksonmez
Copy link
Author

@TontonTremblay thanks for your answer. I wanted to use Ketchup as a reference point to see if I could get an expected result before your answer. I have generated ~3900 images using Blenderproc. What do you think about my belief maps while it is in ~165th epoch:
Screenshot from 2024-08-21 22-08-44

@TontonTremblay
Copy link
Collaborator

TontonTremblay commented Aug 21, 2024 via email

@doruksonmez
Copy link
Author

doruksonmez commented Aug 22, 2024

I think I'm getting somewhere with this but my inference results are not good. I was able to run train2/inference.py with small modifications to the detector.py line 496 (which prevented running with an error):

#belief -= float(torch.min(belief)[0].data.cpu().numpy())
#belief /= float(torch.max(belief)[0].data.cpu().numpy())

belief -= float(torch.min(belief).data.cpu().numpy())
belief /= float(torch.max(belief).data.cpu().numpy())

Here is a couple belief maps from the actual train2/inference.py script and pose estimation results:
file_12
file_12_belief

file_25
file_25_belief

In some of the images, there is no result at all:
file_28
file_28_belief

I've already decreased the threshold to 0.1 and still a maximum of 1 result appears on the output. What might be the issue here?

file_1390
file_1390_belief

@TontonTremblay
Copy link
Collaborator

TontonTremblay commented Aug 22, 2024 via email

@doruksonmez
Copy link
Author

I used Blendproc for this Ketchup dataset. Here is the picture with one Ketchup bottle:
000000

And the belief map:
000000_belief

@doruksonmez
Copy link
Author

doruksonmez commented Aug 22, 2024

And this one is without distractors:
000000

000000_belief

@TontonTremblay
Copy link
Collaborator

TontonTremblay commented Aug 22, 2024 via email

@doruksonmez
Copy link
Author

@TontonTremblay Okay, many thanks for clarifying. @nv-jeff could we please get your help for this issue? Thanks in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants