[Question] Advice on training pick policy using allegro hand/arm #797

CreativeNick · 2025-01-10T08:39:50Z

I'm trying to write a pick policy for a allegro robot to pick up a randomly selected + positioned YCB object. I was wondering if I could get some feedback on potential optimizations. My current implementation uses RGBD + state observations and PPO for learning manipulation of YCB objects.

I haven't been getting successful results, as shown in the videos below.

In the 40th video, the robot is able to at least grasp the object but not yet lift it up:

40.mp4

In the 62nd video (last checkpoint), the robot is still unable to pick up an object and just awkwardly pushes it across the table:

62.mp4

My CNN processes both RGB and depth information together. I'm using standard convolutional layers with ReLU activations, but I'm not sure if this architecture is optimal for depth processing. I wrote an issue about it here: #794

class NatureCNN(nn.Module):
    def __init__(self, sample_obs):
        self.out_features = 0
        feature_size = 256

        if "rgbd" in sample_obs:
            # split last dimension (first 3 channels for RGB, last channel for depth)
            
            # CNN for RGBD input
            cnn = nn.Sequential(
                nn.Conv2d(
                    in_channels=16, # Full RGBD channels
                    out_channels=32,
                    kernel_size=8,
                    stride=4,
                    padding=0,
                ),
                nn.ReLU(),
                nn.Conv2d(
                    in_channels=32, out_channels=64, kernel_size=4, stride=2, padding=0
                ),
                nn.ReLU(),
                nn.Conv2d(
                    in_channels=64, out_channels=64, kernel_size=3, stride=1, padding=0
                ),
                nn.ReLU(),
                nn.Flatten(),
            )

            # RGBD dimension calculation
            with torch.no_grad():
                n_flatten = cnn(sample_obs["rgbd"].float().permute(0,3,1,2).cpu()).shape[1]
                fc = nn.Sequential(nn.Linear(n_flatten, feature_size), nn.ReLU())
            extractors["rgbd"] = nn.Sequential(cnn, fc)
            self.out_features += feature_size

        if "state" in sample_obs:
            # for state data we simply pass it through a single linear layer
            state_size = sample_obs["state"].shape[-1]
            extractors["state"] = nn.Linear(state_size, 256)
            self.out_features += 256

        ...

My reward function combines multiple components including lifting, finger control, and distance-based rewards. It is supposed to incentivize lifting objects above the table (lift_reward), maintain proper finger spread (finger_spread_reward), reduce distance between fingertips and target object (hand_close_reward), center the hand's approach (center_close_reward), and achieve enough lift height (height_reward).

def compute_dense_reward(self, obs: Any, action: np.ndarray, info: Dict):

    ...

    # core reward components
    lift_reward = ycb_xyz[:, 2] - self.table_height - 0.07
    total_reward += lift_reward * 15

    # finger control rewards
    finger_positions = torch.stack([link.pose.p for link in self.right_hand_link], dim=1)
    finger_spread = torch.std(finger_positions, dim=1).mean(dim=-1)
    finger_spread_reward = torch.clamp(finger_spread * 5.0, 0, 1.0)
    total_reward += finger_spread_reward

    # distance-based rewards
    hand_close_reward = torch.exp(-2.0 * right_hand_ycb_distance).mean(dim=-1)
    total_reward += hand_close_reward * 2.0
    
    center_close_reward = torch.exp(-2.0 * right_hand_tip_center_distance)
    total_reward += hand_close_reward + center_close_reward

    # height-based rewards
    height_diff = ycb_xyz[:, 2] - self.table_height
    height_reward = torch.where(height_diff > 0, height_diff * 10.0, torch.zeros_like(height_diff))
    total_reward += height_reward

    total_reward = total_reward.clamp(-self.max_reward, self.max_reward)

    total_reward[info["success"]] = self.max_reward
    total_reward[info["fail"]] = -self.max_reward / 4

    ...

Below is my evaluate function, where success requires lifting the target object above a height threshold of 1.25m from the table's surface. The agent fails if either the robot's fingers collide with the table (below table_height + 0.02m) or if the object falls below the table surface (table_height - 0.02m). The function assigns sparse rewards: +5.0 for success and -1.25 for failure.

def evaluate(self):
    if self.initialized:
        # check collisions with table
        right_hand_link_z = torch.concat(
            [
                link.pose.p[..., 2].unsqueeze(1)
                for link in self.right_hand_link + self.right_hand_tip_link
            ],
            dim=1,
        )

        # failure condition 1: hand collides with table
        fail_collision_table = (right_hand_link_z < self.table_height + 0.02).any(dim=1)
        
        # failure condition 2: object falls below table
        fail_ycb_fall = self.ycb_object.pose.p[:, 2] < self.table_height - 0.02
        fail = fail_collision_table | fail_ycb_fall
    else:
        fail = torch.zeros_like(self.ycb_object.pose.p[:, 0], dtype=torch.bool)
        
    # success condition: object lifted above threshold
    success = self.ycb_object.pose.p[:, 2] >= 1.25
    
    # reward calculation
    reward = torch.zeros_like(self.ycb_object.pose.p[:, 0], device=self.device)
    reward[success] = self.max_reward  # 5.0 for success
    reward[fail] = -self.max_reward / 4  # -1.25 for failure

    ...

Due to limited VRAM (8192MiB total), I'm trained it with 10,000,000 total_timesteps, 64 num_envs, 2 num_eval_envs, 100 num_steps, 4 update_epochs, and no/None reconfiguration and eval_reconfiguration frequency. Having a reconfiguration frequency significantly increased my VRAM usage, because every couple of reconfigurations I would see a spike in 500-1000 MiB increase in GPU memory being used. I'm not sure if this part is a memory leak as per #467.

I'm mainly concerned about my reward structure, particularly in compute_dense_reward in my ycb_env.py file, as well as my policy architecture for RGBD + state processing in my NatureCNN __init__ function in my ppo_rgbd.py file. I'm still learning atm, so any feedback would be very appreciated!

Thank you!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Advice on training pick policy using allegro hand/arm #797

[Question] Advice on training pick policy using allegro hand/arm #797

CreativeNick commented Jan 10, 2025

[Question] Advice on training pick policy using allegro hand/arm #797

[Question] Advice on training pick policy using allegro hand/arm #797

Comments

CreativeNick commented Jan 10, 2025