Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Advice on training pick policy using allegro hand/arm #797

Open
CreativeNick opened this issue Jan 10, 2025 · 0 comments
Open

Comments

@CreativeNick
Copy link
Contributor

I'm trying to write a pick policy for a allegro robot to pick up a randomly selected + positioned YCB object. I was wondering if I could get some feedback on potential optimizations. My current implementation uses RGBD + state observations and PPO for learning manipulation of YCB objects.

I haven't been getting successful results, as shown in the videos below.

In the 40th video, the robot is able to at least grasp the object but not yet lift it up:

40.mp4

In the 62nd video (last checkpoint), the robot is still unable to pick up an object and just awkwardly pushes it across the table:

62.mp4

My CNN processes both RGB and depth information together. I'm using standard convolutional layers with ReLU activations, but I'm not sure if this architecture is optimal for depth processing. I wrote an issue about it here: #794

class NatureCNN(nn.Module):
    def __init__(self, sample_obs):
        self.out_features = 0
        feature_size = 256

        if "rgbd" in sample_obs:
            # split last dimension (first 3 channels for RGB, last channel for depth)
            
            # CNN for RGBD input
            cnn = nn.Sequential(
                nn.Conv2d(
                    in_channels=16, # Full RGBD channels
                    out_channels=32,
                    kernel_size=8,
                    stride=4,
                    padding=0,
                ),
                nn.ReLU(),
                nn.Conv2d(
                    in_channels=32, out_channels=64, kernel_size=4, stride=2, padding=0
                ),
                nn.ReLU(),
                nn.Conv2d(
                    in_channels=64, out_channels=64, kernel_size=3, stride=1, padding=0
                ),
                nn.ReLU(),
                nn.Flatten(),
            )

            # RGBD dimension calculation
            with torch.no_grad():
                n_flatten = cnn(sample_obs["rgbd"].float().permute(0,3,1,2).cpu()).shape[1]
                fc = nn.Sequential(nn.Linear(n_flatten, feature_size), nn.ReLU())
            extractors["rgbd"] = nn.Sequential(cnn, fc)
            self.out_features += feature_size

        if "state" in sample_obs:
            # for state data we simply pass it through a single linear layer
            state_size = sample_obs["state"].shape[-1]
            extractors["state"] = nn.Linear(state_size, 256)
            self.out_features += 256

        ...

My reward function combines multiple components including lifting, finger control, and distance-based rewards. It is supposed to incentivize lifting objects above the table (lift_reward), maintain proper finger spread (finger_spread_reward), reduce distance between fingertips and target object (hand_close_reward), center the hand's approach (center_close_reward), and achieve enough lift height (height_reward).

def compute_dense_reward(self, obs: Any, action: np.ndarray, info: Dict):

    ...

    # core reward components
    lift_reward = ycb_xyz[:, 2] - self.table_height - 0.07
    total_reward += lift_reward * 15

    # finger control rewards
    finger_positions = torch.stack([link.pose.p for link in self.right_hand_link], dim=1)
    finger_spread = torch.std(finger_positions, dim=1).mean(dim=-1)
    finger_spread_reward = torch.clamp(finger_spread * 5.0, 0, 1.0)
    total_reward += finger_spread_reward

    # distance-based rewards
    hand_close_reward = torch.exp(-2.0 * right_hand_ycb_distance).mean(dim=-1)
    total_reward += hand_close_reward * 2.0
    
    center_close_reward = torch.exp(-2.0 * right_hand_tip_center_distance)
    total_reward += hand_close_reward + center_close_reward

    # height-based rewards
    height_diff = ycb_xyz[:, 2] - self.table_height
    height_reward = torch.where(height_diff > 0, height_diff * 10.0, torch.zeros_like(height_diff))
    total_reward += height_reward

    total_reward = total_reward.clamp(-self.max_reward, self.max_reward)

    total_reward[info["success"]] = self.max_reward
    total_reward[info["fail"]] = -self.max_reward / 4

    ...

Below is my evaluate function, where success requires lifting the target object above a height threshold of 1.25m from the table's surface. The agent fails if either the robot's fingers collide with the table (below table_height + 0.02m) or if the object falls below the table surface (table_height - 0.02m). The function assigns sparse rewards: +5.0 for success and -1.25 for failure.

def evaluate(self):
    if self.initialized:
        # check collisions with table
        right_hand_link_z = torch.concat(
            [
                link.pose.p[..., 2].unsqueeze(1)
                for link in self.right_hand_link + self.right_hand_tip_link
            ],
            dim=1,
        )

        # failure condition 1: hand collides with table
        fail_collision_table = (right_hand_link_z < self.table_height + 0.02).any(dim=1)
        
        # failure condition 2: object falls below table
        fail_ycb_fall = self.ycb_object.pose.p[:, 2] < self.table_height - 0.02
        fail = fail_collision_table | fail_ycb_fall
    else:
        fail = torch.zeros_like(self.ycb_object.pose.p[:, 0], dtype=torch.bool)
        
    # success condition: object lifted above threshold
    success = self.ycb_object.pose.p[:, 2] >= 1.25
    
    # reward calculation
    reward = torch.zeros_like(self.ycb_object.pose.p[:, 0], device=self.device)
    reward[success] = self.max_reward  # 5.0 for success
    reward[fail] = -self.max_reward / 4  # -1.25 for failure

    ...

Due to limited VRAM (8192MiB total), I'm trained it with 10,000,000 total_timesteps, 64 num_envs, 2 num_eval_envs, 100 num_steps, 4 update_epochs, and no/None reconfiguration and eval_reconfiguration frequency. Having a reconfiguration frequency significantly increased my VRAM usage, because every couple of reconfigurations I would see a spike in 500-1000 MiB increase in GPU memory being used. I'm not sure if this part is a memory leak as per #467.

I'm mainly concerned about my reward structure, particularly in compute_dense_reward in my ycb_env.py file, as well as my policy architecture for RGBD + state processing in my NatureCNN __init__ function in my ppo_rgbd.py file. I'm still learning atm, so any feedback would be very appreciated!

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant