index.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <link rel="stylesheet" href="assets/css/main.css">
    <title>One-Shot Transfer of Long-Horizon Extrinsic Manipulation Through Contact Retargeting</title>

</head>
<body>
    <div id="title_slide">
        <div class="title_left">
            <h1>One-Shot Transfer of Long-Horizon Extrinsic Manipulation Through Contact Retargeting</h1>
            <div class="author-container">
                <div class="author-name"><a href="" target="_blank">Albert Wu<sup>1</sup></a></div>
                <div class="author-name"><a href="https://cs.stanford.edu/~rcwang/" target="_blank">Ruocheng Wang<sup>1</sup></a></div>
                <div class="author-name"><a href="https://ericcsr.github.io" target="">Sirui Chen<sup>1</sup></a></div>
                <div class="author-name"><a href="https://clemense.github.io" target="_blamk">Clemens Eppner<sup>2</sup></a></div>
                <div class="author-name"><a href="https://profiles.stanford.edu/c-karen-liu" target="_blank">C. Karen Liu<sup>1</sup></a></div>
            </div>
            <div class="affiliation-container">
                <div class="affiliation"><sup>1</sup>Stanford University, <sup>2</sup>NVIDIA</div>
            </div>
        </div>
    </div>
            
            <!-- <div class="affiliation">
                <p><img src="assets/logos/SUSig-red.png" style="height: 50px"></p>
            </div> -->
    <div class="button-container">
        <!-- <a href="assets/extrinsic_manip_paper.pdf" target="_blank" class="button"><i class="fa-light fa-file"></i>&emsp14;PDF</a> -->
        <a href="https://arxiv.org/abs/2404.07468" target="_blank" class="button"><i class="ai ai-arxiv"></i>&emsp14;arXiv</a>
        <!-- <a href="" target="_blank" class="button"><i class="fa-light fa-film"></i>&emsp14;Video</a>
        <a href="https://arxiv.org/abs/2403.07788" target="_blank" class="button"><i class="fa-brands fa-x-twitter"></i>&emsp14;tl;dr</a> -->
        <a href="https://youtu.be/K3iu3qO02m4?feature=shared" target="_blank" class="button"><i class="fa-light fa-film"></i>&emsp14;Video</a>
        <a href="https://github.com/Stanford-TML/extrinsic_manipulation" target="_blank" class="button"><i class="fa-light fa-code"></i>&emsp14;Code</a>
        <!-- <a href="https://drive.google.com/drive/folders/1VG8Dz_f5tfjf8w7tBG1Y2AAZ2gNv-RjT?usp=sharing" target="_blank" class="button"><i class="fa-light fa-face-smiling-hands"></i>&emsp14;Data</a>
        <a href="https://docs.google.com/document/d/1ANxSA_PctkqFf3xqAkyktgBgDWEbrFK7b1OnJe54ltw/edit?usp=sharing" target="_blank" class="button"><i class="fa-light fa-robot-astromech"></i>&emsp14;Hardware</a> -->
    </div>

    <br>
    <div class="slideshow-container">
        <div class="video_container">
            <video autoplay muted playsinline loop controls preload="metadata" width="100%">
                <source src="assets/videos/video_1min.mp4" type="video/mp4">
            </video>
        </div>
    </div>
    <br>

    <div id="abstract">
        <h1>Abstract</h1>
        <p>
            Extrinsic manipulation, the use of environment contacts to achieve manipulation objectives, enables strategies that are otherwise impossible with a parallel jaw gripper. 
            However, orchestrating a long-horizon sequence of contact interactions between the robot, object, and environment is notoriously challenging due to the scene diversity, 
            large action space, and difficult contact dynamics. We observe that most extrinsic manipulation are combinations of short-horizon primitives, 
            each of which depend strongly on initializing from a desirable contact configuration to succeed. 
            Therefore, we propose to generalize one extrinsic manipulation trajectory to diverse objects and environments by retargeting contact requirements. 
            We prepare a single library of robust short-horizon, goal-conditioned primitive policies, and design a framework to compose state constraints stemming from contacts specifications of each primitive. 
            Given a test scene and a single demo prescribing the primitive sequence, our method enforces the state constraints on the test scene and find intermediate goal states using inverse kinematics. 
            The goals are then tracked by the primitive policies. Using a 7+1 DoF robotic arm-gripper system, we achieved an overall success rate of 80.5% on hardware 
            over 4 long-horizon extrinsic manipulation tasks, each with up to 4 primitives. Our experiments cover 10 objects and 6 environment configurations. 
            We further show empirically that our method admits a wide range of demonstrations, and that contact retargeting is indeed the key to successfully combining primitives for long-horizon extrinsic manipulation.
        </p>
    </div>

    <hr class="rounded">
    <!-- <div id="video">
        <h1>DexCap: A Portable Hand Motion Capture System</h1>
        <br>
        <br>
    </div> -->

    <div id="overview"> <!-- This is a legacy misnomer and is just the body of the website-->
        <h1>Pipeline summary</h1>
        <p> 
            We prepare a primitive library and define each primitive’s contact requirements offline. 
            Given a single demonstration, we identify the relative transforms between the initial and final object states of the primitives.
            The transforms are first directly applied to the test scene initial object state via the <i>remap_x</i> subroutine.
            The output are states unlikely to satisfy the contact requirements of the primitives in the test scene.
            We then perform <i>retarget_x</i>, which modifies the outputs to satisfy the environment-object contact requirements required by the primitives.
            The outputs of <i>retarget_x</i> are the intermediate goals for each primitive.
            Next, we run <i>retarget_q</i>, which finds the robot configuration that satisfies the contact requirements of the subsequent primitive.
            We thereby obtain a sequence of intermediate goals and robot configurations in the test scene.
            Finally, we execute the primitive policies using the intermediate goals and robot configurations to achieve the task in the test scene.
            Please refer to our paper for more details.
        </p>
        </p>
        <div class="video_container">
            <video autoplay muted playsinline loop controls preload="metadata" width="100%">
                <source src="assets/videos/method_animation.mp4" type="video/mp4" alt="method oberview">
            </video>
            <div class="caption">
                <p>Summary of our pipeline.
                </p>
            </div>
        </div>
        <br>
        <hr class="rounded">
        
        <h1>Primitive and contact retargeting implementation</h1>
        <p>
            4 primitives, "push," "pull," "pivot," and "grasp," are implemented in this project.
            Each primitive is a short-horizon, goal-conditioned policy that takes in the current state and a goal state and outputs a sequence of actions.
            The initial and goal states are states that satisfy the contact requirements of the primitive.
            Here we summarize the primitives and their contact requirements.
            Please refer to our paper and code for implementation details.
        </p>
        <div id="primitive_summary">
            <div class="image_container" > 
                <img src="assets/images/primitive_summary.png" alt="summary of primitives">
                <div class="caption">
                    <p>Summary of the primitives implemented in this project.</p>
                </div>
            </div>
        </div>

        <h2>Push primtive</h2>
        <p>
            The push primitive is a single reinforcement learning based policy trained in <a href="https://github.com/NVIDIA-Omniverse/IsaacGymEnvs"><cite>Isaac Gym</cite></a>.
            The policy is trained to push any of the 7 standard objects and 3 short objects from any initial pose to any goal pose in the workspace.
            We explicitly inform the policy of the object tested using one-hot encoding.
            <br><br>
            The robot-object contact is implicitly enforced by the policy, thus there is no need to initialize the robot in a specific contact configuration.
            This showcases the flexibility of our pipeline in handling diverse contact requirements of each primitives.
            Such design choice also allows emergent behaviors where the policy switches robot-object contact to correct for tracking errors.
            <br><br>
            Other than requiring the object to be on the ground, there are no environment-object contact requirements.
        </p>

        <h2>Pull primtive</h2>
        <p>
            The pull primitive is a two-stage, hand-designed open loop policy that leverages operational space control (OSC). 
            Starting from an initial robot configuration where the robot is in the vicnity of the top of the object,
            the robot first closes its gripper and moves downward toward the object to ensure contact is established.
            The robot then moves the gripper horizontally to pull the object toward the goal position.
            <br><br>
            The robot-object contact required by the pull primitive is to have the gripper approximately in contact with the top of the object.
            To do so, we compute the top rectangle of the object's bounding box and move the robot to the center of the rectangle.
            <br><br>
            Other than requiring the object to be on the ground, there are no environment-object contact requirements.
        </p>

        <h2>Pivot primtive</h2>
        <p>    
            The pivot primitive is a three-stage, hand-designed feedback policy that uses OSC.
            A lower edge of the object is in required to be in contact with the wall and orthogonal to the wall normal.
            The gripper fingertips are in contact with the object on the opposite side of the wall-object contact.
            The robot first pushes the object toward the wall to establish contact using OSC.
            Next, the gripper follows a <a href="https://en.wikipedia.org/wiki/Trammel_of_Archimedes">Trammel of Archimedes</a> path
            whose parameters are given by the bounding box dimensions of the object. A contact force is maintained 
            in an impedance control manner by commanding a fixed tracking error in the radial direction of the arc.
            Once the object has been pivoted, the robot breaks contact and clears the object by lifting the gripper upwards.
            The pose of the object is tracked to ensure the object is pivoted to by the correct angle of approximately π/2.
            <br><br>
            The robot-object contact is implemented as an intersection of two constraints in <a href="https://drake.mit.edu"><cite>Drake</cite></a>: 
            the distances between the object and the fingertips are zero; 
            the fingertips are within a cone centered at the object’s geometric center, has the wall’s normal as its axis, and has a half-angle of π/6.
            <br><br>
            The environment-object contact is implemented using the bounding box of the object. 
            Of the 4 vertices that are closest to the wall, the 2 lowest ones must be on the wall, 
            and the distance between the wall and the object is 0.
        </p>

        <h2>Grasp primtive</h2>
        <p>
            The grasp primitive is a hand-designed policy that uses OSC.
            The primitive begins with the gripper fully open above the object. 
            First, the robot descends to slot the object between the gripper fingers. 
            Due to potential pose estimation error, a hand-designed wiggling motion is performed to increase the success rate.
            After the object is between the gripper, the gripper is closed and the object is lifted up.
            <br><br>
            To find the robot configuration that satisfies the contact requirements of the grasp primitive,
            we compute the bounding box of the object's project on the ground plane.
            The two fingers are aligned with the short side of the box, and centered at the box's center.
            The initial gripper height is set to the height of the object's bounding box plus a small tolerance.
            <br><br>
            Other than requiring the object to be on the ground, there are no environment-object contact requirements.
        </p>

        <br>
        <hr class="rounded">

        <h1>System setup</h1>
        <h2>Object set</h2>
        <div class="image_container">
            <img src="assets/images/objects.jpg" alt="object set photo">
            <div class="caption">
                <p>
                    Objects used in this project. Standard objects are tested on all tasks. 
                    Short objects and impossible are used for additional "occluded grasping" experiments.
                </p>
            </div>
        </div>
        <div class="image_container">
            <img src="assets/images/object_properties.png" alt="object properties">
            <div class="caption">
                <p>Mass and approximate dimensions of the objects used in the experiments.</p>
            </div>
        </div>

        <!-- <div class="video_container">
            <img src="assets/images/object_properties.png" alt="object properties">
        </div> -->

        <h2>Pose estimation pipeline</h2>
        <p>
            Our pipeline takes in an RGB image, a prespecified text description, and a textured mesh of the object.
            It outputs the 6D pose of the object. 
            To obtain the pose estimation from scratch, we perform the following steps:
            <ol>
                <li>The prespecified text description of the object is given to
                    <a href="https://huggingface.co/docs/transformers/model_doc/owlvit"><cite>OWL-ViT</cite></a>
                     to obtain a bounding box of the object.</li>
                <li>The bounding box is given to 
                    <a href="https://segment-anything.com/"><cite>Segment Anything</cite></a>
                    to produce a segmentation mask of the object.</li>
                <li><a href="https://megapose6d.github.io/"><cite>Megapose</cite></a> 
                    uses the segmented object to produce an initial pose estimation</li>
                <li>Subsequent pose tracking is done using only the "refiner" component of <a href="https://megapose6d.github.io/"><cite>Megapose</cite></a>. The last estimated pose is used as the initial guess.</li>
            </ol>    
        </p>
        <p>
            Steps 1-3 are only run when a guess of the object pose is unavailable, i.e. at pipeline initialization or when the object is lost.
            On our workstation with Intel i9-13900K CPU and NVIDIA GeForce RTX 4090 GPU, steps 1-3 typically takes a few seconds to complete, 
            and step 4 is run at a frame rate of 8-12Hz.
            The pipeline automatically detects when the object is lost using the Megapose refiner's "pose score". 
            If the score is too low, the entire pipeline is rerun.
        </p>
        <div class="video_container" id="pose_estimation">
            <video autoplay muted playsinline loop controls preload="metadata" width="100%">
                <source src="assets/videos/vision_pipeline_video.mp4" type="video/mp4" alt="pipeline overview">
            </video>
            <div class="caption">
                <p>
                    Pose estimation pipeline output at 1x speed. 
                    Step 3 outputs are shown in blue.
                    Step 4 outputs are shown in green.
                </p>
            </div>
        </div>

        <!-- <div class="allegrofail">
            <div class="video_container">
                <video autoplay muted playsinline loop controls preload="metadata">
                    <source src="assets/system.mp4" type="video/mp4">
                </video>
                <video autoplay muted playsinline loop controls preload="metadata">
                    <source src="assets/system.mp4" type="video/mp4">
                </video>
                <video autoplay muted playsinline loop controls preload="metadata">
                    <source src="assets/system.mp4" type="video/mp4">
                </video>
                <video autoplay muted playsinline loop controls preload="metadata">
                    <source src="assets/system.mp4" type="video/mp4">
                </video>
            </div>
        </div> -->
        <br>
        <hr class="rounded">

        <h1>Results</h1>
        <!-- <h2>Overview</h2> -->
        <p>
            We evaluate our framework on 4 real-world extrinsic manipulation tasks: "obstacle avoidance,"
            "object storage," "occluded grasping," and "object retrieval." 
            Various environments are used for the demonstrations and tests to showcase our method's robustness against environment changes. 
            All demonstrations are collected on cracker. 
            Every task is evaluated on the 7 standard objects, each with 5 trials. 
            Additionally, occluded grasping is evaluated on the 3 short objects with an extra "pull" step.
            <br><br>
            Our method achieved an overall success rate of <b>80.5%</b> (<b>81.7%</b> for standard objects).
            Despite not being tailored to "occluded grasping," we outperformed the 2022 paper based on deep reinforcement learning
            <a href="https://sites.google.com/view/grasp-ungraspable"><cite>Learning to Grasp the Ungraspable with Emergent Extrinsic Dexterity</cite></a>, 
            both when the initial object state is against (<b>88.6%</b> vs. 78%) and away from (<b>77.1%</b> vs. 56%) the wall.
            <br><br>
            To show that our method is agnostic to the specific demonstration, we collect demos for grasping on oat and the 3 impossible objects that are unlikely to be graspable by the robot. 
            We then retarget all demos onto cracker from 5 different initial poses. We achieved 100% success rate across 20 trials.

        </p>
 
        <div class="image_container" > 
            <img src="assets/images/results_main.png" alt="Main results">
            <div class="caption">
                <p>Summary of experiments on 7 standard objects.</p>
            </div>
        </div>

        <div id="additional_grasping">
            <div class="image_container" > 
                <img src="assets/images/short_object_results.png" alt="Additional occluded grasping results">
                <div class="caption">
                    <p>Summary of additional "occluded grasping" experiments.</p>
                </div>
            </div>    
        </div>
        <p>
            Below we show 1 success instance of all the task-object combinations. <b>All videos are at 1x speeed.</b>
        </p>
        <h2>Obstacle avoidance</h2>    
        <p><i>Push</i> the object forward, switch contact and <i>push</i> again to avoid the obstacle.</p>
        <div class="taskvideos">
            <div class="video_container">
                <video muted playsinline loop controls preload="metadata">
                    <source src="assets/videos/720p/demo_avoidance.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p>Demonstration</p>
                </div>
            </div>
            <div class="video_container">
                <video  muted playsinline loop controls preload="metadata">
                    <source src="assets/videos/720p/push_push_cereal.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p>Cereal</p>
                </div>
            </div>
            <div class="video_container">
                <video  muted playsinline loop controls preload="metadata">
                    <source src="assets/videos/720p/push_push_cocoa.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p>Cocoa</p>
                </div>
            </div>
            <div class="video_container">
                <video muted playsinline loop controls preload="metadata">
                    <source src="assets/videos/720p/push_push_cracker.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p>Cracker</p>
                </div>
            </div>
        </div>
        <div class="taskvideos">
            <div class="video_container">
                <video muted playsinline loop controls preload="metadata">
                    <source src="assets/videos/720p/push_push_flapjack.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p>Flapjack</p>
                </div>
            </div>
            <div class="video_container">
                <video muted playsinline loop controls preload="metadata">
                    <source src="assets/videos/720p/push_push_oat.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p>Oat</p>
                </div>
            </div>
            <div class="video_container">
                <video muted playsinline loop controls preload="metadata">
                    <source src="assets/videos/720p/push_push_seasoning.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p>Seasoning</p>
                </div>
            </div>
            <div class="video_container">
                <video muted playsinline loop controls preload="metadata">
                    <source src="assets/videos/720p/push_push_wafer.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p>Wafer</p>
                </div>
            </div>
        </div>

        <h2>Object storage</h2> 
        <p>
            <i>Push</i> an object toward the wall, <i>pivot</i> to align with an opening between the wall and the object, then <i>pull</i> it into the opening for storage.
        </p>
        <div class="taskvideos">
            <div class="video_container">
                <video muted playsinline loop controls preload="metadata">
                    <source src="assets/videos/720p/demo_storage.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p>Demonstration</p>
                </div>
            </div>
            <div class="video_container">
                <video muted playsinline loop controls preload="metadata">
                    <source src="assets/videos/720p/push_pivot_pull_cereal.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p>Cereal</p>
                </div>
            </div>
            <div class="video_container">
                <video muted playsinline loop controls preload="metadata">
                    <source src="assets/videos/720p/push_pivot_pull_cocoa.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p>Cocoa</p>
                </div>
            </div>
            <div class="video_container">
                <video muted playsinline loop controls preload="metadata">
                    <source src="assets/videos/720p/push_pivot_pull_cracker.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p>Cracker</p>
                </div>
            </div>
        </div>
        <div class="taskvideos">
            <div class="video_container">
                <video muted playsinline loop controls preload="metadata">
                    <source src="assets/videos/720p/push_pivot_pull_flapjack.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p>Flapjack</p>
                </div>
            </div>
            <div class="video_container">
                <video muted playsinline loop controls preload="metadata">
                    <source src="assets/videos/720p/push_pivot_pull_oat.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p>Oat</p>
                </div>
            </div>
            <div class="video_container">
                <video muted playsinline loop controls preload="metadata">
                    <source src="assets/videos/720p/push_pivot_pull_seasoning.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p>Seasoning</p>
                </div>
            </div>
            <div class="video_container">
                <video muted playsinline loop controls preload="metadata">
                    <source src="assets/videos/720p/push_pivot_pull_wafer.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p>Wafer</p>
                </div>
            </div>
        </div>


        <h2>Occluded grasping</h2>    
        <p>
            <i>Push</i> the object in an ungraspable pose toward the wall, <i>pivot</i> it to expose a graspable edge, and <i>grasp</i> it.
        </p>
        <div class="taskvideos">
            <div class="video_container">
                <video muted playsinline loop controls preload="metadata">
                    <source src="assets/videos/720p/demo_grasping.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p>Demonstration</p>
                </div>
            </div>
            <div class="video_container">
                <video muted playsinline loop controls preload="metadata">
                    <source src="assets/videos/720p/push_pivot_grasp_cereal.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p>Cereal</p>
                </div>
            </div>
            <div class="video_container">
                <video muted playsinline loop controls preload="metadata">
                    <source src="assets/videos/720p/push_pivot_grasp_cocoa.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p>Cocoa</p>
                </div>
            </div>
            <div class="video_container">
                <video muted playsinline loop controls preload="metadata">
                    <source src="assets/videos/720p/push_pivot_grasp_cracker.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p>Cracker</p>
                </div>
            </div>
        </div>

        <div class="taskvideos">
            <div class="video_container">
                <video muted playsinline loop controls preload="metadata">
                    <source src="assets/videos/720p/push_pivot_grasp_flapjack.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p>Flapjack</p>
                </div>
            </div>
            <div class="video_container">
                <video muted playsinline loop controls preload="metadata">
                    <source src="assets/videos/720p/push_pivot_grasp_oat.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p>Oat</p>
                </div>
            </div>
            <div class="video_container">
                <video muted playsinline loop controls preload="metadata">
                    <source src="assets/videos/720p/push_pivot_grasp_seasoning.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p>Seasoning</p>
                </div>
            </div>
            <div class="video_container">
                <video muted playsinline loop controls preload="metadata">
                    <source src="assets/videos/720p/push_pivot_grasp_wafer.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p>Wafer</p>
                </div>
            </div>
        </div>

        <h2>Occluded grasping (short objects)</h2> 
        <p>
            <i>Push</i> the object in an ungraspable pose toward the wall, <i>pivot</i> it to expose a graspable edge, <i>pull</i> to create space between the wall and the object for inserting the gripper,
            and <i>grasp</i> it.
        </p>
        <div class="taskvideos">
            <div class="video_container">
                <video muted playsinline loop controls preload="metadata">
                    <source src="assets/videos/720p/demo_short_grasp.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p>Demonstration</p>
                </div>
            </div>
            <div class="video_container">
                <video muted playsinline loop controls preload="metadata">
                    <source src="assets/videos/720p/push_pivot_pull_grasp_camera.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p>Camera</p>
                </div>
            </div>
            <div class="video_container">
                <video muted playsinline loop controls preload="metadata">
                    <source src="assets/videos/720p/push_pivot_pull_grasp_meat.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p>Meat</p>
                </div>
            </div>
            <div class="video_container">
                <video muted playsinline loop controls preload="metadata">
                    <source src="assets/videos/720p/push_pivot_pull_grasp_onion.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p>Onion</p>
                </div>
            </div>
        </div>

        <h2>Object retrieval</h2>    
        <p>
            <i>Pull</i> the object from between two obstacles, <i>push</i> toward the wall, <i>pivot</i> it to expose a graspable edge, and <i>grasp</i> it.
        </p>
        <div class="taskvideos">
            <div class="video_container">
                <video muted playsinline loop controls preload="metadata">
                    <source src="assets/videos/720p/demo_retrieval.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p>Demonstration</p>
                </div>
            </div>
            <div class="video_container">
                <video muted playsinline loop controls preload="metadata">
                    <source src="assets/videos/720p/pull_push_pivot_grasp_cereal.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p>Cereal</p>
                </div>
            </div>
            <div class="video_container">
                <video muted playsinline loop controls preload="metadata">
                    <source src="assets/videos/720p/pull_push_pivot_grasp_cocoa.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p>Cocoa</p>
                </div>
            </div>
            <div class="video_container">
                <video muted playsinline loop controls preload="metadata">
                    <source src="assets/videos/720p/pull_push_pivot_grasp_cracker.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p>Cracker</p>
                </div>
            </div>
        </div>
        <div class="taskvideos">
            <div class="video_container">
                <video muted playsinline loop controls preload="metadata">
                    <source src="assets/videos/720p/pull_push_pivot_grasp_flapjack.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p>Flapjack</p>
                </div>
            </div>
            <div class="video_container">
                <video muted playsinline loop controls preload="metadata">
                    <source src="assets/videos/720p/pull_push_pivot_grasp_oat.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p>Oat</p>
                </div>
            </div>
            <div class="video_container">
                <video muted playsinline loop controls preload="metadata">
                    <source src="assets/videos/720p/pull_push_pivot_grasp_seasoning.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p>Seasoning</p>
                </div>
            </div>
            <div class="video_container">
                <video muted playsinline loop controls preload="metadata">
                    <source src="assets/videos/720p/pull_push_pivot_grasp_wafer.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p>Wafer</p>
                </div>
            </div>
        </div>

        <br>
        <hr class="rounded">

        <h1>Conclusion</h1>
        <p>
            This work presents a framework for generalizing longhorizon extrinsic manipulation from a single demonstration. 
            Our method retargets the demonstration trajectory to the test scene by enforcing contact constraints with IK at every contact switches. 
            The retargeted trajectory is then tracked with a sequence of short-horizon policies for each contact configuration. 
            Our method achieved an overall success rate of 81.7% on real-world objects over 4 challenging long-horizon extrinsic manipulation tasks. 
            Additional experiments show that contact retargeting is crucial to successfully retargeting such long-horizon plans, and a wide range of demonstration can be successfully retargeted with our pipeline. 
        </p>
        <h1>BibTeX</h1>
        <p class="bibtex">
            Coming soon
            <!-- @article{wang2024dexcap, <br>
            &nbsp;&nbsp;title = {DexCap: Scalable and Portable Mocap Data Collection System for Dexterous Manipulation}, <br>
            &nbsp;&nbsp;author = {Wang, Chen and Shi, Haochen and Wang, Weizhuo and Zhang, Ruohan and Fei-Fei, Li and Liu, C. Karen}, <br>
            &nbsp;&nbsp;journal = {arXiv preprint arXiv:2403.07788}, <br>
            &nbsp;&nbsp;year = {2024} <br>
            } -->
        </p>
        <!-- <h1>From Human to Robot</h1>
        <div class="allegrofail">
            <div class="video_container">
                <video autoplay muted playsinline loop preload="metadata">
                    <source src="assets/human_to_robot.mp4" type="video/mp4">
                </video>
            </div>
        </div>
        <p>
            <b>Observation retargeting:</b> To simplify the process of switching the camera system between the human and robot,
            a quick-release buckle has been integrated into the back of the camera rack, allowing for swift camera swaps
            – in less than 20 seconds. In this way, the robot utilizes the same observation camera employed during human data collection.
        </p>
        <div class="allegrofail">
            <div class="video_container">
                <video autoplay muted playsinline loop preload="metadata">
                    <source src="assets/fingertip_ik.mp4" type="video/mp4">
                </video>
            </div>
        </div>
        <p>
            <b>Action retargeting:</b> To transfer human finger motion to the LEAP robot hand, we use fingertip
            inverse kinematics (IK) to compute the 16-dimensional joint positions. Human finger motions are tracked
            using a pair of motion capture gloves, which measure the 3D positions of the fingers relative to the palm based on electromagnetic field (EMF).
        </p>

        <div class="allegrofail">
            <div class="video_container">
                <video autoplay muted playsinline loop preload="metadata">
                    <source src="assets/dataset.mp4" type="video/mp4">
                </video>
            </div>
        </div>
        <p>
            <b>Visual gap:</b> To further bridge the visual gap between human hand and robot hand,
            we use forward kinematics to genrate a point cloud mesh of the robot hand and add it to the pointcloud observation as is shown in this video.
        </p>

        <br>
        <br>
        <br>
        <hr class="rounded"> -->

        <!-- <h1>Method: Data Retargeting and Imitation Learning</h1>
        <div class="allegrofail">
            <div class="video_container">
                <video autoplay muted playsinline loop controls preload="metadata">
                    <source src="assets/method2.mp4" type="video/mp4">
                </video>
            </div>
        </div>
        <p>
            We first retarget the DexCap data to the robot embodiment by constructing 3D point clouds from RGB-D observations and transforming it into robot operation space.
            Meanwhile, the hand motion capture data is retargeted to the dexterous hand and robot arm with fingertip IK.
            Based on the data, a Diffusion Policy is learned to take the point cloud as input and outputs a sequence of future goal positions as the robot actions.
        </p>

        <h1>Results</h1>
        <div class="allegrofail">
            <div class="video_container">
                <video autoplay muted playsinline loop controls preload="metadata">
                    <source src="assets/normal_ball.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p>Fully autonomous policy rollouts. Policy learned with 30-minute human mocap data without any teleoperation.</p>
                </div>
            </div>
        </div>

        <h1>Bimanual Manipulation Task</h1>
        <div class="allegrofail">
            <div class="video_container">
                <video autoplay muted playsinline loop controls preload="metadata">
                    <source src="assets/normal_bimaual.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p>0:00-0:09 Collecting bimanual human mocap data <br>0:10-1:47 Fully autonomous policy rollouts (learned with 30-minute human mocap data without any teleoperation)</p>
                </div>
            </div>
        </div>

        <br>
        <hr class="rounded">

        <h1>In-the-wild Data Collection with DexCap</h1>
        <div class="allegrofail">
            <div class="video_container">
                <video autoplay muted playsinline loop controls preload="metadata">
                    <source src="assets/system_in_wild.mp4" type="video/mp4">
                </video>
            </div>
        </div>

        <div class="allegrolower">
            <div class="video_container">
                <video autoplay muted playsinline loop controls preload="metadata">
                    <source src="assets/transfer_c.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p><b>Transfer to robot space</b></p>
                </div>
            </div>
            <div class="video_container">
                <video autoplay muted playsinline loop controls preload="metadata">
                    <source src="assets/dataset_c.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p><b>Remove redundant points and add point clouds of the robot hand</b></p>
                </div>
            </div>
        </div>

        <h1>Policy learned with In-the-wild DexCap Data</h1>
        <div class="allegrofail">
            <div class="video_container">
                <video autoplay muted playsinline loop controls preload="metadata">
                    <source src="assets/normal_packaging_trainedobj.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p><b>Trained objects:</b> Fully autonomous policy rollouts in 1x speed.</p>
                </div>
            </div>
        </div>

        <div class="allegrolower">
            <div class="video_container">
                <video autoplay muted playsinline loop controls preload="metadata">
                    <source src="assets/normal_packaging_unseenobj.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p><b>Unseen objects:</b>. Fully autonomous policy rollouts in 1x speed.</p>
                </div>
            </div>
            <div class="video_container">
                <video autoplay muted playsinline loop controls preload="metadata">
                    <source src="assets/normal_packaging_unseenobj2.mp4" type="video/mp4">
                </video>
            </div>
        </div>


        <br>
        <hr class="rounded">

        <h1>Human-in-the-loop correction with DexCap</h1>
        <br>
        <div class="allegrolower">
            <div class="video_container">
                <video autoplay muted playsinline loop controls preload="metadata">
                    <source src="assets/HIL_mode1.mp4" type="video/mp4">
                </video>
            </div>
            <div class="video_container">
                <video autoplay muted playsinline loop controls preload="metadata">
                    <source src="assets/HIL_mode2.mp4" type="video/mp4">
                </video>
            </div>
        </div>
        <p>
            DexCap supports two types of human-in-the-loop correction during the policy rollouts: <br>
            <b>(1). Residual correction</b> measures the 3D delta position changes of the human wrist and incorporates them as residual actions to the robot's wrist movements.
            This mode enables minimal movement but requiring more precise control.<br>
            <b>(2). Teleoperation</b> directly translates full human hand motions to the robot end-effector actions based on inverse kinematics.
            This mode enables the full control over the robot but requiring more effort.<br>
            Users can switch between the two modes by stepping on the foot pedal during the rollouts.
        </p>

        <div class="video_container">
            <img src="assets/HIL_method.png" alt="Description of Image">
        </div>
        <p>
            The corrections are stored in a new dataset and uniformly sampled with the original dataset for fine-tuning the robot policy
        </p>

        <h1>Results after finetuning - Tea preparing</h1>
        <div class="allegrofail">
            <div class="video_container">
                <video autoplay muted playsinline loop controls preload="metadata">
                    <source src="assets/normal_tea.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p>Fully autonomous policy rollouts in 2x speed. Policy learned with 1-hour human mocap data and 30 human-in-the-loop corrections.</p>
                </div>
                <br>
                <video autoplay muted playsinline loop controls preload="metadata">
                    <source src="assets/normal_tea_more.mp4" type="video/mp4">
                </video>
            </div>
        </div>


        <h1>Results after finetuning - Scissor cutting</h1>
        <div class="allegrofail">
            <div class="video_container">
                <video autoplay muted playsinline loop controls preload="metadata">
                    <source src="assets/normal_scissor.mp4" type="video/mp4">
                </video>
                <div class="caption">
                    <p>Fully autonomous policy rollouts in 2x speed. Policy learned with 1-hour human mocap data and 30 human-in-the-loop corrections.</p>
                </div>
                <br>
                <video autoplay muted playsinline loop controls preload="metadata">
                    <source src="assets/normal_scissor_more.mp4" type="video/mp4">
                </video>
            </div>
        </div> -->

        <!-- <br>
        <hr class="rounded">

        <h1>Acknowledgments</h1>
        <p>
            TODO
        </p>

        <br>
        <br>
        <hr class="rounded">
        <h1>BibTeX</h1>
        <p class="bibtex">
            TODO
            <!-- @article{wang2024dexcap, <br>
                title = {DexCap: Scalable and Portable Mocap Data Collection System for Dexterous Manipulation}, <br>
                author = {Wang, Chen and Shi, Haochen and Wang, Weizhuo and Zhang, Ruohan and Fei-Fei, Li and Liu, C. Karen}, <br>
                journal = {arXiv preprint arXiv:2403.07788}, <br>
                year = {2024} <br>
            } -->
        <!-- </p>
        <br>
        <br> -->
    </div>
    <footer class="footer">
        <div class="w-container">
            <p>
                Website template adapted from <a href="https://github.com/nerfies/nerfies.github.io">NeRFies</a>, <a href="https://peract.github.io/">PerAct</a>, and <a href="https://dex-cap.github.io/">DexCap</a>.
            </p>
    
        <!-- <div class="columns is-centered">
            <div class="column">
            <div class="content has-text-centered">
            </div>
            </div>
        </div> -->
        </div>
    </footer>
</body>

<script src="assets/js/full_screen_video.js"></script>
<script src="assets/js/carousel.js"></script>
</html>