-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.html
902 lines (852 loc) · 45 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<link rel="stylesheet" href="assets/css/main.css">
<title>One-Shot Transfer of Long-Horizon Extrinsic Manipulation Through Contact Retargeting</title>
</head>
<body>
<div id="title_slide">
<div class="title_left">
<h1>One-Shot Transfer of Long-Horizon Extrinsic Manipulation Through Contact Retargeting</h1>
<div class="author-container">
<div class="author-name"><a href="" target="_blank">Albert Wu<sup>1</sup></a></div>
<div class="author-name"><a href="https://cs.stanford.edu/~rcwang/" target="_blank">Ruocheng Wang<sup>1</sup></a></div>
<div class="author-name"><a href="https://ericcsr.github.io" target="">Sirui Chen<sup>1</sup></a></div>
<div class="author-name"><a href="https://clemense.github.io" target="_blamk">Clemens Eppner<sup>2</sup></a></div>
<div class="author-name"><a href="https://profiles.stanford.edu/c-karen-liu" target="_blank">C. Karen Liu<sup>1</sup></a></div>
</div>
<div class="affiliation-container">
<div class="affiliation"><sup>1</sup>Stanford University, <sup>2</sup>NVIDIA</div>
</div>
</div>
</div>
<!-- <div class="affiliation">
<p><img src="assets/logos/SUSig-red.png" style="height: 50px"></p>
</div> -->
<div class="button-container">
<!-- <a href="assets/extrinsic_manip_paper.pdf" target="_blank" class="button"><i class="fa-light fa-file"></i> PDF</a> -->
<a href="https://arxiv.org/abs/2404.07468" target="_blank" class="button"><i class="ai ai-arxiv"></i> arXiv</a>
<!-- <a href="" target="_blank" class="button"><i class="fa-light fa-film"></i> Video</a>
<a href="https://arxiv.org/abs/2403.07788" target="_blank" class="button"><i class="fa-brands fa-x-twitter"></i> tl;dr</a> -->
<a href="https://youtu.be/K3iu3qO02m4?feature=shared" target="_blank" class="button"><i class="fa-light fa-film"></i> Video</a>
<a href="https://github.com/Stanford-TML/extrinsic_manipulation" target="_blank" class="button"><i class="fa-light fa-code"></i> Code</a>
<!-- <a href="https://drive.google.com/drive/folders/1VG8Dz_f5tfjf8w7tBG1Y2AAZ2gNv-RjT?usp=sharing" target="_blank" class="button"><i class="fa-light fa-face-smiling-hands"></i> Data</a>
<a href="https://docs.google.com/document/d/1ANxSA_PctkqFf3xqAkyktgBgDWEbrFK7b1OnJe54ltw/edit?usp=sharing" target="_blank" class="button"><i class="fa-light fa-robot-astromech"></i> Hardware</a> -->
</div>
<br>
<div class="slideshow-container">
<div class="video_container">
<video autoplay muted playsinline loop controls preload="metadata" width="100%">
<source src="assets/videos/video_1min.mp4" type="video/mp4">
</video>
</div>
</div>
<br>
<div id="abstract">
<h1>Abstract</h1>
<p>
Extrinsic manipulation, the use of environment contacts to achieve manipulation objectives, enables strategies that are otherwise impossible with a parallel jaw gripper.
However, orchestrating a long-horizon sequence of contact interactions between the robot, object, and environment is notoriously challenging due to the scene diversity,
large action space, and difficult contact dynamics. We observe that most extrinsic manipulation are combinations of short-horizon primitives,
each of which depend strongly on initializing from a desirable contact configuration to succeed.
Therefore, we propose to generalize one extrinsic manipulation trajectory to diverse objects and environments by retargeting contact requirements.
We prepare a single library of robust short-horizon, goal-conditioned primitive policies, and design a framework to compose state constraints stemming from contacts specifications of each primitive.
Given a test scene and a single demo prescribing the primitive sequence, our method enforces the state constraints on the test scene and find intermediate goal states using inverse kinematics.
The goals are then tracked by the primitive policies. Using a 7+1 DoF robotic arm-gripper system, we achieved an overall success rate of 80.5% on hardware
over 4 long-horizon extrinsic manipulation tasks, each with up to 4 primitives. Our experiments cover 10 objects and 6 environment configurations.
We further show empirically that our method admits a wide range of demonstrations, and that contact retargeting is indeed the key to successfully combining primitives for long-horizon extrinsic manipulation.
</p>
</div>
<hr class="rounded">
<!-- <div id="video">
<h1>DexCap: A Portable Hand Motion Capture System</h1>
<br>
<br>
</div> -->
<div id="overview"> <!-- This is a legacy misnomer and is just the body of the website-->
<h1>Pipeline summary</h1>
<p>
We prepare a primitive library and define each primitive’s contact requirements offline.
Given a single demonstration, we identify the relative transforms between the initial and final object states of the primitives.
The transforms are first directly applied to the test scene initial object state via the <i>remap_x</i> subroutine.
The output are states unlikely to satisfy the contact requirements of the primitives in the test scene.
We then perform <i>retarget_x</i>, which modifies the outputs to satisfy the environment-object contact requirements required by the primitives.
The outputs of <i>retarget_x</i> are the intermediate goals for each primitive.
Next, we run <i>retarget_q</i>, which finds the robot configuration that satisfies the contact requirements of the subsequent primitive.
We thereby obtain a sequence of intermediate goals and robot configurations in the test scene.
Finally, we execute the primitive policies using the intermediate goals and robot configurations to achieve the task in the test scene.
Please refer to our paper for more details.
</p>
</p>
<div class="video_container">
<video autoplay muted playsinline loop controls preload="metadata" width="100%">
<source src="assets/videos/method_animation.mp4" type="video/mp4" alt="method oberview">
</video>
<div class="caption">
<p>Summary of our pipeline.
</p>
</div>
</div>
<br>
<hr class="rounded">
<h1>Primitive and contact retargeting implementation</h1>
<p>
4 primitives, "push," "pull," "pivot," and "grasp," are implemented in this project.
Each primitive is a short-horizon, goal-conditioned policy that takes in the current state and a goal state and outputs a sequence of actions.
The initial and goal states are states that satisfy the contact requirements of the primitive.
Here we summarize the primitives and their contact requirements.
Please refer to our paper and code for implementation details.
</p>
<div id="primitive_summary">
<div class="image_container" >
<img src="assets/images/primitive_summary.png" alt="summary of primitives">
<div class="caption">
<p>Summary of the primitives implemented in this project.</p>
</div>
</div>
</div>
<h2>Push primtive</h2>
<p>
The push primitive is a single reinforcement learning based policy trained in <a href="https://github.com/NVIDIA-Omniverse/IsaacGymEnvs"><cite>Isaac Gym</cite></a>.
The policy is trained to push any of the 7 standard objects and 3 short objects from any initial pose to any goal pose in the workspace.
We explicitly inform the policy of the object tested using one-hot encoding.
<br><br>
The robot-object contact is implicitly enforced by the policy, thus there is no need to initialize the robot in a specific contact configuration.
This showcases the flexibility of our pipeline in handling diverse contact requirements of each primitives.
Such design choice also allows emergent behaviors where the policy switches robot-object contact to correct for tracking errors.
<br><br>
Other than requiring the object to be on the ground, there are no environment-object contact requirements.
</p>
<h2>Pull primtive</h2>
<p>
The pull primitive is a two-stage, hand-designed open loop policy that leverages operational space control (OSC).
Starting from an initial robot configuration where the robot is in the vicnity of the top of the object,
the robot first closes its gripper and moves downward toward the object to ensure contact is established.
The robot then moves the gripper horizontally to pull the object toward the goal position.
<br><br>
The robot-object contact required by the pull primitive is to have the gripper approximately in contact with the top of the object.
To do so, we compute the top rectangle of the object's bounding box and move the robot to the center of the rectangle.
<br><br>
Other than requiring the object to be on the ground, there are no environment-object contact requirements.
</p>
<h2>Pivot primtive</h2>
<p>
The pivot primitive is a three-stage, hand-designed feedback policy that uses OSC.
A lower edge of the object is in required to be in contact with the wall and orthogonal to the wall normal.
The gripper fingertips are in contact with the object on the opposite side of the wall-object contact.
The robot first pushes the object toward the wall to establish contact using OSC.
Next, the gripper follows a <a href="https://en.wikipedia.org/wiki/Trammel_of_Archimedes">Trammel of Archimedes</a> path
whose parameters are given by the bounding box dimensions of the object. A contact force is maintained
in an impedance control manner by commanding a fixed tracking error in the radial direction of the arc.
Once the object has been pivoted, the robot breaks contact and clears the object by lifting the gripper upwards.
The pose of the object is tracked to ensure the object is pivoted to by the correct angle of approximately π/2.
<br><br>
The robot-object contact is implemented as an intersection of two constraints in <a href="https://drake.mit.edu"><cite>Drake</cite></a>:
the distances between the object and the fingertips are zero;
the fingertips are within a cone centered at the object’s geometric center, has the wall’s normal as its axis, and has a half-angle of π/6.
<br><br>
The environment-object contact is implemented using the bounding box of the object.
Of the 4 vertices that are closest to the wall, the 2 lowest ones must be on the wall,
and the distance between the wall and the object is 0.
</p>
<h2>Grasp primtive</h2>
<p>
The grasp primitive is a hand-designed policy that uses OSC.
The primitive begins with the gripper fully open above the object.
First, the robot descends to slot the object between the gripper fingers.
Due to potential pose estimation error, a hand-designed wiggling motion is performed to increase the success rate.
After the object is between the gripper, the gripper is closed and the object is lifted up.
<br><br>
To find the robot configuration that satisfies the contact requirements of the grasp primitive,
we compute the bounding box of the object's project on the ground plane.
The two fingers are aligned with the short side of the box, and centered at the box's center.
The initial gripper height is set to the height of the object's bounding box plus a small tolerance.
<br><br>
Other than requiring the object to be on the ground, there are no environment-object contact requirements.
</p>
<br>
<hr class="rounded">
<h1>System setup</h1>
<h2>Object set</h2>
<div class="image_container">
<img src="assets/images/objects.jpg" alt="object set photo">
<div class="caption">
<p>
Objects used in this project. Standard objects are tested on all tasks.
Short objects and impossible are used for additional "occluded grasping" experiments.
</p>
</div>
</div>
<div class="image_container">
<img src="assets/images/object_properties.png" alt="object properties">
<div class="caption">
<p>Mass and approximate dimensions of the objects used in the experiments.</p>
</div>
</div>
<!-- <div class="video_container">
<img src="assets/images/object_properties.png" alt="object properties">
</div> -->
<h2>Pose estimation pipeline</h2>
<p>
Our pipeline takes in an RGB image, a prespecified text description, and a textured mesh of the object.
It outputs the 6D pose of the object.
To obtain the pose estimation from scratch, we perform the following steps:
<ol>
<li>The prespecified text description of the object is given to
<a href="https://huggingface.co/docs/transformers/model_doc/owlvit"><cite>OWL-ViT</cite></a>
to obtain a bounding box of the object.</li>
<li>The bounding box is given to
<a href="https://segment-anything.com/"><cite>Segment Anything</cite></a>
to produce a segmentation mask of the object.</li>
<li><a href="https://megapose6d.github.io/"><cite>Megapose</cite></a>
uses the segmented object to produce an initial pose estimation</li>
<li>Subsequent pose tracking is done using only the "refiner" component of <a href="https://megapose6d.github.io/"><cite>Megapose</cite></a>. The last estimated pose is used as the initial guess.</li>
</ol>
</p>
<p>
Steps 1-3 are only run when a guess of the object pose is unavailable, i.e. at pipeline initialization or when the object is lost.
On our workstation with Intel i9-13900K CPU and NVIDIA GeForce RTX 4090 GPU, steps 1-3 typically takes a few seconds to complete,
and step 4 is run at a frame rate of 8-12Hz.
The pipeline automatically detects when the object is lost using the Megapose refiner's "pose score".
If the score is too low, the entire pipeline is rerun.
</p>
<div class="video_container" id="pose_estimation">
<video autoplay muted playsinline loop controls preload="metadata" width="100%">
<source src="assets/videos/vision_pipeline_video.mp4" type="video/mp4" alt="pipeline overview">
</video>
<div class="caption">
<p>
Pose estimation pipeline output at 1x speed.
Step 3 outputs are shown in blue.
Step 4 outputs are shown in green.
</p>
</div>
</div>
<!-- <div class="allegrofail">
<div class="video_container">
<video autoplay muted playsinline loop controls preload="metadata">
<source src="assets/system.mp4" type="video/mp4">
</video>
<video autoplay muted playsinline loop controls preload="metadata">
<source src="assets/system.mp4" type="video/mp4">
</video>
<video autoplay muted playsinline loop controls preload="metadata">
<source src="assets/system.mp4" type="video/mp4">
</video>
<video autoplay muted playsinline loop controls preload="metadata">
<source src="assets/system.mp4" type="video/mp4">
</video>
</div>
</div> -->
<br>
<hr class="rounded">
<h1>Results</h1>
<!-- <h2>Overview</h2> -->
<p>
We evaluate our framework on 4 real-world extrinsic manipulation tasks: "obstacle avoidance,"
"object storage," "occluded grasping," and "object retrieval."
Various environments are used for the demonstrations and tests to showcase our method's robustness against environment changes.
All demonstrations are collected on cracker.
Every task is evaluated on the 7 standard objects, each with 5 trials.
Additionally, occluded grasping is evaluated on the 3 short objects with an extra "pull" step.
<br><br>
Our method achieved an overall success rate of <b>80.5%</b> (<b>81.7%</b> for standard objects).
Despite not being tailored to "occluded grasping," we outperformed the 2022 paper based on deep reinforcement learning
<a href="https://sites.google.com/view/grasp-ungraspable"><cite>Learning to Grasp the Ungraspable with Emergent Extrinsic Dexterity</cite></a>,
both when the initial object state is against (<b>88.6%</b> vs. 78%) and away from (<b>77.1%</b> vs. 56%) the wall.
<br><br>
To show that our method is agnostic to the specific demonstration, we collect demos for grasping on oat and the 3 impossible objects that are unlikely to be graspable by the robot.
We then retarget all demos onto cracker from 5 different initial poses. We achieved 100% success rate across 20 trials.
</p>
<div class="image_container" >
<img src="assets/images/results_main.png" alt="Main results">
<div class="caption">
<p>Summary of experiments on 7 standard objects.</p>
</div>
</div>
<div id="additional_grasping">
<div class="image_container" >
<img src="assets/images/short_object_results.png" alt="Additional occluded grasping results">
<div class="caption">
<p>Summary of additional "occluded grasping" experiments.</p>
</div>
</div>
</div>
<p>
Below we show 1 success instance of all the task-object combinations. <b>All videos are at 1x speeed.</b>
</p>
<h2>Obstacle avoidance</h2>
<p><i>Push</i> the object forward, switch contact and <i>push</i> again to avoid the obstacle.</p>
<div class="taskvideos">
<div class="video_container">
<video muted playsinline loop controls preload="metadata">
<source src="assets/videos/720p/demo_avoidance.mp4" type="video/mp4">
</video>
<div class="caption">
<p>Demonstration</p>
</div>
</div>
<div class="video_container">
<video muted playsinline loop controls preload="metadata">
<source src="assets/videos/720p/push_push_cereal.mp4" type="video/mp4">
</video>
<div class="caption">
<p>Cereal</p>
</div>
</div>
<div class="video_container">
<video muted playsinline loop controls preload="metadata">
<source src="assets/videos/720p/push_push_cocoa.mp4" type="video/mp4">
</video>
<div class="caption">
<p>Cocoa</p>
</div>
</div>
<div class="video_container">
<video muted playsinline loop controls preload="metadata">
<source src="assets/videos/720p/push_push_cracker.mp4" type="video/mp4">
</video>
<div class="caption">
<p>Cracker</p>
</div>
</div>
</div>
<div class="taskvideos">
<div class="video_container">
<video muted playsinline loop controls preload="metadata">
<source src="assets/videos/720p/push_push_flapjack.mp4" type="video/mp4">
</video>
<div class="caption">
<p>Flapjack</p>
</div>
</div>
<div class="video_container">
<video muted playsinline loop controls preload="metadata">
<source src="assets/videos/720p/push_push_oat.mp4" type="video/mp4">
</video>
<div class="caption">
<p>Oat</p>
</div>
</div>
<div class="video_container">
<video muted playsinline loop controls preload="metadata">
<source src="assets/videos/720p/push_push_seasoning.mp4" type="video/mp4">
</video>
<div class="caption">
<p>Seasoning</p>
</div>
</div>
<div class="video_container">
<video muted playsinline loop controls preload="metadata">
<source src="assets/videos/720p/push_push_wafer.mp4" type="video/mp4">
</video>
<div class="caption">
<p>Wafer</p>
</div>
</div>
</div>
<h2>Object storage</h2>
<p>
<i>Push</i> an object toward the wall, <i>pivot</i> to align with an opening between the wall and the object, then <i>pull</i> it into the opening for storage.
</p>
<div class="taskvideos">
<div class="video_container">
<video muted playsinline loop controls preload="metadata">
<source src="assets/videos/720p/demo_storage.mp4" type="video/mp4">
</video>
<div class="caption">
<p>Demonstration</p>
</div>
</div>
<div class="video_container">
<video muted playsinline loop controls preload="metadata">
<source src="assets/videos/720p/push_pivot_pull_cereal.mp4" type="video/mp4">
</video>
<div class="caption">
<p>Cereal</p>
</div>
</div>
<div class="video_container">
<video muted playsinline loop controls preload="metadata">
<source src="assets/videos/720p/push_pivot_pull_cocoa.mp4" type="video/mp4">
</video>
<div class="caption">
<p>Cocoa</p>
</div>
</div>
<div class="video_container">
<video muted playsinline loop controls preload="metadata">
<source src="assets/videos/720p/push_pivot_pull_cracker.mp4" type="video/mp4">
</video>
<div class="caption">
<p>Cracker</p>
</div>
</div>
</div>
<div class="taskvideos">
<div class="video_container">
<video muted playsinline loop controls preload="metadata">
<source src="assets/videos/720p/push_pivot_pull_flapjack.mp4" type="video/mp4">
</video>
<div class="caption">
<p>Flapjack</p>
</div>
</div>
<div class="video_container">
<video muted playsinline loop controls preload="metadata">
<source src="assets/videos/720p/push_pivot_pull_oat.mp4" type="video/mp4">
</video>
<div class="caption">
<p>Oat</p>
</div>
</div>
<div class="video_container">
<video muted playsinline loop controls preload="metadata">
<source src="assets/videos/720p/push_pivot_pull_seasoning.mp4" type="video/mp4">
</video>
<div class="caption">
<p>Seasoning</p>
</div>
</div>
<div class="video_container">
<video muted playsinline loop controls preload="metadata">
<source src="assets/videos/720p/push_pivot_pull_wafer.mp4" type="video/mp4">
</video>
<div class="caption">
<p>Wafer</p>
</div>
</div>
</div>
<h2>Occluded grasping</h2>
<p>
<i>Push</i> the object in an ungraspable pose toward the wall, <i>pivot</i> it to expose a graspable edge, and <i>grasp</i> it.
</p>
<div class="taskvideos">
<div class="video_container">
<video muted playsinline loop controls preload="metadata">
<source src="assets/videos/720p/demo_grasping.mp4" type="video/mp4">
</video>
<div class="caption">
<p>Demonstration</p>
</div>
</div>
<div class="video_container">
<video muted playsinline loop controls preload="metadata">
<source src="assets/videos/720p/push_pivot_grasp_cereal.mp4" type="video/mp4">
</video>
<div class="caption">
<p>Cereal</p>
</div>
</div>
<div class="video_container">
<video muted playsinline loop controls preload="metadata">
<source src="assets/videos/720p/push_pivot_grasp_cocoa.mp4" type="video/mp4">
</video>
<div class="caption">
<p>Cocoa</p>
</div>
</div>
<div class="video_container">
<video muted playsinline loop controls preload="metadata">
<source src="assets/videos/720p/push_pivot_grasp_cracker.mp4" type="video/mp4">
</video>
<div class="caption">
<p>Cracker</p>
</div>
</div>
</div>
<div class="taskvideos">
<div class="video_container">
<video muted playsinline loop controls preload="metadata">
<source src="assets/videos/720p/push_pivot_grasp_flapjack.mp4" type="video/mp4">
</video>
<div class="caption">
<p>Flapjack</p>
</div>
</div>
<div class="video_container">
<video muted playsinline loop controls preload="metadata">
<source src="assets/videos/720p/push_pivot_grasp_oat.mp4" type="video/mp4">
</video>
<div class="caption">
<p>Oat</p>
</div>
</div>
<div class="video_container">
<video muted playsinline loop controls preload="metadata">
<source src="assets/videos/720p/push_pivot_grasp_seasoning.mp4" type="video/mp4">
</video>
<div class="caption">
<p>Seasoning</p>
</div>
</div>
<div class="video_container">
<video muted playsinline loop controls preload="metadata">
<source src="assets/videos/720p/push_pivot_grasp_wafer.mp4" type="video/mp4">
</video>
<div class="caption">
<p>Wafer</p>
</div>
</div>
</div>
<h2>Occluded grasping (short objects)</h2>
<p>
<i>Push</i> the object in an ungraspable pose toward the wall, <i>pivot</i> it to expose a graspable edge, <i>pull</i> to create space between the wall and the object for inserting the gripper,
and <i>grasp</i> it.
</p>
<div class="taskvideos">
<div class="video_container">
<video muted playsinline loop controls preload="metadata">
<source src="assets/videos/720p/demo_short_grasp.mp4" type="video/mp4">
</video>
<div class="caption">
<p>Demonstration</p>
</div>
</div>
<div class="video_container">
<video muted playsinline loop controls preload="metadata">
<source src="assets/videos/720p/push_pivot_pull_grasp_camera.mp4" type="video/mp4">
</video>
<div class="caption">
<p>Camera</p>
</div>
</div>
<div class="video_container">
<video muted playsinline loop controls preload="metadata">
<source src="assets/videos/720p/push_pivot_pull_grasp_meat.mp4" type="video/mp4">
</video>
<div class="caption">
<p>Meat</p>
</div>
</div>
<div class="video_container">
<video muted playsinline loop controls preload="metadata">
<source src="assets/videos/720p/push_pivot_pull_grasp_onion.mp4" type="video/mp4">
</video>
<div class="caption">
<p>Onion</p>
</div>
</div>
</div>
<h2>Object retrieval</h2>
<p>
<i>Pull</i> the object from between two obstacles, <i>push</i> toward the wall, <i>pivot</i> it to expose a graspable edge, and <i>grasp</i> it.
</p>
<div class="taskvideos">
<div class="video_container">
<video muted playsinline loop controls preload="metadata">
<source src="assets/videos/720p/demo_retrieval.mp4" type="video/mp4">
</video>
<div class="caption">
<p>Demonstration</p>
</div>
</div>
<div class="video_container">
<video muted playsinline loop controls preload="metadata">
<source src="assets/videos/720p/pull_push_pivot_grasp_cereal.mp4" type="video/mp4">
</video>
<div class="caption">
<p>Cereal</p>
</div>
</div>
<div class="video_container">
<video muted playsinline loop controls preload="metadata">
<source src="assets/videos/720p/pull_push_pivot_grasp_cocoa.mp4" type="video/mp4">
</video>
<div class="caption">
<p>Cocoa</p>
</div>
</div>
<div class="video_container">
<video muted playsinline loop controls preload="metadata">
<source src="assets/videos/720p/pull_push_pivot_grasp_cracker.mp4" type="video/mp4">
</video>
<div class="caption">
<p>Cracker</p>
</div>
</div>
</div>
<div class="taskvideos">
<div class="video_container">
<video muted playsinline loop controls preload="metadata">
<source src="assets/videos/720p/pull_push_pivot_grasp_flapjack.mp4" type="video/mp4">
</video>
<div class="caption">
<p>Flapjack</p>
</div>
</div>
<div class="video_container">
<video muted playsinline loop controls preload="metadata">
<source src="assets/videos/720p/pull_push_pivot_grasp_oat.mp4" type="video/mp4">
</video>
<div class="caption">
<p>Oat</p>
</div>
</div>
<div class="video_container">
<video muted playsinline loop controls preload="metadata">
<source src="assets/videos/720p/pull_push_pivot_grasp_seasoning.mp4" type="video/mp4">
</video>
<div class="caption">
<p>Seasoning</p>
</div>
</div>
<div class="video_container">
<video muted playsinline loop controls preload="metadata">
<source src="assets/videos/720p/pull_push_pivot_grasp_wafer.mp4" type="video/mp4">
</video>
<div class="caption">
<p>Wafer</p>
</div>
</div>
</div>
<br>
<hr class="rounded">
<h1>Conclusion</h1>
<p>
This work presents a framework for generalizing longhorizon extrinsic manipulation from a single demonstration.
Our method retargets the demonstration trajectory to the test scene by enforcing contact constraints with IK at every contact switches.
The retargeted trajectory is then tracked with a sequence of short-horizon policies for each contact configuration.
Our method achieved an overall success rate of 81.7% on real-world objects over 4 challenging long-horizon extrinsic manipulation tasks.
Additional experiments show that contact retargeting is crucial to successfully retargeting such long-horizon plans, and a wide range of demonstration can be successfully retargeted with our pipeline.
</p>
<h1>BibTeX</h1>
<p class="bibtex">
Coming soon
<!-- @article{wang2024dexcap, <br>
title = {DexCap: Scalable and Portable Mocap Data Collection System for Dexterous Manipulation}, <br>
author = {Wang, Chen and Shi, Haochen and Wang, Weizhuo and Zhang, Ruohan and Fei-Fei, Li and Liu, C. Karen}, <br>
journal = {arXiv preprint arXiv:2403.07788}, <br>
year = {2024} <br>
} -->
</p>
<!-- <h1>From Human to Robot</h1>
<div class="allegrofail">
<div class="video_container">
<video autoplay muted playsinline loop preload="metadata">
<source src="assets/human_to_robot.mp4" type="video/mp4">
</video>
</div>
</div>
<p>
<b>Observation retargeting:</b> To simplify the process of switching the camera system between the human and robot,
a quick-release buckle has been integrated into the back of the camera rack, allowing for swift camera swaps
– in less than 20 seconds. In this way, the robot utilizes the same observation camera employed during human data collection.
</p>
<div class="allegrofail">
<div class="video_container">
<video autoplay muted playsinline loop preload="metadata">
<source src="assets/fingertip_ik.mp4" type="video/mp4">
</video>
</div>
</div>
<p>
<b>Action retargeting:</b> To transfer human finger motion to the LEAP robot hand, we use fingertip
inverse kinematics (IK) to compute the 16-dimensional joint positions. Human finger motions are tracked
using a pair of motion capture gloves, which measure the 3D positions of the fingers relative to the palm based on electromagnetic field (EMF).
</p>
<div class="allegrofail">
<div class="video_container">
<video autoplay muted playsinline loop preload="metadata">
<source src="assets/dataset.mp4" type="video/mp4">
</video>
</div>
</div>
<p>
<b>Visual gap:</b> To further bridge the visual gap between human hand and robot hand,
we use forward kinematics to genrate a point cloud mesh of the robot hand and add it to the pointcloud observation as is shown in this video.
</p>
<br>
<br>
<br>
<hr class="rounded"> -->
<!-- <h1>Method: Data Retargeting and Imitation Learning</h1>
<div class="allegrofail">
<div class="video_container">
<video autoplay muted playsinline loop controls preload="metadata">
<source src="assets/method2.mp4" type="video/mp4">
</video>
</div>
</div>
<p>
We first retarget the DexCap data to the robot embodiment by constructing 3D point clouds from RGB-D observations and transforming it into robot operation space.
Meanwhile, the hand motion capture data is retargeted to the dexterous hand and robot arm with fingertip IK.
Based on the data, a Diffusion Policy is learned to take the point cloud as input and outputs a sequence of future goal positions as the robot actions.
</p>
<h1>Results</h1>
<div class="allegrofail">
<div class="video_container">
<video autoplay muted playsinline loop controls preload="metadata">
<source src="assets/normal_ball.mp4" type="video/mp4">
</video>
<div class="caption">
<p>Fully autonomous policy rollouts. Policy learned with 30-minute human mocap data without any teleoperation.</p>
</div>
</div>
</div>
<h1>Bimanual Manipulation Task</h1>
<div class="allegrofail">
<div class="video_container">
<video autoplay muted playsinline loop controls preload="metadata">
<source src="assets/normal_bimaual.mp4" type="video/mp4">
</video>
<div class="caption">
<p>0:00-0:09 Collecting bimanual human mocap data <br>0:10-1:47 Fully autonomous policy rollouts (learned with 30-minute human mocap data without any teleoperation)</p>
</div>
</div>
</div>
<br>
<hr class="rounded">
<h1>In-the-wild Data Collection with DexCap</h1>
<div class="allegrofail">
<div class="video_container">
<video autoplay muted playsinline loop controls preload="metadata">
<source src="assets/system_in_wild.mp4" type="video/mp4">
</video>
</div>
</div>
<div class="allegrolower">
<div class="video_container">
<video autoplay muted playsinline loop controls preload="metadata">
<source src="assets/transfer_c.mp4" type="video/mp4">
</video>
<div class="caption">
<p><b>Transfer to robot space</b></p>
</div>
</div>
<div class="video_container">
<video autoplay muted playsinline loop controls preload="metadata">
<source src="assets/dataset_c.mp4" type="video/mp4">
</video>
<div class="caption">
<p><b>Remove redundant points and add point clouds of the robot hand</b></p>
</div>
</div>
</div>
<h1>Policy learned with In-the-wild DexCap Data</h1>
<div class="allegrofail">
<div class="video_container">
<video autoplay muted playsinline loop controls preload="metadata">
<source src="assets/normal_packaging_trainedobj.mp4" type="video/mp4">
</video>
<div class="caption">
<p><b>Trained objects:</b> Fully autonomous policy rollouts in 1x speed.</p>
</div>
</div>
</div>
<div class="allegrolower">
<div class="video_container">
<video autoplay muted playsinline loop controls preload="metadata">
<source src="assets/normal_packaging_unseenobj.mp4" type="video/mp4">
</video>
<div class="caption">
<p><b>Unseen objects:</b>. Fully autonomous policy rollouts in 1x speed.</p>
</div>
</div>
<div class="video_container">
<video autoplay muted playsinline loop controls preload="metadata">
<source src="assets/normal_packaging_unseenobj2.mp4" type="video/mp4">
</video>
</div>
</div>
<br>
<hr class="rounded">
<h1>Human-in-the-loop correction with DexCap</h1>
<br>
<div class="allegrolower">
<div class="video_container">
<video autoplay muted playsinline loop controls preload="metadata">
<source src="assets/HIL_mode1.mp4" type="video/mp4">
</video>
</div>
<div class="video_container">
<video autoplay muted playsinline loop controls preload="metadata">
<source src="assets/HIL_mode2.mp4" type="video/mp4">
</video>
</div>
</div>
<p>
DexCap supports two types of human-in-the-loop correction during the policy rollouts: <br>
<b>(1). Residual correction</b> measures the 3D delta position changes of the human wrist and incorporates them as residual actions to the robot's wrist movements.
This mode enables minimal movement but requiring more precise control.<br>
<b>(2). Teleoperation</b> directly translates full human hand motions to the robot end-effector actions based on inverse kinematics.
This mode enables the full control over the robot but requiring more effort.<br>
Users can switch between the two modes by stepping on the foot pedal during the rollouts.
</p>
<div class="video_container">
<img src="assets/HIL_method.png" alt="Description of Image">
</div>
<p>
The corrections are stored in a new dataset and uniformly sampled with the original dataset for fine-tuning the robot policy
</p>
<h1>Results after finetuning - Tea preparing</h1>
<div class="allegrofail">
<div class="video_container">
<video autoplay muted playsinline loop controls preload="metadata">
<source src="assets/normal_tea.mp4" type="video/mp4">
</video>
<div class="caption">
<p>Fully autonomous policy rollouts in 2x speed. Policy learned with 1-hour human mocap data and 30 human-in-the-loop corrections.</p>
</div>
<br>
<video autoplay muted playsinline loop controls preload="metadata">
<source src="assets/normal_tea_more.mp4" type="video/mp4">
</video>
</div>
</div>
<h1>Results after finetuning - Scissor cutting</h1>
<div class="allegrofail">
<div class="video_container">
<video autoplay muted playsinline loop controls preload="metadata">
<source src="assets/normal_scissor.mp4" type="video/mp4">
</video>
<div class="caption">
<p>Fully autonomous policy rollouts in 2x speed. Policy learned with 1-hour human mocap data and 30 human-in-the-loop corrections.</p>
</div>
<br>
<video autoplay muted playsinline loop controls preload="metadata">
<source src="assets/normal_scissor_more.mp4" type="video/mp4">
</video>
</div>
</div> -->
<!-- <br>
<hr class="rounded">
<h1>Acknowledgments</h1>
<p>
TODO
</p>
<br>
<br>
<hr class="rounded">
<h1>BibTeX</h1>
<p class="bibtex">
TODO
<!-- @article{wang2024dexcap, <br>
title = {DexCap: Scalable and Portable Mocap Data Collection System for Dexterous Manipulation}, <br>
author = {Wang, Chen and Shi, Haochen and Wang, Weizhuo and Zhang, Ruohan and Fei-Fei, Li and Liu, C. Karen}, <br>
journal = {arXiv preprint arXiv:2403.07788}, <br>
year = {2024} <br>
} -->
<!-- </p>
<br>
<br> -->
</div>
<footer class="footer">
<div class="w-container">
<p>
Website template adapted from <a href="https://github.com/nerfies/nerfies.github.io">NeRFies</a>, <a href="https://peract.github.io/">PerAct</a>, and <a href="https://dex-cap.github.io/">DexCap</a>.
</p>
<!-- <div class="columns is-centered">
<div class="column">
<div class="content has-text-centered">
</div>
</div>
</div> -->
</div>
</footer>
</body>
<script src="assets/js/full_screen_video.js"></script>
<script src="assets/js/carousel.js"></script>
</html>