This directory contains code for data generation (both images and associated JSON files) for training DOPE. We provide two variations of this code, one that uses NVISII for rendering, and another that uses Blenderproc. The data produced by these pipelines are compatible with each other, and can be combined into a single dataset without issues.
We highly recommend a GPU with RTX capabilities, as ray tracing can be costly on a non-RTX GPU.
You will need to download HDRI maps to illuminate the scene. These can be found freely on polyhaven. For testing purposes, you can download a single one here:
wget https://www.dropbox.com/s/na3vo8rca7feoiq/teatro_massimo_2k.hdr
mv teatro_massimo_2k.hdr dome_hdri_haven/
In addition to your object of interest, we recommend adding "distractor" objects to the scene. These are unrelated objects that serve to enrich the training data, provide occlusion, etc. Our data generation scripts use the Google scanned objects dataset, which can be downloaded automatically with the following:
python download_google_scanned_objects.py
This section contains a description of the generated dataset format.
All coordinate systems (world, model, camera) are right-handed.
The world coordinate system is X forward, Y left and Z up.
The model coordinate system is ultimately defined by the mesh, but if the model should appear "naturally upright" in its neutral orientation in the world frame, the Z axis should point up (when the object is standing "naturally upright"), the X axis should point from the "natural backside" of the model towards the front, the Y axis should point left and the origin should coincide with the center of the 3D bounding box of the object model.
The camera coordinate system is the same as in OpenCV with X right, Y down and Z into the image away from the viewer.
All pixel coordinates (U, V) have the origin at the top left corner of the image, with U going right and V going down.
All length measurements are in meters.
The indices of the 3D bounding cuboid are in the order shown in the sketch below (0..7), with the object being in its neutral orientation (X axis pointing forward, Y left, Z up).
The order of the indices is the same as NVidia Deep learning Dataset Synthesizer (NDDS) and nvdu_viz from NVidia Dataset Utilities.
(m) 3 +-----------------+ 0 (b)
/ /|
/ / |
(m) 2 +-----------------+ 1| (b)
| | |
| ^ z | |
| | | |
| y <--x | |
| | | |
(y) 7 +-- | + 4 (g)
| / | /
|/ |/
(y) 6 +-----------------+ 5 (g)
Debug markers for the cuboid corners can be rendered using the --debug
option, with (b) = blue, (m) = magenta, (g) = green, (y) = yellow and the centroid being white.
Each generated image is accompanied by a JSON file. This JSON file must contain the following fields to be used for training:
objects
: An array, containing one entry for each object instance, with:class
: class name. This name is referred to in configuration files.location
andquaternion_xyzw
: position and orientation of the object in the camera coordinate systemprojected_cuboid
: 2D coordinates of the projection of the the vertices of the 3D bounding cuboid (in pixels) plus the centroid. See the above section "Projected Cuboid Corners" for more detail.visibility
: The visible fraction of the object silhouette (=px_count_visib
/px_count_all
). Note that if NVISII is used, the object may still not be fully visible whenvisibility == 1.0
because it may extend beyond the borders of the image.
These fields are not required for training, but are used for debugging and numerical evaluation of the results. We recommend generating this data if possible.
-
camera_data
camera_view_matrix
: 4×4 transformation matrix from the world to the camera coordinate system.height
andwidth
: dimensions of the image in pixelsintrinsics
: the camera intrinsics
-
objects
local_cuboid
: 3D coordinates of the vertices of the 3D bounding cuboid (in meters); currently alwaysnull
local_to_world_matrix
: 4×4 transformation matrix from the object to the world coordinate systemname
: a unique string that identifies the object instance internally
If your object has any rotational symmetries, they have to be handled specially.
Here is a video that demonstrates what happens with a rotationally symmetric object if you do not specifiy the symmetries:
cylinder_nosym.mp4
As you can see on the left side of that video, the cuboid corners (visualized as small colored spheres) rotate with the object. Because the object has a rotational symmetry, this results in two frames that are pixel-wise identical to have different cuboid corners. Since the cuboid corners are what DOPE is trained on, this will cause the training to fail.
The right side of the video shows the same object with a debug texture to demonstrate the "real" pose of the object. If your real object actually has a texture like this, it does not have any rotational symmetries in our sense, because two images where the cuboid corners are in different places will also not be pixel-wise identical due to the texture. Also, you only need to deal with rotational symmetries, not mirror symmetries for the same reason.
To handle symmetries, you need to add a model_info.json
file (see the models_with_symmetries
folder for examples). Here is the model_info.json
file for the cylinder object:
{
"symmetries_discrete": [[1, 0, 0, 0,
0, -1, 0, 0,
0, 0, -1, 0,
0, 0, 0, 1]],
"symmetries_continuous": [{"axis": [0, 0, 1], "offset": [0, 0, 0]}],
"align_axes": [{"object": [0, 1, 0], "camera": [0, 0, 1]}, {"object": [0, 0, 1], "camera": [0, 1, 0]}]
}
As you can see, we have specified one discrete symmetry (rotating the object by 180° around the x axis) and one continuous symmetry (rotating around the z axis). Also, we have to specify how to align the axes. With the align_axes
specified as above, the algorithm will:
- Discretize
symmetries_continuous
into 64 discrete rotations. - Combine all discrete and continuous symmetries into one set of complete symmetry transformations.
- Find the combined symmetry transformation such that when the object is rotated by that transformation,
- the y axis of the object (
"object": [0, 1, 0]
) has the best alignment (smallest angle) with the z axis of the camera ("camera": [0, 0, 1]
) - if there are multiple equally good such transformations, it will choose the obje where the z axis of the object (
"object": [0, 0, 1]
) has the best alignment with the y axis of the camera ("camera": [0, 1, 0]
).
- the y axis of the object (
See below for a documentation of the object and camera coordinate systems.
With this model_info.json
file, the result is the following:
cylinder_sym.mp4
As another example, here's a rather unusual object that has a 60° rotational symmetry around the z axis. The model_info.json
file looks like this:
{
"symmetries_discrete": [[ 0.5, -0.866, 0, 0,
0.866, 0.5, 0, 0,
0, 0, 1, 0,
0, 0, 0, 1],
[-0.5, -0.866, 0, 0,
0.866, -0.5, 0, 0,
0, 0, 1, 0,
0, 0, 0, 1],
[-1, 0, 0, 0,
0, -1, 0, 0,
0, 0, 1, 0,
0, 0, 0, 1],
[-0.5, 0.866, 0, 0,
-0.866, -0.5, 0, 0,
0, 0, 1, 0,
0, 0, 0, 1],
[ 0.5, 0.866, 0, 0,
-0.866, 0.5, 0, 0,
0, 0, 1, 0,
0, 0, 0, 1]],
"align_axes": [{"object": [0, 1, 0], "camera": [0, 0, 1]}]
}
The transformation matrices have been computed like this:
from math import sin, cos, pi
for yaw_degree in [60, 120, 180, 240, 300]:
yaw = yaw_degree / 180 * pi
print([cos(yaw), -sin(yaw), 0, 0, sin(yaw), cos(yaw), 0, 0, 0, 0, 1, 0, 0, 0, 0, 1])
The resulting symmetry-corrected output looks like this:
hex_screw.mp4
This symmetry handling scheme allows the data generation script to compute consistent cuboid corners for most rotations of the object. Note however that there are object rotations where the cuboid corners become unstable and "flip over" to a different symmetry transformation. For the cylinder object, this is when the camera looks at the top or bottom of the cylinder (not shown in the video above). For the hex screw object, this is also when the camera looks at the top or bottom or when the rotation is close to the 60° boundary between two transformations (this can be seen in the video). Rotations within a few degrees of the "flipping over" rotation will not be handled well by the trained network. Unfortunately, this cannot be easily avoided.
Further note that specifying symmetries also improves the recognition results for "almost-symmetrical" objects, where there are only minor non-symmetrical parts, such as most of the objects from the T-LESS dataset.