Nowadays there are hundreds of machine learning models which can perform instance segmentation on 2D images, even without fine-tuning. However, when it comes to segmentation of 3D point clouds, the choice of models becomes significantly narrower and most of them require finetuning in order to work properly on custom data. Since instance segmentation on 2D images is relatively easily accesible nowadays, the tranfer of segmentation masks from 2D to 3D space could significantly speed up 3D point cloud labeling for instance segmentation task.
In this tutorial we will learn how to create segmentation masks for 3D point cloud using segmentation masks on 2D photo context image and camera calibration data. We wil take point cloud, photo context image and camera calibration parameters from KITTI dataset as an example, but this approach can be generalized to any data. Supervisely's 3D Point Cloud labeling tool and Image labeling tool will be used for working with point cloud and photo context image respectively.
The main steps of this tutorial are the following:
prepare input data: a 3D point cloud, a reference image labeled with segmentation masks, KITTI's sensor calibration files
project LiDAR 3D points on 2D reference image
get LiDAR point projections located inside masks on photo context image
create segmentation masks for 3D point cloud
Everything you need to reproduce this toturial is on GitHub: source code, Dockerfile, demo data.
Input data overview: 3D point cloud, photo context image with 2D masks, sensor calibration parameters
As mentioned above, input data will be taken from KITTI dataset. You can import point clouds from KITTI dataset to your Supervisely account using Import KITTI 3D app. We will take one point cloud for example:
In KITTI dataset, every point cloud has 4 reference images: 2 from grayscale cameras (left / right) and 2 from color cameras (left / right). We will take image from left color camera as an example and label it with 2D segmentation masks:
Now let's check sensor calibration parameters. KITTI dataset provides several sensor calibration files:
calib_cam_to_cam.txt - contains matrices for camera-to-camera calibration
calib_velo_to_cam.txt - contains matrices for velodyne-to-camera registration
File for camera-to-camera calibration contains the following data (source - KITTI README):
S_xx: 1x2 size of image xx before rectification
K_xx: 3x3 calibration matrix of camera xx before rectification
D_xx: 1x5 distortion vector of camera xx before rectification
R_xx: 3x3 rotation matrix of camera xx (extrinsic)
T_xx: 3x1 translation vector of camera xx (extrinsic)
S_rect_xx: 1x2 size of image xx after rectification
R_rect_xx: 3x3 rectifying rotation to make image planes co-planar
P_rect_xx: 3x4 projection matrix after rectification
For our task, we will need only P_rect_xx, R_rect_xx, R_xx and T_xx matrices.
File for velodyne-to-camera registration contains the following data:
R: 3x3 rotation matrix
T: 3x1 translation vector
This data serves as a representation of the velodyne coordinate frame in camera coordinates. We will need rotation matrix and translation vector in order to transform point in velodyne coordinates into the camera coordinate system.
Environment preparation and libraries import
For running the code provided in this tutorial, you will need some Python modules: supervisely, open3d and plotly. You can use this Dockerfile for convenience:
Download photo context and visualize mask annotations
Let's download photo context image and its annotations from Supervisely platform to local storage and visualize the result. We will use Supervisely Python SDK for this purpose, you can find more tutorials on how to use Supervisely Python SDK here.
Next step - download input point cloud to local storage:
# define point cloud id and download it to local storagepcd_id =32384623local_pcd_path = os.path.join(save_dir, "lidar_data.pcd")api.pointcloud.download_path(pcd_id, local_pcd_path)# display point cloud on interactive plotvisualize_pcd(local_pcd_path)
Get sensor calibration parameters
The KITTI paper describes the transformation from LiDAR to camera i as follows, where each transformation matrix has been converted to it's homogeneous representation. The difference here is that we have changed the notation and added the transformation to the desired camera reference.
For convenience we will denote the transformation from LiDAR to camera i like Isaac Berrios proposed in his sensor fusion tutorial:
Where:
LiDAR to camera reference → transforms a 3D point relative to the LiDAR to a 3D point relative to the Camera.
Rigid body transformation from camera 0 to camera i.
Camera i to rectified camera i reference.
Rectified camera i to 2D camera i(u, v, z) coordinate space.
3D LiDAR space to 2D camera i(u, v, z) coordinate space.
Where (u, v, z) are the final camera coordinates after the rectification and projection transforms. In order to transform from homogeneous image coordinates y to true (u, v, z) image coordinates y, we will need to normalize by the depth and drop the 1:
# define target camera numbercamera_number =2# read calib_cam_to_cam.txtcalib_cam_to_cam_file =open("../tutorial_data/calib_cam_to_cam.txt")calib_cam_to_cam_dict ={}for line in calib_cam_to_cam_file.readlines(): key, value = line.split(": ") calib_cam_to_cam_dict[key]= value.strip()calib_cam_to_cam_file.close()# read calib_velo_to_cam.txtcalib_velo_to_cam_file =open("../tutorial_data/calib_velo_to_cam.txt")calib_velo_to_cam_dict ={}for line in calib_velo_to_cam_file.readlines(): key, value = line.split(": ") calib_velo_to_cam_dict[key]= value.strip()calib_velo_to_cam_file.close()# get projection matricesP_rect = calib_cam_to_cam_dict[f"P_rect_0{camera_number}"]P_rect = np.array([float(x) for x in P_rect.split(" ")]).reshape( (3, 4))# get rectified rotation matricesR_rect = calib_cam_to_cam_dict[f"R_rect_0{camera_number}"]R_rect = np.array([float(x) for x in R_rect.split(" ")]).reshape( (3, 3))# add (0, 0, 0) translation and convert to homogeneous coordinatesR_rect = np.insert(R_rect, 3, values=[0, 0, 0], axis=0)R_rect = np.insert(R_rect, 3, values=[0, 0, 0, 1], axis=1)# get rotation matrix from reference camera to target cameraR = calib_cam_to_cam_dict[f"R_0{camera_number}"]R = np.array([float(x) for x in R.split(" ")]).reshape((3, 3))# get translation vector from reference camera to target camerat = calib_cam_to_cam_dict[f"T_0{camera_number}"]t = np.array([float(x) for x in t.split(" ")]).reshape((3, 1))# get reference camera to target camera rigid body transformation in homogeneous coordinatesT_ref_to_target = np.insert(np.hstack((R, t)), 3, values=[0, 0, 0, 1], axis=0)# get lidar to camera reference transformationR_velo = np.array([float(x) for x in calib_velo_to_cam_dict["R"].split(" ")]).reshape((3, 3))t_velo = np.array([float(x) for x in calib_velo_to_cam_dict["T"].split(" ")])[:,None]T_velo_ref0 = np.vstack((np.hstack((R_velo, t_velo)), np.array([0, 0, 0, 1])))
Project LiDAR 3D points on 2D photo context image
Now when we got all necessary transformation matrices, we can obtain matrix to transform 3D LiDAR points to 2D camera coordinates.
# transform from velo (LiDAR) to target cameraT_velo_to_cam = P_rect @ R_rect @ T_ref_to_target @ T_velo_ref0# get lidar pointspcd = o3d.io.read_point_cloud(local_pcd_path, format="pcd")pcd_points = np.asarray(pcd.points)xyzw = np.insert(pcd_points, 3, 1, axis=1).T# get 3D points projections on photo context imageprojections = T_velo_to_cam @ xyzwprojections[:2]/= projections[2,:]# draw point projections on image and display resultsave_path = os.path.join(save_dir, "projections.png")draw_projections_on_image(projections.copy(), photo_context_image_np.copy(), save_path)plt.axis("off")plt.imshow(Image.open(save_path));
Get LiDAR point projections located inside masks on photo context image
Next step after we got LiDAR point projections - find projections which are located inside masks on photo context image (in our case we need to find find point projections which lie on two cars in front of ego vehicle).
# download photo context image annotationproject_meta_json = api.project.get_meta(photo_context_project_id)project_meta = sly.ProjectMeta.from_json(project_meta_json)ann_json = api.annotation.download(photo_context_image_id).annotationann = sly.Annotation.from_json(ann_json, project_meta)# get binary maskslabels = ann.labelsbitmap = np.zeros(photo_context_image_np.shape, dtype=np.uint8)for label in labels: geometry = label.geometry geometry.draw(bitmap=bitmap, color=[1, 1, 1])bitmap = bitmap[:,:,:2]# to 2d# get indexes of projections which are located inside masksu, v, z = projectionsinside_masks = []img_h, img_w, _ = photo_context_image_np.shapefor idx inrange(len(pcd.points)): point = np.array([int(u[idx]), int(v[idx])])if (point[0]<=0or point[0]>= img_w) or (point[1]<=0or point[1]>= img_h):continueelse:if np.all(bitmap[point[1], point[0]] ==1): inside_masks.append(idx)print(f"{len(inside_masks)} point projections are located inside masks")# display point projections located inside masksmasked_projections_save_path = os.path.join(save_dir, "masked_projections.png")draw_projections_on_image( projections.copy(), photo_context_image_np.copy(), masked_projections_save_path, preprocess=False, indexes=inside_masks,)plt.axis("off")plt.imshow(Image.open(masked_projections_save_path));
Create 3D point cloud segmentation masks
Since 3D points and their projections have the same indexing, we can apply indexes of projections located inside masks on photo context image to original LiDAR points and get masked part of point cloud.
visualize_3d_masks(local_pcd_path, inside_masks)
As you can see, not only points of target objects (cars) were segmented - there are also points in front and behind target objects, whose projections are located on target objects on 2D image. Such scenario was quite predictable since in 3D space points can have similar x and y coordinates, but different z coordinates - their projections will be almost the same, but in 3D space these points can be located in completely different parts of point cloud.
In order to handle such corner case, we will have to apply some postprocessing to masked LiDAR points in order to filer points which do not belong to our target objects.
When it comes to postprocessing 2D to 3D segmentation transfer results, in most cases points which belong to target objects (in our case cars) will be in majority and other points (in our case points of surrounding environment, which we don't want to segment) will be in minority. Additionally, we can state that points of target objects and points of surrounding environment are most likely to be distant from each other.
What can we do with this information? We can apply clustering algorithm to masked part of our point cloud and find N biggest clusters (where N equals number of target objects / segmentation masks, in our case - 2) - since "target" points and "noise" points are likely to be distant from each other and "target" points are likely to have higher density than other points, there is a high probability that points of target objects will belong to biggest clusters. Let's apply DBScan clustering algorithm and test this hypothesis:
# cluster masked part of point cloudpcd = o3d.io.read_point_cloud(local_pcd_path, format="pcd")masked_pcd = pcd.select_by_index(inside_masks)cluster_labels = np.array(masked_pcd.cluster_dbscan(eps=1.5, min_points=100))clusters, counts = np.unique(cluster_labels, return_counts=True)# we recommend to set number of clusters equal to number of segmentation masks on photo context imagen_biggest_clusters =len(ann.labels)biggest_clusters = clusters[np.argsort(counts)][-n_biggest_clusters:]biggest_cluster_indexes = []for idx, label inenumerate(cluster_labels):if label in biggest_clusters: biggest_cluster_indexes.append(idx)inside_masks_processed = [inside_masks[idx]for idx in biggest_cluster_indexes]print(f"{len(inside_masks) -len(inside_masks_processed)} points were filtered using DBScan clustering")# display point filtering resultsvisualize_3d_masks(local_pcd_path, inside_masks_processed)
As you can see, for our case, DBScan perfectly filtered points which do not belong to target objects.
Upload created 3D segmentation mask to Supervisely platform
Now, when we have indexes of target object's points in 3D space, we can create 3D point cloud segmentattion annotation and upload it to Supervisely platform. We will create new point cloud project and upload to it input point cloud, masked photo context image and 3D segmentation mask.
# create output projectpcd_project = api.project.create(657,"Segmented point cloud", change_name_if_conflict=True, type=ProjectType.POINT_CLOUDS,)pcd_dataset = api.dataset.create(pcd_project.id, "ds_0", change_name_if_conflict=True)# upload point cloud to output projectpcd_info = api.pointcloud.upload_path(pcd_dataset.id, name="scene.pcd", path=local_pcd_path)# upload related image to output projectrelated_image_hash = api.pointcloud.upload_related_image(masked_photo_context_path)# create dict with camera infoextrinsic_matrix = T_velo_ref0[:3,:4]extrinsic_matrix = extrinsic_matrix.flatten().tolist()intrinsic_matrix = calib_cam_to_cam_dict["P_rect_00"]intrinsic_matrix = np.array(intrinsic_matrix.split(" "), dtype=np.float32).reshape(3, 4)intrinsic_matrix = intrinsic_matrix[:3,:3].flatten().tolist()cam_info ={"deviceId":"cam_0","sensorsData":{"extrinsicMatrix": extrinsic_matrix,"intrinsicMatrix": intrinsic_matrix,},}# create dict with image inforelated_image_info ={"entityId": pcd_info.id,"name":"img_0.png","hash": related_image_hash,"meta": cam_info,}# upload related image infoapi.pointcloud.add_related_images([related_image_info])# upload point cloud segmentation mask to the platformpcd_project_meta = sly.ProjectMeta.from_json(api.project.get_meta(pcd_project.id))ifnot pcd_project_meta.get_obj_class("mask"): pcd_project_meta = pcd_project_meta.add_obj_class(sly.ObjClass("mask", Pointcloud)) api.project.update_meta(pcd_project.id, pcd_project_meta.to_json())ann_info = api.pointcloud.annotation.download(pcd_id)pcd_objects = []pcd_figures = []geometry =Pointcloud(inside_masks_processed)pcd_object = sly.PointcloudObject(pcd_project_meta.get_obj_class("mask"))pcd_figure = sly.PointcloudFigure(pcd_object, geometry)pcd_objects.append(pcd_object)pcd_figures.append(pcd_figure)pcd_objects =PointcloudObjectCollection(pcd_objects)result_ann = sly.PointcloudAnnotation( pcd_objects, pcd_figures, PointcloudTagCollection([]))api.pointcloud.annotation.append(pcd_info.id, result_ann)
Conclusion
In this tutorial, we used 2D mask guidance, sensor calibration matrices and DBScan clustering algorithm in order to transfer 2D segmentation mask to 3D space. This approach can be useful when there is a need in fast 3D point clouds labeling for instance segmentation tasks. For example, we can apply YOLO11 model for instance segmentation to each photo context image in KITTI dataset and transfer 2D masks to 3D space using sensor calibration data - it will allow to create a huge labeled dataset for 3D instance segmentation without having to manually draw segmentation masks in every point cloud. Alternative ways of 3D point clouds labeling will be covered in future tutorials.
Acknowledgement
This tutorial is based on great work by Isaac Berrios.