LiMA - Project Page

¹Nanjing University of Aeronautics and Astronautics
²National University of Singapore
³Zhejiang University
⁴Nanjing University of Posts and Telecommunications
⁵SKL-TI

Abstract

LiDAR representation learning aims to extract rich structural and semantic information from large-scale, readily available datasets, reducing reliance on costly human annotations. However, existing LiDAR representation strategies often overlook the inherent spatiotemporal cues in LiDAR sequences, limiting their effectiveness. In this work, we propose LiMA, a novel long-term image-to-LiDAR Memory Aggregation framework that explicitly captures longer range temporal correlations to enhance LiDAR representation learning. LiMA comprises three key components: 1) a Cross-View Aggregation module that aligns and fuses overlapping regions across neighboring camera views, constructing a more unified and redundancy-free memory bank; 2) a Long-Term Feature Propagation mechanism that efficiently aligns and integrates multi-frame image features, reinforcing temporal coherence during LiDAR representation learning; and 3) a Cross-Sequence Memory Alignment strategy that enforces consistency across driving sequences, improving generalization to unseen environments. LiMA maintains high pretraining efficiency and incurs no additional computational overhead during downstream tasks. Extensive experiments on mainstream LiDAR-based perception benchmarks demonstrate that LiMA significantly improves both LiDAR semantic segmentation and 3D object detection. We hope this work inspires more effective pretraining paradigms for autonomous driving. The code will be made publicly accessible for future research.

Resource

Paper

Code

HighLight

Figure 1. Illustrative examples of image-to-LiDAR pretraining paradigms. (a) Spatial Alignment aligns LiDAR features with corresponding image features in the spatial domain without considering temporal consistency. (b) Short-Term methods propagate LiDAR features frame by frame, ensuring feature consistency across neighboring frames but fail to capture long-term dependencies. (c) Our approach leverages Long-Term image sequences to enrich LiDAR representations, enabling a more comprehensive understanding of long-range dependencies and motion patterns.

Long-Term image-to-LiDAR Memory Aggregation
In this work, we present LiMA, a simple yet effective framework that distills rich motion patterns from longer sequences into LiDAR representations, enhancing the robustness and generalizability in LiDAR data pretraining. LiMA comprises three key components: (1) Cross-View Aggregation, which unifies multi-view image features to construct a high-quality memory bank; (2) Long-Term Feature Propagation, which enforces temporal consistency module, which enforces temporal consistency and propagates long-term motion cues; and (3) Cross-Sequence Memory Alignment, which improves robustness by aligning long-term features across diverse driving contexts.
By integrating these three components, LiMA effectively captures both spatial and temporal correlations within and across driving sequences. Our framework ensures high pretraining efficiency with an increased number of frames, requiring no more than 20 hours for pretraining. Furthermore, LiMA introduces no additional computational overhead during downstream tasks, ensuring efficient deployment.

Figure 2. Overview of the LiMA framework. At each timestamp t, multi-view image features are first extracted and unified through the Cross-View Aggregation module. A Memory Bank is introduced to maintain unified image features from the past k frames, enabling temporal feature propagation and fusion to capture Long-Term Motion Patterns. The enriched temporal features are distilled into LiDAR representation, enabling Cross-Modal Learning. The memory bank is continuously updated in a first-in, first-out (FIFO) manner, ensuring effective feature propagation and refinement for future frames.

Figure 3. Cosine similarity between a query point (marked as the red dot) and: (1) image features, and (2) LiDAR point features projected onto the image. Colors range from red (indicating high similarity) to blue (indicating low similarity). Best viewed in colors.

Table 1. Comparisons of state-of-the-art LiDAR pretraining methods pretrained on nuScenes and fine-tuned on nuScenes, SemanticKITTI, and Waymo Open datasets, respectively, with specific data portions. LP denotes linear probing with frozen backbones. All scores are given in percentage (%).

Method	Venue	Backbone (2D)	Backbone (3D)	Frames	nuScenes						KITTI	Waymo
Method	Venue	Backbone (2D)	Backbone (3D)	Frames	LP	1%	5%	10%	25%	Full	1%	1%

Random	-	-	-	-	8.10	30.30	47.84	56.15	65.48	74.66	39.50	39.41
SLidR	CVPR'22	ResNet-50	MinkUNet-34	1	38.80	38.30	52.49	59.84	66.91	74.79	44.60	47.12
ST-SLidR	CVPR'23	ResNet-50	MinkUNet-34	1	40.48	40.75	54.69	60.75	67.70	75.14	44.72	44.93
TriCC	CVPR'23	ResNet-50	MinkUNet-34	2	38.00	41.20	54.10	60.40	67.60	75.60	45.90	-
Seal	NeurIPS'23	ResNet-50	MinkUNet-34	2	44.95	45.84	55.64	62.97	68.41	75.60	46.63	49.34
CSC	CVPR'24	ResNet-50	MinkUNet-34	1	46.00	47.00	57.00	63.30	68.60	75.70	47.20	-
HVDistill	IJCV'24	ResNet-50	MinkUNet-34	1	39.50	42.70	56.60	62.90	69.30	76.60	49.70	-

SLidR	CVPR'22	ViT-S	MinkUNet-34	1	44.70	41.16	53.65	61.47	66.71	74.20	44.67	47.57
Seal	NeurIPS'23	ViT-S	MinkUNet-34	2	45.16	44.27	55.13	62.46	67.64	75.58	46.51	48.67
SuperFlow	ECCV'24	ViT-S	MinkUNet-34	3	46.44	47.81	59.44	64.47	69.20	76.54	47.97	49.94
ScaLR	CVPR'24	ViT-S	MinkUNet-34	1	49.66	45.89	56.52	61.07	65.79	73.39	46.06	47.67
LiMA	Ours	ViT-S	MinkUNet-34	6	54.76	48.75	60.83	65.41	69.31	76.94	49.28	50.23

SLidR	CVPR'22	ViT-B	MinkUNet-34	1	45.35	41.64	55.83	62.68	67.61	74.98	45.50	48.32
Seal	NeurIPS'23	ViT-B	MinkUNet-34	2	46.59	45.98	57.15	62.79	68.18	75.41	47.24	48.91
SuperFlow	ECCV'24	ViT-B	MinkUNet-34	3	47.66	48.09	59.66	64.52	69.79	76.57	48.40	50.20
ScaLR	CVPR'24	ViT-B	MinkUNet-34	1	51.90	48.90	57.69	62.88	66.85	74.15	47.77	49.38
LiMA	Ours	ViT-B	MinkUNet-34	6	56.65	51.29	61.11	65.62	70.43	76.91	50.44	51.35

SLidR	CVPR'22	ViT-L	MinkUNet-34	1	45.70	42.77	57.45	63.20	68.13	75.51	47.01	48.60
Seal	NeurIPS'23	ViT-L	MinkUNet-34	2	46.81	46.27	58.14	63.27	68.67	75.66	47.55	50.02
SuperFlow	ECCV'24	ViT-L	MinkUNet-34	3	48.01	49.95	60.72	65.09	70.01	77.19	49.07	50.67
ScaLR	CVPR'24	ViT-L	MinkUNet-34	1	51.77	49.13	58.36	62.75	66.80	74.16	48.64	49.72
LiMA	Ours	ViT-L	MinkUNet-34	6	56.67	53.22	62.46	66.00	70.59	77.23	52.29	51.19

Table 2. Domain generalization study of different LiDAR pretraining methods pretrained on the nuScenes dataset and fine-tuned on a collection of seven different semantic segmentation datasets, respectively, with specific data portions. All scores are given in percentage (%).

Method	Venue	ScriKITTI		Rellis-3D		SemPOSS		SemSTF		SynLiDAR		DAPS-3D		Synth4D
Method	Venue	1%	10%	1%	10%	Half	Full	Half	Full	1%	10%	Half	Full	1%	10%

Random	-	23.81	47.60	38.46	53.60	46.26	54.12	48.03	48.15	19.89	44.74	74.32	79.38	20.22	66.87
SLidR	CVPR'22	39.60	50.45	49.75	54.57	51.56	55.36	52.01	54.35	42.05	47.84	81.00	85.40	63.10	62.67
Seal	NeurIPS'23	40.64	52.77	51.09	55.03	53.26	56.89	53.46	55.36	43.58	49.26	81.88	85.90	64.50	66.96
SuperFlow	ECCV'24	42.70	54.00	52.83	55.71	54.41	57.33	54.72	56.57	44.85	51.38	82.43	86.21	65.31	69.43
ScaLR	CVPR'24	40.64	52.39	52.53	55.57	53.65	56.86	54.06	55.96	44.42	51.96	81.92	85.58	64.36	67.44
LiMA	Ours	45.90	55.13	55.62	57.15	55.05	57.81	55.45	56.70	46.66	52.32	83.11	86.63	66.04	70.19

Table 3. Comparisons of state-of-the-art LiDAR pretraining methods pretrained and fine-tuned on the nuScenes dataset with specific data portions. All detection methods employ CenterPoint or SECOND as the 3D object detection backbone.

Method	Venue	nuScenes
		5%		10%		20%
		mAP	NDS	mAP	NDS	mAP	NDS
Backbone: VoxelNet + CenterPoint
Random	-	38.0	44.3	46.9	55.5	50.2	59.7
PointContrast	ECCV'20	39.8	45.1	47.7	56.0	-	-
GCC-3D	ICCV'21	41.1	46.8	48.4	56.7	-	-
SLidR	CVPR'22	43.3	52.4	47.5	56.8	50.4	59.9
TriCC	CVPR'23	44.6	54.4	48.9	58.1	50.9	60.3
CSC	CVPR'24	45.3	54.2	49.3	58.3	51.9	61.3
ScaLR	CVPR'24	44.3	53.3	48.2	57.1	50.7	60.8
LiMA	Ours	46.5	56.4	50.1	59.6	52.3	62.3
Backbone: VoxelNet + SECOND
Random	-	35.8	45.9	39.0	51.2	43.1	55.7
SLidR	CVPR'22	36.6	48.1	39.8	52.1	44.2	56.3
TriCC	CVPR'23	37.8	50.0	41.4	53.5	45.5	58.7
CSC	CVPR'24	38.2	49.4	42.5	54.8	45.6	58.1
ScaLR	CVPR'24	37.3	48.7	41.4	53.5	45.5	58.6
LiMA	Ours	39.4	50.1	43.2	55.3	46.0	59.5

Figure 4. Qualitative assessments of state-of-the-art methods, pretrained on nuScenes and fine-tuned on nuScenes with 1% annotations. The error maps depict correct and incorrect predictions in gray and red, respectively. Best viewed in colors.

Figure 5. Qualitative assessments of object detection, pretrained on nuScenes and fine-tuned on nuScenes with 5% annotations. The groundtruth/predicted results are highlighted with blue/red boxes, respectively. Best viewed in colors.

BibteX

      @inproceedings{xu2025lima,
        title = {Beyond One Shot, Beyond One Perspective: Cross-View and Long-Horizon Distillation for Better LiDAR Representations},
        author = {Xu, Xiang and Kong, Lingdong and Wang, Song and Zhou, Chuanwei and Liu, Qingshan},
        booktitle = {IEEE/CVF International Conference on Computer Vision},
        year = {2025}
      }