We introduce ScanBot, a novel dataset designed for instruction-conditioned, high-precision surface scanning in robotic systems. In contrast to existing robot learning datasets that focus on coarse tasks such as grasping, navigation, or dialogue, ScanBot targets the high-precision demands of industrial laser scanning, where sub-millimeter path continuity and parameter stability are critical. The dataset covers laser scanning trajectories executed by a robot across 12 diverse objects and 6 task types, including full-surface scans, geometry-focused regions, spatially referenced parts, functionally relevant structures, defect inspection, and comparative analysis. Each scan is guided by natural language instructions and paired with synchronized RGB, depth, and laser profiles, as well as robot pose and joint states. Despite recent progress, existing vision-language action (VLA) models still fail to generate stable scanning trajectories under fine-grained instructions and real-world precision demands. To investigate this limitation, we benchmark a range of multimodal large language models (MLLMs) across the full perception–planning–execution loop, revealing persistent challenges in instruction-following under realistic constraints.
blue gpu
top surface
connector
fan
hdmi
memory chips
green gpu with big fan
top surface
components
connector
fan
hdmi
green gpu with small fan
top surface
blue capacitor
connector
fan
hdmi
red gpu
top surface
connector
fan
hdmi
power component
ram
top surface
4 memory chips
5 memory chips
lower connector
upper connector
wifi card
top surface
antenna ports
connector
pin header
wireless chip
cube 1
top surface
dropped corner
embedded circle
embedded triangle
cube 2
top surface
big bump
big circle hole
missing corners
scratche
small bump
small circle hole
red cylinder
top surface
bump
hole
hole array
ring groove
scratch
white cylinder
top surface
bump
hole
hole array
ring groove
scratch
black triangle
top surface
white triangle
top surface