Continuous recording tells you what happened. Not which fault you are looking at.

A robot stops in the field. You pull yesterday's bag, scrub the timeline, eyeball forty topics until something looks wrong. Hours later you have a guess.

The bag is necessary. It is not enough. Without classification, every replay starts from "what changed at this timestamp" instead of "what fault am I debugging." That is the workflow that scales badly the moment you have more than one robot.

This article walks through a real fault on a Yahboomcar M3 Pro: scan1 covered mid-pickup, classified by ros2_medkit the moment it crossed threshold, then ingested into Mosaico and queried three different ways in 30 seconds. For the SOVD foundation that ros2_medkit builds on, see the SOVD for ROS 2 introduction.

The fault: scan1 LIDAR covered mid-pickup

The robot ran an AprilTag pick-and-place loop. Two LIDAR streams (/scan0 from the rear SLAMTEC, /scan1 from the front), an IMU, an Orbbec RGB camera, a 6-DoF arm on a Mecanum base, plus battery and odometry. About thirty topics, ten of them load-bearing.

m3pro_diagnostics enforced thresholds: valid_ranges < 300 on either scan, battery < 10.5V, IMU out of range. Standard REP-107 /diagnostics output, nothing custom.

At T+14.9s a hand entered scan1's field of view. Valid ranges dropped from ~410 to 141, well under the 300 threshold. m3pro_diagnostics flipped to ERROR. medkit debounced briefly, then promoted it to a confirmed fault.

Cross-topic timeline of the fault: scan1 collapses, scan0 stable, robot stationary

What medkit captured

Two things happened in the gateway, milliseconds apart.

First, the fault became a first-class object - fault_code: LIDAR_SCAN1, state: critical, severity, first/last/active occurrence, debounce metadata. Not a free-text log line that scrolls past, but a structured row keyed by fault_code that any other process can subscribe to over SOVD's REST or Server-Sent Events surface.

Second, the ring buffer flushed. medkit keeps a sliding window in memory across whatever topics you whitelist; the demo is configured for 15 seconds before the trigger and 10 seconds after (the gateway defaults are 5 s / 1 s and tunable per deployment). When a fault is confirmed, the buffer becomes a .mcap snapshot.

For this run that was 414 MB across eleven topics: both LIDAR scans, IMU, command velocity, odometry, battery, the camera image, and three robot-specific topics (/diagnostics, /PosInfo, /arm6_joints). One file, one fault. Tagged with fault_code in the metadata so it travels with the diagnostic record.

A separate safety_bridge was already subscribed to /faults/stream. When LIDAR_SCAN1 confirmed, it started publishing zero-velocity cmd_vel at 50 Hz on a five-second rolling window. The robot stopped within two seconds.

One Python call into Mosaico

medkit's job ends at the snapshot. The next question is what you do with the bag once you have ten of them, then a hundred, then a fleet's worth. Folder of .mcap files plus ad-hoc Python is the usual answer, and it does not scale.

Mosaico is a forensic catalog for robot recordings - Apache Arrow Flight on the wire, Parquet underneath, indexed by metadata you control. Their Python SDK turns a bag into a queryable sequence in one call:

from mosaicolabs.ros_bridge.injector import RosbagInjector, ROSInjectionConfig
 
RosbagInjector(ROSInjectionConfig(
    file_path="fault_LIDAR_SCAN1_2026-05-03T14-31-09.mcap",
    sequence_name="m3pro-fault-LIDAR_SCAN1-001",
    metadata={
        "robot_id": "m3pro-01",
        "fault_code": "LIDAR_SCAN1",
        "severity": "critical",
        "captured_at": "2026-05-03T14:31:09Z",
    },
    host="mosaicod",
    port=6726,
)).run()

Sensor topics (LaserScan, Imu, NavSatFix, Image) get adapters out of the box. After the call the fault record is a sequence in the catalog, addressable by metadata, queryable by topic, joinable across the fleet.

Three queries

Once the bag is in the catalog, the questions you ask change shape. Instead of opening a viewer and scrubbing, you write expressions.

Three Mosaico SDK queries on the ingested fault: inventory, structural, content

Q1: did the robot move, or did the sensor fail?

The IMU showed ~1g vertical with no impact spike. cmd_vel ramped before the fault, then zeroed. The robot was driving normally and was not colliding with anything. Sensor failure, not a physical event.

Q2: was scan0 also affected, or just scan1?

scan0 stayed at ~430 valid ranges throughout the window. The fault was localized to scan1 - not a global LIDAR-driver glitch, not a USB bus dropout. Front sensor blocked, rear sensor fine.

Q3: did safety override engage?

After confirmation, safety_bridge published zero-velocity at 50 Hz. The AprilTag drive node kept pushing 10 Hz commands. The faster publisher won. cmd_vel settled and the robot stopped within two seconds.

Q2 is where the catalog earns its keep. With a folder of bags and ad-hoc Python you would open both files, align two streams on timestamp, count valid ranges per bin, then diff. An afternoon of plumbing for one answer. With Mosaico's catalog queried by fault_code, one expression filters by topic and a second reads the ranges. Thirty seconds.

Catalog the fault, not the day

The numbers fall out of recording the right window:

MetricContinuousFault-driven (this setup)
Per-event capture1.5 TB / day414 MB / fault record
Daily volume at 10 faults/day1.5 TB4.14 GB
Reduction vs continuousbaseline~360x

The compression ratio is not the point. The point is that the catalog grows by event, not by clock. Search is "show me every LIDAR_SCAN1 from this fleet last week," not "show me a 1.5 TB folder and good luck."

ros2_medkit and Mosaico architecture

ros2_medkit and Mosaico architecture: M3 Pro robot to fault tracking to MCAP record to Apache Arrow Flight catalog to query

m3pro_diagnostics is stock REP-107. ros2_medkit is the open-source SOVD gateway - it watches /diagnostics, classifies faults, owns the ring buffer, exposes the REST + SSE surface. mosaicod is the upstream Mosaico daemon running as an unmodified Docker image. A small bridge subscribes to /faults/stream, downloads the snapshot, and calls RosbagInjector. About 150 lines of Python.

Nothing on the robot changes. No custom topics, no extra payload, no new SDK to learn. The diagnostic stack already publishes the right messages; medkit gives them structure; Mosaico gives them a search surface.

Get started

The diagnostic gateway (ros2_medkit) is open source under Apache 2.0. The full demo - architecture, notebook with the three queries, the bridge that wires medkit to Mosaico - lives in selfpatch_demos/demos/mosaico_integration. Run docker compose up, trigger the fault, watch the snapshot land in the catalog, run the queries.

If you are running ROS 2 today and your fault story still ends in tail -f and a folder of bags, the SOVD-based pattern shown here is the workflow change worth piloting first. For the broader picture, see the SOVD for ROS 2 introduction and the VDA 5050 + SOVD article. Get in touch if you want help wiring fault forensics into your fleet.