Projects/Vision-Guided Differentiable Physics for Robotic Manipulation

Vision-Guided Differentiable Physics for Robotic Manipulation

A presentation-style robotics case study that connects RGB observations, robot state, temporal Transformer prediction, differentiable rollout losses, and Isaac Sim Franka data to learn multi-step manipulation behavior from visual context.

Robotics2025Project Lead / Research Engineer
RoboticsRobot LearningDifferentiable PhysicsVision-Guided ControlIsaac SimFranka ManipulationTransformer3D Gaussian SplattingSystem Identification

Highlights

  • Built an end-to-end RGB plus robot-state learning pipeline for Franka manipulation, predicting 8-step future joint deltas and end-effector trajectories from 4 context frames/states.
  • Implemented a TemporalVisionFrankaPolicy with a CNN image encoder, state projection, Transformer temporal fusion, joint-delta rollout head, and end-effector prediction head.
  • Created an Isaac Sim data workflow with brighter multi-episode collection, merged JSONL indexes, episode-level validation, continuity filters, image preprocessing, and modality ablations.
  • Trained the advanced run for 100 epochs and reached a best validation terminal end-effector distance of 1.67 cm with a 1.37 cm mean trajectory end-effector distance across 976 validation windows.
  • Validated that visual context contributes to the model: removing RGB increased terminal end-effector error by about 1.49x, while removing state/cube context caused order-of-magnitude degradation.

Key metrics

Terminal EE error
1.67 cm
Best validation terminal end-effector distance at epoch 97
Mean EE error
1.37 cm
Validation trajectory-average end-effector distance
Temporal window
4 -> 8
4 context frames/states used to predict an 8-step future rollout
Validation windows
976
Windows used in the saved modality-ablation evaluation
Vision ablation
1.49x
Terminal error increase when RGB context is zeroed
Model backbone
2L / 4H
2-layer, 4-head Transformer temporal encoder

Media

Project cover summarizing the portfolio case study: RGB observations, robot state, temporal policy learning, differentiable rollouts, and end-effector target prediction.
System architecture: visual observations are encoded, fused with state context, passed through a differentiable physics/simulation loop, and optimized through end-to-end gradient flow.
Visual feature extraction concept: RGB observations and depth-like structure support prediction of physical properties such as mass, friction, and restitution for downstream physics reasoning.
Optimization progression: the differentiable loop moves from an initial configuration through gradient-guided updates toward a target manipulation configuration.
Presentation-style summary of the advanced Isaac Franka run, including context horizon, best end-effector errors, validation windows, and the training/evaluation pipeline.
Training curve from the advanced run: train terminal EE error, validation terminal EE error, and validation mean EE error converge toward centimeter-level tracking behavior.
Predicted vs target end-effector rollout: the model predicts future x, y, and z end-effector coordinates over the 8-step horizon.
Ablation summary: the full model is strongest; removing RGB increases terminal error, while removing robot/state context causes much larger degradation.
Representative final context frame from the saved Isaac Franka evaluation sample, showing the kind of visual input used by the policy.

Tech stack

PythonPyTorchNVIDIA Isaac SimNVIDIA Warp3D Gaussian SplattingTransformer EncoderCNN Vision EncoderFranka PandaJSONL DatasetsYAML ConfigsMatplotlib

Objective

The goal of this project is to make robotic manipulation models more physically grounded by tying visual observations to differentiable state prediction and rollout losses. Instead of treating perception and control as separate blocks, the system asks whether an RGB-conditioned policy can learn multi-step robot motion while remaining inspectable through simulation metrics and ablations.

The final portfolio version presents the project as a research-style report page: it explains the robotics problem, shows the architecture, documents the Isaac Sim data workflow, visualizes training behavior, and summarizes ablation evidence from the saved run artifacts.

Problem and motivation

Contact-rich manipulation is hard because the policy must reason about geometry, object state, physical interaction, robot kinematics, and future consequences of actions. Pure image models can learn correlations, but they often hide whether the model is using visual evidence, robot state, or dataset shortcuts.

This project focuses on the bridge between visual learning and physics-aware prediction. The code base starts with runnable differentiable-physics demos, then extends the idea toward Isaac Sim Franka manipulation with temporal context, multi-step supervision, continuity filtering, and modality ablations.

System architecture

  • Input: a temporal context of 4 RGB frames plus robot joint positions, end-effector position, cube/object position, target position, and a normalized time token.
  • Vision branch: a lightweight CNN encoder converts each frame into a compact visual feature vector that can be trained without external pretrained dependencies.
  • State branch: robot and scene state are projected into the same embedding space as the vision features.
  • Temporal fusion: a 2-layer, 4-head Transformer encoder fuses the context window and produces a sequence-level summary.
  • Prediction heads: the model predicts bounded future joint deltas for rollout integration and future end-effector positions for trajectory-level supervision.

Isaac Sim data and sequence pipeline

The project includes an Isaac Sim Franka workflow for collecting RGB frames and robot/cube/end-effector trajectories, converting episodes into training indexes, merging multiple episodes, and inspecting sequence quality before training.

A major engineering improvement was treating sequence quality as a first-class issue. The dataset loader splits merged JSONL indexes back into episode trajectories, supports episode-level train/validation splits, and rejects windows with impossible end-effector or joint jumps so the temporal model is not trained across reset boundaries.

  • Bright collection workflow: stronger lighting, camera look-at setup, visual randomization, debug previews, and percentile image preprocessing for under-exposed frames.
  • Sequence settings: image size 128, batch size 8, context length 4, horizon 8, stride 1, validation fraction 0.2, and episode-level split mode.
  • Continuity filters: reject windows with excessive end-effector jumps or joint-space jumps before constructing multi-step targets.

Temporal vision model

The core model is a TemporalVisionFrankaPolicy. It encodes each context frame using a CNN, combines those visual features with projected state vectors, adds learned positional embeddings, and passes the context through a Transformer encoder. The last token summary drives two prediction heads: one for future joint deltas and one for future end-effector coordinates.

The architecture is intentionally practical: the image encoder is compact, the Transformer is modest enough to train on project-scale data, and the prediction heads expose interpretable quantities that can be plotted and compared against future trajectories.

Differentiable physics and rollout losses

The training objective combines multiple physical consistency signals instead of relying on one scalar loss. Predicted joint deltas are integrated into future joint trajectories, end-effector predictions are compared across the horizon, terminal end-effector error receives extra weight, smoothness discourages jittery actions, and joint-limit regularization keeps predictions physically plausible.

The repository also includes differentiable physics scaffolding beyond the Franka sequence model: a PyTorch Gaussian splatting renderer for visual-loss wiring, an optional NVIDIA Warp planar-arm engine, and local proxy robot-arm training scripts that provide a fast path before heavier simulator integration.

Training results

The saved advanced Isaac Franka bright-many run trained for 100 epochs. The best validation checkpoint occurred at epoch 97, reaching 1.67 cm terminal end-effector distance and 1.37 cm mean end-effector distance across the future trajectory. The final epoch remained close, ending at 1.71 cm terminal end-effector distance and 1.39 cm mean end-effector distance.

The loss curve shows the model moving from a high initial terminal error into a stable centimeter-level validation regime. The rollout plot compares predicted and target end-effector x/y/z coordinates over the future horizon and exposes where the model tracks well versus where longer-horizon transitions remain difficult.

Ablation findings

The ablation run evaluated 976 validation windows. The full model achieved a 1.67 cm terminal end-effector distance. Zeroing RGB context increased terminal error to 2.49 cm, showing that visual information contributes to the policy even though robot state is still very informative.

The stronger ablations show that the task is not solvable from vision alone in the current formulation. Removing joint state, end-effector/cube state, or the full state vector increases terminal error by more than an order of magnitude, which is a useful engineering signal: the next version should make action/state conditioning explicit and collect more diverse visual episodes before claiming strong vision-only generalization.

Engineering contribution

  • Implemented the temporal sequence dataset, model, training loop, metrics logging, checkpointing, rollout visualization, and modality-ablation evaluation workflow.
  • Added Isaac Sim data-collection utilities, merged-index tooling, path-repair scripts, dark-frame diagnostics, and generated-data hygiene so the repository stays usable as experiments grow.
  • Documented practical lessons around lighting, camera placement, episode-level validation, reset-boundary filtering, and the difference between a visually attractive demo and a learnable sequence dataset.
  • Packaged the project page with a cover image, architecture visual, visual-property concept figure, optimization progression image, training curve, rollout plot, ablation chart, and concise research-style narrative.

Limitations and next steps

  • The current best results are from simulator-generated Franka sequences, not a physical robot deployment.
  • The model still relies heavily on robot state, so the next iteration should add richer visual diversity, multi-camera inputs, and action-conditioned prediction.
  • The differentiable FK prior is scaffolded but should be calibrated from the actual Isaac/URDF transforms before enabling stronger FK consistency losses.
  • Future work should add closed-loop Isaac replay, action-conditioned rollouts, 50-100 brighter randomized episodes, production-grade 3DGS rendering, and real-robot validation.

Related projects

← Back to all projects