Vision-Guided Differentiable Physics for Robotic Manipulation

Highlights

Built an end-to-end RGB plus robot-state learning pipeline for Franka manipulation, predicting 8-step future joint deltas and end-effector trajectories from 4 context frames/states.
Implemented a TemporalVisionFrankaPolicy with a CNN image encoder, state projection, Transformer temporal fusion, joint-delta rollout head, and end-effector prediction head.
Created an Isaac Sim data workflow with brighter multi-episode collection, merged JSONL indexes, episode-level validation, continuity filters, image preprocessing, and modality ablations.
Trained the advanced run for 100 epochs and reached a best validation terminal end-effector distance of 1.67 cm with a 1.37 cm mean trajectory end-effector distance across 976 validation windows.
Validated that visual context contributes to the model: removing RGB increased terminal end-effector error by about 1.49x, while removing state/cube context caused order-of-magnitude degradation.

Key metrics

Terminal EE error

1.67 cm

Best validation terminal end-effector distance at epoch 97

Mean EE error

1.37 cm

Validation trajectory-average end-effector distance

Temporal window

4 -> 8

4 context frames/states used to predict an 8-step future rollout

Validation windows

976

Windows used in the saved modality-ablation evaluation

Vision ablation

1.49x

Terminal error increase when RGB context is zeroed

Model backbone

2L / 4H

2-layer, 4-head Transformer temporal encoder

Media

Project cover summarizing the portfolio case study: RGB observations, robot state, temporal policy learning, differentiable rollouts, and end-effector target prediction.

System architecture: visual observations are encoded, fused with state context, passed through a differentiable physics/simulation loop, and optimized through end-to-end gradient flow.

Visual feature extraction concept: RGB observations and depth-like structure support prediction of physical properties such as mass, friction, and restitution for downstream physics reasoning.

Optimization progression: the differentiable loop moves from an initial configuration through gradient-guided updates toward a target manipulation configuration.

Presentation-style summary of the advanced Isaac Franka run, including context horizon, best end-effector errors, validation windows, and the training/evaluation pipeline.

Training curve from the advanced run: train terminal EE error, validation terminal EE error, and validation mean EE error converge toward centimeter-level tracking behavior.

Predicted vs target end-effector rollout: the model predicts future x, y, and z end-effector coordinates over the 8-step horizon.

Ablation summary: the full model is strongest; removing RGB increases terminal error, while removing robot/state context causes much larger degradation.

Representative final context frame from the saved Isaac Franka evaluation sample, showing the kind of visual input used by the policy.

Tech stack

PythonPyTorchNVIDIA Isaac SimNVIDIA Warp3D Gaussian SplattingTransformer EncoderCNN Vision EncoderFranka PandaJSONL DatasetsYAML ConfigsMatplotlib

Objective

The goal of this project is to make robotic manipulation models more physically grounded by tying visual observations to differentiable state prediction and rollout losses. Instead of treating perception and control as separate blocks, the system asks whether an RGB-conditioned policy can learn multi-step robot motion while remaining inspectable through simulation metrics and ablations.

The final portfolio version presents the project as a research-style report page: it explains the robotics problem, shows the architecture, documents the Isaac Sim data workflow, visualizes training behavior, and summarizes ablation evidence from the saved run artifacts.

Problem and motivation

Contact-rich manipulation is hard because the policy must reason about geometry, object state, physical interaction, robot kinematics, and future consequences of actions. Pure image models can learn correlations, but they often hide whether the model is using visual evidence, robot state, or dataset shortcuts.

This project focuses on the bridge between visual learning and physics-aware prediction. The code base starts with runnable differentiable-physics demos, then extends the idea toward Isaac Sim Franka manipulation with temporal context, multi-step supervision, continuity filtering, and modality ablations.

System architecture

Input: a temporal context of 4 RGB frames plus robot joint positions, end-effector position, cube/object position, target position, and a normalized time token.
Vision branch: a lightweight CNN encoder converts each frame into a compact visual feature vector that can be trained without external pretrained dependencies.
State branch: robot and scene state are projected into the same embedding space as the vision features.
Temporal fusion: a 2-layer, 4-head Transformer encoder fuses the context window and produces a sequence-level summary.
Prediction heads: the model predicts bounded future joint deltas for rollout integration and future end-effector positions for trajectory-level supervision.

Isaac Sim data and sequence pipeline

The project includes an Isaac Sim Franka workflow for collecting RGB frames and robot/cube/end-effector trajectories, converting episodes into training indexes, merging multiple episodes, and inspecting sequence quality before training.

A major engineering improvement was treating sequence quality as a first-class issue. The dataset loader splits merged JSONL indexes back into episode trajectories, supports episode-level train/validation splits, and rejects windows with impossible end-effector or joint jumps so the temporal model is not trained across reset boundaries.

Bright collection workflow: stronger lighting, camera look-at setup, visual randomization, debug previews, and percentile image preprocessing for under-exposed frames.
Sequence settings: image size 128, batch size 8, context length 4, horizon 8, stride 1, validation fraction 0.2, and episode-level split mode.
Continuity filters: reject windows with excessive end-effector jumps or joint-space jumps before constructing multi-step targets.

Temporal vision model

The core model is a TemporalVisionFrankaPolicy. It encodes each context frame using a CNN, combines those visual features with projected state vectors, adds learned positional embeddings, and passes the context through a Transformer encoder. The last token summary drives two prediction heads: one for future joint deltas and one for future end-effector coordinates.

The architecture is intentionally practical: the image encoder is compact, the Transformer is modest enough to train on project-scale data, and the prediction heads expose interpretable quantities that can be plotted and compared against future trajectories.

Differentiable physics and rollout losses

The training objective combines multiple physical consistency signals instead of relying on one scalar loss. Predicted joint deltas are integrated into future joint trajectories, end-effector predictions are compared across the horizon, terminal end-effector error receives extra weight, smoothness discourages jittery actions, and joint-limit regularization keeps predictions physically plausible.

The repository also includes differentiable physics scaffolding beyond the Franka sequence model: a PyTorch Gaussian splatting renderer for visual-loss wiring, an optional NVIDIA Warp planar-arm engine, and local proxy robot-arm training scripts that provide a fast path before heavier simulator integration.

Training results

The saved advanced Isaac Franka bright-many run trained for 100 epochs. The best validation checkpoint occurred at epoch 97, reaching 1.67 cm terminal end-effector distance and 1.37 cm mean end-effector distance across the future trajectory. The final epoch remained close, ending at 1.71 cm terminal end-effector distance and 1.39 cm mean end-effector distance.

The loss curve shows the model moving from a high initial terminal error into a stable centimeter-level validation regime. The rollout plot compares predicted and target end-effector x/y/z coordinates over the future horizon and exposes where the model tracks well versus where longer-horizon transitions remain difficult.

Ablation findings

The ablation run evaluated 976 validation windows. The full model achieved a 1.67 cm terminal end-effector distance. Zeroing RGB context increased terminal error to 2.49 cm, showing that visual information contributes to the policy even though robot state is still very informative.

The stronger ablations show that the task is not solvable from vision alone in the current formulation. Removing joint state, end-effector/cube state, or the full state vector increases terminal error by more than an order of magnitude, which is a useful engineering signal: the next version should make action/state conditioning explicit and collect more diverse visual episodes before claiming strong vision-only generalization.

Engineering contribution

Implemented the temporal sequence dataset, model, training loop, metrics logging, checkpointing, rollout visualization, and modality-ablation evaluation workflow.
Added Isaac Sim data-collection utilities, merged-index tooling, path-repair scripts, dark-frame diagnostics, and generated-data hygiene so the repository stays usable as experiments grow.
Documented practical lessons around lighting, camera placement, episode-level validation, reset-boundary filtering, and the difference between a visually attractive demo and a learnable sequence dataset.
Packaged the project page with a cover image, architecture visual, visual-property concept figure, optimization progression image, training curve, rollout plot, ablation chart, and concise research-style narrative.

Limitations and next steps

The current best results are from simulator-generated Franka sequences, not a physical robot deployment.
The model still relies heavily on robot state, so the next iteration should add richer visual diversity, multi-camera inputs, and action-conditioned prediction.
The differentiable FK prior is scaffolded but should be calibrated from the actual Isaac/URDF transforms before enabling stronger FK consistency losses.
Future work should add closed-loop Isaac replay, action-conditioned rollouts, 50-100 brighter randomized episodes, production-grade 3DGS rendering, and real-robot validation.

Related projects

Automated Goalie: Ping Pong Ball Trajectory Prediction System

Robotics · 2025

Featured

A closed-loop robot-learning prototype that detects a ping pong ball, estimates its 3D motion, predicts the landing point, and rotates a servo-driven blocker in real time on a Raspberry Pi-based hardware setup.

Computer VisionTrajectory PredictionEmbedded SystemsReal-Time Robotics

S&P 500 Deep Learning Forecasting System

Financial ML · 2025

Featured

A research-grade forecasting system that evaluates Temporal Fusion Transformers against LSTM and ARIMAX baselines for S&P 500 return prediction using mixed-frequency market and macroeconomic data, then extends TFT with regime-aware attention and interpretability diagnostics.

Financial MLTime-Series ForecastingTemporal Fusion TransformerRegime-Aware Attention

Autonomous drone navigation and moving-platform landing cover

Autonomous Drone Navigation System

Robotics · 2025

Featured

A vision-based landing system where a Parrot Mambo drone tracks a moving line-follower robot, stays aligned using image feedback, and executes a timed descent onto the platform.

Autonomous DronesComputer VisionVisual ServoingMoving Platform Landing

Speed-aware event-based star tracking project cover

Event-Based Star Tracking for Spacecraft Attitude Estimation

Space Autonomy · 2026

Featured

A Speed-Aware EBS-EKF research prototype for event-camera star tracking that improves low-light spacecraft attitude estimation by making centroid correction depend on both brightness and image-plane speed.

Event-Based VisionSpace AutonomyState EstimationStar Tracking

← Back to all projects