Unsupervised Learning for Physical Interaction through Video Prediction

Links:

- PDF of paper, and arXiv landing page

- Full Dataset (including training and test sets)

- Code for training the models

Robot pushing evaluation

Below are example video predictions from various models in our evaluation on the robot interaction dataset. The ground truth video shows all ten time steps, whereas all other videos show only the generated 8 time steps (conditioned on only the first two ground truth images, and all robot actions)

Qualitative comparison across models - seen objects

ground truth // CDNA // ConvLSTM, with skip // FF multiscale [14] // FC LSTM [17]

Qualitative comparison across models - novel objects

ground truth // CDNA // ConvLSTM, with skip // FF multiscale [14] // FC LSTM [17]

Note how the ConvLSTM model predicts motion less accurately compared to the CDNA model, and degrades the background (e.g. the left edge of the table).

Changing the action

CDNA, novel objects

0x action // 0.5x action // 1x action // 1.5x action

Randomly-sampled predictions, novel objects

ground truth // CDNA


Visualized masks

CDNA, seen objects, masks 0 (background), 2, and 8

CDNA, novel objects, masks 0 (background), 2, and 8

Human3.6M evaluation

Below are example video predictions from various models in our evaluation on the Human3.6M, with a held-out human subject. The ground truth video shows ten ground truth time steps, whereas all other videos show the generated 10 time steps (conditioned on only the first ten ground truth images, which are not shown)

Sample video predictions

ground truth // DNA // FF multiscale [14] // FC LSTM [17]