Wenhan Luo's Home Page - Video_Prediction

Learning to Generate Time-Lapse Videos Using Multi-Stage Dynamic Generative Adversarial Networks

Wei Xiong^1,2 Wenhan Luo¹ Lin Ma¹ Wei Liu¹ Jiebo Luo²

¹Tencent AI Lab ²University of Rochester

Abstract

Taking a photo outside, can we predict the immediate future, e.g., how would the cloud move in the sky? We address this problem by presenting a generative adversarial network (GAN) based two-stage approach to generating realistic time-lapse videos of high resolution. Given the first frame, our model learns to generate long-term future frames. The first stage generates videos of realistic contents for each frame. The second stage refines the generated video from the first stage by enforcing it to be closer to real videos with regard to motion dynamics. To further encourage vivid motion in the final generated video, Gram matrix is employed to model the motion more precisely. We build a large scale time-lapse dataset, and test our approach on this new dataset. Using our model, we are able to generate realistic videos of up to 128x128 resolution for 32 frames. Quantitative and qualitative experiment results have demonstrated the superiority of our model over the state-of-the-art models.

Left: Videos generated by giving the first frame. Right: Corresponding ground truth videos. (Please click the setting button at the right bottom to choose 1080P for better display qualituy.)

Network Architecture

The overall architecture of MD-GAN model. The input image is first duplicated to 32 frames as input to generator G₁ of the Base-Net, which produces a video Y₁. The discriminator D₁ then distinguishes the real video Y from Y₁. Following the Base-Net, the Refine-Net takes the result video of G₁ as the input and generates a more realistic video Y₂. The discriminator D₂ is updated with an adversarial ranking loss to push Y₂ (the result of Refine-Net) closer to real videos.

Results

The generated video frames by Stage Ⅰ (left) and Stage Ⅱ (right) given the same first frame. We show exemplar frames 1, 8, 16, 24, and 32. Red circles are used to indicate the locations and areas where obvious movements take place between adjacent frames. Larger and more circles are observed in the frames of Stage Ⅱ, indicating that there are more vivid motions generated by the Refine-Net.

Citation

@inproceedings{xiong2018learning,

title={Learning to Generate Time-Lapse Videos Using Multi-Stage Dynamic Generative Adversarial Networks},

author={Xiong, Wei and Luo, Wenhan and Ma, Lin and Liu, Wei and Luo, Jiebo},

booktitle={Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on},

year={2018}

}

Paper

The download link is here.

Dataset

The download link is here. Please ensure your internet connection is ready for downloading a huge file.

Code

Check the repo to access the code.