Learning to Generate Time-Lapse Videos Using Multi-Stage Dynamic Generative Adversarial Networks

Wei Xiong1,2 Wenhan Luo1 Lin Ma1 Wei Liu1 Jiebo Luo2

1Tencent AI Lab 2University of Rochester

Abstract

Taking a photo outside, can we predict the immediate future, e.g., how would the cloud move in the sky? We address this problem by presenting a generative adversarial network (GAN) based two-stage approach to generating realistic time-lapse videos of high resolution. Given the first frame, our model learns to generate long-term future frames. The first stage generates videos of realistic contents for each frame. The second stage refines the generated video from the first stage by enforcing it to be closer to real videos with regard to motion dynamics. To further encourage vivid motion in the final generated video, Gram matrix is employed to model the motion more precisely. We build a large scale time-lapse dataset, and test our approach on this new dataset. Using our model, we are able to generate realistic videos of up to 128x128 resolution for 32 frames. Quantitative and qualitative experiment results have demonstrated the superiority of our model over the state-of-the-art models.

Left: Videos generated by giving the first frame. Right: Corresponding ground truth videos. (Please click the setting button at the right bottom to choose 1080P for better display qualituy.)



Network Architecture

The overall architecture of MD-GAN model. The input image is first duplicated to 32 frames as input to generator G1 of the Base-Net, which produces a video Y1. The discriminator D1 then distinguishes the real video Y from Y1. Following the Base-Net, the Refine-Net takes the result video of G1 as the input and generates a more realistic video Y2. The discriminator D2 is updated with an adversarial ranking loss to push Y2 (the result of Refine-Net) closer to real videos.


Results

The generated video frames by Stage Ⅰ (left) and Stage Ⅱ (right) given the same first frame. We show exemplar frames 1, 8, 16, 24, and 32. Red circles are used to indicate the locations and areas where obvious movements take place between adjacent frames. Larger and more circles are observed in the frames of Stage Ⅱ, indicating that there are more vivid motions generated by the Refine-Net.

Citation

@inproceedings{xiong2018learning,

title={Learning to Generate Time-Lapse Videos Using Multi-Stage Dynamic Generative Adversarial Networks},

author={Xiong, Wei and Luo, Wenhan and Ma, Lin and Liu, Wei and Luo, Jiebo},

booktitle={Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on},

year={2018}

}

Paper

The download link is here.

Dataset

The download link is here. Please ensure your internet connection is ready for downloading a huge file.

Code

Check the repo to access the code.