Synthetic Videos on the Double: VideoGPT is an efficient generative AI system for video.
Using a neural network to generate realistic videos takes a lot of computation. New work performs the task efficiently enough to run on a beefy personal computer.
 
    Using a neural network to generate realistic videos takes a lot of computation. New work performs the task efficiently enough to run on a beefy personal computer.
What’s new: Wilson Yan, Yunzhi Zhang, and colleagues at UC Berkeley developed VideoGPT, a system that combines image generation with image compression to produce novel videos.
Key insight: It takes less computation to learn from compressed image representations than full-fledged image representations.
How it works: VideoGPT comprises a VQ-VAE (a 3D convolutional neural network that consists of an encoder, an embedding, and a decoder) and an image generator based on iGPT. The authors trained the models sequentially on BAIR Robot Pushing (clips of a robot arm manipulating various objects) and other datasets.
- VQ-VAE’s encoder learned to compress representations of the input video (16x64x64) into smaller representations (8x32x32) where each value is a vector. In the process, it learned an embedding whose vectors encoded information across multiple frames.
- VQ-VAE replaced each vector in the smaller representations with the closest value in the learned embedding, and the decoder learned to reproduce the original frames from these modified representations.
- After training VQ-VAE, the authors used the encoder to compress a video from the training set. They trained iGPT, given a flattened 1D sequence of representations, to generate the next representation by choosing vectors from the learned embedding.
- To generate video, VideoGPT passed a random representation to iGPT, concatenated its output to the input, passed the result back to iGPT, and so on for a fixed number of iterations. VQ-VAE’s decoder converted the concatenated representations into a video.
Results: The authors evaluated VideoGPT’s performance using Fréchet Video Distance (FVD), a measure of the distance between representations of generated output and training examples (lower is better). The system achieved 103.3 FVD after training on eight GPUs. The state-of-the-art Video Transformer achieved 94 FVD after training on 128 TPUs (roughly equivalent to several hundred GPUs).
Why it matters: Using VQ-VAE to compress and decompress video is not new, but this work shows how it can be used to cut the computation budget for computer vision tasks.
We’re thinking: Setting aside video generation, better video compression is potentially transformative given that most internet traffic is video. The compressed representations in this work, which are tuned to a specific, sometimes narrow training set, may be well suited to imagery from security or baby cams.
 
             
             
            