人生在世,看得透又看得远者prevail everywhere. v.gd/YLHADG v.gd/cRqvfD v.gd/bKvfFH v.gd/DHdne v.gd/sRm3tA: CogVideo

Friday, 19 May 2023

CogVideo

Text-to-video generation. The repo for ICLR2023 paper "CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers".

This is the official repo for the paper: CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers.

News! The demo for CogVideo is available!

It's also integrated into Huggingface Spaces 🤗 using Gradio. Try out the Web Demo

News! The code and model for text-to-video generation is now available! Currently we only supports simplified Chinese input.

Read our paper CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers on ArXiv for a formal introduction.
Try our demo at https://models.aminer.cn/cogvideo/
Run our pretrained models for text-to-video generation. Please use A100 GPU.
Cite our paper if you find our work helpful

@article{hong2022cogvideo,
  title={CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers},
  author={Hong, Wenyi and Ding, Ming and Zheng, Wendi and Liu, Xinghan and Tang, Jie},
  journal={arXiv preprint arXiv:2205.15868},
  year={2022}
}

Web Demo

The demo for CogVideo is at https://models.aminer.cn/cogvideo/, where you can get hands-on practice on text-to-video generation. The original input is in Chinese.

Generated Samples

Video samples generated by CogVideo. The actual text inputs are in Chinese. Each sample is a 4-second clip of 32 frames, and here we sample 9 frames uniformly for display purposes.

CogVideo is able to generate relatively high-frame-rate videos. A 4-second clip of 32 frames is shown below.

Getting Started

Setup

Hardware: Linux servers with Nvidia A100s are recommended, but it is also okay to run the pretrained models with smaller --max-inference-batch-size and --batch-size or training smaller models on less powerful GPUs.
Environment: install dependencies via pip install -r requirements.txt.
LocalAttention: Make sure you have CUDA installed and compile the local attention kernel.

pip install git+https://github.com/Sleepychord/Image-Local-Attention

Download

Our code will automatically download or detect the models into the path defined by environment variable SAT_HOME. You can also manually download CogVideo-Stage1 , CogVideo-Stage2 and CogView2-dsr place them under SAT_HOME (with folders named cogvideo-stage1 , cogvideo-stage2 and cogview2-dsr)

Text-to-Video Generation

./script/inference_cogvideo_pipeline.sh

Arguments useful in inference are mainly:

--input-source [path or "interactive"]. The path of the input file with one query per line. A CLI would be launched when using "interactive".
--output-path [path]. The folder containing the results.
--batch-size [int]. The number of samples will be generated per query.
--max-inference-batch-size [int]. Maximum batch size per forward. Reduce it if OOM.
--stage1-max-inference-batch-size [int] Maximum batch size per forward in Stage 1. Reduce it if OOM.
--both-stages. Run both stage1 and stage2 sequentially.
--use-guidance-stage1 Use classifier-free guidance in stage1, which is strongly suggested to get better results.

You'd better specify an environment variable SAT_HOME to specify the path to store the downloaded model.

Currently only Chinese input is supported.

from https://github.com/THUDM/CogVideo

人生在世,看得透又看得远者prevail everywhere. v.gd/YLHADG v.gd/cRqvfD v.gd/bKvfFH v.gd/DHdne v.gd/sRm3tA

Total Pageviews