Text-to-video model

A text-to-video model is a machine learning model that takes a natural language description as input and produces a video relevant to the input text.^[1] Recent advancements in generating high-quality, text-conditioned videos have largely been driven by the development of video diffusion models.^[2]

A video generated using OpenAI's Sora text-to-video model, using the prompt

A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.

Models

There are different models, including open source models. The demo version of CogVideo is an early text-to-video model "of 9.4 billion parameters", with their codes presented on GitHub.^[3] Meta Platforms has a partial text-to-video^{[note 1]} model called "Make-A-Video".^[4]^[5]^[6] Google's Brain has released a research paper introducing Imagen Video, a text-to-video model with 3D U-Net.^[7]^[8]^[9]^[10]^[11]

In March 2023, a landmark research paper by Alibaba was published, applying many of the principles found in latent image diffusion models to video generation.^[12]^[13] Services like Kaiber and Reemix have since adopted similar approaches to video generation in their respective products.

Matthias Niessner and Lourdes Agapito at AI company Synthesia work on developing 3D neural rendering techniques that can synthesise realistic video by using 2D and 3D neural representations of shape, appearances, and motion for controllable video synthesis of avatars.^[14]

Alternative approaches to text-to-video models exist.^[15]

Footnotes

^ It can also generate videos from images, video insertion between two images, and videos variations.

References

^ Artificial Intelligence Index Report 2023 (PDF) (Report). Stanford Institute for Human-Centered Artificial Intelligence. p. 98. Multiple high quality text-to-video models, AI systems that can generate video clips from prompted text, were released in 2022.
^ Melnik, Andrew; Ljubljanac, Michal; Lu, Cong; Yan, Qi; Ren, Weiming; Ritter, Helge (2024-05-06). "Video Diffusion Models: A Survey". arXiv:2405.03150 [cs.CV].
^ CogVideo, THUDM, 2022-10-12, retrieved 2022-10-12
^ Davies, Teli (2022-09-29). "Make-A-Video: Meta AI's New Model For Text-To-Video Generation". Weights & Biases. Retrieved 2022-10-12.
^ Monge, Jim Clyde (2022-08-03). "This AI Can Create Video From Text Prompt". Medium. Retrieved 2022-10-12.
^ "Meta's Make-A-Video AI creates videos from text". www.fonearena.com. Retrieved 2022-10-12.
^ "google: Google takes on Meta, introduces own video-generating AI". The Economic Times. 6 October 2022. Retrieved 2022-10-12.
^ Monge, Jim Clyde (2022-08-03). "This AI Can Create Video From Text Prompt". Medium. Retrieved 2022-10-12.
^ "Nuh-uh, Meta, we can do text-to-video AI, too, says Google". www.theregister.com. Retrieved 2022-10-12.
^ "Papers with Code - See, Plan, Predict: Language-guided Cognitive Planning with Video Prediction". paperswithcode.com. Retrieved 2022-10-12.
^ "Papers with Code - Text-driven Video Prediction". paperswithcode.com. Retrieved 2022-10-12.
^ "Home - DAMO Academy". damo.alibaba.com. Retrieved 2023-08-12.
^ Luo, Zhengxiong; Chen, Dayou; Zhang, Yingya; Huang, Yan; Wang, Liang; Shen, Yujun; Zhao, Deli; Zhou, Jingren; Tan, Tieniu (2023). "VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation". arXiv:2303.08320 [cs.CV].
^ "Text to Speech for Videos". Retrieved 2023-10-17.
^ Text2Video-Zero, Picsart AI Research (PAIR), 2023-08-12, retrieved 2023-08-12

[4] It can also generate videos from images, video insertion between two images, and videos variations.

[AIIR-1] Artificial Intelligence Index Report 2023 (PDF) (Report). Stanford Institute for Human-Centered Artificial Intelligence. p. 98. Multiple high quality text-to-video models, AI systems that can generate video clips from prompted text, were released in 2022.

[2] Melnik, Andrew; Ljubljanac, Michal; Lu, Cong; Yan, Qi; Ren, Weiming; Ritter, Helge (2024-05-06). "Video Diffusion Models: A Survey". arXiv:2405.03150 [cs.CV].

[3] CogVideo, THUDM, 2022-10-12, retrieved 2022-10-12

[5] Davies, Teli (2022-09-29). "Make-A-Video: Meta AI's New Model For Text-To-Video Generation". Weights & Biases. Retrieved 2022-10-12.

[6] Monge, Jim Clyde (2022-08-03). "This AI Can Create Video From Text Prompt". Medium. Retrieved 2022-10-12.

[7] "Meta's Make-A-Video AI creates videos from text". www.fonearena.com. Retrieved 2022-10-12.

[8] "google: Google takes on Meta, introduces own video-generating AI". The Economic Times. 6 October 2022. Retrieved 2022-10-12.

[9] Monge, Jim Clyde (2022-08-03). "This AI Can Create Video From Text Prompt". Medium. Retrieved 2022-10-12.

[10] "Nuh-uh, Meta, we can do text-to-video AI, too, says Google". www.theregister.com. Retrieved 2022-10-12.

[11] "Papers with Code - See, Plan, Predict: Language-guided Cognitive Planning with Video Prediction". paperswithcode.com. Retrieved 2022-10-12.

[12] "Papers with Code - Text-driven Video Prediction". paperswithcode.com. Retrieved 2022-10-12.

[13] "Home - DAMO Academy". damo.alibaba.com. Retrieved 2023-08-12.

[14] Luo, Zhengxiong; Chen, Dayou; Zhang, Yingya; Huang, Yan; Wang, Liang; Shen, Yujun; Zhao, Deli; Zhou, Jingren; Tan, Tieniu (2023). "VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation". arXiv:2303.08320 [cs.CV].

[15] "Text to Speech for Videos". Retrieved 2023-10-17.

[16] Text2Video-Zero, Picsart AI Research (PAIR), 2023-08-12, retrieved 2023-08-12

[1]

[2]

[3]

[note 1]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

Text-to-video model

Contents

Models

See also

Footnotes

References