OpenAI launches Sora: text to video AI

16th February 2024

Harry Fowle

1 0

OpenAI has recently unveiled a new tool, dubbed Sora, which has the capability to generate videos directly from text prompts.

This groundbreaking model can produce realistic footage up to a minute in length, closely adhering to the user's directives regarding both the subject matter and style. Remarkably, Sora is not only capable of generating videos from scratch based on textual descriptions but also has the ability to create videos from still images or enhance existing footage with new material.

According to a company blog post, the underlying objective behind the development of Sora is to advance AI's comprehension and simulation of the physical world in motion. This endeavour aims at cultivating models that are instrumental in resolving real-world challenges necessitating interaction with the physical environment. Among the initial examples showcased by the company, one particularly striking video was generated from the prompt: “A movie trailer featuring the adventures of the 30-year-old spaceman wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colours.”

Prompt: “A movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors.” pic.twitter.com/0JzpwPUGPB
— OpenAI (@OpenAI) February 15, 2024

In a cautious approach to ensure responsible usage, OpenAI has granted access to Sora to a select group of researchers and video creators. These experts are tasked with "red teaming" the product, a process designed to assess its vulnerability to misuse or violation of OpenAI’s terms of service, which strictly prohibit the creation of content involving extreme violence, sexual content, hateful imagery, celebrity likenesses, or the intellectual property of others.

Despite limiting access to a small circle of researchers, visual artists, and filmmakers, OpenAI’s CEO, Sam Altman, engaged with the broader public by responding to user prompts on Twitter with video clips purportedly created by Sora. Notably, these videos are marked with a watermark to signify their AI-generated origin.

The debut of Sora follows the successful launches of OpenAI's still image generator, Dall-E, in 2021, and the generative AI chatbot, ChatGPT, in November 2022, which rapidly amassed 100 million users. While other AI firms have ventured into video generation tools, their outputs have been limited to brief footage, often bearing little relevance to the provided prompts. Competitors like Google and Meta have announced their ventures into developing generative video tools, though these projects have yet to reach the public.

OpenAI has been discreet regarding the specifics of the footage volume used to train Sora or the origins of the training videos, mentioning only to the New York Times that the dataset comprises both publicly available and copyrighted videos licensed from copyright owners. This practice aligns with the firm's history of utilizing extensive datasets scraped from the internet for training its generative AI tools, a method that has led to multiple lawsuits alleging copyright infringement.

SORA's text-to-video generation methodology

While OpenAI has not disclosed intricate details of SORA's operational framework, insights can be gathered based on the general principles of AI-driven video generation. These models typically employ advanced machine learning techniques, such as deep learning and neural networks, to interpret text prompts and translate them into visual content.

One of the core technologies likely underpinning SORA is Generative Adversarial Networks (GANs), which have been pivotal in the field of generative AI. GANs involve two neural networks—the generator and the discriminator—working in tandem to produce increasingly realistic outputs. The generator creates videos based on text prompts, while the discriminator evaluates their authenticity against real footage. Through iterative training, the generator learns to produce more convincing videos that can fool the discriminator, thereby enhancing the realism and fidelity of the generated content.

Additionally, SORA might incorporate techniques from transformer models, which have demonstrated remarkable success in understanding and generating natural language. By leveraging transformers, SORA can effectively grasp the nuances of text prompts, enabling it to generate videos that accurately reflect the described scenes, actions, and styles.

The integration of these technologies, combined with a vast dataset of videos, equips SORA with the ability to understand complex narrative structures, visual aesthetics, and dynamic movements, facilitating the generation of videos that are not only realistic but also rich in detail and creativity.

As AI continues to evolve, tools like SORA represent significant milestones in bridging the gap between textual descriptions and visual storytelling, offering promising avenues for creative expression, educational content, and beyond. Could this be the start of a whole new creative AI-era?