Skip to content
All posts

Open-Sora Unveils Major Upgrade: Embracing Open Source with Single-Shot 16-Second Video Generation and 720p Resolution

Open-Sora has quietly updated in the open-source community, now supporting single-shot video generation up to 16 seconds long, with a maximum resolution of 720p, and capable of handling any aspect ratio for text-to-image, text-to-video, image-to-video, video-to-video, and infinitely long video generation requirements. Let's try out the effects.
Generate a landscape video for posting on YouTube.

Then generate a portrait video for TikTok.

Plus, with the ability to create 16-second long single-shot videos, now everyone can indulge in their screenwriting fantasies.
How to get started? Check out the GitHub repository:
What's even cooler is that Open-Sora remains fully open-source, encompassing the latest model architecture, the most recent model weights, training details for various durations/resolutions/aspect ratios/frame rates, a complete workflow for data collection and preprocessing, all training details, demo examples, and comprehensive tutorials for getting started.

Comprehensive Interpretation of the Open-Sora Technical Report

Overview of New Features

We have officially released the Open-Sora new Technical Report on GitHub [1]. This update primarily includes the following key features:
  • Support for long video generation;
  • Video generation with a maximum resolution of up to 720p;
  • Support for text-to-image, text-to-video, image-to-video, video-to-video, and infinitely long video generation requirements for any aspect ratio;
  • A more stable model architecture design has been proposed, capable of training with various durations/resolutions/aspect ratios/frame rates;
  • The latest complete automatic data processing pipeline has been open-sourced.

Spatio-Temporal Diffusion Transformer Model ST-DiT-2

We, Colossal-AI team, have made key improvements to the STDiT architecture in previous Open-Sora, with the aim of enhancing the model's training stability and overall performance. For the current sequence prediction task, we have adopted best practices from large language models (LLMs), replacing the sinusoidal positional encoding in all attention layers with a more efficient Rotary Position Embedding(RoPE). Additionally, to enhance training stability, they have introduced QK-normalization technology, following the SD3 model architecture, to bolster the stability of half-precision training. To support the training requirements for multiple resolutions, various aspect ratios, and frame rates, the ST-DiT-2 architecture proposed by us is capable of automatically scaling positional embeddings and processing inputs of different sizes.


Multi-Stage Training

According to the Open-Sora new technical report, Open-Sora employs a multi-stage training approach, where each stage continues training based on the weights from the previous stage. Compared to single-stage training, this multi-stage training method more efficiently achieves the goal of high-quality video generation by introducing data in steps.
In the initial stage, most videos are trained at a resolution of 144p, with a mix of images and videos at 240p and 480p, lasting approximately one week with a total of 81k steps. The second stage elevates the resolution of most video data to 240p and 480p, with a training duration of one day, reaching 22k steps. The third stage further enhances the training to 480p and 720p, completing 4k steps in one day. The entire multi-stage training process is completed in about 9 days, improving the quality of video generation in multiple dimensions compared to Open-Sora 1.0.


Unified Image-to-Video/Video-to-Video Framework

Leveraging the characteristics of the Transformer, we can easily extend the DiT architecture to support both image-to-image and video-to-video tasks. We propose a masking strategy to support the conditioning of images and videos. By setting different masks, we can support a variety of generation tasks, including: image-to-video, looped video, video extension, autoregressive video generation, video stitching, video editing, and frame insertion.


image (27)
Masking Strategy for Supporting Image and Video Conditioning
Inspired by the UL2 method [2], we have introduced a random masking strategy during the model training phase. Specifically, during training, frames are randomly selected and unmasked, which includes but is not limited to unmask the first frame, the first k frames, the last k frames, and any random k frames. We found that based on previous experiments with Open-Sora, applying a masking strategy with a 50% probability allowed the model to better learn to handle image conditioning with just a small number of steps. In Open-Sora, we have adopted a pretraining approach from scratch using the masking strategy.
Additionally, we have provided a detailed guide for configuring the masking strategy for the inference phase, where a five-number tuple offers great flexibility and control in defining the masking strategy.
image (28)
Masking Strategy Configuration Instructions

Support for Multi-time/Resolution/Aspect Ratio/Fps Training

In the technical report of OpenAI Sora [3], they mentioned that training with the original video's resolution, aspect ratio, and length can increase sampling flexibility and improve framing and composition. Therefore, we proposed a bucket training strategy.
How is it implemented specifically? The so-called bucket is a triplet of (resolution, number of frames, aspect ratio). We have predefined a series of aspect ratios for videos of different resolutions to cover most common types of video aspect ratios. At the beginning of each training epoch, we shuffle the dataset and allocate samples to the corresponding buckets based on their characteristics. Specifically, we place each sample into a bucket that has a resolution and frame length that are less than or equal to the video.
image (29)
Open-Sora Bucket Strategy
To reduce the requirements for computational resources, we introduced two attributes: keep_prob andbatch_size for each (resolution,number of frames) pair, to decrease the computational cost and enable multi-stage training. In this way, we can control the number of samples in different buckets and balance the GPU load by searching for an optimal batch size for each bucket. We provide a detailed explanation of this in our technical report, and interested individuals can read more about it on GitHub:

Data Collection and Preprocessing Pipeline

We also provide a detailed guide for the data collection and processing phase. During the development of Open-Sora 1.0, we realized that the quantity and quality of data are crucial for nurturing a high-performance model. Therefore, we have been committed to expanding and optimizing the dataset. We have established an automated data processing pipeline that adheres to the principles of Singular Value Decomposition (SVD), covering scene segmentation, caption processing, diverse scoring and filtering, as well as the management system and conventions for the dataset. Similarly, we have shared the data processing scripts with the open-source community. Interested developers can now utilize these resources, in conjunction with the technical report and code, to efficiently process and optimize their datasets.


image (30)

Open-Sora Data Processing Pipeline

Comprehensive Performance Evaluation of Open-Sora

Video Generation Showcase

The most eye-catching feature of Open-Sora is its ability to capture the scenes in your mind and transform them into touching dynamic videos through the medium of written description. Images and imaginations that once flickered briefly in your thoughts can now be permanently recorded and shared with others. Here, we are merely breaking the ice by generating several videos using Open-Sora, sparking infinite creative possibilities.
For instance, we attempted to generate a video of a tour through a forest in the winter. The snow had just fallen, and the pine trees were laden with a blanket of pure white snow. The dark green needles of the pines contrasted sharply with the white snowflakes, creating a well-defined and layered scene.

Or perhaps, you might envision a tranquil night where you find yourself within a dark forest depicted in countless fairy tales, with the profound lake shimmering under the radiant glow of a star-filled sky.
An aerial view of a bustling island at night is even more enchanting, with the warm yellow lights and ribbon-like blue sea instantly transporting one into a leisurely holiday mood.
The hustle and bustle of the city, with its high-rise buildings and street-side shops still illuminated late into the night, presents a distinct charm of its own.
Beyond landscapes, Open-Sora is also capable of bringing to life a variety of natural creatures. Whether it's a brightly colored little flower,
or a chameleon slowly turning its head, Open-Sora can generate videos that are quite lifelike.

We have also provided a multitude of generated videos for your reference, encompassing a diverse range of content, resolutions, aspect ratios, and durations.


With just a simple prompt, Open-Sora can generate short videos at multiple resolutions, which can only be described as having no creative bottlenecks.





We can also feed Open-Sora a static image to generate a short video.
image (31)
image (32)
Open-Sora can also ingeniously connect two static images, and with a gentle tap on the video below, it will take you through the transformation of light and shadow from afternoon to dusk. Each frame is a poetic ode to time.


For instance, if we want to edit an existing video, it only takes a simple command to transform the originally bright forest into a scene of heavy snowfall.
or have it generate high-definition images,
1024x1024_7 image (33) sample_6 sample_1
720p_2_1_0 720p_1_2_0
The model weights for Open-Sora have been fully and freely released on our open-source community, so you might as well download them and give it a try. Since we also support video stitching functionality, this means you have the complete opportunity to create a narrative short film for free, bringing your creativity to life. Weight download address:

Current Limitations and Future Plans

Although good progress has been made in replicating the Sora-like video generation model, we notice that the current generated videos still have room for improvement in several aspects: including noise issues during generation, lack of temporal consistency, poor quality of human figure generation, and low aesthetic scoring. For these challenges, we have stated that we will prioritize resolving them in the development of the next version, with the expectation of achieving higher standards for video generation. Stay tuned!
[2] Tay, Yi, et al. "Ul2: Unifying language learning paradigms." arXiv preprint arXiv:2205.05131 (2022).