Skip to content
All posts

Open-Sora from HPC-AI Tech Team Continues Open Source: Generate Any 16-Second 720p HD Video with One Click, Model Weights Ready to Use

Open-Sora Team from HPC-AI Tech has made groundbreaking progress in the quality and generation time of 720p HD text-to-video, enabling seamless production of high-quality short films in any style. What's even more exciting is that we have decided to bring even more impact to the open-source community by continuing to fully open-source our work.
With our model weights, it's possible to generate various stunning short films, such as the intimate encounter between waves and seashells, as well as the mysterious depths of forest landscapes.


The rendering of portrait characters is also remarkably lifelike.


It can also accurately render cyberpunk style, instantly infusing the short films with a strong sense of futuristic and technological atmosphere.



Moreover, it can generate lively and engaging animated scenes, delivering highly expressive visual experiences.


Even movie-grade shot production can be effortlessly handled. For example, achieving smooth zoom effects adds professional-grade visual effects to the film.



It also helps filmmakers create realistic movie scenes.



The model's outstanding performance unveils vast prospects in the field of video generation. Our model weights and training code have been fully open-sourced, and interested individuals can visit our GitHub repository for more details. Github Address:


The Open-Source Warrior of Text-to-Video

LambdaLabs, the Silicon Valley Unicorn, has created a digital LEGO universe based on our previously open-sourced Open-Sora model weights, offering LEGO enthusiasts the ultimate creative experience.
乐高1 乐高2
We deeply understand the accelerating effect of open source on breakthroughs in text-to-video technology. We not only continue to open-source model weights but also share our technical roadmap on GitHub, empowering every enthusiast to become a master of large text-to-video models, no longer just passive onlookers. (Report Address:


Unlocking the Secrets of Text-to-Video

In this technical report, we delve deep into the core and key aspects of this model training. Building upon the previous version, we introduce a Video Compression Network, enhanced diffusion model algorithms, increased controllability, and utilize more data to train a 1.1B diffusion generation model.
In this "era of computing power supremacy," we understand the two major pain points of video model training: the significant consumption of computational resources and the high standards for model output quality. With a streamlined yet effective approach, we have successfully struck a balance between cost and quality.
We propose an innovative Video Compression Network (VAE), which compresses video in both spatial and temporal dimensions. Specifically, we achieve an 8x8 compression in the spatial dimension followed by an additional 4x compression in the temporal dimension. This innovative compression strategy avoids sacrificing video smoothness due to frame subsampling while significantly reducing training costs, achieving a dual optimization of cost and quality.


Video Compression Network Architecture
Stable Diffusion 3, the latest diffusion model, significantly improves the quality of image and video generation by adopting rectified flow technology instead of DDPM. Although the training code for rectified flow in SD3 has not been publicly released, we have provided a complete training solution based on the research findings of SD3, including:
  • Simple and easy-to-use rectified flow training
  • Logit-norm time-step sampling for accelerated training
  • Time-step sampling based on resolution and video length
Through the integration of these techniques, we not only accelerate the model training speed but also significantly reduce waiting times during the inference stage, ensuring a smooth user experience. Additionally, this training solution supports outputting multiple video aspect ratios during inference, meeting the diverse needs of video content creators in various scenarios and providing them with richer creative tools.
In the report, we also disclose more core details about model training, including practical techniques for data cleaning and model fine-tuning, as well as building a more comprehensive model evaluation system to ensure model robustness and generalization capability. We also provide a Gradio application that can be easily deployed, supporting adjustments of motion scores, aesthetic scores, and camera movement parameters. Moreover, it even allows one-click automatic modification of instructions using GPT-4o and supports Chinese input. If you can't resist getting hands-on, click here for more details on


Empowering Innovation: Breaking Barriers with Open Source

Since the release of OpenAI's Sora, industry expectations for Sora's openness have been sky-high, but the reality has been a continued waiting game. Our open-source initiative injects powerful vitality into the innovation and development of text-to-video technology. "Give a man a fish, and you feed him for a day. Teach a man to fish, and you feed him for a lifetime." Visit our GitHub repository to effortlessly access model weights and the complete training code, transforming users from passive content consumers into active content creators. This transformation unlocks new skills for enterprise users to develop text-to-video applications autonomously, expanding the application scenarios of text-to-video technology exponentially, whether for creating immersive games, innovative advertisements, or producing blockbuster films.

We anticipate this spark to ignite a blaze of innovation throughout the text-to-video field, unleashing a momentum that spreads far and wide.
Finally, here is our open-source link: