What is Stable Cascade?

Stable cascade is a new diffusion model generating images from text descriptions. It's developed by Stability AI, the developer of Stable Diffusion and is known for being faster, more affordable, and potentially easier to use than previous models like Stable Diffusion XL (SDXL).

‍

How does Stable Cascade work?

Stable Cascade distinguishes itself from the Stable Diffusion series by incorporating a trio of interconnected models—Stages A, B, and C. This structure, built upon the Würstchen architecture, facilitates a layered approach to image compression, delivering superior outcomes using a highly compact latent space.

Stable Cascade architecture overview. [1]

‍

Here's a breakdown of how these components interact:

‍

Stage C, the Latent Generator phase, converts input from users into dense 24x24 latents, which are then forwarded to the Latent Decoder phase, consisting of Stages A and B. Unlike the VAE in Stable Diffusion which compresses images, Stages A and B achieve even greater compression levels.

‍

Separating the process of generating text-based conditions (Stage C) from the conversion back to high-resolution imagery (Stages A & B) not only enhances flexibility but also dramatically reduces the resources needed for training or fine-tuning.

‍

Advantages of Stable Cascade Compared to SDXL

In this section, we embark on a comparative analysis of Stable Cascade and SDXL, each a sophisticated model in the realm of image generation. Our objective is to outline the unique qualities and strengths that set Stable Cascade apart from SDXL. We aim to cast a light on how Stable Cascade stands out in terms of performance, efficiency, and its capacity to produce high-quality images based on text prompts.

‍

Higher Image Quality

According to Stability AI's research, Stable Cascade demonstrates a slight advantage in adherence to prompts compared to SDXL, aligning more closely with the specified instructions. Our evaluations corroborate these findings, revealing that Stable Cascade excels in generating images that more accurately reflect the requested scenarios, particularly in the creation of realistic portraits and landscapes.

‍

This enhanced fidelity and prompt alignment underscore Stable Cascade's potential in producing high-quality visual content that meets specific creative demands.

Stable Cascade vs SDXL aesthetic — Stable Cascade: Aesthetic quality[1]

^{RAW photo, subject, 8k uhd, dslr, a portrait of young woman, black hair, blue eyes, in a train station}
^{RAW photo, subject, 8k uhd, dslr, soft lighting, high quality, clearly face, a lifelike portrait of a weathered traveler, their face telling stories of adventures and experiences}
^{RAW photo, subject, 8k uhd, dslr, soft lighting, high quality, clearly face, an expressive portrait of a musician lost in the magic of their music, capturing their passion}

^{endless rows of lavender stretching into the horizon, soft warm glow of a setting sun, the rolling rows of purple lavender create a sense of infinite beauty}
^{enchanted winter forest, soft diffuse light on a snow-filled day, serene nature scene, the forest is illuminated by the snow}

^‍

‍

Better Prompt Alignment

Again, reflecting on Stability AI's findings, Stable Cascade showcases a slight improvement in its ability to follow prompts more precisely than SDXL.

Stable Cascade vs SDXL prompt alignment — Stable Cascade: prompt alignment

Our own testing supports this observation, revealing that while both models perform closely, Stable Cascade slightly edges out in prompt adherence. This is particularly evident when examining the generated images side-by-side, where Stable Cascade's outputs are more faithfully aligned with the given prompts, illustrating its capability to capture the nuances of the request with greater accuracy (e.g: ‘soundwaves forming a heart’).

‍

This increment in prompt fidelity highlights Stable Cascade's advanced image generation capabilities, making it a noteworthy option for generating visual content that demands strict adherence to user specifications.

^{A group of people happy working around a table, realistic, 4K}
^{pictorial mark logo of a retro vinyl record with soundwaves forming a heart}

‍

Text generation

When comparing the capabilities of Stable Cascade and SDXL in text generation, it's clear that Stable Cascade significantly surpasses SDXL. This assertion is evident from the examples provided below, where Stable Cascade demonstrates a superior proficiency in interpreting and executing on text prompts.

^{a portrait photo of a 25-year old man, glasses, smiling, holding a sign 'Ikomia'}
^{abstract mark logo of old a purple glowing computer screen with written 'Ikomia' in orange}

‍

Faster image generation

According to a study by Stability AI, Stable Cascade achieves image generation speeds more than twice as fast as SDXL.

‍

In our own testing (see generated images in the previous sections), to achieve optimal results, images with a resolution of 1024x1024 were generated using the following parameters:

For SDXL, the process involved 45 steps plus an additional 15 steps for refinement, totaling 22 seconds to complete.
Stable Cascade, on the other hand, required 30 steps plus 20 steps, with a total time of just 12 seconds.

This significant improvement in speed with Stable Cascade not only enhances efficiency but also underscores its potential for applications requiring rapid image generation without sacrificing quality.

‍

Easier to fine-tune

Although we haven't directly tested it, the architecture of Stable Cascade is designed to facilitate efficient customization and fine-tuning, particularly within its Stage C. This feature is intended to make it simpler for users to adapt the model to specific artistic styles or to incorporate ControlNet functionalities. This flexibility contrasts sharply with the more complex training requirements of models like SDXL, marking Stable Cascade as a potentially more accessible platform for creative and technical modifications.

‍

This streamlined approach has the potential to significantly ease the process of model training. In particular, by concentrating adjustments on Stage C, it's possible to achieve up to a 16-fold reduction in training costs compared to similar efforts with Stable Diffusion models, making the adoption of custom styles and functionalities both more accessible and cost-effective.

‍

Easily run Stable Cascade

Setup

With the Ikomia API, you can effortlessly generate images with Stable Cascade with just a few lines of code.

To get started, you need to install the API in a virtual environment [2].


pip install ikomia

‍

Run Stable Cascade with a few lines of code

You can also directly charge the notebook we have prepared.

Note: The Stable Cascade algorithm requires 17 GB of VRAM to run.

Go to notebook

Go to Colab


from ikomia.dataprocess.workflow import Workflow
from ikomia.utils.displayIO import display


# Init your workflow
wf = Workflow()

# Add algorithm
algo = wf.add_task(name = "infer_stable_cascade", auto_connect=False)

algo.set_parameters({
    'prompt': "a picture of a chubby Dzungarian hamster in an advent urer's hat, standing on two feets, holdging a camera with his paws, travels in the mountains, climbing equipment, clouds, macro photography",
    'negative_prompt': '',
    'prior_num_inference_steps': '20',
    'prior_guidance_scale': '4.0',
    'num_inference_steps': '30',
    'guidance_scale': '0.0',
    'seed': '142753564',
    'width': '1024',
    'height': '1024',
    'num_images_per_prompt': '1'
    })

# Generate your image
wf.run()

# Display the image
display(algo.get_output(0).get_image())

Stable cascade generated image of a chubby Dzungarian hamster

‍

List of parameters:

‍

prompt (str) - default 'Anthropomorphic cat dressed as a pilot' : Text prompt to guide the image generation .
negative_prompt (str, optional) - default '': The prompt not to guide the image generation. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).
prior_num_inference_steps (int) - default '20': Stage B timesteps.
prior_guidance_scale (float) - default '4.0': Higher guidance scale encourages to generate images that are closely linked to the text prompt, usually at the expense of lower image quality. (minimum: 1; maximum: 20).
num_inference_steps (int) - default '30': Stage C timesteps
guidance_scale (float) - default '0.0': Higher guidance scale encourages to generate images that are closely linked to the text prompt, usually at the expense of lower image quality. (minimum: 1; maximum: 20).
height (int) - default '1024': The height in pixels of the generated image.
width (int) - default '1024': The width in pixels of the generated image.
num_images_per_prompt (int) - default '1': Number of generated image(s).‍
seed (int) - default '-1': Seed value. '-1' generates a random number between 0 and 191965535.

‍

Ressources

Browse the Ikomia HUB play with more diffusion models such as:

- SDXL and SDXL Turbo

- The Kandinsky series including, text-to-image, image-to-image, ControlNet & more

‍

For more info how on how to use the API, see Ikomia documentation. It's set up to help you get the most out of the API's offerings.

‍

Ikomia STUDIO complements the ecosystem by offering a no-code, visual approach to image processing, reflecting the API's features in an accessible interface.

‍