Qwen-Image-Layered: Decompose Images into Editable RGBA Layers

AI image editing is getting better fast, but a frustrating limitation keeps showing up in real work: most images are still treated as one flat raster. Every object, shadow, highlight, texture, and piece of text is “entangled” into a single canvas. When you ask a model to edit one part, you often get collateral damage—colors drift, textures change, edges get reinterpreted, and text mutates.

Professional design tools avoid this problem with layered representations. If the headline is its own layer, you can move it, recolor it, or replace it without repainting the whole poster. If a product cutout is its own layer, you can resize it without destroying the background.

Qwen-Image-Layered brings this idea into generative workflows. Instead of only producing a final RGB image, it can decompose a given image into multiple transparent RGBA layers, so you can edit the scene in a way that’s closer to “design operations” than “regeneration”.

Want to try it now? Open the Layered tool →

Layered decomposition example: input image of a sneaker with a SALE sticker, decomposed into separate RGBA layers for background, shoe, and sticker

What does “layer decomposition” mean?

In a typical layered workflow, you’re not just cutting the image into rectangles. Each output layer is a full-size image with transparency:

RGB pixels contain the visible content that belongs to that layer.
Alpha pixels define where the layer is transparent.

This distinction matters. A crop can’t be composited cleanly; an RGBA layer can. When you have a stack of RGBA layers, you can reorder, move, scale, or recolor them and then re-compose them into a final image—often while preserving the rest of the scene.

In practice, layers often correspond to things like:

Foreground subject vs background
Text elements vs graphics
Large objects separated from smaller accessories
Regions that need independent editing (logos, stickers, signage)

The exact layer semantics can vary by image, but the goal is consistent: separate the scene into parts that can be manipulated independently.

Why layers beat “editing a flat raster”

If you’ve used diffusion-based editing tools, you’ve probably seen a recurring trade-off:

Push the model hard enough to make the change you want → it changes other things too.
Constrain the model enough to preserve everything else → it fails to change what you want.

That’s not only a prompt or parameter tuning issue. It’s largely a representation problem. A raster image doesn’t explicitly separate “what belongs to the product” from “what belongs to the background” or “what belongs to the headline”. Everything overlaps in pixel space.

A layered representation makes a simple but powerful promise: the thing you want to change is physically isolated. That isolation reduces the blast radius of edits and makes it easier to iterate—especially when you’re doing many small revisions.

Inherent editability (what the paper is aiming for)

The Qwen-Image-Layered paper describes the concept of inherent editability: editing becomes more reliable because the image is represented as multiple RGBA layers instead of a single entangled canvas.

This is especially valuable for production use cases where consistency matters more than “surprising creativity”:

Marketing creatives: update a badge color, move a headline, tweak a product variant
E-commerce: keep lighting/composition stable while swapping items or adjusting labels
Localization: change text while keeping layout and style consistent across regions
Design iteration: do 10 small revisions without gradually degrading the whole image

In short: layers turn many edits into “graphic design moves” (transform, recolor, replace) instead of “regenerate and hope”.

A high-level look at how it works (without the math)

At a high level, Qwen-Image-Layered is an end-to-end diffusion model that maps:

one RGB image → multiple RGBA images

Two practical challenges show up immediately:

RGB vs RGBA mismatch: the model needs a unified way to represent both RGB inputs and RGBA outputs.
Variable number of layers: images don’t naturally decompose into the same number of layers every time.

The paper introduces three key components to address this (terminology from the authors):

RGBA-VAE: unifies latent representations for RGB and RGBA images.
VLD-MMDiT architecture: supports variable-length layer decomposition, so the model can output different numbers of layers.
Multi-stage training: adapts a pretrained image generator into a multi-layer decomposer.

Another key point is the dataset problem: high-quality “layered ground truth” isn’t abundant on the open web. The authors describe building a pipeline to extract and annotate multilayer training data from Photoshop documents (PSD), which helps train the model to produce cleaner decompositions.

You don’t need to implement any of this yourself to use the model—but these details help explain why layer decomposition is not “just segmentation”. The output is designed for compositing and downstream editing.

Prompting: treat it like a caption, not a layer controller

One subtle but important note from the official release is that the text prompt is intended to describe the overall content of the input image. It’s not designed to explicitly control what each layer contains.

That means:

Good prompt: a concise description of the whole scene (“a red sneaker on a white background with a circular SALE sticker”).
Less reliable prompt: trying to assign per-layer meanings (“layer 1 is background, layer 2 is shoe, layer 3 is sticker”).

The release also points out a useful trick: you can include partially occluded elements in the description (for example, text that is partially hidden behind an object). This can sometimes help the decomposition remain semantically coherent.

Choosing the number of layers (a practical guide)

There’s no single best number of layers. Think of it as choosing the granularity of control:

3–4 layers: faster, usually cleaner; best for simple portraits, product photos, minimal scenes
5–8 layers: better separation in complex scenes; sometimes more fragmented
More layers: useful if you plan to edit many components independently, but you may see diminishing returns

A simple workflow that works well:

Start at 4 layers.
If the subject and background are merged in a way that blocks your edits, try 6–8 layers.
If important objects are scattered across many tiny layers (hard to manage), drop back down.

How to get better decompositions

Layer decomposition is not one-size-fits-all, but a few patterns help:

Increase layers for busy scenes: crowds, cluttered rooms, complex posters, many objects.
Prefer fewer layers for clean assets: studio product photos, single-subject portraits.
Use more inference steps for stability: especially if edges look unstable or transparency boundaries are noisy.
Keep the prompt global: describe the image as a whole; don’t try to micro-control each layer.

The official release also notes that the released weights are fine-tuned for image → multi-RGBA decomposition. While the model supports text-conditioned inference, text-to-layered generation is more limited than standard text-to-image tasks.

A practical workflow: decompose → edit → re-compose

Layer decomposition becomes most valuable when it sits inside a repeatable workflow:

Decompose the image into RGBA layers.
Edit only the layer you want to change (recolor, move, replace, remove).
Re-compose the layers into a final image.

This approach is powerful because each layer becomes a portable design asset:

You can export a ZIP of layers and hand them to a designer.
You can build variants (different products, different text, different colors) without starting from scratch.
You can “freeze” the parts that must stay consistent and iterate on only what changes.

The official repository also demonstrates exporting layers into formats used by real workflows (for example, PPTX/PSD), making it easier to integrate layered outputs into existing creative pipelines.

Where it shines (real-world scenarios)

1) Product and brand creatives

When you’re iterating on ads, the first draft is easy—revision #12 is hard. Layered decomposition helps you keep:

Background style and lighting
Composition and perspective
Decorative elements

while still changing the product variant, a sticker, or a logo with much higher consistency.

2) Posters and typography-heavy images

Text is fragile in image editing. If the decomposition isolates a text block into one or two layers, you can:

Recolor it (brand palette changes)
Replace it (new slogan, new language)
Move/scale it (layout iteration)

Even when text isn’t perfectly isolated, decomposition often reduces how much of the background has to be touched.

3) Character edits with higher consistency

Layer isolation can help preserve identity and context by limiting changes to the appropriate layer. If you want to change clothing color, accessories, or one foreground object, layers often make it easier to keep everything else stable.

Limitations and gotchas (setting expectations)

Layer decomposition is a hard problem. A few things you should expect:

Objects can split across layers: complex items may be distributed across multiple layers.
Soft boundaries are tricky: hair, fur, smoke, transparent glass, motion blur.
Some scenes are inherently entangled: reflections, heavy patterns, extreme lighting.
Prompts aren’t per-layer controls: the prompt works best as an overall caption.

The right way to judge the result is simple: does it give you a layer stack that makes your intended edit easier and more consistent than editing a flat raster? If yes, it’s already delivering real value.

Try it on our site

If you want to test the workflow quickly, use our Layered tool:

Upload an image
Choose the number of layers
Generate and download the RGBA stack

Open the Layered tool →

References

Paper: https://arxiv.org/abs/2512.15603
Code & weights: https://github.com/QwenLM/Qwen-Image-Layered
Official blog: https://qwen.ai/blog?id=qwen-image-layered