Skip to content
Sumit Chatterjee
HDR Image Generation from Diffusion Models

Research · 2024 — present

HDR Image Generation from Diffusion Models

A standard diffusion model outputs an 8-bit tone-mapped image with a peak value of 1.0. A VFX pipeline expects a 16-bit scene-referred linear EXR with peak values that can exceed 100 — sometimes far more — for any pixel containing a sun, a specular highlight, or a practical light source. The gap between the two is not a tone-mapping problem. It is a structural one. This page documents how I closed it on top of Qwen-Image-2512: first by routing scene-referred HDR through LogC4 as an encoding trick on top of the stock model, then by retraining the VAE and the MMDiT to generate directly in linear space.

01 — The problem

Modern diffusion models — Qwen-Image, FLUX, SDXL, the open Stable Diffusion lineage — are trained on display-referred 8-bit imagery. Every photograph in their training set has already been through a camera response curve, a tone map, and a sRGB gamma encode. By the time the model sees a pixel, the information about how bright the sun was has been irreversibly compressed into a code value of 255. The model has no notion that a specular highlight at f/22 contains 600 times more light than a midtone grey card.

This is fine for posters, for illustration, for everything diffusion is currently used for. It is unusable for VFX. A compositor working in Nuke needs:

  • Scene-referred linear values. Not gamma-encoded, not tone-mapped. The pixel value is a measurement of light.
  • A working dynamic range above 1.0. Highlights must continue to carry information into the hundreds, so that exposure adjustments, lens flares, motion blur, and depth-of-field computations behave physically.
  • A 16-bit float container (OpenEXR), not 8-bit integer.

The naïve fix — generate a normal image, then expose-up the highlights in post — does not work. There is nothing there to recover. The model never had access to the unclipped scene.

The interesting question, then, is whether you can teach a model that was trained in display-referred space to produce scene-referred output, without throwing away the billion-image prior it has already learned.

The model does not know how bright the sun is. The training set deleted that information at capture time. The work is figuring out how to give it back without breaking what the model already does well.

02 — Color science: LogC4 as a trojan horse

The first working version of this project did not retrain the model at all. It exploited a property of cinema log-encoding curves that is, in retrospect, obvious: a log curve maps a wide linear range into a narrow code-value range that looks like a normal image. If you can convince the model to generate the log-encoded representation, decoding it gets you HDR for free.

I chose ARRI LogC4 specifically, not LogC3 or S-Log3 or any of the other cinema curves. LogC4 was introduced for the ALEXA 35 and is designed around a roughly 64x scene-reference white headroom — meaning a single LogC4 code value of 1.0 corresponds to a linear value near 64. This is precisely the range a generative model needs to be taught to "see" if the goal is real cinema-grade highlight handling.

The encoding itself is a piecewise function. Above a small linear toe near black, it is a scaled and offset base-2 logarithm of the linear input, with constants drawn from the ARRI specification. Below the toe, it falls back to a linear ramp so that the function remains continuous and well-behaved through the noise floor. The decode is the algebraic inverse. What matters for the diffusion pipeline is that the function is monotonic, smooth, and reversible — and that it concentrates information about high-luminance pixels in the upper half of the 0–1 output range, where a model trained on standard imagery has the most representational capacity to spend.

The pipeline becomes a three-step pre-and-post wrapper around an otherwise stock model. Take the HDR EXR dataset and pass every frame through the LogC4 encode; the result, to a human and to a diffusion model, looks like a flat, slightly-low-contrast normal image. Train a LoRA on Qwen-Image-2512 against this encoded dataset — the model learns what it perceives as a new visual style, without any awareness that the style is in fact a cinema log curve. At inference time, generate normally, normalize the output to the 0–1 range, and apply the LogC4 decode to recover scene-referred linear values. Write to OpenEXR.

The result of this first pass surprised me. Peak values landed in the 200–300 range on cleanly-generated highlights — not the 500-plus I would later achieve with native linear training, but enough to drop straight into a VFX comp and behave correctly under exposure adjustment. The first viable output was a synthetic sunset. The sun, in the log-decoded EXR, was clamping at f/2.8 down four stops in Nuke. That was the moment the project stopped being an experiment.

Stock Qwen-Image (8-bit, peak 1.0)
Before · Stock Qwen-Image (8-bit, peak 1.0)
LogC4 LoRA (decoded to linear, peak ~500)
After · LogC4 LoRA (decoded to linear, peak ~500)

Why log-space is not the final answer

LogC4 is a useful trojan horse but it has structural problems as a target representation for a generative model:

  • The model denoises in log space. Diffusion noise schedules are calibrated for perceptually-uniform 8-bit imagery. In log space, what looks like uniform Gaussian noise is, post-decode, non-uniform noise that is heavily biased toward highlights. Subtle differences in log values become large differences in linear values once you decode.
  • Quantization compounds. The decode amplifies any quantization or smoothing the model applied to bright regions. You see this as faint banding around suns and bright lights — the model's natural tendency to smooth uncertain regions becomes posterization once exposed back to linear.
  • The VAE wasn't trained for this. Qwen-Image's VAE compresses 0–1 RGB into a 16-channel latent at 8×8 spatial reduction, with structural assumptions about gamma-encoded display-referred content. Asking it to encode and decode log-curve content is asking it to operate slightly outside its training distribution. It mostly works. It does not work cleanly.

The log approach is the right move when you have a single GPU and a weekend. It is not the right move when the goal is production-grade output that will sit in an ILM comp without a colorist re-grading the highlights.

03 — Phase two: native linear generation

The second version of this project removes the encode/decode roundtrip entirely. The model generates in scene-referred linear space natively, end-to-end. There is no LogC4 involved at inference time. The peak values come from the model itself.

Doing this required two architectural changes to Qwen-Image: one in the VAE that defines the latent space, and one in the MMDiT that learns to denoise inside it.

Refitting the VAE

The vanilla Qwen-Image VAE is a 16-channel autoencoder — a 28-layer encoder and an 11-layer decoder, both operating at an 8×8 patch — trained on display-referred RGB in the 0–1 range. Feeding it linear values above 1.0 produces nondeterministic behavior: sometimes saturation, sometimes garbage, never reliable reconstruction. The 16-channel latent has more representational capacity than older four-channel autoencoders, but capacity is not the same as suitability — the channels are still spent on a 0–1 prior.

The fix was a full VAE fine-tune on HDR EXR data with a custom normalization layer prepended to the encoder and inverted at the decoder output. The normalization is not a fixed function; it is a learned per-channel scale and offset that maps the empirical distribution of scene-referred linear values from the training set into the dynamic range the latent expects. The relevant detail: I doubled the latent channel count from 16 to 32, allowing the latent to carry dual representations of each pixel — one for the standard tone-mapped variant, one for the linear scene-referred variant. The decoder learns to reconstruct both simultaneously from a shared 32-channel code, sharing structural information across the two heads.

This dual-head approach turned out to be important for an unrelated reason. Pixel-shift between tonemap and linear was a quiet bug in the LogC4 pipeline — every encode/decode introduced sub-pixel drift that became visible in tight comp work. By reconstructing both representations from a single forward pass on a shared latent, the two outputs are pixel-perfect aligned by construction. This solved a problem I had not originally been trying to solve. It is the second time on this project that the right architectural choice quietly fixed a downstream issue I had been working around.

Refitting the MMDiT

Qwen-Image's denoiser is a Multimodal Diffusion Transformer — a 60-layer, ~20B-parameter MMDiT trained with rectified flow matching and a logit-normal timestep schedule, conditioned on a 7B Qwen2.5-VL multimodal LLM as text encoder. Once the VAE was outputting a latent that legitimately represented HDR linear data, the MMDiT had to be taught to denoise into the new 32-channel distribution rather than the original 16-channel one.

I trained a LoRA on the MMDiT using rectified flow matching, consistent with Qwen-Image's pretraining objective and timestep distribution. The adapter was attached to all attention projections — Q/K/V on both the image and text streams of the MMDiT cross-modal blocks — at rank 64. Loss was computed in the new latent space, with an additional weighting term applied to high-luminance regions so that errors in the brightest parts of the image were not drowned out by the much larger area of midtone content — a known pathology of standard diffusion losses on HDR data.

The first projection in the MMDiT, which lifts the 16-channel VAE latent into the transformer's hidden dimension, also had to be widened to accept the new 32-channel input. New rows in that projection were initialized as a copy of the corresponding 16-channel rows so that fine-tuning starts from a sensible prior rather than from random.

Both stages — the VAE fine-tune and the MMDiT LoRA — were trained locally on a single RTX 5090 in my own workstation. End-to-end wall-clock time was approximately twenty-six hours. No cloud compute was used at any point.

04 — Results

A single text prompt now produces a scene-referred linear 16-bit OpenEXR with peak pixel values exceeding 500 on naturally-bright content — suns, specular highlights, practical lights. The output drops directly into a Foundry Nuke comp as a production-usable plate. No tone-mapping is required. No re-exposure. No separate highlight-recovery step.

Eight outputs from the linear-space model, all rendered at four exposure stops in Nuke. Highlights continue to carry information past the display range.
Eight outputs from the linear-space model, all rendered at four exposure stops in Nuke. Highlights continue to carry information past the display range.

Concretely, against the LogC4 baseline:

MetricLogC4 LoRANative linear
Peak linear value (typical)~200–300>500
Highlight banding (perceptual)Visible at -2 EVNone to -4 EV
Tonemap/linear pixel alignmentSub-pixel driftPixel-perfect
Inference time (1024², RTX 5090)~9s~11s
Container16-bit linear EXR16-bit linear EXR

The two-second inference penalty for the linear-space model is the cost of the heavier VAE. It has not been a problem in practice.

05 — Built tooling

The model alone is not enough — a compositor needs it inside Nuke or ComfyUI without scripting. Three pieces of tooling close that loop. A custom inference pipeline handles the HDR-fitted VAE and writes correctly-formatted linear EXR with proper OCIO metadata. A set of ComfyUI nodes wraps the pipeline with dedicated components for HDR VAE loading, OCIO-aware preview, and EXR write with bit-depth and channel-naming controls. An MCP server exposes the whole stack to Claude Code, Cursor, and Kilo, so an AI coding agent can request HDR images programmatically as part of a larger automated VFX workflow.

The Nuke-style nodes for ComfyUI are documented separately under Foundry Nuke Nodes for ComfyUI.

06 — Reflection

The training set is small. ~9,000 EXR files is two orders of magnitude below what foundation-model-scale fine-tuning would prefer, and the model occasionally betrays this — synthetic content involving uncommon lighting setups (interior practicals, low-key dramatic key lights) is less reliable than outdoor daylight scenarios. The next phase is dataset expansion, ideally with curated cinema-grade HDR captures rather than synthesized renders.

The VAE is also still imperfect. Very high-frequency detail in the brightest regions — the inside of a sun, the filament of a practical bulb — is reconstructed with slightly less fidelity than midtone content, which I attribute to the latent allocating capacity by perceptual rather than physical importance. There is a research thread here about training a VAE with a luminance-aware bottleneck that I expect to pursue.

What surprised me most was how much of the work was color science rather than machine learning. The architectural changes to Qwen-Image were straightforward once the target representation was correct. Most of the meaningful decisions — LogC4 versus alternatives, the dual-head VAE, the widened input projection, the highlight-weighted loss, the OCIO metadata in the output — came from the compositing side of my background, not the ML side. The model, in the end, is a function approximator. The work is in deciding what function you want it to approximate.

The work continues. The open-source release is staged: the LogC4 LoRA and ComfyUI workflow are public; the linear-space VAE and MMDiT LoRA will follow once the dataset attribution is finalized.



Next case study

Foundry Nuke Nodes for ComfyUI

Bringing professional compositing nodes into the AI image-generation graph.