Enlightening Photographic Style Transfer with a Self-Supervised Photographic Embedding

Photographic style — the nuanced play of lightness, color, and tone a photographer crafts — is easy for the eye to read, yet invisible to most image embeddings. We present PETAL (Photographic Embedding for Transfer with an Adaptive LUT): we learn a continuous photographic embedding by self-supervision, and use it to drive a lightweight adaptive neural LUT that transfers style faithfully, with no test-time optimization.

Paper Code

The intuition

A photograph's style lives between the words we have for it.

Ask a photographer what makes an image theirs and they will point to exposure, contrast, the warmth of the light, a particular tint in the shadows. These are continuous and perceptual. But the embeddings we train models with are built for semantics — "a dog", "a beach" — and supervised by discrete text. They simply do not have an axis for "slightly dimmer" or "a touch more magenta in the highlights."

Style is continuous

A change in exposure can be far subtler than "dim" → "slightly dim". Discrete text labels cannot reflect the small but perceptible shifts that define a style.

Semantics ≠ style

CLIP1, DINO2 and friends excel at "what is in the picture", but cluster by content, not by how it was lit and graded. They are blind to the photographer's hand.

Prior transfer over/under-shoots

Color-only and LUT methods stay clean but miss complex looks; deep-feature methods distort texture. The gap is a style-aware feature.

Teaser: reference, content and results comparison

The task. Transfer the photographic style of a reference (left) onto a content image while preserving its structure and texture. With its photographic embedding, PETAL robustly renders diverse looks — high-contrast monochrome, low-key portraits, saturated skies — where baselines produce unrealistic results or miss the style entirely.

The approach

First learn a photographic embedding. Then condition the transfer on it.

Stage 1

A ViT that reads tone and color in Lab

The image is converted to CIE-Lab to decouple luminance from chromaticity. A ViT combines a [CLS] token's global descriptor with pooled patch features, while luminance and chromaticity histograms are injected through cross-attention to supply global tonal and color statistics.

It is trained without labels: overlapping local views form positives, and Photographic Style Augmentation applies opposite, differentiable edits of the same patch to mint hard negatives. Intra- and inter-sample losses with stop-gradient, plus a histogram-reconstruction term, shape the space without collapse.

ViT + Lab histograms PSA · self-supervised

Training pipeline. Each crop is augmented into two photographically contrary views (a). A ViT in CIE-Lab space, guided by luminance/chromaticity histograms through cross-attention (b–c), extracts the embedding; intra-sample, inter-sample and reconstruction losses shape the space without labels (d).

? Can we re-use the current image embedding network design?

t-SNE of the photographic embedding vs baselines

Does it actually learn style? A t-SNE check on FiveK-Concept — a probe set we build from the MIT-Adobe FiveK3 photographs: we take each scene and re-edit it along a single photographic concept direction (green/magenta tint, warm/cold temperature, exposure, contrast, B&W), so every variant of a scene shares identical content and differs only in one style axis. If an embedding captures style, the same edit applied across scenes should land together — each concept forming its own cluster (one marker shape per concept). Only PETAL's embedding (a) separates them this way; the semantic backbones (b–j) cluster by scene content instead and collapse the concepts together.

Stage 2

An embedding-conditioned neural LUT

With 𝓜 frozen, two MLPs map the reference−content embedding difference into an affine shift on the content's per-pixel mean and variance — an AdaIN4-like re-normalization applied by a 1×1-conv encoder/decoder, so spatial mixing is avoided and local texture is preserved exactly.

Because it depends only on pixel color (in Lab) and position, it is a 5D (Lab+xy) neural LUT: fast, texture-preserving, and reducing to near-identity when the reference and content already share a style. No test-time optimization.

5D neural LUT no test-time optimization

Transfer network. With the embedding network M frozen, two MLPs map the reference–content embedding difference into an affine shift on the content's per-pixel statistics. The 1×1 convolutions avoid spatial mixing, so local texture is preserved exactly.

μ̃ = μ(x_c) + g_μ(M(I_s)) − g_μ(M(I_c))
Î = D( μ̃ + Norm(x_c) ⊙ σ̃ ) Reduces to identity when style(I_s) ≈ style(I_c).

Inside the embedding

A space you can move through.

Because the photographic style lives in a continuous embedding, we can interpolate between two references — the output style glides smoothly from one to the other while the content stays fixed. Drag the slider to walk the result from Reference A to Reference B. Re-applying the same reference over many rounds barely changes the image, confirming the transfer is near-identity for matching styles.

1 / 3

Reference A

Interpolated result

Reference B

Reference A Reference B

Style interpolation. Each step linearly blends the conditioning embedding from Reference A toward Reference B; the photographic style of the output follows continuously, while the content is preserved throughout. Use ‹ › to step through more examples.

Sequential editing. Feeding each output back as the next input, the content and target style stay stable across rounds — re-applying the same reference yields only minimal change.

Benchmarks

State-of-the-art on retrieval and transfer.

PETAL's embedding wins photographic-style retrieval against general, style and fine-tuned baselines; its transfer wins reference-based fidelity and human preference — at the second-fastest runtime.

Table 1 · Photographic-style retrieval (%, higher is better)

Recall@1, mAP and F1@5 across three benchmarks. Bold blue marks the best in each column, underline the second best.∗ = fine-tuned on our data with our objectives.

Can we re-use the current image embedding network design? No. Even strong backbones (DINOv37, CLIP1) fine-tuned on the very same data and objectives still trail PETAL's R@1 by ~15 points on PPR10K18 — evidence that capturing photographic style needs the dedicated Lab-histogram design, not just more data.

Table 2 · Photographic style transfer & user study

Reference-based metrics on PPR10K18 and PST5019, plus a 34-participant user study. ↑ higher is better, ↓ lower is better. Bold blue = best, underline = second best.

User study: (i) content preservation · (ii) style consistency · (iii) overall visual quality, scored 1–5. PETAL ranks first on style consistency and content preservation.