Photographic style — the nuanced play of lightness, color, and tone a photographer crafts — is easy for the eye to read, yet invisible to most image embeddings. We present PETAL (Photographic Embedding for Transfer with an Adaptive LUT): we learn a continuous photographic embedding by self-supervision, and use it to drive a lightweight adaptive neural LUT that transfers style faithfully, with no test-time optimization.
Ask a photographer what makes an image theirs and they will point to exposure, contrast, the warmth of the light, a particular tint in the shadows. These are continuous and perceptual. But the embeddings we train models with are built for semantics — "a dog", "a beach" — and supervised by discrete text. They simply do not have an axis for "slightly dimmer" or "a touch more magenta in the highlights."
A change in exposure can be far subtler than "dim" → "slightly dim". Discrete text labels cannot reflect the small but perceptible shifts that define a style.
CLIP1, DINO2 and friends excel at "what is in the picture", but cluster by content, not by how it was lit and graded. They are blind to the photographer's hand.
Color-only and LUT methods stay clean but miss complex looks; deep-feature methods distort texture. The gap is a style-aware feature.
The image is converted to CIE-Lab to decouple luminance from chromaticity. A ViT combines a [CLS] token's global descriptor with pooled patch features, while luminance and chromaticity histograms are injected through cross-attention to supply global tonal and color statistics.
It is trained without labels: overlapping local views form positives, and Photographic Style Augmentation applies opposite, differentiable edits of the same patch to mint hard negatives. Intra- and inter-sample losses with stop-gradient, plus a histogram-reconstruction term, shape the space without collapse.
With 𝓜 frozen, two MLPs map the reference−content embedding difference into an affine shift on the content's per-pixel mean and variance — an AdaIN4-like re-normalization applied by a 1×1-conv encoder/decoder, so spatial mixing is avoided and local texture is preserved exactly.
Because it depends only on pixel color (in Lab) and position, it is a 5D (Lab+xy) neural LUT: fast, texture-preserving, and reducing to near-identity when the reference and content already share a style. No test-time optimization.
A set of comparisons spanning portraits, landscapes, architecture and interiors. Drag the divider to compare the original content against any method; switch the right side to see how each baseline fares. The style reference for each scene is shown beside the comparison. The full panel below shows every method side by side.
Because the photographic style lives in a continuous embedding, we can interpolate between two references — the output style glides smoothly from one to the other while the content stays fixed. Drag the slider to walk the result from Reference A to Reference B. Re-applying the same reference over many rounds barely changes the image, confirming the transfer is near-identity for matching styles.
Using references retrieved by the photographic embedding from professional bodies of work, PETAL lends each content image a photographer's signature grade. Drag each tile — left is the original content, right is the PETAL result.
PETAL's embedding wins photographic-style retrieval against general, style and fine-tuned baselines; its transfer wins reference-based fidelity and human preference — at the second-fastest runtime.
Recall@1, mAP and F1@5 across three benchmarks. Bold blue marks the best in each column, underline the second best.∗ = fine-tuned on our data with our objectives.
Can we re-use the current image embedding network design? No. Even strong backbones (DINOv37, CLIP1) fine-tuned on the very same data and objectives still trail PETAL's R@1 by ~15 points on PPR10K18 — evidence that capturing photographic style needs the dedicated Lab-histogram design, not just more data.
Reference-based metrics on PPR10K18 and PST5019, plus a 34-participant user study. ↑ higher is better, ↓ lower is better. Bold blue = best, underline = second best.
User study: (i) content preservation · (ii) style consistency · (iii) overall visual quality, scored 1–5. PETAL ranks first on style consistency and content preservation.
Numbered by first appearance above; superscript markers on method, dataset and baseline names link here.
@inproceedings{zhu2026photographic,
title = {Enlightening Photographic Style Transfer with a Self-Supervised Photographic Embedding},
author = {Zhu, Chengxuan and Fang, Jiacong and Weng, Shuchen and Lyu, Youwei and Tang, Jiajun and Fan, Qingnan and Xu, Chao and Shi, Boxin},
booktitle = {Proceedings of the European Conference on Computer Vision},
year = {2026}
}