AI Product Photography with Flux 2 Dev: What We Learned After 50+ Images

Product photography is expensive. A shoot with a photographer, location, and model easily costs a full day rate. So we tested: can you place a real physical product into AI-generated lifestyle scenes that are good enough for social media and concept presentations?

The short answer: yes — with caveats you need to understand.

The long answer follows. After 50+ generated images, 4 testing rounds, and an automated 3-agent pipeline, we have a clear picture of what works and what doesn't.

AI-generated product photo — device in foreground on sideboard, yoga scene in background Foreground placement with 3 reference images: product, proportions, and scene all land correctly

The Setup

Model: Flux 2 Dev (FP8 mixed precision) with Mistral 3 Small multimodal text encoder
Method: Reference latent conditioning via FluxKontextMultiReferenceLatentMethod
Hardware: RTX 5090 with 32 GB VRAM, ComfyUI with --lowvram
Goal: Place a real product naturally into lifestyle scenes

The Core Technique: Reference Latent Conditioning

The key difference from regular text-to-image: we don't just describe the product in the prompt — we feed actual product photos through the VAE encoder and chain them as latent conditioning. The model then "knows" what the product actually looks like.

LoadImage → FluxKontextImageScale → VAEEncode → ReferenceLatent (chained)
→ FluxKontextMultiReferenceLatentMethod → KSampler

This sounds technical, but the effect is significant: instead of a generic "white tube," the model renders the actual product with correct proportions and details.

The Two-Block Prompt Structure

Every successful prompt has two clearly separated blocks:

Block 1 — Product description: Exact dimensions with human-scale comparisons ("40cm long, about the size of a large thermos"), material/color, distinctive features described subtly, orientation constraints, anti-duplicate rules ("Only ONE device"), and anti-artifact rules ("No dust, no haze").

Block 2 — Scene description: Environment, lighting, person/action, a single specific surface for placement, camera settings (lens, aperture, DOF), and a style anchor ("Photorealistic, editorial lifestyle photography").

This structure is non-negotiable. Mixing both blocks produces inconsistent results.

What Works Reliably: Product in Foreground

Foreground placement with 3-4 reference images is the sweet spot. The model renders the product prominently with accurate shape, proportions, and details. People can be in the background or to the side.

AI-generated lifestyle product photo — device on side table, warm evening mood with couple on sofa Another foreground hit: warm light, natural placement, people in background

Optimal foreground settings:

Parameter	Value
Reference images	3-4 from different angles
Lens in prompt	50mm (most natural proportions)
Placement	"Prominently visible in the foreground"
Surface	Name ONE specific surface ("sits ON the nightstand")
Steps	16-20 (sweet spot)
Sampler	euler, simple scheduler
CFG	1.0
Aperture in prompt	f/1.4 – f/1.8

Success rate: roughly 80% usable results. That's extremely good for generative AI.

The Hard Problem: Product in Background

This is where it gets interesting — and frustrating. The product should appear naturally in the background while a person is the main subject. But the entire reference latent conditioning method is designed to reproduce the referenced object. It wants to show the product prominently.

We tested 4 approaches across 36+ images:

Approach	Method	Result
A: Standard + "FAR behind"	Product block first, 2 refs, strong distance language	Product still lands in foreground
B: Scene-first	Full scene description first, product mentioned casually at end	Best result. Product appears in background ~40% of the time
C: Aggressive minimizing	"MAIN SUBJECT" / "BACKGROUND", "5% of frame"	Too aggressive — product either vanishes or becomes unrecognizable blob
D: Casual mention	No product block at all, product as "just another kitchen item"	Mixed — sometimes perfect, sometimes wrong size

Winner: Scene-first (approach B) with just 1 reference image.

The logic: fewer reference images = less "show this object" pressure. And when the scene is described first, the model prioritizes context.

AI-generated product photo — device subtly in background on shelf, mother and child playing in foreground Background placement with scene-first prompt and 1 reference image — the device blends in naturally

AI-generated product photo — device rendered standing upright in kitchen instead of lying horizontal The orientation problem: the cylindrical device gets rendered standing instead of lying flat

Hard Rules We Learned the Hard Way

After dozens of failed generations, clear no-go's emerge:

Never mention mirrors — guaranteed product duplicate (reflection = second product)
Never name two surfaces — "on the shelf or the counter" = product on both
Never say "LED ring" or "glowing" — model overemphasizes luminous features. Use "small subtle blue accent" instead
Never use 35mm lens — barrel distortion warps product proportions
Never go below 16 steps — faces become soft, haze appears everywhere
Never use inpainting to add the product — ghosting and morphing at every denoise level
Fewer refs for background (1), more for foreground (3-4)

The Orientation Problem

Even when the product is a horizontal cylinder that sits flat in a cradle, the model frequently renders it standing upright — like a speaker or thermos. Reference image choice is crucial:

A 3/4 angle photo (where the cylinder appears somewhat vertical in perspective) → model biases toward upright placement
A clearly horizontal photo (side view, lying flat) → helps but doesn't guarantee horizontal output
Explicit language like "LYING FLAT ON ITS SIDE, NOT standing upright, like a rolling pin" → helps but isn't foolproof

The Size Calibration Problem

For background placement, the product is either too big or too small:

"tiny" / "very small" / "5% of frame" → product disappears
"large" / prominent language → product dominates the scene
Real-world size anchors work best: "about the size of a large water bottle" gives the model a concrete reference
Never use percentage-based size language — it's unpredictable

Text on Products: Forget It

Flux 2 Dev can't render readable text on products. Logos become random letter-like shapes. This is a model limitation, not fixable through prompting.

For your workflow, this means: AI product photos are perfect for mood boards, social media concepts, and internal presentations. For final print assets with visible logos, you still need real photos or compositing.

The Automated Pipeline

We built a 3-agent system that scales the process:

Creative Director: Generates shot briefs from brand research, adjusts based on feedback
Producer: Builds ComfyUI workflows, manages reference images, generates images
Photo Judge: Uses Claude's vision capabilities to score product accuracy, proportions, scene quality, and placement naturalness

Pass threshold: all scores ≥ 6 out of 10, average ≥ 6.5. The judge ignores text garbling (known limitation).

The result: instead of manually scrolling through hundreds of images, the pipeline automatically filters and delivers only the usable results.

Optimal Settings Cheat Sheet

Parameter	Value
Steps	16 (fast iteration) or 20 (final quality)
Sampler	euler, simple scheduler
CFG	1.0
Denoise	1.0
Resolution	768×1344 (9:16) or 832×1024 (4:5)
Refs (foreground)	3-4 images
Refs (background)	1 image
Lens in prompt	50mm (natural) or 85mm (compressed bg)
Aperture in prompt	f/1.4 – f/1.8

Verdict: Usable, But Not a Replacement

AI product photography with Flux 2 Dev isn't a replacement for professional shoots — but it's a powerful tool for quick concepts, social media content, and internal presentations.

Foreground placement is production-ready now. Background placement needs 2-3x more generations but delivers usable results with the right technique.

The biggest advantage: everything runs locally, zero API costs, full control over the workflow. And with the automated pipeline, what used to be a manual trial-and-error process becomes a scalable system.

Want to test AI-generated product photos for your business? I help with setup — from hardware to ComfyUI to automated pipelines. Let's talk →