Image & Video Generation Workflows 2026

/resources · Reference

Written by Josh Ausmus · Updated April 2026

Live reference · updated continuously

Structured Prompt Formula

Use this order every time. It forces the model to parse subject first, then aesthetics, then framing, lighting, and technical constraints.

[Subject] - main focus with specific details (age, clothing, expression, pose).
[Style] - artistic approach or reference (photorealistic, cinematic, technical illustration).
[Composition] - framing and layout (close-up three-quarter view, rule of thirds, wide establishing shot).
[Lighting] - specific sources and quality (golden hour sidelight, dramatic rim lighting with soft fill, volumetric god rays).
[Technical] - quality boosters, parameters, aspect (sharp focus, detailed details, 8k, --hd for MJ).

Model Comparison

Model	Resolution	Speed	Best For	Gotchas
FLUX.2 dev (32B)	Up to 4MP, any ratio	Seconds on good GPU (klein 4B sub-second on consumer)	Photorealism, multi-reference editing, complex scenes	32B variant needs serious VRAM without quantization. Prompt adherence strong but needs explicit technical terms.
Midjourney V8 Alpha	Native 2K with --hd	5x faster than prior	Artistic coherence, aesthetics, style references	Alpha means parameters may shift. --hd costs more GPU time. Still favors artistic interpretation over literal.
Grok (Aurora)	~1K-2K range	Fast, integrated	Unfiltered concepts, text rendering, quick iteration	Tied to X platform access tiers. Less granular control than local FLUX.
Ideogram (latest)	High-res, strong text support	Seconds	Typography, posters, logos, readable text in scene	Excels at text but can over-saturate colors on complex prompts. Great for marketing mockups.
DALL-E (current)	Strong detail, integrated with ChatGPT	Fast via ChatGPT	Natural language prompts, safe iterative brainstorming	Heavier safety filters. Prompt understanding improved but less precise on edge cases than FLUX or Ideogram.

Aspect Ratio and Resolution Cheat Sheet

Ratio	Pixel Examples (approx)	Use Case
1:1	1024x1024 or 1536x1536	Icons, avatars, square social posts
3:2	1536x1024 or 2304x1536	Photography, product shots
16:9	1920x1080 or 2560x1440	Video thumbnails, cinematic stills
9:16	1080x1920 or 1440x2560	Mobile stories, vertical video frames
4:5	1024x1280 or 1536x1920	Instagram portraits
--hd (MJ V8)	Native ~2K base	When you need higher detail without separate upscaler

Stick to even numbers near training resolutions. FLUX handles any ratio well up to 4MP total pixels.

Prompts That Work

Copy these directly. They follow the formula and produce clean results across the listed models.

Cyberpunk street vendor selling neon ramen, rainy night market, dynamic low angle composition with rain reflections, harsh neon pink and cyan lighting with volumetric fog, photorealistic detailed details 8k --ar 16:9
Female engineer in her 30s debugging a PCB at a cluttered workbench, technical documentary style, tight medium shot focused on hands and schematic, cool white LED task lighting with warm desk lamp fill, sharp focus macro details realistic
Minimalist Scandinavian kitchen interior at dawn, clean architectural photography style, wide establishing shot with leading lines from counter to window, soft natural window light and subtle god rays, 4k clean lines no clutter
Vintage 1970s muscle car parked on desert highway at sunset, cinematic film still, rule of thirds composition with dramatic sky, warm orange sidelight and long shadows, photorealistic chrome reflections high detail --ar 3:2
Detailed technical cutaway diagram of a modern electric motor showing windings and magnets, engineering illustration style with exploded view elements, orthographic top and side composition, even studio lighting with clear labels and shadows, vector crisp lines
Portrait of an older male blacksmith hammering hot metal, gritty workshop atmosphere, close-up three-quarter view with flying sparks, dramatic single source forge glow lighting with rim light on face, hyperrealistic skin texture and sweat details 8k
Futuristic floating city skyline at dusk with flying vehicles, sci-fi matte painting style, wide horizontal composition showing scale, cool blue ambient light with warm window accents, tricky architecture sharp focus cinematic
Cozy reading nook in an attic with bookshelves and rainy window, warm hygge illustration style mixed with photoreal, intimate close composition, soft warm lamplight contrasting cool blue window light, detailed textures comfortable mood

These produce the described scenes when fed into the matching model. Tweak the technical section for specific platform parameters.

Consistent Characters Workflow

Generate one strong base image first using a detailed character prompt. Save it.

For subsequent images paste the base image URL or upload as reference (FLUX multi-reference, MJ character reference, Ideogram remix).

Add specific new instructions for pose, clothing, environment while keeping core descriptors (face shape, hair, eye color, build) in the subject section.

Iterate by varying lighting and composition only. Limit changes to one or two variables per generation. Use the same seed when available.

FLUX multi-reference handles this best for more than one character. Midjourney V8 style references and --cref work reliably for single subject series. Test on a small batch. The reference pulls the face and build. Your text prompt controls everything else.

How Models Interpret Prompts Differently

FLUX.2 follows technical descriptions literally. It handles complex spatial relationships and physics better than others. Add exact terms like "hex color #A020F0" or "512-point detail" and it listens. It shines on multi-subject scenes.[1]

Midjourney V8 Alpha leans artistic. It interprets mood and aesthetic references strongly. It adds pleasing choices when your prompt is vague. Use style references and moodboards for control. The new --hd mode changes the base resolution and coherence.[2]

Ideogram treats text as a first-class citizen. Put readable signage or labels in quotes. It renders legible typography where others smear letters. It works great for posters but can push colors hard.

Grok and DALL-E handle natural language conversationally. They fill gaps with reasonable assumptions. This helps rapid iteration in chat but reduces precision on highly technical scenes. Grok feels less filtered.

The structured formula above works across all of them. The difference shows in which parts they emphasize or ignore. Test the same prompt on two models and you'll see the gap immediately. If the output misses the mark, add or remove weight from the technical section first.

For video, use image-to-video pipelines on tools like Kling, Runway, or Google Veo 3.1. Start with a strong FLUX or MJ still as the keyframe. Prompt motion separately. Short clips only. The image models still drive the quality.