Introduction
While most creators are still using simple text prompts with Sora2, a powerful technique has emerged that gives you director-level control over your AI-generated videos. This JSON-based prompting method transforms Sora2 from a basic text-to-video tool into a virtual film production suite.
This guide breaks down the exact structure that's producing the highest-quality, most consistent results in the Sora2 community.
Why JSON prompting works better
Traditional approach:
"A street interviewer talks to a guy dressed as Batman in Times Square at night"
JSON-structured approach: a complete scene specification with technical parameters.
- +Consistency: JSON structure ensures Sora2 understands exactly what you want
- +Technical control: Specify camera settings, resolution, frame rates
- +Character persistence: Define specific roles and appearances that stay consistent
- +Scene architecture: Build complex multi-beat narratives
- +Reproducibility: Tweak parameters without starting from scratch
The complete JSON structure breakdown
1. The prompt object: your creative blueprint
"prompt": {
"title": "Your Scene Title",
"setting": { },
"cast": [ ],
"props": [ ],
"camera": { },
"beats": [ ],
"look": "",
"audio_direction": ""
}This is your narrative container: where you define what happens, who's involved, and how it looks.
1.1 Setting: establishing your world
"setting": {
"location": "Times Square",
"time": "late night",
"vibe": "tourist chaos, LED billboards, random Spider-Man posing for selfies"
}Purpose: grounds your scene in a specific place and atmosphere.
- +Location: Be specific. "Brooklyn Bridge pedestrian walkway" beats "a bridge".
- +Time: Include time of day and lighting conditions ("golden hour", "overcast afternoon")
- +Vibe: This is your world-building space: add environmental details, crowd behavior, weather, energy level
// Intimate setting
"setting": {
"location": "cramped Tokyo ramen shop",
"time": "2 AM",
"vibe": "steam rising, lone salary worker, flickering neon 'OPEN' sign"
}
// Epic setting
"setting": {
"location": "Icelandic black sand beach",
"time": "storm rolling in at dusk",
"vibe": "dramatic waves, distant lighthouse, ominous clouds"
}1.2 Cast: defining your characters
"cast": [
{
"handle": "@gauravsinghbisen",
"role": "interviewer",
"demeanor": "mock-serious",
"wardrobe": "dark jacket, lav mic"
},
{
"id": "subject",
"role": "interviewee",
"demeanor": "dead-serious delusion",
"wardrobe": "cheap Batman costume, mask half falling off"
}
]Purpose: creates consistent, distinct characters with clear visual and behavioral traits.
- +handle/id: Unique identifier (use @ handles for consistency across projects)
- +role: Their function in the scene
- +demeanor: HOW they act (this is crucial for Sora2's understanding)
- +wardrobe: Specific costume and clothing details
Pro tips:
- +Use opposing demeanors for dynamic tension ("calm professional" vs "frantic conspiracy theorist")
- +Include age ranges if important
- +Add physical traits for distinction: a build, a hairstyle
- +For multiple shots, keep the same handle or id to maintain character consistency
1.3 Props: the devil's in the details
"props": [
{
"item": "handheld microphone",
"branding": "generic"
},
{
"item": "crumpled plastic batarang",
"branding": "toy store"
}
]Props signal to Sora2's training data what kind of scene you're creating. A "professional boom mic" reads differently than "smartphone on a selfie stick."
Branding options: "generic" (no visible logos), "toy store" (cheap, plastic look), "professional" (high-end appearance), "vintage" (aged, retro aesthetic).
1.4 Camera: your virtual cinematography
"camera": {
"rig": "handheld camcorder",
"framing": "punch-ins on facial expressions",
"lens": "35mm, f/2.8",
"style": "documentary with meme-style zooms"
}Rig options:
- +handheld camcorder: shaky, intimate, documentary feel
- +steadicam: smooth tracking shots
- +tripod: static, stable, professional
- +drone: aerial perspective
- +gimbal: fluid, cinematic movement
- +shoulder-mounted: news and documentary style
Framing techniques:
- +tight close-ups for emotional intensity
- +punch-ins on facial expressions for reality TV style
- +wide establishing shots for scene setting
- +dutch angle for disorientation and tension
- +over-the-shoulder for conversation dynamics
- +tracking shot for following movement
Lens specifications:
- +24mm f/1.4: wide, shallow depth of field, cinematic
- +35mm f/2.8: documentary standard, natural perspective
- +50mm f/1.8: portrait, subject isolation
- +85mm f/1.2: tight portraits, creamy bokeh
- +14mm f/2.8: ultra-wide, dramatic
Style presets: "documentary with meme-style zooms", "cinematic noir", "vintage VHS", "music video, saturated colors", "horror, found footage".
1.5 Beats: your scene's timeline
"beats": [ "subject insists he protects NYC from pigeons; interviewer keeps roasting him", "crowd starts chanting 'Not My Batman'", "button: @gauravsinghbisen whispers 'Where's Rachel?' into mic; hard cut" ]
In screenwriting, a beat is a moment of action or emotional shift. For Sora2, beats structure your video's progression.
- +Opening beat: establish the situation
- +Development beats: build tension, comedy, or drama
- +Button/payoff: strong ending moment
Formatting tips: use semicolons to separate multiple actions within a beat, specify character actions with their handle or id, include emotional cues ("nervously," "triumphantly"), add timing markers ("slowly," "suddenly," "after a long pause").
Comedy example:
"beats": [ "host asks 'What's your secret talent?'; guest confidently says 'I can talk to plants'", "host blinks in silence; camera zooms on uncomfortable expression", "guest starts arguing with a potted fern; host backs away slowly" ]
Drama example:
"beats": [ "detective shows witness a photo; witness's face drops", "witness whispers 'I haven't seen her in twenty years'", "detective leans forward; camera pushes in on witness's trembling hands" ]
1.6 Look: your visual aesthetic
"look": "gritty, photoreal, HDR"
Popular look combinations:
- +"gritty, photoreal, HDR": street documentary, modern realism
- +"dreamy, soft focus, pastel colors": romantic, nostalgic
- +"high contrast, noir, shadows": mystery, thriller
- +"vibrant, saturated, pop art": music video, advertisement
- +"desaturated, cold tones, clinical": sci-fi, dystopian
- +"warm golden hour, film grain": indie film, heartfelt
- +"neon-lit, cyberpunk, reflections": futuristic, urban
1.7 Audio direction: the forgotten element
"audio_direction": "include street noise, muffled laughs from bystanders; ensure perfect lip sync with natural dialogue timing"
Sora2 can generate audio alongside video. Proper audio direction ensures realistic ambient sound, proper lip-sync timing, environmental acoustics, and sound design elements.
2. The params object: technical specifications
"params": {
"width": 3840,
"height": 2160,
"fps": 30,
"style_preset": "documentary-photoreal",
"enable_hdr": true,
"motion_blur": true,
"guidance": 6.5,
"seed": 102
}Resolution:
- +3840 x 2160: 4K Ultra HD, highest quality
- +1920 x 1080: Full HD, standard quality, faster generation
- +1080 x 1920: vertical 9:16 for Stories and TikTok
- +1080 x 1080: square for feed posts
Frame rates: 24 for cinematic film look, 30 for standard smooth motion, 60 for ultra-smooth sports and gaming.
Guidance (1-10): 3-5 gives more creative freedom, 6-7 is the balanced recommended start, 8-10 means strict adherence to the prompt with less creativity.
Seed: any integer. Same seed plus same prompt equals reproducible results. Change the seed for variations on the same prompt.
3. Negatives: what to avoid
"negatives": ["cartoonish", "polished cosplay", "lip-sync drift"]
Common negative prompts by category:
- +Visual quality: blurry, pixelated, overexposed, color banding, artifacts
- +Style avoidance: cartoonish, anime style, CGI-looking, painting-like
- +Technical problems: lip-sync drift, warped faces, extra fingers, distorted proportions
- +Aesthetic unwanteds: polished cosplay, oversaturated colors, unwanted lens flare
Complete template: viral social media content
{
"prompt": {
"title": "[Catchy Hook Title]",
"setting": {
"location": "[Recognizable place]",
"time": "[Current/trendy time]",
"vibe": "[High energy description]"
},
"cast": [
{
"handle": "@[creator_name]",
"role": "content creator",
"demeanor": "charismatic, direct-to-camera",
"wardrobe": "[trendy outfit]"
}
],
"props": [
{"item": "smartphone on tripod", "branding": "iPhone"}
],
"camera": {
"rig": "tripod",
"framing": "tight close-up, centered",
"lens": "28mm, f/2.2",
"style": "bright, punchy, meme-ready zooms"
},
"beats": [
"hook: @[creator] looks directly at camera and says '[attention grabber]'",
"[quick demonstration or reveal]",
"button: [memorable ending line]; freeze frame on reaction"
],
"look": "vibrant, saturated, crisp HDR",
"audio_direction": "clear voiceover; trending audio in background; perfect lip sync"
},
"params": {
"width": 1080,
"height": 1920,
"fps": 30,
"style_preset": "documentary-photoreal",
"enable_hdr": true,
"motion_blur": false,
"guidance": 6,
"seed": 123
},
"negatives": ["low energy", "dim lighting", "complex background"]
}Common mistakes to avoid
- +Vague descriptions. "A person talks" loses to "dead-serious delusional Batman impersonator insists he protects NYC."
- +Missing camera specs. Always include rig, lens, and style.
- +Overloading beats. Three or four clear moments beat ten different actions.
- +Ignoring negatives. Explicitly state what to avoid.
- +Wrong resolution for platform. Vertical platforms need 1080 x 1920.
- +Inconsistent character IDs. Keep the same handle or id across a sequence.
Troubleshooting guide
- +Characters look different between shots: use consistent handle or id values plus the same seed.
- +Lip sync is off: add "ensure perfect lip sync with natural dialogue timing" to audio_direction.
- +Scene looks too AI-generated: add "artificial", "CGI-like", "overly smooth" to negatives and increase motion blur.
- +Not enough action: expand your beats with semicolon-separated micro-actions.
- +Colors are flat: enable HDR and specify color grading in the look field.
- +Too much prompt drift: increase guidance from 6.5 to 7.5 or 8.
Final thoughts
This JSON prompting method transforms Sora2 from a text-to-video tool into a virtual production studio. You're not just describing a video, you're directing it.
- +Specificity beats generality
- +Technical parameters matter as much as creative description
- +Characters need clear, consistent identifiers
- +Beats structure your narrative arc
- +Negatives prevent common issues
The difference between amateur and professional AI video generation isn't the tool, it's how you communicate with it. This JSON structure is that language. It is the same discipline behind every video in my AI showcase, and the kind of system I set up for brands as a generative AI consultant.