VIBE: Visual Instruction Based Editor
π Project Page | π Paper on arXiv | Github | π€ Space | π€ VIBE-Image-Edit-DistilledCFG |
VIBE is a powerful open-source framework for text-guided image editing. It leverages the efficiency of the Sana1.5-1.6B diffusion model and the visual understanding capabilities of Qwen3-VL-2B-Instruct to provide exceptionally fast and high-quality, instruction-based image manipulation.
We also provide a faster, CFG-distilled version of this model available at VIBE-Image-Edit-DistilledCFG.
Model Details
- Name: VIBE
- Task: Text-Guided Image Editing
- Architecture:
- Diffusion Backbone: Sana1.5 (1.6B parameters) with Linear Attention.
- Condition Encoder: Qwen3-VL (2B parameters) for multimodal understanding.
- Framework: Built on
diffusersandtransformers. - Model precision: torch.bfloat16 (BF16)
- Model resolution: This model is developed to edit up to 2048px images with multi-scale heigh and width.
Features
- Text-Guided Editing: Edit images using natural language instructions (e.g., "Add a cat on the sofa").
- Compact & Efficient: Combines a 1.6B parameter diffusion model with a 2B parameter encoder for a lightweight footprint.
- High-Speed Inference: Utilizes Sana1.5's linear attention mechanism for rapid generation.
- Multimodal Understanding: Qwen3-VL ensures strong alignment between visual content and text instructions.
- Text-to-Image support.
Inference Requirements
vibelibrary
pip install git+https://github.com/ai-forever/VIBE
- requirements for
vibelibrary:
pip install transformers==4.57.1 torchvision==0.21.0 torch==2.6.0 diffusers==0.33.1 loguru==0.7.3
Quick start
from PIL import Image
import requests
from io import BytesIO
from huggingface_hub import snapshot_download
from vibe.editor import ImageEditor
# Download model
model_path = snapshot_download(
repo_id="iitolstykh/VIBE-Image-Edit",
repo_type="model",
)
# Load model
editor = ImageEditor(
checkpoint_path=model_path,
image_guidance_scale=1.2,
guidance_scale=4.5,
num_inference_steps=20,
device="cuda:0",
)
# Download test image
resp = requests.get('https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/3f58a82a-b4b4-40c3-a318-43f9350fcd02/original=true,quality=90/115610275.jpeg')
image = Image.open(BytesIO(resp.content))
# Generate edited image
edited_image = editor.generate_edited_image(
instruction="let this case swim in the river",
conditioning_image=image,
num_images_per_prompt=1,
)[0]
edited_image.save(f"edited_image.jpg", quality=100)
T2I Examples
(Seed: 666) Prompt: Portrait of an old wise man with a long white beard surrounded by books and candles
Comparison with SANA1.5_1.6B_1024px
Prompt: Generate an interior of a rustic cabin workshop during winter evening. The viewpoint is from the doorway, showing a workbench with tools, wood shavings on the floor, and a cast-iron stove glowing softly. Place shelves with jars of nails, coils of rope, and folded blankets. Through a small window, show snow falling and pine trees in the twilight. Add warm lamplight creating soft gradients and a gentle vignette. Include a person in a thick sweater sanding a wooden object at the bench, but keep the person small in frame
Prompt: Generate an ancient jungle temple ruin partially covered in moss and vines, with a waterfall cascading nearby into a shallow pool. Show broken stone steps, carved patterns that are abstract, and damp surfaces with realistic moss detail. Add mist, shafts of sunlight through leaves, and small floating insects. Include a human explorer in the mid-ground, small in frame, wearing a backpack. Lush, cinematic realism.
Prompt: Create a science-fiction interior of a space greenhouse module with hydroponic racks, glowing grow lights, and condensation on transparent walls. Plants include leafy greens and flowering specimens. Tools and tablets have UI elements. Add soft floating dust or microgravity droplets. Clean, detailed, plausible sci-fi aesthetic.
Prompt: Beautiful tropical beach with guinea pig swimming in the water and human drinking wine
Prompt: Create a cinematic, rainy night scene in a narrow backstreet of an old downtown area. The camera is at street level, slightly tilted upward, emphasizing wet cobblestones reflecting neon-like colored lights without readable text. Show a small ramen stall with steam rising from pots, hanging paper lanterns that are blank or patterned (no letters), and acouple of stools under a simple awning. Add puddles, scattered trash like crumpled paper, and subtle mist. Include a passerby in the mid-ground seen from behind wearing a hooded jacket and carrying an umbrella, face not visible. Use a moody color palette of deep blues and warm oranges, with soft bokeh highlights and realistic rain streaks
Prompt: Depict a volcanic lava field at twilight with cooled black rock, glowing cracks of magma in the distance, and heat shimmer. The sky is darkening with faint stars emerging. Add thin smoke plumes and red-orange reflections on nearby rocks. Cinematic realism, dramatic contrast
Prompt: Portrait from back of a young woman dressed in Victorian attire standing in an ancient library filled with mirrors and stained glass windows, softly illuminated by sunlight streaming through
License
This project is built upon the SANA. Please refer to the original SANA license for usage terms: SANA License
Citation
If you use this model in your research or applications, please acknowledge the original projects:
- SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer
- Qwen3-VL
@misc{vibe2026,
Author = {Grigorii Alekseenko and Aleksandr Gordeev and Irina Tolstykh and Bulat Suleimanov and Vladimir Dokholyan and Georgii Fedorov and Sergey Yakubson and Aleksandra Tsybina and Mikhail Chernyshov and Maksim Kuprashevich},
Title = {VIBE: Visual Instruction Based Editor},
Year = {2026},
Eprint = {arXiv:2601.02242},
}
- Downloads last month
- 570
Model tree for iitolstykh/VIBE-Image-Edit
Unable to build the model tree, the base model loops to the model itself. Learn more.








