Qwen2 VL

Instruct

Image

Video

Qwen2-VL is the latest advancement in our Qwen-VL model, delivering state-of-the-art performance in visual understanding across benchmarks like MathVista, DocVQA, and MTVQA. It excels at processing both images and videos, enabling tasks such as question answering, dialog, and content creation. With advanced reasoning and decision-making capabilities, Qwen2-VL can seamlessly operate devices like mobile phones and robots based on visual and textual inputs. Additionally, it offers robust multilingual support, enabling text understanding within images in English, Chinese, European languages, Japanese, Korean, Arabic, Vietnamese, and more.

For instructions on accessing this model or initializing it via API, please refer to our docs.

Configuration

For more details about _model_provder--model_name, visit the model's page on Hugging Face.

NVIDIA L40S x 1

Slider

Context defines the maximum tokens the model can process at once. Smaller values improve speed but risk truncating input. Adjust it to balance performance and input needs.

This website requires your consent to use cookies for traffic analytics. Read more in our privacy policy.