Model Title
Reference to HF
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum.
Size: 50B
Bits: 16b
Max. Context: 10k
Model Title
Reference to HF
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum.
Size: 50B
Bits: 16b
Max. Context: 10k
Model Title
Reference to HF
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum.
Size: 50B
Bits: 16b
Max. Context: 10k
Quantized inference
Cost efficient GPUs
2x faster
Increase inference speed using the latest quantization algorithms such as GPTQ and FP8.
9x cheaper
Lower costs per token when comparing Llama 70B to OpenAI's GPT-4o.
Select your model and immediately access it via API. Models come with an OpenAI-compatible endpoint, allowing you to use popular tools like LangChain right away.
GDPR compliant
ISO certified hosting
Dedicated inference offers exclusive access to a specific model, ensuring that you are the sole user of the underlying compute resources. Unlike serverless options, you won't have to share these resources with other users, guaranteeing consistent performance.
This makes it particularly suitable for applications that require low latency, have a heavy workload, or involve batch processing tasks.
The model is quantized using a methodology that balances a slight reduction in accuracy with significant reductions in computing demand and energy consumption. This approach provides a range of variants, from compact and economical options to larger, near-lossless alternatives, catering to various use cases and preferences.
Quantization offers a compelling solution for reducing inference costs while maintaining high performance standards. Leveraging quantized models enhances cost efficiency while preserving quality.
No, none of your data is stored or used for training.
Your models are hosted on ISO 27001 certified infrastructure, utilizing cutting-edge GPUs. The infrastructure is located within Europe, adhering to GDPR standards.
We utilize Eleuther's Evaluation Harness to assess the quality of models across a wide range of tasks. This allows you to compare models against each other, providing a comprehensive understanding of their performance.
However, for practical reasons, we may subsample larger evaluation datasets to 1,000 entries, which may lead to slight variations when compared to other public evaluations.
Enterprise
Prefer to run your own fine-tuned model or need an auto-scaling setup? Connect with experts for tailored LLM-Inference solutions. They'll guide you to a solution for your needs.
Support
Reach out to customer support for assistance with general inquiries or specific concerns.