Model Title
Reference to HF
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum.
Size: 50B
Bits: 16b
Max. Context: 10k
Model Title
Reference to HF
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum.
Size: 50B
Bits: 16b
Max. Context: 10k
4sec.
Instant Provisioning for the world's fastest deployments.
45k tokens/s
Throughput for models sized up to 14B parameters.
50% lower costs
Compared to conventional cloud providers.
...
client = Cortecs()
#Choose model that has a big context length
model_name = "cortecs/DeepSeek-R1-Distill-Qwen-32B-FP8-Dynamic"
book = requests.get("https://www.gutenberg.org/cache/epub/5200/pg5200.txt").text
question = "Based on the provided text, who is Gregor Samsa?"
tokenizer = AutoTokenizer.from_pretrained(model_name)
len_tokenized_book = len(tokenizer.encode(book)) # ~32k
#Add 1k tokens for question and output
cag_context_length = len_tokenized_book + 1000
instance = client.ensure_instance(model_name,
context_length=cag_context_length)
llm = ChatOpenAI(model_name=model_name, base_url=instance.base_url)
llm.invoke(book + f"\n{question}")
All models include an OpenAI-compatible endpoint, so you can seamlessly use the OpenAI clients you're already familiar with.
Use an API to start and stop your models, with resources seamlessly allocated in the background.
CAG allows dynamic adjustment of context length, balancing efficiency and relevance by reusing stored outputs as needed.
GDPR compliant
TLS encryption
On-demand or dedicated deployments are ideal for applications requiring reliable, low-latency performance or handling heavy workloads. They provide exclusive access to a model and its compute resources, so performance remains consistent without competing traffic from other users.
This approach is particularly effective for high-demand tasks like batch processing or cache-augmented retrieval (CAG).
We support any language model on Hugging Face. Please post your request to our discord channel.
No, none of your data is stored or used for training.
Instant provisioning allows the access to dedicated language models without the usual setup delays. By utilizing a warm start, users can instantly access their dedicated endpoint, even for large models with billions of parameters. This on-demand availability ensures rapid performance without waiting for initialization.
Enterprise
Prefer to run your own fine-tuned model? Connect with our experts for tailored LLM-Inference solutions.
Support
Reach out to customer support for assistance with general inquiries or specific concerns.