The age of hyper-personalized software
Why I run local LLMs to power a multimodal event crawler
Why I run local LLMs to power a multimodal event crawler
Run Claude Code with your own local MiniMax-M2.1 model using vLLM's native Anthropic API endpoint support.
Step-by-step vLLM stable/nightly install on Ubuntu 24.04 for a dual RTX Pro 6000 (96GB x2), model download workflow, and a fix for tp=2 hangs (IOMMU). Includes tested serve commands for Devstral 123B, GLM-4.5/4.6V, Qwen3 235B, MiniMax-M2, and gpt-oss-120b.
A practical guide to injecting new knowledge into LLM models through fine-tuning, using Q&A pairs generated from documentation.
Today marks 3 years since ChatGPT was launched. In this short article I reflect on how far LLMs have come in just a few years, from getting early access to GPT-4 to now running open models that surpass it, and share two graphs that illustrate both the progress in open-weight models and the increasingly close race between OpenAI, Google, and Anthropic (with Google currently in the lead).
A guide to running large language models locally: hardware options, inference engines (vLLM, SGLang, llama.cpp), quantization techniques, and user interfaces.
From fine-tunes to founder stacks, the center of gravity is moving east.
How a small draft model can speed up LLM inference by 1.82× without sacrificing quality - benchmarking Qwen3-32B with speculative decoding
A practical guide to renting GPUs for running open-weight LLM models with control, privacy, and flexibility.
Learn how to properly set up vLLM with GPT-OSS built-in tools and integrate it with LibreChat to leverage powerful capabilities.