Infrastructure Expertise
Local AI Model Serving
I run local LLMs for privacy-sensitive VFX work, custom captioning pipelines, and development workflows where API latency or costs aren't acceptable.
Tools I Use
vLLM
Production inference for my captioning pipeline. Serving Gemma 3-12B with PagedAttention for high-throughput batch processing.
Ollama
Quick local testing and experimentation. I use it daily for trying new models before committing to heavier setups.
LM Studio
Sharing AI capabilities with non-technical colleagues. The GUI makes it easy for artists to interact with local models.
My Setup
RTX 5090 32GB Configuration
What I Run Locally
- ✓ Gemma 3-12B for captioning (vLLM)
- ✓ Llama 3.1 70B Q4 for complex reasoning
- ✓ Qwen2.5-Coder for local code assistance
- ✓ Vision models for image analysis
Why Local Matters
- → VFX studios can't send unreleased content to cloud APIs
- → No rate limits for batch processing thousands of images
- → Zero latency for development iteration
- → Fine-tuned models can't be hosted externally
Quantization Knowledge
I regularly work with different quantization formats depending on the use case:
AWQ (vLLM production)GGUF (Ollama/llama.cpp)GPTQ (legacy models)bitsandbytes NF4 (QLoRA training)FP16/BF16 (full precision)
Technology Stack
vLLMOllamaLM Studiollama.cppCUDAAWQGPTQGGUFFlashAttentionDocker
Expertise by Sumit Chatterjee
Industrial Light & Magic, Sydney