Infrastructure Expertise

Local AI Model Serving

I run local LLMs for privacy-sensitive VFX work, custom captioning pipelines, and development workflows where API latency or costs aren't acceptable.

Tools I Use

vLLM

Production inference for my captioning pipeline. Serving Gemma 3-12B with PagedAttention for high-throughput batch processing.

Ollama

Quick local testing and experimentation. I use it daily for trying new models before committing to heavier setups.

LM Studio

Sharing AI capabilities with non-technical colleagues. The GUI makes it easy for artists to interact with local models.

My Setup

RTX 5090 32GB Configuration

What I Run Locally

  • Gemma 3-12B for captioning (vLLM)
  • Llama 3.1 70B Q4 for complex reasoning
  • Qwen2.5-Coder for local code assistance
  • Vision models for image analysis

Why Local Matters

  • VFX studios can't send unreleased content to cloud APIs
  • No rate limits for batch processing thousands of images
  • Zero latency for development iteration
  • Fine-tuned models can't be hosted externally

Quantization Knowledge

I regularly work with different quantization formats depending on the use case:

AWQ (vLLM production)GGUF (Ollama/llama.cpp)GPTQ (legacy models)bitsandbytes NF4 (QLoRA training)FP16/BF16 (full precision)

Technology Stack

vLLMOllamaLM Studiollama.cppCUDAAWQGPTQGGUFFlashAttentionDocker

Expertise by Sumit Chatterjee

Industrial Light & Magic, Sydney

Back to Portfolio