Infrastructure Expertise

Local AI Model Serving

I run local LLMs for privacy-sensitive VFX work, custom captioning pipelines, and development workflows where API latency or costs aren't acceptable.

Tools I Use

Production inference for my captioning pipeline. Serving Gemma 3-12B with PagedAttention for high-throughput batch processing.

Quick local testing and experimentation. I use it daily for trying new models before committing to heavier setups.

Sharing AI capabilities with non-technical colleagues. The GUI makes it easy for artists to interact with local models.

I regularly work with different quantization formats depending on the use case:

AWQ (vLLM production)GGUF (Ollama/llama.cpp)GPTQ (legacy models)bitsandbytes NF4 (QLoRA training)FP16/BF16 (full precision)

vLLMOllamaLM Studiollama.cppCUDAAWQGPTQGGUFFlashAttentionDocker

Expertise by Sumit Chatterjee

Industrial Light & Magic, Sydney