Aswin Raj Rajan
  • about
  • blog (current)
  • projects
  • Why my INT4 and INT8 KV cache quantization gave bitwise-identical perplexity

    When the standard sliding-window perplexity test produced identical numbers to fifteen decimals, the methodology was the bug, not the quantization.

    7 min read   ·   May 04, 2026

    2026   ·   llm-inference   quantization   kv-cache   benchmarking   ·   machine-learning

    image
  • T4 GPU + Llama: why your attention OOMs at 16K and the one-line fix

    A walk through why PyTorch's SDPA falls through to a memory-blowup path on Turing GPUs, and what FlexAttention does differently.

    15 min read   ·   May 03, 2026

    2026   ·   llm-inference   pytorch   gpu   ·   machine-learning

    image
© Copyright 2026 Aswin Raj Rajan. Built with Jekyll and the al-folio theme. Hosted on GitHub Pages.