Surprisingly, standard self-attention implementations are not bottlenecked by compute, but by inefficient memory access patterns
Very neat and clean explanation of flashattention for beginners
Very neat and clean explanation of flashattention for beginners