Q1: Tiled matrix multiplication is faster than non-tiled, why?
A1: Tiling fetches data to shared memory that increases locality.
Q2: Online softmax is faster than naive softmax. why?
A2: Online softmax reduces the number of loads/stores from memory (from 4 to 3)