Q1: Tiled matrix multiplication is faster than non-tiled, why?

A1: Tiling fetches data to shared memory that increases locality.

Q2: Online softmax is faster than naive softmax. why?

A2: Online softmax reduces the number of loads/stores from memory (from 4 to 3)