Why I can get SOTA results with GPU in cavity case?

I saw that in From CPU to GPU in 80 Days - Palabos - UNIGE , it shows that the performance can be 7000MLUPS with single precision . However ,I try this code in A100 and only got 3500MLUPS ?What’s wrong with me ?