Performance Reality Check


Intro

Scott Meyers had an excellent talk on the importance of cache - that I can’t seem to find anymore. Here are my goto notes from his talk to serve as a reminder for me when designing programs. The bottom line is that if your code is not cache friendly, it will perform horribly. Period.

Here are the tips to make your code cache friendly.

General

  • Small code and small data; this will fit in cache.
  • Locality counts - stay in cache for both data and code; in both temporal and spacial.
  • Be prefetch friendly; that is, use predictable access patterns. (linear forward or backwards).
  • Prefetch friendly implies contiguously allocated structures - aka arrays.
  • Where practical, employ linear traversal.
  • Use as much of the cache line as possible (when you read the data, use it as much of it as possible for as long as possible).
  • Use padding in data structures to avoid cache associativity issues arising because of powers of 2 on size or strides.
  • Ref: cache oblivious algorithms; work by divide and rule, until data sets fits in the cache.
  • Avoid false sharing. Be on the lookout for false sharing, especially when code fails to scale up when it should have.

False sharing: data that is on the same cache line and being written to by multiple threads; reading it is not a problem, writing by one thread invalidates the whole cache line, thereby causing cache miss for other threads too even though they were working on their own data.

reason: Data was on the same cache line.

hint: Avoid close-by data for different threads; using stack data for threads will most likely not be shared as it lives in different areas of memory.

For code

  • Branches cause cache miss. They cause poor utilization of cache-line. They kill performance. Check odd ball cases up front via non inline functions and then make your fast-path code branch-free.

  • Adopt Data Oriented Design:

    • Design your code based on which data structure you would use.
    • The data structure to use are linear traversal data structures which exploit locality of reference.
    • Use as much of the cache line for as long as possible. Don’t read and discard. (spatial + temporal locality).
    • e.g; remove the typical isAlive() boolean flag from the object and move it into a separate array.
    • Use heterogeneous structures or at least sort homogenous structures on type to be i-cache friendly. (I need refresh on this last point; can’t seem to recall what was being implied here.)
  • Use WPO (whole program optimization) and PGO (profile guided optimization). WPO is general purpose. PGO is very data-specific. PGO requires instrumented code to run on actual data and based on results, a recompile is done. But if data is very specific, PGO is worth doing. All major compilers support these two.

Aside

  • Data oriented design lends itself well to cache-friendly code.
  • Now you understand why loop unrolling works. (It is the fast-path code without branches that fits in the i-cache).