Friday, January 15, 2010

Performance Optimization for Embedded Systems

Some key points:
  • Load code and data as much as possible into internal memory (L1 cache);
  • Use compiler or linker options to optimize for code size can sometimes give better performance than optimize for performance (GreenHills has a very nice tool to help find the optimum compromise between optimize for speed and size). Major points for manipulation:
    • Optimize for speed
    • Optimize for size
    • Remove unused functions
    • Remove debug information
    • Enable code cache
  • Carefully tune the use of code cache and data cache. Performance difference between fine tuned layout and the default can be 10 time or more;
  • Use integer, fixed point over float over double. If floating point computation is necessary, but double is not needed, then remember that all the constants MUST explicitly declared as floating precision, otherwise there may be a lot of double computation and float to double conversions. For example, instead of x=2.0; should use x=2.0f;
  • Bit shift is faster than add (usually), add is faster than multiply (usually), multiply is faster than divide. So the following tricks usually are helpful:
    • Use left shift and right shift instead of multi and div by 2, 4, 8, 16, 32, ... (integer only);
    • Use add instead of multi 2, 3;
    • If a number is used as divider many times, pre-calculate it's inverse, and use multiply for the calculations;
  • Blackfin (DSP architecture specific): assign data to A/B bank properly to enable parallel data retrieval;
  • Profiler is your friend: use profiler to find the biggest consumer and focus on them;
  • In C++ world: avoid deep hierarchy, because those practices take precious memory space which in turn have grave impact on performance;
  • Be careful when you use C runtime. Call to "printf" can easily add 1k memory footprint;

No comments: