Reproducible execution is a very complex problem (depends on OS, environment, running services, programs, state of the systems, cache, frequency, pinning, cache sharing, etc). As the first step, we simply execute code using system call and plan to gradually add parameters to stabilize or explain execution variation (current pipeline contains modules to check execution time distribution and apply several off-the-shelf normality tests).
We should check/add the following:
- running the same or multiple codes in parallel (to analyze execution interference or saturate memory bandwidth, etc)
- Multi-threading in python: http://stackoverflow.com/questions/1191374/subprocess-with-timeout
- pinning code, processes and threads to given cores (for contention analysis, etc)
- cleaning cache or emulating cold/hot cache statistically
- monitor/adapt frequency at run-time (to balance performance vs power)