In the future, we would like to move to a fully open, community-driven evaluation which was successfully validated at ADAPT'16 - your comments and ideas are welcome!
However, from our past Artifact Evaluation, the most challenging part is to automate and customize experimental workflows. It is even worse, if you need to validate experiments using latest software environment and hardware (rather than quickly outdated VM and Docker images). Currently, some ad-hoc scripts are used to implement such workflows. They are difficult to change and customize, particularly when an evaluator would like to try other compilers, libraries and data sets.
Therefore, we decided to develop Collective Knowledge Framework (CK) - a small, portable and customizable framework to help researchers share their artifacts as reusable Python components with a unified JSON API. This approach should help researchers quickly prototype experimental workflows (such as multi-objective autotuning) from such components while automatically detecting and resolving all required software or hardware dependencies. CK is also intended to reduce evaluators' burden by unifying statistical analysis and predictive analytics (via scikit-learn, R, DNN), and enabling interactive reports. Please see examples of a live repository, interactive article and PLDI'15 CLSmith artifact shared in CK format. Feel free to contact us, if you would like to use it but need some help to convert your artifacts into CK format.
Furthermore, if you have your artifacts already publicly available at the time of submission, you may profit from the "public review" option, where you are engaged directly with the community to discuss, evaluate and use your software. See such examples here (search for "example of public evaluation).
From our practical experience on collaborative and empirical autotuning (example), we usually perform as many repetitions as needed to "stabilize" expected value (by analyzing a histogram of the results). But even reporting variation of the results (for example, standard deviation) is already a good start.
Furthermore, we strongly suggest you to pre-record results from your platform and provide a script to automatically compare new results with the pre-recorded ones preferably using expected values. This will help evaluators avoid wasting time when trying to dig out and validate results in stdout. For example, see how new results are visualized and compared against the pre-recorded ones using Collective Knowledge dashboard in this CGO'17 distinguished artifact.