This document (V20200102) provides the guidelines to evaluate artifacts across a range of systems and machine learning conferences and journals. We regularly update it based on our past Artifact Evaluation experience and open reproducibility discussions (2018, 2017a, 2017b, 2014, 2009), the feedback we receive from researchers (Artifact Evaluation google group, Shared Google doc), PL AE, and the ACM reviewing and badging policy which we are contributing to as a member of the ACM taskforce on reproducibility. Our goal is to come up with a common methodology, unified artifact appendix with the reproducibility checklist, and an open reproducibility platform for artifact sharing, validation and reuse.

Reviewing process

Shortly after the artifact submission deadline, the AE committee members will bid on artifacts they would like to review based on their competencies and the information provided in the artifact abstract such as software and hardware dependencies while avoiding possible conflicts of interest. Within a few days, AE chairs will make the final selection of reviewers to ensure at least two or more reviewers per artifact.

Reviewers will then have approximately 2..3 weeks to evaluate artifacts and provide a report using a dedicated artifact submission website (usually HotCRP). Reviewers are strongly encouraged to communicate with the authors about encountered issues immediatelly (and anonymously) via the HotCRP submission website to give the authors time to resolve all problems! Note that our philosophy of artifact evaluation is not to fail problematic artifacts but to help the authors improve their artifacts (at least publicly available ones) and pass the evaluation!

In the end, AE chairs will decide on a set of the standard ACM reproducibility badges (see below) to award to a given artifact based on all reviews as well as the authors' responses. Such badges can be printed on the 1st page of the paper and can be made available as meta information in some Digital Libraries such as the ACM DL.

Authors and reviewers are encouraged to check the AE FAQ and use our Slack (#reproducible-research channel) and the dedicated AE google group in case of questions or suggestions.

Artifact evaluation

Reviewers will need to read a paper and then thoroughly go through the Artifact Appendix step-by-step to evaluate a given artifact. They should then describe their experience at each stage (success or failure, encountered problems and how they were possibly solved, and questions or suggestions to the authors), and then give a score on scale -1 .. +1 where

    +1) exceeded expectations
    0) met expectations (or inapplicable)
    -1) fell below expectations

Criteria Score ACM reproducibility badges
Artifacts available? Are all artifacts related to this paper publicly available?

Note that it is not obligatory to make artifacts publicly available!

The author-created artifacts relevant to this paper will receive an ACM "artifact available" badge only if they have been placed on a publicly accessible archival repository such as Zenodo, FigShare, and Dryad. A DOI will be then assigned to their artifacts and must be provided in the Artifact Appendix!

Notes:

  • ACM does not mandate the use of above repositories. However, publisher repositories, institutional repositories, or open commercial repositories are acceptable only if they have a declared plan to enable permanent accessibility! Personal web pages, GitHub, GitLab and BitBucket are not acceptable for this purpose.
  • Artifacts do not need to have been formally evaluated in order for an article to receive this badge. In addition, they need not be complete in the sense described above. They simply need to be relevant to the study and add value beyond the text in the article. Such artifacts could be something as simple as the data from which the figures are drawn, or as complex as a complete software system under study.
  • The authors can provide the DOI at the very end of the AE process and use GitHub or any other convenient way to access their artifacts during AE.
Artifacts functional? Package complete? All components relevant to evaluation are included in the package?

Note that proprietary artifacts need not be included. If they are required to exercise the package then this should be documented, along with instructions on how to obtain them. Proxies for proprietary data should be included so as to demonstrate the analysis.

The artifacts associated with the paper will receive an "Artifacts Evaluated - Functional" badge only if they are found to be documented, consistent, complete, exercisable, and include appropriate evidence of verification and validation.

We usually ask the authors to provide a small/sample data set to validate at least some results from the paper to make sure that their artifact is functional.

Well documented? Enough to understand, install and evaluate artifact?
Exercisable? Includes scripts and/or software to perform appropriate experiments and generate results?
Consistent? Artifacts are relevant to the associated paper and contribute in some inherent way to the generation of its main results?
Artifacts customizable and reusable?

Can other users easily reuse and customize this artifact and experimental workflow? For example, can it be used on a different platform, with different benchmarks, data sets, compilers, tools, under different conditions and parameters, etc.?

Unfortunately, the current ACM criteria for awarding this badge are quite vague. We think that such badge should be awarded to artifacts which use any portable workflow framework such as CK with reusable automation "actions". You can find examples of reproduced papers with portable workflows and reusable actions here.

The artifacts associated with the paper will receive an "Artifact Evaluated - Reusable" badge only if they are of a quality that significantly exceeds minimal functionality. That is, they have all the qualities of the Artifacts Evaluated - Functional level, but, in addition, they are very carefully documented and well-structured to the extent that reuse and repurposing is facilitated.

We usually ask the authors to demonstrate how their artifact can work on a different platform, in a different environment, with a different model/data set/software. We want to make sure that it is possible to reuse a given artifact in a different experimental setup.

Results validated? Can all main results from the paper be validated using provided artifacts?

Report any unexpected artifact behavior (depends on the type of artifact such as unexpected output, scalability issues, crashes, performance variation, etc).

The artifacts associated with the paper will receive a "Results replicated" badge only if the main results of the paper have been obtained in a subsequent study by a person or team other than the authors, using, in part, artifacts provided by the author.

Note that variation of empirical and numerical results is tolerated. In fact it is often unavoidable in computer systems research - see "how to report and compare empirical results" in the AE FAQ page, the SIGPLAN Empirical Evaluation Guidelines, and the NeurIPS reproducibility checklist.

Since it may take weeks and even months to rerun some complex experiments such as deep learning model training, we suggest to use a staged AE where we will first validate that artifacts are functional before the camera ready paper deadline, and then use a separate AE with the full validation of all experimental results with open reviewing and without strict deadlines. We successfully validated a similar approach at the ACM ASPLOS-ReQuEST'18 tournament (SW/HW co-design of Pareto-efficient deep learning) and ADAPT'16, and we saw similar initiatives at the NeurIPS conference.

We are also working with the community on an open-source technology to enable "live" papers where experimental results are continuously validated by the community (see the RPi crowd-tuning paper and the MLPerf/MobileNets crowd-benchmarking demo).

Workflow framework used? Was any portable and customizable workflow framework used to automate the preparation and validation of experiments?

You can find examples of reproduced papers with portable workflows here.

We promote the use of portable workflow frameworks to standardize, automate and simplify the artifact evaluation process. Such artifacts can receive a special prize if arranged by the event.

See our related effort at the Supercomputing Student Cluster Competition: common artifact format, automation workflow, SCC'18 application workflow example.

Distinguished artifact? Is artifact publicly available, functional, reproducible, portable and easily reusable? Artifact can receive the distinguished artifact award if arranged by the event.

Methodology archive

List of different versions of our artifact submission and reviewing guides to help you understand which one was used in papers with evaluated artifacts: