This document (V20190108) provides guidelines to review artifacts. It gradually evolves to define common evaluation criteria based on our past Artifact Evaluations and open discussions, the ACM reviewing and badging policy (which we contributed to as a part of the ACM taskforce on reproducibility), artifact-eval.org and your feedback (2018a, 2018b, 2017a, 2017b, 2014).

Reviewing process

After an artifact submission deadline specific to a given event, AE reviewers will bid on artifacts they would like to review based on a provided artifact abstract and check-list, their competencies, and access to specific hardware and software, while avoiding possible conflicts of interest. Within a few days, AE chairs will make the final reviewer selection to ensure at least three or more reviewers per artifact.

Reviewers will then have approximately 2..3 weeks to evaluate artifacts and provide a report using a dedicated artifact submission website (usually HotCRP). Reviewers will be communicating with authors about encountered issues continuously and anonymously via submission website in order to quickly resolve issues - our philosophy of artifact evaluation is not to fail problematic artifacts but to help authors improve at least publicly available ones and pass evaluation!

In the end, AE chairs will then decide on a set of badges (see below) to award to a given artifact based on all reviews and authors responses.

Artifact evaluation

Reviewers will need to read a paper and then thoroughly go through authors' artifact appendix step-by-step to evaluate a given artifact. They should then describe their experience at each stage (success or failure, encountered problems and how they were possibly solved, and questions or suggestions to the authors), and then give a score on scale -1 .. +1 where

    +1) exceeded expectations
    0) met expectations (or inapplicable)
    -1) fell below expectations

Criteria Score ACM reproducibility badges
Artifacts available? Are all artifacts related to this paper publicly available?

Note that it is not obligatory to make artifacts publicly available!

The author-created artifacts relevant to this paper will receive an ACM "artifact available" badge only if they have been placed on a publicly accessible archival repository such as Zenodo, FigShare, and Dryad. A DOI will be then assigned to their artifacts and must be provided in the Artifact Appendix!

Notes: ACM does not mandate the use of above repositories. However, publisher repositories, institutional repositories, or open commercial repositories are acceptable only if they have a declared plan to enable permanent accessibility! Personal web pages, GitHub, GitLab and BitBucket are not acceptable for this purpose.

Artifacts do not need to have been formally evaluated in order for an article to receive this badge. In addition, they need not be complete in the sense described above. They simply need to be relevant to the study and add value beyond the text in the article. Such artifacts could be something as simple as the data from which the figures are drawn, or as complex as a complete software system under study.

Artifacts functional? Package complete? All components relevant to evaluation are included in the package?

Note that proprietary artifacts need not be included. If they are required to exercise the package then this should be documented, along with instructions on how to obtain them. Proxies for proprietary data should be included so as to demonstrate the analysis.

The artifacts associated with the paper will receive an "Artifacts Evaluated - Functional" badge only if they are found to be documented, consistent, complete, exercisable, and include appropriate evidence of verification and validation.

Well documented? Enough to understand, install and evaluate artifact?
Exercisable? Includes scripts and/or software to perform appropriate experiments and generate results?
Consistent? Artifacts are relevant to the associated paper and contribute in some inherent way to the generation of its main results?
Artifacts customizable and reusable?

Can other users easily reuse and customize this artifact and experimental workflow? For example, can it be used on a different platform, with different benchmarks, data sets, compilers, tools, under different conditions and parameters, etc.?

Unfortunately, current criteria for awarding this badge are vague. We work with the community to develop Collective Knowledge framework which can help authors share their artifacts and workflows as reusable, portable and customizable components with a standard API. The idea is to automatically award this badge when authors use portable and customizable workflow frameworks (see ACM ReQuEST and SC SCC for more details).

The artifacts associated with the paper will receive an "Artifact Evaluated - Reusable" badge only if they are of a quality that significantly exceeds minimal functionality. That is, they have all the qualities of the Artifacts Evaluated - Functional level, but, in addition, they are very carefully documented and well-structured to the extent that reuse and repurposing is facilitated. In particular, norms and standards of the research community for artifacts of this type are strictly adhered to.
Results validated? Can all main results from the paper be validated using provided artifacts?

Report any unexpected artifact behavior (depends on the type of artifact such as unexpected output, scalability issues, crashes, performance variation, etc).

The artifacts associated with the paper will receive a "Results replicated" badge only if the main results of the paper have been obtained in a subsequent study by a person or team other than the authors, using, in part, artifacts provided by the author.

Note that variation of empirical and numerical results is tolerated. In fact it is often unavoidable in computer systems research - see "how to report and compare empirical results" in AE FAQ!

Since it may take months to reproduce some complex experiments (for example to perform full training of a deep learning model), we now discuss a staged AE where we will first validate that artifacts are functional before camera ready paper, and then use a separate AE with full validation of all results based on ACM ReQuEST tournament methodology without strict deadlines. We plan to validate this approach at MLSys'19.

Workflow framework used? Was any portable and customizable workflow framework used to automate preparation and validation of experiments (such as CK)? We promote the use of standard workflow frameworks to help evaluators quickly validate results in an automated and portable way. Such artifact can receive a special prize if arranged by the event.
Distinguished artifact? Is artifact publicly available, functional, reproducible and easily reusable? Artifact can receive distinguished artifact award if arranged by the event.

Methodology archive

We keep track of the past submission and reviewing methodology to let readers understand which one was used in the papers with the evaluated artifacts. Also see original AE procedures for programming language conferences.