Collective Mind (cTuning3 repository and infrastructure)

Disclaimer

This documentation is written by Grigori Fursin and is gradually evolving together with the project (until the stable version is released). Please contact us if you notice mistakes or inconsistencies to collaboratively improve this document!

What is Collective Mind infrastructure, repository and methodology?

C-mind-picture.png

Collective Mind (cM or cMind) is a light-weight, portable, modular and extensible NoSQL infrastructure and repository for collective research and development through collaborative knowledge discovery, systematization, sharing and reuse. It is currently used to understand complex behavior of computer systems through collaborative decomposition of these systems into simple and semantically connected modules with defined interfaces and associated data. cMind allows to preserve the whole experimental setup with any associated artifact and with extensible meta-data (codelet, benchmark, datasets, tool, model, etc) helping to distribute or reproduce experiments by the community. Having enough knowledge about existing computer systems should allow scientists and engineers to build more power efficient computer systems in terms of performance, power consumption, code size, compilation time and other characteristics. It also helps the community to adopt the new research methodology and publication model where experimental results and all associated artifacts are continuously shared, validated and extended by the community to help improve quality and reproducibility of computer engineering often seen as "hacking".

Modules serve as wrappers around all objects in the system (such as compilers, profilers, run-time systems, architectures, programs, predictive models, etc) to describe and record information flow in the system. Modules can either transparently pass information or perform various actions such as recording, retrieving, analyzing, searching and visualizing this information. Modules store information in associated data entries and provide API to access this data.

Modules and data always have Unique IDs, possibly alias (UOA) and are packed into repositories. Repositories can be accessed locally or remotely through web-services.

We hope that community will be gradually and collaboratively decompose complex systems and provide more connected modules that describe properties, characteristics and choices for the information that passes through these modules. Having common infrastructure allows community to share data (applications, data sets, codelets and architecture descriptions), modules (classification, predictive modeling, run-time adaptation) and statistics about behavior of computer systems. It should allow community to quickly reproduce and validate existing results, and focus their effort on learning the relationships between modules, data, properties, characteristics and choices using data mining, classification and predictive modeling.

What are cMind key features?

  • General features:
    • has minimum third-party dependencies (only python in min. version) and works on practically any Linux/Unix and Windows distribution and can work out of the box (Android is supported by light-weight cMind client)
    • uses own simple directory/file-based NoSQL repository that can keep the whole experiemental setup with any associated research artifact and with extensible meta-data (benchmarks, datasets, tools, code, models, architecture and compiler descriptions, etc)
    • any research artifact or interactive graph can be connected with (possibly interactive) publications to be easily validated by the community
    • can be easily used through a customizable Web-based interface and Web services or from CMD
    • uses plugins (modules/components) to implement all cMind functionality starting from cMind kernel and ending with predictive modeling or application online tuning
    • uses simple text-based human-readable standard JSON as internal data format that can be edited directly
    • can easily keep archives, traces or any files in the repository entries that are meta-described in JSON entry description files
    • uses popular Python as a main high-level language with OpenME library to connect to other languages such as PHP, C, C++, Fortran, Java, etc.
    • uses powerful and simple java-based Elastic Search framework for transparent data indexing to enable fast and complex queries (search).
    • has a user-friendly Web interface (implemented as cMind plugins)
    • default web interfaces uses plain html and is thus very portable (we plan to extend it with ajax and javascript later)
    • has unified web services to crowd-source auto-tuning
    • enables standalone customized web-based repositories with user self-registration and data access control (read/write/comments)
    • has data locking mechanism for collective (parallel) data aggregation (updates)
    • supports any number of local or remote repositories (access to remote repositories is transparent through unified web services)
    • can be deployed as extensible and customizable public or private (in-house company) online repository
    • can be easily deployed in cloud/grid environments to automate experiments
    • supports packages and unified installation with dependencies for full reproducibility of environment and for co-existence of multiple versions of tools (such as compilers)
    • supports OpenME interface to open up LLVM, GCC, Open64 and other compilers for interactive and on-the-fly fine-grain analysis and tuning
    • data (experiments, models, etc) can be gradually classified and ranked
  • Collaborative computer system analysis, characterization and optimization:
    • has extensible experimental pipeline for program/architecture auto-tuning with already exposed properties and characteristics at multiple levels for auto-tuning (supporting multiple compilers including GCC, LLVM, Open64, ICC, PathScale; profiling tools including perf, likwid, vTune; OpenME plugin framework for online analysis and auto-tuning particularly for CUDA and OpenCL codes on heterogeneous architectures; frequency monitoring and changing tools, etc)
    • can serve as a distributed performance tracking and tuning buildbot for user applications and compilers including GCC and LLVM
    • has powerful graph plotters (for multidimensional spaces) with export functions to png, pdf and eps
    • had data export functions to csv, json, txt
    • is collaboratively extended for application (online) tuning, data mining, predictive modeling, graph plotting, etc.

cMind concept and design motivation

Current Collective Mind concept and design accumulates all our past 20 years of R&D experience:

  • In our past research, we spent most of the time not on data analysis and checking novel research ideas, but on dealing with ever changing tools, architectures and huge amounts of heterogeneous data. Therefore, we decided to use some wrappers to abstract all tools. These wrappers became cM plugins (modules), i.e. source code have an associated module code.source, compiler - ctuning.compiler, binary - code, dataset - dataset, etc.
  • Various cM modules may have different functions (actions) such as code.source build to build project, code run to run code, etc. Therefore, to unify access to modules, we use a command line front-end cm which allows one to access all modules and their functions using cm <name of the module> <action> parameters
  • For each of the module we may need to store some associated data, i.e. if it's a dataset, we may need to store a real data set (i.e. image file, video file, text file, audio, etc), for source code, we may want to keep all the source code files including Makefiles, etc. Previously, we used MySQL but it was very long and complex to extend it for each of module (we had to rebuild tables, check all relations, etc) or to keep binary data. Also, if some experimental data goes wrong, it's very long to "clean up" and update the repository. Finally, as researchers, we often want to have a direct access to our experimental files, etc. That's why we often keep myriads of csv files, etc. Therefore, for cM, we decided to use our own very simple directory and file based repository: cM repository can be inside any directory and starts with a .cmr root directory, followed by UID or alias of the module and then UID or alias of associated entries:
Cm repository structure.png
  • Another problem that we faces in the past research, was dealing with evolution of our own software. Hence we decided to provide unique IDs for each module and data entry while allowing high-level aliases, i.e. module code.source has cM UID 45741e3fbcf4024b. We can call high-level modules or data using alias but when the module API changes dramatically (not just extended while keeping backwards compatibility), we keep the alias but change the UID! Most of the cM modules can deal with both UID and alias - this combination is called cM UOA (UID or Alias). Since repository is also data, it has its own UID. Therefore, any data can be found using either <module UOA>: and then it is searched through all available repositories, or using <repo UOA>:<module UOA>:<data UOA>. Unique data identifier in cM is called CID (cM ID) and has the format of (<repo UOA>:)<module UOA>:<data UOA>
  • Naturally, such design is very flexible but can be slow for search, etc. However, such design is very easy to combine with existing indexing tools. We decided to use ElasticSearch that works directly with JSON and can perform fast search and complex queries. We provided support for on-the-fly indexing of data in cM.
  • Yet another problem that we had was the use of different frameworks when we wanted to either just run experiments (mobiles, GRID, cloud, supercomputers) or perform analysis or provide web front-end or build graphs, etc. Now, we can use the same framework with various module selection (minimal cM core is only around 500KB).
  • Interestingly, modules are also entries inside repositories making it possible to continuously evolve framework and models when more "knowledge" is available.
  • We added module "class" to start gradually classifying all data entries. We can also rank useful data entries (that can be most profitable compiler optimizations or models, etc).
  • Since format of cM is now open and easily extensible, we can easily combine auto-tuning and expert knowledge (as module ctuning.advice, for example).

Now, we believe that we have a framework that is easy to extend to continue collaborative systematization of characterization, optimization, learning and design of computer systems:

Cm overall structure.png

We use gradual top-down decomposition and learning of computer systems (to keep complexity under control and balancing our return on investment (analysis/optimization cost vs benefit) and gradually improve our knowledge):

Cm top down decomposition and learning.png

Which data/knowledge is currently shared?

All research tools from the past Grigori Fursin's R&D including:

  • machine learning based compiler MILEPOST GCC
  • OpenME event-based plugin interface
  • multiple benchmarks/codelets processed for auto-tuning and including OpenME events to time kernels including OpenCL/CUDA ones
  • multiple datasets (cDataset and kDataset)
  • packages and description of multiple compilers including GCC 4.4.x, 4.6.x, 4.7.x, LLVM 3.x, Open64 4.x and 5.x, PathScale 2.x and 3.x, PGI 12.x, partially ICC 12.x and Microsoft compilers 2010
  • support for multiple OSes including Ubuntu, OpenSuse, CentOS, Android, Windows
  • experimental build and run pipeline with multiple exposed properties and characteristics
  • Pareto frontier online filter
  • Online classifier of profitability of compiler flags
  • Analysis of normality of experimental results (using R)
  • compiler flags auto-tuning scenarios
  • Android node to crowdsource compiler flag tuning
  • powerful multi-graph engines for data analysis and mining (can be exported to various image formats for publications)

We hope that this initative combined with released tools and data for reproducibility of results will push forward new publication model where experimental results are validated by the community!

Supported platforms

cMind requires have python >= 2.6 installed in the system and you add <CM_ROOT>/bin directory to your path. Python 3.x is not yet supported (though we expect that it should not be very difficult to provide support). For repository indexing and fast queries cMind uses third-party powerful ElasticSearch (already pre-packaged in cM) that requires Java run-time environment.

We want to be able to run cMind across as many different computer systems as possible to collect enough statistics for data mining and machine learning. Therefore, we made a considerable effort to make this framework very portable. We currently checked it on the following platforms (both 32 and 64 bit):

cMind was also tested on various Linux distributions under VMWare 6.x and 7.x. cMind can be easily deployed as a bootable live USB key (we tested 8/16/32Gb USB keys with live UBUNTU 12.x and cM).

Where to get?

The project is currently hosted at SourceForge (with the latest available stable release):

[http://sourceforge.net/projects/c-mind http://sourceforge.net/projects/c-mind]

Current development version can be obtained anonymously from SourceForge SVN server (in the future, we plan to move to GIT):

svn checkout https://svn.code.sf.net/p/c-mind/code/projects/cm/trunk cm-trunk

What is cMind license?

After long consultations with industrial and academic partners, we decided to release cMind V1.x under standard open-source BSD 3-clause license. However, we can also provide customized license if strictly necessary - please, contact Grigori Fursin for more details.

OpenME is an open-source low-level universal C/C++/Fortran/Java event-based interface with just 3 functions to open up various tools and applications, make them interactive and/or tunalbe while connecting with cMind or any other third-party tool through plugins. Since various applications and tools may have different licenses and may not be open source, OpenME is licensed under LGPL v2.1.

How to install and configure?

We did a special effort to make cMind as simple and easy to use as possible. Basic cMind CMD version should work out of the box, provided that a user have python >= 2.6 installed in the system and you add <CM_ROOT>/bin directory to your path. Using advanced features require a few very simple steps to fully install and configure cMind as described here.

How to use cM?

Users can access all cMind functionality from CMD using cm or cmx front-end (accessing repository, executing experimental pipelines, reproducing experiments, compose new experiments, etc). However, cMind also features built-in full featured python-based web server to allow users access most of cMind functionality through user-friendly cMind web front-end or expose data to the world through unified cMind web services - user can start web server using cm_web_start.sh in a separate terminal (new terminals will be opened through this root terminal in some usage scenarios) and access cMind front-end at http://localhost:3333

How to develop (and contribute)?

Check out:

Collective Mind development committee

History

Grigori Fursin originally started research on semiconductor neural computers and brain simulation, and in 1998 had to switch to computer engineering to speed up his simulation software and make it more power efficient. cMind aggregated all of Grigori's past techniques and tools on the way to understand behavior of computer systems and speed up their development and optimization.

  • 2011-cur. - cTuning3 (aka Collective Mind) python/OpenME/NoSQL-based/web-based repository and infrastructure for collective research and development through collaborative knowledge discovery, systematization, sharing and reuse. It is currently used to understand complex behavior of computer systems through collaborative decomposition of these systems into simple and semantically connected modules with defined interfaces and associated data.
  • 2009-2010 - cTuning2 (aka Codelet Tuning Infrastructure) С/python/NoSQL-based repository built for Intel Exascale Lab (as presented at SC BOF 2011)
  • 2007-2009 - cTuning/MILEPOST C/php/MySQL/web-based framework for collaborative program and architecture characterization and optimization through predictive modeling
  • 2005-2006 - Framework for Continuous Optimizations - new C/NoSQL based framework and OpenME-like API for continuous optimizations
  • 1999-2004 - Edinburgh Optimizing Software - C/Java/MySQL based modular repository and infrastructure to analyze and optimize memory behavior of large scale computer systems
  • 1997-1999 - prototype of the collaborative experimental repository and tuning infrastructure to access supercomputers through unified web services (SCS - SuperComputer Services)
  • 1993-1997 - prototype of the experimental repository and analysis software to keep experiments for the development and modeling of semiconductor neural networks

Some references

Related events

All documentation

Contacts

If you have questions, comments, would like to join this collaborative effort or invest, do not hesitate to:

































(C) 2014-2015 non-profit cTuning foundation (France) and DIVIDITI (UK)