Current Collective Mind concept and design accumulates all our past 20 years of R&D experience:
- In our past research, we spent most of the time not on data analysis and checking novel research ideas, but on dealing with ever changing tools, architectures and huge amounts of heterogeneous data. Therefore, we decided to use some wrappers to abstract all tools. These wrappers became cM plugins (modules), i.e. source code have an associated module code.source, compiler - ctuning.compiler, binary - code, dataset - dataset, etc.
- Various cM modules may have different functions (actions) such as code.source build to build project, code run to run code, etc. Therefore, to unify access to modules, we use a command line front-end cm which allows one to access all modules and their functions using cm <name of the module> <action> parameters
- For each of the module we may need to store some associated data, i.e. if it's a dataset, we may need to store a real data set (i.e. image file, video file, text file, audio, etc), for source code, we may want to keep all the source code files including Makefiles, etc. Previously, we used MySQL but it was very long and complex to extend it for each of module (we had to rebuild tables, check all relations, etc) or to keep binary data. Also, if some experimental data goes wrong, it's very long to "clean up" and update the repository. Finally, as researchers, we often want to have a direct access to our experimental files, etc. That's why we often keep myriads of csv files, etc. Therefore, for cM, we decided to use our own very simple directory and file based repository: cM repository can be inside any directory and starts with a .cmr root directory, followed by UID or alias of the module and then UID or alias of associated entries:
- Another problem that we faces in the past research, was dealing with evolution of our own software. Hence we decided to provide unique IDs for each module and data entry while allowing high-level aliases, i.e. module code.source has cM UID 45741e3fbcf4024b. We can call high-level modules or data using alias but when the module API changes dramatically (not just extended while keeping backwards compatibility), we keep the alias but change the UID! Most of the cM modules can deal with both UID and alias - this combination is called cM UOA (UID or Alias). Since repository is also data, it has its own UID. Therefore, any data can be found using either <module UOA>: and then it is searched through all available repositories, or using <repo UOA>:<module UOA>:<data UOA>. Unique data identifier in cM is called CID (cM ID) and has the format of (<repo UOA>:)<module UOA>:<data UOA>
- Naturally, such design is very flexible but can be slow for search, etc. However, such design is very easy to combine with existing indexing tools. We decided to use ElasticSearch that works directly with JSON and can perform fast search and complex queries. We provided support for on-the-fly indexing of data in cM.
- Yet another problem that we had was the use of different frameworks when we wanted to either just run experiments (mobiles, GRID, cloud, supercomputers) or perform analysis or provide web front-end or build graphs, etc. Now, we can use the same framework with various module selection (minimal cM core is only around 500KB).
- Interestingly, modules are also entries inside repositories making it possible to continuously evolve framework and models when more "knowledge" is available.
- We added module "class" to start gradually classifying all data entries. We can also rank useful data entries (that can be most profitable compiler optimizations or models, etc).
- Since format of cM is now open and easily extensible, we can easily combine auto-tuning and expert knowledge (as module ctuning.advice, for example).
Now, we believe that we have a framework that is easy to extend to continue collaborative systematization of characterization, optimization, learning and design of computer systems:
We use gradual top-down decomposition and learning of computer systems (to keep complexity under control and balancing our return on investment (analysis/optimization cost vs benefit) and gradually improve our knowledge):