SCS

Strategic Configuration Sampling: software overview

SCS (Strategic Configuration Sampling) is a Python-based, open-source software designed to automate the generation of high-quality, system-specific datasets for the active learning of machine learning interatomic potentials (MLIPs). By combining model-driven exploration with on-the-fly data acquisition, SCS streamlines the traditionally manual and resource-intensive process of dataset creation, often a major bottleneck in MLIP development.

At the heart of SCS lies the concept of user-defined exploration workflows. These workflows organize ML-driven simulations into sequential phases to maximize the exploration of the chemical and configurational space of interest. In each phase, SCS automatically assembles atomic configurations using a simple, high-level syntax. A key innovation is its ability to dynamically construct new systems by “collaging” atomic structures taken from previous simulation outputs. This feature allows SCS to efficiently capture complex and non-trivial atomic environments—such as those arising from chemical reactions, mechanical stresses, or phase transitions—with minimal user intervention.

Moreover, SCS supports the parallel exploration of multiple systems, e.g. differing for the temperature or applied pressure, in a single session, each managed by its own workflow and with independently allocated computational resources. This allows efficient dataset generation even in heterogeneous scenarios, where different systems may require distinct resource strategies for DFT computations.

SCS is currently interfaced with two popular MLIP packages (MACE and DeePMD-kit), as well as several widely used open-source tools: Packmol for geometry initialization, LAMMPS as the molecular dynamics’ engine, Quantum ESPRESSO for DFT calculations, and the ASE Python package for general atomic simulations. An interface with the CP2K package is currently under development. The modular and general design of SCS ensures that it can be easily extended to support additional MLIP models and simulation engines, making it a powerful and adaptable tool for researchers working at the intersection of atomistic modeling and machine learning.

The project is fully open-source and available on GitLab: SCS GitLab Repository.

SCS active learning structure

Each active learning iteration begins with the training of an ensemble of MLIP models, each initialized with a different random seed but trained on the same dataset. Once the training is complete, exploration begins. SCS reads the user-defined input files and automatically builds the exploration workflow, assembling initial geometries and setting up the dynamical conditions of each simulation phase.

For example, Figure 2 illustrates SCS’s ability to automatically construct a complex silica–diamond interface under realistic tribological conditions. Starting from basic components (e.g., SiO₂ molecules, diamond surfaces, water), the workflow generates amorphous silica, performs surface hydroxylation, builds the interface, and simulates loading and wear processes—all within the same active learning session and without requiring manual intervention.

The simulations in each workflow are driven by the MLIP ensemble [ref] : one model pilots the MD simulation by computing atomic energies and forces, while the others evaluate the uncertainty (quantified as the standard deviation in predicted forces) along the trajectory. Configurations with high uncertainty are selected for further processing and validated via single-point DFT calculations. These new data are then added to the training set, and a new iteration of the active learning loop begins.

Through this iterative process, the MLIP ensemble becomes progressively more accurate and robust, learning to represent the relevant chemical and configurational space more effectively. SCS can also take advantage of pre-trained or universal MLIP models to bootstrap the exploration process when no initial dataset is available. This feature accelerates early iterations by enabling stable MD trajectories that facilitate the rapid acquisition of informative training data.

Fig 2: Complex automated exploration workflow for silica-diamond wear applications. Starting from simple geometries, SCS can generate sophisticated systems throughout the workflow’s phases: amorphous silica (Phase0), hydroxylated-silica (Phase1), silica-diamond interface under load (Phase2) and wear (Phase3). The blue addition symbols indicate the usage of the “collaging” feature to stich different structures
The software is described in the following article: A. Pacini, M. Ferrario, and M. C. Righi, Accelerating Data Set Population for Training Machine Learning Potentials with Automated System Generation and Strategic Sampling, J. Chem. Theory Comput. 2025, 1549-9618.