SMILES Module¶
This module provides methods for augmenting molecular data in SMILES format.
augchem.Augmentator.SMILESModule(parent)
¶
Module for augmenting molecular data in SMILES format.
Provides methods for generating augmented SMILES representations using various techniques including masking, deletion, swapping, fusion, and enumeration.
Source code in augchem\core.py
320 321 |
|
augment_data(dataset: Path, mask_ratio: float = 0.1, delete_ratio: float = 0.3, seed: int = 42, augment_percentage: float = 0.2, augmentation_methods: List[str] = ['fusion', 'enumerate'], col_to_augment: str = 'SMILES', property_col: str = None) -> pd.DataFrame
¶
Augment molecular SMILES data from a CSV file.
Reads SMILES strings from a CSV file, applies specified augmentation methods, and returns the augmented dataset. Also saves the augmented dataset to a new CSV file.
Parameters¶
dataset
: Path
Path to the CSV file containing SMILES data to augment
mask_ratio
: float, default=0.1
Fraction of tokens to mask when using masking-based augmentation methods
delete_ratio
: float, default=0.3
Fraction of tokens to delete when using deletion-based augmentation methods
seed
: int, default=42
Random seed for reproducible augmentation
augment_percentage
: float, default=0.2
Target size of augmented dataset as a fraction of original dataset size
augmentation_methods
: List[str], default=["fusion", "enumerate"]
List of augmentation methods to apply. Valid options include:
"mask", "delete", "swap", "fusion", "enumeration"
col_to_augment
: str, default='SMILES'
Column name in the CSV file containing SMILES strings to augment
property_col
: str, optional
Column name containing property values to preserve in augmented data
Returns¶
pd.DataFrame
DataFrame containing both original and augmented molecules, with a 'parent_idx'
column linking augmented molecules to their source molecules
Notes¶
The augmented dataset is automatically saved to "Augmented_QM9.csv" in the current working directory.
Source code in augchem\core.py
323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 |
|