Skip to content

SMILES Module

This module provides methods for augmenting molecular data in SMILES format.

augchem.Augmentator.SMILESModule(parent)

Module for augmenting molecular data in SMILES format.

Provides methods for generating augmented SMILES representations using various techniques including masking, deletion, swapping, fusion, and enumeration.

Source code in augchem\core.py
320
321
def __init__(self, parent):
    self.parent = parent

augment_data(dataset: Path, mask_ratio: float = 0.1, delete_ratio: float = 0.3, seed: int = 42, augment_percentage: float = 0.2, augmentation_methods: List[str] = ['fusion', 'enumerate'], col_to_augment: str = 'SMILES', property_col: str = None) -> pd.DataFrame

Augment molecular SMILES data from a CSV file.

Reads SMILES strings from a CSV file, applies specified augmentation methods, and returns the augmented dataset. Also saves the augmented dataset to a new CSV file.

Parameters

dataset : Path Path to the CSV file containing SMILES data to augment

mask_ratio : float, default=0.1 Fraction of tokens to mask when using masking-based augmentation methods

delete_ratio : float, default=0.3 Fraction of tokens to delete when using deletion-based augmentation methods

seed : int, default=42 Random seed for reproducible augmentation

augment_percentage : float, default=0.2 Target size of augmented dataset as a fraction of original dataset size

augmentation_methods : List[str], default=["fusion", "enumerate"] List of augmentation methods to apply. Valid options include: "mask", "delete", "swap", "fusion", "enumeration"

col_to_augment : str, default='SMILES' Column name in the CSV file containing SMILES strings to augment

property_col : str, optional Column name containing property values to preserve in augmented data

Returns

pd.DataFrame DataFrame containing both original and augmented molecules, with a 'parent_idx' column linking augmented molecules to their source molecules

Notes

The augmented dataset is automatically saved to "Augmented_QM9.csv" in the current working directory.

Source code in augchem\core.py
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
def augment_data(self, dataset: Path, mask_ratio: float = 0.1, delete_ratio: float = 0.3, seed: int = 42, 
                    augment_percentage: float = 0.2, augmentation_methods: List[str] = ["fusion", "enumerate"], col_to_augment: str = 'SMILES',
                    property_col: str = None) -> pd.DataFrame:
    """
    Augment molecular SMILES data from a CSV file.

    Reads SMILES strings from a CSV file, applies specified augmentation methods,
    and returns the augmented dataset. Also saves the augmented dataset to a new CSV file.

    Parameters
    ----------
    `dataset` : Path
        Path to the CSV file containing SMILES data to augment

    `mask_ratio` : float, default=0.1
        Fraction of tokens to mask when using masking-based augmentation methods

    `delete_ratio` : float, default=0.3
        Fraction of tokens to delete when using deletion-based augmentation methods

    `seed` : int, default=42
        Random seed for reproducible augmentation

    `augment_percentage` : float, default=0.2
        Target size of augmented dataset as a fraction of original dataset size

    `augmentation_methods` : List[str], default=["fusion", "enumerate"]
        List of augmentation methods to apply. Valid options include: 
        "mask", "delete", "swap", "fusion", "enumeration"

    `col_to_augment` : str, default='SMILES'
        Column name in the CSV file containing SMILES strings to augment

    `property_col` : str, optional
        Column name containing property values to preserve in augmented data

    Returns
    -------
    `pd.DataFrame`
        DataFrame containing both original and augmented molecules, with a 'parent_idx'
        column linking augmented molecules to their source molecules

    Notes
    -----
    The augmented dataset is automatically saved to "Augmented_QM9.csv" in the
    current working directory.
    """
    df = pd.read_csv(dataset)
    new_df = augment_dataset(dataset=df, augmentation_methods=augmentation_methods, mask_ratio=mask_ratio, delete_ratio=delete_ratio, 
                               col_to_augment=col_to_augment, augment_percentage=augment_percentage, seed=seed,
                               property_col=property_col)


    new_df = new_df.drop_duplicates()
    new_df.to_csv(f"Augmented_{dataset}", index=True, float_format='%.8e')

    new_data = len(new_df) - len(df)
    print(f"Generated new {new_data} SMILES")

    return new_df