Skip to content

Augchem

SMILES Module

Augchem

Home
Tutorials
Tutorials
Examples
Examples
API Reference
API Reference
- Augchem
- Loader
- Methods
  Methods
  - SMILES Methods
  - Graphs Methods
- Modules
  Modules
  - SMILES Module SMILES Module
    Table of contents
    
    SMILESModule
    
    augment_data
    
    Parameters
    
    Returns
    
    Notes
  - Graphs Module

SMILES Module¶

This module provides methods for augmenting molecular data in SMILES format.

`augchem.Augmentator.SMILESModule(parent)` ¶

Module for augmenting molecular data in SMILES format.

Provides methods for generating augmented SMILES representations using various techniques including masking, deletion, swapping, fusion, and enumeration.

Source code in augchem\core.py

def __init__(self, parent):
    self.parent = parent

`augment_data(dataset: Path, mask_ratio: float = 0.1, delete_ratio: float = 0.3, seed: int = 42, augment_percentage: float = 0.2, augmentation_methods: List[str] = ['fusion', 'enumerate'], col_to_augment: str = 'SMILES', property_col: str = None) -> pd.DataFrame` ¶

Augment molecular SMILES data from a CSV file.

Reads SMILES strings from a CSV file, applies specified augmentation methods, and returns the augmented dataset. Also saves the augmented dataset to a new CSV file.

Parameters¶

dataset : Path Path to the CSV file containing SMILES data to augment

mask_ratio : float, default=0.1 Fraction of tokens to mask when using masking-based augmentation methods

delete_ratio : float, default=0.3 Fraction of tokens to delete when using deletion-based augmentation methods

seed : int, default=42 Random seed for reproducible augmentation

augment_percentage : float, default=0.2 Target size of augmented dataset as a fraction of original dataset size

augmentation_methods : List[str], default=["fusion", "enumerate"] List of augmentation methods to apply. Valid options include: "mask", "delete", "swap", "fusion", "enumeration"

col_to_augment : str, default='SMILES' Column name in the CSV file containing SMILES strings to augment

property_col : str, optional Column name containing property values to preserve in augmented data

Returns¶

pd.DataFrame DataFrame containing both original and augmented molecules, with a 'parent_idx' column linking augmented molecules to their source molecules

Notes¶

The augmented dataset is automatically saved to "Augmented_QM9.csv" in the current working directory.

Source code in augchem\core.py

def augment_data(self, dataset: Path, mask_ratio: float = 0.1, delete_ratio: float = 0.3, seed: int = 42, 
                    augment_percentage: float = 0.2, augmentation_methods: List[str] = ["fusion", "enumerate"], col_to_augment: str = 'SMILES',
                    property_col: str = None) -> pd.DataFrame:
    """
    Augment molecular SMILES data from a CSV file.

    Reads SMILES strings from a CSV file, applies specified augmentation methods,
    and returns the augmented dataset. Also saves the augmented dataset to a new CSV file.

    Parameters
    ----------
    `dataset` : Path
        Path to the CSV file containing SMILES data to augment

    `mask_ratio` : float, default=0.1
        Fraction of tokens to mask when using masking-based augmentation methods

    `delete_ratio` : float, default=0.3
        Fraction of tokens to delete when using deletion-based augmentation methods

    `seed` : int, default=42
        Random seed for reproducible augmentation

    `augment_percentage` : float, default=0.2
        Target size of augmented dataset as a fraction of original dataset size

    `augmentation_methods` : List[str], default=["fusion", "enumerate"]
        List of augmentation methods to apply. Valid options include: 
        "mask", "delete", "swap", "fusion", "enumeration"

    `col_to_augment` : str, default='SMILES'
        Column name in the CSV file containing SMILES strings to augment

    `property_col` : str, optional
        Column name containing property values to preserve in augmented data

    Returns
    -------
    `pd.DataFrame`
        DataFrame containing both original and augmented molecules, with a 'parent_idx'
        column linking augmented molecules to their source molecules

    Notes
    -----
    The augmented dataset is automatically saved to "Augmented_QM9.csv" in the
    current working directory.
    """
    df = pd.read_csv(dataset)
    new_df = augment_dataset(dataset=df, augmentation_methods=augmentation_methods, mask_ratio=mask_ratio, delete_ratio=delete_ratio, 
                               col_to_augment=col_to_augment, augment_percentage=augment_percentage, seed=seed,
                               property_col=property_col)


    new_df = new_df.drop_duplicates()
    new_df.to_csv(f"Augmented_{dataset}", index=True, float_format='%.8e')

    new_data = len(new_df) - len(df)
    print(f"Generated new {new_data} SMILES")

    return new_df