Data Loaders¶

This section covers the data loading utilities available in Augchem. These tools help you load and process molecular data from various formats for augmentation and analysis.

Standard Loader¶

This class is a helper in case you don't have your own QM9 dataset loader. In case you have your own dataset loader, you can ignore these functions.

It's important to note that these data augmentation methods aren't exclusive to the QM9 dataset, and can be used with any SMILES, graph or InChI data.

`augchem.QM9Loader(path: Path)` ¶

Source code in augchem\core.py

def __init__(self, path: Path):
    self.path = path

`load_qm9_dataset(directory_path, list_mols=[])` ¶

Load the entire QM9 dataset from a directory containing .xyz files.

Source code in augchem\core.py

def load_qm9_dataset(self, directory_path, list_mols=[]):
    """Load the entire QM9 dataset from a directory containing .xyz files."""
    X = []
    Y = []
    S = []
    SMILES1 = []
    SMILES2 = []
    INCHI1 = []
    INCHI2 = []

    for file_name in os.listdir(directory_path):
        if file_name.endswith(".xyz"):
            file_path = os.path.join(directory_path, file_name)
            molecule_data = self.load_qm9_xyz(file_path)
            if molecule_data['natoms'] in list_mols or len(list_mols)==0:
                X.append([molecule_data['atoms'], molecule_data['coordinates']])
                Y.append(molecule_data['properties'])
                S.append(molecule_data['natoms'])
                SMILES1.append(molecule_data['smiles_1'])
                SMILES2.append(molecule_data['smiles_2'])
                INCHI1.append(molecule_data['inchi_1'])
                INCHI2.append(molecule_data['inchi_2'])

    return X, Y, S, SMILES1, SMILES2, INCHI1, INCHI2

`load_qm9_xyz(file_path)` ¶

Load a single QM9.xyz file.

Source code in augchem\core.py

def load_qm9_xyz(self, file_path):
    """Load a single QM9.xyz file."""
    with open(file_path, 'r') as f:
        # Number of atoms
        natoms = int(f.readline())
        # Properties are in the second line
        properties = list(map(float, f.readline().split()[2:]))
        # Read atomic coordinates and types
        atoms = []
        coordinates = []
        smiles1 = ''
        smiles2 = '' 
        inchi1 = ''
        inchi2 = ''
        # print(properties)
        for num_line, line in enumerate(f):
            # print(num_line, line)
            if num_line >= 0 and num_line < natoms:
                info = line.replace("*^","e").split()
                atoms.append(info[0])
                coordinates.append(list(map(float, info[1:-1])))
            if num_line == natoms + 1:
                smiles1, smiles2 = line.split()
            if num_line == natoms + 2:
                inchi1, inchi2 = line.split()

    return {
        "natoms": natoms,
        "atoms": atoms,
        "coordinates": np.array(coordinates),
        "smiles_1": smiles1,
        "smiles_2": smiles2,
        "inchi_1": inchi1,
        "inchi_2": inchi2,
        "properties": properties
    }

PyTorch Geometric SDF Loader¶

The TorchGeometricSDFLoader is a specialized loader for working with SDF (Structure-Data File) format using PyTorch Geometric. This loader is optimized for molecular graph processing and augmentation workflows.

Key Features¶

Pure PyTorch Geometric: Uses only PyTorch Geometric primitives for optimal performance
SDF Support: Direct loading from SDF files containing molecular structures
Graph Conversion: Converts RDKit molecules to torch_geometric.data.Data objects
Batch Processing: Integrated DataLoader creation for efficient training
Analytics: Comprehensive graph statistics and visualization tools
Self-loop Management: Automatic detection and removal for clean augmentation

Example Usage¶

# Initialize the loader
loader = TorchGeometricSDFLoader(
    sdf_path="path/to/molecules.sdf",
    transform=NormalizeFeatures()  # Optional transforms
)

# Load molecules from SDF
molecules = loader.load_sdf(max_molecules=1000)

# Convert to PyTorch Geometric graphs
graphs = loader.convert_to_torch_geometric_graphs(
    add_self_loops=False  # Important for augmentation techniques
)

# Create DataLoader for batch processing
dataloader = loader.create_torch_geometric_dataloader(
    batch_size=32, 
    shuffle=True
)

# Get comprehensive statistics
stats = loader.get_torch_geometric_statistics()
print(f"Total graphs: {stats['total_graphs']}")
print(f"Average nodes per graph: {stats['nodes_per_graph']['mean']:.2f}")

# Visualize a molecular graph
loader.visualize_with_torch_geometric(idx=0)

# Save processed graphs
loader.save_torch_geometric_graphs("processed_graphs.pt")

Graph Features¶

Node Features (9 dimensions)¶

Atomic number
Degree (number of bonds)
Formal charge
Hybridization type
Aromaticity (boolean)
Number of radical electrons
Total number of hydrogens
Ring membership (boolean)
Atomic mass

Edge Features (4 dimensions)¶

Bond type (single, double, triple, aromatic)
Aromaticity (boolean)
Ring membership (boolean)
Conjugation (boolean)

Augmentation-Ready Processing¶

The loader is specifically designed to work seamlessly with the graph augmentation techniques:

# Load and prepare graphs for augmentation
loader = TorchGeometricSDFLoader(sdf_path, transform=None)
molecules = loader.load_sdf(max_molecules=500)

# Convert without self-loops (optimal for edge dropping)
graphs = loader.convert_to_torch_geometric_graphs(add_self_loops=False)

# Apply augmentation techniques
from augchem.modules.graph.graphs_modules import augment_dataset

augmented_graphs = augment_dataset(
    graphs=graphs,
    augmentation_methods=['edge_drop', 'node_drop', 'feature_mask'],
    augment_percentage=0.3
)

Advanced Analytics¶

# Check for self-loops before augmentation
from your_utils import check_self_loops_in_graphs, remove_self_loops_from_graphs

# Analyze self-loops
self_loop_stats = check_self_loops_in_graphs(graphs)
print(f"Graphs with self-loops: {self_loop_stats['graphs_with_self_loops']}")

# Clean graphs if needed
if self_loop_stats['total_self_loops'] > 0:
    cleaned_graphs = remove_self_loops_from_graphs(graphs)
    loader.graphs = cleaned_graphs  # Update loader with clean graphs

# Comprehensive visualization
plot_torch_geometric_statistics(cleaned_graphs)

Performance Tips¶

Memory Management: Load molecules in batches for large datasets
Self-loop Removal: Always check and remove self-loops before augmentation
Transform Selection: Use transforms without AddSelfLoops for augmentation workflows
Batch Size: Optimize batch size based on your GPU memory
Reproducibility: Set random seeds for consistent results

Integration with Machine Learning¶

# Create train/validation splits
from sklearn.model_selection import train_test_split

train_graphs, val_graphs = train_test_split(
    augmented_graphs, 
    test_size=0.2, 
    random_state=42
)

# Create optimized DataLoaders
train_loader = DataLoader(train_graphs, batch_size=64, shuffle=True)
val_loader = DataLoader(val_graphs, batch_size=64, shuffle=False)

# Ready for GNN training!
for batch in train_loader:
    # batch.x: node features
    # batch.edge_index: edge connections
    # batch.edge_attr: edge features
    # batch.batch: graph assignment for nodes
    pass

Data Loaders¶

Standard Loader¶

augchem.QM9Loader(path: Path) ¶

load_qm9_dataset(directory_path, list_mols=[]) ¶

load_qm9_xyz(file_path) ¶

PyTorch Geometric SDF Loader¶

Key Features¶

Example Usage¶

Graph Features¶

Node Features (9 dimensions)¶

Edge Features (4 dimensions)¶

Augmentation-Ready Processing¶

Advanced Analytics¶

Performance Tips¶

Integration with Machine Learning¶

`augchem.QM9Loader(path: Path)` ¶

`load_qm9_dataset(directory_path, list_mols=[])` ¶

`load_qm9_xyz(file_path)` ¶