Skip to content

What is Augchem?

Augchem is a comprehensive toolbox for chemical data augmentation developed in partnership with CINE and FAPESP. It provides state-of-the-art techniques for augmenting molecular data across multiple representations.

Why use data augmentation for chemical data?

Machine Learning (ML) is gaining prominence in the discovery of new materials, complementing traditional methods by analyzing large volumes of data and identifying patterns. The effectiveness of ML models depends on data quality, making data augmentation techniques essential to improve model accuracy in materials chemistry. However, there is a lack of studies integrating these techniques in this field. This toolbox aims to address this gap with easy-to-use Python libraries for chemical data augmentation.

Key Features

🔤 SMILES Augmentation

  • Token Manipulation: Advanced parsing and modification of SMILES strings
  • Masking Techniques: Replace molecular tokens with mask symbols for robust training
  • Deletion Strategies: Remove random tokens to create structural variations
  • Atom Swapping: Exchange atomic positions for diverse canonical representations
  • Fusion Methods: Intelligently combine multiple augmentation techniques
  • Enumeration: Generate non-canonical SMILES for increased diversity
  • Chemical Validation: Built-in RDKit integration for molecular correctness
  • Property Preservation: Maintain molecular properties during augmentation

🔗 Molecular Graph Augmentation

  • Edge Dropping: Systematic removal of molecular bonds for structural variation
  • Node Dropping: Atomic removal while preserving molecular validity
  • Feature Masking: Node feature perturbation for robust representation learning
  • Edge Perturbation: Dynamic bond addition and removal for chemical space exploration

⚡ Integration & Processing

  • PyTorch Geometric: Native support for modern graph neural network workflows
  • Pandas Integration: Seamless DataFrame processing for molecular datasets
  • Batch Operations: Efficient processing of large molecular databases
  • RDKit Compatibility: Chemical validation and property calculation
  • Reproducible Results: Seed-based random state management

🛠️ Developer Experience

  • Simple API: Intuitive interface through main Augmentator class
  • Flexible Parameters: Customizable augmentation rates and methods
  • Quality Control: Built-in validation and error handling
  • Memory Efficient: Optimized tensor operations and data structures
  • GPU Support: Acceleration for large-scale processing

Target Applications

🔤 SMILES-Based Applications

  • Language Models: Training chemical language models with augmented SMILES
  • Property Prediction: Enhance datasets for molecular property regression
  • Drug Discovery: Virtual compound generation and ADMET prediction
  • Chemical Space Exploration: Systematic molecular diversity expansion
  • Text-Based ML: Transformer and RNN training for chemical sequences

🔗 Graph-Based Applications

  • Graph Neural Networks: Training data enhancement for chemical GNNs
  • Drug Discovery: Molecular property prediction with enhanced datasets
  • Materials Science: Crystal structure and property augmentation
  • Chemical Informatics: Robust molecular representation learning
  • Structure-Activity Relationships: SAR modeling with diverse molecular graphs

🔬 Research Applications

  • Cheminformatics Research: Systematic molecular dataset expansion
  • ML Benchmarking: Standardized augmentation for fair model comparison
  • Data Scarcity: Address limited training data in specialized chemical domains
  • Robustness Testing: Evaluate model performance under molecular variations