# User Guide

This guide walks you through the main features of **arch_eval** and shows how to use them effectively.

## Installation

```bash
# Clone the repository
git clone --depth=1 https://github.com/lof310/arch_eval.git
cd arch_eval

# Install in Development Model (Recommended)
pip install -e .

# Install Normally
pip install .

# Or install from PyPI
pip install arch_eval
```

**Dependencies**:
- Python ≥ 3.8
- PyTorch ≥ 1.9
- pandas, numpy, scikit‑learn, psutil, matplotlib, seaborn
- Optional: wandb, transformer_engine (for FP8), ffmpeg (for video)

## Core Concepts

arch_eval is built around a few central objects:

- **`TrainingConfig`** – holds all parameters for a single training run.
- **`Trainer`** – trains a single model and returns a history.
- **`BenchmarkConfig`** – holds parameters for comparing multiple models.
- **`Benchmark`** – runs several models (sequentially or in parallel) and returns a comparison table.
- **`HyperparameterOptimizer`** – performs grid or random search over a hyperparameter space.

All configuration is done via dataclasses, making it easy to serialise and share.

## Basic Training

Here’s the simplest possible training script:

```python
import torch.nn as nn
from arch_eval import Trainer, TrainingConfig

# 1. Define your model
class SimpleMLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(20, 10)

    def forward(self, x):
        return self.fc(x)

# 2. Create configuration
config = TrainingConfig(
    dataset="synthetic classification",   # built-in synthetic data
    dataset_params={
        "n_samples": 1000,
        "n_features": 20,
        "n_classes": 10,
    },
    training_args={
        "batch_size": 32,
        "learning_rate": 0.001,
        "num_epochs": 5,
    },
    task="classification",
    realtime=True,                       # show live plot window
)

# 3. Instantiate trainer and run
model = SimpleMLP()
trainer = Trainer(model, config)
history = trainer.train()

print(history["val_accuracy"][-1])       # final validation accuracy
```

### Explanation

- `dataset="synthetic classification"` tells the library to generate a synthetic classification dataset.
- `dataset_params` are passed to `sklearn.datasets.make_classification`.
- `training_args` holds the usual training hyperparameters.
- `realtime=True` opens a matplotlib window that updates every `viz_interval` steps (default 10). It shows metric curves and system resource usage.

After training, `history` is a dictionary mapping metric names (like `"train_loss"`, `"val_accuracy"`) to lists of values per epoch.

## Configuration Deep Dive

All configuration classes inherit from `BaseConfig`, which provides common fields. Below are the most important ones.

### Data Specification

You can specify data in many ways:

| Method | Example |
|--------|---------|
| **Synthetic** | `dataset="synthetic classification"` with `dataset_params` |
| **Torchvision** | `dataset="cifar10"`, `dataset_params={"split": "train"}` |
| **Hugging Face** | `dataset = load_dataset("cifar10")` and pass the dataset object |
| **Custom Dataset** | Pass a `torch.utils.data.Dataset` instance |
| **Tensor/Dict** | Pass a tuple `(data, targets)` or a dict with `"data"` and `"targets"` |
| **Streaming** | Set `dataset_streaming=True` for Hugging Face `IterableDataset` |

For distributed training, you can also shard datasets using `dataset_shard = {"num_shards": 4, "shard_id": rank}`.

### Device and Precision

- `device` – auto‑selects `"cuda"` if available, else `"cpu"`.
- `dtype` – default `torch.float32`.
- `mixed_precision=True` enables AMP. Use `mixed_precision_dtype` to choose `"float16"`, `"bfloat16"`, or `"fp8"` (experimental, requires Transformer Engine).

### Logging and Visualization

- `log_interval` – how often to print to console (steps).
- `viz_interval` – how often to update the realtime window.
- `save_plot` – list of metric names; at the end of training, PNG plots are saved.
- `save_video` – list of metric names; a video of the metric evolution is created (requires ffmpeg).
- `log_to_wandb` – enable Weights & Biases logging. Also set `wandb_project` and optionally `wandb_run_name`.

### Callbacks

Callbacks are passed via the `callbacks` list. Built‑in callbacks:

```python
from arch_eval import EarlyStopping, ModelCheckpoint, TensorBoardLogger

callbacks = [
    EarlyStopping(monitor="val_loss", patience=5),
    ModelCheckpoint(filepath="checkpoints/epoch-{epoch}.pt", monitor="val_accuracy", save_best_only=True),
    TensorBoardLogger(log_dir="./logs")
]
```

You can also write your own by subclassing `Callback` and overriding any of its methods.

## Benchmarking Multiple Models

To compare several architectures, use `Benchmark`:

```python
from arch_eval import Benchmark, BenchmarkConfig

models = [
    {"name": "MLP Small", "model": MLP(hidden=128)},
    {"name": "MLP Large", "model": MLP(hidden=256)},
]

bench_config = BenchmarkConfig(
    dataset="synthetic classification",
    dataset_params={"n_samples": 5000, "n_features": 64, "n_classes": 20},
    training_args={"num_epochs": 10, "batch_size": 64},
    compare_metrics=["accuracy", "loss"],
    parallel=True,          # run models concurrently
    use_processes=False,    # use threads (safe for CPU; for GPU, keep sequential or threads)
)

benchmark = Benchmark(models, bench_config)
results = benchmark.run()   # returns pandas DataFrame
print(results)
```

- `parallel=True` runs models in parallel using threads (or processes if `use_processes=True`). For GPU training, parallelism may cause memory issues – use with caution or keep sequential.
- `compare_metrics` lists the metrics you want to extract from each model’s history. They must appear in the history (e.g., `"accuracy"`, `"val_loss"`).
- The resulting DataFrame contains a row per model with the final value of each requested metric.

## Hyperparameter Optimization

The `HyperparameterOptimizer` class provides grid and random search:

```python
from arch_eval import HyperparameterOptimizer, TrainingConfig

def model_fn():
    return MLP()   # must return a fresh model each time

base_config = TrainingConfig(
    dataset="synthetic classification",
    dataset_params={"n_samples": 1000, "n_features": 64, "n_classes": 10},
    training_args={"num_epochs": 5},
    task="classification",
    realtime=False,   # disable live plots during search
)

param_grid = {
    "learning_rate": [0.001, 0.01, 0.1],
    "batch_size": [32, 64],
}

optimizer = HyperparameterOptimizer(
    model_fn,
    base_config,
    param_grid,
    search_type="grid",           # or "random"
    metric="val_accuracy",
    mode="max",
)

results = optimizer.run()
print(results)
```

- `param_grid` keys can be either top‑level attributes of `TrainingConfig` (like `batch_size`) or keys inside `training_args` (like `learning_rate`). The optimizer updates the config accordingly.
- For random search, set `search_type="random"` and optionally `n_trials`.
- The returned DataFrame includes all tried hyperparameters and the target metric.

## Distributed Training

arch_eval supports three distributed backends:

- `DATAPARALLEL` – `torch.nn.DataParallel` (simple, but slower due to GIL)
- `DISTRIBUTED` – `torch.nn.parallel.DistributedDataParallel` (recommended for multi‑GPU)
- `FSDP` – Fully Sharded Data Parallel (PyTorch ≥ 1.12)

To use DDP:

```python
from arch_eval import TrainingConfig, DistributedBackend

config = TrainingConfig(
    ...,
    distributed_backend=DistributedBackend.DISTRIBUTED,
    distributed_world_size=2,        # number of processes
    distributed_rank=0,               # set per process
    distributed_master_addr="127.0.0.1",
    distributed_master_port="29500",
)
```

You must launch your script with `torch.distributed.launch` or `torchrun`. For example:

```bash
torchrun --nproc_per_node=2 train.py
```

In the script, each process will have a different rank; the trainer automatically handles the wrapping and data sharding if you set `dataset_shard` accordingly.

## Using Plugins

Plugins are external modules that can register global hooks. They are discovered automatically if their module name starts with `arch_eval_plugin_` or ends with `_plugin`.

To create a plugin:

1. Create a Python file (e.g., `my_plugin.py`).
2. Define functions decorated with `@hook("hook_name")`.
3. Place it somewhere in your `PYTHONPATH`.

Example plugin:

```python
from arch_eval.plugins import hook

@hook("on_epoch_end")
def log_extra_info(trainer, epoch, metrics):
    print(f"Epoch {epoch} done, loss={metrics.get('val_loss', -1):.4f}")
```

You can also register local hooks directly on a `Trainer` instance via its `plugin_manager`:

```python
def my_local_hook(trainer, batch_idx, loss):
    print(f"Batch {batch_idx} loss: {loss}")

trainer.plugin_manager.register_local_hook("on_batch_end", my_local_hook)
```

## Advanced Features

### Gradient Checkpointing

Enable to reduce memory for large models:

```python
config.gradient_checkpointing = True
config.gradient_checkpointing_modules = ["layer1", "layer2"]   # optional, if you want to checkpoint only specific modules
```

### Mixed Precision with FP8

Requires NVIDIA Transformer Engine:

```python
config.mixed_precision = True
config.mixed_precision_dtype = "fp8"
```

### Profiling

Enable PyTorch profiler to trace execution:

```python
config.profiler = {
    "enabled": True,
    "activities": ["cpu", "cuda"],
    "schedule": {"wait": 1, "warmup": 1, "active": 3},
    "trace_path": "./traces"
}
```

### Custom Loss Functions

You can provide your own loss function:

```python
def my_loss(output, target):
    return torch.nn.functional.mse_loss(output, target)

config.loss_function = my_loss
```

### Model Output Transformation

If your model returns a tuple or a dict, you can transform it to the expected format (tensor of logits) using `model_output_transform`:

```python
def transform(output):
    return output["logits"] if isinstance(output, dict) else output[0]

config.model_output_transform = transform
```

## Transformer and Custom Model Compatibility

arch_eval is designed to work with any PyTorch model architecture, including transformer models from Hugging Face or custom implementations. The library automatically handles various output formats:

### Supported Output Formats

1. **Tensor output** (standard): `return logits`  # shape: (batch, num_classes)
2. **Tuple output**: `return (logits, loss)` or `return (loss, logits)`
3. **Dict output** (Hugging Face style): `return {"logits": logits, "loss": loss}`
4. **Dict with only logits**: `return {"logits": logits}`
5. **Hugging Face Output Objects**: Return instances of `CausalLMOutput`, `SequenceClassifierOutput`, etc. from `transformers.modeling_outputs`

The trainer automatically detects and extracts:
- Loss values from tuples, dicts, or objects with `.loss` attribute (preferring explicit 'loss' keys)
- Logits/predictions for metric calculation from tensors, dicts, or objects with `.logits` attribute

### Example: Training a Transformer Model

```python
import torch.nn as nn
from arch_eval import Trainer, TrainingConfig

class SimpleTransformer(nn.Module):
    def __init__(self, vocab_size=1000, d_model=128, nhead=4, num_classes=10):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        encoder_layer = nn.TransformerEncoderLayer(d_model=d_model, nhead=nhead, batch_first=True)
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=2)
        self.classifier = nn.Linear(d_model, num_classes)
        
    def forward(self, x):
        # x shape: (batch, seq_len)
        emb = self.embedding(x)
        out = self.transformer(emb)
        pooled = out.mean(dim=1)  # Mean pooling over sequence
        return {"logits": self.classifier(pooled)}  # Dict output like HF models

# Create text classification dataset
seq_len, vocab_size = 64, 1000
X = torch.randint(0, vocab_size, (500, seq_len))
y = torch.randint(0, 10, (500,))

config = TrainingConfig(
    dataset=(X, y),
    training_args={"num_epochs": 5, "batch_size": 16},
    task="classification",
    realtime=False
)

model = SimpleTransformer()
trainer = Trainer(model, config)
history = trainer.train()
```

### Example: Training with lof310/transformer or Hugging Face Models

The library is fully compatible with the [lof310/transformer](https://github.com/lof310/transformer) library and Hugging Face Transformers:

```python
from transformer import Transformer, TransformerConfig  # lof310/transformer
from arch_eval import Trainer, TrainingConfig

# Create transformer config
model_config = TransformerConfig(
    vocab_size=32000,
    d_model=256,
    n_heads=8,
    n_layer=4,
    d_ff=512,
    max_seq_len=128,
)

# Create model - returns CausalLMOutput(loss=..., logits=...)
model = Transformer(model_config)

# Prepare language modeling dataset
input_ids = torch.randint(0, 32000, (1000, 128))
labels = input_ids.clone()  # For next token prediction

config = TrainingConfig(
    dataset=(input_ids, labels),
    training_args={"num_epochs": 3, "batch_size": 8},
    task="next-token-prediction",
    realtime=False
)

trainer = Trainer(model, config)
history = trainer.train()  # Works seamlessly!
```

Similarly for Hugging Face models:

```python
from transformers import AutoModelForSequenceClassification
from arch_eval import Trainer, TrainingConfig

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=10)

# The model returns dict with 'loss' and 'logits' keys
config = TrainingConfig(
    dataset=(input_ids, attention_mask, labels),
    training_args={"num_epochs": 3, "batch_size": 16},
    task="classification",
)

trainer = Trainer(model, config)
history = trainer.train()
```

### Custom Loss Handling

For models that compute their own loss internally (common in transformers), the trainer will use the provided loss value when available:

```python
# Model returning (logits, loss)
def forward(self, x, labels=None):
    ...
    if labels is not None:
        loss = criterion(logits, labels)
        return (logits, loss)  # Trainer uses this loss directly
    return (logits,)

# Model returning dict with loss
def forward(self, x, labels=None):
    ...
    result = {"logits": logits}
    if labels is not None:
        result["loss"] = criterion(logits, labels)
    return result  # Trainer uses result["loss"] if present
```

This flexibility ensures compatibility with:
- Hugging Face Transformers (`AutoModelForSequenceClassification`, etc.)
- Custom transformer architectures
- Models with auxiliary losses
- Multi-task learning setups


## Logging and Monitoring

- Console logging is configured via `setup_logging(level="INFO")`.
- WandB integration: set `log_to_wandb=True` and `wandb_project`.
- TensorBoard: use the `TensorBoardLogger` callback.
- Real‑time window: `realtime=True` (requires an interactive backend, e.g., TkAgg).
- Video recording: `save_video=["loss", "accuracy"]` – frames are saved and assembled with ffmpeg at the end.

## Best Practices

1. **Use `seed` for reproducibility** – set `seed=42` and optionally `deterministic=True`.
2. **Start with synthetic data** to quickly test your pipeline.
3. **Monitor GPU memory** with `memory_summary()` or the real‑time window.
4. **For hyperparameter search**, disable realtime plots (`realtime=False`) to avoid GUI overhead.
5. **When benchmarking on GPU**, prefer sequential execution or threads; processes may not work well with CUDA.
6. **Save checkpoints regularly** with `ModelCheckpoint` to recover from interruptions.
7. **Use `profiler` to identify bottlenecks** in your data loading or model forward/backward.

## Next Steps

- See the [Examples](examples.md) page for complete, runnable scripts.
- Browse the [API Reference](api.md) for detailed signatures and parameters.