Reference

addse.data

addse.data.AudioStreamingDataLoader

Bases: StreamingDataLoader

Audio streaming dataloader.

shuffle `property`

shuffle: bool

Get the shuffle attribute of the dataset.

init

__init__(
    dataset: AudioStreamingDataset | DynamicMixingDataset,
    batch_size: int = 1,
    num_workers: int = 0,
    shuffle: bool | None = None,
    **kwargs: Any,
) -> None

Initialize the audio streaming dataloader.

Parameters:

dataset (AudioStreamingDataset | DynamicMixingDataset) –

Dataset to wrap.
batch_size (int, default: 1 ) –

Batch size.
num_workers (int, default: 0 ) –

Number of workers.
shuffle (bool | None, default: None ) –

Whether to shuffle the dataset at every epoch. If None, uses the dataset shuffle attribute.
**kwargs (Any, default: {} ) –

Additional keyword arguments passed to parent constructor.

len

__len__() -> int

Get the number of batches in the dataloader.

Returns:

int –

Number of batches in the dataloader.

Raises:

TypeError –

If the wrapped dataset is an instance of AudioStreamingDataset with segment_length!=None, as the total number of segments in the dataset cannot be determined without iterating over it.

addse.data.AudioStreamingDataset

Bases: StreamingDataset

Audio streaming dataset.

getitem

__getitem__(index: int) -> tuple[torch.Tensor, str]

Get an item from the dataset.

Parameters:

index (int) –

Index of the item to retrieve.

Returns:

tuple[Tensor, str] –

Audio data with shape (1, num_samples) and name.

init

__init__(
    input_dir: str,
    fs: int | None = None,
    segment_length: float | None = None,
    max_length: float | None = None,
    max_dynamic_range: float | None = None,
    shuffle: bool = False,
    seed: int = 0,
    **kwargs: Any,
) -> None

Initialize the audio streaming dataset.

Parameters:

input_dir (str) –

Path or URL to LitData-optimized audio data.
fs (int | None, default: None ) –

Optional sample rate to resample to.
segment_length (float | None, default: None ) –

Audio segment length in seconds. If provided, audio files are concatenated and segmented into chunks of this length. Else, audio files are yielded as is and may have variable length.
max_length (float | None, default: None ) –

Maximum output length in seconds. If provided, audio files longer than this are skipped. Cannot be used together with segment_length.
max_dynamic_range (float | None, default: None ) –

Maximum dynamic range in dB. If provided, audio files and segments with a dynamic range greater than this value are skipped.
shuffle (bool, default: False ) –

Whether to shuffle the dataset.
seed (int, default: 0 ) –

Random seed for shuffling.
**kwargs (Any, default: {} ) –

Additional keyword arguments passed to parent constructor.

iter

__iter__() -> Iterator[ASDOutput]

Iterate over the dataset.

len

__len__() -> int

Get the number of files in the dataset.

Returns:

int –

The number of files in the dataset.

Note

If segment_length is not None, the number of samples yielded by this dataset when iterating over it does not match the output of this method.

check

check(item: Tensor, name: str) -> bool

Check if a signal meets the dataset criteria.

next_segment

next_segment() -> ASDOutput

Get the next audio segment from the dataset.

addse.data.DynamicMixingDataset

Bases: ParallelStreamingDataset

Dynamic mixing dataset.

Wraps two AudioStreamingDataset instances, one for speech and one for noise, and generates noisy speech samples on-the-fly by mixing the speech and noise samples at a random signal-to-noise ratio (SNR).

Multi-channel speech and noise samples are converted to mono by randomly selecting one channel.

If the speech and noise samples have different lengths, the noise is cycled or trimmed to match the speech length.

When length=float("inf"), this dataset is infinite and should be used with limit_<stage>_batches in the Lightning Trainer.

init

__init__(
    speech_dataset: AudioStreamingDataset,
    noise_dataset: AudioStreamingDataset,
    snr_range: tuple[float, float] = (-5.0, 15.0),
    rms_range: tuple[float, float] | None = (0.0, 0.0),
    length: int | float | None = float("inf"),
    resume: bool = True,
    reset_rngs: bool = False,
    **kwargs: Any,
) -> None

Initialize the dynamic mixing dataset.

Parameters:

speech_dataset (AudioStreamingDataset) –

Speech dataset.
noise_dataset (AudioStreamingDataset) –

Noise dataset.
snr_range (tuple[float, float], default: (-5.0, 15.0) ) –

SNR range.
rms_range (tuple[float, float] | None, default: (0.0, 0.0) ) –

RMS range for the clean speech in dB. If None, no RMS adjustment is performed.
length (int | float | None, default: float('inf') ) –

Number of samples to yield per epoch. If None, the speech and noise datasets are iterated over until one is exhausted. If an integer, the datasets are cycled until length samples are yielded. If float("inf"), the datasets are cycled indefinitely.
resume (bool, default: True ) –

Whether to resume the dataset from where it left off in the previous epoch when starting a new epoch. Should be set to False for validation and test datasets. Only works when iterating with an AudioStreamingDataLoader. Ignored if length is None.
reset_rngs (bool, default: False ) –

Whether to set the internal random number generators to the same initial state at the start of each epoch. If True, random numbers are consistent across epochs. Should be set to True for validation and test datasets.
**kwargs (Any, default: {} ) –

Additional keyword arguments passed to parent constructor.

iter

__iter__() -> Iterator[
    tuple[torch.Tensor, torch.Tensor, int]
]

Iterate over the dataset.

Yields:

tuple[Tensor, Tensor, int] –

Noisy speech, clean speech, and sample rate. Noisy and clean speech have shape (1, num_samples).

len

__len__() -> int

Get the number of samples yielded per epoch.

Returns:

int –

Number of samples yielded per epoch.

Raises:

TypeError –

If the dataset is infinite, i.e. if length is float("inf").

transform `staticmethod`

transform(
    samples: tuple[ASDOutput, ASDOutput],
    rngs: dict[str, Any],
    snr_range: tuple[float, float],
    rms_range: tuple[float, float] | None,
) -> tuple[
    torch.Tensor, torch.Tensor, int, tuple[int, int]
]

Generate noisy speech from speech and noise samples.

Parameters:

samples (tuple[ASDOutput, ASDOutput]) –

Tuple with speech and noise samples.
rngs (dict[str, Any]) –

Random number generators.
snr_range (tuple[float, float]) –

SNR range.
rms_range (tuple[float, float] | None) –

RMS range for the clean speech in dB. If None, no RMS adjustment is performed.

Returns:

Tensor –

Noisy speech, clean speech, sample rate, and number of files loaded. Noisy and clean speech have shape
Tensor –

(1, num_samples). The number of files loaded is for internal use only and is discarded before yielding
int –

when iterating over the dataset.

addse.layers

addse.layers.BandMerge

Bases: Module

Band-merge module.

init

__init__(
    subband_idx: Iterable[tuple[int, int]],
    input_channels: int,
    output_channels: int,
    num_channels: int,
    norm: Callable[[int], Module],
    mlp: Callable[
        [int, int, Callable[[int], Module]], Module
    ],
    residual: bool,
) -> None

Initialize the band-merge module.

forward

forward(
    x: Tensor,
) -> tuple[torch.Tensor, torch.Tensor | None]

Forward pass.

Parameters:

x (Tensor) –

Input tensor with shape (batch_size, input_channels, num_bands, num_frames).

Returns:

Tensor –

Tuple (mask, residual) where mask are complex-valued spatial filtering coefficients with shape
Tensor | None –

(batch_size, input_channels, output_channels, num_freqs, num_frames), and residual is a residual
tuple[Tensor, Tensor | None] –

additive short-time Fourier transform with shape (batch_size, output_channels, num_freqs, num_frames) or
tuple[Tensor, Tensor | None] –

None if residual=False.

addse.layers.BandSplit

Bases: Module

Band-split module.

init

__init__(
    subband_idx: Iterable[tuple[int, int]],
    input_channels: int,
    output_channels: int,
    norm: Callable[[int], Module],
) -> None

Initialize the band-split module.

forward

forward(x: Tensor) -> torch.Tensor

Forward pass.

Parameters:

x (Tensor) –

Complex-valued short-time Fourier transform. Shape (batch_size, input_channels, num_freqs, num_frames).

Returns:

Tensor –

Output tensor with shape (batch_size, output_channels, num_bands, num_frames).

addse.layers.BatchNorm

Bases: Module

Batch normalization.

Input tensors must have shape (B, C, ...) where B is the batch dimension, C is the channel dimension, and ... are the spatial dimensions (e.g. height and width in computer vision, frequency and time in audio, or sequence length in NLP). The statistics are aggregated over the batch and spatial dimensions as in ¹, Figure 2. Namely,

\[y = \frac{x - \mathbb{E}[x]}{\sqrt{\mathbb{V}[x] + \epsilon}} (1 + \gamma) + \beta,\]

where \(\gamma\) and \(\beta\) are channel-specific learnable scale and shift parameters. Note the reparameterization of the scale parameter compared to the default PyTorch implementation.

Unlike other normalization modules, this module has track_running_stats and momentum options.

Y. Wu and K. He, "Group normalization", ECCV, 2018. ↩

init

__init__(
    num_channels: int,
    eps: float = 1e-05,
    track_running_stats: bool = True,
    momentum: float | None = 0.1,
) -> None

Initialize the batch normalization module.

Parameters:

num_channels (int) –

Number of channels in input tensors.
eps (float, default: 1e-05 ) –

Small value for numerical stability.
track_running_stats (bool, default: True ) –

If True, normalization statistics are aggregated over batches during training and saved for evaluation. If False, statistics are computed from the current batch both during training and evaluation.
momentum (float | None, default: 0.1 ) –

Momentum for running statistics. The bigger the value, the more weight is given to the current batch statistics. Ignored if track_running_stats is False. If None, running statistics are cumulatively aggregated over batches without decay.

forward

forward(x: Tensor) -> torch.Tensor

Forward pass.

addse.layers.GroupNorm

Bases: Module

Group normalization.

Input tensors must have shape (B, C, ...) where B is the batch dimension, C is the channel dimension, and ... are the spatial dimensions (e.g. height and width in computer vision, frequency and time in audio, or sequence length in NLP). The statistics are aggregated over grouped channels and spatial dimensions as in ¹, Figure 2. Namely,

\[y = \frac{x - \mathbb{E}[x]}{\sqrt{\mathbb{V}[x] + \epsilon}} (1 + \gamma) + \beta,\]

where \(\gamma\) and \(\beta\) are channel-specific learnable scale and shift parameters. Note the reparameterization of the scale parameter compared to the default PyTorch implementation.

Y. Wu and K. He, "Group normalization", ECCV, 2018. ↩

init

__init__(
    num_groups: int,
    num_channels: int,
    eps: float = 1e-05,
    causal: bool = False,
) -> None

Initialize the group normalization module.

Parameters:

num_groups (int) –

Number of groups to separate the channels into.
num_channels (int) –

Number of channels in input tensors.
eps (float, default: 1e-05 ) –

Small value for numerical stability.
causal (bool, default: False ) –

If True, normalization statistics are cumulatively aggregated along the time dimension. The time dimension must be the last dimension of the input tensor.

forward

forward(x: Tensor) -> torch.Tensor

Forward pass.

addse.layers.InstanceNorm

Bases: GroupNorm

Instance normalization.

Input tensors must have shape (B, C, ...) where B is the batch dimension, C is the channel' dimension, and ... are the spatial dimensions (e.g. height and width in computer vision, frequency and time in audio, or sequence length in NLP). The statistics are aggregated over the spatial dimensions as in ¹, Figure 2. Namely,

\[y = \frac{x - \mathbb{E}[x]}{\sqrt{\mathbb{V}[x] + \epsilon}} (1 + \gamma) + \beta,\]

where \(\gamma\) and \(\beta\) are channel-specific learnable scale and shift parameters. Note the reparameterization of the scale parameter compared to the default PyTorch implementation.

Y. Wu and K. He, "Group normalization", ECCV, 2018. ↩

init

__init__(
    num_channels: int,
    eps: float = 1e-05,
    causal: bool = False,
) -> None

Initialize the instance normalization module.

Parameters:

num_channels (int) –

Number of channels in input tensors.
eps (float, default: 1e-05 ) –

Small value for numerical stability.
causal (bool, default: False ) –

If True, normalization statistics are cumulatively aggregated along the time dimension. The time dimension must be the last dimension of the input tensor.

addse.layers.LayerNorm

Bases: Module

Layer normalization.

Input tensors must have shape (B, C, ...) where B is the batch dimension, C is the channel dimension, and ... are the spatial dimensions (e.g. height and width in computer vision, frequency and time in audio, or sequence length in NLP). Namely,

\[y = \frac{x - \mathbb{E}[x]}{\sqrt{\mathbb{V}[x] + \epsilon}} (1 + \gamma) + \beta,\]

where \(\gamma\) and \(\beta\) are channel-specific learnable scale and shift parameters. Note the reparameterization of the scale parameter compared to the default PyTorch implementation.

If element_wise and frame_wise are both False, the statistics are aggregated over the channel dimension and all spatial dimensions as in ¹, Figure 2. In this case, setting causal=False matches the global layer normalization in ², while setting causal=True matches the cumulative layer normalization in ². The time dimension must be the last dimension of input tensors.

If element_wise is True, the statistics are aggregated over the channel dimension only as in ³. I.e. each element (e.g. pixel in computer vision, time-frequency unit in audio, or token in NLP) is normalized independently.

If frame_wise is True, the statistics are aggregated over the channel dimension and all spatial dimensions except the time dimension. The time dimension must be the last dimension of input tensors.

Y. Wu and K. He, "Group normalization", ECCV, 2018. ↩
Y. Luo and N. Mesgarani, "Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation", in IEEE/ACM TASLP, 2019. ↩↩
S. Shen, Z. Yao, A. Gholami, M. W. Mahoney, and K. Keutzer, "PowerNorm: Rethinking batch normalization in transformers", ICML, 2020. ↩

init

__init__(
    num_channels: int,
    element_wise: bool = False,
    frame_wise: bool = False,
    causal: bool = False,
    center: bool = True,
    eps: float = 1e-05,
) -> None

Initialize the layer normalization module.

Parameters:

num_channels (int) –

Number of channels in input tensors.
element_wise (bool, default: False ) –

If True, each element (e.g. pixel in computer vision, time-frequency unit in audio, or token in NLP) is normalized independently. Mutually exclusive with frame_wise and causal.
frame_wise (bool, default: False ) –

If True, each time frame is normalized independently. The time dimension must be the last dimension of input tensors. Mutually exclusive with element_wise and causal.
causal (bool, default: False ) –

If True, normalization statistics are cumulatively aggregated along the time dimension. The time dimension must be the last dimension of the input tensor. Mutually exclusive with element_wise and frame_wise.
center (bool, default: True ) –

If False, the mean is not subtracted from the input, and the input is scaled using the root mean square (RMS) instead of the variance. The bias term \(\beta\) is also omitted.
eps (float, default: 1e-05 ) –

Small value for numerical stability.

forward

forward(x: Tensor) -> torch.Tensor

Forward pass.

addse.layers.group_norm

group_norm(
    x: Tensor,
    num_groups: int,
    weight: Tensor,
    bias: Tensor | None,
    eps: float,
    causal: bool,
    frame_wise: bool,
) -> torch.Tensor

Functional interface for group normalization.

See GroupNorm for details.

addse.lightning

addse.lightning.ADDSELightningModule

Bases: BaseLightningModule, ConfigureOptimizersMixin

ADDSE Lightning module.

init

__init__(
    nac_cfg: str,
    nac_ckpt: str,
    model: ADDSERQDiT,
    num_steps: int,
    block_size: int,
    optimizer: Callable[
        [Iterator[Parameter]], Optimizer
    ] = Adam,
    lr_scheduler: Mapping[str, Any] | None = None,
    val_metrics: Mapping[str, BaseMetric] | None = None,
    test_metrics: Mapping[str, BaseMetric] | None = None,
    log_cfg: LogConfig | None = None,
    debug_sample: tuple[int, int] | None = None,
) -> None

Initialize the ADDSE Lightning module.

forward

forward(
    x: Tensor, return_nfe: bool = False
) -> Tensor | tuple[Tensor, int]

Enhance the input audio.

log_score

log_score(y_q: Tensor, x_q: Tensor) -> Tensor

Estimate the score function.

loss

loss(x_q: Tensor, y_q: Tensor, y_tok: Tensor) -> Tensor

Compute the \(\lambda\)-denoising cross-entropy loss.

Parameters:

x_q (Tensor) –

Noisy speech embeddings. Shape (batch_size, emb_channels, num_codebooks, seq_len).
y_q (Tensor) –

Clean speech embeddings. Shape (batch_size, emb_channels, num_codebooks, seq_len).
y_tok (Tensor) –

Clean speech tokens. Shape (batch_size, num_codebooks, seq_len).

Returns:

Tensor –

The \(\lambda\)-denoising cross-entropy loss.

solve

solve(
    x_tok: Tensor,
    x_q: Tensor,
    num_steps: int,
    return_nfe: bool = False,
) -> Tensor | tuple[Tensor, int]

Sample assuming a log-linear noise schedule and an absorbing transition matrix.

addse.lightning.BaseLightningModule

Bases: LightningModule

Base class for Lightning modules.

log_debug_samples

log_debug_samples(
    batch: tuple[Tensor, Tensor, Tensor],
    batch_idx: int,
    debug_samples: dict[str, Tensor],
) -> None

Log debug audio samples to W&B.

log_metrics

log_metrics(
    loss: dict[str, Tensor],
    metrics: dict[str, float],
    stage: str,
    on_step: bool,
    on_epoch: bool,
) -> None

Log losses and metrics.

step `abstractmethod`

step(
    batch: tuple[Tensor, Tensor, Tensor],
    stage: str,
    batch_idx: int,
    metrics: Mapping[str, BaseMetric] | None = None,
) -> tuple[
    dict[str, Tensor], dict[str, float], dict[str, Tensor]
]

Training, validation, or test step.

Parameters:

batch (tuple[Tensor, Tensor, Tensor]) –

A batch from the dataloader.
stage (str) –

"train", "val", or "test".
batch_idx (int) –

Index of the batch.
metrics (Mapping[str, BaseMetric] | None, default: None ) –

Metrics to compute. None if stage is "train" or if no metrics are defined.

Returns:

dict[str, Tensor] –

Tuple of loss dictionary, metrics dictionary, and debug samples dictionary. Each debug sample must have
dict[str, float] –

shape (batch_size, num_channels, num_samples).

test_step

test_step(
    batch: tuple[Tensor, Tensor, Tensor], batch_idx: int
) -> dict[str, Tensor]

Test step.

Parameters:

batch (tuple[Tensor, Tensor, Tensor]) –

A batch from the test dataloader.
batch_idx (int) –

Index of the batch.

Returns:

dict[str, Tensor] –

Dictionary with losses.

training_step

training_step(
    batch: tuple[Tensor, Tensor, Tensor], batch_idx: int
) -> dict[str, Tensor]

Training step.

Parameters:

batch (tuple[Tensor, Tensor, Tensor]) –

A batch from the training dataloader.
batch_idx (int) –

Index of the batch.

Returns:

dict[str, Tensor] –

Dictionary with losses.

validation_step

validation_step(
    batch: tuple[Tensor, Tensor, Tensor], batch_idx: int
) -> dict[str, Tensor]

Validation step.

Parameters:

batch (tuple[Tensor, Tensor, Tensor]) –

A batch from the validation dataloader.
batch_idx (int) –

Index of the batch.

Returns:

dict[str, Tensor] –

Dictionary with losses.

addse.lightning.ConfigureOptimizersMixin

Bases: LightningModule

Mixin for standard configuration of optimizer and learning rate scheduler.

configure_optimizers

configure_optimizers() -> Any

Configure optimizers.

Returns:

Any –

Dictionary with optimizer, learning rate scheduler, and learning rate scheduler configuration.

addse.lightning.DataModule

Bases: LightningDataModule

Data module.

init

__init__(
    train_dataset: Callable[[], Dataset],
    train_dataloader: Callable[[Dataset], DataLoader],
    val_dataset: Callable[[], Dataset] | None = None,
    val_dataloader: Callable[[Dataset], DataLoader]
    | None = None,
    test_dataset: Callable[[], Dataset] | None = None,
    test_dataloader: Callable[[Dataset], DataLoader]
    | None = None,
) -> None

Initialize the data module.

Parameters:

train_dataset (Callable[[], Dataset]) –

Function to initialize the training dataset.
val_dataset (Callable[[], Dataset] | None, default: None ) –

Function to initialize the validation dataset.
test_dataset (Callable[[], Dataset] | None, default: None ) –

Function to initialize the test dataset.
train_dataloader (Callable[[Dataset], DataLoader]) –

Function to initialize the training dataloader.
val_dataloader (Callable[[Dataset], DataLoader] | None, default: None ) –

Function to initialize the validation dataloader.
test_dataloader (Callable[[Dataset], DataLoader] | None, default: None ) –

Function to initialize the test dataloader.

load_state_dict

load_state_dict(state_dict: dict[str, Any]) -> None

Load the state dict of the data module.

setup

setup(stage: str) -> None

Setup the data module.

Parameters:

stage (str) –

Either "fit", "validate", "test", or "predict".

state_dict

state_dict() -> dict[str, Any]

Get the state dict of the data module.

test_dataloader

test_dataloader() -> DataLoader | list

Get the test dataloader.

Returns:

DataLoader | list –

The test dataloader or an empty list if no test dataset was provided at initialization.

train_dataloader

train_dataloader() -> DataLoader

Get the training dataloader.

Returns:

DataLoader –

The training dataloader.

val_dataloader

val_dataloader() -> DataLoader | list

Get the validation dataloader.

Returns:

DataLoader | list –

The validation dataloader or an empty list if no validation dataset was provided at initialization.

addse.lightning.EDMMixin

Bases: LightningModule

Mixin for training and sampling as in EDM.

denoiser

denoiser(y: Tensor, x: Tensor, sigma: Tensor) -> Tensor

Compute the denoiser parametrization as in EDM.

loss

loss(x: Tensor, y: Tensor) -> Tensor

Compute the loss as in EDM.

sampling_step

sampling_step(i: int) -> float

Compute the i-th sampling step.

solve

solve(x: Tensor, num_steps: int) -> Tensor

Sample using the Heun method as in EDM.

addse.lightning.EDMNACSELightningModule

Bases: BaseLightningModule, ConfigureOptimizersMixin, EDMMixin

Lightning module for speech enhancement using NAC-domain EDM-style diffusion.

init

__init__(
    nac_cfg: str,
    nac_ckpt: str,
    nac_domain: str,
    nac_no_sum: bool,
    nac_stack: bool,
    model: ADDSERQDiT,
    num_steps: int,
    block_size: int,
    norm_factor: float = 2.3,
    sigma_data: float = 0.5,
    p_mean: float = 0.0,
    p_sigma: float = 1.0,
    s_churn: float = 0.0,
    s_min: float = 0.0,
    s_max: float = float("inf"),
    s_noise: float = 1.0,
    sigma_min: float = 0.002,
    sigma_max: float = 80.0,
    rho: float = 7.0,
    optimizer: Callable[
        [Iterator[Parameter]], Optimizer
    ] = Adam,
    lr_scheduler: Mapping[str, Any] | None = None,
    val_metrics: Mapping[str, BaseMetric] | None = None,
    test_metrics: Mapping[str, BaseMetric] | None = None,
    log_cfg: LogConfig | None = None,
    debug_sample: tuple[int, int] | None = None,
) -> None

Initialize the NAC-domain EDM-style Lightning module.

forward

forward(x: Tensor, num_steps: int | None = None) -> Tensor

Enhance the input audio.

addse.lightning.EDMSELightningModule

Bases: BaseLightningModule, ConfigureOptimizersMixin, EDMMixin

Lightning module for speech enhancement using STFT-domain EDM-style diffusion.

init

__init__(
    model: ADM,
    stft: STFT,
    num_steps: int = 30,
    sigma_data: float = 0.5,
    p_mean: float = 0.0,
    p_sigma: float = 1.0,
    s_churn: float = 0.0,
    s_min: float = 0.0,
    s_max: float = float("inf"),
    s_noise: float = 1.0,
    sigma_min: float = 0.002,
    sigma_max: float = 80.0,
    rho: float = 7.0,
    optimizer: Callable[
        [Iterator[Parameter]], Optimizer
    ] = Adam,
    lr_scheduler: Mapping[str, Any] | None = None,
    val_metrics: Mapping[str, BaseMetric] | None = None,
    test_metrics: Mapping[str, BaseMetric] | None = None,
    log_cfg: LogConfig | None = None,
    debug_sample: tuple[int, int] | None = None,
) -> None

Initialize the NAC-domain EDM-style Lightning module.

forward

forward(x: Tensor, num_steps: int | None = None) -> Tensor

Enhance the input audio.

inverse_transform

inverse_transform(x: Tensor, n: int) -> Tensor

Decompress and compute the inverse STFT.

transform

transform(x: Tensor) -> Tensor

Compute the STFT and compress.

addse.lightning.LightningModule

Bases: BaseLightningModule, ConfigureOptimizersMixin

Simple Lightning module for training models to directly predict clean speech given noisy speech.

init

__init__(
    model: Module,
    loss: BaseLoss,
    optimizer: Callable[
        [Iterator[Parameter]], Optimizer
    ] = Adam,
    lr_scheduler: Mapping[str, Any] | None = None,
    val_metrics: Mapping[str, BaseMetric] | None = None,
    test_metrics: Mapping[str, BaseMetric] | None = None,
    log_cfg: LogConfig | None = None,
    debug_sample: tuple[int, int] | None = None,
) -> None

Initialize the simple Lightning module.

Parameters:

model (Module) –

Model to train.
loss (BaseLoss) –

Loss module.
optimizer (Callable[[Iterator[Parameter]], Optimizer], default: Adam ) –

Optimizer constructor.
lr_scheduler (Mapping[str, Any] | None, default: None ) –

Learning rate scheduler configuration.
val_metrics (Mapping[str, BaseMetric] | None, default: None ) –

Metrics to compute during validation.
test_metrics (Mapping[str, BaseMetric] | None, default: None ) –

Metrics to compute during testing.
log_cfg (LogConfig | None, default: None ) –

Logging configuration.
debug_sample (tuple[int, int] | None, default: None ) –

Tuple (batch_idx, sample_idx) to log debug audio samples to W&B during validation.

forward

forward(x: Tensor) -> Tensor

Enhance the input audio.

addse.lightning.LogConfig `dataclass`

Configuration for logging losses and metrics.

addse.lightning.NACLightningModule

Bases: BaseLightningModule

Lightning module for neural audio codec.

init

__init__(
    generator: NAC,
    discriminator: Module | Iterable[Module],
    reconstruction_loss: BaseLoss,
    adversarial_loss_weight: float,
    feature_loss_weight: float,
    reconstruction_loss_weight: float,
    codebook_loss_weight: float,
    commitment_loss_weight: float,
    generator_optimizer: Callable[
        [Iterator[Parameter]], Optimizer
    ],
    discriminator_optimizer: Callable[
        [Iterator[Parameter]], Optimizer
    ],
    generator_grad_clip: float = 0.0,
    discriminator_grad_clip: float = 0.0,
    val_metrics: Mapping[str, BaseMetric] | None = None,
    test_metrics: Mapping[str, BaseMetric] | None = None,
    log_cfg: LogConfig | None = None,
    debug_sample: tuple[int, int] | None = None,
) -> None

Initialize the neural audio codec Lightning module.

configure_optimizers

configure_optimizers() -> tuple[Optimizer, Optimizer]

Configure optimizers.

Returns:

tuple[Optimizer, Optimizer] –

Tuple of generator and discriminator optimizers.

discriminator_forward

discriminator_forward(
    x: Tensor,
) -> tuple[list[Tensor], list[list[Tensor]]]

Forward pass through all discriminators.

discriminator_step

discriminator_step(x: Tensor, y: Tensor) -> Tensor

Discriminator step.

forward

forward(x: Tensor) -> Tensor

Forward pass through the generator.

generator_step

generator_step(
    x: Tensor,
    y: Tensor,
    codebook_loss: Tensor,
    commit_loss: Tensor,
) -> dict[str, Tensor]

Generator step.

addse.lightning.NACSELightningModule

Bases: BaseLightningModule, ConfigureOptimizersMixin

Lightning module for speech enhancement using NAC-domain direct prediction.

init

__init__(
    nac_cfg: str,
    nac_ckpt: str,
    nac_domain: str,
    nac_no_sum: bool,
    model: Module,
    block_size: int,
    optimizer: Callable[
        [Iterator[Parameter]], Optimizer
    ] = Adam,
    lr_scheduler: Mapping[str, Any] | None = None,
    val_metrics: Mapping[str, BaseMetric] | None = None,
    test_metrics: Mapping[str, BaseMetric] | None = None,
    log_cfg: LogConfig | None = None,
    debug_sample: tuple[int, int] | None = None,
) -> None

Initialize the NAC-domain Lightning module.

forward

forward(x: Tensor) -> Tensor

Enhance the input audio.

addse.lightning.SGMSELightningModule

Bases: BaseLightningModule, ConfigureOptimizersMixin

SGMSE Lightning module.

init

__init__(
    model: SGMSEUNet,
    stft: STFT,
    num_steps: int = 30,
    sigma_min: float = 0.05,
    sigma_max: float = 0.5,
    gamma: float = 1.5,
    t_eps: float = 0.03,
    corrector_snr: float = 0.5,
    alpha: float = 0.5,
    beta: float = 0.15,
    optimizer: Callable[
        [Iterator[Parameter]], Optimizer
    ] = Adam,
    lr_scheduler: Mapping[str, Any] | None = None,
    val_metrics: Mapping[str, BaseMetric] | None = None,
    test_metrics: Mapping[str, BaseMetric] | None = None,
    log_cfg: LogConfig | None = None,
    debug_sample: tuple[int, int] | None = None,
) -> None

Initialize the SGMSE Lightning module.

forward

forward(x: Tensor, num_steps: int | None = None) -> Tensor

Enhance the input audio.

inverse_transform

inverse_transform(x: Tensor, n: int) -> Tensor

Decompress, descale, and compute the inverse STFT.

loss

loss(x: Tensor, y: Tensor) -> Tensor

Compute the loss.

score

score(x: Tensor, y: Tensor, t: Tensor) -> Tensor

Estimate the score function.

sigma

sigma(t: Tensor) -> Tensor

Noise schedule.

solve

solve(x: Tensor, num_steps: int) -> Tensor

Sample using the predictor-corrector method.

transform

transform(x: Tensor) -> Tensor

Compute the STFT, compress, and scale.

addse.lightning.compute_metrics

compute_metrics(
    x: Tensor,
    y: Tensor,
    metrics: Mapping[str, BaseMetric] | None = None,
) -> dict[str, float]

Compute validation or test metrics.

Parameters:

x (Tensor) –

Signal to evaluate. Shape (batch_size, num_channels, num_samples).
y (Tensor) –

Reference signal for the metrics. Shape (batch_size, num_channels, num_samples).
metrics (Mapping[str, BaseMetric] | None, default: None ) –

Metrics to compute.

Returns:

dict[str, float] –

Dictionary with computed metrics.

addse.lightning.load_nac

load_nac(cfg_path: str, ckpt_path: str) -> tuple[NAC, int]

Load a pretrained neural audio codec.

addse.lightning.process_in_blocks

process_in_blocks(
    args: tuple[Tensor, ...],
    block_size: int,
    fn: Callable[..., Tensor],
) -> Tensor

Process the inputs in blocks.

addse.losses

addse.losses.BaseLoss

Bases: Module

Base class for losses.

compute `abstractmethod`

compute(
    x: Tensor, y: Tensor
) -> torch.Tensor | dict[str, torch.Tensor]

Compute the loss.

This method should not be called directly. Use forward instead.

forward

forward(x: Tensor, y: Tensor) -> dict[str, torch.Tensor]

Compute the loss.

Validates inputs and calls compute.

Parameters:

x (Tensor) –

Predicted signal. Shape (batch_size, num_channels, num_samples).
y (Tensor) –

Target signal. Shape (batch_size, num_channels, num_samples).

Returns:

dict[str, Tensor] –

Loss dictionary.

addse.losses.MSMelSpecLoss

Bases: MultiTaskLoss

Multi-scale mel-spectrogram loss.

init

__init__(
    n_mels: int | Collection[int] = (
        4,
        8,
        16,
        32,
        64,
        128,
        256,
    ),
    frame_lengths: Collection[int] = (
        31,
        67,
        127,
        257,
        509,
        1021,
        2053,
    ),
    hop_lengths: Collection[int | None] | None = None,
    n_ffts: Collection[int | None] | None = None,
    weights: Collection[float] | None = None,
    window: str = "flattop",
    fs: int = 16000,
    compression: float = 2.0,
    log: bool = True,
    power: float = 1.0,
    eps: float = 1e-07,
    mel_norm: Literal["slaney", "consistent"]
    | None = "consistent",
    stft_norm: bool = True,
) -> None

Initialize the multi-scale mel-spectrogram loss.

addse.losses.MelSpecLoss

Bases: BaseLoss

Mel-spectrogram loss.

init

__init__(
    n_mels: int = 64,
    frame_length: int = 512,
    hop_length: int | None = None,
    n_fft: int | None = None,
    window: str = "flattop",
    fs: int = 16000,
    compression: float = 2.0,
    log: bool = True,
    power: float = 1.0,
    eps: float = 1e-07,
    mel_norm: Literal["slaney", "consistent"]
    | None = "consistent",
    stft_norm: bool = True,
) -> None

Initialize the mel-spectrogram loss.

addse.losses.MultiTaskLoss

Bases: BaseLoss

Multi-task loss.

init

__init__(
    losses: Collection[BaseLoss],
    weights: Collection[float] | None = None,
    names: Collection[str] | None = None,
) -> None

Initialize the multi-task loss.

addse.losses.SDRLoss

Bases: BaseLoss

Signal-to-distortion ratio (SDR) loss.

init

__init__(
    scale_invariant: bool = False,
    zero_mean: bool = False,
    eps: float = 1e-07,
) -> None

Initialize the SDR loss.

Parameters:

scale_invariant (bool, default: False ) –

If True, computes the scale-invariant signal-to-distortion ratio (SI-SDR).
zero_mean (bool, default: False ) –

If True, subtracts the mean from the inputs before computing the loss.
eps (float, default: 1e-07 ) –

Small value for numerical stability.

addse.metrics

addse.metrics.BaseMetric

Base class for metrics.

call

__call__(x: ndarray | Tensor, y: ndarray | Tensor) -> float

Compute the metric.

Validates inputs and calls compute.

Parameters:

x (ndarray | Tensor) –

Input signal to evaluate. Shape (num_channels, num_samples).
y (ndarray | Tensor) –

Reference signal to compare against. Shape (num_channels, num_samples).

Returns:

float –

Metric value.

compute `abstractmethod`

compute(x: ndarray, y: ndarray) -> float

Compute the metric.

This method should not be called directly. Use __call__ instead.

addse.metrics.DNSMOSMetric

Bases: BaseMetric

Deep noise suppression mean opinion score (DNSMOS) metric.

Calculated independently for each channel and averaged across channels.

init

__init__(fs: int) -> None

Initialize the DNS-MOS metric.

Parameters:

fs (int) –

Sampling frequency of input signals.

addse.metrics.LPSMetric

Bases: BaseMetric

Levenshtein phoneme similarity (LPS).

Calculated independently for each channel and averaged across channels.

init

__init__(
    fs: int,
    device: str = "auto",
    checkpoint: str = "facebook/wav2vec2-lv-60-espeak-cv-ft",
) -> None

Initialize the LPS metric.

addse.metrics.MCDMetric

Bases: BaseMetric

Mel-cepstral distance (MCD) metric.

Calculated independently for each channel and averaged across channels.

init

__init__(fs: int) -> None

Initialize the MCD metric.

addse.metrics.NISQAMetric

Bases: BaseMetric

Non-intrusive speech quality assessment (NISQA) metric.

Calculated independently for each channel and averaged across channels.

init

__init__(fs: int) -> None

Initialize the NISQA metric.

addse.metrics.PESQMetric

Bases: BaseMetric

Perceptual evaluation of speech quality (PESQ) metric.

Calculated independently for each channel and averaged across channels.

init

__init__(fs: int) -> None

Initialize the PESQ metric.

Parameters:

fs (int) –

Sampling frequency of input signals.

addse.metrics.SBSMetric

Bases: BaseMetric

SpeechBERTScore (SBS).

init

__init__(fs: int, device: str = 'auto') -> None

Initialize the SBS metric.

addse.metrics.SCOREQMetric

Bases: BaseMetric

Speech contrastive regression for quality assessment (SCOREQ).

Calculated independently for each channel and averaged across channels.

init

__init__(fs: int) -> None

Initialize the SCOREQ metric.

addse.metrics.SDRMetric

Bases: BaseMetric

Signal-to-distortion ratio (SDR) metric.

init

__init__(
    scale_invariant: bool = False,
    zero_mean: bool = False,
    eps: float = 1e-07,
) -> None

Initialize the SDR metric.

Parameters:

scale_invariant (bool, default: False ) –

If True, computes the scale-invariant signal-to-distortion ratio (SI-SDR).
zero_mean (bool, default: False ) –

If True, subtracts the mean from the inputs before computing the metric.
eps (float, default: 1e-07 ) –

Small value for numerical stability.

addse.metrics.STOIMetric

Bases: BaseMetric

Short-time objective intelligibility (STOI) metric.

Calculated independently for each channel and averaged across channels.

init

__init__(fs: int, extended: bool = False) -> None

Initialize the STOI metric.

Parameters:

fs (int) –

Sampling frequency of input signals.
extended (bool, default: False ) –

If True, computes the extended version of the STOI metric (ESTOI).

addse.metrics.UTMOSMetric

Bases: BaseMetric

UTokyo-SaruLab MOS prediction system (UTMOSv2).

Calculated independently for each channel and averaged across channels.

init

__init__(fs: int, device: str = 'auto') -> None

Initialize the PESQ metric.

Parameters:

fs (int) –

Sampling frequency of input signals.
device (str, default: 'auto' ) –

Device to run the model on. One of 'auto', 'cpu', or 'cuda'.

addse.models.addse

addse.models.addse.ADDSEDiT

Bases: Module

ADDSE DiT.

init

__init__(
    dim: int,
    num_layers: int,
    num_heads: int,
    max_seq_len: int,
    elementwise_affine: bool,
) -> None

Initialize the ADDSE DiT.

forward

forward(
    x: Tensor,
    c: Tensor | None = None,
    t: Tensor | None = None,
) -> Tensor

Forward pass.

addse.models.addse.ADDSEDiTBlock

Bases: Module

ADDSE DiT block.

init

__init__(
    dim: int, num_heads: int, elementwise_affine: bool
) -> None

Initialize the ADDSE DiT block.

forward

forward(
    x: Tensor,
    c: Tensor | None,
    cos_emb: Tensor,
    sin_emb: Tensor,
) -> Tensor

Forward pass.

addse.models.addse.ADDSEEmbeddingBlock

Bases: Module

ADDSE noise embedding block with Fourier features.

init

__init__(dim: int, emb_dim: int = 256) -> None

Initialize the ADDSE time embedding block.

forward

forward(x: Tensor) -> Tensor

Forward pass.

addse.models.addse.ADDSERQDiT

Bases: Module

Residual Quantized Diffusion Transformer (RQDiT) backbone used in ADDSE.

init

__init__(
    input_channels: int,
    output_channels: int,
    num_codebooks: int,
    hidden_dim: int,
    num_layers: int,
    num_heads: int,
    max_seq_len: int,
    conditional: bool,
    time_independent: bool,
) -> None

Initialize the ADDSE RQDiT backbone.

Parameters:

input_channels (int) –

Number of input channels.
output_channels (int) –

Number of output channels.
num_codebooks (int) –

Number of codebooks.
hidden_dim (int) –

Number of DiT hidden channels.
num_layers (int) –

Number of DiT layers.
num_heads (int) –

Number of DiT attention heads.
max_seq_len (int) –

Maximum sequence length.
conditional (bool) –

Whether the model is conditional.
time_independent (bool) –

Whether the model is time-independent.

forward

forward(
    x: Tensor,
    c: Tensor | None = None,
    t: Tensor | None = None,
) -> Tensor

Forward pass.

Parameters:

x (Tensor) –

Diffused embeddings. Shape (batch_size, input_channels, num_codebooks, seq_len) or (batch_size, input_channels, seq_len).
c (Tensor | None, default: None ) –

Conditioning embeddings. Same shape as x.
t (Tensor | None, default: None ) –

Time step or noise level. Shape (batch_size,).

Returns:

Tensor –

Output tensor. Shape (batch_size, output_channels, num_codebooks, seq_len).

addse.models.addse.ADDSESelfAttentionBlock

Bases: Module

ADDSE self-attention block.

init

__init__(dim: int, num_heads: int) -> None

Initialize the ADDSE self-attention block.

forward

forward(
    x: Tensor, cos_emb: Tensor, sin_emb: Tensor
) -> Tensor

Forward pass.

addse.models.addse.get_rot_emb

get_rot_emb(
    dim: int, max_seq_len: int
) -> tuple[Tensor, Tensor]

Compute rotary embeddings. Shape (max_seq_len, dim).

addse.models.adm

addse.models.adm.ADM

Bases: Module

ADM similar to configuration F in EDM2 paper.

init

__init__(
    num_channels: int = 1,
    base_channels: int = 96,
    num_res_blocks: int = 3,
    channel_mult: Sequence[int] = (1, 2, 3, 4),
    attn_levels: Container[int] = (),
) -> None

Initialize ADM.

forward

forward(y: Tensor, x: Tensor, t: Tensor) -> Tensor

Forward pass.

Parameters:

y (Tensor) –

Complex-valued diffused speech tensor. Shape (batch_size, num_channels, num_freqs, num_frames).
x (Tensor) –

Complex-valued noisy speech tensor. Shape (batch_size, num_channels, num_freqs, num_frames).
t (Tensor) –

Diffusion step or noise level. Shape (batch_size,).

Returns:

Tensor –

Complex-valued output score. Shape (batch_size, num_channels, num_freqs, num_frames).

addse.models.adm.ADMAttentionBlock

Bases: Module

ADM attention block.

init

__init__(num_channels: int) -> None

Initialize the ADM attention block.

forward

forward(x: Tensor) -> Tensor

Forward pass.

addse.models.adm.ADMBlock

Bases: Module

ADM block.

init

__init__(
    in_ch: int,
    out_ch: int,
    emb_ch: int,
    kind: str,
    resample: bool = False,
    attn: bool = False,
) -> None

Initialize the ADM block.

forward

forward(x: Tensor, emb: Tensor) -> Tensor

Forward pass.

addse.models.adm.ADMEmbeddingBlock

Bases: Module

ADM time step embedding block.

init

__init__(in_channels: int, out_channels: int) -> None

Initialize the ADM time embedding block.

forward

forward(x: Tensor) -> Tensor

Forward pass.

addse.models.adm.ADMResample

Bases: Module

ADM 2D resampling block.

init

__init__(kind: str) -> None

Initialize the ADM 2D resampling block.

forward

forward(x: Tensor) -> Tensor

Forward pass.

addse.models.adm.adm_conv2d

adm_conv2d(
    in_channels: int,
    out_channels: int,
    kernel_size: int,
    stride: int = 1,
    padding: int = 0,
) -> nn.Conv2d

2D convolutional layer with weight normalization and no bias.

addse.models.bsrnn

addse.models.bsrnn.BSRNN

Bases: Module

Band-split RNN (BSRNN) ¹ ² ³.

Y. Luo and J. Yu, "Music source separation with band-split RNN", IEEE/ACM TASLP, 2023. ↩
J. Yu and Y. Luo, "Efficient monaural speech enhancement with universal sample rate band-split RNN", IEEE ICASSP, 2023. ↩
J. Yu, H. Chen, Y. Luo, R. Gu, and C. Weng, "High fidelity speech enhancement with band-split RNN", INTERSPEECH, 2023. ↩

init

__init__(
    stft: STFT | None = None,
    fs: int = 16000,
    input_channels: int = 1,
    output_channels: int = 1,
    num_channels: int = 32,
    num_layers: int = 6,
    causal: bool = False,
    subbands: Iterable[tuple[float, int]] = [
        (100.0, 10),
        (200.0, 10),
        (500.0, 6),
        (1000.0, 2),
    ],
    residual: bool = False,
    norm: Callable[[int], Module] | None = None,
) -> None

Initialize BSRNN.

Parameters:

stft (STFT | None, default: None ) –

STFT module.
fs (int, default: 16000 ) –

Sampling rate.
input_channels (int, default: 1 ) –

Number of input channels.
output_channels (int, default: 1 ) –

Number of output channels.
num_channels (int, default: 32 ) –

Number of internal channels. Denoted as N in the paper.
num_layers (int, default: 6 ) –

Number of dual-path modelling layers.
causal (bool, default: False ) –

Whether to use unidirectional RNNs along the time axis.
subbands (Iterable[tuple[float, int]], default: [(100.0, 10), (200.0, 10), (500.0, 6), (1000.0, 2)] ) –

List of tuples (bandwidth, number), where bandwidth is the bandwidth of the subband in Hz and number is the number of subbands with that bandwidth.
residual (bool, default: False ) –

Whether to predict a residual STFT in addition to the mask. The residual STFT is added after applying the mask to the input STFT.
norm (Callable[[int], Module] | None, default: None ) –

Normalization module to use throughout the network. If None, defaults to LayerNorm with causal=causal. If a non-causal normalization module is provided, the network is not causal, even if causal=True.

forward

forward(x: Tensor) -> torch.Tensor

Forward pass.

Parameters:

x (Tensor) –

Input tensor. Shape (batch_size, input_channels, num_samples).

Returns:

Tensor –

Enhanced tensor. Shape (batch_size, output_channels, num_samples).

addse.models.bsrnn.BSRNNMLP

Bases: Module

Multi-Layer perceptron (MLP) used in BSRNN.

init

__init__(
    input_channels: int,
    output_channels: int,
    norm: Callable[[int], Module],
) -> None

Initialize the BSRNN MLP.

forward

forward(x: Tensor) -> torch.Tensor

Forward pass.

addse.models.bsrnn.BSRNNRNNBlock

Bases: Module

RNN block used in BSRNN.

init

__init__(
    num_channels: int,
    hidden_channels: int,
    causal: bool,
    seq_dim: int,
    norm: Callable[[int], Module],
) -> None

Initialize the BSRNN RNN block.

forward

forward(x: Tensor) -> torch.Tensor

Forward pass.

addse.models.convtasnet

addse.models.convtasnet.ConvTasNet

Bases: Module

Conv-TasNet ¹.

Consists of an encoder, a temporal convolutional network (TCN), and a decoder.

Y. Luo and N. Mesgarani, "Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation", in IEEE/ACM TASLP, 2019. ↩

init

__init__(
    input_channels: int = 1,
    output_channels: int = 1,
    num_filters: int = 512,
    filter_size: int = 32,
    hop_size: int | None = None,
    bottleneck_channels: int = 128,
    hidden_channels: int = 512,
    skip_channels: int = 128,
    kernel_size: int = 3,
    layers: int = 8,
    repeats: int = 3,
    causal: bool = False,
    norm: Callable[[int], Module] | None = None,
) -> None

Initialize Conv-TasNet.

Parameters:

input_channels (int, default: 1 ) –

Number of input channels.
output_channels (int, default: 1 ) –

Number of output channels.
num_filters (int, default: 512 ) –

Number of filters in the encoder. Denoted as N in the paper.
filter_size (int, default: 32 ) –

Encoder filter length. Denoted as L in the paper.
hop_size (int | None, default: None ) –

Encoder hop size. If None, defaults to encoder_kernel_size // 2.
bottleneck_channels (int, default: 128 ) –

Number of bottleneck channels in the TCN. Denoted as B in the paper.
hidden_channels (int, default: 512 ) –

Number of hidden channels in the TCN. Denoted as H in the paper.
skip_channels (int, default: 128 ) –

Number of skip channels in the TCN. Denoted as Sc in the paper.
kernel_size (int, default: 3 ) –

Kernel size in the TCN. Denoted as P in the paper.
layers (int, default: 8 ) –

Number of layers in the TCN. Denoted as X in the paper.
repeats (int, default: 3 ) –

Number of repeats in the TCN. Denoted as R in the paper.
causal (bool, default: False ) –

Whether to use causal convolutions in the TCN.
norm (Callable[[int], Module] | None, default: None ) –

Normalization module to use in the TCN. If None, defaults to LayerNorm with causal=causal. If a non-causal normalization module is provided, the TCN is not causal, even if causal=True.

forward

forward(x: Tensor) -> torch.Tensor

Forward pass.

addse.models.convtasnet.ConvTasNetConv1DBlock

Bases: Module

1D convolutional block with PReLU activation and normalization used in Conv-TasNet.

init

__init__(
    input_channels: int,
    hidden_channels: int,
    skip_channels: int,
    kernel_size: int,
    dilation: int,
    causal: bool,
    last: bool,
    norm: Callable[[int], Module],
) -> None

Initialize the Conv-TasNet 1D convolutional block.

forward

forward(
    x: Tensor,
) -> tuple[torch.Tensor, torch.Tensor | None]

Forward pass.

addse.models.convtasnet.ConvTasNetTCN

Bases: Module

Temporal convolutional network (TCN) used in Conv-TasNet.

init

__init__(
    input_channels: int,
    output_channels: int,
    bottleneck_channels: int,
    hidden_channels: int,
    skip_channels: int,
    kernel_size: int,
    layers: int,
    repeats: int,
    causal: bool,
    norm: Callable[[int], Module],
) -> None

Initialize the Conv-TasNet TCN.

forward

forward(x: Tensor) -> torch.Tensor

Forward pass.

addse.models.mpd

addse.models.mpd.MPDiscriminator

Bases: Module

Multi-period discriminator.

init

__init__(
    periods: Iterable[int] = (2, 3, 5, 7, 11),
    in_channels: int = 1,
    kernel_size: int = 5,
    stride: int = 3,
    channels: Sequence[int] = (32, 128, 512, 1024, 1024),
    out_kernel_size: int = 3,
    out_stride: int = 1,
) -> None

Initialize the multi-period discriminator.

forward

forward(
    x: Tensor,
) -> tuple[list[torch.Tensor], list[list[torch.Tensor]]]

Forward pass.

addse.models.mpd.PDiscriminator

Bases: Module

Period discriminator.

init

__init__(
    period: int,
    in_channels: int = 1,
    kernel_size: int = 5,
    stride: int = 3,
    channels: Sequence[int] = (32, 128, 512, 1024, 1024),
    out_kernel_size: int = 3,
    out_stride: int = 1,
) -> None

Initialize the period discriminator.

forward

forward(
    x: Tensor,
) -> tuple[torch.Tensor, list[torch.Tensor]]

Forward pass.

addse.models.mpd.PDiscriminatorConv1d

Bases: Module

Period discriminator 1D convolutional layer.

init

__init__(
    in_channels: int,
    out_channels: int,
    kernel_size: int,
    stride: int = 1,
    activation: bool = True,
) -> None

Initialize the period discriminator 1D convolutional layer.

forward

forward(x: Tensor) -> torch.Tensor

Forward pass.

addse.models.msstftd

addse.models.msstftd.MSSTFTDiscriminator

Bases: Module

Multi-scale short-time Fourier transform (MS-STFT) discriminator.

init

__init__(
    frame_lengths: Collection[int] = (
        127,
        257,
        509,
        1021,
        2053,
    ),
    hop_lengths: Collection[int | None] | None = None,
    n_ffts: Collection[int | None] | None = None,
    window: str = "flattop",
    in_channels: int = 1,
    out_channels: int = 1,
    num_channels: int = 32,
    kernel_size: tuple[int, int] = (9, 3),
    stride: tuple[int, int] = (2, 1),
    dilations: Iterable[int] = (1, 2, 4),
) -> None

Initialize the MR-STFT discriminator.

forward

forward(
    x: Tensor,
) -> tuple[list[torch.Tensor], list[list[torch.Tensor]]]

Forward pass.

addse.models.msstftd.STFTDiscriminator

Bases: Module

Short-time Fourier transform (STFT) discriminator.

init

__init__(
    frame_length: int = 512,
    hop_length: int | None = None,
    n_fft: int | None = None,
    window: str = "flattop",
    in_channels: int = 1,
    out_channels: int = 1,
    num_channels: int = 32,
    kernel_size: tuple[int, int] = (9, 3),
    stride: tuple[int, int] = (2, 1),
    dilations: Iterable[int] = (1, 2, 4),
) -> None

Initialize the STFT discriminator.

forward

forward(
    x: Tensor,
) -> tuple[torch.Tensor, list[torch.Tensor]]

Forward pass.

addse.models.msstftd.STFTDiscriminatorConv2d

Bases: Module

Short-time Fourier transform (STFT) discriminator 2D convolutional layer.

init

__init__(
    in_channels: int,
    out_channels: int,
    kernel_size: tuple[int, int],
    stride: tuple[int, int] = (1, 1),
    dilation: tuple[int, int] = (1, 1),
    activation: bool = True,
) -> None

Initialize the STFT discriminator 2D convolutional layer.

forward

forward(x: Tensor) -> torch.Tensor

Forward pass.

addse.models.nac

addse.models.nac.NAC

Bases: Module

Neural audio codec.

init

__init__(
    in_channels: int = 1,
    emb_channels: int = 1024,
    base_channels: int = 32,
    strides: list[int] = [2, 2, 4, 4, 5],
    kernel_size: int = 3,
    num_residual_units: int = 3,
    dilation_base: int = 3,
    encoder_in_kernel_size: int = 7,
    encoder_out_kernel_size: int = 7,
    decoder_in_kernel_size: int = 7,
    decoder_out_kernel_size: int = 7,
    codebook_channels: int | None = 8,
    codebook_size: int = 1024,
    num_codebooks: int = 4,
    normalize: bool = True,
    shared_codebook: bool = False,
) -> None

Initialize the neural audio codec.

Parameters:

in_channels (int, default: 1 ) –

Number of input channels.
emb_channels (int, default: 1024 ) –

Number of output and input channels for the encoder and decoder, respectively.
base_channels (int, default: 32 ) –

Number of base channels for the encoder and decoder.
strides (list[int], default: [2, 2, 4, 4, 5] ) –

Downsampling and upsampling factors for the encoder and decoder blocks, respectively.
kernel_size (int, default: 3 ) –

Kernel size for the residual units.
num_residual_units (int, default: 3 ) –

Number of residual units per encoder and decoder block.
dilation_base (int, default: 3 ) –

Dilation base for the residual units.
encoder_in_kernel_size (int, default: 7 ) –

Kernel size for the encoder input convolutional layer.
encoder_out_kernel_size (int, default: 7 ) –

Kernel size for the encoder output convolutional layer.
decoder_in_kernel_size (int, default: 7 ) –

Kernel size for the decoder input convolutional layer.
decoder_out_kernel_size (int, default: 7 ) –

Kernel size for the decoder output convolutional layer.
codebook_channels (int | None, default: 8 ) –

Number of channels for the codebook vectors. If None, uses emb_channels. Else, each quantizer uses input and output linear layers to map between emb_channels and codebook_channels.
codebook_size (int, default: 1024 ) –

Number of vectors per codebook.
num_codebooks (int, default: 4 ) –

Number of codebooks.
normalize (bool, default: True ) –

Whether to normalize the embeddings and codebook vectors before codebook lookup.
shared_codebook (bool, default: False ) –

Whether to use the same codebook for all quantizers.

decode

decode(
    x: Tensor, no_sum: bool = False, domain: str = "code"
) -> torch.Tensor

Decode input into audio.

Parameters:

x (Tensor) –

Input tensor: - If domain is "code": Shape (batch_size, num_codebooks, num_frames). - If domain is "x": Shape (batch_size, emb_channels, num_frames). - If domain is "q": Shape (batch_size, emb_channels, num_frames) if no_sum is False else (batch_size, emb_channels, num_codebooks, num_frames). - If domain is "x_proj": Shape (batch_size, codebook_channels, num_codebooks, num_frames). - If domain is "q_proj": Shape (batch_size, codebook_channels, num_codebooks, num_frames).
no_sum (bool, default: False ) –

If False, the input quantized embeddings are assumed to be summed across codebooks. Ignored if domain is not "q".
domain (str, default: 'code' ) –

Domain of input tensor.

Returns:

Tensor –

Decoded audio. Shape (batch_size, in_channels, num_samples).

encode

encode(
    x: Tensor, no_sum: bool = False, domain: str = "q"
) -> tuple[torch.Tensor, torch.Tensor]

Encode input audio into discrete codes.

Parameters:

x (Tensor) –

Input audio. Shape (batch_size, in_channels, num_samples).
no_sum (bool, default: False ) –

If True, the quantized embeddings are not summed across codebooks. Ignored if domain is not "q".
domain (str, default: 'q' ) –

Which continuous output to return. One of: - "x": Return the encoder output. - "q": Return the quantized embeddings. - "x_proj": Return the projected encoder output in codebook space. - "q_proj": Return the projected quantized embeddings in codebook space.

Returns:

Tensor –

Tuple (codes, continuous):
Tensor –
- codes: Discrete codes. Shape (batch_size, num_codebooks, num_frames).
tuple[Tensor, Tensor] –
- continuous: Continuous output:
- If domain is "x": Shape (batch_size, emb_channels, num_frames).
- If domain is "q": Shape (batch_size, emb_channels, num_frames) if no_sum is False else (batch_size, emb_channels, num_codebooks, num_frames).
- If domain is "x_proj": Shape (batch_size, codebook_channels, num_codebooks, num_frames).
- If domain is "q_proj": Shape (batch_size, codebook_channels, num_codebooks, num_frames).

forward

forward(
    x: Tensor,
) -> tuple[
    torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor
]

Forward pass.

Parameters:

x (Tensor) –

Input audio. Shape (batch_size, in_channels, num_samples).

Returns:

Tensor –

Tuple (decoded, codes, codebook_loss, commit_loss) where decoded is the reconstructed audio with shape
Tensor –

(batch_size, in_channels, num_samples), codes are the discrete codes with shape `(batch_size,
Tensor –

num_codebooks, num_frames),codebook_lossis the codebook loss, andcommit_loss` is the commitment loss.

addse.models.nac.NACConv1d

Bases: Module

Neural audio codec 1D convolutional layer.

init

__init__(
    in_channels: int,
    out_channels: int,
    kernel_size: int = 1,
    stride: int = 1,
    padding: tuple[int, int] | str = (0, 0),
    dilation: int = 1,
    activation: bool = True,
    bias: bool = True,
) -> None

Initialize the neural audio codec 1D convolutional layer.

forward

forward(x: Tensor) -> torch.Tensor

Forward pass.

addse.models.nac.NACConvTranspose1d

Bases: Module

Neural audio codec 1D transposed convolutional layer.

init

__init__(
    in_channels: int, out_channels: int, stride: int
) -> None

Initialize the neural audio codec 1D transposed convolutional layer.

forward

forward(x: Tensor) -> torch.Tensor

Forward pass.

addse.models.nac.NACDecoder

Bases: Module

Neural audio codec decoder.

init

__init__(
    in_channels: int,
    out_channels: int,
    base_channels: int,
    strides: list[int],
    kernel_size: int,
    num_residual_units: int,
    dilation_base: int,
    in_kernel_size: int,
    out_kernel_size: int,
) -> None

Initialize the neural audio codec decoder.

forward

forward(x: Tensor) -> torch.Tensor

Decode continuous embeddings into audio.

Parameters:

x (Tensor) –

Continuous embeddings. Shape (batch_size, in_channels, num_frames).

Returns:

Tensor –

Decoded audio. Shape (batch_size, out_channels, num_samples).

addse.models.nac.NACDecoderBlock

Bases: Module

Neural audio codec decoder block.

init

__init__(
    in_channels: int,
    out_channels: int,
    stride: int,
    kernel_size: int,
    num_residual_units: int,
    dilation_base: int,
) -> None

Initialize the neural audio codec decoder block.

forward

forward(x: Tensor) -> torch.Tensor

Forward pass.

addse.models.nac.NACEncoder

Bases: Module

Neural audio codec encoder.

init

__init__(
    in_channels: int,
    out_channels: int,
    base_channels: int,
    strides: list[int],
    kernel_size: int,
    num_residual_units: int,
    dilation_base: int,
    in_kernel_size: int,
    out_kernel_size: int,
) -> None

Initialize the neural audio codec encoder.

forward

forward(x: Tensor) -> torch.Tensor

Encode input audio into continuous embeddings.

Parameters:

x (Tensor) –

Input audio. Shape (batch_size, in_channels, num_samples).

Returns:

Tensor –

Continuous embeddings. Shape (batch_size, out_channels, num_frames).

addse.models.nac.NACEncoderBlock

Bases: Module

Neural audio codec encoder block.

init

__init__(
    in_channels: int,
    out_channels: int,
    stride: int,
    kernel_size: int,
    num_residual_units: int,
    dilation_base: int,
) -> None

Initialize the neural audio codec encoder block.

forward

forward(x: Tensor) -> torch.Tensor

Forward pass.

addse.models.nac.NACLSTMBlock

Bases: Module

Neural audio codec LSTM block.

init

__init__(channels: int) -> None

Initialize the neural audio codec LSTM block.

forward

forward(x: Tensor) -> torch.Tensor

Forward pass.

process_in_blocks

process_in_blocks(
    x: Tensor,
) -> tuple[torch.Tensor, torch.Tensor]

Process input in blocks to avoid LSTM length limitation.

See https://github.com/pytorch/pytorch/issues/133751.

addse.models.nac.NACRVQVAE

Bases: Module

Neural audio codec residual vector quantizer.

init

__init__(
    emb_channels: int,
    codebook_size: int,
    num_codebooks: int,
    codebook_channels: int | None,
    normalize: bool,
    shared_codebook: bool,
) -> None

Initialize the neural audio codec residual vector quantizer.

decode

decode(
    x: Tensor,
    input_no_sum: bool = False,
    output_no_sum: bool = False,
    domain: str = "code",
) -> torch.Tensor

Decode input into quantized embeddings.

Parameters:

x (Tensor) –

Input tensor: - If domain is "code": Shape (batch_size, num_codebooks, num_frames). - If domain is "x": Shape (batch_size, emb_channels, num_frames). - If domain is "q": Shape (batch_size, emb_channels, num_frames) if input_no_sum is False else (batch_size, emb_channels, num_codebooks, num_frames). - If domain is "x_proj": Shape (batch_size, codebook_channels, num_codebooks, num_frames). - If domain is "q_proj": Shape (batch_size, codebook_channels, num_codebooks, num_frames).
input_no_sum (bool, default: False ) –

If False, the input quantized embeddings are assumed to be summed across codebooks. Ignored if domain is not "q".
output_no_sum (bool, default: False ) –

If True, the output quantized embeddings are not summed across codebooks.
domain (str, default: 'code' ) –

Domain of input tensor.

Returns:

Tensor –

Decoded tensor. Shape (batch_size, emb_channels, num_frames) if output_no_sum is False else
Tensor –

(batch_size, emb_channels, num_codebooks, num_frames).

forward

forward(
    x: Tensor, no_sum: bool = False
) -> tuple[
    torch.Tensor,
    torch.Tensor,
    torch.Tensor,
    torch.Tensor,
    torch.Tensor,
    torch.Tensor,
]

Assign discrete codes to continuous input embeddings.

Parameters:

x (Tensor) –

Input continuous embeddings. Shape (batch_size, emb_channels, num_frames)
no_sum (bool, default: False ) –

If True, the quantized embeddings are not summed across codebooks.

Returns:

Tensor –

A tuple (codes, quantized, codebook_loss, commit_loss, x_proj, quantized_proj):
Tensor –
- codes: Assigned vector indices. Shape (batch_size, num_codebooks, num_frames).
Tensor –
- quantized Quantized embeddings. Shape (batch_size, emb_channels, num_frames) if no_sum is False else (batch_size, emb_channels, num_codebooks, num_frames).
Tensor –
- codebook_loss: Codebook loss. 0-dimensional.
Tensor –
- commit_loss: Commitment loss. 0-dimensional.
Tensor –
- x_proj: Projected input embeddings. Shape (batch_size, codebook_channels, num_codebooks, num_frames).
tuple[Tensor, Tensor, Tensor, Tensor, Tensor, Tensor] –
- quantized_proj: Projected quantized embeddings. Shape (batch_size, codebook_channels, num_codebooks, num_frames).

addse.models.nac.NACResidualUnit

Bases: Module

Neural audio codec residual unit.

init

__init__(
    channels: int, dilation: int, kernel_size: int
) -> None

Initialize the neural audio codec residual unit.

forward

forward(x: Tensor) -> torch.Tensor

Forward pass.

addse.models.nac.NACSnakeActivation

Bases: Module

Neural audio codec Snake activation function.

init

__init__(channels: int) -> None

Initialize the neural audio codec Snake activation function.

forward

forward(x: Tensor) -> torch.Tensor

Forward pass.

addse.models.nac.NACVQVAE

Bases: Module

Neural audio codec vector quantizer.

init

__init__(
    emb_channels: int,
    codebook_size: int,
    codebook_channels: int | None,
    normalize: bool,
    codebook: Embedding | None,
) -> None

Initialize the neural audio codec vector quantizer.

decode

decode(x: Tensor, domain: str = 'code') -> torch.Tensor

Decode input into quantized embeddings.

Parameters:

x (Tensor) –

Input tensor: - Shape (batch_size, num_frames) if domain is "code". - Shape (batch_size, emb_channels, num_frames) if domain is "x". - Shape (batch_size, emb_channels, num_frames) if domain is "q". - Shape (batch_size, codebook_channels, num_frames) if domain is "x_proj". - Shape (batch_size, codebook_channels, num_frames) if domain is "q_proj".
domain (str, default: 'code' ) –

Domain of input tensor.

Returns:

Tensor –

Decoded tensor. Shape (batch_size, emb_channels, num_frames)

forward

forward(
    x: Tensor,
) -> tuple[
    torch.Tensor,
    torch.Tensor,
    torch.Tensor,
    torch.Tensor,
    torch.Tensor,
    torch.Tensor,
]

Assign discrete codes to continuous input embeddings.

Parameters:

x (Tensor) –

Input continuous embeddings. Shape (batch_size, emb_channels, num_frames)

Returns:

Tensor –

A tuple (codes, quantized, codebook_loss, commit_loss, x_proj, quantized_proj):
Tensor –
- codes: Assigned vector indices with shape (batch_size, num_frames).
Tensor –
- quantized: Quantized embeddings with shape (batch_size, emb_channels, num_frames).
Tensor –
- codebook_loss: Codebook loss. 0-dimensional.
Tensor –
- commit_loss: Commitment loss. 0-dimensional.
Tensor –
- x_proj: Projected input embeddings. Shape (batch_size, codebook_channels, num_frames).
tuple[Tensor, Tensor, Tensor, Tensor, Tensor, Tensor] –
- quantized_proj: Projected quantized embeddings. Shape (batch_size, codebook_channels, num_frames).

quantize

quantize(
    x_proj: Tensor,
) -> tuple[
    torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor
]

Quantize projected input embeddings.

addse.models.sgmse

addse.models.sgmse.SGMSEAttentionBlock

Bases: Module

SGMSE attention block.

init

__init__(num_channels: int) -> None

Initialize the SGMSE attention block.

forward

forward(x: Tensor) -> Tensor

Forward pass.

addse.models.sgmse.SGMSEEmbeddingBlock

Bases: Module

SGMSE time step embedding block with Gaussian Fourier projection and MLP.

init

__init__(fourier_channels: int, emb_channels: int) -> None

Initialize the SGMSE time embedding block.

forward

forward(emb: Tensor) -> Tensor

Forward pass.

addse.models.sgmse.SGMSEResample

Bases: Module

SGMSE 2D resampling block.

init

__init__(kind: str) -> None

Initialize the SGMSE 2D resampling block.

forward

forward(x: Tensor) -> Tensor

Forward pass.

addse.models.sgmse.SGMSEUNet

Bases: Module

NCSN++ backbone used in SGMSE.

init

__init__(
    num_channels: int = 1,
    base_channels: int = 128,
    num_res_blocks: int = 2,
    channel_mult: Sequence[int] = (1, 1, 2, 2, 2, 2, 2),
    attn_levels: Container[int] = (4,),
) -> None

Initialize the SGMSE NCSN++ backbone.

Parameters:

num_channels (int, default: 1 ) –

Number of input channels.
base_channels (int, default: 128 ) –

Base number of channels.
num_res_blocks (int, default: 2 ) –

Number of residual blocks per level.
channel_mult (Sequence[int], default: (1, 1, 2, 2, 2, 2, 2) ) –

Channel multiplier for each level.
attn_levels (Container[int], default: (4,) ) –

Indices of levels at which to apply attention.

forward

forward(x: Tensor, y: Tensor, t: Tensor) -> Tensor

Forward pass.

Parameters:

x (Tensor) –

Complex-valued noisy speech tensor. Shape (batch_size, num_channels, num_freqs, num_frames).
y (Tensor) –

Complex-valued diffused speech tensor. Shape (batch_size, num_channels, num_freqs, num_frames).
t (Tensor) –

Diffusion step or noise level. Shape (batch_size,).

Returns:

Tensor –

Complex-valued output score. Shape (batch_size, num_channels, num_freqs, num_frames).

addse.models.sgmse.SGMSEUNetBlock

Bases: Module

SGMSE UNet block.

init

__init__(
    in_ch: int,
    out_ch: int,
    prog_ch: int,
    emb_ch: int,
    kind: str | None = None,
    attn: bool = False,
) -> None

Initialize the SGMSE UNet block.

forward

forward(
    x: Tensor, prog: Tensor | None, emb: Tensor
) -> tuple[Tensor, Tensor | None]

Forward pass.

addse.models.sgmse.sgmse_groupnorm

sgmse_groupnorm(num_channels: int) -> nn.GroupNorm

SGMSE group normalization layer.

addse.stft

addse.stft.STFT

Bases: Module

Short-time Fourier transform (STFT) module.

init

__init__(
    frame_length: int = 512,
    hop_length: int | None = None,
    n_fft: int | None = None,
    window: str = "hann",
    norm: bool = False,
) -> None

Initialize the STFT module.

Parameters:

frame_length (int, default: 512 ) –

Frame length.
hop_length (int | None, default: None ) –

Hop length. If None, defaults to frame_length // 2.
n_fft (int | None, default: None ) –

FFT size. If None, defaults to frame_length.
window (str, default: 'hann' ) –

Window type. Passed to scipy.signal.get_window.
norm (bool, default: False ) –

Whether to normalize the window by the square root of its sum of squares.

forward

forward(x: Tensor) -> torch.Tensor

Compute the STFT.

Parameters:

x (Tensor) –

Input tensor. Shape (..., num_samples).

Returns:

Tensor –

STFT of input tensor. Shape (..., num_freqs, num_frames).

inverse

inverse(x: Tensor, n: int | None = None) -> torch.Tensor

Compute the inverse STFT.

Parameters:

x (Tensor) –

Input tensor. Shape (..., num_freqs, num_frames).
n (int | None, default: None ) –

If provided, the output tensor is trimmed to this length along the last axis.

Returns:

Tensor –

Reconstructed tensor. Shape (..., num_samples).

overlap_add

overlap_add(x: Tensor) -> torch.Tensor

Overlap-add.

Parameters:

x (Tensor) –

Input tensor. Shape (batch_size, num_freqs, num_frames).

Returns:

Tensor –

Output tensor. Shape (batch_size, num_samples).

addse.utils

addse.utils.build_subbands

build_subbands(
    n_fft: int,
    fs: int,
    subbands: Iterable[tuple[float, int]],
) -> list[tuple[int, int]]

Derive subband indices on the FFT axis.

Parameters:

n_fft (int) –

FFT size.
fs (int) –

Sampling rate.
subbands (Iterable[tuple[float, int]]) –

List of tuples (bandwidth, number), where bandwidth is the bandwidth of the subband in Hz and number is the number of subbands with that bandwidth.

Returns:

list[tuple[int, int]] –

List of tuples (start, end), where start and end are the start and end indices of the subband on
list[tuple[int, int]] –

the FFT axis.

addse.utils.bytes_str_to_int

bytes_str_to_int(bytes_str: str) -> int

Convert a human-readable byte size to an integer.

Parameters:

bytes_str (str) –

Human-readable byte size (e.g., "64MB", "1GB").

Returns:

int –

Integer byte size.

addse.utils.dynamic_range

dynamic_range(
    x: Tensor, eps: float = 1e-08
) -> torch.Tensor

Dynamic range in dB.

Calculated as the ratio between the peak amplitude and the RMS.

Parameters:

x (Tensor) –

Input signal. Any number of dimensions.
eps (float, default: 1e-08 ) –

Small value for numerical stability.

Returns:

Tensor –

Dynamic range in dB.

addse.utils.flatten_dict

flatten_dict(
    d: dict[str, Any], parent_key: str = "", sep: str = "."
) -> dict[str, Any]

Flatten a nested dictionary.

Parameters:

d (dict[str, Any]) –

Dictionary to flatten.
parent_key (str, default: '' ) –

Key prefix for the current level.
sep (str, default: '.' ) –

Separator to use between keys.

Returns:

dict[str, Any] –

Flattened dictionary.

addse.utils.hz_to_mel

hz_to_mel(hz: float, scale: str = 'slaney') -> float

Convert frequency in Hz to mel scale.

Parameters:

hz (float) –

Frequency in Hz.
scale (str, default: 'slaney' ) –

Mel scale to use. "htk" matches the Hidden Markov Toolkit, while "slaney" matches the Auditory Toolbox by Slaney. The "slaney" scale is linear below 1 kHz and logarithmic above 1 kHz.

Returns:

float –

Frequency in mel scale.

addse.utils.load_hydra_config

load_hydra_config(
    path: str, overrides: list[str] | None = None
) -> tuple[DictConfig, str]

Load a Hydra configuration file.

addse.utils.load_model

load_model(
    config_path: str,
    model_name: str | None = None,
    logs_dir: str = "logs",
    ckpt_name: str = "last.ckpt",
    ckpt_path: str | None = None,
    state_key: str | None = "state_dict",
    prepend_key: str | None = None,
    device: device | str | None = None,
    strict: bool = True,
) -> L.LightningModule

Load a model.

addse.utils.mel_filters

mel_filters(
    n_filters: int = 64,
    n_fft: int = 512,
    f_min: float = 0.0,
    f_max: float | None = None,
    fs: float = 16000,
    scale: str = "slaney",
    norm: Literal["slaney", "consistent"]
    | None = "consistent",
    dtype: dtype = torch.float32,
) -> tuple[torch.Tensor, torch.Tensor]

Get mel filters.

Parameters:

n_filters (int, default: 64 ) –

Number of filters.
n_fft (int, default: 512 ) –

Number of FFT point.
f_min (float, default: 0.0 ) –

Minimum frequency.
f_max (float | None, default: None ) –

Maximum frequency. If None, uses fs / 2.
fs (float, default: 16000 ) –

Sampling frequency.
scale (str, default: 'slaney' ) –

Mel scale to use. "htk" matches the Hidden Markov Toolkit, while "slaney" matches the Auditory Toolbox by Slaney. The "slaney" scale is linear below 1 kHz and logarithmic above 1 kHz.
norm (Literal['slaney', 'consistent'] | None, default: 'consistent' ) –

Filter normalization method. If "slaney", the filters are normalized by their width in Hz. However this makes the filter response scale with the frequency resolution n_fft / fs. If "consistent", the frequency resolution is factored in. If None, no normalization is applied.
dtype (dtype, default: float32 ) –

Data type to cast the filters to.

Returns:

tuple[Tensor, Tensor] –

Mel filters and center frequencies. Shapes (n_filters, n_fft // 2 + 1) and (n_filters,).

addse.utils.mel_to_hz

mel_to_hz(
    mel: Tensor, scale: str = "slaney"
) -> torch.Tensor

Convert frequency in mel scale to Hz.

Parameters:

mel (Tensor) –

Frequency in mel scale.
scale (str, default: 'slaney' ) –

Mel scale to use. "htk" matches the Hidden Markov Toolkit, while "slaney" matches the Auditory Toolbox by Slaney. The "slaney" scale is linear below 1 kHz and logarithmic above 1 kHz.

Returns:

Tensor –

Frequency in Hz.

addse.utils.scan_files

scan_files(input_dir: str, regex: str) -> Iterator[str]

Scan a directory for files matching a regular expression.

Parameters:

input_dir (str) –

Directory to scan.
regex (str) –

Regular expression to match file paths.

Yields:

str –

Path matching the regular expression.

addse.utils.seed_all

seed_all(seed: int) -> None

Set the seed for all random number generators.

Parameters:

seed (int) –

Seed value.

addse.utils.segment_audio_file

segment_audio_file(
    path: str,
    format: str = "ogg",
    subtype: str | None = None,
    seglen: float | None = None,
    base: str | None = None,
) -> Iterator[tuple[bytes, str]]

Read and segment an audio file and yield bytes and a name for each segment.

Parameters:

path (str) –

Path to the input audio file.
format (str, default: 'ogg' ) –

Audio format to convert to. See soundfile.write.
subtype (str | None, default: None ) –

Audio subtype to convert to. See soundfile.write.
seglen (float | None, default: None ) –

Segment length in seconds. If provided, the file is segmented into chunks of this length approximately.
base (str | None, default: None ) –

Base path to strip from the file path.

Yields:

tuple[bytes, str] –

Audio bytes and name.

addse.utils.set_snr

set_snr(
    speech: Tensor, noise: Tensor, snr: float
) -> torch.Tensor

Scale noise to achieve a desired signal-to-noise ratio (SNR).

Parameters:

speech (Tensor) –

Speech signal. Any number of dimensions.
noise (Tensor) –

Noise signal. Any number of dimensions.
snr (float) –

Desired SNR in dB.

Returns:

Tensor –

Scaled noise signal.

addse.utils.unflatten_dict

unflatten_dict(
    d: dict[str, Any], sep: str = "."
) -> dict[str, Any]

Unflatten a dictionary.

Parameters:

d (dict[str, Any]) –

Dictionary to unflatten.
sep (str, default: '.' ) –

Separator used between keys.

Returns:

dict[str, Any] –

Unflattened dictionary.

Reference

addse.data

addse.data.AudioStreamingDataLoader

shuffle property

__init__

__len__

addse.data.AudioStreamingDataset

__getitem__

__init__

__iter__

__len__

__next__

check

next_segment

addse.data.DynamicMixingDataset

__init__

__iter__

__len__

transform staticmethod

addse.layers

addse.layers.BandMerge

__init__

forward

addse.layers.BandSplit

__init__

forward

addse.layers.BatchNorm

__init__

forward

addse.layers.GroupNorm

__init__

forward

addse.layers.InstanceNorm

__init__

addse.layers.LayerNorm

__init__

forward

addse.layers.group_norm

addse.lightning

addse.lightning.ADDSELightningModule

__init__

forward

log_score

loss

solve

addse.lightning.BaseLightningModule

log_debug_samples

log_metrics

step abstractmethod

test_step

training_step

validation_step

addse.lightning.ConfigureOptimizersMixin

configure_optimizers

addse.lightning.DataModule

__init__

load_state_dict

setup

state_dict

test_dataloader

train_dataloader

val_dataloader

addse.lightning.EDMMixin

denoiser

loss

sampling_step

solve

addse.lightning.EDMNACSELightningModule

__init__

forward

addse.lightning.EDMSELightningModule

__init__

forward

inverse_transform

transform

addse.lightning.LightningModule

__init__

forward

addse.lightning.LogConfig dataclass

addse.lightning.NACLightningModule

shuffle `property`

init

len

getitem

init

iter

len

next

init

iter

len

transform `staticmethod`

init

init

init

init

init

init

init

step `abstractmethod`

init

init

init

init

addse.lightning.LogConfig `dataclass`

init

init

init

compute `abstractmethod`

init

init

init

init

call

compute `abstractmethod`

init

init

init

init

init

init

init

init

init

init

init

init

init

init