Skip to content

Reference

addse.data

addse.data.AudioStreamingDataLoader

Bases: StreamingDataLoader

Audio streaming dataloader.

shuffle property

shuffle: bool

Get the shuffle attribute of the dataset.

__init__

__init__(
    dataset: AudioStreamingDataset | DynamicMixingDataset,
    batch_size: int = 1,
    num_workers: int = 0,
    shuffle: bool | None = None,
    **kwargs: Any,
) -> None

Initialize the audio streaming dataloader.

Parameters:

  • dataset (AudioStreamingDataset | DynamicMixingDataset) –

    Dataset to wrap.

  • batch_size (int, default: 1 ) –

    Batch size.

  • num_workers (int, default: 0 ) –

    Number of workers.

  • shuffle (bool | None, default: None ) –

    Whether to shuffle the dataset at every epoch. If None, uses the dataset shuffle attribute.

  • **kwargs (Any, default: {} ) –

    Additional keyword arguments passed to parent constructor.

__len__

__len__() -> int

Get the number of batches in the dataloader.

Returns:

  • int

    Number of batches in the dataloader.

Raises:

  • TypeError

    If the wrapped dataset is an instance of AudioStreamingDataset with segment_length!=None, as the total number of segments in the dataset cannot be determined without iterating over it.

addse.data.AudioStreamingDataset

Bases: StreamingDataset

Audio streaming dataset.

__getitem__

__getitem__(index: int) -> tuple[torch.Tensor, str]

Get an item from the dataset.

Parameters:

  • index (int) –

    Index of the item to retrieve.

Returns:

  • tuple[Tensor, str]

    Audio data with shape (1, num_samples) and name.

__init__

__init__(
    input_dir: str,
    fs: int | None = None,
    segment_length: float | None = None,
    max_length: float | None = None,
    max_dynamic_range: float | None = None,
    shuffle: bool = False,
    seed: int = 0,
    **kwargs: Any,
) -> None

Initialize the audio streaming dataset.

Parameters:

  • input_dir (str) –

    Path or URL to LitData-optimized audio data.

  • fs (int | None, default: None ) –

    Optional sample rate to resample to.

  • segment_length (float | None, default: None ) –

    Audio segment length in seconds. If provided, audio files are concatenated and segmented into chunks of this length. Else, audio files are yielded as is and may have variable length.

  • max_length (float | None, default: None ) –

    Maximum output length in seconds. If provided, audio files longer than this are skipped. Cannot be used together with segment_length.

  • max_dynamic_range (float | None, default: None ) –

    Maximum dynamic range in dB. If provided, audio files and segments with a dynamic range greater than this value are skipped.

  • shuffle (bool, default: False ) –

    Whether to shuffle the dataset.

  • seed (int, default: 0 ) –

    Random seed for shuffling.

  • **kwargs (Any, default: {} ) –

    Additional keyword arguments passed to parent constructor.

__iter__

__iter__() -> Iterator[ASDOutput]

Iterate over the dataset.

__len__

__len__() -> int

Get the number of files in the dataset.

Returns:

  • int

    The number of files in the dataset.

Note

If segment_length is not None, the number of samples yielded by this dataset when iterating over it does not match the output of this method.

__next__

__next__() -> ASDOutput

Get the next item from the dataset.

Returns:

  • ASDOutput

    Audio data with shape (1, num_samples), sample rate, name, and number of files loaded to get this item.

  • ASDOutput

    The number of files loaded is required by DynamicMixingDataset.

check

check(item: Tensor, name: str) -> bool

Check if a signal meets the dataset criteria.

next_segment

next_segment() -> ASDOutput

Get the next audio segment from the dataset.

addse.data.DynamicMixingDataset

Bases: ParallelStreamingDataset

Dynamic mixing dataset.

Wraps two AudioStreamingDataset instances, one for speech and one for noise, and generates noisy speech samples on-the-fly by mixing the speech and noise samples at a random signal-to-noise ratio (SNR).

Multi-channel speech and noise samples are converted to mono by randomly selecting one channel.

If the speech and noise samples have different lengths, the noise is cycled or trimmed to match the speech length.

When length=float("inf"), this dataset is infinite and should be used with limit_<stage>_batches in the Lightning Trainer.

__init__

__init__(
    speech_dataset: AudioStreamingDataset,
    noise_dataset: AudioStreamingDataset,
    snr_range: tuple[float, float] = (-5.0, 15.0),
    rms_range: tuple[float, float] | None = (0.0, 0.0),
    length: int | float | None = float("inf"),
    resume: bool = True,
    reset_rngs: bool = False,
    **kwargs: Any,
) -> None

Initialize the dynamic mixing dataset.

Parameters:

  • speech_dataset (AudioStreamingDataset) –

    Speech dataset.

  • noise_dataset (AudioStreamingDataset) –

    Noise dataset.

  • snr_range (tuple[float, float], default: (-5.0, 15.0) ) –

    SNR range.

  • rms_range (tuple[float, float] | None, default: (0.0, 0.0) ) –

    RMS range for the clean speech in dB. If None, no RMS adjustment is performed.

  • length (int | float | None, default: float('inf') ) –

    Number of samples to yield per epoch. If None, the speech and noise datasets are iterated over until one is exhausted. If an integer, the datasets are cycled until length samples are yielded. If float("inf"), the datasets are cycled indefinitely.

  • resume (bool, default: True ) –

    Whether to resume the dataset from where it left off in the previous epoch when starting a new epoch. Should be set to False for validation and test datasets. Only works when iterating with an AudioStreamingDataLoader. Ignored if length is None.

  • reset_rngs (bool, default: False ) –

    Whether to set the internal random number generators to the same initial state at the start of each epoch. If True, random numbers are consistent across epochs. Should be set to True for validation and test datasets.

  • **kwargs (Any, default: {} ) –

    Additional keyword arguments passed to parent constructor.

__iter__

__iter__() -> Iterator[
    tuple[torch.Tensor, torch.Tensor, int]
]

Iterate over the dataset.

Yields:

  • tuple[Tensor, Tensor, int]

    Noisy speech, clean speech, and sample rate. Noisy and clean speech have shape (1, num_samples).

__len__

__len__() -> int

Get the number of samples yielded per epoch.

Returns:

  • int

    Number of samples yielded per epoch.

Raises:

  • TypeError

    If the dataset is infinite, i.e. if length is float("inf").

transform staticmethod

transform(
    samples: tuple[ASDOutput, ASDOutput],
    rngs: dict[str, Any],
    snr_range: tuple[float, float],
    rms_range: tuple[float, float] | None,
) -> tuple[
    torch.Tensor, torch.Tensor, int, tuple[int, int]
]

Generate noisy speech from speech and noise samples.

Parameters:

  • samples (tuple[ASDOutput, ASDOutput]) –

    Tuple with speech and noise samples.

  • rngs (dict[str, Any]) –

    Random number generators.

  • snr_range (tuple[float, float]) –

    SNR range.

  • rms_range (tuple[float, float] | None) –

    RMS range for the clean speech in dB. If None, no RMS adjustment is performed.

Returns:

  • Tensor

    Noisy speech, clean speech, sample rate, and number of files loaded. Noisy and clean speech have shape

  • Tensor

    (1, num_samples). The number of files loaded is for internal use only and is discarded before yielding

  • int

    when iterating over the dataset.

addse.layers

addse.layers.BandMerge

Bases: Module

Band-merge module.

__init__

__init__(
    subband_idx: Iterable[tuple[int, int]],
    input_channels: int,
    output_channels: int,
    num_channels: int,
    norm: Callable[[int], Module],
    mlp: Callable[
        [int, int, Callable[[int], Module]], Module
    ],
    residual: bool,
) -> None

Initialize the band-merge module.

forward

forward(
    x: Tensor,
) -> tuple[torch.Tensor, torch.Tensor | None]

Forward pass.

Parameters:

  • x (Tensor) –

    Input tensor with shape (batch_size, input_channels, num_bands, num_frames).

Returns:

  • Tensor

    Tuple (mask, residual) where mask are complex-valued spatial filtering coefficients with shape

  • Tensor | None

    (batch_size, input_channels, output_channels, num_freqs, num_frames), and residual is a residual

  • tuple[Tensor, Tensor | None]

    additive short-time Fourier transform with shape (batch_size, output_channels, num_freqs, num_frames) or

  • tuple[Tensor, Tensor | None]

    None if residual=False.

addse.layers.BandSplit

Bases: Module

Band-split module.

__init__

__init__(
    subband_idx: Iterable[tuple[int, int]],
    input_channels: int,
    output_channels: int,
    norm: Callable[[int], Module],
) -> None

Initialize the band-split module.

forward

forward(x: Tensor) -> torch.Tensor

Forward pass.

Parameters:

  • x (Tensor) –

    Complex-valued short-time Fourier transform. Shape (batch_size, input_channels, num_freqs, num_frames).

Returns:

  • Tensor

    Output tensor with shape (batch_size, output_channels, num_bands, num_frames).

addse.layers.BatchNorm

Bases: Module

Batch normalization.

Input tensors must have shape (B, C, ...) where B is the batch dimension, C is the channel dimension, and ... are the spatial dimensions (e.g. height and width in computer vision, frequency and time in audio, or sequence length in NLP). The statistics are aggregated over the batch and spatial dimensions as in 1, Figure 2. Namely,

\[y = \frac{x - \mathbb{E}[x]}{\sqrt{\mathbb{V}[x] + \epsilon}} (1 + \gamma) + \beta,\]

where \(\gamma\) and \(\beta\) are channel-specific learnable scale and shift parameters. Note the reparameterization of the scale parameter compared to the default PyTorch implementation.

Unlike other normalization modules, this module has track_running_stats and momentum options.


  1. Y. Wu and K. He, "Group normalization", ECCV, 2018. 

__init__

__init__(
    num_channels: int,
    eps: float = 1e-05,
    track_running_stats: bool = True,
    momentum: float | None = 0.1,
) -> None

Initialize the batch normalization module.

Parameters:

  • num_channels (int) –

    Number of channels in input tensors.

  • eps (float, default: 1e-05 ) –

    Small value for numerical stability.

  • track_running_stats (bool, default: True ) –

    If True, normalization statistics are aggregated over batches during training and saved for evaluation. If False, statistics are computed from the current batch both during training and evaluation.

  • momentum (float | None, default: 0.1 ) –

    Momentum for running statistics. The bigger the value, the more weight is given to the current batch statistics. Ignored if track_running_stats is False. If None, running statistics are cumulatively aggregated over batches without decay.

forward

forward(x: Tensor) -> torch.Tensor

Forward pass.

addse.layers.GroupNorm

Bases: Module

Group normalization.

Input tensors must have shape (B, C, ...) where B is the batch dimension, C is the channel dimension, and ... are the spatial dimensions (e.g. height and width in computer vision, frequency and time in audio, or sequence length in NLP). The statistics are aggregated over grouped channels and spatial dimensions as in 1, Figure 2. Namely,

\[y = \frac{x - \mathbb{E}[x]}{\sqrt{\mathbb{V}[x] + \epsilon}} (1 + \gamma) + \beta,\]

where \(\gamma\) and \(\beta\) are channel-specific learnable scale and shift parameters. Note the reparameterization of the scale parameter compared to the default PyTorch implementation.


  1. Y. Wu and K. He, "Group normalization", ECCV, 2018. 

__init__

__init__(
    num_groups: int,
    num_channels: int,
    eps: float = 1e-05,
    causal: bool = False,
) -> None

Initialize the group normalization module.

Parameters:

  • num_groups (int) –

    Number of groups to separate the channels into.

  • num_channels (int) –

    Number of channels in input tensors.

  • eps (float, default: 1e-05 ) –

    Small value for numerical stability.

  • causal (bool, default: False ) –

    If True, normalization statistics are cumulatively aggregated along the time dimension. The time dimension must be the last dimension of the input tensor.

forward

forward(x: Tensor) -> torch.Tensor

Forward pass.

addse.layers.InstanceNorm

Bases: GroupNorm

Instance normalization.

Input tensors must have shape (B, C, ...) where B is the batch dimension, C is the channel' dimension, and ... are the spatial dimensions (e.g. height and width in computer vision, frequency and time in audio, or sequence length in NLP). The statistics are aggregated over the spatial dimensions as in 1, Figure 2. Namely,

\[y = \frac{x - \mathbb{E}[x]}{\sqrt{\mathbb{V}[x] + \epsilon}} (1 + \gamma) + \beta,\]

where \(\gamma\) and \(\beta\) are channel-specific learnable scale and shift parameters. Note the reparameterization of the scale parameter compared to the default PyTorch implementation.


  1. Y. Wu and K. He, "Group normalization", ECCV, 2018. 

__init__

__init__(
    num_channels: int,
    eps: float = 1e-05,
    causal: bool = False,
) -> None

Initialize the instance normalization module.

Parameters:

  • num_channels (int) –

    Number of channels in input tensors.

  • eps (float, default: 1e-05 ) –

    Small value for numerical stability.

  • causal (bool, default: False ) –

    If True, normalization statistics are cumulatively aggregated along the time dimension. The time dimension must be the last dimension of the input tensor.

addse.layers.LayerNorm

Bases: Module

Layer normalization.

Input tensors must have shape (B, C, ...) where B is the batch dimension, C is the channel dimension, and ... are the spatial dimensions (e.g. height and width in computer vision, frequency and time in audio, or sequence length in NLP). Namely,

\[y = \frac{x - \mathbb{E}[x]}{\sqrt{\mathbb{V}[x] + \epsilon}} (1 + \gamma) + \beta,\]

where \(\gamma\) and \(\beta\) are channel-specific learnable scale and shift parameters. Note the reparameterization of the scale parameter compared to the default PyTorch implementation.

If element_wise and frame_wise are both False, the statistics are aggregated over the channel dimension and all spatial dimensions as in 1, Figure 2. In this case, setting causal=False matches the global layer normalization in 2, while setting causal=True matches the cumulative layer normalization in 2. The time dimension must be the last dimension of input tensors.

If element_wise is True, the statistics are aggregated over the channel dimension only as in 3. I.e. each element (e.g. pixel in computer vision, time-frequency unit in audio, or token in NLP) is normalized independently.

If frame_wise is True, the statistics are aggregated over the channel dimension and all spatial dimensions except the time dimension. The time dimension must be the last dimension of input tensors.


  1. Y. Wu and K. He, "Group normalization", ECCV, 2018. 

  2. Y. Luo and N. Mesgarani, "Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation", in IEEE/ACM TASLP, 2019. 

  3. S. Shen, Z. Yao, A. Gholami, M. W. Mahoney, and K. Keutzer, "PowerNorm: Rethinking batch normalization in transformers", ICML, 2020. 

__init__

__init__(
    num_channels: int,
    element_wise: bool = False,
    frame_wise: bool = False,
    causal: bool = False,
    center: bool = True,
    eps: float = 1e-05,
) -> None

Initialize the layer normalization module.

Parameters:

  • num_channels (int) –

    Number of channels in input tensors.

  • element_wise (bool, default: False ) –

    If True, each element (e.g. pixel in computer vision, time-frequency unit in audio, or token in NLP) is normalized independently. Mutually exclusive with frame_wise and causal.

  • frame_wise (bool, default: False ) –

    If True, each time frame is normalized independently. The time dimension must be the last dimension of input tensors. Mutually exclusive with element_wise and causal.

  • causal (bool, default: False ) –

    If True, normalization statistics are cumulatively aggregated along the time dimension. The time dimension must be the last dimension of the input tensor. Mutually exclusive with element_wise and frame_wise.

  • center (bool, default: True ) –

    If False, the mean is not subtracted from the input, and the input is scaled using the root mean square (RMS) instead of the variance. The bias term \(\beta\) is also omitted.

  • eps (float, default: 1e-05 ) –

    Small value for numerical stability.

forward

forward(x: Tensor) -> torch.Tensor

Forward pass.

addse.layers.group_norm

group_norm(
    x: Tensor,
    num_groups: int,
    weight: Tensor,
    bias: Tensor | None,
    eps: float,
    causal: bool,
    frame_wise: bool,
) -> torch.Tensor

Functional interface for group normalization.

See GroupNorm for details.

addse.lightning

addse.lightning.ADDSELightningModule

Bases: BaseLightningModule, ConfigureOptimizersMixin

ADDSE Lightning module.

__init__

__init__(
    nac_cfg: str,
    nac_ckpt: str,
    model: ADDSERQDiT,
    num_steps: int,
    block_size: int,
    optimizer: Callable[
        [Iterator[Parameter]], Optimizer
    ] = Adam,
    lr_scheduler: Mapping[str, Any] | None = None,
    val_metrics: Mapping[str, BaseMetric] | None = None,
    test_metrics: Mapping[str, BaseMetric] | None = None,
    log_cfg: LogConfig | None = None,
    debug_sample: tuple[int, int] | None = None,
) -> None

Initialize the ADDSE Lightning module.

forward

forward(
    x: Tensor, return_nfe: bool = False
) -> Tensor | tuple[Tensor, int]

Enhance the input audio.

log_score

log_score(y_q: Tensor, x_q: Tensor) -> Tensor

Estimate the score function.

loss

loss(x_q: Tensor, y_q: Tensor, y_tok: Tensor) -> Tensor

Compute the \(\lambda\)-denoising cross-entropy loss.

Parameters:

  • x_q (Tensor) –

    Noisy speech embeddings. Shape (batch_size, emb_channels, num_codebooks, seq_len).

  • y_q (Tensor) –

    Clean speech embeddings. Shape (batch_size, emb_channels, num_codebooks, seq_len).

  • y_tok (Tensor) –

    Clean speech tokens. Shape (batch_size, num_codebooks, seq_len).

Returns:

  • Tensor

    The \(\lambda\)-denoising cross-entropy loss.

solve

solve(
    x_tok: Tensor,
    x_q: Tensor,
    num_steps: int,
    return_nfe: bool = False,
) -> Tensor | tuple[Tensor, int]

Sample assuming a log-linear noise schedule and an absorbing transition matrix.

addse.lightning.BaseLightningModule

Bases: LightningModule

Base class for Lightning modules.

log_debug_samples

log_debug_samples(
    batch: tuple[Tensor, Tensor, Tensor],
    batch_idx: int,
    debug_samples: dict[str, Tensor],
) -> None

Log debug audio samples to W&B.

log_metrics

log_metrics(
    loss: dict[str, Tensor],
    metrics: dict[str, float],
    stage: str,
    on_step: bool,
    on_epoch: bool,
) -> None

Log losses and metrics.

step abstractmethod

step(
    batch: tuple[Tensor, Tensor, Tensor],
    stage: str,
    batch_idx: int,
    metrics: Mapping[str, BaseMetric] | None = None,
) -> tuple[
    dict[str, Tensor], dict[str, float], dict[str, Tensor]
]

Training, validation, or test step.

Parameters:

  • batch (tuple[Tensor, Tensor, Tensor]) –

    A batch from the dataloader.

  • stage (str) –

    "train", "val", or "test".

  • batch_idx (int) –

    Index of the batch.

  • metrics (Mapping[str, BaseMetric] | None, default: None ) –

    Metrics to compute. None if stage is "train" or if no metrics are defined.

Returns:

  • dict[str, Tensor]

    Tuple of loss dictionary, metrics dictionary, and debug samples dictionary. Each debug sample must have

  • dict[str, float]

    shape (batch_size, num_channels, num_samples).

test_step

test_step(
    batch: tuple[Tensor, Tensor, Tensor], batch_idx: int
) -> dict[str, Tensor]

Test step.

Parameters:

Returns:

training_step

training_step(
    batch: tuple[Tensor, Tensor, Tensor], batch_idx: int
) -> dict[str, Tensor]

Training step.

Parameters:

Returns:

validation_step

validation_step(
    batch: tuple[Tensor, Tensor, Tensor], batch_idx: int
) -> dict[str, Tensor]

Validation step.

Parameters:

Returns:

addse.lightning.ConfigureOptimizersMixin

Bases: LightningModule

Mixin for standard configuration of optimizer and learning rate scheduler.

configure_optimizers

configure_optimizers() -> Any

Configure optimizers.

Returns:

  • Any

    Dictionary with optimizer, learning rate scheduler, and learning rate scheduler configuration.

addse.lightning.DataModule

Bases: LightningDataModule

Data module.

__init__

__init__(
    train_dataset: Callable[[], Dataset],
    train_dataloader: Callable[[Dataset], DataLoader],
    val_dataset: Callable[[], Dataset] | None = None,
    val_dataloader: Callable[[Dataset], DataLoader]
    | None = None,
    test_dataset: Callable[[], Dataset] | None = None,
    test_dataloader: Callable[[Dataset], DataLoader]
    | None = None,
) -> None

Initialize the data module.

Parameters:

  • train_dataset (Callable[[], Dataset]) –

    Function to initialize the training dataset.

  • val_dataset (Callable[[], Dataset] | None, default: None ) –

    Function to initialize the validation dataset.

  • test_dataset (Callable[[], Dataset] | None, default: None ) –

    Function to initialize the test dataset.

  • train_dataloader (Callable[[Dataset], DataLoader]) –

    Function to initialize the training dataloader.

  • val_dataloader (Callable[[Dataset], DataLoader] | None, default: None ) –

    Function to initialize the validation dataloader.

  • test_dataloader (Callable[[Dataset], DataLoader] | None, default: None ) –

    Function to initialize the test dataloader.

load_state_dict

load_state_dict(state_dict: dict[str, Any]) -> None

Load the state dict of the data module.

setup

setup(stage: str) -> None

Setup the data module.

Parameters:

  • stage (str) –

    Either "fit", "validate", "test", or "predict".

state_dict

state_dict() -> dict[str, Any]

Get the state dict of the data module.

test_dataloader

test_dataloader() -> DataLoader | list

Get the test dataloader.

Returns:

  • DataLoader | list

    The test dataloader or an empty list if no test dataset was provided at initialization.

train_dataloader

train_dataloader() -> DataLoader

Get the training dataloader.

Returns:

val_dataloader

val_dataloader() -> DataLoader | list

Get the validation dataloader.

Returns:

  • DataLoader | list

    The validation dataloader or an empty list if no validation dataset was provided at initialization.

addse.lightning.EDMMixin

Bases: LightningModule

Mixin for training and sampling as in EDM.

denoiser

denoiser(y: Tensor, x: Tensor, sigma: Tensor) -> Tensor

Compute the denoiser parametrization as in EDM.

loss

loss(x: Tensor, y: Tensor) -> Tensor

Compute the loss as in EDM.

sampling_step

sampling_step(i: int) -> float

Compute the i-th sampling step.

solve

solve(x: Tensor, num_steps: int) -> Tensor

Sample using the Heun method as in EDM.

addse.lightning.EDMNACSELightningModule

Bases: BaseLightningModule, ConfigureOptimizersMixin, EDMMixin

Lightning module for speech enhancement using NAC-domain EDM-style diffusion.

__init__

__init__(
    nac_cfg: str,
    nac_ckpt: str,
    nac_domain: str,
    nac_no_sum: bool,
    nac_stack: bool,
    model: ADDSERQDiT,
    num_steps: int,
    block_size: int,
    norm_factor: float = 2.3,
    sigma_data: float = 0.5,
    p_mean: float = 0.0,
    p_sigma: float = 1.0,
    s_churn: float = 0.0,
    s_min: float = 0.0,
    s_max: float = float("inf"),
    s_noise: float = 1.0,
    sigma_min: float = 0.002,
    sigma_max: float = 80.0,
    rho: float = 7.0,
    optimizer: Callable[
        [Iterator[Parameter]], Optimizer
    ] = Adam,
    lr_scheduler: Mapping[str, Any] | None = None,
    val_metrics: Mapping[str, BaseMetric] | None = None,
    test_metrics: Mapping[str, BaseMetric] | None = None,
    log_cfg: LogConfig | None = None,
    debug_sample: tuple[int, int] | None = None,
) -> None

Initialize the NAC-domain EDM-style Lightning module.

forward

forward(x: Tensor, num_steps: int | None = None) -> Tensor

Enhance the input audio.

addse.lightning.EDMSELightningModule

Bases: BaseLightningModule, ConfigureOptimizersMixin, EDMMixin

Lightning module for speech enhancement using STFT-domain EDM-style diffusion.

__init__

__init__(
    model: ADM,
    stft: STFT,
    num_steps: int = 30,
    sigma_data: float = 0.5,
    p_mean: float = 0.0,
    p_sigma: float = 1.0,
    s_churn: float = 0.0,
    s_min: float = 0.0,
    s_max: float = float("inf"),
    s_noise: float = 1.0,
    sigma_min: float = 0.002,
    sigma_max: float = 80.0,
    rho: float = 7.0,
    optimizer: Callable[
        [Iterator[Parameter]], Optimizer
    ] = Adam,
    lr_scheduler: Mapping[str, Any] | None = None,
    val_metrics: Mapping[str, BaseMetric] | None = None,
    test_metrics: Mapping[str, BaseMetric] | None = None,
    log_cfg: LogConfig | None = None,
    debug_sample: tuple[int, int] | None = None,
) -> None

Initialize the NAC-domain EDM-style Lightning module.

forward

forward(x: Tensor, num_steps: int | None = None) -> Tensor

Enhance the input audio.

inverse_transform

inverse_transform(x: Tensor, n: int) -> Tensor

Decompress and compute the inverse STFT.

transform

transform(x: Tensor) -> Tensor

Compute the STFT and compress.

addse.lightning.LightningModule

Bases: BaseLightningModule, ConfigureOptimizersMixin

Simple Lightning module for training models to directly predict clean speech given noisy speech.

__init__

__init__(
    model: Module,
    loss: BaseLoss,
    optimizer: Callable[
        [Iterator[Parameter]], Optimizer
    ] = Adam,
    lr_scheduler: Mapping[str, Any] | None = None,
    val_metrics: Mapping[str, BaseMetric] | None = None,
    test_metrics: Mapping[str, BaseMetric] | None = None,
    log_cfg: LogConfig | None = None,
    debug_sample: tuple[int, int] | None = None,
) -> None

Initialize the simple Lightning module.

Parameters:

  • model (Module) –

    Model to train.

  • loss (BaseLoss) –

    Loss module.

  • optimizer (Callable[[Iterator[Parameter]], Optimizer], default: Adam ) –

    Optimizer constructor.

  • lr_scheduler (Mapping[str, Any] | None, default: None ) –

    Learning rate scheduler configuration.

  • val_metrics (Mapping[str, BaseMetric] | None, default: None ) –

    Metrics to compute during validation.

  • test_metrics (Mapping[str, BaseMetric] | None, default: None ) –

    Metrics to compute during testing.

  • log_cfg (LogConfig | None, default: None ) –

    Logging configuration.

  • debug_sample (tuple[int, int] | None, default: None ) –

    Tuple (batch_idx, sample_idx) to log debug audio samples to W&B during validation.

forward

forward(x: Tensor) -> Tensor

Enhance the input audio.

addse.lightning.LogConfig dataclass

Configuration for logging losses and metrics.

addse.lightning.NACLightningModule

Bases: BaseLightningModule

Lightning module for neural audio codec.

__init__

__init__(
    generator: NAC,
    discriminator: Module | Iterable[Module],
    reconstruction_loss: BaseLoss,
    adversarial_loss_weight: float,
    feature_loss_weight: float,
    reconstruction_loss_weight: float,
    codebook_loss_weight: float,
    commitment_loss_weight: float,
    generator_optimizer: Callable[
        [Iterator[Parameter]], Optimizer
    ],
    discriminator_optimizer: Callable[
        [Iterator[Parameter]], Optimizer
    ],
    generator_grad_clip: float = 0.0,
    discriminator_grad_clip: float = 0.0,
    val_metrics: Mapping[str, BaseMetric] | None = None,
    test_metrics: Mapping[str, BaseMetric] | None = None,
    log_cfg: LogConfig | None = None,
    debug_sample: tuple[int, int] | None = None,
) -> None

Initialize the neural audio codec Lightning module.

configure_optimizers

configure_optimizers() -> tuple[Optimizer, Optimizer]

Configure optimizers.

Returns:

discriminator_forward

discriminator_forward(
    x: Tensor,
) -> tuple[list[Tensor], list[list[Tensor]]]

Forward pass through all discriminators.

discriminator_step

discriminator_step(x: Tensor, y: Tensor) -> Tensor

Discriminator step.

forward

forward(x: Tensor) -> Tensor

Forward pass through the generator.

generator_step

generator_step(
    x: Tensor,
    y: Tensor,
    codebook_loss: Tensor,
    commit_loss: Tensor,
) -> dict[str, Tensor]

Generator step.

addse.lightning.NACSELightningModule

Bases: BaseLightningModule, ConfigureOptimizersMixin

Lightning module for speech enhancement using NAC-domain direct prediction.

__init__

__init__(
    nac_cfg: str,
    nac_ckpt: str,
    nac_domain: str,
    nac_no_sum: bool,
    model: Module,
    block_size: int,
    optimizer: Callable[
        [Iterator[Parameter]], Optimizer
    ] = Adam,
    lr_scheduler: Mapping[str, Any] | None = None,
    val_metrics: Mapping[str, BaseMetric] | None = None,
    test_metrics: Mapping[str, BaseMetric] | None = None,
    log_cfg: LogConfig | None = None,
    debug_sample: tuple[int, int] | None = None,
) -> None

Initialize the NAC-domain Lightning module.

forward

forward(x: Tensor) -> Tensor

Enhance the input audio.

addse.lightning.SGMSELightningModule

Bases: BaseLightningModule, ConfigureOptimizersMixin

SGMSE Lightning module.

__init__

__init__(
    model: SGMSEUNet,
    stft: STFT,
    num_steps: int = 30,
    sigma_min: float = 0.05,
    sigma_max: float = 0.5,
    gamma: float = 1.5,
    t_eps: float = 0.03,
    corrector_snr: float = 0.5,
    alpha: float = 0.5,
    beta: float = 0.15,
    optimizer: Callable[
        [Iterator[Parameter]], Optimizer
    ] = Adam,
    lr_scheduler: Mapping[str, Any] | None = None,
    val_metrics: Mapping[str, BaseMetric] | None = None,
    test_metrics: Mapping[str, BaseMetric] | None = None,
    log_cfg: LogConfig | None = None,
    debug_sample: tuple[int, int] | None = None,
) -> None

Initialize the SGMSE Lightning module.

forward

forward(x: Tensor, num_steps: int | None = None) -> Tensor

Enhance the input audio.

inverse_transform

inverse_transform(x: Tensor, n: int) -> Tensor

Decompress, descale, and compute the inverse STFT.

loss

loss(x: Tensor, y: Tensor) -> Tensor

Compute the loss.

score

score(x: Tensor, y: Tensor, t: Tensor) -> Tensor

Estimate the score function.

sigma

sigma(t: Tensor) -> Tensor

Noise schedule.

solve

solve(x: Tensor, num_steps: int) -> Tensor

Sample using the predictor-corrector method.

transform

transform(x: Tensor) -> Tensor

Compute the STFT, compress, and scale.

addse.lightning.compute_metrics

compute_metrics(
    x: Tensor,
    y: Tensor,
    metrics: Mapping[str, BaseMetric] | None = None,
) -> dict[str, float]

Compute validation or test metrics.

Parameters:

  • x (Tensor) –

    Signal to evaluate. Shape (batch_size, num_channels, num_samples).

  • y (Tensor) –

    Reference signal for the metrics. Shape (batch_size, num_channels, num_samples).

  • metrics (Mapping[str, BaseMetric] | None, default: None ) –

    Metrics to compute.

Returns:

addse.lightning.load_nac

load_nac(cfg_path: str, ckpt_path: str) -> tuple[NAC, int]

Load a pretrained neural audio codec.

addse.lightning.process_in_blocks

process_in_blocks(
    args: tuple[Tensor, ...],
    block_size: int,
    fn: Callable[..., Tensor],
) -> Tensor

Process the inputs in blocks.

addse.losses

addse.losses.BaseLoss

Bases: Module

Base class for losses.

compute abstractmethod

compute(
    x: Tensor, y: Tensor
) -> torch.Tensor | dict[str, torch.Tensor]

Compute the loss.

This method should not be called directly. Use forward instead.

forward

forward(x: Tensor, y: Tensor) -> dict[str, torch.Tensor]

Compute the loss.

Validates inputs and calls compute.

Parameters:

  • x (Tensor) –

    Predicted signal. Shape (batch_size, num_channels, num_samples).

  • y (Tensor) –

    Target signal. Shape (batch_size, num_channels, num_samples).

Returns:

addse.losses.MSMelSpecLoss

Bases: MultiTaskLoss

Multi-scale mel-spectrogram loss.

__init__

__init__(
    n_mels: int | Collection[int] = (
        4,
        8,
        16,
        32,
        64,
        128,
        256,
    ),
    frame_lengths: Collection[int] = (
        31,
        67,
        127,
        257,
        509,
        1021,
        2053,
    ),
    hop_lengths: Collection[int | None] | None = None,
    n_ffts: Collection[int | None] | None = None,
    weights: Collection[float] | None = None,
    window: str = "flattop",
    fs: int = 16000,
    compression: float = 2.0,
    log: bool = True,
    power: float = 1.0,
    eps: float = 1e-07,
    mel_norm: Literal["slaney", "consistent"]
    | None = "consistent",
    stft_norm: bool = True,
) -> None

Initialize the multi-scale mel-spectrogram loss.

addse.losses.MelSpecLoss

Bases: BaseLoss

Mel-spectrogram loss.

__init__

__init__(
    n_mels: int = 64,
    frame_length: int = 512,
    hop_length: int | None = None,
    n_fft: int | None = None,
    window: str = "flattop",
    fs: int = 16000,
    compression: float = 2.0,
    log: bool = True,
    power: float = 1.0,
    eps: float = 1e-07,
    mel_norm: Literal["slaney", "consistent"]
    | None = "consistent",
    stft_norm: bool = True,
) -> None

Initialize the mel-spectrogram loss.

addse.losses.MultiTaskLoss

Bases: BaseLoss

Multi-task loss.

__init__

__init__(
    losses: Collection[BaseLoss],
    weights: Collection[float] | None = None,
    names: Collection[str] | None = None,
) -> None

Initialize the multi-task loss.

addse.losses.SDRLoss

Bases: BaseLoss

Signal-to-distortion ratio (SDR) loss.

__init__

__init__(
    scale_invariant: bool = False,
    zero_mean: bool = False,
    eps: float = 1e-07,
) -> None

Initialize the SDR loss.

Parameters:

  • scale_invariant (bool, default: False ) –

    If True, computes the scale-invariant signal-to-distortion ratio (SI-SDR).

  • zero_mean (bool, default: False ) –

    If True, subtracts the mean from the inputs before computing the loss.

  • eps (float, default: 1e-07 ) –

    Small value for numerical stability.

addse.metrics

addse.metrics.BaseMetric

Base class for metrics.

__call__

__call__(x: ndarray | Tensor, y: ndarray | Tensor) -> float

Compute the metric.

Validates inputs and calls compute.

Parameters:

  • x (ndarray | Tensor) –

    Input signal to evaluate. Shape (num_channels, num_samples).

  • y (ndarray | Tensor) –

    Reference signal to compare against. Shape (num_channels, num_samples).

Returns:

  • float

    Metric value.

compute abstractmethod

compute(x: ndarray, y: ndarray) -> float

Compute the metric.

This method should not be called directly. Use __call__ instead.

addse.metrics.DNSMOSMetric

Bases: BaseMetric

Deep noise suppression mean opinion score (DNSMOS) metric.

Calculated independently for each channel and averaged across channels.

__init__

__init__(fs: int) -> None

Initialize the DNS-MOS metric.

Parameters:

  • fs (int) –

    Sampling frequency of input signals.

addse.metrics.LPSMetric

Bases: BaseMetric

Levenshtein phoneme similarity (LPS).

Calculated independently for each channel and averaged across channels.

__init__

__init__(
    fs: int,
    device: str = "auto",
    checkpoint: str = "facebook/wav2vec2-lv-60-espeak-cv-ft",
) -> None

Initialize the LPS metric.

addse.metrics.MCDMetric

Bases: BaseMetric

Mel-cepstral distance (MCD) metric.

Calculated independently for each channel and averaged across channels.

__init__

__init__(fs: int) -> None

Initialize the MCD metric.

addse.metrics.NISQAMetric

Bases: BaseMetric

Non-intrusive speech quality assessment (NISQA) metric.

Calculated independently for each channel and averaged across channels.

__init__

__init__(fs: int) -> None

Initialize the NISQA metric.

addse.metrics.PESQMetric

Bases: BaseMetric

Perceptual evaluation of speech quality (PESQ) metric.

Calculated independently for each channel and averaged across channels.

__init__

__init__(fs: int) -> None

Initialize the PESQ metric.

Parameters:

  • fs (int) –

    Sampling frequency of input signals.

addse.metrics.SBSMetric

Bases: BaseMetric

SpeechBERTScore (SBS).

__init__

__init__(fs: int, device: str = 'auto') -> None

Initialize the SBS metric.

addse.metrics.SCOREQMetric

Bases: BaseMetric

Speech contrastive regression for quality assessment (SCOREQ).

Calculated independently for each channel and averaged across channels.

__init__

__init__(fs: int) -> None

Initialize the SCOREQ metric.

addse.metrics.SDRMetric

Bases: BaseMetric

Signal-to-distortion ratio (SDR) metric.

__init__

__init__(
    scale_invariant: bool = False,
    zero_mean: bool = False,
    eps: float = 1e-07,
) -> None

Initialize the SDR metric.

Parameters:

  • scale_invariant (bool, default: False ) –

    If True, computes the scale-invariant signal-to-distortion ratio (SI-SDR).

  • zero_mean (bool, default: False ) –

    If True, subtracts the mean from the inputs before computing the metric.

  • eps (float, default: 1e-07 ) –

    Small value for numerical stability.

addse.metrics.STOIMetric

Bases: BaseMetric

Short-time objective intelligibility (STOI) metric.

Calculated independently for each channel and averaged across channels.

__init__

__init__(fs: int, extended: bool = False) -> None

Initialize the STOI metric.

Parameters:

  • fs (int) –

    Sampling frequency of input signals.

  • extended (bool, default: False ) –

    If True, computes the extended version of the STOI metric (ESTOI).

addse.metrics.UTMOSMetric

Bases: BaseMetric

UTokyo-SaruLab MOS prediction system (UTMOSv2).

Calculated independently for each channel and averaged across channels.

__init__

__init__(fs: int, device: str = 'auto') -> None

Initialize the PESQ metric.

Parameters:

  • fs (int) –

    Sampling frequency of input signals.

  • device (str, default: 'auto' ) –

    Device to run the model on. One of 'auto', 'cpu', or 'cuda'.

addse.models.addse

addse.models.addse.ADDSEDiT

Bases: Module

ADDSE DiT.

__init__

__init__(
    dim: int,
    num_layers: int,
    num_heads: int,
    max_seq_len: int,
    elementwise_affine: bool,
) -> None

Initialize the ADDSE DiT.

forward

forward(
    x: Tensor,
    c: Tensor | None = None,
    t: Tensor | None = None,
) -> Tensor

Forward pass.

addse.models.addse.ADDSEDiTBlock

Bases: Module

ADDSE DiT block.

__init__

__init__(
    dim: int, num_heads: int, elementwise_affine: bool
) -> None

Initialize the ADDSE DiT block.

forward

forward(
    x: Tensor,
    c: Tensor | None,
    cos_emb: Tensor,
    sin_emb: Tensor,
) -> Tensor

Forward pass.

addse.models.addse.ADDSEEmbeddingBlock

Bases: Module

ADDSE noise embedding block with Fourier features.

__init__

__init__(dim: int, emb_dim: int = 256) -> None

Initialize the ADDSE time embedding block.

forward

forward(x: Tensor) -> Tensor

Forward pass.

addse.models.addse.ADDSERQDiT

Bases: Module

Residual Quantized Diffusion Transformer (RQDiT) backbone used in ADDSE.

__init__

__init__(
    input_channels: int,
    output_channels: int,
    num_codebooks: int,
    hidden_dim: int,
    num_layers: int,
    num_heads: int,
    max_seq_len: int,
    conditional: bool,
    time_independent: bool,
) -> None

Initialize the ADDSE RQDiT backbone.

Parameters:

  • input_channels (int) –

    Number of input channels.

  • output_channels (int) –

    Number of output channels.

  • num_codebooks (int) –

    Number of codebooks.

  • hidden_dim (int) –

    Number of DiT hidden channels.

  • num_layers (int) –

    Number of DiT layers.

  • num_heads (int) –

    Number of DiT attention heads.

  • max_seq_len (int) –

    Maximum sequence length.

  • conditional (bool) –

    Whether the model is conditional.

  • time_independent (bool) –

    Whether the model is time-independent.

forward

forward(
    x: Tensor,
    c: Tensor | None = None,
    t: Tensor | None = None,
) -> Tensor

Forward pass.

Parameters:

  • x (Tensor) –

    Diffused embeddings. Shape (batch_size, input_channels, num_codebooks, seq_len) or (batch_size, input_channels, seq_len).

  • c (Tensor | None, default: None ) –

    Conditioning embeddings. Same shape as x.

  • t (Tensor | None, default: None ) –

    Time step or noise level. Shape (batch_size,).

Returns:

  • Tensor

    Output tensor. Shape (batch_size, output_channels, num_codebooks, seq_len).

addse.models.addse.ADDSESelfAttentionBlock

Bases: Module

ADDSE self-attention block.

__init__

__init__(dim: int, num_heads: int) -> None

Initialize the ADDSE self-attention block.

forward

forward(
    x: Tensor, cos_emb: Tensor, sin_emb: Tensor
) -> Tensor

Forward pass.

addse.models.addse.get_rot_emb

get_rot_emb(
    dim: int, max_seq_len: int
) -> tuple[Tensor, Tensor]

Compute rotary embeddings. Shape (max_seq_len, dim).

addse.models.adm

addse.models.adm.ADM

Bases: Module

ADM similar to configuration F in EDM2 paper.

__init__

__init__(
    num_channels: int = 1,
    base_channels: int = 96,
    num_res_blocks: int = 3,
    channel_mult: Sequence[int] = (1, 2, 3, 4),
    attn_levels: Container[int] = (),
) -> None

Initialize ADM.

forward

forward(y: Tensor, x: Tensor, t: Tensor) -> Tensor

Forward pass.

Parameters:

  • y (Tensor) –

    Complex-valued diffused speech tensor. Shape (batch_size, num_channels, num_freqs, num_frames).

  • x (Tensor) –

    Complex-valued noisy speech tensor. Shape (batch_size, num_channels, num_freqs, num_frames).

  • t (Tensor) –

    Diffusion step or noise level. Shape (batch_size,).

Returns:

  • Tensor

    Complex-valued output score. Shape (batch_size, num_channels, num_freqs, num_frames).

addse.models.adm.ADMAttentionBlock

Bases: Module

ADM attention block.

__init__

__init__(num_channels: int) -> None

Initialize the ADM attention block.

forward

forward(x: Tensor) -> Tensor

Forward pass.

addse.models.adm.ADMBlock

Bases: Module

ADM block.

__init__

__init__(
    in_ch: int,
    out_ch: int,
    emb_ch: int,
    kind: str,
    resample: bool = False,
    attn: bool = False,
) -> None

Initialize the ADM block.

forward

forward(x: Tensor, emb: Tensor) -> Tensor

Forward pass.

addse.models.adm.ADMEmbeddingBlock

Bases: Module

ADM time step embedding block.

__init__

__init__(in_channels: int, out_channels: int) -> None

Initialize the ADM time embedding block.

forward

forward(x: Tensor) -> Tensor

Forward pass.

addse.models.adm.ADMResample

Bases: Module

ADM 2D resampling block.

__init__

__init__(kind: str) -> None

Initialize the ADM 2D resampling block.

forward

forward(x: Tensor) -> Tensor

Forward pass.

addse.models.adm.adm_conv2d

adm_conv2d(
    in_channels: int,
    out_channels: int,
    kernel_size: int,
    stride: int = 1,
    padding: int = 0,
) -> nn.Conv2d

2D convolutional layer with weight normalization and no bias.

addse.models.bsrnn

addse.models.bsrnn.BSRNN

Bases: Module

Band-split RNN (BSRNN) 1 2 3.


  1. Y. Luo and J. Yu, "Music source separation with band-split RNN", IEEE/ACM TASLP, 2023. 

  2. J. Yu and Y. Luo, "Efficient monaural speech enhancement with universal sample rate band-split RNN", IEEE ICASSP, 2023. 

  3. J. Yu, H. Chen, Y. Luo, R. Gu, and C. Weng, "High fidelity speech enhancement with band-split RNN", INTERSPEECH, 2023. 

__init__

__init__(
    stft: STFT | None = None,
    fs: int = 16000,
    input_channels: int = 1,
    output_channels: int = 1,
    num_channels: int = 32,
    num_layers: int = 6,
    causal: bool = False,
    subbands: Iterable[tuple[float, int]] = [
        (100.0, 10),
        (200.0, 10),
        (500.0, 6),
        (1000.0, 2),
    ],
    residual: bool = False,
    norm: Callable[[int], Module] | None = None,
) -> None

Initialize BSRNN.

Parameters:

  • stft (STFT | None, default: None ) –

    STFT module.

  • fs (int, default: 16000 ) –

    Sampling rate.

  • input_channels (int, default: 1 ) –

    Number of input channels.

  • output_channels (int, default: 1 ) –

    Number of output channels.

  • num_channels (int, default: 32 ) –

    Number of internal channels. Denoted as N in the paper.

  • num_layers (int, default: 6 ) –

    Number of dual-path modelling layers.

  • causal (bool, default: False ) –

    Whether to use unidirectional RNNs along the time axis.

  • subbands (Iterable[tuple[float, int]], default: [(100.0, 10), (200.0, 10), (500.0, 6), (1000.0, 2)] ) –

    List of tuples (bandwidth, number), where bandwidth is the bandwidth of the subband in Hz and number is the number of subbands with that bandwidth.

  • residual (bool, default: False ) –

    Whether to predict a residual STFT in addition to the mask. The residual STFT is added after applying the mask to the input STFT.

  • norm (Callable[[int], Module] | None, default: None ) –

    Normalization module to use throughout the network. If None, defaults to LayerNorm with causal=causal. If a non-causal normalization module is provided, the network is not causal, even if causal=True.

forward

forward(x: Tensor) -> torch.Tensor

Forward pass.

Parameters:

  • x (Tensor) –

    Input tensor. Shape (batch_size, input_channels, num_samples).

Returns:

  • Tensor

    Enhanced tensor. Shape (batch_size, output_channels, num_samples).

addse.models.bsrnn.BSRNNMLP

Bases: Module

Multi-Layer perceptron (MLP) used in BSRNN.

__init__

__init__(
    input_channels: int,
    output_channels: int,
    norm: Callable[[int], Module],
) -> None

Initialize the BSRNN MLP.

forward

forward(x: Tensor) -> torch.Tensor

Forward pass.

addse.models.bsrnn.BSRNNRNNBlock

Bases: Module

RNN block used in BSRNN.

__init__

__init__(
    num_channels: int,
    hidden_channels: int,
    causal: bool,
    seq_dim: int,
    norm: Callable[[int], Module],
) -> None

Initialize the BSRNN RNN block.

forward

forward(x: Tensor) -> torch.Tensor

Forward pass.

addse.models.convtasnet

addse.models.convtasnet.ConvTasNet

Bases: Module

Conv-TasNet 1.

Consists of an encoder, a temporal convolutional network (TCN), and a decoder.


  1. Y. Luo and N. Mesgarani, "Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation", in IEEE/ACM TASLP, 2019. 

__init__

__init__(
    input_channels: int = 1,
    output_channels: int = 1,
    num_filters: int = 512,
    filter_size: int = 32,
    hop_size: int | None = None,
    bottleneck_channels: int = 128,
    hidden_channels: int = 512,
    skip_channels: int = 128,
    kernel_size: int = 3,
    layers: int = 8,
    repeats: int = 3,
    causal: bool = False,
    norm: Callable[[int], Module] | None = None,
) -> None

Initialize Conv-TasNet.

Parameters:

  • input_channels (int, default: 1 ) –

    Number of input channels.

  • output_channels (int, default: 1 ) –

    Number of output channels.

  • num_filters (int, default: 512 ) –

    Number of filters in the encoder. Denoted as N in the paper.

  • filter_size (int, default: 32 ) –

    Encoder filter length. Denoted as L in the paper.

  • hop_size (int | None, default: None ) –

    Encoder hop size. If None, defaults to encoder_kernel_size // 2.

  • bottleneck_channels (int, default: 128 ) –

    Number of bottleneck channels in the TCN. Denoted as B in the paper.

  • hidden_channels (int, default: 512 ) –

    Number of hidden channels in the TCN. Denoted as H in the paper.

  • skip_channels (int, default: 128 ) –

    Number of skip channels in the TCN. Denoted as Sc in the paper.

  • kernel_size (int, default: 3 ) –

    Kernel size in the TCN. Denoted as P in the paper.

  • layers (int, default: 8 ) –

    Number of layers in the TCN. Denoted as X in the paper.

  • repeats (int, default: 3 ) –

    Number of repeats in the TCN. Denoted as R in the paper.

  • causal (bool, default: False ) –

    Whether to use causal convolutions in the TCN.

  • norm (Callable[[int], Module] | None, default: None ) –

    Normalization module to use in the TCN. If None, defaults to LayerNorm with causal=causal. If a non-causal normalization module is provided, the TCN is not causal, even if causal=True.

forward

forward(x: Tensor) -> torch.Tensor

Forward pass.

addse.models.convtasnet.ConvTasNetConv1DBlock

Bases: Module

1D convolutional block with PReLU activation and normalization used in Conv-TasNet.

__init__

__init__(
    input_channels: int,
    hidden_channels: int,
    skip_channels: int,
    kernel_size: int,
    dilation: int,
    causal: bool,
    last: bool,
    norm: Callable[[int], Module],
) -> None

Initialize the Conv-TasNet 1D convolutional block.

forward

forward(
    x: Tensor,
) -> tuple[torch.Tensor, torch.Tensor | None]

Forward pass.

addse.models.convtasnet.ConvTasNetTCN

Bases: Module

Temporal convolutional network (TCN) used in Conv-TasNet.

__init__

__init__(
    input_channels: int,
    output_channels: int,
    bottleneck_channels: int,
    hidden_channels: int,
    skip_channels: int,
    kernel_size: int,
    layers: int,
    repeats: int,
    causal: bool,
    norm: Callable[[int], Module],
) -> None

Initialize the Conv-TasNet TCN.

forward

forward(x: Tensor) -> torch.Tensor

Forward pass.

addse.models.mpd

addse.models.mpd.MPDiscriminator

Bases: Module

Multi-period discriminator.

__init__

__init__(
    periods: Iterable[int] = (2, 3, 5, 7, 11),
    in_channels: int = 1,
    kernel_size: int = 5,
    stride: int = 3,
    channels: Sequence[int] = (32, 128, 512, 1024, 1024),
    out_kernel_size: int = 3,
    out_stride: int = 1,
) -> None

Initialize the multi-period discriminator.

forward

forward(
    x: Tensor,
) -> tuple[list[torch.Tensor], list[list[torch.Tensor]]]

Forward pass.

addse.models.mpd.PDiscriminator

Bases: Module

Period discriminator.

__init__

__init__(
    period: int,
    in_channels: int = 1,
    kernel_size: int = 5,
    stride: int = 3,
    channels: Sequence[int] = (32, 128, 512, 1024, 1024),
    out_kernel_size: int = 3,
    out_stride: int = 1,
) -> None

Initialize the period discriminator.

forward

forward(
    x: Tensor,
) -> tuple[torch.Tensor, list[torch.Tensor]]

Forward pass.

addse.models.mpd.PDiscriminatorConv1d

Bases: Module

Period discriminator 1D convolutional layer.

__init__

__init__(
    in_channels: int,
    out_channels: int,
    kernel_size: int,
    stride: int = 1,
    activation: bool = True,
) -> None

Initialize the period discriminator 1D convolutional layer.

forward

forward(x: Tensor) -> torch.Tensor

Forward pass.

addse.models.msstftd

addse.models.msstftd.MSSTFTDiscriminator

Bases: Module

Multi-scale short-time Fourier transform (MS-STFT) discriminator.

__init__

__init__(
    frame_lengths: Collection[int] = (
        127,
        257,
        509,
        1021,
        2053,
    ),
    hop_lengths: Collection[int | None] | None = None,
    n_ffts: Collection[int | None] | None = None,
    window: str = "flattop",
    in_channels: int = 1,
    out_channels: int = 1,
    num_channels: int = 32,
    kernel_size: tuple[int, int] = (9, 3),
    stride: tuple[int, int] = (2, 1),
    dilations: Iterable[int] = (1, 2, 4),
) -> None

Initialize the MR-STFT discriminator.

forward

forward(
    x: Tensor,
) -> tuple[list[torch.Tensor], list[list[torch.Tensor]]]

Forward pass.

addse.models.msstftd.STFTDiscriminator

Bases: Module

Short-time Fourier transform (STFT) discriminator.

__init__

__init__(
    frame_length: int = 512,
    hop_length: int | None = None,
    n_fft: int | None = None,
    window: str = "flattop",
    in_channels: int = 1,
    out_channels: int = 1,
    num_channels: int = 32,
    kernel_size: tuple[int, int] = (9, 3),
    stride: tuple[int, int] = (2, 1),
    dilations: Iterable[int] = (1, 2, 4),
) -> None

Initialize the STFT discriminator.

forward

forward(
    x: Tensor,
) -> tuple[torch.Tensor, list[torch.Tensor]]

Forward pass.

addse.models.msstftd.STFTDiscriminatorConv2d

Bases: Module

Short-time Fourier transform (STFT) discriminator 2D convolutional layer.

__init__

__init__(
    in_channels: int,
    out_channels: int,
    kernel_size: tuple[int, int],
    stride: tuple[int, int] = (1, 1),
    dilation: tuple[int, int] = (1, 1),
    activation: bool = True,
) -> None

Initialize the STFT discriminator 2D convolutional layer.

forward

forward(x: Tensor) -> torch.Tensor

Forward pass.

addse.models.nac

addse.models.nac.NAC

Bases: Module

Neural audio codec.

__init__

__init__(
    in_channels: int = 1,
    emb_channels: int = 1024,
    base_channels: int = 32,
    strides: list[int] = [2, 2, 4, 4, 5],
    kernel_size: int = 3,
    num_residual_units: int = 3,
    dilation_base: int = 3,
    encoder_in_kernel_size: int = 7,
    encoder_out_kernel_size: int = 7,
    decoder_in_kernel_size: int = 7,
    decoder_out_kernel_size: int = 7,
    codebook_channels: int | None = 8,
    codebook_size: int = 1024,
    num_codebooks: int = 4,
    normalize: bool = True,
    shared_codebook: bool = False,
) -> None

Initialize the neural audio codec.

Parameters:

  • in_channels (int, default: 1 ) –

    Number of input channels.

  • emb_channels (int, default: 1024 ) –

    Number of output and input channels for the encoder and decoder, respectively.

  • base_channels (int, default: 32 ) –

    Number of base channels for the encoder and decoder.

  • strides (list[int], default: [2, 2, 4, 4, 5] ) –

    Downsampling and upsampling factors for the encoder and decoder blocks, respectively.

  • kernel_size (int, default: 3 ) –

    Kernel size for the residual units.

  • num_residual_units (int, default: 3 ) –

    Number of residual units per encoder and decoder block.

  • dilation_base (int, default: 3 ) –

    Dilation base for the residual units.

  • encoder_in_kernel_size (int, default: 7 ) –

    Kernel size for the encoder input convolutional layer.

  • encoder_out_kernel_size (int, default: 7 ) –

    Kernel size for the encoder output convolutional layer.

  • decoder_in_kernel_size (int, default: 7 ) –

    Kernel size for the decoder input convolutional layer.

  • decoder_out_kernel_size (int, default: 7 ) –

    Kernel size for the decoder output convolutional layer.

  • codebook_channels (int | None, default: 8 ) –

    Number of channels for the codebook vectors. If None, uses emb_channels. Else, each quantizer uses input and output linear layers to map between emb_channels and codebook_channels.

  • codebook_size (int, default: 1024 ) –

    Number of vectors per codebook.

  • num_codebooks (int, default: 4 ) –

    Number of codebooks.

  • normalize (bool, default: True ) –

    Whether to normalize the embeddings and codebook vectors before codebook lookup.

  • shared_codebook (bool, default: False ) –

    Whether to use the same codebook for all quantizers.

decode

decode(
    x: Tensor, no_sum: bool = False, domain: str = "code"
) -> torch.Tensor

Decode input into audio.

Parameters:

  • x (Tensor) –

    Input tensor: - If domain is "code": Shape (batch_size, num_codebooks, num_frames). - If domain is "x": Shape (batch_size, emb_channels, num_frames). - If domain is "q": Shape (batch_size, emb_channels, num_frames) if no_sum is False else (batch_size, emb_channels, num_codebooks, num_frames). - If domain is "x_proj": Shape (batch_size, codebook_channels, num_codebooks, num_frames). - If domain is "q_proj": Shape (batch_size, codebook_channels, num_codebooks, num_frames).

  • no_sum (bool, default: False ) –

    If False, the input quantized embeddings are assumed to be summed across codebooks. Ignored if domain is not "q".

  • domain (str, default: 'code' ) –

    Domain of input tensor.

Returns:

  • Tensor

    Decoded audio. Shape (batch_size, in_channels, num_samples).

encode

encode(
    x: Tensor, no_sum: bool = False, domain: str = "q"
) -> tuple[torch.Tensor, torch.Tensor]

Encode input audio into discrete codes.

Parameters:

  • x (Tensor) –

    Input audio. Shape (batch_size, in_channels, num_samples).

  • no_sum (bool, default: False ) –

    If True, the quantized embeddings are not summed across codebooks. Ignored if domain is not "q".

  • domain (str, default: 'q' ) –

    Which continuous output to return. One of: - "x": Return the encoder output. - "q": Return the quantized embeddings. - "x_proj": Return the projected encoder output in codebook space. - "q_proj": Return the projected quantized embeddings in codebook space.

Returns:

  • Tensor

    Tuple (codes, continuous):

  • Tensor
    • codes: Discrete codes. Shape (batch_size, num_codebooks, num_frames).
  • tuple[Tensor, Tensor]
    • continuous: Continuous output:
    • If domain is "x": Shape (batch_size, emb_channels, num_frames).
    • If domain is "q": Shape (batch_size, emb_channels, num_frames) if no_sum is False else (batch_size, emb_channels, num_codebooks, num_frames).
    • If domain is "x_proj": Shape (batch_size, codebook_channels, num_codebooks, num_frames).
    • If domain is "q_proj": Shape (batch_size, codebook_channels, num_codebooks, num_frames).

forward

forward(
    x: Tensor,
) -> tuple[
    torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor
]

Forward pass.

Parameters:

  • x (Tensor) –

    Input audio. Shape (batch_size, in_channels, num_samples).

Returns:

  • Tensor

    Tuple (decoded, codes, codebook_loss, commit_loss) where decoded is the reconstructed audio with shape

  • Tensor

    (batch_size, in_channels, num_samples), codes are the discrete codes with shape `(batch_size,

  • Tensor

    num_codebooks, num_frames),codebook_lossis the codebook loss, andcommit_loss` is the commitment loss.

addse.models.nac.NACConv1d

Bases: Module

Neural audio codec 1D convolutional layer.

__init__

__init__(
    in_channels: int,
    out_channels: int,
    kernel_size: int = 1,
    stride: int = 1,
    padding: tuple[int, int] | str = (0, 0),
    dilation: int = 1,
    activation: bool = True,
    bias: bool = True,
) -> None

Initialize the neural audio codec 1D convolutional layer.

forward

forward(x: Tensor) -> torch.Tensor

Forward pass.

addse.models.nac.NACConvTranspose1d

Bases: Module

Neural audio codec 1D transposed convolutional layer.

__init__

__init__(
    in_channels: int, out_channels: int, stride: int
) -> None

Initialize the neural audio codec 1D transposed convolutional layer.

forward

forward(x: Tensor) -> torch.Tensor

Forward pass.

addse.models.nac.NACDecoder

Bases: Module

Neural audio codec decoder.

__init__

__init__(
    in_channels: int,
    out_channels: int,
    base_channels: int,
    strides: list[int],
    kernel_size: int,
    num_residual_units: int,
    dilation_base: int,
    in_kernel_size: int,
    out_kernel_size: int,
) -> None

Initialize the neural audio codec decoder.

forward

forward(x: Tensor) -> torch.Tensor

Decode continuous embeddings into audio.

Parameters:

  • x (Tensor) –

    Continuous embeddings. Shape (batch_size, in_channels, num_frames).

Returns:

  • Tensor

    Decoded audio. Shape (batch_size, out_channels, num_samples).

addse.models.nac.NACDecoderBlock

Bases: Module

Neural audio codec decoder block.

__init__

__init__(
    in_channels: int,
    out_channels: int,
    stride: int,
    kernel_size: int,
    num_residual_units: int,
    dilation_base: int,
) -> None

Initialize the neural audio codec decoder block.

forward

forward(x: Tensor) -> torch.Tensor

Forward pass.

addse.models.nac.NACEncoder

Bases: Module

Neural audio codec encoder.

__init__

__init__(
    in_channels: int,
    out_channels: int,
    base_channels: int,
    strides: list[int],
    kernel_size: int,
    num_residual_units: int,
    dilation_base: int,
    in_kernel_size: int,
    out_kernel_size: int,
) -> None

Initialize the neural audio codec encoder.

forward

forward(x: Tensor) -> torch.Tensor

Encode input audio into continuous embeddings.

Parameters:

  • x (Tensor) –

    Input audio. Shape (batch_size, in_channels, num_samples).

Returns:

  • Tensor

    Continuous embeddings. Shape (batch_size, out_channels, num_frames).

addse.models.nac.NACEncoderBlock

Bases: Module

Neural audio codec encoder block.

__init__

__init__(
    in_channels: int,
    out_channels: int,
    stride: int,
    kernel_size: int,
    num_residual_units: int,
    dilation_base: int,
) -> None

Initialize the neural audio codec encoder block.

forward

forward(x: Tensor) -> torch.Tensor

Forward pass.

addse.models.nac.NACLSTMBlock

Bases: Module

Neural audio codec LSTM block.

__init__

__init__(channels: int) -> None

Initialize the neural audio codec LSTM block.

forward

forward(x: Tensor) -> torch.Tensor

Forward pass.

process_in_blocks

process_in_blocks(
    x: Tensor,
) -> tuple[torch.Tensor, torch.Tensor]

Process input in blocks to avoid LSTM length limitation.

See https://github.com/pytorch/pytorch/issues/133751.

addse.models.nac.NACRVQVAE

Bases: Module

Neural audio codec residual vector quantizer.

__init__

__init__(
    emb_channels: int,
    codebook_size: int,
    num_codebooks: int,
    codebook_channels: int | None,
    normalize: bool,
    shared_codebook: bool,
) -> None

Initialize the neural audio codec residual vector quantizer.

decode

decode(
    x: Tensor,
    input_no_sum: bool = False,
    output_no_sum: bool = False,
    domain: str = "code",
) -> torch.Tensor

Decode input into quantized embeddings.

Parameters:

  • x (Tensor) –

    Input tensor: - If domain is "code": Shape (batch_size, num_codebooks, num_frames). - If domain is "x": Shape (batch_size, emb_channels, num_frames). - If domain is "q": Shape (batch_size, emb_channels, num_frames) if input_no_sum is False else (batch_size, emb_channels, num_codebooks, num_frames). - If domain is "x_proj": Shape (batch_size, codebook_channels, num_codebooks, num_frames). - If domain is "q_proj": Shape (batch_size, codebook_channels, num_codebooks, num_frames).

  • input_no_sum (bool, default: False ) –

    If False, the input quantized embeddings are assumed to be summed across codebooks. Ignored if domain is not "q".

  • output_no_sum (bool, default: False ) –

    If True, the output quantized embeddings are not summed across codebooks.

  • domain (str, default: 'code' ) –

    Domain of input tensor.

Returns:

  • Tensor

    Decoded tensor. Shape (batch_size, emb_channels, num_frames) if output_no_sum is False else

  • Tensor

    (batch_size, emb_channels, num_codebooks, num_frames).

forward

forward(
    x: Tensor, no_sum: bool = False
) -> tuple[
    torch.Tensor,
    torch.Tensor,
    torch.Tensor,
    torch.Tensor,
    torch.Tensor,
    torch.Tensor,
]

Assign discrete codes to continuous input embeddings.

Parameters:

  • x (Tensor) –

    Input continuous embeddings. Shape (batch_size, emb_channels, num_frames)

  • no_sum (bool, default: False ) –

    If True, the quantized embeddings are not summed across codebooks.

Returns:

  • Tensor

    A tuple (codes, quantized, codebook_loss, commit_loss, x_proj, quantized_proj):

  • Tensor
    • codes: Assigned vector indices. Shape (batch_size, num_codebooks, num_frames).
  • Tensor
    • quantized Quantized embeddings. Shape (batch_size, emb_channels, num_frames) if no_sum is False else (batch_size, emb_channels, num_codebooks, num_frames).
  • Tensor
    • codebook_loss: Codebook loss. 0-dimensional.
  • Tensor
    • commit_loss: Commitment loss. 0-dimensional.
  • Tensor
    • x_proj: Projected input embeddings. Shape (batch_size, codebook_channels, num_codebooks, num_frames).
  • tuple[Tensor, Tensor, Tensor, Tensor, Tensor, Tensor]
    • quantized_proj: Projected quantized embeddings. Shape (batch_size, codebook_channels, num_codebooks, num_frames).

addse.models.nac.NACResidualUnit

Bases: Module

Neural audio codec residual unit.

__init__

__init__(
    channels: int, dilation: int, kernel_size: int
) -> None

Initialize the neural audio codec residual unit.

forward

forward(x: Tensor) -> torch.Tensor

Forward pass.

addse.models.nac.NACSnakeActivation

Bases: Module

Neural audio codec Snake activation function.

__init__

__init__(channels: int) -> None

Initialize the neural audio codec Snake activation function.

forward

forward(x: Tensor) -> torch.Tensor

Forward pass.

addse.models.nac.NACVQVAE

Bases: Module

Neural audio codec vector quantizer.

__init__

__init__(
    emb_channels: int,
    codebook_size: int,
    codebook_channels: int | None,
    normalize: bool,
    codebook: Embedding | None,
) -> None

Initialize the neural audio codec vector quantizer.

decode

decode(x: Tensor, domain: str = 'code') -> torch.Tensor

Decode input into quantized embeddings.

Parameters:

  • x (Tensor) –

    Input tensor: - Shape (batch_size, num_frames) if domain is "code". - Shape (batch_size, emb_channels, num_frames) if domain is "x". - Shape (batch_size, emb_channels, num_frames) if domain is "q". - Shape (batch_size, codebook_channels, num_frames) if domain is "x_proj". - Shape (batch_size, codebook_channels, num_frames) if domain is "q_proj".

  • domain (str, default: 'code' ) –

    Domain of input tensor.

Returns:

  • Tensor

    Decoded tensor. Shape (batch_size, emb_channels, num_frames)

forward

forward(
    x: Tensor,
) -> tuple[
    torch.Tensor,
    torch.Tensor,
    torch.Tensor,
    torch.Tensor,
    torch.Tensor,
    torch.Tensor,
]

Assign discrete codes to continuous input embeddings.

Parameters:

  • x (Tensor) –

    Input continuous embeddings. Shape (batch_size, emb_channels, num_frames)

Returns:

  • Tensor

    A tuple (codes, quantized, codebook_loss, commit_loss, x_proj, quantized_proj):

  • Tensor
    • codes: Assigned vector indices with shape (batch_size, num_frames).
  • Tensor
    • quantized: Quantized embeddings with shape (batch_size, emb_channels, num_frames).
  • Tensor
    • codebook_loss: Codebook loss. 0-dimensional.
  • Tensor
    • commit_loss: Commitment loss. 0-dimensional.
  • Tensor
    • x_proj: Projected input embeddings. Shape (batch_size, codebook_channels, num_frames).
  • tuple[Tensor, Tensor, Tensor, Tensor, Tensor, Tensor]
    • quantized_proj: Projected quantized embeddings. Shape (batch_size, codebook_channels, num_frames).

quantize

quantize(
    x_proj: Tensor,
) -> tuple[
    torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor
]

Quantize projected input embeddings.

addse.models.sgmse

addse.models.sgmse.SGMSEAttentionBlock

Bases: Module

SGMSE attention block.

__init__

__init__(num_channels: int) -> None

Initialize the SGMSE attention block.

forward

forward(x: Tensor) -> Tensor

Forward pass.

addse.models.sgmse.SGMSEEmbeddingBlock

Bases: Module

SGMSE time step embedding block with Gaussian Fourier projection and MLP.

__init__

__init__(fourier_channels: int, emb_channels: int) -> None

Initialize the SGMSE time embedding block.

forward

forward(emb: Tensor) -> Tensor

Forward pass.

addse.models.sgmse.SGMSEResample

Bases: Module

SGMSE 2D resampling block.

__init__

__init__(kind: str) -> None

Initialize the SGMSE 2D resampling block.

forward

forward(x: Tensor) -> Tensor

Forward pass.

addse.models.sgmse.SGMSEUNet

Bases: Module

NCSN++ backbone used in SGMSE.

__init__

__init__(
    num_channels: int = 1,
    base_channels: int = 128,
    num_res_blocks: int = 2,
    channel_mult: Sequence[int] = (1, 1, 2, 2, 2, 2, 2),
    attn_levels: Container[int] = (4,),
) -> None

Initialize the SGMSE NCSN++ backbone.

Parameters:

  • num_channels (int, default: 1 ) –

    Number of input channels.

  • base_channels (int, default: 128 ) –

    Base number of channels.

  • num_res_blocks (int, default: 2 ) –

    Number of residual blocks per level.

  • channel_mult (Sequence[int], default: (1, 1, 2, 2, 2, 2, 2) ) –

    Channel multiplier for each level.

  • attn_levels (Container[int], default: (4,) ) –

    Indices of levels at which to apply attention.

forward

forward(x: Tensor, y: Tensor, t: Tensor) -> Tensor

Forward pass.

Parameters:

  • x (Tensor) –

    Complex-valued noisy speech tensor. Shape (batch_size, num_channels, num_freqs, num_frames).

  • y (Tensor) –

    Complex-valued diffused speech tensor. Shape (batch_size, num_channels, num_freqs, num_frames).

  • t (Tensor) –

    Diffusion step or noise level. Shape (batch_size,).

Returns:

  • Tensor

    Complex-valued output score. Shape (batch_size, num_channels, num_freqs, num_frames).

addse.models.sgmse.SGMSEUNetBlock

Bases: Module

SGMSE UNet block.

__init__

__init__(
    in_ch: int,
    out_ch: int,
    prog_ch: int,
    emb_ch: int,
    kind: str | None = None,
    attn: bool = False,
) -> None

Initialize the SGMSE UNet block.

forward

forward(
    x: Tensor, prog: Tensor | None, emb: Tensor
) -> tuple[Tensor, Tensor | None]

Forward pass.

addse.models.sgmse.sgmse_groupnorm

sgmse_groupnorm(num_channels: int) -> nn.GroupNorm

SGMSE group normalization layer.

addse.stft

addse.stft.STFT

Bases: Module

Short-time Fourier transform (STFT) module.

__init__

__init__(
    frame_length: int = 512,
    hop_length: int | None = None,
    n_fft: int | None = None,
    window: str = "hann",
    norm: bool = False,
) -> None

Initialize the STFT module.

Parameters:

  • frame_length (int, default: 512 ) –

    Frame length.

  • hop_length (int | None, default: None ) –

    Hop length. If None, defaults to frame_length // 2.

  • n_fft (int | None, default: None ) –

    FFT size. If None, defaults to frame_length.

  • window (str, default: 'hann' ) –

    Window type. Passed to scipy.signal.get_window.

  • norm (bool, default: False ) –

    Whether to normalize the window by the square root of its sum of squares.

forward

forward(x: Tensor) -> torch.Tensor

Compute the STFT.

Parameters:

  • x (Tensor) –

    Input tensor. Shape (..., num_samples).

Returns:

  • Tensor

    STFT of input tensor. Shape (..., num_freqs, num_frames).

inverse

inverse(x: Tensor, n: int | None = None) -> torch.Tensor

Compute the inverse STFT.

Parameters:

  • x (Tensor) –

    Input tensor. Shape (..., num_freqs, num_frames).

  • n (int | None, default: None ) –

    If provided, the output tensor is trimmed to this length along the last axis.

Returns:

  • Tensor

    Reconstructed tensor. Shape (..., num_samples).

overlap_add

overlap_add(x: Tensor) -> torch.Tensor

Overlap-add.

Parameters:

  • x (Tensor) –

    Input tensor. Shape (batch_size, num_freqs, num_frames).

Returns:

  • Tensor

    Output tensor. Shape (batch_size, num_samples).

addse.utils

addse.utils.build_subbands

build_subbands(
    n_fft: int,
    fs: int,
    subbands: Iterable[tuple[float, int]],
) -> list[tuple[int, int]]

Derive subband indices on the FFT axis.

Parameters:

  • n_fft (int) –

    FFT size.

  • fs (int) –

    Sampling rate.

  • subbands (Iterable[tuple[float, int]]) –

    List of tuples (bandwidth, number), where bandwidth is the bandwidth of the subband in Hz and number is the number of subbands with that bandwidth.

Returns:

  • list[tuple[int, int]]

    List of tuples (start, end), where start and end are the start and end indices of the subband on

  • list[tuple[int, int]]

    the FFT axis.

addse.utils.bytes_str_to_int

bytes_str_to_int(bytes_str: str) -> int

Convert a human-readable byte size to an integer.

Parameters:

  • bytes_str (str) –

    Human-readable byte size (e.g., "64MB", "1GB").

Returns:

  • int

    Integer byte size.

addse.utils.dynamic_range

dynamic_range(
    x: Tensor, eps: float = 1e-08
) -> torch.Tensor

Dynamic range in dB.

Calculated as the ratio between the peak amplitude and the RMS.

Parameters:

  • x (Tensor) –

    Input signal. Any number of dimensions.

  • eps (float, default: 1e-08 ) –

    Small value for numerical stability.

Returns:

  • Tensor

    Dynamic range in dB.

addse.utils.flatten_dict

flatten_dict(
    d: dict[str, Any], parent_key: str = "", sep: str = "."
) -> dict[str, Any]

Flatten a nested dictionary.

Parameters:

  • d (dict[str, Any]) –

    Dictionary to flatten.

  • parent_key (str, default: '' ) –

    Key prefix for the current level.

  • sep (str, default: '.' ) –

    Separator to use between keys.

Returns:

addse.utils.hz_to_mel

hz_to_mel(hz: float, scale: str = 'slaney') -> float

Convert frequency in Hz to mel scale.

Parameters:

  • hz (float) –

    Frequency in Hz.

  • scale (str, default: 'slaney' ) –

    Mel scale to use. "htk" matches the Hidden Markov Toolkit, while "slaney" matches the Auditory Toolbox by Slaney. The "slaney" scale is linear below 1 kHz and logarithmic above 1 kHz.

Returns:

  • float

    Frequency in mel scale.

addse.utils.load_hydra_config

load_hydra_config(
    path: str, overrides: list[str] | None = None
) -> tuple[DictConfig, str]

Load a Hydra configuration file.

addse.utils.load_model

load_model(
    config_path: str,
    model_name: str | None = None,
    logs_dir: str = "logs",
    ckpt_name: str = "last.ckpt",
    ckpt_path: str | None = None,
    state_key: str | None = "state_dict",
    prepend_key: str | None = None,
    device: device | str | None = None,
    strict: bool = True,
) -> L.LightningModule

Load a model.

addse.utils.mel_filters

mel_filters(
    n_filters: int = 64,
    n_fft: int = 512,
    f_min: float = 0.0,
    f_max: float | None = None,
    fs: float = 16000,
    scale: str = "slaney",
    norm: Literal["slaney", "consistent"]
    | None = "consistent",
    dtype: dtype = torch.float32,
) -> tuple[torch.Tensor, torch.Tensor]

Get mel filters.

Parameters:

  • n_filters (int, default: 64 ) –

    Number of filters.

  • n_fft (int, default: 512 ) –

    Number of FFT point.

  • f_min (float, default: 0.0 ) –

    Minimum frequency.

  • f_max (float | None, default: None ) –

    Maximum frequency. If None, uses fs / 2.

  • fs (float, default: 16000 ) –

    Sampling frequency.

  • scale (str, default: 'slaney' ) –

    Mel scale to use. "htk" matches the Hidden Markov Toolkit, while "slaney" matches the Auditory Toolbox by Slaney. The "slaney" scale is linear below 1 kHz and logarithmic above 1 kHz.

  • norm (Literal['slaney', 'consistent'] | None, default: 'consistent' ) –

    Filter normalization method. If "slaney", the filters are normalized by their width in Hz. However this makes the filter response scale with the frequency resolution n_fft / fs. If "consistent", the frequency resolution is factored in. If None, no normalization is applied.

  • dtype (dtype, default: float32 ) –

    Data type to cast the filters to.

Returns:

  • tuple[Tensor, Tensor]

    Mel filters and center frequencies. Shapes (n_filters, n_fft // 2 + 1) and (n_filters,).

addse.utils.mel_to_hz

mel_to_hz(
    mel: Tensor, scale: str = "slaney"
) -> torch.Tensor

Convert frequency in mel scale to Hz.

Parameters:

  • mel (Tensor) –

    Frequency in mel scale.

  • scale (str, default: 'slaney' ) –

    Mel scale to use. "htk" matches the Hidden Markov Toolkit, while "slaney" matches the Auditory Toolbox by Slaney. The "slaney" scale is linear below 1 kHz and logarithmic above 1 kHz.

Returns:

addse.utils.scan_files

scan_files(input_dir: str, regex: str) -> Iterator[str]

Scan a directory for files matching a regular expression.

Parameters:

  • input_dir (str) –

    Directory to scan.

  • regex (str) –

    Regular expression to match file paths.

Yields:

  • str

    Path matching the regular expression.

addse.utils.seed_all

seed_all(seed: int) -> None

Set the seed for all random number generators.

Parameters:

  • seed (int) –

    Seed value.

addse.utils.segment_audio_file

segment_audio_file(
    path: str,
    format: str = "ogg",
    subtype: str | None = None,
    seglen: float | None = None,
    base: str | None = None,
) -> Iterator[tuple[bytes, str]]

Read and segment an audio file and yield bytes and a name for each segment.

Parameters:

  • path (str) –

    Path to the input audio file.

  • format (str, default: 'ogg' ) –

    Audio format to convert to. See soundfile.write.

  • subtype (str | None, default: None ) –

    Audio subtype to convert to. See soundfile.write.

  • seglen (float | None, default: None ) –

    Segment length in seconds. If provided, the file is segmented into chunks of this length approximately.

  • base (str | None, default: None ) –

    Base path to strip from the file path.

Yields:

addse.utils.set_snr

set_snr(
    speech: Tensor, noise: Tensor, snr: float
) -> torch.Tensor

Scale noise to achieve a desired signal-to-noise ratio (SNR).

Parameters:

  • speech (Tensor) –

    Speech signal. Any number of dimensions.

  • noise (Tensor) –

    Noise signal. Any number of dimensions.

  • snr (float) –

    Desired SNR in dB.

Returns:

  • Tensor

    Scaled noise signal.

addse.utils.unflatten_dict

unflatten_dict(
    d: dict[str, Any], sep: str = "."
) -> dict[str, Any]

Unflatten a dictionary.

Parameters:

  • d (dict[str, Any]) –

    Dictionary to unflatten.

  • sep (str, default: '.' ) –

    Separator used between keys.

Returns: