Reference
addse.data
addse.data.AudioStreamingDataLoader
Bases: StreamingDataLoader
Audio streaming dataloader.
__init__
__init__(
dataset: AudioStreamingDataset | DynamicMixingDataset,
batch_size: int = 1,
num_workers: int = 0,
shuffle: bool | None = None,
**kwargs: Any,
) -> None
Initialize the audio streaming dataloader.
Parameters:
-
dataset(AudioStreamingDataset | DynamicMixingDataset) –Dataset to wrap.
-
batch_size(int, default:1) –Batch size.
-
num_workers(int, default:0) –Number of workers.
-
shuffle(bool | None, default:None) –Whether to shuffle the dataset at every epoch. If
None, uses the dataset shuffle attribute. -
**kwargs(Any, default:{}) –Additional keyword arguments passed to parent constructor.
__len__
Get the number of batches in the dataloader.
Returns:
-
int–Number of batches in the dataloader.
Raises:
-
TypeError–If the wrapped dataset is an instance of AudioStreamingDataset with
segment_length!=None, as the total number of segments in the dataset cannot be determined without iterating over it.
addse.data.AudioStreamingDataset
Bases: StreamingDataset
Audio streaming dataset.
__getitem__
__init__
__init__(
input_dir: str,
fs: int | None = None,
segment_length: float | None = None,
max_length: float | None = None,
max_dynamic_range: float | None = None,
shuffle: bool = False,
seed: int = 0,
**kwargs: Any,
) -> None
Initialize the audio streaming dataset.
Parameters:
-
input_dir(str) –Path or URL to LitData-optimized audio data.
-
fs(int | None, default:None) –Optional sample rate to resample to.
-
segment_length(float | None, default:None) –Audio segment length in seconds. If provided, audio files are concatenated and segmented into chunks of this length. Else, audio files are yielded as is and may have variable length.
-
max_length(float | None, default:None) –Maximum output length in seconds. If provided, audio files longer than this are skipped. Cannot be used together with
segment_length. -
max_dynamic_range(float | None, default:None) –Maximum dynamic range in dB. If provided, audio files and segments with a dynamic range greater than this value are skipped.
-
shuffle(bool, default:False) –Whether to shuffle the dataset.
-
seed(int, default:0) –Random seed for shuffling.
-
**kwargs(Any, default:{}) –Additional keyword arguments passed to parent constructor.
__len__
Get the number of files in the dataset.
Returns:
-
int–The number of files in the dataset.
Note
If segment_length is not None, the number of samples yielded by this dataset when iterating over it does
not match the output of this method.
__next__
Get the next item from the dataset.
Returns:
-
ASDOutput–Audio data with shape
(1, num_samples), sample rate, name, and number of files loaded to get this item. -
ASDOutput–The number of files loaded is required by DynamicMixingDataset.
addse.data.DynamicMixingDataset
Bases: ParallelStreamingDataset
Dynamic mixing dataset.
Wraps two AudioStreamingDataset instances, one for speech and one for noise, and generates noisy speech samples on-the-fly by mixing the speech and noise samples at a random signal-to-noise ratio (SNR).
Multi-channel speech and noise samples are converted to mono by randomly selecting one channel.
If the speech and noise samples have different lengths, the noise is cycled or trimmed to match the speech length.
When length=float("inf"), this dataset is infinite and should be used with limit_<stage>_batches in the
Lightning Trainer.
__init__
__init__(
speech_dataset: AudioStreamingDataset,
noise_dataset: AudioStreamingDataset,
snr_range: tuple[float, float] = (-5.0, 15.0),
rms_range: tuple[float, float] | None = (0.0, 0.0),
length: int | float | None = float("inf"),
resume: bool = True,
reset_rngs: bool = False,
**kwargs: Any,
) -> None
Initialize the dynamic mixing dataset.
Parameters:
-
speech_dataset(AudioStreamingDataset) –Speech dataset.
-
noise_dataset(AudioStreamingDataset) –Noise dataset.
-
snr_range(tuple[float, float], default:(-5.0, 15.0)) –SNR range.
-
rms_range(tuple[float, float] | None, default:(0.0, 0.0)) –RMS range for the clean speech in dB. If
None, no RMS adjustment is performed. -
length(int | float | None, default:float('inf')) –Number of samples to yield per epoch. If
None, the speech and noise datasets are iterated over until one is exhausted. If an integer, the datasets are cycled untillengthsamples are yielded. Iffloat("inf"), the datasets are cycled indefinitely. -
resume(bool, default:True) –Whether to resume the dataset from where it left off in the previous epoch when starting a new epoch. Should be set to
Falsefor validation and test datasets. Only works when iterating with an AudioStreamingDataLoader. Ignored iflengthisNone. -
reset_rngs(bool, default:False) –Whether to set the internal random number generators to the same initial state at the start of each epoch. If
True, random numbers are consistent across epochs. Should be set toTruefor validation and test datasets. -
**kwargs(Any, default:{}) –Additional keyword arguments passed to parent constructor.
__iter__
__len__
transform
staticmethod
transform(
samples: tuple[ASDOutput, ASDOutput],
rngs: dict[str, Any],
snr_range: tuple[float, float],
rms_range: tuple[float, float] | None,
) -> tuple[
torch.Tensor, torch.Tensor, int, tuple[int, int]
]
Generate noisy speech from speech and noise samples.
Parameters:
-
samples(tuple[ASDOutput, ASDOutput]) –Tuple with speech and noise samples.
-
rngs(dict[str, Any]) –Random number generators.
-
snr_range(tuple[float, float]) –SNR range.
-
rms_range(tuple[float, float] | None) –RMS range for the clean speech in dB. If
None, no RMS adjustment is performed.
Returns:
addse.layers
addse.layers.BandMerge
Bases: Module
Band-merge module.
__init__
__init__(
subband_idx: Iterable[tuple[int, int]],
input_channels: int,
output_channels: int,
num_channels: int,
norm: Callable[[int], Module],
mlp: Callable[
[int, int, Callable[[int], Module]], Module
],
residual: bool,
) -> None
Initialize the band-merge module.
forward
Forward pass.
Parameters:
-
x(Tensor) –Input tensor with shape
(batch_size, input_channels, num_bands, num_frames).
Returns:
-
Tensor–Tuple
(mask, residual)wheremaskare complex-valued spatial filtering coefficients with shape -
Tensor | None–(batch_size, input_channels, output_channels, num_freqs, num_frames), andresidualis a residual -
tuple[Tensor, Tensor | None]–additive short-time Fourier transform with shape
(batch_size, output_channels, num_freqs, num_frames)or -
tuple[Tensor, Tensor | None]–Noneifresidual=False.
addse.layers.BandSplit
Bases: Module
Band-split module.
addse.layers.BatchNorm
Bases: Module
Batch normalization.
Input tensors must have shape (B, C, ...) where B is the batch dimension, C is the channel dimension, and
... are the spatial dimensions (e.g. height and width in computer vision, frequency and time in audio, or sequence
length in NLP). The statistics are aggregated over the batch and spatial dimensions as in 1, Figure 2. Namely,
where \(\gamma\) and \(\beta\) are channel-specific learnable scale and shift parameters. Note the reparameterization of the scale parameter compared to the default PyTorch implementation.
Unlike other normalization modules, this module has track_running_stats and momentum options.
-
Y. Wu and K. He, "Group normalization", ECCV, 2018. ↩
__init__
__init__(
num_channels: int,
eps: float = 1e-05,
track_running_stats: bool = True,
momentum: float | None = 0.1,
) -> None
Initialize the batch normalization module.
Parameters:
-
num_channels(int) –Number of channels in input tensors.
-
eps(float, default:1e-05) –Small value for numerical stability.
-
track_running_stats(bool, default:True) –If
True, normalization statistics are aggregated over batches during training and saved for evaluation. IfFalse, statistics are computed from the current batch both during training and evaluation. -
momentum(float | None, default:0.1) –Momentum for running statistics. The bigger the value, the more weight is given to the current batch statistics. Ignored if
track_running_statsisFalse. IfNone, running statistics are cumulatively aggregated over batches without decay.
addse.layers.GroupNorm
Bases: Module
Group normalization.
Input tensors must have shape (B, C, ...) where B is the batch dimension, C is the channel dimension, and
... are the spatial dimensions (e.g. height and width in computer vision, frequency and time in audio, or sequence
length in NLP). The statistics are aggregated over grouped channels and spatial dimensions as in 1, Figure 2.
Namely,
where \(\gamma\) and \(\beta\) are channel-specific learnable scale and shift parameters. Note the reparameterization of the scale parameter compared to the default PyTorch implementation.
-
Y. Wu and K. He, "Group normalization", ECCV, 2018. ↩
__init__
Initialize the group normalization module.
Parameters:
-
num_groups(int) –Number of groups to separate the channels into.
-
num_channels(int) –Number of channels in input tensors.
-
eps(float, default:1e-05) –Small value for numerical stability.
-
causal(bool, default:False) –If
True, normalization statistics are cumulatively aggregated along the time dimension. The time dimension must be the last dimension of the input tensor.
addse.layers.InstanceNorm
Bases: GroupNorm
Instance normalization.
Input tensors must have shape (B, C, ...) where B is the batch dimension, C is the channel' dimension, and
... are the spatial dimensions (e.g. height and width in computer vision, frequency and time in audio, or sequence
length in NLP). The statistics are aggregated over the spatial dimensions as in 1, Figure 2. Namely,
where \(\gamma\) and \(\beta\) are channel-specific learnable scale and shift parameters. Note the reparameterization of the scale parameter compared to the default PyTorch implementation.
-
Y. Wu and K. He, "Group normalization", ECCV, 2018. ↩
__init__
Initialize the instance normalization module.
Parameters:
-
num_channels(int) –Number of channels in input tensors.
-
eps(float, default:1e-05) –Small value for numerical stability.
-
causal(bool, default:False) –If
True, normalization statistics are cumulatively aggregated along the time dimension. The time dimension must be the last dimension of the input tensor.
addse.layers.LayerNorm
Bases: Module
Layer normalization.
Input tensors must have shape (B, C, ...) where B is the batch dimension, C is the channel dimension, and
... are the spatial dimensions (e.g. height and width in computer vision, frequency and time in audio, or sequence
length in NLP). Namely,
where \(\gamma\) and \(\beta\) are channel-specific learnable scale and shift parameters. Note the reparameterization of the scale parameter compared to the default PyTorch implementation.
If element_wise and frame_wise are both False, the statistics are aggregated over the channel dimension and
all spatial dimensions as in 1, Figure 2. In this case, setting causal=False matches the global layer
normalization in 2, while setting causal=True matches the cumulative layer normalization in 2. The time
dimension must be the last dimension of input tensors.
If element_wise is True, the statistics are aggregated over the channel dimension only as in 3. I.e. each
element (e.g. pixel in computer vision, time-frequency unit in audio, or token in NLP) is normalized independently.
If frame_wise is True, the statistics are aggregated over the channel dimension and all spatial dimensions
except the time dimension. The time dimension must be the last dimension of input tensors.
-
Y. Wu and K. He, "Group normalization", ECCV, 2018. ↩
-
Y. Luo and N. Mesgarani, "Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation", in IEEE/ACM TASLP, 2019. ↩↩
-
S. Shen, Z. Yao, A. Gholami, M. W. Mahoney, and K. Keutzer, "PowerNorm: Rethinking batch normalization in transformers", ICML, 2020. ↩
__init__
__init__(
num_channels: int,
element_wise: bool = False,
frame_wise: bool = False,
causal: bool = False,
center: bool = True,
eps: float = 1e-05,
) -> None
Initialize the layer normalization module.
Parameters:
-
num_channels(int) –Number of channels in input tensors.
-
element_wise(bool, default:False) –If
True, each element (e.g. pixel in computer vision, time-frequency unit in audio, or token in NLP) is normalized independently. Mutually exclusive withframe_wiseandcausal. -
frame_wise(bool, default:False) –If
True, each time frame is normalized independently. The time dimension must be the last dimension of input tensors. Mutually exclusive withelement_wiseandcausal. -
causal(bool, default:False) –If
True, normalization statistics are cumulatively aggregated along the time dimension. The time dimension must be the last dimension of the input tensor. Mutually exclusive withelement_wiseandframe_wise. -
center(bool, default:True) –If
False, the mean is not subtracted from the input, and the input is scaled using the root mean square (RMS) instead of the variance. The bias term \(\beta\) is also omitted. -
eps(float, default:1e-05) –Small value for numerical stability.
addse.layers.group_norm
group_norm(
x: Tensor,
num_groups: int,
weight: Tensor,
bias: Tensor | None,
eps: float,
causal: bool,
frame_wise: bool,
) -> torch.Tensor
Functional interface for group normalization.
See GroupNorm for details.
addse.lightning
addse.lightning.ADDSELightningModule
Bases: BaseLightningModule, ConfigureOptimizersMixin
ADDSE Lightning module.
__init__
__init__(
nac_cfg: str,
nac_ckpt: str,
model: ADDSERQDiT,
num_steps: int,
block_size: int,
optimizer: Callable[
[Iterator[Parameter]], Optimizer
] = Adam,
lr_scheduler: Mapping[str, Any] | None = None,
val_metrics: Mapping[str, BaseMetric] | None = None,
test_metrics: Mapping[str, BaseMetric] | None = None,
log_cfg: LogConfig | None = None,
debug_sample: tuple[int, int] | None = None,
) -> None
Initialize the ADDSE Lightning module.
forward
Enhance the input audio.
loss
Compute the \(\lambda\)-denoising cross-entropy loss.
Parameters:
-
x_q(Tensor) –Noisy speech embeddings. Shape
(batch_size, emb_channels, num_codebooks, seq_len). -
y_q(Tensor) –Clean speech embeddings. Shape
(batch_size, emb_channels, num_codebooks, seq_len). -
y_tok(Tensor) –Clean speech tokens. Shape
(batch_size, num_codebooks, seq_len).
Returns:
-
Tensor–The \(\lambda\)-denoising cross-entropy loss.
addse.lightning.BaseLightningModule
Bases: LightningModule
Base class for Lightning modules.
log_debug_samples
log_debug_samples(
batch: tuple[Tensor, Tensor, Tensor],
batch_idx: int,
debug_samples: dict[str, Tensor],
) -> None
Log debug audio samples to W&B.
log_metrics
log_metrics(
loss: dict[str, Tensor],
metrics: dict[str, float],
stage: str,
on_step: bool,
on_epoch: bool,
) -> None
Log losses and metrics.
step
abstractmethod
step(
batch: tuple[Tensor, Tensor, Tensor],
stage: str,
batch_idx: int,
metrics: Mapping[str, BaseMetric] | None = None,
) -> tuple[
dict[str, Tensor], dict[str, float], dict[str, Tensor]
]
Training, validation, or test step.
Parameters:
-
batch(tuple[Tensor, Tensor, Tensor]) –A batch from the dataloader.
-
stage(str) –"train","val", or"test". -
batch_idx(int) –Index of the batch.
-
metrics(Mapping[str, BaseMetric] | None, default:None) –Metrics to compute.
Noneifstageis"train"or if no metrics are defined.
Returns:
test_step
training_step
validation_step
addse.lightning.ConfigureOptimizersMixin
Bases: LightningModule
Mixin for standard configuration of optimizer and learning rate scheduler.
configure_optimizers
Configure optimizers.
Returns:
-
Any–Dictionary with optimizer, learning rate scheduler, and learning rate scheduler configuration.
addse.lightning.DataModule
Bases: LightningDataModule
Data module.
__init__
__init__(
train_dataset: Callable[[], Dataset],
train_dataloader: Callable[[Dataset], DataLoader],
val_dataset: Callable[[], Dataset] | None = None,
val_dataloader: Callable[[Dataset], DataLoader]
| None = None,
test_dataset: Callable[[], Dataset] | None = None,
test_dataloader: Callable[[Dataset], DataLoader]
| None = None,
) -> None
Initialize the data module.
Parameters:
-
train_dataset(Callable[[], Dataset]) –Function to initialize the training dataset.
-
val_dataset(Callable[[], Dataset] | None, default:None) –Function to initialize the validation dataset.
-
test_dataset(Callable[[], Dataset] | None, default:None) –Function to initialize the test dataset.
-
train_dataloader(Callable[[Dataset], DataLoader]) –Function to initialize the training dataloader.
-
val_dataloader(Callable[[Dataset], DataLoader] | None, default:None) –Function to initialize the validation dataloader.
-
test_dataloader(Callable[[Dataset], DataLoader] | None, default:None) –Function to initialize the test dataloader.
load_state_dict
Load the state dict of the data module.
setup
test_dataloader
Get the test dataloader.
Returns:
-
DataLoader | list–The test dataloader or an empty list if no test dataset was provided at initialization.
train_dataloader
val_dataloader
Get the validation dataloader.
Returns:
-
DataLoader | list–The validation dataloader or an empty list if no validation dataset was provided at initialization.
addse.lightning.EDMMixin
Bases: LightningModule
Mixin for training and sampling as in EDM.
denoiser
Compute the denoiser parametrization as in EDM.
addse.lightning.EDMNACSELightningModule
Bases: BaseLightningModule, ConfigureOptimizersMixin, EDMMixin
Lightning module for speech enhancement using NAC-domain EDM-style diffusion.
__init__
__init__(
nac_cfg: str,
nac_ckpt: str,
nac_domain: str,
nac_no_sum: bool,
nac_stack: bool,
model: ADDSERQDiT,
num_steps: int,
block_size: int,
norm_factor: float = 2.3,
sigma_data: float = 0.5,
p_mean: float = 0.0,
p_sigma: float = 1.0,
s_churn: float = 0.0,
s_min: float = 0.0,
s_max: float = float("inf"),
s_noise: float = 1.0,
sigma_min: float = 0.002,
sigma_max: float = 80.0,
rho: float = 7.0,
optimizer: Callable[
[Iterator[Parameter]], Optimizer
] = Adam,
lr_scheduler: Mapping[str, Any] | None = None,
val_metrics: Mapping[str, BaseMetric] | None = None,
test_metrics: Mapping[str, BaseMetric] | None = None,
log_cfg: LogConfig | None = None,
debug_sample: tuple[int, int] | None = None,
) -> None
Initialize the NAC-domain EDM-style Lightning module.
addse.lightning.EDMSELightningModule
Bases: BaseLightningModule, ConfigureOptimizersMixin, EDMMixin
Lightning module for speech enhancement using STFT-domain EDM-style diffusion.
__init__
__init__(
model: ADM,
stft: STFT,
num_steps: int = 30,
sigma_data: float = 0.5,
p_mean: float = 0.0,
p_sigma: float = 1.0,
s_churn: float = 0.0,
s_min: float = 0.0,
s_max: float = float("inf"),
s_noise: float = 1.0,
sigma_min: float = 0.002,
sigma_max: float = 80.0,
rho: float = 7.0,
optimizer: Callable[
[Iterator[Parameter]], Optimizer
] = Adam,
lr_scheduler: Mapping[str, Any] | None = None,
val_metrics: Mapping[str, BaseMetric] | None = None,
test_metrics: Mapping[str, BaseMetric] | None = None,
log_cfg: LogConfig | None = None,
debug_sample: tuple[int, int] | None = None,
) -> None
Initialize the NAC-domain EDM-style Lightning module.
inverse_transform
Decompress and compute the inverse STFT.
addse.lightning.LightningModule
Bases: BaseLightningModule, ConfigureOptimizersMixin
Simple Lightning module for training models to directly predict clean speech given noisy speech.
__init__
__init__(
model: Module,
loss: BaseLoss,
optimizer: Callable[
[Iterator[Parameter]], Optimizer
] = Adam,
lr_scheduler: Mapping[str, Any] | None = None,
val_metrics: Mapping[str, BaseMetric] | None = None,
test_metrics: Mapping[str, BaseMetric] | None = None,
log_cfg: LogConfig | None = None,
debug_sample: tuple[int, int] | None = None,
) -> None
Initialize the simple Lightning module.
Parameters:
-
model(Module) –Model to train.
-
loss(BaseLoss) –Loss module.
-
optimizer(Callable[[Iterator[Parameter]], Optimizer], default:Adam) –Optimizer constructor.
-
lr_scheduler(Mapping[str, Any] | None, default:None) –Learning rate scheduler configuration.
-
val_metrics(Mapping[str, BaseMetric] | None, default:None) –Metrics to compute during validation.
-
test_metrics(Mapping[str, BaseMetric] | None, default:None) –Metrics to compute during testing.
-
log_cfg(LogConfig | None, default:None) –Logging configuration.
-
debug_sample(tuple[int, int] | None, default:None) –Tuple
(batch_idx, sample_idx)to log debug audio samples to W&B during validation.
addse.lightning.LogConfig
dataclass
Configuration for logging losses and metrics.
addse.lightning.NACLightningModule
Bases: BaseLightningModule
Lightning module for neural audio codec.
__init__
__init__(
generator: NAC,
discriminator: Module | Iterable[Module],
reconstruction_loss: BaseLoss,
adversarial_loss_weight: float,
feature_loss_weight: float,
reconstruction_loss_weight: float,
codebook_loss_weight: float,
commitment_loss_weight: float,
generator_optimizer: Callable[
[Iterator[Parameter]], Optimizer
],
discriminator_optimizer: Callable[
[Iterator[Parameter]], Optimizer
],
generator_grad_clip: float = 0.0,
discriminator_grad_clip: float = 0.0,
val_metrics: Mapping[str, BaseMetric] | None = None,
test_metrics: Mapping[str, BaseMetric] | None = None,
log_cfg: LogConfig | None = None,
debug_sample: tuple[int, int] | None = None,
) -> None
Initialize the neural audio codec Lightning module.
configure_optimizers
discriminator_forward
Forward pass through all discriminators.
addse.lightning.NACSELightningModule
Bases: BaseLightningModule, ConfigureOptimizersMixin
Lightning module for speech enhancement using NAC-domain direct prediction.
__init__
__init__(
nac_cfg: str,
nac_ckpt: str,
nac_domain: str,
nac_no_sum: bool,
model: Module,
block_size: int,
optimizer: Callable[
[Iterator[Parameter]], Optimizer
] = Adam,
lr_scheduler: Mapping[str, Any] | None = None,
val_metrics: Mapping[str, BaseMetric] | None = None,
test_metrics: Mapping[str, BaseMetric] | None = None,
log_cfg: LogConfig | None = None,
debug_sample: tuple[int, int] | None = None,
) -> None
Initialize the NAC-domain Lightning module.
addse.lightning.SGMSELightningModule
Bases: BaseLightningModule, ConfigureOptimizersMixin
SGMSE Lightning module.
__init__
__init__(
model: SGMSEUNet,
stft: STFT,
num_steps: int = 30,
sigma_min: float = 0.05,
sigma_max: float = 0.5,
gamma: float = 1.5,
t_eps: float = 0.03,
corrector_snr: float = 0.5,
alpha: float = 0.5,
beta: float = 0.15,
optimizer: Callable[
[Iterator[Parameter]], Optimizer
] = Adam,
lr_scheduler: Mapping[str, Any] | None = None,
val_metrics: Mapping[str, BaseMetric] | None = None,
test_metrics: Mapping[str, BaseMetric] | None = None,
log_cfg: LogConfig | None = None,
debug_sample: tuple[int, int] | None = None,
) -> None
Initialize the SGMSE Lightning module.
inverse_transform
Decompress, descale, and compute the inverse STFT.
addse.lightning.compute_metrics
compute_metrics(
x: Tensor,
y: Tensor,
metrics: Mapping[str, BaseMetric] | None = None,
) -> dict[str, float]
Compute validation or test metrics.
Parameters:
-
x(Tensor) –Signal to evaluate. Shape
(batch_size, num_channels, num_samples). -
y(Tensor) –Reference signal for the metrics. Shape
(batch_size, num_channels, num_samples). -
metrics(Mapping[str, BaseMetric] | None, default:None) –Metrics to compute.
Returns:
addse.lightning.load_nac
Load a pretrained neural audio codec.
addse.losses
addse.losses.BaseLoss
Bases: Module
Base class for losses.
compute
abstractmethod
Compute the loss.
This method should not be called directly. Use forward instead.
forward
addse.losses.MSMelSpecLoss
Bases: MultiTaskLoss
Multi-scale mel-spectrogram loss.
__init__
__init__(
n_mels: int | Collection[int] = (
4,
8,
16,
32,
64,
128,
256,
),
frame_lengths: Collection[int] = (
31,
67,
127,
257,
509,
1021,
2053,
),
hop_lengths: Collection[int | None] | None = None,
n_ffts: Collection[int | None] | None = None,
weights: Collection[float] | None = None,
window: str = "flattop",
fs: int = 16000,
compression: float = 2.0,
log: bool = True,
power: float = 1.0,
eps: float = 1e-07,
mel_norm: Literal["slaney", "consistent"]
| None = "consistent",
stft_norm: bool = True,
) -> None
Initialize the multi-scale mel-spectrogram loss.
addse.losses.MelSpecLoss
Bases: BaseLoss
Mel-spectrogram loss.
__init__
__init__(
n_mels: int = 64,
frame_length: int = 512,
hop_length: int | None = None,
n_fft: int | None = None,
window: str = "flattop",
fs: int = 16000,
compression: float = 2.0,
log: bool = True,
power: float = 1.0,
eps: float = 1e-07,
mel_norm: Literal["slaney", "consistent"]
| None = "consistent",
stft_norm: bool = True,
) -> None
Initialize the mel-spectrogram loss.
addse.losses.MultiTaskLoss
Bases: BaseLoss
Multi-task loss.
addse.losses.SDRLoss
Bases: BaseLoss
Signal-to-distortion ratio (SDR) loss.
__init__
Initialize the SDR loss.
Parameters:
addse.metrics
addse.metrics.BaseMetric
Base class for metrics.
__call__
addse.metrics.DNSMOSMetric
Bases: BaseMetric
Deep noise suppression mean opinion score (DNSMOS) metric.
Calculated independently for each channel and averaged across channels.
__init__
addse.metrics.LPSMetric
Bases: BaseMetric
Levenshtein phoneme similarity (LPS).
Calculated independently for each channel and averaged across channels.
addse.metrics.MCDMetric
Bases: BaseMetric
Mel-cepstral distance (MCD) metric.
Calculated independently for each channel and averaged across channels.
addse.metrics.NISQAMetric
Bases: BaseMetric
Non-intrusive speech quality assessment (NISQA) metric.
Calculated independently for each channel and averaged across channels.
addse.metrics.PESQMetric
Bases: BaseMetric
Perceptual evaluation of speech quality (PESQ) metric.
Calculated independently for each channel and averaged across channels.
__init__
addse.metrics.SBSMetric
Bases: BaseMetric
SpeechBERTScore (SBS).
addse.metrics.SCOREQMetric
Bases: BaseMetric
Speech contrastive regression for quality assessment (SCOREQ).
Calculated independently for each channel and averaged across channels.
addse.metrics.SDRMetric
Bases: BaseMetric
Signal-to-distortion ratio (SDR) metric.
__init__
Initialize the SDR metric.
Parameters:
addse.metrics.STOIMetric
Bases: BaseMetric
Short-time objective intelligibility (STOI) metric.
Calculated independently for each channel and averaged across channels.
addse.metrics.UTMOSMetric
Bases: BaseMetric
UTokyo-SaruLab MOS prediction system (UTMOSv2).
Calculated independently for each channel and averaged across channels.
addse.models.addse
addse.models.addse.ADDSEDiT
Bases: Module
ADDSE DiT.
addse.models.addse.ADDSEDiTBlock
Bases: Module
ADDSE DiT block.
addse.models.addse.ADDSEEmbeddingBlock
Bases: Module
ADDSE noise embedding block with Fourier features.
addse.models.addse.ADDSERQDiT
Bases: Module
Residual Quantized Diffusion Transformer (RQDiT) backbone used in ADDSE.
__init__
__init__(
input_channels: int,
output_channels: int,
num_codebooks: int,
hidden_dim: int,
num_layers: int,
num_heads: int,
max_seq_len: int,
conditional: bool,
time_independent: bool,
) -> None
Initialize the ADDSE RQDiT backbone.
Parameters:
-
input_channels(int) –Number of input channels.
-
output_channels(int) –Number of output channels.
-
num_codebooks(int) –Number of codebooks.
-
hidden_dim(int) –Number of DiT hidden channels.
-
num_layers(int) –Number of DiT layers.
-
num_heads(int) –Number of DiT attention heads.
-
max_seq_len(int) –Maximum sequence length.
-
conditional(bool) –Whether the model is conditional.
-
time_independent(bool) –Whether the model is time-independent.
forward
Forward pass.
Parameters:
-
x(Tensor) –Diffused embeddings. Shape
(batch_size, input_channels, num_codebooks, seq_len)or(batch_size, input_channels, seq_len). -
c(Tensor | None, default:None) –Conditioning embeddings. Same shape as
x. -
t(Tensor | None, default:None) –Time step or noise level. Shape
(batch_size,).
Returns:
-
Tensor–Output tensor. Shape
(batch_size, output_channels, num_codebooks, seq_len).
addse.models.addse.ADDSESelfAttentionBlock
Bases: Module
ADDSE self-attention block.
addse.models.adm
addse.models.adm.ADM
Bases: Module
ADM similar to configuration F in EDM2 paper.
__init__
__init__(
num_channels: int = 1,
base_channels: int = 96,
num_res_blocks: int = 3,
channel_mult: Sequence[int] = (1, 2, 3, 4),
attn_levels: Container[int] = (),
) -> None
Initialize ADM.
forward
Forward pass.
Parameters:
-
y(Tensor) –Complex-valued diffused speech tensor. Shape
(batch_size, num_channels, num_freqs, num_frames). -
x(Tensor) –Complex-valued noisy speech tensor. Shape
(batch_size, num_channels, num_freqs, num_frames). -
t(Tensor) –Diffusion step or noise level. Shape
(batch_size,).
Returns:
-
Tensor–Complex-valued output score. Shape
(batch_size, num_channels, num_freqs, num_frames).
addse.models.adm.ADMAttentionBlock
Bases: Module
ADM attention block.
addse.models.adm.ADMBlock
Bases: Module
ADM block.
addse.models.adm.ADMEmbeddingBlock
Bases: Module
ADM time step embedding block.
addse.models.adm.ADMResample
Bases: Module
ADM 2D resampling block.
addse.models.bsrnn
addse.models.bsrnn.BSRNN
Bases: Module
-
Y. Luo and J. Yu, "Music source separation with band-split RNN", IEEE/ACM TASLP, 2023. ↩
-
J. Yu and Y. Luo, "Efficient monaural speech enhancement with universal sample rate band-split RNN", IEEE ICASSP, 2023. ↩
-
J. Yu, H. Chen, Y. Luo, R. Gu, and C. Weng, "High fidelity speech enhancement with band-split RNN", INTERSPEECH, 2023. ↩
__init__
__init__(
stft: STFT | None = None,
fs: int = 16000,
input_channels: int = 1,
output_channels: int = 1,
num_channels: int = 32,
num_layers: int = 6,
causal: bool = False,
subbands: Iterable[tuple[float, int]] = [
(100.0, 10),
(200.0, 10),
(500.0, 6),
(1000.0, 2),
],
residual: bool = False,
norm: Callable[[int], Module] | None = None,
) -> None
Initialize BSRNN.
Parameters:
-
stft(STFT | None, default:None) –STFT module.
-
fs(int, default:16000) –Sampling rate.
-
input_channels(int, default:1) –Number of input channels.
-
output_channels(int, default:1) –Number of output channels.
-
num_channels(int, default:32) –Number of internal channels. Denoted as N in the paper.
-
num_layers(int, default:6) –Number of dual-path modelling layers.
-
causal(bool, default:False) –Whether to use unidirectional RNNs along the time axis.
-
subbands(Iterable[tuple[float, int]], default:[(100.0, 10), (200.0, 10), (500.0, 6), (1000.0, 2)]) –List of tuples
(bandwidth, number), wherebandwidthis the bandwidth of the subband in Hz andnumberis the number of subbands with that bandwidth. -
residual(bool, default:False) –Whether to predict a residual STFT in addition to the mask. The residual STFT is added after applying the mask to the input STFT.
-
norm(Callable[[int], Module] | None, default:None) –Normalization module to use throughout the network. If
None, defaults to LayerNorm withcausal=causal. If a non-causal normalization module is provided, the network is not causal, even ifcausal=True.
addse.models.bsrnn.BSRNNMLP
Bases: Module
Multi-Layer perceptron (MLP) used in BSRNN.
addse.models.bsrnn.BSRNNRNNBlock
Bases: Module
RNN block used in BSRNN.
addse.models.convtasnet
addse.models.convtasnet.ConvTasNet
Bases: Module
Conv-TasNet 1.
Consists of an encoder, a temporal convolutional network (TCN), and a decoder.
-
Y. Luo and N. Mesgarani, "Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation", in IEEE/ACM TASLP, 2019. ↩
__init__
__init__(
input_channels: int = 1,
output_channels: int = 1,
num_filters: int = 512,
filter_size: int = 32,
hop_size: int | None = None,
bottleneck_channels: int = 128,
hidden_channels: int = 512,
skip_channels: int = 128,
kernel_size: int = 3,
layers: int = 8,
repeats: int = 3,
causal: bool = False,
norm: Callable[[int], Module] | None = None,
) -> None
Initialize Conv-TasNet.
Parameters:
-
input_channels(int, default:1) –Number of input channels.
-
output_channels(int, default:1) –Number of output channels.
-
num_filters(int, default:512) –Number of filters in the encoder. Denoted as N in the paper.
-
filter_size(int, default:32) –Encoder filter length. Denoted as L in the paper.
-
hop_size(int | None, default:None) –Encoder hop size. If
None, defaults toencoder_kernel_size // 2. -
bottleneck_channels(int, default:128) –Number of bottleneck channels in the TCN. Denoted as B in the paper.
-
hidden_channels(int, default:512) –Number of hidden channels in the TCN. Denoted as H in the paper.
-
skip_channels(int, default:128) –Number of skip channels in the TCN. Denoted as Sc in the paper.
-
kernel_size(int, default:3) –Kernel size in the TCN. Denoted as P in the paper.
-
layers(int, default:8) –Number of layers in the TCN. Denoted as X in the paper.
-
repeats(int, default:3) –Number of repeats in the TCN. Denoted as R in the paper.
-
causal(bool, default:False) –Whether to use causal convolutions in the TCN.
-
norm(Callable[[int], Module] | None, default:None) –Normalization module to use in the TCN. If
None, defaults to LayerNorm withcausal=causal. If a non-causal normalization module is provided, the TCN is not causal, even ifcausal=True.
addse.models.convtasnet.ConvTasNetConv1DBlock
Bases: Module
1D convolutional block with PReLU activation and normalization used in Conv-TasNet.
addse.models.convtasnet.ConvTasNetTCN
Bases: Module
Temporal convolutional network (TCN) used in Conv-TasNet.
addse.models.mpd
addse.models.mpd.MPDiscriminator
Bases: Module
Multi-period discriminator.
addse.models.mpd.PDiscriminator
Bases: Module
Period discriminator.
addse.models.mpd.PDiscriminatorConv1d
Bases: Module
Period discriminator 1D convolutional layer.
addse.models.msstftd
addse.models.msstftd.MSSTFTDiscriminator
Bases: Module
Multi-scale short-time Fourier transform (MS-STFT) discriminator.
__init__
__init__(
frame_lengths: Collection[int] = (
127,
257,
509,
1021,
2053,
),
hop_lengths: Collection[int | None] | None = None,
n_ffts: Collection[int | None] | None = None,
window: str = "flattop",
in_channels: int = 1,
out_channels: int = 1,
num_channels: int = 32,
kernel_size: tuple[int, int] = (9, 3),
stride: tuple[int, int] = (2, 1),
dilations: Iterable[int] = (1, 2, 4),
) -> None
Initialize the MR-STFT discriminator.
addse.models.msstftd.STFTDiscriminator
Bases: Module
Short-time Fourier transform (STFT) discriminator.
__init__
__init__(
frame_length: int = 512,
hop_length: int | None = None,
n_fft: int | None = None,
window: str = "flattop",
in_channels: int = 1,
out_channels: int = 1,
num_channels: int = 32,
kernel_size: tuple[int, int] = (9, 3),
stride: tuple[int, int] = (2, 1),
dilations: Iterable[int] = (1, 2, 4),
) -> None
Initialize the STFT discriminator.
addse.models.msstftd.STFTDiscriminatorConv2d
Bases: Module
Short-time Fourier transform (STFT) discriminator 2D convolutional layer.
addse.models.nac
addse.models.nac.NAC
Bases: Module
Neural audio codec.
__init__
__init__(
in_channels: int = 1,
emb_channels: int = 1024,
base_channels: int = 32,
strides: list[int] = [2, 2, 4, 4, 5],
kernel_size: int = 3,
num_residual_units: int = 3,
dilation_base: int = 3,
encoder_in_kernel_size: int = 7,
encoder_out_kernel_size: int = 7,
decoder_in_kernel_size: int = 7,
decoder_out_kernel_size: int = 7,
codebook_channels: int | None = 8,
codebook_size: int = 1024,
num_codebooks: int = 4,
normalize: bool = True,
shared_codebook: bool = False,
) -> None
Initialize the neural audio codec.
Parameters:
-
in_channels(int, default:1) –Number of input channels.
-
emb_channels(int, default:1024) –Number of output and input channels for the encoder and decoder, respectively.
-
base_channels(int, default:32) –Number of base channels for the encoder and decoder.
-
strides(list[int], default:[2, 2, 4, 4, 5]) –Downsampling and upsampling factors for the encoder and decoder blocks, respectively.
-
kernel_size(int, default:3) –Kernel size for the residual units.
-
num_residual_units(int, default:3) –Number of residual units per encoder and decoder block.
-
dilation_base(int, default:3) –Dilation base for the residual units.
-
encoder_in_kernel_size(int, default:7) –Kernel size for the encoder input convolutional layer.
-
encoder_out_kernel_size(int, default:7) –Kernel size for the encoder output convolutional layer.
-
decoder_in_kernel_size(int, default:7) –Kernel size for the decoder input convolutional layer.
-
decoder_out_kernel_size(int, default:7) –Kernel size for the decoder output convolutional layer.
-
codebook_channels(int | None, default:8) –Number of channels for the codebook vectors. If
None, usesemb_channels. Else, each quantizer uses input and output linear layers to map betweenemb_channelsandcodebook_channels. -
codebook_size(int, default:1024) –Number of vectors per codebook.
-
num_codebooks(int, default:4) –Number of codebooks.
-
normalize(bool, default:True) –Whether to normalize the embeddings and codebook vectors before codebook lookup.
-
shared_codebook(bool, default:False) –Whether to use the same codebook for all quantizers.
decode
Decode input into audio.
Parameters:
-
x(Tensor) –Input tensor: - If
domainis"code": Shape(batch_size, num_codebooks, num_frames). - Ifdomainis"x": Shape(batch_size, emb_channels, num_frames). - Ifdomainis"q": Shape(batch_size, emb_channels, num_frames)ifno_sumisFalseelse(batch_size, emb_channels, num_codebooks, num_frames). - Ifdomainis"x_proj": Shape(batch_size, codebook_channels, num_codebooks, num_frames). - Ifdomainis"q_proj": Shape(batch_size, codebook_channels, num_codebooks, num_frames). -
no_sum(bool, default:False) –If
False, the input quantized embeddings are assumed to be summed across codebooks. Ignored ifdomainis not"q". -
domain(str, default:'code') –Domain of input tensor.
Returns:
-
Tensor–Decoded audio. Shape
(batch_size, in_channels, num_samples).
encode
Encode input audio into discrete codes.
Parameters:
-
x(Tensor) –Input audio. Shape
(batch_size, in_channels, num_samples). -
no_sum(bool, default:False) –If
True, the quantized embeddings are not summed across codebooks. Ignored ifdomainis not"q". -
domain(str, default:'q') –Which continuous output to return. One of: -
"x": Return the encoder output. -"q": Return the quantized embeddings. -"x_proj": Return the projected encoder output in codebook space. -"q_proj": Return the projected quantized embeddings in codebook space.
Returns:
-
Tensor–Tuple
(codes, continuous): -
Tensor–codes: Discrete codes. Shape(batch_size, num_codebooks, num_frames).
-
tuple[Tensor, Tensor]–continuous: Continuous output:- If
domainis"x": Shape(batch_size, emb_channels, num_frames). - If
domainis"q": Shape(batch_size, emb_channels, num_frames)ifno_sumisFalseelse(batch_size, emb_channels, num_codebooks, num_frames). - If
domainis"x_proj": Shape(batch_size, codebook_channels, num_codebooks, num_frames). - If
domainis"q_proj": Shape(batch_size, codebook_channels, num_codebooks, num_frames).
forward
Forward pass.
Parameters:
-
x(Tensor) –Input audio. Shape
(batch_size, in_channels, num_samples).
Returns:
-
Tensor–Tuple
(decoded, codes, codebook_loss, commit_loss)wheredecodedis the reconstructed audio with shape -
Tensor–(batch_size, in_channels, num_samples),codesare the discrete codes with shape `(batch_size, -
Tensor–num_codebooks, num_frames)
,codebook_lossis the codebook loss, andcommit_loss` is the commitment loss.
addse.models.nac.NACConv1d
Bases: Module
Neural audio codec 1D convolutional layer.
addse.models.nac.NACConvTranspose1d
Bases: Module
Neural audio codec 1D transposed convolutional layer.
addse.models.nac.NACDecoder
Bases: Module
Neural audio codec decoder.
addse.models.nac.NACDecoderBlock
Bases: Module
Neural audio codec decoder block.
addse.models.nac.NACEncoder
Bases: Module
Neural audio codec encoder.
addse.models.nac.NACEncoderBlock
Bases: Module
Neural audio codec encoder block.
addse.models.nac.NACLSTMBlock
addse.models.nac.NACRVQVAE
Bases: Module
Neural audio codec residual vector quantizer.
__init__
__init__(
emb_channels: int,
codebook_size: int,
num_codebooks: int,
codebook_channels: int | None,
normalize: bool,
shared_codebook: bool,
) -> None
Initialize the neural audio codec residual vector quantizer.
decode
decode(
x: Tensor,
input_no_sum: bool = False,
output_no_sum: bool = False,
domain: str = "code",
) -> torch.Tensor
Decode input into quantized embeddings.
Parameters:
-
x(Tensor) –Input tensor: - If
domainis"code": Shape(batch_size, num_codebooks, num_frames). - Ifdomainis"x": Shape(batch_size, emb_channels, num_frames). - Ifdomainis"q": Shape(batch_size, emb_channels, num_frames)ifinput_no_sumisFalseelse(batch_size, emb_channels, num_codebooks, num_frames). - Ifdomainis"x_proj": Shape(batch_size, codebook_channels, num_codebooks, num_frames). - Ifdomainis"q_proj": Shape(batch_size, codebook_channels, num_codebooks, num_frames). -
input_no_sum(bool, default:False) –If
False, the input quantized embeddings are assumed to be summed across codebooks. Ignored ifdomainis not"q". -
output_no_sum(bool, default:False) –If
True, the output quantized embeddings are not summed across codebooks. -
domain(str, default:'code') –Domain of input tensor.
Returns:
forward
forward(
x: Tensor, no_sum: bool = False
) -> tuple[
torch.Tensor,
torch.Tensor,
torch.Tensor,
torch.Tensor,
torch.Tensor,
torch.Tensor,
]
Assign discrete codes to continuous input embeddings.
Parameters:
-
x(Tensor) –Input continuous embeddings. Shape
(batch_size, emb_channels, num_frames) -
no_sum(bool, default:False) –If
True, the quantized embeddings are not summed across codebooks.
Returns:
-
Tensor–A tuple
(codes, quantized, codebook_loss, commit_loss, x_proj, quantized_proj): -
Tensor–codes: Assigned vector indices. Shape(batch_size, num_codebooks, num_frames).
-
Tensor–quantizedQuantized embeddings. Shape(batch_size, emb_channels, num_frames)ifno_sumisFalseelse(batch_size, emb_channels, num_codebooks, num_frames).
-
Tensor–codebook_loss: Codebook loss. 0-dimensional.
-
Tensor–commit_loss: Commitment loss. 0-dimensional.
-
Tensor–x_proj: Projected input embeddings. Shape(batch_size, codebook_channels, num_codebooks, num_frames).
-
tuple[Tensor, Tensor, Tensor, Tensor, Tensor, Tensor]–quantized_proj: Projected quantized embeddings. Shape(batch_size, codebook_channels, num_codebooks, num_frames).
addse.models.nac.NACResidualUnit
Bases: Module
Neural audio codec residual unit.
addse.models.nac.NACSnakeActivation
Bases: Module
Neural audio codec Snake activation function.
addse.models.nac.NACVQVAE
Bases: Module
Neural audio codec vector quantizer.
__init__
__init__(
emb_channels: int,
codebook_size: int,
codebook_channels: int | None,
normalize: bool,
codebook: Embedding | None,
) -> None
Initialize the neural audio codec vector quantizer.
decode
Decode input into quantized embeddings.
Parameters:
-
x(Tensor) –Input tensor: - Shape
(batch_size, num_frames)ifdomainis"code". - Shape(batch_size, emb_channels, num_frames)ifdomainis"x". - Shape(batch_size, emb_channels, num_frames)ifdomainis"q". - Shape(batch_size, codebook_channels, num_frames)ifdomainis"x_proj". - Shape(batch_size, codebook_channels, num_frames)ifdomainis"q_proj". -
domain(str, default:'code') –Domain of input tensor.
Returns:
-
Tensor–Decoded tensor. Shape
(batch_size, emb_channels, num_frames)
forward
forward(
x: Tensor,
) -> tuple[
torch.Tensor,
torch.Tensor,
torch.Tensor,
torch.Tensor,
torch.Tensor,
torch.Tensor,
]
Assign discrete codes to continuous input embeddings.
Parameters:
-
x(Tensor) –Input continuous embeddings. Shape
(batch_size, emb_channels, num_frames)
Returns:
-
Tensor–A tuple
(codes, quantized, codebook_loss, commit_loss, x_proj, quantized_proj): -
Tensor–codes: Assigned vector indices with shape(batch_size, num_frames).
-
Tensor–quantized: Quantized embeddings with shape(batch_size, emb_channels, num_frames).
-
Tensor–codebook_loss: Codebook loss. 0-dimensional.
-
Tensor–commit_loss: Commitment loss. 0-dimensional.
-
Tensor–x_proj: Projected input embeddings. Shape(batch_size, codebook_channels, num_frames).
-
tuple[Tensor, Tensor, Tensor, Tensor, Tensor, Tensor]–quantized_proj: Projected quantized embeddings. Shape(batch_size, codebook_channels, num_frames).
addse.models.sgmse
addse.models.sgmse.SGMSEAttentionBlock
Bases: Module
SGMSE attention block.
addse.models.sgmse.SGMSEEmbeddingBlock
Bases: Module
SGMSE time step embedding block with Gaussian Fourier projection and MLP.
addse.models.sgmse.SGMSEResample
Bases: Module
SGMSE 2D resampling block.
addse.models.sgmse.SGMSEUNet
Bases: Module
NCSN++ backbone used in SGMSE.
__init__
__init__(
num_channels: int = 1,
base_channels: int = 128,
num_res_blocks: int = 2,
channel_mult: Sequence[int] = (1, 1, 2, 2, 2, 2, 2),
attn_levels: Container[int] = (4,),
) -> None
Initialize the SGMSE NCSN++ backbone.
Parameters:
-
num_channels(int, default:1) –Number of input channels.
-
base_channels(int, default:128) –Base number of channels.
-
num_res_blocks(int, default:2) –Number of residual blocks per level.
-
channel_mult(Sequence[int], default:(1, 1, 2, 2, 2, 2, 2)) –Channel multiplier for each level.
-
attn_levels(Container[int], default:(4,)) –Indices of levels at which to apply attention.
forward
Forward pass.
Parameters:
-
x(Tensor) –Complex-valued noisy speech tensor. Shape
(batch_size, num_channels, num_freqs, num_frames). -
y(Tensor) –Complex-valued diffused speech tensor. Shape
(batch_size, num_channels, num_freqs, num_frames). -
t(Tensor) –Diffusion step or noise level. Shape
(batch_size,).
Returns:
-
Tensor–Complex-valued output score. Shape
(batch_size, num_channels, num_freqs, num_frames).
addse.models.sgmse.SGMSEUNetBlock
Bases: Module
SGMSE UNet block.
addse.stft
addse.stft.STFT
Bases: Module
Short-time Fourier transform (STFT) module.
__init__
__init__(
frame_length: int = 512,
hop_length: int | None = None,
n_fft: int | None = None,
window: str = "hann",
norm: bool = False,
) -> None
Initialize the STFT module.
Parameters:
-
frame_length(int, default:512) –Frame length.
-
hop_length(int | None, default:None) –Hop length. If
None, defaults toframe_length // 2. -
n_fft(int | None, default:None) –FFT size. If
None, defaults toframe_length. -
window(str, default:'hann') –Window type. Passed to scipy.signal.get_window.
-
norm(bool, default:False) –Whether to normalize the window by the square root of its sum of squares.
forward
inverse
addse.utils
addse.utils.build_subbands
build_subbands(
n_fft: int,
fs: int,
subbands: Iterable[tuple[float, int]],
) -> list[tuple[int, int]]
addse.utils.bytes_str_to_int
addse.utils.dynamic_range
addse.utils.flatten_dict
addse.utils.hz_to_mel
Convert frequency in Hz to mel scale.
Parameters:
-
hz(float) –Frequency in Hz.
-
scale(str, default:'slaney') –Mel scale to use.
"htk"matches the Hidden Markov Toolkit, while"slaney"matches the Auditory Toolbox by Slaney. The"slaney"scale is linear below 1 kHz and logarithmic above 1 kHz.
Returns:
-
float–Frequency in mel scale.
addse.utils.load_hydra_config
Load a Hydra configuration file.
addse.utils.load_model
load_model(
config_path: str,
model_name: str | None = None,
logs_dir: str = "logs",
ckpt_name: str = "last.ckpt",
ckpt_path: str | None = None,
state_key: str | None = "state_dict",
prepend_key: str | None = None,
device: device | str | None = None,
strict: bool = True,
) -> L.LightningModule
Load a model.
addse.utils.mel_filters
mel_filters(
n_filters: int = 64,
n_fft: int = 512,
f_min: float = 0.0,
f_max: float | None = None,
fs: float = 16000,
scale: str = "slaney",
norm: Literal["slaney", "consistent"]
| None = "consistent",
dtype: dtype = torch.float32,
) -> tuple[torch.Tensor, torch.Tensor]
Get mel filters.
Parameters:
-
n_filters(int, default:64) –Number of filters.
-
n_fft(int, default:512) –Number of FFT point.
-
f_min(float, default:0.0) –Minimum frequency.
-
f_max(float | None, default:None) –Maximum frequency. If
None, usesfs / 2. -
fs(float, default:16000) –Sampling frequency.
-
scale(str, default:'slaney') –Mel scale to use.
"htk"matches the Hidden Markov Toolkit, while"slaney"matches the Auditory Toolbox by Slaney. The"slaney"scale is linear below 1 kHz and logarithmic above 1 kHz. -
norm(Literal['slaney', 'consistent'] | None, default:'consistent') –Filter normalization method. If
"slaney", the filters are normalized by their width in Hz. However this makes the filter response scale with the frequency resolutionn_fft / fs. If"consistent", the frequency resolution is factored in. IfNone, no normalization is applied. -
dtype(dtype, default:float32) –Data type to cast the filters to.
Returns:
addse.utils.mel_to_hz
Convert frequency in mel scale to Hz.
Parameters:
-
mel(Tensor) –Frequency in mel scale.
-
scale(str, default:'slaney') –Mel scale to use.
"htk"matches the Hidden Markov Toolkit, while"slaney"matches the Auditory Toolbox by Slaney. The"slaney"scale is linear below 1 kHz and logarithmic above 1 kHz.
Returns:
-
Tensor–Frequency in Hz.
addse.utils.scan_files
addse.utils.seed_all
addse.utils.segment_audio_file
segment_audio_file(
path: str,
format: str = "ogg",
subtype: str | None = None,
seglen: float | None = None,
base: str | None = None,
) -> Iterator[tuple[bytes, str]]
Read and segment an audio file and yield bytes and a name for each segment.
Parameters:
-
path(str) –Path to the input audio file.
-
format(str, default:'ogg') –Audio format to convert to. See soundfile.write.
-
subtype(str | None, default:None) –Audio subtype to convert to. See soundfile.write.
-
seglen(float | None, default:None) –Segment length in seconds. If provided, the file is segmented into chunks of this length approximately.
-
base(str | None, default:None) –Base path to strip from the file path.
Yields: