1
0
mirror of https://github.com/malarinv/tacotron2 synced 2026-03-08 09:42:34 +00:00

44 Commits

Author SHA1 Message Date
rafaelvalle
8128e0e98f using waveglow, pillow, text changes 2018-11-26 16:42:07 -08:00
rafaelvalle
e0c7e0e928 adding changes to text 2018-11-26 16:41:21 -08:00
rafaelvalle
f03692c73d cleanup, new model and waveglow 2018-11-26 16:37:44 -08:00
rafaelvalle
7eb045206c README.md: updating readme to explicitly mention that mel representation of WaveNet and Tacotron2 must be the same 2018-06-14 11:25:42 -07:00
rafaelvalle
c67005f1be Dockerfile: adding jupyter to dockerfile 2018-06-14 10:30:01 -07:00
rafaelvalle
cb3794796f Dockerfile: removing return from Dockerfile 2018-06-12 21:38:03 -07:00
Rafael Valle
a8de973923 Merge pull request #37 from yoks/master
`used_saved_learning_rate`  fix
2018-06-12 09:00:14 -07:00
yoks
a0ae2da05f used_saved_learning_rate fix
`used_saved_learning_rate` name change to `use_saved_learning_rate`
2018-06-12 16:53:12 +03:00
rafaelvalle
34066ac4fc requirements.txt: setting torch to 0.4.0 2018-06-11 08:16:31 -07:00
rafaelvalle
12ab5ba89c model.py: setting weight initialization to xavier uniform 2018-06-07 20:28:52 -07:00
rafaelvalle
d10da5f41e hparams.py: commenting n_frames_per_step to indicate that currently only 1 frame per step is supported now 2018-06-07 13:02:23 -07:00
rafaelvalle
5f0ea06c41 hparams.py: adding use saved learning rate param 2018-06-05 08:12:49 -07:00
rafaelvalle
22bcff1155 hparams.py: adding use saved learning rate param 2018-06-05 08:12:35 -07:00
rafaelvalle
2e934478a4 README.md: being explicit about action 2018-06-04 17:12:32 -07:00
rafaelvalle
8ae231b10b README.md: more explicit about demo audio 2018-06-04 16:56:22 -07:00
rafaelvalle
4d733d1bdd README.md: including demo.wav in readme 2018-06-04 16:55:17 -07:00
rafaelvalle
b4e52404ab adding demo.wav file 2018-06-04 16:46:36 -07:00
Rafael Valle
064629c9bc Merge pull request #23 from NVIDIA/attention_full_mel
model.py: attending to full mel instead of prenet and dropout mel
2018-05-20 12:25:54 -07:00
Rafael Valle
d5b64729d1 model.py: moving for better readibility 2018-05-20 12:22:06 -07:00
Rafael Valle
977cb37cea model.py: attending to full mel instead of prenet and dropout mel 2018-05-18 06:59:09 -07:00
Rafael Valle
da30fd8709 Merge pull request #20 from NVIDIA/fp16_path
Fp16 patch, not path!
2018-05-15 09:55:19 -07:00
Rafael Valle
27b1767cb2 train.py: fixing typo 2018-05-15 09:53:33 -07:00
Rafael Valle
817cd403d4 Merge branch 'master' of https://github.com/NVIDIA/tacotron2 into load_mel_from_disk 2018-05-15 09:51:41 -07:00
Rafael Valle
1071023017 train.py: patching score_mask_value formerly inf, not concrete value, for compatibility with pytorch 2018-05-15 09:50:56 -07:00
Rafael Valle
cd851585cb loss_scaler.py: patching loss scaler for compatibility with current pytorch 2018-05-15 09:50:08 -07:00
Rafael Valle
bd42cb6ed7 Merge pull request #19 from NVIDIA/load_mel_from_disk
Load mel from disk
2018-05-15 08:54:24 -07:00
Rafael Valle
2da7a2ebab README.md: describing how to load mel from disk 2018-05-15 08:50:21 -07:00
Rafael Valle
62d2c8b957 data_utils.py: adding support for loading mel from disk 2018-05-15 08:42:06 -07:00
Rafael Valle
2d41ea0682 hparams.py: adding load_mel_from_disk params 2018-05-15 08:41:03 -07:00
Rafael Valle
20568643cb Merge branch 'master' of https://github.com/NVIDIA/tacotron2 2018-05-06 08:58:07 -07:00
Rafael Valle
dcd925f6c8 model.py: mixed squeeze target. fixing 2018-05-06 08:58:01 -07:00
Raul Puri
4ac6ce9ab5 ipynb typo 2018-05-05 17:30:08 -07:00
Raul Puri
c67ca6531e force single gpu in inference.ipynb 2018-05-05 17:29:09 -07:00
Raul Puri
78d5150d83 inference (distributed) dataparallel patch
removing the '.module' that comes from (distibuted)dataparallel state dict
2018-05-05 17:23:11 -07:00
Rafael Valle
424b2f5bf0 README.md: fixing typo as pointed out by @syoyo 2018-05-04 23:06:58 -07:00
Rafael Valle
a38429e629 Merge pull request #6 from NVIDIA/padding-patch-0.4
integer maxlen for padding
2018-05-04 13:24:57 -07:00
Raul Puri
b20765a3dc 0.4 scalar tensor padding update 2018-05-04 12:12:08 -07:00
Raul Puri
2a394f4aaa integer maxlen for padding 2018-05-04 11:11:14 -07:00
Rafael Valle
2c545ac800 Merge pull request #4 from NVIDIA/mask-utils-0.4
mask utils update for 0.4 cuda
2018-05-04 11:02:30 -07:00
Raul Puri
6fbba8ef0f mask utils update for 0.4 cuda 2018-05-04 10:14:30 -07:00
Rafael Valle
c141726a96 requirements.txt: updating tensorflow requirements 2018-05-04 09:44:14 -07:00
Rafael Valle
535042a584 README.md: updating readme to include docker setup 2018-05-04 09:42:11 -07:00
Rafael Valle
a72160b8cb Dockerfile: adding dockerfile 2018-05-04 09:39:34 -07:00
Rafael Valle
d750fcf395 Merge pull request #2 from NVIDIA/single-gpu-and-0.4
train.py single gpu and 0.4 update
2018-05-04 09:12:13 -07:00
18 changed files with 248 additions and 176 deletions

3
.gitmodules vendored Normal file
View File

@@ -0,0 +1,3 @@
[submodule "waveglow"]
path = waveglow
url = https://github.com/NVIDIA/waveglow

2
Dockerfile Normal file
View File

@@ -0,0 +1,2 @@
FROM pytorch/pytorch:0.4_cuda9_cudnn7
RUN pip install numpy scipy matplotlib librosa==0.6.0 tensorflow tensorboardX inflect==0.2.5 Unidecode==1.0.22 jupyter

View File

@@ -11,6 +11,9 @@ Distributed and FP16 support relies on work by Christian Sarofeen and NVIDIA's
![Alignment, Predicted Mel Spectrogram, Target Mel Spectrogram](tensorboard.png)
[Download demo audio](https://github.com/NVIDIA/tacotron2/blob/master/demo.wav) trained on LJS and using Ryuchi Yamamoto's [pre-trained Mixture of Logistics
wavenet](https://github.com/r9y9/wavenet_vocoder/)
"Scientists at the CERN laboratory say they have discovered a new particle."
## Pre-requisites
1. NVIDIA GPU + CUDA cuDNN
@@ -20,11 +23,12 @@ Distributed and FP16 support relies on work by Christian Sarofeen and NVIDIA's
2. Clone this repo: `git clone https://github.com/NVIDIA/tacotron2.git`
3. CD into this repo: `cd tacotron2`
4. Update .wav paths: `sed -i -- 's,DUMMY,ljs_dataset_folder/wavs,g' filelists/*.txt`
- Alternatively, set `load_mel_from_disk=True` in `hparams.py` and update mel-spectrogram paths
5. Install [pytorch 0.4](https://github.com/pytorch/pytorch)
6. Install python requirements or use docker container (tbd)
- Install python requirements: `pip install requirements.txt`
6. Install python requirements or build docker image
- Install python requirements: `pip install -r requirements.txt`
- **OR**
- Docker container `(tbd)`
- Build docker image: `docker build --tag tacotron2 .`
## Training
1. `python train.py --output_directory=outdir --log_directory=logdir`
@@ -34,9 +38,13 @@ Distributed and FP16 support relies on work by Christian Sarofeen and NVIDIA's
1. `python -m multiproc train.py --output_directory=outdir --log_directory=logdir --hparams=distributed_run=True,fp16_run=True`
## Inference
When performing Mel-Spectrogram to Audio synthesis with a WaveNet model, make sure Tacotron 2 and WaveNet were trained on the same mel-spectrogram representation. Follow these steps to use a a simple inference pipeline using griffin-lim:
1. `jupyter notebook --ip=127.0.0.1 --port=31337`
2. load inference.ipynb
## Related repos
[nv-wavenet](https://github.com/NVIDIA/nv-wavenet/): Faster than real-time
wavenet inference

View File

@@ -1,4 +1,5 @@
import random
import numpy as np
import torch
import torch.utils.data
@@ -13,19 +14,18 @@ class TextMelLoader(torch.utils.data.Dataset):
2) normalizes text and converts them to sequences of one-hot vectors
3) computes mel-spectrograms from audio files.
"""
def __init__(self, audiopaths_and_text, hparams, shuffle=True):
self.audiopaths_and_text = load_filepaths_and_text(
audiopaths_and_text, hparams.sort_by_length)
def __init__(self, audiopaths_and_text, hparams):
self.audiopaths_and_text = load_filepaths_and_text(audiopaths_and_text)
self.text_cleaners = hparams.text_cleaners
self.max_wav_value = hparams.max_wav_value
self.sampling_rate = hparams.sampling_rate
self.load_mel_from_disk = hparams.load_mel_from_disk
self.stft = layers.TacotronSTFT(
hparams.filter_length, hparams.hop_length, hparams.win_length,
hparams.n_mel_channels, hparams.sampling_rate, hparams.mel_fmin,
hparams.mel_fmax)
random.seed(1234)
if shuffle:
random.shuffle(self.audiopaths_and_text)
random.shuffle(self.audiopaths_and_text)
def get_mel_text_pair(self, audiopath_and_text):
# separate filename and text
@@ -35,12 +35,22 @@ class TextMelLoader(torch.utils.data.Dataset):
return (text, mel)
def get_mel(self, filename):
audio = load_wav_to_torch(filename, self.sampling_rate)
audio_norm = audio / self.max_wav_value
audio_norm = audio_norm.unsqueeze(0)
audio_norm = torch.autograd.Variable(audio_norm, requires_grad=False)
melspec = self.stft.mel_spectrogram(audio_norm)
melspec = torch.squeeze(melspec, 0)
if not self.load_mel_from_disk:
audio, sampling_rate = load_wav_to_torch(filename)
if sampling_rate != self.stft.sampling_rate:
raise ValueError("{} {} SR doesn't match target {} SR".format(
sampling_rate, self.stft.sampling_rate))
audio_norm = audio / self.max_wav_value
audio_norm = audio_norm.unsqueeze(0)
audio_norm = torch.autograd.Variable(audio_norm, requires_grad=False)
melspec = self.stft.mel_spectrogram(audio_norm)
melspec = torch.squeeze(melspec, 0)
else:
melspec = torch.from_numpy(np.load(filename))
assert melspec.size(0) == self.stft.n_mel_channels, (
'Mel dimension mismatch: given {}, expected {}'.format(
melspec.size(0), self.stft.n_mel_channels))
return melspec
def get_text(self, text):
@@ -78,9 +88,9 @@ class TextMelCollate():
text = batch[ids_sorted_decreasing[i]][0]
text_padded[i, :text.size(0)] = text
# Right zero-pad mel-spec with extra single zero vector to mark the end
# Right zero-pad mel-spec
num_mels = batch[0][1].size(0)
max_target_len = max([x[1].size(1) for x in batch]) + 1
max_target_len = max([x[1].size(1) for x in batch])
if max_target_len % self.n_frames_per_step != 0:
max_target_len += self.n_frames_per_step - max_target_len % self.n_frames_per_step
assert max_target_len % self.n_frames_per_step == 0
@@ -94,7 +104,7 @@ class TextMelCollate():
for i in range(len(ids_sorted_decreasing)):
mel = batch[ids_sorted_decreasing[i]][1]
mel_padded[i, :, :mel.size(1)] = mel
gate_padded[i, mel.size(1):] = 1
gate_padded[i, mel.size(1)-1:] = 1
output_lengths[i] = mel.size(1)
return text_padded, input_lengths, mel_padded, gate_padded, \

BIN
demo.wav Executable file

Binary file not shown.

View File

@@ -118,3 +118,55 @@ class DistributedDataParallel(Module):
super(DistributedDataParallel, self).train(mode)
self.module.train(mode)
'''
'''
Modifies existing model to do gradient allreduce, but doesn't change class
so you don't need "module"
'''
def apply_gradient_allreduce(module):
if not hasattr(dist, '_backend'):
module.warn_on_half = True
else:
module.warn_on_half = True if dist._backend == dist.dist_backend.GLOO else False
for p in module.state_dict().values():
if not torch.is_tensor(p):
continue
dist.broadcast(p, 0)
def allreduce_params():
if(module.needs_reduction):
module.needs_reduction = False
buckets = {}
for param in module.parameters():
if param.requires_grad and param.grad is not None:
tp = type(param.data)
if tp not in buckets:
buckets[tp] = []
buckets[tp].append(param)
if module.warn_on_half:
if torch.cuda.HalfTensor in buckets:
print("WARNING: gloo dist backend for half parameters may be extremely slow." +
" It is recommended to use the NCCL backend in this case. This currently requires" +
"PyTorch built from top of tree master.")
module.warn_on_half = False
for tp in buckets:
bucket = buckets[tp]
grads = [param.grad.data for param in bucket]
coalesced = _flatten_dense_tensors(grads)
dist.all_reduce(coalesced)
coalesced /= dist.get_world_size()
for buf, synced in zip(grads, _unflatten_dense_tensors(coalesced, grads)):
buf.copy_(synced)
for param in list(module.parameters()):
def allreduce_hook(*unused):
param._execution_engine.queue_callback(allreduce_params)
if param.requires_grad:
param.register_hook(allreduce_hook)
def set_needs_reduction(self, input, output):
self.needs_reduction = True
module.register_forward_hook(set_needs_reduction)
return module

View File

@@ -10,7 +10,7 @@ def create_hparams(hparams_string=None, verbose=False):
# Experiment Parameters #
################################
epochs=500,
iters_per_checkpoint=500,
iters_per_checkpoint=1000,
seed=1234,
dynamic_loss_scaling=True,
fp16_run=False,
@@ -23,10 +23,10 @@ def create_hparams(hparams_string=None, verbose=False):
################################
# Data Parameters #
################################
training_files='filelists/ljs_audio_text_train_filelist.txt',
validation_files='filelists/ljs_audio_text_val_filelist.txt',
load_mel_from_disk=False,
training_files='filelists/ljs_audio22khz_text_train_filelist.txt',
validation_files='filelists/ljs_audio22khz_text_val_filelist.txt',
text_cleaners=['english_cleaners'],
sort_by_length=False,
################################
# Audio Parameters #
@@ -38,7 +38,7 @@ def create_hparams(hparams_string=None, verbose=False):
win_length=1024,
n_mel_channels=80,
mel_fmin=0.0,
mel_fmax=None, # if None, half the sampling rate
mel_fmax=8000.0,
################################
# Model Parameters #
@@ -52,11 +52,13 @@ def create_hparams(hparams_string=None, verbose=False):
encoder_embedding_dim=512,
# Decoder parameters
n_frames_per_step=1,
n_frames_per_step=1, # currently only 1 is supported
decoder_rnn_dim=1024,
prenet_dim=256,
max_decoder_steps=1000,
gate_threshold=0.6,
gate_threshold=0.5,
p_attention_dropout=0.1,
p_decoder_dropout=0.1,
# Attention parameters
attention_rnn_dim=1024,
@@ -74,11 +76,12 @@ def create_hparams(hparams_string=None, verbose=False):
################################
# Optimization Hyperparameters #
################################
use_saved_learning_rate=False,
learning_rate=1e-3,
weight_decay=1e-6,
grad_clip_thresh=1,
batch_size=48,
mask_padding=False # set model's padded outputs to padded values
grad_clip_thresh=1.0,
batch_size=64,
mask_padding=True # set model's padded outputs to padded values
)
if hparams_string:

113
inference.ipynb Normal file → Executable file

File diff suppressed because one or more lines are too long

View File

@@ -10,7 +10,7 @@ class LinearNorm(torch.nn.Module):
super(LinearNorm, self).__init__()
self.linear_layer = torch.nn.Linear(in_dim, out_dim, bias=bias)
torch.nn.init.xavier_uniform(
torch.nn.init.xavier_uniform_(
self.linear_layer.weight,
gain=torch.nn.init.calculate_gain(w_init_gain))
@@ -31,7 +31,7 @@ class ConvNorm(torch.nn.Module):
padding=padding, dilation=dilation,
bias=bias)
torch.nn.init.xavier_uniform(
torch.nn.init.xavier_uniform_(
self.conv.weight, gain=torch.nn.init.calculate_gain(w_init_gain))
def forward(self, signal):
@@ -42,7 +42,7 @@ class ConvNorm(torch.nn.Module):
class TacotronSTFT(torch.nn.Module):
def __init__(self, filter_length=1024, hop_length=256, win_length=1024,
n_mel_channels=80, sampling_rate=22050, mel_fmin=0.0,
mel_fmax=None):
mel_fmax=8000.0):
super(TacotronSTFT, self).__init__()
self.n_mel_channels = n_mel_channels
self.sampling_rate = sampling_rate

View File

@@ -51,11 +51,10 @@ class DynamicLossScaler:
# `x` is a torch.Tensor
def _has_inf_or_nan(x):
inf_count = torch.sum(x.abs() == float('inf'))
if inf_count > 0:
cpu_sum = float(x.float().sum())
if cpu_sum == float('inf') or cpu_sum == -float('inf') or cpu_sum != cpu_sum:
return True
nan_count = torch.sum(x != x)
return nan_count > 0
return False
# `overflow` is boolean indicating whether we overflowed in gradient
def update_scale(self, overflow):

View File

@@ -1,3 +1,4 @@
from math import sqrt
import torch
from torch.autograd import Variable
from torch import nn
@@ -56,7 +57,7 @@ class Attention(nn.Module):
processed_query = self.query_layer(query.unsqueeze(1))
processed_attention_weights = self.location_layer(attention_weights_cat)
energies = self.v(F.tanh(
energies = self.v(torch.tanh(
processed_query + processed_attention_weights + processed_memory))
energies = energies.squeeze(-1)
@@ -107,7 +108,6 @@ class Postnet(nn.Module):
def __init__(self, hparams):
super(Postnet, self).__init__()
self.dropout = nn.Dropout(0.5)
self.convolutions = nn.ModuleList()
self.convolutions.append(
@@ -141,9 +141,8 @@ class Postnet(nn.Module):
def forward(self, x):
for i in range(len(self.convolutions) - 1):
x = self.dropout(F.tanh(self.convolutions[i](x)))
x = self.dropout(self.convolutions[-1](x))
x = F.dropout(torch.tanh(self.convolutions[i](x)), 0.5, self.training)
x = F.dropout(self.convolutions[-1](x), 0.5, self.training)
return x
@@ -155,7 +154,6 @@ class Encoder(nn.Module):
"""
def __init__(self, hparams):
super(Encoder, self).__init__()
self.dropout = nn.Dropout(0.5)
convolutions = []
for _ in range(hparams.encoder_n_convolutions):
@@ -175,7 +173,7 @@ class Encoder(nn.Module):
def forward(self, x, input_lengths):
for conv in self.convolutions:
x = self.dropout(F.relu(conv(x)))
x = F.dropout(F.relu(conv(x)), 0.5, self.training)
x = x.transpose(1, 2)
@@ -194,7 +192,7 @@ class Encoder(nn.Module):
def inference(self, x):
for conv in self.convolutions:
x = self.dropout(F.relu(conv(x)))
x = F.dropout(F.relu(conv(x)), 0.5, self.training)
x = x.transpose(1, 2)
@@ -215,6 +213,8 @@ class Decoder(nn.Module):
self.prenet_dim = hparams.prenet_dim
self.max_decoder_steps = hparams.max_decoder_steps
self.gate_threshold = hparams.gate_threshold
self.p_attention_dropout = hparams.p_attention_dropout
self.p_decoder_dropout = hparams.p_decoder_dropout
self.prenet = Prenet(
hparams.n_mel_channels * hparams.n_frames_per_step,
@@ -235,7 +235,7 @@ class Decoder(nn.Module):
self.linear_projection = LinearNorm(
hparams.decoder_rnn_dim + hparams.encoder_embedding_dim,
hparams.n_mel_channels*hparams.n_frames_per_step)
hparams.n_mel_channels * hparams.n_frames_per_step)
self.gate_layer = LinearNorm(
hparams.decoder_rnn_dim + hparams.encoder_embedding_dim, 1,
@@ -350,11 +350,13 @@ class Decoder(nn.Module):
gate_output: gate output energies
attention_weights:
"""
decoder_input = self.prenet(decoder_input)
cell_input = torch.cat((decoder_input, self.attention_context), -1)
self.attention_hidden, self.attention_cell = self.attention_rnn(
cell_input, (self.attention_hidden, self.attention_cell))
self.attention_hidden = F.dropout(
self.attention_hidden, self.p_attention_dropout, self.training)
self.attention_cell = F.dropout(
self.attention_cell, self.p_attention_dropout, self.training)
attention_weights_cat = torch.cat(
(self.attention_weights.unsqueeze(1),
@@ -368,6 +370,10 @@ class Decoder(nn.Module):
(self.attention_hidden, self.attention_context), -1)
self.decoder_hidden, self.decoder_cell = self.decoder_rnn(
decoder_input, (self.decoder_hidden, self.decoder_cell))
self.decoder_hidden = F.dropout(
self.decoder_hidden, self.p_decoder_dropout, self.training)
self.decoder_cell = F.dropout(
self.decoder_cell, self.p_decoder_dropout, self.training)
decoder_hidden_attention_context = torch.cat(
(self.decoder_hidden, self.attention_context), dim=1)
@@ -392,23 +398,23 @@ class Decoder(nn.Module):
alignments: sequence of attention weights from the decoder
"""
decoder_input = self.get_go_frame(memory)
decoder_input = self.get_go_frame(memory).unsqueeze(0)
decoder_inputs = self.parse_decoder_inputs(decoder_inputs)
decoder_inputs = torch.cat((decoder_input, decoder_inputs), dim=0)
decoder_inputs = self.prenet(decoder_inputs)
self.initialize_decoder_states(
memory, mask=~get_mask_from_lengths(memory_lengths))
mel_outputs, gate_outputs, alignments = [], [], []
while len(mel_outputs) < decoder_inputs.size(0):
while len(mel_outputs) < decoder_inputs.size(0) - 1:
decoder_input = decoder_inputs[len(mel_outputs)]
mel_output, gate_output, attention_weights = self.decode(
decoder_input)
mel_outputs += [mel_output.squeeze(1)]
gate_outputs += [gate_output.squeeze()]
alignments += [attention_weights]
decoder_input = decoder_inputs[len(mel_outputs) - 1]
mel_outputs, gate_outputs, alignments = self.parse_decoder_outputs(
mel_outputs, gate_outputs, alignments)
@@ -431,15 +437,15 @@ class Decoder(nn.Module):
self.initialize_decoder_states(memory, mask=None)
mel_outputs, gate_outputs, alignments = [], [], []
while True:
decoder_input = self.prenet(decoder_input)
mel_output, gate_output, alignment = self.decode(decoder_input)
mel_outputs += [mel_output.squeeze(1)]
gate_outputs += [gate_output.squeeze()]
gate_outputs += [gate_output]
alignments += [alignment]
if F.sigmoid(gate_output.data) > self.gate_threshold:
if torch.sigmoid(gate_output.data) > self.gate_threshold:
break
elif len(mel_outputs) == self.max_decoder_steps:
print("Warning! Reached max decoder steps")
@@ -462,6 +468,9 @@ class Tacotron2(nn.Module):
self.n_frames_per_step = hparams.n_frames_per_step
self.embedding = nn.Embedding(
hparams.n_symbols, hparams.symbols_embedding_dim)
std = sqrt(2.0 / (hparams.n_symbols + hparams.symbols_embedding_dim))
val = sqrt(3.0) * std # uniform bounds for std
self.embedding.weight.data.uniform_(-val, val)
self.encoder = Encoder(hparams)
self.decoder = Decoder(hparams)
self.postnet = Postnet(hparams)
@@ -471,7 +480,7 @@ class Tacotron2(nn.Module):
output_lengths = batch
text_padded = to_gpu(text_padded).long()
input_lengths = to_gpu(input_lengths).long()
max_len = torch.max(input_lengths.data)
max_len = torch.max(input_lengths.data).item()
mel_padded = to_gpu(mel_padded).float()
gate_padded = to_gpu(gate_padded).float()
output_lengths = to_gpu(output_lengths).long()
@@ -486,7 +495,7 @@ class Tacotron2(nn.Module):
def parse_output(self, outputs, output_lengths=None):
if self.mask_padding and output_lengths is not None:
mask = ~get_mask_from_lengths(output_lengths+1) # +1 <stop> token
mask = ~get_mask_from_lengths(output_lengths)
mask = mask.expand(self.n_mel_channels, mask.size(0), mask.size(1))
mask = mask.permute(1, 0, 2)
@@ -495,7 +504,6 @@ class Tacotron2(nn.Module):
outputs[2].data.masked_fill_(mask[:, 0, :], 1e3) # gate energies
outputs = fp16_to_fp32(outputs) if self.fp16_run else outputs
return outputs
def forward(self, inputs):
@@ -513,14 +521,6 @@ class Tacotron2(nn.Module):
mel_outputs_postnet = self.postnet(mel_outputs)
mel_outputs_postnet = mel_outputs + mel_outputs_postnet
# DataParallel expects equal sized inputs/outputs, hence padding
if input_lengths is not None:
alignments = alignments.unsqueeze(0)
alignments = nn.functional.pad(
alignments,
(0, max_len - alignments.size(3), 0, 0),
"constant", 0)
alignments = alignments.squeeze()
return self.parse_output(
[mel_outputs, mel_outputs_postnet, gate_outputs, alignments],
output_lengths)

View File

@@ -1,9 +1,10 @@
torch==0.2.0.post3
torch==0.4.0
matplotlib==2.1.0
tensorflow==1.5.0
tensorflow
numpy==1.13.3
inflect==0.2.5
librosa==0.6.0
scipy==1.0.0
tensorboardX==1.1
Unidecode==1.0.22
pillow

View File

@@ -61,7 +61,7 @@ class STFT(torch.nn.Module):
np.linalg.pinv(scale * fourier_basis).T[:, None, :])
if window is not None:
assert(win_length >= filter_length)
assert(filter_length >= win_length)
# get window and zero center pad it to filter_length
fft_window = get_window(window, win_length, fftbins=True)
fft_window = pad_center(fft_window, filter_length)

View File

@@ -37,8 +37,6 @@ def text_to_sequence(text, cleaner_names):
sequence += _arpabet_to_sequence(m.group(2))
text = m.group(3)
# Append EOS token
sequence.append(_symbol_to_id['~'])
return sequence

View File

@@ -7,11 +7,12 @@ The default is a set of ASCII characters that works well for English or text tha
from text import cmudict
_pad = '_'
_eos = '~'
_characters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!\'(),-.:;? '
_punctuation = '!\'(),.:;? '
_special = '-'
_letters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
# Prepend "@" to ARPAbet symbols to ensure uniqueness (some are the same as uppercase letters):
_arpabet = ['@' + s for s in cmudict.valid_symbols]
# Export all symbols:
symbols = [_pad, _eos] + list(_characters) + _arpabet
symbols = [_pad] + list(_special) + list(_punctuation) + list(_letters) + _arpabet

View File

@@ -2,11 +2,12 @@ import os
import time
import argparse
import math
from numpy import finfo
import torch
from distributed import DistributedDataParallel
from distributed import apply_gradient_allreduce
import torch.distributed as dist
from torch.utils.data.distributed import DistributedSampler
from torch.nn import DataParallel
from torch.utils.data import DataLoader
from fp16_optimizer import FP16_Optimizer
@@ -29,19 +30,20 @@ def batchnorm_to_float(module):
def reduce_tensor(tensor, num_gpus):
rt = tensor.clone()
torch.distributed.all_reduce(rt, op=torch.distributed.reduce_op.SUM)
dist.all_reduce(rt, op=dist.reduce_op.SUM)
rt /= num_gpus
return rt
def init_distributed(hparams, n_gpus, rank, group_name):
assert torch.cuda.is_available(), "Distributed mode requires CUDA."
print("Initializing distributed")
print("Initializing Distributed")
# Set cuda device so everything is done on the right GPU.
torch.cuda.set_device(rank % torch.cuda.device_count())
# Initialize distributed communication
torch.distributed.init_process_group(
dist.init_process_group(
backend=hparams.dist_backend, init_method=hparams.dist_url,
world_size=n_gpus, rank=rank, group_name=group_name)
@@ -77,7 +79,9 @@ def prepare_directories_and_logger(output_directory, log_directory, rank):
def load_model(hparams):
model = Tacotron2(hparams).cuda()
model = batchnorm_to_float(model.half()) if hparams.fp16_run else model
if hparams.fp16_run:
model = batchnorm_to_float(model.half())
model.decoder.attention_layer.score_mask_value = float(finfo('float16').min)
if hparams.distributed_run:
model = DistributedDataParallel(model)
@@ -128,22 +132,20 @@ def validate(model, criterion, valset, iteration, batch_size, n_gpus,
pin_memory=False, collate_fn=collate_fn)
val_loss = 0.0
if distributed_run or torch.cuda.device_count() > 1:
batch_parser = model.module.parse_batch
else:
batch_parser = model.parse_batch
for i, batch in enumerate(val_loader):
x, y = batch_parser(batch)
x, y = model.parse_batch(batch)
y_pred = model(x)
loss = criterion(y_pred, y)
reduced_val_loss = reduce_tensor(loss.data, n_gpus)[0] \
if distributed_run else loss.data[0]
if distributed_run:
reduced_val_loss = reduce_tensor(loss.data, num_gpus).item()
else:
reduced_val_loss = loss.item()
val_loss += reduced_val_loss
val_loss = val_loss / (i + 1)
model.train()
return val_loss
print("Validation loss {}: {:9f} ".format(iteration, reduced_val_loss))
logger.log_validation(reduced_val_loss, model, y, y_pred, iteration)
def train(output_directory, log_directory, checkpoint_path, warm_start, n_gpus,
@@ -173,6 +175,9 @@ def train(output_directory, log_directory, checkpoint_path, warm_start, n_gpus,
optimizer = FP16_Optimizer(
optimizer, dynamic_loss_scale=hparams.dynamic_loss_scaling)
if hparams.distributed_run:
model = apply_gradient_allreduce(model)
criterion = Tacotron2Loss()
logger = prepare_directories_and_logger(
@@ -187,16 +192,14 @@ def train(output_directory, log_directory, checkpoint_path, warm_start, n_gpus,
if warm_start:
model = warm_start_model(checkpoint_path, model)
else:
model, optimizer, learning_rate, iteration = load_checkpoint(
model, optimizer, _learning_rate, iteration = load_checkpoint(
checkpoint_path, model, optimizer)
if hparams.use_saved_learning_rate:
learning_rate = _learning_rate
iteration += 1 # next iteration is iteration + 1
epoch_offset = max(0, int(iteration / len(train_loader)))
model.train()
if hparams.distributed_run or torch.cuda.device_count() > 1:
batch_parser = model.module.parse_batch
else:
batch_parser = model.parse_batch
# ================ MAIN TRAINNIG LOOP! ===================
for epoch in range(epoch_offset, hparams.epochs):
print("Epoch: {}".format(epoch))
@@ -206,18 +209,21 @@ def train(output_directory, log_directory, checkpoint_path, warm_start, n_gpus,
param_group['lr'] = learning_rate
model.zero_grad()
x, y = batch_parser(batch)
x, y = model.parse_batch(batch)
y_pred = model(x)
loss = criterion(y_pred, y)
reduced_loss = reduce_tensor(loss.data, n_gpus)[0] \
if hparams.distributed_run else loss.data[0]
if hparams.distributed_run:
reduced_loss = reduce_tensor(loss.data, num_gpus).item()
else:
reduced_loss = loss.item()
if hparams.fp16_run:
optimizer.backward(loss)
grad_norm = optimizer.clip_fp32_grads(hparams.grad_clip_thresh)
else:
loss.backward()
grad_norm = torch.nn.utils.clip_grad_norm(
grad_norm = torch.nn.utils.clip_grad_norm_(
model.parameters(), hparams.grad_clip_thresh)
optimizer.step()
@@ -228,20 +234,14 @@ def train(output_directory, log_directory, checkpoint_path, warm_start, n_gpus,
duration = time.perf_counter() - start
print("Train loss {} {:.6f} Grad Norm {:.6f} {:.2f}s/it".format(
iteration, reduced_loss, grad_norm, duration))
logger.log_training(
reduced_loss, grad_norm, learning_rate, duration, iteration)
if not overflow and (iteration % hparams.iters_per_checkpoint == 0):
reduced_val_loss = validate(
model, criterion, valset, iteration, hparams.batch_size,
n_gpus, collate_fn, logger, hparams.distributed_run, rank)
validate(model, criterion, valset, iteration, hparams.batch_size,
n_gpus, collate_fn, logger, hparams.distributed_run, rank)
if rank == 0:
print("Validation loss {}: {:9f} ".format(
iteration, reduced_val_loss))
logger.log_validation(
reduced_val_loss, model, y, y_pred, iteration)
checkpoint_path = os.path.join(
output_directory, "checkpoint_{}".format(iteration))
save_checkpoint(model, optimizer, learning_rate, iteration,
@@ -276,7 +276,7 @@ if __name__ == '__main__':
torch.backends.cudnn.benchmark = hparams.cudnn_benchmark
print("FP16 Run:", hparams.fp16_run)
print("Dynamic Loss Scaling", hparams.dynamic_loss_scaling)
print("Dynamic Loss Scaling:", hparams.dynamic_loss_scaling)
print("Distributed Run:", hparams.distributed_run)
print("cuDNN Enabled:", hparams.cudnn_enabled)
print("cuDNN Benchmark:", hparams.cudnn_benchmark)

View File

@@ -4,29 +4,26 @@ import torch
def get_mask_from_lengths(lengths):
max_len = torch.max(lengths)
ids = torch.arange(0, max_len, out=torch.LongTensor(max_len)).cuda()
max_len = torch.max(lengths).item()
ids = torch.arange(0, max_len, out=torch.cuda.LongTensor(max_len))
mask = (ids < lengths.unsqueeze(1)).byte()
return mask
def load_wav_to_torch(full_path, sr):
def load_wav_to_torch(full_path):
sampling_rate, data = read(full_path)
assert sr == sampling_rate, "{} SR doesn't match {} on path {}".format(
sr, sampling_rate, full_path)
return torch.FloatTensor(data.astype(np.float32))
return torch.FloatTensor(data.astype(np.float32)), sampling_rate
def load_filepaths_and_text(filename, sort_by_length, split="|"):
def load_filepaths_and_text(filename, split="|"):
with open(filename, encoding='utf-8') as f:
filepaths_and_text = [line.strip().split(split) for line in f]
if sort_by_length:
filepaths_and_text.sort(key=lambda x: len(x[1]))
return filepaths_and_text
def to_gpu(x):
x = x.contiguous().cuda(async=True)
x = x.contiguous()
if torch.cuda.is_available():
x = x.cuda(non_blocking=True)
return torch.autograd.Variable(x)

1
waveglow Submodule

Submodule waveglow added at 4b1001fa33