Adapted tacotron2 hyperparameters to LibriSpeech

5396a54a · Corentin Jemine · 04169a38 · 5396a54a · 5396a54a · 04169a38
10 changed file
--- a/notes/problems.txt
+++ b/notes/problems.txt
+Problems:
+- VC2 has mixed languages and I cannot seem to find labels for those anywhere. For the lack of consistency, I VC1 also has mixed languages but I did have the labels there and was able to prune them out.
+    * Try to find the labels to exclude them:
+        o It's less data, but it's better data
+        * I sent a mail to the authors on the 5th of february
+        * I googled and asked /r/datasets to no avail.
+        * I can go for language guessing using youtube's title and description, but that's a poor approach
+        * Language detection through voice would be great
+            - It might be slow to run on the entire dataset
+            - It may just be weak as well
+    * Include the different languages anyway:
+        + No effort of filtering non-English out
+        o It's apparently what the authors did as well when they reported the EER in the table, but their final model uses their internal dataset which is full English. I wouldn't even be surprised if they never noticed there is non-English speech in VC.
+        - That may just weaken the ability of the model to distinguish among English speakers
+        - It's not the appropriate meaning for the embeddings: the language will likely be encoded but Tacotron is meant to synthesize English only.
+
+* Meaning of the embedding for a sentence, non normalized, duration
+
+- Embedding for a sentence of 
\ No newline at end of file
--- a/notes/speaker_encoder.txt
+++ b/notes/speaker_encoder.txt
@@ -63,23 +63,18 @@ Ideas for a deeper analysis:
 - How stable is the encoding w.r.t. the length of the utterance? Compare side by side embeddings of several utterances cut at 1.6s, 2.4s, 3.2s... Show the distribution of the distance to the centroids w.r.t. to the length of the utterances, the EER
 - How much is the encoding affected by the quality of the microphone? Record the same sentences on 2 or 3 different microphones and project the embeddings. Look at how far they are from the rest of the dataset.
 - Analyze the components of the embeddings
+- Technically, you could do voice morphing in the same sentence


 Ideas for extensions:
 - Add a term in the loss that forces centroid of synthesized utterances to be close to that of ground truth utterances
+http://www.robots.ox.ac.uk/~vgg/data/lip_reading/


 TODO (priority at the top):
- Double check that all model parameters are correct (draw the model graph?)
- Find LSTM extensions?
-
-
-
+Analyze perf on test set (join vc1_test and vc2_test, they're small)
 embed speaker matrix in visdom
 remove y axis in matrix
-include other datasets
-adaptive LR/batch size
-improve the load of the current implementation (?)




--- a/notes/subjects.txt
+++ b/notes/subjects.txt
-TOPIC 1
-Description: Translate the voice of a character in a movie based on their voice in the original language, using subtitles in the target language.
-Domains: NLP, Generative neural networks, Audio processing
-Outline: The essential goal of this project is to be able to generate spoken sentences in a target language with a given voice based on training samples spoken in a different language by the same voice (we assume that for every audio sample we have, we also have a correct transcript of it). If this goal is achieved, the possible next step would be to incorporate lip movements in the process so that the generated audio is synchronized to the video. Note that in the general case, only the lip movements of the original language will be available, opening the possibility for another extension: letting the generative model also recreate lip movements. 
-Our possible objectives, in order of increasing difficulty, are as follows:
-	(1) Translate audio samples by a single speaker from their source language to a target language using transcripts available in both languages.
-	(2) Same as (1), but the audio samples come from a video (a filmed monologue of the speaker) and the generated audio must match the speed of the speaker.
-	(3) Same as (2), but the model also outputs a modified video where the lips of the speaker match the translated audio. We consider this objective to be optional.
-	(4) Same as (2) or (3), but generalized to multiple speakers (Note: the subtitle files include the identity of the speaker for each line).
-	(5) Same as (4), but on arbitrary video segments (-> how should music and noises in the audio be handled?)
-Alternatively, we could instead aim for audio translation without the use of transcripts and instead perform voice recognition followed by language translation, but it seems extremely ambitious.
-Difficulty: The project is ambitious. Achieving our first objective alone already amounts for a considerable quantity of work, depending on the current state of the art. Thorough research must be conducted before being able to accurately rate the difficulty of this projet. However, the remaining possible objectives ensure that this project will be challenging in the unlikely event that (1) is quickly achieved.
-Opinion: I like that the project is ambitious and fairly original. It also could have actual applications in practice if I were to reach (5), but that is putting the bar very high. It is definitely a topic I am motivated to go for, but I need to research it properly to get an idea of what is achievable.
-	
-
-TOPIC 2
-Description: Real-time multi-object tracking applied to humans playing football.
-Domains: Computer vision
-Outline: This project would be supervised by Deltatec (they have given their approval). It consists in creating a model that performs either semantic instance segmentation or multi-object detection (the latter is more common in practice) on players in a football game sequence. The model has to operate over time as it tracks each individual player and must be able to tell when a player goes off the scene. It should also handle edge cases such as two players overlapping each other. Ideally, the model should be able to differentiate between players and other humans present on the scene (line judges, spectators, ...). Previous work at Deltatec has shown that this can be a difficult task on still images, but we hope that it can be made easier with the added time dimension. Indeed, players will hold specific poses or run on the field which other humans present are not expected to do (asides from the referees, which we will consider as players). Thus, a small number of frames would probably be needed before being able to make a detection. The model should run in real-time or near real-time depending on the constraints.
-Difficulty: The difficulty of this project is advanced. It requires developing a framework that interacts with the model. Time constraints and a decent accuracy must be met.
-Opinion: The project seems interesting and challenging enough overall. I am already familiar with the data, which is a plus.
-
-	
-TOPIC 3
-Description: Real-time tracking of a board game: being able to tell what is being played and report to an internal model of the game. 
-Domains: Computer vision
-Outline: An example of such project can be found here: https://www.youtube.com/watch?v=pnntrewH0xg. 
-The project would be made with a chosen board game in mind. A single camera would capture the entire board and machine learning models need to infer the actions made by the players. Possibly, an AI could be integrated in order to take actions as an additional player. The software integrating the models should be able to determine the current state of the game, if an invalid move was made, the score of each player, whose turn it is to play, etc...
-Difficulty: The difficulty of this project is moderate. It involves mostly object detection when very performant models already exist in that field. The remaining part is mostly casual software development.
-Opinion: This is a project I can easily turn to if the others appear too difficult to tackle. It may however not be considered ambitious enough. The choice of the board game could potentially change the nature of the project; I have yet to explore the possibilities. My initial idea was to work with the game of Carcassonne: https://en.wikipedia.org/wiki/Carcassonne_(board_game).
-
-
-TOPIC 4
-Description: Generic framework for active learning. Application with a GUI that facilitates manual data labeling and possibly trains models at the same time.
-Domains: Semi-supervised learning, Software development (higher focus than for the other projects)
-Outline: The goal here is to create a user-friendly and modulable software that is specifically aimed at people who need to manually label data when labeling is a time-consuming process (e.g. the case of segmentation). The software presents a graphical interface and assists the user in creating the labels. The user can define a model to be trained and a training protocol or provide a pretrained model. This model is used in predicting labels and the user is invited to bring correction to them. The framework must operate in such a way that 
-	- The predictions offered by the model actually help in making the labeling process faster and are, at worst, not useful (but non intrusive).
-	- The quality of the predictions increases over time, resulting in faster labeling.
-It is likely difficult to make such an application truly generic with respect to the task at hand. More than likely, the application will be specialized to certain tasks where labeling is time-consuming and should be widely modular. We consider implementing it as a python package so that users can easily interface their own code with the application.
-Difficulty: It is problematic to evaluate the difficulty of this project while it is yet underspecified and not researched. 
-Opinion: It is likely a great effort of software development more than an application of machine learning, even though a part of the project involves getting familiar with active learning. I am unsure as to how interesting this project may be yet. As far as I know, active learning is uncommon in practice. I have however met situations where labeling more data seemed to be the only way to get more out of my models, and found no tools that would help me do this faster than if I were to make a labeling tool myself.
--- a/notes/synthesizer.txt
+++ b/notes/synthesizer.txt
+TRANSFER LEARNING FROM SPEAKER VERIFICATION TO MULTISPEAKER TEXT-TO-SPEECH SYNTHESIS:
+- ASR to remove silence... Any different from WEBRTCVAD? They mention 14->5s median duration, which sounds like a big reduction to me (maybe it's just because it's the median). Will have to check the samples myself.
+- VCTK and LibriSpeech were seperately used for different models.
+- Same speaker set for training and validation in VCTK but disjoint for LibriSpeech.
+- Noise reduction applied to the spectrogram. Go figure if I'll ever implement that.
+- Note that VCTK is 1/10th the duration of LibriSpeech. Is it even worth bothering at this point? I could just start with LibriSpeech and add it later anyway.
+- Guess I will have to skip on the phoneme inputs also. Maybe Waveglow implements them? Would also have to check if it ever appears in the Tacotron 2 paper at all.
+
+
+
+-------------------------------------------------
+
+- Interface train-clean
+- Listen to inverted samples to check if the silence removal is ok
+- Begin with train-clean to see if that works
+- If it does, maybe consider adding VCTK then train-other
\ No newline at end of file
--- a/tacotron2/hparams.py
+++ b/tacotron2/hparams.py
@@ -93,7 +93,8 @@ hparams = tf.contrib.training.HParams(
    #	6- If audio quality is too metallic or fragmented (or if linear spectrogram plots are 
 	# showing black silent regions on top), then restart from step 2.
    num_mels=80,  # Number of mel-spectrogram channels and local conditioning dimensionality
-    num_freq=1025,  # (= n_fft / 2 + 1) only used when adding linear spectrograms post processing
+    # num_freq=1025,  # (= n_fft / 2 + 1) only used when adding linear spectrograms post processing
+    num_freq=513,  # (= n_fft / 2 + 1) only used when adding linear spectrograms post processing
    #  network
    rescale=True,  # Whether to rescale audio prior to preprocessing
    rescaling_max=0.999,  # Rescaling value
@@ -104,7 +105,7 @@ hparams = tf.contrib.training.HParams(
    clip_mels_length=True,
    # For cases of OOM (Not really recommended, only use if facing unsolvable OOM errors, 
 	# also consider clipping your samples to smaller chunks)
-    max_mel_frames=1000,    # was 1000
+    max_mel_frames=1000,
    # Only relevant when clip_mels_length = True, please only use after trying output_per_steps=3
 	#  and still getting OOM errors.
    
@@ -117,10 +118,18 @@ hparams = tf.contrib.training.HParams(
    silence_threshold=2,  # silence threshold used for sound trimming for wavenet preprocessing
    
    # Mel spectrogram
-    n_fft=2048,  # Extra window size is filled with 0 paddings to match this parameter
-    hop_size=275,  # For 22050Hz, 275 ~= 12.5 ms (0.0125 * sample_rate)
-    win_size=1100,  # For 22050Hz, 1100 ~= 50 ms (If None, win_size = n_fft) (0.05 * sample_rate)
-    sample_rate=22050,  # 22050 Hz (corresponding to ljspeech dataset) (sox --i <filename>)
+    # # FOR DATASETS IN 22050Hz:
+    # n_fft=2048,  # Extra window size is filled with 0 paddings to match this parameter
+    # hop_size=275,  # For 22050Hz, 275 ~= 12.5 ms (0.0125 * sample_rate)
+    # win_size=1100,  # For 22050Hz, 1100 ~= 50 ms (If None, win_size = n_fft) (0.05 * sample_rate)
+    # sample_rate=22050,  # 22050 Hz (corresponding to ljspeech dataset) (sox --i <filename>)
+    
+    # FOR DATASETS IN 16000Hz:
+    n_fft=1024,  # Extra window size is filled with 0 paddings to match this parameter
+    hop_size=200,  # For 16000Hz, 200 ~= 12.5 ms (0.0125 * sample_rate)
+    win_size=800,  # For 16000Hz, 800 ~= 50 ms (If None, win_size = n_fft) (0.05 * sample_rate)
+    sample_rate=16000,  # 16000Hz (corresponding to librispeech) (sox --i <filename>)
+    
    frame_shift_ms=None,  # Can replace hop_size parameter. (Recommended: 12.5)
    
    # M-AILABS (and other datasets) trim params (these parameters are usually correct for any 
@@ -276,7 +285,8 @@ hparams = tf.contrib.training.HParams(
    # for each frame while 2D spans "freq_axis_kernel_size" bands at a time
    upsample_activation='LeakyRelu',
    # Activation function used during upsampling. Can be ('LeakyRelu', 'Relu' or None)
-    upsample_scales=[5, 5, 11],  # prod(upsample_scales) should be equal to hop_size
+    # upsample_scales=[5, 5, 11],  # prod(upsample_scales) should be equal to hop_size
+    upsample_scales=[5, 5, 8],  # prod(upsample_scales) should be equal to hop_size
    freq_axis_kernel_size=3,
    # Only used for 2D upsampling. This is the number of requency bands that are spanned at a time
    #  for each frame.

--- a/tacotron2/preprocess.py
+++ b/tacotron2/preprocess.py
@@ -32,11 +32,8 @@ def write_metadata(metadata, out_dir):
 	print('Max audio timesteps length: {}'.format(max(m[3] for m in metadata)))

 def norm_data(args):
-
-	merge_books = (args.merge_books=='True')
-
 	print('Selecting data folders..')
-	supported_datasets = ['LJSpeech-1.0', 'LJSpeech-1.1', 'M-AILABS']
+	supported_datasets = ['LibriSpeech']
 	if args.dataset not in supported_datasets:
 		raise ValueError('dataset value entered {} does not belong to supported datasets: {}'.format(
 			args.dataset, supported_datasets))
@@ -45,37 +42,6 @@ def norm_data(args):
 		return [os.path.join(args.base_dir, args.dataset)]


-	if args.dataset == 'M-AILABS':
-		supported_languages = ['en_US', 'en_UK', 'fr_FR', 'it_IT', 'de_DE', 'es_ES', 'ru_RU',
-			'uk_UK', 'pl_PL', 'nl_NL', 'pt_PT', 'fi_FI', 'se_SE', 'tr_TR', 'ar_SA']
-		if args.language not in supported_languages:
-			raise ValueError('Please enter a supported language to use from M-AILABS dataset! \n{}'.format(
-				supported_languages))
-
-		supported_voices = ['female', 'male', 'mix']
-		if args.voice not in supported_voices:
-			raise ValueError('Please enter a supported voice option to use from M-AILABS dataset! \n{}'.format(
-				supported_voices))
-
-		path = os.path.join(args.base_dir, args.language, 'by_book', args.voice)
-		supported_readers = [e for e in os.listdir(path) if os.path.isdir(os.path.join(path,e))]
-		if args.reader not in supported_readers:
-			raise ValueError('Please enter a valid reader for your language and voice settings! \n{}'.format(
-				supported_readers))
-
-		path = os.path.join(path, args.reader)
-		supported_books = [e for e in os.listdir(path) if os.path.isdir(os.path.join(path,e))]
-		if merge_books:
-			return [os.path.join(path, book) for book in supported_books]
-
-		else:
-			if args.book not in supported_books:
-				raise ValueError('Please enter a valid book for your reader settings! \n{}'.format(
-					supported_books))
-
-			return [os.path.join(path, args.book)]
-
-
 def run_preprocess(args, hparams):
 	input_folders = norm_data(args)
 	output_folder = os.path.join(args.base_dir, args.output)
@@ -89,12 +55,7 @@ def main():
 	parser.add_argument('--base_dir', default='')
 	parser.add_argument('--hparams', default='',
 		help='Hyperparameter overrides as a comma-separated list of name=value pairs')
-	parser.add_argument('--dataset', default='LJSpeech-1.1')
-	parser.add_argument('--language', default='en_US')
-	parser.add_argument('--voice', default='female')
-	parser.add_argument('--reader', default='mary_ann')
-	parser.add_argument('--merge_books', default='False')
-	parser.add_argument('--book', default='northandsouth')
+	parser.add_argument('--dataset', default='LibriSpeech')
 	parser.add_argument('--output', default='training_data')
 	parser.add_argument('--n_jobs', type=int, default=cpu_count())
 	args = parser.parse_args()

--- a/tacotron2/tacotron/feeder.py
+++ b/tacotron2/tacotron/feeder.py
@@ -240,7 +240,18 @@ class Feeder:
 			split_infos.append([input_max_len, mel_target_max_len, token_target_max_len, linear_target_max_len])

 		split_infos = np.asarray(split_infos, dtype=np.int32)
-		return (inputs, input_lengths, mel_targets, token_targets, linear_targets, targets_lengths, split_infos)
+		
+		
+		### SV2TTS ###
+		
+		batch_size = mel_targets.shape[0]
+		speaker_embeddings = np.array([[i / 1000] * 256 for i in range(batch_size)])
+		
+		##############
+		
+		
+		return inputs, input_lengths, mel_targets, token_targets, linear_targets, targets_lengths, \
+			   split_infos, speaker_embeddings

 	def _prepare_inputs(self, inputs):
 		max_len = max([len(x) for x in inputs])

--- a/tacotron2/tacotron/models/tacotron.py
+++ b/tacotron2/tacotron/models/tacotron.py
@@ -28,8 +28,8 @@ class Tacotron():
    def __init__(self, hparams):
        self._hparams = hparams
    
-    def initialize(self, inputs, input_lengths, mel_targets=None, stop_token_targets=None,
-                   linear_targets=None, targets_lengths=None, gta=False,
+    def initialize(self, inputs, input_lengths, speaker_embeddings, mel_targets=None, 
+                   stop_token_targets=None, linear_targets=None, targets_lengths=None, gta=False,
                   global_step=None, is_training=False, is_evaluating=False, split_infos=None):
        """
        Initializes the model for inference sets "mel_outputs" and "alignments" fields.
@@ -38,6 +38,8 @@ class Tacotron():
              steps in the input time series, and values are character IDs
            - input_lengths: int32 Tensor with shape [N] where N is batch size and values are the 
            lengths of each sequence in inputs.
+            - speaker_embeddings: float32 Tensor with shape [N, E] where E is the speaker 
+            embedding size.
            - mel_targets: float32 Tensor with shape [N, T_out, M] where N is batch size, 
            T_out is number of steps in the output time series, M is num_mels, and values are 
            entries in the mel spectrogram. Only needed for training.
@@ -152,9 +154,11 @@ class Tacotron():
                    ### SV2TT2 ###
                    
                    # Append the speaker embedding to the encoder output at each timestep
-                    dummy = tf.zeros((tf.shape(encoder_outputs)[0], 1, 256))
-                    dummy1 = tf.tile(dummy, (1, tf.shape(encoder_outputs)[1], 1))
-                    encoder_cond_outputs = tf.concat((encoder_outputs, dummy1), 2)
+                    tileable_shape = [batch_size, 1, self._hparams.speaker_embedding_size]
+                    tileable_speaker_embeddings = tf.reshape(speaker_embeddings, tileable_shape)
+                    tiled_speaker_embeddings = tf.tile(tileable_speaker_embeddings, 
+                                                       [1, tf.shape(encoder_outputs)[1], 1])
+                    encoder_cond_outputs = tf.concat((encoder_outputs, tiled_speaker_embeddings), 2)
                    
                    ##############
                    

--- a/tacotron2/tacotron/synthesizer.py
+++ b/tacotron2/tacotron/synthesizer.py
@@ -20,14 +20,18 @@ class Synthesizer:
 		#Force the batch size to be known in order to use attention masking in batch synthesis
 		inputs = tf.placeholder(tf.int32, (None, None), name='inputs')
 		input_lengths = tf.placeholder(tf.int32, (None), name='input_lengths')
+		speaker_embeddings = tf.placeholder(tf.float32, (None, hparams.speaker_embedding_size),
+					   name='speaker_embeddings')
 		targets = tf.placeholder(tf.float32, (None, None, hparams.num_mels), name='mel_targets')
 		split_infos = tf.placeholder(tf.int32, shape=(hparams.tacotron_num_gpus, None), name='split_infos')
 		with tf.variable_scope('Tacotron_model') as scope:
 			self.model = create_model(model_name, hparams)
 			if gta:
-				self.model.initialize(inputs, input_lengths, targets, gta=gta, split_infos=split_infos)
+				self.model.initialize(inputs, input_lengths, speaker_embeddings, targets, gta=gta,
+									  split_infos=split_infos)
 			else:
-				self.model.initialize(inputs, input_lengths, split_infos=split_infos)
+				self.model.initialize(inputs, input_lengths, speaker_embeddings, 
+									  split_infos=split_infos)

 			self.mel_outputs = self.model.tower_mel_outputs
 			self.linear_outputs = self.model.tower_linear_outputs if (hparams.predict_linear and not gta) else None

--- a/tacotron2/tacotron/train.py
+++ b/tacotron2/tacotron/train.py
@@ -93,13 +93,14 @@ def model_train_mode(args, feeder, hparams, global_step):
            model_name = 'Tacotron'
        model = create_model(model_name or args.model, hparams)
        if hparams.predict_linear:
-            model.initialize(feeder.inputs, feeder.input_lengths, feeder.mel_targets,
-                             feeder.token_targets, linear_targets=feeder.linear_targets,
+            model.initialize(feeder.inputs, feeder.input_lengths, feeder.speaker_embeddings, 
+                             feeder.mel_targets, feeder.token_targets, 
+                             linear_targets=feeder.linear_targets,
                             targets_lengths=feeder.targets_lengths, global_step=global_step,
                             is_training=True, split_infos=feeder.split_infos)
        else:
-            model.initialize(feeder.inputs, feeder.input_lengths, feeder.mel_targets,
-                             feeder.token_targets,
+            model.initialize(feeder.inputs, feeder.input_lengths, feeder.speaker_embeddings, 
+                             feeder.mel_targets, feeder.token_targets,
                             targets_lengths=feeder.targets_lengths, global_step=global_step,
                             is_training=True, split_infos=feeder.split_infos)
        model.add_loss()
@@ -115,17 +116,17 @@ def model_test_mode(args, feeder, hparams, global_step):
            model_name = 'Tacotron'
        model = create_model(model_name or args.model, hparams)
        if hparams.predict_linear:
-            model.initialize(feeder.eval_inputs, feeder.eval_input_lengths, feeder.eval_mel_targets,
-                             feeder.eval_token_targets,
-                             linear_targets=feeder.eval_linear_targets,
+            model.initialize(feeder.eval_inputs, feeder.eval_input_lengths, 
+                             feeder.speaker_embeddings, feeder.eval_mel_targets, 
+                             feeder.eval_token_targets, linear_targets=feeder.eval_linear_targets,
                             targets_lengths=feeder.eval_targets_lengths, global_step=global_step,
-                             is_training=False, is_evaluating=True,
+                             is_training=False, is_evaluating=True, 
                             split_infos=feeder.eval_split_infos)
        else:
-            model.initialize(feeder.eval_inputs, feeder.eval_input_lengths, feeder.eval_mel_targets,
-                             feeder.eval_token_targets,
-                             targets_lengths=feeder.eval_targets_lengths, global_step=global_step,
-                             is_training=False, is_evaluating=True,
+            model.initialize(feeder.eval_inputs, feeder.eval_input_lengths, 
+                             feeder.speaker_embeddings, feeder.eval_mel_targets,
+                             feeder.eval_token_targets, targets_lengths=feeder.eval_targets_lengths, 
+                             global_step=global_step, is_training=False, is_evaluating=True,
                             split_infos=feeder.eval_split_infos)
        model.add_loss()
        return model