updated README to include testing

Merge branch 'master' of /home/ilml/Public/Repos/speech_scoring
Added README.md describing the workflow
2017-12-29 16:21:38 +05:30 · 2017-12-29 13:15:51 +05:30 · 2017-12-29 13:14:37 +05:30 · 2017-12-28 20:02:44 +05:30 · 2017-12-28 20:01:44 +05:30 · 2017-12-28 20:00:19 +05:30
18 changed files with 1416 additions and 126 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -143,3 +143,5 @@ inputs/audio*
 logs/*
 models/*
 *.pkl
 temp/*
 trained/*
--- a/CLI.md
+++ b/CLI.md
@@ -0,0 +1,2 @@
 ### Convert audio files
 $ `for f in *.mp3; do ffmpeg -i "$f" "${f%.mp3}.aiff"; done`
--- a/README.md
+++ b/README.md
@@ -0,0 +1,23 @@
 ### Setup
 `. env/bin/activate` to activate the virtualenv.
 ### Data Generation
 * update `OUTPUT_NAME` in *speech_samplegen.py* to create the dataset folder with the name
 * `python speech_samplegen.py` generates variants of audio samples
 ### Data Preprocessing
 * `python speech_data.py` creates the training-testing data from the generated samples.
 * run `fix_csv(OUTPUT_NAME)` once to create the fixed index of the dataset generated
 * run `generate_sppas_trans(OUTPUT_NAME)` once to create the SPPAS transcription(wav+txt) data
 * run `$ (SPPAS_DIR)/bin/annotation.py -l eng -e csv --ipus --tok --phon --align --align -w ./outputs/OUTPUT_NAME/` once to create the phoneme alignment csv files for all variants.
 * `create_seg_phonpair_tfrecords(OUTPUT_NAME)` creates the tfrecords files
 with the phoneme level pairs of right/wrong stresses
 ### Training
 * `python speech_model.py` trains the model with the training data generated.
 * `train_siamese(OUTPUT_NAME)` trains the siamese model with the generated dataset.
 ### Testing
 * `python speech_test.py` tests the trained model with the test dataset
 * `evaluate_siamese(TEST_RECORD_FILE,audio_group=OUTPUT_NAME,weights = WEIGHTS_FILE_NAME)`
  the TEST_RECORD_FILE will be under outputs directory and WEIGHTS_FILE_NAME will be under the models directory, pick the most recent weights file.
--- a/requirements-linux.txt
+++ b/requirements-linux.txt
@@ -8,6 +8,7 @@ distributed==1.19.3
 entrypoints==0.2.3
 enum34==1.1.6
 futures==3.1.1
 graphviz==0.8.1
 h5py==2.7.1
 HeapDict==1.0.0
 html5lib==0.9999999
@@ -40,13 +41,14 @@ parso==0.1.0
 partd==0.3.8
 pexpect==4.2.1
 pickleshare==0.7.4
-pkg-resources==0.0.0
+praat-parselmouth==0.2.0
 progressbar2==3.34.3
 prompt-toolkit==1.0.15
-protobuf==3.4.0
+protobuf==3.5.0
 psutil==5.4.0
 ptyprocess==0.5.2
 PyAudio==0.2.11
 pydot==1.2.3
 Pygments==2.2.0
 pyparsing==2.2.0
 pysndfile==1.0.0
@@ -65,7 +67,7 @@ sortedcontainers==1.5.7
 tables==3.4.2
 tblib==1.3.2
 tensorflow==1.3.0
-tensorflow-tensorboard==0.4.0rc1
+tensorflow-tensorboard==0.4.0rc3
 terminado==0.6
 testpath==0.3.1
 toolz==0.8.2
--- a/segment_data.py
+++ b/segment_data.py
@@ -0,0 +1,265 @@
 import random
 import math
 import pickle
 from functools import reduce
 from tqdm import tqdm
 from sklearn.model_selection import train_test_split
 import numpy as np
 import pandas as pd
 import tensorflow as tf
 import shutil
 from speech_pitch import *
 from speech_tools import reservoir_sample,padd_zeros
 # import importlib
 # import speech_tools
 # importlib.reload(speech_tools)
 # %matplotlib inline
 SPEC_MAX_FREQUENCY = 8000
 SPEC_WINDOW_SIZE = 0.03
 def fix_csv(collection_name = 'test'):
    seg_data = pd.read_csv('./outputs/segments/'+collection_name+'/index.csv',names=['phrase','filename'
                ,'start_phoneme','end_phoneme','start_time','end_time'])
    seg_data.to_csv('./outputs/segments/'+collection_name+'/index.fixed.csv')
 def pick_random_phrases(collection_name='test'):
    collection_name = 'test'
    seg_data = pd.read_csv('./outputs/'+collection_name+'.fixed.csv',index_col=0)
    phrase_groups = random.sample([i for i in seg_data.groupby(['phrase'])],10)
    result = []
    for ph,g in phrase_groups:
        result.append(ph)
    pd.DataFrame(result,columns=['phrase']).to_csv('./outputs/'+collection_name+'.random.csv')
 # pick_random_phrases()
 def plot_random_phrases(collection_name = 'test'):
    # collection_name = 'test'
    rand_words = pd.read_csv('./outputs/'+collection_name+'.random.csv',index_col=0)
    rand_w_list = rand_words['phrase'].tolist()
    seg_data = pd.read_csv('./outputs/'+collection_name+'.fixed.csv',index_col=0)
    result = (seg_data['phrase'] == rand_w_list[0])
    for i in rand_w_list[1:]:
         result |= (seg_data['phrase'] == i)
    phrase_groups = [i for i in seg_data[result].groupby(['phrase'])]
    self_files = ['a_wrong_turn-low1.aiff','great_pin-low1.aiff'
    ,'he_set_off_at_once_to_find_the_beast-low1.aiff'
    ,'hound-low1.aiff','noises-low1.aiff','po_burped-low1.aiff'
    ,'she_loves_the_roses-low1.aiff','the_busy_spider-low1.aiff'
    ,'the_rain_helped-low1.aiff','to_go_to_the_doctor-low1.aiff']
    co_files = map(lambda x: './inputs/self/'+x,self_files)
    for ((ph,g),s_f) in zip(phrase_groups,co_files):
    # ph,g = phrase_groups[0]
        file_path = './outputs/test/'+g.iloc[0]['filename']
        phrase_sample = pm_snd(file_path)
        self_sample = pm_snd(s_f)
        player,closer = play_sound()
        # rows = [i for i in g.iterrows()]
        # random.shuffle(rows)
        print(ph)
        phon_stops = []
        for (i,phon) in g.iterrows():
            end_t = phon['end_time']/1000
            phon_ch = phon['start_phoneme']
            phon_stops.append((end_t,phon_ch))
        plot_sample_pitch(phrase_sample,phons = phon_stops)
        plot_sample_pitch(self_sample)
        # player(phrase_sample)
        # input()
    # for (i,phon) in g.iterrows():
    #     # phon = g.iloc[1]
    #     start_t = phon['start_time']/1000
    #     end_t = phon['end_time']/1000
    #     phon_ch = phon['start_phoneme']
    #     phon_sample = phrase_sample.extract_part(from_time=start_t,to_time=end_t)
    #     if phon_sample.n_samples*phon_sample.sampling_period < 6.4/100:
    #         continue
    #     # if phon_ch[0] not in 'AEIOU':
    #     #     continue
    #     # phon_sample
    #     # player(phon_sample)
    #     # plot_sample_intensity(phon_sample)
    #     print(phon_ch)
    #     plot_sample_pitch(phon_sample)
    # closer()
 def plot_segments(collection_name = 'story_test_segments'):
    collection_name = 'story_test_segments'
    seg_data = pd.read_csv('./outputs/'+collection_name+'.fixed.csv',index_col=0)
    phrase_groups = [i for i in seg_data.groupby(['phrase'])]
    for (ph,g) in phrase_groups:
        # ph,g = phrase_groups[0]
        file_path = './outputs/'+collection_name+'/'+g.iloc[0]['filename']
        phrase_sample = pm_snd(file_path)
        # player,closer = play_sound()
        print(ph)
        phon_stops = []
        for (i,phon) in g.iterrows():
            end_t = phon['end_time']/1000
            phon_ch = phon['start_phoneme']
            phon_stops.append((end_t,phon_ch))
        phrase_spec = phrase_sample.to_spectrogram(window_length=0.03, maximum_frequency=8000)
        sg_db = 10 * np.log10(phrase_spec.values)
        result = np.zeros(sg_db.shape[0],dtype=np.int64)
        ph_bounds = [t[0] for t in phon_stops[1:]]
        b_frames = np.asarray([spec_frame(phrase_spec,b) for b in ph_bounds])
        result[b_frames] = 1
    # print(audio)
 def generate_spec(aiff_file):
    phrase_sample = pm_snd(aiff_file)
    phrase_spec = phrase_sample.to_spectrogram(window_length=SPEC_WINDOW_SIZE, maximum_frequency=SPEC_MAX_FREQUENCY)
    sshow_abs = np.abs(phrase_spec.values + np.finfo(phrase_spec.values.dtype).eps)
    sg_db = 10 * np.log10(sshow_abs)
    sg_db[sg_db < 0] = 0
    return sg_db,phrase_spec
 def spec_frame(spec,b):
    return int(round(spec.frame_number_to_time(b)))
 def _float_feature(value):
  return tf.train.Feature(float_list=tf.train.FloatList(value=value))
 def _int64_feature(value):
  return tf.train.Feature(int64_list=tf.train.Int64List(value=value))
 def _bytes_feature(value):
  return tf.train.Feature(bytes_list=tf.train.BytesList(value=value))
 def create_segments_tfrecords(collection_name='story_test_segments',sample_count=0,train_test_ratio=0.1):
    audio_samples = pd.read_csv( './outputs/segments/' + collection_name + '/index.fixed.csv',index_col=0)
    audio_samples['file_path'] = audio_samples.loc[:, 'filename'].apply(lambda x: 'outputs/segments/' + collection_name + '/samples/' + x)
    n_records,n_spec,n_features = 0,0,0
    def write_samples(wg,sample_name):
        phrase_groups = tqdm(wg,desc='Computing segmentation')
        record_file = './outputs/segments/{}/{}.tfrecords'.format(collection_name,sample_name)
        writer = tf.python_io.TFRecordWriter(record_file)
        for (ph,g) in phrase_groups:
            fname = g.iloc[0]['filename']
            sg_db,phrase_spec = generate_spec(g.iloc[0]['file_path'])
            phon_stops = []
            phrase_groups.set_postfix(phrase=ph)
            spec_n,spec_w = sg_db.shape
            spec = sg_db.reshape(-1)
            for (i,phon) in g.iterrows():
                end_t = phon['end_time']/1000
                phon_ch = phon['start_phoneme']
                phon_stops.append((end_t,phon_ch))
            result = np.zeros(spec_n,dtype=np.int64)
            ph_bounds = [t[0] for t in phon_stops]
            f_bounds = [spec_frame(phrase_spec,b) for b in ph_bounds]
            valid_bounds = [i for i in f_bounds if 0 < i < spec_n]
            b_frames = np.asarray(valid_bounds)
            if len(b_frames) > 0:
                result[b_frames] = 1
            nonlocal n_records,n_spec,n_features
            n_spec = max([n_spec,spec_n])
            n_features = spec_w
            n_records+=1
            example = tf.train.Example(features=tf.train.Features(
              feature={
                'phrase': _bytes_feature([ph.encode('utf-8')]),
                'file': _bytes_feature([fname.encode('utf-8')]),
                'spec':_float_feature(spec),
                'spec_n':_int64_feature([spec_n]),
                'spec_w':_int64_feature([spec_w]),
                'output':_int64_feature(result)
              }
            ))
            writer.write(example.SerializeToString())
        phrase_groups.close()
        writer.close()
    word_groups = [i for i in audio_samples.groupby('phrase')]
    wg_sampled = reservoir_sample(word_groups,sample_count) if sample_count > 0 else word_groups
    # write_samples(word_groups,'all')
    tr_audio_samples,te_audio_samples = train_test_split(wg_sampled,test_size=train_test_ratio)
    write_samples(tr_audio_samples,'train')
    write_samples(te_audio_samples,'test')
    const_file = './outputs/segments/'+collection_name+'/constants.pkl'
    pickle.dump((n_spec,n_features,n_records),open(const_file,'wb'))
 def record_generator_count(records_file):
    record_iterator = tf.python_io.tf_record_iterator(path=records_file)
    count,spec_n = 0,0
    for i in record_iterator:
        count+=1
    record_iterator = tf.python_io.tf_record_iterator(path=records_file)
    return record_iterator,count
 def read_segments_tfrecords_generator(collection_name='audio',batch_size=32,test_size=0):
    # collection_name = 'story_test'
    records_file  = './outputs/segments/'+collection_name+'/train.tfrecords'
    const_file = './outputs/segments/'+collection_name+'/constants.pkl'
    (n_spec,n_features,n_records) = pickle.load(open(const_file,'rb'))
    def copy_read_consts(dest_dir):
        shutil.copy2(const_file,dest_dir+'/constants.pkl')
        return (n_spec,n_features,n_records)
    # @threadsafe_iter
    def record_generator():
        print('reading tfrecords({}-train)...'.format(collection_name))
        input_data = []
        output_data = []
        while True:
            record_iterator,records_count = record_generator_count(records_file)
            for (i,string_record) in enumerate(record_iterator):
                # (i,string_record) = next(enumerate(record_iterator))
                example = tf.train.Example()
                example.ParseFromString(string_record)
                spec_n = example.features.feature['spec_n'].int64_list.value[0]
                spec_w = example.features.feature['spec_w'].int64_list.value[0]
                spec = np.array(example.features.feature['spec'].float_list.value).reshape(spec_n,spec_w)
                p_spec = padd_zeros(spec,n_spec)
                input_data.append(p_spec)
                output = np.asarray(example.features.feature['output'].int64_list.value)
                p_output = np.pad(output,(0,n_spec-output.shape[0]),'constant')
                output_data.append(p_output)
                if len(input_data) == batch_size or i == n_records-1:
                    input_arr = np.asarray(input_data)
                    output_arr = np.asarray(output_data)
                    input_arr.shape,output_arr.shape
                    yield (input_arr,output_arr)
                    input_data = []
                    output_data = []
    # Read test in one-shot
    print('reading tfrecords({}-test)...'.format(collection_name))
    te_records_file  = './outputs/segments/'+collection_name+'/test.tfrecords'
    te_re_iterator,te_n_records = record_generator_count(te_records_file)
    # test_size = 10
    test_size = min([test_size,te_n_records]) if test_size > 0 else te_n_records
    input_data = np.zeros((test_size,n_spec,n_features))
    output_data = np.zeros((test_size,n_spec))
    random_samples = enumerate(reservoir_sample(te_re_iterator,test_size))
    for (i,string_record) in tqdm(random_samples,total=test_size):
        # (i,string_record) = next(random_samples)
        example = tf.train.Example()
        example.ParseFromString(string_record)
        spec_n = example.features.feature['spec_n'].int64_list.value[0]
        spec_w = example.features.feature['spec_w'].int64_list.value[0]
        spec = np.array(example.features.feature['spec'].float_list.value).reshape(spec_n,spec_w)
        p_spec = padd_zeros(spec,n_spec)
        input_data[i] = p_spec
        output = np.asarray(example.features.feature['output'].int64_list.value)
        p_output = np.pad(output,(0,n_spec-output.shape[0]),'constant')
        output_data[i] = p_output
    return record_generator,input_data,output_data,copy_read_consts
 if __name__ == '__main__':
    # plot_random_phrases()
    # fix_csv('story_test_segments')
    # plot_segments('story_test_segments')
    # fix_csv('story_words')
    # pass
    create_segments_tfrecords('story_words.30', sample_count=36,train_test_ratio=0.1)
    # record_generator,input_data,output_data,copy_read_consts = read_segments_tfrecords_generator('story_test')
    # tr_gen = record_generator()
    # for i in tr_gen:
    #     print(i[0].shape,i[1].shape)
--- a/segment_model.py
+++ b/segment_model.py
@@ -0,0 +1,144 @@
 from __future__ import absolute_import
 from __future__ import print_function
 import numpy as np
 from keras.models import Model,load_model,model_from_yaml
 from keras.layers import Input,Concatenate,Lambda, Reshape, Dropout
 from keras.layers import Dense,Conv2D, LSTM, Bidirectional, GRU
 from keras.layers import BatchNormalization,Activation
 from keras.losses import categorical_crossentropy
 from keras.utils import to_categorical
 from keras.optimizers import RMSprop,Adadelta,Adagrad,Adam,Nadam
 from keras.callbacks import TensorBoard, ModelCheckpoint
 from keras import backend as K
 from keras.utils import plot_model
 from speech_tools import create_dir,step_count
 from segment_data import read_segments_tfrecords_generator
 # import importlib
 # import segment_data
 # import speech_tools
 # importlib.reload(segment_data)
 # importlib.reload(speech_tools)
 # TODO implement ctc losses
 # https://github.com/fchollet/keras/blob/master/examples/image_ocr.py
 def accuracy(y_true, y_pred):
    '''Compute classification accuracy with a fixed threshold on distances.
    '''
    return K.mean(K.equal(y_true, K.cast(y_pred > 0.5, y_true.dtype)))
 def ctc_lambda_func(args):
    y_pred, labels, input_length, label_length = args
    # the 2 is critical here since the first couple outputs of the RNN
    # tend to be garbage:
    y_pred = y_pred[:, 2:, :]
    return K.ctc_batch_cost(labels, y_pred, input_length, label_length)
 def segment_model(input_dim):
    inp = Input(shape=input_dim)
    cnv1 = Conv2D(filters=32, kernel_size=(5,9))(inp)
    cnv2 = Conv2D(filters=1, kernel_size=(5,9))(cnv1)
    dr_cnv2 = Dropout(rate=0.95)(cnv2)
    cn_rnn_dim = (dr_cnv2.shape[1].value,dr_cnv2.shape[2].value)
    r_dr_cnv2 = Reshape(target_shape=cn_rnn_dim)(dr_cnv2)
    b_gr1 = Bidirectional(GRU(512, return_sequences=True),merge_mode='sum')(r_dr_cnv2)
    b_gr2 = Bidirectional(GRU(512, return_sequences=True),merge_mode='sum')(b_gr1)
    b_gr3 = Bidirectional(GRU(512, return_sequences=True),merge_mode='sum')(b_gr2)
    oup = Dense(2, activation='softmax')(b_gr3)
    return Model(inp, oup)
 def simple_segment_model(input_dim):
    inp = Input(shape=input_dim)
    b_gr1 = Bidirectional(LSTM(32, return_sequences=True))(inp)
    b_gr1 = Bidirectional(LSTM(16, return_sequences=True),merge_mode='sum')(b_gr1)
    b_gr1 = LSTM(1, return_sequences=True,activation='softmax')(b_gr1)
    # b_gr1 = LSTM(4, return_sequences=True)(b_gr1)
    # b_gr1 = LSTM(2, return_sequences=True)(b_gr1)
    # bn_b_gr1 = BatchNormalization(momentum=0.98)(b_gr1)
    # b_gr2 = GRU(64, return_sequences=True)(b_gr1)
    # bn_b_gr2 = BatchNormalization(momentum=0.98)(b_gr2)
    # d1 = Dense(32)(b_gr2)
    # bn_d1 = BatchNormalization(momentum=0.98)(d1)
    # bn_da1 = Activation('relu')(bn_d1)
    # d2 = Dense(8)(bn_da1)
    # bn_d2 = BatchNormalization(momentum=0.98)(d2)
    # bn_da2 = Activation('relu')(bn_d2)
    # d3 = Dense(1)(b_gr1)
    # # bn_d3 = BatchNormalization(momentum=0.98)(d3)
    # bn_da3 = Activation('softmax')(d3)
    oup = Reshape(target_shape=(input_dim[0],))(b_gr1)
    return Model(inp, oup)
 def write_model_arch(mod,mod_file):
    model_f = open(mod_file,'w')
    model_f.write(mod.to_yaml())
    model_f.close()
 def load_model_arch(mod_file):
    model_f = open(mod_file,'r')
    mod = model_from_yaml(model_f.read())
    model_f.close()
    return mod
 def train_segment(collection_name = 'test',resume_weights='',initial_epoch=0):
    # collection_name = 'story_test'
    batch_size = 128
    # batch_size  = 4
    model_dir = './models/segment/'+collection_name
    create_dir(model_dir)
    log_dir = './logs/segment/'+collection_name
    create_dir(log_dir)
    tr_gen_fn,te_x,te_y,copy_read_consts = read_segments_tfrecords_generator(collection_name,batch_size,2*batch_size)
    tr_gen = tr_gen_fn()
    n_step,n_features,n_records = copy_read_consts(model_dir)
    input_dim = (n_step, n_features)
    model = simple_segment_model(input_dim)
    # model.output_shape,model.input_shape
    plot_model(model,show_shapes=True, to_file=model_dir+'/model.png')
    # loss_out = Lambda(ctc_lambda_func, output_shape=(1,), name='ctc')([y_pred, labels, input_length, label_length])
    tb_cb = TensorBoard(
        log_dir=log_dir,
        histogram_freq=1,
        batch_size=32,
        write_graph=True,
        write_grads=True,
        write_images=True,
        embeddings_freq=0,
        embeddings_layer_names=None,
        embeddings_metadata=None)
    cp_file_fmt = model_dir+'/speech_segment_model-{epoch:02d}-epoch-{val_loss:0.2f}\
 -acc.h5'
    cp_cb = ModelCheckpoint(
        cp_file_fmt,
        monitor='val_loss',
        verbose=0,
        save_best_only=False,
        save_weights_only=True,
        mode='auto',
        period=1)
    # train
    opt = RMSprop()
    model.compile(loss=categorical_crossentropy, optimizer=opt, metrics=[accuracy])
    write_model_arch(model,model_dir+'/speech_segment_model_arch.yaml')
    epoch_n_steps = step_count(n_records,batch_size)
    if resume_weights != '':
        model.load_weights(resume_weights)
    model.fit_generator(tr_gen
                        , epochs=10000
                        , steps_per_epoch=epoch_n_steps
                        , validation_data=(te_x, te_y)
                        , max_queue_size=32
                        , callbacks=[tb_cb, cp_cb],initial_epoch=initial_epoch)
    model.save(model_dir+'/speech_segment_model-final.h5')
    # y_pred = model.predict([te_pairs[:, 0], te_pairs[:, 1]])
    # te_acc = compute_accuracy(te_y, y_pred)
    # print('* Accuracy on test set: %0.2f%%' % (100 * te_acc))
 if __name__ == '__main__':
    # pass
    train_segment('story_words')#,'./models/segment/story_phrases.1000/speech_segment_model-final.h5',1001)
--- a/speech_data.py
+++ b/speech_data.py
@@ -1,42 +1,95 @@
 import pandas as pd
-from speech_tools import apply_by_multiprocessing,threadsafe_iter
+from speech_tools import *
 from speech_pitch import *
 # import dask as dd
 # import dask.dataframe as ddf
 import tensorflow as tf
 from tensorflow.python.ops import data_flow_ops
 import numpy as np
-from speech_spectrum import generate_aiff_spectrogram
+from speech_spectrum import generate_aiff_spectrogram,generate_sample_spectrogram
-from speech_pitch import compute_mfcc
+from speech_similar import segmentable_phoneme
 from sklearn.model_selection import train_test_split
 import itertools
 import os,shutil
 import random
 import csv
 import gc
 import pickle
 import itertools
 from tqdm import tqdm
 def siamese_pairs(rightGroup, wrongGroup):
    group1 = [r for (i, r) in rightGroup.iterrows()]
    group2 = [r for (i, r) in wrongGroup.iterrows()]
-    rightWrongPairs = [(g1, g2) for g2 in group2 for g1 in group1]+[(g2, g1) for g2 in group2 for g1 in group1]
+    rightWrongPairs = [(g1, g2) for g2 in group2 for g1 in group1]#+[(g2, g1) for g2 in group2 for g1 in group1]
-    rightRightPairs = [i for i in itertools.permutations(group1, 2)]#+[i for i in itertools.combinations(group2, 2)]
+    rightRightPairs = [i for i in itertools.combinations(group1, 2)]#+[i for i in itertools.combinations(group2, 2)]
    def filter_criteria(s1,s2):
        same = s1['variant'] == s2['variant']
        phon_same = s1['phonemes'] == s2['phonemes']
        voice_diff = s1['voice'] != s2['voice']
        if not same and phon_same:
            return False
-        if same and not voice_diff:
+        # if same and not voice_diff:
-            return False
+        #     return False
        return True
    validRWPairs = [i for i in rightWrongPairs if filter_criteria(*i)]
    validRRPairs = [i for i in rightRightPairs if filter_criteria(*i)]
    random.shuffle(validRWPairs)
    random.shuffle(validRRPairs)
    # return rightRightPairs[:10],rightWrongPairs[:10]
-    return validRWPairs[:32],validRRPairs[:32]
+    return validRRPairs[:32],validRWPairs[:32]
 def seg_siamese_pairs(rightGroup, wrongGroup):
    group1 = [r for (i, r) in rightGroup.iterrows()]
    group2 = [r for (i, r) in wrongGroup.iterrows()]
    rightWrongPairs = [(g1, g2) for g2 in group2 for g1 in group1]#+[(g2, g1) for g2 in group2 for g1 in group1]
    rightRightPairs = [i for i in itertools.combinations(group1, 2)]#+[i for i in itertools.combinations(group2, 2)]
    def filter_criteria(s1,s2):
        same = s1['variant'] == s2['variant']
        phon_same = s1['phonemes'] == s2['phonemes']
        voice_diff = s1['voice'] != s2['voice']
        if not same and phon_same:
            return False
        # if same and not voice_diff:
        #     return False
        return True
    validRWPairs = [i for i in rightWrongPairs if filter_criteria(*i)]
    validRRPairs = [i for i in rightRightPairs if filter_criteria(*i)]
    random.shuffle(validRWPairs)
    random.shuffle(validRRPairs)
    rrPhonePairs = []
    rwPhonePairs = []
    def compute_seg_spec(s1,s2):
        phon_count = len(s1['parsed_phoneme'])
        seg1_count = len(s1['segments'].index)
        seg2_count = len(s2['segments'].index)
        if phon_count == seg1_count and seg2_count == phon_count:
            s1nd,s2nd = pm_snd(s1['file_path']),pm_snd(s2['file_path'])
            segs1 = [tuple(x) for x in s1['segments'][['start','end']].values]
            segs2 = [tuple(x) for x in s2['segments'][['start','end']].values]
            s1_cp = pd.Series(s1)
            s2_cp = pd.Series(s2)
            pp12 = zip(s1['parsed_phoneme'],s2['parsed_phoneme'],segs1,segs2)
            for (p1,p2,(s1s,s1e),(s2s,s2e)) in pp12:
                spc1 = generate_sample_spectrogram(s1nd.extract_part(s1s,s1e).values)
                spc2 = generate_sample_spectrogram(s2nd.extract_part(s2s,s2e).values)
                s1_cp['spectrogram'] = spc1
                s2_cp['spectrogram'] = spc2
                # import pdb; pdb.set_trace()
                if repr(p1) == repr(p2):
                    rrPhonePairs.append((s1_cp,s2_cp))
                else:
                    rwPhonePairs.append((s1_cp,s2_cp))
    for (s1,s2) in validRRPairs:
        compute_seg_spec(s1,s2)
    for (s1,s2) in validRWPairs:
        compute_seg_spec(s1,s2)
    return rrPhonePairs[:32],rwPhonePairs[:32]
    # return rightRightPairs[:10],rightWrongPairs[:10]
    # return
    # validRRPairs[:8],validRWPairs[:8]
 def _float_feature(value):
@@ -64,8 +117,9 @@ def create_spectrogram_tfrecords(audio_group='audio',sample_count=0,train_test_r
        for (w, word_group) in word_group_prog:
            word_group_prog.set_postfix(word=w,sample_name=sample_name)
            g = word_group.reset_index()
-            # g['spectrogram'] = apply_by_multiprocessing(g['file_path'],generate_aiff_spectrogram)
+            # g['spectrogram'] = apply_by_multiprocessing(g['file_path'],pitch_array)
-            g['spectrogram'] = apply_by_multiprocessing(g['file_path'],compute_mfcc)
+            g['spectrogram'] = apply_by_multiprocessing(g['file_path'],generate_aiff_spectrogram)
            # g['spectrogram'] = apply_by_multiprocessing(g['file_path'],compute_mfcc)
            sample_right = g.loc[g['variant'] == 'low']
            sample_wrong = g.loc[g['variant'] == 'medium']
            same, diff = siamese_pairs(sample_right, sample_wrong)
@@ -120,25 +174,6 @@ def create_spectrogram_tfrecords(audio_group='audio',sample_count=0,train_test_r
    const_file = os.path.join('./outputs',audio_group+'.constants')
    pickle.dump((n_spec,n_features,n_records),open(const_file,'wb'))
 def padd_zeros(spgr, max_samples):
    return np.lib.pad(spgr, [(0, max_samples - spgr.shape[0]), (0, 0)],
                      'constant')
 def reservoir_sample(iterable, k):
    it = iter(iterable)
    if not (k > 0):
        raise ValueError("sample size must be positive")
    sample = list(itertools.islice(it, k)) # fill the reservoir
    random.shuffle(sample) # if number of items less then *k* then
                           #   return all items in random order.
    for i, item in enumerate(it, start=k+1):
        j = random.randrange(i) # random [0..i)
        if j < k:
            sample[j] = item # replace item with gradually decreasing probability
    return sample
 def read_siamese_tfrecords_generator(audio_group='audio',batch_size=32,test_size=0):
    records_file  = os.path.join('./outputs',audio_group+'.train.tfrecords')
    input_pairs = []
@@ -147,7 +182,7 @@ def read_siamese_tfrecords_generator(audio_group='audio',batch_size=32,test_size
    (n_spec,n_features,n_records) = pickle.load(open(const_file,'rb'))
    def copy_read_consts(dest_dir):
-        shutil.copy2(const_file,dest_dir)
+        shutil.copy2(const_file,dest_dir+'/constants.pkl')
        return (n_spec,n_features,n_records)
    # @threadsafe_iter
    def record_generator():
@@ -181,7 +216,7 @@ def read_siamese_tfrecords_generator(audio_group='audio',batch_size=32,test_size
    # Read test in one-shot
    print('reading tfrecords({}-test)...'.format(audio_group))
    te_records_file  = os.path.join('./outputs',audio_group+'.test.tfrecords')
-    te_re_iterator,te_n_records = record_generator_count(records_file)
+    te_re_iterator,te_n_records = record_generator_count(te_records_file)
    test_size = min([test_size,te_n_records]) if test_size > 0 else te_n_records
    input_data = np.zeros((test_size,2,n_spec,n_features))
    output_data = np.zeros((test_size,2))
@@ -208,11 +243,16 @@ def audio_samples_word_count(audio_group='audio'):
 def record_generator_count(records_file):
    record_iterator = tf.python_io.tf_record_iterator(path=records_file)
-    count = 0
+    count,spec_n = 0,0
    for i in record_iterator:
    #    example = tf.train.Example()
    #    example.ParseFromString(i)
    #    spec_n1 = example.features.feature['spec_n1'].int64_list.value[0]
    #    spec_n2 = example.features.feature['spec_n2'].int64_list.value[0]
    #    spec_n = max([spec_n,spec_n1,spec_n2])
        count+=1
    record_iterator = tf.python_io.tf_record_iterator(path=records_file)
-    return record_iterator,count
+    return record_iterator,count #,spec_n
 def fix_csv(audio_group='audio'):
    audio_csv_lines = open('./outputs/' + audio_group + '.csv','r').readlines()
@@ -239,6 +279,94 @@ def convert_old_audio():
    audio_samples = audio_samples[['word','phonemes', 'voice', 'language', 'rate', 'variant', 'file']]
    audio_samples.to_csv('./outputs/audio_new.csv',index=False,header=False)
 def generate_sppas_trans(audio_group='story_words.all'):
    # audio_group='story_words.all'
    audio_samples = pd.read_csv( './outputs/' + audio_group + '.fixed.csv',index_col=0)
    audio_samples['file_path'] = audio_samples.loc[:, 'file'].apply(lambda x: 'outputs/' + audio_group + '/' + x)
    # audio_samples = audio_samples.head(5)
    rows = tqdm(audio_samples.iterrows(),total = len(audio_samples.index)
                , desc='Transcribing Words ')
    for (i,row) in rows:
    # len(audio_samples.iterrows())
    # (i,row) = next(audio_samples.iterrows())
        rows.set_postfix(word=row['word'])
        transribe_audio_text(row['file_path'],row['word'])
    rows.close()
 def create_seg_phonpair_tfrecords(audio_group='story_words.all',sample_count=0,train_test_ratio=0.1):
    audio_samples = pd.read_csv( './outputs/' + audio_group + '.fixed.csv',index_col=0)
    audio_samples['file_path'] = audio_samples.loc[:, 'file'].apply(lambda x: 'outputs/' + audio_group + '/' + x)
    audio_samples = audio_samples[(audio_samples['variant'] == 'low') | (audio_samples['variant'] == 'medium')]
    audio_samples['parsed_phoneme'] = apply_by_multiprocessing(audio_samples['phonemes'],segmentable_phoneme)
    # audio_samples['sound'] = apply_by_multiprocessing(audio_samples['file_path'],pm_snd)
    # read_seg_file(audio_samples.iloc[0]['file_path'])
    audio_samples['segments'] = apply_by_multiprocessing(audio_samples['file_path'],read_seg_file)
    n_records,n_spec,n_features = 0,0,0
    def write_samples(wg,sample_name):
        word_group_prog = tqdm(wg,desc='Computing PhonPair spectrogram')
        record_file = './outputs/{}.{}.tfrecords'.format(audio_group,sample_name)
        writer = tf.python_io.TFRecordWriter(record_file)
        for (w, word_group) in word_group_prog:
            word_group_prog.set_postfix(word=w,sample_name=sample_name)
            g = word_group.reset_index()
            # g['spectrogram'] = apply_by_multiprocessing(g['file_path'],pitch_array)
            # g['spectrogram'] = apply_by_multiprocessing(g['file_path'],generate_aiff_spectrogram)
            # g['spectrogram'] = apply_by_multiprocessing(g['file_path'],compute_mfcc)
            sample_right = g.loc[g['variant'] == 'low']
            sample_wrong = g.loc[g['variant'] == 'medium']
            same, diff = seg_siamese_pairs(sample_right, sample_wrong)
            groups = [([0,1],same),([1,0],diff)]
            for (output,group) in groups:
                group_prog = tqdm(group,desc='Writing Spectrogram')
                for sample1,sample2 in group_prog:
                    group_prog.set_postfix(output=output
                        ,var1=sample1['variant']
                        ,var2=sample2['variant'])
                    spectro1,spectro2 = sample1['spectrogram'],sample2['spectrogram']
                    spec_n1,spec_n2 = spectro1.shape[0],spectro2.shape[0]
                    spec_w1,spec_w2 = spectro1.shape[1],spectro2.shape[1]
                    spec1,spec2 = spectro1.reshape(-1),spectro2.reshape(-1)
                    nonlocal n_spec,n_records,n_features
                    n_spec = max([n_spec,spec_n1,spec_n2])
                    n_features = spec_w1
                    n_records+=1
                    example = tf.train.Example(features=tf.train.Features(
                      feature={
                        'word': _bytes_feature([w.encode('utf-8')]),
                        'phoneme1': _bytes_feature([sample1['phonemes'].encode('utf-8')]),
                        'phoneme2': _bytes_feature([sample2['phonemes'].encode('utf-8')]),
                        'voice1': _bytes_feature([sample1['voice'].encode('utf-8')]),
                        'voice2': _bytes_feature([sample2['voice'].encode('utf-8')]),
                        'language': _bytes_feature([sample1['language'].encode('utf-8')]),
                        'rate1':_int64_feature([sample1['rate']]),
                        'rate2':_int64_feature([sample2['rate']]),
                        'variant1': _bytes_feature([sample1['variant'].encode('utf-8')]),
                        'variant2': _bytes_feature([sample2['variant'].encode('utf-8')]),
                        'file1': _bytes_feature([sample1['file'].encode('utf-8')]),
                        'file2': _bytes_feature([sample2['file'].encode('utf-8')]),
                        'spec1':_float_feature(spec1),
                        'spec2':_float_feature(spec2),
                        'spec_n1':_int64_feature([spec_n1]),
                        'spec_w1':_int64_feature([spec_w1]),
                        'spec_n2':_int64_feature([spec_n2]),
                        'spec_w2':_int64_feature([spec_w2]),
                        'output':_int64_feature(output)
                      }
                    ))
                    writer.write(example.SerializeToString())
                group_prog.close()
        word_group_prog.close()
        writer.close()
    word_groups = [i for i in audio_samples.groupby('word')]
    wg_sampled = reservoir_sample(word_groups,sample_count) if sample_count > 0 else word_groups
    tr_audio_samples,te_audio_samples = train_test_split(wg_sampled,test_size=train_test_ratio)
    write_samples(tr_audio_samples,'train')
    write_samples(te_audio_samples,'test')
    const_file = os.path.join('./outputs',audio_group+'.constants')
    pickle.dump((n_spec,n_features,n_records),open(const_file,'wb'))
 if __name__ == '__main__':
    # sunflower_pairs_data()
    # create_spectrogram_data()
@@ -253,8 +381,11 @@ if __name__ == '__main__':
    # create_spectrogram_tfrecords('audio',sample_count=100)
    # create_spectrogram_tfrecords('story_all',sample_count=25)
    # fix_csv('story_words_test')
-    #fix_csv('story_phrases')
+    # fix_csv('test_5_words')
-    create_spectrogram_tfrecords('story_phrases',sample_count=100,train_test_ratio=0.1)
+    # generate_sppas_trans('test_5_words')
    create_seg_phonpair_tfrecords('test_5_words')
    # create_spectrogram_tfrecords('story_words.all',sample_count=0,train_test_ratio=0.1)
    #record_generator_count()
    # create_spectrogram_tfrecords('audio',sample_count=50)
    # read_siamese_tfrecords_generator('audio')
    # padd_zeros_siamese_tfrecords('audio')
--- a/speech_model.py
+++ b/speech_model.py
@@ -3,12 +3,14 @@ from __future__ import print_function
 import numpy as np
 from speech_data import read_siamese_tfrecords_generator
 from keras.models import Model,load_model,model_from_yaml
-from keras.layers import Input, Dense, Dropout, LSTM, Lambda, Concatenate, Bidirectional
+from keras.layers import Input,Concatenate,Lambda, BatchNormalization, Dropout
 from keras.layers import Dense, LSTM, Bidirectional, GRU
 from keras.losses import categorical_crossentropy
 from keras.utils import to_categorical
 from keras.optimizers import RMSprop
 from keras.callbacks import TensorBoard, ModelCheckpoint
 from keras import backend as K
 from keras.utils import plot_model
 from speech_tools import create_dir,step_count
@@ -17,10 +19,12 @@ def create_base_rnn_network(input_dim):
    '''
    inp = Input(shape=input_dim)
    # ls0 = LSTM(512, return_sequences=True)(inp)
-    ls1 = Bidirectional(LSTM(128, return_sequences=True))(inp)
+    ls1 = LSTM(128, return_sequences=True)(inp)
-    ls2 = LSTM(128, return_sequences=True)(ls1)
+    bn_ls1 = BatchNormalization(momentum=0.98)(ls1)
    ls2 = LSTM(64, return_sequences=True)(bn_ls1)
    bn_ls2 = BatchNormalization(momentum=0.98)(ls2)
    # ls3 = LSTM(32, return_sequences=True)(ls2)
-    ls4 = LSTM(64)(ls2)
+    ls4 = LSTM(32)(bn_ls2)
    # d1 = Dense(128, activation='relu')(ls4)
    #d2 = Dense(64, activation='relu')(ls2)
    return Model(inp, ls4)
@@ -42,10 +46,12 @@ def dense_classifier(processed):
    conc_proc = Concatenate()(processed)
    d1 = Dense(64, activation='relu')(conc_proc)
    # dr1 = Dropout(0.1)(d1)
    bn_d1 = BatchNormalization(momentum=0.98)(d1)
    # d2 = Dense(128, activation='relu')(d1)
-    d3 = Dense(8, activation='relu')(d1)
+    d3 = Dense(8, activation='relu')(bn_d1)
    bn_d3 = BatchNormalization(momentum=0.98)(d3)
    # dr2 = Dropout(0.1)(d2)
-    return Dense(2, activation='softmax')(d3)
+    return Dense(2, activation='softmax')(bn_d3)
 def siamese_model(input_dim):
    base_network = create_base_rnn_network(input_dim)
@@ -55,7 +61,7 @@ def siamese_model(input_dim):
    processed_b = base_network(input_b)
    final_output = dense_classifier([processed_a,processed_b])
    model = Model([input_a, input_b], final_output)
-    return model
+    return model,base_network
 def write_model_arch(mod,mod_file):
    model_f = open(mod_file,'w')
@@ -68,19 +74,20 @@ def load_model_arch(mod_file):
    model_f.close()
    return mod
-def train_siamese(audio_group = 'audio'):
+def train_siamese(audio_group = 'audio',resume_weights='',initial_epoch=0):
    batch_size = 128
    model_dir = './models/'+audio_group
    create_dir(model_dir)
    log_dir = './logs/'+audio_group
    create_dir(log_dir)
    tr_gen_fn,te_pairs,te_y,copy_read_consts = read_siamese_tfrecords_generator(audio_group,batch_size=batch_size,test_size=batch_size)
-    n_step,n_features,n_records = copy_read_consts()
+    n_step,n_features,n_records = copy_read_consts(model_dir)
    tr_gen = tr_gen_fn()
    input_dim = (n_step, n_features)
-    model = siamese_model(input_dim)
+    model,base_model = siamese_model(input_dim)
-
+    plot_model(model,show_shapes=True, to_file=model_dir+'/model.png')
    plot_model(base_model,show_shapes=True, to_file=model_dir+'/base_model.png')
    tb_cb = TensorBoard(
        log_dir=log_dir,
        histogram_freq=1,
@@ -96,7 +103,7 @@ def train_siamese(audio_group = 'audio'):
    cp_cb = ModelCheckpoint(
        cp_file_fmt,
-        monitor='val_loss',
+        monitor='acc',
        verbose=0,
        save_best_only=True,
        save_weights_only=True,
@@ -107,19 +114,21 @@ def train_siamese(audio_group = 'audio'):
    model.compile(loss=categorical_crossentropy, optimizer=rms, metrics=[accuracy])
    write_model_arch(model,model_dir+'/siamese_speech_model_arch.yaml')
    epoch_n_steps = step_count(n_records,batch_size)
    if resume_weights != '':
        model.load_weights(resume_weights)
    model.fit_generator(tr_gen
-                        , epochs=1000
+                        , epochs=10000
                        , steps_per_epoch=epoch_n_steps
                        , validation_data=([te_pairs[:, 0], te_pairs[:, 1]], te_y)
-                        , max_queue_size=32
+                        , max_queue_size=8
-                        , callbacks=[tb_cb, cp_cb])
+                        , callbacks=[tb_cb, cp_cb],initial_epoch=initial_epoch)
    model.save(model_dir+'/siamese_speech_model-final.h5')
-    y_pred = model.predict([te_pairs[:, 0], te_pairs[:, 1]])
+    # y_pred = model.predict([te_pairs[:, 0], te_pairs[:, 1]])
-    te_acc = compute_accuracy(te_y, y_pred)
+    # te_acc = compute_accuracy(te_y, y_pred)
-    print('* Accuracy on test set: %0.2f%%' % (100 * te_acc))
+    # print('* Accuracy on test set: %0.2f%%' % (100 * te_acc))
 if __name__ == '__main__':
-    train_siamese('story_phrases')
+    train_siamese('test_5_words')
--- a/speech_pitch.py
+++ b/speech_pitch.py
@@ -1,46 +1,158 @@
 import parselmouth as pm
 from pysndfile import sndio as snd
 import numpy as np
 import matplotlib.pyplot as plt
 import seaborn as sns
 import pyaudio as pa
 sns.set() # Use seaborn's default style to make graphs more pretty
 def pm_snd(sample_file):
    # sample_file = 'inputs/self-apple/apple-low1.aiff'
    samples, samplerate, _  = snd.read(sample_file)
    return pm.Sound(values=samples,sampling_frequency=samplerate)
 def pitch_array(sample_file='outputs/audio/sunflowers-Victoria-180-normal-870.aiff'):
-    samples, samplerate, _  = snd.read(sample_file)
+    sample_sound = pm_snd(sample_file)
    sample_sound = pm.Sound(values=samples,sampling_frequency=samplerate)
    sample_pitch = sample_sound.to_pitch()
    return sample_pitch.to_matrix().as_array()
 def intensity_array(sample_file='outputs/audio/sunflowers-Victoria-180-normal-870.aiff'):
-    sample_file='outputs/audio/sunflowers-Victoria-180-normal-870.aiff'
+    sample_sound = pm_snd(sample_file)
    samples, samplerate, _  = snd.read(sample_file)
    sample_sound = pm.Sound(values=samples,sampling_frequency=samplerate)
    sample_intensity = sample_sound.to_mfcc()
    sample_intensity.as_array().shape
    return sample_pitch.to_matrix().as_array()
 def compute_mfcc(sample_file='outputs/audio/sunflowers-Victoria-180-normal-870.aiff'):
-    # sample_file='outputs/audio/sunflowers-Victoria-180-normal-870.aiff'
+    sample_sound = pm_snd(sample_file)
    samples, samplerate, _  = snd.read(sample_file)
    sample_sound = pm.Sound(values=samples,sampling_frequency=samplerate)
    sample_mfcc = sample_sound.to_mfcc()
    # sample_mfcc.to_array().shape
    return sample_mfcc.to_array()
-# sunflowers_vic_180_norm = pitch_array('outputs/audio/sunflowers-Victoria-180-normal-870.aiff')
+def compute_formants(sample_file='outputs/audio/sunflowers-Victoria-180-normal-870.aiff'):
-# sunflowers_fred_180_norm = pitch_array('outputs/audio/sunflowers-Fred-180-normal-6515.aiff')
+    # sample_file='outputs/audio/sunflowers-Victoria-180-normal-870.aiff'
-# sunflowers_vic_180_norm_mfcc = compute_mfcc('outputs/audio/sunflowers-Victoria-180-normal-870.aiff')
+    sample_sound = pm_snd(sample_file)
-fred_180_norm_mfcc = compute_mfcc('outputs/audio/sunflowers-Fred-180-normal-6515.aiff')
+    sample_formant = sample_sound.to_formant_burg()
-alex_mfcc = compute_mfcc('outputs/audio/sunflowers-Alex-180-normal-4763.aiff')
+    # sample_formant.x_bins()
-# # sunflowers_vic_180_norm.shape
+    return sample_formant.x_bins()
-# # sunflowers_fred_180_norm.shape
+
-# alex_mfcc.shape
+def draw_spectrogram(spectrogram, dynamic_range=70):
-# sunflowers_vic_180_norm_mfcc.shape
+    X, Y = spectrogram.x_grid(), spectrogram.y_grid()
-# sunflowers_fred_180_norm_mfcc.shape
+    sg_db = 10 * np.log10(spectrogram.values.T)
-from speech_spectrum import generate_aiff_spectrogram
+    plt.pcolormesh(X, Y, sg_db, vmin=sg_db.max() - dynamic_range, cmap='afmhot')
-vic_spec = generate_aiff_spectrogram('outputs/audio/sunflowers-Victoria-180-normal-870.aiff')
+    plt.ylim([spectrogram.ymin, spectrogram.ymax])
-alex_spec = generate_aiff_spectrogram('outputs/audio/sunflowers-Alex-180-normal-4763.aiff')
+    plt.xlabel("time [s]")
-alex150spec = generate_aiff_spectrogram('outputs/audio/sunflowers-Alex-150-normal-589.aiff')
+    plt.ylabel("frequency [Hz]")
-vic_spec.shape
+
-alex_spec.shape
+def draw_intensity(intensity):
-alex150spec.shape
+    plt.plot(intensity.xs(), intensity.values, linewidth=3, color='w')
-alex_mfcc.shape
+    plt.plot(intensity.xs(), intensity.values, linewidth=1)
-fred_180_norm_mfcc.shape
+    plt.grid(False)
-# pm.SoundFileFormat
+    plt.ylim(0)
-# pm.Pitch.get_number_of_frames()
+    plt.ylabel("intensity [dB]")
 def draw_pitch(pitch):
    # Extract selected pitch contour, and
    # replace unvoiced samples by NaN to not plot
    pitch_values = pitch.to_matrix().values
    pitch_values[pitch_values==0] = np.nan
    plt.plot(pitch.xs(), pitch_values, linewidth=3, color='w')
    plt.plot(pitch.xs(), pitch_values, linewidth=1)
    plt.grid(False)
    plt.ylim(0, pitch.ceiling)
    plt.ylabel("pitch [Hz]")
 def draw_formants(formant):
    # Extract selected pitch contour, and
    # replace unvoiced samples by NaN to not plot
    formant_values = formant.to_matrix().values
    pitch_values[pitch_values==0] = np.nan
    plt.plot(pitch.xs(), pitch_values, linewidth=3, color='w')
    plt.plot(pitch.xs(), pitch_values, linewidth=1)
    plt.grid(False)
    plt.ylim(0, pitch.ceiling)
    plt.ylabel("Formants [val]")
 def plot_sample_raw(sample_file='outputs/audio/sunflowers-Victoria-180-normal-870.aiff'):
    # %matplotlib inline
    # sample_file='outputs/audio/sunflowers-Victoria-180-normal-870.aiff
    snd_d = pm_snd(sample_file)
    plt.figure()
    plt.plot(snd_d.xs(), snd_d.values)
    plt.xlim([snd_d.xmin, snd_d.xmax])
    plt.xlabel("time [s]")
    plt.ylabel("amplitude")
    plt.show()
 def plot_file_intensity(sample_file='outputs/audio/sunflowers-Victoria-180-normal-870.aiff'):
    snd_d = pm_snd(sample_file)
    plot_sample_intensity(snd_d)
 def plot_sample_intensity(snd_d):
    intensity = snd_d.to_intensity()
    spectrogram = snd_d.to_spectrogram()
    plt.figure()
    draw_spectrogram(spectrogram)
    plt.twinx()
    draw_intensity(intensity)
    plt.xlim([snd_d.xmin, snd_d.xmax])
    plt.show()
 def plot_file_pitch(sample_file='outputs/audio/sunflowers-Victoria-180-normal-870.aiff'):
    snd_d = pm_snd(sample_file)
    plot_sample_pitch(snd_d)
 def plot_sample_pitch(snd_d,phons = []):
    pitch = snd_d.to_pitch()
    spectrogram = snd_d.to_spectrogram(window_length=0.03, maximum_frequency=8000)
    plt.figure()
    draw_spectrogram(spectrogram)
    plt.twinx()
    draw_pitch(pitch)
    for (p,c) in phons:
        plt.axvline(x=p)
        plt.text(p,-1,c)
    plt.xlim([snd_d.xmin, snd_d.xmax])
    plt.show()
 def play_sound(samplerate=22050):
    #snd_sample = pm_snd('outputs/test/a_warm_smile_and_a_good_heart-1917.aiff')
    p_oup = pa.PyAudio()
    stream = p_oup.open(
        format=pa.paFloat32,
        channels=2,
        rate=samplerate,
        output=True)
    def sample_player(snd_sample=None):
        samples = snd_sample.as_array()[:,0]
        one_channel = np.asarray([samples, samples]).T.reshape(-1)
        audio_data = one_channel.astype(np.float32).tobytes()
        stream.write(audio_data)
    def close_player():
        stream.close()
        p_oup.terminate()
    return sample_player,close_player
    # snd_part = snd_d.extract_part(from_time=0.9, preserve_times=True)
    # plt.figure()
    # plt.plot(snd_part.xs(), snd_part.values, linewidth=0.5)
    # plt.xlim([snd_part.xmin, snd_part.xmax])
    # plt.xlabel("time [s]")
    # plt.ylabel("amplitude")
    # plt.show()
 if __name__ == '__main__':
    plot_file_pitch('outputs/audio/sunflowers-Victoria-180-normal-870.aiff')
    plot_file_pitch('outputs/test/a_warm_smile_and_a_good_heart-1917.aiff')
    play_sound(pm_snd('outputs/test/a_warm_smile_and_a_good_heart-1917.aiff'))
    plot_file_pitch('outputs/test/a_wrong_turn-3763.aiff')
    play_sound(pm_snd('outputs/test/a_wrong_turn-3763.aiff'))
    plot_file_pitch('inputs/self/a_wrong_turn-low1.aiff')
    play_sound(pm_snd('inputs/self/a_wrong_turn-low1.aiff'))
    plot_file_pitch('inputs/self/a_wrong_turn-low2.aiff')
    play_sound(pm_snd('inputs/self/a_wrong_turn-low2.aiff'))
    plot_file_pitch('inputs/self/apple-low1.aiff')
    plot_file_pitch('inputs/self/apple-low2.aiff')
    plot_file_pitch('inputs/self/apple-medium1.aiff')
--- a/speech_samplegen.py
+++ b/speech_samplegen.py
@@ -5,7 +5,6 @@ from Foundation import NSURL
 import json
 import csv
 import random
 import string
 import os
 import re
 import subprocess
@@ -13,36 +12,12 @@ import time
 from tqdm import tqdm
 from generate_similar import similar_phoneme_phrase,similar_phrase
 from speech_tools import hms_string,create_dir,format_filename,reservoir_sample
-OUTPUT_NAME = 'story_phrases'
+OUTPUT_NAME = 'test_5_words'
 dest_dir = os.path.abspath('.') + '/outputs/' + OUTPUT_NAME + '/'
 dest_file = './outputs/' + OUTPUT_NAME + '.csv'
 def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60)) / 60)
    s = sec_elapsed % 60.
    return "{}:{:>02}:{:>05.2f}".format(h, m, s)
 def create_dir(direc):
    if not os.path.exists(direc):
        os.makedirs(direc)
 def format_filename(s):
    """
    Take a string and return a valid filename constructed from the string.
    Uses a whitelist approach: any characters not present in valid_chars are
    removed. Also spaces are replaced with underscores.
    Note: this method may produce invalid filenames such as ``, `.` or `..`
    When I use this method I prepend a date string like '2009_01_15_19_46_32_'
    and append a file extension like '.txt', so I avoid the potential of using
    an invalid filename.
    """
    valid_chars = "-_.() %s%s" % (string.ascii_letters, string.digits)
    filename = ''.join(c for c in s if c in valid_chars)
    filename = filename.replace(' ','_') # I don't like spaces in filenames.
    return filename
 def dest_filename(w, v, r, t):
    rand_no = str(random.randint(0, 10000))
@@ -241,6 +216,9 @@ def generate_audio_for_text_list(text_list):
    closer()
 def generate_audio_for_stories():
    '''
    Generates the audio sample variants for the list of words in the stories
    '''
    # story_file = './inputs/all_stories_hs.json'
    story_file = './inputs/all_stories.json'
    stories_data = json.load(open(story_file))
@@ -249,7 +227,11 @@ def generate_audio_for_stories():
    text_list = sorted(list(set(text_list_dup)))
    generate_audio_for_text_list(text_list)
-def generate_test_audio_for_stories():
+def generate_test_audio_for_stories(sample_count=0):
    '''
    Picks a list of words from the wordlist that are not in story words
    and generates the variants
    '''
    story_file = './inputs/all_stories_hs.json'
    # story_file = './inputs/all_stories.json'
    stories_data = json.load(open(story_file))
@@ -259,11 +241,12 @@ def generate_test_audio_for_stories():
    word_list = [i.strip('\n_') for i in open('./inputs/wordlist.txt','r').readlines()]
    text_set = set(text_list)
    new_word_list = [i for i in word_list if i not in text_set and len(i) > 4]
-    test_words = new_word_list[:int(len(text_list)/5+1)]
+    # test_words = new_word_list[:int(len(text_list)/5+1)]
    test_words = reservoir_sample(new_word_list,sample_count) if sample_count > 0 else new_word_list
    generate_audio_for_text_list(test_words)
 if __name__ == '__main__':
-    # generate_test_audio_for_stories()
+    generate_test_audio_for_stories(5)
    # generate_audio_for_text_list(['I want to go home','education'])
-    generate_audio_for_stories()
+    # generate_audio_for_stories()
--- a/speech_segmentgen.py
+++ b/speech_segmentgen.py
@@ -0,0 +1,237 @@
 import objc
 from AppKit import *
 from Foundation import NSURL
 from PyObjCTools import AppHelper
 from time import time
 import os
 import sys
 import random
 import json
 import csv
 import subprocess
 from tqdm import tqdm
 from speech_tools import create_dir,format_filename
 apple_phonemes = [
    '%', '@', 'AE', 'EY', 'AO', 'AX', 'IY', 'EH', 'IH', 'AY', 'IX', 'AA', 'UW',
    'UH', 'UX', 'OW', 'AW', 'OY', 'b', 'C', 'd', 'D', 'f', 'g', 'h', 'J', 'k',
    'l', 'm', 'n', 'N', 'p', 'r', 's', 'S', 't', 'T', 'v', 'w', 'y', 'z', 'Z'
 ]
 OUTPUT_NAME = 'story_test_segments'
 dest_dir = os.path.abspath('.') + '/outputs/' + OUTPUT_NAME + '/'
 csv_dest_file = os.path.abspath('.') + '/outputs/' + OUTPUT_NAME + '.csv'
 create_dir(dest_dir)
 def cli_gen_audio(speech_cmd, out_path):
    subprocess.call(
        ['say', '-o', out_path, "'" + speech_cmd + "'"])
 class SpeechDelegate (NSObject):
    def speechSynthesizer_willSpeakWord_ofString_(self, sender, word, text):
        '''Called automatically when the application has launched'''
        # print("Speaking word {} in sentence {}".format(word,text))
        self.wordWillSpeak()
    def speechSynthesizer_willSpeakPhoneme_(self, sender, phoneme):
        phon_ch = apple_phonemes[phoneme]
        self.phonemeWillSpeak(phon_ch)
    def speechSynthesizer_didFinishSpeaking_(self, synth, didFinishSpeaking):
        if didFinishSpeaking:
            self.completeCB()
    def setC_W_Ph_(self, completed, word, phoneme):
        self.completeCB = completed
        self.wordWillSpeak = word
        self.phonemeWillSpeak = phoneme
 # del SpeechDelegate
 class Delegate (NSObject):
    def applicationDidFinishLaunching_(self, aNotification):
        '''Called automatically when the application has launched'''
        print("App Launched!")
        # phrases = story_texts()#random.sample(story_texts(), 100)  #
        # phrases = test_texts(30)
        phrases = story_words()
        # print(phrases)
        generate_audio(phrases)
 class PhonemeTiming(object):
    """docstring for PhonemeTiming."""
    def __init__(self, phon, start):
        super(PhonemeTiming, self).__init__()
        self.phoneme = phon
        self.start = start
        self.fraction = 0
        self.duration = None
        self.end = None
    def is_audible(self):
        return self.phoneme not in ['%', '~']
    def tune(self):
        if self.is_audible():
            dur_ms = int(self.duration * 1000)
            return '{} {{D {}}}'.format(self.phoneme, dur_ms)
        else:
            return '~'
    def __repr__(self):
        return '[{}]({:0.4f})'.format(self.phoneme, self.fraction)
    @staticmethod
    def to_tune(phone_ts):
        tune_list = ['[[inpt TUNE]]']
        for ph in phone_ts:
            tune_list.append(ph.tune())
        tune_list.append('[[inpt TEXT]]')
        return '\n'.join(tune_list)
 class SegData(object):
    """docstring for SegData."""
    def __init__(self, text, filename):
        super(SegData, self).__init__()
        self.text = text
        self.tune = ''
        self.filename = filename
        self.segments = []
    def csv_rows(self):
        result = []
        s_tim = self.segments[0].start
        for i in range(len(self.segments) - 1):
            cs = self.segments[i]
            # if cs.is_audible():
            ns = self.segments[i + 1]
            row = [self.text, self.filename, cs.phoneme, ns.phoneme,
                   (cs.start - s_tim) * 1000, (cs.end - s_tim) * 1000]
            result.append(row)
        return result
 class SynthesizerQueue(object):
    """docstring for SynthesizerQueue."""
    def __init__(self):
        super(SynthesizerQueue, self).__init__()
        self.synth = NSSpeechSynthesizer.alloc().init()
        self.didComplete = None
        q_delg = SpeechDelegate.alloc().init()
        self.synth.setDelegate_(q_delg)
        def synth_complete():
            end_time = time()
            for i in range(len(self.phoneme_timing)):
                if i == len(self.phoneme_timing) - 1:
                    self.phoneme_timing[i].duration = end_time - \
                        self.phoneme_timing[i].start
                    self.phoneme_timing[i].end = end_time
                else:
                    self.phoneme_timing[i].duration = self.phoneme_timing[i +
                                                                          1].start - self.phoneme_timing[i].start
                    self.phoneme_timing[i].end = self.phoneme_timing[i + 1].start
            total_time = sum(
                [i.duration for i in self.phoneme_timing if i.is_audible()])
            for ph in self.phoneme_timing:
                if ph.is_audible():
                    ph.fraction = ph.duration / total_time
            if self.didComplete:
                self.data.segments = self.phoneme_timing
                self.data.tune = PhonemeTiming.to_tune(self.phoneme_timing)
                self.didComplete(self.data)
        def will_speak_phoneme(phon):
            phtm = PhonemeTiming(phon, time())
            self.phoneme_timing.append(phtm)
        def will_speak_word():
            pass
            # coz it comes after the first phoneme of the word is started
            # phtm = PhonemeTiming('~', time())
            # self.phoneme_timing.append(phtm)
        q_delg.setC_W_Ph_(synth_complete, will_speak_word, will_speak_phoneme)
    def queueTask(self, text):
        rand_no = str(random.randint(0, 10000))
        fname = '{}-{}.aiff'.format(text, rand_no)
        sanitized = format_filename(fname)
        dest_file = dest_dir + sanitized
        cli_gen_audio(text, dest_file)
        self.phoneme_timing = []
        self.data = SegData(text, sanitized)
        self.synth.startSpeakingString_(text)
 def story_texts():
    story_file = './inputs/all_stories.json'
    stories_data = json.load(open(story_file))
    text_list_dup = [t for i in stories_data.values() for t in i]
    text_list = sorted(list(set(text_list_dup)))
    return text_list
 def story_words():
    story_file = './inputs/all_stories_hs.json'
    stories_data = json.load(open(story_file))
    text_list_dup = [t[0] for i in stories_data.values() for t in i]
    text_list = sorted(list(set(text_list_dup)))
    return text_list
 def test_texts(count=10):
    word_list = [i.strip('\n_') for i in open('./inputs/wordlist.txt','r').readlines()]
    text_list = sorted(random.sample(list(set(word_list)),count))
    return text_list
 def generate_audio(phrases):
    synthQ = SynthesizerQueue()
    f = open(csv_dest_file, 'w')
    s_csv_w = csv.writer(f, quoting=csv.QUOTE_MINIMAL)
    i = 0
    p = tqdm(total=len(phrases))
    def nextTask(seg_data=None):
        nonlocal i
        if i < len(phrases):
            p.set_postfix(phrase=phrases[i])
            p.update()
            synthQ.queueTask(phrases[i])
            i += 1
        else:
            p.close()
            f.close()
            dg = NSApplication.sharedApplication().delegate
            print('App terminated.')
            NSApp().terminate_(dg)
        if seg_data:
            s_csv_w.writerows(seg_data.csv_rows())
    synthQ.didComplete = nextTask
    nextTask()
 def main():
    # Create a new application instance ...
    a = NSApplication.sharedApplication()
    # ... and create its delgate.  Note the use of the
    # Objective C constructors below, because Delegate
    # is a subcalss of an Objective C class, NSObject
    delegate = Delegate.alloc().init()
    # Tell the application which delegate object to use.
    a.setDelegate_(delegate)
    AppHelper.runEventLoop()
 if __name__ == '__main__':
    main()
--- a/speech_similar.py
+++ b/speech_similar.py
@@ -0,0 +1,143 @@
 import pandas as pd
 import pronouncing
 import re
 import numpy as np
 import random
 # mapping = {
 #     s.split()[0]: s.split()[1]
 #     for s in """
 # AA AA
 # AE AE
 # AH UX
 # AO AO
 # AW AW
 # AY AY
 # B  b
 # CH C
 # D  d
 # DH D
 # EH EH
 # ER UXr
 # EY EY
 # F  f
 # G  g
 # HH h
 # IH IH
 # IY IY
 # JH J
 # K  k
 # L  l
 # M  m
 # N  n
 # NG N
 # OW OW
 # OY OY
 # P  p
 # R  r
 # S  s
 # SH S
 # T  t
 # TH T
 # UH UH
 # UW UW
 # V  v
 # W  w
 # Y  y
 # Z  z
 # ZH Z
 # """.strip().split('\n')
 # }
 # sim_mat = pd.read_csv('./similarity.csv', header=0, index_col=0)
 #
 #
 # def convert_ph(ph):
 #     stress_level = re.search("(\w+)([0-9])", ph)
 #     if stress_level:
 #         return stress_level.group(2) + mapping[stress_level.group(1)]
 #     else:
 #         return mapping[ph]
 #
 #
 # def sim_mat_to_apple_table(smt):
 #     colnames = [convert_ph(ph) for ph in smt.index.tolist()]
 #     smt = pd.DataFrame(np.nan_to_num(smt.values))
 #     fsmt = (smt.T + smt)
 #     np.fill_diagonal(fsmt.values, 100.0)
 #     asmt = pd.DataFrame.copy(fsmt)
 #     asmt.columns = colnames
 #     asmt.index = colnames
 #     apple_sim_table = asmt.stack().reset_index()
 #     apple_sim_table.columns = ['q', 'r', 's']
 #     return apple_sim_table
 #
 #
 # apple_sim_table = sim_mat_to_apple_table(sim_mat)
 #
 #
 # def top_match(ph):
 #     selected = apple_sim_table[(apple_sim_table.q == ph)
 #                                & (apple_sim_table.s < 100) &
 #                                (apple_sim_table.s >= 70)]
 #     tm = ph
 #     if len(selected) > 0:
 #         tm = pd.DataFrame.sort_values(selected, 's', ascending=False).iloc[0].r
 #     return tm
 apple_phonemes = [
    '%', '@', 'AE', 'EY', 'AO', 'AX', 'IY', 'EH', 'IH', 'AY', 'IX', 'AA', 'UW',
    'UH', 'UX', 'OW', 'AW', 'OY', 'b', 'C', 'd', 'D', 'f', 'g', 'h', 'J', 'k',
    'l', 'm', 'n', 'N', 'p', 'r', 's', 'S', 't', 'T', 'v', 'w', 'y', 'z', 'Z'
 ]
 class ApplePhoneme(object):
    """docstring for ApplePhoneme."""
    def __init__(self, phone, stress, vowel=False):
        super(ApplePhoneme, self).__init__()
        self.phone = phone
        self.stress = stress
        self.vowel = vowel
    def __str__(self):
        return (str(self.stress) if (self.vowel and self.stress>0) else '') + self.phone
    def __repr__(self):
        return "'{}'".format(str(self))
    def adjust_stress(self):
        self.stress = random.choice([i for i in range(3) if i != self.stress])
 def parse_apple_phonemes(ph_str):
    for i in range(len(ph_str)):
        pref, rest = ph_str[:i + 1], ph_str[i + 1:]
        if pref in apple_phonemes:
            vowel = pref[0] in 'AEIOU'
            return [ApplePhoneme(pref, 0, vowel)] + parse_apple_phonemes(rest)
        elif pref[0].isdigit() and pref[1:] in apple_phonemes:
            return [ApplePhoneme(pref[1:], int(pref[0]) , True)] + parse_apple_phonemes(rest)
        elif not pref.isalnum():
            return [ApplePhoneme(pref, -1, False)] + parse_apple_phonemes(rest)
    return []
 def segmentable_phoneme(ph_str):
    return [p for p in parse_apple_phonemes(ph_str) if p.stress >=0]
 def similar_phoneme_word(ph_str):
    phons = parse_apple_phonemes(ph_str)
    vowels = [i for i in phons if i.vowel]
    random.choice(vowels).adjust_stress()
    return ''.join([str(i) for i in phons])
 def similar_phoneme_phrase(ph_str):
    return ' '.join([similar_phoneme_word(w) for w in ph_str.split()])
 def similar_word(word_str):
    similar = pronouncing.rhymes(word_str)
    return random.choice(similar) if len(similar) > 0 else word_str
 def similar_phrase(ph_str):
    return ' '.join([similar_word(w) for w in ph_str.split()])
--- a/speech_spectrum.py
+++ b/speech_spectrum.py
@@ -79,6 +79,9 @@ def generate_spec_frec(samples, samplerate):
    ims[ims < 0] = 0  #np.finfo(sshow.dtype).eps
    return ims, freq
 def generate_sample_spectrogram(samples):
    ims, _ = generate_spec_frec(samples, 22050)
    return ims
 def generate_aiff_spectrogram(audiopath):
    samples, samplerate, _ = snd.read(audiopath)
--- a/speech_test.py
+++ b/speech_test.py
@@ -1,5 +1,5 @@
 from speech_model import load_model_arch
-from speech_tools import record_spectrogram, file_player
+from speech_tools import record_spectrogram, file_player, padd_zeros, pair_for_word
 from speech_data import record_generator_count
 # from importlib import reload
 # import speech_data
@@ -20,6 +20,21 @@ def predict_recording_with(m,sample_size=15):
    inp = create_test_pair(spec1,spec2,sample_size)
    return m.predict([inp[:, 0], inp[:, 1]])
 def predict_tts_sample(sample_word = 'able',audio_group='story_words',weights = 'siamese_speech_model-153-epoch-0.55-acc.h5'):
    # sample_word = 'able';audio_group='story_words';weights = 'siamese_speech_model-153-epoch-0.55-acc.h5'
    const_file = './models/'+audio_group+'/constants.pkl'
    arch_file='./models/'+audio_group+'/siamese_speech_model_arch.yaml'
    weight_file='./models/'+audio_group+'/'+weights
    (sample_size,n_features,n_records) = pickle.load(open(const_file,'rb'))
    model = load_model_arch(arch_file)
    model.load_weights(weight_file)
    spec1,spec2 = pair_for_word(sample_word)
    p_spec1 = padd_zeros(spec1,sample_size)
    p_spec2 = padd_zeros(spec2,sample_size)
    inp = np.array([[p_spec1,p_spec2]])
    result = model.predict([inp[:, 0], inp[:, 1]])[0]
    res_str = 'same' if result[0] < result[1] else 'diff'
    return res_str
 def test_with(audio_group):
    X,Y = speech_data(audio_group)
@@ -29,7 +44,7 @@ def test_with(audio_group):
 def evaluate_siamese(records_file,audio_group='audio',weights = 'siamese_speech_model-final.h5'):
    # audio_group='audio';model_file = 'siamese_speech_model-305-epoch-0.20-acc.h5'
    # records_file  = os.path.join('./outputs',eval_group+'.train.tfrecords')
-    const_file = os.path.join('./models/'+audio_group+'/',audio_group+'.constants')
+    const_file = os.path.join('./models/'+audio_group+'/','constants.pkl')
    arch_file='./models/'+audio_group+'/siamese_speech_model_arch.yaml'
    weight_file='./models/'+audio_group+'/'+weights
    (n_spec,n_features,n_records) = pickle.load(open(const_file,'rb'))
@@ -41,7 +56,6 @@ def evaluate_siamese(records_file,audio_group='audio',weights = 'siamese_speech_
    total,same_success,diff_success,skipped,same_failed,diff_failed = 0,0,0,0,0,0
    all_results = []
    for (i,string_record) in tqdm(enumerate(record_iterator),total=records_count):
        # string_record = next(record_iterator)
        total+=1
        example = tf.train.Example()
        example.ParseFromString(string_record)
@@ -178,7 +192,7 @@ def visualize_results(audio_group='audio'):
 if __name__ == '__main__':
    # evaluate_siamese('./outputs/story_words_test.train.tfrecords',audio_group='story_words.gpu',weights ='siamese_speech_model-58-epoch-0.00-acc.h5')
    # evaluate_siamese('./outputs/story_words.test.tfrecords',audio_group='story_words',weights ='siamese_speech_model-675-epoch-0.00-acc.h5')
-    evaluate_siamese('./outputs/story_words_test.train.tfrecords',audio_group='story_phrases',weights ='siamese_speech_model-231-epoch-0.00-acc.h5')
+    evaluate_siamese('./outputs/story_words.test.tfrecords',audio_group='story_words',weights ='siamese_speech_model-153-epoch-0.55-acc.h5')
    # play_results('story_words')
    #inspect_tfrecord('./outputs/story_phrases.test.tfrecords',audio_group='story_phrases')
    # visualize_results('story_words.gpu')
--- a/speech_testgen.py
+++ b/speech_testgen.py
@@ -0,0 +1,50 @@
 import voicerss_tts
 import json
 from speech_tools import format_filename
 def generate_voice(phrase):
    voice = voicerss_tts.speech({
        'key': '0ae89d82aa78460691c99a4ac8c0f9ec',
        'hl': 'en-us',
        'src': phrase,
        'r': '0',
        'c': 'mp3',
        'f': '22khz_16bit_mono',
        'ssml': 'false',
        'b64': 'false'
    })
    if not voice['error']:
        return voice[b'response']
    return None
 def generate_test_audio_for_stories():
    story_file = './inputs/all_stories_hs.json'
    # story_file = './inputs/all_stories.json'
    stories_data = json.load(open(story_file))
    text_list_dup = [t[0] for i in stories_data.values() for t in i]
    text_list = sorted(list(set(text_list_dup)))[:10]
    for t in text_list:
        v = generate_voice(t)
        if v:
            f_name = format_filename(t)
            tf = open('inputs/voicerss/'+f_name+'.mp3','wb')
            tf.write(v)
            tf.close()
 # def generate_test_audio_for(records_file,audio_group='audio'):
 #     # audio_group='audio';model_file = 'siamese_speech_model-305-epoch-0.20-acc.h5'
 #     # records_file  = os.path.join('./outputs',eval_group+'.train.tfrecords')
 #     const_file = os.path.join('./models/'+audio_group+'/','constants.pkl')
 #     (n_spec,n_features,n_records) = pickle.load(open(const_file,'rb'))
 #     print('evaluating {}...'.format(records_file))
 #     record_iterator,records_count = record_generator_count(records_file)
 #     all_results = []
 #     for (i,string_record) in tqdm(enumerate(record_iterator),total=records_count):
 #         total+=1
 #         example = tf.train.Example()
 #         example.ParseFromString(string_record)
 #         word = example.features.feature['word'].bytes_list.value[0].decode()
 # audio = generate_voice('hello world')
 # audio
--- a/speech_tools.py
+++ b/speech_tools.py
@@ -1,17 +1,23 @@
 import os
 import math
 import string
 import threading
 import itertools
 import random
 import multiprocessing
 import subprocess
 import pandas as pd
 import numpy as np
 import pyaudio
 from pysndfile import sndio as snd
 # from matplotlib import pyplot as plt
-from speech_spectrum import plot_stft, generate_spec_frec
+from speech_spectrum import plot_stft, generate_spec_frec,generate_aiff_spectrogram
 SAMPLE_RATE = 22050
 N_CHANNELS = 2
 devnull = open(os.devnull, 'w')
 def step_count(n_records,batch_size):
    return int(math.ceil(n_records*1.0/batch_size))
@@ -35,6 +41,31 @@ def file_player():
        p_oup.terminate()
    return play_file,close_player
 def reservoir_sample(iterable, k):
    it = iter(iterable)
    if not (k > 0):
        raise ValueError("sample size must be positive")
    sample = list(itertools.islice(it, k)) # fill the reservoir
    random.shuffle(sample) # if number of items less then *k* then
                           #   return all items in random order.
    for i, item in enumerate(it, start=k+1):
        j = random.randrange(i) # random [0..i)
        if j < k:
            sample[j] = item # replace item with gradually decreasing probability
    return sample
 def padd_zeros(spgr, max_samples):
    return np.lib.pad(spgr, [(0, max_samples - spgr.shape[0]), (0, 0)],
                      'constant')
 def read_seg_file(aiff_name):
    base_name = aiff_name.rsplit('.aiff',1)[0]
    seg_file = base_name+'-palign.csv'
    seg_data = pd.read_csv(seg_file,names=['action','start','end','phoneme'])
    seg_data = seg_data[(seg_data['action'] == 'PhonAlign') & (seg_data['phoneme'] != '#')]
    return seg_data
 def record_spectrogram(n_sec, plot=False, playback=False):
    # show_record_prompt()
    N_SEC = n_sec
@@ -70,6 +101,20 @@ def record_spectrogram(n_sec, plot=False, playback=False):
    ims, _ = generate_spec_frec(one_channel, SAMPLE_RATE)
    return ims
 def pair_for_word(phrase='able'):
    spec1 = generate_aiff_spectrogram('./inputs/pairs/good/'+phrase+'.aiff')
    spec2 = generate_aiff_spectrogram('./inputs/pairs/test/'+phrase+'.aiff')
    return spec1,spec2
 def transribe_audio_text(aiff_name,phrase):
    base_name = aiff_name.rsplit('.aiff',1)[0]
    wav_name = base_name+'.wav'
    txt_name = base_name+'.txt'
    params = ['ffmpeg', '-y', '-i',aiff_name,wav_name]
    subprocess.call(params,stdout=devnull,stderr=devnull)
    trcr_f = open(txt_name,'w')
    trcr_f.write(phrase)
    trcr_f.close()
 def _apply_df(args):
    df, func, num, kwargs = args
@@ -87,10 +132,15 @@ def apply_by_multiprocessing(df,func,**kwargs):
 def square(x):
    return x**x
-if __name__ == '__main__':
+# if __name__ == '__main__':
-    df = pd.DataFrame({'a':range(10), 'b':range(10)})
+#     df = pd.DataFrame({'a':range(10), 'b':range(10)})
-    apply_by_multiprocessing(df, square, axis=1, workers=4)
+#     apply_by_multiprocessing(df, square, axis=1, workers=4)
 def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60)) / 60)
    s = sec_elapsed % 60.
    return "{}:{:>02}:{:>05.2f}".format(h, m, s)
 def rm_rf(d):
    for path in (os.path.join(d,f) for f in os.listdir(d)):
@@ -108,6 +158,22 @@ def create_dir(direc):
        create_dir(direc)
 def format_filename(s):
    """
    Take a string and return a valid filename constructed from the string.
    Uses a whitelist approach: any characters not present in valid_chars are
    removed. Also spaces are replaced with underscores.
    Note: this method may produce invalid filenames such as ``, `.` or `..`
    When I use this method I prepend a date string like '2009_01_15_19_46_32_'
    and append a file extension like '.txt', so I avoid the potential of using
    an invalid filename.
    """
    valid_chars = "-_.() %s%s" % (string.ascii_letters, string.digits)
    filename = ''.join(c for c in s if c in valid_chars)
    filename = filename.replace(' ','_') # I don't like spaces in filenames.
    return filename
 #################### Now make the data generator threadsafe ####################
 class threadsafe_iter:
--- a/voicerss_tts.py
+++ b/voicerss_tts.py
@@ -0,0 +1,52 @@
 import http.client, urllib.request, urllib.parse, urllib.error
 def speech(settings):
 	__validate(settings)
 	return __request(settings)
 def __validate(settings):
 	if not settings: raise RuntimeError('The settings are undefined')
 	if 'key' not in settings or not settings['key']: raise RuntimeError('The API key is undefined')
 	if 'src' not in settings or not settings['src']: raise RuntimeError('The text is undefined')
 	if 'hl' not in settings or not settings['hl']: raise RuntimeError('The language is undefined')
 def __request(settings):
 	result = {'error': None, 'response': None}
 	headers = {'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'}
 	params = urllib.parse.urlencode(__buildRequest(settings))
 	if 'ssl' in settings and settings['ssl']:
 		conn = http.client.HTTPSConnection('api.voicerss.org:443')
 	else:
 		conn = http.client.HTTPConnection('api.voicerss.org:80')
 	conn.request('POST', '/', params, headers)
 	response = conn.getresponse()
 	content = response.read()
 	if response.status != 200:
 		result[b'error'] = response.reason
 	elif content.find(b'ERROR') == 0:
 		result[b'error'] = content
 	else:
 		result[b'response'] = content
 	conn.close()
 	return result
 def __buildRequest(settings):
 	params = {'key': '', 'src': '', 'hl': '', 'r': '', 'c': '', 'f': '', 'ssml': '', 'b64': ''}
 	if 'key' in settings: params['key'] = settings['key']
 	if 'src' in settings: params['src'] = settings['src']
 	if 'hl' in settings: params['hl'] = settings['hl']
 	if 'r' in settings: params['r'] = settings['r']
 	if 'c' in settings: params['c'] = settings['c']
 	if 'f' in settings: params['f'] = settings['f']
 	if 'ssml' in settings: params['ssml'] = settings['ssml']
 	if 'b64' in settings: params['b64'] = settings['b64']
 	return params
--- a/voicerss_tts.py.bak
+++ b/voicerss_tts.py.bak
@@ -0,0 +1,52 @@
 import httplib, urllib
 def speech(settings):
 	__validate(settings)
 	return __request(settings)
 def __validate(settings):
 	if not settings: raise RuntimeError('The settings are undefined')
 	if 'key' not in settings or not settings['key']: raise RuntimeError('The API key is undefined')
 	if 'src' not in settings or not settings['src']: raise RuntimeError('The text is undefined')
 	if 'hl' not in settings or not settings['hl']: raise RuntimeError('The language is undefined')
 def __request(settings):
 	result = {'error': None, 'response': None}
 	headers = {'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'}
 	params = urllib.urlencode(__buildRequest(settings))
 	if 'ssl' in settings and settings['ssl']:
 		conn = httplib.HTTPSConnection('api.voicerss.org:443')
 	else:
 		conn = httplib.HTTPConnection('api.voicerss.org:80')
 	conn.request('POST', '/', params, headers)
 	response = conn.getresponse()
 	content = response.read()
 	if response.status != 200:
 		result['error'] = response.reason
 	elif content.find('ERROR') == 0:
 		result['error'] = content
 	else:
 		result['response'] = content
 	conn.close()
 	return result
 def __buildRequest(settings):
 	params = {'key': '', 'src': '', 'hl': '', 'r': '', 'c': '', 'f': '', 'ssml': '', 'b64': ''}
 	if 'key' in settings: params['key'] = settings['key']
 	if 'src' in settings: params['src'] = settings['src']
 	if 'hl' in settings: params['hl'] = settings['hl']
 	if 'r' in settings: params['r'] = settings['r']
 	if 'c' in settings: params['c'] = settings['c']
 	if 'f' in settings: params['f'] = settings['f']
 	if 'ssml' in settings: params['ssml'] = settings['ssml']
 	if 'b64' in settings: params['b64'] = settings['b64']
 	return params
Author	SHA1	Message	Date
Malar Kannan	225a720f18	updated README to include testing	2017-12-29 16:21:38 +05:30
Malar Kannan	b267b89a44	Merge branch 'master' of /home/ilml/Public/Repos/speech_scoring	2017-12-29 13:15:51 +05:30
Malar Kannan	eb10b577ae	Added README.md describing the workflow	2017-12-29 13:14:37 +05:30
Malar Kannan	ee2eb63f66	Merge branch 'master' of ssh://invnuc/~/Public/Repos/speech_scoring	2017-12-28 20:02:44 +05:30
Malar Kannan	2ae269d939	generating test for phone seg model	2017-12-28 20:01:44 +05:30
Malar Kannan	40d7933870	saving model on better 'acc'	2017-12-28 20:00:19 +05:30
Malar Kannan	4dd4bb5963	implemented phoneme segmented training on samples	2017-12-28 18:53:54 +05:30
Malar Kannan	0600482fe5	generating segmentation for words	2017-12-28 13:37:27 +05:30
Malar Kannan	507da49cfa	added voicerss tts support for test data generation	2017-12-26 14:32:56 +05:30
Malar Kannan	f44665e9b2	1. fixed softmax output and overfit the model for small sample 2. updated to run on complete data	2017-12-12 12:18:27 +05:30
Malar Kannan	cc4fbe45b9	trying to overfit 2 samples with model -> doesn't seem to converge	2017-12-11 15:03:14 +05:30
Malar Kannan	8d550c58cc	fixed batch normalization layer before activation	2017-12-11 14:33:56 +05:30
Malar Kannan	240ecb3f27	removed bn output layer	2017-12-11 14:12:23 +05:30
Malar Kannan	05242d5991	added batch normalization	2017-12-11 14:09:04 +05:30
Malar Kannan	fea9184aec	using the full data and fixed typo in model layer name	2017-12-11 13:47:30 +05:30
Malar Kannan	a6543491f8	fixed empty phoneme boundary case	2017-12-11 13:05:46 +05:30
Malar Kannan	d387922f7d	added dense-relu/softmax layers to segment output	2017-12-11 12:30:08 +05:30
Malar Kannan	52bbb69c65	resuming segment training	2017-12-10 21:58:55 +05:30
Malar Kannan	03edd935ea	fixed input_dim	2017-12-07 17:16:05 +05:30
Malar Kannan	a7f1451a7f	fixed exception in data generation	2017-12-07 16:49:34 +05:30
Malar Kannan	91fde710f3	completed the segmentation model	2017-12-07 15:17:59 +05:30
Malar Kannan	c8a07b3d7b	Merge branch 'master' of ssh://invnuc/~/Public/Repos/speech_scoring	2017-12-07 12:00:59 +05:30
Malar Kannan	8785522196	Merge branch 'master' of /home/ilml/Public/Repos/speech_scoring	2017-12-07 12:00:44 +05:30
Malar Kannan	435c4a4aa6	added a resume parameter for training	2017-12-07 12:00:42 +05:30
Malar Kannan	c1801b5aa3	implented segment tfrecords batch data-generator	2017-12-07 11:48:19 +05:30
Malar Kannan	c0369d7a66	Merge branch 'master' of ssh://gpuaws/~/repos/speech_scoring	2017-12-06 17:33:27 +05:30
Malar Kannan	8e14db2437	Merge branch 'master' of ssh://invmac/~/Public/repos/speech-scoring	2017-12-06 17:32:46 +05:30
Malar Kannan	bcf1041bde	created segment sample tfrecord writer	2017-12-06 17:32:26 +05:30
Malar Kannan	b50edb980d	implemented segment-generation for random words for testing	2017-12-06 14:41:25 +05:30
Malar Kannan	3f76207f0d	using pitch contour instead of spectrogram	2017-12-04 19:15:17 +05:30
Malar Kannan	6ef4e86f41	implemented segmentation visualization	2017-11-30 14:49:55 +05:30
Malar Kannan	0b1152b5c3	implemented the model, todo implement ctc and training queueing logic	2017-11-28 19:10:19 +05:30
Malar Kannan	1928fce4e8	Merge branch 'master' of ssh://invnuc/~/Public/Repos/speech_scoring	2017-11-28 17:05:35 +05:30
Malar Kannan	ec7303223c	merged	2017-11-28 17:05:20 +05:30
Malar Kannan	f12da988d3	segmentation model wip	2017-11-28 15:46:39 +05:30
Malar Kannan	705cf3d172	finding exact duration of sound sample	2017-11-28 12:52:20 +05:30
Malar Kannan	8f79316893	Merge branch 'master' of /Users/malarkannan/Public/repos/speech-scoring	2017-11-28 12:32:50 +05:30
Malar Kannan	0345cc46ae	implemented tts sementation generation code	2017-11-28 12:32:45 +05:30
Malar Kannan	20b2d7a958	updated model data	2017-11-27 14:08:01 +05:30
Malar Kannan	43d5b75db9	removing spec_n counter	2017-11-24 11:06:42 +00:00
Malar Kannan	ec08cc7d62	Merge branch 'master' of ssh://gpuaws/~/repos/speech_scoring	2017-11-24 14:32:43 +05:30
Malar Kannan	2268ad8bb0	implemented pitch plotting	2017-11-24 14:32:13 +05:30
Malar Kannan	ec317b6628	Merge branch 'master' of /home/ilml/Public/Repos/speech_scoring	2017-11-24 14:26:40 +05:30
Malar Kannan	235300691e	find spec_n from tfrecords	2017-11-24 14:26:36 +05:30
Malar Kannan	ae46578aec	Merge branch 'master' of ssh://invmac/~/Public/repos/speech-scoring	2017-11-23 17:50:47 +05:30
Malar Kannan	3d7542271d	implemented tts segmentation data generation	2017-11-23 17:50:11 +05:30
Malar Kannan	54f38ca775	removed a layer using lstm	2017-11-22 15:46:42 +05:30
Malar Kannan	6355db4af7	adding missing model-dir for training constants copying	2017-11-22 15:04:02 +05:30
		`@@ -0,0 +1,2 @@`
							`### Convert audio files`
							$ `for f in *.mp3; do ffmpeg -i "$f" "${f%.mp3}.aiff"; done`