Initial code commit + README

ba0a5437 · Shaojie Bai · ba0a5437 · ba0a5437 · ba0a5437 · ba0a5437
36 changed file
--- a/.gitignore
+++ b/.gitignore
+.DS_Store/
+data/
+*.pt
+*.log
+*.pyc
+__pycache__/
--- a/README.md
+++ b/README.md
+# Sequence-Model-Benchmarks-TCN
+
+
+This repository contains the experiments done in the work [An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling](#) by Shaojie Bai, J. Zico Kolter and Vladlen Koltun.
+
+We specifically target a comprehensive set of tasks that have been repeatedly used to compare the effectiveness of different recurrent networks, and evaluate a simple, generic but powerful (purely) convolutional network on the recurrent nets' home turf.
+
+Experiments are done in PyTorch.
+
+## Domains and Datasets
+
+This repository contains the benchmarks to the following tasks, with details explained in each sub-directory:
+
+  - **The Adding Problem** with various T (we evaluated on T=200, 400, 600)
+  - **Copying Memory Task** with various T (we evaluated on T=500, 1000, 2000)
+  - **Sequential MNIST** digit classification
+  - **Permuted Sequential MNIST** (based on Seq. MNIST, but more challenging)
+  - **JSB Chorales** polyphonic music
+  - **Nottingham** polyphonic music
+  - **PennTreebank** [SMALL] word-level language modeling (LM)
+  - **Wikitext-103** [LARGE] word-level LM
+  - **LAMBADA** [LARGE] word-level LM and textual understanding
+  - **PennTreebank** [MEDIUM] char-level LM
+  - **Shakespeare** [SMALL] char-level LM (Note: a small dataset)
+  - **text8** [LARGE] char-level LM
+
+While some of the large datasets are not included in this repo, we use the [observations](#) package to download them, which can be easily installed using pip. 
+
+## Usage
+
+Each task is contained in its own directory, with the following structure:
+
+```
+[TASK_NAME] /
+    data/
+    [TASK_NAME]_test.py
+    models.py
+    utils.py
+```
+
+To run TCN model on the task, one only need to run `[TASK_NAME]_test.py` (e.g. `add_test.py`). To tune the hyperparameters, one can specify via argument options, which can been seen via the `-h` flag. 
--- a/TCN/__init__.py
+++ b/TCN/__init__.py
--- a/TCN/adding_problem/README.md
+++ b/TCN/adding_problem/README.md
+## The Adding Problem
+
+### Overview
+
+In this task, each input consists of a length-T sequence of depth 2, with all values randomly
+chosen randomly in [0, 1] in dimension 1. The second dimension consists of all zeros except for
+two elements, which are marked by 1. The objective is to sum the two random values whose second 
+dimensions are marked by 1. One can think of this as computing the dot product of two dimensions.
+
+Simply predicting the sum to be 1 should give an MSE of about 0.1767. 
+
+### Data Generation
+
+See `data_generator` in `utils.py`.
+
+### Note
+
+Because a TCN's receptive field depends on depth of the network and the filter size, we need
+to make sure these the model we use can cover the sequence length T. 
\ No newline at end of file
--- a/TCN/adding_problem/add_test.py
+++ b/TCN/adding_problem/add_test.py
+import torch
+import argparse
+import torch.optim as optim
+import torch.nn.functional as F
+from TCN.adding_problem.model import TCN
+from TCN.adding_problem.utils import data_generator
+
+parser = argparse.ArgumentParser(description='Sequence Modeling - The Adding Problem')
+parser.add_argument('--batch_size', type=int, default=32, metavar='N',
+                    help='batch size (default: 32)')
+parser.add_argument('--cuda', action='store_false',
+                    help='use CUDA (default: True)')
+parser.add_argument('--dropout', type=float, default=0.0,
+                    help='dropout applied to layers (default: 0.0)')
+parser.add_argument('--clip', type=float, default=-1,
+                    help='gradient clip, -1 means no clip (default: -1)')
+parser.add_argument('--epochs', type=int, default=10,
+                    help='upper epoch limit (default: 10)')
+parser.add_argument('--ksize', type=int, default=7,
+                    help='kernel size (default: 8)')
+parser.add_argument('--levels', type=int, default=8,
+                    help='# of levels (default: 5)')
+parser.add_argument('--seq_len', type=int, default=400,
+                    help='sequence length (default: 400)')
+parser.add_argument('--log-interval', type=int, default=100, metavar='N',
+                    help='report interval (default: 10')
+parser.add_argument('--lr', type=float, default=4e-3,
+                    help='initial learning rate (default: 4e-3)')
+parser.add_argument('--optim', type=str, default='Adam',
+                    help='optimizer to use (default: Adam)')
+parser.add_argument('--nhid', type=int, default=30,
+                    help='number of hidden units per layer (default: 30)')
+parser.add_argument('--seed', type=int, default=1111,
+                    help='random seed (default: 1111)')
+args = parser.parse_args()
+
+torch.manual_seed(args.seed)
+if torch.cuda.is_available():
+    if not args.cuda:
+        print("WARNING: You have a CUDA device, so you should probably run with --cuda")
+
+input_channels = 2
+n_classes = 1
+batch_size = args.batch_size
+seq_length = args.seq_len
+epochs = args.epochs
+
+print(args)
+print("Producing data...")
+X_train, Y_train = data_generator(50000, seq_length)
+X_test, Y_test = data_generator(1000, seq_length)
+
+
+# Note: We use a very simple setting here (assuming all levels have the same # of channels.
+channel_sizes = [args.nhid]*args.levels
+kernel_size = args.ksize
+dropout = args.dropout
+model = TCN(input_channels, n_classes, channel_sizes, kernel_size=kernel_size, dropout=dropout)
+
+if args.cuda:
+    model.cuda()
+    X_train = X_train.cuda()
+    Y_train = Y_train.cuda()
+    X_test = X_test.cuda()
+    Y_test = Y_test.cuda()
+
+lr = args.lr
+optimizer = getattr(optim, args.optim)(model.parameters(), lr=lr)
+
+
+def train(epoch):
+    global lr
+    model.train()
+    batch_idx = 1
+    total_loss = 0
+    for i in range(0, X_train.size()[0], batch_size):
+        if i + batch_size > X_train.size()[0]:
+            x, y = X_train[i:], Y_train[i:]
+        else:
+            x, y = X_train[i:(i+batch_size)], Y_train[i:(i+batch_size)]
+        optimizer.zero_grad()
+        output = model(x)
+        loss = F.mse_loss(output, y)
+        loss.backward()
+        if args.clip > 0:
+            torch.nn.utils.clip_grad_norm(model.parameters(), args.clip)
+        optimizer.step()
+        batch_idx += 1
+        total_loss += loss.data[0]
+
+        if batch_idx % args.log_interval == 0:
+            cur_loss = total_loss / args.log_interval
+            processed = min(i+batch_size, X_train.size()[0])
+            print('Train Epoch: {:2d} [{:6d}/{:6d} ({:.0f}%)]\tLearning rate: {:.4f}\tLoss: {:.6f}'.format(
+                epoch, processed, X_train.size()[0], 100.*processed/X_train.size()[0], lr, cur_loss))
+            total_loss = 0
+
+
+def evaluate():
+    model.eval()
+    output = model(X_test)
+    test_loss = F.mse_loss(output, Y_test)
+    print('\nTest set: Average loss: {:.6f}\n'.format(test_loss.data[0]))
+    return test_loss.data[0]
+
+
+for ep in range(1, epochs+1):
+    train(ep)
+    tloss = evaluate()
+
+
+
--- a/TCN/adding_problem/model.py
+++ b/TCN/adding_problem/model.py
+from torch import nn
+from TCN.tcn import TemporalConvNet
+
+
+class TCN(nn.Module):
+    def __init__(self, input_size, output_size, num_channels, kernel_size, dropout):
+        super(TCN, self).__init__()
+        self.tcn = TemporalConvNet(input_size, num_channels, kernel_size=kernel_size, dropout=dropout)
+        self.linear = nn.Linear(num_channels[-1], output_size)
+        self.init_weights()
+
+    def init_weights(self):
+        self.linear.weight.data.normal_(0, 0.01)
+
+    def forward(self, x):
+        y1 = self.tcn(x)
+        return self.linear(y1[:, :, -1])
\ No newline at end of file
--- a/TCN/adding_problem/utils.py
+++ b/TCN/adding_problem/utils.py
+import torch
+import numpy as np
+from torch.autograd import Variable
+
+
+def data_generator(N, seq_length):
+    """
+    Args:
+        seq_length: Length of the adding problem data
+        N: # of data in the set
+    """
+    X_num = torch.rand([N, 1, seq_length])
+    X_mask = torch.zeros([N, 1, seq_length])
+    Y = torch.zeros([N, 1])
+    for i in range(N):
+        positions = np.random.choice(seq_length, size=2, replace=False)
+        X_mask[i, 0, positions[0]] = 1
+        X_mask[i, 0, positions[1]] = 1
+        Y[i,0] = X_num[i, 0, positions[0]] + X_num[i, 0, positions[1]]
+    X = torch.cat((X_num, X_mask), dim=1)
+    return Variable(X), Variable(Y)
\ No newline at end of file
--- a/TCN/char_cnn/README.md
+++ b/TCN/char_cnn/README.md
+## Character-level Language Modeling
+
+### Overview
+
+In character-level language modeling tasks, each sequence is broken into elements by characters. 
+Therefore, in a character-level language model, at each time step the model is expected to predict
+the next coming character. We evaluate the temporal convolutional network as a character-level
+language model on the PennTreebank dataset and the text8 dataset.
+
+### Data
+
+- **PennTreebank**: When used as a character-level lan-
+guage corpus, PTB contains 5,059K characters for training,
+396K for validation, and 446K for testing, with an alphabet
+size of 50. PennTreebank is a well-studied (but relatively
+small) language dataset.
+
+- **text8**: text8 is about 20 times larger than PTB, with 
+about 100M characters from Wikipedia (90M for training, 5M 
+for validation, and 5M for testing). The corpus contains 27 
+unique alphabets.
+
+See `data_generator` in `utils.py`. We download the language corpus using [observations](#) package 
+in python.
+
+### Note
+
+- Just like in a recurrent network implementation where it is common to repackage 
+hidden units when a new sequence begins, we pass into TCN a sequence `T` consisting 
+of two parts: 1) effective history `L1`, and 2) valid sequence `L2`:
+
+```
+Sequence [---------T---------] = [--L1-- -----L2-----]
+```
+
+In the forward pass, the whole sequence is passed into TCN, but only the `L2` portion is used for 
+training. This ensures that the training data are also provided with sufficient history. The size
+of `T` and `L2` can be adjusted via flags `seq_len` and `validseqlen`.
+
+- The choice of dataset to use can be specified via the `--dataset` flag. For instance, running
+
+```
+python char_cnn_test.py --dataset ptb
+```
+
+would (download if no data found, and) train on the PennTreebank (PTB) dataset.
+
+- Empirically, we found that Adam works better than SGD on the text8 dataset.
\ No newline at end of file
--- a/TCN/char_cnn/char_cnn_test.py
+++ b/TCN/char_cnn/char_cnn_test.py
+import argparse
+import torch.nn as nn
+import torch.optim as optim
+import sys
+sys.path.append("../../")
+from TCN.char_cnn.utils import *
+from TCN.char_cnn.model import TCN
+import time
+import math
+
+
+import warnings
+warnings.filterwarnings("ignore")   # Suppress the RunTimeWarning on unicode
+
+
+parser = argparse.ArgumentParser(description='Sequence Modeling - Character Level Language Model')
+parser.add_argument('--batch_size', type=int, default=32, metavar='N',
+                    help='batch size (default: 32)')
+parser.add_argument('--cuda', action='store_false',
+                    help='use CUDA (default: True)')
+parser.add_argument('--dropout', type=float, default=0.1,
+                    help='dropout applied to layers (default: 0.1)')
+parser.add_argument('--emb_dropout', type=float, default=0.1,
+                    help='dropout applied to the embedded layer (0 = no dropout)')
+parser.add_argument('--clip', type=float, default=0.15,
+                    help='gradient clip, -1 means no clip (default: 0.15)')
+parser.add_argument('--epochs', type=int, default=100,
+                    help='upper epoch limit (default: 100)')
+parser.add_argument('--ksize', type=int, default=3,
+                    help='kernel size (default: 5)')
+parser.add_argument('--levels', type=int, default=3,
+                    help='# of levels (default: 4)')
+parser.add_argument('--log-interval', type=int, default=100, metavar='N',
+                    help='report interval (default: 100')
+parser.add_argument('--lr', type=float, default=4,
+                    help='initial learning rate (default: 4)')
+parser.add_argument('--emsize', type=int, default=100,
+                    help='dimension of character embeddings (default: 100)')
+parser.add_argument('--optim', type=str, default='SGD',
+                    help='optimizer to use (default: SGD)')
+parser.add_argument('--nhid', type=int, default=450,
+                    help='number of hidden units per layer (default: 150)')
+parser.add_argument('--validseqlen', type=int, default=320,
+                    help='valid sequence length (default: 320)')
+parser.add_argument('--seq_len', type=int, default=400,
+                    help='total sequence length, including effective history (default: 400)')
+parser.add_argument('--seed', type=int, default=1111,
+                    help='random seed (default: 1111)')
+parser.add_argument('--dataset', type=str, default='ptb',
+                    help='dataset to use (default: ptb)')
+args = parser.parse_args()
+
+# Set the random seed manually for reproducibility.
+torch.manual_seed(args.seed)
+if torch.cuda.is_available():
+    if not args.cuda:
+        print("WARNING: You have a CUDA device, so you should probably run with --cuda")
+
+
+print(args)
+file, file_len, valfile, valfile_len, testfile, testfile_len, corpus = data_generator(args)
+
+n_characters = len(corpus.dict)
+train_data = batchify(char_tensor(corpus, file), args.batch_size, args)
+val_data = batchify(char_tensor(corpus, valfile), 1, args)
+test_data = batchify(char_tensor(corpus, testfile), 1, args)
+print("Corpus size: ", n_characters)
+
+
+num_chans = [args.nhid] * (args.levels - 1) + [args.emsize]
+k_size = args.ksize
+dropout = args.dropout
+emb_dropout = args.emb_dropout
+model = TCN(args.emsize, n_characters, num_chans, kernel_size=k_size, dropout=dropout, emb_dropout=emb_dropout)
+
+
+if args.cuda:
+    model.cuda()
+
+
+criterion = nn.CrossEntropyLoss()
+lr = args.lr
+optimizer = getattr(optim, args.optim)(model.parameters(), lr=lr)
+
+
+def evaluate(source):
+    model.eval()
+    total_loss = 0
+    count = 0
+    source_len = source.size(1)
+    for batch, i in enumerate(range(0, source_len - 1, args.validseqlen)):
+        if i + args.seq_len - args.validseqlen >= source_len:
+            continue
+        inp, target = get_batch(source, i, args)
+        output = model(inp)
+        eff_history = args.seq_len - args.validseqlen
+        final_output = output[:, eff_history:].contiguous().view(-1, n_characters)
+        final_target = target[:, eff_history:].contiguous().view(-1)
+        loss = criterion(final_output, final_target)
+
+        total_loss += loss.data * final_output.size(0)
+        count += final_output.size(0)
+
+    val_loss = total_loss[0] / count * 1.0
+    return val_loss
+
+
+def train(epoch):
+    model.train()
+    total_loss = 0
+    start_time = time.time()
+    losses = []
+    source = train_data
+    source_len = source.size(1)
+    for batch_idx, i in enumerate(range(0, source_len - 1, args.validseqlen)):
+        if i + args.seq_len - args.validseqlen >= source_len:
+            continue
+        inp, target = get_batch(source, i, args)
+        optimizer.zero_grad()
+        output = model(inp)
+        eff_history = args.seq_len - args.validseqlen
+        final_output = output[:, eff_history:].contiguous().view(-1, n_characters)
+        final_target = target[:, eff_history:].contiguous().view(-1)
+        loss = criterion(final_output, final_target)
+        loss.backward()
+
+        if args.clip > 0:
+            torch.nn.utils.clip_grad_norm(model.parameters(), args.clip)
+        optimizer.step()
+        total_loss += loss.data
+
+        if batch_idx % args.log_interval == 0 and batch_idx > 0:
+            cur_loss = total_loss[0] / args.log_interval
+            losses.append(cur_loss)
+            elapsed = time.time() - start_time
+            print('| epoch {:3d} | {:5d}/{:5d} batches | lr {:02.5f} | ms/batch {:5.2f} | '
+                  'loss {:5.3f} | bpc {:5.3f}'.format(
+                epoch, batch_idx, int((source_len-0.5) / args.validseqlen), lr,
+                              elapsed * 1000 / args.log_interval, cur_loss, cur_loss / math.log(2)))
+            total_loss = 0
+            start_time = time.time()
+
+        # if batch % (200 * args.log_interval) == 0 and batch > 0:
+        #     vloss = evaluate(val_data)
+        #     print('-' * 89)
+        #     print('| In epoch {:3d} | valid loss {:5.3f} | '
+        #           'valid bpc {:8.3f}'.format(epoch, vloss, vloss / math.log(2)))
+        #     model.train()
+
+    return sum(losses) * 1.0 / len(losses)
+
+
+def main():
+    global lr
+    try:
+        print("Training for %d epochs..." % args.epochs)
+        all_losses = []
+        best_vloss = 1e7
+        for epoch in range(1, args.epochs + 1):
+            loss = train(epoch)
+
+            vloss = evaluate(val_data)
+            print('-' * 89)
+            print('| End of epoch {:3d} | valid loss {:5.3f} | valid bpc {:8.3f}'.format(
+                epoch, vloss, vloss / math.log(2)))
+
+            test_loss = evaluate(test_data)
+            print('=' * 89)
+            print('| End of epoch {:3d} | test loss {:5.3f} | test bpc {:8.3f}'.format(
+                epoch, test_loss, test_loss / math.log(2)))
+            print('=' * 89)
+
+            if epoch > 5 and vloss > max(all_losses[-3:]):
+                lr = lr / 10.
+                for param_group in optimizer.param_groups:
+                    param_group['lr'] = lr
+            all_losses.append(vloss)
+
+            if vloss < best_vloss:
+                print("Saving...")
+                save(model)
+                best_vloss = vloss
+
+    except KeyboardInterrupt:
+        print('-' * 89)
+        print("Saving before quit...")
+        save(model)
+
+    # Run on test data.
+    test_loss = evaluate(test_data)
+    print('=' * 89)
+    print('| End of training | test loss {:5.3f} | test bpc {:8.3f}'.format(
+        test_loss, test_loss / math.log(2)))
+    print('=' * 89)
+
+# train_by_random_chunk()
+if __name__ == "__main__":
+    main()
--- a/TCN/char_cnn/model.py
+++ b/TCN/char_cnn/model.py
+from torch import nn
+import sys
+sys.path.append("../../")
+from TCN.tcn import TemporalConvNet
+
+
+class TCN(nn.Module):
+    def __init__(self, input_size, output_size, num_channels, kernel_size=2, dropout=0.2, emb_dropout=0.2):
+        super(TCN, self).__init__()
+        self.encoder = nn.Embedding(output_size, input_size)
+        self.tcn = TemporalConvNet(input_size, num_channels, kernel_size=kernel_size, dropout=dropout)
+        self.decoder = nn.Linear(input_size, output_size)
+        self.decoder.weight = self.encoder.weight
+        self.drop = nn.Dropout(emb_dropout)
+        self.init_weights()
+
+    def init_weights(self):
+        initrange = 0.1
+        self.encoder.weight.data.uniform_(-initrange, initrange)
+        self.decoder.bias.data.fill_(0)
+        self.decoder.weight.data.uniform_(-initrange, initrange)
+
+    def forward(self, x):
+        # input has dimension (N, L_in), and emb has dimension (N, L_in, C_in)
+        emb = self.drop(self.encoder(x))
+        y = self.tcn(emb.transpose(1, 2))
+        o = self.decoder(y.transpose(1, 2))
+        return o.contiguous()
\ No newline at end of file
--- a/TCN/char_cnn/utils.py
+++ b/TCN/char_cnn/utils.py
+import unidecode
+import torch
+from torch.autograd import Variable
+from collections import Counter
+import observations
+import os
+import pickle
+
+
+cuda = torch.cuda.is_available()
+
+
+def data_generator(args):
+    file, testfile, valfile = getattr(observations, args.dataset)('data/')
+    file_len = len(file)
+    valfile_len = len(valfile)
+    testfile_len = len(testfile)
+    corpus = Corpus(file + " " + valfile + " " + testfile)
+
+    #############################################################
+    # Use the following if you want to pickle the loaded data
+    #
+    # pickle_name = "{0}.corpus".format(args.dataset)
+    # if os.path.exists(pickle_name):
+    #     corpus = pickle.load(open(pickle_name, 'rb'))
+    # else:
+    #     corpus = Corpus(file + " " + valfile + " " + testfile)
+    #     pickle.dump(corpus, open(pickle_name, 'wb'))
+    #############################################################
+
+    return file, file_len, valfile, valfile_len, testfile, testfile_len, corpus
+
+
+def read_file(filename):
+    file = unidecode.unidecode(open(filename).read())
+    return file, len(file)
+
+
+class Dictionary(object):
+    def __init__(self):
+        self.char2idx = {}
+        self.idx2char = []
+        self.counter = Counter()
+
+    def add_word(self, char):
+        self.counter[char] += 1
+
+    def prep_dict(self):
+        for char in self.counter:
+            if char not in self.char2idx:
+                self.idx2char.append(char)
+                self.char2idx[char] = len(self.idx2char) - 1
+
+    def __len__(self):
+        return len(self.idx2char)
+
+
+class Corpus(object):
+    def __init__(self, string):
+        self.dict = Dictionary()
+        for c in string:
+            self.dict.add_word(c)
+        self.dict.prep_dict()
+
+
+def char_tensor(corpus, string):
+    tensor = torch.zeros(len(string)).long()
+    for i in range(len(string)):
+        tensor[i] = corpus.dict.char2idx[string[i]]
+    return Variable(tensor).cuda() if cuda else Variable(tensor)
+
+
+def batchify(data, batch_size, args):
+    """The output should have size [L x batch_size], where L could be a long sequence length"""
+    # Work out how cleanly we can divide the dataset into batch_size parts (i.e. continuous seqs).
+    nbatch = data.size(0) // batch_size
+    # Trim off any extra elements that wouldn't cleanly fit (remainders).
+    data = data.narrow(0, 0, nbatch * batch_size)
+    # Evenly divide the data across the batch_size batches.
+    data = data.view(batch_size, -1)
+    if args.cuda:
+        data = data.cuda()
+    return data
+
+
+def get_batch(source, start_index, args):
+    seq_len = min(args.seq_len, source.size(1) - 1 - start_index)
+    end_index = start_index + seq_len
+    inp = source[:, start_index:end_index].contiguous()
+    target = source[:, start_index+1:end_index+1].contiguous()  # The successors of the inp.
+    return inp, target
+
+
+def save(model):
+    save_filename = 'model.pt'
+    torch.save(model, save_filename)
+    print('Saved as %s' % save_filename)
+
+
--- a/TCN/copy_memory/README.md
+++ b/TCN/copy_memory/README.md
+## Copying Memory Task
+
+### Overview
+
+In this task, each input sequence has length T+20. The first 10 values are chosen randomly 
+among the digits 1-8, with the rest being all zeros, except for the last 11 entries that are 
+filled with the digit ‘9’ (the first ‘9’ is a delimiter). The goal is to generate an output 
+of same length that is zero everywhere, except the last 10 values after the delimiter, where 
+the model is expected to repeat the 10 values it encountered at the start of the input.
+
+### Data Generation
+
+See `data_generator` in `utils.py`.
+
+### Note
+
+- Because a TCN's receptive field depends on depth of the network and the filter size, we need
+to make sure these the model we use can cover the sequence length T+20. 
+
+- Using the `--seq_len` flag, one can change the # of values to recall (the typical setup is 10).
\ No newline at end of file
--- a/TCN/copy_memory/copymem_test.py
+++ b/TCN/copy_memory/copymem_test.py
+import argparse
+import torch
+import torch.nn as nn
+import torch.optim as optim
+from torch.autograd import Variable
+import numpy as np
+from TCN.copy_memory.utils import data_generator
+from TCN.copy_memory.model import TCN
+import time
+
+
+parser = argparse.ArgumentParser(description='Sequence Modeling - Copying Memory Task')
+parser.add_argument('--batch_size', type=int, default=32, metavar='N',
+                    help='batch size (default: 32)')
+parser.add_argument('--cuda', action='store_false',
+                    help='use CUDA (default: True)')
+parser.add_argument('--dropout', type=float, default=0.0,
+                    help='dropout applied to layers (default: 0.0)')
+parser.add_argument('--clip', type=float, default=1.0,
+                    help='gradient clip, -1 means no clip (default: 1.0)')
+parser.add_argument('--epochs', type=int, default=50,
+                    help='upper epoch limit (default: 50)')
+parser.add_argument('--ksize', type=int, default=8,
+                    help='kernel size (default: 8)')
+parser.add_argument('--iters', type=int, default=100,
+                    help='number of iters per epoch (default: 100)')
+parser.add_argument('--levels', type=int, default=8,
+                    help='# of levels (default: 5)')
+parser.add_argument('--blank_len', type=int, default=1000, metavar='N',
+                    help='The size of the blank (i.e. T) (default: 1000)')
+parser.add_argument('--seq_len', type=int, default=10,
+                    help='initial history size (default: 10)')
+parser.add_argument('--log-interval', type=int, default=50, metavar='N',
+                    help='report interval (default: 10')
+parser.add_argument('--lr', type=float, default=5e-4,
+                    help='initial learning rate (default: 5e-4)')
+parser.add_argument('--optim', type=str, default='RMSprop',
+                    help='optimizer to use (default: RMSprop)')
+parser.add_argument('--nhid', type=int, default=10,
+                    help='number of hidden units per layer (default: 10)')
+parser.add_argument('--seed', type=int, default=1111,
+                    help='random seed (default: 1111)')
+args = parser.parse_args()
+
+torch.manual_seed(args.seed)
+if torch.cuda.is_available():
+    if not args.cuda:
+        print("WARNING: You have a CUDA device, so you should probably run with --cuda")
+
+
+batch_size = args.batch_size
+seq_len = args.seq_len    # The size to memorize
+epochs = args.epochs
+iters = args.iters
+T = args.blank_len
+n_steps = T + (2 * seq_len)
+n_classes = 10  # Digits 0 - 9
+n_train = 10000
+n_test = 1000
+
+print(args)
+print("Preparing data...")
+train_x, train_y = data_generator(T, seq_len, n_train)
+test_x, test_y = data_generator(T, seq_len, n_test)
+
+
+channel_sizes = [args.nhid] * args.levels
+kernel_size = args.ksize
+dropout = args.dropout
+model = TCN(1, n_classes, channel_sizes, kernel_size, dropout=dropout)
+
+if args.cuda:
+    model.cuda()
+    train_x = train_x.cuda()
+    train_y = train_y.cuda()
+    test_x = test_x.cuda()
+    test_y = test_y.cuda()
+
+criterion = nn.CrossEntropyLoss()
+lr = args.lr
+optimizer = getattr(optim, args.optim)(model.parameters(), lr=lr)
+
+
+def evaluate():
+    model.eval()
+    out = model(test_x.unsqueeze(1).contiguous())
+    loss = criterion(out.view(-1, n_classes), test_y.view(-1))
+    pred = out.view(-1, n_classes).data.max(1, keepdim=True)[1]
+    correct = pred.eq(test_y.data.view_as(pred)).cpu().sum()
+    counter = out.view(-1, n_classes).size(0)
+    print('\nTest set: Average loss: {:.8f}  |  Accuracy: {:.4f}\n'.format(
+        loss.data[0], 100. * correct / counter))
+    return loss.data[0]
+
+
+def train(ep):
+    global batch_size, seq_len, iters, epochs
+    model.train()
+    total_loss = 0
+    start_time = time.time()
+    correct = 0
+    counter = 0
+    for batch_idx, batch in enumerate(range(0, n_train, batch_size)):
+        start_ind = batch
+        end_ind = start_ind + batch_size
+
+        x = train_x[start_ind:end_ind]
+        y = train_y[start_ind:end_ind]
+
+        out = model(x.unsqueeze(1).contiguous())
+        loss = criterion(out.view(-1, n_classes), y.view(-1))
+        pred = out.view(-1, n_classes).data.max(1, keepdim=True)[1]
+        correct += pred.eq(y.data.view_as(pred)).cpu().sum()
+        counter += out.view(-1, n_classes).size(0)
+        if args.clip > 0:
+            torch.nn.utils.clip_grad_norm(model.parameters(), args.clip)
+        loss.backward()
+        optimizer.step()
+        total_loss += loss
+
+        if batch_idx > 0 and batch_idx % args.log_interval == 0:
+            avg_loss = total_loss / args.log_interval
+            elapsed = time.time() - start_time
+            print('| Epoch {:3d} | {:5d}/{:5d} batches | lr {:2.5f} | ms/batch {:5.2f} | '
+                  'loss {:5.8f} | accuracy {:5.4f}'.format(
+                ep, batch_idx, n_train // batch_size+1, args.lr, elapsed * 1000 / args.log_interval,
+                avg_loss.data[0], 100. * correct / counter))
+            start_time = time.time()
+            total_loss = 0
+            correct = 0
+            counter = 0
+
+
+for ep in range(1, epochs + 1):
+    train(ep)
+    evaluate()
--- a/TCN/copy_memory/model.py
+++ b/TCN/copy_memory/model.py
+from torch import nn
+from TCN.tcn import TemporalConvNet
+
+
+class TCN(nn.Module):
+    def __init__(self, input_size, output_size, num_channels, kernel_size, dropout):
+        super(TCN, self).__init__()
+        self.tcn = TemporalConvNet(input_size, num_channels, kernel_size=kernel_size, dropout=dropout)
+        self.linear = nn.Linear(num_channels[-1], output_size)
+        self.init_weights()
+
+    def init_weights(self):
+        self.linear.weight.data.normal_(0, 0.01)
+
+    def forward(self, x):
+        y1 = self.tcn(x)
+        return self.linear(y1.transpose(1, 2))
\ No newline at end of file
--- a/TCN/copy_memory/utils.py
+++ b/TCN/copy_memory/utils.py
+import numpy as np
+import torch
+from torch.autograd import Variable
+
+
+def data_generator(T, mem_length, b_size):
+    """
+    Generate data for the copying memory task
+
+    :param T: The total blank time length
+    :param mem_length: The length of the memory to be recalled
+    :param b_size: The batch size
+    :return: Input and target data tensor
+    """
+    seq = torch.from_numpy(np.random.randint(1, 9, size=(b_size, mem_length))).float()
+    zeros = torch.zeros((b_size, T))
+    marker = 9 * torch.ones((b_size, mem_length + 1))
+    placeholders = torch.zeros((b_size, mem_length))
+
+    x = torch.cat((seq, zeros[:, :-1], marker), 1)
+    y = torch.cat((placeholders, zeros, seq), 1).long()
+
+    x, y = Variable(x), Variable(y)
+    return x, y
\ No newline at end of file
--- a/TCN/lambada_language/README.md
+++ b/TCN/lambada_language/README.md
+## Word-level Language Modeling
+
+### Overview
+
+LAMBADA is a collection of narrative passages sharing the characteristics such that human subjects are able to guess accurately given sufficient context, but not so if they only see the last sentence containing the target word. On average, the context contains 4.6 sentences, and the testing performance is evaluated by having the model the last element of the target sentence (i.e. the very last word). 
+
+Most of the existing computational models fail on this task (without the help of external memory unit, such as neural cache). See [the original LAMBADA paper](https://arxiv.org/pdf/1606.06031.pdf) for more results on applying RNNs on LAMBADA.
+
+**Example**: 
+```
+Context: “Yes, I thought I was going to lose the baby.” “I was scared too,” he stated, sincerity flooding his eyes. “You were ?” “Yes, of course. Why do you even ask?” “This baby wasn’t exactly planned for.”
+
+Target sentence: “Do you honestly think that I would want you to have a _______” 
+
+Target word: miscarriage
+```
+
+### Data
+
+See `data_generator` in `utils.py`. You will need to download the lambada dataset from [here](http://clic.cimec.unitn.it/lambada/) and put it under director `./data/lambada` (or other paths specified by `--data` flag). 
+
+
+### Note
+
+- Just like in a recurrent network implementation where it is common to repackage 
+hidden units when a new sequence begins, we pass into TCN a sequence `T` consisting 
+of two parts: 1) effective history `L1`, and 2) valid sequence `L2`:
+
+```
+Sequence [---------T--------->] = [--L1--> ------L2------>]
+```
+
+In the forward pass, the whole sequence is passed into TCN, but only the `L2` portion is used for 
+training. This ensures that the training data are also provided with sufficient history. The size
+of `T` and `L2` can be adjusted via flags `seq_len` and `validseqlen`. 
+
+- The choice of data to load can be specified via the `--data` flag, followed by the path to
+the directory containing the data. For instance, running
+
+```
+python lambada_test.py --data .data/lambada
+```
+
+would train on the LAMBADA (PTB) dataset, if it is contained in `.data/lambada`.
+
+- LAMBADA is a huge dataset with lots of vocabularies. A
\ No newline at end of file
--- a/TCN/lambada_language/lambada_test.py
+++ b/TCN/lambada_language/lambada_test.py
+import argparse
+import time
+import math
+import torch
+import torch.nn as nn
+from torch.autograd import Variable
+import torch.optim as optim
+import sys
+sys.path.append("../../")
+from TCN.lambada_language.utils import *
+from TCN.lambada_language.model import TCN
+import pickle
+
+
+parser = argparse.ArgumentParser(description='Sequence Modeling - LAMBADA Textual Understanding')
+parser.add_argument('--batch_size', type=int, default=20, metavar='N',
+                    help='batch size (default: 20)')
+parser.add_argument('--cuda', action='store_false',
+                    help='use CUDA (default: True)')
+parser.add_argument('--dropout', type=float, default=0.1,
+                    help='dropout applied to layers (default: 0.1)')
+parser.add_argument('--emb_dropout', type=float, default=0.1,
+                    help='dropout applied to the embedded layer (default: 0.1)')
+parser.add_argument('--clip', type=float, default=0.4,
+                    help='gradient clip, -1 means no clip (default: 0.4)')
+parser.add_argument('--epochs', type=int, default=100,
+                    help='upper epoch limit (default: 100)')
+parser.add_argument('--ksize', type=int, default=4,
+                    help='kernel size (default: 4)')
+parser.add_argument('--data', type=str, default='./data/lambada',
+                    help='location of the data corpus (default: ./data/lambada)')
+parser.add_argument('--emsize', type=int, default=500,
+                    help='size of word embeddings (default: 500)')
+parser.add_argument('--levels', type=int, default=5,
+                    help='# of levels (default: 5)')
+parser.add_argument('--log-interval', type=int, default=100, metavar='N',
+                    help='report interval (default: 100)')
+parser.add_argument('--lr', type=float, default=4,
+                    help='initial learning rate (default: 4)')
+parser.add_argument('--nhid', type=int, default=500,
+                    help='number of hidden units per layer (default: 500)')
+parser.add_argument('--seed', type=int, default=1111,
+                    help='random seed (default: 1111)')
+parser.add_argument('--tied', action='store_false',
+                    help='tie the word embedding and softmax weights (default: True)')
+parser.add_argument('--optim', type=str, default='SGD',
+                    help='optimizer type (default: SGD)')
+parser.add_argument('--validseqlen', type=int, default=50,
+                    help='valid sequence length (default: 50)')
+parser.add_argument('--seq_len', type=int, default=100,
+                    help='total sequence length, including effective history (default: 100)')
+parser.add_argument('--corpus', action='store_true',
+                    help='force re-make the corpus (default: False)')
+args = parser.parse_args()
+
+# Set the random seed manually for reproducibility.
+torch.manual_seed(args.seed)
+if torch.cuda.is_available():
+    if not args.cuda:
+        print("WARNING: You have a CUDA device, so you should probably run with --cuda")
+
+
+print(args)
+train_data, val_data, test_data, corpus = data_generator(args)
+
+n_words = len(corpus.dictionary)
+print("Total # of words: {0}".format(n_words))
+
+num_chans = [args.nhid] * (args.levels - 1) + [args.emsize]
+k_size = args.ksize
+dropout = args.dropout
+emb_dropout = args.emb_dropout
+tied = args.tied
+
+model = TCN(args.emsize, n_words, num_chans, dropout=dropout, 
+            emb_dropout=emb_dropout, kernel_size=k_size, tied_weights=tied)
+
+if args.cuda:
+    model.cuda()
+
+criterion = nn.CrossEntropyLoss()
+lr = args.lr
+optimizer = getattr(optim, args.optim)(model.parameters(), lr=lr)
+
+
+def evaluate(data_source):
+    model.eval()
+    total_loss = 0
+    processed_data_size = 0
+    correct = 0
+    for i in range(len(data_source)):
+        data, targets = torch.LongTensor(data_source[i]).view(1, -1), torch.LongTensor([data_source[i][-1]]).view(1, -1)
+        data, targets = Variable(data), Variable(targets)
+        if args.cuda:
+            data, targets = data.cuda(), targets.cuda()
+        output = model(data)
+        final_output = output[:, -1].contiguous().view(-1, n_words)
+        final_target = targets[:, -1].contiguous().view(-1)
+        loss = criterion(final_output, final_target)
+        total_loss += loss.data
+        processed_data_size += 1
+    return total_loss[0] / processed_data_size
+
+
+def train():
+    global train_data
+    model.train()
+    total_loss = 0
+    start_time = time.time()
+    for babatch_idxtch, i in enumerate(range(0, train_data.size(1) - 1, args.validseqlen)):
+        if i + args.seq_len - args.validseqlen >= train_data.size(1) - 1:
+            continue
+        data, targets = get_batch(train_data, i, args)
+        optimizer.zero_grad()
+        output = model(data)
+        eff_history = args.seq_len - args.validseqlen
+        if eff_history < 0:
+            raise ValueError("Valid sequence length must be smaller than sequence length!")
+        final_target = targets[:, eff_history:].contiguous().view(-1)
+        final_output = output[:, eff_history:].contiguous().view(-1, n_words)
+        loss = criterion(final_output, final_target)
+        loss.backward()
+        if args.clip > 0:
+            torch.nn.utils.clip_grad_norm(model.parameters(), args.clip)
+        optimizer.step()
+        total_loss += loss.data
+
+        if batch_idx % args.log_interval == 0 and batch_idx > 0:
+            cur_loss = total_loss[0] / args.log_interval
+            elapsed = time.time() - start_time
+            print('| epoch {:3d} | {:5d}/{:5d} batches | lr {:02.5f} | ms/batch {:5.5f} | '
+                  'loss {:5.2f} | ppl {:8.2f}'.format(
+                epoch, batch_idx, train_data.size(1) // args.validseqlen, lr,
+                elapsed * 1000 / args.log_interval, cur_loss, math.exp(cur_loss)))
+            total_loss = 0
+            reg_loss = 0
+            start_time = time.time()
+
+
+if __name__ == "__main__":
+    best_vloss = 1e8
+    try:
+        all_vloss = []
+        for epoch in range(1, args.epochs+1):
+            epoch_start_time = time.time()
+            train()
+            val_loss = evaluate(val_data)
+            test_loss = evaluate(test_data)
+            print('-' * 89)
+            print('| end of epoch {:3d} | time: {:5.2f}s | valid loss {:5.2f} | '
+                    'valid ppl {:8.2f}'.format(epoch, (time.time() - epoch_start_time),
+                                               val_loss, math.exp(val_loss)))
+            print('| end of epoch {:3d} | time: {:5.2f}s | test loss {:5.2f} | '
+                  'test ppl {:8.2f}'.format(epoch, (time.time() - epoch_start_time),
+                                            test_loss, math.exp(test_loss)))
+            print('-' * 89)
+            # Save the model if the validation loss is the best we've seen so far.
+
+            if val_loss < best_vloss:
+                with open("model.pt", 'wb') as f:
+                    print('Save model!\n')
+                    torch.save(model, f)
+                best_vloss = val_loss
+            if epoch > 5 and val_loss >= max(all_vloss[-5:]):
+                lr = lr / 10.
+                for param_group in optimizer.param_groups:
+                    param_group['lr'] = lr
+            all_vloss.append(val_loss)
+
+    except KeyboardInterrupt:
+        print('-' * 89)
+        print('Exiting from training early')
+
+    # Load the best saved model.
+    with open("model.pt", 'rb') as f:
+        model = torch.load(f)
+
+    # Run on test data.
+    test_loss = evaluate(test_data)
+    print('=' * 89)
+    print('| End of training | test loss {:5.2f} | test ppl {:8.2f}'.format(
+        test_loss, math.exp(test_loss)))
+    print('=' * 89)
--- a/TCN/lambada_language/model.py
+++ b/TCN/lambada_language/model.py
+import torch
+from torch import nn
+import sys
+sys.path.append("../../")
+from TCN.tcn import TemporalConvNet
+
+
+class TCN(nn.Module):
+
+    def __init__(self, input_size, output_size, num_channels,
+                 kernel_size=2, dropout=0.3, emb_dropout=0.1, tied_weights=False):
+        super(TCN, self).__init__()
+        self.encoder = nn.Embedding(output_size, input_size)
+        self.tcn = TemporalConvNet(input_size, num_channels, kernel_size, dropout=dropout)
+
+        self.decoder = nn.Linear(num_channels[-1], output_size)
+        if tied_weights:
+            if num_channels[-1] != input_size:
+                raise ValueError('When using the tied flag, nhid must be equal to emsize')
+            self.decoder.weight = self.encoder.weight
+            print("Weight tied")
+        self.drop = nn.Dropout(emb_dropout)
+        self.emb_dropout = emb_dropout
+        self.init_weights()
+
+    def init_weights(self):
+        self.encoder.weight.data.normal_(0, 0.01)
+        self.decoder.bias.data.fill_(0)
+        self.decoder.weight.data.normal_(0, 0.01)
+
+    def forward(self, input):
+        """Input ought to have dimension (N, C_in, L_in), where L_in is the seq_len; here the input is (N, L, C)"""
+        emb = self.drop(self.encoder(input))
+        y = self.tcn(emb.transpose(1, 2)).transpose(1, 2)
+        y = self.decoder(y)
+        return y.contiguous()
+
--- a/TCN/lambada_language/utils.py
+++ b/TCN/lambada_language/utils.py
+import os
+import torch
+from torch.autograd import Variable
+import re
+from collections import Counter
+import pickle
+
+"""
+Note: The meaning of batch_size in PTB is different from that in MNIST example. In MNIST, 
+batch_size is the # of sample data that is considered in each iteration; in PTB, however,
+it is the number of segments to speed up computation. 
+
+The goal of PTB is to train a language model to predict the next word.
+"""
+
+def data_generator(args):
+    if os.path.exists(args.data + "/corpus") and not args.corpus:
+        corpus = pickle.load(open(args.data + '/corpus', 'rb'))
+    else:
+        print("Creating Corpus...")
+        corpus = Corpus(args.data + "/lambada_vocabulary_sorted.txt", args.data)
+        pickle.dump(corpus, open(args.data + '/corpus', 'wb'))
+
+    eval_batch_size = 1
+    train_data = batchify(corpus.train, args.batch_size, args)
+    val_data = [[0] * (args.seq_len-len(line)) + line for line in corpus.valid]
+    test_data = [[0] * (args.seq_len-len(line)) + line for line in corpus.test]
+    return train_data, val_data, test_data, corpus
+
+
+class Dictionary(object):
+    def __init__(self):
+        self.word2idx = {}
+        self.idx2word = []
+
+    def add_word(self, word):
+        if word not in self.word2idx:
+            self.idx2word.append(word)
+            self.word2idx[word] = len(self.idx2word) - 1
+        return self.word2idx[word]
+
+    def __len__(self):
+        return len(self.idx2word)
+
+
+class Corpus(object):
+    def __init__(self, dict_path, path):
+        self.dictionary = Dictionary()
+        self.prep_dict(dict_path)
+        self.train = torch.LongTensor(self.tokenize(os.path.join(path, 'train-novels')))
+        self.valid = self.tokenize(os.path.join(path, 'lambada_development_plain_text.txt'), eval=True)
+        self.test = self.tokenize(os.path.join(path, 'lambada_test_plain_text.txt'), eval=True)
+
+    def prep_dict(self, dict_path):
+        assert os.path.exists(dict_path)
+
+        # Add words to the dictionary
+        with open(dict_path, 'r') as f:
+            tokens = 0
+            for line in f:
+                word = line.strip()
+                tokens += 1
+                self.dictionary.add_word(word)
+
+        if "<eos>" not in self.dictionary.word2idx:
+            self.dictionary.add_word("<eos>")
+            tokens += 1
+
+        print("The dictionary captured a vocabulary of size {0}.".format(tokens))
+
+    def tokenize(self, path, eval=False):
+        assert os.path.exists(path)
+
+        ids = []
+        token = 0
+        misses = 0
+        if not path.endswith(".txt"):   # it's a folder
+            for subdir in os.listdir(path):
+                for filename in os.listdir(path + "/" + subdir):
+                    if filename.endswith(".txt"):
+                        full_path = "{0}/{1}/{2}".format(path, subdir, filename)
+                        # Tokenize file content
+                        delta_ids, delta_token, delta_miss = self._tokenize_file(full_path, eval=eval)
+                        ids += delta_ids
+                        token += delta_token
+                        misses += delta_miss
+        else:
+            ids, token, misses = self._tokenize_file(path, eval=eval)
+
+        print(token, misses)
+        return ids
+
+    def _tokenize_file(self, path, eval=False):
+        with open(path, 'r') as f:
+            token = 0
+            ids = []
+            misses = 0
+            for line in f:
+                line_ids = []
+                words = line.strip().split() + ['<eos>']
+                if eval:
+                    words = words[:-1]
+                for word in words:
+                    # These words are in the text but not vocabulary
+                    if word == "n't":
+                        word = "not"
+                    elif word == "'s":
+                        word = "is"
+                    elif word == "'re":
+                        word = "are"
+                    elif word == "'ve":
+                        word = "have"
+                    elif word == "wo":
+                        word = "will"
+                    if word not in self.dictionary.word2idx:
+                        word = re.sub(r'[^\w\s]', '', word)
+                    if word not in self.dictionary.word2idx:
+                        misses += 1
+                        continue
+                    line_ids.append(self.dictionary.word2idx[word])
+                    token += 1
+                if eval:
+                    ids.append(line_ids)
+                else:
+                    ids += line_ids
+        return ids, token, misses
+
+
+def batchify(data, batch_size, args):
+    """The output should have size [L x batch_size], where L could be a long sequence length"""
+    # Work out how cleanly we can divide the dataset into batch_size parts (i.e. continuous seqs).
+    nbatch = data.size(0) // batch_size
+    # Trim off any extra elements that wouldn't cleanly fit (remainders).
+    data = data.narrow(0, 0, nbatch * batch_size)
+    # Evenly divide the data across the batch_size batches.
+    data = data.view(batch_size, -1)
+    print(data.size())
+    if args.cuda:
+        data = data.cuda()
+    return data
+
+
+def get_batch(source, i, args, seq_len=None, evaluation=False):
+    seq_len = min(seq_len if seq_len else args.seq_len, source.size(1) - 1 - i)
+    data = Variable(source[:, i:i+seq_len], volatile=evaluation)
+    target = Variable(source[:, i+1:i+1+seq_len])  # CAUTION: This is un-flattened!
+    return data, target
--- a/TCN/mnist_pixel/README.md
+++ b/TCN/mnist_pixel/README.md
+## Sequential MNIST & Permuted Sequential MNIST
+
+### Overview
+
+MNIST is a handwritten digit classification dataset (Lecun et al., 1998) that is frequently used to 
+test deep learning models. In particular, sequential MNIST is frequently used to test a recurrent 
+network’s ability to retain information from the distant past (see paper for references). In 
+this task, each MNIST image (28 x 28) is presented to the model as a 784 × 1 sequence 
+for digit classification. In the more challenging permuted MNIST (P-MNIST) setting, the order of 
+the sequence is permuted at a (fixed) random order.
+
+### Data
+
+See `data_generator` in `utils.py`. You only need to download the data once. The default path
+to store the data is at `./data/mnist`.
+
+Original source of the data can be found [here](http://yann.lecun.com/exdb/mnist/).
+
+### Note
+
+- Because a TCN's receptive field depends on depth of the network and the filter size, we need
+to make sure these the model we use can cover the sequence length 784. 
+
+- While this is a sequence model task, we only use the very last output (i.e. at time T=784) for 
+the eventual classification.
\ No newline at end of file
--- a/TCN/mnist_pixel/model.py
+++ b/TCN/mnist_pixel/model.py
+import torch.nn.functional as F
+from torch import nn
+from TCN.tcn import TemporalConvNet
+
+
+class TCN(nn.Module):
+    def __init__(self, input_size, output_size, num_channels, kernel_size, dropout):
+        super(TCN, self).__init__()
+        self.tcn = TemporalConvNet(input_size, num_channels, kernel_size=kernel_size, dropout=dropout)
+        self.linear = nn.Linear(num_channels[-1], output_size)
+
+    def forward(self, inputs):
+        """Inputs have to have dimension (N, C_in, L_in)"""
+        y1 = self.tcn(inputs)  # input should have dimension (N, C, L)
+        o = self.linear(y1[:, :, -1])
+        return F.log_softmax(o, dim=1)
\ No newline at end of file
--- a/TCN/mnist_pixel/pmnist_test.py
+++ b/TCN/mnist_pixel/pmnist_test.py
+import torch
+from torch.autograd import Variable
+import torch.optim as optim
+import torch.nn.functional as F
+from TCN.mnist_pixel.utils import data_generator
+from TCN.mnist_pixel.model import TCN
+import numpy as np
+import argparse
+
+parser = argparse.ArgumentParser(description='Sequence Modeling - (Permuted) Sequential MNIST')
+parser.add_argument('--batch_size', type=int, default=64, metavar='N',
+                    help='batch size (default: 64)')
+parser.add_argument('--cuda', action='store_false',
+                    help='use CUDA (default: True)')
+parser.add_argument('--dropout', type=float, default=0.05,
+                    help='dropout applied to layers (default: 0.0)')
+parser.add_argument('--clip', type=float, default=-1,
+                    help='gradient clip, -1 means no clip (default: -1)')
+parser.add_argument('--epochs', type=int, default=20,
+                    help='upper epoch limit (default: 20)')
+parser.add_argument('--ksize', type=int, default=7,
+                    help='kernel size (default: 8)')
+parser.add_argument('--levels', type=int, default=8,
+                    help='# of levels (default: 8)')
+parser.add_argument('--log-interval', type=int, default=100, metavar='N',
+                    help='report interval (default: 10')
+parser.add_argument('--lr', type=float, default=2e-3,
+                    help='initial learning rate (default: 5e-4)')
+parser.add_argument('--optim', type=str, default='Adam',
+                    help='optimizer to use (default: Adam)')
+parser.add_argument('--nhid', type=int, default=25,
+                    help='number of hidden units per layer (default: 25)')
+parser.add_argument('--seed', type=int, default=1111,
+                    help='random seed (default: 1111)')
+parser.add_argument('--permute', action='store_true',
+                    help='use permuted MNIST (default: false)')
+args = parser.parse_args()
+
+torch.manual_seed(args.seed)
+if torch.cuda.is_available():
+    if not args.cuda:
+        print("WARNING: You have a CUDA device, so you should probably run with --cuda")
+
+root = './data/mnist'
+batch_size = args.batch_size
+n_classes = 10
+input_channels = 1
+seq_length = int(784 / input_channels)
+epochs = args.epochs
+steps = 0
+
+print(args)
+train_loader, test_loader = data_generator(root, batch_size)
+
+permute = torch.Tensor(np.random.permutation(784).astype(np.float64)).long()
+channel_sizes = [args.nhid] * args.levels
+kernel_size = args.ksize
+model = TCN(input_channels, n_classes, channel_sizes, kernel_size=kernel_size, dropout=args.dropout)
+
+if args.cuda:
+    model.cuda()
+    permute = permute.cuda()
+
+lr = args.lr
+optimizer = getattr(optim, args.optim)(model.parameters(), lr=lr)
+
+
+def train(ep):
+    global steps
+    train_loss = 0
+    model.train()
+    for batch_idx, (data, target) in enumerate(train_loader):
+        if args.cuda: data, target = data.cuda(), target.cuda()
+        data = data.view(-1, input_channels, seq_length)
+        if args.permute:
+            data = data[:, :, permute]
+        data, target = Variable(data), Variable(target)
+        optimizer.zero_grad()
+        output = model(data)
+        loss = F.nll_loss(output, target)
+        loss.backward()
+        if args.clip > 0:
+            torch.nn.utils.clip_grad_norm(model.parameters(), args.clip)
+        optimizer.step()
+        train_loss += loss
+        steps += seq_length
+        if batch_idx > 0 and batch_idx % args.log_interval == 0:
+            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}\tSteps: {}'.format(
+                ep, batch_idx * batch_size, len(train_loader.dataset),
+                100. * batch_idx / len(train_loader), train_loss.data[0]/args.log_interval, steps))
+            train_loss = 0
+
+
+def test():
+    model.eval()
+    test_loss = 0
+    correct = 0
+    for data, target in test_loader:
+        if args.cuda:
+            data, target = data.cuda(), target.cuda()
+        data = data.view(-1, input_channels, seq_length)
+        if args.permute:
+            data = data[:, :, permute]
+        data, target = Variable(data, volatile=True), Variable(target)
+        output = model(data)
+        test_loss += F.nll_loss(output, target, size_average=False).data[0]
+        pred = output.data.max(1, keepdim=True)[1]
+        correct += pred.eq(target.data.view_as(pred)).cpu().sum()
+
+    test_loss /= len(test_loader.dataset)
+    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
+        test_loss, correct, len(test_loader.dataset),
+        100. * correct / len(test_loader.dataset)))
+    return test_loss
+
+
+if __name__ == "__main__":
+    for epoch in range(1, epochs+1):
+        train(epoch)
+        test()
+        if epoch % 10 == 0:
+            lr /= 10
+            for param_group in optimizer.param_groups:
+                param_group['lr'] = lr
\ No newline at end of file
--- a/TCN/mnist_pixel/utils.py
+++ b/TCN/mnist_pixel/utils.py
+import torch
+from torchvision import datasets, transforms
+
+
+def data_generator(root, batch_size):
+    train_set = datasets.MNIST(root=root, train=True, download=True,
+                               transform=transforms.Compose([
+                                   transforms.ToTensor(),
+                                   transforms.Normalize((0.1307,), (0.3081,))
+                               ]))
+    test_set = datasets.MNIST(root=root, train=False, download=True,
+                              transform=transforms.Compose([
+                                  transforms.ToTensor(),
+                                  transforms.Normalize((0.1307,), (0.3081,))
+                              ]))
+
+    train_loader = torch.utils.data.DataLoader(train_set, batch_size=batch_size)
+    test_loader = torch.utils.data.DataLoader(test_set, batch_size=batch_size)
+    return train_loader, test_loader
--- a/TCN/poly_music/README.md
+++ b/TCN/poly_music/README.md
+## Polyphonic Music Dataset
+
+### Overview
+
+We evaluate temporal convolutional network (TCN) on two popular polyphonic music dataset, described below.
+
+- **JSB Chorales** dataset (Allan & Williams, 2005) is a polyphonic music dataset con-
+sisting of the entire corpus of 382 four-part harmonized chorales by J. S. Bach. In a polyphonic
+music dataset, each input is a sequence of elements having 88 dimensions, representing the 88 keys
+on a piano. Therefore, each element `x_t` is a chord written in as binary vector, in which a “1” indicates
+a key pressed.
+
+- **Nottingham** dataset is a collection of 1200 British and American folk tunes. Not-
+tingham is a much larger dataset than JSB Chorales. Along with JSB Chorales, Nottingham has
+been used in a number of works that investigated recurrent models’ applicability in polyphonic mu-
+sic, and the performance for both tasks are measured in terms
+of negative log-likelihood (NLL) loss.
+
+The goal here is to predict the next note given some history of the notes played.
+
+### Data
+
+See `data_generator` in `utils.py`. The data has been pre-processed and can be loaded directly using 
+scipy functions.
+
+Original source of the data can be found [here](http://www-etud.iro.umontreal.ca/~boulanni/icml2012).
+
+### Note
+
+- Each sequence can have a different length. In the current implementation, we simply train each
+sequence separately (i.e. batch size is 1), but one can zero-pad all sequences to the same length
+and train by batch.
+
+- One can use different datasets by specifying through the `--data` flag on the command line. The
+default is `Nott`, for Nottingham.
+
+- While each data is binary, the fact that there are 88 dimensions (for 88 keys) means there are
+essentially `2^88` "classes". Therefore, instead of directly predicting each key directly, we
+follow the standard practice so that a sigmoid is added at the end of the network. This ensures
+that every entry is converted to a value between 0 and 1 to compute the NLL loss.
+
+
--- a/TCN/poly_music/mdata/JSB_Chorales.mat
+++ b/TCN/poly_music/mdata/JSB_Chorales.mat
--- a/TCN/poly_music/mdata/MuseData.mat
+++ b/TCN/poly_music/mdata/MuseData.mat
--- a/TCN/poly_music/mdata/Nottingham.mat
+++ b/TCN/poly_music/mdata/Nottingham.mat
--- a/TCN/poly_music/mdata/Piano_midi.mat
+++ b/TCN/poly_music/mdata/Piano_midi.mat
--- a/TCN/poly_music/model.py
+++ b/TCN/poly_music/model.py
+from torch import nn
+from TCN.tcn import TemporalConvNet
+import torch.nn.functional as F
+
+
+class TCN(nn.Module):
+    def __init__(self, input_size, output_size, num_channels, kernel_size, dropout):
+        super(TCN, self).__init__()
+        self.tcn = TemporalConvNet(input_size, num_channels, kernel_size, dropout=dropout)
+        self.linear = nn.Linear(num_channels[-1], output_size)
+        self.sig = nn.Sigmoid()
+
+    def forward(self, x):
+        # x needs to have dimension (N, C, L) in order to be passed into CNN
+        output = self.tcn(x.transpose(1, 2)).transpose(1, 2)
+        output = self.linear(output).double()
+        return self.sig(output)
--- a/TCN/poly_music/music_test.py
+++ b/TCN/poly_music/music_test.py
+import argparse
+import torch
+import torch.nn as nn
+from torch.autograd import Variable
+import torch.optim as optim
+from TCN.poly_music.model import TCN
+from TCN.poly_music.utils import data_generator
+import numpy as np
+
+
+parser = argparse.ArgumentParser(description='Sequence Modeling - Polyphonic Music')
+parser.add_argument('--cuda', action='store_false',
+                    help='use CUDA (default: True)')
+parser.add_argument('--dropout', type=float, default=0.25,
+                    help='dropout applied to layers (default: 0.25)')
+parser.add_argument('--clip', type=float, default=0.2,
+                    help='gradient clip, -1 means no clip (default: 0.2)')
+parser.add_argument('--epochs', type=int, default=100,
+                    help='upper epoch limit (default: 100)')
+parser.add_argument('--ksize', type=int, default=5,
+                    help='kernel size (default: 5)')
+parser.add_argument('--levels', type=int, default=4,
+                    help='# of levels (default: 4)')
+parser.add_argument('--log-interval', type=int, default=100, metavar='N',
+                    help='report interval (default: 100')
+parser.add_argument('--lr', type=float, default=1e-3,
+                    help='initial learning rate (default: 1e-3)')
+parser.add_argument('--optim', type=str, default='Adam',
+                    help='optimizer to use (default: Adam)')
+parser.add_argument('--nhid', type=int, default=150,
+                    help='number of hidden units per layer (default: 150)')
+parser.add_argument('--data', type=str, default='Nott',
+                    help='the dataset to run (default: Nott)')
+parser.add_argument('--seed', type=int, default=1111,
+                    help='random seed (default: 1111)')
+
+args = parser.parse_args()
+
+# Set the random seed manually for reproducibility.
+torch.manual_seed(args.seed)
+if torch.cuda.is_available():
+    if not args.cuda:
+        print("WARNING: You have a CUDA device, so you should probably run with --cuda")
+
+print(args)
+input_size = 88
+X_train, X_valid, X_test = data_generator(args.data)
+
+n_channels = [args.nhid] * args.levels
+kernel_size = args.ksize
+dropout = args.dropout
+
+model = TCN(input_size, input_size, n_channels, kernel_size, dropout=args.dropout)
+
+
+if args.cuda:
+    model.cuda()
+
+criterion = nn.CrossEntropyLoss()
+lr = args.lr
+optimizer = getattr(optim, args.optim)(model.parameters(), lr=lr)
+
+
+def evaluate(X_data):
+    model.eval()
+    eval_idx_list = np.arange(len(X_data), dtype="int32")
+    total_loss = 0.0
+    count = 0
+    for idx in eval_idx_list:
+        data_line = X_data[idx]
+        x, y = Variable(data_line[:-1]), Variable(data_line[1:])
+        if args.cuda:
+            x, y = x.cuda(), y.cuda()
+        output = model(x.unsqueeze(0)).squeeze(0)
+        loss = -torch.trace(torch.matmul(y, torch.log(output).float().t()) +
+                            torch.matmul((1-y), torch.log(1-output).float().t()))
+        total_loss += loss.data[0]
+        count += output.size(0)
+    eval_loss = total_loss / count
+    print("Validation/Test loss: {:.5f}".format(eval_loss))
+    return eval_loss
+
+
+def train(ep):
+    model.train()
+    total_loss = 0
+    count = 0
+    train_idx_list = np.arange(len(X_train), dtype="int32")
+    np.random.shuffle(train_idx_list)
+    for idx in train_idx_list:
+        data_line = X_train[idx]
+        x, y = Variable(data_line[:-1]), Variable(data_line[1:])
+        if args.cuda:
+            x, y = x.cuda(), y.cuda()
+
+        optimizer.zero_grad()
+        output = model(x.unsqueeze(0)).squeeze(0)
+        loss = -torch.trace(torch.matmul(y, torch.log(output).float().t()) +
+                            torch.matmul((1 - y), torch.log(1 - output).float().t()))
+        total_loss += loss.data[0]
+        count += output.size(0)
+
+        if args.clip > 0:
+            torch.nn.utils.clip_grad_norm(model.parameters(), args.clip)
+        loss.backward()
+        optimizer.step()
+        if idx > 0 and idx % args.log_interval == 0:
+            cur_loss = total_loss / count
+            print("Epoch {:2d} | lr {:.5f} | loss {:.5f}".format(ep, lr, cur_loss))
+            total_loss = 0.0
+            count = 0
+
+
+if __name__ == "__main__":
+    best_vloss = 1e8
+    vloss_list = []
+    model_name = "poly_music_{0}.pt".format(args.data)
+    for ep in range(1, args.epochs+1):
+        train(ep)
+        vloss = evaluate(X_valid)
+        tloss = evaluate(X_test)
+        if vloss < best_vloss:
+            with open(model_name, "wb") as f:
+                torch.save(model, f)
+                print("Saved model!\n")
+            best_vloss = vloss
+        if ep > 10 and vloss > max(vloss_list[-3:]):
+            lr /= 10
+            for param_group in optimizer.param_groups:
+                param_group['lr'] = lr
+
+        vloss_list.append(vloss)
+
+    print('-' * 89)
+    model = torch.load(open(model_name, "rb"))
+    tloss = evaluate(X_test)
+
--- a/TCN/poly_music/utils.py
+++ b/TCN/poly_music/utils.py
+from scipy.io import loadmat
+import torch
+import numpy as np
+
+
+def data_generator(dataset):
+    if dataset == "JSB":
+        print('loading JSB data...')
+        data = loadmat('./mdata/JSB_Chorales.mat')
+    elif dataset == "Muse":
+        print('loading Muse data...')
+        data = loadmat('./mdata/MuseData.mat')
+    elif dataset == "Nott":
+        print('loading Nott data...')
+        data = loadmat('./mdata/Nottingham.mat')
+    elif dataset == "Piano":
+        print('loading Piano data...')
+        data = loadmat('./mdata/Piano_midi.mat')
+
+    X_train = data['traindata'][0]
+    X_valid = data['validdata'][0]
+    X_test = data['testdata'][0]
+
+    for data in [X_train, X_valid, X_test]:
+        for i in range(len(data)):
+            data[i] = torch.Tensor(data[i].astype(np.float64))
+
+    return X_train, X_valid, X_test
\ No newline at end of file
--- a/TCN/tcn.py
+++ b/TCN/tcn.py
+import torch
+import torch.nn as nn
+from torch.nn.utils import weight_norm
+
+
+class Chomp1d(nn.Module):
+    def __init__(self, chomp_size):
+        super(Chomp1d, self).__init__()
+        self.chomp_size = chomp_size
+
+    def forward(self, x):
+        return x[:, :, :-self.chomp_size].contiguous()
+
+
+class TemporalBlock(nn.Module):
+    def __init__(self, n_inputs, n_outputs, kernel_size, stride, dilation, padding, dropout=0.2):
+        super(TemporalBlock, self).__init__()
+        self.conv1 = weight_norm(nn.Conv1d(n_inputs, n_outputs, kernel_size,
+                                           stride=stride, padding=padding, dilation=dilation))
+        self.chomp1 = Chomp1d(padding)
+        self.relu1 = nn.ReLU()
+        self.dropout1 = nn.Dropout2d(dropout)
+
+        self.conv2 = weight_norm(nn.Conv1d(n_outputs, n_outputs, kernel_size,
+                                           stride=stride, padding=padding, dilation=dilation))
+        self.chomp2 = Chomp1d(padding)
+        self.relu2 = nn.ReLU()
+        self.dropout2 = nn.Dropout2d(dropout)
+
+        self.net = nn.Sequential(self.conv1, self.chomp1, self.relu1, self.dropout1,
+                                 self.conv2, self.chomp2, self.relu2, self.dropout2)
+        self.downsample = nn.Conv1d(n_inputs, n_outputs, 1) if n_inputs != n_outputs else None
+        self.relu = nn.ReLU()
+        self.init_weights()
+
+    def init_weights(self):
+        self.conv1.weight.data.normal_(0, 0.01)
+        self.conv2.weight.data.normal_(0, 0.01)
+        if self.downsample is not None:
+            self.downsample.weight.data.normal_(0, 0.01)
+
+    def forward(self, x):
+        out = self.net(x)
+        res = x if self.downsample is None else self.downsample(x)
+        return self.relu(out + res)
+
+
+class TemporalConvNet(nn.Module):
+    def __init__(self, num_inputs, num_channels, kernel_size=2, dropout=0.2):
+        super(TemporalConvNet, self).__init__()
+        layers = []
+        num_levels = len(num_channels)
+        for i in range(num_levels):
+            dilation_size = 2 ** i
+            in_channels = num_inputs if i == 0 else num_channels[i-1]
+            out_channels = num_channels[i]
+            layers += [TemporalBlock(in_channels, out_channels, kernel_size, stride=1, dilation=dilation_size,
+                                     padding=(kernel_size-1) * dilation_size, dropout=dropout)]
+
+        self.network = nn.Sequential(*layers)
+
+    def forward(self, x):
+        return self.network(x)
--- a/TCN/word_cnn/README.md
+++ b/TCN/word_cnn/README.md
+## Word-level Language Modeling
+
+### Overview
+
+In word-level language modeling tasks, each element of the sequence is a word, where the model
+is expected to predict the next incoming word in the text. We evaluate the temporal convolutional
+network as a word-level language model on three datasets: PennTreebank (PTB), Wikitext-103 
+and LAMBADA.
+
+Because the evaluation of LAMBADA has different requirement (predicting only the very last word
+based on a broader context), we put it in another directory. See `../lambada_language`. 
+
+
+### Data
+
+- **PennTreebank**: A frequently studied, but still relatively
+small language corpus. When used as a word-level language corpus,
+PTB contains 888K words for training, 70K for validation,
+and 79K for testing, with a vocabulary size of 10K.
+
+- **Wikitext-103**: Wikitext-103 is almost
+110 times as large as PTB, featuring a vocabulary size of
+about 268K. The dataset contains 28K Wikipedia articles
+(about 103 million words) for training, 60 articles (about
+218K words) for validation, and 60 articles (246K words)
+for testing. This is a more representative and realistic dataset
+than PTB, with a much larger vocabulary that includes many
+rare words, and have been used in (e.g. Merity et al. (2016)).
+
+- **LAMBADA**: An even larger language corpus than Wikitext-103
+consisting of novels from different categories. The goal is to 
+test a model's ability to understand text and predict according
+to a long context. See `../lambada_language`. 
+
+See `data_generator` in `utils.py`.
+
+
+### Note
+
+- Just like in a recurrent network implementation where it is common to repackage 
+hidden units when a new sequence begins, we pass into TCN a sequence `T` consisting 
+of two parts: 1) effective history `L1`, and 2) valid sequence `L2`:
+
+```
+Sequence [---------T--------->] = [--L1--> ------L2------>]
+```
+
+In the forward pass, the whole sequence is passed into TCN, but only the `L2` portion is used for 
+training. This ensures that the training data are also provided with sufficient history. The size
+of `T` and `L2` can be adjusted via flags `seq_len` and `validseqlen`. A similar setting
+was used in character-level language modeling experiments.
+
+- The choice of data to load can be specified via the `--data` flag, followed by the path to
+the directory containing the data. For instance, running
+
+```
+python word_cnn_test.py --data .data/penn
+```
+
+would train on the PennTreebank (PTB) dataset, if it is contained in `.data/penn`.
\ No newline at end of file
--- a/TCN/word_cnn/model.py
+++ b/TCN/word_cnn/model.py
+import torch
+from torch import nn
+import sys
+sys.path.append("../../")
+from TCN.tcn import TemporalConvNet
+
+
+class TCN(nn.Module):
+
+    def __init__(self, input_size, output_size, num_channels,
+                 kernel_size=2, dropout=0.3, emb_dropout=0.1, tied_weights=False):
+        super(TCN, self).__init__()
+        self.encoder = nn.Embedding(output_size, input_size)
+        self.tcn = TemporalConvNet(input_size, num_channels, kernel_size, dropout=dropout)
+
+        self.decoder = nn.Linear(num_channels[-1], output_size)
+        if tied_weights:
+            if num_channels[-1] != input_size:
+                raise ValueError('When using the tied flag, nhid must be equal to emsize')
+            self.decoder.weight = self.encoder.weight
+            print("Weight tied")
+        self.drop = nn.Dropout(emb_dropout)
+        self.emb_dropout = emb_dropout
+        self.init_weights()
+
+    def init_weights(self):
+        self.encoder.weight.data.normal_(0, 0.01)
+        self.decoder.bias.data.fill_(0)
+        self.decoder.weight.data.normal_(0, 0.01)
+
+    def forward(self, input):
+        """Input ought to have dimension (N, C_in, L_in), where L_in is the seq_len; here the input is (N, L, C)"""
+        emb = self.drop(self.encoder(input))
+        y = self.tcn(emb.transpose(1, 2)).transpose(1, 2)
+        y = self.decoder(y)
+        return y.contiguous()
+
--- a/TCN/word_cnn/utils.py
+++ b/TCN/word_cnn/utils.py
+import os
+import torch
+from torch.autograd import Variable
+import pickle
+
+"""
+Note: The meaning of batch_size in PTB is different from that in MNIST example. In MNIST, 
+batch_size is the # of sample data that is considered in each iteration; in PTB, however,
+it is the number of segments to speed up computation. 
+
+The goal of PTB is to train a language model to predict the next word.
+"""
+
+
+def data_generator(args):
+    if os.path.exists(args.data + "/corpus") and not args.corpus:
+        corpus = pickle.load(open(args.data + '/corpus', 'rb'))
+    else:
+        corpus = Corpus(args.data)
+        pickle.dump(corpus, open(args.data + '/corpus', 'wb'))
+    return corpus
+
+
+class Dictionary(object):
+    def __init__(self):
+        self.word2idx = {}
+        self.idx2word = []
+
+    def add_word(self, word):
+        if word not in self.word2idx:
+            self.idx2word.append(word)
+            self.word2idx[word] = len(self.idx2word) - 1
+        return self.word2idx[word]
+
+    def __len__(self):
+        return len(self.idx2word)
+
+
+class Corpus(object):
+    def __init__(self, path):
+        self.dictionary = Dictionary()
+        self.train = self.tokenize(os.path.join(path, 'train.txt'))
+        self.valid = self.tokenize(os.path.join(path, 'valid.txt'))
+        self.test = self.tokenize(os.path.join(path, 'test.txt'))
+
+    def tokenize(self, path):
+        """Tokenizes a text file."""
+        assert os.path.exists(path)
+        # Add words to the dictionary
+        with open(path, 'r') as f:
+            tokens = 0
+            for line in f:
+                words = line.split() + ['<eos>']
+                tokens += len(words)
+                for word in words:
+                    self.dictionary.add_word(word)
+
+        # Tokenize file content
+        with open(path, 'r') as f:
+            ids = torch.LongTensor(tokens)
+            token = 0
+            for line in f:
+                words = line.split() + ['<eos>']
+                for word in words:
+                    ids[token] = self.dictionary.word2idx[word]
+                    token += 1
+
+        return ids
+
+
+def batchify(data, batch_size, args):
+    """The output should have size [L x batch_size], where L could be a long sequence length"""
+    # Work out how cleanly we can divide the dataset into batch_size parts (i.e. continuous seqs).
+    nbatch = data.size(0) // batch_size
+    # Trim off any extra elements that wouldn't cleanly fit (remainders).
+    data = data.narrow(0, 0, nbatch * batch_size)
+    # Evenly divide the data across the batch_size batches.
+    data = data.view(batch_size, -1)
+    if args.cuda:
+        data = data.cuda()
+    return data
+
+
+def get_batch(source, i, args, seq_len=None, evaluation=False):
+    seq_len = min(seq_len if seq_len else args.seq_len, source.size(1) - 1 - i)
+    data = Variable(source[:, i:i+seq_len], volatile=evaluation)
+    target = Variable(source[:, i+1:i+1+seq_len])     # CAUTION: This is un-flattened!
+    return data, target
--- a/TCN/word_cnn/word_cnn_test.py
+++ b/TCN/word_cnn/word_cnn_test.py
+import argparse
+import time
+import math
+import torch
+import torch.nn as nn
+from torch.autograd import Variable
+import torch.optim as optim
+import sys
+sys.path.append("../../")
+from TCN.word_cnn.utils import *
+from TCN.word_cnn.model import *
+import pickle
+from random import randint
+
+
+parser = argparse.ArgumentParser(description='Sequence Modeling - Word-level Language Modeling')
+
+parser.add_argument('--batch_size', type=int, default=16, metavar='N',
+                    help='batch size (default: 16)')
+parser.add_argument('--cuda', action='store_false',
+                    help='use CUDA (default: True)')
+parser.add_argument('--dropout', type=float, default=0.45,
+                    help='dropout applied to layers (default: 0.45)')
+parser.add_argument('--emb_dropout', type=float, default=0.25,
+                    help='dropout applied to the embedded layer (default: 0.25)')
+parser.add_argument('--clip', type=float, default=0.35,
+                    help='gradient clip, -1 means no clip (default: 0.35)')
+parser.add_argument('--epochs', type=int, default=100,
+                    help='upper epoch limit (default: 100)')
+parser.add_argument('--ksize', type=int, default=3,
+                    help='kernel size (default: 3)')
+parser.add_argument('--data', type=str, default='./data/penn',
+                    help='location of the data corpus (default: ./data/penn)')
+parser.add_argument('--emsize', type=int, default=600,
+                    help='size of word embeddings (default: 600)')
+parser.add_argument('--levels', type=int, default=4,
+                    help='# of levels (default: 4)')
+parser.add_argument('--log-interval', type=int, default=100, metavar='N',
+                    help='report interval (default: 100)')
+parser.add_argument('--lr', type=float, default=4,
+                    help='initial learning rate (default: 4)')
+parser.add_argument('--nhid', type=int, default=600,
+                    help='number of hidden units per layer (default: 600)')
+parser.add_argument('--seed', type=int, default=1111,
+                    help='random seed (default: 1111)')
+parser.add_argument('--tied', action='store_false',
+                    help='tie the encoder-decoder weights (default: True)')
+parser.add_argument('--optim', type=str, default='SGD',
+                    help='optimizer type (default: SGD)')
+parser.add_argument('--validseqlen', type=int, default=40,
+                    help='valid sequence length (default: 40)')
+parser.add_argument('--seq_len', type=int, default=80,
+                    help='total sequence length, including effective history (default: 80)')
+parser.add_argument('--corpus', action='store_true',
+                    help='force re-make the corpus (default: False)')
+args = parser.parse_args()
+
+# Set the random seed manually for reproducibility.
+torch.manual_seed(args.seed)
+if torch.cuda.is_available():
+    if not args.cuda:
+        print("WARNING: You have a CUDA device, so you should probably run with --cuda")
+
+print(args)
+corpus = data_generator(args)
+eval_batch_size = 10
+train_data = batchify(corpus.train, args.batch_size, args)
+val_data = batchify(corpus.valid, eval_batch_size, args)
+test_data = batchify(corpus.test, eval_batch_size, args)
+
+
+n_words = len(corpus.dictionary)
+
+num_chans = [args.nhid] * (args.levels - 1) + [args.emsize]
+k_size = args.ksize
+dropout = args.dropout
+emb_dropout = args.emb_dropout
+tied = args.tied
+model = TCN(args.emsize, n_words, num_chans, dropout=dropout, emb_dropout=emb_dropout, kernel_size=k_size, tied_weights=tied)
+
+if args.cuda:
+    model.cuda()
+
+# May use adaptive softmax to speed up training
+criterion = nn.CrossEntropyLoss()
+
+lr = args.lr
+optimizer = getattr(optim, args.optim)(model.parameters(), lr=lr)
+
+
+def evaluate(data_source):
+    model.eval()
+    total_loss = 0
+    processed_data_size = 0
+    for i in range(0, data_source.size(1) - 1, args.validseqlen):
+        if i + args.seq_len - args.validseqlen >= data_source.size(1) - 1:
+            continue
+        data, targets = get_batch(data_source, i, args, evaluation=True)
+        output = model(data)
+
+        # Discard the effective history, just like in training
+        eff_history = args.seq_len - args.validseqlen
+        final_output = output[:, eff_history:].contiguous().view(-1, n_words)
+        final_target = targets[:, eff_history:].contiguous().view(-1)
+
+        loss = criterion(final_output, final_target)
+
+        # Note that we don't add TAR loss here
+        total_loss += (data.size(1) - eff_history) * loss.data
+        processed_data_size += data.size(1) - eff_history
+    return total_loss[0] / processed_data_size
+
+
+def train():
+    # Turn on training mode which enables dropout.
+    global train_data
+    model.train()
+    total_loss = 0
+    start_time = time.time()
+    for batch_idx, i in enumerate(range(0, train_data.size(1) - 1, args.validseqlen)):
+        if i + args.seq_len - args.validseqlen >= train_data.size(1) - 1:
+            continue
+        data, targets = get_batch(train_data, i, args)
+        optimizer.zero_grad()
+        output = model(data)
+
+        # Discard the effective history part
+        eff_history = args.seq_len - args.validseqlen
+        if eff_history < 0:
+            raise ValueError("Valid sequence length must be smaller than sequence length!")
+        final_target = targets[:, eff_history:].contiguous().view(-1)
+        final_output = output[:, eff_history:].contiguous().view(-1, n_words)
+        loss = criterion(final_output, final_target)
+
+        loss.backward()
+        if args.clip > 0:
+            torch.nn.utils.clip_grad_norm(model.parameters(), args.clip)
+        optimizer.step()
+
+        total_loss += loss.data
+
+        if batch_idx % args.log_interval == 0 and batch_idx > 0:
+            cur_loss = total_loss[0] / args.log_interval
+            elapsed = time.time() - start_time
+            print('| epoch {:3d} | {:5d}/{:5d} batches | lr {:02.5f} | ms/batch {:5.5f} | '
+                  'loss {:5.2f} | ppl {:8.2f}'.format(
+                epoch, batch_idx, train_data.size(1) // args.validseqlen, lr,
+                elapsed * 1000 / args.log_interval, cur_loss, math.exp(cur_loss)))
+            total_loss = 0
+            start_time = time.time()
+
+
+if __name__ == "__main__":
+    best_vloss = 1e8
+
+    # At any point you can hit Ctrl + C to break out of training early.
+    try:
+        all_vloss = []
+        for epoch in range(1, args.epochs+1):
+            epoch_start_time = time.time()
+            train()
+            val_loss = evaluate(val_data)
+            test_loss = evaluate(test_data)
+
+            print('-' * 89)
+            print('| end of epoch {:3d} | time: {:5.2f}s | valid loss {:5.2f} | '
+                    'valid ppl {:8.2f}'.format(epoch, (time.time() - epoch_start_time),
+                                               val_loss, math.exp(val_loss)))
+            print('| end of epoch {:3d} | time: {:5.2f}s | test loss {:5.2f} | '
+                  'test ppl {:8.2f}'.format(epoch, (time.time() - epoch_start_time),
+                                            test_loss, math.exp(test_loss)))
+            print('-' * 89)
+
+            # Save the model if the validation loss is the best we've seen so far.
+            if val_loss < best_vloss:
+                with open("model.pt", 'wb') as f:
+                    print('Save model!\n')
+                    torch.save(model, f)
+                best_vloss = val_loss
+
+            # Anneal the learning rate if the validation loss plateaus
+            if epoch > 5 and val_loss >= max(all_vloss[-5:]):
+                lr = lr / 2.
+                for param_group in optimizer.param_groups:
+                    param_group['lr'] = lr
+            all_vloss.append(val_loss)
+
+    except KeyboardInterrupt:
+        print('-' * 89)
+        print('Exiting from training early')
+
+    # Load the best saved model.
+    with open("model.pt", 'rb') as f:
+        model = torch.load(f)
+
+    # Run on test data.
+    test_loss = evaluate(test_data)
+    print('=' * 89)
+    print('| End of training | test loss {:5.2f} | test ppl {:8.2f}'.format(
+        test_loss, math.exp(test_loss)))
+    print('=' * 89)