提交 ba0a5437 编写于 作者: S Shaojie Bai

Initial code commit + README

上级
.DS_Store/
data/
*.pt
*.log
*.pyc
__pycache__/
# Sequence-Model-Benchmarks-TCN
This repository contains the experiments done in the work [An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling](#) by Shaojie Bai, J. Zico Kolter and Vladlen Koltun.
We specifically target a comprehensive set of tasks that have been repeatedly used to compare the effectiveness of different recurrent networks, and evaluate a simple, generic but powerful (purely) convolutional network on the recurrent nets' home turf.
Experiments are done in PyTorch.
## Domains and Datasets
This repository contains the benchmarks to the following tasks, with details explained in each sub-directory:
- **The Adding Problem** with various T (we evaluated on T=200, 400, 600)
- **Copying Memory Task** with various T (we evaluated on T=500, 1000, 2000)
- **Sequential MNIST** digit classification
- **Permuted Sequential MNIST** (based on Seq. MNIST, but more challenging)
- **JSB Chorales** polyphonic music
- **Nottingham** polyphonic music
- **PennTreebank** [SMALL] word-level language modeling (LM)
- **Wikitext-103** [LARGE] word-level LM
- **LAMBADA** [LARGE] word-level LM and textual understanding
- **PennTreebank** [MEDIUM] char-level LM
- **Shakespeare** [SMALL] char-level LM (Note: a small dataset)
- **text8** [LARGE] char-level LM
While some of the large datasets are not included in this repo, we use the [observations](#) package to download them, which can be easily installed using pip.
## Usage
Each task is contained in its own directory, with the following structure:
```
[TASK_NAME] /
data/
[TASK_NAME]_test.py
models.py
utils.py
```
To run TCN model on the task, one only need to run `[TASK_NAME]_test.py` (e.g. `add_test.py`). To tune the hyperparameters, one can specify via argument options, which can been seen via the `-h` flag.
## The Adding Problem
### Overview
In this task, each input consists of a length-T sequence of depth 2, with all values randomly
chosen randomly in [0, 1] in dimension 1. The second dimension consists of all zeros except for
two elements, which are marked by 1. The objective is to sum the two random values whose second
dimensions are marked by 1. One can think of this as computing the dot product of two dimensions.
Simply predicting the sum to be 1 should give an MSE of about 0.1767.
### Data Generation
See `data_generator` in `utils.py`.
### Note
Because a TCN's receptive field depends on depth of the network and the filter size, we need
to make sure these the model we use can cover the sequence length T.
\ No newline at end of file
import torch
import argparse
import torch.optim as optim
import torch.nn.functional as F
from TCN.adding_problem.model import TCN
from TCN.adding_problem.utils import data_generator
parser = argparse.ArgumentParser(description='Sequence Modeling - The Adding Problem')
parser.add_argument('--batch_size', type=int, default=32, metavar='N',
help='batch size (default: 32)')
parser.add_argument('--cuda', action='store_false',
help='use CUDA (default: True)')
parser.add_argument('--dropout', type=float, default=0.0,
help='dropout applied to layers (default: 0.0)')
parser.add_argument('--clip', type=float, default=-1,
help='gradient clip, -1 means no clip (default: -1)')
parser.add_argument('--epochs', type=int, default=10,
help='upper epoch limit (default: 10)')
parser.add_argument('--ksize', type=int, default=7,
help='kernel size (default: 8)')
parser.add_argument('--levels', type=int, default=8,
help='# of levels (default: 5)')
parser.add_argument('--seq_len', type=int, default=400,
help='sequence length (default: 400)')
parser.add_argument('--log-interval', type=int, default=100, metavar='N',
help='report interval (default: 10')
parser.add_argument('--lr', type=float, default=4e-3,
help='initial learning rate (default: 4e-3)')
parser.add_argument('--optim', type=str, default='Adam',
help='optimizer to use (default: Adam)')
parser.add_argument('--nhid', type=int, default=30,
help='number of hidden units per layer (default: 30)')
parser.add_argument('--seed', type=int, default=1111,
help='random seed (default: 1111)')
args = parser.parse_args()
torch.manual_seed(args.seed)
if torch.cuda.is_available():
if not args.cuda:
print("WARNING: You have a CUDA device, so you should probably run with --cuda")
input_channels = 2
n_classes = 1
batch_size = args.batch_size
seq_length = args.seq_len
epochs = args.epochs
print(args)
print("Producing data...")
X_train, Y_train = data_generator(50000, seq_length)
X_test, Y_test = data_generator(1000, seq_length)
# Note: We use a very simple setting here (assuming all levels have the same # of channels.
channel_sizes = [args.nhid]*args.levels
kernel_size = args.ksize
dropout = args.dropout
model = TCN(input_channels, n_classes, channel_sizes, kernel_size=kernel_size, dropout=dropout)
if args.cuda:
model.cuda()
X_train = X_train.cuda()
Y_train = Y_train.cuda()
X_test = X_test.cuda()
Y_test = Y_test.cuda()
lr = args.lr
optimizer = getattr(optim, args.optim)(model.parameters(), lr=lr)
def train(epoch):
global lr
model.train()
batch_idx = 1
total_loss = 0
for i in range(0, X_train.size()[0], batch_size):
if i + batch_size > X_train.size()[0]:
x, y = X_train[i:], Y_train[i:]
else:
x, y = X_train[i:(i+batch_size)], Y_train[i:(i+batch_size)]
optimizer.zero_grad()
output = model(x)
loss = F.mse_loss(output, y)
loss.backward()
if args.clip > 0:
torch.nn.utils.clip_grad_norm(model.parameters(), args.clip)
optimizer.step()
batch_idx += 1
total_loss += loss.data[0]
if batch_idx % args.log_interval == 0:
cur_loss = total_loss / args.log_interval
processed = min(i+batch_size, X_train.size()[0])
print('Train Epoch: {:2d} [{:6d}/{:6d} ({:.0f}%)]\tLearning rate: {:.4f}\tLoss: {:.6f}'.format(
epoch, processed, X_train.size()[0], 100.*processed/X_train.size()[0], lr, cur_loss))
total_loss = 0
def evaluate():
model.eval()
output = model(X_test)
test_loss = F.mse_loss(output, Y_test)
print('\nTest set: Average loss: {:.6f}\n'.format(test_loss.data[0]))
return test_loss.data[0]
for ep in range(1, epochs+1):
train(ep)
tloss = evaluate()
from torch import nn
from TCN.tcn import TemporalConvNet
class TCN(nn.Module):
def __init__(self, input_size, output_size, num_channels, kernel_size, dropout):
super(TCN, self).__init__()
self.tcn = TemporalConvNet(input_size, num_channels, kernel_size=kernel_size, dropout=dropout)
self.linear = nn.Linear(num_channels[-1], output_size)
self.init_weights()
def init_weights(self):
self.linear.weight.data.normal_(0, 0.01)
def forward(self, x):
y1 = self.tcn(x)
return self.linear(y1[:, :, -1])
\ No newline at end of file
import torch
import numpy as np
from torch.autograd import Variable
def data_generator(N, seq_length):
"""
Args:
seq_length: Length of the adding problem data
N: # of data in the set
"""
X_num = torch.rand([N, 1, seq_length])
X_mask = torch.zeros([N, 1, seq_length])
Y = torch.zeros([N, 1])
for i in range(N):
positions = np.random.choice(seq_length, size=2, replace=False)
X_mask[i, 0, positions[0]] = 1
X_mask[i, 0, positions[1]] = 1
Y[i,0] = X_num[i, 0, positions[0]] + X_num[i, 0, positions[1]]
X = torch.cat((X_num, X_mask), dim=1)
return Variable(X), Variable(Y)
\ No newline at end of file
## Character-level Language Modeling
### Overview
In character-level language modeling tasks, each sequence is broken into elements by characters.
Therefore, in a character-level language model, at each time step the model is expected to predict
the next coming character. We evaluate the temporal convolutional network as a character-level
language model on the PennTreebank dataset and the text8 dataset.
### Data
- **PennTreebank**: When used as a character-level lan-
guage corpus, PTB contains 5,059K characters for training,
396K for validation, and 446K for testing, with an alphabet
size of 50. PennTreebank is a well-studied (but relatively
small) language dataset.
- **text8**: text8 is about 20 times larger than PTB, with
about 100M characters from Wikipedia (90M for training, 5M
for validation, and 5M for testing). The corpus contains 27
unique alphabets.
See `data_generator` in `utils.py`. We download the language corpus using [observations](#) package
in python.
### Note
- Just like in a recurrent network implementation where it is common to repackage
hidden units when a new sequence begins, we pass into TCN a sequence `T` consisting
of two parts: 1) effective history `L1`, and 2) valid sequence `L2`:
```
Sequence [---------T---------] = [--L1-- -----L2-----]
```
In the forward pass, the whole sequence is passed into TCN, but only the `L2` portion is used for
training. This ensures that the training data are also provided with sufficient history. The size
of `T` and `L2` can be adjusted via flags `seq_len` and `validseqlen`.
- The choice of dataset to use can be specified via the `--dataset` flag. For instance, running
```
python char_cnn_test.py --dataset ptb
```
would (download if no data found, and) train on the PennTreebank (PTB) dataset.
- Empirically, we found that Adam works better than SGD on the text8 dataset.
\ No newline at end of file
import argparse
import torch.nn as nn
import torch.optim as optim
import sys
sys.path.append("../../")
from TCN.char_cnn.utils import *
from TCN.char_cnn.model import TCN
import time
import math
import warnings
warnings.filterwarnings("ignore") # Suppress the RunTimeWarning on unicode
parser = argparse.ArgumentParser(description='Sequence Modeling - Character Level Language Model')
parser.add_argument('--batch_size', type=int, default=32, metavar='N',
help='batch size (default: 32)')
parser.add_argument('--cuda', action='store_false',
help='use CUDA (default: True)')
parser.add_argument('--dropout', type=float, default=0.1,
help='dropout applied to layers (default: 0.1)')
parser.add_argument('--emb_dropout', type=float, default=0.1,
help='dropout applied to the embedded layer (0 = no dropout)')
parser.add_argument('--clip', type=float, default=0.15,
help='gradient clip, -1 means no clip (default: 0.15)')
parser.add_argument('--epochs', type=int, default=100,
help='upper epoch limit (default: 100)')
parser.add_argument('--ksize', type=int, default=3,
help='kernel size (default: 5)')
parser.add_argument('--levels', type=int, default=3,
help='# of levels (default: 4)')
parser.add_argument('--log-interval', type=int, default=100, metavar='N',
help='report interval (default: 100')
parser.add_argument('--lr', type=float, default=4,
help='initial learning rate (default: 4)')
parser.add_argument('--emsize', type=int, default=100,
help='dimension of character embeddings (default: 100)')
parser.add_argument('--optim', type=str, default='SGD',
help='optimizer to use (default: SGD)')
parser.add_argument('--nhid', type=int, default=450,
help='number of hidden units per layer (default: 150)')
parser.add_argument('--validseqlen', type=int, default=320,
help='valid sequence length (default: 320)')
parser.add_argument('--seq_len', type=int, default=400,
help='total sequence length, including effective history (default: 400)')
parser.add_argument('--seed', type=int, default=1111,
help='random seed (default: 1111)')
parser.add_argument('--dataset', type=str, default='ptb',
help='dataset to use (default: ptb)')
args = parser.parse_args()
# Set the random seed manually for reproducibility.
torch.manual_seed(args.seed)
if torch.cuda.is_available():
if not args.cuda:
print("WARNING: You have a CUDA device, so you should probably run with --cuda")
print(args)
file, file_len, valfile, valfile_len, testfile, testfile_len, corpus = data_generator(args)
n_characters = len(corpus.dict)
train_data = batchify(char_tensor(corpus, file), args.batch_size, args)
val_data = batchify(char_tensor(corpus, valfile), 1, args)
test_data = batchify(char_tensor(corpus, testfile), 1, args)
print("Corpus size: ", n_characters)
num_chans = [args.nhid] * (args.levels - 1) + [args.emsize]
k_size = args.ksize
dropout = args.dropout
emb_dropout = args.emb_dropout
model = TCN(args.emsize, n_characters, num_chans, kernel_size=k_size, dropout=dropout, emb_dropout=emb_dropout)
if args.cuda:
model.cuda()
criterion = nn.CrossEntropyLoss()
lr = args.lr
optimizer = getattr(optim, args.optim)(model.parameters(), lr=lr)
def evaluate(source):
model.eval()
total_loss = 0
count = 0
source_len = source.size(1)
for batch, i in enumerate(range(0, source_len - 1, args.validseqlen)):
if i + args.seq_len - args.validseqlen >= source_len:
continue
inp, target = get_batch(source, i, args)
output = model(inp)
eff_history = args.seq_len - args.validseqlen
final_output = output[:, eff_history:].contiguous().view(-1, n_characters)
final_target = target[:, eff_history:].contiguous().view(-1)
loss = criterion(final_output, final_target)
total_loss += loss.data * final_output.size(0)
count += final_output.size(0)
val_loss = total_loss[0] / count * 1.0
return val_loss
def train(epoch):
model.train()
total_loss = 0
start_time = time.time()
losses = []
source = train_data
source_len = source.size(1)
for batch_idx, i in enumerate(range(0, source_len - 1, args.validseqlen)):
if i + args.seq_len - args.validseqlen >= source_len:
continue
inp, target = get_batch(source, i, args)
optimizer.zero_grad()
output = model(inp)
eff_history = args.seq_len - args.validseqlen
final_output = output[:, eff_history:].contiguous().view(-1, n_characters)
final_target = target[:, eff_history:].contiguous().view(-1)
loss = criterion(final_output, final_target)
loss.backward()
if args.clip > 0:
torch.nn.utils.clip_grad_norm(model.parameters(), args.clip)
optimizer.step()
total_loss += loss.data
if batch_idx % args.log_interval == 0 and batch_idx > 0:
cur_loss = total_loss[0] / args.log_interval
losses.append(cur_loss)
elapsed = time.time() - start_time
print('| epoch {:3d} | {:5d}/{:5d} batches | lr {:02.5f} | ms/batch {:5.2f} | '
'loss {:5.3f} | bpc {:5.3f}'.format(
epoch, batch_idx, int((source_len-0.5) / args.validseqlen), lr,
elapsed * 1000 / args.log_interval, cur_loss, cur_loss / math.log(2)))
total_loss = 0
start_time = time.time()
# if batch % (200 * args.log_interval) == 0 and batch > 0:
# vloss = evaluate(val_data)
# print('-' * 89)
# print('| In epoch {:3d} | valid loss {:5.3f} | '
# 'valid bpc {:8.3f}'.format(epoch, vloss, vloss / math.log(2)))
# model.train()
return sum(losses) * 1.0 / len(losses)
def main():
global lr
try:
print("Training for %d epochs..." % args.epochs)
all_losses = []
best_vloss = 1e7
for epoch in range(1, args.epochs + 1):
loss = train(epoch)
vloss = evaluate(val_data)
print('-' * 89)
print('| End of epoch {:3d} | valid loss {:5.3f} | valid bpc {:8.3f}'.format(
epoch, vloss, vloss / math.log(2)))
test_loss = evaluate(test_data)
print('=' * 89)
print('| End of epoch {:3d} | test loss {:5.3f} | test bpc {:8.3f}'.format(
epoch, test_loss, test_loss / math.log(2)))
print('=' * 89)
if epoch > 5 and vloss > max(all_losses[-3:]):
lr = lr / 10.
for param_group in optimizer.param_groups:
param_group['lr'] = lr
all_losses.append(vloss)
if vloss < best_vloss:
print("Saving...")
save(model)
best_vloss = vloss
except KeyboardInterrupt:
print('-' * 89)
print("Saving before quit...")
save(model)
# Run on test data.
test_loss = evaluate(test_data)
print('=' * 89)
print('| End of training | test loss {:5.3f} | test bpc {:8.3f}'.format(
test_loss, test_loss / math.log(2)))
print('=' * 89)
# train_by_random_chunk()
if __name__ == "__main__":
main()
from torch import nn
import sys
sys.path.append("../../")
from TCN.tcn import TemporalConvNet
class TCN(nn.Module):
def __init__(self, input_size, output_size, num_channels, kernel_size=2, dropout=0.2, emb_dropout=0.2):
super(TCN, self).__init__()
self.encoder = nn.Embedding(output_size, input_size)
self.tcn = TemporalConvNet(input_size, num_channels, kernel_size=kernel_size, dropout=dropout)
self.decoder = nn.Linear(input_size, output_size)
self.decoder.weight = self.encoder.weight
self.drop = nn.Dropout(emb_dropout)
self.init_weights()
def init_weights(self):
initrange = 0.1
self.encoder.weight.data.uniform_(-initrange, initrange)
self.decoder.bias.data.fill_(0)
self.decoder.weight.data.uniform_(-initrange, initrange)
def forward(self, x):
# input has dimension (N, L_in), and emb has dimension (N, L_in, C_in)
emb = self.drop(self.encoder(x))
y = self.tcn(emb.transpose(1, 2))
o = self.decoder(y.transpose(1, 2))
return o.contiguous()
\ No newline at end of file
import unidecode
import torch
from torch.autograd import Variable
from collections import Counter
import observations
import os
import pickle
cuda = torch.cuda.is_available()
def data_generator(args):
file, testfile, valfile = getattr(observations, args.dataset)('data/')
file_len = len(file)
valfile_len = len(valfile)
testfile_len = len(testfile)
corpus = Corpus(file + " " + valfile + " " + testfile)
#############################################################
# Use the following if you want to pickle the loaded data
#
# pickle_name = "{0}.corpus".format(args.dataset)
# if os.path.exists(pickle_name):
# corpus = pickle.load(open(pickle_name, 'rb'))
# else:
# corpus = Corpus(file + " " + valfile + " " + testfile)
# pickle.dump(corpus, open(pickle_name, 'wb'))
#############################################################
return file, file_len, valfile, valfile_len, testfile, testfile_len, corpus
def read_file(filename):
file = unidecode.unidecode(open(filename).read())
return file, len(file)
class Dictionary(object):
def __init__(self):
self.char2idx = {}
self.idx2char = []
self.counter = Counter()
def add_word(self, char):
self.counter[char] += 1
def prep_dict(self):
for char in self.counter:
if char not in self.char2idx:
self.idx2char.append(char)
self.char2idx[char] = len(self.idx2char) - 1
def __len__(self):
return len(self.idx2char)
class Corpus(object):
def __init__(self, string):
self.dict = Dictionary()
for c in string:
self.dict.add_word(c)
self.dict.prep_dict()
def char_tensor(corpus, string):
tensor = torch.zeros(len(string)).long()
for i in range(len(string)):
tensor[i] = corpus.dict.char2idx[string[i]]
return Variable(tensor).cuda() if cuda else Variable(tensor)
def batchify(data, batch_size, args):
"""The output should have size [L x batch_size], where L could be a long sequence length"""
# Work out how cleanly we can divide the dataset into batch_size parts (i.e. continuous seqs).
nbatch = data.size(0) // batch_size
# Trim off any extra elements that wouldn't cleanly fit (remainders).
data = data.narrow(0, 0, nbatch * batch_size)
# Evenly divide the data across the batch_size batches.
data = data.view(batch_size, -1)
if args.cuda:
data = data.cuda()
return data
def get_batch(source, start_index, args):
seq_len = min(args.seq_len, source.size(1) - 1 - start_index)
end_index = start_index + seq_len
inp = source[:, start_index:end_index].contiguous()
target = source[:, start_index+1:end_index+1].contiguous() # The successors of the inp.
return inp, target
def save(model):
save_filename = 'model.pt'
torch.save(model, save_filename)
print('Saved as %s' % save_filename)
## Copying Memory Task
### Overview
In this task, each input sequence has length T+20. The first 10 values are chosen randomly
among the digits 1-8, with the rest being all zeros, except for the last 11 entries that are
filled with the digit ‘9’ (the first ‘9’ is a delimiter). The goal is to generate an output
of same length that is zero everywhere, except the last 10 values after the delimiter, where
the model is expected to repeat the 10 values it encountered at the start of the input.
### Data Generation
See `data_generator` in `utils.py`.
### Note
- Because a TCN's receptive field depends on depth of the network and the filter size, we need
to make sure these the model we use can cover the sequence length T+20.
- Using the `--seq_len` flag, one can change the # of values to recall (the typical setup is 10).
\ No newline at end of file
import argparse
import torch
import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable
import numpy as np
from TCN.copy_memory.utils import data_generator
from TCN.copy_memory.model import TCN
import time
parser = argparse.ArgumentParser(description='Sequence Modeling - Copying Memory Task')
parser.add_argument('--batch_size', type=int, default=32, metavar='N',
help='batch size (default: 32)')
parser.add_argument('--cuda', action='store_false',
help='use CUDA (default: True)')
parser.add_argument('--dropout', type=float, default=0.0,
help='dropout applied to layers (default: 0.0)')
parser.add_argument('--clip', type=float, default=1.0,
help='gradient clip, -1 means no clip (default: 1.0)')
parser.add_argument('--epochs', type=int, default=50,
help='upper epoch limit (default: 50)')
parser.add_argument('--ksize', type=int, default=8,
help='kernel size (default: 8)')
parser.add_argument('--iters', type=int, default=100,
help='number of iters per epoch (default: 100)')
parser.add_argument('--levels', type=int, default=8,
help='# of levels (default: 5)')
parser.add_argument('--blank_len', type=int, default=1000, metavar='N',
help='The size of the blank (i.e. T) (default: 1000)')
parser.add_argument('--seq_len', type=int, default=10,
help='initial history size (default: 10)')
parser.add_argument('--log-interval', type=int, default=50, metavar='N',
help='report interval (default: 10')
parser.add_argument('--lr', type=float, default=5e-4,
help='initial learning rate (default: 5e-4)')
parser.add_argument('--optim', type=str, default='RMSprop',
help='optimizer to use (default: RMSprop)')
parser.add_argument('--nhid', type=int, default=10,
help='number of hidden units per layer (default: 10)')
parser.add_argument('--seed', type=int, default=1111,
help='random seed (default: 1111)')
args = parser.parse_args()
torch.manual_seed(args.seed)
if torch.cuda.is_available():
if not args.cuda:
print("WARNING: You have a CUDA device, so you should probably run with --cuda")
batch_size = args.batch_size
seq_len = args.seq_len # The size to memorize
epochs = args.epochs
iters = args.iters
T = args.blank_len
n_steps = T + (2 * seq_len)
n_classes = 10 # Digits 0 - 9
n_train = 10000
n_test = 1000
print(args)
print("Preparing data...")
train_x, train_y = data_generator(T, seq_len, n_train)
test_x, test_y = data_generator(T, seq_len, n_test)
channel_sizes = [args.nhid] * args.levels
kernel_size = args.ksize
dropout = args.dropout
model = TCN(1, n_classes, channel_sizes, kernel_size, dropout=dropout)
if args.cuda:
model.cuda()
train_x = train_x.cuda()
train_y = train_y.cuda()
test_x = test_x.cuda()
test_y = test_y.cuda()
criterion = nn.CrossEntropyLoss()
lr = args.lr
optimizer = getattr(optim, args.optim)(model.parameters(), lr=lr)
def evaluate():
model.eval()
out = model(test_x.unsqueeze(1).contiguous())
loss = criterion(out.view(-1, n_classes), test_y.view(-1))
pred = out.view(-1, n_classes).data.max(1, keepdim=True)[1]
correct = pred.eq(test_y.data.view_as(pred)).cpu().sum()
counter = out.view(-1, n_classes).size(0)
print('\nTest set: Average loss: {:.8f} | Accuracy: {:.4f}\n'.format(
loss.data[0], 100. * correct / counter))
return loss.data[0]
def train(ep):
global batch_size, seq_len, iters, epochs
model.train()
total_loss = 0
start_time = time.time()
correct = 0
counter = 0
for batch_idx, batch in enumerate(range(0, n_train, batch_size)):
start_ind = batch
end_ind = start_ind + batch_size
x = train_x[start_ind:end_ind]
y = train_y[start_ind:end_ind]
out = model(x.unsqueeze(1).contiguous())
loss = criterion(out.view(-1, n_classes), y.view(-1))
pred = out.view(-1, n_classes).data.max(1, keepdim=True)[1]
correct += pred.eq(y.data.view_as(pred)).cpu().sum()
counter += out.view(-1, n_classes).size(0)
if args.clip > 0:
torch.nn.utils.clip_grad_norm(model.parameters(), args.clip)
loss.backward()
optimizer.step()
total_loss += loss
if batch_idx > 0 and batch_idx % args.log_interval == 0:
avg_loss = total_loss / args.log_interval
elapsed = time.time() - start_time
print('| Epoch {:3d} | {:5d}/{:5d} batches | lr {:2.5f} | ms/batch {:5.2f} | '
'loss {:5.8f} | accuracy {:5.4f}'.format(
ep, batch_idx, n_train // batch_size+1, args.lr, elapsed * 1000 / args.log_interval,
avg_loss.data[0], 100. * correct / counter))
start_time = time.time()
total_loss = 0
correct = 0
counter = 0
for ep in range(1, epochs + 1):
train(ep)
evaluate()
from torch import nn
from TCN.tcn import TemporalConvNet
class TCN(nn.Module):
def __init__(self, input_size, output_size, num_channels, kernel_size, dropout):
super(TCN, self).__init__()
self.tcn = TemporalConvNet(input_size, num_channels, kernel_size=kernel_size, dropout=dropout)
self.linear = nn.Linear(num_channels[-1], output_size)
self.init_weights()
def init_weights(self):
self.linear.weight.data.normal_(0, 0.01)
def forward(self, x):
y1 = self.tcn(x)
return self.linear(y1.transpose(1, 2))
\ No newline at end of file
import numpy as np
import torch
from torch.autograd import Variable
def data_generator(T, mem_length, b_size):
"""
Generate data for the copying memory task
:param T: The total blank time length
:param mem_length: The length of the memory to be recalled
:param b_size: The batch size
:return: Input and target data tensor
"""
seq = torch.from_numpy(np.random.randint(1, 9, size=(b_size, mem_length))).float()
zeros = torch.zeros((b_size, T))
marker = 9 * torch.ones((b_size, mem_length + 1))
placeholders = torch.zeros((b_size, mem_length))
x = torch.cat((seq, zeros[:, :-1], marker), 1)
y = torch.cat((placeholders, zeros, seq), 1).long()
x, y = Variable(x), Variable(y)
return x, y
\ No newline at end of file
## Word-level Language Modeling
### Overview
LAMBADA is a collection of narrative passages sharing the characteristics such that human subjects are able to guess accurately given sufficient context, but not so if they only see the last sentence containing the target word. On average, the context contains 4.6 sentences, and the testing performance is evaluated by having the model the last element of the target sentence (i.e. the very last word).
Most of the existing computational models fail on this task (without the help of external memory unit, such as neural cache). See [the original LAMBADA paper](https://arxiv.org/pdf/1606.06031.pdf) for more results on applying RNNs on LAMBADA.
**Example**:
```
Context: “Yes, I thought I was going to lose the baby.” “I was scared too,” he stated, sincerity flooding his eyes. “You were ?” “Yes, of course. Why do you even ask?” “This baby wasn’t exactly planned for.”
Target sentence: “Do you honestly think that I would want you to have a _______”
Target word: miscarriage
```
### Data
See `data_generator` in `utils.py`. You will need to download the lambada dataset from [here](http://clic.cimec.unitn.it/lambada/) and put it under director `./data/lambada` (or other paths specified by `--data` flag).
### Note
- Just like in a recurrent network implementation where it is common to repackage
hidden units when a new sequence begins, we pass into TCN a sequence `T` consisting
of two parts: 1) effective history `L1`, and 2) valid sequence `L2`:
```
Sequence [---------T--------->] = [--L1--> ------L2------>]
```
In the forward pass, the whole sequence is passed into TCN, but only the `L2` portion is used for
training. This ensures that the training data are also provided with sufficient history. The size
of `T` and `L2` can be adjusted via flags `seq_len` and `validseqlen`.
- The choice of data to load can be specified via the `--data` flag, followed by the path to
the directory containing the data. For instance, running
```
python lambada_test.py --data .data/lambada
```
would train on the LAMBADA (PTB) dataset, if it is contained in `.data/lambada`.
- LAMBADA is a huge dataset with lots of vocabularies. A
\ No newline at end of file
import argparse
import time
import math
import torch
import torch.nn as nn
from torch.autograd import Variable
import torch.optim as optim
import sys
sys.path.append("../../")
from TCN.lambada_language.utils import *
from TCN.lambada_language.model import TCN
import pickle
parser = argparse.ArgumentParser(description='Sequence Modeling - LAMBADA Textual Understanding')
parser.add_argument('--batch_size', type=int, default=20, metavar='N',
help='batch size (default: 20)')
parser.add_argument('--cuda', action='store_false',
help='use CUDA (default: True)')
parser.add_argument('--dropout', type=float, default=0.1,
help='dropout applied to layers (default: 0.1)')
parser.add_argument('--emb_dropout', type=float, default=0.1,
help='dropout applied to the embedded layer (default: 0.1)')
parser.add_argument('--clip', type=float, default=0.4,
help='gradient clip, -1 means no clip (default: 0.4)')
parser.add_argument('--epochs', type=int, default=100,
help='upper epoch limit (default: 100)')
parser.add_argument('--ksize', type=int, default=4,
help='kernel size (default: 4)')
parser.add_argument('--data', type=str, default='./data/lambada',
help='location of the data corpus (default: ./data/lambada)')
parser.add_argument('--emsize', type=int, default=500,
help='size of word embeddings (default: 500)')
parser.add_argument('--levels', type=int, default=5,
help='# of levels (default: 5)')
parser.add_argument('--log-interval', type=int, default=100, metavar='N',
help='report interval (default: 100)')
parser.add_argument('--lr', type=float, default=4,
help='initial learning rate (default: 4)')
parser.add_argument('--nhid', type=int, default=500,
help='number of hidden units per layer (default: 500)')
parser.add_argument('--seed', type=int, default=1111,
help='random seed (default: 1111)')
parser.add_argument('--tied', action='store_false',
help='tie the word embedding and softmax weights (default: True)')
parser.add_argument('--optim', type=str, default='SGD',
help='optimizer type (default: SGD)')
parser.add_argument('--validseqlen', type=int, default=50,
help='valid sequence length (default: 50)')
parser.add_argument('--seq_len', type=int, default=100,
help='total sequence length, including effective history (default: 100)')
parser.add_argument('--corpus', action='store_true',
help='force re-make the corpus (default: False)')
args = parser.parse_args()
# Set the random seed manually for reproducibility.
torch.manual_seed(args.seed)
if torch.cuda.is_available():
if not args.cuda:
print("WARNING: You have a CUDA device, so you should probably run with --cuda")
print(args)
train_data, val_data, test_data, corpus = data_generator(args)
n_words = len(corpus.dictionary)
print("Total # of words: {0}".format(n_words))
num_chans = [args.nhid] * (args.levels - 1) + [args.emsize]
k_size = args.ksize
dropout = args.dropout
emb_dropout = args.emb_dropout
tied = args.tied
model = TCN(args.emsize, n_words, num_chans, dropout=dropout,
emb_dropout=emb_dropout, kernel_size=k_size, tied_weights=tied)
if args.cuda:
model.cuda()
criterion = nn.CrossEntropyLoss()
lr = args.lr
optimizer = getattr(optim, args.optim)(model.parameters(), lr=lr)
def evaluate(data_source):
model.eval()
total_loss = 0
processed_data_size = 0
correct = 0
for i in range(len(data_source)):
data, targets = torch.LongTensor(data_source[i]).view(1, -1), torch.LongTensor([data_source[i][-1]]).view(1, -1)
data, targets = Variable(data), Variable(targets)
if args.cuda:
data, targets = data.cuda(), targets.cuda()
output = model(data)
final_output = output[:, -1].contiguous().view(-1, n_words)
final_target = targets[:, -1].contiguous().view(-1)
loss = criterion(final_output, final_target)
total_loss += loss.data
processed_data_size += 1
return total_loss[0] / processed_data_size
def train():
global train_data
model.train()
total_loss = 0
start_time = time.time()
for babatch_idxtch, i in enumerate(range(0, train_data.size(1) - 1, args.validseqlen)):
if i + args.seq_len - args.validseqlen >= train_data.size(1) - 1:
continue
data, targets = get_batch(train_data, i, args)
optimizer.zero_grad()
output = model(data)
eff_history = args.seq_len - args.validseqlen
if eff_history < 0:
raise ValueError("Valid sequence length must be smaller than sequence length!")
final_target = targets[:, eff_history:].contiguous().view(-1)
final_output = output[:, eff_history:].contiguous().view(-1, n_words)
loss = criterion(final_output, final_target)
loss.backward()
if args.clip > 0:
torch.nn.utils.clip_grad_norm(model.parameters(), args.clip)
optimizer.step()
total_loss += loss.data
if batch_idx % args.log_interval == 0 and batch_idx > 0:
cur_loss = total_loss[0] / args.log_interval
elapsed = time.time() - start_time
print('| epoch {:3d} | {:5d}/{:5d} batches | lr {:02.5f} | ms/batch {:5.5f} | '
'loss {:5.2f} | ppl {:8.2f}'.format(
epoch, batch_idx, train_data.size(1) // args.validseqlen, lr,
elapsed * 1000 / args.log_interval, cur_loss, math.exp(cur_loss)))
total_loss = 0
reg_loss = 0
start_time = time.time()
if __name__ == "__main__":
best_vloss = 1e8
try:
all_vloss = []
for epoch in range(1, args.epochs+1):
epoch_start_time = time.time()
train()
val_loss = evaluate(val_data)
test_loss = evaluate(test_data)
print('-' * 89)
print('| end of epoch {:3d} | time: {:5.2f}s | valid loss {:5.2f} | '
'valid ppl {:8.2f}'.format(epoch, (time.time() - epoch_start_time),
val_loss, math.exp(val_loss)))
print('| end of epoch {:3d} | time: {:5.2f}s | test loss {:5.2f} | '
'test ppl {:8.2f}'.format(epoch, (time.time() - epoch_start_time),
test_loss, math.exp(test_loss)))
print('-' * 89)
# Save the model if the validation loss is the best we've seen so far.
if val_loss < best_vloss:
with open("model.pt", 'wb') as f:
print('Save model!\n')
torch.save(model, f)
best_vloss = val_loss
if epoch > 5 and val_loss >= max(all_vloss[-5:]):
lr = lr / 10.
for param_group in optimizer.param_groups:
param_group['lr'] = lr
all_vloss.append(val_loss)
except KeyboardInterrupt:
print('-' * 89)
print('Exiting from training early')
# Load the best saved model.
with open("model.pt", 'rb') as f:
model = torch.load(f)
# Run on test data.
test_loss = evaluate(test_data)
print('=' * 89)
print('| End of training | test loss {:5.2f} | test ppl {:8.2f}'.format(
test_loss, math.exp(test_loss)))
print('=' * 89)
import torch
from torch import nn
import sys
sys.path.append("../../")
from TCN.tcn import TemporalConvNet
class TCN(nn.Module):
def __init__(self, input_size, output_size, num_channels,
kernel_size=2, dropout=0.3, emb_dropout=0.1, tied_weights=False):
super(TCN, self).__init__()
self.encoder = nn.Embedding(output_size, input_size)
self.tcn = TemporalConvNet(input_size, num_channels, kernel_size, dropout=dropout)
self.decoder = nn.Linear(num_channels[-1], output_size)
if tied_weights:
if num_channels[-1] != input_size:
raise ValueError('When using the tied flag, nhid must be equal to emsize')
self.decoder.weight = self.encoder.weight
print("Weight tied")
self.drop = nn.Dropout(emb_dropout)
self.emb_dropout = emb_dropout
self.init_weights()
def init_weights(self):
self.encoder.weight.data.normal_(0, 0.01)
self.decoder.bias.data.fill_(0)
self.decoder.weight.data.normal_(0, 0.01)
def forward(self, input):
"""Input ought to have dimension (N, C_in, L_in), where L_in is the seq_len; here the input is (N, L, C)"""
emb = self.drop(self.encoder(input))
y = self.tcn(emb.transpose(1, 2)).transpose(1, 2)
y = self.decoder(y)
return y.contiguous()
import os
import torch
from torch.autograd import Variable
import re
from collections import Counter
import pickle
"""
Note: The meaning of batch_size in PTB is different from that in MNIST example. In MNIST,
batch_size is the # of sample data that is considered in each iteration; in PTB, however,
it is the number of segments to speed up computation.
The goal of PTB is to train a language model to predict the next word.
"""
def data_generator(args):
if os.path.exists(args.data + "/corpus") and not args.corpus:
corpus = pickle.load(open(args.data + '/corpus', 'rb'))
else:
print("Creating Corpus...")
corpus = Corpus(args.data + "/lambada_vocabulary_sorted.txt", args.data)
pickle.dump(corpus, open(args.data + '/corpus', 'wb'))
eval_batch_size = 1
train_data = batchify(corpus.train, args.batch_size, args)
val_data = [[0] * (args.seq_len-len(line)) + line for line in corpus.valid]
test_data = [[0] * (args.seq_len-len(line)) + line for line in corpus.test]
return train_data, val_data, test_data, corpus
class Dictionary(object):
def __init__(self):
self.word2idx = {}
self.idx2word = []
def add_word(self, word):
if word not in self.word2idx:
self.idx2word.append(word)
self.word2idx[word] = len(self.idx2word) - 1
return self.word2idx[word]
def __len__(self):
return len(self.idx2word)
class Corpus(object):
def __init__(self, dict_path, path):
self.dictionary = Dictionary()
self.prep_dict(dict_path)
self.train = torch.LongTensor(self.tokenize(os.path.join(path, 'train-novels')))
self.valid = self.tokenize(os.path.join(path, 'lambada_development_plain_text.txt'), eval=True)
self.test = self.tokenize(os.path.join(path, 'lambada_test_plain_text.txt'), eval=True)
def prep_dict(self, dict_path):
assert os.path.exists(dict_path)
# Add words to the dictionary
with open(dict_path, 'r') as f:
tokens = 0
for line in f:
word = line.strip()
tokens += 1
self.dictionary.add_word(word)
if "<eos>" not in self.dictionary.word2idx:
self.dictionary.add_word("<eos>")
tokens += 1
print("The dictionary captured a vocabulary of size {0}.".format(tokens))
def tokenize(self, path, eval=False):
assert os.path.exists(path)
ids = []
token = 0
misses = 0
if not path.endswith(".txt"): # it's a folder
for subdir in os.listdir(path):
for filename in os.listdir(path + "/" + subdir):
if filename.endswith(".txt"):
full_path = "{0}/{1}/{2}".format(path, subdir, filename)
# Tokenize file content
delta_ids, delta_token, delta_miss = self._tokenize_file(full_path, eval=eval)
ids += delta_ids
token += delta_token
misses += delta_miss
else:
ids, token, misses = self._tokenize_file(path, eval=eval)
print(token, misses)
return ids
def _tokenize_file(self, path, eval=False):
with open(path, 'r') as f:
token = 0
ids = []
misses = 0
for line in f:
line_ids = []
words = line.strip().split() + ['<eos>']
if eval:
words = words[:-1]
for word in words:
# These words are in the text but not vocabulary
if word == "n't":
word = "not"
elif word == "'s":
word = "is"
elif word == "'re":
word = "are"
elif word == "'ve":
word = "have"
elif word == "wo":
word = "will"
if word not in self.dictionary.word2idx:
word = re.sub(r'[^\w\s]', '', word)
if word not in self.dictionary.word2idx:
misses += 1
continue
line_ids.append(self.dictionary.word2idx[word])
token += 1
if eval:
ids.append(line_ids)
else:
ids += line_ids
return ids, token, misses
def batchify(data, batch_size, args):
"""The output should have size [L x batch_size], where L could be a long sequence length"""
# Work out how cleanly we can divide the dataset into batch_size parts (i.e. continuous seqs).
nbatch = data.size(0) // batch_size
# Trim off any extra elements that wouldn't cleanly fit (remainders).
data = data.narrow(0, 0, nbatch * batch_size)
# Evenly divide the data across the batch_size batches.
data = data.view(batch_size, -1)
print(data.size())
if args.cuda:
data = data.cuda()
return data
def get_batch(source, i, args, seq_len=None, evaluation=False):
seq_len = min(seq_len if seq_len else args.seq_len, source.size(1) - 1 - i)
data = Variable(source[:, i:i+seq_len], volatile=evaluation)
target = Variable(source[:, i+1:i+1+seq_len]) # CAUTION: This is un-flattened!
return data, target
## Sequential MNIST & Permuted Sequential MNIST
### Overview
MNIST is a handwritten digit classification dataset (Lecun et al., 1998) that is frequently used to
test deep learning models. In particular, sequential MNIST is frequently used to test a recurrent
network’s ability to retain information from the distant past (see paper for references). In
this task, each MNIST image (28 x 28) is presented to the model as a 784 × 1 sequence
for digit classification. In the more challenging permuted MNIST (P-MNIST) setting, the order of
the sequence is permuted at a (fixed) random order.
### Data
See `data_generator` in `utils.py`. You only need to download the data once. The default path
to store the data is at `./data/mnist`.
Original source of the data can be found [here](http://yann.lecun.com/exdb/mnist/).
### Note
- Because a TCN's receptive field depends on depth of the network and the filter size, we need
to make sure these the model we use can cover the sequence length 784.
- While this is a sequence model task, we only use the very last output (i.e. at time T=784) for
the eventual classification.
\ No newline at end of file
import torch.nn.functional as F
from torch import nn
from TCN.tcn import TemporalConvNet
class TCN(nn.Module):
def __init__(self, input_size, output_size, num_channels, kernel_size, dropout):
super(TCN, self).__init__()
self.tcn = TemporalConvNet(input_size, num_channels, kernel_size=kernel_size, dropout=dropout)
self.linear = nn.Linear(num_channels[-1], output_size)
def forward(self, inputs):
"""Inputs have to have dimension (N, C_in, L_in)"""
y1 = self.tcn(inputs) # input should have dimension (N, C, L)
o = self.linear(y1[:, :, -1])
return F.log_softmax(o, dim=1)
\ No newline at end of file
import torch
from torch.autograd import Variable
import torch.optim as optim
import torch.nn.functional as F
from TCN.mnist_pixel.utils import data_generator
from TCN.mnist_pixel.model import TCN
import numpy as np
import argparse
parser = argparse.ArgumentParser(description='Sequence Modeling - (Permuted) Sequential MNIST')
parser.add_argument('--batch_size', type=int, default=64, metavar='N',
help='batch size (default: 64)')
parser.add_argument('--cuda', action='store_false',
help='use CUDA (default: True)')
parser.add_argument('--dropout', type=float, default=0.05,
help='dropout applied to layers (default: 0.0)')
parser.add_argument('--clip', type=float, default=-1,
help='gradient clip, -1 means no clip (default: -1)')
parser.add_argument('--epochs', type=int, default=20,
help='upper epoch limit (default: 20)')
parser.add_argument('--ksize', type=int, default=7,
help='kernel size (default: 8)')
parser.add_argument('--levels', type=int, default=8,
help='# of levels (default: 8)')
parser.add_argument('--log-interval', type=int, default=100, metavar='N',
help='report interval (default: 10')
parser.add_argument('--lr', type=float, default=2e-3,
help='initial learning rate (default: 5e-4)')
parser.add_argument('--optim', type=str, default='Adam',
help='optimizer to use (default: Adam)')
parser.add_argument('--nhid', type=int, default=25,
help='number of hidden units per layer (default: 25)')
parser.add_argument('--seed', type=int, default=1111,
help='random seed (default: 1111)')
parser.add_argument('--permute', action='store_true',
help='use permuted MNIST (default: false)')
args = parser.parse_args()
torch.manual_seed(args.seed)
if torch.cuda.is_available():
if not args.cuda:
print("WARNING: You have a CUDA device, so you should probably run with --cuda")
root = './data/mnist'
batch_size = args.batch_size
n_classes = 10
input_channels = 1
seq_length = int(784 / input_channels)
epochs = args.epochs
steps = 0
print(args)
train_loader, test_loader = data_generator(root, batch_size)
permute = torch.Tensor(np.random.permutation(784).astype(np.float64)).long()
channel_sizes = [args.nhid] * args.levels
kernel_size = args.ksize
model = TCN(input_channels, n_classes, channel_sizes, kernel_size=kernel_size, dropout=args.dropout)
if args.cuda:
model.cuda()
permute = permute.cuda()
lr = args.lr
optimizer = getattr(optim, args.optim)(model.parameters(), lr=lr)
def train(ep):
global steps
train_loss = 0
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
if args.cuda: data, target = data.cuda(), target.cuda()
data = data.view(-1, input_channels, seq_length)
if args.permute:
data = data[:, :, permute]
data, target = Variable(data), Variable(target)
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
if args.clip > 0:
torch.nn.utils.clip_grad_norm(model.parameters(), args.clip)
optimizer.step()
train_loss += loss
steps += seq_length
if batch_idx > 0 and batch_idx % args.log_interval == 0:
print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}\tSteps: {}'.format(
ep, batch_idx * batch_size, len(train_loader.dataset),
100. * batch_idx / len(train_loader), train_loss.data[0]/args.log_interval, steps))
train_loss = 0
def test():
model.eval()
test_loss = 0
correct = 0
for data, target in test_loader:
if args.cuda:
data, target = data.cuda(), target.cuda()
data = data.view(-1, input_channels, seq_length)
if args.permute:
data = data[:, :, permute]
data, target = Variable(data, volatile=True), Variable(target)
output = model(data)
test_loss += F.nll_loss(output, target, size_average=False).data[0]
pred = output.data.max(1, keepdim=True)[1]
correct += pred.eq(target.data.view_as(pred)).cpu().sum()
test_loss /= len(test_loader.dataset)
print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
test_loss, correct, len(test_loader.dataset),
100. * correct / len(test_loader.dataset)))
return test_loss
if __name__ == "__main__":
for epoch in range(1, epochs+1):
train(epoch)
test()
if epoch % 10 == 0:
lr /= 10
for param_group in optimizer.param_groups:
param_group['lr'] = lr
\ No newline at end of file
import torch
from torchvision import datasets, transforms
def data_generator(root, batch_size):
train_set = datasets.MNIST(root=root, train=True, download=True,
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
]))
test_set = datasets.MNIST(root=root, train=False, download=True,
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
]))
train_loader = torch.utils.data.DataLoader(train_set, batch_size=batch_size)
test_loader = torch.utils.data.DataLoader(test_set, batch_size=batch_size)
return train_loader, test_loader
## Polyphonic Music Dataset
### Overview
We evaluate temporal convolutional network (TCN) on two popular polyphonic music dataset, described below.
- **JSB Chorales** dataset (Allan & Williams, 2005) is a polyphonic music dataset con-
sisting of the entire corpus of 382 four-part harmonized chorales by J. S. Bach. In a polyphonic
music dataset, each input is a sequence of elements having 88 dimensions, representing the 88 keys
on a piano. Therefore, each element `x_t` is a chord written in as binary vector, in which a “1” indicates
a key pressed.
- **Nottingham** dataset is a collection of 1200 British and American folk tunes. Not-
tingham is a much larger dataset than JSB Chorales. Along with JSB Chorales, Nottingham has
been used in a number of works that investigated recurrent models’ applicability in polyphonic mu-
sic, and the performance for both tasks are measured in terms
of negative log-likelihood (NLL) loss.
The goal here is to predict the next note given some history of the notes played.
### Data
See `data_generator` in `utils.py`. The data has been pre-processed and can be loaded directly using
scipy functions.
Original source of the data can be found [here](http://www-etud.iro.umontreal.ca/~boulanni/icml2012).
### Note
- Each sequence can have a different length. In the current implementation, we simply train each
sequence separately (i.e. batch size is 1), but one can zero-pad all sequences to the same length
and train by batch.
- One can use different datasets by specifying through the `--data` flag on the command line. The
default is `Nott`, for Nottingham.
- While each data is binary, the fact that there are 88 dimensions (for 88 keys) means there are
essentially `2^88` "classes". Therefore, instead of directly predicting each key directly, we
follow the standard practice so that a sigmoid is added at the end of the network. This ensures
that every entry is converted to a value between 0 and 1 to compute the NLL loss.
from torch import nn
from TCN.tcn import TemporalConvNet
import torch.nn.functional as F
class TCN(nn.Module):
def __init__(self, input_size, output_size, num_channels, kernel_size, dropout):
super(TCN, self).__init__()
self.tcn = TemporalConvNet(input_size, num_channels, kernel_size, dropout=dropout)
self.linear = nn.Linear(num_channels[-1], output_size)
self.sig = nn.Sigmoid()
def forward(self, x):
# x needs to have dimension (N, C, L) in order to be passed into CNN
output = self.tcn(x.transpose(1, 2)).transpose(1, 2)
output = self.linear(output).double()
return self.sig(output)
import argparse
import torch
import torch.nn as nn
from torch.autograd import Variable
import torch.optim as optim
from TCN.poly_music.model import TCN
from TCN.poly_music.utils import data_generator
import numpy as np
parser = argparse.ArgumentParser(description='Sequence Modeling - Polyphonic Music')
parser.add_argument('--cuda', action='store_false',
help='use CUDA (default: True)')
parser.add_argument('--dropout', type=float, default=0.25,
help='dropout applied to layers (default: 0.25)')
parser.add_argument('--clip', type=float, default=0.2,
help='gradient clip, -1 means no clip (default: 0.2)')
parser.add_argument('--epochs', type=int, default=100,
help='upper epoch limit (default: 100)')
parser.add_argument('--ksize', type=int, default=5,
help='kernel size (default: 5)')
parser.add_argument('--levels', type=int, default=4,
help='# of levels (default: 4)')
parser.add_argument('--log-interval', type=int, default=100, metavar='N',
help='report interval (default: 100')
parser.add_argument('--lr', type=float, default=1e-3,
help='initial learning rate (default: 1e-3)')
parser.add_argument('--optim', type=str, default='Adam',
help='optimizer to use (default: Adam)')
parser.add_argument('--nhid', type=int, default=150,
help='number of hidden units per layer (default: 150)')
parser.add_argument('--data', type=str, default='Nott',
help='the dataset to run (default: Nott)')
parser.add_argument('--seed', type=int, default=1111,
help='random seed (default: 1111)')
args = parser.parse_args()
# Set the random seed manually for reproducibility.
torch.manual_seed(args.seed)
if torch.cuda.is_available():
if not args.cuda:
print("WARNING: You have a CUDA device, so you should probably run with --cuda")
print(args)
input_size = 88
X_train, X_valid, X_test = data_generator(args.data)
n_channels = [args.nhid] * args.levels
kernel_size = args.ksize
dropout = args.dropout
model = TCN(input_size, input_size, n_channels, kernel_size, dropout=args.dropout)
if args.cuda:
model.cuda()
criterion = nn.CrossEntropyLoss()
lr = args.lr
optimizer = getattr(optim, args.optim)(model.parameters(), lr=lr)
def evaluate(X_data):
model.eval()
eval_idx_list = np.arange(len(X_data), dtype="int32")
total_loss = 0.0
count = 0
for idx in eval_idx_list:
data_line = X_data[idx]
x, y = Variable(data_line[:-1]), Variable(data_line[1:])
if args.cuda:
x, y = x.cuda(), y.cuda()
output = model(x.unsqueeze(0)).squeeze(0)
loss = -torch.trace(torch.matmul(y, torch.log(output).float().t()) +
torch.matmul((1-y), torch.log(1-output).float().t()))
total_loss += loss.data[0]
count += output.size(0)
eval_loss = total_loss / count
print("Validation/Test loss: {:.5f}".format(eval_loss))
return eval_loss
def train(ep):
model.train()
total_loss = 0
count = 0
train_idx_list = np.arange(len(X_train), dtype="int32")
np.random.shuffle(train_idx_list)
for idx in train_idx_list:
data_line = X_train[idx]
x, y = Variable(data_line[:-1]), Variable(data_line[1:])
if args.cuda:
x, y = x.cuda(), y.cuda()
optimizer.zero_grad()
output = model(x.unsqueeze(0)).squeeze(0)
loss = -torch.trace(torch.matmul(y, torch.log(output).float().t()) +
torch.matmul((1 - y), torch.log(1 - output).float().t()))
total_loss += loss.data[0]
count += output.size(0)
if args.clip > 0:
torch.nn.utils.clip_grad_norm(model.parameters(), args.clip)
loss.backward()
optimizer.step()
if idx > 0 and idx % args.log_interval == 0:
cur_loss = total_loss / count
print("Epoch {:2d} | lr {:.5f} | loss {:.5f}".format(ep, lr, cur_loss))
total_loss = 0.0
count = 0
if __name__ == "__main__":
best_vloss = 1e8
vloss_list = []
model_name = "poly_music_{0}.pt".format(args.data)
for ep in range(1, args.epochs+1):
train(ep)
vloss = evaluate(X_valid)
tloss = evaluate(X_test)
if vloss < best_vloss:
with open(model_name, "wb") as f:
torch.save(model, f)
print("Saved model!\n")
best_vloss = vloss
if ep > 10 and vloss > max(vloss_list[-3:]):
lr /= 10
for param_group in optimizer.param_groups:
param_group['lr'] = lr
vloss_list.append(vloss)
print('-' * 89)
model = torch.load(open(model_name, "rb"))
tloss = evaluate(X_test)
from scipy.io import loadmat
import torch
import numpy as np
def data_generator(dataset):
if dataset == "JSB":
print('loading JSB data...')
data = loadmat('./mdata/JSB_Chorales.mat')
elif dataset == "Muse":
print('loading Muse data...')
data = loadmat('./mdata/MuseData.mat')
elif dataset == "Nott":
print('loading Nott data...')
data = loadmat('./mdata/Nottingham.mat')
elif dataset == "Piano":
print('loading Piano data...')
data = loadmat('./mdata/Piano_midi.mat')
X_train = data['traindata'][0]
X_valid = data['validdata'][0]
X_test = data['testdata'][0]
for data in [X_train, X_valid, X_test]:
for i in range(len(data)):
data[i] = torch.Tensor(data[i].astype(np.float64))
return X_train, X_valid, X_test
\ No newline at end of file
import torch
import torch.nn as nn
from torch.nn.utils import weight_norm
class Chomp1d(nn.Module):
def __init__(self, chomp_size):
super(Chomp1d, self).__init__()
self.chomp_size = chomp_size
def forward(self, x):
return x[:, :, :-self.chomp_size].contiguous()
class TemporalBlock(nn.Module):
def __init__(self, n_inputs, n_outputs, kernel_size, stride, dilation, padding, dropout=0.2):
super(TemporalBlock, self).__init__()
self.conv1 = weight_norm(nn.Conv1d(n_inputs, n_outputs, kernel_size,
stride=stride, padding=padding, dilation=dilation))
self.chomp1 = Chomp1d(padding)
self.relu1 = nn.ReLU()
self.dropout1 = nn.Dropout2d(dropout)
self.conv2 = weight_norm(nn.Conv1d(n_outputs, n_outputs, kernel_size,
stride=stride, padding=padding, dilation=dilation))
self.chomp2 = Chomp1d(padding)
self.relu2 = nn.ReLU()
self.dropout2 = nn.Dropout2d(dropout)
self.net = nn.Sequential(self.conv1, self.chomp1, self.relu1, self.dropout1,
self.conv2, self.chomp2, self.relu2, self.dropout2)
self.downsample = nn.Conv1d(n_inputs, n_outputs, 1) if n_inputs != n_outputs else None
self.relu = nn.ReLU()
self.init_weights()
def init_weights(self):
self.conv1.weight.data.normal_(0, 0.01)
self.conv2.weight.data.normal_(0, 0.01)
if self.downsample is not None:
self.downsample.weight.data.normal_(0, 0.01)
def forward(self, x):
out = self.net(x)
res = x if self.downsample is None else self.downsample(x)
return self.relu(out + res)
class TemporalConvNet(nn.Module):
def __init__(self, num_inputs, num_channels, kernel_size=2, dropout=0.2):
super(TemporalConvNet, self).__init__()
layers = []
num_levels = len(num_channels)
for i in range(num_levels):
dilation_size = 2 ** i
in_channels = num_inputs if i == 0 else num_channels[i-1]
out_channels = num_channels[i]
layers += [TemporalBlock(in_channels, out_channels, kernel_size, stride=1, dilation=dilation_size,
padding=(kernel_size-1) * dilation_size, dropout=dropout)]
self.network = nn.Sequential(*layers)
def forward(self, x):
return self.network(x)
## Word-level Language Modeling
### Overview
In word-level language modeling tasks, each element of the sequence is a word, where the model
is expected to predict the next incoming word in the text. We evaluate the temporal convolutional
network as a word-level language model on three datasets: PennTreebank (PTB), Wikitext-103
and LAMBADA.
Because the evaluation of LAMBADA has different requirement (predicting only the very last word
based on a broader context), we put it in another directory. See `../lambada_language`.
### Data
- **PennTreebank**: A frequently studied, but still relatively
small language corpus. When used as a word-level language corpus,
PTB contains 888K words for training, 70K for validation,
and 79K for testing, with a vocabulary size of 10K.
- **Wikitext-103**: Wikitext-103 is almost
110 times as large as PTB, featuring a vocabulary size of
about 268K. The dataset contains 28K Wikipedia articles
(about 103 million words) for training, 60 articles (about
218K words) for validation, and 60 articles (246K words)
for testing. This is a more representative and realistic dataset
than PTB, with a much larger vocabulary that includes many
rare words, and have been used in (e.g. Merity et al. (2016)).
- **LAMBADA**: An even larger language corpus than Wikitext-103
consisting of novels from different categories. The goal is to
test a model's ability to understand text and predict according
to a long context. See `../lambada_language`.
See `data_generator` in `utils.py`.
### Note
- Just like in a recurrent network implementation where it is common to repackage
hidden units when a new sequence begins, we pass into TCN a sequence `T` consisting
of two parts: 1) effective history `L1`, and 2) valid sequence `L2`:
```
Sequence [---------T--------->] = [--L1--> ------L2------>]
```
In the forward pass, the whole sequence is passed into TCN, but only the `L2` portion is used for
training. This ensures that the training data are also provided with sufficient history. The size
of `T` and `L2` can be adjusted via flags `seq_len` and `validseqlen`. A similar setting
was used in character-level language modeling experiments.
- The choice of data to load can be specified via the `--data` flag, followed by the path to
the directory containing the data. For instance, running
```
python word_cnn_test.py --data .data/penn
```
would train on the PennTreebank (PTB) dataset, if it is contained in `.data/penn`.
\ No newline at end of file
import torch
from torch import nn
import sys
sys.path.append("../../")
from TCN.tcn import TemporalConvNet
class TCN(nn.Module):
def __init__(self, input_size, output_size, num_channels,
kernel_size=2, dropout=0.3, emb_dropout=0.1, tied_weights=False):
super(TCN, self).__init__()
self.encoder = nn.Embedding(output_size, input_size)
self.tcn = TemporalConvNet(input_size, num_channels, kernel_size, dropout=dropout)
self.decoder = nn.Linear(num_channels[-1], output_size)
if tied_weights:
if num_channels[-1] != input_size:
raise ValueError('When using the tied flag, nhid must be equal to emsize')
self.decoder.weight = self.encoder.weight
print("Weight tied")
self.drop = nn.Dropout(emb_dropout)
self.emb_dropout = emb_dropout
self.init_weights()
def init_weights(self):
self.encoder.weight.data.normal_(0, 0.01)
self.decoder.bias.data.fill_(0)
self.decoder.weight.data.normal_(0, 0.01)
def forward(self, input):
"""Input ought to have dimension (N, C_in, L_in), where L_in is the seq_len; here the input is (N, L, C)"""
emb = self.drop(self.encoder(input))
y = self.tcn(emb.transpose(1, 2)).transpose(1, 2)
y = self.decoder(y)
return y.contiguous()
import os
import torch
from torch.autograd import Variable
import pickle
"""
Note: The meaning of batch_size in PTB is different from that in MNIST example. In MNIST,
batch_size is the # of sample data that is considered in each iteration; in PTB, however,
it is the number of segments to speed up computation.
The goal of PTB is to train a language model to predict the next word.
"""
def data_generator(args):
if os.path.exists(args.data + "/corpus") and not args.corpus:
corpus = pickle.load(open(args.data + '/corpus', 'rb'))
else:
corpus = Corpus(args.data)
pickle.dump(corpus, open(args.data + '/corpus', 'wb'))
return corpus
class Dictionary(object):
def __init__(self):
self.word2idx = {}
self.idx2word = []
def add_word(self, word):
if word not in self.word2idx:
self.idx2word.append(word)
self.word2idx[word] = len(self.idx2word) - 1
return self.word2idx[word]
def __len__(self):
return len(self.idx2word)
class Corpus(object):
def __init__(self, path):
self.dictionary = Dictionary()
self.train = self.tokenize(os.path.join(path, 'train.txt'))
self.valid = self.tokenize(os.path.join(path, 'valid.txt'))
self.test = self.tokenize(os.path.join(path, 'test.txt'))
def tokenize(self, path):
"""Tokenizes a text file."""
assert os.path.exists(path)
# Add words to the dictionary
with open(path, 'r') as f:
tokens = 0
for line in f:
words = line.split() + ['<eos>']
tokens += len(words)
for word in words:
self.dictionary.add_word(word)
# Tokenize file content
with open(path, 'r') as f:
ids = torch.LongTensor(tokens)
token = 0
for line in f:
words = line.split() + ['<eos>']
for word in words:
ids[token] = self.dictionary.word2idx[word]
token += 1
return ids
def batchify(data, batch_size, args):
"""The output should have size [L x batch_size], where L could be a long sequence length"""
# Work out how cleanly we can divide the dataset into batch_size parts (i.e. continuous seqs).
nbatch = data.size(0) // batch_size
# Trim off any extra elements that wouldn't cleanly fit (remainders).
data = data.narrow(0, 0, nbatch * batch_size)
# Evenly divide the data across the batch_size batches.
data = data.view(batch_size, -1)
if args.cuda:
data = data.cuda()
return data
def get_batch(source, i, args, seq_len=None, evaluation=False):
seq_len = min(seq_len if seq_len else args.seq_len, source.size(1) - 1 - i)
data = Variable(source[:, i:i+seq_len], volatile=evaluation)
target = Variable(source[:, i+1:i+1+seq_len]) # CAUTION: This is un-flattened!
return data, target
import argparse
import time
import math
import torch
import torch.nn as nn
from torch.autograd import Variable
import torch.optim as optim
import sys
sys.path.append("../../")
from TCN.word_cnn.utils import *
from TCN.word_cnn.model import *
import pickle
from random import randint
parser = argparse.ArgumentParser(description='Sequence Modeling - Word-level Language Modeling')
parser.add_argument('--batch_size', type=int, default=16, metavar='N',
help='batch size (default: 16)')
parser.add_argument('--cuda', action='store_false',
help='use CUDA (default: True)')
parser.add_argument('--dropout', type=float, default=0.45,
help='dropout applied to layers (default: 0.45)')
parser.add_argument('--emb_dropout', type=float, default=0.25,
help='dropout applied to the embedded layer (default: 0.25)')
parser.add_argument('--clip', type=float, default=0.35,
help='gradient clip, -1 means no clip (default: 0.35)')
parser.add_argument('--epochs', type=int, default=100,
help='upper epoch limit (default: 100)')
parser.add_argument('--ksize', type=int, default=3,
help='kernel size (default: 3)')
parser.add_argument('--data', type=str, default='./data/penn',
help='location of the data corpus (default: ./data/penn)')
parser.add_argument('--emsize', type=int, default=600,
help='size of word embeddings (default: 600)')
parser.add_argument('--levels', type=int, default=4,
help='# of levels (default: 4)')
parser.add_argument('--log-interval', type=int, default=100, metavar='N',
help='report interval (default: 100)')
parser.add_argument('--lr', type=float, default=4,
help='initial learning rate (default: 4)')
parser.add_argument('--nhid', type=int, default=600,
help='number of hidden units per layer (default: 600)')
parser.add_argument('--seed', type=int, default=1111,
help='random seed (default: 1111)')
parser.add_argument('--tied', action='store_false',
help='tie the encoder-decoder weights (default: True)')
parser.add_argument('--optim', type=str, default='SGD',
help='optimizer type (default: SGD)')
parser.add_argument('--validseqlen', type=int, default=40,
help='valid sequence length (default: 40)')
parser.add_argument('--seq_len', type=int, default=80,
help='total sequence length, including effective history (default: 80)')
parser.add_argument('--corpus', action='store_true',
help='force re-make the corpus (default: False)')
args = parser.parse_args()
# Set the random seed manually for reproducibility.
torch.manual_seed(args.seed)
if torch.cuda.is_available():
if not args.cuda:
print("WARNING: You have a CUDA device, so you should probably run with --cuda")
print(args)
corpus = data_generator(args)
eval_batch_size = 10
train_data = batchify(corpus.train, args.batch_size, args)
val_data = batchify(corpus.valid, eval_batch_size, args)
test_data = batchify(corpus.test, eval_batch_size, args)
n_words = len(corpus.dictionary)
num_chans = [args.nhid] * (args.levels - 1) + [args.emsize]
k_size = args.ksize
dropout = args.dropout
emb_dropout = args.emb_dropout
tied = args.tied
model = TCN(args.emsize, n_words, num_chans, dropout=dropout, emb_dropout=emb_dropout, kernel_size=k_size, tied_weights=tied)
if args.cuda:
model.cuda()
# May use adaptive softmax to speed up training
criterion = nn.CrossEntropyLoss()
lr = args.lr
optimizer = getattr(optim, args.optim)(model.parameters(), lr=lr)
def evaluate(data_source):
model.eval()
total_loss = 0
processed_data_size = 0
for i in range(0, data_source.size(1) - 1, args.validseqlen):
if i + args.seq_len - args.validseqlen >= data_source.size(1) - 1:
continue
data, targets = get_batch(data_source, i, args, evaluation=True)
output = model(data)
# Discard the effective history, just like in training
eff_history = args.seq_len - args.validseqlen
final_output = output[:, eff_history:].contiguous().view(-1, n_words)
final_target = targets[:, eff_history:].contiguous().view(-1)
loss = criterion(final_output, final_target)
# Note that we don't add TAR loss here
total_loss += (data.size(1) - eff_history) * loss.data
processed_data_size += data.size(1) - eff_history
return total_loss[0] / processed_data_size
def train():
# Turn on training mode which enables dropout.
global train_data
model.train()
total_loss = 0
start_time = time.time()
for batch_idx, i in enumerate(range(0, train_data.size(1) - 1, args.validseqlen)):
if i + args.seq_len - args.validseqlen >= train_data.size(1) - 1:
continue
data, targets = get_batch(train_data, i, args)
optimizer.zero_grad()
output = model(data)
# Discard the effective history part
eff_history = args.seq_len - args.validseqlen
if eff_history < 0:
raise ValueError("Valid sequence length must be smaller than sequence length!")
final_target = targets[:, eff_history:].contiguous().view(-1)
final_output = output[:, eff_history:].contiguous().view(-1, n_words)
loss = criterion(final_output, final_target)
loss.backward()
if args.clip > 0:
torch.nn.utils.clip_grad_norm(model.parameters(), args.clip)
optimizer.step()
total_loss += loss.data
if batch_idx % args.log_interval == 0 and batch_idx > 0:
cur_loss = total_loss[0] / args.log_interval
elapsed = time.time() - start_time
print('| epoch {:3d} | {:5d}/{:5d} batches | lr {:02.5f} | ms/batch {:5.5f} | '
'loss {:5.2f} | ppl {:8.2f}'.format(
epoch, batch_idx, train_data.size(1) // args.validseqlen, lr,
elapsed * 1000 / args.log_interval, cur_loss, math.exp(cur_loss)))
total_loss = 0
start_time = time.time()
if __name__ == "__main__":
best_vloss = 1e8
# At any point you can hit Ctrl + C to break out of training early.
try:
all_vloss = []
for epoch in range(1, args.epochs+1):
epoch_start_time = time.time()
train()
val_loss = evaluate(val_data)
test_loss = evaluate(test_data)
print('-' * 89)
print('| end of epoch {:3d} | time: {:5.2f}s | valid loss {:5.2f} | '
'valid ppl {:8.2f}'.format(epoch, (time.time() - epoch_start_time),
val_loss, math.exp(val_loss)))
print('| end of epoch {:3d} | time: {:5.2f}s | test loss {:5.2f} | '
'test ppl {:8.2f}'.format(epoch, (time.time() - epoch_start_time),
test_loss, math.exp(test_loss)))
print('-' * 89)
# Save the model if the validation loss is the best we've seen so far.
if val_loss < best_vloss:
with open("model.pt", 'wb') as f:
print('Save model!\n')
torch.save(model, f)
best_vloss = val_loss
# Anneal the learning rate if the validation loss plateaus
if epoch > 5 and val_loss >= max(all_vloss[-5:]):
lr = lr / 2.
for param_group in optimizer.param_groups:
param_group['lr'] = lr
all_vloss.append(val_loss)
except KeyboardInterrupt:
print('-' * 89)
print('Exiting from training early')
# Load the best saved model.
with open("model.pt", 'rb') as f:
model = torch.load(f)
# Run on test data.
test_loss = evaluate(test_data)
print('=' * 89)
print('| End of training | test loss {:5.2f} | test ppl {:8.2f}'.format(
test_loss, math.exp(test_loss)))
print('=' * 89)
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册