提交 0653337f 编写于 作者: M MaoXianxin

nlp recommend

上级 5f5279d3
In the featurization tutorial we incorporated multiple features beyond just user and movie identifiers into our models, but we haven't explored whether those features improve model accuracy.
Many factors affect whether features beyond ids are useful in a recommender model:
1. Importance of context: if user preferences are relatively stable across contexts and time, context features may not provide much benefit. If, however, users preferences are highly contextual, adding context will improve the model significantly. For example, day of the week may be an important feature when deciding whether to recommend a short clip or a movie: users may only have time to watch short content during the week, but can relax and enjoy a full-length movie during the weekend. Similarly, query timestamps may play an important role in modelling popularity dynamics: one movie may be highly popular around the time of its release, but decay quickly afterwards. Conversely, other movies may be evergreens that are happily watched time and time again.
2. Data sparsity: using non-id features may be critical if data is sparse. With few observations available for a given user or item, the model may struggle with estimating a good per-user or per-item representation. To build an accurate model, other features such as item categories, descriptions, and images have to be used to help the model generalize beyond the training data. This is especially relevant in cold-start situations, where relatively little data is available on some items or users.
```
import os
import tempfile
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_recommenders as tfrs
```
We follow the featurization tutorial and keep the user id, timestamp, and movie title features.
```
ratings = tfds.load("movielens/100k-ratings", split="train")
movies = tfds.load("movielens/100k-movies", split="train")
ratings = ratings.map(lambda x: {
"movie_title": x["movie_title"],
"user_id": x["user_id"],
"timestamp": x["timestamp"],
})
movies = movies.map(lambda x: x["movie_title"])
```
We also do some housekeeping to prepare feature vocabularies.
```
timestamps = np.concatenate(list(ratings.map(lambda x: x["timestamp"]).batch(100)))
max_timestamp = timestamps.max()
min_timestamp = timestamps.min()
timestamp_buckets = np.linspace(
min_timestamp, max_timestamp, num=1000,
)
unique_movie_titles = np.unique(np.concatenate(list(movies.batch(1000))))
unique_user_ids = np.unique(np.concatenate(list(ratings.batch(1_000).map(
lambda x: x["user_id"]))))
```
## Model definition
### Query model
We start with the user model defined in the featurization tutorial as the first layer of our model, tasked with converting raw input examples into feature embeddings. However, we change it slightly to allow us to turn timestamp features on or off. This will allow us to more easily demonstrate the effect that timestamp features have on the model. In the code below, the use_timestamps parameter gives us control over whether we use timestamp features.
```
class UserModel(tf.keras.Model):
def __init__(self, use_timestamps):
super().__init__()
self._use_timestamps = use_timestamps
self.user_embedding = tf.keras.Sequential([
tf.keras.layers.experimental.preprocessing.StringLookup(
vocabulary=unique_user_ids, mask_token=None),
tf.keras.layers.Embedding(len(unique_user_ids) + 1, 32),
])
if use_timestamps:
self.timestamp_embedding = tf.keras.Sequential([
tf.keras.layers.experimental.preprocessing.Discretization(timestamp_buckets.tolist()),
tf.keras.layers.Embedding(len(timestamp_buckets) + 1, 32),
])
self.normalized_timestamp = tf.keras.layers.experimental.preprocessing.Normalization()
self.normalized_timestamp.adapt(timestamps)
def call(self, inputs):
if not self._use_timestamps:
return self.user_embedding(inputs["user_id"])
return tf.concat([
self.user_embedding(inputs["user_id"]),
self.timestamp_embedding(inputs["timestamp"]),
self.normalized_timestamp(inputs["timestamp"]),
], axis=1)
```
Note that our use of timestamp features in this tutorial interacts with our choice of training-test split in an undesirable way. Because we have split our data randomly rather than chronologically (to ensure that events that belong to the test dataset happen later than those in the training set), our model can effectively learn from the future. This is unrealistic: after all, we cannot train a model today on data from tomorrow.
This means that adding time features to the model lets it learn future interaction patterns. We do this for illustration purposes only: the MovieLens dataset itself is very dense, and unlike many real-world datasets does not benefit greatly from features beyond user ids and movie titles.
This caveat aside, real-world models may well benefit from other time-based features such as time of day or day of the week, especially if the data has strong seasonal patterns.
### Candidate model
For simplicity, we'll keep the candidate model fixed. Again, we copy it from the featurization tutorial:
```
class MovieModel(tf.keras.Model):
def __init__(self):
super().__init__()
max_tokens = 10_000
self.title_embedding = tf.keras.Sequential([
tf.keras.layers.experimental.preprocessing.StringLookup(
vocabulary=unique_movie_titles, mask_token=None),
tf.keras.layers.Embedding(len(unique_movie_titles) + 1, 32)
])
self.title_vectorizer = tf.keras.layers.experimental.preprocessing.TextVectorization(
max_tokens=max_tokens)
self.title_text_embedding = tf.keras.Sequential([
self.title_vectorizer,
tf.keras.layers.Embedding(max_tokens, 32, mask_zero=True),
tf.keras.layers.GlobalAveragePooling1D(),
])
self.title_vectorizer.adapt(movies)
def call(self, titles):
return tf.concat([
self.title_embedding(titles),
self.title_text_embedding(titles),
], axis=1)
```
### Combined model
With both UserModel and MovieModel defined, we can put together a combined model and implement our loss and metrics logic.
Note that we also need to make sure that the query model and candidate model output embeddings of compatible size. Because we'll be varying their sizes by adding more features, the easiest way to accomplish this is to use a dense projection layer after each model:
```
class MovielensModel(tfrs.models.Model):
def __init__(self, use_timestamps):
super().__init__()
self.query_model = tf.keras.Sequential([
UserModel(use_timestamps),
tf.keras.layers.Dense(32)
])
self.candidate_model = tf.keras.Sequential([
MovieModel(),
tf.keras.layers.Dense(32)
])
self.task = tfrs.tasks.Retrieval(
metrics=tfrs.metrics.FactorizedTopK(
candidates=movies.batch(128).map(self.candidate_model),
),
)
def compute_loss(self, features, training=False):
# We only pass the user id and timestamp features into the query model. This
# is to ensure that the training inputs would have the same keys as the
# query inputs. Otherwise the discrepancy in input structure would cause an
# error when loading the query model after saving it.
query_embeddings = self.query_model({
"user_id": features["user_id"],
"timestamp": features["timestamp"],
})
movie_embeddings = self.candidate_model(features["movie_title"])
return self.task(query_embeddings, movie_embeddings)
```
## Experiments
### Prepare the data
We first split the data into a training set and a testing set.
```
tf.random.set_seed(42)
shuffled = ratings.shuffle(100_000, seed=42, reshuffle_each_iteration=False)
train = shuffled.take(80_000)
test = shuffled.skip(80_000).take(20_000)
cached_train = train.shuffle(100_000).batch(2048)
cached_test = test.batch(4096).cache()
```
### Baseline: no timestamp features
We're ready to try out our first model: let's start with not using timestamp features to establish our baseline.
```
model = MovielensModel(use_timestamps=False)
model.compile(optimizer=tf.keras.optimizers.Adagrad(0.1))
model.fit(cached_train, epochs=3)
train_accuracy = model.evaluate(
cached_train, return_dict=True)["factorized_top_k/top_100_categorical_accuracy"]
test_accuracy = model.evaluate(
cached_test, return_dict=True)["factorized_top_k/top_100_categorical_accuracy"]
print(f"Top-100 accuracy (train): {train_accuracy:.2f}.")
print(f"Top-100 accuracy (test): {test_accuracy:.2f}.")
```
This gives us a baseline top-100 accuracy of around 0.2
### Capturing time dynamics with time features
Do the result change if we add time features?
```
model = MovielensModel(use_timestamps=True)
model.compile(optimizer=tf.keras.optimizers.Adagrad(0.1))
model.fit(cached_train, epochs=3)
train_accuracy = model.evaluate(
cached_train, return_dict=True)["factorized_top_k/top_100_categorical_accuracy"]
test_accuracy = model.evaluate(
cached_test, return_dict=True)["factorized_top_k/top_100_categorical_accuracy"]
print(f"Top-100 accuracy (train): {train_accuracy:.2f}.")
print(f"Top-100 accuracy (test): {test_accuracy:.2f}.")
```
This is quite a bit better: not only is the training accuracy much higher, but the test accuracy is also substantially improved.
\ No newline at end of file
此差异已折叠。
In this tutorial, we build a simple matrix factorization model using the MovieLens 100K dataset with TFRS. We can use this model to recommend movies for a given user.
### Import TFRS
```
from typing import Dict, Text
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_recommenders as tfrs
```
### Read the data
```
# Ratings data.
ratings = tfds.load('movielens/100k-ratings', split="train")
# Features of all the available movies.
movies = tfds.load('movielens/100k-movies', split="train")
# Select the basic features.
ratings = ratings.map(lambda x: {
"movie_title": x["movie_title"],
"user_id": x["user_id"]
})
movies = movies.map(lambda x: x["movie_title"])
```
Build vocabularies to convert user ids and movie titles into integer indices for embedding layers:
```
user_ids_vocabulary = tf.keras.layers.experimental.preprocessing.StringLookup(mask_token=None)
user_ids_vocabulary.adapt(ratings.map(lambda x: x["user_id"]))
movie_titles_vocabulary = tf.keras.layers.experimental.preprocessing.StringLookup(mask_token=None)
movie_titles_vocabulary.adapt(movies)
```
### Define a model
We can define a TFRS model by inheriting from tfrs.Model and implementing the compute_loss method:
```
class MovieLensModel(tfrs.Model):
# We derive from a custom base class to help reduce boilerplate. Under the hood,
# these are still plain Keras Models.
def __init__(
self,
user_model: tf.keras.Model,
movie_model: tf.keras.Model,
task: tfrs.tasks.Retrieval):
super().__init__()
# Set up user and movie representations.
self.user_model = user_model
self.movie_model = movie_model
# Set up a retrieval task.
self.task = task
def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
# Define how the loss is computed.
user_embeddings = self.user_model(features["user_id"])
movie_embeddings = self.movie_model(features["movie_title"])
return self.task(user_embeddings, movie_embeddings)
```
Define the two models and the retrieval task.
```
# Define user and movie models.
user_model = tf.keras.Sequential([
user_ids_vocabulary,
tf.keras.layers.Embedding(user_ids_vocabulary.vocab_size(), 64)
])
movie_model = tf.keras.Sequential([
movie_titles_vocabulary,
tf.keras.layers.Embedding(movie_titles_vocabulary.vocab_size(), 64)
])
# Define your objectives.
task = tfrs.tasks.Retrieval(metrics=tfrs.metrics.FactorizedTopK(
movies.batch(128).map(movie_model)
)
)
```
### Fit and evaluate it.
Create the model, train it, and generate predictions:
```
# Create a retrieval model.
model = MovieLensModel(user_model, movie_model, task)
model.compile(optimizer=tf.keras.optimizers.Adagrad(0.5))
# Train for 3 epochs.
model.fit(ratings.batch(4096), epochs=3)
# Use brute-force search to set up retrieval using the trained representations.
index = tfrs.layers.factorized_top_k.BruteForce(model.user_model)
index.index(movies.batch(100).map(model.movie_model), movies)
# Get some recommendations.
_, titles = index(np.array(["42"]))
print(f"Top 3 recommendations for user 42: {titles[0, :3]}")
```
代码地址:
\ No newline at end of file
One of the great advantages of using a deep learning framework to build recommender models is the freedom to build rich, flexible feature representations.
These need to be appropriately transformed in order to be useful in building models:
+ User and item ids have to be translated into embedding vectors: high-dimensional numerical representations that are adjusted during training to help the model predict its objective better.
+ Raw text needs to be tokenized (split into smaller parts such as individual words) and translated into embeddings.
+ Numerical features need to be normalized so that their values lie in a small interval around 0.
## The MovieLens dataset
Let's first have a look at what features we can use from the MovieLens dataset:
```
import pprint
import tensorflow_datasets as tfds
ratings = tfds.load("movielens/100k-ratings", split="train")
for x in ratings.take(1).as_numpy_iterator():
pprint.pprint(x)
```
There are a couple of key features here:
+ Movie title is useful as a movie identifier.
+ User id is useful as a user identifier.
+ Timestamps will allow us to model the effect of time.
The first two are categorical features; timestamps are a continuous feature.
## Turning categorical features into embeddings
A categorical feature is a feature that does not express a continuous quantity, but rather takes on one of a set of fixed values.
Most deep learning models express these feature by turning them into high-dimensional vectors. During model training, the value of that vector is adjusted to help the model predict its objective better.
For example, suppose that our goal is to predict which user is going to watch which movie. To do that, we represent each user and each movie by an embedding vector. Initially, these embeddings will take on random values - but during training, we will adjust them so that embeddings of users and the movies they watch end up closer together.
Taking raw categorical features and turning them into embeddings is normally a two-step process:
+ Firstly, we need to translate the raw values into a range of contiguous integers, normally by building a mapping (called a "vocabulary") that maps raw values ("Star Wars") to integers (say, 15)
+ Secondly, we need to take these integers and turn them into embeddings.
### Defining the vocabulary
The first step is to define a vocabulary. We can do this easily using Keras preprocessing layers.
```
import numpy as np
import tensorflow as tf
movie_title_lookup = tf.keras.layers.experimental.preprocessing.StringLookup()
```
The layer itself does not have a vocabulary yet, but we can build it using our data.
```
movie_title_lookup.adapt(ratings.map(lambda x: x["movie_title"]))
print(f"Vocabulary: {movie_title_lookup.get_vocabulary()[:3]}")
```
Once we have this we can use the layer to translate raw tokens to embedding ids:
```
movie_title_lookup(["Star Wars (1977)", "One Flew Over the Cuckoo's Nest (1975)"])
```
Note that the layer's vocabulary includes one (or more!) unknown (or "out of vocabulary", OOV) tokens. This is really handy: it means that the layer can handle categorical values that are not in the vocabulary. In practical terms, this means that the model can continue to learn about and make recommendations even using features that have not been seen during vocabulary construction.
### Using feature hashing
We can take this to its logical extreme and rely entirely on feature hashing, with no vocabulary at all. This is implemented in the tf.keras.layers.experimental.preprocessing.Hashing layer.
```
# We set up a large number of bins to reduce the chance of hash collisions.
num_hashing_bins = 200_000
movie_title_hashing = tf.keras.layers.experimental.preprocessing.Hashing(
num_bins=num_hashing_bins
)
```
We can do the lookup as before without the need to build vocabularies:
```
movie_title_hashing(["Star Wars (1977)", "One Flew Over the Cuckoo's Nest (1975)"])
```
### Defining the embeddings
Now that we have integer ids, we can use the Embedding layer to turn those into embeddings.
An embedding layer has two dimensions: the first dimension tells us how many distinct categories we can embed; the second tells us how large the vector representing each of them can be.
When creating the embedding layer for movie titles, we are going to set the first value to the size of our title vocabulary (or the number of hashing bins). The second is up to us: the larger it is, the higher the capacity of the model, but the slower it is to fit and serve.
```
movie_title_embedding = tf.keras.layers.Embedding(
# Let's use the explicit vocabulary lookup.
input_dim=movie_title_lookup.vocab_size(),
output_dim=32
)
```
We can put the two together into a single layer which takes raw text in and yields embeddings.
```
movie_title_model = tf.keras.Sequential([movie_title_lookup, movie_title_embedding])
```
Just like that, we can directly get the embeddings for our movie titles:
```
movie_title_model(["Star Wars (1977)"])
```
We can do the same with user embeddings:
```
user_id_lookup = tf.keras.layers.experimental.preprocessing.StringLookup()
user_id_lookup.adapt(ratings.map(lambda x: x["user_id"]))
user_id_embedding = tf.keras.layers.Embedding(user_id_lookup.vocab_size(), 32)
user_id_model = tf.keras.Sequential([user_id_lookup, user_id_embedding])
```
## Normalizing continuous features
Continuous features also need normalization. For example, the timestamp feature is far too large to be used directly in a deep model
```
for x in ratings.take(3).as_numpy_iterator():
print(f"Timestamp: {x['timestamp']}.")
```
We need to process it before we can use it. While there are many ways in which we can do this, discretization and standardization are two common ones.
### Standardization
Standardization rescales features to normalize their range by subtracting the feature's mean and dividing by its standard deviation. It is a common preprocessing transformation.
This can be easily accomplished using the tf.keras.layers.experimental.preprocessing.Normalization layer:
```
timestamp_normalization = tf.keras.layers.experimental.preprocessing.Normalization()
timestamp_normalization.adapt(ratings.map(lambda x: x["timestamp"]).batch(1024))
for x in ratings.take(3).as_numpy_iterator():
print(f"Normalized timestamp: {timestamp_normalization(x['timestamp'])}.")
```
### Discretization
Another common transformation is to turn a continuous feature into a number of categorical features. This makes good sense if we have reasons to suspect that a feature's effect is non-continuous.
To do this, we first need to establish the boundaries of the buckets we will use for discretization. The easiest way is to identify the minimum and maximum value of the feature, and divide the resulting interval equally:
```
max_timestamp = ratings.map(lambda x: x["timestamp"]).reduce(
tf.cast(0, tf.int64), tf.maximum).numpy().max()
min_timestamp = ratings.map(lambda x: x["timestamp"]).reduce(
np.int64(1e9), tf.minimum).numpy().min()
timestamp_buckets = np.linspace(
min_timestamp, max_timestamp, num=1000)
print(f"Buckets: {timestamp_buckets[:3]}")
```
Given the bucket boundaries we can transform timestamps into embeddings:
```
timestamp_embedding_model = tf.keras.Sequential([
tf.keras.layers.experimental.preprocessing.Discretization(timestamp_buckets.tolist()),
tf.keras.layers.Embedding(len(timestamp_buckets) + 1, 32)
])
for timestamp in ratings.take(1).map(lambda x: x["timestamp"]).batch(1).as_numpy_iterator():
print(f"Timestamp embedding: {timestamp_embedding_model(timestamp)}.")
```
## Processing text features
We may also want to add text features to our model. Usually, things like product descriptions are free form text, and we can hope that our model can learn to use the information they contain to make better recommendations, especially in a cold-start or long tail scenario.
While the MovieLens dataset does not give us rich textual features, we can still use movie titles. This may help us capture the fact that movies with very similar titles are likely to belong to the same series.
The first transformation we need to apply to text is tokenization (splitting into constituent words or word-pieces), followed by vocabulary learning, followed by an embedding.
The Keras tf.keras.layers.experimental.preprocessing.TextVectorization layer can do the first two steps for us:
```
title_text = tf.keras.layers.experimental.preprocessing.TextVectorization()
title_text.adapt(ratings.map(lambda x: x["movie_title"]))
```
Let's try it out:
```
for row in ratings.batch(1).map(lambda x: x["movie_title"]).take(1):
print(title_text(row))
```
Each title is translated into a sequence of tokens, one for each piece we've tokenized.
We can check the learned vocabulary to verify that the layer is using the correct tokenization:
```
title_text.get_vocabulary()[40:45]
```
This looks correct: the layer is tokenizing titles into individual words.
To finish the processing, we now need to embed the text. Because each title contains multiple words, we will get multiple embeddings for each title. For use in a donwstream model these are usually compressed into a single embedding. Models like RNNs or Transformers are useful here, but averaging all the words' embeddings together is a good starting point.
## Putting it all together
With these components in place, we can build a model that does all the preprocessing together.
### User model
The full user model may look like the following:
```
class UserModel(tf.keras.Model):
def __init__(self):
super().__init__()
self.user_embedding = tf.keras.Sequential([
user_id_lookup,
tf.keras.layers.Embedding(user_id_lookup.vocab_size(), 32),
])
self.timestamp_embedding = tf.keras.Sequential([
tf.keras.layers.experimental.preprocessing.Discretization(timestamp_buckets.tolist()),
tf.keras.layers.Embedding(len(timestamp_buckets) + 2, 32)
])
self.normalized_timestamp = tf.keras.layers.experimental.preprocessing.Normalization()
def call(self, inputs):
# Take the input dictionary, pass it through each input layer,
# and concatenate the result.
return tf.concat([
self.user_embedding(inputs["user_id"]),
self.timestamp_embedding(inputs["timestamp"]),
self.normalized_timestamp(inputs["timestamp"])
], axis=1)
```
Let's try it out:
```
user_model = UserModel()
user_model.normalized_timestamp.adapt(
ratings.map(lambda x: x["timestamp"]).batch(128))
for row in ratings.batch(1).take(1):
print(f"Computed representations: {user_model(row)[0, :3]}")
```
### Movie model
We can do the same for the movie model:
```
class MovieModel(tf.keras.Model):
def __init__(self):
super().__init__()
max_tokens = 10_000
self.title_embedding = tf.keras.Sequential([
movie_title_lookup,
tf.keras.layers.Embedding(movie_title_lookup.vocab_size(), 32)
])
self.title_text_embedding = tf.keras.Sequential([
tf.keras.layers.experimental.preprocessing.TextVectorization(max_tokens=max_tokens),
tf.keras.layers.Embedding(max_tokens, 32, mask_zero=True),
# We average the embedding of individual words to get one embedding vector
# per title.
tf.keras.layers.GlobalAveragePooling1D(),
])
def call(self, inputs):
return tf.concat([
self.title_embedding(inputs["movie_title"]),
self.title_text_embedding(inputs["movie_title"]),
], axis=1)
```
Let's try it out:
```
movie_model = MovieModel()
movie_model.title_text_embedding.layers[0].adapt(
ratings.map(lambda x: x["movie_title"]))
for row in ratings.batch(1).take(1):
print(f"Computed representations: {movie_model(row)[0, :3]}")
```
\ No newline at end of file
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册