update youtube dnn (#181)

Co-authored-by: N tangwei12 <tangwei12@baidu.com>

update youtube dnn (#181)
Co-authored-by: N tangwei12 <tangwei12@baidu.com>
03cec6d1 · 123malin · GitHub · ed5703ee · 03cec6d1 · 03cec6d1
10 changed file
--- a/models/recall/youtube_dnn/README.md
+++ b/models/recall/youtube_dnn/README.md
+# Youtebe-DNN
+
+以下是本例的简要目录结构及说明： 
+
+```
+├── data #样例数据
+	├── train
+		├── data.txt
+    ├── test
+		├── data.txt
+├── generate_ramdom_data # 随机训练数据生成文件
+├── __init__.py
+├── README.md # 文档
+├── model.py #模型文件
+├── config.yaml #配置文件
+├── data_prepare.sh #一键数据处理脚本
+├── reader.py #reader
+├── infer.py # 预测程序
+```
+
+注：在阅读该示例前，建议您先了解以下内容：
+
+[paddlerec入门教程](https://github.com/PaddlePaddle/PaddleRec/blob/master/README.md)
+
+
+---
+## 内容
+
+- [模型简介](#模型简介)
+- [数据准备](#数据准备)
+- [运行环境](#运行环境)
+- [快速开始](#快速开始)
+- [论文复现](#论文复现)
+- [进阶使用](#进阶使用)
+- [FAQ](#FAQ)
+
+## 模型简介
+[《Deep Neural Networks for YouTube Recommendations》](https://link.zhihu.com/?target=https%3A//static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf) 这篇论文是google的YouTube团队在推荐系统上DNN方面的尝试，是经典的向量化召回模型，主要通过模型来学习用户和物品的兴趣向量，并通过内积来计算用户和物品之间的相似性，从而得到最终的候选集。YouTube采取了两层深度网络完成整个推荐过程：
+
+1.第一层是**Candidate Generation Model**完成候选视频的快速筛选，这一步候选视频集合由百万降低到了百的量级。
+
+2.第二层是用**Ranking Model**完成几百个候选视频的精排。
+
+本项目在paddlepaddle上完成YouTube dnn的召回部分Candidate Generation Model，分别获得用户和物品的向量表示，从而后续可以通过其他方法（如用户和物品的余弦相似度）给用户推荐物品。
+
+由于原论文没有开源数据集，本项目随机构造数据验证网络的正确性。
+
+本项目支持功能
+
+训练：单机CPU、单机单卡GPU、本地模拟参数服务器训练、增量训练，配置请参考 [启动训练](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/train.md)   
+
+预测：单机CPU、单机单卡GPU；配置请参考[PaddleRec 离线预测](https://github.com/PaddlePaddle/PaddleRec/blob/master/doc/predict.md) 
+
+## 数据处理
+调用python generate_ramdom_data.py生成随机训练数据，每行数据格式如下：
+```
+#watch_vec;search_vec;other_feat;label
+0.01,0.02,...,0.09;0.01,0.02,...,0.09;0.01,0.02,...,0.09;20
+```
+方便起见，我们提供了一键式数据生成脚本：
+```
+sh data_prepare.sh
+```
+
+## 运行环境
+
+PaddlePaddle>=1.7.2 
+
+python 2.7/3.5/3.6/3.7
+
+PaddleRec >=0.1
+
+os : windows/linux/macos
+
+## 快速开始
+
+### 单机训练
+
+```
+mode: [cpu_single_train]
+
+runner:
+- name: cpu_single_train
+  class: train
+  device: cpu   # if use_gpu, set it to gpu
+  epochs: 20
+  save_checkpoint_interval: 1
+  save_inference_interval: 1
+  save_checkpoint_path: "increment_youtubednn"
+  save_inference_path: "inference_youtubednn"
+  save_inference_feed_varnames: ["watch_vec", "search_vec", "other_feat"] # feed vars of save inference
+  save_inference_fetch_varnames: ["l3.tmp_2"]
+  print_interval: 1
+```
+
+### 单机预测
+通过计算每个用户和每个物品的余弦相似度，给每个用户推荐topk视频：
+
+cpu infer:
+```
+python infer.py --test_epoch 19 --inference_model_dir ./inference_youtubednn --increment_model_dir ./increment_youtubednn --watch_vec_size 64 --search_vec_size 64 --other_feat_size 64 --topk 5
+```
+
+gpu infer:
+```
+python infer.py --use_gpu 1 --test_epoch 19 --inference_model_dir ./inference_youtubednn --increment_model_dir ./increment_youtubednn --watch_vec_size 64 --search_vec_size 64 --other_feat_size 64 --topk 5
+```
+### 运行
+```
+python -m paddlerec.run -m paddlerec.models.recall.w2v
+```
+
+### 结果展示
+
+样例数据训练结果展示：
+
+```
+Running SingleStartup.
+Running SingleRunner.
+batch: 1, acc: [0.03125]
+batch: 2, acc: [0.0625]
+batch: 3, acc: [0.]
+...
+epoch 0 done, use time: 0.0605320930481, global metrics: acc=[0.]
+...
+epoch 19 done, use time: 0.33447098732, global metrics: acc=[0.]
+```
+
+样例数据预测结果展示:
+```
+user:0, top K videos:[40, 31, 4, 33, 93]
+user:1, top K videos:[35, 57, 58, 40, 17]
+user:2, top K videos:[35, 17, 88, 40, 9]
+user:3, top K videos:[73, 35, 39, 58, 38]
+user:4, top K videos:[40, 31, 57, 4, 73]
+user:5, top K videos:[38, 9, 7, 88, 22]
+user:6, top K videos:[35, 73, 14, 58, 28]
+user:7, top K videos:[35, 73, 58, 38, 56]
+user:8, top K videos:[38, 40, 9, 35, 99]
+user:9, top K videos:[88, 73, 9, 35, 28]
+user:10, top K videos:[35, 52, 28, 54, 73]
+```
+
+## 进阶使用
+
+## FAQ
--- a/models/recall/youtube_dnn/config.yaml
+++ b/models/recall/youtube_dnn/config.yaml
@@ -17,11 +17,10 @@ workspace: "models/recall/youtube_dnn"

 dataset:
 - name: dataset_train
-  batch_size: 5
-  type: DataLoader
-  #type: QueueDataset
+  batch_size: 32
+  type: DataLoader # or QueueDataset
  data_path: "{workspace}/data/train"
-  data_converter: "{workspace}/random_reader.py"
+  data_converter: "{workspace}/reader.py"

 hyper_parameters:
  watch_vec_size: 64
@@ -30,22 +29,23 @@ hyper_parameters:
  output_size: 100
  layers: [128, 64, 32]
  optimizer: 
-    class: adam
-    learning_rate: 0.001
-    strategy: async
+    class: SGD
+    learning_rate: 0.01

-mode: train_runner
+mode: [cpu_single_train]

 runner:
- name: train_runner
+- name: cpu_single_train
  class: train
  device: cpu
-  epochs: 3
-  save_checkpoint_interval: 2
-  save_inference_interval: 4
-  save_checkpoint_path: "increment"
-  save_inference_path: "inference"
-  print_interval: 10
+  epochs: 20
+  save_checkpoint_interval: 1
+  save_inference_interval: 1
+  save_checkpoint_path: "increment_youtubednn"
+  save_inference_path: "inference_youtubednn"
+  save_inference_feed_varnames: ["watch_vec", "search_vec", "other_feat"] # feed vars of save inference
+  save_inference_fetch_varnames: ["l3.tmp_2"]
+  print_interval: 1

 phase:
 - name: train

--- a/models/recall/youtube_dnn/data/test/data.txt
+++ b/models/recall/youtube_dnn/data/test/data.txt
--- a/models/recall/youtube_dnn/data/test/small_data.txt
+++ b/models/recall/youtube_dnn/data/test/small_data.txt
-4764,174,1
-4764,2958,0
-4764,452,0
-4764,1946,0
-4764,3208,0
-2044,2237,1
-2044,1998,0
-2044,328,0
-2044,1542,0
-2044,1932,0
-4276,65,1
-4276,3247,0
-4276,942,0
-4276,3666,0
-4276,2222,0
-3933,682,1
-3933,2451,0
-3933,3695,0
-3933,1643,0
-3933,3568,0
-1151,1265,1
-1151,118,0
-1151,2532,0
-1151,2083,0
-1151,2350,0
-1757,876,1
-1757,201,0
-1757,3633,0
-1757,1068,0
-1757,2549,0
-3370,276,1
-3370,2435,0
-3370,606,0
-3370,910,0
-3370,2146,0
-5137,1018,1
-5137,2163,0
-5137,3167,0
-5137,2315,0
-5137,3595,0
-3933,2831,1
-3933,2881,0
-3933,2949,0
-3933,3660,0
-3933,417,0
-3102,999,1
-3102,1902,0
-3102,2161,0
-3102,3042,0
-3102,1113,0
-2022,336,1
-2022,1672,0
-2022,2656,0
-2022,3649,0
-2022,883,0
-2664,655,1
-2664,3660,0
-2664,1711,0
-2664,3386,0
-2664,1668,0
-25,701,1
-25,32,0
-25,2482,0
-25,3177,0
-25,2767,0
-1738,1643,1
-1738,2187,0
-1738,228,0
-1738,650,0
-1738,3101,0
-5411,1241,1
-5411,2546,0
-5411,3019,0
-5411,3618,0
-5411,1674,0
-638,579,1
-638,3512,0
-638,783,0
-638,2111,0
-638,1880,0
-3554,200,1
-3554,2893,0
-3554,2428,0
-3554,969,0
-3554,2741,0
-4283,1074,1
-4283,3056,0
-4283,2032,0
-4283,405,0
-4283,1505,0
-5111,200,1
-5111,3488,0
-5111,477,0
-5111,2790,0
-5111,40,0
-3964,515,1
-3964,1528,0
-3964,2173,0
-3964,1701,0
-3964,2832,0
--- a/models/recall/youtube_dnn/data/train/data.txt
+++ b/models/recall/youtube_dnn/data/train/data.txt
--- a/models/recall/youtube_dnn/data/train/samll_data.txt
+++ b/models/recall/youtube_dnn/data/train/samll_data.txt
-4764,174,1
-4764,2958,0
-4764,452,0
-4764,1946,0
-4764,3208,0
-2044,2237,1
-2044,1998,0
-2044,328,0
-2044,1542,0
-2044,1932,0
-4276,65,1
-4276,3247,0
-4276,942,0
-4276,3666,0
-4276,2222,0
-3933,682,1
-3933,2451,0
-3933,3695,0
-3933,1643,0
-3933,3568,0
-1151,1265,1
-1151,118,0
-1151,2532,0
-1151,2083,0
-1151,2350,0
-1757,876,1
-1757,201,0
-1757,3633,0
-1757,1068,0
-1757,2549,0
-3370,276,1
-3370,2435,0
-3370,606,0
-3370,910,0
-3370,2146,0
-5137,1018,1
-5137,2163,0
-5137,3167,0
-5137,2315,0
-5137,3595,0
-3933,2831,1
-3933,2881,0
-3933,2949,0
-3933,3660,0
-3933,417,0
-3102,999,1
-3102,1902,0
-3102,2161,0
-3102,3042,0
-3102,1113,0
-2022,336,1
-2022,1672,0
-2022,2656,0
-2022,3649,0
-2022,883,0
-2664,655,1
-2664,3660,0
-2664,1711,0
-2664,3386,0
-2664,1668,0
-25,701,1
-25,32,0
-25,2482,0
-25,3177,0
-25,2767,0
-1738,1643,1
-1738,2187,0
-1738,228,0
-1738,650,0
-1738,3101,0
-5411,1241,1
-5411,2546,0
-5411,3019,0
-5411,3618,0
-5411,1674,0
-638,579,1
-638,3512,0
-638,783,0
-638,2111,0
-638,1880,0
-3554,200,1
-3554,2893,0
-3554,2428,0
-3554,969,0
-3554,2741,0
-4283,1074,1
-4283,3056,0
-4283,2032,0
-4283,405,0
-4283,1505,0
-5111,200,1
-5111,3488,0
-5111,477,0
-5111,2790,0
-5111,40,0
-3964,515,1
-3964,1528,0
-3964,2173,0
-3964,1701,0
-3964,2832,0
--- a/models/recall/youtube_dnn/data_prepare.sh
+++ b/models/recall/youtube_dnn/data_prepare.sh
+python generate_ramdom_data.py
--- a/models/recall/youtube_dnn/generate_ramdom_data.py
+++ b/models/recall/youtube_dnn/generate_ramdom_data.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle
+import numpy as np
+
+# Build a random data set.
+sample_size = 100
+batch_size = 32
+watch_vec_size = 64
+search_vec_size = 64
+other_feat_size = 64
+output_size = 100
+
+watch_vecs = np.random.rand(batch_size * sample_size, watch_vec_size).tolist()
+search_vecs = np.random.rand(batch_size * sample_size,
+                             search_vec_size).tolist()
+other_vecs = np.random.rand(batch_size * sample_size, other_feat_size).tolist()
+labels = np.random.randint(
+    output_size, size=(batch_size * sample_size)).tolist()
+
+output_path = "./data/train/data.txt"
+with open(output_path, 'w') as fout:
+    for i in range(batch_size * sample_size):
+        _str_ = ','.join(map(str, watch_vecs[i])) + ";" + ','.join(
+            map(str, search_vecs[i])) + ";" + ','.join(
+                map(str, other_vecs[i])) + ";" + str(labels[i])
+        fout.write(_str_)
+        fout.write("\n")
--- a/models/recall/youtube_dnn/infer.py
+++ b/models/recall/youtube_dnn/infer.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import copy
+import numpy as np
+import argparse
+import paddle.fluid as fluid
+import pandas as pd
+from paddle.fluid.incubate.fleet.utils import utils
+
+
+def parse_args():
+    parser = argparse.ArgumentParser("PaddlePaddle Youtube DNN infer example")
+    parser.add_argument(
+        '--use_gpu', type=int, default='0', help='whether use gpu')
+    parser.add_argument(
+        "--batch_size", type=int, default=32, help="batch_size")
+    parser.add_argument(
+        "--test_epoch", type=int, default=19, help="test_epoch")
+    parser.add_argument(
+        '--inference_model_dir',
+        type=str,
+        default='./inference_youtubednn',
+        help='inference_model_dir')
+    parser.add_argument(
+        '--increment_model_dir',
+        type=str,
+        default='./increment_youtubednn',
+        help='persistable_model_dir')
+    parser.add_argument(
+        '--watch_vec_size', type=int, default=64, help='watch_vec_size')
+    parser.add_argument(
+        '--search_vec_size', type=int, default=64, help='search_vec_size')
+    parser.add_argument(
+        '--other_feat_size', type=int, default=64, help='other_feat_size')
+    parser.add_argument('--topk', type=int, default=5, help='topk')
+    args = parser.parse_args()
+    return args
+
+
+def infer(args):
+    video_save_path = os.path.join(args.increment_model_dir,
+                                   str(args.test_epoch), "l4_weight")
+    video_vec, = utils.load_var("l4_weight", [32, 100], 'float32',
+                                video_save_path)
+
+    place = fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace()
+    exe = fluid.Executor(place)
+    cur_model_path = os.path.join(args.inference_model_dir,
+                                  str(args.test_epoch))
+
+    user_vec = None
+    with fluid.scope_guard(fluid.Scope()):
+        infer_program, feed_target_names, fetch_vars = fluid.io.load_inference_model(
+            cur_model_path, exe)
+        # Build a random data set.
+        sample_size = 100
+        watch_vecs = []
+        search_vecs = []
+        other_feats = []
+
+        for i in range(sample_size):
+            watch_vec = np.random.rand(1, args.watch_vec_size)
+            search_vec = np.random.rand(1, args.search_vec_size)
+            other_feat = np.random.rand(1, args.other_feat_size)
+            watch_vecs.append(watch_vec)
+            search_vecs.append(search_vec)
+            other_feats.append(other_feat)
+
+        for i in range(sample_size):
+            l3 = exe.run(infer_program,
+                         feed={
+                             "watch_vec": watch_vecs[i].astype('float32'),
+                             "search_vec": search_vecs[i].astype('float32'),
+                             "other_feat": other_feats[i].astype('float32'),
+                         },
+                         return_numpy=True,
+                         fetch_list=fetch_vars)
+            if user_vec is not None:
+                user_vec = np.concatenate([user_vec, l3[0]], axis=0)
+            else:
+                user_vec = l3[0]
+
+    # get topk result
+    user_video_sim_list = []
+    for i in range(user_vec.shape[0]):
+        for j in range(video_vec.shape[1]):
+            user_video_sim = cos_sim(user_vec[i], video_vec[:, j])
+            user_video_sim_list.append(user_video_sim)
+
+        tmp_list = copy.deepcopy(user_video_sim_list)
+        tmp_list.sort()
+        max_sim_index = [
+            user_video_sim_list.index(one)
+            for one in tmp_list[::-1][:args.topk]
+        ]
+
+        print("user:{0}, top K videos:{1}".format(i, max_sim_index))
+        user_video_sim_list = []
+
+
+def cos_sim(vector_a, vector_b):
+    vector_a = np.mat(vector_a)
+    vector_b = np.mat(vector_b)
+    num = float(vector_a * vector_b.T)
+    denom = np.linalg.norm(vector_a) * np.linalg.norm(vector_b)
+    cos = num / (denom + 1e-4)
+    sim = 0.5 + 0.5 * cos
+    return sim
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    infer(args)
--- a/models/recall/youtube_dnn/random_reader.py
+++ b/models/recall/youtube_dnn/random_reader.py
@@ -39,13 +39,17 @@ class Reader(ReaderBase):
            """
            This function needs to be implemented by the user, based on data format
            """
-
+            features = line.rstrip().split(";")
+            watch_vec = features[0].split(',')
+            search_vec = features[1].split(',')
+            other_feat = features[2].split(',')
+            label = features[3]
+            assert (len(watch_vec) == self.watch_vec_size)
+            assert (len(search_vec) == self.search_vec_size)
+            assert (len(other_feat) == self.other_feat_size)
            feature_name = ["watch_vec", "search_vec", "other_feat", "label"]
            yield list(
-                zip(feature_name, [
-                    np.random.rand(self.watch_vec_size).tolist()
-                ] + [np.random.rand(self.search_vec_size).tolist()] + [
-                    np.random.rand(self.other_feat_size).tolist()
-                ] + [[np.random.randint(self.output_size)]]))
+                zip(feature_name, [watch_vec] + [search_vec] + [other_feat] +
+                    [label]))

        return reader