提交 883e3006 编写于 作者: O overlordmax

add flen

上级 591b005e
# FLEN
以下是本例的简要目录结构及说明:
```
├── data #样例数据
├── sample_data
├── train
├── sample_train.txt
├── run.sh
├── get_slot_data.py
├── __init__.py
├── README.md # 文档
├── model.py #模型文件
├── config.yaml #配置文件
```
## 简介
[《FLEN: Leveraging Field for Scalable CTR Prediction》](https://arxiv.org/pdf/1911.04690.pdf)文章提出了field-wise bi-interaction pooling技术,解决了在大规模应用特征field信息时存在的时间复杂度和空间复杂度高的困境,同时提出了一种缓解梯度耦合问题的方法dicefactor。该模型已应用于美图的大规模推荐系统中,持续稳定地取得业务效果的全面提升。
本项目在avazu数据集上验证模型效果
## 数据下载及预处理
## 环境
PaddlePaddle 1.7.2
python3.7
PaddleRec
## 单机训练
CPU环境
在config.yaml文件中设置好设备,epochs等。
```
# select runner by name
mode: [single_cpu_train, single_cpu_infer]
# config of each runner.
# runner is a kind of paddle training class, which wraps the train/infer process.
runner:
- name: single_cpu_train
class: train
# num of epochs
epochs: 4
# device to run training or infer
device: cpu
save_checkpoint_interval: 2 # save model interval of epochs
save_inference_interval: 4 # save inference
save_checkpoint_path: "increment_model" # save checkpoint path
save_inference_path: "inference" # save inference path
save_inference_feed_varnames: [] # feed vars of save inference
save_inference_fetch_varnames: [] # fetch vars of save inference
init_model_path: "" # load model path
print_interval: 10
phases: [phase1]
```
## 单机预测
CPU环境
在config.yaml文件中设置好epochs、device等参数。
```
- name: single_cpu_infer
class: infer
# num of epochs
epochs: 1
# device to run training or infer
device: cpu #选择预测的设备
init_model_path: "increment_dnn" # load model path
phases: [phase2]
```
## 运行
```
python -m paddlerec.run -m paddlerec.models.rank.flen
```
## 模型效果
在样例数据上测试模型
训练:
```
0702 13:38:20.903220 7368 parallel_executor.cc:440] The Program will be executed on CPU using ParallelExecutor, 2 cards are used, so 2 programs are executed in parallel.
I0702 13:38:20.925912 7368 parallel_executor.cc:307] Inplace strategy is enabled, when build_strategy.enable_inplace = True
I0702 13:38:20.933356 7368 parallel_executor.cc:375] Garbage collection strategy is enabled, when FLAGS_eager_delete_tensor_gb = 0
batch: 2, AUC: [0.09090909 0. ], BATCH_AUC: [0.09090909 0. ]
batch: 4, AUC: [0.31578947 0.29411765], BATCH_AUC: [0.31578947 0.29411765]
batch: 6, AUC: [0.41333333 0.33333333], BATCH_AUC: [0.41333333 0.33333333]
batch: 8, AUC: [0.4453125 0.44166667], BATCH_AUC: [0.4453125 0.44166667]
batch: 10, AUC: [0.39473684 0.38888889], BATCH_AUC: [0.44117647 0.41176471]
batch: 12, AUC: [0.41860465 0.45535714], BATCH_AUC: [0.5078125 0.54545455]
batch: 14, AUC: [0.43413729 0.42746615], BATCH_AUC: [0.56666667 0.56 ]
batch: 16, AUC: [0.46433566 0.47460087], BATCH_AUC: [0.53 0.59247649]
batch: 18, AUC: [0.44009217 0.44642857], BATCH_AUC: [0.46 0.47]
batch: 20, AUC: [0.42705314 0.43781095], BATCH_AUC: [0.45878136 0.4874552 ]
batch: 22, AUC: [0.45176471 0.46011281], BATCH_AUC: [0.48046875 0.45878136]
batch: 24, AUC: [0.48375 0.48910256], BATCH_AUC: [0.56630824 0.59856631]
epoch 0 done, use time: 0.21532440185546875
PaddleRec Finish
```
预测
```
PaddleRec: Runner single_cpu_infer Begin
Executor Mode: infer
processor_register begin
Running SingleInstance.
Running SingleNetwork.
QueueDataset can not support PY3, change to DataLoader
QueueDataset can not support PY3, change to DataLoader
Running SingleInferStartup.
Running SingleInferRunner.
load persistables from increment_model/0
batch: 20, AUC: [0.49121353], BATCH_AUC: [0.66176471]
batch: 40, AUC: [0.51156463], BATCH_AUC: [0.55197133]
Infer phase2 of 0 done, use time: 0.3941819667816162
PaddleRec Finish
```
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# workspace
workspace: "paddlerec.models.rank.flen"
# list of dataset
dataset:
- name: dataloader_train # name of dataset to distinguish different datasets
batch_size: 2
type: DataLoader # or QueueDataset
data_path: "{workspace}/data/sample_data/train"
sparse_slots: "click user_0 user_1 user_2 user_3 user_4 user_5 user_6 user_7 user_8 user_9 user_10 user_11 item_0 item_1 item_2 contex_0 contex_1 contex_2 contex_3 contex_4 contex_5"
dense_slots: ""
- name: dataset_infer # name
batch_size: 2
type: DataLoader # or QueueDataset
data_path: "{workspace}/data/sample_data/train"
sparse_slots: "click user_0 user_1 user_2 user_3 user_4 user_5 user_6 user_7 user_8 user_9 user_10 user_11 item_0 item_1 item_2 contex_0 contex_1 contex_2 contex_3 contex_4 contex_5"
dense_slots: ""
# hyper parameters of user-defined network
hyper_parameters:
# optimizer config
optimizer:
class: Adam
learning_rate: 0.001
strategy: async
# user-defined <key, value> pairs
sparse_inputs_slots: 21
sparse_feature_number: 100
sparse_feature_dim: 8
dense_input_dim: 1
dropout_rate: 0.5
# select runner by name
mode: [single_cpu_train, single_cpu_infer]
# config of each runner.
# runner is a kind of paddle training class, which wraps the train/infer process.
runner:
- name: single_cpu_train
class: train
# num of epochs
epochs: 1
# device to run training or infer
device: cpu
save_checkpoint_interval: 1 # save model interval of epochs
save_inference_interval: 4 # save inference
save_checkpoint_path: "increment_model" # save checkpoint path
save_inference_path: "inference" # save inference path
save_inference_feed_varnames: [] # feed vars of save inference
save_inference_fetch_varnames: [] # fetch vars of save inference
init_model_path: "" # load model path
print_interval: 2
phases: [phase1]
- name: single_gpu_train
class: train
# num of epochs
epochs: 1
# device to run training or infer
device: gpu
save_checkpoint_interval: 1 # save model interval of epochs
save_inference_interval: 4 # save inference
save_checkpoint_path: "increment_model" # save checkpoint path
save_inference_path: "inference" # save inference path
save_inference_feed_varnames: [] # feed vars of save inference
save_inference_fetch_varnames: [] # fetch vars of save inference
init_model_path: "" # load model path
print_interval: 2
phases: [phase1]
- name: single_cpu_infer
class: infer
# device to run training or infer
device: cpu
init_model_path: "increment_model" # load model path
phases: [phase2]
- name: single_gpu_infer
class: infer
# device to run training or infer
device: gpu
init_model_path: "increment_model" # load model path
phases: [phase2]
# runner will run all the phase in each epoch
phase:
- name: phase1
model: "{workspace}/model.py" # user-defined model
dataset_name: dataloader_train # select dataset by name
thread_num: 2
- name: phase2
model: "{workspace}/model.py" # user-defined model
dataset_name: dataset_infer # select dataset by name
thread_num: 2
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import paddle.fluid.incubate.data_generator as dg
class CriteoDataset(dg.MultiSlotDataGenerator):
"""
DacDataset: inheritance MultiSlotDataGeneratior, Implement data reading
Help document: http://wiki.baidu.com/pages/viewpage.action?pageId=728820675
"""
def generate_sample(self, line):
"""
Read the data line by line and process it as a dictionary
"""
def reader():
"""
This function needs to be implemented by the user, based on data format
"""
features = line.strip().split(',')
label = [int(features[0])]
s = "click:" + str(label[0])
for i, elem in enumerate(features[1:13]):
s += " user_" + str(i) + ":" + str(elem)
for i, elem in enumerate(features[13:16]):
s += " item_" + str(i) + ":" + str(elem)
for i, elem in enumerate(features[16:]):
s += " contex_" + str(i) + ":" + str(elem)
print(s.strip())
yield None
return reader
d = CriteoDataset()
d.run_from_stdin()
mkdir train
for i in `ls ./train_data`
do
cat train_data/$i | python get_slot_data.py > train/$i
done
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import paddle.fluid as fluid
import itertools
from paddlerec.core.utils import envs
from paddlerec.core.model import ModelBase
class Model(ModelBase):
def __init__(self, config):
ModelBase.__init__(self, config)
def _init_hyper_parameters(self):
self.is_distributed = True if envs.get_fleet_mode().upper(
) == "PSLIB" else False
self.sparse_feature_number = envs.get_global_env(
"hyper_parameters.sparse_feature_number")
self.sparse_feature_dim = envs.get_global_env(
"hyper_parameters.sparse_feature_dim")
self.learning_rate = envs.get_global_env(
"hyper_parameters.optimizer.learning_rate")
def _FieldWiseBiInteraction(self, inputs):
# MF module
field_wise_embeds_list = inputs
field_wise_vectors = [
fluid.layers.reduce_sum(
field_i_vectors, dim=1, keep_dim=True)
for field_i_vectors in field_wise_embeds_list
]
num_fields = len(field_wise_vectors)
h_mf_list = []
for emb_left, emb_right in itertools.combinations(field_wise_vectors,
2):
embeddings_prod = fluid.layers.elementwise_mul(emb_left, emb_right)
field_weighted_embedding = fluid.layers.fc(
input=embeddings_prod,
size=self.sparse_feature_dim,
param_attr=fluid.initializer.ConstantInitializer(value=1),
name='kernel_mf')
h_mf_list.append(field_weighted_embedding)
h_mf = fluid.layers.concat(h_mf_list, axis=1)
h_mf = fluid.layers.reshape(
x=h_mf, shape=[-1, num_fields, self.sparse_feature_dim])
h_mf = fluid.layers.reduce_sum(h_mf, dim=1)
square_of_sum_list = [
fluid.layers.square(
fluid.layers.reduce_sum(
field_i_vectors, dim=1, keep_dim=True))
for field_i_vectors in field_wise_embeds_list
]
sum_of_square_list = [
fluid.layers.reduce_sum(
fluid.layers.elementwise_mul(field_i_vectors, field_i_vectors),
dim=1,
keep_dim=True) for field_i_vectors in field_wise_embeds_list
]
field_fm_list = []
for square_of_sum, sum_of_square in zip(square_of_sum_list,
sum_of_square_list):
field_fm = fluid.layers.reshape(
fluid.layers.elementwise_sub(square_of_sum, sum_of_square),
shape=[-1, self.sparse_feature_dim])
field_fm = fluid.layers.fc(
input=field_fm,
size=self.sparse_feature_dim,
param_attr=fluid.initializer.ConstantInitializer(value=0.5),
name='kernel_fm')
field_fm_list.append(field_fm)
h_fm = fluid.layers.concat(field_fm_list, axis=1)
h_fm = fluid.layers.reshape(
x=h_fm, shape=[-1, num_fields, self.sparse_feature_dim])
h_fm = fluid.layers.reduce_sum(h_fm, dim=1)
return fluid.layers.elementwise_add(h_mf, h_fm)
def _DNNLayer(self, inputs, dropout_rate=0.2):
deep_input = inputs
for i, hidden_unit in enumerate([64, 32]):
fc_out = fluid.layers.fc(
input=deep_input,
size=hidden_unit,
param_attr=fluid.initializer.Xavier(uniform=False),
act='relu',
name='d_' + str(i))
fc_out = fluid.layers.dropout(fc_out, dropout_prob=dropout_rate)
deep_input = fc_out
return deep_input
def embeddingLayer(self, inputs):
emb_list = []
in_len = len(inputs)
for data in inputs:
feat_emb = fluid.embedding(
input=data,
size=[self.sparse_feature_number, self.sparse_feature_dim],
param_attr=fluid.ParamAttr(
name='item_emb',
learning_rate=5,
initializer=fluid.initializer.Xavier(
fan_in=self.sparse_feature_dim,
fan_out=self.sparse_feature_dim)),
is_sparse=True)
emb_list.append(feat_emb)
concat_emb = fluid.layers.concat(emb_list, axis=1)
field_emb = fluid.layers.reshape(
x=concat_emb, shape=[-1, in_len, self.sparse_feature_dim])
return field_emb
def net(self, input, is_infer=False):
self.user_inputs = self._sparse_data_var[1:13]
self.item_inputs = self._sparse_data_var[13:16]
self.contex_inputs = self._sparse_data_var[16:]
self.label_input = self._sparse_data_var[0]
dropout_rate = envs.get_global_env("hyper_parameters.dropout_rate")
field_wise_embeds_list = []
for inputs in [self.user_inputs, self.item_inputs, self.contex_inputs]:
field_emb = self.embeddingLayer(inputs)
field_wise_embeds_list.append(field_emb)
dnn_input = fluid.layers.concat(
[
fluid.layers.flatten(
x=field_i_vectors, axis=1)
for field_i_vectors in field_wise_embeds_list
],
axis=1)
#mlp part
dnn_output = self._DNNLayer(dnn_input, dropout_rate)
#field-weighted embedding
fm_mf_out = self._FieldWiseBiInteraction(field_wise_embeds_list)
logits = fluid.layers.concat([fm_mf_out, dnn_output], axis=1)
y_pred = fluid.layers.fc(
input=logits,
size=1,
param_attr=fluid.initializer.Xavier(uniform=False),
act='sigmoid',
name='logit')
self.predict = y_pred
auc, batch_auc, _ = fluid.layers.auc(input=self.predict,
label=self.label_input,
num_thresholds=2**12,
slide_steps=20)
if is_infer:
self._infer_results["AUC"] = auc
self._infer_results["BATCH_AUC"] = batch_auc
return
self._metrics["AUC"] = auc
self._metrics["BATCH_AUC"] = batch_auc
cost = fluid.layers.log_loss(
input=self.predict,
label=fluid.layers.cast(
x=self.label_input, dtype='float32'))
avg_cost = fluid.layers.reduce_mean(cost)
self._cost = avg_cost
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册