未验证 提交 1be38a77 编写于 作者: S Snow 提交者: GitHub

Dev sx insightface (#127)

* add insightface pr template

* update readme and add fp32 results

* update readme

* update readme and modify scripts

* update insightface readme and add reports

* add pics in chinese report

* add pics in english

* add en pics

* refine readme.md

* update test pics

* upfate report

* Update dlperf_insightface_test_report_v1.md

test pic

* Create dlperf_insightface_test_report_v1.md

test pics again

* fix pic bug

* Update dlperf_insightface_test_report_v1.md

* fix data

* add new path of insightface and update as reviewed

* refine scripts and update readme

* update data and pics

* modify as reviewed

* update report as reviewed

* modify OOM into - and rm invalid pics' links

* fix mxnet max num_classes

* update changelog

* update change log

* add insightface report links and introduction

* modify as reviewed

* modify as reviewed

* fix part num and dataset zoo

* rm data preprocess part

* update link

* update data of 2 nodes

* add multi node scripts

* update results of 4 nodes

* update report

* update pics

* update mxnet data

* update multi-node scripts and rm different insightface_train.py

* rm insightface_train.py and mv config into scripts with sed

* update gnuplot img
Co-authored-by: NLiang Depeng <liangdepeng@gmail.com>
Co-authored-by: NFlowingsun007 <flowingsun007@163.com>
Co-authored-by: MarDino's avatarMARD1NO <359521840@qq.com>
上级 110f1642
......@@ -296,6 +296,8 @@ Saving result to ./result/_result.json
| 1 | 1 | 64 | 246.45 | 1.00 |
| 1 | 4 | 64 | 948.96 | 3.85 |
| 1 | 8 | 64 | 1872.81 | 7.6 |
| 2 | 8 | 64 | 3540.09 | 14.36 |
| 4 | 8 | 64 | 6931.6 | 28.13 |
**batch_size=max**
......@@ -304,6 +306,8 @@ Saving result to ./result/_result.json
| 1 | 1 | 120 | 256.61 | 1.00 |
| 1 | 4 | 120 | 990.82 | 3.86 |
| 1 | 8 | 120 | 1962.76 | 7.65 |
| 2 | 8 | 120 | 3856.52 | 15.03 |
| 4 | 8 | 120 | 7564.74 | 29.48 |
......
import os
import math
import argparse
import numpy as np
import oneflow as flow
from sample_config import config, default, generate_config
import ofrecord_util
import validation_util
from callback_util import TrainMetric
from insightface_val import Validator, get_val_args
from symbols import fresnet100, fmobilefacenet
def str2list(x):
x = [float(y) if type(eval(y)) == float else int(y) for y in x.split(',')]
return x
def str2bool(v):
if v.lower() in ("yes", "true", "t", "y", "1"):
return True
elif v.lower() in ("no", "false", "f", "n", "0"):
return False
else:
raise argparse.ArgumentTypeError("Unsupported value encountered.")
def get_train_args():
train_parser = argparse.ArgumentParser(description="Flags for train")
train_parser.add_argument(
"--dataset", default=default.dataset, required=True, help="Dataset config"
)
train_parser.add_argument(
"--network", default=default.network, required=True, help="Network config"
)
train_parser.add_argument(
"--loss", default=default.loss, required=True, help="Loss config")
args, rest = train_parser.parse_known_args()
generate_config(args.network, args.dataset, args.loss)
# distribution config
train_parser.add_argument(
"--device_num_per_node",
type=int,
default=default.device_num_per_node,
help="The number of GPUs used per node",
)
train_parser.add_argument(
"--num_nodes",
type=int,
default=default.num_nodes,
help="Node/Machine number for training",
)
train_parser.add_argument(
"--node_ips",
type=str2list,
default=default.node_ips,
help='Nodes ip list for training, devided by ",", length >= num_nodes',
)
train_parser.add_argument(
"--model_parallel",
type=str2bool,
nargs="?",
default=default.model_parallel,
help="Whether to use model parallel",
)
train_parser.add_argument(
"--partial_fc",
type=str2bool,
nargs="?",
default=default.partial_fc,
help="Whether to use partial fc",
)
# train config
train_parser.add_argument("--num_classes", type=int, default=config.num_classes, help="Number of classes")
train_parser.add_argument("--data_part_num", type=int, default=config.train_data_part_num, help="Train data part num")
train_parser.add_argument(
"--train_batch_size",
type=int,
default=default.train_batch_size,
help="Train batch size totally",
)
train_parser.add_argument(
"--use_synthetic_data",
type=str2bool,
nargs="?",
default=default.use_synthetic_data,
help="Whether to use synthetic data",
)
train_parser.add_argument(
"--do_validation_while_train",
type=str2bool,
nargs="?",
default=default.do_validation_while_train,
help="Whether do validation while training",
)
train_parser.add_argument(
"--use_fp16", type=str2bool, nargs="?", default=default.use_fp16, help="Whether to use fp16"
)
train_parser.add_argument("--nccl_fusion_threshold_mb", type=int, default=default.nccl_fusion_threshold_mb,
help="NCCL fusion threshold megabytes, set to 0 to compatible with previous version of OneFlow.")
train_parser.add_argument("--nccl_fusion_max_ops", type=int, default=default.nccl_fusion_max_ops,
help="Maximum number of ops of NCCL fusion, set to 0 to compatible with previous version of OneFlow.")
# hyperparameters
train_parser.add_argument(
"--train_unit",
type=str,
default=default.train_unit,
help="Choose train unit of iteration, batch or epoch",
)
train_parser.add_argument(
"--train_iter",
type=int,
default=default.train_iter,
help="Iteration for training",
)
train_parser.add_argument(
"--lr", type=float, default=default.lr, help="Initial start learning rate"
)
train_parser.add_argument(
"--lr_steps",
type=str2list,
default=default.lr_steps,
help="Steps of lr changing",
)
train_parser.add_argument(
"-wd", "--weight_decay", type=float, default=default.wd, help="Weight decay"
)
train_parser.add_argument(
"-mom", "--momentum", type=float, default=default.mom, help="Momentum"
)
train_parser.add_argument("--scales", type=str2list,
default=default.scales, help="Learning rate step sacles")
# model and log
train_parser.add_argument(
"--model_load_dir",
type=str,
default=default.model_load_dir,
help="Path to load model",
)
train_parser.add_argument(
"--models_root",
type=str,
default=default.models_root,
help="Root directory to save model.",
)
train_parser.add_argument(
"--log_dir", type=str, default=default.log_dir, help="Log info save directory"
)
train_parser.add_argument(
"--loss_print_frequency",
type=int,
default=default.loss_print_frequency,
help="Frequency of printing loss",
)
train_parser.add_argument(
"--iter_num_in_snapshot",
type=int,
default=default.iter_num_in_snapshot,
help="The number of train unit iter in the snapshot",
)
train_parser.add_argument(
"--sample_ratio",
type=float,
default=default.sample_ratio,
help="The ratio for sampling",
)
# validation config
train_parser.add_argument(
"--val_batch_size_per_device",
type=int,
default=default.val_batch_size_per_device,
help="Validation batch size per device",
)
train_parser.add_argument(
"--validation_interval",
type=int,
default=default.validation_interval,
help="Validation interval while training, using train unit as interval unit",
)
train_parser.add_argument(
"--val_data_part_num",
type=str,
default=default.val_data_part_num,
help="Validation dataset dir prefix",
)
train_parser.add_argument(
"--lfw_total_images_num", type=int, default=12000,
)
train_parser.add_argument(
"--cfp_fp_total_images_num", type=int, default=14000,
)
train_parser.add_argument(
"--agedb_30_total_images_num", type=int, default=12000,
)
for ds in config.val_targets:
assert ds == 'lfw' or 'cfp_fp' or 'agedb_30', "Lfw, cfp_fp, agedb_30 datasets are supported now!"
train_parser.add_argument(
"--%s_dataset_dir" % ds,
type=str,
default=os.path.join(default.val_dataset_dir, ds),
help="Validation dataset path",
)
train_parser.add_argument(
"--nrof_folds", type=int, default=default.nrof_folds,
)
return train_parser.parse_args()
def get_train_config(args):
func_config = flow.FunctionConfig()
func_config.default_logical_view(flow.scope.consistent_view())
func_config.default_data_type(flow.float)
func_config.cudnn_conv_heuristic_search_algo(
config.cudnn_conv_heuristic_search_algo
)
func_config.enable_fuse_model_update_ops(
config.enable_fuse_model_update_ops)
func_config.enable_fuse_add_to_output(config.enable_fuse_add_to_output)
if args.use_fp16:
print("Training with FP16 now.")
func_config.enable_auto_mixed_precision(True)
if args.partial_fc:
func_config.enable_fuse_model_update_ops(False)
func_config.indexed_slices_optimizer_conf(
dict(include_op_names=dict(op_name=['fc7-weight'])))
if args.use_fp16 and (args.num_nodes * args.device_num_per_node) > 1:
flow.config.collective_boxing.nccl_fusion_all_reduce_use_buffer(False)
if args.nccl_fusion_threshold_mb:
flow.config.collective_boxing.nccl_fusion_threshold_mb(
args.nccl_fusion_threshold_mb)
if args.nccl_fusion_max_ops:
flow.config.collective_boxing.nccl_fusion_max_ops(
args.nccl_fusion_max_ops)
size = args.device_num_per_node * args.num_nodes
config.num_classes = args.num_classes
config.train_data_part_num = args.data_part_num
num_local = (config.num_classes + size - 1) // size
num_sample = int(num_local * args.sample_ratio)
args.total_num_sample = num_sample * size
assert args.train_iter > 0, "Train iter must be greater than 0!"
steps_per_epoch = math.ceil(config.total_img_num / args.train_batch_size)
if args.train_unit == "epoch":
print("Using epoch as training unit now. Each unit of iteration is epoch, including train_iter, iter_num_in_snapshot and validation interval")
args.total_iter_num = steps_per_epoch * args.train_iter
args.iter_num_in_snapshot = steps_per_epoch * args.iter_num_in_snapshot
if args.validation_interval <= args.total_iter_num:
args.validation_interval = steps_per_epoch * args.validation_interval
else:
print(
"It doesn't do validation because validation_interval is greater than train_iter.")
elif args.train_unit == "batch":
print("Using batch as training unit now. Each unit of iteration is batch, including train_iter, iter_num_in_snapshot and validation interval")
args.total_iter_num = args.train_iter
args.iter_num_in_snapshot = args.iter_num_in_snapshot
args.validation_interval = args.validation_interval
else:
raise ValueError("Invalid train unit!")
return func_config
def make_train_func(args):
@flow.global_function(type="train", function_config=get_train_config(args))
def get_symbol_train_job():
if args.use_synthetic_data:
(labels, images) = ofrecord_util.load_synthetic(args)
else:
labels, images = ofrecord_util.load_train_dataset(args)
image_size = images.shape[1:-1]
assert len(
image_size) == 2, "The length of image size must be equal to 2."
assert image_size[0] == image_size[1], "image_size[0] should be equal to image_size[1]."
print("train image_size: ", image_size)
embedding = eval(config.net_name).get_symbol(images)
def _get_initializer():
return flow.random_normal_initializer(mean=0.0, stddev=0.01)
trainable = True
if config.loss_name == "softmax":
if args.model_parallel:
print("Training is using model parallelism now.")
labels = labels.with_distribute(flow.distribute.broadcast())
fc1_distribute = flow.distribute.broadcast()
fc7_data_distribute = flow.distribute.split(1)
fc7_model_distribute = flow.distribute.split(0)
else:
fc1_distribute = flow.distribute.split(0)
fc7_data_distribute = flow.distribute.split(0)
fc7_model_distribute = flow.distribute.broadcast()
fc7 = flow.layers.dense(
inputs=embedding.with_distribute(fc1_distribute),
units=config.num_classes,
activation=None,
use_bias=False,
kernel_initializer=_get_initializer(),
bias_initializer=None,
trainable=trainable,
name="fc7",
model_distribute=fc7_model_distribute,
)
fc7 = fc7.with_distribute(fc7_data_distribute)
elif config.loss_name == "margin_softmax":
if args.model_parallel:
print("Training is using model parallelism now.")
labels = labels.with_distribute(flow.distribute.broadcast())
fc1_distribute = flow.distribute.broadcast()
fc7_data_distribute = flow.distribute.split(1)
fc7_model_distribute = flow.distribute.split(0)
else:
fc1_distribute = flow.distribute.split(0)
fc7_data_distribute = flow.distribute.split(0)
fc7_model_distribute = flow.distribute.broadcast()
fc7_weight = flow.get_variable(
name="fc7-weight",
shape=(config.num_classes, embedding.shape[1]),
dtype=embedding.dtype,
initializer=_get_initializer(),
regularizer=None,
trainable=trainable,
model_name="weight",
distribute=fc7_model_distribute,
)
if args.partial_fc and args.model_parallel:
print(
"Training is using model parallelism and optimized by partial_fc now."
)
(
mapped_label,
sampled_label,
sampled_weight,
) = flow.distributed_partial_fc_sample(
weight=fc7_weight, label=labels, num_sample=args.total_num_sample,
)
labels = mapped_label
fc7_weight = sampled_weight
fc7_weight = flow.math.l2_normalize(
input=fc7_weight, axis=1, epsilon=1e-10)
fc1 = flow.math.l2_normalize(
input=embedding, axis=1, epsilon=1e-10)
fc7 = flow.matmul(
a=fc1.with_distribute(fc1_distribute), b=fc7_weight, transpose_b=True
)
fc7 = fc7.with_distribute(fc7_data_distribute)
fc7 = (
flow.combined_margin_loss(
fc7, labels, m1=config.loss_m1, m2=config.loss_m2, m3=config.loss_m3
)
* config.loss_s
)
fc7 = fc7.with_distribute(fc7_data_distribute)
else:
raise NotImplementedError
loss = flow.nn.sparse_softmax_cross_entropy_with_logits(
labels, fc7, name="softmax_loss"
)
lr_scheduler = flow.optimizer.PiecewiseScalingScheduler(
base_lr=args.lr,
boundaries=args.lr_steps,
scale=args.scales,
warmup=None
)
flow.optimizer.SGDW(lr_scheduler,
momentum=args.momentum if args.momentum > 0 else None,
weight_decay=args.weight_decay
).minimize(loss)
return loss
return get_symbol_train_job
def main(args):
flow.config.gpu_device_num(args.device_num_per_node)
print("gpu num: ", args.device_num_per_node)
if not os.path.exists(args.models_root):
os.makedirs(args.models_root)
prefix = os.path.join(
args.models_root, "%s-%s-%s" % (args.network,
args.loss, args.dataset), "model"
)
prefix_dir = os.path.dirname(prefix)
print("prefix: ", prefix)
if not os.path.exists(prefix_dir):
os.makedirs(prefix_dir)
if args.num_nodes > 1:
assert args.num_nodes <= len(
args.node_ips), "The number of nodes should not be greater than length of node_ips list."
flow.env.ctrl_port(12138)
nodes = []
for ip in args.node_ips:
addr_dict = {}
addr_dict["addr"] = ip
nodes.append(addr_dict)
flow.env.machine(nodes)
if config.data_format.upper() != "NCHW" and config.data_format.upper() != "NHWC":
raise ValueError("Invalid data format")
flow.env.log_dir(args.log_dir)
train_func = make_train_func(args)
validator = Validator(args)
if os.path.exists(args.model_load_dir):
print("Loading model from {}".format(args.model_load_dir))
variables = flow.checkpoint.get(args.model_load_dir)
flow.load_variables(variables)
print("num_classes ", config.num_classes)
print("Called with argument: ", args, config)
train_metric = TrainMetric(
desc="train", calculate_batches=args.loss_print_frequency, batch_size=args.train_batch_size
)
lr = args.lr
for step in range(args.total_iter_num):
# train
train_func().async_get(train_metric.metric_cb(step))
# validation
if args.do_validation_while_train and (step + 1) % args.validation_interval == 0:
for ds in config.val_targets:
issame_list, embeddings_list = validator.do_validation(
dataset=ds)
validation_util.cal_validation_metrics(
embeddings_list, issame_list, nrof_folds=args.nrof_folds,
)
if step in args.lr_steps:
lr *= 0.1
print("lr_steps: ", step)
print("lr change to ", lr)
# snapshot
if (step + 1) % args.iter_num_in_snapshot == 0:
path = os.path.join(
prefix_dir, "snapshot_" + str(step // args.iter_num_in_snapshot))
flow.checkpoint.save(path)
if __name__ == "__main__":
args = get_train_args()
main(args)
#!/bin/bash
set -ex
WORKSPACE=~/oneflow_temp
SCRIPTS_PATH=$WORKSPACE/oneflow_face
host_num=${1:-4}
network=${2:-"r100"}
dataset=${3:-"emore"}
loss=${4:-"arcface"}
num_nodes=${5:-${host_num}}
bz_per_device=${6:-64}
train_unit=${7:-"batch"}
train_iter=${8:-150}
gpu_num_per_node=${9:-8}
precision=${10:-fp32}
model_parallel=${11:-1}
partial_fc=${12:-1}
test_times=${13:-5}
sample_ratio=${14:-0.1}
num_classes=${15:-85744}
use_synthetic_data=${16:-False}
# 2n8g
sed
sed -i "s/num_nodes = 1/num_nodes = 2/g" $SCRIPTS_PATH/sample_config.py
sed -i "s/node_ips = \['10.11.0.2'\]/node_ips = \['10.11.0.2', '10.11.0.3'\]/g" $SCRIPTS_PATH/sample_config.py
sed -i "s/\"10.11.0.3\"/\#\"10.11.0.3\"/g" $WORKSPACE/run_multi_nodes.sh
sed -i "s/\"10.11.0.4\"/\#\"10.11.0.4\"/g" $WORKSPACE/run_multi_nodes.sh
i=1
while [ $i -le ${test_times} ]
do
bash $SCRIPTS_PATH/run_multi_nodes.sh 2 ${network} ${dataset} ${loss} 2 $bz_per_device $train_unit $train_iter ${gpu_num_per_node} $precision $model_parallel $partial_fc $i $sample_ratio $num_classes $use_synthetic_data
echo " >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>Finished Test Case ${i}!<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< "
let i++
sleep 20s
done
# 4n8g
sed -i "s/num_nodes = 2/num_nodes = 4/g" $SCRIPTS_PATH/sample_config.py
sed -i "s/node_ips = \['10.11.0.2', '10.11.0.3'\]/node_ips = \['10.11.0.2', '10.11.0.3', '10.11.0.4', '10.11.0.5'\]/g" $SCRIPTS_PATH/sample_config.py
sed -i "s/\#\"10.11.0.3\"/\"10.11.0.3\"/g" $WORKSPACE/run_multi_nodes.sh
sed -i "s/\#\"10.11.0.4\"/\"10.11.0.4\"/g" $WORKSPACE/run_multi_nodes.sh
i=1
while [ $i -le ${test_times} ]
do bash $SCRIPTS_PATH/run_multi_nodes.sh 4 ${network} ${dataset} ${loss} 4 $bz_per_device $train_unit $train_iter ${gpu_num_per_node} $precision $model_parallel $partial_fc $i $sample_ratio $num_classes $use_synthetic_data
echo " >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>Finished Test Case ${i}!<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< "
let i++
sleep 20s
done
#!/bin/bash
set -ex
workdir=/home/leinao/sx
host_num=${1:-4}
network=${2:-"r100"}
dataset=${3:-"emore"}
loss=${4:-"arcface"}
num_nodes=${5:-${host_num}}
bz_per_device=${6:-64}
train_unit=${7:-"batch"}
train_iter=${8:-150}
gpu_num_per_node=${9:-8}
precision=${10:-fp32}
model_parallel=${11:-1}
partial_fc=${12:-1}
test_times=${13:-5}
sample_ratio=${14:-0.1}
num_classes=${15:-85744}
use_synthetic_data=${16:-False}
port=22
scripts_path=${workdir}/oneflow_face
test_scripts=${scripts_path}/scripts
LOCAL_RUN=${scripts_path}/scripts/train_insightface.sh
##############################################
#0 prepare the host list for training
#comment unused hosts with `#`
#or use first arg to limit the hosts number
declare -a host_list=(
"10.11.0.2"
"10.11.0.3"
"10.11.0.4"
"10.11.0.5"
)
if [ -n "$1" ]
then
host_num=$1
else
host_num=${#host_list[@]}
fi
if [ ${host_num} -gt ${#host_list[@]} ]
then
host_num=${#host_list[@]}
fi
hosts=("${host_list[@]:0:${host_num}}")
echo "Working on hosts:${hosts[@]}"
test_case=${host_num}n${gpu_num_per_node}g_b${bz_per_device}_${network}_${dataset}_${loss}
log_file=${test_case}.log
logs_folder=logs
mkdir -p $logs_folder
echo log file: ${log_file}
##############################################
#1 prepare oneflow_temp folder on each host
for host in "${hosts[@]}"
do
ssh -p ${port} $host "mkdir -p ~/oneflow_temp"
done
##############################################
#2 copy files to each host and start work
for host in "${hosts[@]:1}"
do
echo "start training on ${host}"
ssh -p ${port} $host "rm -rf ~/oneflow_temp/*"
scp -P ${port} -r $scripts_path $LOCAL_RUN $host:~/oneflow_temp
ssh -p ${port} $host "cd ~/oneflow_temp; nohup bash train_insightface.sh ~/oneflow_temp/oneflow_face ${network} ${dataset} ${loss} ${num_nodes} $bz_per_device $train_unit $train_iter ${gpu_num_per_node} $precision $model_parallel $partial_fc $test_times $sample_ratio $num_classes 1>${log_file} 2>&1 </dev/null &"
done
#3 copy files to master host and start work
host=${hosts[0]}
echo "start training on ${host}"
ssh -p ${port} $host "rm -rf ~/oneflow_temp/*"
scp -P ${port} -r $scripts_path $LOCAL_RUN $host:~/oneflow_temp
ssh -p ${port} $host "cd ~/oneflow_temp; bash train_insightface.sh ~/oneflow_temp/oneflow_face ${network} ${dataset} ${loss} ${num_nodes} $bz_per_device $train_unit $train_iter ${gpu_num_per_node} $precision $model_parallel $partial_fc $test_times $sample_ratio $num_classes 1>${log_file}"
echo "done"
cp ~/oneflow_temp/${log_file} $logs_folder/${log_file}
sleep 3
export ONEFLOW_DEBUG_MODE=""
#export ONEFLOW_DEBUG_MODE=True
export PYTHONUNBUFFERED=1
workspace=${1:-""}
workspace=${1:-"/data/oneflow_temp/oneflow_face"}
network=${2:-"r100"}
dataset=${3:-"emore"}
loss=${4:-"arcface"}
num_nodes=${5:-1}
num_nodes=${5:-4}
batch_size_per_device=${6:-64}
train_unit=${7:-"batch"}
train_iter=${8:-150}
gpu_num_per_node=${9:-8}
precision=${10:-fp16}
model_parallel=${11:-0}
partial_fc=${12:-0}
precision=${10:-fp32}
model_parallel=${11:-1}
partial_fc=${12:-1}
test_times=${13:-1}
sample_ratio=${14:-0.1}
num_classes=${15:-1500000}
num_classes=${15:-85744}
use_synthetic_data=${16:-False}
MODEL_SAVE_DIR=${num_classes}_${precision}_b${batch_size_per_device}_oneflow_model_parallel_${model_parallel}_partial_fc_${partial_fc}/${num_nodes}n${gpu_num_per_node}g
LOG_DIR=$MODEL_SAVE_DIR
if [ $gpu_num_per_node -gt 1 ]; then
if [ $network = "r100"]
data_part_num=16
elif [$network = "r100_glint360k"]
if [ $network = "r100" ]; then
data_part_num=32
elif [ $network = "r100_glint360k" ]; then
data_part_num=200
else
echo "Please modify exact data part num in sample_config.py!"
fi
else
data_part_num=1
fi
sed -i "s/emore.train_data_part_num = 32/emore.train_data_part_num = $data_part_num/g" $workspace/sample_config.py
sed -i "s/emore.num_classes = 85744/emore.num_classes = $num_classes/g" $workspace/sample_config.py
PREC=""
if [ "$precision" = "fp16" ] ; then
......@@ -66,7 +71,7 @@ CMD+=" --use_synthetic_data=${use_synthetic_data}"
CMD+=" --num_classes=${num_classes}"
CMD+=" --data_part_num=${data_part_num}"
CMD="python3 $CMD "
CMD="/home/leinao/anaconda3/envs/insightface/bin/python3 $CMD "
set -x
if [ -z "$LOG_FILE" ] ; then
$CMD
......
......@@ -84,7 +84,7 @@ In this report, num classes mean the number of face categories. In the tests, it
| 1 | 4 | 64 | 923.23 | 655.56 |
| 1 | 8 | 64 | 1836.8 | 650.8 |
![ ](../imgs/emore_r100_fp32_b64_dp_en.png)
![ ](../imgs/data_parallel_face_emore_r100_bz64.png)
**batch_size = max**
......@@ -95,7 +95,7 @@ In this report, num classes mean the number of face categories. In the tests, it
| 1 | 4 | 972.8 | 733.1 |
| 1 | 8 | 1931.76 | 749.42 |
![ ](../imgs/emore_r100_fp32_bmax_dp_en.png)
![ ](../imgs/data_parallel_face_emore_r100_bz_max.png)
#### Model Parallelism
......@@ -108,7 +108,7 @@ In this report, num classes mean the number of face categories. In the tests, it
| 1 | 8 | 64 | 1854.15 | 756.96 |
![ ](../imgs/emore_r100_fp32_b64_mp_en.png)
![ ](../imgs/model_parallel_face_emore_r100_bz64.png)
**batch_size = max**
......@@ -118,7 +118,7 @@ In this report, num classes mean the number of face categories. In the tests, it
| 1 | 4 | 970.1 | 724.26 |
| 1 | 8 | 1921.87 | 821.06 |
![ ](../imgs/emore_r100_fp32_bmax_mp_en.png)
![ ](../imgs/model_parallel_face_emore_r100_bz_max.png)
#### Partial FC, sample_ratio = 0.1
......@@ -126,21 +126,27 @@ In this report, num classes mean the number of face categories. In the tests, it
| node_num | gpu_num_per_node | batch_size_per_device | OneFlow samples/s | MXNet samples/s |
| -------- | ---------------- | --------------------- | ----------------- | --------------- |
| 1 | 1 | 64 | 246.45 | 223.11 |
| 1 | 4 | 64 | 948.96 | 799.19 |
| 1 | 8 | 64 | 1872.81 | 1586.09 |
| 1 | 1 | 64 | 246.45 | 218.84 |
| 1 | 4 | 64 | 948.96 | 787.07 |
| 1 | 8 | 64 | 1872.81 | 1423.12 |
| 2 | 8 | 64 | 3540.09 | 2612.65 |
| 4 | 8 | 64 | 6931.6 | 5008.72 |
![ ](../imgs/emore_r100_fp32_b64_pf_en.png)
![ ](../imgs/partial_fc_sample_ratio_0_1_face_emore_r100_bz64.png)
**batch_size=max**
| node_num | gpu_num_per_node | OneFlow samples/s(max bsz=120) | MXNet samples/s(max bsz=104) |
| -------- | ---------------- | ------------------------------ | ---------------------------- |
| 1 | 1 | 256.61 | 232.56 |
| 1 | 4 | 990.82 | 852.4 |
| 1 | 8 | 1962.76 | 1644.42 |
| 1 | 1 | 256.61 | 229.11 |
| 1 | 4 | 990.82 | 844.37 |
| 1 | 8 | 1962.76 | 1584.89 |
| 2 | 8 | 3856.52 | 2845.97 |
| 4 | 8 | 7564.74 | 5476.51 |
![ ](../imgs/emore_r100_fp32_bmax_pf_en.png)
![ ](../imgs/partial_fc_sample_ratio_0_1_face_emore_r100_bz_max.png)
### Glint360k & R100 & FP32 Thoughputs
......@@ -199,7 +205,7 @@ In this report, num classes mean the number of face categories. In the tests, it
| 1 | 4 | 64 | 945.44 | 730.29 |
| 1 | 8 | 64 | 1858.57 | 1359.2 |
![ ](../imgs/glint360k_r100_fp32_b64_pf_en.png)
![ ](../imgs/partial_fc_sample_ratio_0_1_glint_r100_bz64.png)
**batch_size=max**
......@@ -209,7 +215,7 @@ In this report, num classes mean the number of face categories. In the tests, it
| 1 | 4 | 973.63 | 811.34 |
| 1 | 8 | 1933.88 | 1493.51 |
![ ](../imgs/glint360k_r100_fp32_bmax_pf_en.png)
![ ](../imgs/partial_fc_sample_ratio_0_1_glint_r100_bz_max.png)
### Face Emore & Y1 & FP32 Thoughputs
......@@ -224,7 +230,7 @@ In this report, num classes mean the number of face categories. In the tests, it
| 1 | 4 | 256 | 7354.49 | 1055.88 |
| 1 | 8 | 256 | 14298.02 | 1031.1 |
![ ](../imgs/emore_y1_fp32_b256_dp_en.png)
![ ](../imgs/data_parallel_face_emore_y1_bz256.png)
**batch_size = max**
......@@ -234,7 +240,7 @@ In this report, num classes mean the number of face categories. In the tests, it
| 1 | 4 | 7511.53 | 1044.38 |
| 1 | 8 | 14756.03 | 1026.68 |
![ ](../imgs/emore_y1_fp32_bmax_dp_en.png)
![ ](../imgs/data_parallel_face_emore_y1_bz_max.png)
#### Model Parallelism
......@@ -246,7 +252,7 @@ In this report, num classes mean the number of face categories. In the tests, it
| 1 | 4 | 256 | 7264.54 | 984.88 |
| 1 | 8 | 256 | 14049.75 | 1030.58 |
![ ](../imgs/emore_y1_fp32_b256_mp_en.png)
![ ](../imgs/model_parallel_face_emore_y1_bz256.png)
**batch_size = max**
......@@ -256,7 +262,7 @@ In this report, num classes mean the number of face categories. In the tests, it
| 1 | 4 | 7363.77 | 1017.78 |
| 1 | 8 | 14436.38 | 1038.6 |
![ ](../imgs/emore_y1_fp32_bmax_mp_en.png)
![ ](../imgs/model_parallel_face_emore_y1_bz_max.png)
### Max num_classes
......@@ -268,7 +274,7 @@ In this report, num classes mean the number of face categories. In the tests, it
## Conclusion
The above series of tests show that:
......
......@@ -38,7 +38,7 @@
| 框架 | 版本 | 模型来源 |
| ------------------------------------------------------------- | ----------------- | ------------------------------------------------------------------------------------------------------------------- |
| [OneFlow](https://github.com/Oneflow-Inc/oneflow/tree/v0.3.4) | 0.3.4 | [oneflow_face](https://github.com/Oneflow-Inc/oneflow_face/tree/1705ae5b4cee6466f7abf75ba891984ec02b8ea3) |
| [OneFlow](https://github.com/Oneflow-Inc/oneflow/tree/v0.3.4) | 0.3.4 | [oneflow_face](https://github.com/Oneflow-Inc/oneflow_face/tree/1705ae5b4cee6466f7abf75ba891984ec02b8ea3) |
| [deepinsight](https://github.com/deepinsight) | 2021-01-20 update | [deepinsight/insightface](https://github.com/deepinsight/insightface/tree/a9beb60971fb8115698859c35fdca721d6f75f5d) |
## 评测配置
......@@ -83,7 +83,7 @@ OneFlow 的实现与 MXNet 进行了严格对齐,主要包括:
| 1 | 4 | 64 | 923.23 | 655.56 |
| 1 | 8 | 64 | 1836.8 | 650.8 |
![ ](../imgs/emore_r100_fp32_b64_dp_en.png)
![ ](../imgs/data_parallel_face_emore_r100_bz64.png)
**batch_size = max**
......@@ -94,7 +94,7 @@ OneFlow 的实现与 MXNet 进行了严格对齐,主要包括:
| 1 | 4 | 972.8 | 733.1 |
| 1 | 8 | 1931.76 | 749.42 |
![ ](../imgs/emore_r100_fp32_bmax_dp_en.png)
![ ](../imgs/data_parallel_face_emore_r100_bz_max.png)
#### Model Parallelism
......@@ -107,7 +107,7 @@ OneFlow 的实现与 MXNet 进行了严格对齐,主要包括:
| 1 | 8 | 64 | 1854.15 | 756.96 |
![ ](../imgs/emore_r100_fp32_b64_mp_en.png)
![ ](../imgs/model_parallel_face_emore_r100_bz64.png)
**batch_size = max**
......@@ -117,7 +117,7 @@ OneFlow 的实现与 MXNet 进行了严格对齐,主要包括:
| 1 | 4 | 970.1 | 724.26 |
| 1 | 8 | 1921.87 | 821.06 |
![ ](../imgs/emore_r100_fp32_bmax_mp_en.png)
![ ](../imgs/model_parallel_face_emore_r100_bz_max.png)
#### Partial FC, sample_ratio = 0.1
......@@ -125,21 +125,25 @@ OneFlow 的实现与 MXNet 进行了严格对齐,主要包括:
| node_num | gpu_num_per_node | batch_size_per_device | OneFlow samples/s | MXNet samples/s |
| -------- | ---------------- | --------------------- | ----------------- | --------------- |
| 1 | 1 | 64 | 246.45 | 223.11 |
| 1 | 4 | 64 | 948.96 | 799.19 |
| 1 | 8 | 64 | 1872.81 | 1586.09 |
| 1 | 1 | 64 | 246.45 | 218.84 |
| 1 | 4 | 64 | 948.96 | 787.07 |
| 1 | 8 | 64 | 1872.81 | 1423.12 |
| 2 | 8 | 64 | 3540.09 | 2612.65 |
| 4 | 8 | 64 | 6931.6 | 5008.72 |
![ ](../imgs/emore_r100_fp32_b64_pf_en.png)
![ ](../imgs/partial_fc_sample_ratio_0_1_face_emore_r100_bz64.png)
**batch_size=max**
| node_num | gpu_num_per_node | OneFlow samples/s(max bsz=120) | MXNet samples/s(max bsz=104) |
| -------- | ---------------- | ------------------------------ | ---------------------------- |
| 1 | 1 | 256.61 | 232.56 |
| 1 | 4 | 990.82 | 852.4 |
| 1 | 8 | 1962.76 | 1644.42 |
| 1 | 1 | 256.61 | 229.11 |
| 1 | 4 | 990.82 | 844.37 |
| 1 | 8 | 1962.76 | 1584.89 |
| 2 | 8 | 3856.52 | 2845.97 |
| 4 | 8 | 7564.74 | 5476.51 |
![ ](../imgs/emore_r100_fp32_bmax_pf_en.png)
![ ](../imgs/partial_fc_sample_ratio_0_1_face_emore_r100_bz_max.png)
### Glint360k & R100 & FP32 Thoughputs
......@@ -198,7 +202,7 @@ OneFlow 的实现与 MXNet 进行了严格对齐,主要包括:
| 1 | 4 | 64 | 945.44 | 730.29 |
| 1 | 8 | 64 | 1858.57 | 1359.2 |
![ ](../imgs/glint360k_r100_fp32_b64_pf_en.png)
![ ](../imgs/partial_fc_sample_ratio_0_1_glint_r100_bz64.png)
**batch_size=max**
......@@ -208,7 +212,7 @@ OneFlow 的实现与 MXNet 进行了严格对齐,主要包括:
| 1 | 4 | 973.63 | 811.34 |
| 1 | 8 | 1933.88 | 1493.51 |
![ ](../imgs/glint360k_r100_fp32_bmax_pf_en.png)
![ ](../imgs/partial_fc_sample_ratio_0_1_glint_r100_bz_max.png)
### Face Emore & Y1 & FP32 Thoughputs
......@@ -223,7 +227,7 @@ OneFlow 的实现与 MXNet 进行了严格对齐,主要包括:
| 1 | 4 | 256 | 7354.49 | 1055.88 |
| 1 | 8 | 256 | 14298.02 | 1031.1 |
![ ](../imgs/emore_y1_fp32_b256_dp_en.png)
![ ](../imgs/data_parallel_face_emore_y1_bz256.png)
**batch_size = max**
......@@ -233,7 +237,7 @@ OneFlow 的实现与 MXNet 进行了严格对齐,主要包括:
| 1 | 4 | 7511.53 | 1044.38 |
| 1 | 8 | 14756.03 | 1026.68 |
![ ](../imgs/emore_y1_fp32_bmax_dp_en.png)
![ ](../imgs/data_parallel_face_emore_y1_bz_max.png)
#### Model Parallelism
......@@ -245,7 +249,7 @@ OneFlow 的实现与 MXNet 进行了严格对齐,主要包括:
| 1 | 4 | 256 | 7264.54 | 984.88 |
| 1 | 8 | 256 | 14049.75 | 1030.58 |
![ ](../imgs/emore_y1_fp32_b256_mp_en.png)
![ ](../imgs/model_parallel_face_emore_y1_bz256.png)
**batch_size = max**
......@@ -255,7 +259,7 @@ OneFlow 的实现与 MXNet 进行了严格对齐,主要包括:
| 1 | 4 | 7363.77 | 1017.78 |
| 1 | 8 | 14436.38 | 1038.6 |
![ ](../imgs/emore_y1_fp32_bmax_mp_en.png)
![ ](../imgs/model_parallel_face_emore_y1_bz_max.png)
### Max num_classses
......@@ -273,4 +277,4 @@ OneFlow 的实现与 MXNet 进行了严格对齐,主要包括:
1. 随着 `batch_size_per_device` 的增大,MXNet 的吞吐率即使使用了 Partial FC 也很难有突破,而 OneFlow 则始终保持较为稳定的线性增长;
2. 相同条件下 OneFlow 支持更大规模的 `batch_size``num_classes` ,在单机 8 卡、 单卡 batch size 固定为 64 ,同样是使用 FP16、model_parallel、partial_fc 的情况下,OneFlow 所支持的 `num_classes` 数量是 MXNet 的 1.125 倍(1350 万 vs 1200 万);
更多数据细节可移步 DLPerf 的 OneFlow 和 MXNet。
\ No newline at end of file
更多数据细节可移步 DLPerf 的 OneFlow 和 MXNet。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册