Dev sx insightface (#127)

* add insightface pr template * update readme and add fp32 results * update readme * update readme and modify scripts * update insightface readme and add reports * add pics in chinese report * add pics in english * add en pics * refine readme.md * update test pics * upfate report * Update dlperf_insightface_test_report_v1.md test pic * Create dlperf_insightface_test_report_v1.md test pics again * fix pic bug * Update dlperf_insightface_test_report_v1.md * fix data * add new path of insightface and update as reviewed * refine scripts and update readme * update data and pics * modify as reviewed * update report as reviewed * modify OOM into - and rm invalid pics' links * fix mxnet max num_classes * update changelog * update change log * add insightface report links and introduction * modify as reviewed * modify as reviewed * fix part num and dataset zoo * rm data preprocess part * update link * update data of 2 nodes * add multi node scripts * update results of 4 nodes * update report * update pics * update mxnet data * update multi-node scripts and rm different insightface_train.py * rm insightface_train.py and mv config into scripts with sed * update gnuplot img Co-authored-by: N Liang Depeng <liangdepeng@gmail.com> Co-authored-by: N Flowingsun007 <flowingsun007@163.com> Co-authored-by: MARD1NO <359521840@qq.com>

Dev sx insightface (#127)
* add insightface pr template * update readme and add fp32 results * update readme * update readme and modify scripts * update insightface readme and add reports * add pics in chinese report * add pics in english * add en pics * refine readme.md * update test pics * upfate report * Update dlperf_insightface_test_report_v1.md test pic * Create dlperf_insightface_test_report_v1.md test pics again * fix pic bug * Update dlperf_insightface_test_report_v1.md * fix data * add new path of insightface and update as reviewed * refine scripts and update readme * update data and pics * modify as reviewed * update report as reviewed * modify OOM into - and rm invalid pics' links * fix mxnet max num_classes * update changelog * update change log * add insightface report links and introduction * modify as reviewed * modify as reviewed * fix part num and dataset zoo * rm data preprocess part * update link * update data of 2 nodes * add multi node scripts * update results of 4 nodes * update report * update pics * update mxnet data * update multi-node scripts and rm different insightface_train.py * rm insightface_train.py and mv config into scripts with sed * update gnuplot img Co-authored-by: N Liang Depeng <liangdepeng@gmail.com> Co-authored-by: N Flowingsun007 <flowingsun007@163.com> Co-authored-by: MARD1NO <359521840@qq.com>
1be38a77 · Snow · GitHub · 110f1642 · 1be38a77 · 110f1642
35 changed file
--- a/OneFlow/Recognition/insightface/README.md
+++ b/OneFlow/Recognition/insightface/README.md
@@ -296,6 +296,8 @@ Saving result to ./result/_result.json
 | 1        | 1                | 64                    | 246.45    | 1.00    |
 | 1        | 4                | 64                    | 948.96     | 3.85    |
 | 1        | 8                | 64                    | 1872.81   | 7.6     |
+| 2        | 8                | 64                    | 3540.09   | 14.36     |
+| 4        | 8                | 64                   | 6931.6   | 28.13    |

 **batch_size=max**

@@ -304,6 +306,8 @@ Saving result to ./result/_result.json
 | 1        | 1                | 120                   | 256.61    | 1.00    |
 | 1        | 4                | 120                   | 990.82    | 3.86    |
 | 1        | 8                | 120                   | 1962.76   | 7.65    |
+| 2        | 8                | 120                   | 3856.52   | 15.03    |
+| 4        | 8                | 120                   | 7564.74   | 29.48    |




--- a/OneFlow/Recognition/insightface/scripts/insightface_train.py
+++ b/OneFlow/Recognition/insightface/scripts/insightface_train.py
-import os
-import math
-import argparse
-import numpy as np
-import oneflow as flow
-
-from sample_config import config, default, generate_config
-import ofrecord_util
-import validation_util
-from callback_util import TrainMetric
-from insightface_val import Validator, get_val_args
-
-from symbols import fresnet100, fmobilefacenet
-
-
-def str2list(x):
-    x = [float(y) if type(eval(y)) == float else int(y) for y in x.split(',')]
-    return x
-
-
-def str2bool(v):
-    if v.lower() in ("yes", "true", "t", "y", "1"):
-        return True
-    elif v.lower() in ("no", "false", "f", "n", "0"):
-        return False
-    else:
-        raise argparse.ArgumentTypeError("Unsupported value encountered.")
-
-
-def get_train_args():
-    train_parser = argparse.ArgumentParser(description="Flags for train")
-    train_parser.add_argument(
-        "--dataset", default=default.dataset, required=True, help="Dataset config"
-    )
-    train_parser.add_argument(
-        "--network", default=default.network, required=True, help="Network config"
-    )
-    train_parser.add_argument(
-        "--loss", default=default.loss, required=True, help="Loss config")
-    args, rest = train_parser.parse_known_args()
-    generate_config(args.network, args.dataset, args.loss)
-
-    # distribution config
-    train_parser.add_argument(
-        "--device_num_per_node",
-        type=int,
-        default=default.device_num_per_node,
-        help="The number of GPUs used per node",
-    )
-    train_parser.add_argument(
-        "--num_nodes",
-        type=int,
-        default=default.num_nodes,
-        help="Node/Machine number for training",
-    )
-    train_parser.add_argument(
-        "--node_ips",
-        type=str2list,
-        default=default.node_ips,
-        help='Nodes ip list for training, devided by ",", length >= num_nodes',
-    )
-    train_parser.add_argument(
-        "--model_parallel",
-        type=str2bool,
-        nargs="?",
-        default=default.model_parallel,
-        help="Whether to use model parallel",
-    )
-    train_parser.add_argument(
-        "--partial_fc",
-        type=str2bool,
-        nargs="?",
-        default=default.partial_fc,
-        help="Whether to use partial fc",
-    )
-
-    # train config
-    train_parser.add_argument("--num_classes", type=int, default=config.num_classes, help="Number of classes")
-    train_parser.add_argument("--data_part_num", type=int, default=config.train_data_part_num, help="Train data part num")
-
-    train_parser.add_argument(
-        "--train_batch_size",
-        type=int,
-        default=default.train_batch_size,
-        help="Train batch size totally",
-    )
-    train_parser.add_argument(
-        "--use_synthetic_data",
-        type=str2bool,
-        nargs="?",
-        default=default.use_synthetic_data,
-        help="Whether to use synthetic data",
-    )
-    train_parser.add_argument(
-        "--do_validation_while_train",
-        type=str2bool,
-        nargs="?",
-        default=default.do_validation_while_train,
-        help="Whether do validation while training",
-    )
-    train_parser.add_argument(
-        "--use_fp16", type=str2bool, nargs="?", default=default.use_fp16, help="Whether to use fp16"
-    )
-    train_parser.add_argument("--nccl_fusion_threshold_mb", type=int, default=default.nccl_fusion_threshold_mb,
-                              help="NCCL fusion threshold megabytes, set to 0 to compatible with previous version of OneFlow.")
-    train_parser.add_argument("--nccl_fusion_max_ops", type=int, default=default.nccl_fusion_max_ops,
-                              help="Maximum number of ops of NCCL fusion, set to 0 to compatible with previous version of OneFlow.")
-
-    # hyperparameters
-    train_parser.add_argument(
-        "--train_unit",
-        type=str,
-        default=default.train_unit,
-        help="Choose train unit of iteration, batch or epoch",
-    )
-    train_parser.add_argument(
-        "--train_iter",
-        type=int,
-        default=default.train_iter,
-        help="Iteration for training",
-    )
-    train_parser.add_argument(
-        "--lr", type=float, default=default.lr, help="Initial start learning rate"
-    )
-    train_parser.add_argument(
-        "--lr_steps",
-        type=str2list,
-        default=default.lr_steps,
-        help="Steps of lr changing",
-    )
-    train_parser.add_argument(
-        "-wd", "--weight_decay", type=float, default=default.wd, help="Weight decay"
-    )
-    train_parser.add_argument(
-        "-mom", "--momentum", type=float, default=default.mom, help="Momentum"
-    )
-    train_parser.add_argument("--scales", type=str2list,
-                              default=default.scales, help="Learning rate step sacles")
-
-    # model and log
-    train_parser.add_argument(
-        "--model_load_dir",
-        type=str,
-        default=default.model_load_dir,
-        help="Path to load model",
-    )
-    train_parser.add_argument(
-        "--models_root",
-        type=str,
-        default=default.models_root,
-        help="Root directory to save model.",
-    )
-    train_parser.add_argument(
-        "--log_dir", type=str, default=default.log_dir, help="Log info save directory"
-    )
-
-    train_parser.add_argument(
-        "--loss_print_frequency",
-        type=int,
-        default=default.loss_print_frequency,
-        help="Frequency of printing loss",
-    )
-    train_parser.add_argument(
-        "--iter_num_in_snapshot",
-        type=int,
-        default=default.iter_num_in_snapshot,
-        help="The number of train unit iter in the snapshot",
-    )
-    train_parser.add_argument(
-        "--sample_ratio",
-        type=float,
-        default=default.sample_ratio,
-        help="The ratio for sampling",
-    )
-
-    # validation config
-    train_parser.add_argument(
-        "--val_batch_size_per_device",
-        type=int,
-        default=default.val_batch_size_per_device,
-        help="Validation batch size per device",
-    )
-    train_parser.add_argument(
-        "--validation_interval",
-        type=int,
-        default=default.validation_interval,
-        help="Validation interval while training, using train unit as interval unit",
-    )
-    train_parser.add_argument(
-        "--val_data_part_num",
-        type=str,
-        default=default.val_data_part_num,
-        help="Validation dataset dir prefix",
-    )
-    train_parser.add_argument(
-        "--lfw_total_images_num", type=int, default=12000,
-    )
-    train_parser.add_argument(
-        "--cfp_fp_total_images_num", type=int, default=14000,
-    )
-    train_parser.add_argument(
-        "--agedb_30_total_images_num", type=int, default=12000,
-    )
-    for ds in config.val_targets:
-        assert ds == 'lfw' or 'cfp_fp' or 'agedb_30', "Lfw, cfp_fp, agedb_30 datasets are supported now!"
-        train_parser.add_argument(
-            "--%s_dataset_dir" % ds,
-            type=str,
-            default=os.path.join(default.val_dataset_dir, ds),
-            help="Validation dataset path",
-        )
-    train_parser.add_argument(
-        "--nrof_folds", type=int, default=default.nrof_folds,
-    )
-    return train_parser.parse_args()
-
-
-def get_train_config(args):
-    func_config = flow.FunctionConfig()
-    func_config.default_logical_view(flow.scope.consistent_view())
-    func_config.default_data_type(flow.float)
-    func_config.cudnn_conv_heuristic_search_algo(
-        config.cudnn_conv_heuristic_search_algo
-    )
-    func_config.enable_fuse_model_update_ops(
-        config.enable_fuse_model_update_ops)
-    func_config.enable_fuse_add_to_output(config.enable_fuse_add_to_output)
-    if args.use_fp16:
-        print("Training with FP16 now.")
-        func_config.enable_auto_mixed_precision(True)
-    if args.partial_fc:
-        func_config.enable_fuse_model_update_ops(False)
-        func_config.indexed_slices_optimizer_conf(
-            dict(include_op_names=dict(op_name=['fc7-weight'])))
-    if args.use_fp16 and (args.num_nodes * args.device_num_per_node) > 1:
-        flow.config.collective_boxing.nccl_fusion_all_reduce_use_buffer(False)
-    if args.nccl_fusion_threshold_mb:
-        flow.config.collective_boxing.nccl_fusion_threshold_mb(
-            args.nccl_fusion_threshold_mb)
-    if args.nccl_fusion_max_ops:
-        flow.config.collective_boxing.nccl_fusion_max_ops(
-            args.nccl_fusion_max_ops)
-    size = args.device_num_per_node * args.num_nodes
-    config.num_classes = args.num_classes
-    config.train_data_part_num = args.data_part_num
-    num_local = (config.num_classes + size - 1) // size
-    num_sample = int(num_local * args.sample_ratio)
-    args.total_num_sample = num_sample * size
-
-    assert args.train_iter > 0, "Train iter must be greater than 0!"
-    steps_per_epoch = math.ceil(config.total_img_num / args.train_batch_size)
-    if args.train_unit == "epoch":
-        print("Using epoch as training unit now. Each unit of iteration is epoch, including train_iter, iter_num_in_snapshot and validation interval")
-        args.total_iter_num = steps_per_epoch * args.train_iter
-        args.iter_num_in_snapshot = steps_per_epoch * args.iter_num_in_snapshot
-        if args.validation_interval <= args.total_iter_num:
-            args.validation_interval = steps_per_epoch * args.validation_interval
-        else:
-            print(
-                "It doesn't do validation because validation_interval is greater than train_iter.")
-    elif args.train_unit == "batch":
-        print("Using batch as training unit now. Each unit of iteration is batch, including train_iter, iter_num_in_snapshot and validation interval")
-        args.total_iter_num = args.train_iter
-        args.iter_num_in_snapshot = args.iter_num_in_snapshot
-        args.validation_interval = args.validation_interval
-    else:
-        raise ValueError("Invalid train unit!")
-    return func_config
-
-
-def make_train_func(args):
-    @flow.global_function(type="train", function_config=get_train_config(args))
-    def get_symbol_train_job():
-        if args.use_synthetic_data:
-            (labels, images) = ofrecord_util.load_synthetic(args)
-        else:
-            labels, images = ofrecord_util.load_train_dataset(args)
-        image_size = images.shape[1:-1]
-        assert len(
-            image_size) == 2, "The length of image size must be equal to 2."
-        assert image_size[0] == image_size[1], "image_size[0] should be equal to image_size[1]."
-        print("train image_size: ", image_size)
-        embedding = eval(config.net_name).get_symbol(images)
-
-        def _get_initializer():
-            return flow.random_normal_initializer(mean=0.0, stddev=0.01)
-
-        trainable = True
-        if config.loss_name == "softmax":
-            if args.model_parallel:
-                print("Training is using model parallelism now.")
-                labels = labels.with_distribute(flow.distribute.broadcast())
-                fc1_distribute = flow.distribute.broadcast()
-                fc7_data_distribute = flow.distribute.split(1)
-                fc7_model_distribute = flow.distribute.split(0)
-            else:
-                fc1_distribute = flow.distribute.split(0)
-                fc7_data_distribute = flow.distribute.split(0)
-                fc7_model_distribute = flow.distribute.broadcast()
-
-            fc7 = flow.layers.dense(
-                inputs=embedding.with_distribute(fc1_distribute),
-                units=config.num_classes,
-                activation=None,
-                use_bias=False,
-                kernel_initializer=_get_initializer(),
-                bias_initializer=None,
-                trainable=trainable,
-                name="fc7",
-                model_distribute=fc7_model_distribute,
-            )
-            fc7 = fc7.with_distribute(fc7_data_distribute)
-        elif config.loss_name == "margin_softmax":
-            if args.model_parallel:
-                print("Training is using model parallelism now.")
-                labels = labels.with_distribute(flow.distribute.broadcast())
-                fc1_distribute = flow.distribute.broadcast()
-                fc7_data_distribute = flow.distribute.split(1)
-                fc7_model_distribute = flow.distribute.split(0)
-            else:
-                fc1_distribute = flow.distribute.split(0)
-                fc7_data_distribute = flow.distribute.split(0)
-                fc7_model_distribute = flow.distribute.broadcast()
-            fc7_weight = flow.get_variable(
-                name="fc7-weight",
-                shape=(config.num_classes, embedding.shape[1]),
-                dtype=embedding.dtype,
-                initializer=_get_initializer(),
-                regularizer=None,
-                trainable=trainable,
-                model_name="weight",
-                distribute=fc7_model_distribute,
-            )
-            if args.partial_fc and args.model_parallel:
-                print(
-                    "Training is using model parallelism and optimized by partial_fc now."
-                )
-                (
-                    mapped_label,
-                    sampled_label,
-                    sampled_weight,
-                ) = flow.distributed_partial_fc_sample(
-                    weight=fc7_weight, label=labels, num_sample=args.total_num_sample,
-                )
-                labels = mapped_label
-                fc7_weight = sampled_weight
-            fc7_weight = flow.math.l2_normalize(
-                input=fc7_weight, axis=1, epsilon=1e-10)
-            fc1 = flow.math.l2_normalize(
-                input=embedding, axis=1, epsilon=1e-10)
-            fc7 = flow.matmul(
-                a=fc1.with_distribute(fc1_distribute), b=fc7_weight, transpose_b=True
-            )
-            fc7 = fc7.with_distribute(fc7_data_distribute)
-            fc7 = (
-                flow.combined_margin_loss(
-                    fc7, labels, m1=config.loss_m1, m2=config.loss_m2, m3=config.loss_m3
-                )
-                * config.loss_s
-            )
-            fc7 = fc7.with_distribute(fc7_data_distribute)
-        else:
-            raise NotImplementedError
-
-        loss = flow.nn.sparse_softmax_cross_entropy_with_logits(
-            labels, fc7, name="softmax_loss"
-        )
-
-        lr_scheduler = flow.optimizer.PiecewiseScalingScheduler(
-            base_lr=args.lr,
-            boundaries=args.lr_steps,
-            scale=args.scales,
-            warmup=None
-        )
-        flow.optimizer.SGDW(lr_scheduler,
-                            momentum=args.momentum if args.momentum > 0 else None,
-                            weight_decay=args.weight_decay
-                            ).minimize(loss)
-
-        return loss
-
-    return get_symbol_train_job
-
-
-def main(args):
-
-    flow.config.gpu_device_num(args.device_num_per_node)
-    print("gpu num: ", args.device_num_per_node)
-    if not os.path.exists(args.models_root):
-        os.makedirs(args.models_root)
-    prefix = os.path.join(
-        args.models_root, "%s-%s-%s" % (args.network,
-                                        args.loss, args.dataset), "model"
-    )
-    prefix_dir = os.path.dirname(prefix)
-    print("prefix: ", prefix)
-    if not os.path.exists(prefix_dir):
-        os.makedirs(prefix_dir)
-
-    if args.num_nodes > 1:
-        assert args.num_nodes <= len(
-            args.node_ips), "The number of nodes should not be greater than length of node_ips list."
-        flow.env.ctrl_port(12138)
-        nodes = []
-        for ip in args.node_ips:
-            addr_dict = {}
-            addr_dict["addr"] = ip
-            nodes.append(addr_dict)
-
-        flow.env.machine(nodes)
-    if config.data_format.upper() != "NCHW" and config.data_format.upper() != "NHWC":
-        raise ValueError("Invalid data format")
-    flow.env.log_dir(args.log_dir)
-    train_func = make_train_func(args)
-    validator = Validator(args)
-    if os.path.exists(args.model_load_dir):
-        print("Loading model from {}".format(args.model_load_dir))
-        variables = flow.checkpoint.get(args.model_load_dir)
-        flow.load_variables(variables)
-
-    print("num_classes ", config.num_classes)
-    print("Called with argument: ", args, config)
-    train_metric = TrainMetric(
-        desc="train", calculate_batches=args.loss_print_frequency, batch_size=args.train_batch_size
-    )
-    lr = args.lr
-
-    for step in range(args.total_iter_num):
-        # train
-        train_func().async_get(train_metric.metric_cb(step))
-
-        # validation
-        if args.do_validation_while_train and (step + 1) % args.validation_interval == 0:
-            for ds in config.val_targets:
-                issame_list, embeddings_list = validator.do_validation(
-                    dataset=ds)
-                validation_util.cal_validation_metrics(
-                    embeddings_list, issame_list, nrof_folds=args.nrof_folds,
-                )
-        if step in args.lr_steps:
-            lr *= 0.1
-            print("lr_steps: ", step)
-            print("lr change to ", lr)
-
-        # snapshot
-        if (step + 1) % args.iter_num_in_snapshot == 0:
-            path = os.path.join(
-                prefix_dir, "snapshot_" + str(step // args.iter_num_in_snapshot))
-            flow.checkpoint.save(path)
-
-
-if __name__ == "__main__":
-    args = get_train_args()
-    main(args)
--- a/OneFlow/Recognition/insightface/scripts/multi_run.sh
+++ b/OneFlow/Recognition/insightface/scripts/multi_run.sh
+#!/bin/bash
+
+set -ex
+
+WORKSPACE=~/oneflow_temp
+SCRIPTS_PATH=$WORKSPACE/oneflow_face
+host_num=${1:-4}
+network=${2:-"r100"}
+dataset=${3:-"emore"}
+loss=${4:-"arcface"}
+num_nodes=${5:-${host_num}}
+bz_per_device=${6:-64}
+train_unit=${7:-"batch"}
+train_iter=${8:-150}
+gpu_num_per_node=${9:-8}
+precision=${10:-fp32}
+model_parallel=${11:-1}
+partial_fc=${12:-1}
+test_times=${13:-5}
+sample_ratio=${14:-0.1}
+num_classes=${15:-85744}
+use_synthetic_data=${16:-False}
+
+# 2n8g
+sed 
+sed -i  "s/num_nodes = 1/num_nodes = 2/g" $SCRIPTS_PATH/sample_config.py
+sed -i "s/node_ips = \['10.11.0.2'\]/node_ips = \['10.11.0.2', '10.11.0.3'\]/g" $SCRIPTS_PATH/sample_config.py
+sed -i "s/\"10.11.0.3\"/\#\"10.11.0.3\"/g" $WORKSPACE/run_multi_nodes.sh
+sed -i "s/\"10.11.0.4\"/\#\"10.11.0.4\"/g" $WORKSPACE/run_multi_nodes.sh
+
+
+i=1
+while [ $i -le ${test_times} ]
+do
+    bash $SCRIPTS_PATH/run_multi_nodes.sh 2 ${network} ${dataset} ${loss} 2 $bz_per_device $train_unit $train_iter ${gpu_num_per_node} $precision $model_parallel $partial_fc $i $sample_ratio $num_classes $use_synthetic_data 
+    echo " >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>Finished Test Case ${i}!<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< "
+    let i++
+    sleep 20s
+done
+
+# 4n8g
+sed -i  "s/num_nodes = 2/num_nodes = 4/g" $SCRIPTS_PATH/sample_config.py
+sed -i "s/node_ips = \['10.11.0.2', '10.11.0.3'\]/node_ips = \['10.11.0.2', '10.11.0.3', '10.11.0.4', '10.11.0.5'\]/g" $SCRIPTS_PATH/sample_config.py
+sed -i "s/\#\"10.11.0.3\"/\"10.11.0.3\"/g" $WORKSPACE/run_multi_nodes.sh
+sed -i "s/\#\"10.11.0.4\"/\"10.11.0.4\"/g" $WORKSPACE/run_multi_nodes.sh
+
+i=1
+while [ $i -le ${test_times} ]
+do    bash $SCRIPTS_PATH/run_multi_nodes.sh 4 ${network} ${dataset} ${loss} 4 $bz_per_device $train_unit $train_iter ${gpu_num_per_node} $precision $model_parallel $partial_fc $i $sample_ratio $num_classes $use_synthetic_data         
+    echo " >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>Finished Test Case ${i}!<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< "
+    let i++
+    sleep 20s
+done
--- a/OneFlow/Recognition/insightface/scripts/run_multi_nodes.sh
+++ b/OneFlow/Recognition/insightface/scripts/run_multi_nodes.sh
+#!/bin/bash
+
+set -ex
+
+workdir=/home/leinao/sx
+host_num=${1:-4}
+network=${2:-"r100"}
+dataset=${3:-"emore"}
+loss=${4:-"arcface"}
+num_nodes=${5:-${host_num}}
+bz_per_device=${6:-64}
+train_unit=${7:-"batch"}
+train_iter=${8:-150} 
+gpu_num_per_node=${9:-8}
+precision=${10:-fp32}
+model_parallel=${11:-1}
+partial_fc=${12:-1}
+test_times=${13:-5}
+sample_ratio=${14:-0.1}
+num_classes=${15:-85744}
+use_synthetic_data=${16:-False}
+
+
+port=22
+scripts_path=${workdir}/oneflow_face
+test_scripts=${scripts_path}/scripts
+LOCAL_RUN=${scripts_path}/scripts/train_insightface.sh
+
+##############################################
+#0 prepare the host list for training
+#comment unused hosts with `#`
+#or use first arg to limit the hosts number
+declare -a host_list=(
+                  "10.11.0.2"
+                  "10.11.0.3"
+                  "10.11.0.4"
+                  "10.11.0.5"
+                  )
+
+if [ -n "$1" ]
+then
+  host_num=$1
+else
+  host_num=${#host_list[@]}
+fi
+
+
+if [ ${host_num} -gt ${#host_list[@]} ]
+then
+  host_num=${#host_list[@]}
+fi
+
+hosts=("${host_list[@]:0:${host_num}}")
+echo "Working on hosts:${hosts[@]}"
+
+test_case=${host_num}n${gpu_num_per_node}g_b${bz_per_device}_${network}_${dataset}_${loss}
+log_file=${test_case}.log
+
+logs_folder=logs
+mkdir -p $logs_folder
+
+echo log file: ${log_file}
+##############################################
+#1 prepare oneflow_temp folder on each host
+for host in "${hosts[@]}"
+do
+  ssh -p ${port} $host "mkdir -p ~/oneflow_temp"
+done
+
+##############################################
+#2 copy files to each host and start work
+for host in "${hosts[@]:1}"
+do
+  echo "start training on ${host}"
+  ssh -p ${port} $host "rm -rf ~/oneflow_temp/*"
+  scp -P ${port} -r $scripts_path $LOCAL_RUN $host:~/oneflow_temp
+  ssh -p ${port} $host "cd ~/oneflow_temp; nohup bash train_insightface.sh ~/oneflow_temp/oneflow_face ${network} ${dataset} ${loss} ${num_nodes} $bz_per_device $train_unit $train_iter ${gpu_num_per_node} $precision $model_parallel $partial_fc $test_times $sample_ratio $num_classes 1>${log_file} 2>&1 </dev/null &"
+done
+
+#3 copy files to master host and start work
+host=${hosts[0]}
+echo "start training on ${host}"
+ssh -p ${port} $host "rm -rf ~/oneflow_temp/*"
+scp -P ${port} -r $scripts_path $LOCAL_RUN $host:~/oneflow_temp
+ssh -p ${port} $host "cd ~/oneflow_temp; bash train_insightface.sh ~/oneflow_temp/oneflow_face ${network} ${dataset} ${loss} ${num_nodes} $bz_per_device $train_unit $train_iter ${gpu_num_per_node} $precision $model_parallel $partial_fc $test_times $sample_ratio $num_classes 1>${log_file}"
+
+echo "done"
+
+cp ~/oneflow_temp/${log_file} $logs_folder/${log_file}
+sleep 3
+
--- a/OneFlow/Recognition/insightface/scripts/train_insightface.sh
+++ b/OneFlow/Recognition/insightface/scripts/train_insightface.sh
-export ONEFLOW_DEBUG_MODE=""
+#export ONEFLOW_DEBUG_MODE=True
+export PYTHONUNBUFFERED=1

-workspace=${1:-""}
+workspace=${1:-"/data/oneflow_temp/oneflow_face"}
 network=${2:-"r100"}
 dataset=${3:-"emore"}
 loss=${4:-"arcface"}
-num_nodes=${5:-1}
+num_nodes=${5:-4}
 batch_size_per_device=${6:-64}
 train_unit=${7:-"batch"}
 train_iter=${8:-150} 
 gpu_num_per_node=${9:-8}
-precision=${10:-fp16}
-model_parallel=${11:-0}
-partial_fc=${12:-0}
+precision=${10:-fp32}
+model_parallel=${11:-1}
+partial_fc=${12:-1}
 test_times=${13:-1}
 sample_ratio=${14:-0.1}
-num_classes=${15:-1500000}
+num_classes=${15:-85744}
 use_synthetic_data=${16:-False}

 MODEL_SAVE_DIR=${num_classes}_${precision}_b${batch_size_per_device}_oneflow_model_parallel_${model_parallel}_partial_fc_${partial_fc}/${num_nodes}n${gpu_num_per_node}g
 LOG_DIR=$MODEL_SAVE_DIR

 if [ $gpu_num_per_node -gt 1 ]; then
-  if [ $network = "r100"]
-    data_part_num=16
-  elif [$network = "r100_glint360k"]
+  if [ $network = "r100" ]; then
+    data_part_num=32
+  elif [ $network = "r100_glint360k" ]; then
    data_part_num=200
  else
    echo "Please modify exact data part num in sample_config.py!"
+   fi
 else
    data_part_num=1
 fi
+sed -i "s/emore.train_data_part_num = 32/emore.train_data_part_num = $data_part_num/g" $workspace/sample_config.py
+sed -i "s/emore.num_classes = 85744/emore.num_classes = $num_classes/g" $workspace/sample_config.py
+

 PREC=""
 if [ "$precision" = "fp16" ] ; then
@@ -66,7 +71,7 @@ CMD+=" --use_synthetic_data=${use_synthetic_data}"
 CMD+=" --num_classes=${num_classes}"
 CMD+=" --data_part_num=${data_part_num}"

-CMD="python3 $CMD "
+CMD="/home/leinao/anaconda3/envs/insightface/bin/python3 $CMD "
 set -x
 if [ -z "$LOG_FILE" ] ; then
   $CMD

--- a/reports/imgs/data_parallel_face_emore_r100_bz64.png
+++ b/reports/imgs/data_parallel_face_emore_r100_bz64.png
--- a/reports/imgs/data_parallel_face_emore_r100_bz_max.png
+++ b/reports/imgs/data_parallel_face_emore_r100_bz_max.png
--- a/reports/imgs/data_parallel_face_emore_y1_bz256.png
+++ b/reports/imgs/data_parallel_face_emore_y1_bz256.png
--- a/reports/imgs/data_parallel_face_emore_y1_bz_max.png
+++ b/reports/imgs/data_parallel_face_emore_y1_bz_max.png
--- a/reports/imgs/emore_r100_fp32_b64_dp_en.png
+++ b/reports/imgs/emore_r100_fp32_b64_dp_en.png
--- a/reports/imgs/emore_r100_fp32_b64_mp.png
+++ b/reports/imgs/emore_r100_fp32_b64_mp.png
--- a/reports/imgs/emore_r100_fp32_b64_mp_en.png
+++ b/reports/imgs/emore_r100_fp32_b64_mp_en.png
--- a/reports/imgs/emore_r100_fp32_b64_pf_en.png
+++ b/reports/imgs/emore_r100_fp32_b64_pf_en.png
--- a/reports/imgs/emore_r100_fp32_bmax_dp_en.png
+++ b/reports/imgs/emore_r100_fp32_bmax_dp_en.png
--- a/reports/imgs/emore_r100_fp32_bmax_mp.png
+++ b/reports/imgs/emore_r100_fp32_bmax_mp.png
--- a/reports/imgs/emore_r100_fp32_bmax_mp_en.png
+++ b/reports/imgs/emore_r100_fp32_bmax_mp_en.png
--- a/reports/imgs/emore_r100_fp32_bmax_pf_en.png
+++ b/reports/imgs/emore_r100_fp32_bmax_pf_en.png
--- a/reports/imgs/emore_y1_fp32_b256_dp_en.png
+++ b/reports/imgs/emore_y1_fp32_b256_dp_en.png
--- a/reports/imgs/emore_y1_fp32_b256_mp.png
+++ b/reports/imgs/emore_y1_fp32_b256_mp.png
--- a/reports/imgs/emore_y1_fp32_b256_mp_en.png
+++ b/reports/imgs/emore_y1_fp32_b256_mp_en.png
--- a/reports/imgs/emore_y1_fp32_bmax_dp_en.png
+++ b/reports/imgs/emore_y1_fp32_bmax_dp_en.png
--- a/reports/imgs/emore_y1_fp32_bmax_mp.png
+++ b/reports/imgs/emore_y1_fp32_bmax_mp.png
--- a/reports/imgs/emore_y1_fp32_bmax_mp_en.png
+++ b/reports/imgs/emore_y1_fp32_bmax_mp_en.png
--- a/reports/imgs/glint360k_r100_fp32_b64_pf_en.png
+++ b/reports/imgs/glint360k_r100_fp32_b64_pf_en.png
--- a/reports/imgs/glint360k_r100_fp32_bmax_pf_en.png
+++ b/reports/imgs/glint360k_r100_fp32_bmax_pf_en.png
--- a/reports/imgs/model_parallel_face_emore_r100_bz64.png
+++ b/reports/imgs/model_parallel_face_emore_r100_bz64.png
--- a/reports/imgs/model_parallel_face_emore_r100_bz_max.png
+++ b/reports/imgs/model_parallel_face_emore_r100_bz_max.png
--- a/reports/imgs/model_parallel_face_emore_y1_bz256.png
+++ b/reports/imgs/model_parallel_face_emore_y1_bz256.png
--- a/reports/imgs/model_parallel_face_emore_y1_bz_max.png
+++ b/reports/imgs/model_parallel_face_emore_y1_bz_max.png
--- a/reports/imgs/partial_fc_sample_ratio_0_1_face_emore_r100_bz64.png
+++ b/reports/imgs/partial_fc_sample_ratio_0_1_face_emore_r100_bz64.png
--- a/reports/imgs/partial_fc_sample_ratio_0_1_face_emore_r100_bz_max.png
+++ b/reports/imgs/partial_fc_sample_ratio_0_1_face_emore_r100_bz_max.png
--- a/reports/imgs/partial_fc_sample_ratio_0_1_glint_r100_bz64.png
+++ b/reports/imgs/partial_fc_sample_ratio_0_1_glint_r100_bz64.png
--- a/reports/imgs/partial_fc_sample_ratio_0_1_glint_r100_bz_max.png
+++ b/reports/imgs/partial_fc_sample_ratio_0_1_glint_r100_bz_max.png
--- a/reports/insightface/dlperf_insightface_test_report_v1.md
+++ b/reports/insightface/dlperf_insightface_test_report_v1.md
@@ -84,7 +84,7 @@ In this report, num classes mean the number of face categories. In the tests, it
 | 1        | 4                | 64                    | 923.23            | 655.56          |
 | 1        | 8                | 64                    | 1836.8            | 650.8           |

-![ ](../imgs/emore_r100_fp32_b64_dp_en.png)
+![ ](../imgs/data_parallel_face_emore_r100_bz64.png)


 **batch_size = max**
@@ -95,7 +95,7 @@ In this report, num classes mean the number of face categories. In the tests, it
 | 1        | 4                | 972.8                           | 733.1                       |
 | 1        | 8                | 1931.76                         | 749.42                      |

-![ ](../imgs/emore_r100_fp32_bmax_dp_en.png)
+![ ](../imgs/data_parallel_face_emore_r100_bz_max.png)

 #### Model Parallelism

@@ -108,7 +108,7 @@ In this report, num classes mean the number of face categories. In the tests, it
 | 1        | 8                | 64                    | 1854.15           | 756.96          |


-![ ](../imgs/emore_r100_fp32_b64_mp_en.png)
+![ ](../imgs/model_parallel_face_emore_r100_bz64.png)

 **batch_size = max**

@@ -118,7 +118,7 @@ In this report, num classes mean the number of face categories. In the tests, it
 | 1        | 4                | 970.1                          | 724.26                        |
 | 1        | 8                | 1921.87                        | 821.06                        |

-![ ](../imgs/emore_r100_fp32_bmax_mp_en.png)
+![ ](../imgs/model_parallel_face_emore_r100_bz_max.png)

 #### Partial FC, sample_ratio = 0.1

@@ -126,21 +126,27 @@ In this report, num classes mean the number of face categories. In the tests, it

 | node_num | gpu_num_per_node | batch_size_per_device | OneFlow samples/s | MXNet samples/s |
 | -------- | ---------------- | --------------------- | ----------------- | --------------- |
-| 1        | 1                | 64                    | 246.45            | 223.11          |
-| 1        | 4                | 64                    | 948.96            | 799.19          |
-| 1        | 8                | 64                    | 1872.81           | 1586.09         |
+| 1        | 1                | 64                    | 246.45            | 218.84          |
+| 1        | 4                | 64                    | 948.96            | 787.07          |
+| 1        | 8                | 64                    | 1872.81           | 1423.12         |
+| 2        | 8                | 64                    | 3540.09           | 2612.65         |
+| 4        | 8                | 64                    | 6931.6            | 5008.72         |

-![ ](../imgs/emore_r100_fp32_b64_pf_en.png)
+
+![ ](../imgs/partial_fc_sample_ratio_0_1_face_emore_r100_bz64.png)

 **batch_size=max**

 | node_num | gpu_num_per_node | OneFlow samples/s(max bsz=120) | MXNet samples/s(max bsz=104) |
 | -------- | ---------------- | ------------------------------ | ---------------------------- |
-| 1        | 1                | 256.61                         | 232.56                       |
-| 1        | 4                | 990.82                         | 852.4                        |
-| 1        | 8                | 1962.76                        | 1644.42                      |
+| 1        | 1                | 256.61                         | 229.11                       |
+| 1        | 4                | 990.82                         | 844.37                       |
+| 1        | 8                | 1962.76                        | 1584.89                      |
+| 2        | 8                | 3856.52                        | 2845.97                      |
+| 4        | 8                | 7564.74                        | 5476.51                      |
+

-![ ](../imgs/emore_r100_fp32_bmax_pf_en.png)
+![ ](../imgs/partial_fc_sample_ratio_0_1_face_emore_r100_bz_max.png)

 ### Glint360k & R100 & FP32 Thoughputs

@@ -199,7 +205,7 @@ In this report, num classes mean the number of face categories. In the tests, it
 | 1        | 4                | 64                    | 945.44            | 730.29          |
 | 1        | 8                | 64                    | 1858.57           | 1359.2          |

-![ ](../imgs/glint360k_r100_fp32_b64_pf_en.png)
+![ ](../imgs/partial_fc_sample_ratio_0_1_glint_r100_bz64.png)

 **batch_size=max**

@@ -209,7 +215,7 @@ In this report, num classes mean the number of face categories. In the tests, it
 | 1        | 4                | 973.63                         | 811.34                      |
 | 1        | 8                | 1933.88                        | 1493.51                     |

-![ ](../imgs/glint360k_r100_fp32_bmax_pf_en.png)
+![ ](../imgs/partial_fc_sample_ratio_0_1_glint_r100_bz_max.png)

 ### Face Emore & Y1 & FP32 Thoughputs

@@ -224,7 +230,7 @@ In this report, num classes mean the number of face categories. In the tests, it
 | 1        | 4                | 256                   | 7354.49           | 1055.88         |
 | 1        | 8                | 256                   | 14298.02          | 1031.1          |

-![ ](../imgs/emore_y1_fp32_b256_dp_en.png)
+![ ](../imgs/data_parallel_face_emore_y1_bz256.png)

 **batch_size = max**

@@ -234,7 +240,7 @@ In this report, num classes mean the number of face categories. In the tests, it
 | 1        | 4                | 7511.53                        | 1044.38                      |
 | 1        | 8                | 14756.03                       | 1026.68                      |

-![ ](../imgs/emore_y1_fp32_bmax_dp_en.png)
+![ ](../imgs/data_parallel_face_emore_y1_bz_max.png)

 #### Model Parallelism

@@ -246,7 +252,7 @@ In this report, num classes mean the number of face categories. In the tests, it
 | 1        | 4                | 256                   | 7264.54           | 984.88          |
 | 1        | 8                | 256                   | 14049.75          | 1030.58         |

-![ ](../imgs/emore_y1_fp32_b256_mp_en.png)
+![ ](../imgs/model_parallel_face_emore_y1_bz256.png)

 **batch_size = max**

@@ -256,7 +262,7 @@ In this report, num classes mean the number of face categories. In the tests, it
 | 1        | 4                | 7363.77                        | 1017.78                      |
 | 1        | 8                | 14436.38                       | 1038.6                       |

-![ ](../imgs/emore_y1_fp32_bmax_mp_en.png)
+![ ](../imgs/model_parallel_face_emore_y1_bz_max.png)

 ### Max num_classes

@@ -268,7 +274,7 @@ In this report, num classes mean the number of face categories. In the tests, it


 ## Conclusion
- 
+
 The above series of tests show that:



--- a/reports/insightface/dlperf_insightface_test_report_v1_cn.md
+++ b/reports/insightface/dlperf_insightface_test_report_v1_cn.md
@@ -38,7 +38,7 @@

 | 框架                                                          | 版本              | 模型来源                                                                                                            |
 | ------------------------------------------------------------- | ----------------- | ------------------------------------------------------------------------------------------------------------------- |
-| [OneFlow](https://github.com/Oneflow-Inc/oneflow/tree/v0.3.4) | 0.3.4             | [oneflow_face](https://github.com/Oneflow-Inc/oneflow_face/tree/1705ae5b4cee6466f7abf75ba891984ec02b8ea3)                                                         |
+| [OneFlow](https://github.com/Oneflow-Inc/oneflow/tree/v0.3.4) | 0.3.4             | [oneflow_face](https://github.com/Oneflow-Inc/oneflow_face/tree/1705ae5b4cee6466f7abf75ba891984ec02b8ea3)           |
 | [deepinsight](https://github.com/deepinsight)                 | 2021-01-20 update | [deepinsight/insightface](https://github.com/deepinsight/insightface/tree/a9beb60971fb8115698859c35fdca721d6f75f5d) |

 ## 评测配置
@@ -83,7 +83,7 @@ OneFlow 的实现与 MXNet 进行了严格对齐，主要包括：
 | 1        | 4                | 64                    | 923.23            | 655.56          |
 | 1        | 8                | 64                    | 1836.8            | 650.8           |

-![ ](../imgs/emore_r100_fp32_b64_dp_en.png)
+![ ](../imgs/data_parallel_face_emore_r100_bz64.png)


 **batch_size = max**
@@ -94,7 +94,7 @@ OneFlow 的实现与 MXNet 进行了严格对齐，主要包括：
 | 1        | 4                | 972.8                           | 733.1                       |
 | 1        | 8                | 1931.76                         | 749.42                      |

-![ ](../imgs/emore_r100_fp32_bmax_dp_en.png)
+![ ](../imgs/data_parallel_face_emore_r100_bz_max.png)

 #### Model Parallelism

@@ -107,7 +107,7 @@ OneFlow 的实现与 MXNet 进行了严格对齐，主要包括：
 | 1        | 8                | 64                    | 1854.15           | 756.96          |


-![ ](../imgs/emore_r100_fp32_b64_mp_en.png)
+![ ](../imgs/model_parallel_face_emore_r100_bz64.png)

 **batch_size = max**

@@ -117,7 +117,7 @@ OneFlow 的实现与 MXNet 进行了严格对齐，主要包括：
 | 1        | 4                | 970.1                          | 724.26                        |
 | 1        | 8                | 1921.87                        | 821.06                        |

-![ ](../imgs/emore_r100_fp32_bmax_mp_en.png)
+![ ](../imgs/model_parallel_face_emore_r100_bz_max.png)

 #### Partial FC, sample_ratio = 0.1

@@ -125,21 +125,25 @@ OneFlow 的实现与 MXNet 进行了严格对齐，主要包括：

 | node_num | gpu_num_per_node | batch_size_per_device | OneFlow samples/s | MXNet samples/s |
 | -------- | ---------------- | --------------------- | ----------------- | --------------- |
-| 1        | 1                | 64                    | 246.45            | 223.11          |
-| 1        | 4                | 64                    | 948.96            | 799.19          |
-| 1        | 8                | 64                    | 1872.81           | 1586.09         |
+| 1        | 1                | 64                    | 246.45            | 218.84          |
+| 1        | 4                | 64                    | 948.96            | 787.07          |
+| 1        | 8                | 64                    | 1872.81           | 1423.12         |
+| 2        | 8                | 64                    | 3540.09           | 2612.65         |
+| 4        | 8                | 64                    | 6931.6            | 5008.72         |

-![ ](../imgs/emore_r100_fp32_b64_pf_en.png)
+![ ](../imgs/partial_fc_sample_ratio_0_1_face_emore_r100_bz64.png)

 **batch_size=max**

 | node_num | gpu_num_per_node | OneFlow samples/s(max bsz=120) | MXNet samples/s(max bsz=104) |
 | -------- | ---------------- | ------------------------------ | ---------------------------- |
-| 1        | 1                | 256.61                         | 232.56                       |
-| 1        | 4                | 990.82                         | 852.4                        |
-| 1        | 8                | 1962.76                        | 1644.42                      |
+| 1        | 1                | 256.61                         | 229.11                       |
+| 1        | 4                | 990.82                         | 844.37                       |
+| 1        | 8                | 1962.76                        | 1584.89                      |
+| 2        | 8                | 3856.52                        | 2845.97                      |
+| 4        | 8                | 7564.74                        | 5476.51                      |

-![ ](../imgs/emore_r100_fp32_bmax_pf_en.png)
+![ ](../imgs/partial_fc_sample_ratio_0_1_face_emore_r100_bz_max.png)


 ### Glint360k & R100 & FP32 Thoughputs
@@ -198,7 +202,7 @@ OneFlow 的实现与 MXNet 进行了严格对齐，主要包括：
 | 1        | 4                | 64                    | 945.44            | 730.29          |
 | 1        | 8                | 64                    | 1858.57           | 1359.2          |

-![ ](../imgs/glint360k_r100_fp32_b64_pf_en.png)
+![ ](../imgs/partial_fc_sample_ratio_0_1_glint_r100_bz64.png)

 **batch_size=max**

@@ -208,7 +212,7 @@ OneFlow 的实现与 MXNet 进行了严格对齐，主要包括：
 | 1        | 4                | 973.63                         | 811.34                      |
 | 1        | 8                | 1933.88                        | 1493.51                     |

-![ ](../imgs/glint360k_r100_fp32_bmax_pf_en.png)
+![ ](../imgs/partial_fc_sample_ratio_0_1_glint_r100_bz_max.png)

 ### Face Emore & Y1 & FP32 Thoughputs

@@ -223,7 +227,7 @@ OneFlow 的实现与 MXNet 进行了严格对齐，主要包括：
 | 1        | 4                | 256                   | 7354.49           | 1055.88         |
 | 1        | 8                | 256                   | 14298.02          | 1031.1          |

-![ ](../imgs/emore_y1_fp32_b256_dp_en.png)
+![ ](../imgs/data_parallel_face_emore_y1_bz256.png)

 **batch_size = max**

@@ -233,7 +237,7 @@ OneFlow 的实现与 MXNet 进行了严格对齐，主要包括：
 | 1        | 4                | 7511.53                        | 1044.38                      |
 | 1        | 8                | 14756.03                       | 1026.68                      |

-![ ](../imgs/emore_y1_fp32_bmax_dp_en.png)
+![ ](../imgs/data_parallel_face_emore_y1_bz_max.png)

 #### Model Parallelism

@@ -245,7 +249,7 @@ OneFlow 的实现与 MXNet 进行了严格对齐，主要包括：
 | 1        | 4                | 256                   | 7264.54           | 984.88          |
 | 1        | 8                | 256                   | 14049.75          | 1030.58         |

-![ ](../imgs/emore_y1_fp32_b256_mp_en.png)
+![ ](../imgs/model_parallel_face_emore_y1_bz256.png)

 **batch_size = max**

@@ -255,7 +259,7 @@ OneFlow 的实现与 MXNet 进行了严格对齐，主要包括：
 | 1        | 4                | 7363.77                        | 1017.78                      |
 | 1        | 8                | 14436.38                       | 1038.6                       |

-![ ](../imgs/emore_y1_fp32_bmax_mp_en.png)
+![ ](../imgs/model_parallel_face_emore_y1_bz_max.png)

 ### Max num_classses

@@ -273,4 +277,4 @@ OneFlow 的实现与 MXNet 进行了严格对齐，主要包括：
 1. 随着 `batch_size_per_device` 的增大，MXNet 的吞吐率即使使用了 Partial FC 也很难有突破，而 OneFlow 则始终保持较为稳定的线性增长；
 2. 相同条件下 OneFlow 支持更大规模的 `batch_size` 和 `num_classes` ，在单机 8 卡、 单卡 batch size 固定为 64 ，同样是使用 FP16、model_parallel、partial_fc 的情况下，OneFlow 所支持的 `num_classes` 数量是 MXNet 的 1.125 倍（1350 万 vs 1200 万）；

-更多数据细节可移步 DLPerf 的 OneFlow 和 MXNet。
\ No newline at end of file
+更多数据细节可移步 DLPerf 的 OneFlow 和 MXNet。