提交 7ce9c9e6 编写于 作者: Y Yun Chen

update to pytorch1.5, replace nms/roipool with torchvision functions

上级 f9e00e28
......@@ -2,19 +2,21 @@
## 1. Introduction
**I've update the code to support both Python2 and Python3, PyTorch 1.0. If you want the old version code please checkout branch [v0.3](https://github.com/chenyuntc/simple-faster-rcnn-pytorch/tree/0.3)**
**[Update:]** I've further simplified the code to pytorch 1.5, torchvision 0.6, and replace the customized ops roipool and nms with the one from torchvision. if you want the old version code, please checkout branch [v1.0](https://github.com/chenyuntc/simple-faster-rcnn-pytorch/tree/v1.0)
This project is a **Simplified** Faster R-CNN implementation based on [chainercv](https://github.com/chainer/chainercv) and other [projects](#acknowledgement) . It aims to:
This project is a **Simplified** Faster R-CNN implementation based on [chainercv](https://github.com/chainer/chainercv) and other [projects](#acknowledgement) . I hope it can serve as an start code for those who want to know the detail of Faster R-CNN. It aims to:
- Simplify the code (*Simple is better than complex*)
- Make the code more straightforward (*Flat is better than nested*)
- Match the performance reported in [origin paper](https://arxiv.org/abs/1506.01497) (*Speed Counts and mAP Matters*)
And it has the following features:
- It can be run as pure Python code, no more build affair. (cuda code moves to cupy, Cython acceleration are optional)
- It can be run as pure Python code, no more build affair.
- It's a minimal implemention in around 2000 lines valid code with a lot of comment and instruction.(thanks to chainercv's excellent documentation)
- It achieves higher mAP than the origin implementation (0.712 VS 0.699)
- It achieve speed compariable with other implementation (6fps and 14fps for train and test in TITAN XP with cython)
- It achieve speed compariable with other implementation (6fps and 14fps for train and test in TITAN XP)
- It's memory-efficient (about 3GB for vgg16)
......@@ -50,21 +52,29 @@ VGG16 train on `trainval` and test on `test` split.
It could be faster by removing visualization, logging, averaging loss etc.
## 3. Install dependencies
requires PyTorch >=0.4
- install PyTorch >=0.4 with GPU (code are GPU-only), refer to [official website](http://pytorch.org)
Here is an example of create environ **from scratch** with `anaconda`
```sh
# create conda env
conda create --name simp python=3.7
conda activate simp
# install pytorch
conda install pytorch torchvision cudatoolkit=10.2 -c pytorch
- install cupy, you can install via `pip install cupy-cuda80` or(cupy-cuda90,cupy-cuda91, etc).
# install other dependancy
pip install visdom scikit-image tqdm fire ipdb pprint matplotlib torchnet
- install other dependencies: `pip install -r requirements.txt `
# start visdom
nohup python -m visdom.server &
```
- Optional, but strongly recommended: build cython code `nms_gpu_post`:
If you don't use anaconda, then:
```Bash
cd model/utils/nms/
python build.py build_ext --inplace
cd -
```
- install PyTorch with GPU (code are GPU-only), refer to [official website](http://pytorch.org)
- install other dependencies: `pip install visdom scikit-image tqdm fire ipdb pprint matplotlib torchnet`
- start visdom for visualization
......@@ -73,6 +83,7 @@ nohup python -m visdom.server &
```
## 4. Demo
Download pretrained model from [Google Drive](https://drive.google.com/open?id=1cQ27LIn-Rig4-Uayzy_gH5-cW-NRGVzY) or [Baidu Netdisk( passwd: scxn)](https://pan.baidu.com/s/1o87RuXW)
......@@ -114,7 +125,7 @@ See [demo.ipynb](https://github.com/chenyuntc/simple-faster-rcnn-pytorch/blob/ma
4. modify `voc_data_dir` cfg item in `utils/config.py`, or pass it to program using argument like `--voc-data-dir=/path/to/VOCdevkit/VOC2007/` .
### 5.2 Prepare caffe-pretrained vgg16
### 5.2 [Optional]Prepare caffe-pretrained vgg16
If you want to use caffe-pretrain model as initial weight, you can run below to get vgg16 weights converted from caffe, which is the same as the origin paper use.
......@@ -134,12 +145,9 @@ If you want to use pretrained model from torchvision, you may skip this step.
### 5.3 begin training
```Bash
mkdir checkpoints/ # folder for snapshots
```
```bash
python train.py train --env='fasterrcnn-caffe' --plot-every=100 --caffe-pretrain
python train.py train --env='fasterrcnn' --plot-every=100
```
you may refer to `utils/config.py` for more argument.
......@@ -165,16 +173,10 @@ you may open browser, visit `http://<ip>:8097` and see the visualization of trai
see [discussion](https://github.com/pytorch/pytorch/issues/973#issuecomment-346405667), It's alreadly fixed in [train.py](https://github.com/chenyuntc/simple-faster-rcnn-pytorch/blob/master/train.py#L17-L22). So I think you are free from this problem.
- Windows support
I don't have windows machine with GPU to debug and test it. It's welcome if anyone could make a pull request and test it.
## More
- [ ] training on coco
- [ ] resnet
- [ ] Maybe;replace cupy with THTensor+cffi?
- [ ] Maybe:Convert all numpy code to tensor?
- [x] python2-compatibility
## Acknowledgement
This work builds on many excellent works, which include:
......
......@@ -2,10 +2,10 @@ from __future__ import absolute_import
from __future__ import division
import torch as t
import numpy as np
import cupy as cp
from utils import array_tool as at
from model.utils.bbox_tools import loc2bbox
from model.utils.nms import non_maximum_suppression
from torchvision.ops import nms
# from model.utils.nms import non_maximum_suppression
from torch import nn
from data.dataset import preprocess
......@@ -171,13 +171,13 @@ class FasterRCNN(nn.Module):
mask = prob_l > self.score_thresh
cls_bbox_l = cls_bbox_l[mask]
prob_l = prob_l[mask]
keep = non_maximum_suppression(
cp.array(cls_bbox_l), self.nms_thresh, prob_l)
keep = cp.asnumpy(keep)
bbox.append(cls_bbox_l[keep])
keep = nms(cls_bbox_l, prob_l,self.nms_thresh)
# import ipdb;ipdb.set_trace()
# keep = cp.asnumpy(keep)
bbox.append(cls_bbox_l[keep].cpu().numpy())
# The labels are in [0, self.n_class - 2].
label.append((l - 1) * np.ones((len(keep),)))
score.append(prob_l[keep])
score.append(prob_l[keep].cpu().numpy())
bbox = np.concatenate(bbox, axis=0).astype(np.float32)
label = np.concatenate(label, axis=0).astype(np.int32)
score = np.concatenate(score, axis=0).astype(np.float32)
......@@ -254,12 +254,9 @@ class FasterRCNN(nn.Module):
cls_bbox[:, 0::2] = (cls_bbox[:, 0::2]).clamp(min=0, max=size[0])
cls_bbox[:, 1::2] = (cls_bbox[:, 1::2]).clamp(min=0, max=size[1])
prob = at.tonumpy(F.softmax(at.totensor(roi_score), dim=1))
prob = (F.softmax(at.totensor(roi_score), dim=1))
raw_cls_bbox = at.tonumpy(cls_bbox)
raw_prob = at.tonumpy(prob)
bbox, label, score = self._suppress(raw_cls_bbox, raw_prob)
bbox, label, score = self._suppress(cls_bbox, prob)
bboxes.append(bbox)
labels.append(label)
scores.append(score)
......
......@@ -2,9 +2,10 @@ from __future__ import absolute_import
import torch as t
from torch import nn
from torchvision.models import vgg16
from torchvision.ops import RoIPool
from model.region_proposal_network import RegionProposalNetwork
from model.faster_rcnn import FasterRCNN
from model.roi_module import RoIPooling2D
from utils import array_tool as at
from utils.config import opt
......@@ -112,7 +113,7 @@ class VGG16RoIHead(nn.Module):
self.n_class = n_class
self.roi_size = roi_size
self.spatial_scale = spatial_scale
self.roi = RoIPooling2D(self.roi_size, self.roi_size, self.spatial_scale)
self.roi = RoIPool( (self.roi_size, self.roi_size),self.spatial_scale)
def forward(self, x, rois, roi_indices):
"""Forward the chain.
......
from collections import namedtuple
from string import Template
import cupy, torch
import cupy as cp
import torch as t
from torch.autograd import Function
from model.utils.roi_cupy import kernel_backward, kernel_forward
Stream = namedtuple('Stream', ['ptr'])
@cupy.util.memoize(for_each_device=True)
def load_kernel(kernel_name, code, **kwargs):
cp.cuda.runtime.free(0)
code = Template(code).substitute(**kwargs)
kernel_code = cupy.cuda.compile_with_cache(code)
return kernel_code.get_function(kernel_name)
CUDA_NUM_THREADS = 1024
def GET_BLOCKS(N, K=CUDA_NUM_THREADS):
return (N + K - 1) // K
class RoI(Function):
def __init__(self, outh, outw, spatial_scale):
self.forward_fn = load_kernel('roi_forward', kernel_forward)
self.backward_fn = load_kernel('roi_backward', kernel_backward)
self.outh, self.outw, self.spatial_scale = outh, outw, spatial_scale
def forward(self, x, rois):
# NOTE: MAKE SURE input is contiguous too
x = x.contiguous()
rois = rois.contiguous()
self.in_size = B, C, H, W = x.size()
self.N = N = rois.size(0)
output = t.zeros(N, C, self.outh, self.outw).cuda()
self.argmax_data = t.zeros(N, C, self.outh, self.outw).int().cuda()
self.rois = rois
args = [x.data_ptr(), rois.data_ptr(),
output.data_ptr(),
self.argmax_data.data_ptr(),
self.spatial_scale, C, H, W,
self.outh, self.outw,
output.numel()]
stream = Stream(ptr=torch.cuda.current_stream().cuda_stream)
self.forward_fn(args=args,
block=(CUDA_NUM_THREADS, 1, 1),
grid=(GET_BLOCKS(output.numel()), 1, 1),
stream=stream)
return output
def backward(self, grad_output):
##NOTE: IMPORTANT CONTIGUOUS
# TODO: input
grad_output = grad_output.contiguous()
B, C, H, W = self.in_size
grad_input = t.zeros(self.in_size).cuda()
stream = Stream(ptr=torch.cuda.current_stream().cuda_stream)
args = [grad_output.data_ptr(),
self.argmax_data.data_ptr(),
self.rois.data_ptr(),
grad_input.data_ptr(),
self.N, self.spatial_scale, C, H, W, self.outh, self.outw,
grad_input.numel()]
self.backward_fn(args=args,
block=(CUDA_NUM_THREADS, 1, 1),
grid=(GET_BLOCKS(grad_input.numel()), 1, 1),
stream=stream
)
return grad_input, None
class RoIPooling2D(t.nn.Module):
def __init__(self, outh, outw, spatial_scale):
super(RoIPooling2D, self).__init__()
self.RoI = RoI(outh, outw, spatial_scale)
def forward(self, x, rois):
return self.RoI(x, rois)
def test_roi_module():
## fake data###
B, N, C, H, W, PH, PW = 2, 8, 4, 32, 32, 7, 7
bottom_data = t.randn(B, C, H, W).cuda()
bottom_rois = t.randn(N, 5)
bottom_rois[:int(N / 2), 0] = 0
bottom_rois[int(N / 2):, 0] = 1
bottom_rois[:, 1:] = (t.rand(N, 4) * 100).float()
bottom_rois = bottom_rois.cuda()
spatial_scale = 1. / 16
outh, outw = PH, PW
# pytorch version
module = RoIPooling2D(outh, outw, spatial_scale)
x = bottom_data.requires_grad_()
rois = bottom_rois.detach()
output = module(x, rois)
output.sum().backward()
def t2c(variable):
npa = variable.data.cpu().numpy()
return cp.array(npa)
def test_eq(variable, array, info):
cc = cp.asnumpy(array)
neq = (cc != variable.data.cpu().numpy())
assert neq.sum() == 0, 'test failed: %s' % info
# chainer version,if you're going to run this
# pip install chainer
import chainer.functions as F
from chainer import Variable
x_cn = Variable(t2c(x))
o_cn = F.roi_pooling_2d(x_cn, t2c(rois), outh, outw, spatial_scale)
test_eq(output, o_cn.array, 'forward')
F.sum(o_cn).backward()
test_eq(x.grad, x_cn.grad, 'backward')
print('test pass')
import numpy as np
import cupy as cp
import torch
from torchvision.ops import nms
from model.utils.bbox_tools import bbox2loc, bbox_iou, loc2bbox
from model.utils.nms import non_maximum_suppression
class ProposalTargetCreator(object):
......@@ -415,16 +414,18 @@ class ProposalCreator:
if n_pre_nms > 0:
order = order[:n_pre_nms]
roi = roi[order, :]
score = score[order]
# Apply nms (e.g. threshold = 0.7).
# Take after_nms_topN (e.g. 300).
# unNOTE: somthing is wrong here!
# TODO: remove cuda.to_gpu
keep = non_maximum_suppression(
cp.ascontiguousarray(cp.asarray(roi)),
thresh=self.nms_thresh)
keep = nms(
torch.from_numpy(roi).cuda(),
torch.from_numpy(score).cuda(),
self.nms_thresh)
if n_post_nms > 0:
keep = keep[:n_post_nms]
roi = roi[keep]
roi = roi[keep.cpu().numpy()]
return roi
from model.utils.nms.non_maximum_suppression import non_maximum_suppression
\ No newline at end of file
cimport numpy as np
from libc.stdint cimport uint64_t
import numpy as np
def _nms_gpu_post(np.ndarray[np.uint64_t, ndim=1] mask,
int n_bbox,
int threads_per_block,
int col_blocks
):
cdef:
int i, j, nblock, index
uint64_t inblock
int n_selection = 0
uint64_t one_ull = 1
np.ndarray[np.int32_t, ndim=1] selection
np.ndarray[np.uint64_t, ndim=1] remv
selection = np.zeros((n_bbox,), dtype=np.int32)
remv = np.zeros((col_blocks,), dtype=np.uint64)
for i in range(n_bbox):
nblock = i // threads_per_block
inblock = i % threads_per_block
if not (remv[nblock] & one_ull << inblock):
selection[n_selection] = i
n_selection += 1
index = i * col_blocks
for j in range(nblock, col_blocks):
remv[j] |= mask[index + j]
return selection, n_selection
import numpy as np
def _nms_gpu_post( mask,
n_bbox,
threads_per_block,
col_blocks
):
n_selection = 0
one_ull = np.array([1],dtype=np.uint64)
selection = np.zeros((n_bbox,), dtype=np.int32)
remv = np.zeros((col_blocks,), dtype=np.uint64)
for i in range(n_bbox):
nblock = i // threads_per_block
inblock = i % threads_per_block
if not (remv[nblock] & one_ull << inblock):
selection[n_selection] = i
n_selection += 1
index = i * col_blocks
for j in range(nblock, col_blocks):
remv[j] |= mask[index + j]
return selection, n_selection
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
import numpy
#ext_modules = [Extension("_nms_gpu_post", ["_nms_gpu_post.pyx"])]
ext_modules = [Extension("_nms_gpu_post", ["_nms_gpu_post.pyx"],
include_dirs=[numpy.get_include()])]
setup(
name="nms pyx",
cmdclass={'build_ext': build_ext},
ext_modules=ext_modules
)
from __future__ import division
import numpy as np
import cupy as cp
import torch as t
try:
from ._nms_gpu_post import _nms_gpu_post
except:
import warnings
warnings.warn('''
the python code for non_maximum_suppression is about 2x slow
It is strongly recommended to build cython code:
`cd model/utils/nms/; python3 build.py build_ext --inplace''')
from ._nms_gpu_post_py import _nms_gpu_post
@cp.util.memoize(for_each_device=True)
def _load_kernel(kernel_name, code, options=()):
cp.cuda.runtime.free(0)
assert isinstance(options, tuple)
kernel_code = cp.cuda.compile_with_cache(code, options=options)
return kernel_code.get_function(kernel_name)
def non_maximum_suppression(bbox, thresh, score=None,
limit=None):
"""Suppress bounding boxes according to their IoUs.
This method checks each bounding box sequentially and selects the bounding
box if the Intersection over Unions (IoUs) between the bounding box and the
previously selected bounding boxes is less than :obj:`thresh`. This method
is mainly used as postprocessing of object detection.
The bounding boxes are selected from ones with higher scores.
If :obj:`score` is not provided as an argument, the bounding box
is ordered by its index in ascending order.
The bounding boxes are expected to be packed into a two dimensional
tensor of shape :math:`(R, 4)`, where :math:`R` is the number of
bounding boxes in the image. The second axis represents attributes of
the bounding box. They are :math:`(y_{min}, x_{min}, y_{max}, x_{max})`,
where the four attributes are coordinates of the top left and the
bottom right vertices.
:obj:`score` is a float array of shape :math:`(R,)`. Each score indicates
confidence of prediction.
This function accepts both :obj:`numpy.ndarray` and :obj:`cupy.ndarray` as
an input. Please note that both :obj:`bbox` and :obj:`score` need to be
the same type.
The type of the output is the same as the input.
Args:
bbox (array): Bounding boxes to be transformed. The shape is
:math:`(R, 4)`. :math:`R` is the number of bounding boxes.
thresh (float): Threshold of IoUs.
score (array): An array of confidences whose shape is :math:`(R,)`.
limit (int): The upper bound of the number of the output bounding
boxes. If it is not specified, this method selects as many
bounding boxes as possible.
Returns:
array:
An array with indices of bounding boxes that are selected. \
They are sorted by the scores of bounding boxes in descending \
order. \
The shape of this array is :math:`(K,)` and its dtype is\
:obj:`numpy.int32`. Note that :math:`K \\leq R`.
"""
return _non_maximum_suppression_gpu(bbox, thresh, score, limit)
def _non_maximum_suppression_gpu(bbox, thresh, score=None, limit=None):
if len(bbox) == 0:
return cp.zeros((0,), dtype=np.int32)
n_bbox = bbox.shape[0]
if score is not None:
order = score.argsort()[::-1].astype(np.int32)
else:
order = cp.arange(n_bbox, dtype=np.int32)
sorted_bbox = bbox[order, :]
selec, n_selec = _call_nms_kernel(
sorted_bbox, thresh)
selec = selec[:n_selec]
selec = order[selec]
if limit is not None:
selec = selec[:limit]
return cp.asnumpy(selec)
_nms_gpu_code = '''
#define DIVUP(m,n) ((m) / (n) + ((m) % (n) > 0))
int const threadsPerBlock = sizeof(unsigned long long) * 8;
__device__
inline float devIoU(float const *const bbox_a, float const *const bbox_b) {
float top = max(bbox_a[0], bbox_b[0]);
float bottom = min(bbox_a[2], bbox_b[2]);
float left = max(bbox_a[1], bbox_b[1]);
float right = min(bbox_a[3], bbox_b[3]);
float height = max(bottom - top, 0.f);
float width = max(right - left, 0.f);
float area_i = height * width;
float area_a = (bbox_a[2] - bbox_a[0]) * (bbox_a[3] - bbox_a[1]);
float area_b = (bbox_b[2] - bbox_b[0]) * (bbox_b[3] - bbox_b[1]);
return area_i / (area_a + area_b - area_i);
}
extern "C"
__global__
void nms_kernel(const int n_bbox, const float thresh,
const float *dev_bbox,
unsigned long long *dev_mask) {
const int row_start = blockIdx.y;
const int col_start = blockIdx.x;
const int row_size =
min(n_bbox - row_start * threadsPerBlock, threadsPerBlock);
const int col_size =
min(n_bbox - col_start * threadsPerBlock, threadsPerBlock);
__shared__ float block_bbox[threadsPerBlock * 4];
if (threadIdx.x < col_size) {
block_bbox[threadIdx.x * 4 + 0] =
dev_bbox[(threadsPerBlock * col_start + threadIdx.x) * 4 + 0];
block_bbox[threadIdx.x * 4 + 1] =
dev_bbox[(threadsPerBlock * col_start + threadIdx.x) * 4 + 1];
block_bbox[threadIdx.x * 4 + 2] =
dev_bbox[(threadsPerBlock * col_start + threadIdx.x) * 4 + 2];
block_bbox[threadIdx.x * 4 + 3] =
dev_bbox[(threadsPerBlock * col_start + threadIdx.x) * 4 + 3];
}
__syncthreads();
if (threadIdx.x < row_size) {
const int cur_box_idx = threadsPerBlock * row_start + threadIdx.x;
const float *cur_box = dev_bbox + cur_box_idx * 4;
int i = 0;
unsigned long long t = 0;
int start = 0;
if (row_start == col_start) {
start = threadIdx.x + 1;
}
for (i = start; i < col_size; i++) {
if (devIoU(cur_box, block_bbox + i * 4) >= thresh) {
t |= 1ULL << i;
}
}
const int col_blocks = DIVUP(n_bbox, threadsPerBlock);
dev_mask[cur_box_idx * col_blocks + col_start] = t;
}
}
'''
def _call_nms_kernel(bbox, thresh):
# PyTorch does not support unsigned long Tensor.
# Doesn't matter,since it returns ndarray finally.
# So I'll keep it unmodified.
n_bbox = bbox.shape[0]
threads_per_block = 64
col_blocks = np.ceil(n_bbox / threads_per_block).astype(np.int32)
blocks = (col_blocks, col_blocks, 1)
threads = (threads_per_block, 1, 1)
mask_dev = cp.zeros((n_bbox * col_blocks,), dtype=np.uint64)
bbox = cp.ascontiguousarray(bbox, dtype=np.float32)
kern = _load_kernel('nms_kernel', _nms_gpu_code)
kern(blocks, threads, args=(cp.int32(n_bbox), cp.float32(thresh),
bbox, mask_dev))
mask_host = mask_dev.get()
selection, n_selec = _nms_gpu_post(
mask_host, n_bbox, threads_per_block, col_blocks)
return selection, n_selec
kernel_forward = '''
extern "C"
__global__ void roi_forward(const float* const bottom_data,const float* const bottom_rois,
float* top_data, int* argmax_data,
const double spatial_scale,const int channels,const int height,
const int width, const int pooled_height,
const int pooled_width,const int NN
){
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if(idx>=NN)
return;
const int pw = idx % pooled_width;
const int ph = (idx / pooled_width) % pooled_height;
const int c = (idx / pooled_width / pooled_height) % channels;
int num = idx / pooled_width / pooled_height / channels;
const int roi_batch_ind = bottom_rois[num * 5 + 0];
const int roi_start_w = round(bottom_rois[num * 5 + 1] * spatial_scale);
const int roi_start_h = round(bottom_rois[num * 5 + 2] * spatial_scale);
const int roi_end_w = round(bottom_rois[num * 5 + 3] * spatial_scale);
const int roi_end_h = round(bottom_rois[num * 5 + 4] * spatial_scale);
// Force malformed ROIs to be 1x1
const int roi_width = max(roi_end_w - roi_start_w + 1, 1);
const int roi_height = max(roi_end_h - roi_start_h + 1, 1);
const float bin_size_h = static_cast<float>(roi_height)
/ static_cast<float>(pooled_height);
const float bin_size_w = static_cast<float>(roi_width)
/ static_cast<float>(pooled_width);
int hstart = static_cast<int>(floor(static_cast<float>(ph)
* bin_size_h));
int wstart = static_cast<int>(floor(static_cast<float>(pw)
* bin_size_w));
int hend = static_cast<int>(ceil(static_cast<float>(ph + 1)