提交 d1d483f3 编写于 作者: G gineshidalgo99

Multi-scale much faster & less memory

上级 7ae8d776
......@@ -115,20 +115,23 @@ OpenPose Library - Release Notes
## Current version (future OpenPose 1.2.0alpha)
1. Main improvements:
1. Added IP camera support.
2. Output images can have the input size, OpenPose able to change its size for each image and not required fixed size anymore.
1. Speed increase when processing images with different aspect ratios. E.g. ~20% increase over 3.7k COCO validation images on 1 scale.
2. Huge speed increase and memory reduction when processing multi-scale. E.g. over 3.7k COCO validation images on 4 scales: ~40% (~770 to ~450 sec) speed increase, ~25% memory reduction (from ~8.9 to ~6.7 GB / GPU).
3. Slightly increase of accuracy given the fixed mini-bugs.
4. Added IP camera support.
5. Output images can have the input size, OpenPose able to change its size for each image and not required fixed size anymore.
1. FrameDisplayer accepts variable size images by rescaling every time a frame with bigger width or height is displayed (gui module).
2. OpOutputToCvMat & GuiInfoAdder does not require to know the output size at construction time, deduced from each image.
3. CvMatToOutput and Renderers allow to keep input resolution as output for images (core module).
3. New standalone face keypoint detector based on OpenCV face detector: much faster if body keypoint detection is not required but much less accurate.
4. Face and hand keypoint detectors now can return each keypoint heatmap.
5. The flag `USE_CUDNN` is no longer required; `USE_CAFFE` and `USE_CUDA` (replacing the old `CPU_ONLY`) are no longer required to use the library, only to build it. In addition, Boost, Caffe, and its dependencies have been removed from the OpenPose header files. Only OpenCV include and lib folders are required when building a project using OpenPose.
6. OpenPose successfully compiles if the flags `USE_CAFFE` and/or `USE_CUDA` are not enabled, although it will give an error saying they are required.
7. COCO JSON file outputs 0 as score for non-detected keypoints.
8. Added example for OpenPose for user asynchronous output and cleaned all `tutorial_wrapper/` examples.
9. Added `-1` option for `net_resolution` in order to auto-select the best possible aspect ratio given the user input.
10. Net resolution can be dynamically changed (e.g. for images with different size).
11. Added example to add functionality/modules to OpenPose.
6. New standalone face keypoint detector based on OpenCV face detector: much faster if body keypoint detection is not required but much less accurate.
7. Face and hand keypoint detectors now can return each keypoint heatmap.
8. The flag `USE_CUDNN` is no longer required; `USE_CAFFE` and `USE_CUDA` (replacing the old `CPU_ONLY`) are no longer required to use the library, only to build it. In addition, Boost, Caffe, and its dependencies have been removed from the OpenPose header files. Only OpenCV include and lib folders are required when building a project using OpenPose.
9. OpenPose successfully compiles if the flags `USE_CAFFE` and/or `USE_CUDA` are not enabled, although it will give an error saying they are required.
10. COCO JSON file outputs 0 as score for non-detected keypoints.
11. Added example for OpenPose for user asynchronous output and cleaned all `tutorial_wrapper/` examples.
12. Added `-1` option for `net_resolution` in order to auto-select the best possible aspect ratio given the user input.
13. Net resolution can be dynamically changed (e.g. for images with different size).
14. Added example to add functionality/modules to OpenPose.
2. Functions or parameters renamed:
1. OpenPose able to change its size and initial size dynamically:
1. Flag `resolution` renamed as `output_resolution`.
......
# Script for internal use. We might completely change it continuously and we will not answer questions about it.
clear && clear
# USAGE EXAMPLE
# See ./examples/tests/pose_accuracy_coco_test.sh
# Parameters
IMAGE_FOLDER=/media/posefs3b/Users/gines/openpose_train/dataset/COCO/images/test2017_dev/
JSON_FOLDER=../evaluation/coco_val_jsons/
OP_BIN=./build/examples/openpose/openpose.bin
# 1 scale
$OP_BIN --image_dir $IMAGE_FOLDER --write_coco_json ${JSON_FOLDER}1_test.json --no_display --render_pose 0
# # 3 scales
# $OP_BIN --image_dir $IMAGE_FOLDER --write_coco_json ${JSON_FOLDER}1_3.json --no_display --render_pose 0 --scale_number 3 --scale_gap 0.25
# 4 scales
$OP_BIN --image_dir $IMAGE_FOLDER --write_coco_json ${JSON_FOLDER}1_4_test.json --no_display --render_pose 0 --scale_number 4 --scale_gap 0.25 --net_resolution "1312x736"
......@@ -3,7 +3,7 @@
clear && clear
# USAGE EXAMPLE
# clear && clear && make all -j24 && bash ./examples/tests/pose_accuracy_coco_test.sh
# clear && clear && make all -j`nproc` && bash ./examples/tests/pose_accuracy_coco_test.sh
# # Go back to main folder
# cd ../../
......@@ -23,14 +23,14 @@ OP_BIN=./build/examples/openpose/openpose.bin
# 1 scale
$OP_BIN --image_dir $IMAGE_FOLDER --write_coco_json ${JSON_FOLDER}1.json --no_display --render_pose 0 --frame_last 3558
# 1 scale - Debugging
# 1 scale - Debugging
# $OP_BIN --image_dir $IMAGE_FOLDER --write_coco_json ${JSON_FOLDER}1.json --no_display --frame_last 3558 --write_images ~/Desktop/CppValidation/
# # 3 scales
# $OP_BIN --image_dir $IMAGE_FOLDER --write_coco_json ${JSON_FOLDER}1_3.json --no_display --render_pose 0 --scale_number 3 --scale_gap 0.25 --frame_last 3558
# # 4 scales
# $OP_BIN --num_gpu 1 --image_dir $IMAGE_FOLDER --write_coco_json ${JSON_FOLDER}1_4.json --no_display --render_pose 0 --num_gpu 1 --scale_number 4 --scale_gap 0.25 --net_resolution "1312x736" --frame_last 3558
# $OP_BIN --num_gpu 1 --image_dir $IMAGE_FOLDER --write_coco_json ${JSON_FOLDER}1_4.json --no_display --render_pose 0 --scale_number 4 --scale_gap 0.25 --net_resolution "1312x736" --frame_last 3558
# Debugging - Rendered frames saved
# $OP_BIN --image_dir $IMAGE_FOLDER --write_images ${JSON_FOLDER}frameOutput --no_display
# Script for internal use. We might completely change it continuously and we will not answer questions about it.
clear && clear
# USAGE EXAMPLE
# See ./examples/tests/pose_accuracy_coco_test.sh
# Parameters
IMAGE_FOLDER=/home/gines/devel/images/val2014/
JSON_FOLDER=../evaluation/coco_val_jsons/
OP_BIN=./build/examples/openpose/openpose.bin
# 1 scale
$OP_BIN --image_dir $IMAGE_FOLDER --write_coco_json ${JSON_FOLDER}1.json --no_display --render_pose 0 --frame_last 3558
# 3 scales
$OP_BIN --image_dir $IMAGE_FOLDER --write_coco_json ${JSON_FOLDER}1_3.json --no_display --render_pose 0 --scale_number 3 --scale_gap 0.25 --frame_last 3558
# 4 scales
$OP_BIN --image_dir $IMAGE_FOLDER --write_coco_json ${JSON_FOLDER}1_4.json --no_display --render_pose 0 --scale_number 4 --scale_gap 0.25 --net_resolution "1312x736" --frame_last 3558
......@@ -9,8 +9,9 @@ namespace op
class OP_API CvMatToOpInput
{
public:
Array<float> createArray(const cv::Mat& cvInputData, const std::vector<double>& scaleInputToNetInputs,
const std::vector<Point<int>>& netInputSizes) const;
std::vector<Array<float>> createArray(const cv::Mat& cvInputData,
const std::vector<double>& scaleInputToNetInputs,
const std::vector<Point<int>>& netInputSizes) const;
};
}
......
......@@ -35,13 +35,14 @@ namespace op
* with the net.
* In case of >1 scales, then each scale is right- and bottom-padded to fill the greatest resolution. The
* scales are sorted from bigger to smaller.
* Size: #scales x 3 x input_net_height x input_net_width
* Vector size: #scales
* Each array size: 3 x input_net_height x input_net_width
*/
Array<float> inputNetData;
std::vector<Array<float>> inputNetData;
/**
* Rendered image in Array<float> format.
* It consists of a blending of the inputNetData and the pose/body part(s) heatmap/PAF(s).
* It consists of a blending of the cvInputData and the pose/body part(s) heatmap/PAF(s).
* If rendering is disabled (e.g. `no_render_pose` flag in the demo), then outputData will be empty.
* Size: 3 x output_net_height x output_net_width
*/
......
......@@ -6,11 +6,15 @@
namespace op
{
template <typename T>
OP_API void resizeAndMergeCpu(T* targetPtr, const T* const sourcePtr, const std::array<int, 4>& targetSize, const std::array<int, 4>& sourceSize,
OP_API void resizeAndMergeCpu(T* targetPtr, const std::vector<const T*>& sourcePtrs,
const std::array<int, 4>& targetSize,
const std::vector<std::array<int, 4>>& sourceSizes,
const std::vector<T>& scaleInputToNetInputs = {1.f});
template <typename T>
OP_API void resizeAndMergeGpu(T* targetPtr, const T* const sourcePtr, const std::array<int, 4>& targetSize, const std::array<int, 4>& sourceSize,
OP_API void resizeAndMergeGpu(T* targetPtr, const std::vector<const T*>& sourcePtrs,
const std::array<int, 4>& targetSize,
const std::vector<std::array<int, 4>>& sourceSizes,
const std::vector<T>& scaleInputToNetInputs = {1.f});
}
......
......@@ -24,7 +24,7 @@ namespace op
virtual void LayerSetUp(const std::vector<caffe::Blob<T>*>& bottom, const std::vector<caffe::Blob<T>*>& top);
virtual void Reshape(const std::vector<caffe::Blob<T>*>& bottom, const std::vector<caffe::Blob<T>*>& top,
const float netFactor, const float scaleFactor, const bool mergeFirstDimension = true);
const T netFactor, const T scaleFactor, const bool mergeFirstDimension = true);
virtual inline const char* type() const { return "ResizeAndMerge"; }
......@@ -42,7 +42,7 @@ namespace op
private:
std::vector<T> mScaleRatios;
std::array<int, 4> mBottomSize;
std::vector<std::array<int, 4>> mBottomSizes;
std::array<int, 4> mTopSize;
DELETE_COPY(ResizeAndMergeCaffe);
......
......@@ -20,7 +20,7 @@ namespace op
void initializationOnThread();
virtual void forwardPass(const Array<float>& inputNetData, const Point<int>& inputDataSize,
virtual void forwardPass(const std::vector<Array<float>>& inputNetData, const Point<int>& inputDataSize,
const std::vector<double>& scaleRatios = {1.f}) = 0;
virtual const float* getHeatMapCpuConstPtr() const = 0;
......
......@@ -19,7 +19,7 @@ namespace op
void netInitializationOnThread();
void forwardPass(const Array<float>& inputNetData, const Point<int>& inputDataSize,
void forwardPass(const std::vector<Array<float>>& inputNetData, const Point<int>& inputDataSize,
const std::vector<double>& scaleInputToNetInputs = {1.f});
const float* getHeatMapCpuConstPtr() const;
......
......@@ -289,6 +289,7 @@ namespace op
};
const std::array<float, (int)PoseModel::Size> POSE_DEFAULT_CONNECT_INTER_MIN_ABOVE_THRESHOLD{
0.95f, 0.95f, 0.95f, 0.95f, 0.95f, 0.95f
// 0.85f, 0.85f, 0.85f, 0.85f, 0.85f, 0.85f // Matlab version
};
const std::array<float, (int)PoseModel::Size> POSE_DEFAULT_CONNECT_INTER_THRESHOLD{
0.05f, 0.01f, 0.01f, 0.05f, 0.05f, 0.05f
......@@ -298,6 +299,7 @@ namespace op
};
const std::array<float, (int)PoseModel::Size> POSE_DEFAULT_CONNECT_MIN_SUBSET_SCORE{
0.4f, 0.4f, 0.4f, 0.4f, 0.4f, 0.4f
// 0.2f, 0.4f, 0.4f, 0.4f, 0.4f, 0.4f // Matlab version
};
// Rendering parameters
......
......@@ -4,9 +4,9 @@
namespace op
{
Array<float> CvMatToOpInput::createArray(const cv::Mat& cvInputData,
const std::vector<double>& scaleInputToNetInputs,
const std::vector<Point<int>>& netInputSizes) const
std::vector<Array<float>> CvMatToOpInput::createArray(const cv::Mat& cvInputData,
const std::vector<double>& scaleInputToNetInputs,
const std::vector<Point<int>>& netInputSizes) const
{
try
{
......@@ -19,22 +19,22 @@ namespace op
error("scaleInputToNetInputs.size() != netInputSizes.size().", __LINE__, __FUNCTION__, __FILE__);
// inputNetData - Reescale keeping aspect ratio and transform to float the input deep net image
const auto numberScales = (int)scaleInputToNetInputs.size();
Array<float> inputNetData{{numberScales, 3, netInputSizes.at(0).y, netInputSizes.at(0).x}};
std::vector<double> scaleRatios(numberScales, 1.f);
const auto inputNetDataOffset = inputNetData.getVolume(1, 3);
for (auto i = 0; i < numberScales; i++)
std::vector<Array<float>> inputNetData(numberScales);
for (auto i = 0u ; i < inputNetData.size() ; i++)
{
inputNetData[i].reset({1, 3, netInputSizes.at(i).y, netInputSizes.at(i).x});
std::vector<double> scaleRatios(numberScales, 1.f);
const cv::Mat frameWithNetSize = resizeFixedAspectRatio(cvInputData, scaleInputToNetInputs[i],
netInputSizes[i]);
// Fill inputNetData
uCharCvMatToFloatPtr(inputNetData.getPtr() + i * inputNetDataOffset, frameWithNetSize, true);
// Fill inputNetData[i]
uCharCvMatToFloatPtr(inputNetData[i].getPtr(), frameWithNetSize, true);
}
return inputNetData;
}
catch (const std::exception& e)
{
error(e.what(), __LINE__, __FUNCTION__, __FILE__);
return Array<float>{};
return {};
}
}
}
......@@ -157,7 +157,9 @@ namespace op
datum.name = name;
// Input image and rendered version
datum.cvInputData = cvInputData.clone();
datum.inputNetData = inputNetData.clone();
datum.inputNetData.resize(inputNetData.size());
for (auto i = 0u ; i < datum.inputNetData.size() ; i++)
datum.inputNetData[i] = inputNetData[i].clone();
datum.outputData = outputData.clone();
datum.cvOutputData = cvOutputData.clone();
// Resulting Array<float> data
......
......@@ -33,8 +33,7 @@ namespace op
mGpuId{gpuId},
mCaffeProto{caffeProto},
mCaffeTrainedModel{caffeTrainedModel},
mLastBlobName{lastBlobName},
mNetInputSize4D{0,0,0,0}
mLastBlobName{lastBlobName}
{
const std::string message{".\nPossible causes:\n\t1. Not downloading the OpenPose trained models."
"\n\t2. Not running OpenPose from the same directory where the `model`"
......@@ -160,7 +159,10 @@ namespace op
#endif
// Perform deep network forward pass
upImpl->upCaffeNet->ForwardFrom(0);
cudaCheck(__LINE__, __FUNCTION__, __FILE__);
// Cuda checks
#ifdef USE_CUDA
cudaCheck(__LINE__, __FUNCTION__, __FILE__);
#endif
#else
UNUSED(inputData);
#endif
......
......@@ -4,16 +4,18 @@
namespace op
{
template <typename T>
void resizeAndMergeCpu(T* targetPtr, const T* const sourcePtr, const std::array<int, 4>& targetSize,
const std::array<int, 4>& sourceSize, const std::vector<T>& scaleInputToNetInputs)
void resizeAndMergeCpu(T* targetPtr, const std::vector<const T*>& sourcePtrs,
const std::array<int, 4>& targetSize,
const std::vector<std::array<int, 4>>& sourceSizes,
const std::vector<T>& scaleInputToNetInputs)
{
try
{
UNUSED(targetPtr);
UNUSED(sourcePtr);
UNUSED(sourcePtrs);
UNUSED(scaleInputToNetInputs);
UNUSED(targetSize);
UNUSED(sourceSize);
UNUSED(sourceSizes);
error("CPU version not completely implemented.", __LINE__, __FUNCTION__, __FILE__);
// TODO: THIS CODE IS WORKING, BUT IT DOES NOT CONSIDER THE SCALES (I.E. SCALE NUMBER, START AND GAP)
......@@ -34,10 +36,10 @@ namespace op
// const auto sourceOffsetChannel = sourceHeight * sourceWidth;
// const auto sourceOffsetNum = sourceOffsetChannel * channel;
// const auto sourceOffset = n*sourceOffsetNum + c*sourceOffsetChannel;
// const T* const sourcePtr = bottom->cpu_data();
// const T* const sourcePtrs = bottom->cpu_data();
// for (int y = 0; y < sourceHeight; y++)
// for (int x = 0; x < sourceWidth; x++)
// source.at<T>(x,y) = sourcePtr[sourceOffset + y*sourceWidth + x];
// source.at<T>(x,y) = sourcePtrs[sourceOffset + y*sourceWidth + x];
// // spatial resize
// cv::Mat target;
......@@ -60,8 +62,12 @@ namespace op
}
}
template void resizeAndMergeCpu(float* targetPtr, const float* const sourcePtr, const std::array<int, 4>& targetSize,
const std::array<int, 4>& sourceSize, const std::vector<float>& scaleInputToNetInputs);
template void resizeAndMergeCpu(double* targetPtr, const double* const sourcePtr, const std::array<int, 4>& targetSize,
const std::array<int, 4>& sourceSize, const std::vector<double>& scaleInputToNetInputs);
template void resizeAndMergeCpu(float* targetPtr, const std::vector<const float*>& sourcePtrs,
const std::array<int, 4>& targetSize,
const std::vector<std::array<int, 4>>& sourceSizes,
const std::vector<float>& scaleInputToNetInputs);
template void resizeAndMergeCpu(double* targetPtr, const std::vector<const double*>& sourcePtrs,
const std::array<int, 4>& targetSize,
const std::vector<std::array<int, 4>>& sourceSizes,
const std::vector<double>& scaleInputToNetInputs);
}
......@@ -15,110 +15,112 @@ namespace op
if (x < targetWidth && y < targetHeight)
{
const auto scaleWidth = targetWidth / T(sourceWidth);
const auto scaleHeight = targetHeight / T(sourceHeight);
const T xSource = (x + 0.5f) / scaleWidth - 0.5f;
const T ySource = (y + 0.5f) / scaleHeight - 0.5f;
const T xSource = (x + 0.5f) * sourceWidth / T(targetWidth) - 0.5f;
const T ySource = (y + 0.5f) * sourceHeight / T(targetHeight) - 0.5f;
targetPtr[y*targetWidth+x] = bicubicInterpolate(sourcePtr, xSource, ySource, sourceWidth, sourceHeight,
sourceWidth);
}
}
template <typename T>
__global__ void resizeKernelAndMerge(T* targetPtr, const T* const sourcePtr, const int sourceNumOffset,
const int num, const T* scaleInputToNetInputs, const int sourceWidth,
const int sourceHeight, const int targetWidth, const int targetHeight)
__global__ void resizeKernelAndMerge(T* targetPtr, const T* const sourcePtr, const T scaleWidth,
const T scaleHeight, const int sourceWidth, const int sourceHeight,
const int targetWidth, const int targetHeight, const int averageCounter)
{
const auto x = (blockIdx.x * blockDim.x) + threadIdx.x;
const auto y = (blockIdx.y * blockDim.y) + threadIdx.y;
if (x < targetWidth && y < targetHeight)
{
const T xSource = (x + 0.5f) / scaleWidth - 0.5f;
const T ySource = (y + 0.5f) / scaleHeight - 0.5f;
const auto interpolated = bicubicInterpolate(sourcePtr, xSource, ySource, sourceWidth, sourceHeight,
sourceWidth);
auto& targetPixel = targetPtr[y*targetWidth+x];
targetPixel = 0.f; // For average
// targetPixel = -1000.f; // For fastMax
for (auto n = 0; n < num; n++)
{
const auto currentWidth = sourceWidth * scaleInputToNetInputs[n] / scaleInputToNetInputs[0];
const auto currentHeight = sourceHeight * scaleInputToNetInputs[n] / scaleInputToNetInputs[0];
const auto scaleWidth = targetWidth / currentWidth;
const auto scaleHeight = targetHeight / currentHeight;
const T xSource = (x + 0.5f) / scaleWidth - 0.5f;
const T ySource = (y + 0.5f) / scaleHeight - 0.5f;
const T* const sourcePtrN = sourcePtr + n * sourceNumOffset;
const auto interpolated = bicubicInterpolate(sourcePtrN, xSource, ySource, intRound(currentWidth),
intRound(currentHeight), sourceWidth);
targetPixel += interpolated;
// targetPixel = fastMax(targetPixel, interpolated);
}
targetPixel /= num;
targetPixel = ((averageCounter * targetPixel) + interpolated) / T(averageCounter + 1);
// targetPixel = fastMax(targetPixel, interpolated);
}
}
template <typename T>
void resizeAndMergeGpu(T* targetPtr, const T* const sourcePtr, const std::array<int, 4>& targetSize,
const std::array<int, 4>& sourceSize, const std::vector<T>& scaleInputToNetInputs)
void resizeAndMergeGpu(T* targetPtr, const std::vector<const T*>& sourcePtrs, const std::array<int, 4>& targetSize,
const std::vector<std::array<int, 4>>& sourceSizes,
const std::vector<T>& scaleInputToNetInputs)
{
try
{
const auto num = sourceSize[0];
const auto channels = sourceSize[1];
const auto sourceHeight = sourceSize[2];
const auto sourceWidth = sourceSize[3];
// Security checks
if (sourceSizes.empty())
error("sourceSizes cannot be empty.", __LINE__, __FUNCTION__, __FILE__);
if (sourcePtrs.size() != sourceSizes.size() || sourceSizes.size() != scaleInputToNetInputs.size())
error("Size(sourcePtrs) must match size(sourceSizes) and size(scaleInputToNetInputs). Currently: "
+ std::to_string(sourcePtrs.size()) + " vs. " + std::to_string(sourceSizes.size()) + " vs. "
+ std::to_string(scaleInputToNetInputs.size()) + ".", __LINE__, __FUNCTION__, __FILE__);
// Parameters
const auto channels = targetSize[1];
const auto targetHeight = targetSize[2];
const auto targetWidth = targetSize[3];
const dim3 threadsPerBlock{THREADS_PER_BLOCK_1D, THREADS_PER_BLOCK_1D};
const dim3 numBlocks{getNumberCudaBlocks(targetWidth, threadsPerBlock.x),
getNumberCudaBlocks(targetHeight, threadsPerBlock.y)};
const auto sourceChannelOffset = sourceHeight * sourceWidth;
const auto targetChannelOffset = targetWidth * targetHeight;
const auto& sourceSize = sourceSizes[0];
const auto sourceHeight = sourceSize[2];
const auto sourceWidth = sourceSize[3];
// No multi-scale merging
if (targetSize[0] > 1)
// No multi-scale merging or no merging required
if (sourceSizes.size() == 1)
{
for (auto n = 0; n < num; n++)
const auto num = sourceSize[0];
if (targetSize[0] > 1 || num == 1)
{
const auto offsetBase = n*channels;
for (auto c = 0 ; c < channels ; c++)
const auto sourceChannelOffset = sourceHeight * sourceWidth;
const auto targetChannelOffset = targetWidth * targetHeight;
for (auto n = 0; n < num; n++)
{
const auto offset = offsetBase + c;
resizeKernel<<<numBlocks, threadsPerBlock>>>(targetPtr + offset * targetChannelOffset,
sourcePtr + offset * sourceChannelOffset,
sourceWidth, sourceHeight, targetWidth,
targetHeight);
const auto offsetBase = n*channels;
for (auto c = 0 ; c < channels ; c++)
{
const auto offset = offsetBase + c;
resizeKernel<<<numBlocks, threadsPerBlock>>>(targetPtr + offset * targetChannelOffset,
sourcePtrs.at(0) + offset * sourceChannelOffset,
sourceWidth, sourceHeight, targetWidth,
targetHeight);
}
}
}
// Old inefficient multi-scale merging
else
error("It should never reaches this point. Notify us.", __LINE__, __FUNCTION__, __FILE__);
}
// Multi-scale merging
// Multi-scaling merging
else
{
// If scale_number > 1 --> scaleInputToNetInputs must be set
if (scaleInputToNetInputs.size() != num)
error("The scale ratios size must be equal than the number of scales.",
__LINE__, __FUNCTION__, __FILE__);
const auto maxScales = 10;
if (scaleInputToNetInputs.size() > maxScales)
error("The maximum number of scales is " + std::to_string(maxScales) + ".",
__LINE__, __FUNCTION__, __FILE__);
// Copy scaleInputToNetInputs
T* scaleInputToNetInputsPtr;
cudaMalloc((void**)&scaleInputToNetInputsPtr, maxScales * sizeof(T));
cudaMemcpy(scaleInputToNetInputsPtr, scaleInputToNetInputs.data(),
scaleInputToNetInputs.size() * sizeof(T), cudaMemcpyHostToDevice);
// Perform resize + merging
const auto sourceNumOffset = channels * sourceChannelOffset;
for (auto c = 0 ; c < channels ; c++)
resizeKernelAndMerge<<<numBlocks, threadsPerBlock>>>(targetPtr + c * targetChannelOffset,
sourcePtr + c * sourceChannelOffset,
sourceNumOffset, num,
scaleInputToNetInputsPtr, sourceWidth,
sourceHeight, targetWidth, targetHeight);
// Free memory
cudaFree(scaleInputToNetInputsPtr);
const auto targetChannelOffset = targetWidth * targetHeight;
cudaMemset(targetPtr, 0.f, channels*targetChannelOffset * sizeof(T));
auto averageCounter = -1;
const auto scaleToMainScaleWidth = targetWidth / T(sourceWidth);
const auto scaleToMainScaleHeight = targetHeight / T(sourceHeight);
for (auto i = 0u ; i < sourceSizes.size(); i++)
{
const auto& currentSize = sourceSizes.at(i);
const auto currentHeight = currentSize[2];
const auto currentWidth = currentSize[3];
const auto sourceChannelOffset = currentHeight * currentWidth;
const auto scaleInputToNet = scaleInputToNetInputs[i] / scaleInputToNetInputs[0];
const auto scaleWidth = scaleToMainScaleWidth / scaleInputToNet;
const auto scaleHeight = scaleToMainScaleHeight / scaleInputToNet;
averageCounter++;
for (auto c = 0 ; c < channels ; c++)
{
resizeKernelAndMerge<<<numBlocks, threadsPerBlock>>>(
targetPtr + c * targetChannelOffset, sourcePtrs[i] + c * sourceChannelOffset,
scaleWidth, scaleHeight, currentWidth, currentHeight, targetWidth,
targetHeight, averageCounter
);
}
}
}
cudaCheck(__LINE__, __FUNCTION__, __FILE__);
......@@ -129,10 +131,12 @@ namespace op
}
}
template void resizeAndMergeGpu(float* targetPtr, const float* const sourcePtr,
const std::array<int, 4>& targetSize, const std::array<int, 4>& sourceSize,
template void resizeAndMergeGpu(float* targetPtr, const std::vector<const float*>& sourcePtrs,
const std::array<int, 4>& targetSize,
const std::vector<std::array<int, 4>>& sourceSizes,
const std::vector<float>& scaleInputToNetInputs);
template void resizeAndMergeGpu(double* targetPtr, const double* const sourcePtr,
const std::array<int, 4>& targetSize, const std::array<int, 4>& sourceSize,
template void resizeAndMergeGpu(double* targetPtr, const std::vector<const double*>& sourcePtrs,
const std::array<int, 4>& targetSize,
const std::vector<std::array<int, 4>>& sourceSizes,
const std::vector<double>& scaleInputToNetInputs);
}
......@@ -32,9 +32,9 @@ namespace op
{
#ifdef USE_CAFFE
if (top.size() != 1)
error("top.size() != 1", __LINE__, __FUNCTION__, __FILE__);
error("top.size() != 1.", __LINE__, __FUNCTION__, __FILE__);
if (bottom.size() != 1)
error("bottom.size() != 2", __LINE__, __FUNCTION__, __FILE__);
error("bottom.size() != 1.", __LINE__, __FUNCTION__, __FILE__);
#else
UNUSED(bottom);
UNUSED(top);
......@@ -49,16 +49,21 @@ namespace op
template <typename T>
void ResizeAndMergeCaffe<T>::Reshape(const std::vector<caffe::Blob<T>*>& bottom,
const std::vector<caffe::Blob<T>*>& top,
const float netFactor,
const float scaleFactor,
const T netFactor,
const T scaleFactor,
const bool mergeFirstDimension)
{
try
{
#ifdef USE_CAFFE
// Security checks
if (top.size() != 1)
error("top.size() != 1", __LINE__, __FUNCTION__, __FILE__);
if (bottom.empty())
error("bottom cannot be empty.", __LINE__, __FUNCTION__, __FILE__);
// Data
const auto* bottomBlob = bottom.at(0);
auto* topBlob = top.at(0);
const auto* bottomBlob = bottom.at(0);
// Set top shape
auto topShape = bottomBlob->shape();
topShape[0] = (mergeFirstDimension ? 1 : bottomBlob->shape(0));
......@@ -66,18 +71,21 @@ namespace op
// E.g. 100x100 image --> 200x200 --> 0-99 to 0-199 --> scale = 199/99 (not 2!)
// E.g. 101x101 image --> 201x201 --> scale = 2
// Test: pixel 0 --> 0, pixel 99 (ex 1) --> 199, pixel 100 (ex 2) --> 200
topShape[2] = intRound((topShape[2]*netFactor - 1.f) * scaleFactor + 1);
topShape[3] = intRound((topShape[3]*netFactor - 1.f) * scaleFactor + 1);
topShape[2] = intRound((topShape[2]*netFactor - 1.f) * scaleFactor) + 1;
topShape[3] = intRound((topShape[3]*netFactor - 1.f) * scaleFactor) + 1;
topBlob->Reshape(topShape);
// Array sizes
mTopSize = std::array<int, 4>{topBlob->shape(0), topBlob->shape(1), topBlob->shape(2),
topBlob->shape(3)};
mBottomSize = std::array<int, 4>{bottomBlob->shape(0), bottomBlob->shape(1),
bottomBlob->shape(2), bottomBlob->shape(3)};
mBottomSizes.resize(bottom.size());
for (auto i = 0u ; i < mBottomSizes.size() ; i++)
mBottomSizes[i] = std::array<int, 4>{bottom[i]->shape(0), bottom[i]->shape(1),
bottom[i]->shape(2), bottom[i]->shape(3)};
#else
UNUSED(bottom);
UNUSED(top);
UNUSED(factor);
UNUSED(netFactor);
UNUSED(scaleFactor);
UNUSED(mergeFirstDimension);
#endif
}
......@@ -107,7 +115,10 @@ namespace op
try
{
#ifdef USE_CAFFE
resizeAndMergeCpu(top.at(0)->mutable_cpu_data(), bottom.at(0)->cpu_data(), mTopSize, mBottomSize,
std::vector<const T*> sourcePtrs(bottom.size());
for (auto i = 0u ; i < sourcePtrs.size() ; i++)
sourcePtrs[i] = bottom[i]->cpu_data();
resizeAndMergeCpu(top.at(0)->mutable_cpu_data(), sourcePtrs, mTopSize, mBottomSizes,
mScaleRatios);
#else
UNUSED(bottom);
......@@ -127,7 +138,10 @@ namespace op
try
{
#if defined USE_CAFFE && defined USE_CUDA
resizeAndMergeGpu(top.at(0)->mutable_gpu_data(), bottom.at(0)->gpu_data(), mTopSize, mBottomSize,
std::vector<const T*> sourcePtrs(bottom.size());
for (auto i = 0u ; i < sourcePtrs.size() ; i++)
sourcePtrs[i] = bottom[i]->gpu_data();
resizeAndMergeGpu(top.at(0)->mutable_gpu_data(), sourcePtrs, mTopSize, mBottomSizes,
mScaleRatios);
#else
UNUSED(bottom);
......
......@@ -54,9 +54,9 @@ namespace op
poseNetInputSize.x * inputResolution.y / (float) inputResolution.x / 16.f
);
}
// scaleInputToNetInputs & sizes - Reescale keeping aspect ratio
// scaleInputToNetInputs & netInputSizes - Reescale keeping aspect ratio
std::vector<double> scaleInputToNetInputs(mScaleNumber, 1.f);
std::vector<Point<int>> sizes(mScaleNumber);
std::vector<Point<int>> netInputSizes(mScaleNumber);
for (auto i = 0; i < mScaleNumber; i++)
{
const auto currentScale = 1. - i*mScaleGap;
......@@ -70,7 +70,7 @@ namespace op
poseNetInputSize.y);
const Point<int> targetSize{targetWidth, targetHeight};
scaleInputToNetInputs[i] = resizeGetScaleFactor(inputResolution, targetSize);
sizes[i] = poseNetInputSize;
netInputSizes[i] = targetSize;
}
// scaleInputToOutput - Scale between input and desired output size
Point<int> outputResolution;
......@@ -88,7 +88,7 @@ namespace op
scaleInputToOutput = 1.;
}
// Return result
return std::make_tuple(scaleInputToNetInputs, sizes, scaleInputToOutput, outputResolution);
return std::make_tuple(scaleInputToNetInputs, netInputSizes, scaleInputToOutput, outputResolution);
}
catch (const std::exception& e)
{
......
......@@ -18,23 +18,30 @@ namespace op
struct PoseExtractorCaffe::ImplPoseExtractorCaffe
{
#ifdef USE_CAFFE
std::shared_ptr<NetCaffe> spNetCaffe;
// Used when increasing spCaffeNets
const PoseModel mPoseModel;
const int mGpuId;
const std::string mModelFolder;
const bool mEnableGoogleLogging;
// General parameters
std::vector<std::shared_ptr<NetCaffe>> spCaffeNets;
std::shared_ptr<ResizeAndMergeCaffe<float>> spResizeAndMergeCaffe;
std::shared_ptr<NmsCaffe<float>> spNmsCaffe;
std::shared_ptr<BodyPartConnectorCaffe<float>> spBodyPartConnectorCaffe;
std::vector<int> mNetInputSize4D;
std::vector<std::vector<int>> mNetInput4DSizes;
std::vector<double> mScaleInputToNetInputs;
// Init with thread
boost::shared_ptr<caffe::Blob<float>> spCaffeNetOutputBlob;
std::vector<boost::shared_ptr<caffe::Blob<float>>> spCaffeNetOutputBlobs;
std::shared_ptr<caffe::Blob<float>> spHeatMapsBlob;
std::shared_ptr<caffe::Blob<float>> spPeaksBlob;
std::shared_ptr<caffe::Blob<float>> spPoseBlob;
ImplPoseExtractorCaffe(const PoseModel poseModel, const int gpuId,
const std::string& modelFolder, const bool enableGoogleLogging) :
spNetCaffe{std::make_shared<NetCaffe>(modelFolder + POSE_PROTOTXT[(int)poseModel],
modelFolder + POSE_TRAINED_MODEL[(int)poseModel], gpuId,
enableGoogleLogging)},
mPoseModel{poseModel},
mGpuId{gpuId},
mModelFolder{modelFolder},
mEnableGoogleLogging{enableGoogleLogging},
spResizeAndMergeCaffe{std::make_shared<ResizeAndMergeCaffe<float>>()},
spNmsCaffe{std::make_shared<NmsCaffe<float>>()},
spBodyPartConnectorCaffe{std::make_shared<BodyPartConnectorCaffe<float>>()}
......@@ -44,10 +51,28 @@ namespace op
};
#ifdef USE_CAFFE
std::vector<caffe::Blob<float>*> caffeNetSharedToPtr(
std::vector<boost::shared_ptr<caffe::Blob<float>>>& caffeNetOutputBlob)
{
try
{
// Prepare spCaffeNetOutputBlobss
std::vector<caffe::Blob<float>*> caffeNetOutputBlobs(caffeNetOutputBlob.size());
for (auto i = 0u ; i < caffeNetOutputBlobs.size() ; i++)
caffeNetOutputBlobs[i] = caffeNetOutputBlob[i].get();
return caffeNetOutputBlobs;
}
catch (const std::exception& e)
{
error(e.what(), __LINE__, __FUNCTION__, __FILE__);
return {};
}
}
inline void reshapePoseExtractorCaffe(std::shared_ptr<ResizeAndMergeCaffe<float>>& resizeAndMergeCaffe,
std::shared_ptr<NmsCaffe<float>>& nmsCaffe,
std::shared_ptr<BodyPartConnectorCaffe<float>>& bodyPartConnectorCaffe,
boost::shared_ptr<caffe::Blob<float>>& caffeNetOutputBlob,
std::vector<boost::shared_ptr<caffe::Blob<float>>>& caffeNetOutputBlob,
std::shared_ptr<caffe::Blob<float>>& heatMapsBlob,
std::shared_ptr<caffe::Blob<float>>& peaksBlob,
std::shared_ptr<caffe::Blob<float>>& poseBlob,
......@@ -57,14 +82,47 @@ namespace op
try
{
// HeatMaps extractor blob and layer
resizeAndMergeCaffe->Reshape({caffeNetOutputBlob.get()}, {heatMapsBlob.get()},
const auto caffeNetOutputBlobs = caffeNetSharedToPtr(caffeNetOutputBlob);
resizeAndMergeCaffe->Reshape(caffeNetOutputBlobs, {heatMapsBlob.get()},
POSE_CCN_DECREASE_FACTOR[(int)poseModel], 1.f/scaleInputToNetInput);
// Pose extractor blob and layer
nmsCaffe->Reshape({heatMapsBlob.get()}, {peaksBlob.get()}, POSE_MAX_PEAKS[(int)poseModel]);
// Pose extractor blob and layer
bodyPartConnectorCaffe->Reshape({heatMapsBlob.get(), peaksBlob.get()}, {poseBlob.get()});
// Cuda check
cudaCheck(__LINE__, __FUNCTION__, __FILE__);
#ifdef USE_CUDA
cudaCheck(__LINE__, __FUNCTION__, __FILE__);
#endif
}
catch (const std::exception& e)
{
error(e.what(), __LINE__, __FUNCTION__, __FILE__);
}
}
void addCaffeNetOnThread(std::vector<std::shared_ptr<NetCaffe>>& netCaffe,
std::vector<boost::shared_ptr<caffe::Blob<float>>>& caffeNetOutputBlob,
const PoseModel poseModel, const int gpuId,
const std::string& modelFolder, const bool enableGoogleLogging)
{
try
{
// Add Caffe Net
netCaffe.emplace_back(
std::make_shared<NetCaffe>(modelFolder + POSE_PROTOTXT[(int)poseModel],
modelFolder + POSE_TRAINED_MODEL[(int)poseModel],
gpuId, enableGoogleLogging)
);
// Initializing them on the thread
netCaffe.back()->initializationOnThread();
caffeNetOutputBlob.emplace_back(netCaffe.back()->getOutputBlob());
// Security checks
if (netCaffe.size() != caffeNetOutputBlob.size())
error("Weird error, this should not happen. Notify us.", __LINE__, __FUNCTION__, __FILE__);
// Cuda check
#ifdef USE_CUDA
cudaCheck(__LINE__, __FUNCTION__, __FILE__);
#endif
}
catch (const std::exception& e)
{
......@@ -114,14 +172,18 @@ namespace op
// Logging
log("Starting initialization on thread.", Priority::Low, __LINE__, __FUNCTION__, __FILE__);
// Initialize Caffe net
upImpl->spNetCaffe->initializationOnThread();
cudaCheck(__LINE__, __FUNCTION__, __FILE__);
addCaffeNetOnThread(upImpl->spCaffeNets, upImpl->spCaffeNetOutputBlobs, upImpl->mPoseModel,
upImpl->mGpuId, upImpl->mModelFolder, upImpl->mEnableGoogleLogging);
#ifdef USE_CUDA
cudaCheck(__LINE__, __FUNCTION__, __FILE__);
#endif
// Initialize blobs
upImpl->spCaffeNetOutputBlob = upImpl->spNetCaffe->getOutputBlob();
upImpl->spHeatMapsBlob = {std::make_shared<caffe::Blob<float>>(1,1,1,1)};
upImpl->spPeaksBlob = {std::make_shared<caffe::Blob<float>>(1,1,1,1)};
upImpl->spPoseBlob = {std::make_shared<caffe::Blob<float>>(1,1,1,1)};
cudaCheck(__LINE__, __FUNCTION__, __FILE__);
#ifdef USE_CUDA
cudaCheck(__LINE__, __FUNCTION__, __FILE__);
#endif
// Logging
log("Finished initialization on thread.", Priority::Low, __LINE__, __FUNCTION__, __FILE__);
#endif
......@@ -132,7 +194,8 @@ namespace op
}
}
void PoseExtractorCaffe::forwardPass(const Array<float>& inputNetData, const Point<int>& inputDataSize,
void PoseExtractorCaffe::forwardPass(const std::vector<Array<float>>& inputNetData,
const Point<int>& inputDataSize,
const std::vector<double>& scaleInputToNetInputs)
{
try
......@@ -141,30 +204,50 @@ namespace op
// Security checks
if (inputNetData.empty())
error("Empty inputNetData.", __LINE__, __FUNCTION__, __FILE__);
for (const auto& inputNetDataI : inputNetData)
if (inputNetDataI.empty())
error("Empty inputNetData.", __LINE__, __FUNCTION__, __FILE__);
if (inputNetData.size() != scaleInputToNetInputs.size())
error("Size(inputNetData) must be same than size(scaleInputToNetInputs).",
__LINE__, __FUNCTION__, __FILE__);
// 1. Caffe deep network
upImpl->spNetCaffe->forwardPass(inputNetData); // ~80ms
// Resize std::vectors if required
const auto numberScales = inputNetData.size();
upImpl->mNetInput4DSizes.resize(numberScales);
while (upImpl->spCaffeNets.size() < numberScales)
addCaffeNetOnThread(upImpl->spCaffeNets, upImpl->spCaffeNetOutputBlobs, upImpl->mPoseModel,
upImpl->mGpuId, upImpl->mModelFolder, false);
// Reshape blobs if required
// Note: In order to resize to input size to have same results as Matlab, uncomment the commented lines
if (!vectorsAreEqual(upImpl->mNetInputSize4D, inputNetData.getSize()))
// || !vectorsAreEqual(upImpl->mScaleInputToNetInputs, scaleInputToNetInputs))
// Process each image
for (auto i = 0u ; i < inputNetData.size(); i++)
{
upImpl->mNetInputSize4D = inputNetData.getSize();
mNetOutputSize = Point<int>{upImpl->mNetInputSize4D[3], upImpl->mNetInputSize4D[2]};
// upImpl->mScaleInputToNetInputs = scaleInputToNetInputs;
reshapePoseExtractorCaffe(upImpl->spResizeAndMergeCaffe, upImpl->spNmsCaffe,
upImpl->spBodyPartConnectorCaffe, upImpl->spCaffeNetOutputBlob,
upImpl->spHeatMapsBlob, upImpl->spPeaksBlob, upImpl->spPoseBlob,
1.f, mPoseModel);
// scaleInputToNetInputs[0], mPoseModel);
// 1. Caffe deep network
upImpl->spCaffeNets.at(i)->forwardPass(inputNetData[i]); // ~80ms
// Reshape blobs if required
// Note: In order to resize to input size to have same results as Matlab, uncomment the commented
// lines
if (!vectorsAreEqual(upImpl->mNetInput4DSizes.at(i), inputNetData[i].getSize()))
// || !vectorsAreEqual(upImpl->mScaleInputToNetInputs, scaleInputToNetInputs))
{
upImpl->mNetInput4DSizes.at(i) = inputNetData[i].getSize();
mNetOutputSize = Point<int>{upImpl->mNetInput4DSizes[0][3],
upImpl->mNetInput4DSizes[0][2]};
// upImpl->mScaleInputToNetInputs = scaleInputToNetInputs;
reshapePoseExtractorCaffe(upImpl->spResizeAndMergeCaffe, upImpl->spNmsCaffe,
upImpl->spBodyPartConnectorCaffe, upImpl->spCaffeNetOutputBlobs,
upImpl->spHeatMapsBlob, upImpl->spPeaksBlob, upImpl->spPoseBlob,
1.f, mPoseModel);
// scaleInputToNetInputs[i], mPoseModel);
}
}
// 2. Resize heat maps + merge different scales
const auto caffeNetOutputBlobs = caffeNetSharedToPtr(upImpl->spCaffeNetOutputBlobs);
const std::vector<float> floatScaleRatios(scaleInputToNetInputs.begin(), scaleInputToNetInputs.end());
upImpl->spResizeAndMergeCaffe->setScaleRatios(floatScaleRatios);
#ifdef USE_CUDA
upImpl->spResizeAndMergeCaffe->Forward_gpu({upImpl->spCaffeNetOutputBlob.get()}, // ~5ms
upImpl->spResizeAndMergeCaffe->Forward_gpu(caffeNetOutputBlobs, // ~5ms
{upImpl->spHeatMapsBlob.get()});
cudaCheck(__LINE__, __FUNCTION__, __FILE__);
#else
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册