Multi-scale much faster & less memory

d1d483f3 · gineshidalgo99 · 7ae8d776 · d1d483f3 · d1d483f3 · d1d483f3
19 changed file
--- a/doc/release_notes.md
+++ b/doc/release_notes.md
@@ -115,20 +115,23 @@ OpenPose Library - Release Notes

 ## Current version (future OpenPose 1.2.0alpha)
 1. Main improvements:
-    1. Added IP camera support.
-    2. Output images can have the input size, OpenPose able to change its size for each image and not required fixed size anymore.
+    1. Speed increase when processing images with different aspect ratios. E.g. ~20% increase over 3.7k COCO validation images on 1 scale.
+    2. Huge speed increase and memory reduction when processing multi-scale. E.g. over 3.7k COCO validation images on 4 scales: ~40% (~770 to ~450 sec) speed increase, ~25% memory reduction (from ~8.9 to ~6.7 GB / GPU).
+    3. Slightly increase of accuracy given the fixed mini-bugs.
+    4. Added IP camera support.
+    5. Output images can have the input size, OpenPose able to change its size for each image and not required fixed size anymore.
        1. FrameDisplayer accepts variable size images by rescaling every time a frame with bigger width or height is displayed (gui module).
        2. OpOutputToCvMat & GuiInfoAdder does not require to know the output size at construction time, deduced from each image.
        3. CvMatToOutput and Renderers allow to keep input resolution as output for images (core module).
-    3. New standalone face keypoint detector based on OpenCV face detector: much faster if body keypoint detection is not required but much less accurate.
-    4. Face and hand keypoint detectors now can return each keypoint heatmap.
-    5. The flag `USE_CUDNN` is no longer required; `USE_CAFFE` and `USE_CUDA` (replacing the old `CPU_ONLY`) are no longer required to use the library, only to build it. In addition, Boost, Caffe, and its dependencies have been removed from the OpenPose header files. Only OpenCV include and lib folders are required when building a project using OpenPose.
-    6. OpenPose successfully compiles if the flags `USE_CAFFE` and/or `USE_CUDA` are not enabled, although it will give an error saying they are required.
-    7. COCO JSON file outputs 0 as score for non-detected keypoints.
-    8. Added example for OpenPose for user asynchronous output and cleaned all `tutorial_wrapper/` examples.
-    9. Added `-1` option for `net_resolution` in order to auto-select the best possible aspect ratio given the user input.
-    10. Net resolution can be dynamically changed (e.g. for images with different size).
-    11. Added example to add functionality/modules to OpenPose.
+    6. New standalone face keypoint detector based on OpenCV face detector: much faster if body keypoint detection is not required but much less accurate.
+    7. Face and hand keypoint detectors now can return each keypoint heatmap.
+    8. The flag `USE_CUDNN` is no longer required; `USE_CAFFE` and `USE_CUDA` (replacing the old `CPU_ONLY`) are no longer required to use the library, only to build it. In addition, Boost, Caffe, and its dependencies have been removed from the OpenPose header files. Only OpenCV include and lib folders are required when building a project using OpenPose.
+    9. OpenPose successfully compiles if the flags `USE_CAFFE` and/or `USE_CUDA` are not enabled, although it will give an error saying they are required.
+    10. COCO JSON file outputs 0 as score for non-detected keypoints.
+    11. Added example for OpenPose for user asynchronous output and cleaned all `tutorial_wrapper/` examples.
+    12. Added `-1` option for `net_resolution` in order to auto-select the best possible aspect ratio given the user input.
+    13. Net resolution can be dynamically changed (e.g. for images with different size).
+    14. Added example to add functionality/modules to OpenPose.
 2. Functions or parameters renamed:
    1. OpenPose able to change its size and initial size dynamically:
        1. Flag `resolution` renamed as `output_resolution`.

--- a/examples/tests/pose_accuracy_coco_test_dev.sh
+++ b/examples/tests/pose_accuracy_coco_test_dev.sh
+# Script for internal use. We might completely change it continuously and we will not answer questions about it.
+
+clear && clear
+
+# USAGE EXAMPLE
+# See ./examples/tests/pose_accuracy_coco_test.sh
+
+# Parameters
+IMAGE_FOLDER=/media/posefs3b/Users/gines/openpose_train/dataset/COCO/images/test2017_dev/
+JSON_FOLDER=../evaluation/coco_val_jsons/
+OP_BIN=./build/examples/openpose/openpose.bin
+
+    # 1 scale
+$OP_BIN --image_dir $IMAGE_FOLDER --write_coco_json ${JSON_FOLDER}1_test.json --no_display --render_pose 0
+
+#     # 3 scales
+# $OP_BIN --image_dir $IMAGE_FOLDER --write_coco_json ${JSON_FOLDER}1_3.json --no_display --render_pose 0 --scale_number 3 --scale_gap 0.25
+
+    # 4 scales
+$OP_BIN --image_dir $IMAGE_FOLDER --write_coco_json ${JSON_FOLDER}1_4_test.json --no_display --render_pose 0 --scale_number 4 --scale_gap 0.25 --net_resolution "1312x736"
--- a/examples/tests/pose_accuracy_coco_test.sh
+++ b/examples/tests/pose_accuracy_coco_test.sh
@@ -3,7 +3,7 @@
 clear && clear

 # USAGE EXAMPLE
-# clear && clear && make all -j24 && bash ./examples/tests/pose_accuracy_coco_test.sh
+# clear && clear && make all -j`nproc` && bash ./examples/tests/pose_accuracy_coco_test.sh

 # # Go back to main folder
 # cd ../../
@@ -23,14 +23,14 @@ OP_BIN=./build/examples/openpose/openpose.bin
    # 1 scale
 $OP_BIN --image_dir $IMAGE_FOLDER --write_coco_json ${JSON_FOLDER}1.json --no_display --render_pose 0 --frame_last 3558

-	# 1 scale - Debugging
+    # 1 scale - Debugging
 # $OP_BIN --image_dir $IMAGE_FOLDER --write_coco_json ${JSON_FOLDER}1.json --no_display --frame_last 3558 --write_images ~/Desktop/CppValidation/

 #     # 3 scales
 # $OP_BIN --image_dir $IMAGE_FOLDER --write_coco_json ${JSON_FOLDER}1_3.json --no_display --render_pose 0 --scale_number 3 --scale_gap 0.25 --frame_last 3558

 #     # 4 scales
-# $OP_BIN --num_gpu 1 --image_dir $IMAGE_FOLDER --write_coco_json ${JSON_FOLDER}1_4.json --no_display --render_pose 0 --num_gpu 1 --scale_number 4 --scale_gap 0.25 --net_resolution "1312x736" --frame_last 3558
+# $OP_BIN --num_gpu 1 --image_dir $IMAGE_FOLDER --write_coco_json ${JSON_FOLDER}1_4.json --no_display --render_pose 0 --scale_number 4 --scale_gap 0.25 --net_resolution "1312x736" --frame_last 3558

 # Debugging - Rendered frames saved
 # $OP_BIN --image_dir $IMAGE_FOLDER --write_images ${JSON_FOLDER}frameOutput --no_display
--- a/examples/tests/pose_accuracy_coco_val_server.sh
+++ b/examples/tests/pose_accuracy_coco_val_server.sh
+# Script for internal use. We might completely change it continuously and we will not answer questions about it.
+
+clear && clear
+
+# USAGE EXAMPLE
+# See ./examples/tests/pose_accuracy_coco_test.sh
+
+# Parameters
+IMAGE_FOLDER=/home/gines/devel/images/val2014/
+JSON_FOLDER=../evaluation/coco_val_jsons/
+OP_BIN=./build/examples/openpose/openpose.bin
+
+    # 1 scale
+$OP_BIN --image_dir $IMAGE_FOLDER --write_coco_json ${JSON_FOLDER}1.json --no_display --render_pose 0 --frame_last 3558
+
+    # 3 scales
+$OP_BIN --image_dir $IMAGE_FOLDER --write_coco_json ${JSON_FOLDER}1_3.json --no_display --render_pose 0 --scale_number 3 --scale_gap 0.25 --frame_last 3558
+
+    # 4 scales
+$OP_BIN --image_dir $IMAGE_FOLDER --write_coco_json ${JSON_FOLDER}1_4.json --no_display --render_pose 0 --scale_number 4 --scale_gap 0.25 --net_resolution "1312x736" --frame_last 3558
--- a/include/openpose/core/cvMatToOpInput.hpp
+++ b/include/openpose/core/cvMatToOpInput.hpp
@@ -9,8 +9,9 @@ namespace op
    class OP_API CvMatToOpInput
    {
    public:
-        Array<float> createArray(const cv::Mat& cvInputData, const std::vector<double>& scaleInputToNetInputs,
-                                 const std::vector<Point<int>>& netInputSizes) const;
+        std::vector<Array<float>> createArray(const cv::Mat& cvInputData,
+                                              const std::vector<double>& scaleInputToNetInputs,
+                                              const std::vector<Point<int>>& netInputSizes) const;
    };
 }


--- a/include/openpose/core/datum.hpp
+++ b/include/openpose/core/datum.hpp
@@ -35,13 +35,14 @@ namespace op
         * with the net.
         * In case of >1 scales, then each scale is right- and bottom-padded to fill the greatest resolution. The
         * scales are sorted from bigger to smaller.
-         * Size: #scales x 3 x input_net_height x input_net_width
+         * Vector size: #scales
+         * Each array size: 3 x input_net_height x input_net_width
         */
-        Array<float> inputNetData;
+        std::vector<Array<float>> inputNetData;

        /**
         * Rendered image in Array<float> format.
-         * It consists of a blending of the inputNetData and the pose/body part(s) heatmap/PAF(s).
+         * It consists of a blending of the cvInputData and the pose/body part(s) heatmap/PAF(s).
         * If rendering is disabled (e.g. `no_render_pose` flag in the demo), then outputData will be empty.
         * Size: 3 x output_net_height x output_net_width
         */

--- a/include/openpose/core/resizeAndMergeBase.hpp
+++ b/include/openpose/core/resizeAndMergeBase.hpp
@@ -6,11 +6,15 @@
 namespace op
 {
    template <typename T>
-    OP_API void resizeAndMergeCpu(T* targetPtr, const T* const sourcePtr, const std::array<int, 4>& targetSize, const std::array<int, 4>& sourceSize,
+    OP_API void resizeAndMergeCpu(T* targetPtr, const std::vector<const T*>& sourcePtrs,
+                                  const std::array<int, 4>& targetSize,
+                                  const std::vector<std::array<int, 4>>& sourceSizes,
                                  const std::vector<T>& scaleInputToNetInputs = {1.f});

    template <typename T>
-    OP_API void resizeAndMergeGpu(T* targetPtr, const T* const sourcePtr, const std::array<int, 4>& targetSize, const std::array<int, 4>& sourceSize,
+    OP_API void resizeAndMergeGpu(T* targetPtr, const std::vector<const T*>& sourcePtrs,
+                                  const std::array<int, 4>& targetSize,
+                                  const std::vector<std::array<int, 4>>& sourceSizes,
                                  const std::vector<T>& scaleInputToNetInputs = {1.f});
 }


--- a/include/openpose/core/resizeAndMergeCaffe.hpp
+++ b/include/openpose/core/resizeAndMergeCaffe.hpp
@@ -24,7 +24,7 @@ namespace op
        virtual void LayerSetUp(const std::vector<caffe::Blob<T>*>& bottom, const std::vector<caffe::Blob<T>*>& top);

        virtual void Reshape(const std::vector<caffe::Blob<T>*>& bottom, const std::vector<caffe::Blob<T>*>& top,
-                             const float netFactor, const float scaleFactor, const bool mergeFirstDimension = true);
+                             const T netFactor, const T scaleFactor, const bool mergeFirstDimension = true);

        virtual inline const char* type() const { return "ResizeAndMerge"; }

@@ -42,7 +42,7 @@ namespace op

    private:
        std::vector<T> mScaleRatios;
-        std::array<int, 4> mBottomSize;
+        std::vector<std::array<int, 4>> mBottomSizes;
        std::array<int, 4> mTopSize;

        DELETE_COPY(ResizeAndMergeCaffe);

--- a/include/openpose/pose/poseExtractor.hpp
+++ b/include/openpose/pose/poseExtractor.hpp
@@ -20,7 +20,7 @@ namespace op

        void initializationOnThread();

-        virtual void forwardPass(const Array<float>& inputNetData, const Point<int>& inputDataSize,
+        virtual void forwardPass(const std::vector<Array<float>>& inputNetData, const Point<int>& inputDataSize,
                                 const std::vector<double>& scaleRatios = {1.f}) = 0;

        virtual const float* getHeatMapCpuConstPtr() const = 0;

--- a/include/openpose/pose/poseExtractorCaffe.hpp
+++ b/include/openpose/pose/poseExtractorCaffe.hpp
@@ -19,7 +19,7 @@ namespace op

        void netInitializationOnThread();

-        void forwardPass(const Array<float>& inputNetData, const Point<int>& inputDataSize,
+        void forwardPass(const std::vector<Array<float>>& inputNetData, const Point<int>& inputDataSize,
                         const std::vector<double>& scaleInputToNetInputs = {1.f});

        const float* getHeatMapCpuConstPtr() const;

--- a/include/openpose/pose/poseParameters.hpp
+++ b/include/openpose/pose/poseParameters.hpp
@@ -289,6 +289,7 @@ namespace op
    };
    const std::array<float, (int)PoseModel::Size>    POSE_DEFAULT_CONNECT_INTER_MIN_ABOVE_THRESHOLD{
        0.95f,      0.95f,      0.95f,      0.95f,      0.95f,      0.95f
+        // 0.85f,      0.85f,      0.85f,      0.85f,      0.85f,      0.85f // Matlab version
    };
    const std::array<float, (int)PoseModel::Size>           POSE_DEFAULT_CONNECT_INTER_THRESHOLD{
        0.05f,      0.01f,      0.01f,      0.05f,      0.05f,      0.05f
@@ -298,6 +299,7 @@ namespace op
    };
    const std::array<float, (int)PoseModel::Size>           POSE_DEFAULT_CONNECT_MIN_SUBSET_SCORE{
        0.4f,       0.4f,       0.4f,       0.4f,       0.4f,       0.4f
+        // 0.2f,       0.4f,       0.4f,       0.4f,       0.4f,       0.4f // Matlab version
    };

    // Rendering parameters

--- a/src/openpose/core/cvMatToOpInput.cpp
+++ b/src/openpose/core/cvMatToOpInput.cpp
@@ -4,9 +4,9 @@

 namespace op
 {
-    Array<float> CvMatToOpInput::createArray(const cv::Mat& cvInputData,
-                                             const std::vector<double>& scaleInputToNetInputs,
-                                             const std::vector<Point<int>>& netInputSizes) const
+    std::vector<Array<float>> CvMatToOpInput::createArray(const cv::Mat& cvInputData,
+                                                          const std::vector<double>& scaleInputToNetInputs,
+                                                          const std::vector<Point<int>>& netInputSizes) const
    {
        try
        {
@@ -19,22 +19,22 @@ namespace op
                error("scaleInputToNetInputs.size() != netInputSizes.size().", __LINE__, __FUNCTION__, __FILE__);
            // inputNetData - Reescale keeping aspect ratio and transform to float the input deep net image
            const auto numberScales = (int)scaleInputToNetInputs.size();
-            Array<float> inputNetData{{numberScales, 3, netInputSizes.at(0).y, netInputSizes.at(0).x}};
-            std::vector<double> scaleRatios(numberScales, 1.f);
-            const auto inputNetDataOffset = inputNetData.getVolume(1, 3);
-            for (auto i = 0; i < numberScales; i++)
+            std::vector<Array<float>> inputNetData(numberScales);
+            for (auto i = 0u ; i < inputNetData.size() ; i++)
            {
+                inputNetData[i].reset({1, 3, netInputSizes.at(i).y, netInputSizes.at(i).x});
+                std::vector<double> scaleRatios(numberScales, 1.f);
                const cv::Mat frameWithNetSize = resizeFixedAspectRatio(cvInputData, scaleInputToNetInputs[i],
                                                                        netInputSizes[i]);
-                // Fill inputNetData
-                uCharCvMatToFloatPtr(inputNetData.getPtr() + i * inputNetDataOffset, frameWithNetSize, true);
+                // Fill inputNetData[i]
+                uCharCvMatToFloatPtr(inputNetData[i].getPtr(), frameWithNetSize, true);
            }
            return inputNetData;
        }
        catch (const std::exception& e)
        {
            error(e.what(), __LINE__, __FUNCTION__, __FILE__);
-            return Array<float>{};
+            return {};
        }
    }
 }
--- a/src/openpose/core/datum.cpp
+++ b/src/openpose/core/datum.cpp
@@ -157,7 +157,9 @@ namespace op
            datum.name = name;
            // Input image and rendered version
            datum.cvInputData = cvInputData.clone();
-            datum.inputNetData = inputNetData.clone();
+            datum.inputNetData.resize(inputNetData.size());
+            for (auto i = 0u ; i < datum.inputNetData.size() ; i++)
+                datum.inputNetData[i] = inputNetData[i].clone();
            datum.outputData = outputData.clone();
            datum.cvOutputData = cvOutputData.clone();
            // Resulting Array<float> data

--- a/src/openpose/core/netCaffe.cpp
+++ b/src/openpose/core/netCaffe.cpp
@@ -33,8 +33,7 @@ namespace op
                mGpuId{gpuId},
                mCaffeProto{caffeProto},
                mCaffeTrainedModel{caffeTrainedModel},
-                mLastBlobName{lastBlobName},
-                mNetInputSize4D{0,0,0,0}
+                mLastBlobName{lastBlobName}
            {
                const std::string message{".\nPossible causes:\n\t1. Not downloading the OpenPose trained models."
                                          "\n\t2. Not running OpenPose from the same directory where the `model`"
@@ -160,7 +159,10 @@ namespace op
                #endif
                // Perform deep network forward pass
                upImpl->upCaffeNet->ForwardFrom(0);
-                cudaCheck(__LINE__, __FUNCTION__, __FILE__);
+                // Cuda checks
+                #ifdef USE_CUDA
+                    cudaCheck(__LINE__, __FUNCTION__, __FILE__);
+                #endif
            #else
                UNUSED(inputData);
            #endif

--- a/src/openpose/core/resizeAndMergeBase.cpp
+++ b/src/openpose/core/resizeAndMergeBase.cpp
@@ -4,16 +4,18 @@
 namespace op
 {
    template <typename T>
-    void resizeAndMergeCpu(T* targetPtr, const T* const sourcePtr, const std::array<int, 4>& targetSize,
-                           const std::array<int, 4>& sourceSize, const std::vector<T>& scaleInputToNetInputs)
+    void resizeAndMergeCpu(T* targetPtr, const std::vector<const T*>& sourcePtrs,
+                           const std::array<int, 4>& targetSize,
+                           const std::vector<std::array<int, 4>>& sourceSizes,
+                           const std::vector<T>& scaleInputToNetInputs)
    {
        try
        {
            UNUSED(targetPtr);
-            UNUSED(sourcePtr);
+            UNUSED(sourcePtrs);
            UNUSED(scaleInputToNetInputs);
            UNUSED(targetSize);
-            UNUSED(sourceSize);
+            UNUSED(sourceSizes);
            error("CPU version not completely implemented.", __LINE__, __FUNCTION__, __FILE__);

            // TODO: THIS CODE IS WORKING, BUT IT DOES NOT CONSIDER THE SCALES (I.E. SCALE NUMBER, START AND GAP) 
@@ -34,10 +36,10 @@ namespace op
            //         const auto sourceOffsetChannel = sourceHeight * sourceWidth;
            //         const auto sourceOffsetNum = sourceOffsetChannel * channel;
            //         const auto sourceOffset = n*sourceOffsetNum + c*sourceOffsetChannel;
-            //         const T* const sourcePtr = bottom->cpu_data();
+            //         const T* const sourcePtrs = bottom->cpu_data();
            //         for (int y = 0; y < sourceHeight; y++)
            //             for (int x = 0; x < sourceWidth; x++)
-            //                 source.at<T>(x,y) = sourcePtr[sourceOffset + y*sourceWidth + x];
+            //                 source.at<T>(x,y) = sourcePtrs[sourceOffset + y*sourceWidth + x];

            //         // spatial resize
            //         cv::Mat target;
@@ -60,8 +62,12 @@ namespace op
        }
    }

-    template void resizeAndMergeCpu(float* targetPtr, const float* const sourcePtr, const std::array<int, 4>& targetSize,
-                                    const std::array<int, 4>& sourceSize, const std::vector<float>& scaleInputToNetInputs);
-    template void resizeAndMergeCpu(double* targetPtr, const double* const sourcePtr, const std::array<int, 4>& targetSize,
-                                    const std::array<int, 4>& sourceSize, const std::vector<double>& scaleInputToNetInputs);
+    template void resizeAndMergeCpu(float* targetPtr, const std::vector<const float*>& sourcePtrs,
+                                    const std::array<int, 4>& targetSize,
+                                    const std::vector<std::array<int, 4>>& sourceSizes,
+                                    const std::vector<float>& scaleInputToNetInputs);
+    template void resizeAndMergeCpu(double* targetPtr, const std::vector<const double*>& sourcePtrs,
+                                    const std::array<int, 4>& targetSize,
+                                    const std::vector<std::array<int, 4>>& sourceSizes,
+                                    const std::vector<double>& scaleInputToNetInputs);
 }
--- a/src/openpose/core/resizeAndMergeBase.cu
+++ b/src/openpose/core/resizeAndMergeBase.cu
@@ -15,110 +15,112 @@ namespace op

        if (x < targetWidth && y < targetHeight)
        {
-            const auto scaleWidth = targetWidth / T(sourceWidth);
-            const auto scaleHeight = targetHeight / T(sourceHeight);
-            const T xSource = (x + 0.5f) / scaleWidth - 0.5f;
-            const T ySource = (y + 0.5f) / scaleHeight - 0.5f;
-
+            const T xSource = (x + 0.5f) * sourceWidth / T(targetWidth) - 0.5f;
+            const T ySource = (y + 0.5f) * sourceHeight / T(targetHeight) - 0.5f;
            targetPtr[y*targetWidth+x] = bicubicInterpolate(sourcePtr, xSource, ySource, sourceWidth, sourceHeight,
                                                            sourceWidth);
        }
    }

    template <typename T>
-    __global__ void resizeKernelAndMerge(T* targetPtr, const T* const sourcePtr, const int sourceNumOffset,
-                                         const int num, const T* scaleInputToNetInputs, const int sourceWidth,
-                                         const int sourceHeight, const int targetWidth, const int targetHeight)
+    __global__ void resizeKernelAndMerge(T* targetPtr, const T* const sourcePtr, const T scaleWidth,
+                                         const T scaleHeight, const int sourceWidth, const int sourceHeight,
+                                         const int targetWidth, const int targetHeight, const int averageCounter)
    {
        const auto x = (blockIdx.x * blockDim.x) + threadIdx.x;
        const auto y = (blockIdx.y * blockDim.y) + threadIdx.y;

        if (x < targetWidth && y < targetHeight)
        {
+            const T xSource = (x + 0.5f) / scaleWidth - 0.5f;
+            const T ySource = (y + 0.5f) / scaleHeight - 0.5f;
+            const auto interpolated = bicubicInterpolate(sourcePtr, xSource, ySource, sourceWidth, sourceHeight,
+                                                         sourceWidth);
            auto& targetPixel = targetPtr[y*targetWidth+x];
-            targetPixel = 0.f; // For average
-            // targetPixel = -1000.f; // For fastMax
-            for (auto n = 0; n < num; n++)
-            {
-                const auto currentWidth = sourceWidth * scaleInputToNetInputs[n] / scaleInputToNetInputs[0];
-                const auto currentHeight = sourceHeight * scaleInputToNetInputs[n] / scaleInputToNetInputs[0];
-
-                const auto scaleWidth = targetWidth / currentWidth;
-                const auto scaleHeight = targetHeight / currentHeight;
-                const T xSource = (x + 0.5f) / scaleWidth - 0.5f;
-                const T ySource = (y + 0.5f) / scaleHeight - 0.5f;
-
-                const T* const sourcePtrN = sourcePtr + n * sourceNumOffset;
-                const auto interpolated = bicubicInterpolate(sourcePtrN, xSource, ySource, intRound(currentWidth),
-                                                             intRound(currentHeight), sourceWidth);
-                targetPixel += interpolated;
-                // targetPixel = fastMax(targetPixel, interpolated);
-            }
-            targetPixel /= num;
+            targetPixel = ((averageCounter * targetPixel) + interpolated) / T(averageCounter + 1);
+            // targetPixel = fastMax(targetPixel, interpolated);
        }
    }

    template <typename T>
-    void resizeAndMergeGpu(T* targetPtr, const T* const sourcePtr, const std::array<int, 4>& targetSize,
-                           const std::array<int, 4>& sourceSize, const std::vector<T>& scaleInputToNetInputs)
+    void resizeAndMergeGpu(T* targetPtr, const std::vector<const T*>& sourcePtrs, const std::array<int, 4>& targetSize,
+                           const std::vector<std::array<int, 4>>& sourceSizes,
+                           const std::vector<T>& scaleInputToNetInputs)
    {
        try
        {
-            const auto num = sourceSize[0];
-            const auto channels = sourceSize[1];
-            const auto sourceHeight = sourceSize[2];
-            const auto sourceWidth = sourceSize[3];
+            // Security checks
+            if (sourceSizes.empty())
+                error("sourceSizes cannot be empty.", __LINE__, __FUNCTION__, __FILE__);
+            if (sourcePtrs.size() != sourceSizes.size() || sourceSizes.size() != scaleInputToNetInputs.size())
+                error("Size(sourcePtrs) must match size(sourceSizes) and size(scaleInputToNetInputs). Currently: "
+                      + std::to_string(sourcePtrs.size()) + " vs. " + std::to_string(sourceSizes.size()) + " vs. "
+                      + std::to_string(scaleInputToNetInputs.size()) + ".", __LINE__, __FUNCTION__, __FILE__);
+
+            // Parameters
+            const auto channels = targetSize[1];
            const auto targetHeight = targetSize[2];
            const auto targetWidth = targetSize[3];
-
            const dim3 threadsPerBlock{THREADS_PER_BLOCK_1D, THREADS_PER_BLOCK_1D};
            const dim3 numBlocks{getNumberCudaBlocks(targetWidth, threadsPerBlock.x),
                                 getNumberCudaBlocks(targetHeight, threadsPerBlock.y)};
-            const auto sourceChannelOffset = sourceHeight * sourceWidth;
-            const auto targetChannelOffset = targetWidth * targetHeight;
+            const auto& sourceSize = sourceSizes[0];
+            const auto sourceHeight = sourceSize[2];
+            const auto sourceWidth = sourceSize[3];

-            // No multi-scale merging
-            if (targetSize[0] > 1)
+            // No multi-scale merging or no merging required
+            if (sourceSizes.size() == 1)
            {
-                for (auto n = 0; n < num; n++)
+                const auto num = sourceSize[0];
+                if (targetSize[0] > 1 || num == 1)
                {
-                    const auto offsetBase = n*channels;
-                    for (auto c = 0 ; c < channels ; c++)
+                    const auto sourceChannelOffset = sourceHeight * sourceWidth;
+                    const auto targetChannelOffset = targetWidth * targetHeight;
+                    for (auto n = 0; n < num; n++)
                    {
-                        const auto offset = offsetBase + c;
-                        resizeKernel<<<numBlocks, threadsPerBlock>>>(targetPtr + offset * targetChannelOffset,
-                                                                     sourcePtr + offset * sourceChannelOffset,
-                                                                     sourceWidth, sourceHeight, targetWidth,
-                                                                     targetHeight);
+                        const auto offsetBase = n*channels;
+                        for (auto c = 0 ; c < channels ; c++)
+                        {
+                            const auto offset = offsetBase + c;
+                            resizeKernel<<<numBlocks, threadsPerBlock>>>(targetPtr + offset * targetChannelOffset,
+                                                                         sourcePtrs.at(0) + offset * sourceChannelOffset,
+                                                                         sourceWidth, sourceHeight, targetWidth,
+                                                                         targetHeight);
+                        }
                    }
                }
+                // Old inefficient multi-scale merging
+                else
+                    error("It should never reaches this point. Notify us.", __LINE__, __FUNCTION__, __FILE__);
            }
-            // Multi-scale merging
+            // Multi-scaling merging
            else
            {
-                // If scale_number > 1 --> scaleInputToNetInputs must be set
-                if (scaleInputToNetInputs.size() != num)
-                    error("The scale ratios size must be equal than the number of scales.",
-                          __LINE__, __FUNCTION__, __FILE__);
-                const auto maxScales = 10;
-                if (scaleInputToNetInputs.size() > maxScales)
-                    error("The maximum number of scales is " + std::to_string(maxScales) + ".",
-                          __LINE__, __FUNCTION__, __FILE__);
-                // Copy scaleInputToNetInputs
-                T* scaleInputToNetInputsPtr;
-                cudaMalloc((void**)&scaleInputToNetInputsPtr, maxScales * sizeof(T));
-                cudaMemcpy(scaleInputToNetInputsPtr, scaleInputToNetInputs.data(),
-                           scaleInputToNetInputs.size() * sizeof(T), cudaMemcpyHostToDevice);
-                // Perform resize + merging
-                const auto sourceNumOffset = channels * sourceChannelOffset;
-                for (auto c = 0 ; c < channels ; c++)
-                    resizeKernelAndMerge<<<numBlocks, threadsPerBlock>>>(targetPtr + c * targetChannelOffset,
-                                                                         sourcePtr + c * sourceChannelOffset,
-                                                                         sourceNumOffset, num,
-                                                                         scaleInputToNetInputsPtr, sourceWidth,
-                                                                         sourceHeight, targetWidth, targetHeight);
-                // Free memory
-                cudaFree(scaleInputToNetInputsPtr);
+                const auto targetChannelOffset = targetWidth * targetHeight;
+                cudaMemset(targetPtr, 0.f, channels*targetChannelOffset * sizeof(T));
+                auto averageCounter = -1;
+                const auto scaleToMainScaleWidth = targetWidth / T(sourceWidth);
+                const auto scaleToMainScaleHeight = targetHeight / T(sourceHeight);
+
+                for (auto i = 0u ; i < sourceSizes.size(); i++)
+                {
+                    const auto& currentSize = sourceSizes.at(i);
+                    const auto currentHeight = currentSize[2];
+                    const auto currentWidth = currentSize[3];
+                    const auto sourceChannelOffset = currentHeight * currentWidth;
+                    const auto scaleInputToNet = scaleInputToNetInputs[i] / scaleInputToNetInputs[0];
+                    const auto scaleWidth = scaleToMainScaleWidth / scaleInputToNet;
+                    const auto scaleHeight = scaleToMainScaleHeight / scaleInputToNet;
+                    averageCounter++;
+                    for (auto c = 0 ; c < channels ; c++)
+                    {
+                        resizeKernelAndMerge<<<numBlocks, threadsPerBlock>>>(
+                            targetPtr + c * targetChannelOffset, sourcePtrs[i] + c * sourceChannelOffset,
+                            scaleWidth, scaleHeight, currentWidth, currentHeight, targetWidth,
+                            targetHeight, averageCounter
+                        );
+                    }
+                }
            }

            cudaCheck(__LINE__, __FUNCTION__, __FILE__);
@@ -129,10 +131,12 @@ namespace op
        }
    }

-    template void resizeAndMergeGpu(float* targetPtr, const float* const sourcePtr,
-                                    const std::array<int, 4>& targetSize, const std::array<int, 4>& sourceSize,
+    template void resizeAndMergeGpu(float* targetPtr, const std::vector<const float*>& sourcePtrs,
+                                    const std::array<int, 4>& targetSize,
+                                    const std::vector<std::array<int, 4>>& sourceSizes,
                                    const std::vector<float>& scaleInputToNetInputs);
-    template void resizeAndMergeGpu(double* targetPtr, const double* const sourcePtr,
-                                    const std::array<int, 4>& targetSize, const std::array<int, 4>& sourceSize,
+    template void resizeAndMergeGpu(double* targetPtr, const std::vector<const double*>& sourcePtrs,
+                                    const std::array<int, 4>& targetSize,
+                                    const std::vector<std::array<int, 4>>& sourceSizes,
                                    const std::vector<double>& scaleInputToNetInputs);
 }
--- a/src/openpose/core/resizeAndMergeCaffe.cpp
+++ b/src/openpose/core/resizeAndMergeCaffe.cpp
@@ -32,9 +32,9 @@ namespace op
        {
            #ifdef USE_CAFFE
                if (top.size() != 1)
-                    error("top.size() != 1", __LINE__, __FUNCTION__, __FILE__);
+                    error("top.size() != 1.", __LINE__, __FUNCTION__, __FILE__);
                if (bottom.size() != 1)
-                    error("bottom.size() != 2", __LINE__, __FUNCTION__, __FILE__);
+                    error("bottom.size() != 1.", __LINE__, __FUNCTION__, __FILE__);
            #else
                UNUSED(bottom);
                UNUSED(top);
@@ -49,16 +49,21 @@ namespace op
    template <typename T>
    void ResizeAndMergeCaffe<T>::Reshape(const std::vector<caffe::Blob<T>*>& bottom,
                                         const std::vector<caffe::Blob<T>*>& top,
-                                         const float netFactor,
-                                         const float scaleFactor,
+                                         const T netFactor,
+                                         const T scaleFactor,
                                         const bool mergeFirstDimension)
    {
        try
        {
            #ifdef USE_CAFFE
+                // Security checks
+                if (top.size() != 1)
+                    error("top.size() != 1", __LINE__, __FUNCTION__, __FILE__);
+                if (bottom.empty())
+                    error("bottom cannot be empty.", __LINE__, __FUNCTION__, __FILE__);
                // Data
-                const auto* bottomBlob = bottom.at(0);
                auto* topBlob = top.at(0);
+                const auto* bottomBlob = bottom.at(0);
                // Set top shape
                auto topShape = bottomBlob->shape();
                topShape[0] = (mergeFirstDimension ? 1 : bottomBlob->shape(0));
@@ -66,18 +71,21 @@ namespace op
                // E.g. 100x100 image --> 200x200 --> 0-99 to 0-199 --> scale = 199/99 (not 2!)
                // E.g. 101x101 image --> 201x201 --> scale = 2
                // Test: pixel 0 --> 0, pixel 99 (ex 1) --> 199, pixel 100 (ex 2) --> 200
-                topShape[2] = intRound((topShape[2]*netFactor - 1.f) * scaleFactor + 1);
-                topShape[3] = intRound((topShape[3]*netFactor - 1.f) * scaleFactor + 1);
+                topShape[2] = intRound((topShape[2]*netFactor - 1.f) * scaleFactor) + 1;
+                topShape[3] = intRound((topShape[3]*netFactor - 1.f) * scaleFactor) + 1;
                topBlob->Reshape(topShape);
                // Array sizes
                mTopSize = std::array<int, 4>{topBlob->shape(0), topBlob->shape(1), topBlob->shape(2),
                                              topBlob->shape(3)};
-                mBottomSize = std::array<int, 4>{bottomBlob->shape(0), bottomBlob->shape(1),
-                                                 bottomBlob->shape(2), bottomBlob->shape(3)};
+                mBottomSizes.resize(bottom.size());
+                for (auto i = 0u ; i < mBottomSizes.size() ; i++)
+                    mBottomSizes[i] = std::array<int, 4>{bottom[i]->shape(0), bottom[i]->shape(1),
+                                                         bottom[i]->shape(2), bottom[i]->shape(3)};
            #else
                UNUSED(bottom);
                UNUSED(top);
-                UNUSED(factor);
+                UNUSED(netFactor);
+                UNUSED(scaleFactor);
                UNUSED(mergeFirstDimension);
            #endif
        }
@@ -107,7 +115,10 @@ namespace op
        try
        {
            #ifdef USE_CAFFE
-                resizeAndMergeCpu(top.at(0)->mutable_cpu_data(), bottom.at(0)->cpu_data(), mTopSize, mBottomSize,
+                std::vector<const T*> sourcePtrs(bottom.size());
+                for (auto i = 0u ; i < sourcePtrs.size() ; i++)
+                    sourcePtrs[i] = bottom[i]->cpu_data();
+                resizeAndMergeCpu(top.at(0)->mutable_cpu_data(), sourcePtrs, mTopSize, mBottomSizes,
                                  mScaleRatios);
            #else
                UNUSED(bottom);
@@ -127,7 +138,10 @@ namespace op
        try
        {
            #if defined USE_CAFFE && defined USE_CUDA
-                resizeAndMergeGpu(top.at(0)->mutable_gpu_data(), bottom.at(0)->gpu_data(), mTopSize, mBottomSize,
+                std::vector<const T*> sourcePtrs(bottom.size());
+                for (auto i = 0u ; i < sourcePtrs.size() ; i++)
+                    sourcePtrs[i] = bottom[i]->gpu_data();
+                resizeAndMergeGpu(top.at(0)->mutable_gpu_data(), sourcePtrs, mTopSize, mBottomSizes,
                                  mScaleRatios);
            #else
                UNUSED(bottom);

--- a/src/openpose/core/scaleAndSizeExtractor.cpp
+++ b/src/openpose/core/scaleAndSizeExtractor.cpp
@@ -54,9 +54,9 @@ namespace op
                        poseNetInputSize.x * inputResolution.y / (float) inputResolution.x / 16.f
                    );
            }
-            // scaleInputToNetInputs & sizes - Reescale keeping aspect ratio
+            // scaleInputToNetInputs & netInputSizes - Reescale keeping aspect ratio
            std::vector<double> scaleInputToNetInputs(mScaleNumber, 1.f);
-            std::vector<Point<int>> sizes(mScaleNumber);
+            std::vector<Point<int>> netInputSizes(mScaleNumber);
            for (auto i = 0; i < mScaleNumber; i++)
            {
                const auto currentScale = 1. - i*mScaleGap;
@@ -70,7 +70,7 @@ namespace op
                                                       poseNetInputSize.y);
                const Point<int> targetSize{targetWidth, targetHeight};
                scaleInputToNetInputs[i] = resizeGetScaleFactor(inputResolution, targetSize);
-                sizes[i] = poseNetInputSize;
+                netInputSizes[i] = targetSize;
            }
            // scaleInputToOutput - Scale between input and desired output size
            Point<int> outputResolution;
@@ -88,7 +88,7 @@ namespace op
                scaleInputToOutput = 1.;
            }
            // Return result
-            return std::make_tuple(scaleInputToNetInputs, sizes, scaleInputToOutput, outputResolution);
+            return std::make_tuple(scaleInputToNetInputs, netInputSizes, scaleInputToOutput, outputResolution);
        }
        catch (const std::exception& e)
        {

--- a/src/openpose/pose/poseExtractorCaffe.cpp
+++ b/src/openpose/pose/poseExtractorCaffe.cpp
@@ -18,23 +18,30 @@ namespace op
    struct PoseExtractorCaffe::ImplPoseExtractorCaffe
    {
        #ifdef USE_CAFFE
-            std::shared_ptr<NetCaffe> spNetCaffe;
+            // Used when increasing spCaffeNets
+            const PoseModel mPoseModel;
+            const int mGpuId;
+            const std::string mModelFolder;
+            const bool mEnableGoogleLogging;
+            // General parameters
+            std::vector<std::shared_ptr<NetCaffe>> spCaffeNets;
            std::shared_ptr<ResizeAndMergeCaffe<float>> spResizeAndMergeCaffe;
            std::shared_ptr<NmsCaffe<float>> spNmsCaffe;
            std::shared_ptr<BodyPartConnectorCaffe<float>> spBodyPartConnectorCaffe;
-            std::vector<int> mNetInputSize4D;
+            std::vector<std::vector<int>> mNetInput4DSizes;
            std::vector<double> mScaleInputToNetInputs;
            // Init with thread
-            boost::shared_ptr<caffe::Blob<float>> spCaffeNetOutputBlob;
+            std::vector<boost::shared_ptr<caffe::Blob<float>>> spCaffeNetOutputBlobs;
            std::shared_ptr<caffe::Blob<float>> spHeatMapsBlob;
            std::shared_ptr<caffe::Blob<float>> spPeaksBlob;
            std::shared_ptr<caffe::Blob<float>> spPoseBlob;

            ImplPoseExtractorCaffe(const PoseModel poseModel, const int gpuId,
                                   const std::string& modelFolder, const bool enableGoogleLogging) :
-                spNetCaffe{std::make_shared<NetCaffe>(modelFolder + POSE_PROTOTXT[(int)poseModel],
-                                                      modelFolder + POSE_TRAINED_MODEL[(int)poseModel], gpuId,
-                                                      enableGoogleLogging)},
+                mPoseModel{poseModel},
+                mGpuId{gpuId},
+                mModelFolder{modelFolder},
+                mEnableGoogleLogging{enableGoogleLogging},
                spResizeAndMergeCaffe{std::make_shared<ResizeAndMergeCaffe<float>>()},
                spNmsCaffe{std::make_shared<NmsCaffe<float>>()},
                spBodyPartConnectorCaffe{std::make_shared<BodyPartConnectorCaffe<float>>()}
@@ -44,10 +51,28 @@ namespace op
    };

    #ifdef USE_CAFFE
+        std::vector<caffe::Blob<float>*> caffeNetSharedToPtr(
+            std::vector<boost::shared_ptr<caffe::Blob<float>>>& caffeNetOutputBlob)
+        {
+            try
+            {
+                // Prepare spCaffeNetOutputBlobss
+                std::vector<caffe::Blob<float>*> caffeNetOutputBlobs(caffeNetOutputBlob.size());
+                for (auto i = 0u ; i < caffeNetOutputBlobs.size() ; i++)
+                    caffeNetOutputBlobs[i] = caffeNetOutputBlob[i].get();
+                return caffeNetOutputBlobs;
+            }
+            catch (const std::exception& e)
+            {
+                error(e.what(), __LINE__, __FUNCTION__, __FILE__);
+                return {};
+            }
+        }
+
        inline void reshapePoseExtractorCaffe(std::shared_ptr<ResizeAndMergeCaffe<float>>& resizeAndMergeCaffe,
                                              std::shared_ptr<NmsCaffe<float>>& nmsCaffe,
                                              std::shared_ptr<BodyPartConnectorCaffe<float>>& bodyPartConnectorCaffe,
-                                              boost::shared_ptr<caffe::Blob<float>>& caffeNetOutputBlob,
+                                              std::vector<boost::shared_ptr<caffe::Blob<float>>>& caffeNetOutputBlob,
                                              std::shared_ptr<caffe::Blob<float>>& heatMapsBlob,
                                              std::shared_ptr<caffe::Blob<float>>& peaksBlob,
                                              std::shared_ptr<caffe::Blob<float>>& poseBlob,
@@ -57,14 +82,47 @@ namespace op
            try
            {
                // HeatMaps extractor blob and layer
-                resizeAndMergeCaffe->Reshape({caffeNetOutputBlob.get()}, {heatMapsBlob.get()},
+                const auto caffeNetOutputBlobs = caffeNetSharedToPtr(caffeNetOutputBlob);
+                resizeAndMergeCaffe->Reshape(caffeNetOutputBlobs, {heatMapsBlob.get()},
                                             POSE_CCN_DECREASE_FACTOR[(int)poseModel], 1.f/scaleInputToNetInput);
                // Pose extractor blob and layer
                nmsCaffe->Reshape({heatMapsBlob.get()}, {peaksBlob.get()}, POSE_MAX_PEAKS[(int)poseModel]);
                // Pose extractor blob and layer
                bodyPartConnectorCaffe->Reshape({heatMapsBlob.get(), peaksBlob.get()}, {poseBlob.get()});
                // Cuda check
-                cudaCheck(__LINE__, __FUNCTION__, __FILE__);
+                #ifdef USE_CUDA
+                    cudaCheck(__LINE__, __FUNCTION__, __FILE__);
+                #endif
+            }
+            catch (const std::exception& e)
+            {
+                error(e.what(), __LINE__, __FUNCTION__, __FILE__);
+            }
+        }
+
+        void addCaffeNetOnThread(std::vector<std::shared_ptr<NetCaffe>>& netCaffe,
+                         std::vector<boost::shared_ptr<caffe::Blob<float>>>& caffeNetOutputBlob,
+                         const PoseModel poseModel, const int gpuId,
+                         const std::string& modelFolder, const bool enableGoogleLogging)
+        {
+            try
+            {
+                // Add Caffe Net
+                netCaffe.emplace_back(
+                    std::make_shared<NetCaffe>(modelFolder + POSE_PROTOTXT[(int)poseModel],
+                                               modelFolder + POSE_TRAINED_MODEL[(int)poseModel],
+                                               gpuId, enableGoogleLogging)
+                );
+                // Initializing them on the thread
+                netCaffe.back()->initializationOnThread();
+                caffeNetOutputBlob.emplace_back(netCaffe.back()->getOutputBlob());
+                // Security checks
+                if (netCaffe.size() != caffeNetOutputBlob.size())
+                    error("Weird error, this should not happen. Notify us.", __LINE__, __FUNCTION__, __FILE__);
+                // Cuda check
+                #ifdef USE_CUDA
+                    cudaCheck(__LINE__, __FUNCTION__, __FILE__);
+                #endif
            }
            catch (const std::exception& e)
            {
@@ -114,14 +172,18 @@ namespace op
                // Logging
                log("Starting initialization on thread.", Priority::Low, __LINE__, __FUNCTION__, __FILE__);
                // Initialize Caffe net
-                upImpl->spNetCaffe->initializationOnThread();
-                cudaCheck(__LINE__, __FUNCTION__, __FILE__);
+                addCaffeNetOnThread(upImpl->spCaffeNets, upImpl->spCaffeNetOutputBlobs, upImpl->mPoseModel,
+                                    upImpl->mGpuId, upImpl->mModelFolder, upImpl->mEnableGoogleLogging);
+                #ifdef USE_CUDA
+                    cudaCheck(__LINE__, __FUNCTION__, __FILE__);
+                #endif
                // Initialize blobs
-                upImpl->spCaffeNetOutputBlob = upImpl->spNetCaffe->getOutputBlob();
                upImpl->spHeatMapsBlob = {std::make_shared<caffe::Blob<float>>(1,1,1,1)};
                upImpl->spPeaksBlob = {std::make_shared<caffe::Blob<float>>(1,1,1,1)};
                upImpl->spPoseBlob = {std::make_shared<caffe::Blob<float>>(1,1,1,1)};
-                cudaCheck(__LINE__, __FUNCTION__, __FILE__);
+                #ifdef USE_CUDA
+                    cudaCheck(__LINE__, __FUNCTION__, __FILE__);
+                #endif
                // Logging
                log("Finished initialization on thread.", Priority::Low, __LINE__, __FUNCTION__, __FILE__);
            #endif
@@ -132,7 +194,8 @@ namespace op
        }
    }

-    void PoseExtractorCaffe::forwardPass(const Array<float>& inputNetData, const Point<int>& inputDataSize,
+    void PoseExtractorCaffe::forwardPass(const std::vector<Array<float>>& inputNetData,
+                                         const Point<int>& inputDataSize,
                                         const std::vector<double>& scaleInputToNetInputs)
    {
        try
@@ -141,30 +204,50 @@ namespace op
                // Security checks
                if (inputNetData.empty())
                    error("Empty inputNetData.", __LINE__, __FUNCTION__, __FILE__);
+                for (const auto& inputNetDataI : inputNetData)
+                    if (inputNetDataI.empty())
+                        error("Empty inputNetData.", __LINE__, __FUNCTION__, __FILE__);
+                if (inputNetData.size() != scaleInputToNetInputs.size())
+                    error("Size(inputNetData) must be same than size(scaleInputToNetInputs).",
+                          __LINE__, __FUNCTION__, __FILE__);

-                // 1. Caffe deep network
-                upImpl->spNetCaffe->forwardPass(inputNetData);                                                 // ~80ms
+                // Resize std::vectors if required
+                const auto numberScales = inputNetData.size();
+                upImpl->mNetInput4DSizes.resize(numberScales);
+                while (upImpl->spCaffeNets.size() < numberScales)
+                    addCaffeNetOnThread(upImpl->spCaffeNets, upImpl->spCaffeNetOutputBlobs, upImpl->mPoseModel,
+                                        upImpl->mGpuId, upImpl->mModelFolder, false);

-                // Reshape blobs if required
-                // Note: In order to resize to input size to have same results as Matlab, uncomment the commented lines
-                if (!vectorsAreEqual(upImpl->mNetInputSize4D, inputNetData.getSize()))
-                    // || !vectorsAreEqual(upImpl->mScaleInputToNetInputs, scaleInputToNetInputs))
+                // Process each image
+                for (auto i = 0u ; i < inputNetData.size(); i++)
                {
-                    upImpl->mNetInputSize4D = inputNetData.getSize();
-                    mNetOutputSize = Point<int>{upImpl->mNetInputSize4D[3], upImpl->mNetInputSize4D[2]};
-                    // upImpl->mScaleInputToNetInputs = scaleInputToNetInputs;
-                    reshapePoseExtractorCaffe(upImpl->spResizeAndMergeCaffe, upImpl->spNmsCaffe,
-                                              upImpl->spBodyPartConnectorCaffe, upImpl->spCaffeNetOutputBlob,
-                                              upImpl->spHeatMapsBlob, upImpl->spPeaksBlob, upImpl->spPoseBlob,
-                                              1.f, mPoseModel);
-                                              // scaleInputToNetInputs[0], mPoseModel);
+                    // 1. Caffe deep network
+                    upImpl->spCaffeNets.at(i)->forwardPass(inputNetData[i]);                                   // ~80ms
+
+                    // Reshape blobs if required
+                    // Note: In order to resize to input size to have same results as Matlab, uncomment the commented
+                    // lines
+                    if (!vectorsAreEqual(upImpl->mNetInput4DSizes.at(i), inputNetData[i].getSize()))
+                        // || !vectorsAreEqual(upImpl->mScaleInputToNetInputs, scaleInputToNetInputs))
+                    {
+                        upImpl->mNetInput4DSizes.at(i) = inputNetData[i].getSize();
+                        mNetOutputSize = Point<int>{upImpl->mNetInput4DSizes[0][3],
+                                                    upImpl->mNetInput4DSizes[0][2]};
+                        // upImpl->mScaleInputToNetInputs = scaleInputToNetInputs;
+                        reshapePoseExtractorCaffe(upImpl->spResizeAndMergeCaffe, upImpl->spNmsCaffe,
+                                                  upImpl->spBodyPartConnectorCaffe, upImpl->spCaffeNetOutputBlobs,
+                                                  upImpl->spHeatMapsBlob, upImpl->spPeaksBlob, upImpl->spPoseBlob,
+                                                  1.f, mPoseModel);
+                                                  // scaleInputToNetInputs[i], mPoseModel);
+                    }
                }

                // 2. Resize heat maps + merge different scales
+                const auto caffeNetOutputBlobs = caffeNetSharedToPtr(upImpl->spCaffeNetOutputBlobs);
                const std::vector<float> floatScaleRatios(scaleInputToNetInputs.begin(), scaleInputToNetInputs.end());
                upImpl->spResizeAndMergeCaffe->setScaleRatios(floatScaleRatios);
                #ifdef USE_CUDA
-                    upImpl->spResizeAndMergeCaffe->Forward_gpu({upImpl->spCaffeNetOutputBlob.get()},            // ~5ms
+                    upImpl->spResizeAndMergeCaffe->Forward_gpu(caffeNetOutputBlobs,                             // ~5ms
                                                               {upImpl->spHeatMapsBlob.get()});
                    cudaCheck(__LINE__, __FUNCTION__, __FILE__);
                #else