未验证 提交 1afb1ca0 编写于 作者: M Michael Auli 提交者: GitHub

Merge pull request #149 from dariopavllo/master

Update setup instructions; Detectron2 and trajectory support for inference in the wild
# Dataset setup
## Human3.6M
We provide two ways to set up the Human3.6M dataset on our pipeline. You can either use the [dataset preprocessed by Martinez et al.](https://github.com/una-dinosauria/3d-pose-baseline) (fastest way) or convert the original dataset from scratch. The two methods produce the same result. After this step, you should end up with two files in the `data` directory: `data_3d_h36m.npz` for the 3D poses, and `data_2d_h36m_gt.npz` for the ground-truth 2D poses.
We provide two ways to set up the Human3.6M dataset on our pipeline. You can either convert the original dataset (recommended) or use the [dataset preprocessed by Martinez et al.](https://github.com/una-dinosauria/3d-pose-baseline) (no longer available as of May 22nd, 2020). The two methods produce the same result. After this step, you should end up with two files in the `data` directory: `data_3d_h36m.npz` for the 3D poses, and `data_2d_h36m_gt.npz` for the ground-truth 2D poses.
### Setup from preprocessed dataset
Download the [h36m.zip archive](https://www.dropbox.com/s/e35qv3n6zlkouki/h36m.zip) (source: [3D pose baseline repository](https://github.com/una-dinosauria/3d-pose-baseline)) to the `data` directory, and run the conversion script from the same directory. This step does not require any additional dependency.
### Setup from original source (recommended)
**Update:** we have updated the instructions to simplify the procedure. MATLAB is no longer required for this step.
Register to the [Human3.6m website](http://vision.imar.ro/human3.6m/) website (or login if you already have an account) and download the dataset in its original format. You only need to download *Poses -> D3 Positions* for each subject (1, 5, 6, 7, 8, 9, 11)
##### Instructions without MATLAB (recommended)
You first need to install `cdflib` Python library via `pip install cdflib`.
Extract the archives named `Poses_D3_Positions_S*.tgz` (subjects 1, 5, 6, 7, 8, 9, 11) to a common directory. Your directory tree should look like this:
```
/path/to/dataset/S1/MyPoseFeatures/D3_Positions/Directions 1.cdf
/path/to/dataset/S1/MyPoseFeatures/D3_Positions/Directions.cdf
...
```
Then, run the preprocessing script:
```sh
cd data
wget https://www.dropbox.com/s/e35qv3n6zlkouki/h36m.zip
python prepare_data_h36m.py --from-archive h36m.zip
python prepare_data_h36m.py --from-source-cdf /path/to/dataset
cd ..
```
### Setup from original source
Alternatively, you can download the dataset from the [Human3.6m website](http://vision.imar.ro/human3.6m/) and convert it from its original format. This is useful if the other link goes down, or if you want to be sure to use the original source. MATLAB is required for this step.
If everything goes well, you are ready to go.
##### Instructions with MATLAB (old instructions)
First, we need to convert the 3D poses from `.cdf` to `.mat`, so they can be loaded from Python scripts. To this end, we have provided the MATLAB script `convert_cdf_to_mat.m` in the `data` directory. Extract the archives named `Poses_D3_Positions_S*.tgz` (subjects 1, 5, 6, 7, 8, 9, 11) to a directory named `pose`, and set up your directory tree so that it looks like this:
```
......@@ -26,19 +39,32 @@ First, we need to convert the 3D poses from `.cdf` to `.mat`, so they can be loa
```
Then run `convert_cdf_to_mat.m` from MATLAB.
Finally, as before, run the Python conversion script specifying the dataset path:
Finally, run the Python conversion script specifying the dataset path:
```sh
cd data
python prepare_data_h36m.py --from-source /path/to/dataset/pose
cd ..
```
### Setup from preprocessed dataset (old instructions)
**Update:** the link to the preprocessed dataset is no longer available; please use the procedure above. These instructions have been kept for backwards compatibility in case you already have a copy of this archive. All procedures produce the same result.
Download the [~~h36m.zip archive~~](https://www.dropbox.com/s/e35qv3n6zlkouki/h36m.zip) (source: [3D pose baseline repository](https://github.com/una-dinosauria/3d-pose-baseline)) to the `data` directory, and run the conversion script from the same directory. This step does not require any additional dependency.
```sh
cd data
wget https://www.dropbox.com/s/e35qv3n6zlkouki/h36m.zip
python prepare_data_h36m.py --from-archive h36m.zip
cd ..
```
## 2D detections for Human3.6M
We provide support for the following 2D detections:
- `gt`: ground-truth 2D poses, extracted through the camera projection parameters.
- `sh_pt_mpii`: Stacked Hourglass detections, pretrained on MPII.
- `sh_pt_mpii`: Stacked Hourglass detections (model pretrained on MPII, no fine tuning).
- `sh_ft_h36m`: Stacked Hourglass detections, fine-tuned on Human3.6M.
- `detectron_pt_h36m`: Detectron (Mask R-CNN) detections (model pretrained on COCO, no fine tuning).
- `detectron_ft_h36m`: Detectron (Mask R-CNN) detections, fine-tuned on Human3.6M.
- `cpn_ft_h36m_dbb`: Cascaded Pyramid Network detections, fine-tuned on Human3.6M. Bounding boxes from `detectron_ft_h36m`.
- User-supplied (see below).
......@@ -48,7 +74,7 @@ The 2D detection source is specified through the `--keypoints` parameter, which
Ground-truth poses (`gt`) have already been extracted by the previous step. The other detections must be downloaded manually (see instructions below). You only need to download the detections you want to use. For reference, our best results on Human3.6M are achieved by `cpn_ft_h36m_dbb`.
### Mask R-CNN and CPN detections
You can download these from AWS. You just have to put `data_2d_h36m_cpn_ft_h36m_dbb.npz` and `data_2d_h36m_detectron_ft_h36m.npz` in the `data` directory.
You can download these directly and put them in the `data` directory. We recommend starting with:
```sh
cd data
......@@ -59,6 +85,13 @@ cd ..
These detections have been produced by models fine-tuned on Human3.6M. We adopted the usual protocol of fine-tuning on 5 subjects (S1, S5, S6, S7, and S8). We also included detections from the unlabeled subjects S2, S3, S4, which can be loaded by our framework for semi-supervised experimentation.
Optionally, you can download the Mask R-CNN detections without fine-tuning if you want to experiment with these:
```sh
cd data
wget https://dl.fbaipublicfiles.com/video-pose-3d/data_2d_h36m_detectron_pt_coco.npz
cd ..
```
### Stacked Hourglass detections
These detections (both pretrained and fine-tuned) are provided by [Martinez et al.](https://github.com/una-dinosauria/3d-pose-baseline) in their repository on 3D human pose estimation. The 2D poses produced by the pretrained model are in the same archive as the dataset ([h36m.zip](https://www.dropbox.com/s/e35qv3n6zlkouki/h36m.zip)). The fine-tuned poses can be downloaded [here](https://drive.google.com/open?id=0BxWzojlLp259S2FuUXJ6aUNxZkE). Put the two archives in the `data` directory and run:
......@@ -99,4 +132,4 @@ Since HumanEva is very small, we do not fine-tune the pretrained models. As befo
cd data
wget https://dl.fbaipublicfiles.com/video-pose-3d/data_2d_humaneva15_detectron_pt_coco.npz
cd ..
```
```
\ No newline at end of file
# Inference in the wild
**Update:** we have added support for Detectron2.
In this short tutorial, we show how to run our model on arbitrary videos and visualize the predictions. Note that this feature is only provided for experimentation/research purposes and presents some limitations, as this repository is meant to provide a reference implementation of the approach described in the paper (not production-ready code for inference in the wild).
Our script assumes that a video depicts *exactly* one person. In case of multiple people visible at once, the script will select the person corresponding to the bounding box with the highest confidence, which may cause glitches.
......@@ -6,9 +9,9 @@ Our script assumes that a video depicts *exactly* one person. In case of multipl
The instructions below show how to use Detectron to infer 2D keypoints from videos, convert them to a custom dataset for our code, and infer 3D poses. For now, we do not have instructions for CPN. In the last section of this tutorial, we also provide some tips.
## Step 1: setup
Set up [Detectron](https://github.com/facebookresearch/Detectron) and copy the script `inference/infer_video.py` from this repo to the `tools` directory of the Detectron repo. This script, which requires `ffmpeg` in your system, provides a convenient interface to generate 2D keypoint predictions from videos without manually extracting individual frames.
The inference script requires `ffmpeg`, which you can easily install via conda, pip, or manually.
Next, download the [pretrained model](https://dl.fbaipublicfiles.com/video-pose-3d/pretrained_h36m_detectron_coco.bin) for generating 3D predictions. This model is different than the pretrained ones listed in the main README, as it expects input keypoints in COCO format (generated by the pretrained Detectron model) and outputs 3D joint positions in Human3.6M format. Put this model in the `checkpoint` directory of this repo.
Download the [pretrained model](https://dl.fbaipublicfiles.com/video-pose-3d/pretrained_h36m_detectron_coco.bin) for generating 3D predictions. This model is different than the pretrained ones listed in the main README, as it expects input keypoints in COCO format (generated by the pretrained Detectron model) and outputs 3D joint positions in Human3.6M format. Put this model in the `checkpoint` directory of this repo.
**Note:** if you had downloaded `d-pt-243.bin`, you should download the new pretrained model using the link above. `d-pt-243.bin` takes the keypoint probabilities as input (in addition to the x, y coordinates), which causes problems on videos with a different resolution than that of Human3.6M. The new model is only trained on 2D coordinates and works with any resolution/aspect ratio.
......@@ -25,6 +28,26 @@ ffmpeg -i input.mp4 -filter "minterpolate='fps=50'" -crf 0 output.mp4
```
## Step 3: inferring 2D keypoints with Detectron
### Using Detectron2 (new)
Set up [Detectron2](https://github.com/facebookresearch/detectron2) and use the script `inference/infer_video_d2.py` (no need to copy this, as it directly uses the Detectron2 API). This script provides a convenient interface to generate 2D keypoint predictions from videos without manually extracting individual frames.
To infer keypoints from all the mp4 videos in `input_directory`, run
```
cd inference
python infer_video_d2.py \
--cfg COCO-Keypoints/keypoint_rcnn_R_101_FPN_3x.yaml \
--output-dir output_directory \
--image-ext mp4 \
input_directory
```
The results will be exported to `output_directory` as custom NumPy archives (`.npz` files). You can change the video extension in `--image-ext` (ffmpeg supports a wide range of formats).
**Note:** although the architecture is the same (ResNet-101), the weights used by the Detectron2 model are not the same as those used by Detectron1. Since our pretrained model was trained on Detectron1 poses, the result might be slightly different (but it should still be pretty close).
### Using Detectron1 (old instructions)
Set up [Detectron](https://github.com/facebookresearch/Detectron) and copy the script `inference/infer_video.py` from this repo to the `tools` directory of the Detectron repo. This script provides a convenient interface to generate 2D keypoint predictions from videos without manually extracting individual frames.
Our Detectron script `infer_video.py` is a simple adaptation of `infer_simple.py` (which works on images) and has a similar command-line syntax.
To infer keypoints from all the mp4 videos in `input_directory`, run
......@@ -57,6 +80,6 @@ You can also export the 3D joint positions (in camera space) to a NumPy archive.
## Limitations and tips
- The model was trained on Human3.6M cameras (which are relatively undistorted), and the results may be bad if the intrinsic parameters of the cameras of your videos differ much from those of Human3.6M. This may be particularly noticeable with fisheye cameras, which present a high degree of non-linear lens distortion. If the camera parameters are known, consider preprocessing your videos to match those of Human3.6M as closely as possible.
- If you want multi-person tracking, you should implement a bounding box matching strategy. An example would be to use bipartite matching on the bounding box overlap (IoU) between subsequent frames, but there many other approaches.
- If you want multi-person tracking, you should implement a bounding box matching strategy. An example would be to use bipartite matching on the bounding box overlap (IoU) between subsequent frames, but there are many other approaches.
- Predictions are relative to the root joint, i.e. the global trajectory is not regressed. If you need it, you may want to use another model to regress it, such as the one we use for semi-supervision.
- Predictions are always in *camera space* (regardless of whether the trajectory is available). For our visualization script, we simply take a random camera from Human3.6M, which fits decently most videos where the camera viewport is parallel to the ground.
\ No newline at end of file
......@@ -88,7 +88,10 @@ def render_animation(keypoints, keypoints_metadata, poses, skeleton, fps, bitrat
ax.set_xlim3d([-radius/2, radius/2])
ax.set_zlim3d([0, radius])
ax.set_ylim3d([-radius/2, radius/2])
ax.set_aspect('equal')
try:
ax.set_aspect('equal')
except NotImplementedError:
ax.set_aspect('auto')
ax.set_xticklabels([])
ax.set_yticklabels([])
ax.set_zticklabels([])
......@@ -183,9 +186,9 @@ def render_animation(keypoints, keypoints_metadata, poses, skeleton, fps, bitrat
for n, ax in enumerate(ax_3d):
pos = poses[n][i]
lines_3d[n][j-1][0].set_xdata([pos[j, 0], pos[j_parent, 0]])
lines_3d[n][j-1][0].set_ydata([pos[j, 1], pos[j_parent, 1]])
lines_3d[n][j-1][0].set_3d_properties([pos[j, 2], pos[j_parent, 2]], zdir='z')
lines_3d[n][j-1][0].set_xdata(np.array([pos[j, 0], pos[j_parent, 0]]))
lines_3d[n][j-1][0].set_ydata(np.array([pos[j, 1], pos[j_parent, 1]]))
lines_3d[n][j-1][0].set_3d_properties(np.array([pos[j, 2], pos[j_parent, 2]]), zdir='z')
points.set_offsets(keypoints[i])
......
......@@ -6,7 +6,6 @@
#
import numpy as np
import h5py
mpii_metadata = {
'layout_name': 'mpii',
......@@ -88,6 +87,7 @@ def import_cpn_poses(path):
def import_sh_poses(path):
import h5py
with h5py.File(path) as hf:
positions = hf['poses'].value
return positions.astype('float32')
......
......@@ -30,12 +30,17 @@ if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Human3.6M dataset downloader/converter')
# Default: convert dataset preprocessed by Martinez et al. in https://github.com/una-dinosauria/3d-pose-baseline
# Convert dataset preprocessed by Martinez et al. in https://github.com/una-dinosauria/3d-pose-baseline
parser.add_argument('--from-archive', default='', type=str, metavar='PATH', help='convert preprocessed dataset')
# Alternatively, convert dataset from original source (the Human3.6M dataset path must be specified manually)
# Convert dataset from original source, using files converted to .mat (the Human3.6M dataset path must be specified manually)
# This option requires MATLAB to convert files using the provided script
parser.add_argument('--from-source', default='', type=str, metavar='PATH', help='convert original dataset')
# Convert dataset from original source, using original .cdf files (the Human3.6M dataset path must be specified manually)
# This option does not require MATLAB, but the Python library cdflib must be installed
parser.add_argument('--from-source-cdf', default='', type=str, metavar='PATH', help='convert original dataset')
args = parser.parse_args()
if args.from_archive and args.from_source:
......@@ -106,6 +111,36 @@ if __name__ == '__main__':
print('Done.')
elif args.from_source_cdf:
print('Converting original Human3.6M dataset from', args.from_source_cdf, '(CDF files)')
output = {}
import cdflib
for subject in subjects:
output[subject] = {}
file_list = glob(args.from_source_cdf + '/' + subject + '/MyPoseFeatures/D3_Positions/*.cdf')
assert len(file_list) == 30, "Expected 30 files for subject " + subject + ", got " + str(len(file_list))
for f in file_list:
action = os.path.splitext(os.path.basename(f))[0]
if subject == 'S11' and action == 'Directions':
continue # Discard corrupted video
# Use consistent naming convention
canonical_name = action.replace('TakingPhoto', 'Photo') \
.replace('WalkingDog', 'WalkDog')
hf = cdflib.CDF(f)
positions = hf['Pose'].reshape(-1, 32, 3)
positions /= 1000 # Meters instead of millimeters
output[subject][canonical_name] = positions.astype('float32')
print('Saving...')
np.savez_compressed(output_filename, positions_3d=output)
print('Done.')
else:
print('Please specify the dataset source')
exit(0)
......
# Copyright (c) 2018-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
#
"""Perform inference on a single video or all videos with a certain extension
(e.g., .mp4) in a folder.
"""
import detectron2
from detectron2.utils.logger import setup_logger
from detectron2.config import get_cfg
from detectron2 import model_zoo
from detectron2.engine import DefaultPredictor
import subprocess as sp
import numpy as np
import time
import argparse
import sys
import os
import glob
def parse_args():
parser = argparse.ArgumentParser(description='End-to-end inference')
parser.add_argument(
'--cfg',
dest='cfg',
help='cfg model file (/path/to/model_config.yaml)',
default=None,
type=str
)
parser.add_argument(
'--output-dir',
dest='output_dir',
help='directory for visualization pdfs (default: /tmp/infer_simple)',
default='/tmp/infer_simple',
type=str
)
parser.add_argument(
'--image-ext',
dest='image_ext',
help='image file name extension (default: mp4)',
default='mp4',
type=str
)
parser.add_argument(
'im_or_folder', help='image or folder of images', default=None
)
if len(sys.argv) == 1:
parser.print_help()
sys.exit(1)
return parser.parse_args()
def get_resolution(filename):
command = ['ffprobe', '-v', 'error', '-select_streams', 'v:0',
'-show_entries', 'stream=width,height', '-of', 'csv=p=0', filename]
pipe = sp.Popen(command, stdout=sp.PIPE, bufsize=-1)
for line in pipe.stdout:
w, h = line.decode().strip().split(',')
return int(w), int(h)
def read_video(filename):
w, h = get_resolution(filename)
command = ['ffmpeg',
'-i', filename,
'-f', 'image2pipe',
'-pix_fmt', 'bgr24',
'-vsync', '0',
'-vcodec', 'rawvideo', '-']
pipe = sp.Popen(command, stdout=sp.PIPE, bufsize=-1)
while True:
data = pipe.stdout.read(w*h*3)
if not data:
break
yield np.frombuffer(data, dtype='uint8').reshape((h, w, 3))
def main(args):
cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file(args.cfg))
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.7
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url(args.cfg)
predictor = DefaultPredictor(cfg)
if os.path.isdir(args.im_or_folder):
im_list = glob.iglob(args.im_or_folder + '/*.' + args.image_ext)
else:
im_list = [args.im_or_folder]
for video_name in im_list:
out_name = os.path.join(
args.output_dir, os.path.basename(video_name)
)
print('Processing {}'.format(video_name))
boxes = []
segments = []
keypoints = []
for frame_i, im in enumerate(read_video(video_name)):
t = time.time()
outputs = predictor(im)['instances'].to('cpu')
print('Frame {} processed in {:.3f}s'.format(frame_i, time.time() - t))
has_bbox = False
if outputs.has('pred_boxes'):
bbox_tensor = outputs.pred_boxes.tensor.numpy()
if len(bbox_tensor) > 0:
has_bbox = True
scores = outputs.scores.numpy()[:, None]
bbox_tensor = np.concatenate((bbox_tensor, scores), axis=1)
if has_bbox:
kps = outputs.pred_keypoints.numpy()
kps_xy = kps[:, :, :2]
kps_prob = kps[:, :, 2:3]
kps_logit = np.zeros_like(kps_prob) # Dummy
kps = np.concatenate((kps_xy, kps_logit, kps_prob), axis=2)
kps = kps.transpose(0, 2, 1)
else:
kps = []
bbox_tensor = []
# Mimic Detectron1 format
cls_boxes = [[], bbox_tensor]
cls_keyps = [[], kps]
boxes.append(cls_boxes)
segments.append(None)
keypoints.append(cls_keyps)
# Video resolution
metadata = {
'w': im.shape[1],
'h': im.shape[0],
}
np.savez_compressed(out_name, boxes=boxes, segments=segments, keypoints=keypoints, metadata=metadata)
if __name__ == '__main__':
setup_logger()
args = parse_args()
main(args)
......@@ -209,6 +209,18 @@ if args.resume or args.evaluate:
model_pos_train.load_state_dict(checkpoint['model_pos'])
model_pos.load_state_dict(checkpoint['model_pos'])
if args.evaluate and 'model_traj' in checkpoint:
# Load trajectory model if it contained in the checkpoint (e.g. for inference in the wild)
model_traj = TemporalModel(poses_valid_2d[0].shape[-2], poses_valid_2d[0].shape[-1], 1,
filter_widths=filter_widths, causal=args.causal, dropout=args.dropout, channels=args.channels,
dense=args.dense)
if torch.cuda.is_available():
model_traj = model_traj.cuda()
model_traj.load_state_dict(checkpoint['model_traj'])
else:
model_traj = None
test_generator = UnchunkedGenerator(cameras_valid, poses_valid, poses_valid_2d,
pad=pad, causal_shift=causal_shift, augment=False,
kps_left=kps_left, kps_right=kps_right, joints_left=joints_left, joints_right=joints_right)
......@@ -637,13 +649,16 @@ if not args.evaluate:
plt.close('all')
# Evaluate
def evaluate(test_generator, action=None, return_predictions=False):
def evaluate(test_generator, action=None, return_predictions=False, use_trajectory_model=False):
epoch_loss_3d_pos = 0
epoch_loss_3d_pos_procrustes = 0
epoch_loss_3d_pos_scale = 0
epoch_loss_3d_vel = 0
with torch.no_grad():
model_pos.eval()
if not use_trajectory_model:
model_pos.eval()
else:
model_traj.eval()
N = 0
for _, batch, batch_2d in test_generator.next_epoch():
inputs_2d = torch.from_numpy(batch_2d.astype('float32'))
......@@ -651,13 +666,17 @@ def evaluate(test_generator, action=None, return_predictions=False):
inputs_2d = inputs_2d.cuda()
# Positional model
predicted_3d_pos = model_pos(inputs_2d)
if not use_trajectory_model:
predicted_3d_pos = model_pos(inputs_2d)
else:
predicted_3d_pos = model_traj(inputs_2d)
# Test-time augmentation (if enabled)
if test_generator.augment_enabled():
# Undo flipping and take average with non-flipped version
predicted_3d_pos[1, :, :, 0] *= -1
predicted_3d_pos[1, :, joints_left + joints_right] = predicted_3d_pos[1, :, joints_right + joints_left]
if not use_trajectory_model:
predicted_3d_pos[1, :, joints_left + joints_right] = predicted_3d_pos[1, :, joints_right + joints_left]
predicted_3d_pos = torch.mean(predicted_3d_pos, dim=0, keepdim=True)
if return_predictions:
......@@ -717,6 +736,9 @@ if args.render:
pad=pad, causal_shift=causal_shift, augment=args.test_time_augmentation,
kps_left=kps_left, kps_right=kps_right, joints_left=joints_left, joints_right=joints_right)
prediction = evaluate(gen, return_predictions=True)
if model_traj is not None and ground_truth is None:
prediction_traj = evaluate(gen, return_predictions=True, use_trajectory_model=True)
prediction += prediction_traj
if args.viz_export is not None:
print('Exporting joint positions to', args.viz_export)
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册