提交 67e6cf2f 编写于 作者: D Dario Pavllo

Add preliminary support for inference in the wild

上级 11e90ad9
......@@ -68,6 +68,7 @@ Here is a list of the command-line arguments related to visualization:
- `--viz-limit`: render only first N frames. By default, all frames are rendered.
- `--viz-downsample`: downsample videos by the specified factor, i.e. reduce the frame rate. E.g. if set to `2`, the frame rate is reduced from 50 FPS to 25 FPS. Default: `1` (no downsampling).
- `--viz-size`: output resolution multiplier. Higher = larger images. Default: `5`.
- `--viz-export`: export 3D joint coordinates (in camera space) to the specified NumPy archive.
Example:
```
......
# Inference in the wild
In this short tutorial, we show how to run our model on arbitrary videos and visualize the predictions. Note that this feature is only provided for experimentation/research purposes and presents some limitations, as this repository is meant to provide a reference implementation of the approach described in the paper (not production-ready code for inference in the wild).
Our script assumes that a video depicts *exactly* one person. In case of multiple people visible at once, the script will select the person corresponding to the bounding box with the highest confidence, which may cause glitches.
The instructions below show how to use Detectron to infer 2D keypoints from videos, convert them to a custom dataset for our code, and infer 3D poses. For now, we do not have instructions for CPN. In the last section of this tutorial, we also provide some tips.
## Step 1: setup
Set up [Detectron](https://github.com/facebookresearch/Detectron) and copy the script `inference/infer_video.py` from this repo to the `tools` directory of the Detectron repo. This script, which requires `ffmpeg` in your system, provides a convenient interface to generate 2D keypoint predictions from videos without manually extracting individual frames.
Next, download the [pretrained model](https://dl.fbaipublicfiles.com/video-pose-3d/pretrained_h36m_detectron_coco.bin) for generating 3D predictions. This model is different than the pretrained ones listed in the main README, as it expects input keypoints in COCO format (generated by the pretrained Detectron model) and outputs 3D joint positions in Human3.6M format. Put this model in the `checkpoint` directory of this repo.
**Note:** if you had downloaded `d-pt-243.bin`, you should download the new pretrained model using the link above. `d-pt-243.bin` takes the keypoint probabilities as input (in addition to the x, y coordinates), which causes problems on videos with a different resolution than that of Human3.6M. The new model is only trained on 2D coordinates and works with any resolution/aspect ratio.
## Step 2 (optional): video preprocessing
Since the script expects a single-person scenario, you may want to extract a portion of your video. This is very easy to do with ffmpeg, e.g.
```
ffmpeg -i input.mp4 -ss 1:00 -to 1:30 -c copy output.mp4
```
extracts a clip from minute 1:00 to minute 1:30 of `input.mp4`, and exports it to `output.mp4`.
Optionally, you can also adapt the frame rate of the video. Most videos have a frame rate of about 25 FPS, but our Human3.6M model was trained on 50-FPS videos. Since our model is robust to alterations in speed, this step is not very important and can be skipped, but if you want the best possible results you can use ffmpeg again for this task:
```
ffmpeg -i input.mp4 -filter "minterpolate='fps=50'" -crf 0 output.mp4
```
## Step 3: inferring 2D keypoints with Detectron
Our Detectron script `infer_video.py` is a simple adaptation of `infer_simple.py` (which works on images) and has a similar command-line syntax.
To infer keypoints from all the mp4 videos in `input_directory`, run
```
python tools/infer_video.py \
--cfg configs/12_2017_baselines/e2e_keypoint_rcnn_R-101-FPN_s1x.yaml \
--output-dir output_directory \
--image-ext mp4 \
--wts https://dl.fbaipublicfiles.com/detectron/37698009/12_2017_baselines/e2e_keypoint_rcnn_R-101-FPN_s1x.yaml.08_45_57.YkrJgP6O/output/train/keypoints_coco_2014_train:keypoints_coco_2014_valminusminival/generalized_rcnn/model_final.pkl \
input_directory
```
The results will be exported to `output_directory` as custom NumPy archives (`.npz` files). You can change the video extension in `--image-ext` (ffmpeg supports a wide range of formats).
## Step 4: creating a custom dataset
Run our dataset preprocessing script from the `data` directory:
```
python prepare_data_2d_custom.py -i /path/to/detections/output_directory -o myvideos
```
This creates a custom dataset named `myvideos` (which contains all the videos in `output_directory`, each of which is mapped to a different subject) and saved to `data_2d_custom_myvideos.npz`. You are free to specify any name for the dataset.
**Note:** as mentioned, the script will take the bounding box with the highest probability for each frame. If a particular frame has no bounding boxes, it is assumed to be a missed detection and the keypoints will be interpolated from neighboring frames.
## Step 5: rendering a custom video and exporting coordinates
You can finally use the visualization feature to render a video of the 3D joint predictions. You must specify the `custom` dataset (`-d custom`), the input keypoints as exported in the previous step (`-k myvideos`), the correct architecture/checkpoint, and the action `custom` (`--viz-action custom`). The subject is the file name of the input video, and the camera is always 0.
```
python run.py -d custom -k myvideos -arc 3,3,3,3,3 -c checkpoint --evaluate pretrained_h36m_detectron_coco.bin --render --viz-subject input_video.mp4 --viz-action custom --viz-camera 0 --viz-video /path/to/input_video.mp4 --viz-output output.mp4 --viz-size 6
```
You can also export the 3D joint positions (in camera space) to a NumPy archive. To this end, replace `--viz-output` with `--viz-export` and specify the file name.
## Limitations and tips
- The model was trained on Human3.6M cameras (which are relatively undistorted), and the results may be bad if the intrinsic parameters of the cameras of your videos differ much from those of Human3.6M. This may be particularly noticeable with fisheye cameras, which present a high degree of non-linear lens distortion. If the camera parameters are known, consider preprocessing your videos to match those of Human3.6M as closely as possible.
- If you want multi-person tracking, you should implement a bounding box matching strategy. An example would be to use bipartite matching on the bounding box overlap (IoU) between subsequent frames, but there many other approaches.
- Predictions are relative to the root joint, i.e. the global trajectory is not regressed. If you need it, you may want to use another model to regress it, such as the one we use for semi-supervision.
- Predictions are always in *camera space* (regardless of whether the trajectory is available). For our visualization script, we simply take a random camera from Human3.6M, which fits decently most videos where the camera viewport is parallel to the ground.
\ No newline at end of file
......@@ -65,6 +65,9 @@ python run.py -d humaneva15 -k detectron_pt_coco -str Train/S1,Train/S2,Train/S3
[`DOCUMENTATION.md`](DOCUMENTATION.md) provides a precise description of all command-line arguments.
### Inference in the wild
We have introduced an experimental feature to run our model on custom videos. See [`INFERENCE.md`](INFERENCE.md) for more details.
### Training from scratch
If you want to reproduce the results of our pretrained models, run the following commands.
......
......@@ -65,6 +65,7 @@ def parse_args():
parser.add_argument('--viz-video', type=str, metavar='PATH', help='path to input video')
parser.add_argument('--viz-skip', type=int, default=0, metavar='N', help='skip first N frames of input video')
parser.add_argument('--viz-output', type=str, metavar='PATH', help='output file name (.gif or .mp4)')
parser.add_argument('--viz-export', type=str, metavar='PATH', help='output file name for coordinates')
parser.add_argument('--viz-bitrate', type=int, default=3000, metavar='N', help='bitrate for mp4 videos')
parser.add_argument('--viz-no-ground-truth', action='store_true', help='do not show ground-truth poses')
parser.add_argument('--viz-limit', type=int, default=-1, metavar='N', help='only render first N frames')
......
# Copyright (c) 2018-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
#
import numpy as np
import copy
from common.skeleton import Skeleton
from common.mocap_dataset import MocapDataset
from common.camera import normalize_screen_coordinates, image_coordinates
from common.h36m_dataset import h36m_skeleton
custom_camera_params = {
'id': None,
'res_w': None, # Pulled from metadata
'res_h': None, # Pulled from metadata
# Dummy camera parameters (taken from Human3.6M), only for visualization purposes
'azimuth': 70, # Only used for visualization
'orientation': [0.1407056450843811, -0.1500701755285263, -0.755240797996521, 0.6223280429840088],
'translation': [1841.1070556640625, 4955.28466796875, 1563.4454345703125],
}
class CustomDataset(MocapDataset):
def __init__(self, detections_path, remove_static_joints=True):
super().__init__(fps=None, skeleton=h36m_skeleton)
# Load serialized dataset
data = np.load(detections_path, allow_pickle=True)
resolutions = data['metadata'].item()['video_metadata']
self._cameras = {}
self._data = {}
for video_name, res in resolutions.items():
cam = {}
cam.update(custom_camera_params)
cam['orientation'] = np.array(cam['orientation'], dtype='float32')
cam['translation'] = np.array(cam['translation'], dtype='float32')
cam['translation'] = cam['translation']/1000 # mm to meters
cam['id'] = video_name
cam['res_w'] = res['w']
cam['res_h'] = res['h']
self._cameras[video_name] = [cam]
self._data[video_name] = {
'custom': {
'cameras': cam
}
}
if remove_static_joints:
# Bring the skeleton to 17 joints instead of the original 32
self.remove_joints([4, 5, 9, 10, 11, 16, 20, 21, 22, 23, 24, 28, 29, 30, 31])
# Rewire shoulders to the correct parents
self._skeleton._parents[11] = 8
self._skeleton._parents[14] = 8
def supports_semi_supervised(self):
return False
\ No newline at end of file
......@@ -20,7 +20,8 @@ class MocapDataset:
for subject in self._data.keys():
for action in self._data[subject].keys():
s = self._data[subject][action]
s['positions'] = s['positions'][:, kept_joints]
if 'positions' in s:
s['positions'] = s['positions'][:, kept_joints]
def __getitem__(self, key):
......
......@@ -21,6 +21,14 @@ def get_resolution(filename):
for line in pipe.stdout:
w, h = line.decode().strip().split(',')
return int(w), int(h)
def get_fps(filename):
command = ['ffprobe', '-v', 'error', '-select_streams', 'v:0',
'-show_entries', 'stream=r_frame_rate', '-of', 'csv=p=0', filename]
with sp.Popen(command, stdout=sp.PIPE, bufsize=-1) as pipe:
for line in pipe.stdout:
a, b = line.decode().strip().split('/')
return int(a) / int(b)
def read_video(filename, skip=0, limit=-1):
w, h = get_resolution(filename)
......@@ -39,10 +47,11 @@ def read_video(filename, skip=0, limit=-1):
if not data:
break
i += 1
if i > limit and limit != -1:
continue
if i > skip:
yield np.frombuffer(data, dtype='uint8').reshape((h, w, 3))
if i == limit:
break
......@@ -50,7 +59,7 @@ def downsample_tensor(X, factor):
length = X.shape[0]//factor * factor
return np.mean(X[:length].reshape(-1, factor, *X.shape[1:]), axis=1)
def render_animation(keypoints, poses, skeleton, fps, bitrate, azim, output, viewport,
def render_animation(keypoints, keypoints_metadata, poses, skeleton, fps, bitrate, azim, output, viewport,
limit=-1, downsample=1, size=6, input_video_path=None, input_video_skip=0):
"""
TODO
......@@ -97,10 +106,17 @@ def render_animation(keypoints, poses, skeleton, fps, bitrate, azim, output, vie
else:
# Load video using ffmpeg
all_frames = []
for f in read_video(input_video_path, skip=input_video_skip):
for f in read_video(input_video_path, skip=input_video_skip, limit=limit):
all_frames.append(f)
effective_length = min(keypoints.shape[0], len(all_frames))
all_frames = all_frames[:effective_length]
keypoints = keypoints[input_video_skip:] # todo remove
for idx in range(len(poses)):
poses[idx] = poses[idx][input_video_skip:]
if fps is None:
fps = get_fps(input_video_path)
if downsample > 1:
keypoints = downsample_tensor(keypoints, downsample)
......@@ -129,6 +145,9 @@ def render_animation(keypoints, poses, skeleton, fps, bitrate, azim, output, vie
ax.set_ylim3d([-radius/2 + trajectories[n][i, 1], radius/2 + trajectories[n][i, 1]])
# Update 2D poses
joints_right_2d = keypoints_metadata['keypoints_symmetry'][1]
colors_2d = np.full(keypoints.shape[1], 'black')
colors_2d[joints_right_2d] = 'red'
if not initialized:
image = ax_in.imshow(all_frames[i], aspect='equal')
......@@ -136,7 +155,7 @@ def render_animation(keypoints, poses, skeleton, fps, bitrate, azim, output, vie
if j_parent == -1:
continue
if len(parents) == keypoints.shape[1]:
if len(parents) == keypoints.shape[1] and keypoints_metadata['layout_name'] != 'coco':
# Draw skeleton only if keypoints match (otherwise we don't have the parents definition)
lines.append(ax_in.plot([keypoints[i, j, 0], keypoints[i, j_parent, 0]],
[keypoints[i, j, 1], keypoints[i, j_parent, 1]], color='pink'))
......@@ -148,7 +167,7 @@ def render_animation(keypoints, poses, skeleton, fps, bitrate, azim, output, vie
[pos[j, 1], pos[j_parent, 1]],
[pos[j, 2], pos[j_parent, 2]], zdir='z', c=col))
points = ax_in.scatter(*keypoints[i].T, 5, color='red', edgecolors='white', zorder=10)
points = ax_in.scatter(*keypoints[i].T, 10, color=colors_2d, edgecolors='white', zorder=10)
initialized = True
else:
......@@ -158,7 +177,7 @@ def render_animation(keypoints, poses, skeleton, fps, bitrate, azim, output, vie
if j_parent == -1:
continue
if len(parents) == keypoints.shape[1]:
if len(parents) == keypoints.shape[1] and keypoints_metadata['layout_name'] != 'coco':
lines[j-1][0].set_data([keypoints[i, j, 0], keypoints[i, j_parent, 0]],
[keypoints[i, j, 1], keypoints[i, j_parent, 1]])
......
# Copyright (c) 2018-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
#
import numpy as np
from glob import glob
import os
import sys
import argparse
from data_utils import suggest_metadata
output_prefix_2d = 'data_2d_custom_'
def decode(filename):
# Latin1 encoding because Detectron runs on Python 2.7
print('Processing {}'.format(filename))
data = np.load(filename, encoding='latin1', allow_pickle=True)
bb = data['boxes']
kp = data['keypoints']
metadata = data['metadata'].item()
results_bb = []
results_kp = []
for i in range(len(bb)):
if len(bb[i][1]) == 0 or len(kp[i][1]) == 0:
# No bbox/keypoints detected for this frame -> will be interpolated
results_bb.append(np.full(4, np.nan, dtype=np.float32)) # 4 bounding box coordinates
results_kp.append(np.full((17, 4), np.nan, dtype=np.float32)) # 17 COCO keypoints
continue
best_match = np.argmax(bb[i][1][:, 4])
best_bb = bb[i][1][best_match, :4]
best_kp = kp[i][1][best_match].T.copy()
results_bb.append(best_bb)
results_kp.append(best_kp)
bb = np.array(results_bb, dtype=np.float32)
kp = np.array(results_kp, dtype=np.float32)
kp = kp[:, :, :2] # Extract (x, y)
# Fix missing bboxes/keypoints by linear interpolation
mask = ~np.isnan(bb[:, 0])
indices = np.arange(len(bb))
for i in range(4):
bb[:, i] = np.interp(indices, indices[mask], bb[mask, i])
for i in range(17):
for j in range(2):
kp[:, i, j] = np.interp(indices, indices[mask], kp[mask, i, j])
print('{} total frames processed'.format(len(bb)))
print('{} frames were interpolated'.format(np.sum(~mask)))
print('----------')
return [{
'start_frame': 0, # Inclusive
'end_frame': len(kp), # Exclusive
'bounding_boxes': bb,
'keypoints': kp,
}], metadata
if __name__ == '__main__':
if os.path.basename(os.getcwd()) != 'data':
print('This script must be launched from the "data" directory')
exit(0)
parser = argparse.ArgumentParser(description='Custom dataset creator')
parser.add_argument('-i', '--input', type=str, default='', metavar='PATH', help='detections directory')
parser.add_argument('-o', '--output', type=str, default='', metavar='PATH', help='output suffix for 2D detections')
args = parser.parse_args()
if not args.input:
print('Please specify the input directory')
exit(0)
if not args.output:
print('Please specify an output suffix (e.g. detectron_pt_coco)')
exit(0)
print('Parsing 2D detections from', args.input)
metadata = suggest_metadata('coco')
metadata['video_metadata'] = {}
output = {}
file_list = glob(args.input + '/*.npz')
for f in file_list:
canonical_name = os.path.splitext(os.path.basename(f))[0]
data, video_metadata = decode(f)
output[canonical_name] = {}
output[canonical_name]['custom'] = [data[0]['keypoints'].astype('float32')]
metadata['video_metadata'][canonical_name] = video_metadata
print('Saving...')
np.savez_compressed(output_prefix_2d + args.output, positions_2d=output, metadata=metadata)
print('Done.')
\ No newline at end of file
# Copyright (c) 2018-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
#
"""Perform inference on a single video or all videos with a certain extension
(e.g., .mp4) in a folder.
"""
from infer_simple import *
import subprocess as sp
import numpy as np
def get_resolution(filename):
command = ['ffprobe', '-v', 'error', '-select_streams', 'v:0',
'-show_entries', 'stream=width,height', '-of', 'csv=p=0', filename]
pipe = sp.Popen(command, stdout=sp.PIPE, bufsize=-1)
for line in pipe.stdout:
w, h = line.decode().strip().split(',')
return int(w), int(h)
def read_video(filename):
w, h = get_resolution(filename)
command = ['ffmpeg',
'-i', filename,
'-f', 'image2pipe',
'-pix_fmt', 'bgr24',
'-vsync', '0',
'-vcodec', 'rawvideo', '-']
pipe = sp.Popen(command, stdout=sp.PIPE, bufsize=-1)
while True:
data = pipe.stdout.read(w*h*3)
if not data:
break
yield np.frombuffer(data, dtype='uint8').reshape((h, w, 3))
def main(args):
logger = logging.getLogger(__name__)
merge_cfg_from_file(args.cfg)
cfg.NUM_GPUS = 1
args.weights = cache_url(args.weights, cfg.DOWNLOAD_CACHE)
assert_and_infer_cfg(cache_urls=False)
model = infer_engine.initialize_model_from_cfg(args.weights)
dummy_coco_dataset = dummy_datasets.get_coco_dataset()
if os.path.isdir(args.im_or_folder):
im_list = glob.iglob(args.im_or_folder + '/*.' + args.image_ext)
else:
im_list = [args.im_or_folder]
for video_name in im_list:
out_name = os.path.join(
args.output_dir, os.path.basename(video_name)
)
print('Processing {}'.format(video_name))
boxes = []
segments = []
keypoints = []
for frame_i, im in enumerate(read_video(video_name)):
logger.info('Frame {}'.format(frame_i))
timers = defaultdict(Timer)
t = time.time()
with c2_utils.NamedCudaScope(0):
cls_boxes, cls_segms, cls_keyps = infer_engine.im_detect_all(
model, im, None, timers=timers
)
logger.info('Inference time: {:.3f}s'.format(time.time() - t))
for k, v in timers.items():
logger.info(' | {}: {:.3f}s'.format(k, v.average_time))
boxes.append(cls_boxes)
segments.append(cls_segms)
keypoints.append(cls_keyps)
# Video resolution
metadata = {
'w': im.shape[1],
'h': im.shape[0],
}
np.savez_compressed(out_name, boxes=boxes, segments=segments, keypoints=keypoints, metadata=metadata)
if __name__ == '__main__':
workspace.GlobalInit(['caffe2', '--caffe2_log_level=0'])
setup_logging(__name__)
args = parse_args()
main(args)
......@@ -42,6 +42,9 @@ if args.dataset == 'h36m':
elif args.dataset.startswith('humaneva'):
from common.humaneva_dataset import HumanEvaDataset
dataset = HumanEvaDataset(dataset_path)
elif args.dataset.startswith('custom'):
from common.custom_dataset import CustomDataset
dataset = CustomDataset('data/data_2d_' + args.dataset + '_' + args.keypoints + '.npz')
else:
raise KeyError('Invalid dataset')
......@@ -50,16 +53,18 @@ for subject in dataset.subjects():
for action in dataset[subject].keys():
anim = dataset[subject][action]
positions_3d = []
for cam in anim['cameras']:
pos_3d = world_to_camera(anim['positions'], R=cam['orientation'], t=cam['translation'])
pos_3d[:, 1:] -= pos_3d[:, :1] # Remove global offset, but keep trajectory in first position
positions_3d.append(pos_3d)
anim['positions_3d'] = positions_3d
if 'positions' in anim:
positions_3d = []
for cam in anim['cameras']:
pos_3d = world_to_camera(anim['positions'], R=cam['orientation'], t=cam['translation'])
pos_3d[:, 1:] -= pos_3d[:, :1] # Remove global offset, but keep trajectory in first position
positions_3d.append(pos_3d)
anim['positions_3d'] = positions_3d
print('Loading 2D detections...')
keypoints = np.load('data/data_2d_' + args.dataset + '_' + args.keypoints + '.npz', allow_pickle=True)
keypoints_symmetry = keypoints['metadata'].item()['keypoints_symmetry']
keypoints_metadata = keypoints['metadata'].item()
keypoints_symmetry = keypoints_metadata['keypoints_symmetry']
kps_left, kps_right = list(keypoints_symmetry[0]), list(keypoints_symmetry[1])
joints_left, joints_right = list(dataset.skeleton().joints_left()), list(dataset.skeleton().joints_right())
keypoints = keypoints['positions_2d'].item()
......@@ -68,6 +73,9 @@ for subject in dataset.subjects():
assert subject in keypoints, 'Subject {} is missing from the 2D detections dataset'.format(subject)
for action in dataset[subject].keys():
assert action in keypoints[subject], 'Action {} of subject {} is missing from the 2D detections dataset'.format(action, subject)
if 'positions_3d' not in dataset[subject][action]:
continue
for cam_idx in range(len(keypoints[subject][action])):
# We check for >= instead of == because some videos in H3.6M contain extra frames
......@@ -90,7 +98,10 @@ for subject in keypoints.keys():
subjects_train = args.subjects_train.split(',')
subjects_semi = [] if not args.subjects_unlabeled else args.subjects_unlabeled.split(',')
subjects_test = args.subjects_test.split(',')
if not args.render:
subjects_test = args.subjects_test.split(',')
else:
subjects_test = [args.viz_subject]
semi_supervised = len(subjects_semi) > 0
if semi_supervised and not dataset.supports_semi_supervised():
......@@ -160,15 +171,15 @@ cameras_valid, poses_valid, poses_valid_2d = fetch(subjects_test, action_filter)
filter_widths = [int(x) for x in args.architecture.split(',')]
if not args.disable_optimizations and not args.dense and args.stride == 1:
# Use optimized model for single-frame predictions
model_pos_train = TemporalModelOptimized1f(poses_valid_2d[0].shape[-2], poses_valid_2d[0].shape[-1], poses_valid[0].shape[-2],
model_pos_train = TemporalModelOptimized1f(poses_valid_2d[0].shape[-2], poses_valid_2d[0].shape[-1], dataset.skeleton().num_joints(),
filter_widths=filter_widths, causal=args.causal, dropout=args.dropout, channels=args.channels)
else:
# When incompatible settings are detected (stride > 1, dense filters, or disabled optimization) fall back to normal model
model_pos_train = TemporalModel(poses_valid_2d[0].shape[-2], poses_valid_2d[0].shape[-1], poses_valid[0].shape[-2],
model_pos_train = TemporalModel(poses_valid_2d[0].shape[-2], poses_valid_2d[0].shape[-1], dataset.skeleton().num_joints(),
filter_widths=filter_widths, causal=args.causal, dropout=args.dropout, channels=args.channels,
dense=args.dense)
model_pos = TemporalModel(poses_valid_2d[0].shape[-2], poses_valid_2d[0].shape[-1], poses_valid[0].shape[-2],
model_pos = TemporalModel(poses_valid_2d[0].shape[-2], poses_valid_2d[0].shape[-1], dataset.skeleton().num_joints(),
filter_widths=filter_widths, causal=args.causal, dropout=args.dropout, channels=args.channels,
dense=args.dense)
......@@ -695,10 +706,11 @@ if args.render:
print('Rendering...')
input_keypoints = keypoints[args.viz_subject][args.viz_action][args.viz_camera].copy()
ground_truth = None
if args.viz_subject in dataset.subjects() and args.viz_action in dataset[args.viz_subject]:
ground_truth = dataset[args.viz_subject][args.viz_action]['positions_3d'][args.viz_camera].copy()
else:
ground_truth = None
if 'positions_3d' in dataset[args.viz_subject][args.viz_action]:
ground_truth = dataset[args.viz_subject][args.viz_action]['positions_3d'][args.viz_camera].copy()
if ground_truth is None:
print('INFO: this action is unlabeled. Ground truth will not be rendered.')
gen = UnchunkedGenerator(None, None, [input_keypoints],
......@@ -706,40 +718,46 @@ if args.render:
kps_left=kps_left, kps_right=kps_right, joints_left=joints_left, joints_right=joints_right)
prediction = evaluate(gen, return_predictions=True)
if ground_truth is not None:
# Reapply trajectory
trajectory = ground_truth[:, :1]
ground_truth[:, 1:] += trajectory
prediction += trajectory
# Invert camera transformation
cam = dataset.cameras()[args.viz_subject][args.viz_camera]
if ground_truth is not None:
prediction = camera_to_world(prediction, R=cam['orientation'], t=cam['translation'])
ground_truth = camera_to_world(ground_truth, R=cam['orientation'], t=cam['translation'])
else:
# If the ground truth is not available, take the camera extrinsic params from a random subject.
# They are almost the same, and anyway, we only need this for visualization purposes.
for subject in dataset.cameras():
if 'orientation' in dataset.cameras()[subject][args.viz_camera]:
rot = dataset.cameras()[subject][args.viz_camera]['orientation']
break
prediction = camera_to_world(prediction, R=rot, t=0)
# We don't have the trajectory, but at least we can rebase the height
prediction[:, :, 2] -= np.min(prediction[:, :, 2])
anim_output = {'Reconstruction': prediction}
if ground_truth is not None and not args.viz_no_ground_truth:
anim_output['Ground truth'] = ground_truth
if args.viz_export is not None:
print('Exporting joint positions to', args.viz_export)
# Predictions are in camera space
np.save(args.viz_export, prediction)
input_keypoints = image_coordinates(input_keypoints[..., :2], w=cam['res_w'], h=cam['res_h'])
from common.visualization import render_animation
render_animation(input_keypoints, anim_output,
dataset.skeleton(), dataset.fps(), args.viz_bitrate, cam['azimuth'], args.viz_output,
limit=args.viz_limit, downsample=args.viz_downsample, size=args.viz_size,
input_video_path=args.viz_video, viewport=(cam['res_w'], cam['res_h']),
input_video_skip=args.viz_skip)
if args.viz_output is not None:
if ground_truth is not None:
# Reapply trajectory
trajectory = ground_truth[:, :1]
ground_truth[:, 1:] += trajectory
prediction += trajectory
# Invert camera transformation
cam = dataset.cameras()[args.viz_subject][args.viz_camera]
if ground_truth is not None:
prediction = camera_to_world(prediction, R=cam['orientation'], t=cam['translation'])
ground_truth = camera_to_world(ground_truth, R=cam['orientation'], t=cam['translation'])
else:
# If the ground truth is not available, take the camera extrinsic params from a random subject.
# They are almost the same, and anyway, we only need this for visualization purposes.
for subject in dataset.cameras():
if 'orientation' in dataset.cameras()[subject][args.viz_camera]:
rot = dataset.cameras()[subject][args.viz_camera]['orientation']
break
prediction = camera_to_world(prediction, R=rot, t=0)
# We don't have the trajectory, but at least we can rebase the height
prediction[:, :, 2] -= np.min(prediction[:, :, 2])
anim_output = {'Reconstruction': prediction}
if ground_truth is not None and not args.viz_no_ground_truth:
anim_output['Ground truth'] = ground_truth