Deep Learning

Seeing operation failed followed by the process dying

One cause of this issue is when the GPU being used does not have enough memory to run the model. For example, DOPE may require up to 6GB of VRAM to operate, depending on the application.

Symptom

[component_container_mt-1] 2022-06-27 08:35:37.518 ERROR extensions/tensor_ops/Reshape.cpp@71: reshape tensor failed.
[component_container_mt-1] 2022-06-27 08:35:37.518 ERROR extensions/tensor_ops/TensorOperator.cpp@151: operation failed.
[component_container_mt-1] 2022-06-27 08:35:37.518 ERROR gxf/std/entity_executor.cpp@200: Entity with 102 not found!
[component_container_mt-1] INFO: infer_simple_runtime.cpp:69 TrtISBackend id:164 initialized model: Ketchup
[component_container_mt-1] 2022-06-27 08:35:37.518 WARN  gxf/std/greedy_scheduler.cpp@221: Error while executing entity 87 named 'VERAGYEWGZ_reshaper': GXF_FAILURE
[component_container_mt-1] [ERROR] [1656318937.518424053] [dope_encoder]: [NitrosPublisher] Vault ("vault/vault", eid=102) was stopped. The graph may have been terminated due to an error.
[component_container_mt-1] terminate called after throwing an instance of 'std::runtime_error'
[component_container_mt-1]   what():  [NitrosPublisher] Vault ("vault/vault", eid=102) was stopped. The graph may have been terminated due to an error.
[ERROR] [component_container_mt-1]: process has died [pid 13378, exit code -6, cmd '/opt/ros/humble/install/lib/rclcpp_components/component_container_mt --ros-args -r __node:=dope_container -r __ns:=/'].

Solution

Try using the Isaac ROS TensorRT node or the Isaac ROS Triton node with the TensorRT backend instead. Otherwise, a discrete GPU with more VRAM may be required.

Triton fails to create the TensorRT engine and load a model

Symptom

1: [component_container_mt-1] I0331 05:56:07.479791 11359 tensorrt.cc:5591] TRITONBACKEND_ModelInitialize: detectnet (version 1)
1: [component_container_mt-1] I0331 05:56:07.483989 11359 tensorrt.cc:5640] TRITONBACKEND_ModelInstanceInitialize: detectnet (GPU device 0)
1: [component_container_mt-1] I0331 05:56:08.169240 11359 logging.cc:49] Loaded engine size: 21 MiB
1: [component_container_mt-1] E0331 05:56:08.209208 11359 logging.cc:43] 1: [runtime.cpp::parsePlan::314] Error Code 1: Serialization (Serialization assertion plan->header.magicTag == rt::kPLAN_MAGIC_TAG failed.)
1: [component_container_mt-1] I0331 05:56:08.213483 11359 tensorrt.cc:5678] TRITONBACKEND_ModelInstanceFinalize: delete instance state
1: [component_container_mt-1] I0331 05:56:08.213525 11359 tensorrt.cc:5617] TRITONBACKEND_ModelFinalize: delete model state
1: [component_container_mt-1] E0331 05:56:08.214059 11359 model_lifecycle.cc:596] failed to load 'detectnet' version 1: Internal: unable to create TensorRT engine
1: [component_container_mt-1] ERROR: infer_trtis_server.cpp:1057 Triton: failed to load model detectnet, triton_err_str:Invalid argument, err_msg:load failed for model 'detectnet': version 1 is at UNAVAILABLE state: Internal: unable to create TensorRT engine;
1: [component_container_mt-1]
1: [component_container_mt-1] ERROR: infer_trtis_backend.cpp:54 failed to load model: detectnet, nvinfer error:NVDSINFER_TRITON_ERROR
1: [component_container_mt-1] ERROR: infer_simple_runtime.cpp:33 failed to initialize backend while ensuring model:detectnet ready, nvinfer error:NVDSINFER_TRITON_ERROR
1: [component_container_mt-1] ERROR: Error in createNNBackend() <infer_simple_context.cpp:76> [UID = 16]: failed to initialize triton simple runtime for model:detectnet, nvinfer error:NVDSINFER_TRITON_ERROR
1: [component_container_mt-1] ERROR: Error in initialize() <infer_base_context.cpp:79> [UID = 16]: create nn-backend failed, check config file settings, nvinfer error:NVDSINFER_TRITON_ERROR

Solution

This error can occur when TensorRT attempts to load an incompatible model.plan file. The incompatibility may arise due to a versioning or platform mismatch between the time of plan generation and the time of plan execution.

Delete the model.plan file that is being passed in as an argument to the Triton node’s model_repository_paths parameter, and then use the source package’s instructions to regenerate the model.plan file from the original weights file (often a .etlt or .onnx file).