Deep Learning

Seeing operation failed followed by the process dying

One cause of this issue is when the GPU being used does not have enough memory to run the model. For example, DOPE may require up to 6GB of VRAM to operate, depending on the application.

Symptom

[component_container_mt-1] 2022-06-27 08:35:37.518 ERROR extensions/tensor_ops/Reshape.cpp@71: reshape tensor failed.
[component_container_mt-1] 2022-06-27 08:35:37.518 ERROR extensions/tensor_ops/TensorOperator.cpp@151: operation failed.
[component_container_mt-1] 2022-06-27 08:35:37.518 ERROR gxf/std/entity_executor.cpp@200: Entity with 102 not found!
[component_container_mt-1] INFO: infer_simple_runtime.cpp:69 TrtISBackend id:164 initialized model: Ketchup
[component_container_mt-1] 2022-06-27 08:35:37.518 WARN  gxf/std/greedy_scheduler.cpp@221: Error while executing entity 87 named 'VERAGYEWGZ_reshaper': GXF_FAILURE
[component_container_mt-1] [ERROR] [1656318937.518424053] [dope_encoder]: [NitrosPublisher] Vault ("vault/vault", eid=102) was stopped. The graph may have been terminated due to an error.
[component_container_mt-1] terminate called after throwing an instance of 'std::runtime_error'
[component_container_mt-1]   what():  [NitrosPublisher] Vault ("vault/vault", eid=102) was stopped. The graph may have been terminated due to an error.
[ERROR] [component_container_mt-1]: process has died [pid 13378, exit code -6, cmd '/opt/ros/humble/install/lib/rclcpp_components/component_container_mt --ros-args -r __node:=dope_container -r __ns:=/'].

Solution

Try using the Isaac ROS TensorRT node or the Isaac ROS Triton node with the TensorRT backend instead. Otherwise, a discrete GPU with more VRAM may be required.

Triton fails to create the TensorRT engine and load a model

Symptom

1: [component_container_mt-1] I0331 05:56:07.479791 11359 tensorrt.cc:5591] TRITONBACKEND_ModelInitialize: detectnet (version 1)
1: [component_container_mt-1] I0331 05:56:07.483989 11359 tensorrt.cc:5640] TRITONBACKEND_ModelInstanceInitialize: detectnet (GPU device 0)
1: [component_container_mt-1] I0331 05:56:08.169240 11359 logging.cc:49] Loaded engine size: 21 MiB
1: [component_container_mt-1] E0331 05:56:08.209208 11359 logging.cc:43] 1: [runtime.cpp::parsePlan::314] Error Code 1: Serialization (Serialization assertion plan->header.magicTag == rt::kPLAN_MAGIC_TAG failed.)
1: [component_container_mt-1] I0331 05:56:08.213483 11359 tensorrt.cc:5678] TRITONBACKEND_ModelInstanceFinalize: delete instance state
1: [component_container_mt-1] I0331 05:56:08.213525 11359 tensorrt.cc:5617] TRITONBACKEND_ModelFinalize: delete model state
1: [component_container_mt-1] E0331 05:56:08.214059 11359 model_lifecycle.cc:596] failed to load 'detectnet' version 1: Internal: unable to create TensorRT engine
1: [component_container_mt-1] ERROR: infer_trtis_server.cpp:1057 Triton: failed to load model detectnet, triton_err_str:Invalid argument, err_msg:load failed for model 'detectnet': version 1 is at UNAVAILABLE state: Internal: unable to create TensorRT engine;
1: [component_container_mt-1]
1: [component_container_mt-1] ERROR: infer_trtis_backend.cpp:54 failed to load model: detectnet, nvinfer error:NVDSINFER_TRITON_ERROR
1: [component_container_mt-1] ERROR: infer_simple_runtime.cpp:33 failed to initialize backend while ensuring model:detectnet ready, nvinfer error:NVDSINFER_TRITON_ERROR
1: [component_container_mt-1] ERROR: Error in createNNBackend() <infer_simple_context.cpp:76> [UID = 16]: failed to initialize triton simple runtime for model:detectnet, nvinfer error:NVDSINFER_TRITON_ERROR
1: [component_container_mt-1] ERROR: Error in initialize() <infer_base_context.cpp:79> [UID = 16]: create nn-backend failed, check config file settings, nvinfer error:NVDSINFER_TRITON_ERROR

Solution

This error can occur when TensorRT attempts to load an incompatible model.plan file. The incompatibility may arise due to a versioning or platform mismatch between the time of plan generation and the time of plan execution.

Delete the model.plan file that is being passed in as an argument to the Triton node’s model_repository_paths parameter, and then use the source package’s instructions to regenerate the model.plan file from the original weights file (often a .etlt or .onnx file).

Seeing CUDA Error: an illegal memory access was encountered

One cause of this issue is when the GPU being used does not have enough memory to run the model or combination of models. For example, SAM with YOLO may require up to 12GB of VRAM to operate depending on the application.

Symptom

[component_container_mt-1] [ERROR] [1714404908.405286740] [NitrosImage]: [convert_to_custom] cudaMemcpy2D failed for conversion from sensor_msgs::msg::Image to NitrosImage: cudaErrorIllegalAddress (an illegal memory access was encountered)
[component_container_mt-1] 2024-04-29 21:05:08.405 ERROR gxf/std/entity_executor.cpp@552: Failed to tick codelet  in entity: UKNJEXSISG_triton_response code: GXF_FAILURE
[component_container_mt-1] 2024-04-29 21:05:08.405 ERROR gxf/std/entity_executor.cpp@552: Failed to tick codelet  in entity: APDRLGKSXZ_cuda_stream_sync code: GXF_FAILURE
[component_container_mt-1] 2024-04-29 21:05:08.405 ERROR gxf/std/entity_executor.cpp@210: Entity with eid 207 not found!
[component_container_mt-1] CUDA Error: an illegal memory access was encountered
[component_container_mt-1] 2024-04-30 14:08:12.187 ERROR extensions/triton/inferencers/triton_inferencer_impl.cpp@729: cudaMemcpy error: an illegal memory access was encountered
[component_container_mt-1]
[component_container_mt-1] 2024-04-30 14:08:12.187 ERROR gxf/std/entity_executor.cpp@552: Failed to tick codelet  in entity: CMMPVZZGUM_triton_response code: GXF_FAILURE
[component_container_mt-1] terminate called after throwing an instance of 'std::runtime_error'
[component_container_mt-1]   what():  CUDA error returned at ./src/image_to_tensor_node.cpp:154, Error code: 700 (an illegal memory access was encountered)
[component_container_mt-1] 2024-04-30 14:08:12.187 WARN  gxf/std/multi_thread_scheduler.cpp@342: Error while executing entity E926 named 'CMMPVZZGUM_triton_response': GXF_FAILURE
[ERROR] [component_container_mt-1]: process has died [pid 4801, exit code -6, cmd '/opt/ros/humble/lib/rclcpp_components/component_container_mt --ros-args --log-level WARN --ros-args -r __node:=segment_anything_container -r __ns:=/'].

Solution

A discrete GPU with more VRAM may be required.

System throttled due to Over-current warnings

Inference for large models such as FoundationPose or Segformer can be an extremely computationally intensive task. This can cause accelerator clocks to reach maximum limits for optimal performance and can trigger over-current throttling to prevent permanent hardware damage due to stress.

Symptom

Warning dialog on a Jetson shows the message “System throttled due to Over-current.”

Solution

The warning does not prevent task completion but can slow down the rate. Consider a lighter variant of the model for inference or using a more powerful discrete GPU.