Deep Learning
Seeing operation failed followed by the process dying
One cause of this issue is when the GPU being used does not have enough memory to run the model. For example, DOPE may require up to 6GB of VRAM to operate, depending on the application.
Symptom
[component_container_mt-1] 2022-06-27 08:35:37.518 ERROR extensions/tensor_ops/Reshape.cpp@71: reshape tensor failed.
[component_container_mt-1] 2022-06-27 08:35:37.518 ERROR extensions/tensor_ops/TensorOperator.cpp@151: operation failed.
[component_container_mt-1] 2022-06-27 08:35:37.518 ERROR gxf/std/entity_executor.cpp@200: Entity with 102 not found!
[component_container_mt-1] INFO: infer_simple_runtime.cpp:69 TrtISBackend id:164 initialized model: Ketchup
[component_container_mt-1] 2022-06-27 08:35:37.518 WARN gxf/std/greedy_scheduler.cpp@221: Error while executing entity 87 named 'VERAGYEWGZ_reshaper': GXF_FAILURE
[component_container_mt-1] [ERROR] [1656318937.518424053] [dope_encoder]: [NitrosPublisher] Vault ("vault/vault", eid=102) was stopped. The graph may have been terminated due to an error.
[component_container_mt-1] terminate called after throwing an instance of 'std::runtime_error'
[component_container_mt-1] what(): [NitrosPublisher] Vault ("vault/vault", eid=102) was stopped. The graph may have been terminated due to an error.
[ERROR] [component_container_mt-1]: process has died [pid 13378, exit code -6, cmd '/opt/ros/humble/install/lib/rclcpp_components/component_container_mt --ros-args -r __node:=dope_container -r __ns:=/'].
Solution
Try using the Isaac ROS TensorRT node or the Isaac ROS Triton node with the TensorRT backend instead. Otherwise, a discrete GPU with more VRAM may be required.
Triton fails to create the TensorRT engine and load a model
Symptom
1: [component_container_mt-1] I0331 05:56:07.479791 11359 tensorrt.cc:5591] TRITONBACKEND_ModelInitialize: detectnet (version 1)
1: [component_container_mt-1] I0331 05:56:07.483989 11359 tensorrt.cc:5640] TRITONBACKEND_ModelInstanceInitialize: detectnet (GPU device 0)
1: [component_container_mt-1] I0331 05:56:08.169240 11359 logging.cc:49] Loaded engine size: 21 MiB
1: [component_container_mt-1] E0331 05:56:08.209208 11359 logging.cc:43] 1: [runtime.cpp::parsePlan::314] Error Code 1: Serialization (Serialization assertion plan->header.magicTag == rt::kPLAN_MAGIC_TAG failed.)
1: [component_container_mt-1] I0331 05:56:08.213483 11359 tensorrt.cc:5678] TRITONBACKEND_ModelInstanceFinalize: delete instance state
1: [component_container_mt-1] I0331 05:56:08.213525 11359 tensorrt.cc:5617] TRITONBACKEND_ModelFinalize: delete model state
1: [component_container_mt-1] E0331 05:56:08.214059 11359 model_lifecycle.cc:596] failed to load 'detectnet' version 1: Internal: unable to create TensorRT engine
1: [component_container_mt-1] ERROR: infer_trtis_server.cpp:1057 Triton: failed to load model detectnet, triton_err_str:Invalid argument, err_msg:load failed for model 'detectnet': version 1 is at UNAVAILABLE state: Internal: unable to create TensorRT engine;
1: [component_container_mt-1]
1: [component_container_mt-1] ERROR: infer_trtis_backend.cpp:54 failed to load model: detectnet, nvinfer error:NVDSINFER_TRITON_ERROR
1: [component_container_mt-1] ERROR: infer_simple_runtime.cpp:33 failed to initialize backend while ensuring model:detectnet ready, nvinfer error:NVDSINFER_TRITON_ERROR
1: [component_container_mt-1] ERROR: Error in createNNBackend() <infer_simple_context.cpp:76> [UID = 16]: failed to initialize triton simple runtime for model:detectnet, nvinfer error:NVDSINFER_TRITON_ERROR
1: [component_container_mt-1] ERROR: Error in initialize() <infer_base_context.cpp:79> [UID = 16]: create nn-backend failed, check config file settings, nvinfer error:NVDSINFER_TRITON_ERROR
Solution
This error can occur when TensorRT attempts to load an incompatible
model.plan
file. The incompatibility may arise due to a versioning
or platform mismatch between the time of plan generation and the time of
plan execution.
Delete the model.plan
file that is being passed in as an argument to
the Triton node’s model_repository_paths
parameter, and then use the
source package’s instructions to regenerate the model.plan
file from
the original weights file (often a .etlt
or .onnx
file).
Seeing CUDA Error: an illegal memory access was encountered
One cause of this issue is when the GPU being used does not have enough memory to run the model or combination of models. For example, SAM with YOLO may require up to 12GB of VRAM to operate depending on the application.
Symptom
[component_container_mt-1] [ERROR] [1714404908.405286740] [NitrosImage]: [convert_to_custom] cudaMemcpy2D failed for conversion from sensor_msgs::msg::Image to NitrosImage: cudaErrorIllegalAddress (an illegal memory access was encountered)
[component_container_mt-1] 2024-04-29 21:05:08.405 ERROR gxf/std/entity_executor.cpp@552: Failed to tick codelet in entity: UKNJEXSISG_triton_response code: GXF_FAILURE
[component_container_mt-1] 2024-04-29 21:05:08.405 ERROR gxf/std/entity_executor.cpp@552: Failed to tick codelet in entity: APDRLGKSXZ_cuda_stream_sync code: GXF_FAILURE
[component_container_mt-1] 2024-04-29 21:05:08.405 ERROR gxf/std/entity_executor.cpp@210: Entity with eid 207 not found!
[component_container_mt-1] CUDA Error: an illegal memory access was encountered
[component_container_mt-1] 2024-04-30 14:08:12.187 ERROR extensions/triton/inferencers/triton_inferencer_impl.cpp@729: cudaMemcpy error: an illegal memory access was encountered
[component_container_mt-1]
[component_container_mt-1] 2024-04-30 14:08:12.187 ERROR gxf/std/entity_executor.cpp@552: Failed to tick codelet in entity: CMMPVZZGUM_triton_response code: GXF_FAILURE
[component_container_mt-1] terminate called after throwing an instance of 'std::runtime_error'
[component_container_mt-1] what(): CUDA error returned at ./src/image_to_tensor_node.cpp:154, Error code: 700 (an illegal memory access was encountered)
[component_container_mt-1] 2024-04-30 14:08:12.187 WARN gxf/std/multi_thread_scheduler.cpp@342: Error while executing entity E926 named 'CMMPVZZGUM_triton_response': GXF_FAILURE
[ERROR] [component_container_mt-1]: process has died [pid 4801, exit code -6, cmd '/opt/ros/humble/lib/rclcpp_components/component_container_mt --ros-args --log-level WARN --ros-args -r __node:=segment_anything_container -r __ns:=/'].
Solution
A discrete GPU with more VRAM may be required.
System throttled due to Over-current warnings
Inference for large models such as FoundationPose or Segformer can be an extremely computationally intensive task. This can cause accelerator clocks to reach maximum limits for optimal performance and can trigger over-current throttling to prevent permanent hardware damage due to stress.
Symptom
Warning dialog on a Jetson shows the message “System throttled due to Over-current.”
Solution
The warning does not prevent task completion but can slow down the rate. Consider a lighter variant of the model for inference or using a more powerful discrete GPU.