Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 38 additions & 24 deletions docsrc/user_guide/mixed_precision.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,18 +32,15 @@ Consider the following PyTorch model which explicitly casts intermediate layer t
return x


If we compile the above model using Torch-TensorRT, layer profiling logs indicate that all the layers are
run in FP32. This is because TensorRT picks the kernels for layers which result in the best performance.
If we compile the above model using Torch-TensorRT with the following settings, layer profiling logs indicate that all the layers are
run in FP32. This is because TensorRT picks the kernels for layers which result in the best performance (i.e., weak typing in TensorRT).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to reorient around strong typing first and then weak typing as an optimization. Right now this is a bit confusing

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So like in the tutorial

  1. Demonstrate strong typing and explain that its going to be the default behavior
  2. Show the weak typing behavior and talk about how the trt graph changed (and maybe why)
  3. Show how you can recover the weak typing behavior using auto cast for trt 11 and beyond

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since TRT has deprecated weak typing, should we mention weak typing is deprecated so need to use autocast instead? Thus, we have only two modes:

User defineds precision:          use_explicit_typing=True + enable_autocast=False
Autocast chooses precision:          use_explicit_typing=True + enable_autocast=True


.. code-block:: python

inputs = [torch.randn((1, 10), dtype=torch.float32).cuda()]
mod = MyModule().eval().cuda()
ep = torch.export.export(mod, tuple(inputs))
with torch_tensorrt.logging.debug():
trt_gm = torch_tensorrt.dynamo.compile(ep,
inputs=inputs,
debug=True)
trt_gm = torch_tensorrt.dynamo.compile(ep, inputs=inputs)

# Debug log info
# Layers:
Expand All @@ -53,31 +50,50 @@ run in FP32. This is because TensorRT picks the kernels for layers which result


In order to respect the types specified by the user in the model (eg: in this case, ``linear2`` layer to run in FP16), users can enable
the compilation setting ``use_explicit_typing=True``. Compiling with this option results in the following TensorRT logs

.. note:: If you enable ``use_explicit_typing=True``, only torch.float32 is supported in the enabled_precisions.

the compilation setting ``use_explicit_typing=True``. Compiling with this option results in the following TensorRT logs:

.. code-block:: python

inputs = [torch.randn((1, 10), dtype=torch.float32).cuda()]
mod = MyModule().eval().cuda()
ep = torch.export.export(mod, tuple(inputs))
with torch_tensorrt.logging.debug():
trt_gm = torch_tensorrt.dynamo.compile(ep,
inputs=inputs,
use_explicit_typing=True,
debug=True)
trt_gm = torch_tensorrt.dynamo.compile(ep, inputs=inputs, use_explicit_typing=True)

# Debug log info
# Layers:
# Name: __myl_MulSumAddCas_myl0_0, LayerType: kgen, Inputs: [ { Name: linear1/addmm_constant_0 _ linear1/addmm_add_broadcast_to_same_shape_lhs_broadcast_constantFloat, Dimensions: [1,10], Format/Datatype: Float }, { Name: __mye112_dconst, Dimensions: [10,10], Format/Datatype: Float }, { Name: x, Dimensions: [10,1], Format/Datatype: Float }], Outputs: [ { Name: __myln_k_arg__bb1_2, Dimensions: [1,10], Format/Datatype: Half }], TacticName: __myl_MulSumAddCas_0xacf8f5dd9be2f3e7bb09cdddeac6c936, StreamId: 0, Metadata:
# Name: __myl_ResMulSumAddCas_myl0_1, LayerType: kgen, Inputs: [ { Name: __mye127_dconst, Dimensions: [10,30], Format/Datatype: Half }, { Name: linear2/addmm_1_constant_0 _ linear2/addmm_1_add_broadcast_to_same_shape_lhs_broadcast_constantHalf, Dimensions: [1,30], Format/Datatype: Half }, { Name: __myln_k_arg__bb1_2, Dimensions: [1,10], Format/Datatype: Half }], Outputs: [ { Name: __myln_k_arg__bb1_3, Dimensions: [1,30], Format/Datatype: Float }], TacticName: __myl_ResMulSumAddCas_0x5a3b318b5a1c97b7d5110c0291481337, StreamId: 0, Metadata:
# Name: __myl_ResMulSumAdd_myl0_2, LayerType: kgen, Inputs: [ { Name: __mye142_dconst, Dimensions: [30,40], Format/Datatype: Float }, { Name: linear3/addmm_2_constant_0 _ linear3/addmm_2_add_broadcast_to_same_shape_lhs_broadcast_constantFloat, Dimensions: [1,40], Format/Datatype: Float }, { Name: __myln_k_arg__bb1_3, Dimensions: [1,30], Format/Datatype: Float }], Outputs: [ { Name: output0, Dimensions: [1,40], Format/Datatype: Float }], TacticName: __myl_ResMulSumAdd_0x3fad91127c640fd6db771aa9cde67db0, StreamId: 0, Metadata:

Now the ``linear2`` layer runs in FP16 as shown in the above logs.
Autocast
---------------

Weak typing behavior in TensorRT is deprecated. However it is a good way to maximize performance. Therefore, in Torch-TensorRT,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However mixed precision is a good way to maximize performance

we want to provide a way to enable weak typing behavior in Torch-TensorRT, which is called `Autocast`.

Torch-TensorRT Autocast intelligently selects nodes to keep in FP32 precision to maintain model accuracy while benefiting from
reduced precision on the rest of the nodes. Torch-TensorRT Autocast also supports users to specify which nodes to exclude from Autocast,
considering some nodes might be more sensitive to affecting accuracy. In addition, Torch-TensorRT Autocast can cooperate with PyTorch
native Autocast, allowing users to use both PyTorch and Torch-TensorRT Autocast in the same model. Torch-TensorRT respects the precision
of the nodes within PyTorch Autocast.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the difference between PyTorch and Torch-TensorRT autocast?


To enable Torch-TensorRT Autocast, users need to set both ``enable_autocast=True`` and ``use_explicit_typing=True``. For example,

.. code-block:: python

inputs = [torch.randn((1, 10), dtype=torch.float32).cuda()]
mod = MyModule().eval().cuda()
ep = torch.export.export(mod, tuple(inputs))
trt_gm = torch_tensorrt.dynamo.compile(ep, inputs=inputs, enable_autocast=True, use_explicit_typing=True)


Users can also specify the precision of the nodes by ``autocast_low_precision_type``, or ``autocast_excluded_nodes`` / ``autocast_excluded_ops``
to exclude certain nodes/ops from Autocast.

In summary, there are three ways in Torch-TensorRT to enable mixed precision:
1. TRT chooses precision (weak typing): ``use_explicit_typing=False + enable_autocast=False``
2. User specifies precision (strong typing): ``use_explicit_typing=True + enable_autocast=False``
3. Autocast chooses precision (autocast + strong typing): ``use_explicit_typing=True + enable_autocast=True``

FP32 Accumulation
-----------------
Expand All @@ -93,14 +109,12 @@ When ``use_fp32_acc=True`` is set, Torch-TensorRT will attempt to use FP32 accum
inputs = [torch.randn((1, 10), dtype=torch.float16).cuda()]
mod = MyModule().eval().cuda()
ep = torch.export.export(mod, tuple(inputs))
with torch_tensorrt.logging.debug():
trt_gm = torch_tensorrt.dynamo.compile(
ep,
inputs=inputs,
use_fp32_acc=True,
use_explicit_typing=True, # Explicit typing must be enabled
debug=True
)
trt_gm = torch_tensorrt.dynamo.compile(
ep,
inputs=inputs,
use_fp32_acc=True,
use_explicit_typing=True, # Explicit typing must be enabled
)

# Debug log info
# Layers:
Expand Down
70 changes: 70 additions & 0 deletions examples/dynamo/autocast_example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
import torch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add comments to this doc? Here is an example of what im looking for https://docs.pytorch.org/TensorRT/tutorials/_rendered_examples/dynamo/converter_overloading.html

import torch.nn as nn
import torch_tensorrt


class MixedPytorchAutocastModel(nn.Module):
def __init__(self):
super(MixedPytorchAutocastModel, self).__init__()
self.conv1 = nn.Conv2d(
in_channels=3, out_channels=8, kernel_size=3, stride=1, padding=1
)
self.relu1 = nn.ReLU()
self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
self.conv2 = nn.Conv2d(
in_channels=8, out_channels=16, kernel_size=3, stride=1, padding=1
)
self.relu2 = nn.ReLU()
self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)
self.flatten = nn.Flatten()
self.fc1 = nn.Linear(16 * 8 * 8, 10)

def forward(self, x):
x = self.conv1(x)
x = self.relu1(x)
x = self.pool1(x)
x = self.conv2(x)
x = self.relu2(x)
x = self.pool2(x)
x = self.flatten(x)
with torch.autocast(x.device.type, enabled=True, dtype=torch.float16):
x = self.fc1(x)
out = torch.log(
torch.abs(x) + 1
) # log is fp32 due to Pytorch Autocast requirements
return out


if __name__ == "__main__":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know its not best practice but lets just make them pure scripts so they render better

model = MixedPytorchAutocastModel().cuda().eval()
inputs = (torch.randn((8, 3, 32, 32), dtype=torch.float32, device="cuda"),)
ep = torch.export.export(model, inputs)
calibration_dataloader = torch.utils.data.DataLoader(
torch.utils.data.TensorDataset(*inputs), batch_size=2, shuffle=False
)

with torch_tensorrt.dynamo.Debugger(
"graphs",
logging_dir=".",
engine_builder_monitor=False,
):
trt_autocast_mod = torch_tensorrt.compile(
ep.module(),
arg_inputs=inputs,
min_block_size=1,
use_python_runtime=True,
##### weak typing #####
# use_explicit_typing=False,
# enabled_precisions={torch.float16},
##### strong typing + autocast #####
use_explicit_typing=True,
enable_autocast=True,
autocast_low_precision_type=torch.float16,
autocast_excluded_nodes={"^conv1$", "relu"},
autocast_excluded_ops={"torch.ops.aten.flatten.using_ints"},
autocast_max_output_threshold=512,
autocast_max_depth_of_reduction=None,
autocast_calibration_dataloader=calibration_dataloader,
)

autocast_outs = trt_autocast_mod(*inputs)
46 changes: 45 additions & 1 deletion py/torch_tensorrt/dynamo/_compiler.py
Original file line number Diff line number Diff line change
Expand Up @@ -141,7 +141,7 @@ def cross_compile_for_windows(
disable_tf32 (bool): Force FP32 layers to use traditional as FP32 format vs the default behavior of rounding the inputs to 10-bit mantissas before multiplying, but accumulates the sum using 23-bit mantissas
assume_dynamic_shape_support (bool): Setting this to true enables the converters work for both dynamic and static shapes. Default: False
sparse_weights (bool): Enable sparsity for convolution and fully connected layers.
enabled_precision (Set(Union(torch.dtype, torch_tensorrt.dtype))): The set of datatypes that TensorRT can use when selecting kernels
enabled_precisions (Set(Union(torch.dtype, torch_tensorrt.dtype))): The set of datatypes that TensorRT can use when selecting kernels
capability (torch_tensorrt.EngineCapability): Restrict kernel selection to safe gpu kernels or safe dla kernels
num_avg_timing_iters (int): Number of averaging timing iterations used to select kernels
workspace_size (int): Maximum size of workspace given to TensorRT
Expand Down Expand Up @@ -434,6 +434,19 @@ def compile(
l2_limit_for_tiling: int = _defaults.L2_LIMIT_FOR_TILING,
offload_module_to_cpu: bool = _defaults.OFFLOAD_MODULE_TO_CPU,
use_distributed_mode_trace: bool = _defaults.USE_DISTRIBUTED_MODE_TRACE,
enable_autocast: bool = _defaults.ENABLE_AUTOCAST,
autocast_low_precision_type: Optional[
Union[torch.dtype, dtype]
] = _defaults.AUTOCAST_LOW_PRECISION_TYPE,
autocast_excluded_nodes: Collection[str] = _defaults.AUTOCAST_EXCLUDED_NODES,
autocast_excluded_ops: Collection[Target] = _defaults.AUTOCAST_EXCLUDED_OPS,
autocast_max_output_threshold: float = _defaults.AUTOCAST_MAX_OUTPUT_THRESHOLD,
autocast_max_depth_of_reduction: Optional[
int
] = _defaults.AUTOCAST_MAX_DEPTH_OF_REDUCTION,
autocast_calibration_dataloader: Optional[
torch.utils.data.DataLoader
] = _defaults.AUTOCAST_CALIBRATION_DATALOADER,
**kwargs: Any,
) -> torch.fx.GraphModule:
"""Compile an ExportedProgram module for NVIDIA GPUs using TensorRT
Expand Down Expand Up @@ -511,6 +524,13 @@ def compile(
l2_limit_for_tiling (int): The target L2 cache usage limit (in bytes) for tiling optimization (default is -1 which means no limit).
offload_module_to_cpu (bool): Offload the module to CPU. This is useful when we need to minimize GPU memory usage.
use_distributed_mode_trace (bool): Using aot_autograd to trace the graph. This is enabled when DTensors or distributed tensors are present in distributed model
enable_autocast (bool): Whether to enable autocast. If enabled, use_explicit_typing will be set to True.
autocast_low_precision_type (Optional[Union[torch.dtype, dtype]]): The precision to reduce to. We currently support torch.float16 and torch.bfloat16. Default is None, which means no low precision is used.
autocast_excluded_nodes (Collection[str]): The set of regex patterns to match user-specified node names that should remain in FP32. Default is [].
autocast_excluded_ops (Collection[Target]): The set of targets (ATen ops) that should remain in FP32. Default is [].
autocast_max_output_threshold (float): Maximum absolute value for node outputs, nodes with outputs greater than this value will remain in FP32. Default is 512.
autocast_max_depth_of_reduction (Optional[int]): Maximum depth of reduction allowed in low precision. Nodes with higher reduction depths will remain in FP32. This helps prevent excessive accuracy loss in operations particularly sensitive to reduced precision, as higher-depth reductions may amplify computation errors in low precision formats. If not provided, infinity will be used. Default is None.
autocast_calibration_dataloader (Optional[torch.utils.data.DataLoader]): The dataloader to use for autocast calibration. Default is None.
**kwargs: Any,
Returns:
torch.fx.GraphModule: Compiled FX Module, when run it will execute via TensorRT
Expand Down Expand Up @@ -584,6 +604,10 @@ def compile(
"\nThis feature is unimplemented in Torch-TRT Dynamo currently."
)

if enable_autocast:
use_explicit_typing = True
logger.debug("Autocast is enabled, setting use_explicit_typing to True.")

if use_explicit_typing:
if len(enabled_precisions) != 1 or not any(
x in enabled_precisions
Expand All @@ -593,6 +617,19 @@ def compile(
f"use_explicit_typing was set to True, however found that enabled_precisions was also specified (saw: {enabled_precisions}, expected: dtype.f32, dtype.f4). enabled_precisions should not be used when use_explicit_typing=True"
)

if autocast_low_precision_type is not None:
if not isinstance(autocast_low_precision_type, (torch.dtype, dtype)):
raise ValueError(
f"autocast_low_precision_type must be a torch.dtype or torch_tensorrt._enums.dtype, got {type(autocast_low_precision_type)}"
)
if autocast_low_precision_type not in {
torch.float16,
torch.bfloat16,
} and autocast_low_precision_type not in {dtype.f16, dtype.bf16}:
raise ValueError(
f"autocast_low_precision_type must be one of torch.float16, torch.bfloat16, dtype.f16, dtype.bf16, got {autocast_low_precision_type}"
)

if use_fp32_acc:
logger.debug(
"FP32 accumulation for matmul layers is enabled. This option should only be enabled if the model already has FP16 weights and has no effect if it has FP32 weights. \
Expand Down Expand Up @@ -680,6 +717,13 @@ def compile(
"l2_limit_for_tiling": l2_limit_for_tiling,
"offload_module_to_cpu": offload_module_to_cpu,
"use_distributed_mode_trace": use_distributed_mode_trace,
"enable_autocast": enable_autocast,
"autocast_low_precision_type": autocast_low_precision_type,
"autocast_excluded_nodes": autocast_excluded_nodes,
"autocast_excluded_ops": autocast_excluded_ops,
"autocast_max_output_threshold": autocast_max_output_threshold,
"autocast_max_depth_of_reduction": autocast_max_depth_of_reduction,
"autocast_calibration_dataloader": autocast_calibration_dataloader,
}

settings = CompilationSettings(**compilation_options)
Expand Down
7 changes: 7 additions & 0 deletions py/torch_tensorrt/dynamo/_defaults.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,13 @@
L2_LIMIT_FOR_TILING = -1
USE_DISTRIBUTED_MODE_TRACE = False
OFFLOAD_MODULE_TO_CPU = False
ENABLE_AUTOCAST = False
AUTOCAST_LOW_PRECISION_TYPE = None
AUTOCAST_EXCLUDED_NODES = set[str]()
AUTOCAST_EXCLUDED_OPS = set[torch.fx.node.Target]()
AUTOCAST_MAX_OUTPUT_THRESHOLD = 512
AUTOCAST_MAX_DEPTH_OF_REDUCTION = None
AUTOCAST_CALIBRATION_DATALOADER = None

if platform.system() == "Linux":
import pwd
Expand Down
Loading
Loading