vision3d.transforms.v2#

Mirror of torchvision.transforms.v2 with geometric safety guarantees.

Swap

from torchvision.transforms import v2 as T

for

from vision3d.transforms import v2 as T

to make every transform that would silently break the geometric consistency of a 3D scene refuse vision3d-aware TVTensor inputs with a TypeError instead.

The module forwards every public name from torchvision.transforms.v2 unchanged, except for the transforms listed in the module-private _REFUSED set. Those are subclassed with a refusal mixin: calling one on a sample containing any vision3d TVTensor (PointCloud3D, BoundingBoxes3D, CameraImages, CameraExtrinsics, or CameraIntrinsics) raises TypeError. They still work on plain torchvision.tv_tensors.Image / Mask samples.

To remove a transform from the refused set (after registering the necessary kernels), delete the entry from _REFUSED.

Flip-axis convention#

The registered kernels map each image-space flip to a fixed world-axis reflection:

RandomHorizontalFlip: world Y reflection
RandomVerticalFlip: world Z reflection

These choices match the intuition of an upright rig (image_y aligned with -world_Z), but projection stays consistent for any camera orientation: the extrinsics kernel reflects the matching camera-frame axis to absorb the discrepancy.

World X-flip has no torchvision equivalent and stays in vision3d.transforms.RandomFlip3D (achievable via Y-flip + a 180 degree yaw rotation).

Classes

`AugMix`([severity, mixture_width, ...])	AugMix data augmentation method based on "AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty".
`AutoAugment`([policy, interpolation, fill])	AutoAugment data augmentation method based on "AutoAugment: Learning Augmentation Strategies from Data".
`CutMix`(*[, alpha, num_classes, labels_getter])	Apply CutMix to the provided batch of images and labels.
`ElasticTransform`([alpha, sigma, ...])	Transform the input with elastic transformations.
`FiveCrop`(size)	Crop the image or video into four corners and the central crop.
`MixUp`(*[, alpha, num_classes, labels_getter])	Apply MixUp to the provided batch of images and labels.
`RandAugment`([num_ops, magnitude, ...])	RandAugment data augmentation method based on "RandAugment: Practical automated data augmentation with a reduced search space".
`RandomAffine`(degrees[, translate, scale, ...])	Random affine transformation the input keeping center invariant.
`RandomIoUCrop`([min_scale, max_scale, ...])	Random IoU crop transformation from "SSD: Single Shot MultiBox Detector".
`RandomPerspective`([distortion_scale, p, ...])	Perform a random perspective transformation of the input with a given probability.
`RandomRotation`(degrees[, interpolation, ...])	Rotate the input by angle.
`TenCrop`(size[, vertical_flip])	Crop the image or video into four corners and the central crop plus the flipped version of these (horizontal flipping is used by default).
`TrivialAugmentWide`([num_magnitude_bins, ...])	Dataset-independent data-augmentation with TrivialAugment Wide, as described in "TrivialAugment: Tuning-free Yet State-of-the-Art Data Augmentation".

class vision3d.transforms.v2.AugMix(severity=3, mixture_width=3, chain_depth=-1, alpha=1.0, all_ops=True, interpolation=InterpolationMode.BILINEAR, fill=None)#

Bases: _Refuse3DAwareMixin, AugMix

AugMix data augmentation method based on “AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty”.

This transformation works on images and videos only.

If the input is torch.Tensor, it should be of type torch.uint8, and it is expected to have […, 1 or 3, H, W] shape, where … means an arbitrary number of leading dimensions. If img is PIL Image, it is expected to be in mode “L” or “RGB”.

Parameters:

severity (int, optional) – The severity of base augmentation operators. Default is 3.
mixture_width (int, optional) – The number of augmentation chains. Default is 3.
chain_depth (int, optional) – The depth of augmentation chains. A negative value denotes stochastic depth sampled from the interval [1, 3]. Default is -1.
alpha (float, optional) – The hyperparameter for the probability distributions. Default is 1.0.
all_ops (bool, optional) – Use all operations (including brightness, contrast, color and sharpness). Default is True.
interpolation (InterpolationMode, optional) – Desired interpolation enum defined by torchvision.transforms.InterpolationMode. Default is InterpolationMode.NEAREST. If input is Tensor, only InterpolationMode.NEAREST, InterpolationMode.BILINEAR are supported.
fill (sequence or number, optional) – Pixel fill value for the area outside the transformed image. If given a number, the value is used for all bands respectively.

class vision3d.transforms.v2.AutoAugment(policy=AutoAugmentPolicy.IMAGENET, interpolation=InterpolationMode.NEAREST, fill=None)#

Bases: _Refuse3DAwareMixin, AutoAugment

AutoAugment data augmentation method based on “AutoAugment: Learning Augmentation Strategies from Data”.

This transformation works on images and videos only.

Parameters:

policy (AutoAugmentPolicy, optional) – Desired policy enum defined by torchvision.transforms.autoaugment.AutoAugmentPolicy. Default is AutoAugmentPolicy.IMAGENET.
interpolation (InterpolationMode, optional) – Desired interpolation enum defined by torchvision.transforms.InterpolationMode. Default is InterpolationMode.NEAREST. If input is Tensor, only InterpolationMode.NEAREST, InterpolationMode.BILINEAR are supported.
fill (sequence or number, optional) – Pixel fill value for the area outside the transformed image. If given a number, the value is used for all bands respectively.

static get_params(transform_num)[source]#

Get parameters for autoaugment transformation

Returns:: params required by the autoaugment transformation
Parameters:: transform_num (int)
Return type:: tuple[int, Tensor, Tensor]

class vision3d.transforms.v2.CutMix(*, alpha=1.0, num_classes=None, labels_getter='default')#

Bases: _Refuse3DAwareMixin, CutMix

Apply CutMix to the provided batch of images and labels.

Paper: CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features.

Note

This transform is meant to be used on batches of samples, not individual images. See How to use CutMix and MixUp for detailed usage examples. The sample pairing is deterministic and done by matching consecutive samples in the batch, so the batch needs to be shuffled (this is an implementation detail, not a guaranteed convention.)

In the input, the labels are expected to be a tensor of shape (batch_size,). They will be transformed into a tensor of shape (batch_size, num_classes).

Parameters:

alpha (float, optional) – hyperparameter of the Beta distribution used for mixup. Default is 1.
num_classes (int, optional) – number of classes in the batch. Used for one-hot-encoding. Can be None only if the labels are already one-hot-encoded.
labels_getter (callable or "default", optional) – indicates how to identify the labels in the input. By default, this will pick the second parameter as the labels if it’s a tensor. This covers the most common scenario where this transform is called as CutMix()(imgs_batch, labels_batch). It can also be a callable that takes the same input as the transform, and returns the labels.

class vision3d.transforms.v2.ElasticTransform(alpha=50.0, sigma=5.0, interpolation=InterpolationMode.BILINEAR, fill=0)#

Bases: _Refuse3DAwareMixin, ElasticTransform

Transform the input with elastic transformations.

If the input is a torch.Tensor or a TVTensor (e.g. Image, Video, BoundingBoxes etc.) it can have arbitrary number of leading batch dimensions. For example, the image can have [..., C, H, W] shape. A bounding box can have [..., 4] shape.

Given alpha and sigma, it will generate displacement vectors for all pixels based on random offsets. Alpha controls the strength and sigma controls the smoothness of the displacements. The displacements are added to an identity grid and the resulting grid is used to transform the input.

Note

Implementation to transform bounding boxes is approximative (not exact). We construct an approximation of the inverse grid as inverse_grid = identity - displacement. This is not an exact inverse of the grid used to transform images, i.e. grid = identity + displacement. Our assumption is that displacement * displacement is small and can be ignored. Large displacements would lead to large errors in the approximation.

Applications:: Randomly transforms the morphology of objects in images and produces a see-through-water-like effect.

Parameters:

alpha (float or sequence of floats, optional) – Magnitude of displacements. Default is 50.0. A single value is [alpha, alpha].
sigma (float or sequence of floats, optional) – Smoothness of displacements. Default is 5.0. A single value is [sigma, sigma].
interpolation (InterpolationMode, optional) – Desired interpolation enum defined by torchvision.transforms.InterpolationMode. Default is InterpolationMode.BILINEAR. If input is Tensor, only InterpolationMode.NEAREST, InterpolationMode.BILINEAR are supported. The corresponding Pillow integer constants, e.g. PIL.Image.BILINEAR are accepted as well.
fill (number or tuple or dict, optional) – Pixel fill value used when the padding_mode is constant. Default is 0. If a tuple of length 3, it is used to fill R, G, B channels respectively. Fill value can be also a dictionary mapping data type to the fill value, e.g. fill={tv_tensors.Image: 127, tv_tensors.Mask: 0} where Image will be filled with 127 and Mask will be filled with 0.

class vision3d.transforms.v2.FiveCrop(size)#

Bases: _Refuse3DAwareMixin, FiveCrop

Crop the image or video into four corners and the central crop.

If the input is a torch.Tensor or a Image or a Video it can have arbitrary number of leading batch dimensions. For example, the image can have [..., C, H, W] shape.

Note

This transform returns a tuple of images and there may be a mismatch in the number of inputs and targets your Dataset returns. See below for an example of how to deal with this.

Parameters:: size (sequence or int) – Desired output size of the crop. If size is an int instead of sequence like (h, w), a square crop of size (size, size) is made. If provided a sequence of length 1, it will be interpreted as (size[0], size[0]).

Example

>>> class BatchMultiCrop(transforms.Transform):
...     def forward(self, sample: Tuple[Tuple[Union[tv_tensors.Image, tv_tensors.Video], ...], int]):
...         images_or_videos, labels = sample
...         batch_size = len(images_or_videos)
...         image_or_video = images_or_videos[0]
...         images_or_videos = tv_tensors.wrap(torch.stack(images_or_videos), like=image_or_video)
...         labels = torch.full((batch_size,), label, device=images_or_videos.device)
...         return images_or_videos, labels
...
>>> image = tv_tensors.Image(torch.rand(3, 256, 256))
>>> label = 3
>>> transform = transforms.Compose([transforms.FiveCrop(224), BatchMultiCrop()])
>>> images, labels = transform(image, label)
>>> images.shape
torch.Size([5, 3, 224, 224])
>>> labels
tensor([3, 3, 3, 3, 3])

class vision3d.transforms.v2.MixUp(*, alpha=1.0, num_classes=None, labels_getter='default')#

Bases: _Refuse3DAwareMixin, MixUp

Apply MixUp to the provided batch of images and labels.

Paper: mixup: Beyond Empirical Risk Minimization.

Note

In the input, the labels are expected to be a tensor of shape (batch_size,). They will be transformed into a tensor of shape (batch_size, num_classes).

Parameters:

alpha (float, optional) – hyperparameter of the Beta distribution used for mixup. Default is 1.
num_classes (int, optional) – number of classes in the batch. Used for one-hot-encoding. Can be None only if the labels are already one-hot-encoded.
labels_getter (callable or "default", optional) – indicates how to identify the labels in the input. By default, this will pick the second parameter as the labels if it’s a tensor. This covers the most common scenario where this transform is called as MixUp()(imgs_batch, labels_batch). It can also be a callable that takes the same input as the transform, and returns the labels.

class vision3d.transforms.v2.RandAugment(num_ops=2, magnitude=9, num_magnitude_bins=31, interpolation=InterpolationMode.NEAREST, fill=None)#

Bases: _Refuse3DAwareMixin, RandAugment

RandAugment data augmentation method based on “RandAugment: Practical automated data augmentation with a reduced search space”.

This transformation works on images and videos only.

Parameters:

num_ops (int, optional) – Number of augmentation transformations to apply sequentially, must be non-negative integer. Default: 2.
magnitude (int, optional) – Magnitude for all the transformations.
num_magnitude_bins (int, optional) – The number of different magnitude values.
interpolation (InterpolationMode, optional) – Desired interpolation enum defined by torchvision.transforms.InterpolationMode. Default is InterpolationMode.NEAREST. If input is Tensor, only InterpolationMode.NEAREST, InterpolationMode.BILINEAR are supported.
fill (sequence or number, optional) – Pixel fill value for the area outside the transformed image. If given a number, the value is used for all bands respectively.

class vision3d.transforms.v2.RandomAffine(degrees, translate=None, scale=None, shear=None, interpolation=InterpolationMode.NEAREST, fill=0, center=None)#

Bases: _Refuse3DAwareMixin, RandomAffine

Random affine transformation the input keeping center invariant.

Parameters:

degrees (sequence or number) – Range of degrees to select from. If degrees is a number instead of sequence like (min, max), the range of degrees will be (-degrees, +degrees). Set to 0 to deactivate rotations.
translate (tuple, optional) – tuple of maximum absolute fraction for horizontal and vertical translations. For example translate=(a, b), then horizontal shift is randomly sampled in the range -img_width * a < dx < img_width * a and vertical shift is randomly sampled in the range -img_height * b < dy < img_height * b. Will not translate by default.
scale (tuple, optional) – scaling factor interval, e.g (a, b), then scale is randomly sampled from the range a <= scale <= b. Will keep original scale by default.
shear (sequence or number, optional) – Range of degrees to select from. If shear is a number, a shear parallel to the x-axis in the range (-shear, +shear) will be applied. Else if shear is a sequence of 2 values a shear parallel to the x-axis in the range (shear[0], shear[1]) will be applied. Else if shear is a sequence of 4 values, an x-axis shear in (shear[0], shear[1]) and y-axis shear in (shear[2], shear[3]) will be applied. Will not apply shear by default.
interpolation (InterpolationMode, optional) – Desired interpolation enum defined by torchvision.transforms.InterpolationMode. Default is InterpolationMode.NEAREST. If input is Tensor, only InterpolationMode.NEAREST, InterpolationMode.BILINEAR are supported. The corresponding Pillow integer constants, e.g. PIL.Image.BILINEAR are accepted as well.
fill (number or tuple or dict, optional) – Pixel fill value used when the padding_mode is constant. Default is 0. If a tuple of length 3, it is used to fill R, G, B channels respectively. Fill value can be also a dictionary mapping data type to the fill value, e.g. fill={tv_tensors.Image: 127, tv_tensors.Mask: 0} where Image will be filled with 127 and Mask will be filled with 0.
center (sequence, optional) – Optional center of rotation, (x, y). Origin is the upper left corner. Default is the center of the image.

static get_params(degrees, translate, scale_ranges, shears, img_size)[source]#

Get parameters for affine transformation

Returns:

params to be passed to the affine transformation

Parameters:

degrees (list[float])
translate (list[float] | None)
scale_ranges (list[float] | None)
shears (list[float] | None)
img_size (list[int])

Return type:

tuple[float, tuple[int, int], float, tuple[float, float]]

class vision3d.transforms.v2.RandomIoUCrop(min_scale=0.3, max_scale=1.0, min_aspect_ratio=0.5, max_aspect_ratio=2.0, sampler_options=None, trials=40)#

Bases: _Refuse3DAwareMixin, RandomIoUCrop

Random IoU crop transformation from “SSD: Single Shot MultiBox Detector”.

This transformation requires an image or video data and tv_tensors.BoundingBoxes in the input.

Warning

In order to properly remove the bounding boxes below the IoU threshold, RandomIoUCrop must be followed by SanitizeBoundingBoxes, either immediately after or later in the transforms pipeline.

Parameters:

min_scale (float, optional) – Minimum factors to scale the input size.
max_scale (float, optional) – Maximum factors to scale the input size.
min_aspect_ratio (float, optional) – Minimum aspect ratio for the cropped image or video.
max_aspect_ratio (float, optional) – Maximum aspect ratio for the cropped image or video.
sampler_options (list of float, optional) – List of minimal IoU (Jaccard) overlap between all the boxes and a cropped image or video. Default, None which corresponds to [0.0, 0.1, 0.3, 0.5, 0.7, 0.9, 1.0]
trials (int, optional) – Number of trials to find a crop for a given value of minimal IoU (Jaccard) overlap. Default, 40.

class vision3d.transforms.v2.RandomPerspective(distortion_scale=0.5, p=0.5, interpolation=InterpolationMode.BILINEAR, fill=0)#

Bases: _Refuse3DAwareMixin, RandomPerspective

Perform a random perspective transformation of the input with a given probability.

Parameters:

distortion_scale (float, optional) – argument to control the degree of distortion and ranges from 0 to 1. Default is 0.5.
p (float, optional) – probability of the input being transformed. Default is 0.5.
interpolation (InterpolationMode, optional) – Desired interpolation enum defined by torchvision.transforms.InterpolationMode. Default is InterpolationMode.BILINEAR. If input is Tensor, only InterpolationMode.NEAREST, InterpolationMode.BILINEAR are supported. The corresponding Pillow integer constants, e.g. PIL.Image.BILINEAR are accepted as well.
fill (number or tuple or dict, optional) – Pixel fill value used when the padding_mode is constant. Default is 0. If a tuple of length 3, it is used to fill R, G, B channels respectively. Fill value can be also a dictionary mapping data type to the fill value, e.g. fill={tv_tensors.Image: 127, tv_tensors.Mask: 0} where Image will be filled with 127 and Mask will be filled with 0.

static get_params(width, height, distortion_scale)[source]#

Get parameters for perspective for a random perspective transform.

Parameters:

width (int) – width of the image.
height (int) – height of the image.
distortion_scale (float) – argument to control the degree of distortion and ranges from 0 to 1.

Returns:

List containing [top-left, top-right, bottom-right, bottom-left] of the original image, List containing [top-left, top-right, bottom-right, bottom-left] of the transformed image.

Return type:

tuple[list[list[int]], list[list[int]]]

class vision3d.transforms.v2.RandomRotation(degrees, interpolation=InterpolationMode.NEAREST, expand=False, center=None, fill=0)#

Bases: _Refuse3DAwareMixin, RandomRotation

Rotate the input by angle.

Note

When center=None and the angle is a multiple of 90 degrees (0, 90, 180, 270), the rotation is performed using torch.rot90() instead of an affine transform. This is significantly faster, but the output tensor for 90 and 270 degree rotations may not be contiguous. Users who need contiguous output should call contiguous() on the result.

Parameters:

degrees (sequence or number) – Range of degrees to select from. If degrees is a number instead of sequence like (min, max), the range of degrees will be [-degrees, +degrees]. [90, 90] will rotate the image by 90 degrees anticlockwise.
interpolation (InterpolationMode, optional) – Desired interpolation enum defined by torchvision.transforms.InterpolationMode. Default is InterpolationMode.NEAREST. If input is Tensor, only InterpolationMode.NEAREST, InterpolationMode.BILINEAR are supported. The corresponding Pillow integer constants, e.g. PIL.Image.BILINEAR are accepted as well.
expand (bool, optional) – Optional expansion flag. If true, expands the output to make it large enough to hold the entire rotated image. If false or omitted, make the output image the same size as the input image. Note that the expand flag assumes rotation around the center (see note below) and no translation.
center (sequence, optional) –
Optional center of rotation, (x, y). Origin is the upper left corner. Default is the center of the image.

Note

In theory, setting center has no effect if expand=True, since the image center will become the center of rotation. In practice however, due to numerical precision, this can lead to off-by-one differences of the resulting image size compared to using the image center in the first place. Thus, when setting expand=True, it’s best to leave center=None (default).
fill (number or tuple or dict, optional) – Pixel fill value used when the padding_mode is constant. Default is 0. If a tuple of length 3, it is used to fill R, G, B channels respectively. Fill value can be also a dictionary mapping data type to the fill value, e.g. fill={tv_tensors.Image: 127, tv_tensors.Mask: 0} where Image will be filled with 127 and Mask will be filled with 0.

static get_params(degrees)[source]#

Get parameters for rotate for a random rotation.

Returns:: angle parameter to be passed to rotate for random rotation.
Return type:: float
Parameters:: degrees (list[float])

class vision3d.transforms.v2.TenCrop(size, vertical_flip=False)#

Bases: _Refuse3DAwareMixin, TenCrop

Crop the image or video into four corners and the central crop plus the flipped version of these (horizontal flipping is used by default).

If the input is a torch.Tensor or a Image or a Video it can have arbitrary number of leading batch dimensions. For example, the image can have [..., C, H, W] shape.

See FiveCrop for an example.

Note

This transform returns a tuple of images and there may be a mismatch in the number of inputs and targets your Dataset returns. See below for an example of how to deal with this.

Parameters:

size (sequence or int) – Desired output size of the crop. If size is an int instead of sequence like (h, w), a square crop (size, size) is made. If provided a sequence of length 1, it will be interpreted as (size[0], size[0]).
vertical_flip (bool, optional) – Use vertical flipping instead of horizontal

class vision3d.transforms.v2.TrivialAugmentWide(num_magnitude_bins=31, interpolation=InterpolationMode.NEAREST, fill=None)#

Bases: _Refuse3DAwareMixin, TrivialAugmentWide

Dataset-independent data-augmentation with TrivialAugment Wide, as described in “TrivialAugment: Tuning-free Yet State-of-the-Art Data Augmentation”.

This transformation works on images and videos only.

Parameters:

num_magnitude_bins (int, optional) – The number of different magnitude values.
interpolation (InterpolationMode, optional) – Desired interpolation enum defined by torchvision.transforms.InterpolationMode. Default is InterpolationMode.NEAREST. If input is Tensor, only InterpolationMode.NEAREST, InterpolationMode.BILINEAR are supported.
fill (sequence or number, optional) – Pixel fill value for the area outside the transformed image. If given a number, the value is used for all bands respectively.