vision3d.transforms.v2#
Mirror of torchvision.transforms.v2 with geometric safety guarantees.
Swap
from torchvision.transforms import v2 as T
for
from vision3d.transforms import v2 as T
to make every transform that would silently break the geometric
consistency of a 3D scene refuse vision3d-aware TVTensor inputs with a
TypeError instead.
The module forwards every public name from torchvision.transforms.v2
unchanged, except for the transforms listed in the module-private
_REFUSED set. Those are subclassed with a refusal mixin: calling one
on a sample containing any vision3d TVTensor
(PointCloud3D,
BoundingBoxes3D,
CameraImages,
CameraExtrinsics, or
CameraIntrinsics) raises
TypeError. They still work on plain
torchvision.tv_tensors.Image / Mask
samples.
To remove a transform from the refused set (after registering the
necessary kernels), delete the entry from _REFUSED.
Classes
|
AugMix data augmentation method based on "AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty". |
|
AutoAugment data augmentation method based on "AutoAugment: Learning Augmentation Strategies from Data". |
|
Apply CutMix to the provided batch of images and labels. |
|
Transform the input with elastic transformations. |
|
Crop the image or video into four corners and the central crop. |
|
Apply MixUp to the provided batch of images and labels. |
|
RandAugment data augmentation method based on "RandAugment: Practical automated data augmentation with a reduced search space". |
|
Random affine transformation the input keeping center invariant. |
|
Horizontally flip the input with a given probability. |
|
Random IoU crop transformation from "SSD: Single Shot MultiBox Detector". |
|
Perform a random perspective transformation of the input with a given probability. |
|
Rotate the input by angle. |
|
Vertically flip the input with a given probability. |
|
Crop the image or video into four corners and the central crop plus the flipped version of these (horizontal flipping is used by default). |
|
Dataset-independent data-augmentation with TrivialAugment Wide, as described in "TrivialAugment: Tuning-free Yet State-of-the-Art Data Augmentation". |
- class vision3d.transforms.v2.AugMix(severity=3, mixture_width=3, chain_depth=-1, alpha=1.0, all_ops=True, interpolation=InterpolationMode.BILINEAR, fill=None)#
Bases:
_Refuse3DAwareMixin,AugMixAugMix data augmentation method based on “AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty”.
This transformation works on images and videos only.
If the input is
torch.Tensor, it should be of typetorch.uint8, and it is expected to have […, 1 or 3, H, W] shape, where … means an arbitrary number of leading dimensions. If img is PIL Image, it is expected to be in mode “L” or “RGB”.- Parameters:
severity (
int, optional) – The severity of base augmentation operators. Default is3.mixture_width (
int, optional) – The number of augmentation chains. Default is3.chain_depth (
int, optional) – The depth of augmentation chains. A negative value denotes stochastic depth sampled from the interval [1, 3]. Default is-1.alpha (
float, optional) – The hyperparameter for the probability distributions. Default is1.0.all_ops (
bool, optional) – Use all operations (including brightness, contrast, color and sharpness). Default isTrue.interpolation (
InterpolationMode, optional) – Desired interpolation enum defined bytorchvision.transforms.InterpolationMode. Default isInterpolationMode.NEAREST. If input is Tensor, onlyInterpolationMode.NEAREST,InterpolationMode.BILINEARare supported.fill (
sequenceornumber, optional) – Pixel fill value for the area outside the transformed image. If given a number, the value is used for all bands respectively.
- class vision3d.transforms.v2.AutoAugment(policy=AutoAugmentPolicy.IMAGENET, interpolation=InterpolationMode.NEAREST, fill=None)#
Bases:
_Refuse3DAwareMixin,AutoAugmentAutoAugment data augmentation method based on “AutoAugment: Learning Augmentation Strategies from Data”.
This transformation works on images and videos only.
If the input is
torch.Tensor, it should be of typetorch.uint8, and it is expected to have […, 1 or 3, H, W] shape, where … means an arbitrary number of leading dimensions. If img is PIL Image, it is expected to be in mode “L” or “RGB”.- Parameters:
policy (
AutoAugmentPolicy, optional) – Desired policy enum defined bytorchvision.transforms.autoaugment.AutoAugmentPolicy. Default isAutoAugmentPolicy.IMAGENET.interpolation (
InterpolationMode, optional) – Desired interpolation enum defined bytorchvision.transforms.InterpolationMode. Default isInterpolationMode.NEAREST. If input is Tensor, onlyInterpolationMode.NEAREST,InterpolationMode.BILINEARare supported.fill (
sequenceornumber, optional) – Pixel fill value for the area outside the transformed image. If given a number, the value is used for all bands respectively.
- class vision3d.transforms.v2.CutMix(*, alpha=1.0, num_classes=None, labels_getter='default')#
Bases:
_Refuse3DAwareMixin,CutMixApply CutMix to the provided batch of images and labels.
Paper: CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features.
Note
This transform is meant to be used on batches of samples, not individual images. See How to use CutMix and MixUp for detailed usage examples. The sample pairing is deterministic and done by matching consecutive samples in the batch, so the batch needs to be shuffled (this is an implementation detail, not a guaranteed convention.)
In the input, the labels are expected to be a tensor of shape
(batch_size,). They will be transformed into a tensor of shape(batch_size, num_classes).- Parameters:
alpha (
float, optional) – hyperparameter of the Beta distribution used for mixup. Default is 1.num_classes (
int, optional) – number of classes in the batch. Used for one-hot-encoding. Can be None only if the labels are already one-hot-encoded.labels_getter (
callableor"default", optional) – indicates how to identify the labels in the input. By default, this will pick the second parameter as the labels if it’s a tensor. This covers the most common scenario where this transform is called asCutMix()(imgs_batch, labels_batch). It can also be a callable that takes the same input as the transform, and returns the labels.
- class vision3d.transforms.v2.ElasticTransform(alpha=50.0, sigma=5.0, interpolation=InterpolationMode.BILINEAR, fill=0)#
Bases:
_Refuse3DAwareMixin,ElasticTransformTransform the input with elastic transformations.
If the input is a
torch.Tensoror aTVTensor(e.g.Image,Video,BoundingBoxesetc.) it can have arbitrary number of leading batch dimensions. For example, the image can have[..., C, H, W]shape. A bounding box can have[..., 4]shape.Given alpha and sigma, it will generate displacement vectors for all pixels based on random offsets. Alpha controls the strength and sigma controls the smoothness of the displacements. The displacements are added to an identity grid and the resulting grid is used to transform the input.
Note
Implementation to transform bounding boxes is approximative (not exact). We construct an approximation of the inverse grid as
inverse_grid = identity - displacement. This is not an exact inverse of the grid used to transform images, i.e.grid = identity + displacement. Our assumption is thatdisplacement * displacementis small and can be ignored. Large displacements would lead to large errors in the approximation.- Applications:
Randomly transforms the morphology of objects in images and produces a see-through-water-like effect.
- Parameters:
alpha (
floatorsequenceoffloats, optional) – Magnitude of displacements. Default is 50.0. A single value is[alpha, alpha].sigma (
floatorsequenceoffloats, optional) – Smoothness of displacements. Default is 5.0. A single value is[sigma, sigma].interpolation (
InterpolationMode, optional) – Desired interpolation enum defined bytorchvision.transforms.InterpolationMode. Default isInterpolationMode.BILINEAR. If input is Tensor, onlyInterpolationMode.NEAREST,InterpolationMode.BILINEARare supported. The corresponding Pillow integer constants, e.g.PIL.Image.BILINEARare accepted as well.fill (
numberortupleordict, optional) – Pixel fill value used when thepadding_modeis constant. Default is 0. If a tuple of length 3, it is used to fill R, G, B channels respectively. Fill value can be also a dictionary mapping data type to the fill value, e.g.fill={tv_tensors.Image: 127, tv_tensors.Mask: 0}whereImagewill be filled with 127 andMaskwill be filled with 0.
- class vision3d.transforms.v2.FiveCrop(size)#
Bases:
_Refuse3DAwareMixin,FiveCropCrop the image or video into four corners and the central crop.
If the input is a
torch.Tensoror aImageor aVideoit can have arbitrary number of leading batch dimensions. For example, the image can have[..., C, H, W]shape.Note
This transform returns a tuple of images and there may be a mismatch in the number of inputs and targets your Dataset returns. See below for an example of how to deal with this.
- Parameters:
size (
sequenceorint) – Desired output size of the crop. If size is anintinstead of sequence like (h, w), a square crop of size (size, size) is made. If provided a sequence of length 1, it will be interpreted as (size[0], size[0]).
Example
>>> class BatchMultiCrop(transforms.Transform): ... def forward(self, sample: Tuple[Tuple[Union[tv_tensors.Image, tv_tensors.Video], ...], int]): ... images_or_videos, labels = sample ... batch_size = len(images_or_videos) ... image_or_video = images_or_videos[0] ... images_or_videos = tv_tensors.wrap(torch.stack(images_or_videos), like=image_or_video) ... labels = torch.full((batch_size,), label, device=images_or_videos.device) ... return images_or_videos, labels ... >>> image = tv_tensors.Image(torch.rand(3, 256, 256)) >>> label = 3 >>> transform = transforms.Compose([transforms.FiveCrop(224), BatchMultiCrop()]) >>> images, labels = transform(image, label) >>> images.shape torch.Size([5, 3, 224, 224]) >>> labels tensor([3, 3, 3, 3, 3])
- class vision3d.transforms.v2.MixUp(*, alpha=1.0, num_classes=None, labels_getter='default')#
Bases:
_Refuse3DAwareMixin,MixUpApply MixUp to the provided batch of images and labels.
Paper: mixup: Beyond Empirical Risk Minimization.
Note
This transform is meant to be used on batches of samples, not individual images. See How to use CutMix and MixUp for detailed usage examples. The sample pairing is deterministic and done by matching consecutive samples in the batch, so the batch needs to be shuffled (this is an implementation detail, not a guaranteed convention.)
In the input, the labels are expected to be a tensor of shape
(batch_size,). They will be transformed into a tensor of shape(batch_size, num_classes).- Parameters:
alpha (
float, optional) – hyperparameter of the Beta distribution used for mixup. Default is 1.num_classes (
int, optional) – number of classes in the batch. Used for one-hot-encoding. Can be None only if the labels are already one-hot-encoded.labels_getter (
callableor"default", optional) – indicates how to identify the labels in the input. By default, this will pick the second parameter as the labels if it’s a tensor. This covers the most common scenario where this transform is called asMixUp()(imgs_batch, labels_batch). It can also be a callable that takes the same input as the transform, and returns the labels.
- class vision3d.transforms.v2.RandAugment(num_ops=2, magnitude=9, num_magnitude_bins=31, interpolation=InterpolationMode.NEAREST, fill=None)#
Bases:
_Refuse3DAwareMixin,RandAugmentRandAugment data augmentation method based on “RandAugment: Practical automated data augmentation with a reduced search space”.
This transformation works on images and videos only.
If the input is
torch.Tensor, it should be of typetorch.uint8, and it is expected to have […, 1 or 3, H, W] shape, where … means an arbitrary number of leading dimensions. If img is PIL Image, it is expected to be in mode “L” or “RGB”.- Parameters:
num_ops (
int, optional) – Number of augmentation transformations to apply sequentially, must be non-negative integer. Default: 2.magnitude (
int, optional) – Magnitude for all the transformations.num_magnitude_bins (
int, optional) – The number of different magnitude values.interpolation (
InterpolationMode, optional) – Desired interpolation enum defined bytorchvision.transforms.InterpolationMode. Default isInterpolationMode.NEAREST. If input is Tensor, onlyInterpolationMode.NEAREST,InterpolationMode.BILINEARare supported.fill (
sequenceornumber, optional) – Pixel fill value for the area outside the transformed image. If given a number, the value is used for all bands respectively.
- class vision3d.transforms.v2.RandomAffine(degrees, translate=None, scale=None, shear=None, interpolation=InterpolationMode.NEAREST, fill=0, center=None)#
Bases:
_Refuse3DAwareMixin,RandomAffineRandom affine transformation the input keeping center invariant.
If the input is a
torch.Tensoror aTVTensor(e.g.Image,Video,BoundingBoxesetc.) it can have arbitrary number of leading batch dimensions. For example, the image can have[..., C, H, W]shape. A bounding box can have[..., 4]shape.- Parameters:
degrees (
sequenceornumber) – Range of degrees to select from. If degrees is a number instead of sequence like (min, max), the range of degrees will be (-degrees, +degrees). Set to 0 to deactivate rotations.translate (
tuple, optional) – tuple of maximum absolute fraction for horizontal and vertical translations. For example translate=(a, b), then horizontal shift is randomly sampled in the range -img_width * a < dx < img_width * a and vertical shift is randomly sampled in the range -img_height * b < dy < img_height * b. Will not translate by default.scale (
tuple, optional) – scaling factor interval, e.g (a, b), then scale is randomly sampled from the range a <= scale <= b. Will keep original scale by default.shear (
sequenceornumber, optional) – Range of degrees to select from. If shear is a number, a shear parallel to the x-axis in the range (-shear, +shear) will be applied. Else if shear is a sequence of 2 values a shear parallel to the x-axis in the range (shear[0], shear[1]) will be applied. Else if shear is a sequence of 4 values, an x-axis shear in (shear[0], shear[1]) and y-axis shear in (shear[2], shear[3]) will be applied. Will not apply shear by default.interpolation (
InterpolationMode, optional) – Desired interpolation enum defined bytorchvision.transforms.InterpolationMode. Default isInterpolationMode.NEAREST. If input is Tensor, onlyInterpolationMode.NEAREST,InterpolationMode.BILINEARare supported. The corresponding Pillow integer constants, e.g.PIL.Image.BILINEARare accepted as well.fill (
numberortupleordict, optional) – Pixel fill value used when thepadding_modeis constant. Default is 0. If a tuple of length 3, it is used to fill R, G, B channels respectively. Fill value can be also a dictionary mapping data type to the fill value, e.g.fill={tv_tensors.Image: 127, tv_tensors.Mask: 0}whereImagewill be filled with 127 andMaskwill be filled with 0.center (
sequence, optional) – Optional center of rotation, (x, y). Origin is the upper left corner. Default is the center of the image.
- class vision3d.transforms.v2.RandomHorizontalFlip(p=0.5)#
Bases:
_Refuse3DAwareMixin,RandomHorizontalFlipHorizontally flip the input with a given probability.
If the input is a
torch.Tensoror aTVTensor(e.g.Image,Video,BoundingBoxesetc.) it can have arbitrary number of leading batch dimensions. For example, the image can have[..., C, H, W]shape. A bounding box can have[..., 4]shape.- Parameters:
p (
float, optional) – probability of the input being flipped. Default value is 0.5
- class vision3d.transforms.v2.RandomIoUCrop(min_scale=0.3, max_scale=1.0, min_aspect_ratio=0.5, max_aspect_ratio=2.0, sampler_options=None, trials=40)#
Bases:
_Refuse3DAwareMixin,RandomIoUCropRandom IoU crop transformation from “SSD: Single Shot MultiBox Detector”.
This transformation requires an image or video data and
tv_tensors.BoundingBoxesin the input.Warning
In order to properly remove the bounding boxes below the IoU threshold, RandomIoUCrop must be followed by
SanitizeBoundingBoxes, either immediately after or later in the transforms pipeline.If the input is a
torch.Tensoror aTVTensor(e.g.Image,Video,BoundingBoxesetc.) it can have arbitrary number of leading batch dimensions. For example, the image can have[..., C, H, W]shape. A bounding box can have[..., 4]shape.- Parameters:
min_scale (
float, optional) – Minimum factors to scale the input size.max_scale (
float, optional) – Maximum factors to scale the input size.min_aspect_ratio (
float, optional) – Minimum aspect ratio for the cropped image or video.max_aspect_ratio (
float, optional) – Maximum aspect ratio for the cropped image or video.sampler_options (
listoffloat, optional) – List of minimal IoU (Jaccard) overlap between all the boxes and a cropped image or video. Default,Nonewhich corresponds to[0.0, 0.1, 0.3, 0.5, 0.7, 0.9, 1.0]trials (
int, optional) – Number of trials to find a crop for a given value of minimal IoU (Jaccard) overlap. Default, 40.
- class vision3d.transforms.v2.RandomPerspective(distortion_scale=0.5, p=0.5, interpolation=InterpolationMode.BILINEAR, fill=0)#
Bases:
_Refuse3DAwareMixin,RandomPerspectivePerform a random perspective transformation of the input with a given probability.
If the input is a
torch.Tensoror aTVTensor(e.g.Image,Video,BoundingBoxesetc.) it can have arbitrary number of leading batch dimensions. For example, the image can have[..., C, H, W]shape. A bounding box can have[..., 4]shape.- Parameters:
distortion_scale (
float, optional) – argument to control the degree of distortion and ranges from 0 to 1. Default is 0.5.p (
float, optional) – probability of the input being transformed. Default is 0.5.interpolation (
InterpolationMode, optional) – Desired interpolation enum defined bytorchvision.transforms.InterpolationMode. Default isInterpolationMode.BILINEAR. If input is Tensor, onlyInterpolationMode.NEAREST,InterpolationMode.BILINEARare supported. The corresponding Pillow integer constants, e.g.PIL.Image.BILINEARare accepted as well.fill (
numberortupleordict, optional) – Pixel fill value used when thepadding_modeis constant. Default is 0. If a tuple of length 3, it is used to fill R, G, B channels respectively. Fill value can be also a dictionary mapping data type to the fill value, e.g.fill={tv_tensors.Image: 127, tv_tensors.Mask: 0}whereImagewill be filled with 127 andMaskwill be filled with 0.
- static get_params(width, height, distortion_scale)[source]#
Get parameters for
perspectivefor a random perspective transform.- Parameters:
- Returns:
List containing [top-left, top-right, bottom-right, bottom-left] of the original image, List containing [top-left, top-right, bottom-right, bottom-left] of the transformed image.
- Return type:
- class vision3d.transforms.v2.RandomRotation(degrees, interpolation=InterpolationMode.NEAREST, expand=False, center=None, fill=0)#
Bases:
_Refuse3DAwareMixin,RandomRotationRotate the input by angle.
If the input is a
torch.Tensoror aTVTensor(e.g.Image,Video,BoundingBoxesetc.) it can have arbitrary number of leading batch dimensions. For example, the image can have[..., C, H, W]shape. A bounding box can have[..., 4]shape.Note
When
center=Noneand the angle is a multiple of 90 degrees (0, 90, 180, 270), the rotation is performed usingtorch.rot90()instead of an affine transform. This is significantly faster, but the output tensor for 90 and 270 degree rotations may not be contiguous. Users who need contiguous output should callcontiguous()on the result.- Parameters:
degrees (
sequenceornumber) – Range of degrees to select from. If degrees is a number instead of sequence like (min, max), the range of degrees will be [-degrees, +degrees].[90, 90]will rotate the image by 90 degrees anticlockwise.interpolation (
InterpolationMode, optional) – Desired interpolation enum defined bytorchvision.transforms.InterpolationMode. Default isInterpolationMode.NEAREST. If input is Tensor, onlyInterpolationMode.NEAREST,InterpolationMode.BILINEARare supported. The corresponding Pillow integer constants, e.g.PIL.Image.BILINEARare accepted as well.expand (
bool, optional) – Optional expansion flag. If true, expands the output to make it large enough to hold the entire rotated image. If false or omitted, make the output image the same size as the input image. Note that the expand flag assumes rotation around the center (see note below) and no translation.center (
sequence, optional) –Optional center of rotation, (x, y). Origin is the upper left corner. Default is the center of the image.
Note
In theory, setting
centerhas no effect ifexpand=True, since the image center will become the center of rotation. In practice however, due to numerical precision, this can lead to off-by-one differences of the resulting image size compared to using the image center in the first place. Thus, when settingexpand=True, it’s best to leavecenter=None(default).fill (
numberortupleordict, optional) – Pixel fill value used when thepadding_modeis constant. Default is 0. If a tuple of length 3, it is used to fill R, G, B channels respectively. Fill value can be also a dictionary mapping data type to the fill value, e.g.fill={tv_tensors.Image: 127, tv_tensors.Mask: 0}whereImagewill be filled with 127 andMaskwill be filled with 0.
- class vision3d.transforms.v2.RandomVerticalFlip(p=0.5)#
Bases:
_Refuse3DAwareMixin,RandomVerticalFlipVertically flip the input with a given probability.
If the input is a
torch.Tensoror aTVTensor(e.g.Image,Video,BoundingBoxesetc.) it can have arbitrary number of leading batch dimensions. For example, the image can have[..., C, H, W]shape. A bounding box can have[..., 4]shape.- Parameters:
p (
float, optional) – probability of the input being flipped. Default value is 0.5
- class vision3d.transforms.v2.TenCrop(size, vertical_flip=False)#
Bases:
_Refuse3DAwareMixin,TenCropCrop the image or video into four corners and the central crop plus the flipped version of these (horizontal flipping is used by default).
If the input is a
torch.Tensoror aImageor aVideoit can have arbitrary number of leading batch dimensions. For example, the image can have[..., C, H, W]shape.See
FiveCropfor an example.Note
This transform returns a tuple of images and there may be a mismatch in the number of inputs and targets your Dataset returns. See below for an example of how to deal with this.
- Parameters:
size (
sequenceorint) – Desired output size of the crop. If size is an int instead of sequence like (h, w), a square crop (size, size) is made. If provided a sequence of length 1, it will be interpreted as (size[0], size[0]).vertical_flip (
bool, optional) – Use vertical flipping instead of horizontal
- class vision3d.transforms.v2.TrivialAugmentWide(num_magnitude_bins=31, interpolation=InterpolationMode.NEAREST, fill=None)#
Bases:
_Refuse3DAwareMixin,TrivialAugmentWideDataset-independent data-augmentation with TrivialAugment Wide, as described in “TrivialAugment: Tuning-free Yet State-of-the-Art Data Augmentation”.
This transformation works on images and videos only.
If the input is
torch.Tensor, it should be of typetorch.uint8, and it is expected to have […, 1 or 3, H, W] shape, where … means an arbitrary number of leading dimensions. If img is PIL Image, it is expected to be in mode “L” or “RGB”.- Parameters:
num_magnitude_bins (
int, optional) – The number of different magnitude values.interpolation (
InterpolationMode, optional) – Desired interpolation enum defined bytorchvision.transforms.InterpolationMode. Default isInterpolationMode.NEAREST. If input is Tensor, onlyInterpolationMode.NEAREST,InterpolationMode.BILINEARare supported.fill (
sequenceornumber, optional) – Pixel fill value for the area outside the transformed image. If given a number, the value is used for all bands respectively.