Data standardization and augmentation Prior to feeding the data to the neural network for training, some preprocessing is usually done. Many beginners fail to obtain reasonable results not because of the archi tectures or methods or lack of regularization, but instead because they simply did not normalize and visually inspect their data. Two most important forms of pre-processing are data standardization and dataset augmentation. There are a few data standardization techniques common in imaging. • Mean subtraction. During mean subtraction, the mean of every channel is com puted over the training dataset, and these means are subtracted channelwise from both the training and the testing data. • Scaling. Scaling amounts to computing channelwise standard deviations across the training dataset, and dividing the input data channelwise by these values so as to obtain a distribution with standard deviation equal to 1 in each channel. In place of division by standard deviation one can divide, e.g., by 95-percentile of the absolute value of a channel. • Specialized methods. In addition to these generic methods, there are also some specialized standardization methods for medical imaging tasks, e.g., in chest X-ray one has to work with images coming from different vendors, furthermore, X-ray tubes might be deteriorating. In [17] local energy-based normalization was inves tigated for chest X-ray images, and it was shown that this normalization technique improves model performance on supervised computer-aided detection tasks. For another example, when working with hematoxylin and eosin (H&E) stained histo logical slides, one can observe variations in color and intensity in samples coming from different laboratories and performed on different days of the week. These variations can potentially reduce the effectiveness of quantitative image analysis. A normalization algorithm specifically designed to tackle this problem was sug gested in [18], where it was also shown that it improves the performance for a few computer-aided detection tasks on these slide images. Finally, in certain sce narios (e.g., working directly with raw sinogram data for CT or Digital Breast Tomosynthesis [19]) it is reasonable to take log-transform of the input data as an extra preprocessing step. Neural networks are known to benefit from large amounts of training data, and it is a common practice to artificially enlarge an existing dataset by adding data to it in a process called “augmentation”. We distinguish between train-time augmentation and test time augmentation, and concentrate on the first for now (which is also more common). In case of train-time augmentation, the goal is to provide a larger training dataset to the algorithm. In a supervised learning scenario, we are given a dataset D consisting of pairs (xj, yj) of a training sample xj ∈ Rd and the corresponding label yj. Given the dataset D, one should design transformations T1,T2,...,Tn : Rd → Rd which are label-preserving in a sense that for every sample (xj, yj) ∈ D and every transformation Ti the resulting vector Tixj still looks like a sample from D with label yj. Multiple transformations can be additionally stacked, resulting in greater number of new samples. The resulting new samples with labels assigned to them in this way are added to the training dataset and optimization as usual is performed. In case of the test-time augmentation the goal is to improve test-time performance of the model as follows. For a predictive model f , given a test sample x ∈ Rd, one computes the model predictions f (x),f (T1x), . . . ,f (Tnx) for different augmenting transformations and aggregates these predictions in a certain way (e.g., by averaging softmax-output from classification layer [6]). In general, choice of the augmenting transformation depends on the dataset, but there are a few common strategies for data augmentation in imaging tasks: • Flipping. Image x is mirrored in one or two dimensions, yielding one or two additional samples. Flipping in horizontal dimension is commonly done, e.g., on the ImageNet dataset [6], while on medical imaging datasets flipping in both dimensions is sometimes used. • Random cropping and scaling. Image x of dimensions W × H is cropped to a random region [x1, x2]×[y1, y2]⊆[0,W]×[0,H], and the result is interpolated to obtain original pixel dimensions if necessary. The size of the cropped region should still be large enough to preserve enough global context for correct label assignment. • Random rotation. An image x is rotated by some random angle ϕ (often limited to the set ϕ ∈ [π/2,π, 3π/2]). This transformation is useful, e.g., in pathology, where rotation invariance of samples is observed; however, it is not widely used on datasets like ImageNet. • Gamma transform. A grayscale image x is mapped to image xγ for γ0, where γ = 1 corresponds to identity mapping. This transformation in effect adjusts the contrast of an image. • Color augmentations. Individual color channels of the image are altered in order to capture certain invariance of classification with respect to variation in factors such as intensity of illumination or its color. This can be done, e.g., by adding small random offsets to individual channel values; an alternative scheme based on PCA can be found in [6].