7641group11.github.io

Semantic Segmentation for Autonomous Vehicles Subjected to Adverse Weather Conditions.

Contributed by Hingwe Mihir, Ibidapo Samuel, Jere Kunal, Pawaskar Bhushan, Sodimu Oluwatofunmi.

Introduction:

To perceive its environment, an autonomous vehicle relies on advanced sensors like lidars and cameras, which generate tonnes of raw data. However, this data must undergo extensive processing to derive meaningful semantic and spatial information, which then serves as the foundation for making informed decisions further along the pipeline. In literature, image segmentation models utilizing deep learning techniques have been employed to not only extract significant features from the environment but also to generate rich descriptions of diverse object categories [1][2].

Existing image segmentation models perform well under clear conditions but severe weather degrades the quality of image data reducing the performance [3]. To counter the distortion of images, realtime and post processing techniques have been used to mitigate the effect of severe weather on these images without altering the scene [4][5]. Additionally, ensembling multiple deep learning models has been shown to improve the accuracy of object detection in unfavorable weather conditions [6].

Problem Statement:

Reliability of camera data is affected due to adverse weather conditions like rainfall. Our primary focus is on designing a model that can excel in these challenging scenarios.

Our project aims to create a machine learning model capable of robustly and accurately predicting semantic image labels across the spatial regions within images taken from an autonomous vehicle camera in rainy weather environments.

Data Collection

For this study, we will be using the RaidaR dataset consisting of 58,542 images of street scenes taken onboard a self-driving car camera. The dataset includes a pair of 14,570 actual rainy images and their corresponding semantic segmented annotations. It was determined that it would be computationally intensive and time-consuming to train our model on all 14,570 images, so we decided to use a subset of 2000 images. Additionally, we decided to randomly select the images because it was discovered that many of the images were similar because they were obtained from video feeds, which necessitated the selection of images from different scenes to ensure a representative dataset.

To achieve this, the images were shuffled using the numpy random module and 2000 of them were randomly selected as the actual dataset. Further exploration of the selected images revealed that some of the images were brighter corresponding to being taken during the day, while others seemed to have been taken at dusk and were relatively not too bright.

It was important to cluster the dataset based on this characteristic to avoid bias and improve model accuracy, and the following methods were explored:

Classifying based on the average of the HSV V-channel (brightness) of all pixels in the image. An interesting observation from this was that relatively dark images were classified as relatively bright images if they had light emitting objects within them (e.g. an image with multiple cars with their headlamps on).
KMeans algorithm clustering of the normalized brightness values of each image.
Classifying based on the average pixel values of the HSV V-channel for a subset of pixels from the sky (top-middle) of the images.

Of these methods, classifying based on the average brightness of the sky was the method of choice because it was determined to be the most reliable indicator of time of day. The mean brightness across the dataset was calculated and used as the threshold for grouping the images. Representative images from clusters 1 (dawn to mid-day) and 2 (dusk) are shown in Figures 1 and 2 below.

Cluster 1 (Dawn to Mid-day) — Representative images of dawn to midday cluster

Cluster 2 (Dusk) — Representative images of dusk cluster

Methodology:

So far we have been utilizing unsupervised learning techniques. We chose to employ two distinct methods to segment the images in our dataset; GMM and Hierarchical clustering.

These algorithms cluster data points from the images based on their distance in the RGB color channel space. This approach facilitates grouping of objects that exhibit similar colors in close proximity to one another. However, the ground truth labels that we intend to use for classification are not segmented based on the RGB color space. Instead, their actual segmentation is performed based on semantic categories that may or may not be related to the colors they exhibit. Although it can be argued that certain areas, such as the sky or large patches of vegetation, may have a singular color, this is not necessarily applicable to other categories such as signs, people, or cars. Consequently, these methods yield unsatisfactory results when applied to our images, performing well in separating large areas but failing to segment smaller objects that may not be correlated with their color.

In spite of this, we proceeded to attempt image segmentation using both GMM and Hierarchical clustering in order to evaluate their efficacy in this context. As these algorithms require the number of clusters to be specified beforehand, we employed the elbow method to determine the optimal number of clusters.

The distortion score versus the number of clusters graph obtained using hierarchical clustering is presented below.

Following the evaluation, we have determined to segment our image between four and six clusters, providing a foundation to further refine our results. To improve the outcomes, we have experimented with several preprocessing methods such as:

Blurring and normalizing
Adding white noise
Contouring
Blob detection
Considering pixel position when calculating distance

To evaluate the effectiveness of the clustering, we have utilized internal measures for clustering evaluation, rather than relying on external measures and comparing with ground truth labels. This approach was taken due to the large number of object categories present in the ground truth labels, compared to the limited number of clusters employed. Although we could have increased the number of clusters to match the semantic object categories, unsupervised learning methods such as GMM or Hierarchical clustering do not predict which cluster belongs to which semantic category. Instead, they merely provide us with the clusters. Hence, there is no direct method to match predicted and actual clusters to ensure they represent the same cluster. While we could have attempted to match clusters using means, the algorithms are generating clusters in rgb space, whereas the actual clusters exist in semantic space. Therefore, there is little correlation between the means of these clusters.

The internal measure based methods that compare intra-cluster distances with inter-cluster distances which we used were:

Davies-Bouldin Index
Silhouette Coefficient
Beta CV measure

The external measures used were as follows:

Pairwise measures
- Rand’s statistic
- Jaccard’s coefficient
Entropy Based measures
- Normalized Mutual information

For our supervised method we decided to use a UNet which is a CNN architecture commonly used for image segmentation in medical imaging.

O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” arXiv.org, 18-May-2015. https://doi.org/10.48550/arXiv.1505.04597.

The network first “encodes” the image to extract the features of the input similar to a traditional CNN. Each convolution block consists of a 3x3 convolution, followed by regularization and a ReLu step. This is repeated once more then a maxpool calculation is performed. This encoding step is repeated until the desired depth of the Unet is reached. Similar to the above figure, our implementation was three layers deep.

The next portion of the network “decodes” the extracted features to create a segmentation map which shows the probability of a pixel belonging to a certain class. The decoder performs deconvolution or upsampling and then concatenates the output with the feature map taken from the output of the encoder at the same layer. This “skip-step” allows the decoder to have access to higher resolution feature maps which improves localization. Following this, two convolution steps are taken similar to the encoding phase. The decoding step is repeated until the output is the same dimension as the input image. At this point a final 1x1 convolution and softmax calculation is performed. The result is an image segmentation map which can be compared to the truth image.

To update the network parameters for the UNet, we used multiclass cross entropy as our loss function. Each pixel of the pre segmented image was assigned a class integer label and then vectorized using one hot encoding. We chose cross entropy as it would penalize the output of the UNet for having a low probability for the correct classification of the input image pixel.

We also implemented a custom convolutional neural network for individual pixel classification. The pixels of each of the semantically segmented images were assigned an integer label (e.g. static (0,0,0) - 0; ground (81,0,81) - 6, person (220,20,60) - 24), and the flattened array was used as the y values for training/testing. The architecture consisted of three convolutional layers with leaky relu activation functions followed by a dropout and reshape layer, two dense layers with leaky relu activation functions, another dropout layer, and finally a dense layer using a softmax activation function with 31 neurons corresponding to each of the labels.The labels were then converted back to semantically segmented images by matching the class with the highest probability to the corresponding RGB value. The input images were resized from 1920x1080 to 64x64 before fitting.

Results and Discussion:

Unsupervised

Based on the subset of images that was set aside for training, it can be seen that randomization was not effective at generating a representative dataset, as many of the images in the randomized dataset were still from similar scenes. Moving forward, we intend to use clustering to group images from similar scenes together and pick one or a few images from each cluster depending on the number of clusters generated. Additionally, for classifying images based on time of day, we determined that further analysis would need to be done to decide the most appropriate threshold.

Forward and backward feature selection are not commonly used for images and implementing dimensionality reduction methods such as PCA and LDA can lead to loss of information and ultimately affect the quality of the segmentation. Therefore, our preprocessing techniques focused on preprocessing specifically for image segmentation, such as blurring and normalizing, blob detection, pixel positioning, and contouring.

We have split our results based off of each image pre-processing technique we implemented on the Gaussian Mixture Modeling and Hierarchical Clustering algorithms. Our calculation metrics were most consistent with the Silhouette Coefficient and Beta CV and based on that, the preprocessing techniques that performed the worst were contouring and blob detection. Surprisingly, clustering the original image without any preprocessing performed the best followed closely by the image that we preprocessed by blurring and normalizing. The images and data table can be viewed below.

Original Image

No Preprocessing

Blob Detection

Contouring

Blurred and Normalized

Pixel Positioning

Supervised

As can be seen in the images below and the accuracy table, the customized CNN for individual pixel classification did not perform as well as the UNet method. Some of the limitations of this technique include:

Image size: The input images were resized from 1920 by 1080 to 64 by 64, this led to a significant loss in the information needed for the individual pixel classification.
Architecture: The architecture of the model, including filters, kernel sizes, optimizer, and loss functions, needs to be further optimized for individual pixel classification.

Unet training (batch size = 10)

Unet training (batch size = 32)

Unet training (batch size = 64)

Custom CNN

UNet seems to perform much better than the custom FCN on a variety of evaluation metrics.

For evaluation of our results we used the following indices which are

DB - Davies Bouldin Index
SS - Silhouette Coefficient
Beta CV - Beta CV measure
Jac - Jaccard Index
Rand - Rands Score
Fm - Fowlkes Mallows Score
Hom - Homogeneity Score
Nmi - Normalized Mutual Information

Comparison of results for supervised and unsupervised methods:

Next, we compare the common evaluation measures Davies Bouldin Index, Silhouette Coefficient, Beta CV measure for similar images to see how well the CNN did in comparison to Gaussian Mixture Models and Hierarchical Clustering.

Higher values represent better performance. As can be clearly seen, well optimized neural networks like UNet do far better than simple clustering algorithms like GMM. This is because of the complexity that these models have in comparison. The custom fully convolutional did not perform as well even after playing around with many parameters. This can be attributed to the fact that UNet has been optimized over years of research and thus is far superior to the basic FCN used by us. Specifically, the “decoding” step of the UNet and incorporation of skip steps allows for a more robust output segmented output of the image.

Final Project Video:

https://drive.google.com/file/d/1HFgAdporBk8gFErS8lpyvu5zPqvNHS-D/view?usp=share_link

References:

[1] I. Papadeas, L. Tsochatzidis, A. Amanatiadis, and I. Pratikakis, “Real-Time Semantic Image Segmentation with Deep Learning for Autonomous Driving: A Survey,” Applied Sciences, vol. 11, no. 19, p. 8802, Sep. 2021, doi:

[2] H. Fujiyoshi, T. Hirakawa, and T. Yamashita, “Deep learning-based image recognition for autonomous driving,” IATSS Research, vol. 43, no. 4, pp. 244–252, Dec. 2019, doi: 10.1016/j.iatssr.2019.11.008.

[3] S. Zang, M. Ding, D. Smith, P. Tyler, T. Rakotoarivelo and M. A. Kaafar, “The Impact of Adverse Weather Conditions on Autonomous Vehicles: How Rain, Snow, Fog, and Hail Affect the Performance of a Self-Driving Car,” in IEEE Vehicular Technology Magazine, vol. 14, no. 2, pp. 103-111, June 2019, doi: 10.1109/MVT.2019.2892497.

[4] K. Garg and S. K. Nayar, “When does a camera see rain?,” Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, Beijing, China, 2005, pp. 1067-1074 Vol. 2, doi: 10.1109/ICCV.2005.253.

[5] X. Fu, J. Huang, X. Ding, Y. Liao and J. Paisley, “Clearing the Skies: A Deep Network Architecture for Single-Image Rain Removal,” in IEEE Transactions on Image Processing, vol. 26, no. 6, pp. 2944-2956, June 2017, doi: 10.1109/TIP.2017.2691802.

[6] Rahee Walambe, Aboli Marathe, Ketan Kotecha, George Ghinea, “Lightweight Object Detection Ensemble Framework for Autonomous Vehicles in Challenging Weather Conditions”, Computational Intelligence and Neuroscience, vol. 2021, Article ID 5278820, 12 pages, 2021. https://doi.org/10.1155/2021/5278820.

[7] Yao ZhiWei, Yao Yu and Xu Xiao, “Image segmentation based on ensemble learning,” 2010 International Conference on Computer and Communication Technologies in Agriculture Engineering, Chengdu, 2010, pp. 423-427, doi: 10.1109/CCTAE.2010.5543712.