UC Berkeley and Adobe AI researchers propose BlobGAN, a new unsupervised and mid-level representation for insane scene manipulation

Since the advent of computer vision, one of the fundamental questions of the research community has always been how to represent the incredible richness of the visual world. One concept that has emerged from the beginning is the importance of a scene in the context of understanding objects. Suppose we want a classification to distinguish between a sofa and a bed. In that case, the scene context provides information about the environment (ie the room is a living room or a bedroom) that can be useful for classification.

However, after years of research, scenes images are still primarily represented in two ways: 1) in a top-down fashion, so scene classes are labeled the same way as object classes, or 2) in a bottom-up fashion, with semantic labeling of single pixels. The main limitation of these two approaches is that they do not represent the different parts of a scene as entities. In the first case, the different components are merged into a unique label; in the second case, the individual elements are individual pixels, not entities.

From the official video presentationSource: https://arxiv.org/pdf/2205.02837.pdf

To fill this gap, researchers at UC Berkeley and Adobe Research proposed BlobGAN, an extremely new unsupervised mid-level representation for generative models of scenes. Mid-level means the rendering is not per pixel or per image, but entities in scenes are modeled with spatial and depth ordered Gaussian blobs. Given some random noise, the layout network, an 8-layer MLP, maps it to a collection of blobs parameters, which are then distributed in a spatial grid and passed to a StyleGAN2-like decoder. The model is trained in a hostile framework with an unmodified StyleGAN2 discriminator.

Source: https://arxiv.org/pdf/2205.02837.pdf

More specifically, blobs are represented as ellipses with central coordinates XScale saspect ratio a, and rotation angle θ. In addition, each blob is associated with two feature vectors, one for structure and one for style.

Source: https://arxiv.org/pdf/2205.02837.pdf | From the official video presentation

Thus, the layout network maps random noise to a fixed number k blobs (the network may also decide to suppress a blob by imposing a very low scale parameter), each represented by four parameters (actually five, as the center is defined by X and Yes coordinates) and two feature vectors. Then all the ellipses defined by the parameters are gridded with the depth dimension also, then alpha composite (to address occlusion and relationships) in 2D and populated using the information in the object vectors. The image is then passed to the generator. In the original StyleGAN2, the generator took as input a single array containing all the extracted information, while in this work the first layers have been modified to take layout and appearance separately. This technique provided a disentangled representation, along with the authors adding uniform noise to blob parameters before feeding them into the generator.

The network defined above was trained in an unsupervised manner with the LSUN scene dataset.

Despite being unsupervised, thanks to the spatial uniformity of blobs and the location of convolutions, the network was able to associate different blobs with different components of the scene. This is evident from the presented results, calculated with k=10 blobs. For a comprehensive visualization of the results, here’s the project page with animations. The results are awe-inspiring, as can be seen from the image below: manipulating blobs allows substantial and precise adjustments to the generated image. For example, it is possible to empty a room (even if the dataset has not been trained with empty rooms images), add, resize and move entities and also restyle the various objects.

Source: https://arxiv.org/pdf/2205.02837.pdf

In conclusion, if diffusion models have recently eclipsed GANs, this paper presents a novel and disruptive technique that controls the scene with unprecedented precision. In addition, the training is completely unsupervised, so that no time is needed for labeling the different images.

This Article is written as a summary article by Marktechpost Staff based on the paper 'BlobGAN: Spatially Disentangled
Scene Representations'. All Credit For This Research Goes To Researchers on This Project. Checkout the paper, github, project.

Please Don't Forget To Join Our ML Subreddit

Leave a Comment

Your email address will not be published.