Introduction

In our recent magazine paper (under review), one topic of interests is subspace-aided transfer learning. In this blog, we give an overview of the method we studied with illustrations that are not included in the paper. This blog will nonethless not go into mathematics details, rather, we ask motivated readers to bear with the partial presentation and postpone the enjoyment of full derivation when our paper is ready. We hope our vivid illustrations and animations may partially explain the reasons why this seemingly ad-hoc method may work a posteriori.

More introductory texts on transfer learning can be found in one of my earlier blog. We should remark that, the jargon “domain adaptation” and “transfer learning” are not exactly equivalent in a mathematical sense. Subtle differences exist though, which are essentially different assumptions on the underlying latent distribution. We do not, nonetheless, differentiate them in this rather casual writing.

Dataset

As a general setup, there are several popular dataset tailored for benchmarking transfer learning. We select the celebrated Office data set [Saenko et al. 2010] augmented by Caltech-256 [Griffin et al. 2007]. These datasets have been collected for many years and are among the very popular choices.

Office dataset contains a handful of categories of objects whose images are collected by either from online merchant (‘amazon’); by a DSLR camera (‘dslr’); or by a webcam (‘webcam’). The Caltech-256 dataset, as the name suggested, contains 256 categories of objects. Usually, we would select a small subset of categories in common, and train a supervised model using one domain out of the four and test in another domain. This procedure is known as unsupervised domain adaptation since no label in the target domain is known; Alternatively, if we train the model using very few but not none data from the target domain in addition to the source domain, we are indeed doing the semi-supervised (or semi-unsupervised) domain adaptation.

Firstly, let us have a brief overview of several categories among four domains:

Headphone

headphone

Projector

projector

As suggested by name, we see the core disparities between domains: “amazon” has generally better-aligned objects; images in “dslr” are of higher resolution; “webcam” on the contrary is much more casual and usually under unbalanced illumination; “caltech” is more in the wild. The discrepancies here can be considered as a new scenario that we wish our model to generalized well from previously learnt domains.

Geodesic Sampling

The details of the methods we studied and implemented can be found in [Gopalan et al. 2011; Gong et al. 2012]. Very roughly speaking, we model the source and target domains of images of the same category as two points on some highly-nonlinear manifold (i.e., Grassmann manifold), which, by some heurstic argument, may capture the geometric and statistical properties of the set of the image by rendering a unified latent representation. Though the manifold is highly non-linear per se, its distinct element infact constitutes of vector spaces (i.e., linear subspaces); we may as well construct its tangent space as vector spaces. Furthermore, to measure the discrepancies between points (or rather, define the metric over the manifold), we progress to the notion of principle angles between subspaces. For example, in $\mathcal{R}^2$, each vector is by itself (the basis of) a 1 dimensional subspace and thus the discrepancy between two such spaces can be measured by the cosine of the angle they make. Principle angle is a natural extension of such construction in general dimension subspaces. Above is some very coarse-grained discussion, we refer interested readers for classical texts on the subjects such as Boothby, Edelman or Absil. Readers are also encouraged to check out our paper for abundant details omitted here.

The geodesic is the shortest curve under constant velocity joining two points (the velocity is indeed the element of the tangent space). By sampling on the geodesic between the source and target domains, it has been heuristically argued and empirically tested that some intermediate points exhibited more consistency in terms of statistical and geometrical properties between source and target domains. Such samples can thus be viewed as latent representations and consequently a discriminatively-trained classifier should be able to generalize well on such latent spaces.

In fact, suppose we parametric the geodesic (recall it is simply a curve joining two points) in terms of $t \in [0, 1]$, we can infact visualize the change of a single image in the source domain as we traverse the geodesic to another domain. For such maneuver, we apply a similar procedure used in “Eigenfaces” by way of principle component analysis.

Concretely, assume we wish to generalize bikes from “caltech” to “amazon”, we first plot their mean images to have a better feeling of the two domains:

mean

The mean image is simply an average over all images in the domain, we noted the main differences may be the size of the wheels and orientation of the bikes; and the characteristics shared are the structures of the bike.

sample_bike

We choose an image from the source domain and visualize its latent-representation on five samples along the geodesic:

samples_bike

The result shows some interesting features besides the overreacted pixels: along the geodesic, there is a noticeable deformation of the front wheel shrinking to the size that mostly appears in the target domain.

As another example, consider the computer mouse from “dslr” to “webcam”:

sample_mouse

samples_mouse

and keyboards of the same source-target pair:

sample_keyboard

samples_keyboard

This example illustrates that one of the defining factors of the visual consistency of such manuvour is the quality of the mean image.

In fact, we can animate the whole process, take our now familiar caltech bike, for example:

animated_cal_bike_sample

animated_cal_bike

How about a backpack from “webcam” to “amazon”?

animated_webcam_backpack_sample

animated_webcam_backpack

Feature Visualization

Going along this direction a little bit further, we are curious what are the latent features and how they change along the geodesic. To do this, we follow the same procedure of “Eigenface” again. Concretely, we animate top eigenvector of the latent samples along the geodesic of calculators from “dslr” to “amazon”.

latent_calculator_1

Remark.

During a recent talk by D. A. Forsyth I attended, he commented on a general perception in CV community which I consider to be very true:

In the past, we understood the theories very well, but we cannot do well; now the models perform very well but we do not know why.

I believe the incorporation of deep models and mechanism inspired from geometrically/physcially-based perspective may lead to our better understanding and exploring the capacity of deep models; and better comprehend how to obtain models that really understand the image, not at least the classifications or regressions.

The experiments in this blog are purely illustrative: we did not even bother to feed HOG or SURF features, rather, we directly used normalized raw images; we did neither augment the dataset by removing lousy images or localizing objects.

References

  1. Gong, B., Shi, Y., Sha, F., and Grauman, K. 2012. Geodesic flow kernel for unsupervised domain adaptation. Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE, 2066–2073.
  2. Gopalan, R., Li, R., and Chellappa, R. 2011. Domain adaptation for object recognition: An unsupervised approach. ICCV ’11, IEEE, 999–1006.
  3. Griffin, G., Holub, A., and Perona, P. 2007. Caltech-256 object category dataset. .
  4. Saenko, K., Kulis, B., Fritz, M., and Darrell, T. 2010. Adapting visual category models to new domains. Computer Vision–ECCV 2010, 213–226.