Back to Blog
·Honza Tyl·2 min read·Archive 2017

Capsule Networks and Inverse Graphics

Have you encountered Capsule Networks and Inverse Graphics? These new concepts were recently popularised by the 'godfather' of deep learning…

Capsule Networks and Inverse Graphics

Have you encountered Capsule Networks and Inverse Graphics? These new concepts were recently popularised by the 'godfather' of deep learning, Geoffrey Hinton. Generally, they push computer vision a step closer to emulating human perception. I have distilled five key ideas for you:

  1. Hierarchy – Humans learn and analyse visual information hierarchically. Children first learn to recognise colours and outlines. A person sees two eyes, one nose, and a mouth, and concludes that it looks like a human. This principle has been known since the 1970s and was at the forefront of deep networks (where deep layers pass information to subsequent layers, leading to an increasingly complex understanding of the image).
  2. Positional Equivalence – The position of an image should not affect how the network classifies it. So whether a picture of a kitten is on the left or right, in both cases, the network will evaluate it as an image of a cat. Convolution (small filters that analyse local parts of the image and recognise interesting features such as colour, edges, etc.) helps us significantly here.
  3. MaxPool Doesn't Work (for example, a 2×2 filter) – In the 1980s (Kunihiko Fukushima), this function can find an eye in an image but fails to identify its spatial relationship to other parts of the face. The neural network then perceives a portrait of a person with their mouth and eye swapped as a perfectly normal face – this property/error is referred to as translational invariance.
  4. Distilling Views – The Pose Matrix (transformation matrix) is a 4×4 matrix that represents the properties of an object (such as xyz coordinates, scale, rotation). Additionally, a matrix representing the hierarchical relationships between objects in the image is added (the eyes, mouth, and nose are parts of the head; the head is part of the body…). Similar to a rendering program in 3D graphics, various views of the "camera" are then computed.
  5. Inverse Graphics works in reverse to distilling views. I look at a 2D image and try to estimate what the virtual 3D object might look like. This allows us to model spatial relationships using linear transformations and generalise multiple views into a single matrix.

These new approaches enable better computer vision. For example, the classic MNIST dataset achieves a test accuracy of 99.75%! Results on more complex data have yet to be verified. A strong point is also that vision is becoming less of a black box.

https://towardsdatascience.com/uncovering-the-intuition-beh…

https://hackernoon.com/what-is-a-capsnet-or-capsule-network…

Published by Artificial Intelligence on 5 December 2017

Původní zdroj: wordpress

Související články