Because this was a winning submission, I cannot share code as per CrowdAnalytix’s Solver’s Agreement. Permission is however given to share the approach to the solution.
From the competition website, “The objective of this contest is to extract the MPN for a given product from its Title/Description using regex patterns.” Now, I didn’t know what RegEx patterns were, but I could understand the problem of extracting text from a larger text. For my purposes, given that I wanted to learn representations, it was enough for me to understand that if I had the following:
EVGA NVIDIA GeForce GTX 1080 Founders Edition 8gb Gddr5x 08GP46180KR
Then I just wanted to extract the MPN “08GP46180KR” using some representations that learned to distinguish MPNs from other text making up the product title and description.
Here’s the basic gist of approaching this problem using RegEx: you hard-code some rules for patterns that you are interested in finding. Here’s an example for finding e-mail addresses:
Here, this RegEx looks for pre-defined characters in fields surrounding the “@” and “.” characters. The power of Deep Learning is that, provided enough training examples, we can learn these RegEx patterns from the data directly instead of hard-coding them. This is the approach that I took.
The training data consisted of ~54,000 examples with the following four entries: [id, product_title, product_description, mpn]; test data was the same except for the omission of MPN field. Upon inspection of the data, I found that the MPN was, in almost all cases, present in either the product title or description, if not both. It also became evident to me that this was a hard problem as there were many other “distractors” that looked very similar to MPNs but were not marked as the target (for example, in the above Graphics Processing Unit product, “Gddr5x” looks a lot like other MPNs that existed in the training set). Given that the problem was to extract the MPN from the other fields, I set the input as a concatenation of the product title and description and set the target (or output) as the MPN.
Now that I had determined what my inputs and outputs were, I needed to determine some sort of embedding so that I could use a neural network. Because this was not a usual Natural Language Processing problem do the presence of MPN codes, HTML snippets and other odd characters, common choices such as word2vec were not going to be suitable (correct me if I’m wrong here). I fortunately had a rock-climbing buddy, Joseph Prusa, that had been working with character-wise embeddings for sentiment analysis (Prusa & Khoshgoftaar, 2014). He very kindly shared his embedding code, and after some custom-tailoring to my problem, I had an embedding solution.
The embedding procedure takes each character and embeds it as an 8-bit binary vector. For example the string “EVGA NVIDIA GeForce GTX 1080 Founders Edition 8gb Gddr5x 08GP46180KR” from the above example would be represented like such:
The next problem was that inputs (i.e., the concatenated product title and description string) were of varying length. Thus, I figured that I needed to settle on some way to make them all the same length to feed to the network. My first step was to visualize the distribution of all the input lengths.
Based on this distribution, I chose to set the max length to 2000 as it included most examples and avoided very long inputs to only include a couple outliers. With this max length set, I first clipped each string input and then embedded it using the procedure above. In the case that an input was shorter than the max length, it was padded with zeros. In the case that it was longer, if the MPN code was within the range of the max length, then no problem, if it was, then it was just another case where the MPN code was absent (which was very infrequent as well). The result of all this is a 8 x 2000 “image” that can now be fed to the model.
Assuming that we want to build some neural network that we can train using back-propagation, the next question is what is the appropriate output and loss function. The most natural choice seemed to be that the output would be just the MPN in the embedded vectorial format. This, in combination with a loss like Mean-Squared Error that is common of generative models in unsupervised learning just did not do the trick due to technical reasons.
Eventually I converged on the following solution that was sufficient to get some reasonable results. Namely, I defined the output of the network to be two one-hot binary vectors with a length equal to the max length (set to 2000 here), where the first vector indicated the starting index of the MPN and the second vector indicated the ending index of the MPN. Then the loss was simply the summed categorical crossentropy for both vectors.
Given this output, an auxiliary function was then created on the backend to extract the MPN vectorial representation from the input given the two indices and then convert the embedded MPN back to a string representation as the final output. In the cases where no MPN was present, the target was defined as ones at the end of both vectors.
Ok, so now that the data has been embedded, and our target has been formulated, the next step was to build a model that would perform the above task well. I tried a bunch of different neural network models, including deep convolutional neural networks with standard architectures (e.g., 2D conv-net with max-pooling layers). These produced good but unsatisfactory results – nothing that was going to get me a winning spot.
Fortunately, Google DeepMind had just put out a paper on their new model WaveNet that used causal, dilated convolutions that served as the seed of my idea. WaveNet, and other similar models, were very intriguing because they used multiple layers of convolutional operators with no pooling that were able to obtain filters with very large receptive fields while keeping the number of parameters within a reasonable range because of the dilations used at each subsequent layer (see the red portion of the figure below; image source – Neural Machine Translation in Linear Time).
The final model idea that I converged on was to extract a set of basic features from the input, feed them through a series of dilated convolutions and then branch off two convolutional filters with softmax activation functions to predict start and end indices. In more detail, the model was as follows:
The model architecture is represented graphically below, showing the major features of the model.
After training, I observed that the model was close to perfect on the training set, hovered around ~90% accuracy for the validation set, and obtained ~84% on the public leaderboard. Not bad!
One thing that I noticed as I was scrambling to make submissions was that the model overfit the data very quickly due to the relatively small number of samples. I know that with only ~54,000 training examples, learning representations directly from the data was a bit risky, but I believe with a couple hundred thousand, my solution might have placed higher. Because I was late to the competition, I just chose to lower the learning rate and only train for a couple of epochs, which in the end worked out for me. However, provided that there was more time, I would have liked to explore some data augmentation techniques and model regularization which would have helped made the model more expressive and prevented overfitting. Additionally, pretraining on other text might have been a successful strategy. A brute-force effort would have also been increasing the max length parameter slightly, that may have given me some marginal improvements, but at a very high computational cost.
This was a fun challenge for me and I found it satisfying to place especially given that I had not really worked on this type of problem before. Sorry in advance for adding to the Deep Learning hype, but I found this to be another interesting application of said methods to a domain that probably doesn’t see much of these techniques used, again showing the general abilities of Deep Learning. Hope this helps someone with a similar problem.
The main point that Epstein attempts to convey in this article is that we are confusing ourselves about how the brain works by using our most advanced technology, computers, as a metaphor – what he refers to as the information processing (IP) metaphor. He notes that we do not have, and never develop, the components that are essential to modern day computers, such as software, models, or memory buffers. Indeed, computers process information by moving around large arrays of data encoded in bits that are then processed using an algorithm. This is not what humans do – a given. Challenging researchers in the field, he finds that basically none can explain the brain without appealing to the IP metaphor, which he reasonably sees as a problem. Crucially, he points out that the metaphor is based on a faulty syllogism whereby we conclude “…all entities that are capable of behaving intelligently are information processors”. Just as previous metaphors for the brain seem silly to us now, so will that of the IP metaphor in the future.
As with most debates, it is important to define some important terms so that disagreements are not based in semantics. The following are definitions for key terms used in Epstein’s argument and that I will reference throughout this reply. These are surely debatable, but it was the best I could do here.
algorithm - ”…a self-contained step-by-step set of operations to be performed”
operation - in mathematics, ”…a calculation from zero or more input values (called operands) to an output value.”
information processing - in the computing sense, ”…the use of algorithms to transform data…”
information - the resolver of uncertainty. we can say that data that is completely random, is uncertain, and has no information. if information of some form exists in data, it can resolve the uncertainty with respect to outputs
data - the quantities, characters, or symbols on which operations are performed by a computer
representation - from a cognitive viewpoint (which Epstein is likely to be more familiar with), is that of an “internal cognitive symbol that represents external reality”; from machine learning, ”…learning representations of the data that make it easier to extract useful information when building classifiers or other predictors.”
computer - ”…a device that can be instructed to carry out an arbitrary set of arithmetic or logical operations automatically.”; more generally, we can say “one that computes”
computation - ”…any type of calculation … that follows a well-defined model understood and expressed as, for example, an algorithm.”
These definitions were not cherry picked to support my position, and when appropriate, I gave multiple definitions to reflect use-specific cases.
I appreciate Epstein’s challenge to the IP metaphor and recognize the higher chance of it being invalid rather than the final answer to understanding the brain given the history of failed metaphorical applications to brain functioning. Figures such as Epstein are a necessary component to scientific progress, as we must challenge our scientific paradigms and inspect how they bias the observations we make. However, as someone that builds artificial neural networks, I (perhaps erroneously) remain convinced of the brain’s role in computation despite the evidence he and others provide. Here I attempt to illustrate how most of the examples he cites are straw man by conveying an inaccurate view of how the IP metaphor is seen by those that take seriously the primary role of computation in artificial general intelligence and its relation to the brain.
Based on our definition above, information processing involves the transformation of data using algorithms that are themselves a set of sequential operations. From this, Epstein’s understanding of how algorithms and computations relate to brain function appears immediately outdated and misguided. He strongly emphasizes that an algorithm is about the set of rules in machine code that dictates how data is stored, transfered, and received from hardware elements such as buffers, devices, and registers. Yes, this is an algorithm, but not the one that we are interested in when trying to draw connections to the brain. These aspects he highlights are details of implementation specific to conventional computers that those simulating neural nets, for example, would never argue takes place in the brain.
One aspect of his overly narrow view of computation is a veridical, discrete, and easily accessible memory store. To illustrate that we don’t have any type of memory bank like a computer, he describes a demonstration where an intern is asked to draw a dollar bill. When she must do so from memory, she is unable to draw the object with much detail. In contrast, when she is given the opportunity to look at the dollar bill, she can draw it in great detail.
“Jinny was as surprised by the outcome as you probably are…”
Not really. Besides this being the expected outcome, common findings in unsupervised learning predict that outcome, producing “fuzzy” generated images when prompted. On the other hand, the most recent state-of-the-art in generative modeling, adversarial networks, are more akin to what an artist might do – producing depictions that can be compared to reality until they match more closely (at least with respect to realism).
Even if she spent time studying the details, “…no image of the dollar bill has in any sense been ‘stored’ in Jinny’s brain.”
Artificial neural networks, and likely biological, for that matter, don’t have images stored inside them; they have abstract, identity-preserving translation-invariant, representations learned from input data. In fact, the representations themselves are a form of memory. Presumably what Jinny would be doing in those moments of studying is temporarily strengthening connections in the network hierarchy that would generate a dollar bill image. Below, we will see how this is likely distributed across the network – unlike a modern computer, yet still computational in nature.
Let us adopt the general (non-spiking, feed-forward) artificial neural network (aNN) as our main example of modeling the brain from an information processing perspective. At their core, aNNs are algorithms (i.e., a sequence of operations, such as weighted sums and linear rectifiers), which are arguably biologically plausible (i.e., synapses with inhibitory and excitatory strengths that are thresholded), yet clearly limited in scope.
Deep neural networks, more specifically, are simply a hierarchical series of operations that take data (e.g., images analogous to retinal pattern of activation) and transform them through each layer, increasingly projecting them into a space that is more valuable for learning and making decisions. This is nothing more than an algorithm, and there is good reason to believe that it is very similar to what our brains are doing.
For example, a convolutional neural network convolves filters across an image. Certainly our brains do not perform convolutions, but rather this reflects certain assumptions that allow us to simulate them in software. Specifically, images translate across space and because we can move our heads and eyes, similar basic features can occur anywhere in the visual field. Therefore it would make sense to have similar basic features tiled across the early visual cortex with receptive fields at different locations. Indeed this is what we see in biological brains. Although an oversimplification that undoubtedly has biologically implausible limitations, it serves to make the point that these computations are abstractions of primitives (e.g., weighted sums) that can easily be physically implemented.
Now, in the above drawing demonstration, Epstein uses the outcome as an argument against the existence of “representations” that exist as stored memories, particularly in individual neurons. Indeed, as Epstein states and surely knows, no one in contemporary computational neuroscience would make such an absurd assertion. A very common concept in computational neuroscience is that of a sparse distributed representation, which is very similar to the idea of parallel distributed processing. In this framework, inputs can be represented with a very small number of features that are distributed. Additionally, memories (which may be representations themselves as described above) are not discrete, but are distributed. Therefore, “deleting” one memory, if even possible, may involve removing enumerable others.
From my previous blog post:
A sparse coding scheme also has logically demonstrable functional advantages over the other possible types (i.e., local and dense codes) in that it has high representational and memory capacites (representatoinal capacity grows exponentially with average activity ratio and short codes do not occupy much memory), fast learning (only a few units have to be updated), good fault tolerance (the failure or death of a unit is not entirely crippling), and controlled interference (many representations can be active simultaneously; Földiák & Young, 1995).
Thus, when we look at the deep neural network below, the image of the face or cat that points at a particular image does not mean that an “image” is stored at that location. This image was generated by synthesizing an input that maximized that unit’s activation. But because that node’s activation is contingent upon a complex weighting of all the connections before it, it would be more accurate to say that the representation is distributed across all the connections before it, not in one location. Also note that the synthesized image is fuzzy and shows that it could be insensitive to translations.
In fact, he said it himself.
“For any given experience, orderly change could involve a thousand neurons, a million neurons or even the entire brain, with the pattern of change different in every brain.”
This is exactly what neural network simulations do; this fact does not preclude computation.
After this example, he attempts “to build the framework of a metaphor-free theory of intelligent human behaviour” that I find unsuccessful and only reframes the existing IP metaphor. In particular, he notes that we learn by making observations of paired (i.e., correlated) events and we are punished or rewarded based on how we respond to them. These are not new concepts to the field of machine learning and artificial intelligence. In fact, the success of almost all machine learning applications hinges on learning patterns built from correlated primitives that allow for good decisions to be made.
As an illustration he submits that when a new song is learned, instead of “storing” it, the brain has simply “changed”. It’s not clear how this is at all different. What has changed? And how? In a computer I may save a new song as a discrete file on my hard drive. But if I want a neural network to learn it and be able to generate it, the “storing” would involve changing weights distributed across the network (as stated above). For example, if I already have existing knowledge of songs, many of which have similar components, I can represent this new song as a distributed code of sparse components that is not located in one single place. These components will be coupled, thereby creating a memory (or representation) that is distributed across the very thing that perceives it and produces it.
To give a final illustration, he cites a commonly referenced example of catching a fly ball. Epstein, as well as others, argue that:
The IP perspective requires the player to formulate an estimate of various initial conditions of the ball’s flight – the force of the impact, the angle of the trajectory, that kind of thing – then to create and analyse an internal model of the path along which the ball will likely move, then to use that model to guide and adjust motor movements continuously in time in order to intercept the ball.
Now, granted that this is from a 1995 article, this is an outdated view. As someone that takes the information processing view of the brain, I would never say that catching a fly ball involves explicitly calculating trajectory. In fact, keeping the ball in constant relation to the surrounding is likely exactly what a neural network agent using reinforcement learning would learn to do if it were trained to do so based on visual input. Furthermore, it would learn representations specific to those aspects (i.e., the ball, the horizon), and ultimately be calculating trajectory implicitly. It is not clear how this could be done “completely free of computations, representations and algorithms”, as Epstein and others claim.
He argues that “because neither ‘memory banks’ nor ‘representations’ of stimuli exist in the brain, and because all that is required for us to function in the world is for the brain to change in an orderly way as a result of our experiences, there is no reason to believe that any two of us are changed the same way by the same experience.” It is just as easy to see that no experience would be the same due to distributed representations of “fuzzy” memories that have been compressed based on the existing network. IP prevails.
Epstein’s argument appears to be based on a erroneous, outdated, and rigid view of information processing that is unlike what those like myself take it to be. Unlike what he suggests, brains do not have perfect memory stores and representations are distributed, not local. Many simulated neural networks have exactly these features. What he has done is conflated computer with computation. Ultimately, artificial neural networks can be implemented in hardware, using the same operations as in the simulation, but without any of the other aspects involved in conventional computers, such as data transferring. Maybe if Epstein understood this, he would have to update his position.
At the closing of his essay, Epstein makes a rather insulting statement:
“The IP metaphor has had a half-century run, producing few, if any, insights along the way.”
This is a slap in the face to the fields of computational neuroscience, neuromorphic computing, and artificial intelligence, to name a few. The field of artificial intelligence, specifically, has been greatly guided by principles derived from our understanding of the brain. To ignore what those fields have to say about cognition is a dire mistake and rejecting the IP metaphor upon which they are founded removes all chance of such a dialogue.
Ultimately, I can’t explain the brain either without appealing to the IP metaphor. But it appears neither can he. I see that the IP metaphor is based on invalid reasoning, but it is the best we have to go on and amazingly deep insights have been made through it. There is also something special, and universal about IP that, at least to me, makes it seem very likely to be implemented in the brain.
“We are organisms, not computers.”
Well, maybe we’re both. At the very least we’re doing some computation. And the thing about computation is that it transcends the medium in which it is implemented – be it flesh or silicon.
]]>Before being able to run the code described in this post, there are a couple of dependencies that must be installed (if not already on your machine). This includes the NetworkX installation, NLTK installation, and Graphviz installation. Also, after installing NLTK, import nltk
and use nltk.download()
to futher install the wordnet
and wordnet_ic
databases. You should be all set at this point!
For this code demonstration, we do not actually need the CIFAR-10 dataset, but rather its object categories. One alternative would be to download the dataset and use the batches.meta
file to import the labels. For simplicity, I instead just list out the categories and put them into a set.
1 categories = set()
2 categories.add('airplane')
3 categories.add('automobile')
4 categories.add('bird')
5 categories.add('cat')
6 categories.add('deer')
7 categories.add('dog')
8 categories.add('frog')
9 categories.add('horse')
10 categories.add('ship')
11 categories.add('truck')
Now we need to define a function that, beginning with a given object class, recursively adds a node and an edge between it and its hypernym all the way up to the highest node (i.e., “entity”). I found this post that demonstrated code that could do this, so I borrowed it and modified it for my purposes. The major addition was to extend the graph building function to mulitple object categories. We define a function wordnet_graph
that builds us our network:
1 import networkx as nx
2 import matplotlib.pyplot as pl
3 from nltk.corpus import wordnet as wn
4
5 def wordnet_graph(words):
6
7 """
8 Construct a semantic graph and labels for a set of object categories using
9 WordNet and NetworkX.
10
11 Parameters:
12 ----------
13 words : set
14 Set of words for all the categories.
15
16 Returns:
17 -------
18 graph : graph
19 Graph object containing edges and nodes for the network.
20 labels : dict
21 Dictionary of all synset labels.
22 """
23
24 graph = nx.Graph()
25 labels = {}
26 seen = set()
27
28 def recurse(s):
29
30 """ Recursively move up semantic hierarchy and add nodes / edges """
31
32 if not s in seen: # if not seen...
33 seen.add(s) # add to seen
34 graph.add_node(s.name) # add node
35 labels[s.name] = s.name().split(".")[0] # add label
36 hypernyms = s.hypernyms() # get hypernyms
37
38 for s1 in hypernyms: # for hypernyms
39 graph.add_node(s1.name) # add node
40 graph.add_edge(s.name, s1.name) # add edge between
41 recurse(s1) # do so until top
42
43 # build network containing all categories
44 for word in words: # for all categories
45 s = wn.synset(str(word) + str('.n.01')) # create synset
46 recurse(s) # call recurse
47
48 # return the graph and labels
49 return graph , labels
Now we’re ready to create the graph for visualizing the semantic network for CIFAR-10.
1 # create the graph and labels
2 graph, labels = wordnet_graph(categories)
3
4 # draw the graph
5 nx.draw_graphviz(graph)
6 pos=nx.graphviz_layout(graph)
7 nx.draw_networkx_labels(graph, pos=pos, labels=labels)
8 pl.show()
The resulting semantic network should look like the following:
We can see that from entity, the main branch between categories in CIFAR-10 is between artifacts and living things. The object categories themselves tend to be terminal nodes (except for dog).
We can also use WordNet to quantify the semantic distance between two given object categories. Developing quantifications for semantic similarity is an area of ongoing study and the NLTK includes a couple variations. Here, we use a simple path_similarity
quantification which is the length of the shortest path between two nodes, but many others can be implemented by using the wordnet_ic
dataset and defining an information content dictionary (see here).
To find the semantic distance between all object categories, we create an empty similarity matrix of size , where equals the number of object categoes, and iteratively calculate the semantic similarity for all pair-wise comparisons.
1 import numpy as np
2 from nltk.corpus import wordnet_ic
3
4 # empty similarity matix
5 N = len(categories)
6 similarity_matrix = np.zeros((N, N))
7
8 # initialize counters
9 x_index = 0
10 y_index = 0
11 # loop over all pairwise comparisons
12 for category_x in categories:
13 for category_y in categories:
14 x = wn.synset(str(category_x) + str('.n.01'))
15 y = wn.synset(str(category_y) + str('.n.01'))
16 # enter similarity value into the matrix
17 similarity_matrix[x_index, y_index] = x.path_similarity(y)
18 # iterate x counter
19 x_index += 1
20 # reinitialize x counter and iterate y counter
21 x_index = 0
22 y_index += 1
23
24 # convert the main diagonal of the matrix to zeros
25 similarity_matrix = similarity_matrix * abs(np.eye(10) - 1)
We can then visualize this matrix using Pylab. I found this notebook that contained some code for generating a nice comparison matrix. I borrowed that code and only made slight modifications for the current purposes. This code is as follows:
1 # Plot it out
2 fig, ax = pl.subplots()
3 heatmap = ax.pcolor(similarity_matrix, cmap=pl.cm.Blues, alpha=0.8)
4
5 # Format
6 fig = pl.gcf()
7 fig.set_size_inches(8, 11)
8
9 # turn off the frame
10 ax.set_frame_on(False)
11
12 # put the major ticks at the middle of each cell
13 ax.set_yticks(np.arange(similarity_matrix.shape[0]) + 0.5, minor=False)
14 ax.set_xticks(np.arange(similarity_matrix.shape[1]) + 0.5, minor=False)
15
16 # want a more natural, table-like display
17 ax.invert_yaxis()
18 ax.xaxis.tick_top()
19
20 # Set the labels
21
22 # label source:https://en.wikipedia.org/wiki/Basketball_statistics
23 labels = []
24 for category in categories:
25 labels.append(category)
26
27
28 # note I could have used nba_sort.columns but made "labels" instead
29 ax.set_xticklabels(labels, minor=False)
30 ax.set_yticklabels(labels, minor=False)
31
32 # rotate the x-axis labels
33 pl.xticks(rotation=90)
34
35 ax.grid(False)
36
37 # Turn off all the ticks
38 ax = pl.gca()
39 ax.set_aspect('equal', adjustable='box')
40
41 for t in ax.xaxis.get_major_ticks():
42 t.tick1On = False
43 t.tick2On = False
44 for t in ax.yaxis.get_major_ticks():
45 t.tick1On = False
46 t.tick2On = False
This generates the following visualization of the semantic similiary matrix for the CIFAR-10 object categories:
In this image, bluer colors represent higher similarity (neglecting the main diagonal which was forced to zero for better visualization). As is apparent, all of the object categories belonging to either the artifact or living_thing major branches are closely similar to one another and very different from objects in the opposite branch. Now these semantic distances between object categories can be used for many other types of analyses.
]]>In biological vision, receptive fields in early visual cortex are organized into orientation columns where adjacent columns have selectivity close in feature space. The global appearance of selectivity to oreintation across the coritical sheet is that of smooth transitions between orientation preference of columns and the classic pinwheel features where orientation column selectivities meet at a singularity (see image below).
A large amount of computational research has explored the mechanisms underlying such organization (Swindale, 1996). More recent research has learned the self-organization of feature detectors based on the natural statistics of images when structured sparsity is imposed (Hyvärinen, Hoyer, & Inki, 2001; Jia & Karayev, 2010; Kavukcuoglu, Ranzato, Fergus, & Le-Cun, 2009; Welling, Osindero, & Hinton, 2002).^{1} Most of these models involve a two layers where the activations of the first layer are square rectified and projected up to a second layer based on locally defined connections. If we have activations in the first layer given by:
\begin{equation} a^{(1)} = \mathbf{w}^T \mathbf{x} \end{equation}
where is the weight matrix and is the input data, these can then be projected up to a second layer unit given the local connections defined by overlapping neighborhoods :
\begin{equation} a_i^{(2)} = \sqrt{\sum_{j \in H_i} (a^{(1)}_j)^2 } \end{equation}
Thus, the activation of each unit in the second level is the sum of sqares of adjacent units in the first layer as defined by a local connectivity matrix that can either be binary or have some distribution across the nieghborhood (below is an example of 3 X 3 overlapping neighborhoods).
To avoid edge artifacts, these neighborhoods are also commonly defined to be toroidal so that each unit in a given layer has an equal number of neighbors.
Thus the optimization objective for a sparse-penalized least-squares reconstruction model with the aforementioned architecture would be:
\begin{equation} \min_{\mathbf{\alpha}\in\mathbb{R}^m} \vert\vert \mathbf{x} - \mathbf{w}^T a^{(1)} \vert\vert^2_2+\lambda\vert\vert a^{(1)} +a^{(2)} \vert\vert_1 \end{equation}
where, as before, is a sparsity tradeoff parameter.
Locally competitive algorithms (Rozell, Johnson, Baraniuk, & Olshausen, 2007) are dynamic models that are implementable in hardware and converge to good solutions for sparse approximation. In these models, each unit has an state , and when presented with a stimulus , each unit begins accumulating activity that leaks out over time (much like a bucket with small holes on the bottom). When units reach a threshold , they begin exerting inhibition over their competitors weighted by some function based on similarity or proximity in space. The states of a given unit is represented by the nonlinear ordinary differential equaion
where represents increased activation proportional to the receptive field’s similarity to the incoming input. The internal states of each unit and thus the degree of inhibition that they can exert are expressed by a hard thresholding function which simply means that if the state of a unit is below the threshold, its internal state is zero, and if the state is above threshold, it’s internal state is a linear function of . This inhibition is finally wieghted based on the similarity between two units ensuring that redundant feature representations are not used for any given input and a sparse approximation is achieved.
A simple implementation of LCA in MATLAB is as follows (more details given later in post):
1 function [a] = LCA(W, X, neurons, batch_size, thresh)
2
3 % get activation values (b) and similarity values (G)
4 b = W' * X; % [neurons X examples]
5 G = W'* W - eye(neurons); % [neurons X neurons]
6
7 % LCA
8 u = zeros(neurons,batch_size); % unit states
9 for l =1:10
10 a = u .* (abs(u) > thresh); % internal activations
11 u = 0.9 * u + 0.01 * (b - G * a); % update unit states
12 end
13 a = u .* (abs(u) > thresh); % [groups, batch_size]
The distribution of activation across both the population and examples is very sparsely distributed, with most activation values at zero, and only very few greater-than-zero values.
Observing the weights that are learned, we can also see that, consistent with previous research, LCA learns Gabor-like receptive fields.
It is also interesting to consider the reconstruction performance of the learned receptive fields.
The reconstructed image clearly captures the most important structural features of the original image and removes much of the noise.
Here I will introduce a two-layer Locally Competitive Algorithm that I will call Topographical Locally Competitive Algorithm (tLCA). The general procedure is to first determine the initial activity of the first layer, immediately project it to the second layer in a fast feedforward manner, perform LCA at the second layer, project the activity back down to the first layer, and then perform LCA on the first layer (see figure for schematic illustration).
Now we will walk through the steps of bulding the model (to see all code navigate to my tLCA repository). To begin building the model, we will first define some parameters:
1 % set environment parameters
2 neurons = 121; % number of neurons
3 patch_size = 256; % patch size
4 batch_size = 1000; % batch size
5 thresh = 0.1; % LCA threshold
6 h = .005; % learning rate
7 blocksize = 3; % neighborhood size
8 maxIter = 1000; % maximum number of iterations
We can then randomly initialize the wieghts of the network and constrain them to lie on the ball via normalization:
1 W = randn(patch_size, neurons); % randomly initialize wieghts
2 W = W * diag(1 ./ sqrt(sum(W .^ 2, 1))); % normalize the weights
Next we need to define the local connectivities between the first layer and the second layer. These weights are held constant and are not trained like the weights of the first layer that connect to the input. To do so, we define a function gridGenerator
with arguments neurons
and filterSize
and returns a group x neurons
matrix blockMaster
that contains binary row vectors with filled entries corresponding to neurons that belong to the group.
1 function blockMaster = gridGenerator(neurons, filterSize)
2
3 % determine grid dimensions
4 gridSize = sqrt(neurons);
5
6 % create matrix with grids
7 blockMaster = zeros(neurons, neurons);
8 c = 1;
9 x = zeros(gridSize, gridSize);
10 x(end - (filterSize - 1):end, end - (filterSize - 1):end) = 1;
11 x = circshift(x, [1, 1]);
12 for i = 1:gridSize
13 for j = 1:gridSize
14 temp = circshift(x, [i, j]);
15 blockMaster(c,:) = temp(:)';
16 c = c + 1;
17 end
18 end
This works by first creating a binary matrix with ones over the group centered at x(1,1)
(lines 9 - 11); because it is toriodal, there are ones on opposite sides of the matrix. Then, for all groups, it shifts this primary matrix around until all group local connections have been created and saved into the master matrix.
Now that we have a means of projecting the first layer activation up to the second layer, we need to define how inhibition between units in the second layer should be weighted. We can define the mutual inhibiiton between two units in the second layer as being proportional to how many units in the first layer share their local connections. This can be conveniently created as follows:
1 % create group inhibition weight matrix
2 G2 = blockMaster * blockMaster';
3 G2 = G2 ./ max(max(G2));
4 G2 = G2 - eye(neurons);
Lastly we need to also set up a similarity matrix for all pairwise connections between units. In the traditional LCA, this was computed as the similarity between receptive fields as described previously. Here we instead compute similarity as Euclidean distance in simulated cortical space. We can compute the distance of each unit to all other units using the function lateral_connection_generator
:
1 function master = lateral_connection_generator(neurons)
2
3 % define grid size
4 dim = sqrt(neurons);
5
6 % create list of all pairwise x-y coordinates
7 x = zeros(dim * dim, 2);
8 c = 1;
9 for i = 1:dim
10 for j = 1:dim
11 x(c, :) = [i, j];
12 c = c + 1;
13 end
14 end
15
16 % create distance matrix of each cell from the center of the matrix
17 center_index = ceil(neurons / 2);
18 center = x(center_index, :);
19 temp = zeros(dim, 1);
20 for j = 1:size(x, 1)
21 temp(j) = norm(center - x(j, :));
22 end
23 temp = reshape(temp, [dim, dim]);
24
25 % shift the center of the matrix (zero distance) to the bottom right corner
26 temp = circshift(temp, [center_index - 1, center_index - 1]);
27
28 % create master matrix
29 master = zeros(neurons, neurons);
30 c = 1;
31 for i = 1:dim
32 for j = 1:dim
33 new = circshift(temp, [j, i]);
34 master(c, :) = new(:)';
35 c = c + 1;
36 end
37 end
Now we are ready to actually run the neural network and analyze its characteristics. Image patches that were preselected from natural images and preprocessed through normalization are read in and assigned to the variable X
. Then we loop through each training iteration and perform the following procedure:
1 W = W * diag(1 ./ sqrt(sum(W .^ 2, 1))); % normalize the weights
1 b1 = W' * X; % [neurons X examples]
2 b2 = (blockMaster * sqrt(b1 .^ 2)) / blocksize; % [groups X examples]
1 u2 = zeros(groups,batch_size);
2 for i = 1:5
3 a2 = u2 .* (abs(u2) > thresh);
4 u2 = 0.9 * u2 + 0.01 * (b2 - G2 * a2);
5 end
6 a2=u2.*(abs(u2) > thresh); % [groups, batch_size]
1 a1 = blockMaster' * a2; % [neurons X batch_size]
2 a1 = a1 .* b1; % weight by first level activation
1 u1 = a1;
2 for l =1:10
3 a1=u1.*(abs(u1) > thresh);
4 u1 = 0.9 * u1 + 0.01 * (b1 - G1*a1);
5 end
6 a1=u1.*(abs(u1) > thresh); % [groups, batch_size]
1 W = W + h * ((X - W * a1) * a1');
Running this code using maxIter
as set above takes just over a minute. The features that are learned replicate those found in the literature and they also self organize as has been found in the studies cited. An important observation is that the receptive fields organize by both orientation and spatial frequency, whereas lateral connections alone (run latLCA.m
for comparison) only leads to some organization of orientation. Therefore, performing LCA in a two-layer network as we did here seems to be necessary to get good self organization along both dimensions. It is also important to note that phase appears to organize randomly, and this is due to the square rectification of the first layer (i.e., a counter-phase stimulus may result in a negative activation, but this will be rectified into a positive activation).
The activation distributions of both the first and second level are very sparse, just as in regular LCA.
When we look at the activity distribution across cortical space for both levels of the network, we see that they are very localized. Also note that there is a high degree of overlap between activations across the two levels.
We can also see that the reconstruction capability of tLCA is on par with LCA.
The reason for the poorer reconstruction performance is because of the smaller number of neurons, their more dependent activity, and fewer iterations in training.
Although these models have the advantage of being driven by natural image statistics, they also suffer from some biological implausibility (Antolı́k Ján & Bednar, 2011). ↩
Models employing sparsity-inducing norms are ubiquitous in the statistical modeling of images. Their employment is strongly motivated by the inextricably woven web of a sparse code’s efficiency, functionality, and input distribution match — a rather uncanny alignment of properties. Representing an input signal with a few number of active units has obvious benefits in efficient energy usage; the fewer units that can be used to provide a good representation, without breaking the system, the better. A sparse coding scheme also has logically demonstrable functional advantages over the other possible types (i.e., local and dense codes) in that it has high representational and memory capacites (representatoinal capacity grows exponentially with average activity ratio and short codes do not occupy much memory), fast learning (only a few units have to be updated), good fault tolerance (the failure or death of a unit is not entirely crippling), and controlled interference (many representations can be active simultaneously)(Földiák & Young, 1995). Finally, and perhaps most mysteriously, a sparse code is a good representational scheme because it matches the sparse structure, or non-Gaussianity of natural images (Simoncelli & Olshausen, 2001). That is, images can be represented as a combination of a sparse number of elements. Because a sparse code matches the sparse distribution of natural scenes, this provides a good statistiacal model of the input, which is useful because…
...such models provide the prior probabilities needed in Bayesian inference, or, in general, the prior information that the visual system needs on the environment. These tasks include denoising and completion of missing data. So, sparse coding models are useful for the visual system simply because they provide a better statistical model of the input data.
(Hyvärinen, Hurri, & Hoyer, 2009)
In the mid 90s, a seminal article by Olshausen and Field marked the beginning of a proliferation of research in theoretical neuroscience, computer vision, and machine learning more generally. There, they first introduced the computational model of sparse coding (Olshausen & Field, 1996) and demonstrated the ability to learn units with receptive fields strongly resembling those observed in biological vision when trained on natural images. Sparse coding is based on the assumption that an input image can be modeled as a linear combination of sparsely activated representational units :
\begin{equation} {I(x, y)} = \sum_i a_i \phi_i(x, y) \end{equation}
Given this linear generative model of images, the goal of sparse coding is then to find some representational units that can be used to represent an image using a sparse activity coefficient vector (i.e., one that has a leptokurtotic distribution with a large peak around zero and heavy tails as can be seen in the figure below).
The optmization problem for finding such a sparse code can be formalized by minimizing the following cost function:
\begin{equation} E = - {\sum_{x, y} \bigg[ {I(x, y)} - \sum_i a_i \phi_i(x, y) \bigg] ^2} - {\sum_i } S\Big(\frac{a_i} {\sigma}\Big) \end{equation}
where is some non-linear function that penalizes non-sparse activations and is a scaling constant. We can see that this is basically a combination of a reconstruction error and a sparsity cost, what can be referred to as sparse-penalized least-squares reconstruction and can be generally represented by:
\begin{equation} \text{cost = [reconstruction error] + [sparseness]} \end{equation}
More generally, this form of problem falls under the more general class of sparse approximation where a good subset of a dictionary must be found to reconstruct the data:
\begin{equation} \min_{\mathbf{\alpha}\in\mathbb{R}^m} \frac{1}{2} \vert\vert \mathbf{x} - \mathbf{D}\alpha\vert\vert^2_2+\lambda\vert\vert \alpha\vert\vert_1 \end{equation}
However, in this case, is not known and thus makes this an unsupervised learning problem.
Sparse Filtering (Ngiam, Chen, Bhaskar, Koh, & Ng, 2011) is an unsupervised learning technique that does not directly model the data (i.e., it has no reconstruction error term in the cost function). The goal of the algorithm is to learn a dictionary that provides a sparse representation by minimizing the normalized entries in a feature value matrix. For each iteration of the algorithm:
The remaining portion of this subsection is an excerpt from (Hahn, Lewkowitz, Lacombe Jr, & Barenholtz, 2015):
Let be the feature value matrix to be normalized, summed, and minimized. The components
\begin{equation} f^{(i)}_j \end{equation}
represent the feature value ( row) for the example ( column), where
\begin{equation} f^{(i)}_j=\mathbf{w}_j^T\mathbf{x}^{(i)} \end{equation}
Here, the are the input patches and is the weight matrix. Initially random, the weight matrix is updated iteratively in order to minimize the Objective Function.
In the first step of the optimization scheme,
\begin{equation} \widetilde{\mathbf{f}}_j=\frac{\mathbf{f}_j}{\vert\vert\mathbf{f}_j\vert\vert_2} \end{equation}
Each feature row is treated as a vector, and mapped to the unit ball by dividing by its -norm. This has the effect of giving each feature approximately the same variance.
The second step is to normalize across the columns, which again maps the entries to the unit ball. This makes the rows about equally active, introducing competition between the features and thus removing the need for an orthogonal basis. Sparse filtering prevents degenerate situations in which the same features are always active (Ngiam, Chen, Bhaskar, Koh, & Ng, 2011).
\begin{equation} \hat{\mathbf{f}}^{(i)}=\frac{\widetilde{\mathbf{f}}^{(i)}}{\vert\vert\widetilde{\mathbf{f}}^{(i)}\vert\vert_2} \end{equation}
The normalized features are optimized for sparseness by minimizing the norm. That is, minimize the Objective Function, the sum of the absolute values of all the entries of . For datasets of examples we have the sparse filtering objective:
The sparse filtering objective is minimized using a Limited-memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) algorithm, a common iterative method for solving unconstrained nonlinear optimization problems.
Theano is a powerful Python library that allows the user to define and optimize functions that are compiled to machine code for faster run time performance. One of the niceset features of this package is that it performs automatic symbolic differentation. This means we can simply define a model and its cost function and Theano will calculate the gradients for us! This frees the user from analytically deriving the gradients and allows us to explore many different model-cost combinations much more quickly. However, one of the drawbacks of this library is that it does not come prepackaged with more sophisticated optimization algorithms, like L-BFGS. Other Python libraries, such as SciPy’s optimize library do contain these optimization algorithms and here I will show how they can be integrated with Theano to optimize sparse filters with respect to their cost function described above.
First we define a SparseFiter class which performs the normalization scheme formalized above.
1 import theano
2 from theano import tensor as t
3
4 class SparseFilter(object):
5
6 """ Sparse Filtering """
7
8 def __init__(self, w, x):
9
10 """
11 Build a sparse filtering model.
12
13 Parameters:
14 ----------
15 w : ndarray
16 Weight matrix randomly initialized.
17 x : ndarray (symbolic Theano variable)
18 Data for model.
19 """
20
21 # assign inputs to sparse filter
22 self.w = w
23 self.x = x
24
25 def feed_forward(self):
26
27 """ Performs sparse filtering normalization procedure """
28
29 f = t.dot(self.w, self.x.T) # initial activation values
30 fs = t.sqrt(f ** 2 + 1e-8) # numerical stability
31 l2fs = t.sqrt(t.sum(fs ** 2, axis=1)) # l2 norm of row
32 nfs = fs / l2fs.dimshuffle(0, 'x') # normalize rows
33 l2fn = t.sqrt(t.sum(nfs ** 2, axis=0)) # l2 norm of column
34 f_hat = nfs / l2fn.dimshuffle('x', 0) # normalize columns
35
36 return f_hat
37
38 def get_cost_grads(self):
39
40 """ Returns the cost and flattened gradients for the layer """
41
42 cost = t.sum(t.abs_(self.feed_forward()))
43 grads = t.grad(cost=cost, wrt=self.w).flatten()
44
45 return cost, grads
When this object is called, it is initialized with the passed weights and data variables. It also has a feed_forward
method for getting the normalized activation values for as well as a get_cost_grads
method that returns the cost (defined above) and the gradients wrt the cost. Note that in this implementation, the gradients are flattened out; this has to do with making Theano compatible with SciPy’s optimization library as will be described next.
Now we need to define a function that, when called, will compile a Theano training function for the SparseFilter
based on it’s cost and gradients at each training step as well as a callable function for SciPy’s optimization procedure that does the following steps:
theta_value
consistent with how they are initialized in the model and convert to float32Note that in step #3, the gradients returned are already vectorized based on the get_cost_grads
method of the SparseFilter
class for compatability with SciPy’s optimization framework. The code for accomplishing this is as follows:
1 import numpy as np
2
3 def training_functions(data, model, weight_dims):
4
5 """
6 Construct training functions for the model.
7
8 Parameters:
9 ----------
10 data : ndarray
11 Training data for unsupervised feature learning.
12
13 Returns:
14 -------
15 train_fn : list
16 Callable training function for L-BFGS.
17 """
18
19 # compile the Theano training function
20 cost, grads = model.get_cost_grads()
21 fn = theano.function(inputs=[], outputs=[cost, grads],
22 givens={model.x: data}, allow_input_downcast=True)
23
24 def train_fn(theta_value):
25
26 """
27 Creates a shell around training function for L-BFGS optimization
28 algorithm such that weights are reshaped before calling Theano
29 training function and outputs of Theano training function are
30 converted to float64 for SciPy optimization procedure.
31
32 Parameters:
33 ----------
34 theta_value : ndarray
35 Output of SciPy optimization procedure (vectorized).
36
37 Returns:
38 -------
39 c : float64
40 The cost value for the model at a given iteration.
41 g : float64
42 The vectorized gradients of all weights
43 """
44
45 # reshape the theta value for Theano and convert to float32
46 theta_value = np.asarray(theta_value.reshape(weight_dims[0],
47 weight_dims[1]),
48 dtype=theano.config.floatX)
49
50 # assign the theta value to weights
51 model.w.set_value(theta_value, borrow=True)
52
53 # get the cost and vectorized grads
54 c, g = fn()
55
56 # convert values to float64 for SciPy
57 c = np.asarray(c, dtype=np.float64)
58 g = np.asarray(g, dtype=np.float64)
59
60 return c, g
61
62 return train_fn
Now that we have the model defined and the training environment, we can build the model and visualize what it learns. First we read in some data and preprocess it by centering the mean at zero and whitening to remove pairwise correlations. Finally we convert the data to float32 for GPU compatability.
1 from scipy.io import loadmat
2 from scipy.cluster.vq import whiten
3
4 data = loadmat("patches.mat")['X'] # load in the data
5 data -= data.mean(axis=0) # center data at mean
6 data = whiten(data) # whiten the data
7 data = np.float32(data.T) # convert to float32
Next we define the model variables, including the network architecture (i.e., number of neurons and their weights), the initial weights themselves, and a symbolic variable for the data.
1 from init import init_weights
2
3 weight_dims = (100, 256) # network architecture
4 w = init_weights(weight_dims) # random weights
5 x = t.fmatrix() # symbolic variable for data
The imported method init_weights
simply generates random weights with zero mean and unit variance. In addition, these weights are deemed “shared” variables so that they can be updated across all function that they appear in and are designated as float32 for GPU compatability. With this in place, we can then build the Sparse Filtering model and the training functions for its optimization.
1 model = SparseFilter(w, x)
2 train_fn = training_functions(data, model, weight_dims)
Finally, we can train the model using SciPy’s optimization library.
1 from scipy.optimize import minimize
2
3 weights = minimize(train_fn, model.w.eval().flatten(),
4 method='L-BFGS-B', jac=True,
5 options={'maxiter': 100, 'disp': True})
With the maximum number of iterations set at 100, this algorithm converges well under a minute. We can then visualize the representations that it has learned by grabbing the final weights and reshaping them.
1 import visualize
2
3 weights = weights.x.reshape(weight_dims[0], weight_dims[1])
4 visualize.drawplots(weights.T, 'y', 'n', 1)
As we can see, Sparse Filtering learns edge-like feature detectors even withough modeling the data directly. Similar outcomes can also be acquired using standard gradient descent methods.