• Cameron Gordon

The Information Bottleneck Principle and Corporate Structure

Over the summer I completed one final research project to finish my Master's. My goal was to understand the information bottleneck principle which in recent years has been used to partially explain how neural networks are so effective at learning complex associations.

What I'd like to do with this post is show how the same principle can be used to understand how large multi-level companies efficiently process information to make decisions. I'll try to avoid being too math heavy here and instead present the key ideas informally.

In brief the information bottleneck method considers two signals: an input X and an output Y, and attempts to squeeze relevant information θ between the two variables that can be used for prediction. It's worth highlighting this with an example - take the picture of a rose below. It contains a large amount of information: each of the 500x700 pixels has a hue, there are shapes and swirls and backgrounds and petal edges - that's an input. The output is the word 'rose' out of all the other possible flowers it could be. Extracting the relevant parts of the input (e.g. the colour, the shape of the petals) that tell us that it is a 'rose' is what the information bottleneck method seeks to squeeze out of the variables.

The relevant information is a highly compressed representation of the initial input - it ignores irrelevant features (e.g. the colour of the background) to find the parts that are predictive of the output. The central problem is learning what needs to be paid attention to and what should to be ignored. Finding that optimal representation θ is important.

Let's turn to business for a second. The information that flows into a department is broad: there are individual sales, there's the weather of the day, statistics reports, statements in the media, gossip with staff, Government taxation policy, supply disruptions overseas, the colour of a customer's shoes - a staggering amount of raw information, only a small proportion of which is going to be useful for a business decision such as opening a new store. An omniscient and unconstrained entity would be able to use all of this information to make a business decision. The colour of a customer's shoes might be relevant for a store opening - but probably not. Businesses are constrained by resources and face a dual problem: to compress and filter the sum total of input information subject to its business relevance.

Extracting a compressed representation of relevant information is the question that Shwartz-Ziv and Tishby (2017) sought to understand for deep neural networks. Most simply put a deep neural network involves simple calculating units (neurons) arranged in layers between an input and an output. This structure is similar to how many large businesses are organised: a hierarchy of staff from ground-level employees to middle managers, departmental heads, and a controlling board responsible for approving business decisions. Drawing a comparison between neural networks and corporate hierarchy may seem odd, but as we'll see it's useful for understanding how the information bottleneck principle can apply.

The above shows two images: on the left we have a standard feedforward neural network (source: Glassner 2018); on the right a highly simplified corporate structure involving a direct reporting structure with information received by the organisation at the operational staff level and strategic decisions undertaken by the board. The similar structure here is striking. Actual corporate structures aren't as neatly defined as this - but we'll treat it as an imperfect model to see what light it can shed on corporate information processing.

What Shwartz-Ziv and Tishby (2017) note is that a neural network acts as a whole to encode the relation between an input to the output (X layer1 layer 2 ... Y). Each individual layer decodes a representation passed from the previous layer and encodes it to the next. As a result information becomes more compressed as it flows up through the hierarchy. A feedback mechanism (backpropagation) allows the network representations to become more predictive of the output during interaction with the input-output signals. The final network cuts the input signal down to a compressed representation of relevant information predictive of the output. The 'bottleneck' in the information bottleneck method is a limit on compression (compress too far and valuable information is lost from the signal).

Hierarchical information flow. Each layer decodes information from the previous, processes the details and encodes a compressed representation to the next.

Two important results are useful in understanding why this occurs. The first is the Data Processing Inequality which states that information cannot increase through processing. Secondly, mutual information is invariant to invertible transformations. (For the mathematically inclined, Tishby and Zaslavsky (2015) give a fuller explanation).

The first property means that information can only reduce as we move up the hierarchy. It can be processed, it can be analysed, it can be extracted, filtered, summarised - but it cannot increase as we move from one layer to the next. The second means that we can apply the same idea to a wide variety of different structures, of which neural networks are only one. Moreover we can process the information in a variety of ways - including (in theory) the complex mental and analytical processes of a staff member briefing to their manager and so on up through the chain of command. Feedback flowing down through the organisation similarly modifies the organisation and processing to be more relevant to the board's goals.

Together the two properties provides a way that potentially useful information can be lost through the corporate hierarchy. Perhaps the woman at the entrance with the bright red shoes is actually a high-flying corporate executive on the hunt for an acquisition. The ground level staff remarks to their manager about the 'serious woman with fancy shoes', who (thinking it a spurious signal) ignores it and tells them to get back to work. The information is filtered early and never reaches the board, who with terminally falling revenues would have been keen to try a sideline pitch. A contrived example, yes - but it illustrates the principle.

A window shopping Meryl Streep. Spurious signal or not, that information will move quickly.

In future posts I'll extend this concept to more realistic corporate structures, and discuss the information loss (where relevant information is incorrectly filtered or lost moving through the hierarchy) more seriously, in particular drawing insights recent advances in neural network architectures to discuss how these ideas apply to corporate environments. I'll also discuss how these concepts fit to Herbert Simon's seminal works on administrative decision making.


After writing this post I was saddened to learn that Naftali Tishby, one of the key authors of the information bottleneck method and in particular its application to deep neural networks passed away on 9 August 2021. His work was an inspiration to me and changed how I view the world. For those that want to learn more of the method I refer you to Tishby, Pereira, and Bialek (2000), Tishby and Zaslavsky (2015), and Shwartz-Ziv and Tishby (2017).

For a brief introduction to the ideas which contains the necessary information theory background the summary report I produced over the summer is below:

Information Bottleneck Summary Paper
Download PDF • 1.65MB

87 views0 comments

Recent Posts

See All