Self Organizing Map (SOM) in Excel

Neural Network Based Clustering
using Self Organizing Map (SOM) in Excel

Here is a small tool in Excel using which you can find clusters in your data set. The tool uses Self Organizing Maps (SOM) - originally proposed by T.Kohonen as the method for clustering.

A bit of background on Clustering...

Clustering is a very popular and useful technique in data analysis / data mining. Suppose your data contain observations on individuals - for example, individuals could be -

Customers of a business,

Outlets of a retail chain of grocery stores,

Students of a University

measured on a large number of variables - for example

Customers - age, monthly expenditure, credit scores ...

Outlets - geographical location, sales, number of employees ...

Students - scores in different subjects

One very important and interesting problem regarding such data is to identify small number of groups of such individuals based on their similarity to each other. Why might one be doing such groupings? Typically for taking some actions on these groups and the actions that can be customized for different groups as opposed to the same action taken on all groups. For example

Design different promotional campaigns for different groups of customers.

Design different business strategies to promote sales in different groups of outlets.

Design different courses based on strengths and weaknesses of different groups of students.

There are many clustering algorithms available. SOM is one of them. SOM is an extensively researched area and it has been applied sucessfully on problems from various fields.

Why SOM in Excel ?

Guys who are extensively working on SOM would perhaps immediately recognize that SOM in Excel is not a very good idea. One reason for this is - SOM needs a lot of computing power. If you are really serious about solving a large real life problem with SOM - no way you are going to choose Excel as your vehicle. You need thousands of iterations to run, you need speed in computation, you need pretty sophisticated graphics to visualize the results of the analysis. In none of these fronts Excel will be able to compete favorably with other packages speficically writeen for SOM. More importantly - Excel is not meant for these things.

The tool here is really a toy version of SOM. Severely restricted in speed and capability. So a better name for this tool would be, perhaps, A Poor Man's SOM .

If you are new to SOM - just read or heard about it, know a bit about the algorithm - then here is a small toy to play with and get a feel about how it works.

You have just started wondering where and how you can use SOM. You want to buy a commercial SOM software eventually but for now you want to have a feel of it and see what kind of features would be useful or nice to have. There would be quite a few freewares available on the web. You may find that they run on Unix systems and you don't have Unix. You may find stuff for windows but that may need a bit of work to install it, read through the help files, prepare your dataset in a special format etc. You wish there were some stuff less messy.

You need to give a short quick presentation on SOM to your colleagues, students. You wish there were something small an easy to run with which you can run a quick demo.

This tool comes in handy typically for such situations. As long as you have Excel (I naively assume you have, pardon me if you don't) you can download the tool and immediately start playing with it.

I am sure there must be Excel tools freely available on the web doing SOM for you. Here is another one of them. It's upto you to choose. Since it's free - why not give it a try to see whether its any good.

Some Nice Features of the Application ...

The data columns used for clustering need not reside in contiguous columns in your data set. For example - you may have 20 columns in your data and you just want to use columns 4,5, 19 and 11 for clustering. For this tool you don't need to bring the columns side by side. This means less time spent formatting you data so that Her Majesty, the software - likes it.

You can change some of the training parametrs and see how it affects the result of clustering.

While training the map you can present the data to the map in random order.

Once you have trained the map, the application saves the weight vectors of the map. Next time you want to train a map of same dimension on same data, you have the option of using the weights already saved as the starting weights. This provides incremental learning of the map.

Application has some capability of handling missing data . Any non-number in any of your data column which are used in clustering will be treated as missing value. Application will replace all the missing values in a column by respective column means.

Once the clustering is over, in the output mean and varaince of each cluster are reported. Location of each cluster on the Map is reported. Application also generates the map along with data points on the map to visually depict the cluster locations. On the map, ideally one should represent all the data points of a cluster by a single point - the neuron that captured all those points. However this does not give a good idea about how many points are captured by a neuron. So the locations of the data points on the map are randomly jittered a bit to give you the idea of cluster sizes.

Once the training is over, you have the option of saving the model in a separate workbook. Here your actual data set along with Cluster labels are saved. The variables that are not used in clustering are also saved in the output. This may be helpful for profilng the clusters. For example for a clustering of customer data base you may have used only - Income, Credit Score and Monthly expenditure variables. However the database may also have the variable - AGE. Once the clustering is complete, you might check the age distribution in each cluster. You might try to see whether certain clusters consist of predominantly younger people etc.

If you choose the option of saving the model in a separate workbook, you also get a Radar Plot. This plots visually compares the cluster means of the varaibles across different clusters.

Stuff that are not so nice ...

Size restriction - The application can handle at most 50 input variables. Application cannot handle categorical data. All the columns to used in clustering should be numeric. Any non-number is treated as missing values and replaced by the column average. This restriction is basically because according to my limited knowledge, there does not seem to be any consescus on how to mesure distances between categories of a categorical variable.

At one time you can run at most 500 training cycles. This is however, not such a big handicap since you can start from the weights you have reached after 500 cycles and run it for another 500 cycles ..... and keep on doing that.

Speed - Speed is a big concern. For moderately large data sets the application can get painfully slow. Actual training of the map is not that much of a problem. However following are the two major bottlenecks
1. Data pre-processing before the training starts.
2. Final drawing of the map once the training is over.
One tip to speed it up - before copying the data into the Data sheet of the file, please remove any non-numbers (blank cells, cells with Excel errors, text etc.) from the data columns that you are planning to use for clustering. If non-numbers exists in your data application searches cell by cell to locate the non-number and replaces it with the column mean. For a moderate to big data set this search takes awful lot of time.

Graphics - There are lots of nice graphics typicallyy associated with SOMs. Unfortunately, this application does not have that.

Techy Stuff ... for the more inquisitive guys

The application uses a square grid of neurons arranged in rows and columns. This makes the length and width of the map equal. Unfortunately application cannot handle non-square grids.
Application uses Gaussian neighborhood function.
Weight vectors of the map are randomly initialized with coordiante values between - 1 and 1. This is not a good strategy at all. Recommended strategy is to get two largest principal components of your data and look at a plane spanned by those two components and sytematically initialize the weight vecotrs from that place preserving the distance ordering. For me, this happens to be a dream since I don't know any nice simple method in Excel to compute principal components.
Input variables are scaled to have values between -1 nd 1.
Plain Euclidean distance is used to compute similarities. Application cannot handle any other kind of distances.

mail me

Neural Network Based Clustering using Self Organizing Map (SOM) in Excel

A bit of background on Clustering...

Why SOM in Excel ?

Neural Network Based Clustering
using Self Organizing Map (SOM) in Excel