Here is a small tool in Excel using which you can find
clusters in your data set. The tool uses
Self Organizing Maps (SOM) -
originally proposed by T.Kohonen as the method for clustering.
A bit of background on Clustering...
Clustering is a very popular and useful technique in data analysis / data mining. Suppose your data contain observations on individuals
- for example, individuals could be -
Customers of a business,
Outlets of a retail chain of grocery stores,
Students of a University
measured on a large number of variables - for example
Customers - age, monthly expenditure, credit scores ...
Outlets - geographical location, sales, number of employees ...
Students - scores in different subjects
One very important and interesting problem regarding such data is to identify small number of groups of such individuals based on their similarity to each other. Why might one be doing such groupings?
Typically for taking some actions on these groups and the actions
that can be customized for different groups as opposed to the
same action taken on all groups. For example
Design different promotional campaigns for different groups of customers.
Design different business strategies to promote sales
in different groups of outlets.
Design different courses based on strengths and weaknesses of
different groups of students.
There are many clustering algorithms available. SOM is one of them.
SOM is an extensively researched area and it has been applied
sucessfully on problems from various fields.
Why SOM in Excel ?
Guys who are extensively working on SOM would perhaps
immediately recognize that SOM in Excel is not a very good idea.
One reason for this is - SOM needs a lot of computing power.
If you are really serious about solving a large real life problem with SOM - no way you are going to choose Excel as your vehicle.
You need thousands of iterations to run, you need speed in computation,
you need pretty sophisticated graphics to visualize the results
of the analysis. In none of these fronts Excel will be able to compete
favorably with other packages speficically writeen for SOM. More
importantly - Excel is not meant for these things.
The tool here is really a toy version of SOM. Severely restricted
in speed and capability. So a better name for this tool would be, perhaps, A Poor Man's SOM .
If you are new to SOM - just read or heard about it, know a bit about the algorithm - then here is a small toy to play with and get a feel about how it works.
You have just started wondering where and how you can use SOM.
You want to buy a commercial SOM software eventually but for now
you want to have a feel of it and see what kind of features
would be useful or nice to have. There would be quite a few freewares available on the web. You may find that they run on Unix systems and you don't have Unix. You may find stuff for windows but that may need a bit of work to install it, read through the help files, prepare your dataset in a special format etc. You wish there were some stuff less messy.
You need to give a short quick presentation on SOM to
your colleagues, students. You wish there were something small an
easy to run with which you can run a quick demo.
This tool comes in handy typically for such situations. As long as you
have Excel (I naively assume you have, pardon me if you don't) you can
download the tool and immediately start playing with it.
I am sure there must be Excel tools freely available on the web doing SOM for you. Here is another one of them. It's upto you to choose.
Since it's free - why not give it a try to see whether its any good.
  Some Nice Features of the Application ...  
|
-
The data columns used for clustering need not reside in contiguous columns in your data set. For example - you may have 20 columns in your data and you just want
to use columns 4,5, 19 and 11 for clustering. For this tool you
don't need to bring the columns side by side. This means less
time spent formatting you data so that Her Majesty, the software - likes it.
-
You can change some of the training parametrs and see how it affects the result of clustering.
-
While training the map you can present the data to the map in
random order.
-
Once you have trained the map, the application saves the weight vectors of the map. Next time you want to train a map of same dimension on same data, you have the option of using the weights already saved as the starting weights. This provides incremental learning of the map.
-
Application has some capability of handling
missing data . Any non-number in any of your data column which are used in clustering will be treated as missing value. Application will replace all the missing values in a column by respective column means.
- Once the clustering is over, in the output
mean and varaince of each cluster
are reported. Location of each cluster on the Map is reported.
Application also generates the map along with data points on the map to visually depict the cluster locations. On the map, ideally one should represent all the data points of a cluster by a single point - the neuron that captured all those points. However this does not give a good idea about how many points are captured by a neuron. So the locations of the data points on the map are randomly jittered a bit to give you the idea of cluster sizes.
-
Once the training is over, you have the option of
saving the model in a separate workbook. Here your actual data set along with Cluster labels are saved. The variables that are not used in clustering are
also saved in the output. This may be helpful for profilng the clusters. For example for a clustering of customer data base you may have used only - Income, Credit Score and Monthly expenditure variables. However the database may also have the variable - AGE.
Once the clustering is complete, you might check the age distribution in each cluster. You might try to see whether certain clusters consist
of predominantly younger people etc.
-
If you choose the option of saving the model in a separate workbook, you also get a Radar Plot.
This plots visually compares the cluster means of the varaibles
across different clusters.
  Stuff that are not so nice ...  
|
-
Size restriction - The application can handle at most 50 input variables. Application cannot handle categorical data. All the columns to used in clustering should be numeric. Any non-number is treated as missing values and replaced by the column average. This restriction is basically because according to my limited knowledge, there does not seem to be any consescus on how
to mesure distances between categories of a categorical variable.
-
At one time you can run at most 500 training cycles. This is however, not such a big handicap since you can start from
the weights you have reached after 500 cycles and run it for another 500 cycles ..... and keep on doing that.
-
Speed -
Speed is a big concern. For moderately large data sets the application can get painfully slow. Actual training of the map is not that much of a problem. However following are the two major bottlenecks
- Data pre-processing before the training starts.
- Final drawing of the map once the training is over.
One tip to speed it up - before copying the data into the Data sheet of the file, please remove any non-numbers (blank cells, cells with Excel errors, text etc.) from the data columns that you
are planning to use for clustering. If non-numbers exists in your data
application searches cell by cell to locate the non-number and replaces it with the column mean. For a moderate to big data set this search takes awful lot of time.
-
Graphics
- There are lots of nice graphics typicallyy associated with SOMs. Unfortunately, this application does not have that.
  Techy Stuff ... for the more inquisitive guys  
|
-
The application uses a
square grid
of neurons arranged in rows and columns. This makes the length and width of the map equal.
Unfortunately application cannot handle non-square grids.
-
Application uses
Gaussian neighborhood function.
-
Weight vectors of the map are
randomly initialized
with coordiante values between - 1 and 1. This is not a good strategy at all. Recommended strategy is to get two largest principal components of your data and look at a plane spanned by those two components and
sytematically initialize the weight vecotrs from that place preserving the distance ordering. For me, this happens to be a dream since I don't know any nice simple method in Excel to compute principal components.
-
Input variables are
scaled to have values
between -1 nd 1.
-
Plain Euclidean distance
is used to compute similarities. Application cannot handle any other kind of distances.
Now you know more about the application. I feel that it is a nice little toy for small experiments regarding SOM and clustering. You may even try it just for fun. I myself is not very experienced with SOMs.
I have put together the application by reading about SOM in bits and pieces here and there. So if you find major bugs or error I would really appreciate if you could mail me your comments, feedback, bug reports etc.
|
Home |
Resume |
Thesis Abstract |
PET |
Neural Network in Excel | Clustering in Excel |
Classification Tree in Excel |