The sixth DublinICA seminar will take place this Friday, the 19th of August
and will feature Tom Melia and Susanna Still talking about source separation
and clustering. As keeping with tradition, visitors from Hawaiian
universities will be allowed to speak as long as they like (as long as it's
about an hour)  thus the 1.5 hour meeting.
The seminar will start at 3:30pm in Room 234 in the Engineering Building at
UCD. Transportation info can be found here:
http://www.ucd.ie/trans.htm
http://www.ucd.ie/maps/campusmap_feb05.pdf
(Engineering is building 21 on the above map). There will be
coffee/tea/cookies from 3:15.

Tom Melia (UCD)
The DESPRIT Source Separation Algorithm

Susanna Still (Dept of Computer Science, University of Hawaii)
TITLE
An information theoretic approach to clustering and complexity control.
ABSTRACT
I will give a brief introduction to clustering / unsupervised learning
within an information theoretic framework, and then I will discuss the
important problem of complexity control within this framework.
Clustering provides a common means of identifying structure in complex data,
and there is renewed interest in clustering as a tool for the analysis of
large data sets in many fields. A natural question is how many clusters are
appropriate for the description of a given system.
Traditional approaches to this problem are based on either a framework in
which clusters of a particular shape are assumed as a model of the system or
on a twostep procedure in which a clustering criterion determines the
optimal assignments for a given number of clusters and a separate criterion
measures the goodness of the classification to determine the number of clusters.
In a statistical mechanics approach, clustering can be seen as a tradeoff
between energy and entropylike terms, with lower temperature driving the
proliferation of clusters to provide a more detailed description of the
data. For finite data sets, we expect that there is a limit to the
meaningful structure that can be resolved and therefore a minimum
temperature beyond which we will capture sampling noise. This suggests that
correcting the clustering criterion for the bias that arises due to sampling
errors will allow us to find a clustering solution at a temperature that is
optimal in the sense that we capture maximal meaningful structurewithout
having to define an external criterion for the goodness or stability of the
clustering. We have shown that in a general informationtheoretic framework,
the finite size of a data set determines an optimal temperature, and we have
introduced a method for finding the maximal number of clusters that can be
resolved from the data in the hard clustering limit. In my talk, for
simplicity, I will focus on this limit.
If there is remaining time, I will discuss how the very frequently used
Kmeans algorithm can be derived and understood from information theoretic
principles.
Relevant publications:
S. Still and W. Bialek (2004): ``How many clusters? An information
theoretic perspective.'' Neural Computation, 16:24832506.
S. Still, W. Bialek and L. Bottou (2003): "Geometric Clustering using the
Information Bottleneck method." In "Advances In Neural Information
Processing Systems 16". http://books.nips.cc/nips16.html
