News:
With recent advances in data acquisition and storage, extremely large data collections (in the Gigabytes to Terabytes range) have become very common. About 10 years ago, the systematic study of techniques for discoveries of information in such large collections was initiated and such techniques are commonly referred to as data mining algorithms. The Data Mining Group at the Australian National University has been active since 1997 in the development and analysis of data mining techniques.
The main focus of our work is on the computational aspects relating to the large number of data records available, the complexity of the data and parallel algorithms. Techniques studied include scalable smoothing techniques, wavelet-based methods and scalable parallel algorithms. Much of the research is done jointly with government and commercial organisations and our students participate actively in the interaction with industry and government. The proposed research is making use of the national high performance computing facilities at the APAC (Australian Partnership for Advanced Computing) National Facility located on the ANU campus. The main aim is to get better and faster algorithms which ultimately bring the data to the desktop such that end users can do analysis of very large data sets in real time.
The following projects are available for potential honours and post-graduate students. Please contact us if you are interested, we are happy to provide more information.
A computer science honours research project. For a project description and more information click here.
A computer science honours research project. For a project description and more information click here.
Contact: Markus Hegland
This particular study shall investigate the application of additive models which are functions of the form:
f(x1,...,xn) = f1(x1) + ... + fn(xn)
where the attributes x1,...,xn can be simple (numbers or categories) or composite (e.g., sets, arrays or graphs). Our earlier work has showed that additive models are effective data mining tools as they handle very large numbers of attributes effectively and they also have good approximation properties in this case. However, very little work has been done on composite attributes and this is where this project shall contribute new insights. The focus of the project depends on the interests of the student and could include the development and analysis of new algorithms, the study of approximation properties and the application in data mining projects.
Depending on the actual direction taken the work is suitable for studies in mathematics or computer science. Preferably this research should ultimately lead to a Ph.D. but initial work including implementations and literature studies could also be done in an honours project.
Contact: peter (dot) christen {at} anu [dot] edu {dot} au and Markus Hegland
Data mining applications have to deal with increasingly large data sets and complexity. Only algorithms which scale linearly with data size are feasible for successful data mining applications.
In our group we are developing algorithms for predictive modelling of high dimensional and very large data sets that are both scalable with the number of data records as well as number of processors if implemented in parallel. These algorithms are based on techniques like finite elements, sparse grids, thin plate splines, wavelets, additive models and clustering. Prototype implementations have been developed in Matlab, Python, C and MPI.
This research project will involve the further development of parallel data mining algorithms based on the available mathematical algorithms and prototypes. Besides scalability other important aspects are data distribution, load balancing and integration into a larger data mining framework currently being developed in our group.
A student working on this project will have the possibility to access some of Australia's most powerful parallel computers, namely the 196 Pentium processor Beowulf Linux cluster and the APAC National Facility, a Compaq cluster with 480 Alpha processors.
Contact: Ole Nielsen, Markus Hegland and Zuowei Shen
A fundamental issue in data mining is the development of algorithms to extract useful information from very large databases. One important technique is to estimate a smooth surface approximating the data. Such an approximation can be used for visualisation, prediction, or classification purposes. However, the size of data sets tend to grow steadily both in terms of the number of records and the number of attributes. While the former issue requires that algorithms scale linearly with the number of records in order to be feasible, the latter requires the so-called curse of dimensionality to be addressed: The complexity of the smoothing problem grows exponentially with the dimension so any algorithm computing and storing a smooth surface exactly becomes infeasible for dimensions higher than 4 or 5. Hence, approximative methods that balance accuracy against complexity are needed.
Existing technologies for dealing with high dimensional data include neural nets, classification and regression trees, and regression splines. Our group has developed an alternative approach using wavelets and a prototype written in Matlab has successfully addressed smoothing problems in six dimensions. It works by computing a projection onto spaces of low density formed as a sum of tensor products of multilevel spaces carefully chosen such that the approximation properties are good for reasonable smooth functions and the algorithmic complexity is reduced significantly.
This project will involve the further development of these ideas based on the available theory and prototypes. Issues to be addressed include (but are not limited to) the following: