Research Projects
The ANU Data Mining Group specialises in developing
computational approaches to data mining which can deal with
large scale high dimensional data sets. We are applying our
technologies to real world consultancies. Technology development
includes scalable parallel algorithms, handling of large complex
data sets and scripting tools for automatisation of routine
tasks. We have consultancy experience in health (Australian
Medicare data), astronomy and chemistry. The main research focus
of the group is on predictive modelling, in particular,
additive models, high dimensional wavelet smoothing and time
series analysis.
We have access to several high-performance computing platforms:
- Our main platform for research is a Sun Enterprise
450 server with four processors and 4 Gigabytes of
main memory.
- For scalable parallel algorithm development we are
using the ANU
Bunyip, a Beowulf style Linux cluster
with 196 Pentium processors. This distributed memory
computer has a total of 36 Gigabytes of main memory and
1.3 Terabytes of disk space.
- We are also using the APAC
National Faciliy, a Compaq cluster
with 480 Alpha processors.
- For confidential data mining consultancies we have
access to a secure facility at the
CSIRO CMIS that consists of a 12
processor Sun Enterprise 4500 shared memory
multiprocessor with 4.75 Gigabytes of main memory and
a 256 Gigabyte RAID disk array.
Data Mining is one of 13 APAC expertise sub-programs located at the
National
Facility on the ANU campus. APAC's vision is to underpin
significant achievements in Australian research, education
and technology diffusion by sustaining an advanced computing
capability ranked in the top 10 countries. The expertise
sub-program in Data Mining aims at developing innovative data
mining techniques and customised software that are fully
scalable and applied to a number of areas. Particular areas
considered initially are astronomy, administrative health
data, and proteonics.
Research Projects
- Parallel Techniques
for High-Performance Record Linkage
Record linkage (also called Database Matching or
Data Cleansing) is used to merge data collections.
Probabilistic techniques have to be applied if no common
identifier is available. This project is conducted in
collaboration with the
NSW Health Department, Epidemiology and
Surveillance Branch, which also funds this research.
The aim of the project is to develop high-performance
techniques for probabilistic record linkage of
administrative health data collections.
- Memory Performance
of KDD Applications
Many data mining applications have irregular memory
access patterns due to their complex and recursive
data structures. This results in low performance on
modern high-performance platforms compared to scientific
or engineering applications. This research project
aims in analysing the performance behaviour of data mining
applications.
- Predictive Modelling
with Sparse Grids
- A Flexible and Efficient
Toolbox For Data Mining
The DMtools are
an efficient and flexible toolbox for common tasks in data
mining, written in the open source programming language
Python.
They can be used for all kinds of tasks in data mining,
starting from data analysis and preprocessing up to
visualisation and report generation.