One of the main research projects the ANU Data Mining Group is currently working on is Parallel Techniques for High-Performance Record Linkage. This project is supported by and conducted in collaboration with the NSW Department of Health.
In October 2003 the Australian Research Council (ARC) approved our application Investigation and Development of Parallel Large Scale Record Linkage Techniques for a linkage grant which allows us to offer this APAI PhD scholarship.
Project Background and Aims
Historical collections of administrative data and transactional databases contain hundreds of millions of records, with millions of records added per annum. Examples of such data collections occur in credit card and insurance administration, census, taxation, the health sector, police/intelligence and telecommunications. Data mining techniques are increasingly used to analyse such large data sets. Often information from multiple data sources needs to be combined or linked in order to allow more detailed analysis. The aim of such linkages is to merge all records relating to the same entity, such as a customer or patient. Most of the time the linkage process is challenged by the lack of a common unique identifier, and thus becomes non-trivial.
The major challenges in record linkage are of computational nature. Current experience suggests that commonly used record linkage techniques require processing times in the order of days or weeks, with the manual clerical review of possible links being especially time consuming, tedious and labour intensive. We aim to address the following four challenge areas of record linkage.
We will develop testbeds based on our already implemented prototype software Febrl using the open source programming language Python, which is ideal for rapid prototyping due to the large variety of functionalities available for it (including modules for numerical computations, statistics, database access and parallel computing).
A successful PhD applicant will work on various aspects of this project, including improved algorithms and techniques for blocking (clustering) and classification for record linkage (with a view to reducing or eliminating the clerical review of possible links), and on parallel implementations of these techniques. The following techniques and methods will be used:
It is planned for the PhD applicant to spend around tree times one week per year at the NSW Department of Health Centre for Epidemiology and Research, enabling the experience of working with experts in the area of record linkage in an industry setting, and to explore and evaluate the newly developed algorithms and techniques on various in-house health data sets.
The student will be able to attend and present the work at major conferences and workshops in data mining, data preprocessing and management and record linkage.
This PhD scholarships will allow the student to gain expertise in areas ranging from computational mathematics over parallel computing and data mining to practical record linkage. In the first year the student will attend lectures at the ANU providing foundations in the underlying computer science, machine learning and mathematics. The student will also receive an introduction to practical record linkage while visiting the NSW Department of Health.
The offered PhD scholarship will start in early 2004 for a duration of 3 years (extendable to 3 1/2 years). The rate of pay will be around A$ 23,000 per year.
Note: Applicants must be Australian citizens or hold a valid Australian permanent residency visa.
Please contact Peter Christen for more information.
Phone: ++61 (02) 6125 5690
Fax: ++61 (02) 6125 0010