ANU Data Mining Group

Parallel Large Scale Techniques for High-Performance Record Linkage


Table of Contents:

Contact and Mailing List:

Contact: peter (dot) christen {at} anu [dot] edu {dot} au

We have created two mailing lists at Sourceforge.Net. For more details see Sourceforge Febrl Mailing Lists. If you are interested in this project please subscribe and we will inform you of software releases, publications and other output of this project.

Research Team:

Project Description:

Record or data linkage techniques are used to link together records which relate to the same entity (e.g. patient, customer, household) in one or more data sets where a unique identifier for each entity is not available in all or any of the data sets to be linked.

Record linkage is an important initial step in many research and data mining projects in the biomedical and other sectors, where it is used to improve data quality and to assemble longitudinal or other data sets which would not otherwise be available.

The ANU Data Mining Group is currently working in collaboration with the Centre for Epidemiology and Research at the NSW Department of Health on the improvement of record linkage techniques and software. We are particularly interested in advancing the development of two aspects of record linkage:

We have started developing prototype software which undertakes data standardisation, which is an essential pre-processing phase for most record linkage projects, and which implements the "classical" approach to probabilistic record linkage model as described by Fellegi and Sunter (I. Fellegi and A. Sunter, A theory for record linkage. Journal of the American Statistical Association, 1969) and subsequently extended by others. We hope that this prototype software will be of immediate use to biomedical and other researchers.

We plan to use that software as a platform for exploring various parallel computing and machine learning techniques. To our knowledge, no parallel implementation for probabilistic record linkage is currently available. Issues to be explored include data distribution, blocking techniques, parallel preprocessing and load balancing. Although a number of machine learning and other classification techniques have been applied to the record linkage problem over the last few years, no-one has yet focused on using these techniques to reduce or eliminate the time consuming and tedious manual clerical review process which is needed to decide the status of possible or doubtful links between records.

The prototype software is published under a free, open source software license in order to promote collaboration and to encourage others to contribute to the development and maintenance of the software. The tools we are using are also all free, open source software, namely the object-oriented programming language Python and associated extension libraries.

In order to gain an appreciation of the wide range of uses of record linkage in biomedical research, click here to perform a PubMed search for the term "Medical Record Linkage".


Return to TOP

Prototype Software:

The prototype software Febrl (Freely extensible biomedical record linkage) is hosted on Sourceforge.Net and can be downloaded from:

Information on the current (bold font) and past releases are also available locally:

Febrl is licensed under the ANU Open Source License, please see the release pages for more details.
Please see the Febrl Sourceforge page for more details. Sourceforge

Back to TOP