News:
We have created two mailing lists at Sourceforge.Net. For more details see Sourceforge Febrl Mailing Lists. If you are interested in this project please subscribe and we will inform you of software releases, publications and other output of this project.
Record linkage is an important initial step in many research and data mining projects in the biomedical and other sectors, where it is used to improve data quality and to assemble longitudinal or other data sets which would not otherwise be available.
The ANU Data Mining Group is currently working in collaboration with the Centre for Epidemiology and Research at the NSW Department of Health on the improvement of record linkage techniques and software. We are particularly interested in advancing the development of two aspects of record linkage:
We have started developing prototype software which undertakes data standardisation, which is an essential pre-processing phase for most record linkage projects, and which implements the "classical" approach to probabilistic record linkage model as described by Fellegi and Sunter (I. Fellegi and A. Sunter, A theory for record linkage. Journal of the American Statistical Association, 1969) and subsequently extended by others. We hope that this prototype software will be of immediate use to biomedical and other researchers.
We plan to use that software as a platform for exploring various parallel computing and machine learning techniques. To our knowledge, no parallel implementation for probabilistic record linkage is currently available. Issues to be explored include data distribution, blocking techniques, parallel preprocessing and load balancing. Although a number of machine learning and other classification techniques have been applied to the record linkage problem over the last few years, no-one has yet focused on using these techniques to reduce or eliminate the time consuming and tedious manual clerical review process which is needed to decide the status of possible or doubtful links between records.
The prototype software is published under a free, open source software license in order to promote collaboration and to encourage others to contribute to the development and maintenance of the software. The tools we are using are also all free, open source software, namely the object-oriented programming language Python and associated extension libraries.
In order to gain an appreciation of the wide range of uses of record linkage in biomedical research, click here to perform a PubMed search for the term "Medical Record Linkage".
The linkage of records which refer to the same entity in separate data collections is a common requirement in public health and biomedical research. Traditionally, record linkage techniques have required that all the identifying data in which links are sought be revealed to at least one party, often a third party (i.e. the organisation performing the linkage). This necessarily invades personal privacy and requires complete trust in the intentions of that party and their ability to maintain security and confidentiality.
In this project we aim to develop techniques for blindfolded record linkage, based on secure one-way hash transformations and n-gram (e.g. bigram) scores (which permit the calculation of a general similarity measure between strings), without having to reveal the data being compared, albeit at some cost in computation and data communication. These techniques can be combined with public key cryptography and automatic estimation of linkage model parameters to create an overall system for blindfolded record linkage.
For more details please see our publications.
It is estimated that between 80% and 90% of governmental and business data collections contain address information. Geocoding the process of assigning geographic co-ordinates to addresses is becoming increasingly important in many application areas that involve the analysis and mining of such data. In many cases, address records are captured and/or stored in a free-form or inconsistent manner. This fact complicates the task of robustly matching such addresses to a spatially-annotated reference file.
The aim of this project is to develop geocoding techniques based on our data cleaning, standardisation and linkage matching techniques. The geocoded reference file used is the Australian Geocoded National Address File (G-NAF), a comprehensive high-quality geocoded national address database.
For more details please see our publications.
The prototype software Febrl (Freely extensible biomedical record linkage) is hosted on Sourceforge.Net and can be downloaded from:
Information on the current (bold font) and past releases are also available locally:
Febrl is licensed under the ANU Open Source License, please see the release pages for more details.
| Please see the Febrl Sourceforge page for more details. |
|