The records represent individual data including first and family name, sex, date of birth and postal code, which were collected through iterative insertions in the course of several years. The comparison patterns in this data set are based on a sample of 100.000 records dating from 2005 to 2008. Data pairs were classified as "match" or "non-match" during an extensive manual review where several documentarists were involved.
The resulting classification formed the basis for assessing the quality of the registry’s own record linkage procedure.
In order to limit the amount of patterns a blocking procedure was applied, which selects only record pairs that meet specific agreement conditions. The results of the following six blocking iterations were merged together:
- Phonetic equality of first name and family name, equality of date of birth.
Phonetic equality of first name, equality of day of birth.
- Phonetic equality of first name, equality of month of birth.
Phonetic equality of first name, equality of year of birth.
- Equality of complete date of birth.
Phonetic equality of family name, equality of sex.
This procedure resulted in 5.749.132 record pairs, of which 20.931 are matches.
The data set is split into 10 blocks of (approximately) equal size and ratio of matches to non-matches.
Record Linkage Comparison Patterns Data Set at UCI