A note on the F-measure for evaluating record linkage algorithms (and classification methods and information retrieval systems)

56 mins 34 secs,  343.42 MB,  WebM  640x360,  29.97 fps,  44100 Hz,  828.9 kbits/sec
Share this media item:
Embed this media item:


About this item
Image inherited from collection
Description: Hand, D (Imperial College London) - Christen, P (Australian National University)
Thursday 8th September 2016 - 15:30 to 16:30
 
Created: 2016-09-30 14:40
Collection: Data Linkage and Anonymisation
Publisher: Isaac Newton Institute
Copyright: Hand, D - Christen, P
Language: eng (English)
Distribution: World     (downloadable)
Explicit content: No
Aspect Ratio: 16:9
Screencast: No
Bumper: UCS Default
Trailer: UCS Default
 
Abstract: Record linkage is the process of identifying and linking records about the same entities from one more databases. If applied on a single database the process is known as deduplication. Record linkage can be viewed as a classification problem where the aim is to decide if a pair of records is a match (the two records refer to the same real-world entity) or a non-match (the two records refer to two different entities). Various classification techniques – including supervised, unsupervised, semi-supervised and active learning based – have been employed for record linkage. If ground truth data in the form of known true matches and non-matches are available, the quality of classified links can be evaluated. Due to the generally high class imbalance in record linkage problems, standard accuracy or misclassification rate are not meaningful for assessing the quality of a set of linked records. Instead, precision and recall, as commonly used in information retrieval, are used. These are often combined into the popular F-measure, which is normally presented as the harmonic mean of precision and recall. We show that F-measure can be expressed as a weighted sum of precision and recall, with weights which depend on the linkage method being used. This reformulation reveals the measure to have a major conceptual weakness: the relative importance assigned to precision and recall should be an aspect of the problem and the user, but not of the particular instrument being used. We suggest alternative measures which do not suffer from this fundamental flaw.
Available Formats
Format Quality Bitrate Size
MPEG-4 Video 640x360    1.9 Mbits/sec 809.09 MB View Download
WebM * 640x360    828.9 kbits/sec 343.42 MB View Download
iPod Video 480x270    487.25 kbits/sec 201.82 MB View Download
MP3 44100 Hz 249.79 kbits/sec 103.55 MB Listen Download
Auto (Allows browser to choose a format it supports)