Algorithm Design

Overview

The MPI (Master Patient Index) system uses a multi-pass record linkage algorithm to identify whether incoming patient records belong to existing person clusters in the database. The algorithm employs blocking keys to efficiently reduce the search space, then uses fuzzy matching with log-odds scoring to evaluate potential matches across multiple evaluation passes.

Core Components

Blocking Strategy

The algorithm uses blocking keys to dramatically reduce the number of comparisons needed. Rather than comparing every incoming record against every existing record, blocking groups records that share certain simplified characteristics. Some example blocking keys include birthdate, zipcode, and first 4 characters of first name (for a full list of blocking keys see Algorithm Configuration).

The algorithm uses a subset of these keys from the incoming record to quickly find potential matches against existing patients in the MPI, matching on their blocking values and including any patient records in the Person clusters of those blocked records.

Record Evaluation with Log-Odds Scoring

The algorithm compares the incoming patient record against patient records that share the same blocking values, as well as patient records in the Person clusters of those that share the same blocking values which are missing data in the fields used for blocking (such records aren't blocked on individually because of this missingness, so including them after the fact allows them to be considered).

For each Person cluster found during blocking, the system computes an aggregated log-odds sum measuring the median number of log-odds points the incoming record scored in comparison to each Patient record in the cluster. These sums are tracked across the algorithm and are used to compute a Relative Match Score (RMS), the percentage of total possible log-odds points the incoming record scored during evaluation. These RMS values are used in conjunction with minimum match and certain match thresholds supplied by the user to make final linkage decisions.

Scoring Process

Field Comparison: Each evaluation field is compared between the incoming record and candidate records using fuzzy matching algorithms
Threshold Application: Only similarity scores above the configured fuzzy threshold for each field contribute points (scores below threshold contribute 0 points)
Weight Application: Qualifying similarity scores are multiplied by pre-configured log-odds weights specific to each field
Cluster Aggregation: The median log-odds sum across all patient records in a cluster determines the cluster's overall score
RMS Calculation: The cluster's RMS is calculated as the median log-odds sum divided by total possible evaluation points for the pass

Missing Data Handling

The algorithm includes sophisticated handling for incomplete records. These strategies recognize that purely missing data doesn't indicate that two records aren't for the same individual—rather, we might not have enough information to make that determination. These strategies are used in both the blocking and evaluation phases of the algorithm:

Blocking with Missing Data: Patient records missing blocking fields are still considered if they belong to Person clusters that contain records matching the blocking criteria
Evaluation with Missing Data: Missing fields receive partial credit based on the missing_field_points_proportion parameter (typically 0.5, meaning 50% of possible points)
Missingness Threshold: If more than max_missingness_allowed_proportion (typically 0.5) of evaluation points come from missing fields, the comparison is skipped entirely to avoid decisions based on insufficient data

Match Grading

For each Person cluster found, a match grade is assigned based on where that cluster's RMS falls in relation to the provided thresholds:

If RMS < minimum match threshold, the cluster is graded as certainly-not a match
If RMS ≥ certain match threshold, the cluster is graded as a certain match
Otherwise, if the RMS is between the two thresholds, the cluster is graded as a possible match

Algorithm Workflow

1. Multi-Pass Processing

The algorithm executes multiple passes, each with different blocking and evaluation strategies designed to catch different types of matches. Each pass follows this workflow:

Blocking Phase

Generate blocking keys from the incoming record based on the pass configuration
Find all existing Patient records with matching blocking values
Include Patient records from Person clusters of blocked records, even if they have missing blocking data
Return the set of candidate Person clusters for evaluation

Evaluation Phase

For each candidate Person cluster, compare the incoming record against each matched Patient record in the cluster
Apply the scoring process described above to calculate log-odds sums
Calculate the cluster's RMS and assign a match grade
Track results for cross-pass aggregation

2. Cross-Pass Aggregation

Clusters appearing in multiple passes retain their highest RMS score
Match grades are updated if a higher grade is achieved in any pass
Final results aggregate the best performance across all passes

3. Final Decision Logic

The algorithm makes final linkage decisions based on aggregated results:

Automatic Linking: If at least one cluster is graded as certain, the certain cluster with the highest RMS is returned as the result, and a link is formed between that cluster and the incoming record.

Manual Review Required: If no clusters grade as certain but at least one cluster grades as possible, no links are made automatically, but the possible clusters are returned to the user for manual evaluation.

New Person Creation: If no clusters grade better than certainly-not, a new Person cluster is created for the incoming record.