Record Linkage: January 2007

Monday, January 29, 2007

Record Linkage Graphical User Interfaces

Manual handling of duplicates in a database can be quite time consuming. It is important to find the right tool to help you speed this process. Or if you are building your own record linkage tool, it is always good to see what is out there, how GUIs are laid out and what the common features are.

The Link King:

The Link King’s graphical user interface (GUI) makes record linkage and unduplication easy for beginning and advanced users. The data linking neophyte will appreciate the easy-to-follow instructions. The Link King's artificial intelligence will assist in the selection of the most appropriate linkage/unduplication protocol.

Linkage Wiz:

LinkageWiz is a powerful data matching and record de-duplication software program used by businesses, government agencies, hospitals and other organisations in the USA, Canada, UK, Australia and France. It makes it easy to link records across multiple databases and to identify and remove duplicate records within databases

TAILOR -

TAILOR is extensible, and hence any proposed searching method, comparison function, decision model, or measurement tool can be easily plugged into the system. We have proposed three machine learning record linkage models that raise the limitations of the existing record linkage models. Our extensive experimental study, using both synthetic and real data, shows that the machine learning record linkage models outperform the probabilistic record linkage model with respect to the accuracy and the completeness metrics, the probabilistic record linkage model identifies a lesser percentage of possibly matched record pairs, both the clustering and the hybrid record linkage models are very useful, especially in the case of real applications where training sets are not available or are very expensive to obtain, and Jaro's algorithm performs better than the other comparison functions.

The following three screen snapshots are the basic screens of TAILOR graphical user interface. The first screen allows the user to either generate a synthetic experiment using DBGen, perform a real experiment on a database, or repeat a previous experiment knowing its data files. The user then uses the second screen in order to select a searching method and a comparison function and tune their required parameters. Finally, the third screen allows the user to select the decision model he would like to apply and outputs the values of the measures if the experiment is on synthetic data.

Download PDF for screenshots of TAILOR here.

MatchIT -

matchIT incorporates our proprietary matching algorithms to ensure phonetic, miskeyed and abbreviated variations of data are detected. Results can be verified using comprehensive data auditing functions, drilling down to suspect data, identifying data anomalies, and filtering garbage and salacious words.

Fuzzy Dupes 2007

Did you know that your contact database typically contains 3-10% duplicates ?

These duplicate records result in unnecessary costs when sending out printed catalogs, are aggravating to your customers, create problems in the controlling, etc. With classical methods you have no possibility to locate these duplicates in your database.

Sunday, January 28, 2007

Machine Learning and the Hidden Markov Models (HMM)

What is a Hidden Markov Model? (from Answers.com)

A hidden Markov model (HMM) is a statistical model where the system being modeled is assumed to be a Markov process with unknown parameters, and the challenge is to determine the hidden parameters from the observable parameters. The extracted model parameters can then be used to perform further analysis, for example for pattern recognition applications. A HMM can be considered as the simplest dynamic Bayesian network.

In a regular Markov model, the state is directly visible to the observer, and therefore the state transition probabilities are the only parameters. In a hidden Markov model, the state is not directly visible, but variables influenced by the state are visible. Each state has a probability distribution over the possible output tokens. Therefore the sequence of tokens generated by an HMM gives some information about the sequence of states.

Hidden Markov models are especially known for their application in temporal pattern recognition such as speech, handwriting, gesture recognition and bioinformatics.

State transitions in a hidden Markov model (example)
x — hidden states
y — observable outputs
a — transition probabilities
b — output probabilities

Resources:

Machine Learning Links

Record Linkage and List Quality

What is Record Linkage:

Record linkage is the task of quickly and accurately identifying records corresponding to the same entity from one or more data sources. Record linkage is also known as data cleaning, entity reconciliation or identification and the merge/purge problem. This paper presents the “standard” probabilistic record linkage model and the associated algorithm. Recent work in information retrieval, federated database systems and data mining have proposed alternatives to key components of the standard algorithm. The impact of these alternatives on the standard approach are assessed. The key question is whether and how these new alternatives are better in terms of time, accuracy and degree of automation for a particular record linkage application.

Other names that mean the same thing: entity heterogeneity, entity identification, object isomerism, instance identification, merge/purge, entity reconciliation,
list washing, match/consolidate and data cleaning. I like the term "record linkage" and will refer to it as such from this point forward in this blog.

Seems clear that if you want to be thorough in your record linkage efforts you would implement a combination of deterministic and probabilistic matching methodologies. Using a straightforward name and address match such as the Firstlogic (now Business Objects) approach will usually be sufficient if you are a list vendor or mailhouse. But if you are at all serious about identity management you will step into the deep end and implement a probabilistic matching method. I've downloaded the Ferbl open source probablistic matching tool, but have yet to experiment with it.

But whatever method you use or software package you buy, the quality of your record linkage always ends up in how good you have configured your rules. This is not an off-the-shelf solution - it requires work.

I bet most organizations "record linkage problems" could have been avoided if enough forsight and initiative had been put on the original database systems, in establishing that unique key.

Well, I guess hindsight is in fact - 20/20.

One product that definitely contributes to an increased percentage of record linkage is the SSA-NAME3 product. As you'll see from their site, they have developed this name search tool which uses probabilistic matching but factors in how common or uncommon the name is. For example matching two Jose Garcia's in the city of Los Angeles is not the same as matching two Jose Garcia's in Iceland. They are probably the same person in Iceland and most likely not, in LA.

What is a Data Steward?

Data Steward defined:

The person responsible for a data standard. In this role, a Data Steward is charged by his/her Management to develop and maintain the data standard and to counsel Service personnel on the proper use of the data. He/she must: have a thorough knowledge of the subject matter of the standard, provide accurate and current electronic copies of data relevant to the standard, and weigh the pros and cons of comments received during review of the standard. He/she is authorized to defend or modify the standard as necessary in order to ensure its proper use.

Claudio Imhoff's definition and description of duties:

Steward - from Old English for "keeper of the sty", a sty ward.

Data Steward - Person responsible for managing the data in a corporation in terms of integrated, consistent definitions, structures, calculations, derivations, and so on.

Corporations are demanding better and better sources of data. The explosive growth of data warehousing and sophistication of the access tools are proof that data is one of the most critical assets any company possesses. Data, in the form of information, must be delivered to decision-makers quickly, concisely and more importantly, accurately.

The data warehouse is an excellent mechanism for getting information into the hands of decision-makers. However, it is only as good as the data that goes into it. Problems occur when we attempt to acquire and deliver this information. A major effort must be made in defining, integrating and synchronizing the data coming from the myriad operational systems producing data throughout the corporation. Who should be responsible for this important task? The answer for a growing number of companies is a new business function called Data Stewardship.

What is Data Stewardship?

Data Stewardship has, as its main objective, the management of the corporation's data assets in order to improve their reusability, accessibility, and quality. It is the Data Stewards' responsibility to approve business naming standards, develop consistent data definitions, determine data aliases, develop standard calculations and derivations, document the business rules of the corporation, monitor the quality of the data in the data warehouse, define security requirements, and so forth (see Table 1 for a list of the data integration issues determined by Data Stewards).

This data about data, or meta data, developed by Data Stewards can then be used by the corporation's knowledge workers in their everyday analyses to determine what comparisons should be made, which trends are significant, that apples have indeed been compared to apples, etc.

Just as the demand for a data warehouse with good data has grown, the need for a Data Stewardship function has likewise grown. More and more companies are recognizing the critical role this function serves in the overall quest for high quality, available data. Such an integrated, corporate-wide view of the data provides the foundation for the shared data so critical in the data warehouse. ...

Data Stewards are responsible for the following:
* Standard Business Naming Standards
* Standard Entity Definitions
* Standard Attribute Definitions
* Business Rules Specification
* Standard Calculation and Summarization Definitions
* Entity and Attribute Aliases
* Data Quality Analyses
* Sources of Data for the Data Warehouse
* Data Security Specification
* Data Retention Criteria

Crash Course on Data Stewardship:

Data Stewardship programs are implemented to reduce information technology costs and improve the value companies gain from their data assets. Stewardship programs focus on improving data quality, reducing data duplication, formalizing accountability for data, and improving business and IT productivity. An effective Data Stewardship program will rapidly improve the ROI from data warehousing and business intelligence efforts

Data Governance - IT or Business?

"Where does Data Governance fit into the Organization?"

This question appears to have two answers. "In Business" or "In IT". The answer I get most often when I ask the question is ... "In Business". Or "Business should own Data Governance".

I wish it was that simple.

What exactly does it mean for Data Governance to fit "In Business"?

This article seemed to take up the questions posed in my last post. The last paragraph sums it up and answers this question.

The best answer to the question "Where does Data Governance fit into the Organization?" is "It doesn't matter". Data Governance can be successful when managed by a business area or by an IT area.

The decision of who will manage the Data Governance program can be very important to the success of the program. However, it will not necessarily make or break a well-defined Data Governance program's likelihood of success. As long as the business areas and IT areas coordinate their efforts, use a Data Governance Council as a strategic resource, cooperate in strategic data management activities, and act in the best interests of the organization (data-wise), the placement of the management of the Data Governance program is not nearly the most important question that needs to be answered.

Wednesday, January 24, 2007

IT Departments Role in CDI

Introducing Customer Data Integration into an organization can be a bit daunting. A recommendation to anyone doing this is Jill Dyche / Evan Levy's book: Customer Data Integration which definitely builds the needed foundation for anyone's understanding of this subject. However, there are still aspects that need to be understood and certain functions clearly delineated.

What is the role of IT in CDI (Customer Data Integration)?

First a definition of IT:

information technology
n. Abbr. IT
The development, installation, and implementation of computer systems and applications.
"IT." The American Heritage® Dictionary of the English Language, Fourth Edition. Houghton Mifflin Company, 2004. 24 Jan. 2007. http://dictionary.reference.com/browse/IT>

The development, installation and implementation of the CDI system is the responsibility of IT - Not the day-to-day upkeep of the data. As a Data Steward you must have the needed access to the data in order to uphold your responsibilities.

Tight coordination is needed between the Data Stewards and the IT Dept when the CDI project is in its early stages. Clear definitions should be established for the roles needed, responsibilities assigned and the needed functionality and access levels built into the CDI project.

Getting this established and clearly delineated at the onset is critical to a successful CDI implementation and will increase the mileage and speed of progress in achieving your goals with CDI.

Friday, January 19, 2007

Postalsoft Mailing Software: Compare Solutions

Here is a good comparison of the Firstlogic mailing products. I thought I had the professional suite but in looking at these it seems I am only getting the Business Edition options. I better call my account rep...

Postalsoft Mailing Software: Compare Solutions

Thursday, January 18, 2007

Oracle TO_CHAR function

The oracle TO_CHAR function can display seconds past midnight with the following syntax:

SELECT TO_CHAR(sysdate, 'SSSSS') FROM dual;

Further Oracle SQL/PLSQL topics

Monday, January 15, 2007

Machine Learning and Artificial Intelligence

Machine Learning is a topic that has come up a lot during my CDI (Customer Data Integration) research. Specifically when diving into the domain of Probabilistic Matching. From the Initiate website:

Probabilistic matching

Probabilistic matching uses likelihood ratio theory to assign comparison outcomes to the correct, or more likely decision. This method leverages statistical theory and data analysis and, thus, can establish more accurate links than deterministic systems between records that have more complex typographical errors and error patterns.

Typically, probabilistic systems assign a percentage (such as 75 percent) indicating the probability of a match. Because these systems pinpoint variation and nuances to a much finer degree than a deterministic approach, they are better suited for businesses that have complex data systems with multiple databases. Due to the size of these data systems, the potential for duplicates, human error and discrepancies is far greater, making a system designed to establish links between records with complex error patterns much more effective.

Probabilistic matching enables one to use match scores and percentages on a field by field comparison to determine a match. There are three categories output from probabilistic matching and are set by the user based on the overall probability percentage:

1) Match - that can be automatically merged
2) Candidate Match - requiring manual review
3) Non Match

The topic of Machine Learning and artificial intelligence enters in with the manual review process of the candidate matches. The idea is that computer can learn from the users decisions of what was manually determined to be a match or non-match and build these into its future decisions and probability scores.

If an expert system--brilliantly designed, engineered and implemented--cannot learn not to repeat its mistakes, it is not as intelligent as a worm or a sea anemone or a kitten.
-Oliver G. Selfridge, from The Gardens of Learning.

"Find a bug in a program, and fix it, and the program will work today. Show the program how to find and fix a bug, and the program will work forever."
- Oliver G. Selfridge, in AI's Greatest Trends and Controversies

Machine learning refers to a system capable of the autonomous acquisition and integration of knowledge. This capacity to learn from experience, analytical observation, and other means, results in a system that can continuously self-improve and thereby offer increased efficiency and effectiveness.

(read full article on Machine Learning here)

Companies like Purisma and Siperian offer machine learning techniques built into their software.

A primary goal of machine learning would be to reduce the amount of manual review needed to determine matches and continuously improve the software's ability to accurately detect and consolidate duplicate records.

An open source tool that can be used for probabilistic record matching is Febrl (Freely Extensible Biomedical Record Linkage). Written in Python anyone can download this from sourceforge.net and get your feet wet with probabilistic matching.

Tuesday, January 09, 2007

An SQL Introduction and Tutorial

If you need to freshen up your basics on SQL or if you need to train a new employee on a gradient approach to SQL, the following is a suggested training line-up:

First, clear up each of the following terms:

SQL
SELECT
FROM
AS
WHERE
LIKE
IN
BETWEEN
AND
OR
INTO
GROUP BY
ORDER BY
HAVING
COUNT

Next, go through this online interactive tutorial. This has the student walk through real examples using an online SQL tool so you can practice what you learn online without having to go to another computer where your database is.

An Interactive SQL Tutorial

In learning any new subject it is extremely important to start with the key words and to have these cleared up first, before carrying on with the material. This is covered in Hubbard's, Study Technology.

Here are good basic definitions for each of these SQL terms:

SQL: SQL stands for Structured Query Language and is a computer language that allows you to ask questions (query) or interact with your database in a structured manner.
SELECT: The SELECT statement indicates that you wish to query and retrieve information from a database. The select_list specifies the type of information (or column names) to retrieve. The keyword ALL or the wildcard character asterisk (*) could be used to signify all columns in the table. Also, the keyword DISTINCT could be used to discard duplicate records and retrieve only the unique records for the specified columns.
INTO: The INTO keyword indicates that you are inserting records into another table. For example in Microsoft Access you can create a new table and insert all records from another table by using the following: SELECT * INTO newtablename FROM tablename
FROM: The FROM clause is required in any SELECT statement. The FROM clause specifies the specific tables to retrieve data from.
AS: The AS keyword is used in the SELECT or the FROM clause when you want to refer to a table or column as something else. This is very handy if you have long table or column names and you want to use a shorter name to refer to them. This is commonly used when you are joining multiple tables together and do not want to rewrite the full table name each time you are referring to that table. For example:

SELECT a.column, b.column
FROM table_name_a AS a, table_name_b AS b
WHERE a.ID = b.ID

This is also commonly used to have a nicer display of your results. For example if you have a column called "field17" and you want to display it as "COUNTRY" you can use the following:

SELECT field17 as COUNTRY
FROM table_name
WHERE: The WHERE clause limits the results to those records (or rows) that meet some particular conditions (optional).
LIKE: LIKE is another SQL keyword that is used in the WHERE clause. LIKE allows you to search based on a pattern rather than specifying exactly what is desired The syntax for it is as follows: SELECT * FROM table WHERE column LIKE 'abc%'
The wildcard '%' (in Microsoft SQL and Oracle SQL) or '*' (in Access SQL) indicates any character and can be used in any location within the quotes. This will return any records that match 'abc' as the first three letters and then anything else after that such as 'abc123', 'abcdef', etc.
IN: The IN keyword is used in the WHERE clause when you want to select all records that have specific values such as if you wanted all customers that have either blue, green or hazel eyes you would use the following:

SELECT *
FROM customers
WHERE eye_color IN ('blue', 'green', 'hazel')

You can not use wildcards with the IN keyword.
BETWEEN: If you want to specify a range of values you can use the BETWEEN keyword. This will give you anything between value a and b values. For example, if you wanted any zip codes that were between '90210' and '91420' you would use the following:

SELECT *
FROM address_list
WHERE zip_code BETWEEN '90210' AND '91420'
AND: The AND keyword is used when you want to add additional criteria to your WHERE clause or as part of the BETWEEN clause.
OR: The OR keyword is used when you want either one criteria or another, but not both. Be very careful when using the OR keyword as if you have AND and OR in the same WHERE clause your result will probably not be what you expect, unless you use parentheses (). For example, if you want all people who with blue or hazel eyes that live in California, you would use the following:

SELECT *
FROM address_list
WHERE (eye_color = 'blue' OR eye_color = 'hazel') AND state = 'CA'; If you don't use the parentheses your query will return people in California, people with blue eyes or people with hazel eyes. So remember your math class where you learned to group parts of the equation to avoid confusion (a * b) + y ...
GROUP BY: The GROUP BY clause specifies if you are grouping (or aggregating) any of the columns in your SELECT statement. For example if you want to display a count of all addresses by city you would use the GROUP BY clause to group the results by City. This following example will give you a breakdown by city and sort it by most to least:; SELECT city, COUNT(*); FROM table; GROUP BY city; ORDER BY COUNT(*) DESC
HAVING: The HAVING clause specifies the specific conditions to group by (optional). For example you may want to group your results by City but only display those Cities that have more than 10 matches. You would do this by saying "HAVING COUNT(*) > 10". HAVING always comes after the GROUP BY clause.
ORDER BY: The ORDER BY clause specifies whether to output the query result in ascending or descending order. This is often used to sort your results in alphabetical or numerical sequence.
COUNT: The COUNT keyword is used in the SELECT or HAVING clauses. This allows you to give a total count of something. This is also referred to as an "aggregate". This means you are grouping things together and displaying aggregated data on the group instead of listing each individual record. Other aggregate keywords are SUM, MIN, MAX, etc. COUNT gives you the count of records, SUM gives you the SUM of the values in a column, MIN gives you the minimum value, MAX gives you the maximum value. Here is an example of the COUNT usage:

SELECT COUNT(*)
FROM table_name
WHERE last_name = 'SMITH'

This will give you a count of all records with the last name of 'SMITH'. Notice the (*) after the keyword. This means you want to give a count of all records meeting the criteria in your WHERE clause. But you may also see COUNT(1) which also means give a total number of the records. The COUNT keyword is very useful in giving reports or breakdowns of various columns. For example if you want a breakdown of all records you have for each eye_color (if you have this column in your table) you could use the following:

SELECT eye_color, COUNT(1)
FROM table_name
GROUP BY eye_color
ORDER BY COUNT(1) desc

Notice the GROUP BY and ORDER BY keywords used as well. This will show you the count by eye_color and will sort it by the the highest number to the lowest.

Here is another simple introduction to SQL.

Friday, January 05, 2007

Filmed Interview with L. Ron Hubbard

The Church of Scientology just released the only video interview with the founder of Scientology, L. Ron Hubbard. This answers all common questions on Scientology:

"What is Scientology?"
"What is the Mind?"
"How did L. Ron Hubbard come to develop Scientology?"
"How did he make these discoveries?"
"What is an auditor?"
"Why is Man on this planet and what is his purpose here?"
"Why is Man basically good?"

This site also has a running tally of how many of these DVDs are now in circulation - over 100,000 in the first week after its release!

CDI - Matching Engines

Another tip from CDI expert Jill Dyche on the subject of matching engines and whether or not to base your CDI software decision entirely on the quality of the match engine:

Here are some other things to keep in mind when assessing matching engines:
Test with lots of real data; compare and benchmark results. Organizations testing CDI hubs should test with as much data as possible to get a more real-world test of the system, Levy said. DataFlux's Gidley also recommends profiling data sources to assess the accuracy and completeness of data. This can help with tuning the matching engine's business rules. For example, if 50% of the phone number fields in a database are empty, it might not be a good matching attribute.
Consider the interface. When a system isn't sure whether records match, it generally refers the matter to a human data steward to make the call. So the interface for data stewards is an important part of a tool decision, Gidley said.
Think globally and futuristically. Language differences, industry-specific requirements, and future infrastructure plans -- such as moving to a service-oriented architecture -- can also affect matching engine choices, according to Dyche.
Overall, Dyche said, the matching engine should be only a part of the overall CDI tool choice.
"Matching is one of many decisions to make," Dyche said. "When we see these vendor bake-offs and companies get down to [which CDI vendor] has the most accurate match, that's still only one component of the decision."
So, in some cases, the CDI tool with the most accurate matching engine might not be the final choice. Companies must consider the big picture, including data volumes, processing speed and functional requirements, Dyche said.

CDI - Project Research

A great post about CDI project research and what you should ask yourself before diving too deep into the CDI pool covers the following:

What is the "need, pain or problem" that we're trying to solve?
What are our requirements, both from a business as well as a functional perspective?
What data sources have critical customer information in them?
What's the current system of record for customer data?
Who are the current de facto data owners in the company?
What are the ideal matching algorithms for our particular data?
What is the potential impact on existing technology architecture and systems?
How and when will data stewardship be addressed?

"Eight must-ask questions for CDI projects by Jill Dyche

Thursday, January 04, 2007

ISAL Rate Calculator - Download it here

If you are a mailer who has ever sent out ISAL (International Surface to Air Lift) mailings you know the postal form can be quite a pain to fill out. To simplify this for myself and for anyone else who is interested, I've made an Excel spreadsheet that contains the latest ISAL rates for each ISAL Zone as well as the calculations for the per piece and per pound costs. This is set up so all you have to do is enter the weight of the mailpiece (in ounces) and enter how many pieces are going to each ISAL Zone. Here is a view of this tool that you can download below and use:

Download the ISAL Rate Calculator here.

Just enter the weight and pieces for each ISAL Zone in yellow. If you get an ISAL Discount you can edit this cell as well. Your final ISAL cost will show up in the blue cell.

This can be used to help you fill in the USPS Postal Form for International mail (PS FORM 3650).

If you downloaded this ISAL Rate Calculator and it worked for you, let me know.

Here is the current link to the USPS ISAL Page where the latest rates for each ZONE can be found. Note there is a new ISAL Zone for Australia as of May 14th, 2007.

Tuesday, January 02, 2007

Matching Algorithms for CDI

It generally understood that between 2%-5% of a customer database will contain undetected duplicates after a standard merge/merge routine is run. As stated in CRM Today, most merge-purge algorithms are 20 years old...

Most merge/purge processes were developed over 20 years ago and were never conceived to recognize the fluidity of movement, name change, and channels in which customers interact today. Even the most advanced de-duplication processes use character based logic and look up tables that are ill-equipped to assess the totality of a customers’ name and address permutations that accumulate through multiple customer interaction channels. These processes are easily deceived by minor variations in the name and address elements such as married/maiden, nick names, typos, and mis-keys. A typical file will contain 2 to 5% unidentified duplicate customers after a standard merge/purge process.

Instead of the common approach of only using one matching algorithm (I am guilty of this one) modern approaches are utilizing many algorithms to isolate duplicate records.

No single algorithm can efficiently and effectively power a matching technology due to the multiple culprits of customer identity and data quality error. Advanced solutions incorporate not one, but several advanced matching algorithms, each designed to group records into temporal data sets for the purpose of bringing visibility to distinct patterns of repetitious error. Once patterns of error are identified, the records can be referenced to consumer data sources, allowing for the remediation of the error and recognition of the true identity of the customer.

An integral part of any CDI product is its matching capability. Match rules will vary from organization to organization so using a standard template or built-in rules usually will not suffice if you want thorough duplicate detection and handling.

Firstlogic (now Business Objects) has a sophisticated match engine and beyond the standard match capabilities are "Extended Matching" which allows the user to set up custom rules for different match scenarios. I believe that simply using rule based matching you can solve 80% of your data consolidation issues.

Companies like Purisma or Siperian offer much more sophisticated match engines that utilize match clusters and larger "match footprints" which ease the process, but for the most part, if you know what you are trying to do, any rule based matching engine will do. And remember:

"The computer is no better than the organization that feeds it." - L. Ron Hubbard

The moral is first work out all your match rules and then see if you can solve your problems using your existing match engine software before looking elsewhere to solve these match related problems.

Record Linkage