Previously I have shown how to use Excel and VBA language to automatically format financial statements or other documents. Today I will describe how to use theory of probability to correct typos in the data collected in our database e.g spreadsheets.
Surprised? Yes, we use theory of probability and statistics, completely useless piece of …. maths, to automatically find and correct typos made by our clients or colleagues.
– We don’t need that because we have autocorrection tools in our office software!
– Show me then, how to correct surnames or account numbers using these tools?
– Oh… We need to wait when our customer will tell us about the mistake in his/her personal data…
– And then you need to say sorry ;)
– What can I do instead?
– Let’s look at the example below …
Imagine tons of entries that look like below:
Id Surname City Account Number 1 Novak Warsaw 0000-1111-0000 2 Novak Warsaw 0000-111-0000 3 Novak Warsaw 0000-1111-0000 4 Novak Waraw 0000-1111-0000 5 Novk Warsaw 0000-1111-0000 6 Conor Warsaw 0000-1111-0000
We may notice that there are relationships between surname, city and account number, e.g:
- account number is unique
- there might be number of people with the same surname, but there is less people with the same surname, who are living in the same city
We want to use these relationship to find and correct typos.
- First we divide our entries according to account number. We will get two sets of data. Account number is unique, so the data inside a single set should apply to the same person:
Id Surname City Account Number 1 Novak Warsaw 0000-1111-0000 3 Novak Warsaw 0000-1111-0000 4 Novak Waraw 0000-1111-0000 5 Novk Warsaw 0000-1111-0000 6 Conor Warsaw 0000-1111-0000
Id Surname City Account Number 2 Novak Warsaw 0000-111-0000
- Compute the probability, that the city equals a value in a row. In the first set the probabilities will be: 0.8 for Warsaw and 0.2 for Waraw.
- Notice that mistakes will be less probable than the correct values and the probability of mistakes will be lower the more data is collected. Don’t be afraid checking millions of entries!
- To find typos among cities examine the values with the lowest probability. There will be much less data that should be examined manually and the probability that we will find mistakes among them is truly high. In the same time the probability that we will find a mistake among high probable values will be very low. Our work will be very efficient!
- Repeat this procedure for surname. Here you may also notice that Conor is an uncle, a cousin, a friend or a other person that paid for a Novak :) You may use this information in the future.
How to find incorrect account number? Let’s group our data according to surname and the city after correcting these columns. We should get two sets of data:
Id Surname City Account Number 1 Novak Warsaw 0000-1111-0000 2 Novak Warsaw 0000-111-0000 3 Novak Warsaw 0000-1111-0000 4 Novak Waraw 0000-1111-0000 5 Novk Warsaw 0000-1111-0000
Id Surname City Account Number 6 Conor Warsaw 0000-1111-0000
Then the probability that account number equals specified value in a row should be: 0.8 for 0000-1111-0000 and 0.2 for 0000-111-0000. You know what it means ;)
That’s how theory of probability help us in everyday life :) But … it is only a very beginning of the story. In the next article I will show you a working solution in an Excel spreadsheet and an algorithm recognizing that Novk is a typo but Conor is something totally different. We will use levenshtein distance. Sounds complicated? Not necessarily!