Theory of probability in the office. How to correct typos in the data?

dt160106Previously I have shown how to use Excel and VBA language to automatically format financial statements or other documents. Today I will describe how to use theory of probability to correct typos in the data collected in our database e.g spreadsheets.

Surprised? Yes, we use theory of probability and statistics, completely useless piece of …. maths, to automatically find and correct typos made by our clients or colleagues.

– We don’t need that because we have autocorrection tools in our office software!
– Show me then, how to correct surnames or account numbers using these tools?
Oh… We need to wait when our customer will tell us about the mistake in his/her personal data…
And then you need to say sorry ;)
– What can I do instead?
– Let’s look at the example below …

Imagine tons of entries that look like below:

Id   Surname  City      Account Number
1    Novak    Warsaw    0000-1111-0000
2    Novak    Warsaw    0000-111-0000
3    Novak    Warsaw    0000-1111-0000
4    Novak    Waraw     0000-1111-0000
5    Novk     Warsaw    0000-1111-0000
6    Conor    Warsaw    0000-1111-0000

We may notice that there are relationships between surname, city and account number, e.g:

  • account number is unique
  • there might be number of people with the same surname, but there is less people with the same surname, who are living in the same city

We want to use these relationship to find and correct typos.

  1. First we divide our entries according to account number. We will get two sets of data. Account number is unique, so the data inside a single set should apply to the same person:
    Id   Surname    City      Account Number
    1    Novak      Warsaw    0000-1111-0000
    3    Novak      Warsaw    0000-1111-0000
    4    Novak      Waraw     0000-1111-0000
    5    Novk       Warsaw    0000-1111-0000
    6    Conor      Warsaw    0000-1111-0000
    Id    Surname    City      Account Number
    2     Novak      Warsaw    0000-111-0000
  2. Compute the probability, that the city equals a value in a row. In the first set the probabilities will be: 0.8 for Warsaw and 0.2 for Waraw.
  3. Notice that mistakes will be less probable than the correct values and the probability of mistakes will be lower the more data is collected. Don’t be afraid checking millions of entries!
  4. To find typos among cities examine the values with the lowest probability. There will be much less data that should be examined manually and the probability that we will find mistakes among them is truly high. In the same time the probability that we will find a mistake among high probable values will be very low. Our work will be very efficient!
  5. Repeat this procedure for surname. Here you may also notice that Conor is an uncle, a cousin, a friend or a other person that paid for a Novak :) You may use this information in the future.

How to find incorrect account number? Let’s group our data according to surname and the city after correcting these columns. We should get two sets of data:

Id   Surname    City      Account Number
1    Novak      Warsaw    0000-1111-0000
2    Novak      Warsaw    0000-111-0000
3    Novak      Warsaw    0000-1111-0000
4    Novak      Waraw     0000-1111-0000
5    Novk       Warsaw    0000-1111-0000
Id   Surname    City      Account Number
6    Conor      Warsaw    0000-1111-0000

Then the probability that account number equals specified value in a row should be: 0.8 for 0000-1111-0000 and 0.2 for 0000-111-0000. You know what it means ;)

That’s how theory of probability help us in everyday life :) But … it is only a very beginning of the story. In the next article I will show you a working solution in an Excel spreadsheet and an algorithm recognizing that Novk is a typo but Conor is something totally different. We will use levenshtein distance. Sounds complicated? Not necessarily!

Advertisements

One thought on “Theory of probability in the office. How to correct typos in the data?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s