Data Warehouse / Business Intelligence Requirements Elicitation. Where do You Begin? (Part 4 of 5)
I discussed dimensional modeling in the last post and why I felt this was something the BI BA can own. I also focused on red and blue people (nothing political here). These colors represent the business and IT. When blended they become “Purple People”, a Wayne Eckerson (TDWI) term. I wanted to continue down this path with data cleansing and data quality, and how BAs can assist in this area.
Data Cleansing
Data cleansing attempts to understand the content of the data, establish a standard form, and correct any inaccuracies detected. Data cleansing is done prior to the extract, transform, and loading (ETL) of the data. One of the tasks associated with data cleansing is the removal of obvious duplicates and not so obvious duplicates.
Data cleansing alone will not lead to “quality” data because it is a perpetual bandage. It corrects the same data and the same inaccuracies. Data cleansing should not be confused with data quality.
BAs can assist by identifying embedded business logic that could potentially be removed by automated data profiling tools. For example, data that contains a mixture of numbers and letters might not look like much to you but could have significant meaning to the business (i.e., NewYork22, 777L).
Data Quality
Having access to prodigious amounts of data sounds like a good idea when information is needed to answer business questions. However, if the veracity of that data is suspect then one cannot trust results based on that data. I’m sure we have all experienced this on one or more occasions. This is definitely an area BAs should get involved in considering we represent the business.
Data quality attempts to improve business processes in order to eliminate the root of flawed data. Much like data cleansing, corrections are made along the way when discovered as they are low hanging fruit. However, unlike data cleansing, corrections are persisted to new and old data each time the data warehouse is updated or loaded.
Bad data not only gives us a headache, it costs lots of money in the long run. The opposite also holds true for the bottom line. Below are examples of the types of errors that can be eliminated by a strong data quality initiative:
Attribute Granularity:
This refers to the amount of information embedded in one attribute. An example would be first name, middle name, and last name all in one. This often stems from poor modeling.
Overloaded Attributes:
This is similar to attribute granularity but can stem from poor modeling or a poor business practice. Overloaded attributes clearly consist of more than one attributes A good example of this kind of error is relationships among individuals. This often occurs in the financial services industry where companies are going from an account centric view to a customer centric view. See below:
1. James McKinney and Jacqueline McKinney
2. James and Jacqueline McKinney
3. James McKinney in Trust for Kelvin Barkley
Floating Data:
This is usually the inconsistency of data entry caused by poor metadata modeling. Have you ever seen suite# or apt# appear in the address 1 field instead of the address 2 field? If so, just imagine this on a much larger scale. The business might not be the blame here but again, these are the kinds of findings that surface during the data quality deep dive.
Strict Format Conformance:
It is comfortable and natural for us to look for patterns in data but sometimes they can screw us up. For example, in the US our phone number format is ###-###-#### and we will create attributes to adhere to this format. What happens when we open our data warehouses to phone number formats of European and Japanese individuals? You get bad data because these fields cannot be completed properly.
I hope I provided you with a new way to look at data quality and how you can help by lobbying to improve your business processes. Remember the old IT axiom, garbage in garbage out (GIGO).
Till next time,
Jamie
- jamiemckinney's blog
- Login or register to post comments
- 495 reads

