Microsoft has provided a handy guide for data cleaning in Excel.
In this blog post, I will list their top ten ways to clean data, and briefly describe the importance to data mining. Data cleaning is an important prerequisite for data mining, and in time scale may comprise most of the data mining activity. The Microsoft tips listed include the place to start, but I will also have some additional data cleaning recommendations.
Click on the link below each Tip to know more detailed explanation with realtime examples
Tip one: Spell Checking
I used to be a better speller, but my skill has been eroded by all sorts of automated agents which check for spelling even while I type. Even WordPress has various spell checkers too (I use one called After the Deadline which audits spelling and grammar). Excel checks spelling and grammar, and the importance is consistency in display results for data mining attributes.
Tip two: Removing duplicate rows
SQL Server Data Mining will equally consider all rows presented to it. Removing the duplicate rows could be important, and in my experience, the most obvious error would be when entire blocks of rows are (for whatever reason) included twice in the data. Filtering can be done in Excel or PowerPivot. Duplicate rows do not automatically represent a problem, and in the case of oversampling, someone could manually repeat rows intentionally to give them more weight in the final analysis.
Tip three: Finding and replacing text
Again, text cleaning plays a role in all data mining since the attributes need to be consistent. Data mining uses text labels to uniquely determine and group results. A find-and-replace operation can systematically handle errors among many columns.
Tip four: Changing the case of text
I like to have consistent case structure in the source data. In my test I did with Microsoft Clustering, the machine learning algorithm ignores case (even when I changed the collation setting for the server that moderates case sensitivity ). In some applications, the case might matter, and therefore the remedy would be tip three (use finding and replacing text to make the two or more cases have a different expression).
Tip five: Removing spaces and nonprinting characters from text
I would recommend removing nonprinting characters even if Analysis Services takes them. By “nonprinting” we could be referring to characters which are not aligned with the language collation setting for Analysis Services. I did another test adding spacing between characters (from Excel) and Analysis Services does recognize a difference. Thus, an inadvertent double space (such as “United States”) would be a separate attribute.
Tip six: Fixing numbers and number signs
Excel can store numbers as text, and that issue presents a problem to Analysis Services. I tested out this fact by making a time series with supposedly numeric values stored as text (done by adding a single apostrophe before the value). Another type of issue might be a numeric category which is not intended to be a numeric value. See the Microsoft tips page on how to fix that issue.
Tip seven: Fixing dates and times
Dates continue to be, in my experience, the most perplexing topic. I once had a legal client who hired me to fix the week-of-year as it would print on their Microsoft Access report (linked to Excel). Excel uses integers to represent dates, and decimal parts to represent time. On my current settings, the number 1 corresponds to January 1, 1900, at 12 AM, and the number zero corresponds to the mythical date of January 0, 1900, 12 AM. The Microsoft tip includes important date and time functions to apply to data.
Tip eight: Merging and splitting columns
I have a super data mining recommendation under this tip. Someone once asked me, for a medical industry data mining model, if it is possible to have multiple target values for a supervised algorithm. Though I am not aware of machine learning algorithms which do this task, there are statistical methods like MANOVA which have multiple outputs. Practically, though, the Analysis Services technology will only allow one output. A way to allow for multiple outputs is to merge columns together, which would assume a categorical target. Thus, two columns would therefore be combined (through merging) to make a new single column reflecting the diversity of its components. The splitting idea would create multiple predictors out of a single column. The Microsoft tip page shows how to approach that topic.
Tip nine: Transforming and rearranging columns and rows
Excel’s transpose function might as well be named
normalize, but that word itself might be confusing to people expecting statistical language instead of relational database language. Also, even for the database people, normalization is by degrees, and a transpose operation might not complete the task. In any case, I find myself using transpose for the most common situation, when the observations are in columns instead of rows (since the Excel Data Mining add-in will expect the columns to have single attributes and values).
Tip ten: Reconciling table data by joining or matching
Sometimes there is a need to join tables, and Excel functions like LOOKUP simulate what a relational database does. More basically, that family of functions allows array-type access too. It’s a “join” in the sense that the returned value becomes part of another table (as a reminder, Excel Data Mining with the add-in only works when a range is formatted as a table).
Click on each formula link to learn more
None of the previous data cleaning talks about missing data or outliers. Missing data (sometimes called “messy data” by statisticians) might be an issue, because there are different types of missing. Here is a list off the top of my head (based on my experience):
- null (meaning no value)
- blank cell (which might be a valid entry for a text expectation of a space), which is Excel’s default null
- an entered zero (where the zero was actually entered in some system to flag a missing value), and sometimes people make other substitutions like a letter or negative number when a number is expected
I prefer to use a formal null when available — some languages have such a construct. In Excel, an analyst will have to decide what to do.
Outliers might be represented, for a numeric value, an entry which is more than two standard deviations from the norm. However, that simple heuristic may or may not describe an outlier in a multivariate distribution. Someone may need to study values across combinations of variables to see what does and does not fall within the range of peer observations. There may be metadata which indicates that some output is an outlier. And the fact that a particular combination appears might itself be an outlier (even though the individual attributes and values all fall in expected ranges by columns).
The data mining tips talked extensively about text transformations, but did not talk about numeric transformations. Two common transformations for numbers is to apply the log (either base 10 or natural log) or a square root. Those transformations tend to normalize numeric distributions and allow the data mining algorithm to look at the values in a wider spread.
Source data, though, has inherent variation in original measurement. For example, sometimes numeric values result from an estimate, and “estimates” range from wild guesses to more guided and defined estimates. Even a physical measuring device (such as measuring the temperature) has a known and calculable error range. These inherent source errors may play a role in determining whether a numeric transformation is logical. The machine learning algorithms in Analysis Services look at numeric values (raw or transformed) as if there is no measurement error.