QuadExcel.com

Top 10 ways to Clean your Data in Microsoft Excel

Microsoft has provided a handy guide for data cleaning in Excel.

In this blog post, I will list their top ten ways to clean data, and briefly describe the importance to data mining. Data cleaning is an important prerequisite for data mining, and in time scale may comprise most of the data mining activity. The Microsoft tips listed include the place to start, but I will also have some additional data cleaning recommendations.

Click on the link below each Tip to know more detailed explanation with realtime examples

Tip one: Spell Checking

I used to be a better speller, but my skill has been eroded by all sorts of automated agents which check for spelling even while I type. Even WordPress has various spell checkers too (I use one called After the Deadline which audits spelling and grammar). Excel checks spelling and grammar, and the importance is consistency in display results for data mining attributes.

How to Spell Check in Excel

Tip two: Removing duplicate rows

SQL Server Data Mining will equally consider all rows presented to it. Removing the duplicate rows could be important, and in my experience, the most obvious error would be when entire blocks of rows are (for whatever reason) included twice in the data. Filtering can be done in Excel or PowerPivot. Duplicate rows do not automatically represent a problem, and in the case of oversampling, someone could manually repeat rows intentionally to give them more weight in the final analysis.

How to Remove Duplicate Rows in Excel

Tip three: Finding and replacing text

 

Again, text cleaning plays a role in all data mining since the attributes need to be consistent. Data mining uses text labels to uniquely determine and group results. A find-and-replace operation can systematically handle errors among many columns.

How to Use Find and Replace in Excel

Tip four: Changing the case of text

I like to have consistent case structure in the source data. In my test I did with Microsoft Clustering, the machine learning algorithm ignores case (even when I changed the collation setting for the server that moderates case sensitivity ). In some applications, the case might matter, and therefore the remedy would be tip three (use finding and replacing text to make the two or more cases have a different expression).

How to Quickly and Easily Change Case in Excel

Tip five: Removing spaces and nonprinting characters from text

I would recommend removing nonprinting characters even if Analysis Services takes them. By “nonprinting” we could be referring to characters which are not aligned with the language collation setting for Analysis Services. I did another test adding spacing between characters (from Excel) and Analysis Services does recognize a difference. Thus, an inadvertent double space (such as “United States”) would be a separate attribute.

How to Remove Non-Printable Characters in Excel.

Tip six: Fixing numbers and number signs

Excel can store numbers as text, and that issue presents a problem to Analysis Services. I tested out this fact by making a time series with supposedly numeric values stored as text (done by adding a single apostrophe before the value). Another type of issue might be a numeric category which is not intended to be a numeric value. See the Microsoft tips page on how to fix that issue.

How to Fix Numbers Stored as Text

Tip seven: Fixing dates and times

Dates continue to be, in my experience, the most perplexing topic. I once had a legal client who hired me to fix the week-of-year as it would print on their Microsoft Access report (linked to Excel). Excel uses integers to represent dates, and decimal parts to represent time. On my current settings, the number 1 corresponds to January 1, 1900, at 12 AM, and the number zero corresponds to the mythical date of January 0, 1900, 12 AM. The Microsoft tip includes important date and time functions to apply to data.

Tip eight: Merging and splitting columns

I have a super data mining recommendation under this tip. Someone once asked me, for a medical industry data mining model, if it is possible to have multiple target values for a supervised algorithm. Though I am not aware of machine learning algorithms which do this task, there are statistical methods like MANOVA which have multiple outputs. Practically, though, the Analysis Services technology will only allow one output. A way to allow for multiple outputs is to merge columns together, which would assume a categorical target. Thus, two columns would therefore be combined (through merging) to make a new single column reflecting the diversity of its components. The splitting idea would create multiple predictors out of a single column. The Microsoft tip page shows how to approach that topic.

How to Merge and Split columns

Tip nine: Transforming and rearranging columns and rows

Excel’s transpose function might as well be named

normalize, but that word itself might be confusing to people expecting statistical language instead of relational database language. Also, even for the database people, normalization is by degrees, and a transpose operation might not complete the task. In any case, I find myself using transpose for the most common situation, when the observations are in columns instead of rows (since the Excel Data Mining add-in will expect the columns to have single attributes and values).

How to Transpose Rows and Columns in Excel 2013

Tip ten: Reconciling table data by joining or matching

Sometimes there is a need to join tables, and Excel functions like LOOKUP simulate what a relational database does. More basically, that family of functions allows array-type access too. It’s a “join” in the sense that the returned value becomes part of another table (as a reminder, Excel Data Mining with the add-in only works when a range is formatted as a table).

Click on each formula link to learn more

VLOOKUPHLOOKUP – INDEX MATCHVLOOKUP MATCHINDEX MATCH MATCH

Also read How to Decide Which Excel Lookup Formula to Use

Additional comments

None of the previous data cleaning talks about missing data or outliers. Missing data (sometimes called “messy data” by statisticians) might be an issue, because there are different types of missing. Here is a list off the top of my head (based on my experience):

I prefer to use a formal null when available — some languages have such a construct. In Excel, an analyst will have to decide what to do.

Outliers might be represented, for a numeric value, an entry which is more than two standard deviations from the norm. However, that simple heuristic may or may not describe an outlier in a multivariate distribution. Someone may need to study values across combinations of variables to see what does and does not fall within the range of peer observations. There may be metadata which indicates that some output is an outlier. And the fact that a particular combination appears might itself be an outlier (even though the individual attributes and values all fall in expected ranges by columns).

The data mining tips talked extensively about text transformations, but did not talk about numeric transformations. Two common transformations for numbers is to apply the log (either base 10 or natural log) or a square root. Those transformations tend to normalize numeric distributions and allow the data mining algorithm to look at the values in a wider spread.

Source data, though, has inherent variation in original measurement. For example, sometimes numeric values result from an estimate, and “estimates” range from wild guesses to more guided and defined estimates. Even a physical measuring device (such as measuring the temperature) has a known and calculable error range. These inherent source errors may play a role in determining whether a numeric transformation is logical. The machine learning algorithms in Analysis Services look at numeric values (raw or transformed) as if there is no measurement error.

Also read Top 10 Tips & Best Practices to Optimize and Speed-up Excel Formulas

More Resources to Explore:
Excel Shortcuts , Excel Formulas , Excel Tips , VBA Macros , Productivity Tips , Pivot Tables , Charts , Conditional Formatting , Dashboards , Data Validation , Debugging Tips , File Recovery , Logical Formulas , Lookup Formulas

Courtesy: marktab.net

How to Spell Check in Excel 2013

You’ll be relieved to know that Excel 2013 has a built-in spell checker that can catch and remove all those embarrassing little spelling errors. With this in mind, you no longer have any excuse for putting out worksheets with typos in the titles or headings.

To check the spelling in a worksheet, you have the following options:

  • Click the Spelling command button on the Ribbon’s Review tab

  • Press Alt+RS

  • Press F7

Any way you do it, Excel begins checking the spelling of all text entries in the worksheet. When the program comes across an unknown word, it displays the Spelling dialog box.

Excel suggests replacements for the unknown word shown in the Not in Dictionary text box with a likely replacement in the Suggestions list box of the Spelling dialog box. If that replacement is incorrect, you can scroll through the Suggestions list and click the correct replacement. Use the Spelling dialog box options as follows:

  • Ignore Once and Ignore All: When Excel’s spell check comes across a word its dictionary finds suspicious but you know is viable, click the Ignore Once button. If you don’t want the spell checker to bother querying you about this word again, click the Ignore All button.

  • Add to Dictionary: Click this button to add the unknown (to Excel) word — such as your name — to a custom dictionary so that Excel won’t flag it again when you check the spelling in the worksheet later on.

  • Change: Click this button to replace the word listed in the Not in Dictionary text box with the word Excel offers in the Suggestions list box.

  • Change All: Click this button to change all occurrences of this misspelled word in the worksheet to the word Excel displays in the Suggestions list box.

  • AutoCorrect: Click this button to have Excel automatically correct this spelling error with the suggestion displayed in the Suggestions list box (by adding the misspelling and suggestion to the AutoCorrect dialog box).

  • Dictionary Language: To switch to another dictionary (such as a United Kingdom English dictionary, or a French dictionary when checking French terms in a multilingual worksheet), click this drop-down button and then select the name of the desired language in the list.

  • Options button to open the Proofing tab in the Excel Options dialog box where you can modify the current Excel spell-check settings such as Ignore Words in Uppercase, Ignore Words with Numbers, and the like.

Notice that the Excel spell checker not only flags words not found in its built-in or custom dictionary, but also flags occurrences of double words in a cell entry (such as total total) and words with unusual capitalization (such as NEw York instead of New York).

By default, the spell checker ignores all words with numbers and all Internet addresses. If you want it to ignore all words in uppercase letters as well, click the Options button at the bottom of the Spelling dialog box, and then select the Ignore Words in UPPERCASE check box before clicking OK.

You can check the spelling of just a particular group of entries by selecting the cells before you click the Spelling command button on the Review tab of the Ribbon or press F7.

Excel also has a Thesaurus pane that enables you to find synonyms for the label entered into the cell that’s current when you open the pane (or that you type into its text box). To open the Thesaurus pane, select Review→Thesaurus in the Proofing group at the beginning of the Review tab on the Ribbon or press Shift+F7.

Excel then opens a pane showing a list of all the synonyms for the label in the current cell or the term manually entered in its text box. To view more synonyms for a particular term in the list, select it. To replace the label entered in the current cell with a term in the Thesaurus list, select Insert on the term’s drop-down menu.

Also read 5 Spreadsheet Hacks for Excel Users

Go To Main Page


How to Use Find and Replace in Excel 2013

If your purpose for finding a cell with a particular entry in Excel 2013 is so that you can change it, you can automate this process by using the Replace tab on the Find and Replace dialog box. If you click Home→Find & Select→Replace or press Ctrl+H or Alt+HFDR, Excel opens the Find and Replace dialog box with the Replace tab (rather than the Find tab) selected.

On the Replace tab, enter the text or value you want to replace in the Find What text box, and then enter the replacement text or value in the Replace With text box.

When you enter replacement text, enter it exactly how you want it to appear in the cell. In other words, if you want to replace all occurrences of Jan in the worksheet with January, enter the following in the Replace With text box:

January

Make sure that you use a capital J in the Replace With text box, even though you can enter the following in the Find What text box (providing you don’t check the Match Case check box that appears only when you choose the Options button to expand the Find and Replace dialog box options):

Jan

After specifying what to replace and what to replace it with, you can have Excel replace occurrences in the worksheet on a case-by-case basis or globally. To replace all occurrences in a single operation, click the Replace All button.

Be careful with global search-and-replace operations; they can really mess up a worksheet in a hurry if you inadvertently replace values, parts of formulas, or characters in titles and headings that you hadn’t intended to change. With this in mind, always follow one rule: Never undertake a global search-and-replace operation on an unsaved worksheet.

Also, verify whether the Match Entire Cell Contents check box (displayed only when you click the Options button) is selected before you begin. You can end up with many unwanted replacements if you leave this check box unselected when you really only want to replace entire cell entries (rather than matching parts in cell entries).

If you do make a mess, immediately click the Undo button on the Quick Access toolbar or press Ctrl+Z to restore the worksheet.

To see each occurrence before you replace it, click the Find Next button or press Enter. Excel selects the next cell with the text or value you enter in the Find What text box. To have the program replace the selected text, click the Replace button. To skip this occurrence, click the Find Next button to continue the search. When you finish replacing occurrences, click the Close button.

Also read Simple Excel Tips and Tricks

Go To Main Page


How to Quickly and Easily Change Case in Excel 2013

What is you want to convert text from uppercase to lowercase, or from lowercase to uppercase, or from whatever to proper case to make it more readable? Imagine the text if so long or read it from another sheet or even from another file! How can you change the case of text in Microsoft Excel?

Use the UPPER function in Microsoft Excel 2013

Converts text to uppercase. Text can be a reference or text string.

Use the LOWER function in Microsoft Excel 2013

Converts text to lowercase. That is crystal clear that LOWER function does not change characters in text that are not letters. Text can be a reference or text string.

Use the PROPER function in Microsoft Excel 2013

Converts all other letters to lowercase except first letter in a text string and any other letters in text that follow any character other than a letter. These characters would be capitalized. Microsoft Excel PROPER function is very cool and practical function.

Also read Excel Formulas: 10 Formulas That Helped Me Keep My Job

Go To Main Page


Remove Non-Printable Characters in Excel.

CLEAN Function Overview

When data files are imported or copied into Excel worksheet non-printable characters can sometimes be included along with the good data.

A common example of these non-printable characters is the small box shaped character seen in cell D1 in the image above.

The CLEAN function can be used to remove most of these non-printable characters.

ASCII Character Codes

Each character on a computer – printable and non-printable – has a number known as its ASCII code or value.

ASCII stands for the American Standard Code for Information Interchange and it encompasses one set of codes for 255 characters and symbols for use in computer programs.

The first 32 characters (0 to 31) of the ASCII set are referred to as control characters and they are used by programs to control peripheral devices such as printers.

As such, they are not intended for use in a worksheet and they cannot be printed. In addition, they also create errors in Excel formulas if they are present in a cell along with good data.

The CLEAN Function’s Syntax and Arguments

A function’s syntax refers to the layout of the function and includes the function’s name, brackets, and arguments.

The syntax for the CLEAN function is:

= CLEAN ( Text )

Text – (required) the data from which you want to remove non-printable characters. Although the argument can contain the actual data, it can also be a cell reference to the location of the data in the worksheet.

Example: Remove Non-Printable Characters with the CLEAN Function

The image above shows a box-shaped non-printable character (ASCII code #12) along with the number 10 in cell D1 of an Excel worksheet.

Cell F1 contains a formula that is attempting to add the contents of cells D1 + E1, but the presence of the non-printable character in D1 has resulted in a #VALUE! error value appearing in cell F1 rather than the formula answer.

Row two in the image uses the CLEAN function to remove the non-printable character, which permits the formula in F2 to calculate the result of 25.

This example will go through the steps necessary to remove the non-printable character with the CLEAN function as seen in row two of the image.

Entering a Non-Printable Character into a Worksheet

In order to have a non-printable character to remove, the first step will be to enter one into the worksheet using the CHAR function. This function converts an ASCII code into an actual character.

  1. Enter the following data into cell D1
    =CHAR(12) & “10” – this enters the non-printable character next to the number 10
  2. If the box-shaped character is not present next to the number, as shown in the image above, change the font for cell D1 to Verdana – some fonts do not display non-printable characters
  3. Enter the number 15 into cell E1
  4. Click on cell F1 and enter the formula = D1 + E1
  5. Press the Enter key on the keyboard to complete the formula

 

  1. The #VALUE! error value should appear in cell F1 due to the presence of the non-printable character in cell D1

Entering the CLEAN Function

  1. Click on cell D2 to make it the active cell
  2. Click on the Formulas tab of the ribbon menu
  3. Choose Text from the ribbon to open the function drop down list
  4. Click on CLEAN in the list to bring up the function’s dialog box
  5. In the dialog box, click on the Text line
  6. Click on cell D1 in the worksheet
  7. Click OK
  8. The number 10 should appear in cell D2 on its own
  • If you click on cell D2 the complete function = CLEAN ( D1 ) appears in the formula bar above the worksheet

Entering the Addition Formula

  1. Enter the number 15 into cell E2
  2. Click on cell F1 and enter the formula = D1 + E1
  3. Press the Enter key on the keyboard
  • The number 25 – the answer to the formula – should appear in cell F2 now that the non-printable character has been removed

The CLEAN Function and Other Non-Printable Characters

As mentioned above, the CLEAN function was designed to remove the first 32 non-printable ASCII characters from a worksheet.

There are additional non-printable ASCII characters – #127, #129, #141, #143, #144, and #157 – which, according to the Microsoft help file, cannot be removed by the CLEAN function alone.

In actuality, however, the CLEAN function does remove ASCII characters #129, #141, #143, #144, and #157, which leaves #127 as the only non-printable character that the function cannot remove.

ASCII #127 is the code for the delete key on the keyboard, and, while it cannot be removed by the CLEAN function, it can be removed using a formula containing the SUBSTITUTE and CHAR functions.

Removing Non-Breaking Spaces from a Worksheet

Similar to non-printable characters is the non-breaking space (&nbsp) which can also cause problems with calculations and formatting in a worksheet. The ASCII code for non-breaking spaces is #160.

Non-breaking spaces are used extensively in web pages, so if data is copied into Excel from a web page, non-breaking spaces may show up in a worksheet.

This tutorial on removing non-breaking spaces explains how to remove them with a formula that combines the SUBSTITUTE, CHAR, and TRIM functions.

Also read 20 Tricks That Can Make Anyone An Excel Expert

Go To Main Page


Fixing Numbers Stored as Text

Sometimes when you import data or receive data from another source, the numbers might be converted to text. When you try to sum them, nothing works. That is because Excel will not sum numbers stored as text.

When numbers in a sheet are being stored as text, Excel lets you know by placing a green triangle in the cell (if File, Options, Formulas, Error Checking, Enable Background Error Checking is selected). When you select the cell and click the warning sign that appears, Excel informs you that the number is being stored as text, as shown in Figure 3.19. It then gives you options for handling the number, such as Convert to Number or Ignore Error.

Figure 3.19. With Background Error Checking enabled, Excel informs you if a number is being stored as text.

If you have a worksheet with thousands of cells, it will take a long time to convert them all to numbers. Three options for doing a larger-scale conversion are covered in the next sections.

Using Convert to Number on a Range

One option for converting multiple cells into numbers is to use the information drop-down that Excel has provided:

  1. Select the range consisting of all the cells you need to convert (making sure that the first cell in the range needs to be converted). The range can include text and other numerical values, as long as it doesn’t include cells you do not want to be converted to numbers.
  2. Click the warning symbol in the first cell.
  3. From the drop-down, select Convert to Number, and all cells in the selected range will be modified, turning the numbers to true numbers.

Using Paste Special to Force a Number

If you have the Background Error Checking disabled and don’t see the green warning triangle, try this method for converting cells to numbers:

  1. Enter a 1 in a blank cell and copy it.
  2. Select the cells containing the numbers, right-click and select Paste Special, Paste Special.
  3. From the dialog box that opens, select Multiply, and click OK.The act of multiplying the values by 1 forces the contents of the cells to become their numerical values.

Using Text to Columns to Convert Text to Numbers

In step 3 of the Text to Columns wizard, you select the data type of a column. You can use this functionality to also correct numbers being stored as text. To convert a column of numbers stored as text to just numbers, follow these steps:

  1. Highlight the range of text to be converted.
  2. Go to Data, Text to Columns.

Click Finish. The numbers are no longer considered numbers stored as text.

Also read Customize Excel’s pivot table

Go To Main Page


Merging and Splitting Cells in Excel

You can merge and center data horizontally or vertically across multiple cells in Excel 2010. You also can unmerge or split a merged cell into its original, individual cells. A common use of merge and center in Excel 2010 is to horizontally center a worksheet title over a table.

You can only split a cell that has previously been merged.

Merging and centering cells

Follow these steps to merge and center a range of cells:

  1. Select the range of cells you want to merge and center.

    You can use Merge & Center only on a contiguous, rectangle-shaped range of cells.

    A worksheet title before merging and centering.
  2. On the Home tab, in the Alignment group, click the Merge & Center button.

    The cells are merged into a single cell, and the text (if any) is centered within the merged cell.

    A worksheet title after centering it across columns A through E.
  3. (Optional) Change the alignment in the merged cell, if desired.

    For example, click the Align Text Right button in the Alignment group if you want the text in the merged cell to be right-aligned instead of centered.

If you want to merge cells without centering the text contained in the cell, click the drop-down arrow beside the Merge & Center button in the Alignment group and choose either Merge Across or Merge Cells.

Splitting a merged cell

If you need to split a cell that you’ve merged with the Merge & Center button, follow these steps:

  1. Select the merged cell.

    The Merge & Center button appears selected in the Alignment group.

  2. Click the Merge & Center button in the Alignment group.

    The merged cell reverts to a cell range again, and any text contained in the merged cell appears in the upper-left cell of the range.

Also read How to Combine Text from Multiple Cells

Go To Main Page

How to Transpose Rows and Columns in Excel 2013

You may have observed in Microsoft Office Excel you can copy or move the cells, rows, columns or ranges as per your convenience. However, there appears no trick available for swapping the same. Not anymore! Excel, if you are not aware, supports a one click operation to quickly swap rows, columns or ranges, known as ‘Transpose’.

The feature works with most versions of Excel such as Excel 2010, 2007, 2003.

Transpose feature in Excel 2013

Here, if you see I have created a spreadsheet in Excel 2013 in vertical orientation using a column style. We will proceed further and change the same into a horizontal row style. Here’s how.

From your Excel sheet, select the desired cells you would want to change.

Once done, Right-Click and select the ‘Copy’ option.

Next, Right-Click in any empty cell and choose Paste Options > Transpose as shown in the screen-shot below.

One of the highlights of Excel 2013 is that it will show you a preview of what the paste would look like before confirming the action.

If you do not see it, as an alternative step you can Right-Click an empty cell and Select ‘Paste Special’.

Then, from its window you can check the ‘Transpose’ box and Click OK.

You will now see your data transposed to a new layout style.

The same feature allows you to convert horizontal row data sheets into the vertical column style sheets too.

Also read 12 Basic Shortcuts That Will Make You An Excel Guru And Add Hours To Your Life

Go To Main Page

Exit mobile version