Clean messy data by providing examples in Excel

While Excel isn’t usually my tool of choice for manipulating or analyzing data (I prefer to use it as a data source for R), it has just learned a new trick that’s likely to prove useful from time to time. Extracting the useful information from complicated or inconsistent formats can often be a pain, but with Excel’s new Transform Data By Example add-in, you can start with a column of messy data, manually provide a few examples of transformations by hand (3 or so usually does the trick — just as many as you need to cover all the variations), and Excel will automatically figure out how to generate similar examples for the rest of the rows.

For example, given just three examples of “Lastname. Firstname” format, the add-in can extract formatted names from a mixture of name formats while discarding titles and suffixes:

clean-messy-data-by-providing-examples-in-excel Clean messy data by providing examples in Excel 1495170264_828_clean-messy-data-by-providing-examples-in-excel Clean messy data by providing examples in Excel

Excel has had a system like this (called Flash Fill) for a while, which uses the automated rule-writing algorithm described in this paper. This new add-in extends the concept by searching not just the space of text transformations, but also the outputs from library of web-based services from Github, .NET and StackOverflow. You can even add your own transformations by publishing them as an Azure Function or web service. The video below shows an example from Python, but I imagine that R-based examples are equally possible.

For more, check out the link below.

Microsoft: Transform Data By Example

Source link

Previous Getting Value Faster with a Data Strategy
Next Refurb Belkin WeMo WiFi-Enabled Light Switch for $15 + free shipping