Written by 16:16 AI, Data Views: [tptn_views]

Cleaning your dirty data

Freshening up your information for free

In 2022, data scientists and data engineers are in high demand. The potential benefits of these hires for businesses are clear: we can mine the mass of data that our business collects every minute, and conjure up predictive machine learning models that help us make better, faster decisions. The reality is that data experts spend up to 80% of their time cleaning data. 

Data in its raw form is usually a mess. Gaps need to be filled, duplicates need to be removed, metadata needs to be added, outliers need to be dealt with – and everything needs to be unified and standardised. Fixing datasets requires a whole heap of manual work, bolted-together code and spreadsheet hacks. 

Thankfully, there are tools that help. But many of the sophisticated software solutions can require huge investments, and will be out of reach of scaling startups or medium-sized enterprises. The fact is, we need just enough technology to get the job done – whether that’s cleaning the data for one part of the business, or developing a proof of concept for a bigger project. 

Fresh and free


OpenRefine (formerly Google Refine) has long been a popular choice, not least because it’s free and open source. It requires a lot of configuration for the job in hand, but it’s highly customisable (if you have the technical knowledge). 

Trajektory specialises in sponsorship data, integrating across all your internal and partner data sources and generating reports for business users as well as analysts

 

Paid-for options often offer easier set-up, and more features. Trifacta Wrangler uses machine learning to identify inconsistencies in the data and suggest changes, and has a visual UI for setting up pipelines. The award-winning Winpure Clean & Match is focused on customer and business data (think CRM data), and features an intuitive interface and features like fuzzy matching (which is great for picking up typos or inconsistent abbreviations). 

Trajektory specialises in sponsorship data, integrating across all your internal and partner data sources and generating reports for business users as well as analysts. And Turkish startup Sweephy is currently using a seed funding round to develop a no-code solution that will clean up data for machine learning models. 

With the right tools, your data will be sparkling in no time.