10 Things I Learned About Data Preparation
6 min readIntroduction
Data preparation is a crucial step in any data science project. It allows you to clean your data, making sure it’s ready for analysis and model building. However, many new data scientists are surprised by how much work goes into preparing their datasets. In this post we’ll cover 10 important lessons I learned about data preparation during my first few months as a practicing data scientist:
Data is not always clean
Data is not always clean, and you’ll have to get used to that. You can’t just load it into your database and expect it to be ready for analysis. There are many things that can go wrong with your data, such as:
- Missing values – these are values in your dataset which don’t exist or aren’t known at the time of analysis
- Duplicate entries – multiple rows with identical information (usually caused by manual entry errors)
- Inconsistent formats – different columns contain different types of information (e.g., dates vs times)
Your data might be broken up into multiple files
If you’re working with data that has been split into multiple files, you will need to combine them before you can use them. This is often the case when people export their data from an application like Excel or Google Sheets and save it as a CSV file.
The Unix cat command can be used to concatenate files together into one large file:
- $ cat file1 file2 > combinedFile_new
Data can contain a lot of null values
As you can see, data preparation is a process that requires careful attention to detail and a keen understanding of what your data actually means.
Null values are a common occurrence in many datasets, so it’s important that you know how to handle them if you want to produce reliable results from your analysis. In this section, I’ll discuss nulls (what they mean), how they affect the analysis process, and how we can deal with them when necessary.
Some columns might be missing values
Missing values are a common problem with big data. They can be caused by:
- Data collection errors, such as when the device didn’t record any information or the person doing the survey forgot to ask some questions.
- Data entry errors, where someone types in a number when they meant to write down “no” or something else.
- Data processing errors, like when your spreadsheet program changes a zero into an empty cell when you import it into Excel (this happened to me once).
For this reason, missing values might not be so easy to spot at first glance; but if you’re working with large datasets and haven’t accounted for them yet–you needn’t worry! There are several ways of dealing with missing values depending on which type(s) you have:
Don’t put all your eggs in one basket
A common mistake made by data scientists is to rely on one source of data. It’s tempting to pull together all the data you have, but this can lead to problems if any of your sources goes offline or becomes unavailable for whatever reason.
Another thing that I learned from my experience was not to rely on just one type of data either: make sure you have multiple sources of information so that if one gets taken down (or otherwise corrupted), there are still others available for use in your analysis.
Always run a data quality check on your data before you begin
- Always run a data quality check on your data before you begin.
A good data quality check is an essential tool for any data scientist, and it can be used in many different ways. For example:
- You can use R or another programming language to check the quality of your dataset.
- You can use Data Quality Checker (DQC) from Microsoft Azure Machine Learning Studio to automate this process by running multiple checks at once, including checks for missing values and outliers as well as other potential problems with the structure of your dataset such as duplicate rows or columns that don’t make sense together (e.g., two columns representing different versions of the same date).
Make sure to use the right data types and formats for each variable you find
- Use the right data types and formats for each variable you find.
- Data types are not case sensitive, but they do matter! The following table shows some examples of the differences between strings and integers:
- Integer (1) & string (“one”)
- Integer (1) & string (“one”)
The first example returns 2 because it sees that both values are integers, so they should be added together as such. The second example returns 1 because it sees that only one value is an integer, so this is what gets added together to get your final result!
Make sure that your data is properly formatted and use the right columns if possible.
One of the most important things to remember when preparing data is to make sure that it’s in the right shape. In other words, you need to ensure that your data has all the columns and rows it needs. If you are unsure of what these are, ask someone who knows more about SQL than you do!
If your column is missing some information or has extra characters in it (like blank spaces), those might be causing problems for later steps when we try to join our tables together at different points during this process.
Your dataset may have outliers, so it’s important that you know how to deal with them.
You may have heard the term “outliers” before, but you might not know what it means. An outlier is a data point that does not follow the general pattern of your dataset. This could be because of measurement errors or data entry errors (e.g., someone entered a number wrong). Outliers can also cause problems with your analysis and skew results if they’re included in your calculations, so it’s important to know how to deal with them!
There are two types of outliers: statistical outliers and substantive outliers. Statistical outliers occur when there is an error in calculating averages or medians from a set of numbers; these are more common than substantive outliers because they don’t reflect anything about the underlying population being sampled–they’re just due to chance alone.* Substantive outliers are different from statistical ones because they indicate some kind of anomaly within an individual sample point–for example, maybe one person was sick all week so their productivity was lower than usual.*
Data preparation is just as important as collecting data
Data preparation is a critical step in the data science process. Data cleaning and transformation are important, but they’re also an art and a science. There are many tools out there to help you prepare your data for analysis, but it’s important that you use the right tool for each job. In this post we’ll look at some examples of popular tools for different types of data preparation tasks:
Conclusion
So, there you have it. Data preparation is an important part of your data science workflow, but it’s also one that many people neglect. If you’re planning on using your data for anything more than just a quick analysis or visualization, then it’s worth spending some time getting it ready first.