Preprocessing for Clean Data, Handling Values in Datasets

Most of data science is data preprocessing or making data usable for analysis and computation according to many, many sources. Using a cardiac dataset, preprocessing can be shown by example. An important task is either removing or imputing values for nulls in a dataset. Only keeping values within specified ranges or imputing value to replace what does not make sense is another. Are there ever negative heartbeats per minute?

Why an issue

Using data produces results. Results are from programmed algorithms, either built in or designed for a use. Numbers change results. When nulls or out of bound values are included in a set, they need to be cleaned for proper computation. Sometimes this is due to poor data entry and mistakes in recorded information. However, in large data sets, data automatically captured through sensors or live feeds, often will have errors. These show inconsistencies that are of value, but in processing need removed or edited to use analytics for value from the data.

Solutions, Different Ways

For an example, cardiac data is public data. There are two standard ways to handle nulls and NA, deletion or imputation. Both methods can be processed in Excel or R.

R

There are choices in R to handle dirty data. When importing data in R, blanks are filled with NA as the value. Using arrays or data frames, data can be formatted with imputation for NA; set variables or features to a value for each NA. There are many differences between arrays and data frames. There are helpful cheat sheets to use diplyr and tidyr where commands can be combined and used to clean on import or transformed into a new clean dataset.

The most common solutions are is.na(df)=0 for filling NA or missing values with 0; omit.na(df) for removing the row a NA or missing value occurs.

Excel

By inserting columns, more variables can be created. An addition is the data that needs to be sorted with an additional character in the field to aid in sorting and matching for data preprocessing to remove null values out of datasets. Using logic, a character is added and that information is placed in a new column. From the new column, formulas can be used to create logic to set for inclusion or to set aside invalid data. There is also a filter to remove #NA or blanks from a dataset, however I have found it is better to create additional columns with the clean data.

Conclusion

Dirty data yields dirty results. Clean data yields clean results. In tools, there are automated menus to clean data, but often it is a primary job to make a usable dataset before using machine learning to big data analytics from live stream or sensor data. Continuous data is a growing concern in the market and is a valuable information when processed properly for review.

Originally Published Dec 27 2020 in Artificial Intelligence in Plain English on Medium

1 view0 comments

Recent Posts

See All

© 2021 Sarah Mason  New York, United States        

Privacy Policy    Terms and Conditions

  • LinkedIn
  • Facebook