How to Clean CSV Data at the Command Line

Have you ever dealt with a big-scary CSV file that has many columns that you don’t want and many records that slow down the process for you to filter and get the desired information?

This tutorial is about using two command-line programs that can solve these problems; csvkit and xsv . We will compare the two at the end and see how performant each and when we can use one and not the other in terms of speed especially if we’re processing a large CSV file. In the last blog post, we talked about how to clean text data at the command line that I recommend to have a look at.

Downloading COVID data from covidtracking

Let’s first download recent coronavirus data across the United States from COVID Tracking Project which is a volunteer organization dedicated to collecting and publishing the data required to understand the COVID-19 outbreak in the US. Btw, The data is published under a Creative Commons CC BY 4.0 license.

Let’s do this by downloading the CSV file manually or using curl :

$ curl -LO https://covidtracking.com/data/download/all-states-history.csv

-LO is a combination of -L and -O

-L is used to make sure if the URL has changed to another location, curl will redo the request on the new redirection link
-O this option is used to create an output file of the same name of the requested file name which is all-states-history.csv here

Printing the CSV file headers

Let’s first print what column names we have for this all-states.history.csv file:

$ csvcut -n all-states-history.csv 
  1: date
  2: state
  3: dataQualityGrade
  4: death
  5: deathConfirmed
  6: deathIncrease
  7: deathProbable
  8: hospitalized
  9: hospitalizedCumulative
 10: hospitalizedCurrently
 11: hospitalizedIncrease
 12: inIcuCumulative
 13: inIcuCurrently
 14: negative
 15: negativeIncrease
 16: negativeTestsAntibody
 17: negativeTestsPeopleAntibody
 18: negativeTestsViral
 19: onVentilatorCumulative
 20: onVentilatorCurrently
 21: pending
 22: positive
 23: positiveCasesViral
 24: positiveIncrease
 25: positiveScore
 26: positiveTestsAntibody
 27: positiveTestsAntigen
 28: positiveTestsPeopleAntibody
 29: positiveTestsPeopleAntigen
 30: positiveTestsViral
 31: recovered
 32: totalTestEncountersViral
 33: totalTestEncountersViralIncrease
 34: totalTestResults
 35: totalTestResultsIncrease
 36: totalTestsAntibody
 37: totalTestsAntigen
 38: totalTestsPeopleAntibody
 39: totalTestsPeopleAntigen
 40: totalTestsPeopleViral
 41: totalTestsPeopleViralIncrease
 42: totalTestsViral
 43: totalTestsViralIncrease

As you can see, using csvcut with the option -n can list all the headers we have with their associated order which can help us select some specific columns that we’re interested in.

Selecting specific columns

In this tutorial, we’re interested in four columns and these are their descriptions as reported by the COVID Tracking Project:

data: Date on which data was collected by The COVID Tracking Project.
state: Two-letter abbreviation for the state or territory.
positive: Total number of confirmed plus probable cases of COVID-19 reported by the state or territory
death: Total fatalities with confirmed OR probable COVID-19 case diagnosis

Let’s see how we can get the first 10 lines of these 4 columns in our CSV file at the command line:

$ csvcut -c date,state,positive,death all-states-history.csv | head | csvlook 
|       date | state | positive |  death |
| ---------- | ----- | -------- | ------ |
| 2020-10-26 | AK    |   14,413 |     68 |
| 2020-10-26 | AL    |  185,322 |  2,866 |
| 2020-10-26 | AR    |  106,727 |  1,833 |
| 2020-10-26 | AS    |        0 |      0 |
| 2020-10-26 | AZ    |  238,964 |  5,875 |
| 2020-10-26 | CA    |  901,010 | 17,357 |
| 2020-10-26 | CO    |   95,089 |  2,076 |
| 2020-10-26 | CT    |   68,099 |  4,589 |
| 2020-10-26 | DC    |   16,812 |    642 |

So csvcut with the option -c is used here to select the upcoming columns separated by commas. These 10 lines look better aligned with csvlook

Note that we could’ve done that with either of the following commands:

$ csvcut -c 1,2,22,4 all-states-history.csv | csvgrep -c state -m CA |head | csvlook
$ csvcut -c 1-2,22,4 all-states-history.csv | csvgrep -c state -m CA |head | csvlook
$ csvcut -c 1-2,positive,4 all-states-history.csv | csvgrep -c state -m CA |head | csvlook

Meaning you can select the columns with their numbers or ranges or a combination of numbers and column names as strings.

Take care that this CSV data may differ from yours if you’re using the recent data from the COVID Tracking Project on another day than the day this tutorial was written.

Filtering information

Let’s now filter out COVID data at California state:

$ csvcut -c date,state,positive,death all-states-history.csv | csvgrep -c state -m AL | head | csvlook 
|       date | state | positive | death |
| ---------- | ----- | -------- | ----- |
| 2020-10-26 | AL    |  185,322 | 2,866 |
| 2020-10-25 | AL    |  184,355 | 2,866 |
| 2020-10-24 | AL    |  183,276 | 2,866 |
| 2020-10-23 | AL    |  180,916 | 2,859 |
| 2020-10-22 | AL    |  177,064 | 2,843 |
| 2020-10-21 | AL    |  174,528 | 2,805 |
| 2020-10-20 | AL    |  174,528 | 2,805 |
| 2020-10-19 | AL    |  173,485 | 2,789 |
| 2020-10-18 | AL    |  172,626 | 2,788 |

We used here csvgrep with the option -c to select the column that we’re filtering which is the state here to match AL using -m option that matches the pattern we search for.

I’d like to make sure of this data, so I went to Google and asked how many cases we have at Alabama and this is the answer:

Image by the Author

Looks like the data reported by the COVID Tracking Project is close to what Google is reporting having 186K positive cases and 2892 fatalities.

If you also put another column to show the increase in the positive cases from the previous day, you’d find:

$ csvcut -c date,state,positive,24,death all-states-history.csv | csvgrep -c state -m AL | head | csvlook
|       date | state | positive | positiveIncrease | death |
| ---------- | ----- | -------- | ---------------- | ----- |
| 2020-10-26 | AL    |  185,322 |              967 | 2,866 |
| 2020-10-25 | AL    |  184,355 |            1,079 | 2,866 |
| 2020-10-24 | AL    |  183,276 |            2,360 | 2,866 |
| 2020-10-23 | AL    |  180,916 |            3,852 | 2,859 |
| 2020-10-22 | AL    |  177,064 |            2,536 | 2,843 |
| 2020-10-21 | AL    |  174,528 |                0 | 2,805 |
| 2020-10-20 | AL    |  174,528 |            1,043 | 2,805 |
| 2020-10-19 | AL    |  173,485 |              859 | 2,789 |
| 2020-10-18 | AL    |  172,626 |              964 | 2,788 |

967 positive cases increased from Oct. 26 to Oct. 27 and this number exactly matches what Google reports (+967) below the Total cases number in the image above.

Joining two CSVs

I’m not familiar with some abbreviations in the state column, so let’s have the second CSV file which we can join on to get a cleaner output of CSV data we understand. Let’s download it using curl:

$ curl -LO https://gist.githubusercontent.com/afomi/8824ddb02a68cf15151a804d4d0dc3b7/raw/5f1cfabf2e65c5661a9ed12af27953ae4032b136/states.csv

This states.csv ** file has two columns: State and Abbreviation

Let’s see how we can make this interesting join here:

$ csvjoin -c Abbreviation,state states.csv all-states-history.csv | csvcut -c date,State,Abbreviation,positive,death | head | csvlook 
|       date | State   | Abbreviation | positive | death |
| ---------- | ------- | ------------ | -------- | ----- |
| 2020-10-26 | ALABAMA | AL           |  185,322 | 2,866 |
| 2020-10-25 | ALABAMA | AL           |  184,355 | 2,866 |
| 2020-10-24 | ALABAMA | AL           |  183,276 | 2,866 |
| 2020-10-23 | ALABAMA | AL           |  180,916 | 2,859 |
| 2020-10-22 | ALABAMA | AL           |  177,064 | 2,843 |
| 2020-10-21 | ALABAMA | AL           |  174,528 | 2,805 |
| 2020-10-20 | ALABAMA | AL           |  174,528 | 2,805 |
| 2020-10-19 | ALABAMA | AL           |  173,485 | 2,789 |
| 2020-10-18 | ALABAMA | AL           |  172,626 | 2,788 |

Note here that csvjoin command takes much time because it’s reading both files into memory.

Here we joined the two CSV files on a column for each CSV; Abbreviation in the first file and state in the second one and then we filtered out 5 columns to view using csvcut -c

Also, note that there the second column you filtered out when you joined is gone meaning if you filtered out state ( which was the column that has the two-letter abbreviation of the state) it will give an error that ‘state’ is invalid which means this column is not there anymore.

Comparing between xsv and csvkit utilities

As we noticed, some commands took much time using csvkit command line utility. Let’s see a quick comparison between its command-line tools and their associated ones at xsv.

All the upcoming commands run are relative to my machine, let’s compare one by another:

xsv headers vs csvcut -n

$ time csvcut -n all-states-history.csv | head
  1: date
  2: state
  3: dataQualityGrade
  4: death
  5: deathConfirmed
  6: deathIncrease
  7: deathProbable
  8: hospitalized
  9: hospitalizedCumulative
 10: hospitalizedCurrently

real    0m0.307s
user    0m0.224s
sys 0m0.077s

Time of csvkit’s csvcut -n : ~307ms

$ time xsv headers all-states-history.csv | head
1   date
2   state
3   dataQualityGrade
4   death
5   deathConfirmed
6   deathIncrease
7   deathProbable
8   hospitalized
9   hospitalizedCumulative
10  hospitalizedCurrently

real    0m0.013s
user    0m0.008s
sys 0m0.007s

Time of xsv’s headers : ~13ms

xsv select vs csvcut -c

$ time csvcut -c date,state,positive,death all-states-history.csv | head
date,state,positive,death
2020-10-26,AK,14413,68
2020-10-26,AL,185322,2866
2020-10-26,AR,106727,1833
2020-10-26,AS,0,0
2020-10-26,AZ,238964,5875
2020-10-26,CA,901010,17357
2020-10-26,CO,95089,2076
2020-10-26,CT,68099,4589
2020-10-26,DC,16812,642

real    0m0.288s
user    0m0.209s
sys 0m0.073s

Time of csvkit’s csvcut -c : ~288ms

$ time xsv select date,state,positive,death all-states-history.csv | head
date,state,positive,death
2020-10-26,AK,14413,68
2020-10-26,AL,185322,2866
2020-10-26,AR,106727,1833
2020-10-26,AS,0,0
2020-10-26,AZ,238964,5875
2020-10-26,CA,901010,17357
2020-10-26,CO,95089,2076
2020-10-26,CT,68099,4589
2020-10-26,DC,16812,642

real    0m0.035s
user    0m0.012s
sys 0m0.011s

Time of xsv’s select : ~35ms

xsv search vs csvgrep

$ time csvcut -c date,state,positive,death all-states-history.csv | csvgrep -c state -m AL |head 
date,state,positive,death
2020-10-26,AL,185322,2866
2020-10-25,AL,184355,2866
2020-10-24,AL,183276,2866
2020-10-23,AL,180916,2859
2020-10-22,AL,177064,2843
2020-10-21,AL,174528,2805
2020-10-20,AL,174528,2805
2020-10-19,AL,173485,2789
2020-10-18,AL,172626,2788

real    0m0.438s
user    0m0.571s
sys 0m0.173s

Time of csvkit’s csvgrep with csvcut : ~438ms

$ time xsv select date,state,positive,death all-states-history.csv | xsv search -s state AL |head 
date,state,positive,death
2020-10-26,AL,185322,2866
2020-10-25,AL,184355,2866
2020-10-24,AL,183276,2866
2020-10-23,AL,180916,2859
2020-10-22,AL,177064,2843
2020-10-21,AL,174528,2805
2020-10-20,AL,174528,2805
2020-10-19,AL,173485,2789
2020-10-18,AL,172626,2788

real    0m0.038s
user    0m0.026s
sys 0m0.015s

Time of xsv’s search with select : ~38ms

xsv table vs csvlook

$ time csvcut -c date,state,positive,death all-states-history.csv | csvgrep -c state -m AL | head | csvlook
|       date | state | positive | death |
| ---------- | ----- | -------- | ----- |
| 2020-10-26 | AL    |  185,322 | 2,866 |
| 2020-10-25 | AL    |  184,355 | 2,866 |
| 2020-10-24 | AL    |  183,276 | 2,866 |
| 2020-10-23 | AL    |  180,916 | 2,859 |
| 2020-10-22 | AL    |  177,064 | 2,843 |
| 2020-10-21 | AL    |  174,528 | 2,805 |
| 2020-10-20 | AL    |  174,528 | 2,805 |
| 2020-10-19 | AL    |  173,485 | 2,789 |
| 2020-10-18 | AL    |  172,626 | 2,788 |

real    0m0.476s
user    0m0.879s
sys 0m0.281s

Time of csvkit’s csvlook with csvgrep and csvcut : ~476ms

$ time xsv select date,state,positive,death all-states-history.csv | xsv search -s state AL | head | xsv table
date        state  positive  death
2020-10-26  AL     185322    2866
2020-10-25  AL     184355    2866
2020-10-24  AL     183276    2866
2020-10-23  AL     180916    2859
2020-10-22  AL     177064    2843
2020-10-21  AL     174528    2805
2020-10-20  AL     174528    2805
2020-10-19  AL     173485    2789
2020-10-18  AL     172626    2788

real    0m0.041s
user    0m0.036s
sys 0m0.023s

Time of xsv’s table with search with select : ~41ms

xsv join vs csvjoin

$ time csvjoin -c Abbreviation,state states.csv all-states-history.csv | csvcut -c date,State,Abbreviation,positive,death | head
date,State,Abbreviation,positive,death
2020-10-26,ALABAMA,AL,185322,2866
2020-10-25,ALABAMA,AL,184355,2866
2020-10-24,ALABAMA,AL,183276,2866
2020-10-23,ALABAMA,AL,180916,2859
2020-10-22,ALABAMA,AL,177064,2843
2020-10-21,ALABAMA,AL,174528,2805
2020-10-20,ALABAMA,AL,174528,2805
2020-10-19,ALABAMA,AL,173485,2789
2020-10-18,ALABAMA,AL,172626,2788

real    1m5.788s
user    1m5.293s
sys 0m0.462s

Time of csvkit’s csvjoin with csvcut : ~1.6min

$ time xsv join Abbreviation states.csv state all-states-history.csv | xsv select date,State,Abbreviation,positive,death | head
date,State,Abbreviation,positive,death
2020-10-26,ALABAMA,AL,185322,2866
2020-10-25,ALABAMA,AL,184355,2866
2020-10-24,ALABAMA,AL,183276,2866
2020-10-23,ALABAMA,AL,180916,2859
2020-10-22,ALABAMA,AL,177064,2843
2020-10-21,ALABAMA,AL,174528,2805
2020-10-20,ALABAMA,AL,174528,2805
2020-10-19,ALABAMA,AL,173485,2789
2020-10-18,ALABAMA,AL,172626,2788

real    0m0.051s
user    0m0.036s
sys 0m0.018s

Time of xsv’s join with select : ~51ms

‍

Did you see what just happened?! 51 milliseconds vs 1.6 minutes?!

Well, xsv can do better than 51 ms here with indexing with:

$ xsv index all-states-history.csv

like this:

$ time xsv join Abbreviation states.csv state all-states-history.csv | xsv select date,State,Abbreviation,positive,death | head
date,State,Abbreviation,positive,death
2020-10-26,ALABAMA,AL,185322,2866
2020-10-25,ALABAMA,AL,184355,2866
2020-10-24,ALABAMA,AL,183276,2866
2020-10-23,ALABAMA,AL,180916,2859
2020-10-22,ALABAMA,AL,177064,2843
2020-10-21,ALABAMA,AL,174528,2805
2020-10-20,ALABAMA,AL,174528,2805
2020-10-19,ALABAMA,AL,173485,2789
2020-10-18,ALABAMA,AL,172626,2788

real    0m0.036s
user    0m0.031s
sys 0m0.017s

but we can feel more how fast xsv is if the file we’re investigating has many more records.

Final thoughts

From our investigation of the 13269 records data of the COVID Tracking Project CSV file, it seems that xsv is killing it and it’s more performant than csvkit. We’ve seen a tremendous improvement in speed when cleaning our data by:

knowing the headers of the CSV file using xsv headers and comparing it with csvcut -n ‍
filtering out our desired columns using xsv select comparing it with csvcut -c
searching for a specific pattern using xsv select comparing it with csvgrep
better looking for the CSV using xsv table comparing it with csvlook
or joining two tables using xsv join comparing it with csvjoin

In the end, you have the choice to choose whatever you want from the csvkit or xsv but it’d be fair to use what makes our life easy which is xsv especially when we’re dealing with a large CSV file and if speed and performance is not what we look for especially when we’re dealing with small CSVs, we can go for csvkit.

You might have noticed that the syntax is kind of similar, it just differs in some commands like joining two CSVs. So you always have the choice!

‍

This tutorial is mainly motivated by Data Science at the Command Line

Disclosure: The Amazon links for the book (in this section) are paid links so if you buy the book, I will have a small commission

This book tries to catch your attention on the ability of the command line when you do data science tasks - meaning you can obtain your data, manipulate it, explore it, and make your prediction on it using the command line. If you are a data scientist, aspiring to be, or want to know more about it, I highly recommend this book. You can read it online for free from its website or order an ebook or paperback . In this tutorial we’re gonna focus on using the command line to clean our data.

‍

You might be interested in my previous tutorials on why we use docker tutorial or the one similar to this on how to clean text data at the command line

Stay safe everybody, will see you in the next tutorials ;)

If you want the whole package of cleaning data at the command line in PDF format, get my ebook below 👇

Get your ebook