Have you ever dealt with a big-scary CSV file that has many columns that you don’t want and many records that slow down the process for you to filter and get the desired information?
This tutorial is about using two command-line programs that can solve these problems; csvkit and xsv . We will compare the two at the end and see how performant each and when we can use one and not the other in terms of speed especially if we’re processing a large CSV file. In the last blog post, we talked about how to clean text data at the command line that I recommend to have a look at.
Downloading COVID data from covidtracking
Let’s first download recent coronavirus data across the United States from COVID Tracking Project which is a volunteer organization dedicated to collecting and publishing the data required to understand the COVID-19 outbreak in the US. Btw, The data is published under a Creative Commons CC BY 4.0 license.
Let’s do this by downloading the CSV file manually or using curl :
$ curl -LO https://covidtracking.com/data/download/all-states-history.csv
-LO is a combination of -L and -O
- -L is used to make sure if the URL has changed to another location, curl will redo the request on the new redirection link
- -O this option is used to create an output file of the same name of the requested file name which is all-states-history.csv here
Printing the CSV file headers
Let’s first print what column names we have for this all-states.history.csv file:
$ csvcut -n all-states-history.csv
1: date
2: state
3: dataQualityGrade
4: death
5: deathConfirmed
6: deathIncrease
7: deathProbable
8: hospitalized
9: hospitalizedCumulative
10: hospitalizedCurrently
11: hospitalizedIncrease
12: inIcuCumulative
13: inIcuCurrently
14: negative
15: negativeIncrease
16: negativeTestsAntibody
17: negativeTestsPeopleAntibody
18: negativeTestsViral
19: onVentilatorCumulative
20: onVentilatorCurrently
21: pending
22: positive
23: positiveCasesViral
24: positiveIncrease
25: positiveScore
26: positiveTestsAntibody
27: positiveTestsAntigen
28: positiveTestsPeopleAntibody
29: positiveTestsPeopleAntigen
30: positiveTestsViral
31: recovered
32: totalTestEncountersViral
33: totalTestEncountersViralIncrease
34: totalTestResults
35: totalTestResultsIncrease
36: totalTestsAntibody
37: totalTestsAntigen
38: totalTestsPeopleAntibody
39: totalTestsPeopleAntigen
40: totalTestsPeopleViral
41: totalTestsPeopleViralIncrease
42: totalTestsViral
43: totalTestsViralIncrease
As you can see, using csvcut with the option -n can list all the headers we have with their associated order which can help us select some specific columns that we’re interested in.
Selecting specific columns
In this tutorial, we’re interested in four columns and these are their descriptions as reported by the COVID Tracking Project:
- data: Date on which data was collected by The COVID Tracking Project.
- state: Two-letter abbreviation for the state or territory.
- positive: Total number of confirmed plus probable cases of COVID-19 reported by the state or territory
- death: Total fatalities with confirmed OR probable COVID-19 case diagnosis
Let’s see how we can get the first 10 lines of these 4 columns in our CSV file at the command line:
$ csvcut -c date,state,positive,death all-states-history.csv | head | csvlook
| date | state | positive | death |
| ---------- | ----- | -------- | ------ |
| 2020-10-26 | AK | 14,413 | 68 |
| 2020-10-26 | AL | 185,322 | 2,866 |
| 2020-10-26 | AR | 106,727 | 1,833 |
| 2020-10-26 | AS | 0 | 0 |
| 2020-10-26 | AZ | 238,964 | 5,875 |
| 2020-10-26 | CA | 901,010 | 17,357 |
| 2020-10-26 | CO | 95,089 | 2,076 |
| 2020-10-26 | CT | 68,099 | 4,589 |
| 2020-10-26 | DC | 16,812 | 642 |
So csvcut with the option -c is used here to select the upcoming columns separated by commas. These 10 lines look better aligned with csvlook
Note that we could’ve done that with either of the following commands:
$ csvcut -c 1,2,22,4 all-states-history.csv | csvgrep -c state -m CA |head | csvlook
$ csvcut -c 1-2,22,4 all-states-history.csv | csvgrep -c state -m CA |head | csvlook
$ csvcut -c 1-2,positive,4 all-states-history.csv | csvgrep -c state -m CA |head | csvlook
Meaning you can select the columns with their numbers or ranges or a combination of numbers and column names as strings.
Take care that this CSV data may differ from yours if you’re using the recent data from the COVID Tracking Project on another day than the day this tutorial was written.
Filtering information
Let’s now filter out COVID data at California state:
$ csvcut -c date,state,positive,death all-states-history.csv | csvgrep -c state -m AL | head | csvlook
| date | state | positive | death |
| ---------- | ----- | -------- | ----- |
| 2020-10-26 | AL | 185,322 | 2,866 |
| 2020-10-25 | AL | 184,355 | 2,866 |
| 2020-10-24 | AL | 183,276 | 2,866 |
| 2020-10-23 | AL | 180,916 | 2,859 |
| 2020-10-22 | AL | 177,064 | 2,843 |
| 2020-10-21 | AL | 174,528 | 2,805 |
| 2020-10-20 | AL | 174,528 | 2,805 |
| 2020-10-19 | AL | 173,485 | 2,789 |
| 2020-10-18 | AL | 172,626 | 2,788 |
We used here csvgrep with the option -c to select the column that we’re filtering which is the state here to match AL using -m option that matches the pattern we search for.
I’d like to make sure of this data, so I went to Google and asked how many cases we have at Alabama and this is the answer:
Image by the Author
Looks like the data reported by the COVID Tracking Project is close to what Google is reporting having 186K positive cases and 2892 fatalities.
If you also put another column to show the increase in the positive cases from the previous day, you’d find:
$ csvcut -c date,state,positive,24,death all-states-history.csv | csvgrep -c state -m AL | head | csvlook
| date | state | positive | positiveIncrease | death |
| ---------- | ----- | -------- | ---------------- | ----- |
| 2020-10-26 | AL | 185,322 | 967 | 2,866 |
| 2020-10-25 | AL | 184,355 | 1,079 | 2,866 |
| 2020-10-24 | AL | 183,276 | 2,360 | 2,866 |
| 2020-10-23 | AL | 180,916 | 3,852 | 2,859 |
| 2020-10-22 | AL | 177,064 | 2,536 | 2,843 |
| 2020-10-21 | AL | 174,528 | 0 | 2,805 |
| 2020-10-20 | AL | 174,528 | 1,043 | 2,805 |
| 2020-10-19 | AL | 173,485 | 859 | 2,789 |
| 2020-10-18 | AL | 172,626 | 964 | 2,788 |
967 positive cases increased from Oct. 26 to Oct. 27 and this number exactly matches what Google reports (+967) below the Total cases number in the image above.
Joining two CSVs
I’m not familiar with some abbreviations in the state column, so let’s have the second CSV file which we can join on to get a cleaner output of CSV data we understand. Let’s download it using curl:
$ curl -LO https://gist.githubusercontent.com/afomi/8824ddb02a68cf15151a804d4d0dc3b7/raw/5f1cfabf2e65c5661a9ed12af27953ae4032b136/states.csv
This states.csv ** file has two columns: State and Abbreviation
Let’s see how we can make this interesting join here:
$ csvjoin -c Abbreviation,state states.csv all-states-history.csv | csvcut -c date,State,Abbreviation,positive,death | head | csvlook
| date | State | Abbreviation | positive | death |
| ---------- | ------- | ------------ | -------- | ----- |
| 2020-10-26 | ALABAMA | AL | 185,322 | 2,866 |
| 2020-10-25 | ALABAMA | AL | 184,355 | 2,866 |
| 2020-10-24 | ALABAMA | AL | 183,276 | 2,866 |
| 2020-10-23 | ALABAMA | AL | 180,916 | 2,859 |
| 2020-10-22 | ALABAMA | AL | 177,064 | 2,843 |
| 2020-10-21 | ALABAMA | AL | 174,528 | 2,805 |
| 2020-10-20 | ALABAMA | AL | 174,528 | 2,805 |
| 2020-10-19 | ALABAMA | AL | 173,485 | 2,789 |
| 2020-10-18 | ALABAMA | AL | 172,626 | 2,788 |
Note here that csvjoin command takes much time because it’s reading both files into memory.
Here we joined the two CSV files on a column for each CSV; Abbreviation in the first file and state in the second one and then we filtered out 5 columns to view using csvcut -c
Also, note that there the second column you filtered out when you joined is gone meaning if you filtered out state ( which was the column that has the two-letter abbreviation of the state) it will give an error that ‘state’ is invalid which means this column is not there anymore.
Comparing between xsv and csvkit utilities
As we noticed, some commands took much time using csvkit command line utility. Let’s see a quick comparison between its command-line tools and their associated ones at xsv.
All the upcoming commands run are relative to my machine, let’s compare one by another:
xsv headers vs csvcut -n
$ time csvcut -n all-states-history.csv | head
1: date
2: state
3: dataQualityGrade
4: death
5: deathConfirmed
6: deathIncrease
7: deathProbable
8: hospitalized
9: hospitalizedCumulative
10: hospitalizedCurrently
real 0m0.307s
user 0m0.224s
sys 0m0.077s
Time of csvkit’s csvcut -n : ~307ms
$ time xsv headers all-states-history.csv | head
1 date
2 state
3 dataQualityGrade
4 death
5 deathConfirmed
6 deathIncrease
7 deathProbable
8 hospitalized
9 hospitalizedCumulative
10 hospitalizedCurrently
real 0m0.013s
user 0m0.008s
sys 0m0.007s
Time of xsv’s headers : ~13ms
xsv select vs csvcut -c
$ time csvcut -c date,state,positive,death all-states-history.csv | head
date,state,positive,death
2020-10-26,AK,14413,68
2020-10-26,AL,185322,2866
2020-10-26,AR,106727,1833
2020-10-26,AS,0,0
2020-10-26,AZ,238964,5875
2020-10-26,CA,901010,17357
2020-10-26,CO,95089,2076
2020-10-26,CT,68099,4589
2020-10-26,DC,16812,642
real 0m0.288s
user 0m0.209s
sys 0m0.073s
Time of csvkit’s csvcut -c : ~288ms
$ time xsv select date,state,positive,death all-states-history.csv | head
date,state,positive,death
2020-10-26,AK,14413,68
2020-10-26,AL,185322,2866
2020-10-26,AR,106727,1833
2020-10-26,AS,0,0
2020-10-26,AZ,238964,5875
2020-10-26,CA,901010,17357
2020-10-26,CO,95089,2076
2020-10-26,CT,68099,4589
2020-10-26,DC,16812,642
real 0m0.035s
user 0m0.012s
sys 0m0.011s
Time of xsv’s select : ~35ms
xsv search vs csvgrep
$ time csvcut -c date,state,positive,death all-states-history.csv | csvgrep -c state -m AL |head
date,state,positive,death
2020-10-26,AL,185322,2866
2020-10-25,AL,184355,2866
2020-10-24,AL,183276,2866
2020-10-23,AL,180916,2859
2020-10-22,AL,177064,2843
2020-10-21,AL,174528,2805
2020-10-20,AL,174528,2805
2020-10-19,AL,173485,2789
2020-10-18,AL,172626,2788
real 0m0.438s
user 0m0.571s
sys 0m0.173s
Time of csvkit’s csvgrep with csvcut : ~438ms
$ time xsv select date,state,positive,death all-states-history.csv | xsv search -s state AL |head
date,state,positive,death
2020-10-26,AL,185322,2866
2020-10-25,AL,184355,2866
2020-10-24,AL,183276,2866
2020-10-23,AL,180916,2859
2020-10-22,AL,177064,2843
2020-10-21,AL,174528,2805
2020-10-20,AL,174528,2805
2020-10-19,AL,173485,2789
2020-10-18,AL,172626,2788
real 0m0.038s
user 0m0.026s
sys 0m0.015s
Time of xsv’s search with select : ~38ms
xsv table vs csvlook
$ time csvcut -c date,state,positive,death all-states-history.csv | csvgrep -c state -m AL | head | csvlook
| date | state | positive | death |
| ---------- | ----- | -------- | ----- |
| 2020-10-26 | AL | 185,322 | 2,866 |
| 2020-10-25 | AL | 184,355 | 2,866 |
| 2020-10-24 | AL | 183,276 | 2,866 |
| 2020-10-23 | AL | 180,916 | 2,859 |
| 2020-10-22 | AL | 177,064 | 2,843 |
| 2020-10-21 | AL | 174,528 | 2,805 |
| 2020-10-20 | AL | 174,528 | 2,805 |
| 2020-10-19 | AL | 173,485 | 2,789 |
| 2020-10-18 | AL | 172,626 | 2,788 |
real 0m0.476s
user 0m0.879s
sys 0m0.281s
Time of csvkit’s csvlook with csvgrep and csvcut : ~476ms
$ time xsv select date,state,positive,death all-states-history.csv | xsv search -s state AL | head | xsv table
date state positive death
2020-10-26 AL 185322 2866
2020-10-25 AL 184355 2866
2020-10-24 AL 183276 2866
2020-10-23 AL 180916 2859
2020-10-22 AL 177064 2843
2020-10-21 AL 174528 2805
2020-10-20 AL 174528 2805
2020-10-19 AL 173485 2789
2020-10-18 AL 172626 2788
real 0m0.041s
user 0m0.036s
sys 0m0.023s
Time of xsv’s table with search with select : ~41ms
xsv join vs csvjoin
$ time csvjoin -c Abbreviation,state states.csv all-states-history.csv | csvcut -c date,State,Abbreviation,positive,death | head
date,State,Abbreviation,positive,death
2020-10-26,ALABAMA,AL,185322,2866
2020-10-25,ALABAMA,AL,184355,2866
2020-10-24,ALABAMA,AL,183276,2866
2020-10-23,ALABAMA,AL,180916,2859
2020-10-22,ALABAMA,AL,177064,2843
2020-10-21,ALABAMA,AL,174528,2805
2020-10-20,ALABAMA,AL,174528,2805
2020-10-19,ALABAMA,AL,173485,2789
2020-10-18,ALABAMA,AL,172626,2788
real 1m5.788s
user 1m5.293s
sys 0m0.462s
Time of csvkit’s csvjoin with csvcut : ~1.6min
$ time xsv join Abbreviation states.csv state all-states-history.csv | xsv select date,State,Abbreviation,positive,death | head
date,State,Abbreviation,positive,death
2020-10-26,ALABAMA,AL,185322,2866
2020-10-25,ALABAMA,AL,184355,2866
2020-10-24,ALABAMA,AL,183276,2866
2020-10-23,ALABAMA,AL,180916,2859
2020-10-22,ALABAMA,AL,177064,2843
2020-10-21,ALABAMA,AL,174528,2805
2020-10-20,ALABAMA,AL,174528,2805
2020-10-19,ALABAMA,AL,173485,2789
2020-10-18,ALABAMA,AL,172626,2788
real 0m0.051s
user 0m0.036s
sys 0m0.018s
Time of xsv’s join with select : ~51ms
Did you see what just happened?! 51 milliseconds vs 1.6 minutes?!
Well, xsv can do better than 51 ms here with indexing with:
$ xsv index all-states-history.csv
like this:
$ time xsv join Abbreviation states.csv state all-states-history.csv | xsv select date,State,Abbreviation,positive,death | head
date,State,Abbreviation,positive,death
2020-10-26,ALABAMA,AL,185322,2866
2020-10-25,ALABAMA,AL,184355,2866
2020-10-24,ALABAMA,AL,183276,2866
2020-10-23,ALABAMA,AL,180916,2859
2020-10-22,ALABAMA,AL,177064,2843
2020-10-21,ALABAMA,AL,174528,2805
2020-10-20,ALABAMA,AL,174528,2805
2020-10-19,ALABAMA,AL,173485,2789
2020-10-18,ALABAMA,AL,172626,2788
real 0m0.036s
user 0m0.031s
sys 0m0.017s
but we can feel more how fast xsv is if the file we’re investigating has many more records.
Final thoughts
From our investigation of the 13269 records data of the COVID Tracking Project CSV file, it seems that xsv is killing it and it’s more performant than csvkit. We’ve seen a tremendous improvement in speed when cleaning our data by:
- knowing the headers of the CSV file using xsv headers and comparing it with csvcut -n
- filtering out our desired columns using xsv select comparing it with csvcut -c
- searching for a specific pattern using xsv select comparing it with csvgrep
- better looking for the CSV using xsv table comparing it with csvlook
- or joining two tables using xsv join comparing it with csvjoin
In the end, you have the choice to choose whatever you want from the csvkit or xsv but it’d be fair to use what makes our life easy which is xsv especially when we’re dealing with a large CSV file and if speed and performance is not what we look for especially when we’re dealing with small CSVs, we can go for csvkit.
You might have noticed that the syntax is kind of similar, it just differs in some commands like joining two CSVs. So you always have the choice!
This tutorial is mainly motivated by Data Science at the Command Line
Disclosure: The Amazon links for the book (in this section) are paid links so if you buy the book, I will have a small commission
This book tries to catch your attention on the ability of the command line when you do data science tasks - meaning you can obtain your data, manipulate it, explore it, and make your prediction on it using the command line. If you are a data scientist, aspiring to be, or want to know more about it, I highly recommend this book. You can read it online for free from its website or order an ebook or paperback . In this tutorial we’re gonna focus on using the command line to clean our data.
You might be interested in my previous tutorials on why we use docker tutorial or the one similar to this on how to clean text data at the command line
Stay safe everybody, will see you in the next tutorials ;)
If you want the whole package of cleaning data at the command line in PDF format, get my ebook below 👇
Motivated by
- csvkit Documentation
- 5 Magic Fixes for the Most Common CSV File reader Problems | ThoughtSpot
- COVID Tracking Data
- States Abbreviations Data
- xsv Repo by the BruntSushi Author
- XSV: A fast CSV command-line toolkit written in Rust