In the last blog post, we proved that xsv is ~1882x faster than csvkit when joining two CSVs and saw how performant xsv is and when we can use xsv or csvkit when we are at the terminal.
Today, we’re talking about part 2 of cleaning CSV data at the command line investigating a large CSV file (from Kaggle) that contains ~26 million users and their number of comments posted from 2005 to 2017. We also talked about cleaning text files in general using the command line if you want to check out.
Prerequisites
You just need to install csvkit and xsv with the following commands:
install csvkit
Just need pip to be able to install csvkit:
$ pip install csvkit
install xsv
Binaries for Windows, Linux and macOS are available from Github .
Or if you’re using macOS Homebrew, install it just like this:
$ brew install xsv
Compiling from its repository also works similarly:
$ git clone git://github.com/BurntSushi/xsv
$ cd xsv
$ cargo build --release
but you need to have Cargo installed which is Rust’s package manager since xsv is written in Rust
Concatenating: xsv cat vs csvstack
Let’s first create a CSV file to play with instead of the large file. Create it with your favorite text editor or just at the command line:
$ cat >> fake_users.csv
author,n
x,5
z,7
y,6
Here, we’d like to concatenate this fake_users.csv file with the large users.csv file.
To see the effect of the performance here, we will stack the large file first because that will be more accurate when using tail to validate that the second file is already read into memory and concatenated.
xsv has a tool for that which is cat , let’s use it:
$ time xsv cat rows users.csv fake_users.csv | tail
youbd636,1
generalkickstand,1
Throwaway_injuries,5
LowlzMan,3
ImSoLitBroLMFAO,1
i_want_to_die_killme,2
beeman3459,1
x,5
z,7
y,6
real 0m15.198s
user 0m17.499s
sys 0m0.364s
- rows here is used to indicate that we will concatenate the rows so the rows of the second file will be appended to the rows of the first file
Time here taken by xsv cat is ~15s
csvkit also has a csvstack tool, let’s see it:
$ time csvstack users.csv fake_users.csv | tail
youbd636,1
generalkickstand,1
Throwaway_injuries,5
LowlzMan,3
ImSoLitBroLMFAO,1
i_want_to_die_killme,2
beeman3459,1
x,5
z,7
y,6
real 1m7.213s
user 1m21.464s
sys 0m1.026s
Time here taken by csvstack is ~1.7mins
So approximately, xsv is ~7x faster here than csvkit
Sorting: pandas.sort_values() vs xsv sort vs GNU sort vs csvsort
Let’s first see how csvsort and xsv sort work by sorting fake_users.csv by the number of comments n :
$ csvsort -c n fake_users.csv
author,n
x,5
y,6
z,7
We can also sort in reverse order by setting the option -r:
$ csvsort -rc n fake_users.csv
author,n
z,7
y,6
x,5
The equivalent xsv command is:
$ xsv sort -RNs n fake_users.csv
-RNs is a combination of:
- -R for reverse order
- -N for numeric sorting
- -s for selecting the column that you sort by
After downloading users.csv , **** the ~359MB CSV file, let’s sort it with xsv first. This is a +25 million records so expect a latency here!
$ time xsv sort -RNs n users.csv > /dev/null
real 1m48.611s
user 1m27.743s
sys 0m16.621s
So xsv sort took here ~1.4 mins for this much data.
Let’s try the sort command coming with the UNIX system:
$ time sort -t, -rnk2 users.csv > /dev/null
real 6m29.940s
user 5m49.416s
sys 0m32.340s
- -t, to determine the field separator which is comma here as the file is a CSV file
- -r for reverse order
- -n for numeric sorting
- -k2 for setting the key field (the column that you sort by which is the second column here)
Here BSD sort took ~6.3min which is ~5x slower than xsv
What about csvsort ?
$ time csvsort -rc n users.csv > /dev/null
Do you know that I have terminated my terminal after more than an hour of waiting for this command to complete execution and it’s not completed yet!
Knowing that csvkit is written in pure python can somehow proves that it’s not advisable to write a very performant tool with such technology alone unless you use C or Cython with Python as Pandas did:
$ python sort_with_pandas.py
Time: 77.732413994
Pandas here beats ALL, csvkit , BSD sort , and even xsv with execution time of ~78 seconds which means it’s very close to xsv result (1.4min = 84s) so just 6 seconds difference and with each run, the time varies so it’s acceptable for xsv to be competing with pandas for performance.
Final thoughts
From our investigation of the 26 million records data of the Reddit username data, it seems that xsv is the fastest command line I ever used and I think ever existed at the terminal. It is more performant than csvkit . We’ve seen a tremendous improvement in speed when cleaning our data by:
- concatenating the rows of the CSV files using xsv cat rows and comparing it with csvstack
- sorting by a column and in a reverse order using xsv sort -sRN comparing it with csvsort -rc
- also comparing the two utilities with BSD sort and pandas resulting in a very good performing of pandas and xsv
In the end, as always you have the choice to choose whatever you want from the csvkit or xsv but it’d be fair to use what makes our life easy which is xsv especially when we’re dealing with a large CSV file like what we saw today and if speed and performance are not what we look for especially when we’re dealing with small CSVs, we can go for csvkit .
You might have noticed that the syntax is kind of similar in sorting and stacking. So you always have the choice!
This tutorial is mainly motivated by Data Science at the Command Line
Disclosure: The Amazon links for the book (in this section) are paid links so if you buy the book, I will have a small commission
This book tries to catch your attention on the ability of the command line when you do data science tasks - meaning you can obtain your data, manipulate it, explore it, and make your prediction on it using the command line. If you are a data scientist, aspiring to be, or want to know more about it, I highly recommend this book. You can read it online for free from its website or order an ebook or paperback .
You might be interested in my previous tutorials on part 1 of cleaning CSV data or why we use docker tutorial or the one similar to this on how to clean text data at the command line
Take care, will see you in the next tutorials :)
If you want the whole package of cleaning data at the command line in PDF format, get my ebook below 👇
Click here to get fresh content to your inbox
Resources
- csvkit Documentation
- xsv Repo by the BruntSushi Author
- Reddit Usernames data | Kaggle
- pandas.sort_values()
- how to sort pandas dataframe from one column
- How do you calculate program run time in python? [duplicate]
- Photo by Yan Ots on Unsplash