How to Clean CSV Data at the Command Line | Part 2

Created on Nov 7, 2020
Updated on Mar 6, 2021

In the last blog post, we proved that xsv is ~1882x faster than csvkit when joining two CSVs and saw how performant xsv is and when we can use xsv or csvkit when we are at the terminal.

Today, we’re talking about part 2 of cleaning CSV data at the command line investigating a large CSV file (from Kaggle) that contains ~26 million users and their number of comments posted from 2005 to 2017. We also talked about cleaning text files in general using the command line if you want to check out.

Prerequisites

You just need to install csvkit and xsv with the following commands:

install csvkit

Just need pip to be able to install csvkit:

$ pip install csvkit

install xsv

Binaries for Windows, Linux and macOS are available from Github .

Or if you’re using macOS Homebrew, install it just like this:

$ brew install xsv

Compiling from its repository also works similarly:

$ git clone git://github.com/BurntSushi/xsv
$ cd xsv
$ cargo build --release

but you need to have Cargo installed which is Rust’s package manager since xsv is written in Rust

Concatenating: xsv cat vs csvstack

Let’s first create a CSV file to play with instead of the large file. Create it with your favorite text editor or just at the command line:

$ cat >> fake_users.csv
author,n
x,5
z,7
y,6

Here, we’d like to concatenate this fake_users.csv file with the large users.csv file.

To see the effect of the performance here, we will stack the large file first because that will be more accurate when using tail to validate that the second file is already read into memory and concatenated.

xsv has a tool for that which is cat , let’s use it:

$ time xsv cat rows users.csv fake_users.csv | tail
youbd636,1
generalkickstand,1
Throwaway_injuries,5
LowlzMan,3
ImSoLitBroLMFAO,1
i_want_to_die_killme,2
beeman3459,1
x,5
z,7
y,6

real    0m15.198s
user    0m17.499s
sys 0m0.364s

Time here taken by xsv cat is ~15s

csvkit also has a csvstack tool, let’s see it:

$ time csvstack users.csv fake_users.csv | tail
youbd636,1
generalkickstand,1
Throwaway_injuries,5
LowlzMan,3
ImSoLitBroLMFAO,1
i_want_to_die_killme,2
beeman3459,1
x,5
z,7
y,6

real    1m7.213s
user    1m21.464s
sys 0m1.026s

Time here taken by csvstack is ~1.7mins

So approximately, xsv is ~7x faster here than csvkit

Sorting: pandas.sort_values() vs xsv sort vs GNU sort vs csvsort

Let’s first see how csvsort and xsv sort work by sorting fake_users.csv by the number of comments n :

$ csvsort -c n fake_users.csv
author,n
x,5
y,6
z,7

We can also sort in reverse order by setting the option -r:

$ csvsort -rc n fake_users.csv
author,n
z,7
y,6
x,5

The equivalent xsv command is:

$ xsv sort -RNs n fake_users.csv 

-RNs is a combination of:

After downloading users.csv , **** the ~359MB CSV file, let’s sort it with xsv first. This is a +25 million records so expect a latency here!

$ time xsv sort -RNs n users.csv > /dev/null

real    1m48.611s
user    1m27.743s
sys 0m16.621s

So xsv sort took here ~1.4 mins for this much data.

Let’s try the sort command coming with the UNIX system:

$ time sort -t, -rnk2 users.csv > /dev/null

real    6m29.940s
user    5m49.416s
sys 0m32.340s

Here BSD sort took ~6.3min which is ~5x slower than xsv

What about csvsort ?

$ time csvsort -rc n users.csv > /dev/null

Do you know that I have terminated my terminal after more than an hour of waiting for this command to complete execution and it’s not completed yet!

Knowing that csvkit is written in pure python can somehow proves that it’s not advisable to write a very performant tool with such technology alone unless you use C or Cython with Python as Pandas did:

$ python sort_with_pandas.py
Time: 77.732413994

Pandas here beats ALL, csvkit , BSD sort , and even xsv with execution time of ~78 seconds which means it’s very close to xsv result (1.4min = 84s) so just 6 seconds difference and with each run, the time varies so it’s acceptable for xsv to be competing with pandas for performance.

Final thoughts

From our investigation of the 26 million records data of the Reddit username data, it seems that xsv is the fastest command line I ever used and I think ever existed at the terminal. It is more performant than csvkit . We’ve seen a tremendous improvement in speed when cleaning our data by:

In the end, as always you have the choice to choose whatever you want from the csvkit or xsv but it’d be fair to use what makes our life easy which is xsv especially when we’re dealing with a large CSV file like what we saw today and if speed and performance are not what we look for especially when we’re dealing with small CSVs, we can go for csvkit .

You might have noticed that the syntax is kind of similar in sorting and stacking. So you always have the choice!

This tutorial is mainly motivated by Data Science at the Command Line

This book tries to catch your attention on the ability of the command line when you do data science tasks - meaning you can obtain your data, manipulate it, explore it, and make your prediction on it using the command line. If you are a data scientist, aspiring to be, or want to know more about it, I highly recommend this book. You can read it online for free from its website or order an ebook or paperback .

You might be interested in my previous tutorials on part 1 of cleaning CSV data or why we use docker tutorial or the one similar to this on how to clean text data at the command line

Take care, will see you in the next tutorials :)

If you want the whole package of cleaning data at the command line in PDF format, get my ebook below 👇

Get your ebook

Click here to get fresh content to your inbox

Resources

Published on medium