I’ve been reading Data Science at the Command Line recently and I’ve noticed an important reason why we use docker when I faced a problem. When I find something in a practical way, it means a lot to me even if I’ve heard a lot about it theoretically so you’ve probably faced that problem before and you said:
probably many times you said ‘it works on my machine!’ so why isn’t that working on another machine?
What is Data Science at the Command Line?
Before we dive into the problem and how docker can solve it, what Data Science at the Command Line is really about.
Disclosure: The amazon link is a paid link so if you buy the book, I will have a small commission
This book tries to catch your attention on the ability of the command line when you do data science tasks - meaning you can obtain your data, manipulate it, explore it, and make your prediction on it using the command line. If you are a data scientist, aspiring to be, or want to know more about it, I highly recommend this book. You can read it online for free from its website or order an ebook or paperback.
Getting data from OpenML
Let see the problem by first trying to get data and see what we can do with it locally and what we can’t do with it. OpenML is an open-source platform to build an open, organized, online ecosystem for machine learning. One of the features on this website is that you can get public data and start playing with it. Let’s first make a new directory so that we can download the data into it:
$ mkdir docker-tutorial
$ cd docker-tutorial
Let’s try to fetch data called penguins using the command-line tool curl and then see what we got using ls
$ curl -sL 'https://www.openml.org/data/get_csv/21854866/penguins.arff' > penguins.csv
$ ls
Now we have penguins.csv on my machine, let’s know how many rows it has:
$ wc -l penguins.csv
Let’s see the first 10 lines of that CSV file:
$ head -n 10 penguins.csv
The result should look like this:
See the **_mm **** in each of these columns: culmen_length_mm, culmen_depth_mm, flipper_length_mm .. and the _g in body_mass_g
I’d like to get rid of them, meaning we should trim them and these columns will be: culmen_length, culmen_depth, flipper_length, body_mass
We can do that by replacing them with nothing using sed ****** which is a powerful stream editor that can insert, delete, search, and replace (substitute).
$ sed -i -r '1s/_(mm|g)//g' penguins.csv
To understand this command-line, we should know some things:
- the option -i is used here to do in-place changes in the penguins.csv file
- the option -r is used to indicate that we’re using extended regex to match the text
- ‘1s/_(mm|g)//g’ let’s explain it as we did before when solving a regular expression problem
1 operate on the first line
s start substituting
/ first delimiter which exists between what you want to match and what you want to substitute
_( the regex match of underscore followed by a group of…
mm double m’s **
|g** or g
)
/ the other delimiter between the regex match and the replacement string which is ‘nothing’ here because we don’t have a string to be replaced with
nothing
/g replace all instances of the pattern globally across the line
So using this command should work and remove the _mm and _g in the header
BUT the problem here is that I got the same output:
To investigate this problem, I found that the sed program I use is AT&T Unix version because I use Mac and I didn’t download the GNU/Linux version while the Data Science at the Command Line book is using Linux so I have to use the same environment that the book is using to be able to follow along. Luckily the book is using a docker image pushed to the public registry but this image is ~905 MB in size which is a very big size to demonstrate something little that I’d like to show here.
Solving ‘it works on my machine’ problem
So what does docker actually do here? Docker can run the sed command that I wasn’t able to run on my machine while it was working on the author’s machine. So if I have the same environment of Linux and sed is installed in there, I will be able to proceed with the book and follow along and my life will be better. Let’s try to do that and learn how docker really works.
So in order to have this environment, you should have a docker image. But what is that?
- Docker image is a set of instructions to build the container
- A container is an overall package and virtualized environment (dependencies, everything that an application needs) in which the application will run inside
So how can we get that container to be able to have that new environment?
- So we either want to get the docker image ready and when we run it we will have a container and we’re good to go
- Or we get that container by building it from scratch
I will show you both ways:
1. Getting a ready docker image
There is a public registry that is like GitHub but for docker. There you can find the docker image that you want to use or maybe you can later share your own image with others as we’ll see at the second point.
The docker image I created for this tutorial resides in this repo on docker hub and you can pull that image using this command:
$ docker pull ezzeddin/penguins-in-docker
Running the docker image
Let’s run this image and start using the environment that we wished:
$ docker run --rm -it -v `pwd`:/data ezzeddin/penguins-in-docker
- docker run is a command to run the docker image
- the option –rm is set to remove the container after it exists
- the option -it which is a combination of -i and -t is set for an interactive process (shell)
- -v `pwd`:/data here we specify the option -v to indicate mounting a volume from the absolute path of the local directory (which is indicated by `pwd` ) to the working directory in the container /data and that’s for using sed on that penguins.csv file
- ezzeddin/penguins-in-docker is the docker image name
Note: this is working on Mac and Linux.
After running this docker image, the shell is open and you can see the prompt has changed from $ to bash-5.0# if you list what’s in there using ls you can the file that you have in your directory locally mounted in the container so you can see now penguins.csv.
Let’s see the moment that we’re waiting for, using sed:
bash-5.0# sed -i -r '1s/_(mm|g)//g' penguins.csv
Let’s see the in-place change if that really happened to penguins.csv
bash-5.0# head -n 10 penguins.csv
AND Wa-lah!
penguins are working in docker now and sed is working well on the environment the container has and we can now see the columns: culmen_length, culmen_depth, flipper_length, body_mass are trimmed.
Let’s see how we built this image or we better see how we can build it again from scratch and put it on your own public repo at docker hub…
2. Dockerfile
Let’s first exit from this shell and return to our current directory locally by pressing CTRL+D or writing exit in that shell.
We need to build a Dockerfile which should result in a Docker image with a lightweight executable package of that sed program that we want and see if we need something else to install into.
Let’s see what this Dockerfile has:
- FROM alpine:3.12.0 getting a lightweight Linux distribution to install on top of it
- LABEL maintainer “EzzEddin Abdullah <ezzeddinabdullah@gmail.com>" labeling the attribution to the author of that Dockerfile
- RUN apk add –no-cache sed bash this is like apt-get install when dealing with Ubuntu, but we’re dealing with another distribution (alpine) which yields a much lower virtual image size. Here we install sed and also bash so that we can interact with the terminal and write some bash commands after we run the docker image. The option –no-cache allows to not cache the index locally, which is useful for keeping containers small
- WORKDIR /data setting the working directory to be this /data directory
- CMD bash for getting the bash shell in the container
Building a Dockerfile
Make sure you’re at the same directory docker-tutorial , you now have that Dockerfile and penguins.csv (the edited version) in that directory. Let’s first download that penguins file again using:
$ curl -sL 'https://www.openml.org/data/get_csv/21854866/penguins.arff' > penguins_v2.csv
Now we need to build the Dockerfile:
$ docker build -t username/penguins-in-docker .
- docker build
- -t tagging the docker image that you’re building
- username/penguins-in-docker specify your username to put it on the public registry
- . dot here is for specifying the directory of Dockerfile which is the current directory here
This docker image is residing locally on your machine, you can now go to the next section and run the docker image. But if you want to make it public so that people, including you, can pull that image .. guess what! you just need to push it into the same docker image that we had because it has your username existing in the docker hub:
$ docker push username/penguins-in-docker
Running the docker image
Here we run the docker image as before:
$ docker run --rm -it -v `pwd`:/data username/penguins-in-docker
The only difference here is your username. You can run now and then write the sed command on the new shell and you’re good to go:
Conclusion
Now, there are no worries that you have another environment than mine because we both can have the same set of dependencies we need to solve our problems without the hustle of the need to install a specific version of a program, that works for both of us especially if the number of dependencies is many. I hope this basic tutorial made docker simpler for you and if you want to get deeper, have a look at the documentation and the other references I put here.
To sum up, you can now download a file using curl , you can use sed to do some manipulation on strings, you can use an already existing docker image, you can mount a volume you have locally, you can run a docker image, you can build a Dockerfile and come up with your own docker image, and you can finally see a big reason why we use docker for the isolated environment that we need and to solve that problem of ‘it works on my machine’.
Thank you for reaching here!
Resources
- docker run doc
- What is Docker and How to Use it With Python (Tutorial)
- Data Science at the Command Line
- OpenML a platform for democratizing machine learning
- sed command with examples
- Alpine Dockerfile Advantages of –no-cache Vs. rm /var/cache/apk/*
- basic and extended regex
- Photo by Derek Oyen on Unsplash
More tutorials?
If you want to see more tutorials blog posts, check out:
https://www.ezzeddinabdullah.com/categories/tutorials
Want to support me?
Please clap on medium if you like this article , thank you! :)