Back

Command Line 101

This post is inspired by a friend who has never heard of the command line before. This is not too surprising because I only started about two years ago. Now, I use it every day.

One of the most important tools in data science is the command line (synonymous phrases include terminal, shell, console, command prompt, Bash). Especially when working with Amazon Web Services (AWS) and Elastic Compute Cloud (EC2), familiarity with the command line is a must.

You may ask, “Why can’t I just use my own computer?” Well, the answer is simple — as data volume increases, it becomes impossible to process terabytes of data with merely 8 or 16 GB of RAM. Using AWS enables scalability when working with Big Data. You are no longer using one local computer, but perhaps 40 computers on the cloud, a concept known as parallel processing. In a nutshell (pun intended), you are paying Amazon to borrow their computers.

The purpose of the command line is to interact with the computer (local or remote) and its filesystem. It provides a text-only interface (yes, no more point-and-clicking) to provide commands for your operating system to run.

Some use cases:

- Read, write, edit, find, move, copy, remove, download files Git/Github

- Basic data exploration/manipulation

- Logging onto a remote computer aka SSH-ing (Secure Shell)

- Watch Star Wars (Open your terminal and type telnet towel.blinkenlights.nl)

Some dangerous use cases:

- Denial-of-service (DoS) attacks Hacking and stealing people’s information

Let’s begin by grabbing some text (Pride and Prejudice, by Jane Austen) from Project Gutenberg: http://www.gutenberg.org/files/1342/1342-0.txt

Anything inside [brackets] will be the definition of the term and anything starting with $ will be command line syntax. 

[wget: download a file from a website]

$ wget http://www.gutenberg.org/files/1342/1342-0.txt

[ls: list files in your current working directory]

Your terminal should show one file called 1342-0.txt when typing ls.

$ ls

1342-0.txt

Files with the prefix of . are hidden files. The argument -a will display them. Some arguments are mandatory while others like -a are optional.

[man: view manual page for a command]

Typing man ls will provide you with information on each argument. Multiple arguments can be done by typing them consecutively, i.e. ls -ltr will show your files in a long list format, and sorted by modification time, with oldest entries appearing first.

[head: print the first 10 lines]

$ head

The Project Gutenberg EBook of Pride and Prejudice, by Jane Austen

This eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever.  You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org

Title: Pride and Prejudice  

[tail: print the last 10 lines]

What you see on the screen is referred to as the standard output. Let’s combine three commands into one.

[cat: print the contents of the file(s)]

[ |: pipe operator that passes the output of tone command as input to another]

[wc: word count]

$ cat 1342-0.txt | wc

13427  124592  724726

First, the text content will be printed to standard output. Then, the standard output will be passed to the wc command, which will provide the line count, word count, and character count of the file.

[mv: move a file (can be used to rename)]

[mkdir: create a directory/folder]

[cp: copy a file]

[rm: remove a file]

[cd: change directory]

Let’s rename the text file to pride_and_prejudice , create a directory called books , copy the pride_and_prejudice file to books.

$ mv 1342-0.txt pride_and_prejudice

$ mkdir books

$ cp pride_and_prejudice books/

[grep: filter based on a pattern]

[ >: write standard output to a file (overwrites if there is an existing file with the same name)]

[ >>: append standard output to the end of a file]

[touch: create an empty file]

[echo: print a message to standard output]

Let’s store all lines containing the word “happy” into a file called happy.txt. Next, let’s store all lines containing the word “sad” into a file called sad.txt Then, create an empty file called subset and combine the two files together. Add a message to the end of subset that says “Finished!”

$ cat pride_and_prejudice | grep happy > happy.txt

$ cat pride_and_prejudice | grep -sw sad > sad.txt

$ touch subset

$ cat *.txt >> subset

$ echo "Finished" >> subset

On the second line, the optional argument -sw is used so that words like dissadvantage are not captured as well. You can use the asterisk * to perform operations on all files ending with the extension .txt.

Let’s say you were tasked with downloading 100 files (Books 1000–1099) from the Project Gutenberg website AND changing the file name to the title of the book. It might seem like a very monotonous task, but using the command line, it can be done in just a few lines!

We need to learn how to do for loops.

for i in 1 2 3 4 5

do

    echo "Hi Person $i"

done

The output would be:

Hi Person 1 Hi Person 2 Hi Person 3 Hi Person 4 Hi Person 5

A slightly more complicated example:

for i in $( ls )

do

    echo file: $i

done

The output would be:

file: books

file: happy.txt

file: pride_and_prejudice

file: sad.txt

file: subset

The $ enables you to use a command inside ANOTHER command.

From the Gutenberg website, the files will be http://www.gutenberg.org/files/1/1-0.txt or http://www.gutenberg.org/files/1/1.txt (it is inconsistent whether or not they have a -0 in the file name.

To account for both scenarios, we can use the || command which will only trigger the second command if the first one fails.

[tr: translate a character (using -d will delete the characters)]

The code will be the following (step-by-step details can be seen below):

$ mkdir gutenberg

$ cd gutenberg

$ for i in {1000..1099}

> do

> wget -O file "http://www.gutenberg.org/files/$i/$i.txt" || wget -O file "http://www.gutenberg.org/files/$i/$i-0.txt"

> name=$(cat file | head -n 1 | tr -cd "[:alnum:][:space:]")

> name="${name/$'\r'/}"

> mkdir "$i"

> mv file "$i/$name"

> done

Typing ls should give you this:

1000  1007  1014  1021  1028  1035  1042  1049  1056  1063  1070  1077  1084  1091  1098 1001  1008  1015  1022  1029  1036  1043  1050  1057  1064  1071  1078  1085  1092  1099 1002  1009  1016  1023  1030  1037  1044  1051  1058  1065  1072  1079  1086  1093 1003  1010  1017  1024  1031  1038  1045  1052  1059  1066  1073  1080  1087  1094 1004  1011  1018  1025  1032  1039  1046  1053  1060  1067  1074  1081  1088  1095 1005  1012  1019  1026  1033  1040  1047  1054  1061  1068  1075  1082  1089  1096 1006  1013  1020  1027  1034  1041  1048  1055  1062  1069  1076  1083  1090  1097

To view the files inside the folders, you can use ls -R :

./1095: 'The Project Gutenberg EBook of The Light of Western Stars by Zane Grey'

./1096: 'The Project Gutenberg Etext of The Faith of Men by Jack London'

./1097: 'Project Gutenbergs Mrs Warrens Profession by George Bernard Shaw'

./1098: 'The Project Gutenberg EBook of The Turmoil by Booth Tarkington'

./1099: 'The Project Gutenberg EBook of The Riverman by Stewart Edward White'  

Making a folder called gutenberg and changing directory to it

$ mkdir gutenberg

$ cd gutenberg

Starting the for loop where i will be a number from 1000 to 1099 (inclusive)

$ for i in {1000..1099} do

The argument -O will rename the file to the namefile . It will first try to download .txt and if it fails it will try -0.txt.

$ wget -O file "http://www.gutenberg.org/files/$i/$i.txt" || wget -O file "http://www.gutenberg.org/files/$i/$i-0.txt"

This will take the text file, retrieve the first line (where the title is located), keep only alphanumeric and white spaces, and store the string as a variable called name. [:alnum:] and [:space:] are character sets for alphanumeric and white space respectively.The next line will remove weird, bash-specific characters that remain, e.g converting 'The Project Gutenberg EBook of the Riverman by Stewart Edward White'$'\r' to 'The Project Gutenberg EBook of the Riverman by Stewart Edward White' . This uses the concept of variable substitition, and uses this syntax: ${parameter//patern/string} . In this part, the /string component is empty so it replaces \r with nothing.

$ name=$(cat file | head -n 1 | tr -cd "[:alnum:][:space:]") name="${name/$'\r'/}"

This last part will end the for loop by making a folder with the appropriate number and moving the file inside it.

$ mkdir "$i" mv file "$i/$name" done

Thank you for reading! I hope you were able to learn the basics of the command line from this tutorial.

Comments

  • To post a comment please log In