☰ Menu

      Introduction to the Command Line for Bioinformatics

Home
Introduction
Introduction to the Workshop and the Core
Schedule
Support
Slack
Zoom
Cheat Sheets
Windows 10 and Mac installation
CLI
Introduction to the Command Line 1
Introduction to the Command Line 2
Introduction to the Command Line 3
Extra
Software Installation
Installation using make and cmake
Conda
ETC
Closing thoughts
Github page
Biocore website

Extra Command-Line

Find

Find is a very powerful command that is used to recursively find files/directories in a file system. First take a look at the find man page:

man find

Notice there are LOTS of options. The simplest version of the find command will simply list every file and directory within a path, as far down as it can go:

find /usr/lib

If we want to refine the command to only show you files that end in “.so”, we use the “-name” option:

find /usr/lib -name "*.so"

One of the most powerful uses of find is to execute commands on every file it finds. To do this, you use the “-exec” option. When you use that option, everything after the “-exec” is assumed to be a command, and you use the “{}” characters to substitute for the file names that it finds. So in the command below, “wc -l” will get executed sequentially for every file it finds. Finally, the exec option needs to end with a semi-colon, however, since the semi-colon is a special character that the shell will try to interpret, you need to “escape” the semi-colon with a backslash, to indicate to the shell that the semi-colon is NOT to be interpreted and just sent as is to the find command:

find /usr/lib -name "*.so" -exec wc -l {} \;

You will probably want to Ctrl-C out of this because it will take a long time to go through them all.

Xargs

xargs is another command that can be very useful for running a program on a long list of files. For example, the “find” commands we ran above could be run using xargs like this:

find /usr/lib -name "*.so" | xargs wc -l

This is taking the output of the find command and then creating a list of all the filenames which it adds to the command given to xargs. So, in this case, after “xargs” comes “wc -l”… so “wc -l” will get run on the entire list of filenames from find.

.bashrc/.bash_profile, aliases & the PATH variable

On a Linux system, there is usually a user-modifiable file of commands that gets run every time you log in. This is used to set up your environment the way that you want it. On our systems, the file is “.bash_profile” and it resides in your home directory. Sometimes the file is called “.bashrc” as well. Take a look at a .bash_profile:

cat ~/.bash_profile

This one has a lot of stuff in it to set up the environment. Now take a look at your own .bash_profile. One of the things you set up was adding to the PATH variable. One thing that is very useful to add to the PATH variable is the “.” directory. This allows you to execute things that are in your current directory, specified by “.”. So, let’s use nano to edit our .bash_profile and add “.” to PATH:

nano ~/.bash_profile

Add “:.” to the PATH variable by adding this line:

export PATH=$PATH:.

Now, next time you log in, “.” will be in your PATH.

Another thing that is very useful are aliases. An alias is a user-defined command that is a shortcut for another command. For example, let’s say you typed the command “ls -ltrh” a lot and it would be easier to have it be a simpler command. Use an alias:

alias lt='ls -ltrh'

Now, you’ve created an alias that lists the contents of a directory with extra information (-l), in reverse time order of last modified (-r and -t), and with human readable file sizes (-h). Try it out:

lt

Typing alias by itself give you a list of all the aliases:

alias

You can now put the alias command for lt in your .bash_profile and you will have it automatically when you log in.

Environment variables

In Linux, environment variables act as placeholders for information stored within the system that passes data to programs launched in your shell. To look at most of the environment variables currently in your shell, use the env command:

env

In order to see the value of any one variable, you can “echo” the value using a “$” in front of the variable name:

echo $USER

This gives you your user ID. Some of the most commonly used environment variables are USER, HOSTNAME, SHELL, EDITOR, TERM, and PATH.

You can also create your own environment variables:

MYVAR=1
echo $MYVAR

By convention, variable names are in all caps, but they don’t have to be. You can then update the variable as well:

MYVAR=2
echo $MYVAR

In order to use any environment variable in a program that you execute from the command-line, you need to use the export command:

export MYVAR=3

You can also use backticks (`) to execute a command and send the output into a variable:

MYVAR=`pwd`
echo $MYVAR

More grep

Grep is a very powerful tool that has many applications. Grep can be used to find lines in a file that match a pattern, but also you can get lines above and below the matching line as well. The “-A” option to grep is used to specify number of lines after a match and the “-B” option is for number of lines before a match. So, for example, if you wanted to find a particular sequence in a fastq file and also the 3 other lines that form that entire fastq record, you would do this:

zcat C61_S67_L006_R1_001.fastq.gz | grep -B1 -A2 CACAATGTTTCTGCTGCCTGAACC

This looks for the sequence “CACAATGTTTCTGCTGCCTGAACC” in the fastq file and then also prints the line before and two lines after each match.

Another thing grep can do is regular expressions. Regular expressions are a way of specifying a search pattern. Two very useful characters in regular expressions are “^” and “$”. The “^” symbol specifies the beginning of a line and the “$” specifies the end of a line. So, for example, if you wanted to find just the lines that began with “TTCCAACACA” you would do this:

zcat C61_S67_L006_R1_001.fastq.gz | grep ^TTCCAACACA

Without the “^”, grep will find any line that has “TTCCAACACA” anywhere in the line, not just the beginning. Conversely, if you wanted to find the lines that ended in “TAAACTTA”:

zcat C61_S67_L006_R1_001.fastq.gz | grep TAAACTTA$

There are also extended regular expression that grep can use to do more complex matches, using the “-E” option:

zcat C61_S67_L006_R1_001.fastq.gz | grep -E '^TTCCAACACA|TAAACTTA$'

This command will find any line that begins with “TTCCAACACA” OR ends with “TAAACTTA”. The “|” character means OR.

CHALLENGE: Find a way to use grep to match any line that has between 7 and 16 ‘A’s in a row at the end of the line. You will probably need to look at the man page for grep.

The Prompt

The Prompt is the part of the command line that is the information on the screen before where you type commands. Typically, it has your username, the name of the machine you are on, and your current directory. This, like everything else in Linux, is highly customizable. Your current prompt probably has your username, your hostname, and your current directory. However, there are many other things you could add to it if you so desired. The environment variable used for your prompt is “PS1”. So to change your prompt you just need to reassign PS1. There are some special escape characters that you use to specify your username, hostname, etc… these can be found in the “PROMPTING” section of the bash man page:

man bash

Then type “/” and “PROMPTING” to find the section. You’ll see that for username it is “\u”, for hostname it is “\h”, and for the full current working directory it is “\w”. So you would change your prompt by doing this:

export PS1="\u@\h:\w\\$ "

The convention is to put the “@” symbol and the “:” symbol to delineate the different parts, but you can use anything you want. Also, it is convention to put a “$” at the end to indicate the end of the prompt.

You can also do all kinds of fancy things in your prompt, like color and highlighting. Here is an example of such a prompt:

export PS1='\[\033[7;29m\]\u@\h\[\033[0m\]:\[\e[1m\]\w\[\e[m\]$ '

When you have a prompt you like, you can put it in your .bash_profile/.bashrc so that it is automatically set when you log in.

Awk

Awk is a simple programming language that can be used to do more complex filtering of data. Awk has many capabilities, and we are only going to touch on one of them here. One really useful thing is to filter lines of a file based on the value in a column. Let’s get a file with some data:

wget https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2022-Jan-Introduction-to-the-Command-Line-for-Bioinformatics/master/cli/DMR.GBM2.vs.NB1.bed -O DMR.GBM2.vs.NB1.bed

Take a look at the beginning of the file:

head DMR.GBM2.vs.NB1.bed

Let’s say we wanted to get only the lines where the pvalue (column 10) was below a certain value. In awk, the default delimiter is the tab character, which most bioinformatics files use as their delimiter. Using the tab as a delimiter, it assigns each value in a column to the variables $1, $2, $3, etc… for as many columns as there are. So if we wanted to get lines where the pvalue column was under 0.00000005 pvalue:

cat DMR.GBM2.vs.NB1.bed | awk '$10 < 0.00000005'

And lines where the pvalue >= 0.00000005:

cat DMR.GBM2.vs.NB1.bed | awk '$10 >= 0.00000005'

You can also use it to extract lines with a particular word in a column:

cat DMR.GBM2.vs.NB1.bed | awk '$1 == "chr3"'

A double equals (==) is used for equality comparisons. This will pull out lines where the chromosome column is “chr3”.

Let’s say you wanted only the lines where the p-value<0.00000005 AND only on chr3:

cat DMR.GBM2.vs.NB1.bed | awk '$1 == "chr3" && $10 < 0.00000005'

You can also change the value of one of the columns and then print each line:

cat DMR.GBM2.vs.NB1.bed | awk '{$2=$2+11; print $0;}'

Or if you want to print a count variable appended to a column, you can use a count variable. You can use the BEGIN keyword to specify awk code that is to be run once before the rest, i.e. initialize the count. And space is the concatenating operator for strings that you use to add the count value to the first column.:

cat DMR.GBM2.vs.NB1.bed | awk 'BEGIN{cnt=0;} {$1 = $1 "_" cnt; cnt=cnt+1; print $0}'

Take a look at the awk manual to learn more about the capabilities of awk.