☰ Menu

      Introduction to Analysis of Epigenetic Data 2020

Home
Introduction
Introduction to the Workshop and the Core
Schedule
What is Bioinformatics/Genomics?
Some Experimental Design?
Support
Slack
Zoom
Cheat Sheets
Software and Links
Scripts
Prerequisites
Introduction to the Command Line
Introduction to R
High Performance Computing
Data Reduction ChIPseq/ATACseq
Filetype Description
ChIP and ATAC project setup
Preprocessing Reads
Mapping Reads
Filtering Reads
Peak Call with MACS2
Data Analysis ChIPseq/ATACseq
Differential ChIPseq
Differential ATACseq
R Package Installation
Methylation
Introduction to WGBS
WGBS hands-on
ETC
Closing thoughts
Workshop Photos
Github page
Biocore website

Introduction to Command Line Interface

Outline:

  1. What is the command line?
  2. Directory Structure
  3. Syntax of a Command
  4. Logging Into a Remove Server
  5. Command Line Basics (ls, pwd, Ctrl-C, man, alias, ls -lthra)
  6. Getting Around (cd)
  7. Absolute and Relative Paths
  8. Tab Completion
  9. History Repeats Itself (history, head, tail, )
  10. Editing Yourself (Ctrl-A, Ctrl-E, Ctrl-K, Ctrl-W)
  11. Create and Destroy (echo, cat, rm, rmdir)
  12. Transferring Files (scp)
  13. Piping and Redirection (|, >, », cut, sort, grep)
  14. Compressions and Archives (tar, gzip, gunzip)
  15. Forced Removal (rm -r)
  16. BASH Wildcard Characters (?, *, find, environment variables($), quotes/ticks)
  17. Manipulation of a FASTA file (cp, mv, wc -l/-c)
  18. Symbolic Links (ln -s)
  19. STDOUT and STDERR (>1, >2)
  20. Paste Command (paste, for loops)
  21. Shell Scripts and File Permissions (chmod, nano, ./)

A CLI cheat sheet

This browser does not support PDFs. Please download the PDF to view it: Download PDF.

</embed>

What is the Command-Line Interface

cli_figure1

Logging In

In order to log in, you should have already created an account on our systems.

For Macs/Linux/Windows 10 - Logging In

  1. For Macs, open a Terminal (usually under Applications/Utilities on a Mac), or install iterm2. For Linux, just open a regular terminal. For Windows 10, open a command prompt by searching for and running “cmd”.

  2. Copy and paste this into the terminal:

    ssh username@tadpole.genomecenter.ucdavis.edu

where ‘username’ is replaced with your username. Press Enter. You will probably get a warning about not being able to establish the authenticity of the host and it will ask if you want to continue connecting. Just type “yes” and press enter.

  1. Type in your password. No characters will display when you are typing. Press Enter.

For Windows 8 and less

  1. Download and install PuTTY.
  2. Open up PuTTY
  3. In the Host Name field, type tadpole.genomecenter.ucdavis.edu
  4. Make sure the Connection Type is SSH.
  5. Press “Open”. It will ask you for your username and password.

After opening, system messages are often displayed, followed by the “prompt”. A prompt is a short text message at the start of the command line and ends with ‘$’ in bash shell, commands are typed after the prompt. The prompt typically follows the form username@server:current_directory$. If your screen looks like the one below, i.e. your see your a bunch of messages and then your username followed by “@tadpole:~$” at the beginning of the line, then you are successfully logged in.

cli_figure4

Command Line Basics

First some basics - how to look at your surroundings.

pwd

present working directory … where am I?

ls

list files here … you should see nothing since your homes are empty

ls /tmp/

list files somewhere else, like /tmp/

Because one of the first things that’s good to know is how to escape once you’ve started something you don’t want.

sleep 1000  # wait for 1000 seconds!

Use Ctrl-c (shows as ‘^C’ in the terminal) to exit (kill) a command. In some cases, a different key sequence is required (Ctrl-d). Note that anything including and after a “#” symbol is ignored, i.e. a comment. So in all the commands below, you do not have to type anything including and past a “#”.

Options

Each command can act as a basic tool, or you can add ‘options’ or ‘flags’ that modify the default behavior of the tool. These flags come in the form of ‘-v’ … or, when it’s a more descriptive word, two dashes: ‘--verbose’ … that’s a common (but not universal) one that tells a tool that you want it to give you output with more detail. Sometimes, options require specifying amounts or strings, like ‘-o results.txt’ or ‘--output results.txt’ … or ‘-n 4’ or ‘--numCPUs 4’. Let’s try some, and see what the man page for the ‘list files’ command ‘ls’ is like.

ls -R /

Lists directories and files recursively. This will be a very long output, so use Ctrl-C to break out of it. Sometimes you have to press Ctrl-C many times to get the terminal to recognize it. In order to know which options do what, you can use the manual pages. To look up a command in the manual pages type “man” and then the command name. So to look up the options for “ls”, type:

man ls

Navigate this page using the up and down arrow keys, PageUp and PageDown, and then use q to quit out of the manual. In this manual page, find the following options, quit the page, and then try those commands. You could even open another terminal, log in again, and run manual commands in that terminal.

ls -l /share/genomes/GENCODE/GRCm38.p6/ # long format, gives permission values, owner, group, size, modification time, and name

ls1

ls -a /share/genomes/GENCODE/GRCm38.p6/ # shows ALL files, including hidden ones

ls2

ls -l -a /share/genomes/GENCODE/GRCm38.p6/ # does both of the above

ls3

ls -la  /share/genomes/GENCODE/GRCm38.p6/ # option 'smushing' can be done with single letter options

ls4

ls -ltrha /share/genomes/GENCODE/GRCm38.p6/ # shows all files, long format, in last modified time reverse order, with human readable sizes

ls5

And finally adding color (white for regular files, blue for directories, turquoise for links):

ls -ltrha --color /share/genomes/GENCODE/GRCm38.p6/ # single letter (smushed) vs word options (Linux)

OR

ls -ltrhaG /share/genomes/GENCODE/GRCm38.p6/ # (MacOS)

ls6

Quick aside: what if I want to use same options repeatedly? and be lazy? You can create a shortcut to another command using ‘alias’.

alias ll='ls -lah'
ll

Directory Structure

Absolute path: always starts with ”/” - the root folder

/share/workshop/msettles/cli

the folder (or file) “cli” in the folder “msettles” in the folder “workshop” in the folder “share” from the root folder.

Relative path: always relative to our current location.

a single dot (.) refers to the current directory
two dots (..) refers to the directory one level up

cli_figure2

Usually, /home is where the user accounts reside, ie. users’ ‘home’ directories. For example, for a user that has a username of “msettles”: their home directory is /home/msettles It is the directory that a user starts in after starting a new shell or logging into a remote server.

The tilde (~) is a short form of a user’s home directory.

Syntax of a command

cli_figure3

Quiz 1

Getting Around

The filesystem you’re working on is like the branching root system of a tree. The top level, right at the root of the tree, is called the ‘root’ directory, specified by ‘/’ … which is the divider for directory addresses, or ‘paths’. We move around using the ‘change directory’ command, ‘cd’. The command pwd return the present working directory.

cd  # no effect? that's because by itself it sends you home (to ~)
cd /  # go to root of tree's root system
cd home  # go to where everyone's homes are
pwd
cd username  # use your actual home, not "username"
pwd
cd /
pwd
cd ~  # a shortcut to home, from anywhere
pwd
cd .  # '.' always means *this* directory
pwd
cd ..  # '..' always means *one directory up*
pwd

cli_figure5

You should also notice the location changes in your prompt.

Absolute and Relative Paths

You can think of paths like addresses. You can tell your friend how to go to a particular store from where they are currently (a ‘relative’ path), or from the main Interstate Highway that everyone uses (in this case, the root of the filesystem, ‘/’ … this is an ‘absolute’ path). Both are valid. But absolute paths can’t be confused, because they always start off from the same place, and are unique. Relative paths, on the other hand, could be totally wrong for your friend if you assume they’re somewhere they’re not. With this in mind, let’s try a few more:

cd ~  # let's start at home

relative (start here, take two steps up, then down through share and workshop)

cd ../../share/workshop/
pwd

absolute (start at root, take steps)

cd /share/workshop/
pwd

Now, because it can be a real pain to type out, or remember these long paths, we need to discuss …

Tab Completion

Using tab-completion is a must on the command line. A single auto-completes file or directory names when there's only one name that could be completed correctly. If multiple files could satisfy the tab-completion, then nothing will happen after the first . In this case, press a second time to list all the possible completing names. Note that if you've already made a mistake that means that no files will ever be completed correctly from its current state, then 's will do nothing.

touch updates the timestamp on a file, here we use it to create three empty files.

mkdir ~/tmp
cd ~/tmp
touch one seven september
ls o

tab with no enter should complete to ‘one’, then enter

ls s

tab with no enter completes up to ‘se’ since that’s in common between seven and september. tab again and no enter, this second tab should cause listing of seven and september. type ‘v’ then tab and no enter now it’s unique to seven, and should complete to seven. enter runs ‘ls seven’ command.

I can’t overstate how useful tab completion is. You should get used to using it constantly. Watch experienced users type and they maniacally hit tab once or twice in between almost every character. You don’t have to go that far, of course, but get used to constantly getting feedback from hitting tab and you will save yourself a huge amount of typing and trying to remember weird directory and filenames.

Quiz 2

History Repeats Itself

Linux remembers everything you’ve done (at least in the current shell session), which allows you to pull steps from your history, potentially modify them, and redo them. This can obviously save a lot of time and typing.

The ‘head’ command views the first 10 (by default) lines of a file. The ‘tail’ commands views the last 10 (by default) lines of a file. Type ‘man head’ or ‘man tail’ to consult their manuals.

<up arrow>  # last command
<up>  # next-to-last command
<down>  # last command, again
<down>  # current command, empty or otherwise
history  # usually too much for one screen, so ...
history | head # we discuss pipes (the vertical bar) below
history | tail
history | less # use 'q' to exit less
ls -l
pwd
history | tail
!560  # re-executes 560th command (yours will have different numbers; choose the one that recreates your really important result!)

Editing Yourself

Here are some more ways to make editing previous commands, or novel commands that you’re building up, easier:

<up><up>  # go to some previous command, just to have something to work on
<ctrl-a>  # go to the beginning of the line
<ctrl-e>  # go to the end of the line
# now use left and right to move to a single word (surrounded by whitespace: spaces or tabs)
<ctrl-k>  # delete from here to end of line
<ctrl-w>  # delete from here to beginning of preceeding word
blah blah blah<ctrl-w><ctrl-w>  # leaves you with only one 'blah'

You can also search your history from the command line:

<ctrl-r>fir  # should find most recent command containing 'fir' string: echo 'first' > test.txt
<enter>  # to run command
<ctrl-c>  # get out of recursive search
<ctr-r>  # repeat <ctrl-r> to find successively older string matches

Transferring files

For Macs/Linux/Windows 10

  1. Open a Terminal/Command Prompt on your local machine.
  2. Using the cd command go to the directory you want to copy to.
  3. Use scp (secure copy, a remote file copying program):

    scp username@tadpole.genomecenter.ucdavis.edu:/full/path/to/file .

  4. Replace ‘username’ with your username and replace ‘/full/path/to/file’ with the full path to the file you want to transfer. Note that there is a “.” at the end of the command, which is where to put the file, i.e. your current directory. You will have to type in your password.

For Windows 8 and less

  1. Open up WinSCP. If you haven’t installed it, get WinSCP here.
  2. In the Host Name field, type tadpole.genomecenter.ucdavis.edu
  3. Type in your username and password.
  4. Make sure the File Protocol is SFTP.
  5. Press “Login”.
  6. Look at the WinSCP documentation to learn how to transfer items.

Try transferring a file from tadpole to your local computer.

Create and Destroy

We already learned one command that will create a file, touch. Lets create a folder in /share/workshop for you to work in and then another directory cli. We will use the environment variable $USER, that contains your username.

cd  # home again
echo $USER # echo to screen the contents of the variable $USER
cd ~/tmp
echo 'Hello, world!' > first.txt

echo text then redirect (‘>’) to a file.

cat first.txt  # 'cat' means 'concatenate', or just spit the contents of the file to the screen

why ‘concatenate’? try this:

cat first.txt first.txt first.txt > second.txt
cat second.txt

OK, let’s destroy what we just created:

cd ../
rmdir tmp  # 'rmdir' meands 'remove directory', but this shouldn't work!
rm tmp/first.txt
rm tmp/second.txt  # clear directory first
rmdir tmp  # should succeed now

So, ‘mkdir’ and ‘rmdir’ are used to create and destroy (empty) directories. ‘rm’ to remove files. To create a file can be as simple as using ‘echo’ and the ‘>’ (redirection) character to put text into a file. Even simpler is the ‘touch’ command.

mkdir ~/cli
cd ~/cli
touch newFile
ls -ltra  # look at the time listed for the file you just created
cat newFile  # it's empty!
sleep 60  # go grab some coffee
touch newFile
ls -ltra  # same time?

So ‘touch’ creates empty files, or updates the ‘last modified’ time. Note that the options on the ‘ls’ command you used here give you a Long listing, of All files, in Reverse Time order (l, a, r, t).

Forced Removal

When you’re on the command line, there’s no ‘Recycle Bin’. Since we’ve expanded a whole directory tree, we need to be able to quickly remove a directory without clearing each subdirectory and using ‘rmdir’.

cd
mkdir -p rmtest/dir1/dir2 # the -p option creates all the directories at once
rmdir rmtest # gives an error since rmdir can only remove directories that are empty
rm -rf rmtest # will remove the directory and EVERYTHING in it

Here -r = recursively remove sub-directories, -f means force. Obviously, be careful with ‘rm -rf’, there is no going back, if you delete something with rm, rmdir its gone! There is no Recycle Bin on the Command-Line!

Quiz 3

Piping and Redirection

Pipes (‘|’) allow commands to hand output to other commands, and redirection characters (‘>’ and ‘»’) allow you to put output into files.

echo 'first' > test.txt
cat test.txt # outputs the contents of the file to the terminal
echo 'second' > test.txt
cat test.txt
echo 'third' >> test.txt
cat test.txt

The ‘>’ character redirects output of a command that would normally go to the screen instead into a specified file. ‘>’ overwrites the file, ‘»’ appends to the file.

The ‘cut’ command pieces of lines from a file line by line. This command cuts characters 1 to 3, from every line, from file ‘test.txt’

cut -c 1-3 test.txt  

same thing, piping output of one command into input of another

cat test.txt | cut -c 1-3  

This pipes (i.e., sends the output of) cat to cut to sort (-r means reverse order sort), and then grep searches for pattern (‘s’) matches (i.e. for any line where an ‘s’ appears anywhere on the line.)

cat test.txt | cut -c 1-3 | sort -r
cat test.txt | cut -c 1-3 | sort -r | grep s

This is a great way to build up a set of operations while inspecting the output of each step in turn. We’ll do more of this in a bit.

Compression and Archives

As file sizes get large, you’ll often see compressed files, or whole compressed folders. Note that any good bioinformatics software should be able to work with compressed file formats.

gzip test.txt
cat test.txt.gz

To uncompress a file

gunzip -c test.txt.gz

The ‘-c’ leaves the original file alone, but dumps expanded output to screen

gunzip test.txt.gz  # now the file should change back to uncompressed test.txt

Tape archives, or .tar files, are one way to compress entire folders and all contained folders into one file. When they’re further compressed they’re called ‘tarballs’. We can use wget (web get).

wget http://igenomes.illumina.com.s3-website-us-east-1.amazonaws.com/PhiX/Illumina/RTA/PhiX_Illumina_RTA.tar.gz

The .tar.gz and .tgz are commonly used extensions for compressed tar files, when gzip compression is used. The application tar is used to uncompress .tar files

tar -xzvf PhiX_Illumina_RTA.tar.gz

Here -x = extract, -z = use gzip/gunzip, -v = verbose (show each file in archive), -f filename

Note that, unlike Windows, linux does not depend on file extensions to determine file behavior. So you could name a tarball ‘fish.puppy’ and the extract command above should work just fine. The only thing that should be different is that tab-completion doesn’t work within the ‘tar’ command if it doesn’t see the ‘correct’ file extension.

BASH Wildcard Characters

We can use ‘wildcard characters’ when we want to specify or operate on sets of files all at once.

ls ?hiX/Illumina

list files in Illumina sub-directory of any directory ending in ‘hiX’

ls PhiX/Illumina/RTA/Sequence/*/*.fa

list all files ending in ‘.fa’ a few directories down. So, ‘?’ fills in for zero or one character, ‘*’ fills in for zero or more characters. The ‘find’ command can be used to locate files using a similar form.

find . -name "*.f*"
find . -name "*.f?"

how is this different from the previous ls commands?

Quick Note About the Quote(s)

The quote characters “ and ‘ are different. In general, single quotes preserve the literal meaning of all characters between them. On the other hand, double quotes allow the shell to see what’s between them and make substitutions when appropriate. For example:

VRBL=someText
echo '$VRBL'
echo "$VRBL"

However, some commands try to be ‘smarter’ about this behavior, so it’s a little hard to predict what will happen in all cases. It’s safest to experiment first when planning a command that depends on quoting … list filenames first, instead of changing them, etc. Finally, the ‘backtick’ characters ` (same key - unSHIFTED - as the tilde ~) causes the shell to interpret what’s between them as a command, and return the result.

 # counts the number of lines in file and stores result in the LINES variable
LINES=`cat PhiX/Illumina/RTA/Sequence/Bowtie2Index/genome.1.bt2 | wc -l` 
echo $LINES

Quiz 4

Manipulation of a FASTA File

We just found the phiX-174 genome, so let’s copy it (using the ‘cp’ command) to our current directory so we can play with it:

cp ./PhiX/Illumina/RTA/Sequence/WholeGenomeFasta/genome.fa phix.fa    

Similarly we can also use the move command here, but then ./PhiX/Illumina/RTA/Sequence/WholeGenomeFasta/genome.fa will no longer be there:

cp ./PhiX/Illumina/RTA/Sequence/WholeGenomeFasta/genome.fa  ./PhiX/Illumina/RTA/Sequence/WholeGenomeFasta/genome2.fa
ls ./PhiX/Illumina/RTA/Sequence/WholeGenomeFasta/
mv ./PhiX/Illumina/RTA/Sequence/WholeGenomeFasta/genome2.fa phix.fa
ls ./PhiX/Illumina/RTA/Sequence/WholeGenomeFasta/

This functionality of mv is why it is used to rename files.

Note how we copied the ‘genome.fa’ file to a different name: ‘phix.fa’

wc -l phix.fa

count the number of lines in the file using ‘wc’ (word count) and parameter ‘-l’ (lines).

We can use the ‘grep’ command to search for matches to patterns. ‘grep’ comes from ‘globally search for a regular expression and print’.

grep -c '>' phix.fa

Only one FASTA sequence entry, since only one header line (‘>gi|somethingsomething…’)

cat phix.fa

This may not be useful for anything larger than a virus! Let’s look at the start codon and the two following codons:

grep --color "ATG......" phix.fa

’.’ characters are the single-character wildcards for grep. So “ATG……” matches any set of 9 characters that starts with ATG.

Use the –color ‘-o’ option to only print the pattern matches, one per line

grep -o "ATG......" phix.fa

Use the ‘cut’ command with ‘-c’ to select characters 4-6, the second codon

grep --color  -o "ATG......" phix.fa | cut -c4-6

‘sort’ the second codon sequences (default order is same as ASCII table; see ‘man ascii’)

grep --color  -o "ATG......" phix.fa | cut -c4-6 | sort

Combine successive identical sequences, but count them using the ‘uniq’ command with the ‘-c’ option

grep --color  -o "ATG......" phix.fa | cut -c4-6 | sort | uniq -c

Finally sort using reverse numeric order (‘-rn’)

grep --color  -o "ATG......" phix.fa | cut -c4-6 | sort | uniq -c | sort -rn 

… which gives us the most common codons first

This may not be a particularly useful thing to do with a genomic FASTA file, but it illustrates the process by which one can build up a string of operations, using pipes, in order to ask quantitative questions about sequence content. More generally than that, this process allows one to ask questions about files and file contents and the operating system, and verify at each step that the process so far is working as expected. The command line is, in this sense, really a modular workflow management system.

Since copying or even moving large files (like sequence data) around your filesystem may be impractical, we can use links to reference ‘distant’ files without duplicating the data in the files. Symbolic links are disposable pointers that refer to other files, but behave like the referenced files in commands. I.e., they are essentially ‘Shortcuts’ (to use a Windows term) to a file or directory.

The ‘ln’ command creates a link. You should, by default, always create a symbolic link using the -s option.

ln -s PhiX/Illumina/RTA/Sequence/WholeGenomeFasta/genome.fa .
ls -ltrhaF  # notice the symbolic link pointing at its target
grep -c ">" genome.fa

STDOUT & STDERR

Programs can write to two separate output streams, ‘standard out’ (STDOUT), and ‘standard error’ (STDERR). The former is generally for direct output of a program, while the latter is supposed to be used for reporting problems. I’ve seen some bioinformatics tools use STDERR to report summary statistics about the output, but this is probably bad practice. Default behavior in a lot of cases is to dump both STDOUT and STDERR to the screen, unless you specify otherwise. In order to nail down what goes where, and record it for posterity:

wc -c genome.fa 1> chars.txt 2> any.err

the 1st output, STDOUT, goes to ‘chars.txt’
the 2nd output, STDERR, goes to ‘any.err’

cat chars.txt

Contains the character count of the file genome.fa

cat any.err

Empty since no errors occured.

Saving STDOUT is pretty routine (you want your results, yes?), but remember that explicitly saving STDERR is important on a remote server, since you may not directly see the ‘screen’ when you’re running jobs.

Shell Scripts, File Permissions

Often it’s useful to define a whole string of commands to run on some input, so that (1) you can be sure you’re running the same commands on all data, and (2) so you don’t have to type the same commands in over and over! Let’s use the ‘nano’ text editor program that’s pretty reliably installed on most linux systems.

nano test.sh

cli_figure7

nano now occupies the whole screen; see commands at the bottom type/paste in the following … (note that ‘#!’ is an interpreted command to the shell, not a comment)

#!/bin/bash grep -o . $1 | \ sort | \ uniq -c | \ sort -rn -k1,1

Cntrl-X top exit first saving the document. Follow the instruction at the bottom of the screen

Note that ‘$1’ means ‘the value of the 1st argument to the shell script’ … in other words, the text that follows the shell script name when we run it (see below).

Though there are ways to run the commands in test.sh right now, it’s generally useful to give yourself (and others) ‘execute’ permissions for test.sh, really making it a shell script. Note the characters in the first (left-most) field of the file listing:

ls -lh test.sh
-rw-rw-r-- 1 msettles biocore 79 Aug 19 15:05 test.sh

The first ‘-‘ becomes a ‘d’ if the ‘file’ is actually a directory. The next three characters represent read, write, and execute permissions for the file owner (you), followed by three characters for users in the owner’s group, followed by three characters for all other users. Run the ‘chmod’ command to change permissions for the ‘test.sh’ file, adding execute permissions (‘+x’) for the user (you) and your group (‘ug’):

chmod ug+x test.sh
ls -lh test.sh
-rwxr-xr-- 1 msettles biocore 79 Aug 19 15:05 test.sh

The first 10 characters of the output represent the file and permissions. The first character is the file type, the next three sets of three represent the file permissions for the user, group, and everyone respectively.

OK! So let’s run this script, feeding it the phiX genome. When we put the genome file 1st after the name of the script, this filename becomes variable ‘1’, which the script can access by specifying ‘$1’. We have to provide a relative reference to the script ‘./’ because its not our our “PATH”.

./test.sh genome.fa
msettles@tadpole:/share/workshop/msettles/cli$ ./test.sh genome.fa 1686 T 1292 A 1253 G 1155 C 1 x 1 p 1 i 1 h 1 >

The script’s grep command splits out every character in the file on a separate line, then sorts them so it can count the occurrences of every unique character and show the most frequent characters first … a quick and dirty way to get at GC content.

Quiz 5