Nat Pombubpa Lab - Basic Unix for Biologists (EP2)

Basic Unix for Biologists EP2 is aiming to helps anyone who would like to learn basic unix programming. This introduction/tutorial dose not require installation, you can simply click you can simply use Rstudio Cloud on your browser.

เว็บเพจนี้สอน Unix Shell เบื้องต้น โดยผู้เรียนไม่ต้องดาวน์โหลดโปรแกรมลงบนคอมพิวเตอร์ส่วนตัว เพียงใช้ Rstudio Cloud บนเว็บบราวเชอร์

Open Binder and Launch Terminal

Step A: Open Rstudio cloud and Launch Terminal

Landing Page

Once you log in to Rstudio cloud, your web browser should bring up a similar window as the picture shown above. Click the button on the top right corner to create a new Rstudio project. Then, the next step is to click “Terminal” which should look like a picture below after you click on it.

Terminal

Download example files (If you have done this for EP1, you can skip this part.)

/cloud/project$ svn export https://github.com/NatPombubpa/Binder_Intro_Unix/trunk/unix_intro
/cloud/project$ svn export https://github.com/NatPombubpa/Binder_Intro_Unix/trunk/data-shell

If everything work perfectly for you, you are ready for the tutorial.

Very useful commands

We will learn some useful commands that are used ofetn in Bioinformatics.


/cloud/project$ cd unix_intro/six_commands/

We’ll be working with gene_annotations.tsv which contains information including gene_ID, genome, KO_ID, and KO_annotation (KO is Kegg Orthology - functional database).

Let’s checkout the file

/cloud/project/unix_intro/six_commands$ head gene_annotations.tsv 
gene_ID genome  KO_ID   KO_annotation
     CC9311  K02338  DPO3B; DNA polymerase III subunit beta [EC:2.7.7.7]
     CC9311  NA      NA
     CC9311  K01952  purL; phosphoribosylformylglycinamidine synthase [EC:6.3.5.3]
     CC9311  K00764  purF; amidophosphoribosyltransferase [EC:2.4.2.14]
     CC9311  K02469  gyrA; DNA gyrase subunit A [EC:5.99.1.3]
     CC9311  NA      NA
     CC9311  K18979  queG; epoxyqueuosine reductase [EC:1.17.99.6]
     CC9311  NA      NA
     CC9311  NA      NA

Let’s take a look at the first few lines

/cloud/project/unix_intro/six_commands$ head -n 3 gene_annotations.tsv
gene_ID genome  KO_ID   KO_annotation
1       CC9311  K02338  DPO3B; DNA polymerase III subunit beta [EC:2.7.7.7]
2       CC9311  NA      NA

We can also count number of rows in the file

/cloud/project/unix_intro/six_commands$ wc -l gene_annotations.tsv
101 gene_annotations.tsv

cut command

using cut to extract column from tab delimted file

/cloud/project/unix_intro/six_commands$ cut -f 1 gene_annotations.tsv

cut and print out just few lines

/cloud/project/unix_intro/six_commands$ cut -f 1 gene_annotations.tsv | head

/cloud/project/unix_intro/six_commands$ cut -f 1,3 gene_annotations.tsv | head

/cloud/project/unix_intro/six_commands$ cut -f 1-3 gene_annotations.tsv | head

However, it we use other types of file, we might have to add a delimiter.

/cloud/project/unix_intro/six_commands$ cut -d "," -f 1-3 example_gene_annotations.csv | head

cut command practice

Create a new file that contian 2 columns including gene_ID and KO_annotation. Hint: > is a redirector.

grep command

grep = global regular expression grep can be used to search through a text file and print out the match.


/cloud/project/unix_intro/six_commands$ grep re colors.txt

let’s imagine we’re looking for genes that are predicted to encode the enzyme epoxyqueuosine reductase. When we search for this on the KO website, we find two KO_IDs linked with it: K09765 and K18979. use grp to find these IDs


/cloud/project/unix_intro/six_commands$ grep K09765 gene_annotations.tsv


/cloud/project/unix_intro/six_commands$ grep K18979 gene_annotations.tsv

To report how mant lines match the pattern, we can add -c flag

/cloud/project/unix_intro/six_commands$ grep -c K18979 gene_annotations.tsv

grep command practice

using grep and cut to print out just column 2 (genomes) that have K18979 annotation. Hint: | is a redirector.

References