Basic Unix and Version control for Biologists EP3 is aiming to helps anyone who would like to learn basic unix programming and version control using github. This introduction/tutorial dose not require installation, you can simply click you can simply use Rstudio Cloud on your browser.

เว็บเพจนี้สอน Unix Shell เบื้องต้น โดยผู้เรียนไม่ต้องดาวน์โหลดโปรแกรมลงบนคอมพิวเตอร์ส่วนตัว เพียงใช้ Rstudio Cloud บนเว็บบราวเชอร์

Open Binder and Launch Terminal

Step A: Open Rstudio cloud and Launch Terminal

Landing Page

Once you log in to Rstudio cloud, your web browser should bring up a similar window as the picture shown above. Click the button on the top right corner to create a new Rstudio project. Then, the next step is to click “Terminal” which should look like a picture below after you click on it.

Terminal

Hands-On Exercise (45 minutes)

  • Downloading a Sample Dataset
    • Example: FASTA file of DNA sequences
  • Processing the Data
    • Viewing the data structure
    • Counting sequences
    • Extracting specific sequences using grep

Creating Bash Scripts and Version Control (60 minutes)

  • Creating Bash Scripts
    • Writing and executing a simple script
  • Version Control with GitHub
    • Basic Git commands
    • Pushing code to GitHub

Summary and Q&A (45 minutes)

  • Recap of Key Commands
  • Real-World Applications in Bioinformatics
  • Questions and Discussion

Lesson

Hands-On Exercise (45 minutes)

Basic command line recap Downloading a Sample Dataset

  • Download a sample FASTA file of DNA sequences.
/cloud/project$ wget -O SRR21388484.fasta.gz https://trace.ncbi.nlm.nih.gov/Traces/sra-reads-be/fasta?acc=SRR21388484

Processing the Data

  • View the structure of the FASTA file using cat, less, and head.
/cloud/project$ gunzip SRR21388484.fasta.gz
/cloud/project$ cat SRR21388484.fasta | head
/cloud/project$ less SRR21388484.fasta

note: type q to quit less

/cloud/project$ head SRR21388484.fasta
  • Count the number of sequences in the file using grep and wc.
/cloud/project$ grep -c ">" SRR21388484.fasta
/cloud/project$ grep ">" SRR21388484.fasta | wc -l

Creating Bash Scripts and Version Control (60 minutes)

Creating Bash Scripts

A bash script is a file containing a series of commands that you can execute together. Here’s how to create and execute a simple bash script:

  • Create a new file called script.sh:
/cloud/project$ vim script.sh
  • Add the following lines to the script:
#!/bin/bash
# This script prints the current date and lists files in the directory

echo "Current date and time:"
date

echo "Files in the current directory:"
ls

save and exit by type :wq and hit Enter

  • Execute the script:
/cloud/project$ bash script.sh

Version Control with GitHub

Version control is essential for tracking changes to your code and collaborating with others. Here’s how to use Git and GitHub for version control:

  • Create an account on Github
  • Create a New Repository on GitHub:
    • Go to GitHub and create a new repository.
    • we’ll demonstrate this on github.

Additional Topics

Working with FASTQ Files

FASTQ is a common file format used to store sequences and their corresponding quality scores in bioinformatics. Understanding how to work with FASTQ files is essential for tasks like analyzing high-throughput sequencing data.

Introduction to the FASTQ Format:

FASTQ files typically contain information about DNA or RNA sequences obtained from next-generation sequencing (NGS) platforms. Each record in a FASTQ file consists of four lines:

  1. Sequence Identifier (ID): Begins with “@” and contains information about the read or sequence.
  2. Sequence Data: Contains the actual DNA or RNA sequence as a string of characters (A, C, G, T, or N).
  3. Quality Identifier (ID): Begins with “+” and is often the same as the sequence identifier.
  4. Quality Scores: Represents the quality of each base in the sequence as ASCII characters.

Here’s an example FASTQ record:

/cloud/project$ wget -O SRR21388484.fastq.gz https://trace.ncbi.nlm.nih.gov/Traces/sra-reads-be/fastq?acc=SRR21388484
/cloud/project$ gunzip SRR21388484.fastq.gz
/cloud/project$ head SRR21388484.fastq 
@SRR21388484.1 1 length=301
TACGTAGGGGGCGAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCGTGTAGGCGGTTTGGTAAGTCTGCC
GTGAAAACCCGGGGCTCAACCCCGGTCGTGCGGTGGATACTGCCAGGCTAGAGGATGGTAGAGGCGAGTG
GAATTCCCGGTGTAGCGGTGAAATGCGCAGATATCGGGAGGAACACCAGTAGCGAAGGCGGCTCGCTGGG
CCATTCCTGACGCTGAGACGCGAAAGCTAGGGGAGCGAACAGGATTAGATACCCTGGTAGTCCGGCTGAC
TGACTATCTCGTATTCCGTCT
+SRR21388484.1 1 length=301
??????????????????????????????????????????????????????????????????????
??????????????????????????????????????????????????????????????????????
??????????????????????????????????????????????????????????????????????

Quality score refrence https://help.basespace.illumina.com/files-used-by-basespace/quality-scores

Viewing FASTQ Files:

  • Use cat to display the entire file on the terminal (suitable for small files).
/cloud/project$ cat SRR21388484.fastq | head -5
  • Use less for interactive viewing of large FASTQ files.
/cloud/project$ less SRR21388484.fastq
  • Use more for interactive viewing similar to less.
/cloud/project$ more SRR21388484.fastq

Combining, Splitting, and Converting File Formats:

  • Split a large FASTQ file into smaller files based on the number of records:
/cloud/project$ split -l 600006 SRR21388484.fastq 600kSRR
  • Combine multiple FASTQ files into one:
/cloud/project$ cat 600kSRRaa 600kSRRab > combined.fastq

Extracting Sequences from FASTQ Files Based on IDs:

  • Use grep to extract sequences based on their sequence identifiers:
/cloud/project$ vim sequence_ids.txt 
  • sequence_ids.txt should contain one sequence ID per line:, try adding SRR21388484.1 and SRR21388484.2
/cloud/project$ grep -A 3 -f sequence_ids.txt combined.fastq > extracted_sequences.fastq

Example

@seq_id_1
@seq_id_2
...
/cloud/project$ head extracted_sequences.fastq 
@SRR21388484.1 1 length=301
TACGTAGGGGGCGAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCGTGTAGGCGGTTTGGTAAGTCTGCC
GTGAAAACCCGGGGCTCAACCCCGGTCGTGCGGTGGATACTGCCAGGCTAGAGGATGGTAGAGGCGAGTG
GAATTCCCGGTGTAGCGGTGAAATGCGCAGATATCGGGAGGAACACCAGTAGCGAAGGCGGCTCGCTGGG
--
+SRR21388484.1 1 length=301
??????????????????????????????????????????????????????????????????????
??????????????????????????????????????????????????????????????????????
??????????????????????????????????????????????????????????????????????
--

Summary and Q&A (45 minutes)

  • Recap the key UNIX commands covered.
  • Discuss how these commands are applied in real-world bioinformatics.
  • Open the floor for questions and further discussion.