Bioinformatics Procedures and Recommendations

(This needs to be expanded, just getting a few thoughts down) Advice on setting up your computer for common bioinformatics tasks

Sequencing Database

sequencing_database_and_sra_submission

Organizing your project

READ How to organize a bioinformatics project READ Best Practices for Scientific Computing

Atom (Julin)

link to suggested packages

Github (Julin)

link to info and to lab repos

Genome Center Clusters

Maloof lab info on genome center cluster
Genome center tutorials on UCD cluster usage

R (Ruijuan)

Tips and Tricks

You can directly access google sheets from R using the googlesheets library. See CRAN or github repositories for more info. Follow the basic usage vignette to get started.

Pipelines

RNAseq (illumina platform)

BIS180L (the lab Julin created for his undergraduate teaching) has most of the information/guidelines on each step.
There is also a paper describing the best practice for RNAseq data analysis, check me please.

Below is a step-by-step instruction:

1) Quality check: check the quality of raw fastq data using FastQC to see whether there is bad sequencing quality and/or adapter contamination.

2) Trimming: AlWAYS trim raw data! Trimmomatic is a good tool for quality and adapter trimming. Make sure you are using the adapter you used for your library making for trimming.

3) Quality check: Check the trimmed data using FastQC.

4) mapping to reference:

4.a) If you are mapping to reference genome sequence, use gapped aligner STAR.

Here is how we usually use STAR:

Firstly index the reference genome with gff3 annotation file:

  STAR --runMode genomeGenerate --genomeDir ref_genome_dir/ --genomeFastaFiles ref.fa --sjdbGTFfile ref.gff3 --runThreadN 6 --sjdbGTFtagExonParentTranscript Parent --sjdbGTFfeatureExon CDS

Depends on the format of your gff3 file, you may need to modify it a bit.

Then map (a paired end data example):

  STAR --genomeDir ref_genome_dir/ --readFilesIn 1.fq 2.fq --outSAMtype BAM SortedByCoordinate --sjdbGTFfile ref.gff3 --quantMode TranscriptomeSAM GeneCounts --twopassMode Basic –alignIntronMax 15000 --outFilterIntronMotifs RemoveNoncanonical --runThreadN 6 --sjdbGTFtagExonParentTranscript Parent --sjdbGTFfeatureExon CDS --outReadsUnmapped fastx

4.b) If you are mapping to reference CDS, use kallisto.

index genome:

  kallisto index -k 19 -i ref.cds.19.kai ref.cds.fa

map:

  kallisto quant --single --plaintext -l 250 -s 50 -t 12 -i ref.cds.19.kai -o out.dir file.fq

5) Alignment result checking, visualization: check the mapping rate after mapping, very low mapping rate (<75%) to reference genome requires trouble shooting. Use IGV to visualize alignment results. You can also use Picard CollectInsertSizeMetrics to decide library type and check insert size.

6) read count file generation: make a master raw read count tsv file from mapping result of each library, kallisto and STAR give you the read count file directly, but sometimes you need to generate read count file from your bam alignment using Rsubread or HTseq.

7) expression analysis: import tsv file into R for differential expression analysis. edgeR is the tool we prefer for expression analysis. Refer to BIS180L and edgeR user manual for differential expression analysis guideline.

tips on expression analysis:
a) libraries with very low read count compared to other libraries should be excluded. We used 1,000,000 read count as the threshold to keep a library.
b) Use plotMDS() to check your sample clustering pattern, biological replicates are expected to cluster together, if not, trouble shooting.
c) Check batch effect, if there is, included in the model, read edgeR user manual for batch effect.

8) GO and promoter motif enrichment analysis: use Goseq package for GO enrichment analysis, Julin also wrote a function for promoter motif enrichment analysis. Check BIS180L.

9) Network/coexpression analysis: use SOM (R package kohonen) or WGCNA for coexpression analysis on variance stabilizing transformed data with DEseq.

SNP calling

transcriptome assembly

QTL Analysis

mixed model tutorials

Bayesian advice

programs and settings

ImageJ https://imagej.nih.gov/ij/

LeafJ https://bitbucket.org/jnmaloof/leafj

Tips on screen command

screen command in general https://www.rackaid.com/blog/linux-screen-tutorial-and-how-to/

Start screen

      screen

resumes a detached screen session

      screen -r

HOW TO ACCESS ATTACHED SCREEN AFTER CONNECTION DROPPED

      screen -D -r ****

Please see "how to kill screen" http://stackoverflow.com/questions/1509677/kill-detached-screen-session screen -X -S [session # you want to kill] kill You can kill a detached session which is not responding within the screen session by doing the following.

type "screen -list" to identify the (detached) screen session. eg: screen -list There are screens on: 20751.Melvin_Peter_V42 (Detached) Note: "20751.Melvin_Peter_V42" is your session id.

get attached to the detached screen session eg: screen -r 20751.Melvin_Peter_V42

Once connected to the session which might or might not respond, do the following. press "Ctrl + a" (there wont be any changes in your window now) type ":quit" ( its a colon[:] followed quit)

Thats its your remote screen session will be terminated now.

Bioinformatics Procedures and Recommendations bioinformatics