(This needs to be expanded, just getting a few thoughts down) Advice on setting up your computer for common bioinformatics tasks
link to suggested packages
link to info and to lab repos
RNAseq (illumina platform)
BIS180L (the lab Julin created for his undergraduate teaching) has most of the information/guidelines on each step.
There is also a paper describing the best practice for RNAseq data analysis, check me please.
Below is a step-by-step instruction:
2) Trimming: AlWAYS trim raw data! Trimmomatic is a good tool for quality and adapter trimming. Make sure you are using the adapter you used for your library making for trimming.
3) Quality check: Check the trimmed data using FastQC.
4) mapping to reference:
Here is how we usually use STAR:
STAR --runMode genomeGenerate --genomeDir ref_genome_dir/ --genomeFastaFiles ref.fa --sjdbGTFfile ref.gff3 --runThreadN 6 --sjdbGTFtagExonParentTranscript Parent --sjdbGTFfeatureExon CDS
Depends on the format of your gff3 file, you may need to modify it a bit.
Then map (a paired end data example):
STAR --genomeDir ref_genome_dir/ --readFilesIn 1.fq 2.fq --outSAMtype BAM SortedByCoordinate --sjdbGTFfile ref.gff3 --quantMode TranscriptomeSAM GeneCounts --twopassMode Basic –alignIntronMax 15000 --outFilterIntronMotifs RemoveNoncanonical --runThreadN 6 --sjdbGTFtagExonParentTranscript Parent --sjdbGTFfeatureExon CDS --outReadsUnmapped fastx
4.b) If you are mapping to reference CDS, use kallisto.
kallisto index -k 19 -i ref.cds.19.kai ref.cds.fa
kallisto quant --single --plaintext -l 250 -s 50 -t 12 -i ref.cds.19.kai -o out.dir file.fq
5) Alignment result checking, visualization: check the mapping rate after mapping, very low mapping rate (<75%) to reference genome requires trouble shooting. Use IGV to visualize alignment results. You can also use Picard CollectInsertSizeMetrics to decide library type and check insert size.
6) read count file generation: make a master raw read count tsv file from mapping result of each library, kallisto and STAR give you the read count file directly, but sometimes you need to generate read count file from your bam alignment using Rsubread or HTseq.
7) expression analysis: import tsv file into R for differential expression analysis. edgeR is the tool we prefer for expression analysis. Refer to BIS180L and edgeR user manual for differential expression analysis guideline.
tips on expression analysis:
a) libraries with very low read count compared to other libraries should be excluded. We used 1,000,000 read count as the threshold to keep a library.
b) Use plotMDS() to check your sample clustering pattern, biological replicates are expected to cluster together, if not, trouble shooting.
c) Check batch effect, if there is, included in the model, read edgeR user manual for batch effect.
8) GO and promoter motif enrichment analysis: use Goseq package for GO enrichment analysis, Julin also wrote a function for promoter motif enrichment analysis. Check BIS180L.
mixed model tutorials
screen command in general https://www.rackaid.com/blog/linux-screen-tutorial-and-how-to/
resumes a detached screen session
HOW TO ACCESS ATTACHED SCREEN AFTER CONNECTION DROPPED
screen -D -r ****
Please see "how to kill screen" http://stackoverflow.com/questions/1509677/kill-detached-screen-session screen -X -S [session # you want to kill] kill You can kill a detached session which is not responding within the screen session by doing the following.
type "screen -list" to identify the (detached) screen session. eg: screen -list There are screens on: 20751.Melvin_Peter_V42 (Detached) Note: "20751.Melvin_Peter_V42" is your session id.
get attached to the detached screen session eg: screen -r 20751.Melvin_Peter_V42
Once connected to the session which might or might not respond, do the following. press "Ctrl + a" (there wont be any changes in your window now) type ":quit" ( its a colon[:] followed quit)
Thats its your remote screen session will be terminated now.