User Manual


References


If you use NovelSeq in your projects, please cite the following:
"Detection and characterization of novel sequence insertions using paired-end next-generation sequencing."
Iman Hajirasouliha, Fereydoun Hormozdiari, Can Alkan, Jeffrey M. Kidd, Inanc Birol, Evan E. Eichler, S. Cenk Sahinalp. Bioinformatics, May 15; 26(10):1277-83, 2010.

Steps


1) Paired-end read mapping:


Input:
FASTA or FASTQ file that contains paired-end sequences of an individual genome. This FASTA/Q
file is the only input of the pipeline.

Output:
A FASTA file containing those paired-end sequences where neither end-read sequence can be
mapped onto the reference genome. We call these paired-end reads orphan reads.
Another output file containing the alignments of those paired end reads where exactly oneend
sequence can be mapped onto the reference genome. These are referred to as one-end
anchored (OEA) reads.

Note that mrFAST and mrsFAST provide the desired output
format.

2) Orphan assembly and contamination removal:


Input:
The set of orphan reads in FASTA (FASTQ) format.

Output:
A FASTA file containing assembled contigs from the set of orphan paired-end reads.
Recommended de novo assembly tools are ABySS and Velvet.

IMPORTANT: Note that one should remove the contaminant contigs using standard BLAST scripts
against bacterial databases. Please see the paper for more details.

3) OEA read clustering and the local assembly of the OEA clusters:


Description: The OEA-reads are clustered using the minimum set-cover approach and their
corresponding mapping location in the "selected" clusters are outputted.

Input:
A SAM file containing the alignments of those paired end reads where exactly one-end sequence
can be mapped onto the reference genome.
A FASTA file containing all the OEA read sequences.

Output:
A FASTA file containing the assembly each selected OEA cluster.
A cluster file containing the each selected OEA cluster.

Command:
./novelseq_cluster
<dir><dataset><satRemChrs><chroSDir><AllreadsSeq><histogram><maxNumOfReadsEachClust
er>
<dir>: is the working directory.
<dataset>: is an standard SAM file which contains the alignments of those paired end reads where
exactly one-end sequence can be mapped onto the reference genome.
A sample line of a SAM file is shown below:
EAS56_200:2:96:4067:8649 133 chr10 0 255 101M = 5270364 0
TTGTGGCACTATTCACAATAACAAAGACTTGGAACCAACCCAAATGTCCATCAATGATAGACTGGGTTAAGA
AAATGTGGCACATATACACTATGGAATAC
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGGG
DFEGGGGGGFGEFEGGGFGFGDGFFFGGEGEEEDEGF NM:i:4 MD:Z:20G29A14A25C9
NS:Z:ATCAACTCGTCATTTACATTAGGTATTTCTCCTAATGTTATCCCTCTCCTAGCCCCCCACCCCCTGACT
GGCCCTGGTGTGTGATGTTCCCCACCTTGTGT
NQ:Z:GGGGGGGGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGG
GGGGGGFGGGGGGGEBEDFCGCDDDGEFGFGGEGECEBCFD
<satRemChrs>: is a text file which contains the satellite regions of the reference. Alignments to
these regions are excluded from our study.
Each line of SatRemChrs has three fields <chroN> <ML_L> <ML_R>:
<chroN>: Chromosome name.
<ML_L>: Position of left side of satellite region.
<ML_R>: Position of right side of satellite region.
Few lines of a sample is shown below:
chr10 6130904 6130985
chr10 15016925 15018149
chr10 15018157 15022510
chr10 15025011 15026274
chr10 15026181 15027677
chrY 58982440 58982785
chrY 58983271 58997782
chrY 59361268 59362572
chrY 59362752 59362785
<chroDir>
Directory containing data about lengths of chromosomes; data files named chroName$chr where
$chr=chromosome no., each file contains 1 line: "chr$chr $length"
<AllreadsSeq>: is a FASTA file containing all the OEA read sequences.
<histogram>: is a text file representing the insert size histogram of the paried-end reads. the first
line shows the minimum insert size (Delta_min) and the maximum insert size (Delta_max). The
following Delta_max-Delta_min+1 lines also have two numbers. The first column shows an insert
size and the second column shows the fraction of paired-end reads having the insert size. An
example of a histogram file is shown as follows:
74 399
74 0.00000039
75 0.00000037
76 0.00000037
77 0.00000039
78 0.00000041
79 0.00000041
80 0.00000036
81 0.00000038
...
398 0.00000028
399 0.00000028
<maxNumOfReadsEachCluster>: is a user defined upper-bound on the number of reads in each
cluster.
*IMPORTANT NOTE: After running the command, a cluster file is also being created. This file
represents a set of clusters (the user might have filtered some of the OEA clusters from the output
of the clustering package based on the number of read supports already). The file contains a
number of clusters and the reads supporting the clusters.
For each cluster in the file, we have a set line which starts from a line <cluserID> <Strand> and
ends with a line containing the word END. After the first line, for each read in the cluster, we have
the following lines:
<readN>: Name of the read
<seqN>: Sequence content of the read
<chroN>: Chromosome to which the read is mapped
<MapLoc>: Outer end of mapping location of the read
<Score>: Denoted the score of the mapping (e.g. equals to 1.000 if it is a uniqe mapping)
<ClusterID>: The identifier for the cluster.
Here is a simple example:
1 R
EAS254_13:4:57:1675:16601/2
ATAAATTTAGCAACTGTCTGTCCAAGTGTTGCACCACACAAGTACCTCGTC
chr2
132735921
R1
.000000
1 END
1 F
EAS139_43:5:49:1430:3681/1
GAGGTACTTGTGTGGTGCAACACTTGGACAGACAGTTGCTAAAGTTATGTA
chr2
132735630
F
1.000000
1 EAS139_43:4:63:787:8041/2
AAGTTATGTAAAGATGTGAATTTAAAACTTTGGTCAGGATCTGGCTTTTCA
chr2
132735555
F
0.998674
1
END

4) Anchoring orphan contigs using the OEA contigs:

Anchoring orphan contigs has two independent parts. In the first part, we create a "weight" matrix
with the following command:
Command: ./createScoreMatrix.out -maxLen <maxLen> -fOEA <OEAcontigs> -fNH
<ORPHANcontigs> -OEAcount <OEAcount> -NHcount <ORPHANcount> -OEAStartCover
<OEA_startId> -OEAStopCover <OEA_stopId>
<maxLen>: the maximum possible length of an OEA contig
<OEAcontigs>: the file containing the OEA contigs (which was obtained from the local assembly in
the previous step)
<ORPHANcontigs>: a FASTA file containing all the orphan CONTIGS.
<OEAcount>: the number of total oea contigs. (note that both F and R strands together contribute
ONE in this count)
<ORPHANcount>: the number of orphan contigs
<OEA_startId>: the starting ID of the OEA contigs
<OEA_stopId>: the stopping ID of the OEA contigs (i.e. only the OEA clusters which have an ID
greater than or equal to <OEA_startId> and less than <OEA_stopId> will be considered).

Input:
A cluster file (indicating all OEA clusters – obtained from the previous section), A fasta file
containing all orphan contigs, A fasta file containing all OEA contigs.

Output:
An output file showing those orphan contigs which match up with an OEA contig and the loci
information of the insertion.
Note that a matching format is defined as follows:
Command:
./greedyMatching.out -fOEA_C <oea_c> -fOEA_S <oea_s> -fMatrix <score_matrix> -fNH
<ORPHANcontigs> -OEAcount <OEAcount> -NHcount <ORPHANcount>
<oea_c>: A cluster format 2 file, containing all the OEA clusters.
<oea_s>: is the file containing the OEA contigs
<score_matrix>: is the file representing the alignment scores between OEA and orphan contigs.
Each line has the 4 items as follows:
<OEA_ID><ORPHAN_ID><SCORE><ORIENTATION>
<OEA_ID>: an integer showing the ID of an OEA contig
<ORPHAN_ID>: an integer showing the ID of an orphan contig
<SCORE>: the alignment score between the OEA and orphan contigs.
<ORIENTATION>: is either 1 or 2. 1 if the OEA contig is aligned with the given orphan sequence
content, 2 if the OEA contig is aligned with the reverse complement of that.
<ORPHANcontigs> : a FASTA file containing all the orphan CONTIGS.
<OEAcount>: the number of total oea contigs. (note that both F and R strands together contribute
ONE in this count)
<ORPHANcount>: the number of orphan contigs.
Valid XHTML :: Valid CSS: :: Powered by WikkaWiki