remote BLAST 的使用!
时间:2010-06-24 来源:aids260
Accesing remote BLAST |
Written by Administrator |
Problem: given a set of sequnces, run remote BLAST at NCBI and get neighborhood near the hit. Here\'s implementation in Perl + BioPerl. 1. read sequences from text file (each sequnce on separate line). This can be easily adapted to reading in FASTA-format. 2. Run remote blastn from NCBI server. Specify the following parameters: - minimum e-value - nr database, search in only viruses (txid10239 [ORGN]). Again, this needs to be modified for your needs. 3. BLAST result is saved as XML. For each hit also download the sequnce, containing the hit and cut neighborhood of given length ($DELTA). 4. Output to text file for each initial sequence. Each file can contain several hits. Code was adapted from BioPerl demo and is not optimized. For example, sequence containing a hit (usually full genome) should be cached instead of downloading it separtely each time. All parameters are specified in the code. Additional Perl module misc.pm is used only for trimming spaces from a string, trim(). This is for additional safety and can be removed. Please, feel free to post comments/questions. # # BLAST nucleotide sequences against Genbank # use List::Util qw[min max]; use Bio::SearchIO; use Bio::Tools::Run::RemoteBlast; use Bio::DB::GenBank; use misc; my $infile=\"seqs.txt\"; # file with sequences my $prog = \'blastn\'; my $blastdb = \'nr\'; my $e_val= \'1\'; $DELTA=100; #how many nucleotides before and after hit to preserve $db = new Bio::DB::GenBank; #connect to NCBI my @params = ( \'-prog\' => $prog, \'-data\' => $blastdb, \'-expect\' => $e_val, \'-readmethod\' => \'xml\' ); my $factory = Bio::Tools::Run::RemoteBlast->new(@params); #SEE http://www.ncbi.nlm.nih.gov/BLAST/Doc/urlapi.html for more params $factory->retrieve_parameter(\'FORMAT_TYPE\', \'XML\'); # tells NCBI to send XML back #change a query paramter $Bio::Tools::Run::RemoteBlast::HEADER{\'ENTREZ_QUERY\'} = \'txid10239 [ORGN]\'; # all viruses #$Bio::Tools::Run::RemoteBlast::HEADER{\'ENTREZ_QUERY\'} = \'txid629 [ORGN]\'; # Yersinia $Bio::Tools::Run::RemoteBlast::HEADER{\'FILTER\'} = \'L\'; $Bio::Tools::Run::RemoteBlast::HEADER{\'WORD_SIZE\'} = \'7\'; $Bio::Tools::Run::RemoteBlast::HEADER{\'FORMAT_TYPE\'} = \'XML\'; #see also OTHER_ADVANCED for specifying word size etc #howto: change a retrieval parameter #$Bio::Tools::Run::RemoteBlast::RETRIEVALHEADER{\'DESCRIPTIONS\'} = 1000; #remove a parameter #delete $Bio::Tools::Run::RemoteBlast::HEADER{\'FILTER\'}; open(INFILE,$infile) or die \"error opening file\"; while () { my $str=$_; chomp($str); my $pos=index $str, \"#\"; if ($pos==0) { next; } #skipping commentaries $seq=trim($str); $outfile=$seq.\".txt\"; # to allow re-run if (-e $outfile) { print \"file $outfile already exists - skipping\\n\"; next; } open(OUTFILE,\">$outfile\"); $old_fh=select(OUTFILE);$|=1;select($old_fh); #disable buffering my $seqobj = Bio::Seq->new(-seq =>$seq); my $r = $factory->submit_blast($seqobj); while ( my @rids = $factory->each_rid ) { #RID - request ID # my @rids = $factory->each_rid ; #RID - request ID $rid= $rids[-1]; #taking last one my $rc = $factory->retrieve_blast($rid); if( !ref($rc) ) { if($rc<0) { $factory->remove_rid($rid); } print \".\"; sleep 1;} else { my $result = $rc->next_result(); my $filename = $seq.\".blres.xml\"; $factory->save_output($filename); # save balst result to xml # parsing xml file my $searchio = new Bio::SearchIO( -format => \'blastxml\', -file => $filename ); while ( my $res = $searchio->next_result() ) { while( my $hit = $res->next_hit ) { # Bio::Search::Hit::HitI object my $query_id=$hit->accession; my $hitdescr=$hit->description; while( my $hsp = $hit->next_hsp ) { # Bio::Search::HSP::HSPI object $start=$hsp->{HIT_START}; $end=$hsp->{HIT_END}; $evalue=$hsp->{EVALUE}; #NB: do this before retrieving the sequence! if ($start>$end) { print \"Error: start $start > end $end Swapping them\\n\"; my $tmp=$start; $start=$end; $end=$tmp; } $delta=$DELTA; #NB reassign it here each time print \"retrieving $query_id...\"; $seq = $db->get_Seq_by_acc($query_id); print \"done\\n\"; if ((($start-$delta)<=0)||(($end+$delta)>length($seq->seq))) { $delta = min($start-1, length($seq->seq)-$end, 100); } #change delta # for debug: # print \"debug start= $start end = $end delta=$delta seqlength=\".length($seq->seq).\" \"; # printing output print OUTFILE \"$query_id\\t$hitdescr\\n\"; print OUTFILE \"$start\\t$end\\t$hsp->{IDENTICAL}\\t$evalue\\n\"; $subseq = $seq->primary_seq->subseq($start-$delta,$end+$delta); print OUTFILE \"$subseq\\n\"; for(my $i=0;$i<$delta;$i++) { print OUTFILE \" \";}; print OUTFILE \"$hsp->{QUERY_SEQ}\\n\"; for(my $i=0;$i<$delta;$i++) { print OUTFILE \" \";}; print OUTFILE \"$hsp->{HOMOLOGY_SEQ}\\n\"; for(my $i=0;$i<$delta;$i++) { print OUTFILE \" \";}; print OUTFILE \"$hsp->{HIT_SEQ}\\n\"; } print OUTFILE \"\\n\"; #separator between records } # for each hsp } #$result = $searchio->next_result() last; } }# while ( my @rids = $factory->each_rid ) print \"\\nfinished for sequence $seq\\n\"; close(OUTFILE); }#reading $infile close(INFILE); print \"#work is finished\"; |
相关阅读 更多 +