如何提取介于某个区间的几行文字-CU帖子

时间：2009-03-19 来源：mouse.rice

如这样的文件：
CC -!- FUNCTION: Rapidly .
CC -!- CATALYTIC ACTIVITY: Acetylcholine.
CC -!- SUBUNIT: Homotetramer; composed .
CC    Interacts with PRIMA1.
CC    anchor it to the basal
CC    (By similarity).
CC -!- SUBCELLULAR LOCATION: Cell junction, synapse. Secreted (By
CC    similarity). Cell membrane; Peripheral membrane protein (By
CC    similarity).
CC -!- SUBCELLULAR LOCATION: Isoform 2: Cell membrane;
CC    anchor; Extracellular side (By similarity).
CC -!- ALTERNATIVE PRODUCTS:
CC    Event=Alternative splicing; Named isoforms=2;
我要提取其中以SUBCELLULAR LOCATION开头的那一小段文件，如下：
SUBCELLULAR LOCATION: Cell junction, synapse. Secreted (By
  similarity). Cell membrane; Peripheral membrane protein (By
  similarity).
SUBCELLULAR LOCATION: Isoform 2: Cell membrane; Lipid-anchor, GPI-
anchor; Extracellular side (By similarity).
NO1. 下面给出这一类问题的通用解决办法。

这是面向行处理的一种轻量级解决方法。
比那些对整个文件进行模式匹配的方法不知优雅了要多少倍。

$start 表示开始标记的模式，$end 表示结束标记的模式，
if ( (/$start/ .. /$end/) and !/$end/ ){
表示需要开始和结束之间的，但不需要结束的那一行。

#! /usr/bin/env perl

my $start = qr/^CC\s+-!- SUBCELLULAR LOCATION/;
my $end = qr/^CC\s+-!- (?!SUBCELLULAR LOCATION)/;

while(<DATA>){
    if ( (/$start/ .. /$end/) and !/$end/ ){
        print "*** $_";
    }
    else{
        print "--- $_";
    }
}
__END__
CC -!- FUNCTION: Rapidly .
CC -!- CATALYTIC ACTIVITY: Acetylcholine.
CC -!- SUBUNIT: Homotetramer; composed .
CC Interacts with PRIMA1.
CC anchor it to the basal
CC (By similarity).
CC -!- SUBCELLULAR LOCATION: Cell junction, synapse. Secreted (By
CC similarity). Cell membrane; Peripheral membrane protein (By
CC similarity).
CC -!- SUBCELLULAR LOCATION: Isoform 2: Cell membrane;
CC anchor; Extracellular side (By similarity).
CC -!- ALTERNATIVE PRODUCTS:
CC Event=Alternative splicing; Named isoforms=2;

运行结果：

flw@debian:~$ ./ttt.pl
--- CC -!- FUNCTION: Rapidly .
--- CC -!- CATALYTIC ACTIVITY: Acetylcholine.
--- CC -!- SUBUNIT: Homotetramer; composed .
--- CC Interacts with PRIMA1.
--- CC anchor it to the basal
--- CC (By similarity).
*** CC -!- SUBCELLULAR LOCATION: Cell junction, synapse. Secreted (By
*** CC similarity). Cell membrane; Peripheral membrane protein (By
*** CC similarity).
*** CC -!- SUBCELLULAR LOCATION: Isoform 2: Cell membrane;
*** CC anchor; Extracellular side (By similarity).
--- CC -!- ALTERNATIVE PRODUCTS:
--- CC Event=Alternative splicing; Named isoforms=2;
flw@debian:~$

No2.

#!user/bin/perl

use strict;
use warnings;

my @data = <DATA>;
$_ = join '', @data;

my @t = /(SUBCELLULAR.*?)CC\s+-!-/msg;

print map {s/CC\s+//g; $_} @t;

__DATA__
CC -!- FUNCTION: Rapidly .
CC -!- CATALYTIC ACTIVITY: Acetylcholine.
CC -!- SUBUNIT: Homotetramer; composed .
CC Interacts with PRIMA1.
CC anchor it to the basal
CC (By similarity).
CC -!- SUBCELLULAR LOCATION: Cell junction, synapse. Secreted (By
CC similarity). Cell membrane; Peripheral membrane protein (By
CC similarity).
CC -!- SUBCELLULAR LOCATION: Isoform 2: Cell membrane;
CC anchor; Extracellular side (By similarity).
CC -!- ALTERNATIVE PRODUCTS:
CC Event=Alternative splicing; Named isoforms=2;

No3.

#! /bin/perl

use warnings;
use strict;

my $key;

while(<DATA>){
    if (/-!-/) {
        $key = 0;
    }
    if (/SUBCELLULAR LOCATION/) {
        print;
        $key = 1;
        next;
    }
    if ($key) {
        print;
    }
}

__END__
CC -!- FUNCTION: Rapidly .
CC -!- CATALYTIC ACTIVITY: Acetylcholine.
CC -!- SUBUNIT: Homotetramer; composed .
CC Interacts with PRIMA1.
CC anchor it to the basal
CC (By similarity).
CC -!- SUBCELLULAR LOCATION: Cell junction, synapse. Secreted (By
CC similarity). Cell membrane; Peripheral membrane protein (By
CC similarity).
CC -!- SUBCELLULAR LOCATION: Isoform 2: Cell membrane;
CC anchor; Extracellular side (By similarity).
CC -!- ALTERNATIVE PRODUCTS:
CC Event=Alternative splicing; Named isoforms=2;