Data Mining

DataMining

Strategies for getting what you need from databases

Often you are able to obtain large amounts of wanted data, but not in a format that is very useful to you. How can you sort through what you have to get what you want?

Sometimes the answer may be in using something already familiar to you. for example:

Microsoft Excel

Recall the Report file that was generated when you did the sequence assembly project.

There was a lot of information embeded in the file (names of sequence reads, number of bases used from each, what Contig it was added to, etc.):

Scanned against repeated sequences:
Time to do prepass: 0:1:10
Preassembly Elapsed Time 0:0:0
Construction parameters:
Match Size 12
Maximum Added Gap Length in Contig 20
Maximum Added Gap Length in Sequence 20
Minimum Match Percentage 80
Maximum Register Shift Difference 20
Lastgroup Considered 2
Gap Penalty 0.20
Gap Length Penalty 0.70
Consensus Threshold 75
Entering 1762 sequences on 12/13/01, 11:34 AM
CREATING NEW contig 1: from 58-2-11-C07.r.1.scf(1>933)
ENTERING 58-2-2-C10.f.1.scf(4>942) in Contig 1: percent match 96
ENTERING 58-2-11-H02.r.1.scf(1>894) in Contig 1: percent match 96
CREATING NEW contig 2: from 58-2-6-F06.r.1.scf(1>795)
ENTERING 58-2-2-B07.f.1.scf(8>1014) in Contig 2: percent match 96
ENTERING 58-2-12-A01.r.1.scf(1>897) in Contig 1: percent match 96
ENTERING 58-2-10-H06.r.1.scf(1>783) in Contig 2: percent match 96
ENTERING 58-2-12-F02.r.1.scf(6>889) in Contig 1: percent match 96
ENTERING 58-2-3-D11.r.1.scf(1>693) in Contig 1: percent match 97
ENTERING 58-2-7-E01.r.1.scf(1>793) in Contig 1: percent match 97
CREATING NEW contig 3: from 58-2-12-B05.r.1.scf(1>952)
ENTERING 58-2-3-B07.f.1.scf(8>680) in Contig 1: percent match 99
CREATING NEW contig 4: from 58-2-10-C04.r.1.scf(1>769)
CREATING NEW contig 5: from 58-2-1-D09.r.1.scf(8>800)
ENTERING 58-2-11-E01.r.1.scf(1>964) in Contig 3: percent match 85
ENTERING 58-2-4-C01.r.1.scf(3>784) in Contig 1: percent match 95

In looking at the Strategy view of the entire dataset, I noticed that there were many plasmids that had been sequenced with one primer, but not the other, shown as black reads below:

How many were there and what are all their names so we can ask to have them sequenced?

First I modified a copy of the assembly report, for ease in reading into Excel. This is often not necessary, but in this case, I wanted to sort through the names of the reads to look for paired reads. This meant breaking up the name with tab characters:

For this I used a text editor's find and replace command to insert tabs (\t) within the file names. Now I can import (just open the file, Excel will guide you through) into Excel and analyze the names of the files. different parts of the filename are now different fields (columns) within the file.Let's look at what's in a name:

58-2-4-C01.r.1.scf

All the files begin with "58-2-". That is the name of the sequencing project. Next is "4-" the microtiter plate number for the plasmid. "C01" is the well location on the plate."r" is a reaction using the reverse primer; all plasmids should be sequenced once each with the forward (f) and reverse primers. "1" is the read number, some reactions are repeated.

So here is how it looks in Excel:

I have hidden the text in front of the name and the names have been sorted. You see that there were reads from both plasmids for 58-2-1A01, 1 forward and 2 reverse reads for 1A02, but only single reads from 1A03 and 1A04. These are the ones we want to find. We can put a conditional formula in column O to help us:

This means "if the value in cell F7 equals the one below it [F7=F8] then put a 1 in O7 [,1] OR if l F7 equals the one above it [,IF(F7=F6,1] then put a 1 in O7. If neither of these is true, put a 0 in O7 [,0))].

So 1A03 and 1A04 should both have "0" in column O, as would all unique reads. But what if there are more than 1 read in one direction, but none in the other? The next 2 columns help there. Column P keeps track of the number of reads with the same name:

This conditional statement says: IF the value in cell F11 is the same as F10, then add 1 to the value of P10 (the cell above the formula), otherwise, put a 1 in P11. So the first read from a plasmid will have a 1 in this column, the next a 2, the third will have a 3. etc.

But what about the case of 2 reverse reads, but no forward reads? Column Q takes care of that:

If there are more than 2 reads, it compare the value in G to the cell 2 above it, if they match it puts a 1 in column Q. If there is more than 1 read, it compares the value of G to the cell directly above and again puts a 1 here. If there is only one read, or it is the first read with that name, then it puts a 0 in column Q.

Read name						diagnostics
58-2-	1	A	3	r	2	0	1	0	unique
58-2-	1	A	4	r	2	0	1	0	unique
58-2-	1	A	6	f	1	1	1	0	normal
58-2-	1	A	6	r	2	1	2	0	normal
58-2-	1	A	7	f	1	1	1	0	extra OK
58-2-	1	A	7	r	1	1	2	0	extra OK
58-2-	1	A	7	r	2	1	3	0	extra OK
58-2-	1	A	8	f	1	1	1	0	extra OK
58-2-	1	A	8	r	1	1	2	0	extra OK
58-2-	1	A	8	r	2	1	3	0	extra OK
58-2-	1	A	9	f	1	1	1	0	normal
58-2-	1	A	9	r	2	1	2	0	normal
58-2-	1	A	10	r	1	1	1	0	2 reads, 1 pr.
58-2-	1	A	10	r	2	1	2	1	2 reads, 1 pr
58-2-	1	A	11	r	1	1	1	0	2 reads, 1 pr
58-2-	1	A	11	r	2	1	2	1	2 reads, 1 pr
58-2-	1	A	12	r	1	1	1	0	2 reads, 1 pr
58-2-	1	A	12	r	2	1	2	1	2 reads, 1 pr

So you should be able to sort the spreadsheet to identify different classes of reads, but sorting will change the values of the diagnostic columns, so you need to copy and paste the values elsewhere, so you can sort the numbers without changing them.

You can also download new data from websites into Excel. This allows you to do calculations on the most up-to-date information available

sed, awk, grep

There are several comand-line programs available on any UNIX computer (like the rs6000 -if you have a tulane email address, you can login to rs6000.tcs.tulane.edu with your mail account login). They are very fast at handling text files. Once you learn to use them, they can be very powerful and big timesavers. You can tie them together with scripts to perform multiple manipulations on multiple files. you can type "man program_name" to get help.

awk

Can search for text strings, compare values of fields, and output results however you tell it:

jnolan% awk '/HUMAN/ {print $2, $1}' cytcox.aln
---MALPLRPLTRGLASA--------AKGGHGGAG------------------ARTWRLL COXD_HUMAN
--MAVVGVSSVSRLLGRSRPQLGRPMSSGAHGEEGS-----------------ARMWKTL COXE_HUMAN
TFVLALPSVALCTFNSYL-HSGH--RERPE--------FRPYQHLRIRTKPYPWGDGNHT COXD_HUMAN
TFFVALPGVAVSMLNVYL-KSHHGEHERPE--------FIAYPHLRIRTKPFPWGDGNHT COXE_HUMAN
LFHNSHVNPLP-TGYEHP---- COXD_HUMAN
LFHNPHVNPLP-TGYEDE---- COXE_HUMAN

The program searched for lines containing 'HUMAN' and printed out the second field first, followed by the first field.

[nolan:lecture/723/seqs] jnolan% ^HUMAN^ELVIS^
awk '/ELVIS/ {print $2, $1}' cytcox.aln
[nolan:lecture/723/seqs] jnolan% awk '!~/ELVIS/ {print $2, $1}' cytcox.aln
-no result!-

awk '/COXE_HUMAN/ {print $2}' cytcox.aln
--MAVVGVSSVSRLLGRSRPQLGRPMSSGAHGEEGS-----------------ARMWKTL
TFFVALPGVAVSMLNVYL-KSHHGEHERPE--------FIAYPHLRIRTKPFPWGDGNHT
LFHNPHVNPLP-TGYEDE----

returns just the sequence for COXE_HUMAN. A good way to extract sequences from an alignment.

We can take this result and pass it through a second program sed, which is a text editor:

awk '/COXE_HUMAN/ {print $2}' cytcox.aln | sed 's/-//g'
MAVVGVSSVSRLLGRSRPQLGRPMSSGAHGEEGSARMWKTL
TFFVALPGVAVSMLNVYLKSHHGEHERPEFIAYPHLRIRTKPFPWGDGNHT
LFHNPHVNPLPTGYEDE

We repeated the awk command, but used the pipe "|" to pass the rsult to sed without ever seeing the first result. Then we used sed to do a global search and replace to get rid of all "-".

Now we have our old sequence back!

Other useful unix commands:

comm file1 file2 - compares file1 with file2. Output is in 3 columns: lines unique to file1, lines unique to file2, lines common to both.

diff file1 file2 - compares file1 with file2. Output has barckets indicating where differences are in the files

diff cytcox.aln cytcox2.aln
16,17d15
< COXE_YEAST ---MFR---QCAKRYASSLPPNALKPAFGPPDKVAAQKFKESLMATEKHAKDTSNMWVKI
< COXE_SCHPO MSMMNRNIGFLSRTLKTSVPKRAGLLSFRAYSNEAKVNWLEEVQAEEEHAKRSSEFWKKV
32,33d29
< COXE_YEAST SVWVALPAIALTAVNTYFVEKEHAEHREHLKHVPDSEWPRDYEFMNIRSKPFFWGDGDKT
< COXE_SCHPO TYYIGGPALILASANAYYIYCKHQEHAKHVEDTDPG-----YSFENLRFKKYPWGDGSKT
48,49d43
< COXE_YEAST LFWNPVVNRHIEHDD-------
< COXE_SCHPO LFWNDKVN-HLKKDDE------

cut filename - extracts columns of text using character numbers or field numbers

cut -c17-76 cytcox.aln
multiple sequence alignment

-MAMSPAATVARRRLAAA--------SQGSH-EGG------------------ARTWKIL
--MASPASMAARRVLSAA--------SHAGH-EGGS-----------------ARTWKIL
---MALPLKSLSRGLASA--------AKGDHGGTG------------------ARTWRFL
---MALPLRPLTRGLASA--------AKGGHGGAG------------------ARTWRLL
------PLKVLSRSMASA--------SKGDHGGAG------------------ANTWRLL
---MALPLKVLSRSMASA--------AKGDHGGAG------------------ANTWRLL
MASAVLSASRVSRPLGRALPGLRRPMSSGAHGEEGS-----------------ARMWKAL
MASAVLSASRVSGLLGRALPRVGRPMSSGAHGEEGS-----------------ARIWKAL
--------------------------SSGAHGEEGS-----------------ARMWKAL
--MAVVGVSSVSRLLGRSRPQLGRPMSSGAHGEEGS-----------------ARMWKTL
--MAAAAWSRVSQLLGRSRLQVGRPMSSGAHGEEGS-----------------ARMWKAL
---MNRLAQPATRSVVKTFQRKSSGSFYGSNNVEGFKESYVTPLKQAHNA---SETWKKI
---MFR---QCAKRYASSLPPNALKPAFGPPDKVAAQKFKESLMATEKHAKDTSNMWVKI
MSMMNRNIGFLSRTLKTSVPKRAGLLSFRAYSNEAKVNWLEEVQAEEEHAKRSSEFWKKV
. :. * :
SFVLALPGVGVCMANAYM-KMQAHSHDPPE--------FVPYPHLRIRTKPWPWGDGNHS
SFVLALPGVAVCIANAYM-KMQQHSHEPPE--------FVAYSHLRIRTKKWPWGDGNHS
TFGLALPSVALCTLNSWL-HSGH--RERPA--------FIPYHHLRIRTKPFSWGDGNHT
TFVLALPSVALCTFNSYL-HSGH--RERPE--------FRPYQHLRIRTKPYPWGDGNHT
TFVLALPSVALCSLNCWM-HAGH--HERPE--------FIPYHHLRIRTKPFSWGDGNHT
TFVLALPGVALCSLNCWM-HAGH--HERPE--------FIPYHHLRIRTKPFAWGDGNHT
TYFVALPGVGVSMLNVFL-KSRHEEHERPP--------FVAYPHLRIRTKPFPWGDGNHT
TYFVALPGVGVSMLNVFL-KSRHEEHERPE--------FVAYPHLRIRTKPFPWGDGNHT
TLFVALPGVGVSMLNVFM-KSHHGEEERPE--------FVAYPHLRIRSKPFPWGDGNHT
TFFVALPGVAVSMLNVYL-KSHHGEHERPE--------FIAYPHLRIRTKPFPWGDGNHT
TYFVALPGVGVSMLNVYL-KSHHEEHERPE--------FIAYPHLRIRSKPFPWGDGNHT
FFIASIPCLALTMYAAFKDHKKHMSHERPE--------HVEYAFLNVRNKPFPWSDGNHS
SVWVALPAIALTAVNTYFVEKEHAEHREHLKHVPDSEWPRDYEFMNIRSKPFFWGDGDKT
TYYIGGPALILASANAYYIYCKHQEHAKHVEDTDPG-----YSFENLRFKKYPWGDGSKT
. * : : : . * . .:* * : *.**.::
LFHNAHTNALP-TGYEGPHH--
LFHNPHENALP-EGYEGPRH--
FFHNPRVNPLP-TGYEKP----
LFHNSHVNPLP-TGYEHP----
LFHNPHVNPLP-TGYEQP----
LFHNPHVNPLP-TGYEHP----
LFHNPHVNPLP-TGYEDE----
LFHNPHMNPLP-TGYEDE----
LFHNPHVNPLP-TGYEDE----
LFHNPHVNPLP-TGYEDE----
LFHNPHVNPLP-TGYEDV----
LFHNKAEQFVPGVGFEADREKH
LFWNPVVNRHIEHDD-------
LFWNDKVN-HLKKDDE------
:* * : .

paste file1 file2

puts 2 files together side-by-side

paste -d"Z" names seqs
CLUSTAL WZ multiple sequence alignment
Z
Z
COXE_CYPCZ-MAMSPAATVARRRLAAA--------SQGSH-EGG------------------ARTWKIL
COXE_ONCMZ--MASPASMAARRVLSAA--------SHAGH-EGGS-----------------ARTWKIL
COXD_BOVIZ---MALPLKSLSRGLASA--------AKGDHGGTG------------------ARTWRFL
COXD_HUMAZ---MALPLRPLTRGLASA--------AKGGHGGAG------------------ARTWRLL
COXD_RAT Z------PLKVLSRSMASA--------SKGDHGGAG------------------ANTWRLL
COXD_MOUSZ---MALPLKVLSRSMASA--------AKGDHGGAG------------------ANTWRLL
COXE_MOUSZMASAVLSASRVSRPLGRALPGLRRPMSSGAHGEEGS-----------------ARMWKAL
COXE_RAT ZMASAVLSASRVSGLLGRALPRVGRPMSSGAHGEEGS-----------------ARIWKAL
COXE_BOVIZ--------------------------SSGAHGEEGS-----------------ARMWKAL
COXE_HUMAZ--MAVVGVSSVSRLLGRSRPQLGRPMSSGAHGEEGS-----------------ARMWKTL
COXE_RABIZ--MAAAAWSRVSQLLGRSRLQVGRPMSSGAHGEEGS-----------------ARMWKAL
COXE_CAEEZ---MNRLAQPATRSVVKTFQRKSSGSFYGSNNVEGFKESYVTPLKQAHNA---SETWKKI
COXE_YEASZ---MFR---QCAKRYASSLPPNALKPAFGPPDKVAAQKFKESLMATEKHAKDTSNMWVKI
COXE_SCHPZMSMMNRNIGFLSRTLKTSVPKRAGLLSFRAYSNEAKVNWLEEVQAEEEHAKRSSEFWKKV
Z . :. * :
Z
COXE_CYPCZSFVLALPGVGVCMANAYM-KMQAHSHDPPE--------FVPYPHLRIRTKPWPWGDGNHS
COXE_ONCMZSFVLALPGVAVCIANAYM-KMQQHSHEPPE--------FVAYSHLRIRTKKWPWGDGNHS
COXD_BOVIZTFGLALPSVALCTLNSWL-HSGH--RERPA--------FIPYHHLRIRTKPFSWGDGNHT
COXD_HUMAZTFVLALPSVALCTFNSYL-HSGH--RERPE--------FRPYQHLRIRTKPYPWGDGNHT
COXD_RAT ZTFVLALPSVALCSLNCWM-HAGH--HERPE--------FIPYHHLRIRTKPFSWGDGNHT
COXD_MOUSZTFVLALPGVALCSLNCWM-HAGH--HERPE--------FIPYHHLRIRTKPFAWGDGNHT
COXE_MOUSZTYFVALPGVGVSMLNVFL-KSRHEEHERPP--------FVAYPHLRIRTKPFPWGDGNHT
COXE_RAT ZTYFVALPGVGVSMLNVFL-KSRHEEHERPE--------FVAYPHLRIRTKPFPWGDGNHT
COXE_BOVIZTLFVALPGVGVSMLNVFM-KSHHGEEERPE--------FVAYPHLRIRSKPFPWGDGNHT
COXE_HUMAZTFFVALPGVAVSMLNVYL-KSHHGEHERPE--------FIAYPHLRIRTKPFPWGDGNHT
COXE_RABIZTYFVALPGVGVSMLNVYL-KSHHEEHERPE--------FIAYPHLRIRSKPFPWGDGNHT
COXE_CAEEZFFIASIPCLALTMYAAFKDHKKHMSHERPE--------HVEYAFLNVRNKPFPWSDGNHS
COXE_YEASZSVWVALPAIALTAVNTYFVEKEHAEHREHLKHVPDSEWPRDYEFMNIRSKPFFWGDGDKT
COXE_SCHPZTYYIGGPALILASANAYYIYCKHQEHAKHVEDTDPG-----YSFENLRFKKYPWGDGSKT
Z . * : : : . * . .:* * : *.**.::
Z
COXE_CYPCZLFHNAHTNALP-TGYEGPHH--
COXE_ONCMZLFHNPHENALP-EGYEGPRH--
COXD_BOVIZFFHNPRVNPLP-TGYEKP----
COXD_HUMAZLFHNSHVNPLP-TGYEHP----
COXD_RAT ZLFHNPHVNPLP-TGYEQP----
COXD_MOUSZLFHNPHVNPLP-TGYEHP----
COXE_MOUSZLFHNPHVNPLP-TGYEDE----
COXE_RAT ZLFHNPHMNPLP-TGYEDE----
COXE_BOVIZLFHNPHVNPLP-TGYEDE----
COXE_HUMAZLFHNPHVNPLP-TGYEDE----
COXE_RABIZLFHNPHVNPLP-TGYEDV----
COXE_CAEEZLFHNKAEQFVPGVGFEADREKH
COXE_YEASZLFWNPVVNRHIEHDD-------
COXE_SCHPZLFWNDKVN-HLKKDDE------
Z:* * : .

Perl

One of the most powerful mining and manipulation tools. Fairly easy to learn, and once you do, you can do just about anything. There is a specific site that has hundreds of bioinformatic tools at bioperl.org. There is an excellent text available from O'Reilly Publishers, titled "Beginning Perl for Bioinformatics". It has some handy exercises you can download to lead you through the programming steps in learning perl.

Perl is very useful for analyzing sequences and parsing results.

Parsing: extracting data from a result in a useful manner. for example BLAST results:
Lots of info, but it is hard to compare and compile all the results from one search. Parsers search through the file and organize it into fields:

Query Seq Name	Start Subj	End Subj	Query Start	Query End	Score Bits	Score 2	Expect	Length	Overlap Length	Identities	Total	% Identities
Contig15 64577 bp	1	1032	48366	45271	1702	4408	0	1032	3095	809	1032	78%
Contig15 64577 bp	1	658	50333	48372	1149	2972	0	660	1961	569	658	86%
Contig15 64577 bp	1	610	34786	32954	1117	2890	0	610	1832	544	611	89%
Contig15 64577 bp	1	659	32920	30944	1089	2817	0	659	1976	533	659	80%
Contig15 64577 bp	2	511	19393	17861	883	2281		587	1532	427	511	83%
Contig15 64577 bp	1	602	43345	41534	969	2506	0	602	1811	462	604	76%
Contig15 64577 bp	2	521	26507	24948	957	2475	0	521	1559	485	520	93%
Contig15 64577 bp	1	523	34525	32954	957	2474	0	523	1571	470	524	89%
Contig15 64577 bp	1	524	30249	28681	932	2408	0	524	1568	450	524	85%
Contig15 64577 bp	1	575	52890	51160	925	2391	0	575	1730	450	578	77%
Contig15 64577 bp	1	505	34471	32954	922	2384	0	505	1517	453	506	89%
Contig15 64577 bp	40	427	24406	23243	720	1859		427	1163	365	388	94%
Contig15 64577 bp	151	401	25787	25014	74.7	182	2.00E-14		773	78	273	28%
Contig15 64577 bp	1	416	34204	32954	746	1926	0	416	1250	366	417	87%

Can use in spreadsheet program.

Parsing takes advantage of key features of the document that can be used to divide a document into important parts and assigns them to variables to use for out puttin data in a useful format:

BLAST Parser variables:
$hsp->hit->seq_id $hsp->subject->length	$hsp->score $hsp->bits
$hsp->P	$hsp->sbjctFrame
$hsp->match $hsp->length $hsp->percent $hsp->positive	$hsp->querySeq $hsp->homologySeq $hsp->sbjctSeq
$hsp->hit->start $hsp->hit->end	$hsp->query->start $hsp->query->end

Where they come from in the BLAST output:

>uvsX_Aeh1 RecA-like recomb. pro; DNA-ATPase[seq_id]
          Length = 411[subject length]

 Score =  439 bits (1130), Expect = e-124[Score and P]
 Identities = 203/357 (56%), Positives = 278/357 (77%)[match/length][positive]
 Frame = -2

Query: 23561 MSDLKSRLIKASTSKLTAELTASKFFNEKDVVRTKIPMMNIALSGEITGGMQSGLLILAG 23382 
             +  L S+L   S++K+++ L  SKFFN+KD VRT++P++N+A+SGE+ GG+  GL +LAG
Sbjct: 13    LGSLMSKLAGTSSNKMSSVLADSKFFNDKDCVRTRVPLLNLAMSGELDGGLTPGLTVLAG 72


Query: 23381 PSKSFKSNFGLTMVSSYMRQYPDAVCLFYDSEFGITPAYLRSMGVDPERVIHTPVQSLEQ 23202
             PSK FKSN  L  V++Y+R+YPDAVC+F+D+EFG TP Y  S GVD  RVIH P +++E+
Sbjct: 73    PSKHFKSNLSLVFVAAYLRKYPDAVCIFFDNEFGSTPGYFESQGVDISRVIHCPFKNIEE 132


Query: 23201 LRIDMVNQLDAIERGEKVVVFIDSLGNLASKKETEDALNEKVVSDMTRAKTMKSLFRIVT 23022
             L+ D+V +L+AIERG++V+VF+DS+GN ASKKE +DA++EK VSDMTRAK +KSL R++T
Sbjct: 133   LKFDIVKKLEAIERGDRVIVFVDSIGNAASKKEIDDAIDEKSVSDMTRAKQIKSLTRMMT 192


Query: 23021 PYFSTKNIPCIAINHTYETQEMFSKTVMGGGTGPMYSADTVFIIGKRQIKDGSDLQGYQF 22842
             PY +  +IP I + HTY+TQEM+SK V+ GGTG  YS+DTV IIG++Q KDG +L GY F
Sbjct: 193   PYLTVNDIPAIMVAHTYDTQEMYSKKVVSGGTGITYSSDTVIIIGRQQEKDGKELLGYNF 252


Query: 22841 VLNVEKSRTVKEKSKFFIDVKFDGGIDPYSGLLDMALELGFVVKPKNGWYAREFLDEETG 22662
             VLN+EKSR VKE+SK  ++V F GGI+ YSG+LD+ALE+GFVVKP NGW++R FLDEETG
Sbjct: 253   VLNMEKSRFVKEQSKLPLEVTFQGGINTYSGMLDIALEVGFVVKPSNGWFSRAFLDEETG 312


Query: 22661 EMIREEKSWRAKDTNCTTFWGPLFKHQPFRDAIKRAYQLGAIDSNEIVEAEVDELIN 22491
             E++ E++ WR  DTNC  FW P+F HQPF+ A    ++L ++   + V  EVDEL +
Sbjct: 313   ELVEEDRKWRRADTNCLEFWKPMFAHQPFKTACSDMFKLKSVAVKDEVFDEVDELFS 369


>60plus39_Aeh1 DNA topoisomerase sub.; DNAdep. ATPase; memb-assoc
          Length = 613

 Score =  412 bits (1058), Expect = e-115
 Identities = 214/471 (45%), Positives = 296/471 (62%)
 Frame = -1

Query: 5325 IKNEIKILSDIEHIKKRSGMYIGSSANETHERFMFGKWESVQYVPGLVKLIDEIIDNSVD 5146
            +  E K+LSD EH    + MYIGS++ ETH+  + GK+  + YVPGLVK+ DE+IDNSVD
Sbjct: 1    MSQEFKVLSDKEHCLINTDMYIGSTSTETHDVLVDGKFVQIAYVPGLVKITDEVIDNSVD 60

Can be used to glean data from almost any useful source for further manipulation.

GBCH723 Home Page