Often you are able to obtain large amounts of wanted data, but not in a format that is very useful to you. How can you sort through what you have to get what you want?
Sometimes the answer may be in using something already familiar to you. for example:
Recall the Report file that was generated when you did the sequence assembly project.
There was a lot of information embeded in the file (names of sequence reads, number of bases used from each, what Contig it was added to, etc.):
Scanned against repeated sequences:
Time to do prepass: 0:1:10
Preassembly Elapsed Time 0:0:0
Construction parameters:
Match Size 12
Maximum Added Gap Length in Contig 20
Maximum Added Gap Length in Sequence 20
Minimum Match Percentage 80
Maximum Register Shift Difference 20
Lastgroup Considered 2
Gap Penalty 0.20
Gap Length Penalty 0.70
Consensus Threshold 75
Entering 1762 sequences on 12/13/01, 11:34 AM
CREATING NEW contig 1: from 58-2-11-C07.r.1.scf(1>933)
ENTERING 58-2-2-C10.f.1.scf(4>942) in Contig 1: percent match 96
ENTERING 58-2-11-H02.r.1.scf(1>894) in Contig 1: percent match 96
CREATING NEW contig 2: from 58-2-6-F06.r.1.scf(1>795)
ENTERING 58-2-2-B07.f.1.scf(8>1014) in Contig 2: percent match 96
ENTERING 58-2-12-A01.r.1.scf(1>897) in Contig 1: percent match 96
ENTERING 58-2-10-H06.r.1.scf(1>783) in Contig 2: percent match 96
ENTERING 58-2-12-F02.r.1.scf(6>889) in Contig 1: percent match 96
ENTERING 58-2-3-D11.r.1.scf(1>693) in Contig 1: percent match 97
ENTERING 58-2-7-E01.r.1.scf(1>793) in Contig 1: percent match 97
CREATING NEW contig 3: from 58-2-12-B05.r.1.scf(1>952)
ENTERING 58-2-3-B07.f.1.scf(8>680) in Contig 1: percent match 99
CREATING NEW contig 4: from 58-2-10-C04.r.1.scf(1>769)
CREATING NEW contig 5: from 58-2-1-D09.r.1.scf(8>800)
ENTERING 58-2-11-E01.r.1.scf(1>964) in Contig 3: percent match 85
ENTERING 58-2-4-C01.r.1.scf(3>784) in Contig 1: percent match 95
In looking at the Strategy view of the entire dataset, I noticed that there were many plasmids that had been sequenced with one primer, but not the other, shown as black reads below:
How many were there and what are all their names so we can ask to have them sequenced?
First I modified a copy of the assembly report, for ease in reading into Excel. This is often not necessary, but in this case, I wanted to sort through the names of the reads to look for paired reads. This meant breaking up the name with tab characters:
For this I used a text editor's find and replace command to insert tabs (\t) within the file names. Now I can import (just open the file, Excel will guide you through) into Excel and analyze the names of the files. different parts of the filename are now different fields (columns) within the file.Let's look at what's in a name:
58-2-4-C01.r.1.scf
All the files begin with "58-2-". That is the name of the sequencing project. Next is "4-" the microtiter plate number for the plasmid. "C01" is the well location on the plate."r" is a reaction using the reverse primer; all plasmids should be sequenced once each with the forward (f) and reverse primers. "1" is the read number, some reactions are repeated.
So here is how it looks in Excel:
I have hidden the text in front of the name and the names have been sorted. You see that there were reads from both plasmids for 58-2-1A01, 1 forward and 2 reverse reads for 1A02, but only single reads from 1A03 and 1A04. These are the ones we want to find. We can put a conditional formula in column O to help us:
This means "if the value in cell F7 equals the one below it [F7=F8] then put a 1 in O7 [,1] OR if l F7 equals the one above it [,IF(F7=F6,1] then put a 1 in O7. If neither of these is true, put a 0 in O7 [,0))].
So 1A03 and 1A04 should both have "0" in column O, as would all unique reads. But what if there are more than 1 read in one direction, but none in the other? The next 2 columns help there. Column P keeps track of the number of reads with the same name:
This conditional statement says: IF the value in cell F11 is the same as F10, then add 1 to the value of P10 (the cell above the formula), otherwise, put a 1 in P11. So the first read from a plasmid will have a 1 in this column, the next a 2, the third will have a 3. etc.
But what about the case of 2 reverse reads, but no forward reads? Column Q takes care of that:
If there are more than 2 reads, it compare the value in G to the cell 2 above it, if they match it puts a 1 in column Q. If there is more than 1 read, it compares the value of G to the cell directly above and again puts a 1 here. If there is only one read, or it is the first read with that name, then it puts a 0 in column Q.
Read name | diagnostics | ||||||||
58-2- | 1 | A | 3 | r | 2 | 0 | 1 | 0 | unique |
58-2- | 1 | A | 4 | r | 2 | 0 | 1 | 0 | unique |
58-2- | 1 | A | 6 | f | 1 | 1 | 1 | 0 | normal |
58-2- | 1 | A | 6 | r | 2 | 1 | 2 | 0 | normal |
58-2- | 1 | A | 7 | f | 1 | 1 | 1 | 0 | extra OK |
58-2- | 1 | A | 7 | r | 1 | 1 | 2 | 0 | extra OK |
58-2- | 1 | A | 7 | r | 2 | 1 | 3 | 0 | extra OK |
58-2- | 1 | A | 8 | f | 1 | 1 | 1 | 0 | extra OK |
58-2- | 1 | A | 8 | r | 1 | 1 | 2 | 0 | extra OK |
58-2- | 1 | A | 8 | r | 2 | 1 | 3 | 0 | extra OK |
58-2- | 1 | A | 9 | f | 1 | 1 | 1 | 0 | normal |
58-2- | 1 | A | 9 | r | 2 | 1 | 2 | 0 | normal |
58-2- | 1 | A | 10 | r | 1 | 1 | 1 | 0 | 2 reads, 1 pr. |
58-2- | 1 | A | 10 | r | 2 | 1 | 2 | 1 | 2 reads, 1 pr |
58-2- | 1 | A | 11 | r | 1 | 1 | 1 | 0 | 2 reads, 1 pr |
58-2- | 1 | A | 11 | r | 2 | 1 | 2 | 1 | 2 reads, 1 pr |
58-2- | 1 | A | 12 | r | 1 | 1 | 1 | 0 | 2 reads, 1 pr |
58-2- | 1 | A | 12 | r | 2 | 1 | 2 | 1 | 2 reads, 1 pr |
So you should be able to sort the spreadsheet to identify different classes of reads, but sorting will change the values of the diagnostic columns, so you need to copy and paste the values elsewhere, so you can sort the numbers without changing them.
You can also download new data from websites into Excel. This allows you to do calculations on the most up-to-date information available
There are several comand-line programs available on any UNIX computer (like the rs6000 -if you have a tulane email address, you can login to rs6000.tcs.tulane.edu with your mail account login). They are very fast at handling text files. Once you learn to use them, they can be very powerful and big timesavers. You can tie them together with scripts to perform multiple manipulations on multiple files. you can type "man program_name" to get help.
Can search for text strings, compare values of fields, and output results however you tell it:
jnolan% awk '/HUMAN/ {print $2,
$1}' cytcox.aln
---MALPLRPLTRGLASA--------AKGGHGGAG------------------ARTWRLL COXD_HUMAN
--MAVVGVSSVSRLLGRSRPQLGRPMSSGAHGEEGS-----------------ARMWKTL COXE_HUMAN
TFVLALPSVALCTFNSYL-HSGH--RERPE--------FRPYQHLRIRTKPYPWGDGNHT COXD_HUMAN
TFFVALPGVAVSMLNVYL-KSHHGEHERPE--------FIAYPHLRIRTKPFPWGDGNHT COXE_HUMAN
LFHNSHVNPLP-TGYEHP---- COXD_HUMAN
LFHNPHVNPLP-TGYEDE---- COXE_HUMAN
The program searched for lines containing 'HUMAN' and printed out the second field first, followed by the first field.
[nolan:lecture/723/seqs] jnolan% ^HUMAN^ELVIS^
awk '/ELVIS/ {print $2, $1}' cytcox.aln
[nolan:lecture/723/seqs] jnolan% awk '!~/ELVIS/ {print $2, $1}' cytcox.aln
-no result!-
awk '/COXE_HUMAN/ {print $2}' cytcox.aln
--MAVVGVSSVSRLLGRSRPQLGRPMSSGAHGEEGS-----------------ARMWKTL
TFFVALPGVAVSMLNVYL-KSHHGEHERPE--------FIAYPHLRIRTKPFPWGDGNHT
LFHNPHVNPLP-TGYEDE----
returns just the sequence for COXE_HUMAN. A good way to extract sequences from an alignment.
We can take this result and pass it through a second program sed, which is a text editor:
awk '/COXE_HUMAN/ {print $2}' cytcox.aln | sed 's/-//g'
MAVVGVSSVSRLLGRSRPQLGRPMSSGAHGEEGSARMWKTL
TFFVALPGVAVSMLNVYLKSHHGEHERPEFIAYPHLRIRTKPFPWGDGNHT
LFHNPHVNPLPTGYEDE
We repeated the awk command, but used the pipe "|" to pass the rsult to sed without ever seeing the first result. Then we used sed to do a global search and replace to get rid of all "-".
Now we have our old sequence back!
Other useful unix commands:
comm file1 file2 - compares file1 with file2. Output is in 3 columns: lines unique to file1, lines unique to file2, lines common to both.
diff file1 file2 - compares file1 with file2. Output has barckets indicating where differences are in the files
diff cytcox.aln cytcox2.aln
16,17d15
< COXE_YEAST ---MFR---QCAKRYASSLPPNALKPAFGPPDKVAAQKFKESLMATEKHAKDTSNMWVKI
< COXE_SCHPO MSMMNRNIGFLSRTLKTSVPKRAGLLSFRAYSNEAKVNWLEEVQAEEEHAKRSSEFWKKV
32,33d29
< COXE_YEAST SVWVALPAIALTAVNTYFVEKEHAEHREHLKHVPDSEWPRDYEFMNIRSKPFFWGDGDKT
< COXE_SCHPO TYYIGGPALILASANAYYIYCKHQEHAKHVEDTDPG-----YSFENLRFKKYPWGDGSKT
48,49d43
< COXE_YEAST LFWNPVVNRHIEHDD-------
< COXE_SCHPO LFWNDKVN-HLKKDDE------
cut filename - extracts columns of text using character numbers or field numbers
cut -c17-76 cytcox.aln
multiple sequence alignment-MAMSPAATVARRRLAAA--------SQGSH-EGG------------------ARTWKIL
--MASPASMAARRVLSAA--------SHAGH-EGGS-----------------ARTWKIL
---MALPLKSLSRGLASA--------AKGDHGGTG------------------ARTWRFL
---MALPLRPLTRGLASA--------AKGGHGGAG------------------ARTWRLL
------PLKVLSRSMASA--------SKGDHGGAG------------------ANTWRLL
---MALPLKVLSRSMASA--------AKGDHGGAG------------------ANTWRLL
MASAVLSASRVSRPLGRALPGLRRPMSSGAHGEEGS-----------------ARMWKAL
MASAVLSASRVSGLLGRALPRVGRPMSSGAHGEEGS-----------------ARIWKAL
--------------------------SSGAHGEEGS-----------------ARMWKAL
--MAVVGVSSVSRLLGRSRPQLGRPMSSGAHGEEGS-----------------ARMWKTL
--MAAAAWSRVSQLLGRSRLQVGRPMSSGAHGEEGS-----------------ARMWKAL
---MNRLAQPATRSVVKTFQRKSSGSFYGSNNVEGFKESYVTPLKQAHNA---SETWKKI
---MFR---QCAKRYASSLPPNALKPAFGPPDKVAAQKFKESLMATEKHAKDTSNMWVKI
MSMMNRNIGFLSRTLKTSVPKRAGLLSFRAYSNEAKVNWLEEVQAEEEHAKRSSEFWKKV
. :. * :
SFVLALPGVGVCMANAYM-KMQAHSHDPPE--------FVPYPHLRIRTKPWPWGDGNHS
SFVLALPGVAVCIANAYM-KMQQHSHEPPE--------FVAYSHLRIRTKKWPWGDGNHS
TFGLALPSVALCTLNSWL-HSGH--RERPA--------FIPYHHLRIRTKPFSWGDGNHT
TFVLALPSVALCTFNSYL-HSGH--RERPE--------FRPYQHLRIRTKPYPWGDGNHT
TFVLALPSVALCSLNCWM-HAGH--HERPE--------FIPYHHLRIRTKPFSWGDGNHT
TFVLALPGVALCSLNCWM-HAGH--HERPE--------FIPYHHLRIRTKPFAWGDGNHT
TYFVALPGVGVSMLNVFL-KSRHEEHERPP--------FVAYPHLRIRTKPFPWGDGNHT
TYFVALPGVGVSMLNVFL-KSRHEEHERPE--------FVAYPHLRIRTKPFPWGDGNHT
TLFVALPGVGVSMLNVFM-KSHHGEEERPE--------FVAYPHLRIRSKPFPWGDGNHT
TFFVALPGVAVSMLNVYL-KSHHGEHERPE--------FIAYPHLRIRTKPFPWGDGNHT
TYFVALPGVGVSMLNVYL-KSHHEEHERPE--------FIAYPHLRIRSKPFPWGDGNHT
FFIASIPCLALTMYAAFKDHKKHMSHERPE--------HVEYAFLNVRNKPFPWSDGNHS
SVWVALPAIALTAVNTYFVEKEHAEHREHLKHVPDSEWPRDYEFMNIRSKPFFWGDGDKT
TYYIGGPALILASANAYYIYCKHQEHAKHVEDTDPG-----YSFENLRFKKYPWGDGSKT
. * : : : . * . .:* * : *.**.::
LFHNAHTNALP-TGYEGPHH--
LFHNPHENALP-EGYEGPRH--
FFHNPRVNPLP-TGYEKP----
LFHNSHVNPLP-TGYEHP----
LFHNPHVNPLP-TGYEQP----
LFHNPHVNPLP-TGYEHP----
LFHNPHVNPLP-TGYEDE----
LFHNPHMNPLP-TGYEDE----
LFHNPHVNPLP-TGYEDE----
LFHNPHVNPLP-TGYEDE----
LFHNPHVNPLP-TGYEDV----
LFHNKAEQFVPGVGFEADREKH
LFWNPVVNRHIEHDD-------
LFWNDKVN-HLKKDDE------
:* * : .
paste file1 file2
puts 2 files together side-by-side
paste -d"Z" names seqs
CLUSTAL WZ multiple sequence alignment
Z
Z
COXE_CYPCZ-MAMSPAATVARRRLAAA--------SQGSH-EGG------------------ARTWKIL
COXE_ONCMZ--MASPASMAARRVLSAA--------SHAGH-EGGS-----------------ARTWKIL
COXD_BOVIZ---MALPLKSLSRGLASA--------AKGDHGGTG------------------ARTWRFL
COXD_HUMAZ---MALPLRPLTRGLASA--------AKGGHGGAG------------------ARTWRLL
COXD_RAT Z------PLKVLSRSMASA--------SKGDHGGAG------------------ANTWRLL
COXD_MOUSZ---MALPLKVLSRSMASA--------AKGDHGGAG------------------ANTWRLL
COXE_MOUSZMASAVLSASRVSRPLGRALPGLRRPMSSGAHGEEGS-----------------ARMWKAL
COXE_RAT ZMASAVLSASRVSGLLGRALPRVGRPMSSGAHGEEGS-----------------ARIWKAL
COXE_BOVIZ--------------------------SSGAHGEEGS-----------------ARMWKAL
COXE_HUMAZ--MAVVGVSSVSRLLGRSRPQLGRPMSSGAHGEEGS-----------------ARMWKTL
COXE_RABIZ--MAAAAWSRVSQLLGRSRLQVGRPMSSGAHGEEGS-----------------ARMWKAL
COXE_CAEEZ---MNRLAQPATRSVVKTFQRKSSGSFYGSNNVEGFKESYVTPLKQAHNA---SETWKKI
COXE_YEASZ---MFR---QCAKRYASSLPPNALKPAFGPPDKVAAQKFKESLMATEKHAKDTSNMWVKI
COXE_SCHPZMSMMNRNIGFLSRTLKTSVPKRAGLLSFRAYSNEAKVNWLEEVQAEEEHAKRSSEFWKKV
Z . :. * :
Z
COXE_CYPCZSFVLALPGVGVCMANAYM-KMQAHSHDPPE--------FVPYPHLRIRTKPWPWGDGNHS
COXE_ONCMZSFVLALPGVAVCIANAYM-KMQQHSHEPPE--------FVAYSHLRIRTKKWPWGDGNHS
COXD_BOVIZTFGLALPSVALCTLNSWL-HSGH--RERPA--------FIPYHHLRIRTKPFSWGDGNHT
COXD_HUMAZTFVLALPSVALCTFNSYL-HSGH--RERPE--------FRPYQHLRIRTKPYPWGDGNHT
COXD_RAT ZTFVLALPSVALCSLNCWM-HAGH--HERPE--------FIPYHHLRIRTKPFSWGDGNHT
COXD_MOUSZTFVLALPGVALCSLNCWM-HAGH--HERPE--------FIPYHHLRIRTKPFAWGDGNHT
COXE_MOUSZTYFVALPGVGVSMLNVFL-KSRHEEHERPP--------FVAYPHLRIRTKPFPWGDGNHT
COXE_RAT ZTYFVALPGVGVSMLNVFL-KSRHEEHERPE--------FVAYPHLRIRTKPFPWGDGNHT
COXE_BOVIZTLFVALPGVGVSMLNVFM-KSHHGEEERPE--------FVAYPHLRIRSKPFPWGDGNHT
COXE_HUMAZTFFVALPGVAVSMLNVYL-KSHHGEHERPE--------FIAYPHLRIRTKPFPWGDGNHT
COXE_RABIZTYFVALPGVGVSMLNVYL-KSHHEEHERPE--------FIAYPHLRIRSKPFPWGDGNHT
COXE_CAEEZFFIASIPCLALTMYAAFKDHKKHMSHERPE--------HVEYAFLNVRNKPFPWSDGNHS
COXE_YEASZSVWVALPAIALTAVNTYFVEKEHAEHREHLKHVPDSEWPRDYEFMNIRSKPFFWGDGDKT
COXE_SCHPZTYYIGGPALILASANAYYIYCKHQEHAKHVEDTDPG-----YSFENLRFKKYPWGDGSKT
Z . * : : : . * . .:* * : *.**.::
Z
COXE_CYPCZLFHNAHTNALP-TGYEGPHH--
COXE_ONCMZLFHNPHENALP-EGYEGPRH--
COXD_BOVIZFFHNPRVNPLP-TGYEKP----
COXD_HUMAZLFHNSHVNPLP-TGYEHP----
COXD_RAT ZLFHNPHVNPLP-TGYEQP----
COXD_MOUSZLFHNPHVNPLP-TGYEHP----
COXE_MOUSZLFHNPHVNPLP-TGYEDE----
COXE_RAT ZLFHNPHMNPLP-TGYEDE----
COXE_BOVIZLFHNPHVNPLP-TGYEDE----
COXE_HUMAZLFHNPHVNPLP-TGYEDE----
COXE_RABIZLFHNPHVNPLP-TGYEDV----
COXE_CAEEZLFHNKAEQFVPGVGFEADREKH
COXE_YEASZLFWNPVVNRHIEHDD-------
COXE_SCHPZLFWNDKVN-HLKKDDE------
Z:* * : .
One of the most powerful mining and manipulation tools. Fairly easy to learn, and once you do, you can do just about anything. There is a specific site that has hundreds of bioinformatic tools at bioperl.org. There is an excellent text available from O'Reilly Publishers, titled "Beginning Perl for Bioinformatics". It has some handy exercises you can download to lead you through the programming steps in learning perl.
Perl is very useful for analyzing sequences and parsing results.
Parsing: extracting data from a result in a useful manner. for example
BLAST results:
Lots of info, but it is hard to compare and compile all the results from one
search. Parsers search through the file and organize it into fields:
Query Seq Name | Start Subj | End Subj | Query Start | Query End | Score Bits | Score 2 | Expect | Length | Overlap Length | Identities | Total | % Identities |
Contig15 64577 bp | 1 | 1032 | 48366 | 45271 | 1702 | 4408 | 0 | 1032 | 3095 | 809 | 1032 | 78% |
Contig15 64577 bp | 1 | 658 | 50333 | 48372 | 1149 | 2972 | 0 | 660 | 1961 | 569 | 658 | 86% |
Contig15 64577 bp | 1 | 610 | 34786 | 32954 | 1117 | 2890 | 0 | 610 | 1832 | 544 | 611 | 89% |
Contig15 64577 bp | 1 | 659 | 32920 | 30944 | 1089 | 2817 | 0 | 659 | 1976 | 533 | 659 | 80% |
Contig15 64577 bp | 2 | 511 | 19393 | 17861 | 883 | 2281 | 587 | 1532 | 427 | 511 | 83% | |
Contig15 64577 bp | 1 | 602 | 43345 | 41534 | 969 | 2506 | 0 | 602 | 1811 | 462 | 604 | 76% |
Contig15 64577 bp | 2 | 521 | 26507 | 24948 | 957 | 2475 | 0 | 521 | 1559 | 485 | 520 | 93% |
Contig15 64577 bp | 1 | 523 | 34525 | 32954 | 957 | 2474 | 0 | 523 | 1571 | 470 | 524 | 89% |
Contig15 64577 bp | 1 | 524 | 30249 | 28681 | 932 | 2408 | 0 | 524 | 1568 | 450 | 524 | 85% |
Contig15 64577 bp | 1 | 575 | 52890 | 51160 | 925 | 2391 | 0 | 575 | 1730 | 450 | 578 | 77% |
Contig15 64577 bp | 1 | 505 | 34471 | 32954 | 922 | 2384 | 0 | 505 | 1517 | 453 | 506 | 89% |
Contig15 64577 bp | 40 | 427 | 24406 | 23243 | 720 | 1859 | 427 | 1163 | 365 | 388 | 94% | |
Contig15 64577 bp | 151 | 401 | 25787 | 25014 | 74.7 | 182 | 2.00E-14 | 773 | 78 | 273 | 28% | |
Contig15 64577 bp | 1 | 416 | 34204 | 32954 | 746 | 1926 | 0 | 416 | 1250 | 366 | 417 | 87% |
Can use in spreadsheet program.
Parsing takes advantage of key features of the document that can be used to divide a document into important parts and assigns them to variables to use for out puttin data in a useful format:
BLAST Parser variables: | |
$hsp->hit->seq_id $hsp->subject->length |
$hsp->score $hsp->bits |
$hsp->P | $hsp->sbjctFrame |
$hsp->match $hsp->length $hsp->percent $hsp->positive |
$hsp->querySeq $hsp->homologySeq $hsp->sbjctSeq |
$hsp->hit->start $hsp->hit->end |
$hsp->query->start $hsp->query->end |
>uvsX_Aeh1 RecA-like recomb. pro; DNA-ATPase[seq_id] Length = 411[subject length] Score = 439 bits (1130), Expect = e-124[Score and P] Identities = 203/357 (56%), Positives = 278/357 (77%)[match/length][positive] Frame = -2 Query: 23561 MSDLKSRLIKASTSKLTAELTASKFFNEKDVVRTKIPMMNIALSGEITGGMQSGLLILAG 23382 + L S+L S++K+++ L SKFFN+KD VRT++P++N+A+SGE+ GG+ GL +LAG Sbjct: 13 LGSLMSKLAGTSSNKMSSVLADSKFFNDKDCVRTRVPLLNLAMSGELDGGLTPGLTVLAG 72
Query: 23381 PSKSFKSNFGLTMVSSYMRQYPDAVCLFYDSEFGITPAYLRSMGVDPERVIHTPVQSLEQ 23202 PSK FKSN L V++Y+R+YPDAVC+F+D+EFG TP Y S GVD RVIH P +++E+ Sbjct: 73 PSKHFKSNLSLVFVAAYLRKYPDAVCIFFDNEFGSTPGYFESQGVDISRVIHCPFKNIEE 132
Query: 23201 LRIDMVNQLDAIERGEKVVVFIDSLGNLASKKETEDALNEKVVSDMTRAKTMKSLFRIVT 23022 L+ D+V +L+AIERG++V+VF+DS+GN ASKKE +DA++EK VSDMTRAK +KSL R++T Sbjct: 133 LKFDIVKKLEAIERGDRVIVFVDSIGNAASKKEIDDAIDEKSVSDMTRAKQIKSLTRMMT 192
Query: 23021 PYFSTKNIPCIAINHTYETQEMFSKTVMGGGTGPMYSADTVFIIGKRQIKDGSDLQGYQF 22842 PY + +IP I + HTY+TQEM+SK V+ GGTG YS+DTV IIG++Q KDG +L GY F Sbjct: 193 PYLTVNDIPAIMVAHTYDTQEMYSKKVVSGGTGITYSSDTVIIIGRQQEKDGKELLGYNF 252
Query: 22841 VLNVEKSRTVKEKSKFFIDVKFDGGIDPYSGLLDMALELGFVVKPKNGWYAREFLDEETG 22662 VLN+EKSR VKE+SK ++V F GGI+ YSG+LD+ALE+GFVVKP NGW++R FLDEETG Sbjct: 253 VLNMEKSRFVKEQSKLPLEVTFQGGINTYSGMLDIALEVGFVVKPSNGWFSRAFLDEETG 312
Query: 22661 EMIREEKSWRAKDTNCTTFWGPLFKHQPFRDAIKRAYQLGAIDSNEIVEAEVDELIN 22491 E++ E++ WR DTNC FW P+F HQPF+ A ++L ++ + V EVDEL + Sbjct: 313 ELVEEDRKWRRADTNCLEFWKPMFAHQPFKTACSDMFKLKSVAVKDEVFDEVDELFS 369 >60plus39_Aeh1 DNA topoisomerase sub.; DNAdep. ATPase; memb-assoc Length = 613 Score = 412 bits (1058), Expect = e-115 Identities = 214/471 (45%), Positives = 296/471 (62%) Frame = -1 Query: 5325 IKNEIKILSDIEHIKKRSGMYIGSSANETHERFMFGKWESVQYVPGLVKLIDEIIDNSVD 5146 + E K+LSD EH + MYIGS++ ETH+ + GK+ + YVPGLVK+ DE+IDNSVD Sbjct: 1 MSQEFKVLSDKEHCLINTDMYIGSTSTETHDVLVDGKFVQIAYVPGLVKITDEVIDNSVD 60
Can be used to glean data from almost any useful source for further manipulation.