Often you are able to obtain large amounts of wanted data, but not in a format that is very useful to you. How can you sort through what you have to get what you want?
Sometimes the answer may be in using something already familiar to you. for example:
Here is a Report file that was generated when I did a sequence assembly project with another program.
There was a lot of information embedded in the file (names of sequence reads, number of bases used from each, what Contig it was added to, etc.):
Scanned against repeated sequences:
Time to do prepass: 0:1:10
Preassembly Elapsed Time 0:0:0
Construction parameters:
Match Size 12
Maximum Added Gap Length in Contig 20
Maximum Added Gap Length in Sequence 20
Minimum Match Percentage 80
Maximum Register Shift Difference 20
Lastgroup Considered 2
Gap Penalty 0.20
Gap Length Penalty 0.70
Consensus Threshold 75
Entering 1762 sequences on 12/13/01, 11:34 AM
CREATING NEW contig 1: from 58-2-11-C07.r.1.scf(1>933)
ENTERING 58-2-2-C10.f.1.scf(4>942) in Contig 1: percent match 96
ENTERING 58-2-11-H02.r.1.scf(1>894) in Contig 1: percent match 96
CREATING NEW contig 2: from 58-2-6-F06.r.1.scf(1>795)
ENTERING 58-2-2-B07.f.1.scf(8>1014) in Contig 2: percent match 96
ENTERING 58-2-12-A01.r.1.scf(1>897) in Contig 1: percent match 96
ENTERING 58-2-10-H06.r.1.scf(1>783) in Contig 2: percent match 96
ENTERING 58-2-12-F02.r.1.scf(6>889) in Contig 1: percent match 96
ENTERING 58-2-3-D11.r.1.scf(1>693) in Contig 1: percent match 97
ENTERING 58-2-7-E01.r.1.scf(1>793) in Contig 1: percent match 97
CREATING NEW contig 3: from 58-2-12-B05.r.1.scf(1>952)
ENTERING 58-2-3-B07.f.1.scf(8>680) in Contig 1: percent match 99
CREATING NEW contig 4: from 58-2-10-C04.r.1.scf(1>769)
CREATING NEW contig 5: from 58-2-1-D09.r.1.scf(8>800)
ENTERING 58-2-11-E01.r.1.scf(1>964) in Contig 3: percent match 85
ENTERING 58-2-4-C01.r.1.scf(3>784) in Contig 1: percent match 95
In looking at the Strategy view of the entire data set, I noticed that there were many plasmids that had been sequenced with one primer, but not the other, shown as black reads below:
How many were there and what are all their names so we can ask to have them sequenced?
First I modified a copy of the assembly report, for ease in reading into Excel. This is often not necessary, but in this case, I wanted to sort through the names of the reads to look for paired reads. This meant breaking up the name with tab characters:
For this I used a text editor's find and replace command to insert tabs (\t) within the file names. Now I can import (just open the file, Excel will guide you through) into Excel and analyze the names of the files. different parts of the filename are now different fields (columns) within the file.Let's look at what's in a name:
58-2-4-C01.r.1.scf
All the files begin with "58-2-". That is the name of the sequencing project. Next is "4-" the microtiter plate number for the plasmid. "C01" is the well location on the plate."r" is a reaction using the reverse primer; all plasmids should be sequenced once each with the forward (f) and reverse primers. "1" is the read number, some reactions are repeated.
So here is how it looks in Excel: Here's a link to the file
I have hidden the text in front of the name and the names have been sorted. You see that there were reads from both plasmids for 58-2-1A01, 1 forward and 2 reverse reads for 1A02, but only single reads from 1A03 and 1A04. These are the ones we want to find. We can put a conditional formula in column O to help us:
This means "if the value in cell F7 equals the one below it [F7=F8] then put a 1 in O7 [,1] OR if l F7 equals the one above it [,IF(F7=F6,1] then put a 1 in O7. If neither of these is true, put a 0 in O7 [,0))].
So 1A03 and 1A04 should both have "0" in column O, as would all unique reads. But what if there are more than 1 read in one direction, but none in the other? The next 2 columns help there. Column P keeps track of the number of reads with the same name:
This conditional statement says: IF the value in cell F11 is the same as F10, then add 1 to the value of P10 (the cell above the formula), otherwise, put a 1 in P11. So the first read from a plasmid will have a 1 in this column, the next a 2, the third will have a 3. etc.
But what about the case of 2 reverse reads, but no forward reads? Column Q takes care of that:
If there are more than 2 reads, it compare the value in G to the cell 2 above it, if they match it puts a 1 in column Q. If there is more than 1 read, it compares the value of G to the cell directly above and again puts a 1 here. If there is only one read, or it is the first read with that name, then it puts a 0 in column Q.
Read name | diagnostics | ||||||||
58-2- | 1 | A | 3 | r | 2 | 0 | 1 | 0 | unique |
58-2- | 1 | A | 4 | r | 2 | 0 | 1 | 0 | unique |
58-2- | 1 | A | 6 | f | 1 | 1 | 1 | 0 | normal |
58-2- | 1 | A | 6 | r | 2 | 1 | 2 | 0 | normal |
58-2- | 1 | A | 7 | f | 1 | 1 | 1 | 0 | extra OK |
58-2- | 1 | A | 7 | r | 1 | 1 | 2 | 0 | extra OK |
58-2- | 1 | A | 7 | r | 2 | 1 | 3 | 0 | extra OK |
58-2- | 1 | A | 8 | f | 1 | 1 | 1 | 0 | extra OK |
58-2- | 1 | A | 8 | r | 1 | 1 | 2 | 0 | extra OK |
58-2- | 1 | A | 8 | r | 2 | 1 | 3 | 0 | extra OK |
58-2- | 1 | A | 9 | f | 1 | 1 | 1 | 0 | normal |
58-2- | 1 | A | 9 | r | 2 | 1 | 2 | 0 | normal |
58-2- | 1 | A | 10 | r | 1 | 1 | 1 | 0 | 2 reads, 1 pr. |
58-2- | 1 | A | 10 | r | 2 | 1 | 2 | 1 | 2 reads, 1 pr |
58-2- | 1 | A | 11 | r | 1 | 1 | 1 | 0 | 2 reads, 1 pr |
58-2- | 1 | A | 11 | r | 2 | 1 | 2 | 1 | 2 reads, 1 pr |
58-2- | 1 | A | 12 | r | 1 | 1 | 1 | 0 | 2 reads, 1 pr |
58-2- | 1 | A | 12 | r | 2 | 1 | 2 | 1 | 2 reads, 1 pr |
So you should be able to sort the spreadsheet to identify different classes of reads, but sorting will change the values of the diagnostic columns, so you need to copy and paste the values elsewhere, so you can sort the numbers without changing them.
You can also download new data from web sites into Excel. This allows you to do calculations on the most up-to-date information available
There are several comand-line programs available on any UNIX computer (like the rs6000 -if you have a tulane email address, you can login to rs6000.tcs.tulane.edu with your mail account login). They are very fast at handling text files. Once you learn to use them, they can be very powerful and big time savers. You can tie them together with scripts to perform multiple manipulations on multiple files. you can type "man program_name" to get help.
Can search for text strings, compare values of fields, and output results however you tell it:
jnolan% awk '/HUMAN/ {print $2,
$1}' cytcox.aln
---MALPLRPLTRGLASA--------AKGGHGGAG------------------ARTWRLL COXD_HUMAN
--MAVVGVSSVSRLLGRSRPQLGRPMSSGAHGEEGS-----------------ARMWKTL COXE_HUMAN
TFVLALPSVALCTFNSYL-HSGH--RERPE--------FRPYQHLRIRTKPYPWGDGNHT COXD_HUMAN
TFFVALPGVAVSMLNVYL-KSHHGEHERPE--------FIAYPHLRIRTKPFPWGDGNHT COXE_HUMAN
LFHNSHVNPLP-TGYEHP---- COXD_HUMAN
LFHNPHVNPLP-TGYEDE---- COXE_HUMAN
The program searched for lines containing 'HUMAN' and printed out the second field first, followed by the first field.
[nolan:lecture/723/seqs] jnolan% ^HUMAN^ELVIS^
awk '/ELVIS/ {print $2, $1}' cytcox.aln
[nolan:lecture/723/seqs] jnolan% awk '!~/ELVIS/ {print $2, $1}' cytcox.aln
-no result!-
awk '/COXE_HUMAN/ {print $2}' cytcox.aln
--MAVVGVSSVSRLLGRSRPQLGRPMSSGAHGEEGS-----------------ARMWKTL
TFFVALPGVAVSMLNVYL-KSHHGEHERPE--------FIAYPHLRIRTKPFPWGDGNHT
LFHNPHVNPLP-TGYEDE----
returns just the sequence for COXE_HUMAN. A good way to extract sequences from an alignment.
We can take this result and pass it through a second program sed, which is a text editor:
awk '/COXE_HUMAN/ {print $2}' cytcox.aln | sed 's/-//g'
MAVVGVSSVSRLLGRSRPQLGRPMSSGAHGEEGSARMWKTL
TFFVALPGVAVSMLNVYLKSHHGEHERPEFIAYPHLRIRTKPFPWGDGNHT
LFHNPHVNPLPTGYEDE
We repeated the awk command, but used the pipe "|" to pass the result to sed without ever seeing the first result. Then we used sed to do a global search and replace to get rid of all "-".
Now we have our old sequence back!
Other useful unix commands:
comm file1 file2 - compares file1 with file2. Output is in 3 columns: lines unique to file1, lines unique to file2, lines common to both.
diff file1 file2 - compares file1 with file2. Output has brackets indicating where differences are in the files
diff cytcox.aln cytcox2.aln
16,17d15
< COXE_YEAST ---MFR---QCAKRYASSLPPNALKPAFGPPDKVAAQKFKESLMATEKHAKDTSNMWVKI
< COXE_SCHPO MSMMNRNIGFLSRTLKTSVPKRAGLLSFRAYSNEAKVNWLEEVQAEEEHAKRSSEFWKKV
32,33d29
< COXE_YEAST SVWVALPAIALTAVNTYFVEKEHAEHREHLKHVPDSEWPRDYEFMNIRSKPFFWGDGDKT
< COXE_SCHPO TYYIGGPALILASANAYYIYCKHQEHAKHVEDTDPG-----YSFENLRFKKYPWGDGSKT
48,49d43
< COXE_YEAST LFWNPVVNRHIEHDD-------
< COXE_SCHPO LFWNDKVN-HLKKDDE------
cut filename - extracts columns of text using character numbers or field numbers
cut -c17-76 cytcox.aln
multiple sequence alignment-MAMSPAATVARRRLAAA--------SQGSH-EGG------------------ARTWKIL
--MASPASMAARRVLSAA--------SHAGH-EGGS-----------------ARTWKIL
---MALPLKSLSRGLASA--------AKGDHGGTG------------------ARTWRFL
---MALPLRPLTRGLASA--------AKGGHGGAG------------------ARTWRLL
------PLKVLSRSMASA--------SKGDHGGAG------------------ANTWRLL
---MALPLKVLSRSMASA--------AKGDHGGAG------------------ANTWRLL
MASAVLSASRVSRPLGRALPGLRRPMSSGAHGEEGS-----------------ARMWKAL
MASAVLSASRVSGLLGRALPRVGRPMSSGAHGEEGS-----------------ARIWKAL
--------------------------SSGAHGEEGS-----------------ARMWKAL
--MAVVGVSSVSRLLGRSRPQLGRPMSSGAHGEEGS-----------------ARMWKTL
--MAAAAWSRVSQLLGRSRLQVGRPMSSGAHGEEGS-----------------ARMWKAL
---MNRLAQPATRSVVKTFQRKSSGSFYGSNNVEGFKESYVTPLKQAHNA---SETWKKI
---MFR---QCAKRYASSLPPNALKPAFGPPDKVAAQKFKESLMATEKHAKDTSNMWVKI
MSMMNRNIGFLSRTLKTSVPKRAGLLSFRAYSNEAKVNWLEEVQAEEEHAKRSSEFWKKV
. :. * :
SFVLALPGVGVCMANAYM-KMQAHSHDPPE--------FVPYPHLRIRTKPWPWGDGNHS
SFVLALPGVAVCIANAYM-KMQQHSHEPPE--------FVAYSHLRIRTKKWPWGDGNHS
TFGLALPSVALCTLNSWL-HSGH--RERPA--------FIPYHHLRIRTKPFSWGDGNHT
TFVLALPSVALCTFNSYL-HSGH--RERPE--------FRPYQHLRIRTKPYPWGDGNHT
TFVLALPSVALCSLNCWM-HAGH--HERPE--------FIPYHHLRIRTKPFSWGDGNHT
TFVLALPGVALCSLNCWM-HAGH--HERPE--------FIPYHHLRIRTKPFAWGDGNHT
TYFVALPGVGVSMLNVFL-KSRHEEHERPP--------FVAYPHLRIRTKPFPWGDGNHT
TYFVALPGVGVSMLNVFL-KSRHEEHERPE--------FVAYPHLRIRTKPFPWGDGNHT
TLFVALPGVGVSMLNVFM-KSHHGEEERPE--------FVAYPHLRIRSKPFPWGDGNHT
TFFVALPGVAVSMLNVYL-KSHHGEHERPE--------FIAYPHLRIRTKPFPWGDGNHT
TYFVALPGVGVSMLNVYL-KSHHEEHERPE--------FIAYPHLRIRSKPFPWGDGNHT
FFIASIPCLALTMYAAFKDHKKHMSHERPE--------HVEYAFLNVRNKPFPWSDGNHS
SVWVALPAIALTAVNTYFVEKEHAEHREHLKHVPDSEWPRDYEFMNIRSKPFFWGDGDKT
TYYIGGPALILASANAYYIYCKHQEHAKHVEDTDPG-----YSFENLRFKKYPWGDGSKT
. * : : : . * . .:* * : *.**.::
LFHNAHTNALP-TGYEGPHH--
LFHNPHENALP-EGYEGPRH--
FFHNPRVNPLP-TGYEKP----
LFHNSHVNPLP-TGYEHP----
LFHNPHVNPLP-TGYEQP----
LFHNPHVNPLP-TGYEHP----
LFHNPHVNPLP-TGYEDE----
LFHNPHMNPLP-TGYEDE----
LFHNPHVNPLP-TGYEDE----
LFHNPHVNPLP-TGYEDE----
LFHNPHVNPLP-TGYEDV----
LFHNKAEQFVPGVGFEADREKH
LFWNPVVNRHIEHDD-------
LFWNDKVN-HLKKDDE------
:* * : .
paste file1 file2
puts 2 files together side-by-side
paste -d"Z" names seqs
CLUSTAL WZ multiple sequence alignment
Z
Z
COXE_CYPCZ-MAMSPAATVARRRLAAA--------SQGSH-EGG------------------ARTWKIL
COXE_ONCMZ--MASPASMAARRVLSAA--------SHAGH-EGGS-----------------ARTWKIL
COXD_BOVIZ---MALPLKSLSRGLASA--------AKGDHGGTG------------------ARTWRFL
COXD_HUMAZ---MALPLRPLTRGLASA--------AKGGHGGAG------------------ARTWRLL
COXD_RAT Z------PLKVLSRSMASA--------SKGDHGGAG------------------ANTWRLL
COXD_MOUSZ---MALPLKVLSRSMASA--------AKGDHGGAG------------------ANTWRLL
COXE_MOUSZMASAVLSASRVSRPLGRALPGLRRPMSSGAHGEEGS-----------------ARMWKAL
COXE_RAT ZMASAVLSASRVSGLLGRALPRVGRPMSSGAHGEEGS-----------------ARIWKAL
COXE_BOVIZ--------------------------SSGAHGEEGS-----------------ARMWKAL
COXE_HUMAZ--MAVVGVSSVSRLLGRSRPQLGRPMSSGAHGEEGS-----------------ARMWKTL
COXE_RABIZ--MAAAAWSRVSQLLGRSRLQVGRPMSSGAHGEEGS-----------------ARMWKAL
COXE_CAEEZ---MNRLAQPATRSVVKTFQRKSSGSFYGSNNVEGFKESYVTPLKQAHNA---SETWKKI
COXE_YEASZ---MFR---QCAKRYASSLPPNALKPAFGPPDKVAAQKFKESLMATEKHAKDTSNMWVKI
COXE_SCHPZMSMMNRNIGFLSRTLKTSVPKRAGLLSFRAYSNEAKVNWLEEVQAEEEHAKRSSEFWKKV
Z . :. * :
Z
COXE_CYPCZSFVLALPGVGVCMANAYM-KMQAHSHDPPE--------FVPYPHLRIRTKPWPWGDGNHS
COXE_ONCMZSFVLALPGVAVCIANAYM-KMQQHSHEPPE--------FVAYSHLRIRTKKWPWGDGNHS
COXD_BOVIZTFGLALPSVALCTLNSWL-HSGH--RERPA--------FIPYHHLRIRTKPFSWGDGNHT
COXD_HUMAZTFVLALPSVALCTFNSYL-HSGH--RERPE--------FRPYQHLRIRTKPYPWGDGNHT
COXD_RAT ZTFVLALPSVALCSLNCWM-HAGH--HERPE--------FIPYHHLRIRTKPFSWGDGNHT
COXD_MOUSZTFVLALPGVALCSLNCWM-HAGH--HERPE--------FIPYHHLRIRTKPFAWGDGNHT
COXE_MOUSZTYFVALPGVGVSMLNVFL-KSRHEEHERPP--------FVAYPHLRIRTKPFPWGDGNHT
COXE_RAT ZTYFVALPGVGVSMLNVFL-KSRHEEHERPE--------FVAYPHLRIRTKPFPWGDGNHT
COXE_BOVIZTLFVALPGVGVSMLNVFM-KSHHGEEERPE--------FVAYPHLRIRSKPFPWGDGNHT
COXE_HUMAZTFFVALPGVAVSMLNVYL-KSHHGEHERPE--------FIAYPHLRIRTKPFPWGDGNHT
COXE_RABIZTYFVALPGVGVSMLNVYL-KSHHEEHERPE--------FIAYPHLRIRSKPFPWGDGNHT
COXE_CAEEZFFIASIPCLALTMYAAFKDHKKHMSHERPE--------HVEYAFLNVRNKPFPWSDGNHS
COXE_YEASZSVWVALPAIALTAVNTYFVEKEHAEHREHLKHVPDSEWPRDYEFMNIRSKPFFWGDGDKT
COXE_SCHPZTYYIGGPALILASANAYYIYCKHQEHAKHVEDTDPG-----YSFENLRFKKYPWGDGSKT
Z . * : : : . * . .:* * : *.**.::
Z
COXE_CYPCZLFHNAHTNALP-TGYEGPHH--
COXE_ONCMZLFHNPHENALP-EGYEGPRH--
COXD_BOVIZFFHNPRVNPLP-TGYEKP----
COXD_HUMAZLFHNSHVNPLP-TGYEHP----
COXD_RAT ZLFHNPHVNPLP-TGYEQP----
COXD_MOUSZLFHNPHVNPLP-TGYEHP----
COXE_MOUSZLFHNPHVNPLP-TGYEDE----
COXE_RAT ZLFHNPHMNPLP-TGYEDE----
COXE_BOVIZLFHNPHVNPLP-TGYEDE----
COXE_HUMAZLFHNPHVNPLP-TGYEDE----
COXE_RABIZLFHNPHVNPLP-TGYEDV----
COXE_CAEEZLFHNKAEQFVPGVGFEADREKH
COXE_YEASZLFWNPVVNRHIEHDD-------
COXE_SCHPZLFWNDKVN-HLKKDDE------
Z:* * : .
One of the most powerful mining and manipulation tools. Fairly easy to learn, and once you do, you can do just about anything. There is a specific site that has hundreds of bioinformatic tools at bioperl.org. There is an excellent text available from O'Reilly Publishers, titled "Beginning Perl for Bioinformatics". It has some handy exercises you can download to lead you through the programming steps in learning perl.
Perl is very useful for analyzing sequences and parsing results.
Parsing: extracting data from a result in a useful manner. for example
BLAST results:
Lots of info, but it is hard to compare and compile all the results from one
search. Parsers search through the file and organize it into fields:
Query Seq Name | Start Subj | End Subj | Query Start | Query End | Score Bits | Score 2 | Expect | Length | Overlap Length | Identities | Total | % Identities |
Contig15 64577 bp | 1 | 1032 | 48366 | 45271 | 1702 | 4408 | 0 | 1032 | 3095 | 809 | 1032 | 78% |
Contig15 64577 bp | 1 | 658 | 50333 | 48372 | 1149 | 2972 | 0 | 660 | 1961 | 569 | 658 | 86% |
Contig15 64577 bp | 1 | 610 | 34786 | 32954 | 1117 | 2890 | 0 | 610 | 1832 | 544 | 611 | 89% |
Contig15 64577 bp | 1 | 659 | 32920 | 30944 | 1089 | 2817 | 0 | 659 | 1976 | 533 | 659 | 80% |
Contig15 64577 bp | 2 | 511 | 19393 | 17861 | 883 | 2281 | 587 | 1532 | 427 | 511 | 83% | |
Contig15 64577 bp | 1 | 602 | 43345 | 41534 | 969 | 2506 | 0 | 602 | 1811 | 462 | 604 | 76% |
Contig15 64577 bp | 2 | 521 | 26507 | 24948 | 957 | 2475 | 0 | 521 | 1559 | 485 | 520 | 93% |
Contig15 64577 bp | 1 | 523 | 34525 | 32954 | 957 | 2474 | 0 | 523 | 1571 | 470 | 524 | 89% |
Contig15 64577 bp | 1 | 524 | 30249 | 28681 | 932 | 2408 | 0 | 524 | 1568 | 450 | 524 | 85% |
Contig15 64577 bp | 1 | 575 | 52890 | 51160 | 925 | 2391 | 0 | 575 | 1730 | 450 | 578 | 77% |
Contig15 64577 bp | 1 | 505 | 34471 | 32954 | 922 | 2384 | 0 | 505 | 1517 | 453 | 506 | 89% |
Contig15 64577 bp | 40 | 427 | 24406 | 23243 | 720 | 1859 | 427 | 1163 | 365 | 388 | 94% | |
Contig15 64577 bp | 151 | 401 | 25787 | 25014 | 74.7 | 182 | 2.00E-14 | 773 | 78 | 273 | 28% | |
Contig15 64577 bp | 1 | 416 | 34204 | 32954 | 746 | 1926 | 0 | 416 | 1250 | 366 | 417 | 87% |
Can use in spreadsheet program.
Parsing takes advantage of key features of the document that can be used to divide a document into important parts and assigns them to variables to use for outputting data in a useful format:
BLAST Parser variables: | |
$hsp->hit->seq_id $hsp->subject->length |
$hsp->score $hsp->bits |
$hsp->P | $hsp->sbjctFrame |
$hsp->match $hsp->length $hsp->percent $hsp->positive |
$hsp->querySeq $hsp->homologySeq $hsp->sbjctSeq |
$hsp->hit->start $hsp->hit->end |
$hsp->query->start $hsp->query->end |
>uvsX_Aeh1 RecA-like recomb. pro; DNA-ATPase[seq_id] Length = 411[subject length] Score = 439 bits (1130), Expect = e-124[Score and P] Identities = 203/357 (56%), Positives = 278/357 (77%)[match/length][positive] Frame = -2 Query: 23561 MSDLKSRLIKASTSKLTAELTASKFFNEKDVVRTKIPMMNIALSGEITGGMQSGLLILAG 23382 + L S+L S++K+++ L SKFFN+KD VRT++P++N+A+SGE+ GG+ GL +LAG Sbjct: 13 LGSLMSKLAGTSSNKMSSVLADSKFFNDKDCVRTRVPLLNLAMSGELDGGLTPGLTVLAG 72
Query: 23381 PSKSFKSNFGLTMVSSYMRQYPDAVCLFYDSEFGITPAYLRSMGVDPERVIHTPVQSLEQ 23202 PSK FKSN L V++Y+R+YPDAVC+F+D+EFG TP Y S GVD RVIH P +++E+ Sbjct: 73 PSKHFKSNLSLVFVAAYLRKYPDAVCIFFDNEFGSTPGYFESQGVDISRVIHCPFKNIEE 132
Query: 23201 LRIDMVNQLDAIERGEKVVVFIDSLGNLASKKETEDALNEKVVSDMTRAKTMKSLFRIVT 23022 L+ D+V +L+AIERG++V+VF+DS+GN ASKKE +DA++EK VSDMTRAK +KSL R++T Sbjct: 133 LKFDIVKKLEAIERGDRVIVFVDSIGNAASKKEIDDAIDEKSVSDMTRAKQIKSLTRMMT 192
Query: 23021 PYFSTKNIPCIAINHTYETQEMFSKTVMGGGTGPMYSADTVFIIGKRQIKDGSDLQGYQF 22842 PY + +IP I + HTY+TQEM+SK V+ GGTG YS+DTV IIG++Q KDG +L GY F Sbjct: 193 PYLTVNDIPAIMVAHTYDTQEMYSKKVVSGGTGITYSSDTVIIIGRQQEKDGKELLGYNF 252
Query: 22841 VLNVEKSRTVKEKSKFFIDVKFDGGIDPYSGLLDMALELGFVVKPKNGWYAREFLDEETG 22662 VLN+EKSR VKE+SK ++V F GGI+ YSG+LD+ALE+GFVVKP NGW++R FLDEETG Sbjct: 253 VLNMEKSRFVKEQSKLPLEVTFQGGINTYSGMLDIALEVGFVVKPSNGWFSRAFLDEETG 312
Query: 22661 EMIREEKSWRAKDTNCTTFWGPLFKHQPFRDAIKRAYQLGAIDSNEIVEAEVDELIN 22491 E++ E++ WR DTNC FW P+F HQPF+ A ++L ++ + V EVDEL + Sbjct: 313 ELVEEDRKWRRADTNCLEFWKPMFAHQPFKTACSDMFKLKSVAVKDEVFDEVDELFS 369 >60plus39_Aeh1 DNA topoisomerase sub.; DNAdep. ATPase; memb-assoc Length = 613 Score = 412 bits (1058), Expect = e-115 Identities = 214/471 (45%), Positives = 296/471 (62%) Frame = -1 Query: 5325 IKNEIKILSDIEHIKKRSGMYIGSSANETHERFMFGKWESVQYVPGLVKLIDEIIDNSVD 5146 + E K+LSD EH + MYIGS++ ETH+ + GK+ + YVPGLVK+ DE+IDNSVD Sbjct: 1 MSQEFKVLSDKEHCLINTDMYIGSTSTETHDVLVDGKFVQIAYVPGLVKITDEVIDNSVD 60
Can be used to glean data from almost any useful source for further manipulation.
Good blast example:
Automatically annotating phage genomes:
Output from GeneMark:
1 - 1 1731 1731 1
2 - 1815 2177 363 1
3 - 2188 2391 204 1
4 - 2446 4263 1818 1
5 - 4333 4593 261 1
6 - 4599 4970 372 1
GFF format needed for webpage programs:
RB32 predicted ORF 1 1731 . - . ORF "RB32ORF0001c" ; Note "hypothetical protein"
RB32 predicted ORF 1815 2177 . - . ORF "RB32ORF0002c" ; Note "hypothetical protein"
RB32 predicted ORF 2188 2391 . - . ORF "RB32ORF0003c" ; Note "hypothetical protein"
RB32 predicted ORF 2446 4263 . - . ORF "RB32ORF0004c" ; Note "hypothetical protein"
Pretty similar, just moved columns around and added some generic names.
You've seen blast files before. After I parse it I get just the score, name of the ORF, name and function of the T4 hit
2951 RB32ORF0001c hypothetical protein (576 letters) rIIA_T4 membrane-assoc. affects host memb. ATPase 509 RB32ORF0002c hypothetical protein (120 letters) rIIA_T4 membrane-assoc. affects host memb. ATPase 333 RB32ORF0003c hypothetical protein (67 letters) rIIA.1_T4 unknown funct. 2254 RB32ORF0004c hypothetical protein (605 letters) 60plus39_T4 DNA topoisomerase sub.; DNAdep. ATPase; memb-assoc
I use the join program to give me the new names with the old coordinates. If it didn't match T4, the old name is printed:
RB32ORF0001c RB32 predicted ORF 1 1731 . - . ORF ; Note hypothetical
protein 2951 hypothetical protein (576 letters) rIIA_T4 membrane-assoc. affects
host memb. ATPase
RB32ORF0002c RB32 predicted ORF 1815 2177 . - . ORF ; Note hypothetical protein
509 hypothetical protein (120 letters) rIIA_T4 membrane-assoc. affects host
memb. ATPase
RB32ORF0003c RB32 predicted ORF 2188 2391 . - . ORF ; Note hypothetical protein
333 hypothetical protein (67 letters) rIIA.1_T4 unknown funct.
RB32ORF0004c RB32 predicted ORF 2446 4263 . - . ORF ; Note hypothetical protein
These are rearranged with a text editor to give the final format for the web server:
RB32 predicted gene 1 1731 . - . Gene "rIIA-b" ; Note "Possible frameshift membrane-assoc. affects host memb. ATPase" RB32 predicted gene 1815 2177 . - . Gene "rIIA" ; Note "membrane-assoc. affects host memb. ATPase" RB32 predicted gene 2188 2391 . - . Gene "rIIA.1" ; Note "unknown funct." RB32 predicted gene 2446 4263 . - . Gene "60plus39" ; Note "DNA topoisomerase sub.; DNAdep. ATPase; memb-assoc" RB32 predicted ORF 4599 4970 . - . ORF "RB32ORF006c" ; Note "hypothetical protein"
Other info like tRNA predictions and blast hits to other phages are parsed and added to the file. The result:
The display program is also written in perl and automatically generates images with links to more information.
I was helping my friends manage a basketball pool for the NCAA tournament. I had installed a program to let people make entries and calculate the scores, but it was hard to use and easy to make a mistake entering the points awarded for each game.
It was even harder to check the data entered. I could get a list of games won and points awarded for picking the game, but the team showed up as a list of numbers not names. The "LOOKUP" function of excel did the trick: (poolaudit.xls)
The program uses the columns of information in A-D to keep track of teams. Kentucky is team 1, Gonzaga is team 2, etc. Their ranking, or seed is important for calculating the score in the pool. The info on who won each game is in the next 3 columns.
Columns I,J, L and M refer to the first 3 columns to find information: Doubleclick on I2 to see how it works:
The game winner's id # is in F3, in this case "2", which it looks
up from column A and reports the value from the 2nd column "Gonzaga".
Column J looks up the seed number for team #2, in this case "2". The
results of the point calculation are made in column K. If it doesn't match column
H, it shows up as red, becuase of the Conditional Formatting function of Excel.
Any value in column K that doesn't match the value in H is red, otherwise it
is black. This is a great way to compare 2 lists of numbers, like the ones used
in the website programs.
On to the perl exercise for lab!