Testing 4 Prophage finding tools

I was searching for the best prophage finding tool for bacterial genomics. After a literature review and recommendations from colleagues, I decided on 3 (updated: 4): Phispy, VirSorter, Phigaro, and ProphET. I also throw in PHASTER, but as it is not open source, I decided against it. Having code available to peer review (or in this case, to install locally) is important to me. If the title of this post bothers you, just pretend I am using 0-indexed counting.

Our test data

Erwinia carotovora subsp. atroseptica SCRI1043 BX950851.1, was chosen for a test bug, as one of the submitters confirmend that considerable human effort went into manually annotating the prophages.

PHASTER

We uploaded a fasta file of our test genome to PHASTER. After about 5 minutes, we got a result

Region	Region Length	Completeness	Score	# Total Proteins	Region Position	Most Common Phage	GC %	Details
1	41.4Kb	intact	150	42	2926054-2967509 info_outline	PHAGE_Entero_P88_NC_026014(22)	50.58%	Show info_outline
2	40.6Kb	questionable	90	55	4140147-4180770 info_outline	PHAGE_Burkho_BcepMu_NC_005882(32)	52.22%	Show info_outline

This was, unsurprisingly, the easiest of the tools to use.

ProphET

Next we tried ProphET. Being a non-Broadie (single tear), I installed it into a conda env I created:

conda create -n ProphET
source activate ProphET
conda install -c biocore blast-legacy
conda install bedtools
conda install  perl-bioperl 

NOTE: There were a few more perl libraries that I needed to install, which they document well in their README.

After getting everything installed, I tried running it, but started running into GFF issues. For the uninitiated, the GFF file format is the leading cause of existential crises. The author’s insist on using one GFF spec, while other resources uses another. For this reason, they include a tool rewrite_gff.pl to fix this. So, my workflow became (running from the root directory of ProphET):

    cd ./UTILS.dir/GFFLib/
    perl ./gff_rewrite.pl --input path/to/BX950851.1.gff --output ../../BX950851.1_new..gff --add_missing_features
    cd ../../
    ./ProphET_standalone.pl --fasta_in path/to/BX950851.1.fasta../../BX950851.1_new..gff --outdir Lys${i}_ProphET

This was cumbersome (especially when running with multiple genomes), and it is surprising and frustrating that this is not built in to the tool.

The results look like this:

BX950851.1      1       2932697 2967509
BX950851.1      2       3803975 3827465
BX950851.1      3       4144591 4180770

So ProphET has truncated the first hit from PHASTER, and added an aditional hit. But overall, these results agree decently well.

Phispy

The last release of Phispy was in 2014; the readme shows that the author was a PhD student, who has likely gone on to greater things now. Dispite the tool’s age, it came highly recommended.

I installed it from sourceforge:

# download from https://downloads.sourceforge.net/project/phispy/phiSpyNov11_v2.3.zip

unzip phiS*
conda create --name phispy python=2.7.15 biopython r-randomforests
source activate phispy

Ok, so it says that it needs a certain type of annotation. Lets give it a try with a vanilla annotation

python ~/miniconda3/envs/phispy/bin/phiSpyNov11_v2.3/genbank_to_seed.py ./BX950851.1.gb ./BX950851_seed/ 

In the GenBank file, for a gene/RNA, locus_tag is missing. locus_tag is required for each gene/RNA.                                                                   
Please make sure that each gene/RNA has locus_tag and run the program again.       

It didnt work. I reannoated the genome with Prokka, and it subsequently worked. I ran Phispy, and eventually got the following results:

CAIIPDDM_00pp.1 CAIIPDDM_1_1063490_1081325
CAIIPDDM_02pp.2 CAIIPDDM_1_2372933_2431115
CAIIPDDM_02pp.3 CAIIPDDM_1_2773681_2839971
CAIIPDDM_02pp.4 CAIIPDDM_1_2929563_2964448
CAIIPDDM_02pp.5 CAIIPDDM_1_3192250_3271834
CAIIPDDM_03pp.6 CAIIPDDM_1_4116963_4175637

Now, we have 1 that we have seen before, and 5 new ones.

Phigaro

Joe Healey recommended trying another tool, Phigaro. why not? Its available on pip, how bad could it be?

conda create -n phigaro
source activate phigaro
pip install phigaro
phigaro-setup

I am working on an HPC, so I do not have sudo access. Strike 1 for phigaro. But they include instruction for semi-manual setup.

I am stuck trying to get metagenemark. Their licence page on their website isn’t working. I have emailed the authors. Strike 2.

[6 hours later] The authors got it sorted out, and I now have it configured, after a bit of fiddling around with having the .gm_key in the right part of the filesystem.

I ran the program as phigaro -f ./BX950851.fasta -o BX950851_phigaro.txt, getting the following result:

scaffold        begin   end
BX950851.1 Erwinia carotovora subsp. atroseptica SCRI1043, complete genome      1356757 1376318
BX950851.1 Erwinia carotovora subsp. atroseptica SCRI1043, complete genome      2928085 3003543
BX950851.1 Erwinia carotovora subsp. atroseptica SCRI1043, complete genome      3106945 3119520
BX950851.1 Erwinia carotovora subsp. atroseptica SCRI1043, complete genome      3735875 3765347
BX950851.1 Erwinia carotovora subsp. atroseptica SCRI1043, complete genome      4143714 4184809

Compared to Phispy, we have 1, maybe two overlapping hits. Both the PHASTER hits seem to be there, though.

VirSorter

Lastly, we tried Virsorter. You have to download a sizable database, but they give a nice walkthrough on how to get set up on their Github page. I’ll spare you the details of running it; it was pretty pain-free. Here are the results

Update: talked with the authors on how to find the coordinates.

Summary

| Island no. | Putative phenotype(s)                                                  | start   | stop    | PHASTER | ProphET | phiSpy | phigaro | VirSorter |
|------------|------------------------------------------------------------------------|---------|---------|---------|---------|--------|---------|-----------|
| HAI1       | Capsular polysaccharide biosynthesis                                   | 574080  | 587651  |         |         |        |         |    X      |
| HAI2       | Polyketide phytotoxin biosynthesis (cfa)                               | 590844  | 688402  |         |         |        |         |    X      |
| HAI3       |                                                                        | 739283  | 750273  |         |         |        |         |           |
|            |                                                                        | 1063490 | 1081325 |         |         |   X    |         |           |
|            |                                                                        | 1356757 | 1376318 |         |         |        |   X     |           |
| HAI4       |                                                                        | 1180862 | 1196644 |         |         |        |         |           |
| HAI5       | Exopolysaccharide and O-antigen biosynthesis                           | 1606718 | 1638218 |         |         |        |         |           |
| HAI6       | Nonribosomal peptide phytotoxin                                        | 1666599 | 1727303 |         |         |        |         |           |
| HAI7       | Type IV secretion (virB) ; Integrated plasmid ;                        | 1855523 | 1926959 |         |         |        |         |    X      |
| HAI8       | Type III secretion (hrp);  (hecAB)                                     | 2324722 | 2486065 |         |         |   X    |         |    X      |
| HAI9       | P2 family prophage                                                     | 2935461 | 2966671 |   X     |   o     |   X    |   X     |    X      |
| HAI10      | Phenazine antibiotic biosynthesis (ehp)                                | 3029319 | 3040751 |         |         |        |         |           |
| HAI11      |                                                                        | 3092131 | 3101182 |         |         |        |         |           |
|            |                                                                        | 3106945 | 3119520 |         |         |        |   X     |           |
| HAI12      | Rhs and its accessory element VgrG                                     | 3194782 | 3227396 |         |         |   X    |         |           |
| HAI13      | Putative integrated plasmid                                            | 3236381 | 3263492 |         |         |        |         |    X      |
| HAI14      | Nitrogen fixation (nif)                                                | 3280604 | 3355481 |         |         |        |         |           |
| HAI15      | Agglutination/adhesion (aggA)                                          | 3652523 | 3677516 |         |         |        |         |           |
|            |                                                                        | 3735875 | 3765347 |         |         |        |   X     |           |
| HAI16      |                                                                        | 3794816 | 3880056 |         |   X     |        |         |           |
| HAI17      | Prophage                                                               | 4144591 | 4180770 |   X     |   X     |   o    |   X     |    X      |

X = Hit o = partial hit ? = unknown

This table adapted from table 1 of https://www.ncbi.nlm.nih.gov/pubmed/15263089

We do appear to detect the obvious phages (marked HAI11 and HAI17), but thats where the similarity ends. We don’t really know what to make of the other hits. And this is sensible – the longer a prophage is integrated into a genome, the more replication errors are going to occur, and the more that region is going to start blending in to the host region. I think that we learn is that any of these softwares can detect obvious (read: recent) prophages, but for more ancestral ones, its anyones guess.

If one was being conservative, they would likely choose PHASTER, and I would, if it wasn’t a webapp. The next closest is ProphET, but what about HAI16? If that is not a legitimate phage, than a 33% false pasitive rate is unacceptable. Manually, I went through and blasted the genes in HAI16, to see whether they appeared to be phagy. And they did! Conversely, I did the same with genes between 1063490-1081325 to see whether phispy would be usable – unfortuately, none of those genes looked like phage at all. For phigaro, I checked the 1356757-1376318 region; nothing looked phagy, and the region contained the only copy of gyrA, which I am pretty sure is essential for Erwinia. The 3106945-3119520 region had one partial hit for a phage tail protein, but nothing else.

VirSorter appears to detect anything that doesn’t look like the host genome, including plasmids; that rule out VirSorter. I will be using some different software for plasmid detection.

ProphET seems to be the most conservative.

Conclusion

Keep in mind that this is n_genoms = 1. Take this with a grain of salt.

This was not fun.

I wish most of these were easier to install.
I wish none of these were written in Perl.
I wish all of these were open source.
I wish Phispy was still maintained.
I wish none of them were hosted on Sourceforge.
I wish there weren’t additional tools that I would need to compare to make this more complete.
I wish these worked from raw reads rather than assemblies, as we know prophages are difficult to correcly assemble.
I wish all the tools’s publications had appropriate, benchmarked comparisons to existing tools.
I wish I didn’t get 5 different answers from 5 different tools.
I wish none of these involved instalation licenses

weird_one_offs

For analyses that I don't want to loose but that don't deserve their own repo