% PLEASE NOTE: % % This update of PlantClusterFinder is able can process genomes where gene % identifier to protein identifiers mappings are complex. However, the change % of the algorithm comes with a change in the processing of the input files. % Any change in input files (e.g. change of the genome assembly, changes in % genome annotations) as well as running it with a different Matlab version, % can change the insertion of hypothetical genes and thus can have impacts % on cluster prediction. Hence, if you plan to use PlantClusterFinder for a % comparative study, please re-compute all species that are included in your % study. % We tested in various ways how our results (published in Schlapfer et al 2017, % PlantPhys) are impacted by such manipulations (changing biomart files, % changing genomes, changing top% thresholds), and we found that numbers of % clusters can change, but our conclusions were robust to any of these changes. % % PLEASE ALSO NOTE: % You need to provide a Genename_conversion_File that maps all ids used in % pgdbs and mcl clusterings to one gene identifier. It accepts regular % expressions and lines are executed sequentially. We strongly suggest you % convert all gene identiers to start with FIX_ to ignore any potential % hardcoded ID mappings in the core of plant cluster finder. There will % be an update of this algorithm which makes this one obsolete. % % PLEASE CHECK FOR RELEASE OF VERSION 1.3. % % This standallone needs the Matlab runtime. See below for more information % Ensure that you have rwx access to all folders files and the executables % you want to use. % % To run this code for any other species, you will need: % - a PGDB of that species (either downloaded from plantcyc.org (a free % license is needed) or from another source using pathwaytools.) % - a gene position file either created using a biomart of a genome or a gff % file. This file ideally matches the genome that was used to create the pgdb. % The structure is a tabseparated file with a header as the following: % Gene Name Gene Start (bp) Gene End (bp) Chromosome Name Strand % FIX_PGSC0003DMG400030251 3198 6347 chr01 -1 % FIX_PGSC0003DMG402030252 30275 32399 chr01 1 % - A masked DNA file of the species that should match the data used for the % PGDB and the biomart file. % - An MCL clustering file. This needs a protein annotation fasta file to start. % Please see README_HOW_TO_MCL_CLUSTER.txt how to perform this step. % - A Gene_Name_conversion file mapping protein and gene IDs to your own IDs. % Since there are hard coded parts in the older parts of the code, we suggest to map all ids to: % FIX_GeneID. The files is a tab separated file, with a regular expression as % the first and second argument and a string that should replace the regular expression. % Every line is executed serially. Thus if there is a gene Identifyer A, B, and C, and your file % reads the first line A A B and the second line B B C, then the resulting gene Identifiers will bee % C, C, C. % Here is an example for Potato, converting PGSC0003DMP400000001, PGSC0003DMT400000001 and % PGSC0003DMG400000001 all to FIX_PGSC0003DMG400000001: % PGSC0003DMP400000001 PGSC0003DMP400000001 PGSC0003DMG400000001 % PGSC0003DMT400000001 PGSC0003DMT400000001 PGSC0003DMG400000001 % PGSC0003DMG400000001 PGSC0003DMG400000001 FIX_PGSC0003DMG400000001 % FIX_FIX_ FIX_FIX_ FIX_ % % Other things to consider: % - If you use this software with pgdb's higher than 20.5, please download updated % files for metacyc reactions and store them in the input files. Also check for % new releases of plantcyc. % - If you want to use Metabolic Domain information then update the information in % PathwayMetabolicDomainClassification.txt, ReactionMetabolicDomainClassification.txt % and SuperPathwayList.txt % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Function for running the analysis of where gaps exist in a genome % % (either gaps that are encoded by base-pairs that are not ATCGatcg* or % % gaps that are encoded by sequences of N's. Then this is used to modify % % a set of gene positions lists, and invent hypothetical genes to fill % % the gaps. The output of these files are then ususally used to find gene % % clusters of metabolic enzymes. % % Author: P. Schlapfer, Carnegie Institution for Science % % Date: 2017/12/10 % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % INPUT: % - vGene_Position_Files_Folder: % A string defining the position of the gene position files. % - vGene_Position_File_Names: % A string with every name of a gene position file concatenated % by semicolons (;). % - vGenename_Conversion_Files_Folder: % A string defining the position of the gene name conversion % files. % - vGenename_Conversion_File_Names: % A string with every name of a gene name conversion file % concatenated by semicolons (;). % - vGenome_Files_Folder: % A string defining the position of the genome fasta files. % - vGenome_File_Names: % A string with every name of the fasta files of the genomes of % the species that are investigated concatenated by % semicolons (;). % - vMCL_clustering_result_Folder: % A string defining the position of the gene MCL clustering % files. % - vMCL_clustering_result_file_Names: % A String that defines the names of the MCL clustering results % concatenated by semicolons (;): % PlantClusterFinder was written in Java and built to run on % 64-bit Linux systems, it has not been tested widely on % different systems. It relies on the results from % all-against-all BLAST and MCL clustering. The % PlantClusterFinder package does not include BLAST and MCL. % For how to prepare the MCL clustering result file for a % genome, please consult the protocol from the manual of MCL % (http://micans.org/mcl/man/clmprotocols.html#blast). % - vClusterResultFolderNames: % A string that defines the result file names. % - vGaps_noCluster: % A string that defines genome sequence gap sizes that should be % interpreted to prevent that two genes at both sides of the end % up in the same cluster. % - vGapSizeEnd: % Maximal number of non-metabolic genes allowed between two % metabolic enzymes, such that a gene cluster is still defined as % such. % - vMetaCyc_reactions_dat: % The full path and name of the metacyc reactions flatfile that % was used to produce the PGDBs (two variables below). % - vPlantCyc_reactions_dat: % The full path and name of the metacyc reactions flatfile that % was used to produce the PGDBs (two variables below). This can % be an empty string, then it is ignored. % - vPGDB_Folder: % The folder where the PGDBs are stored. The last version of an % organism will be used (using the default_version). % - vPathway_Domains_File: % A tab separated txt file, containing all pathways in the study % with their annotated metabolic domains. % - vReaction_Domains_File: % A tab separated txt file, containing all reactions in the study % with their annotated metabolic domains. % - vSuperPathway_File: % A txt file, containing a list of pathways that represent % superpathways (a pathway containing more than one pathway or % containing a pathway and an additional reaction). % - vScaffoldTailoringReactions_File: % A tab delemited txt file, containing a list of signature and % tailoring enzymes. % - vScaffoldTailoringReactions_Ignore_File: % A txt file with EC numbers that should be ignored for the % signature and tailoring enzyme search. % % Optional: % In Addition to above mandatory input, below Input can be given. Always % the tag (like KeepTempFiles) has to be given with one of the options % (for example 1) as a next argument. % - 'KeepTempFiles', [0, 1]: Usually the switch is on 0, but you can % set it to one, then the temporary folder % is not deleted. All other options are % sent to the function of the hypothetical % genes. % - 'MaxClusterSize', [0 to n]: Maximal size of clusters allowed % (number of genes represented in a % cluster). % - 'OutputFolder', 'string': Folder where the ResultsFolder should % be written into. Needs a string as % input. % - 'TempFolder', 'string': Folder where the Temporary information % about missing sequencing information % should be written to. Default is % Temp_Get_Gaps_And_Modify_Gene_Lists % - TempPCFFolderName, 'string': Folder where the Core of Plant % Cluster Finder shoud be stored/ % - RandSeed: 'integer', [0 to n]: Seed number for randomization % purposes. % - Verbose: [0, 1 or 2]: Gives more out for debugging. % - OverwriteGap: [0, 1 (default)]: If set to zero, previous Gap % information files are not % recomputed and replaced. Old ones % are used. % - OverwriteHypo: [0, 1 (default)]: If set to zero, previous Gene % position files populated with % hypothetical genes are not % recreated. Old ones are used. % - SkipHypo: [0 (default), 1]: Do not insert Hypothetical genes in % regions of the genome that have % sequencing gaps. % - UseCore: [1.2 or 1.3]: Defines the core that is used (1.2 was % published core % % Options of subfunctions: Hypothetical genes: % Same as above for the Options. % - 'Switch', [0 or 1]: Identifies whether intermediate results % should be loaded (1) or overwritten (0). % - 'MaxGap', [0, or 1, ..., or n]: Identifies the maximal number of % genes that should be introduced % into a gap of extreme length. % - 'Silent', [0 or 1]: Identifies whether some output should be % given to the standard output (1) or not (0). % - 'TempSave', [0 or 1]: Regulates if temporary matfiles are given % out. % - 'GiveDirectOutput', [0 or 1]: Gives the lists of modified gene % position files to the standard % output. % OUTPUT: % - Result Folder with temporary files to find the gaps encoded on % the genome files, stored in a folder named % Temp_Get_Gaps_And_Modify_Gene_Lists under the current directory. % The folder is deleted afterwards, if deletion is not prevented by % the Option KeepTempFiles. You can change the name and the % location of the folder by using the flag TempFolder. % - PlantClusterFinder result files stored in a results folder, % stored in the folder PlantClusterFinder is ran. See documentation % of former PlantCluster Finder by Taehyong Kim for more % specification of resultfiles. % % CALL of the function: % ./run_f_PlantClusterFinder.sh %Location of the MCR% "%Gene Position File Folder%" %Gene Position File Name% "%Genename Conversion Files Folder%" %Genename Conversion File Name% "%Genome Fasta File Folder%" %Genome Fasta File% "%MCL clustering Files Folder%" %MCL Clustering File% %PGDBName% "%gapsize1%, %gapsize2%, %gapsize3%" %Maximal number of intervening Gene size to be tested% "%Full path and filename to metacyc reactions.dat%" "%Full path and filename to plantcyc reactions.dat%" "%Path to location of pgdbs%" "%Full path and file name to the pathway Metabolic Domain Classification file%" "%Full path and file name to the reaction Metabolic Domain Classification file%" "%Full path and file name to the File containing the list of Superpathways%" "%Full path and file name to the File containing the list of Scaffold and Tailoring reactions%" "%Full path and file name to the File containing the list of ECs that should not be counted as Scaffold and Tailoring reactions%" % or when run for multiple species: % ./run_f_PlantClusterFinder.sh %Location of the MCR% "%Gene Position Files Folder%" %Gene Position File Name1%,%Gene Position File Name2% "%Genename Conversion Files Folder%" %Genename Conversion File Name1%,%Genename Conversion File Name2% "%Genome Fasta Files Folder%" %Genome Fasta File1%,%Genome Fasta File2% "%MCL clustering Files Folder%" %MCL Clustering File1%,%MCL Clustering File2% %PGDBName% "%gapsize1%, %gapsize2%, %gapsize3%; %gapsize1%, %gapsize2%, %gapsize3%" %Maximal number of intervening Gene size to be tested% "%Full path and filename to metacyc reactions.dat%" "%Full path and filename to plantcyc reactions.dat%" "%Path to location of pgdbs%" "%Full path and file name to the pathway Metabolic Domain Classification file%" "%Full path and file name to the reaction Metabolic Domain Classification file%" "%Full path and file name to the File containing the list of Superpathways%" "%Full path and file name to the File containing the list of Scaffold and Tailoring reactions%" "%Full path and file name to the File containing the list of ECs that should not be counted as Scaffold and Tailoring reactions%" % % Example Call: % For Published versions: please look into run_PlantClusterFinder_v1_2_on_linux_batch.SPECIESNAME.sh, adapt paths and run scripts. Some of the code is dependent on random number generation and thus hypothetical gene insertion can change. % In general: % ./run_f_PlantClusterFinder.sh /share/apps/MATLAB/MATLAB_Runtime/v90 "/Volumes/DPB/Data/Shared/Labs/Rhee/Everyone/GeneClusters/PlantClusterFinder_v1_1/Gene_position_files/" mart_export_mtruncatulacyc.txt "/Volumes/DPB/Data/Shared/Labs/Rhee/Everyone/GeneClusters/PlantClusterFinder_v1_1/Gene_name_conversion_files/" Gene_name_conversion_mtruncatulacyc.txt "/Volumes/DPB/Data/Shared/Labs/Rhee/Everyone/GeneClusters/PlantClusterFinder_v1_1/Genomes/" Mtruncatula_285_Mt4.0.fa "/Volumes/DPB/Data/Shared/Labs/Rhee/Everyone/GeneClusters/PlantClusterFinder_v1_1/MCL_clustering_results/" dump.out.mtruncatulacyc.mci.I20 mtruncatulacyc "5000, 50000, 100000" 5 "/Volumes/DPB/Data/Shared/Labs/Rhee/Everyone/GeneClusters/PlantClusterFinder_v1_1/Input_files/reactions_metacyc20.5.dat" "/Volumes/DPB/Data/Shared/Labs/Rhee/Everyone/GeneClusters/PlantClusterFinder_v1_1/Input_files/reactions_plantcyc12.0.dat" "/Volumes/DPB/Data/Shared/Labs/Rhee/Everyone/GeneClusters/PlantClusterFinder_v1_1/Pgdbs/" "/Volumes/DPB/Data/Shared/Labs/Rhee/Everyone/GeneClusters/PlantClusterFinder_v1_1/Input_files/PathwayMetabolicDomainClassification.txt" "/Volumes/DPB/Data/Shared/Labs/Rhee/Everyone/GeneClusters/PlantClusterFinder_v1_1/Input_files/ReactionMetabolicDomainClassification.txt" "/Volumes/DPB/Data/Shared/Labs/Rhee/Everyone/GeneClusters/PlantClusterFinder_v1_1/Input_files/SuperPathwayList.txt" "/Volumes/DPB/Data/Shared/Labs/Rhee/Everyone/GeneClusters/PlantClusterFinder_v1_1/Input_files/scaffold-tailoring-reactions.tab" "/Volumes/DPB/Data/Shared/Labs/Rhee/Everyone/GeneClusters/PlantClusterFinder_v1_1/Input_files/scaffold-tailoring-reactions-not.list" % or when run for multiple species: % ./run_f_PlantClusterFinder.sh /share/apps/MATLAB/MATLAB_Runtime/v90 "/Volumes/DPB/Data/Shared/Labs/Rhee/Everyone/GeneClusters/PlantClusterFinder_v1_1/Gene_position_files/" mart_export_mtruncatulacyc.txt;mart_export_mtruncatulacyc.txt "/Volumes/DPB/Data/Shared/Labs/Rhee/Everyone/GeneClusters/PlantClusterFinder_v1_1/Gene_name_conversion_files/" Gene_name_conversion_mtruncatulacyc.txt;Gene_name_conversion_mtruncatulacyc.txt "/Volumes/DPB/Data/Shared/Labs/Rhee/Everyone/GeneClusters/PlantClusterFinder_v1_1/Genomes/" Mtruncatula_285_Mt4.0.fa;Mtruncatula_285_Mt4.0.fa "/Volumes/DPB/Data/Shared/Labs/Rhee/Everyone/GeneClusters/PlantClusterFinder_v1_1/MCL_clustering_results/" dump.out.mtruncatulacyc.mci.I20;dump.out.mtruncatulacyc.mci.I20 mtruncatulacyc;mtruncatulacyc "5000, 50000, 100000;5000, 50000, 100000" 5 "/Volumes/DPB/Data/Shared/Labs/Rhee/Everyone/GeneClusters/PlantClusterFinder_v1_1/Input_files/reactions_metacyc20.5.dat" "/Volumes/DPB/Data/Shared/Labs/Rhee/Everyone/GeneClusters/PlantClusterFinder_v1_1/Input_files/reactions_plantcyc12.0.dat" "/Volumes/DPB/Data/Shared/Labs/Rhee/Everyone/GeneClusters/PlantClusterFinder_v1_1/Pgdbs/" "/Volumes/DPB/Data/Shared/Labs/Rhee/Everyone/GeneClusters/PlantClusterFinder_v1_1/Input_files/PathwayMetabolicDomainClassification.txt" "/Volumes/DPB/Data/Shared/Labs/Rhee/Everyone/GeneClusters/PlantClusterFinder_v1_1/Input_files/ReactionMetabolicDomainClassification.txt" "/Volumes/DPB/Data/Shared/Labs/Rhee/Everyone/GeneClusters/PlantClusterFinder_v1_1/Input_files/SuperPathwayList.txt" "/Volumes/DPB/Data/Shared/Labs/Rhee/Everyone/GeneClusters/PlantClusterFinder_v1_1/Input_files/scaffold-tailoring-reactions.tab" "/Volumes/DPB/Data/Shared/Labs/Rhee/Everyone/GeneClusters/PlantClusterFinder_v1_1/Input_files/scaffold-tailoring-reactions-not.list" % % % MATLAB Runtime % % Download the Linux 64-bit version of the MATLAB Runtime for R2016b % from the MathWorks Web site by navigating to % % http://www.mathworks.com/products/compiler/mcr/index.html % % % For more information about the MATLAB Runtime and the MATLAB Runtime installer, see % Package and Distribute in the MATLAB Compiler documentation % in the MathWorks Documentation Center. % % % Files in the Package % % % -f_PlantClusterFinder % -run_f_PlantClusterFinder.sh (shell script for temporarily setting environment variables % and executing the application) % -to run the shell script, type % % ./run_f_PlantClusterFinder.sh % % at Linux or Mac command prompt. is the directory % where version 9.1 of the MATLAB Runtime is installed or the directory where % MATLAB is installed on the machine. is all the % arguments you want to pass to your application. For example, % % If you have version 9.1 of the MATLAB Runtime installed in % /mathworks/home/application/v91, run the shell script as: % % ./run_f_PlantClusterFinder.sh /mathworks/home/application/v91 % % If you have MATLAB installed in /mathworks/devel/application/matlab, % run the shell script as: % % ./run_f_PlantClusterFinder.sh /mathworks/devel/application/matlab % % Addendum: % A. Linux x86-64 systems: % In the following directions, replace MCR_ROOT by the directory where the MATLAB Runtime % is installed on the target machine. % % (1) Set the environment variable XAPPLRESDIR to this value: % % MCR_ROOT/v91/X11/app-defaults % % % (2) If the environment variable LD_LIBRARY_PATH is undefined, set it to the concatenation % of the following strings: % % MCR_ROOT/v91/runtime/glnxa64: % MCR_ROOT/v91/bin/glnxa64: % MCR_ROOT/v91/sys/os/glnxa64: % MCR_ROOT/v91/sys/opengl/lib/glnxa64 % % If it is defined, set it to the concatenation of these strings: % % ${LD_LIBRARY_PATH}: % MCR_ROOT/v91/runtime/glnxa64: % MCR_ROOT/v91/bin/glnxa64: % MCR_ROOT/v91/sys/os/glnxa64: % MCR_ROOT/v91/sys/opengl/lib/glnxa64 % % For more detail information about setting the MATLAB Runtime paths, see Package and % Distribute in the MATLAB Compiler documentation in the MathWorks Documentation Center. % % % % NOTE: To make these changes persistent after logout on Linux % or Mac machines, modify the .cshrc file to include this % setenv command. % NOTE: The environment variable syntax utilizes forward % slashes (/), delimited by colons (:). % NOTE: When deploying standalone applications, it is possible % to run the shell script file run_f_PlantClusterFinder.sh % instead of setting environment variables. See % section 2 "Files to Deploy and Package".