FISHtrees Quickstart ================================================================ FISHtrees 3 (currently version 3.2.0) is a software program that takes as input copy-number counts of fluorescence in situ hybridization (FISH) probes obtained from individual cells in a tumor sample. It applies phylogenetic algorithms to infer a likely evolutionary tree from these data and uses these trees to estimate the probability of different types of mutations. It also can analyze consensus between trees generated from tumor samples taken in different states; for instance primary vs. metastatic cancer. For further background see the (* BACKGROUND) section and the references therein. FISHtrees runs as a command line utility. It was written using standard-conforming C++, but was primary developed on GNU/Linux systems using GCC as a compiler. The installation instructions reflect this history -- FISHtrees is usually built using a Unix-style toolchain and a Makefile. We have verified that FISHtrees works on recent versions of Linux and MacOS (which is a Unix-like system). A prerequisite for FISHtrees is the SCIP (http://scip.zib.de/) library. SCIP must be built before FISHtrees can be built. See the SCIP website for instructions -- additional instructions are given in the (* SCIP INSTALLATION) section below. The FISHtrees source is distributed as a .tar.gz file. The traditional utility for extracting the contents of such an archive file is named 'tar' and may be run from the command line ('tar zxf FISHtrees...tar.gz') where .. is the version number -- the version number is currently 3.2.0. However, many graphical user interfaces also understand this archive type, so double-clicking on the tar.gz file's icon instead may suffice to extract the contents. To build FISHtrees, the Makefile system must know where to find SCIP. We, by default, assume it is located at '../scip'. In other words, the default setup is such that a directory listing would show the folders scip/ FISHtrees3.2.0/ as siblings. For instructions on how to edit the Makefile source files to find another version of SCIP, or SCIP installed in a different location, see the (* SCIP INSTALLATION) section below. FISHtrees may then be built using the following commands. cd FISHtrees3.2.0 make The resulting executable is named 'fish'. FISH may be run using a command line similar to ./fish P1/ CEP7 LAMP3 CEP7 PROX1 4 that will analyze the data files within the directory P1/ using the FISH probes CEP7, LAMP3 and PROX1 using two different methods. The format of the data files is described in the (* INPUT) section. The probe name CEP7 is repeated on the example command line to indicate that it serves as a probe of the ploidy of the cell. The order in which the probes are specified is important. In particular, the ploidy probe must come before each gene probe. The number '4' selects the analysis mode to be used; '4' is the recommended value. The results of the analyses are printed by the program and sent to several output files (see section [* OUTPUT FILES] below). The program also generates GraphViz 'dot' files that may be used to visualize the generated trees. This is discussed in more detail in the (* VISUALIZING GRAPHS) section below. An additional documentation file, called README.code, explains the structure of the source code files for the FISHtrees package. * BACKGROUND ================================================================ Phylogenetic algorithms, which infer evolutionary histories from profiles of discrete species, have previously proven a powerful tool for interpreting patterns of tumor evolution from profiles of tumors or tumor cells. Pennington et al. [1] developed computational approaches to study evolution within individual tumors, using FISH profiles of DNA copy numbers to label discrete cells within tumors and then inferring likely evolutionary trees between cell lineages within tumors. "FISHtrees" [2] is a software program: 1) to parse copy numbers of multiple Fluorescence In Situ Hybridization (FISH) probe signals in a single-cell assay; 2) to analyze probe signals across multiple single-cell assays in a patient data file; 3) to model tumorigenesis as a tree of tumor progression paths for a given probe on a given patient data file; 4) to merge multiple tumorigenesis trees to represent multiple probes on a given patient data file; 5) to compute consensus multiple joint tumorigenesis trees representing multiple patient data files into a network as a graph of tumor progression pathways. The development of FISHtrees was stimulated by the desire to analyze tumor progression on datasets such as those for cervical cancer [6], breast cancer [7], prostate cancer [8], and tongue cancer [9], collected by Kerstin Heselmeyer-Haddad, Darawalee Wangsa and colleagues in the laboratory of Thomas Ried, National Cancer Institute/NIH. FISHtrees includes two classes of methods that differ in whether they incorporate an estimate of the ploidy, which is the mode for the number of copies of the autosomes. Below, these classes of methods are described as "ploidy-based" [1, 5] and "ploidyless" [2, 3, 4]. In 2014, there were substantial improvements and extensions in the ploidyless methods beyond the algorithms described in [1, 2, 3]. Substantial changes made in 2014, and described in [4], after the release of August 2014 are marked with [new_end2014]. In 2015-2016, substantial additions were made to the ploidy-based and consensus tree methods, as described in [5], and these are labeled [new_2016]. The current release series, beginning with version 3.0, incorporates advances in ploidy-based and consensus methods described in [5] as well as advances in ploidyless methods described in [4]. Version 3.1, released in May 2017, improves the speed of the high weight branching algorithm. Version 3.2 improves the output content of the program, and enables maximum copy numbers greater than 9, while setting the default maximum copy number to 10. The relevant constant is MAX_COPY, which is defined in fish.h. We refer to the package as FISHtrees and to the executable program as fish. Search for the symbol * to navigate to different sections of this README. ============================================================================ * ORGANIZATION OF THE FTP SITE FISHtrees is available from ftp://ftp.ncbi.nlm.nih.gov/pub/FISHtrees This file is README and is at the top level. To build and run FISHtrees, one must first install SCIP, which is distributed under its own license, free for non-commercial use. SCIP may be obtained from http://scip.zib.de The FISHtrees code was written in part by United States government employees as part of the authors' official duties and so cannot be copyrighted. The source code of FISHtrees is provided. In the subdirectory code/ there is a tar gzip'd archive FISHtrees...tar.gz containing all the source code and a Makefile. In the subdirectory data/ are tar archives cervical_cancer_data_filtered.tar breast_cancer_data_filtered.tar for the cervical cancer [6] and breast cancer data sets [7]. The word "filtered" refers to a method of filtering out a small proportion of the cells in each data file because they are suspected to have cut nuclei. The filtering method was described in [7], but the filtering program itself is not part of the FISHtrees software distribution. ============================================================================ * INPUT Any file in the specified patient directory that includes the substring 'txt' in the file name is treated as a patient file. Each patient data file is in tab-delimited columns with two header rows.
:- ... \ ...
:- \ \ :- ... \ ... \ ... There are rows of , therefore all sum up to . The upper part of a patient data file looks as follows, with a possible variant in the second row: 1 CEP7 4 LAMP3 PROX1 PRKAA1 CCND1 112 249 1 0 3 2 1 1 1 0 4 1 1 1 1 1 1 1 2 3 1 1 1 2 3 1 1 1 2 1 1 1 1 1 2 2 2 2 1 1 3 0 1 1 1 1 3 1 2 1 1 1 3 2 1 1 1 1 3 2 2 1 1 1 6 4 0 1 1 2 0 1 1 1 1 2 1 0 1 1 1 2 1 1 0 1 1 2 1 1 1 1 1 2 1 2 2 2 1 2 2 0 1 1 followed possibly by more rows of cell types. The second row can also look like: 112 249 3 1 5 11 where the 3rd, 4th,... entries are the chromosomes containing the gene probes. When the chromosome location information is included, it can be used to detect if two genes are on the same chromosome. The recognition that two genes are on the same chromosome is useful for incorporating chromosome gains and losses into the tree progression models. If one included the chromosome information, then the first two lines would instead look like this: 1 CEP7 4 LAMP3 PROX1 PRKAA1 CCND1 112 249 3 1 5 11 On the second row 112 indicates that there are 112 distinct cell count patterns 249 indicates that there are 249 cells evaluated in the sample 3 indicates that the gene LAMP3 is on human chromosome 3 1 indicates that the gene PROX1 is on human chromosome 1 5 indicates that the gene PRKAA1 is on human chromosome 5 11 indicates that the gene CCND1 is on human chromosome 11 It is expected that there are data on multiple patients. Each sample corresponds to exactly one file; a single patient may have multiple files, if the patient was sampled multiple times. The input files should be put in a separate directory; that directory is one of the parameters input to the program. ---------------------------------------------------------------------------- * SCIP INSTALLATION As part of the major changes in FISHtrees version 3, the fish program now uses the package SCIP (scip.zib.de) and it must be installed to run fish. Previous versions of FISHtrees, through 2.3, used the GNU Linear Programming Kit (GLPK) package, but that package is no longer used by FISHtrees. FISHtrees has been built and tested with SCIP v6.0.2 and some earlier versions. Over time, the SCIP developers have changed build instructions and the names of the generated libraries between versions of SCIP and the FISHtrees instructions have changed accordingly. SCIP requires that the GNU Multiple Precision Arithmetic Library (GMP) be installed on a system. On Linux systems, this is usually a non-issue -- it already is installed by default. On MacOS, it must be installed. We recommend using Homebrew (https://brew.sh) to do the installation, but Fink, MacPorts or simply installing GMP oneself are also viable options. For SCIP version 6.0.2, the recommended build instructions use the CMake system, which is available for most operating systems. CMake is invariably supplied by Linux package managers and MacOS package managers (e.g. Homebrew), but may also be downloaded from https://cmake.org. SCIP requires users to register one time. On the SCIP website, go to Downloads and download the current SCIP Optimization Suite source whose filename should look like scipoptsuite-...tgz, where .. is the current version number. Place this archive in the same directory as the FISHtrees folder, not in the FISHtrees folder. Then do: tar zxf scipoptsuite-...tgz mkdir scip A directory listing should show the folders scip/ FISHtrees/ scipoptsuite-../ For the purpose of building SCIP, create a shell variable pointing to the installation directory. SCIPOPTDIR=$PWD/scip The variable SCIPOPTDIR needs to be created only once, and is only used in the call to `cmake` below. Then cd scipoptsuite-.. mkdir build cd build cmake -DCMAKE_C_STANDARD=11 -DCMAKE_INSTALL_PREFIX=$SCIPOPTDIR -D SHARED=off .. Note that we are building a static version of the SCIP libraries -- a version that will ultimately be embedded via software library linking in the `fish` executable. This embedding results in an executable that may be redistributed only according to the SCIP licence; see the licence information within SCIP or on the SCIP website. If cmake works, then run make make install In this document, we present instructions on how to build a static (not shared) version of SCIP in a specific location. In principle, one may build a shared version of the SCIP libraries and then build FISHtrees using this shared version. However, details about how build and install shared libraries differ between operating systems, and even between versions of the same operating system. By compiling a static version, the user does not need to understand how shared libraries work. ---------------------------------------------------------------------------- * CODE COMPILATION Typically, the 'make' utility is aware of the location of an appropriate C++ compiler. If this is not the case on your system, insert an (unindented) line similar to CXX = g++ at the top of the Makefile. Replace g++ with the compiler appropriate to your system. One can choose among four compilation modes in Makefile as follows: 1) default mode CXXFLAGS = -O -Wall -Wextra --pedantic-errors -Wno-long-long 2) permissive mode CXXFLAGS = -O -Wno-long-long 3) debug mode CXXFLAGS = -g -Wall -Wextra --pedantic-errors -Wno-long-long 4) profile mode CXXFLAGS = -pg -O -Wno-long-long Users are likely to be interested in only the default mode. Changes in compilation mode are made by uncommenting the line of the desired mode and commenting out lines of the other three modes in Makefile. In the Makefile, the symbol '#' indicates that everything to the right of the '#' on the same line is a comment. Thus, one comments out a line, by inserting a '#' on the far left. ---------------------------------------------------------------------------- To compile and build the executable, run the command line utility make After 'make' is finished successfully, one may optionally recover some disk space by typing. make clean Sometimes, one must type gmake or nmake instead of make Compilation and execution have been tested on Linux and MacOS. ============================================================================ * COMMAND-LINE SYNTAX and USAGE Usage: fish ... can be an absolute or relative path ended with the system-dependent directory separator '/' or '\'. Both and and all arguments that follow are case-sensitive. It is assumed that the only files in are input files, one file per sample. A patient may have multiple samples. [new_end2014] The rightmost argument can take the values -4, -3, -2, -1, 0, 1, 2, 3, 4 If the strategy is nonnegative the ploidy-based method [5] will be employed. If the strategy is nonzero, a ploidyless method will be employed, with the absolute value of the absolute value of indicating the ploidyless strategy to use, as follows. 1) for the exact method in [2] 2) for the heuristic method without genome duplication in [2] 3) for the unweighted heuristic method with genome duplication and chromosome gains/losses as in [3] 4) for the weighted heuristic method with genome duplication and chromosome gains/losses in [4] The running time of the exact method rises superexponentially with the number of probes. To use modes with absolute value 3 or 4 to best effect, the input files should contain the chromosomes for the gene probes on the second line. If the chromosome information is missing, the program assumes that all genes are on different chromosomes. ---------------------------------------------------------------------------- For example, the command: ./fish P1/ CEP7 LAMP3 CEP7 PROX1 CEP7 PRKAA1 4 reads all patient data files under the P1/ sub-directory, uses CEP7 as a reference chromosome probe, analyzes patterns of three gene probes {LAMP3, PROX1, PRKAA1}, and calculates the tumorigenesis trees and graphs as the results. A more advanced heuristic method is used that allows for genome duplication and chromosome gains/losses, with weights, due to the 4 on the far right. Because the 4 is positive, the ploidy-based method will also be run and produce separate output files. To do only the ploidyless part of this run, use instead ./fish P1/ CEP7 LAMP3 CEP7 PROX1 CEP7 PRKAA1 -4 Use 3 instead of 4 for an unweighted version of this heuristic. Unweighted in this context means each type of change has the same probability, regardless of the gene. We showed in reference [4] that mode 4 (or -4) is better than mode 3 (or -3), so the 3 ( -3) is retained mostly for backwards compatibility. The command ./fish P1/ CEP7 LAMP3 CEP7 PROX1 CEP7 PRKAA1 2 parses all patient data files under the P1/ sub-directory, uses CEP7 as a reference chromosome probe, analyzes patterns of three gene probes {LAMP3, PROX1, PRKAA1}, and calculates the tumorigenesis trees and graphs as the results. The original heuristic method is used for the ploidyless trees and the ploidy-based method is also run due to the positive 2 on the far right. ./fish P1/ CEP7 LAMP3 CEP7 CCND1 -1 parses all patient data files under the P1/ sub-directory, uses CEP7 as a reference chromosome probe, and analyzes patterns of two gene probes {LAMP3,CCND1}, and calculates the tumorigenesis trees and graphs as the results; Only the exact method is used for the ploidyless trees due to the negative -1 on the far right. ./fish P1/ CEP7 LAMP3 CEP7 CCND1 0 parses all patient data files under the P1/ sub-directory, uses CEP7 as a reference chromosome probe, and analyzes patterns of two gene probes {LAMP3,CCND1}, and calculates the tumorigenesis trees and graphs as the results; Only the ploidy-based method is used do to the 0 on the right. ============================================================================ * EXPERIMENTAL REORDERING OF GENES Modes -4, ..., -1 for any number of gene probes, and modes 0, ..., 4 for two gene probes, are minimally sensitive to the order of the input genes. That is, no order is objectively better, and the answers should be similar among reordering of the genes, differing only due to numeric precision or when ties are broken arbitrarily within the algorithm. For modes 0, ..., 4 and more than two genes the ploidy-based trees do explicitly depend on the order of the genes. By default, FISHtrees versions 3.x.x use the genes in the order provided by the user on the command line. A command line option '--choose-gene-order' (which may appear anywhere after './fish' on the command line) causes FISHtrees to use a heuristic that reorders genes to attempt to produce a better tree. The details of this heuristic are presented in [5]. This heuristic is experimental, and its implementation may change in future releases. ============================================================================ * OUTPUT FILES The fish program writes into three varieties of files in the DOT format (http://www.graphviz.org/doc/info/lang.html). 1) single probe trees ..dot in ploidy-based modeling 2) joint trees .+....dot in ploidy-based modeling and .ploidyless.dot in ploidyless modeling 3) consensus graphs .ploidy-based..dot in ploidy-based modeling and .ploidy-less..dot in ploidyless modeling is the directory from which the fish command is issued. This directory should be different from the directory in which input files are stored because fish assumes that every file in the input file directory is an input file. is the name of the input file. Since suffixes are appended in the output files, the output files will have different names than the input files. ---------------------------------------------------------------------------- For example, the command: ./fish P1/ CEP7 LAMP3 CEP7 PROX1 4 produces the following 12 output files C1.txt.LAMP3.dot : ploidy-based single probe tree C1.txt.PROX1.dot : ploidy-based single probe tree C1.txt.LAMP3+PROX1.dot : ploidy-based joint tree C1.txt.ploidyless.dot : ploidyless joint tree C3.txt.LAMP3.dot : ploidy-based single probe tree C3.txt.PROX1.dot : ploidy-based single probe tree C3.txt.LAMP3+PROX1.dot : ploidy-based joint tree C3.txt.ploidyless.dot : ploidyless joint tree P1.ploidy-based.0.05.dot : ploidy-based consensus graph P1.ploidy-based.1.dot : ploidy-based consensus graph P1.ploidy-less.0.05.dot : ploidyless consensus graph P1.ploidy-less.1.dot : ploidyless consensus graph The files with 0.05 are "union" consensus graphs in which edges for either C1 or C3 are shown. The files with 1 are "intersection" consensus graphs in which only edges shared by the two trees for C1 and C3 are shown. The constant 0.05 is #defined as BOUND_T in fish.h (see below). The constant 1.0 is currently hard-coded in fish.cpp. If instead one used ./fish P1/ CEP7 LAMP3 CEP7 PROX1 -4 then only the subset of files C1.txt.ploidyless.dot : ploidyless joint tree C3.txt.ploidyless.dot : ploidyless joint tree P1.ploidy-less.0.05.dot : ploidyless consensus graph P1.ploidy-less.1.dot : ploidyless consensus graph are produced. ============================================================================ * VISUALIZING GRAPHS The fish program writes calculated single probe and joint trees (as graphs) and consensus networks (as graphs) in the DOT Language, which is defined at http://www.graphviz.org/doc/info/lang.html. There are at least three ways to visualize such graphs: 1) Graphviz is a open source graph visualization software, which can be downloaded from http://www.graphviz.org/. [10] 2) dotty is a visualization program on most UNIX and Linux systems. The command is: dotty .dot 3) dot takes a DOT file as the input and converts into a supported format listed at http://www.graphviz.org/content/output-formats/. The command is: dot -T .dot -o . The following example converts a DOT file into a corresponding PNG file. dot -Tpng C1.txt.LAMP3.dot -o C1.txt.LAMP3.png ============================================================================ * COMMENTS Send any comments, questions or complaints about fish to: Alejandro Schaffer and E. Michael Gertz and Russell Schwartz . Most of the code in the FISHtrees package for the executable program fish was implemented by Salim Akhter Chowdhury, E. Michael Gertz, and Adam Lee. Please cite references [2,5] if you publish results using FISHtrees. ============================================================================ * BIBLIOGRAPHY References: [1] Pennington G, Smith CA, Shackney S, Schwartz R: Reconstructing tumor phylogenies from heterogeneous single-cell data. J Bioinform Comput Biol 2007, 5:407-427. [2] Chowdhury SA, Shackney SE, Heselmeyer-Haddad K, Ried T, Schaffer AA Schwartz R: Phylogenetic analysis of multiprobe fluorescence in situ hybridization data from tumor cell populations. Bioinformatics 29:i189--i198, 2013. [3] Chowdhury SA, Shackney SE, Heselmeyer-Haddad K, Ried T, Schaffer AA, Schwartz R: Algorithms to model single gene, single chromosome, and whole genome copy number changes jointly in tumor phylogenetics. PLoS Comput Biol. 2014, 10:e1003740. [4] Chowdhury SA, Gertz EM, Wangsa D, Heselmeyer-Haddad K, Ried T, Schaffer AA, Schwartz R: Inferring models of multiscale copy number evolution for single-tumor phylogenetics, Bioinformatics 2015, 31:i258--i267. [5] Gertz EM, Chowdhury SA, Lee W-J, Wangsa, D, Heselmeyer-Haddad K, Ried T, Schwartz R, Schwartz R: FISHtrees 3.0: Tumor phylogenetics using a ploidy probe. PLoS One. 2016 Jun 30;11(6). [6] Wangsa D, Heselmeyer-Haddad K, Ried P, Eriksson E, Schaffer AA, Morrison L, Luo J, Auer G, Munck-Wikland E, Ried T, Lundqvist EA: FISH markers for detection of cervical lymph node metastases. Am J Pathol 2009, 175:2637-2645. [7] Heselmeyer-Haddad K, Berroa Garcia LY, Bradley A, Ortiz-Melendez C, Lee W-J, Christensen R, Prindiville SA, Calzone KA, Soballe PW, Hu Y, Chowdhury SA, Schwartz R, Schaffer AA, Ried T: Single-cell genetic analysis of ductal carcinoma in situ and invasive breast cancer reveals enormous tumor heterogeneity, yet conserved genomic imbalances and gain of MYC during progression. Am J Pathol 2012, 181:1807-1822. [8] Heselmeyer-Haddad K, Berroa Garcia LY, Bradley A, Hernandez L, Hu Y, Habermann JK, Dumke C, Thorns C, Pestova E, Burke C, Chowdhury SA, Schwartz R, Schaffer AA, Paris PL, Ried T: Single-Cell genetic analysis reveals insights into clonal development of prostate cancers and suggests loss of PTEN as a marker of poor prognosis. Am J Pathol 2014, 184:2671-2686. [9] Wangsa D, Chowdhury SA, Ryott M, Gertz EM, Elmberger G, Auer G, Lundqvist EA, Kuffer S, Strobel P, Schaffer AA, Schwartz R, Munck-Wikland E, Ried T, Heselmeyer-Haddad K: Phylogenetic analysis of multiple FISH markers in oral tongue squamous cell carcinoma suggests that a diverse distribution of copy number changes is associated with poor prognosis. Int J Cancer 2016, 138:98-109. [10] Gansner ER, North SC: An open graph visualization system and its applications to software engineering. Software --- Pract Exp 2000, 30:1203-1233. ============================================================================