Home My Page Projects LogolExec
Summary Activity Forums Tracker Lists Tasks Docs News SCM Files Mediawiki

InriaForge

User Guide

From LogolExec Wiki
Jump to: navigation, search

Contents

Definitions

The term “should” will be used when an operation is not mandatory but highly recommended to perform the operation in the best conditions.

The term “must” will be used when a condition is mandatory to meet the program requirements.

The term “can” will be used to describe a program option (command-line or configuration or installation related) and is not limited to this option.

The term “may” will be used to define a condition that is not dependent upon the current program (cpu speed for example, cluster queue management...).

Installation

To install LogolMatch package, copy the install package file to the final system (Linux) and execute it (dpkg -i xxx.dpkg or rpm -ivh xxx.rpm)

For manual installation, extract the code in a directory, after pre-requisites Technical_Overview#Requirements, and execute

   ant test_swi for Swi-Prolog

or

   ant test_sicstus for Sicstus Prolog

This will compile and test the software.

Vmatch or Cassiopee software is required to run LogolMatch. It is advised to install one first.

Cassiopee is a default tool, with dna ambiguity support. Vmatch has better performances over large sequences (genomes). Package is available in Debian/Ubuntu as "ruby-cassiopee", or as RPM in our repository (http://rpm.genouest.org/rpm/yourdistrib). It can also be installed as a Ruby gem via rubygems (cassiopee).

Two directories are required to run LogolMatch. The first one is the place where result files will be written. If running on a cluster, this directory must be accessible from all the nodes of the cluster.

The second one is a work directory where temporary files are created. It should be local to the remote nodes for the clusters. If the software is to be used on a single server, then both directories can be the same.

Once the software is installed, a few additional steps are required:

  1. Configure the pattern search software with default parameters (see next section)
  2. If cluster is to be used, or if email sending is required at the end of a run, edit the mail template file located in $INSTALLDIR/LogolMatch/prolog/mail.tpl. The __FILE__ string will be replaced at runtime by the final file name.
  3. To compile the files for an other architectures, read the file $INSTALLDIR/LogolMatch/README.txt and prolog/SWI_README.

NB: Java environment should be set to run the programs (JRE 1.6+)

Configuration

The configuration is available in $INSTALLDIR/LogolMatch/prolog/logol.properties

Most of the configuration should be set after the installer. Parameters in bold are the ones to take care of:


##
# Configuration file for LogolMatch. Some properties may be override by command-line parameters.
##
# Minimum size to use to split a file to parallelize treatments
minSplitSize=2000000
# Maximum size to display a solution, 0 is no limit. Above the limit, the variable is replaced by "-" character
maxResultSize=0
# Maximum size of a solution (used to optimize the search)
#maxMatchSize=30000
maxMatchSize=0
# Temporary directory used for the analysis. Should be local to the node
workingDir=/tmp/Logol
# Directory where to place the results. In case of cluster usage, result must be a shared directory between nodes
dir.result=/tmp/Logol
# Maximum length of a spacer when looking forward for a match
#maxSpacerLength=10000
maxSpacerLength=0
# Maximum length of a variable in a match
#maxLength=1000
maxLength=0
# Minimum length of a variable in a match
minLength=2
# Default strategy to use, 1 must be keep by default
parentStrategy=1
# Number of processor on computer running the analysis, or number of available processors on DRM  nodes. Can speed up the search process when sequence file can be splitted.
nbProcessor=1
# Max Number of jobs to run for a single sequence when used in DRM config.
nbJobs=1
# Default number to limit number of results (must be above 0)
maxSolutions=100
# Minimum size of tree index. In case of use of small sequences, should be set to 2, else use 4. (see vmatch manual). This applies for all sequences.
minTreeIndex=2
# Host where is smtp server (if email required)
smtp.host=localhost
# Mail user for smtp host
mail.user=
# DRM queue command if a specific queue is to be used
# Example for SGE: drm.queue= -q long
drm.queue=
# Suffix tool   0: Cassiopee (default), 1: Vmatch
suffix.tool = 0
suffix.path=

Default configuration (except directories) should apply to most of usages, but parameters should be carefully studied to improve the performances of the software.

The configuration file described here is the default configuration file. However, a per-request configuration file can be specified in command-line, this allows to adapt the parameters to specific queries/sequences.

Usage

Sequences

Input sequences must be in a single file in NCBI Fasta format. All headers must be like:

>gi|51511735|ref|nc_000018.8|nc_000018 test sequence for logol validation

References are not checked and can be any value.

Web interfaces (Genouest web site only)

For the web interface, connect to the web container at the application URL. For LogolDesigner, an index page provides links to the online help and software as well as some screencasts.

For the Logolanalyser, an online help is available under URL at [1]

Command line

Options specified in command-line supersedes default configuration options.

Use programname.sh -h to get a list of available options.

LogolMultiExec.sh

LogolMultiExec.sh is an intermediate program only. It takes as input one or more sequences, and dispatch them to LogolExec. If configuration allows it, it can also split a sequence in smaller part, in such a case, it is also in charge of merging the results fr the sequence. On DRM systems, it creates a new job for each (sub-)sequence. On non-DRM system, all (sub-)sequences are executed sequentially.

A man page is available for options.

LogolExec.sh

Called by LogolMultiExec.sh, it can be run directly when using a single sequence. A man page is available for options.

Checking a grammar

To check a grammar, one can run LogolExec.sh -check -g mygrammarfile

Results

Results are zipped in a single file. There is one result file per input sequence, in XML format.

The model is the id of the model defined in the grammar. Variables are the detailled value of the match according to the grammar. The Id of the match is unique for the sequence result file. Reverse complement search, when selected, will have a begin position higher than end position.

Here is the DTD of the XML document:

<!ELEMENT sequences ( fastaHeader, grammar, model, match+ ) >
<!ELEMENT fastaHeader ( #PCDATA ) >
<!ELEMENT grammar ( #PCDATA ) >
<!ELEMENT match ( model, id, begin, end, errors, distance, variable+ ) >
<!ELEMENT model ( #PCDATA ) >
<!ELEMENT id ( #PCDATA ) >
<!ELEMENT begin ( #PCDATA ) >
<!ELEMENT content ( #PCDATA ) >
<!ELEMENT end ( #PCDATA ) >
<!ELEMENT errors ( #PCDATA ) >
<!ELEMENT distance ( #PCDATA ) >
<!ELEMENT variable ( begin, end, size, errors, content, text ) >
<!ELEMENT size ( #PCDATA ) >
<!ELEMENT text ( #PCDATA ) >
<!ATTLIST variable name CDATA  #REQUIRED >

FAQ

  • Out of memory issue when running LogolMatch programs:

Depending on sequence size, it may be required to increase the JVM max memory. To do so, in case of problem, edit the .sh files and increase the -Xmx parameter value (should be at least 2 times the sequence size).

  • How to add a cost specific function:

Create a program in LogolMatch/tools directory. Script should return the number of errors found to stdout and have execution rights for the LogolMatch user.

  • I have some grammar errors when running a search

Look at the error message, it usually specify the kind of error.

  • I have no results or result file is empty

Look the stdout information messages, there could be grammar issues, or suffix tree file creation issue. In case of use via a DRM system, edit the generated xxx.ojobid and xxx.ejobid to get job stream information. If this is not enough, it is possible to increase the level of information in the file LogolMatch/log4j.properties. Modify the level of log4j.logger.org.irisa.genouest.logol and og4j.logger.org.irisa.genouest.logol.StreamGobbler from ERROR to INFO or DEBUG.

  • Does grammar support DNA ambiguity:

Yes with the Cassiopee matcher, no for vmatch. For vmatch: though alphabet by itself is supported (B,N...), comparison will fail between nucleotides. A 'B' in front of a 'B' will match, but a 'B' in front of a 'C'/'G'/'T' will fail. In the same way, '-' is accepted, but will match a '–' only (in alignment sequence files for example). Though it could be easily implemented in direct match analysis, this is an issue for suffix array implementations as this alphabet introduces high complexity in result combinations.