Home My Page Projects LogolExec
Summary Activity Forums Tracker Lists Tasks Docs News SCM Files Mediawiki

InriaForge

Technical Overview

From LogolExec Wiki
Jump to: navigation, search

Contents

Introduction

LogolMatch is a Logol interpreter and a pattern search tool. It takes as input a biological sequence, DNA, RNA or protein, and a grammar file. The grammar is a Logol grammar that describes a pattern to be found in the input sequence. LogolMatch analyses the grammar and executes a program to match the pattern on the sequence. It returns a result file containing the matches with all required details.

Logol is a highly descriptive language dedicated to biological sequence analysis. It defines sequence information with allowed mutations and morphism. Models and variables can be used to express a specific pattern and can be used to find repetition of a pattern along the sequence.

Logolmatch is open source, and free of use, though dependent on other tools that may require fees or specific license agreements. Please refer to the annex for tools and bibliography.

Architecture

The tool is articulated around different modules:
  • the grammar interpreter
  • the pattern match tool
  • the analyser

Requirements

LogolMatch requires the following software to run:
  • Linux operating system
  • Java 1.6 or above
  • Ruby (>=1.8), Rubygems, Cassiopee gem (gem install cassiopee or use existing packages for your distribution)
  • Recommends: vmatch (see S.Kurtz) for better performance on large sequences.
  • The Java DRMAA library if a cluster is used.
  • Gawk (tested with version 3,1,6)

To build the software, the following is also needed

  • JDK 1.6 or above
  • Junit4
  • Ant and Ant-junit
  • Swi-Prolog OR Sicstus Prolog 1.4.04 (to compile programs only, if executables are not available for your platform). To run the program, prolog interpreter is not required, executables embed all required libraries.

Running modes

The software is designed to run on a single computer, with one or several CPU, or on a cluster. It makes use of a configurable number of CPU when doing parallel treatments, when possible. When several sequences require analysis for the same grammar (against a personal or public bank), the program can either serialize the analysis or use a cluster help with the Java DRMAA library.

Model components

Different components are in charge of specialized functions and some of them are run in parallel, while other synchronize them.

  1. The grammar interpreter reads a grammar file and generate an executable ready to search for the pattern on an input sequence, e.g. the sequence parser. The grammar file can also be a model grammar designed by a model designer.
  2. The multi-sequence manager takes a group of sequence as input, and run a treatment for each sequence. According to the mode, sequences are analysed sequentially or sent as cluster jobs. The multi-sequence manager is built upon an job manager interface, making it easy to add new managers. If only one sequence is used, the program can be called directly without using the multi-sequence manager. Multi-sequence manager might also cut the sequences according to the maximum number of jobs allowed per sequence.
  3. The Job Manager analyse the sequence and some configuration parameters to see if file can be split to run several analyse in parallel. According to this, one or several process will be executed, each generating its own result file. Finally, the Job Manager re-assemble the files in a unique result file.
  4. The result file can then be studied in a graphical analyser.

Processes

The program is made of multiple processes:

  • The multi-sequence processes runs a logolmatch process for each sequence in the bank. All processes are executed separately, just like it would be done for a single sequence. It generates a unique identifier for each sequence and result in X sequence results, X is the number of sequence in the bank.
  • The logolmatch process takes a configuration file, a sequence and a grammar file as input. It creates temporary files as pre-treatment as creates an executable (sequence parser) able to match the grammar against the sequence. This executable is a prolog executable made of a generated file linked with a pre-defined prolog library. The process executes the newly created program then store the result in an XML file corresponding to the matches for the input sequence. At least, all temporary files are deleted.
  • The sequence parser is a dynamically created program and is specific to a grammar, not a sequence. The program takes some arguments as input and can be used against any sequence of the same type (dna/rna/protein). It is a prolog executable linked with Sicstus implementation. It makes use of vmatch, as an external program, to search though the sequence for fixed patterns located on a distant position. Suffix array is used for its performances.

To resume, the main program, in Java, is responsible of the interpretation of the grammar and of the work coordination among the different sequences or sub-sequences (if they can be cut). It executes sub programs, in prolog. Those sub-programs parse one sequence as a sequence of alphabet characters to match some defined patterns.

  • The result analyser is a separate program independent of the match analysis. It takes an xml file as input and offer a graphical interface to check for the results. The xml file can be used directly if required.

!! Physical implementation

Usage of resources is different according to the usage (single computer or in a cluster).

Single computer

Properties can define a number of available processors. Program tries to use this defined number to run in parallel treatments when there are several sequences or if the sequence can be split to accelerate the treatment. Input sequence can be located anywhere on an available disk (local or remote). A temporary directory is used (and required) to pre-process the sequence and create any additional required file. At least, result files are generated in a configurable directory (results).

Cluster

In the case of a cluster, the main program can be executed on a cluster node, using any cluster submission operation. Environment needs however to be set. Once running on a node two cases are possible.

  • Single sequence analysis:

The program runs on the node just like for a single computer , however, input sequence must be available locally to the node, e.g. on a shared directory or, when the job is submitted, one should care of transmitting the sequence locally. A local directory is used for temporary operations and result files are written to a shared directory (results). If results cannot be shared, shell script should be modified to return result file of the operation to an available location by any available mean on the system. Temporary directory should be local to the node for performance access reason.

  • multiple sequences analysis:

The multi-sequence process runs on a node. The process can make use of the cluster with the DRMAA library to queue the treatment for each sequence in the bank. The management of the queue is not driven by the program but by the cluster queue manager. Therefore, all sequences may be run in parallel if queue/available nodes permit it. The process will wait for the end of all the queued programs. A local directory is used for temporary operations and result files are written to a shared directory (results). Result directory must be shared because bank split into individual sequence requires that each sequence is available to the queued process.

Be it the logolmatch process or the multi-sequence process, the called program will end when and only when all sub operations are over. This means that if process is alive on the calling system, there are still some running operations, either on local system or on a remote node.

Resource usage

As described before, two directories are required for the program:

  • a temporary directory: it contains some suffix trees, temporary searxch info, intermediate result files... All files are deleted at the end of the treatment. An option is available not to delete those (for debug purpose for example). A lot of temporary space disk is required by the program. The suffix tree generates a lot of sub-files and input sequence is copied too. Temporary data requires around 14 times the original sequence size. If reverse search is done in parallel (multi processor), reverse search should be considered as a new sequence search ran in parallel and will require the same temporary space. For multi-sequence search, an additional sequence size is required e.g. 15 * sequence size.
  • a result directory: it contains the xml files resulting from the analysis. There is 1 file per sequence. Large sequence with large match will of course result in large result file. Due to the tree display of the matches, a match of 100b may require 200b (or much more) depending on the depth of the tree.

Program uses a configurable number of processes. Memory may need to be adjusted in shell scripts according to sequence size for Java execution (memory heap). At least 1Gb of RAM is required. The memory will be a limiting factor with the number of solutions. As number of solution increase with a sequence, more memory will be required (prolog needs).

Logol

The Logol is implemented in the grammar interpreter with the following dependencies. It supports recursion among types according to the definition. Recursion is mapped in the result file with a corresponding tree. For more information, please refer to the Logol implementation specification.

For example, a View can be made of a group of Entity, one of those Entity can itself be a View etc...

Software implementation

LogolMatch

LogolMATCH is articulated around several steps of treatment. The first major step is the creation of the executable from a grammar. Second step is the parsing of the sequence by the executable.

Grammar analyser

The grammar interpreter takes a Logol grammar file as input. It basically runs in 3 steps.

The first step parse the grammar to analyse the variable, constraints etc.. and store internally the structure of the data.

The second steps generates a prolog program to analyse the definition order of the variables. Indeed, a variable A can have a constraint related to a variable B. As the sequence analyser works in a left to right (L2R) way, B needs to be know before A to make it work. If B is unknown, A needs to be selected with the least weighting constraint and checked later on, when B is known. This is the purpose of step 2. If redefine the grammar in a way to know which variable analyses need to be postpone to a later time. This will be referred to postponed treatments in this document.

At least, step 3 parses the grammar again, but takes into account the results of step 2 to know what kind of treatment is required for the current variable (variable could refer to a fixed value). It generates some prolog predicates, using an available library of functions, to create the sequence analyser program.

According to the type of the variable and the required functions, the analyser will translate those to internal predicates, used by the functions, or some postpone predicates to validate a variable at a later time. Due to prolog internals, those manipulations sometimes require to keep track of variables content to be able to pass them as argument to a function, even if on original grammar, the variable is not an argument. The logic of the program takes those constraints into account and re-arrange the data to solve the problem.

Once program is created, it creates the suffix array of the sequence to analyse. If possible, according to input parameters/configuration, the file is split to run several analysis in parallel. The splitting accelerate the treatments. Then it execute the program with the different sequences (input or input split, possibly the reverse if asked at execution).

When result is available, the generated xml files are parsed to re-order some data and merged in a single file.

Sequence analyser

The sequence analyser is made of a main entry point, counting the number of solutions, and calling repeatedly for an other solution to the prolog main predicate.

As a prolog program, solutions are analysed one after the other while parsing the program. Each condition is analysed, and while conditions are true, the next conditions are analysed. When all of them are verified, a solution is sent to the output. Then the program goes backward to match a new solution until none is available or the main program stops asking for a new solution.

The analysis is made on a position on the sequence. Each function tries to match according to input position and outputs a new position on sequence. When all constraints are checked, the variable is saved in memory with all its data (position, size, content, sub-content, number of errors...).

When required, a call to an external program is used for suffix search on sequence (when there is a gap and a known variable content). Other programs can also be used for the user defined cost functions. Those programs are located in tools directory.

When all data are checked to make a full match, all data are reformatted to an XML content.

Prolog library

The sequence analyser uses a library of function available in prolog directory. It is a set of predicates that provides different ways to analyse the data. Among those are defined some exact match functions, match with errors only, match with allowed distance, overlap...

It also manages gaps, also called spacers between variables, and uses different search algorithms depending on the context. Some predicates take as input an other predicate. This is the case for overlap and repeat testing. Those predicates are generic to check for a match on a global function.

For example, if we want to a repeat(exactmatch(A) AND exactmatch(B)): the grammar analyser will create a internal predicate pred(A,B)=exactmatch(A) AND exactmatch(B), and call repeat(pred(A,B)).

Result file

The DTD of the final XML is available in appendix. The file contains a list of matches, each match has a unique identifier. The begin/end of match element corresponds to the data for the full match. The detail appears after that for each variable of the match in a tree format.

Content element contains the sequence sub-data, while text element represent the grammar part matching this variable match. Each match may get result for one or many model. Each model also has a unique id within the match. This will occur for example for a rule like : mod1(X),mod2(X)==>SEQ1

For such rule, a match is search for each model, but both results will be in the same match because they share some data (X). At least, one should take care of “reverse” search case. The begin tag will correspond to the begin position in a L2R reading, so begin position will be after end position.

Example of a result match for grammar “ccc”,.*,”ccc”

<match xmlns="http://www.w3.org/TR/REC-html40"> <fastaHeader>gi|00000000|Logol search|ap003045_5.0| sequence input</fastaHeader> <sequenceBegin>0</sequenceBegin> <sequenceEnd>1044</sequenceEnd> <model>FXFEE.....</model> <model>1</model><id>0-12</id><begin>15</begin><end>30</end><errors>0</errors> <variable name="LogolVAR_1"> <begin>15</begin> <end>18</end> <size>3</size> <errors>0</errors> <content>ccc</content> <text>"ccc"</text> </variable> <variable name="LogolVAR_2"> <begin>18</begin> <end>27</end> <size>9</size> <errors>0</errors> <content>cacgtaggt</content> <text>_</text> </variable> <variable name="LogolVAR_3"> <begin>27</begin> <end>30</end> <size>3</size> <errors>0</errors> <content>ccc</content> <text>"ccc"</text> </variable></match>

Result file is named according to the grammar plus a unique identifer. This identifier can be specified when executing the program or automatically generated. File name will be like:

For logolmatch, grammar_file_name.uid.begin-end-all.xml

For multi-sequence program, grammar_file_name.guid.zip (will contain one file per sequence)

grammar_file_name: is the name of input grammar
uid: is the input identifier (must be unique on server), or automatically generated
begin: is the start position of the whole sequence (in case an offset is specified as input)
end: is the end position of the whole sequence (taking into account the offset too)
guid: is a unique identifier generated by the multi-sequence program to differentiate the treatments for each sequence in the bank

A possible mask to find the file generated by logolmatch for the current grammar and uid is:

grammar_file_name.*.uid.*-all.xml

An option is available to specify output file name (-output) in logolmatch.

For multi-sequence, file is determined by grammar name and guid (can be specified in command-line)

Appendix

Tools

Swi-prolog: (for compilation only) http://www.swi-prolog.org/
Sicstus prolog: (for compilation only) not included in product, see http://www.sics.se/isl/sicstuswww/site/index.html to purchase some license. Porting to free prolog interpreter is limited to the port of a few functions in the main prolog library.
BioJava: GNU Lesser GPL V2.1
Apache libraries  Apache License v2,0
Antlr: The BSD License
Cassiopee: GNU Lesser GPL  https://rubygems.org/gems/cassiopee
vmatch: not included in distribution, see http://www.vmatch.de/ for software and license agreement.


XML DTD

<!ELEMENT sequences ( fastaHeader, match+ ) >
<!ELEMENT fastaHeader ( #PCDATA ) >
<!ELEMENT model ( #PCDATA ) >
<!ELEMENT grammar ( #PCDATA ) >
<!ELEMENT match ( model, id, begin, end, errors, variable+ ) >
<!ELEMENT model ( #PCDATA ) >
<!ELEMENT id ( #PCDATA ) >
<!ELEMENT begin ( #PCDATA ) >
<!ELEMENT content ( #PCDATA ) >
<!ELEMENT end ( #PCDATA ) >
<!ELEMENT errors ( #PCDATA ) >
<!ELEMENT variable ( begin, end, size, errors, content, text ) >
<!ELEMENT size ( #PCDATA ) >
<!ELEMENT text ( #PCDATA ) >
<!ATTLIST variable name CDATA  #REQUIRED >