Index: galaxy-central/tools/unix_tools/remove_ending.xml =================================================================== --- galaxy-central/tools/unix_tools/remove_ending.xml (revision 3) +++ galaxy-central/tools/unix_tools/remove_ending.xml (revision 3) @@ -0,0 +1,43 @@ + + of a file + remove_ending.sh $num_lines $input $out_file1 + + + + + + + + + + + + + + + + +**What it does** + +This tool removes specified number of lines from the ending of a dataset + +----- + +**Example** + +Input File:: + + chr7 56632 56652 D17003_CTCF_R6 310 + + chr7 56736 56756 D17003_CTCF_R7 354 + + chr7 56761 56781 D17003_CTCF_R4 220 + + chr7 56772 56792 D17003_CTCF_R7 372 + + chr7 56775 56795 D17003_CTCF_R4 207 + + +After removing the last 2 lines the dataset will look like this:: + + chr7 56632 56652 D17003_CTCF_R6 310 + + chr7 56736 56756 D17003_CTCF_R7 354 + + chr7 56761 56781 D17003_CTCF_R4 220 + + + + Index: galaxy-central/tools/unix_tools/word_list_grep.xml =================================================================== --- galaxy-central/tools/unix_tools/word_list_grep.xml (revision 3) +++ galaxy-central/tools/unix_tools/word_list_grep.xml (revision 3) @@ -0,0 +1,106 @@ + +by word list + + word_list_grep.pl + #if $searchwhere.choice == "column": + -c $searchwhere.column + #end if + -o $output + $inverse + $caseinsensitive + $wholewords + $skip_first_line + $wordlist + $input + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +**What it does** + +This tool selects lines that match words from a word list. + +-------- + +**Example** + +Input file (UCSC's rmsk track from dm3):: + + 585 787 66 241 11 chrXHet 2860 3009 -201103 - DNAREP1_DM LINE Penelope 0 594 435 1 + 585 1383 78 220 0 chrXHet 3012 3320 -200792 - DNAREP1_DM LINE Penelope -217 377 2 1 + 585 244 103 0 0 chrXHet 3737 3776 -200336 - DNAREP1_DM LINE Penelope -555 39 1 1 + 585 2270 83 144 0 chrXHet 7907 8426 -195686 + DNAREP1_DM LINE Penelope 1 594 0 1 + 585 617 189 73 68 chrXHet 10466 10671 -193441 + DNAREP1_DM LINE Penelope 368 573 -21 1 + 586 1122 71 185 0 chrXHet 173138 173322 -30790 - PROTOP DNA P -4033 447 230 1 + ... + ... + + +Word list file:: + + STALKER + PROTOP + + + +Output sequence (searching in column 11):: + + 586 1122 71 185 0 chrXHet 173138 173322 -30790 - PROTOP DNA P -4033 447 230 1 + 586 228 162 0 0 chrXHet 181026 181063 -23049 + STALKER4_I LTR Gypsy 9 45 -6485 1 + 585 245 105 26 0 chr3R 41609 41647 -27863406 + PROTOP_B DNA P 507 545 -608 4 + 586 238 91 0 0 chr3R 140224 140257 -27764796 - PROTOP_B DNA P -617 536 504 4 + ... + ... + +( With **find whole-words** not selected, *PROTOP* matched *PROTOP_B*, *STALKER* matched *STALKER4_I* ) + + + + +Output sequence (searching in column 11, and whole-words only):: + + 586 670 90 38 57 chrXHet 168356 168462 -35650 - PROTOP DNA P -459 4021 3918 1 + 586 413 139 70 0 chrXHet 168462 168548 -35564 - PROTOP DNA P -3406 1074 983 1 + 586 1122 71 185 0 chrXHet 173138 173322 -30790 - PROTOP DNA P -4033 447 230 1 + ... + ... + + + + Index: galaxy-central/tools/unix_tools/sort_tool.xml =================================================================== --- galaxy-central/tools/unix_tools/sort_tool.xml (revision 3) +++ galaxy-central/tools/unix_tools/sort_tool.xml (revision 3) @@ -0,0 +1,134 @@ + + + sort -S 2G $unique + #for $key in $sortkeys + '-k ${key.column},${key.column}${key.order}${key.style}' + #end for + $input > $out_file1 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +**What it does** + +This tool runs the unix **sort** command on the selected data file. + +----- + +**Sorting Styles** + +* **Fast Numeric**: sort by numeric values. Handles integer values (e.g. 43, 134) and decimal-point values (e.g. 3.14). *Does not* handle scientific notation (e.g. -2.32e2). +* **General Numeric**: sort by numeric values. Handles all numeric notations (including scientific notation). Slower than *fast numeric*, so use only when necessary. +* **Natural Sort**: Sort in 'natural' order (natural to humans, not to computers). See example below. +* **Alphabetical sort**: Sort in strict alphabetical order. See example below. + + + + +**Sorting Examples** + +Given the following list:: + + chr4 + chr13 + chr1 + chr10 + chr20 + chr2 + +**Alphabetical sort** would produce the following sorted list:: + + chr1 + chr10 + chr13 + chr2 + chr20 + chr4 + +**Natural Sort** would produce the following sorted list:: + + chr1 + chr2 + chr4 + chr10 + chr13 + chr20 + + +.. class:: infomark + +If you're planning to use the file with another tool that expected sorted files (such as *join*), you should use the **Alphabetical sort**, not the **Natural Sort**. Natural sort order is easier for humans, but is unnatural for computer programs. + + + Index: galaxy-central/tools/unix_tools/sed_wrapper.sh =================================================================== --- galaxy-central/tools/unix_tools/sed_wrapper.sh (revision 3) +++ galaxy-central/tools/unix_tools/sed_wrapper.sh (revision 3) @@ -0,0 +1,37 @@ +#!/bin/sh + +## +## Galaxy wrapper for SED command +## + +## +## command line arguments: +## input_file +## output_file +## sed-program +## [other parameters passed on to sed] + +INPUT="$1" +OUTPUT="$2" +PROG="$3" + +shift 3 + +if [ -z "$PROG" ]; then + echo usage: $0 INPUTFILE OUTPUTFILE SED-PROGRAM [other sed patameters] >&2 + exit 1 +fi + +if [ ! -r "$INPUT" ]; then + echo "error: input file ($INPUT) not found!" >&2 + exit 1 +fi + +# Messages printed to STDOUT will be displayed in the "INFO" field in the galaxy dataset. +# This way the user can tell what was the command +echo "sed" "$@" "$PROG" + +sed -r --sandbox "$@" "$PROG" "$INPUT" > "$OUTPUT" +if (( $? )); then exit; fi + +exit 0 Index: galaxy-central/tools/unix_tools/cut_tool.xml =================================================================== --- galaxy-central/tools/unix_tools/cut_tool.xml (revision 3) +++ galaxy-central/tools/unix_tools/cut_tool.xml (revision 3) @@ -0,0 +1,94 @@ + + columns from files + + cut_wrapper.sh '$complement' '$cutwhat' '$list' '$input' '$output' + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +**What it does** + +This tool runs the **cut** unix command, which extract or delete columns from a file. + +----- + +Field List Example: + +**1,3,7** - Cut specific fields/characters. + +**3-** - Cut from the third field/character to the end of the line. + +**2-5** - Cut from the second to the fifth field/character. + +**-8** - Cut from the first to the eight field/characters. + + + + +Input Example:: + + fruit color price weight + apple red 1.4 0.5 + orange orange 1.5 0.3 + banana yellow 0.9 0.3 + + +Output Example ( **Keeping fields 1,3,4** ):: + + fruit price weight + apple 1.4 0.5 + orange 1.5 0.3 + banana 0.9 0.3 + +Output Example ( **Discarding field 2** ):: + + fruit price weight + apple 1.4 0.5 + orange 1.5 0.3 + banana 0.9 0.3 + +Output Example ( **Keeping 3 characters** ):: + + fru + app + ora + ban + + + Index: galaxy-central/tools/unix_tools/grep_tool.xml =================================================================== --- galaxy-central/tools/unix_tools/grep_tool.xml (revision 3) +++ galaxy-central/tools/unix_tools/grep_tool.xml (revision 3) @@ -0,0 +1,130 @@ + + + grep_wrapper.sh $input $output '$url_paste' $color -A $lines_after -B $lines_before $invert $case_sensitive + + + + + + + + + + + value.find('\'')==-1 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +**What it does** + +This tool runs the unix **grep** command on the selected data file. + +.. class:: infomark + +**TIP:** This tool uses the **perl** regular expression syntax (same as running 'grep -P'). This is **NOT** the POSIX or POSIX-extended syntax (unlike the awk/sed tools). + + +**Further reading** + +- Wikipedia's Regular Expression page (http://en.wikipedia.org/wiki/Regular_expression) +- Regular Expressions cheat-sheet (PDF) (http://www.addedbytes.com/cheat-sheets/download/regular-expressions-cheat-sheet-v2.pdf) +- Grep Tutorial (http://www.panix.com/~elflord/unix/grep.html) + +----- + +**Grep Examples** + +- **AGC.AAT** would match lines with AGC followed by any character, followed by AAT (e.g. **AGCQAAT**, **AGCPAAT**, **AGCwAAT**) +- **C{2,5}AGC** would match lines with 2 to 5 consecutive Cs followed by AGC +- **TTT.{4,10}AAA** would match lines with 3 Ts, followed by 4 to 10 characters (any characeters), followed by 3 As. +- **^chr([0-9A-Za-z])+** would match lines that begin with chromsomes, such as lines in a BED format file. +- **(ACGT){1,5}** would match at least 1 "ACGT" and at most 5 "ACGT" consecutively. +- **hsa|mmu** would match lines containing "hsa" or "mmu" (or both). + +----- + +**Regular Expression Syntax** + +The select tool searches the data for lines containing or not containing a match to the given pattern. A Regular Expression is a pattern descibing a certain amount of text. + +- **( ) { } [ ] . * ? + \ ^ $** are all special characters. **\\** can be used to "escape" a special character, allowing that special character to be searched for. +- **^** matches the beginning of a string(but not an internal line). +- **\\d** matches a digit, same as [0-9]. +- **\\D** matches a non-digit. +- **\\s** matches a whitespace character. +- **\\S** matches anything BUT a whitespace. +- **\\t** matches a tab. +- **\\w** matches an alphanumeric character ( A to Z, 0 to 9 and underscore ) +- **\\W** matches anything but an alphanumeric character. +- **(** .. **)** groups a particular pattern. +- **\\Z** matches the end of a string(but not a internal line). +- **{** n or n, or n,m **}** specifies an expected number of repetitions of the preceding pattern. + + - **{n}** The preceding item is matched exactly n times. + - **{n,}** The preceding item ismatched n or more times. + - **{n,m}** The preceding item is matched at least n times but not more than m times. + +- **[** ... **]** creates a character class. Within the brackets, single characters can be placed. A dash (-) may be used to indicate a range such as **a-z**. +- **.** Matches any single character except a newline. +- ***** The preceding item will be matched zero or more times. +- **?** The preceding item is optional and matched at most once. +- **+** The preceding item will be matched one or more times. +- **^** has two meaning: + - matches the beginning of a line or string. + - indicates negation in a character class. For example, [^...] matches every character except the ones inside brackets. +- **$** matches the end of a line or string. +- **\|** Separates alternate possibilities. + + + + Index: galaxy-central/tools/unix_tools/remove_ending.sh =================================================================== --- galaxy-central/tools/unix_tools/remove_ending.sh (revision 3) +++ galaxy-central/tools/unix_tools/remove_ending.sh (revision 3) @@ -0,0 +1,69 @@ +#!/bin/sh + +# Version 0.1 , 15aug08 +# Written by Assaf Gordon (gordon@cshl.edu) +# + +LINES="$1" +INFILE="$2" +OUTFILE="$3" + +if [ "$LINES" == "" ]; then + cat >&2 < my_output_file.txt + +EOF + + exit 1 +fi + +#Validate line argument - remove non-digits characters +LINES=${LINES//[^[:digit:]]/} + +#Make sure the line strings isn't empty +#(after the regex above, they will either contains digits or be empty) +if [ -z "$LINES" ]; then + echo "Error: bad line value (must be numeric)" >&2 + exit 1 +fi + +# Use default (stdin/out) values if infile / outfile not specified +[ -z "$INFILE" ] && INFILE="/dev/stdin" +[ -z "$OUTFILE" ] && OUTFILE="/dev/stdout" + +#Make sure the input file (if specified) exists. +if [ ! -r "$INFILE" ]; then + echo "Error: input file ($INFILE) not found!" >&2 + exit 1 +fi + + +# The "gunzip -f" trick allows +# piping a file (gzip or plain text, real file name or "/dev/stdin") to sed +gunzip -f <"$INFILE" | sed -n -e :a -e "1,${LINES}!{P;N;D;};N;ba" > "$OUTFILE" + Index: galaxy-central/tools/unix_tools/join_tool.xml =================================================================== --- galaxy-central/tools/unix_tools/join_tool.xml (revision 3) +++ galaxy-central/tools/unix_tools/join_tool.xml (revision 3) @@ -0,0 +1,54 @@ + + two files + join_tool.sh "$jointype" "$output_format" + "$empty_string_filler" "$delimiter" + "$ignore_case" + "$input1" "$column1" + "$input2" "$column2" + "$output" + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Index: galaxy-central/tools/unix_tools/awk_wrapper.sh =================================================================== --- galaxy-central/tools/unix_tools/awk_wrapper.sh (revision 3) +++ galaxy-central/tools/unix_tools/awk_wrapper.sh (revision 3) @@ -0,0 +1,47 @@ +#!/bin/sh + +## +## Galaxy wrapper for AWK command +## + +## +## command line arguments: +## input_file +## output_file +## awk-program +## input-field-separator +## output-field-separator + +INPUT="$1" +OUTPUT="$2" +PROG="$3" +FS="$4" +OFS="$5" + +shift 5 + +if [ -z "$OFS" ]; then + echo usage: $0 INPUTFILE OUTPUTFILE AWK-PROGRAM FS OFS>&2 + exit 1 +fi + +if [ ! -r "$INPUT" ]; then + echo "error: input file ($INPUT) not found!" >&2 + exit 1 +fi + +if [ "$FS" == "tab" ]; then + FS="\t" +fi +if [ "$OFS" == "tab" ]; then + OFS="\t" +fi + +# Messages printed to STDOUT will be displayed in the "INFO" field in the galaxy dataset. +# This way the user can tell what was the command +echo "awk" "$PROG" + +awk --sandbox -v OFS="$OFS" -v FS="$FS" --re-interval "$PROG" "$INPUT" > "$OUTPUT" +if (( $? )); then exit; fi + +exit 0 Index: galaxy-central/tools/unix_tools/find_and_replace.xml =================================================================== --- galaxy-central/tools/unix_tools/find_and_replace.xml (revision 3) +++ galaxy-central/tools/unix_tools/find_and_replace.xml (revision 3) @@ -0,0 +1,154 @@ + + text + + find_and_replace.pl + #if $searchwhere.choice == "column": + -c $searchwhere.column + #end if + -o $output + $caseinsensitive + $wholewords + $skip_first_line + $is_regex + '$url_paste' + '$file_data' + '$input' + + + + + + + value.find('\'')==-1 + + + + value.find('\'')==-1 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +**What it does** + +This tool finds & replaces text in an input dataset. + +.. class:: infomark + +The **pattern to find** can be a simple text string, or a perl **regular expression** string (depending on *pattern is a regex* check-box). + +.. class:: infomark + +When using regular expressions, the **replace pattern** can contain back-references ( e.g. \\1 ) + +.. class:: infomark + +This tool uses Perl regular expression syntax. + +----- + +**Examples of *regular-expression* Find Patterns** + +- **HELLO** The word 'HELLO' (case sensitive). +- **AG.T** The letters A,G followed by any single character, followed by the letter T. +- **A{4,}** Four or more consecutive A's. +- **chr2[012]\\t** The words 'chr20' or 'chr21' or 'chr22' followed by a tab character. +- **hsa-mir-([^ ]+)** The text 'hsa-mir-' followed by one-or-more non-space characters. When using parenthesis, the matched content of the parenthesis can be accessed with **\1** in the **replace** pattern. + + +**Examples of Replace Patterns** + +- **WORLD** The word 'WORLD' will be placed whereever the find pattern was found. +- **FOO-&-BAR** Each time the find pattern is found, it will be surrounded with 'FOO-' at the begining and '-BAR' at the end. **&** (ampersand) represents the matched find pattern. +- **\\1** The text which matched the first parenthesis in the Find Pattern. + + +----- + +**Example 1** + +**Find Pattern:** HELLO +**Replace Pattern:** WORLD +**Regular Expression:** no +**Replace what:** entire line + +Every time the word HELLO is found, it will be replaced with the word WORLD. + +----- + +**Example 2** + +**Find Pattern:** ^chr +**Replace Pattern:** (empty) +**Regular Expression:** yes +**Replace what:** column 11 + +If column 11 (of every line) begins with ther letters 'chr', they will be removed. Effectively, it'll turn "chr4" into "4" and "chrXHet" into "XHet" + + +----- + +**Perl's Regular Expression Syntax** + +The Find & Replace tool searches the data for lines containing or not containing a match to the given pattern. A Regular Expression is a pattern descibing a certain amount of text. + +- **( ) { } [ ] . * ? + \\ ^ $** are all special characters. **\\** can be used to "escape" a special character, allowing that special character to be searched for. +- **^** matches the beginning of a string(but not an internal line). +- **(** .. **)** groups a particular pattern. +- **{** n or n, or n,m **}** specifies an expected number of repetitions of the preceding pattern. + + - **{n}** The preceding item is matched exactly n times. + - **{n,}** The preceding item ismatched n or more times. + - **{n,m}** The preceding item is matched at least n times but not more than m times. + +- **[** ... **]** creates a character class. Within the brackets, single characters can be placed. A dash (-) may be used to indicate a range such as **a-z**. +- **.** Matches any single character except a newline. +- ***** The preceding item will be matched zero or more times. +- **?** The preceding item is optional and matched at most once. +- **+** The preceding item will be matched one or more times. +- **^** has two meaning: + - matches the beginning of a line or string. + - indicates negation in a character class. For example, [^...] matches every character except the ones inside brackets. +- **$** matches the end of a line or string. +- **\\|** Separates alternate possibilities. +- **\\d** matches a single digit +- **\\w** matches a single letter or digit or an underscore. +- **\\s** matches a single white-space (space or tabs). + + + + + Index: galaxy-central/tools/unix_tools/word_list_grep.pl =================================================================== --- galaxy-central/tools/unix_tools/word_list_grep.pl (revision 3) +++ galaxy-central/tools/unix_tools/word_list_grep.pl (revision 3) @@ -0,0 +1,182 @@ +#!/usr/bin/perl +use strict; +use warnings; +use Getopt::Std; + +sub parse_command_line(); +sub load_word_list(); +sub compile_regex(@); +sub usage(); + +my $word_list_file; +my $input_file ; +my $output_file; +my $find_complete_words ; +my $find_inverse; +my $find_in_specific_column ; +my $find_case_insensitive ; +my $skip_first_line ; + + +## +## Program Start +## +usage() if @ARGV==0; +parse_command_line(); + +my @words = load_word_list(); + +my $regex = compile_regex(@words); + +# Allow first line to pass without filtering? +if ( $skip_first_line ) { + my $line = <$input_file>; + print $output_file $line ; +} + + +## +## Main loop +## +while ( <$input_file> ) { + my $target = $_; + + + # If searching in a specific column (and not in the entire line) + # extract the content of that one column + if ( $find_in_specific_column ) { + my @columns = split ; + + #not enough columns in this line - skip it + next if ( @columns < $find_in_specific_column ) ; + + $target = $columns [ $find_in_specific_column - 1 ] ; + } + + # Match ? + if ( ($target =~ $regex) ^ ($find_inverse) ) { + print $output_file $_ ; + } +} + +close $input_file; +close $output_file; + +## +## Program end +## + + +sub parse_command_line() +{ + my %opts ; + getopts('siwvc:o:', \%opts) or die "$0: Invalid option specified\n"; + + die "$0: missing word-list file name\n" if (@ARGV==0); + + $word_list_file = $ARGV[0]; + die "$0: Word-list file '$word_list_file' not found\n" unless -e $word_list_file ; + + $find_complete_words = ( exists $opts{w} ) ; + $find_inverse = ( exists $opts{v} ) ; + $find_case_insensitive = ( exists $opts{i} ) ; + $skip_first_line = ( exists $opts{s} ) ; + + + # Search in specific column ? + if ( defined $opts{c} ) { + $find_in_specific_column = $opts{c}; + + die "$0: invalid column number ($find_in_specific_column).\n" + unless $find_in_specific_column =~ /^\d+$/ ; + + die "$0: invalid column number ($find_in_specific_column).\n" + if $find_in_specific_column <= 0; + } + else { + $find_in_specific_column = 0 ; + } + + + # Output File specified (instead of STDOUT) ? + if ( defined $opts{o} ) { + my $filename = $opts{o}; + open $output_file, ">$filename" or die "$0: Failed to create output file '$filename': $!\n" ; + } else { + $output_file = *STDOUT ; + } + + + + # Input file Specified (instead of STDIN) ? + if ( @ARGV>1 ) { + my $filename = $ARGV[1]; + open $input_file, "<$filename" or die "$0: Failed to open input file '$filename': $!\n" ; + } else { + $input_file = *STDIN; + } +} + +sub load_word_list() +{ + open WORDLIST, "<$word_list_file" or die "$0: Failed to open word-list file '$word_list_file'\n" ; + my @words ; + while ( ) { + chomp ; + s/^\s+//; + s/\s+$//; + next if length==0; + push @words,quotemeta $_; + } + close WORDLIST; + + die "$0: Error: word-list file '$word_list_file' is empty!\n" + unless @words; + + return @words; +} + +sub compile_regex(@) +{ + my @words = @_; + + my $regex_string = join ( '|', @words ) ; + if ( $find_complete_words ) { + $regex_string = "\\b($regex_string)\\b"; + } + my $regex; + + if ( $find_case_insensitive ) { + $regex = qr/$regex_string/i ; + } else { + $regex = qr/$regex_string/; + } + + return $regex; +} + +sub usage() +{ +print <&2 + exit 1 +fi + +if [ ! -r "$INPUT" ]; then + echo "error: input file ($INPUT) not found!" >&2 + exit 1 +fi + +# Messages printed to STDOUT will be displayed in the "INFO" field in the galaxy dataset. +# This way the user can tell what was the command +if [ -z "$COMPLEMENT" ]; then + echo -n "Extracting " +else + echo "Deleting " +fi + +case $CUTWHAT in + -f) echo -n "field(s) " + ;; + + -c) echo -n "character(s) " + ;; +esac + +echo "$CUTLIST" + + +cut $COMPLEMENT $CUTWHAT $CUTLIST < $INPUT > $OUTPUT + +exit Index: galaxy-central/tools/unix_tools/join_tool.sh =================================================================== --- galaxy-central/tools/unix_tools/join_tool.sh (revision 3) +++ galaxy-central/tools/unix_tools/join_tool.sh (revision 3) @@ -0,0 +1,37 @@ +#!/bin/sh + +# +# NOTE: +# This is a wrapper for GNU's join under galaxy +# not ment to be used from command line (if you're using the command line, simply run 'join' directly...) +# +# All parameters must be supplied. +# the join_tool.xml file takes care of that. + +JOINTYPE="$1" +OUTPUT_FORMAT="$2" +EMPTY_STRING="$3" +DELIMITER="$4" +IGNORE_CASE="$5" + +INPUT1="$6" +COLUMN1="$7" +INPUT2="$8" +COLUMN2="$9" +OUTPUT="${10}" + +if [ "$OUTPUT" == "" ]; then + echo "This script is part of galaxy. Don't run it manually.\n" >&2 + exit 1; +fi + +#This a TAB hack for galaxy (which can't transfer a "\t" as a parameter) +[ "$DELIMITER" == "tab" ] && DELIMITER=" " + +#Remove spaces from the output format (if the user entered any) +OUTPUT_FORMAT=${OUTPUT_FORMAT// /} +[ "$OUTPUT_FORMAT" != "" ] && OUTPUT_FORMAT="-o $OUTPUT_FORMAT" + +echo join $OUTPUT_FORMAT -t "$DELIMITER" -e "$EMPTY_STRING" $IGNORE_CASE $JOINTYPE -1 "$COLUMN1" -2 "$COLUMN2" +#echo join $OUTPUT_FORMAT -t "$DELIMITER" -e "$EMPTY_STRING" $IGNORE_CASE $JOINTYPE -1 "$COLUMN1" -2 "$COLUMN2" "$INPUT1" "$INPUT2" \> "$OUTPUT" +join $OUTPUT_FORMAT -t "$DELIMITER" -e "$EMPTY_STRING" $JOINTYPE -1 "$COLUMN1" -2 "$COLUMN2" "$INPUT1" "$INPUT2" > "$OUTPUT" || exit 1 Index: galaxy-central/tools/unix_tools/grep_wrapper.sh =================================================================== --- galaxy-central/tools/unix_tools/grep_wrapper.sh (revision 3) +++ galaxy-central/tools/unix_tools/grep_wrapper.sh (revision 3) @@ -0,0 +1,62 @@ +#!/bin/sh + +## +## Galaxy wrapper for GREP command. +## + +## +## command line arguments: +## input_file +## output_file +## regex +## COLOR or NOCOLOR +## [other parameters passed on to grep] + +INPUT="$1" +OUTPUT="$2" +REGEX="$3" +COLOR="$4" + +shift 4 + +if [ -z "$COLOR" ]; then + echo usage: $0 INPUTFILE OUTPUTFILE REGEX COLOR\|NOCOLOR [other grep patameters] >&2 + exit 1 +fi + +if [ ! -r "$INPUT" ]; then + echo "error: input file ($INPUT) not found!" >&2 + exit 1 +fi + +# Messages printed to STDOUT will be displayed in the "INFO" field in the galaxy dataset. +# This way the user can tell what was the command +echo "grep" "$@" "$REGEX" + +if [ "$COLOR" == "COLOR" ]; then + # + # What the heck is going on here??? + # 1. "GREP_COLORS" is an environment variable, telling GREP which ANSI colors to use. + # 2. "--colors=always" tells grep to actually use colors (according to the GREP_COLORS variable) + # 3. first sed command translates the ANSI color to a tag with blue color (and a tag, too) + # 4. second sed command translates the no-color ANSI command to a tag (and a tag, too) + # 5. htmlize_pre scripts takes a text input and wraps it in

 tags, making it a fixed-font HTML file.
+
+	GREP_COLORS="ms=31" grep --color=always -P "$@" -- "$REGEX" "$INPUT" | \
+		grep -v "^\[36m\[K--\[m\[K$" | \
+		sed -r 's/\[[0123456789;]+m\[K?//g' | \
+		sed -r 's/\[m\[K?/<\/b><\/font>/g' | \
+		htmlize_pre.sh > "$OUTPUT"
+
+
+	if (( $? ));  then exit; fi
+
+elif [ "$COLOR" == "NOCOLOR" ]; then
+	grep -P "$@" -- "$REGEX" "$INPUT" | grep -v "^--$" > "$OUTPUT"
+	if (( $? ));  then exit; fi
+else
+	echo Error: third parameter must be "COLOR" or "NOCOLOR" >&2
+	exit 1
+fi
+
+exit 0
Index: galaxy-central/tools/unix_tools/sed_tool.xml
===================================================================
--- galaxy-central/tools/unix_tools/sed_tool.xml (revision 3)
+++ galaxy-central/tools/unix_tools/sed_tool.xml (revision 3)
@@ -0,0 +1,92 @@
+
+  
+  
+  sed_wrapper.sh $silent $input $output '$url_paste'
+  
+    
+
+    
+     
+    	value.find('\'')==-1
+    
+
+     
+      
+      
+    
+
+  
+  
+    
+  
+
+
+**What it does**
+
+This tool runs the unix **sed** command on the selected data file.
+
+.. class:: infomark
+
+**TIP:** This tool uses the **extended regular** expression syntax (same as running 'sed -r').
+
+
+
+**Further reading**
+
+- Short sed tutorial (http://www.linuxhowtos.org/System/sed_tutorial.htm)
+- Long sed tutorial (http://www.grymoire.com/Unix/Sed.html)
+- sed faq with good examples (http://sed.sourceforge.net/sedfaq.html)
+- sed cheat-sheet (http://www.catonmat.net/download/sed.stream.editor.cheat.sheet.pdf)
+- Collection of useful sed one-liners (http://student.northpark.edu/pemente/sed/sed1line.txt)
+
+-----
+
+**Sed commands**
+
+The most useful sed command is **s** (substitute).
+
+**Examples**
+
+- **s/hsa//**  will remove the first instance of 'hsa' in every line.
+- **s/hsa//g**  will remove all instances (beacuse of the **g**) of 'hsa' in every line.
+- **s/A{4,}/--&--/g**  will find sequences of 4 or more consecutive A's, and once found, will surround them with two dashes from each side. The **&** marker is a place holder for 'whatever matched the regular expression'.
+- **s/hsa-mir-([^ ]+)/short name: \\1 full name: &/**  will find strings such as 'hsa-mir-43a' (the regular expression is 'hsa-mir-' followed by non-space characters) and will replace it will string such as 'short name: 43a full name: hsa-mir-43a'.  The **\\1** marker is a place holder for 'whatever matched the first parenthesis' (similar to perl's **$1**) .
+
+
+**sed's Regular Expression Syntax**
+
+The select tool searches the data for lines containing or not containing a match to the given pattern. A Regular Expression is a pattern descibing a certain amount of text. 
+
+- **( ) { } [ ] . * ? + \ ^ $** are all special characters. **\\** can be used to "escape" a special character, allowing that special character to be searched for.
+- **^** matches the beginning of a string(but not an internal line).
+- **(** .. **)** groups a particular pattern.
+- **{** n or n, or n,m **}** specifies an expected number of repetitions of the preceding pattern.
+
+  - **{n}** The preceding item is matched exactly n times.
+  - **{n,}** The preceding item ismatched n or more times. 
+  - **{n,m}** The preceding item is matched at least n times but not more than m times. 
+
+- **[** ... **]** creates a character class. Within the brackets, single characters can be placed. A dash (-) may be used to indicate a range such as **a-z**.
+- **.** Matches any single character except a newline.
+- ***** The preceding item will be matched zero or more times.
+- **?** The preceding item is optional and matched at most once.
+- **+** The preceding item will be matched one or more times.
+- **^** has two meaning:
+  - matches the beginning of a line or string. 
+  - indicates negation in a character class. For example, [^...] matches every character except the ones inside brackets.
+- **$** matches the end of a line or string.
+- **\|** Separates alternate possibilities. 
+
+
+**Note**: SED uses extended regular expression syntax, not Perl syntax. **\\d**, **\\w**, **\\s** etc. are **not** supported.
+
+
+
Index: galaxy-central/tools/unix_tools/find_and_replace.pl
===================================================================
--- galaxy-central/tools/unix_tools/find_and_replace.pl (revision 3)
+++ galaxy-central/tools/unix_tools/find_and_replace.pl (revision 3)
@@ -0,0 +1,202 @@
+#!/usr/bin/perl
+use strict;
+use warnings;
+use Getopt::Std;
+
+sub parse_command_line();
+sub build_regex_string();
+sub usage();
+
+my $input_file ;
+my $output_file;
+my $find_pattern ;
+my $replace_pattern ;
+my $find_complete_words ;
+my $find_pattern_is_regex ;
+my $find_in_specific_column ;
+my $find_case_insensitive ;
+my $replace_global ;
+my $skip_first_line ;
+
+
+##
+## Program Start
+##
+usage() if @ARGV<2;
+parse_command_line();
+my $regex_string = build_regex_string() ;
+
+# Allow first line to pass without filtering?
+if ( $skip_first_line ) {
+	my $line = <$input_file>;
+	print $output_file $line ;
+}
+
+
+##
+## Main loop
+##
+
+## I LOVE PERL (and hate it, at the same time...)
+##
+## So what's going on with the self-compiling perl code?
+##
+## 1. The program gets the find-pattern and the replace-pattern from the user (as strings).
+## 2. If both the find-pattern and replace-pattern are simple strings (not regex), 
+##    it would be possible to pre-compile a regex (with qr//) and use it in a 's///'
+## 3. If the find-pattern is a regex but the replace-pattern is a simple text string (with out back-references)
+##    it is still possible to pre-compile the regex and use it in a 's///'
+## However,
+## 4. If the replace-pattern contains back-references, pre-compiling is not possible.
+##    (in perl, you can't precompile a substitute regex).
+##    See these examples:
+##    http://www.perlmonks.org/?node_id=84420
+##    http://stackoverflow.com/questions/125171/passing-a-regex-substitution-as-a-variable-in-perl
+##
+##    The solution:
+##    we build the regex string as valid perl code (in 'build_regex()', stored in $regex_string ),
+##    Then eval() a new perl code that contains the substitution regex as inlined code.
+##    Gotta love perl!
+
+my $perl_program ;
+if ( $find_in_specific_column ) {
+	# Find & replace in specific column
+
+	$perl_program = < ) {
+		chomp ;
+		my \@columns = split ;
+
+		#not enough columns in this line - skip it
+		next if ( \@columns < $find_in_specific_column ) ;
+
+		\$columns [ $find_in_specific_column - 1 ] =~ $regex_string ;
+
+		print STDOUT join("\t", \@columns), "\n" ;
+	}
+EOF
+
+} else {
+	# Find & replace the entire line
+	$perl_program = < ) {
+			$regex_string ;
+			print STDOUT;
+		}
+EOF
+}
+
+
+# The dynamic perl code reads from STDIN and writes to STDOUT,
+# so connect these handles (if the user didn't specifiy input / output
+# file names, these might be already be STDIN/OUT, so the whole could be a no-op).
+*STDIN = $input_file ;
+*STDOUT = $output_file ;
+eval $perl_program ;
+
+
+##
+## Program end
+##
+
+
+sub parse_command_line()
+{
+	my %opts ;
+	getopts('grsiwc:o:', \%opts) or die "$0: Invalid option specified\n";
+
+	die "$0: missing Find-Pattern argument\n" if (@ARGV==0); 
+	$find_pattern = $ARGV[0];
+	die "$0: missing Replace-Pattern argument\n" if (@ARGV==1); 
+	$replace_pattern = $ARGV[1];
+
+	$find_complete_words = ( exists $opts{w} ) ;
+	$find_case_insensitive = ( exists $opts{i} ) ;
+	$skip_first_line = ( exists $opts{s} ) ;
+	$find_pattern_is_regex = ( exists $opts{r} ) ;
+	$replace_global = ( exists $opts{g} ) ;
+
+	# Search in specific column ?
+	if ( defined $opts{c} ) {
+		$find_in_specific_column = $opts{c};
+
+		die "$0: invalid column number ($find_in_specific_column).\n"
+			unless $find_in_specific_column =~ /^\d+$/ ;
+			
+		die "$0: invalid column number ($find_in_specific_column).\n"
+			if $find_in_specific_column <= 0; 
+	}
+	else {
+		$find_in_specific_column = 0 ;
+	}
+
+	# Output File specified (instead of STDOUT) ?
+	if ( defined $opts{o} ) {
+		my $filename = $opts{o};
+		open $output_file, ">$filename" or die "$0: Failed to create output file '$filename': $!\n" ;
+	} else {
+		$output_file = *STDOUT ;
+	}
+
+
+	# Input file Specified (instead of STDIN) ?
+	if ( @ARGV>2 ) {
+		my $filename = $ARGV[2];
+		open $input_file, "<$filename" or die "$0: Failed to open input file '$filename': $!\n" ;
+	} else {
+		$input_file = *STDIN;
+	}
+}
+
+sub build_regex_string()
+{
+	my $find_string ;
+	my $replace_string ;
+
+	if ( $find_pattern_is_regex ) {
+		$find_string = $find_pattern ;
+		$replace_string = $replace_pattern ;
+	} else {
+		$find_string = quotemeta $find_pattern ;
+		$replace_string = quotemeta $replace_pattern;
+	}
+
+	if ( $find_complete_words ) {
+		$find_string = "\\b($find_string)\\b"; 
+	}
+
+	my $regex_string = "s/$find_string/$replace_string/";
+
+	$regex_string .= "i" if ( $find_case_insensitive );
+	$regex_string .= "g" if ( $replace_global ) ;
+	
+
+	return $regex_string;
+}
+
+sub usage()
+{
+print <
+  
+  	uniq -f $skipfields $count $repeated $ignorecase $uniqueonly $input $output
+  
+
+  
+	
+		
+	
+
+	
+
+	
+
+	
+
+	
+  
+
+  
+    
+  
+  
+  
+
Index: galaxy-central/tools/unix_tools/awk_tool.xml
===================================================================
--- galaxy-central/tools/unix_tools/awk_tool.xml (revision 3)
+++ galaxy-central/tools/unix_tools/awk_tool.xml (revision 3)
@@ -0,0 +1,138 @@
+
+  
+  awk_wrapper.sh $input $output '$file_data' '$FS' '$OFS'
+  
+    
+
+    
+	
+	
+	
+	
+	
+	
+	
+	
+    
+
+    
+	
+	
+	
+	
+	
+	
+	
+	
+    
+
+
+    
+     
+    	value.find('\'')==-1
+    
+
+  
+  
+	  
+		  
+		  
+		  
+		  
+		  
+	  
+  
+  
+    
+  
+
+
+**What it does**
+
+This tool runs the unix **awk** command on the selected data file.
+
+.. class:: infomark
+
+**TIP:** This tool uses the **extended regular** expression syntax (not the perl syntax).
+
+
+**Further reading**
+
+- Awk by Example (http://www.ibm.com/developerworks/linux/library/l-awk1.html)
+- Long AWK tutorial (http://www.grymoire.com/Unix/Awk.html)
+- Learn AWK in 1 hour (http://www.selectorweb.com/awk.html)
+- awk cheat-sheet (http://cbi.med.harvard.edu/people/peshkin/sb302/awk_cheatsheets.pdf)
+- Collection of useful awk one-liners (http://student.northpark.edu/pemente/awk/awk1line.txt)
+
+-----
+
+**AWK programs**
+
+Most AWK programs consist of **patterns** (i.e. rules that match lines of text) and **actions** (i.e. commands to execute when a pattern matches a line).
+
+The basic form of AWK program is::
+
+    pattern { action 1; action 2; action 3; }
+
+
+
+
+
+**Pattern Examples**
+
+- **$2 == "chr3"**  will match lines whose second column is the string 'chr3'
+- **$5-$4>23**  will match lines that after subtracting the value of the fourth column from the value of the fifth column, gives value alrger than 23.
+- **/AG..AG/** will match lines that contain the regular expression **AG..AG** (meaning the characeters AG followed by any two characeters followed by AG). (This is the way to specify regular expressions on the entire line, similar to GREP.)
+- **$7 ~ /A{4}U/**  will match lines whose seventh column contains 4 consecutive A's followed by a U. (This is the way to specify regular expressions on a specific field.)
+- **10000 < $4 && $4 < 20000** will match lines whose fourth column value is larger than 10,000 but smaller than 20,000
+- If no pattern is specified, all lines match (meaning the **action** part will be executed on all lines).
+
+
+
+**Action Examples**
+
+- **{ print }** or **{ print $0 }**   will print the entire input line (the line that matched in **pattern**). **$0** is a special marker meaning 'the entire line'.
+- **{ print $1, $4, $5 }** will print only the first, fourth and fifth fields of the input line.
+- **{ print $4, $5-$4 }** will print the fourth column and the difference between the fifth and fourth column. (If the fourth column was start-position in the input file, and the fifth column was end-position - the output file will contain the start-position, and the length).
+- If no action part is specified (not even the curly brackets) - the default action is to print the entire line.
+
+
+
+
+
+
+
+
+
+**AWK's Regular Expression Syntax**
+
+The select tool searches the data for lines containing or not containing a match to the given pattern. A Regular Expression is a pattern descibing a certain amount of text. 
+
+- **( ) { } [ ] . * ? + \ ^ $** are all special characters. **\\** can be used to "escape" a special character, allowing that special character to be searched for.
+- **^** matches the beginning of a string(but not an internal line).
+- **(** .. **)** groups a particular pattern.
+- **{** n or n, or n,m **}** specifies an expected number of repetitions of the preceding pattern.
+
+  - **{n}** The preceding item is matched exactly n times.
+  - **{n,}** The preceding item ismatched n or more times. 
+  - **{n,m}** The preceding item is matched at least n times but not more than m times. 
+
+- **[** ... **]** creates a character class. Within the brackets, single characters can be placed. A dash (-) may be used to indicate a range such as **a-z**.
+- **.** Matches any single character except a newline.
+- ***** The preceding item will be matched zero or more times.
+- **?** The preceding item is optional and matched at most once.
+- **+** The preceding item will be matched one or more times.
+- **^** has two meaning:
+  - matches the beginning of a line or string. 
+  - indicates negation in a character class. For example, [^...] matches every character except the ones inside brackets.
+- **$** matches the end of a line or string.
+- **\|** Separates alternate possibilities. 
+
+
+**Note**: AWK uses extended regular expression syntax, not Perl syntax. **\\d**, **\\w**, **\\s** etc. are **not** supported.
+
+
+