GSE138197 Processing Pipeline

RNA-Seq code_examples 6 steps

Publication

Skipper analysis of eCLIP datasets enables sensitive detection of constrained translation factor binding sites.

Cell genomics (2023) — PMID 37388912

Dataset

GSE138197

Alternative polyadenylation mediates genetic regulation of gene expression

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    map to HG19 with WASP output mode.

    STAR (Inferred with models/gemini-2.5-flash) v2.7.9a GitHub
    $ Bash example
    # Install STAR if not already available
    # conda install -c bioconda star
    
    # Define variables
    # Replace with actual paths to your HG19 STAR index and input FASTQ files
    GENOME_DIR="/path/to/hg19_star_index" # Pre-built STAR index for HG19
    READ1="input_R1.fastq.gz"
    READ2="input_R2.fastq.gz" # Adjust if single-end reads, remove READ2 and use only READ1
    OUTPUT_PREFIX="aligned_hg19_wasp_"
    THREADS=8 # Adjust as needed, typically based on available CPU cores
    
    # Note: The STAR genome index for HG19 must be pre-built. 
    # Example command to build index (requires HG19 FASTA and GTF/GFF annotation):
    # STAR --runThreadN ${THREADS} --runMode genomeGenerate --genomeDir ${GENOME_DIR} \
    #      --genomeFastaFiles /path/to/hg19.fa --sjdbGTFfile /path/to/hg19.gtf
    
    # Run STAR alignment with WASP-compatible output parameters
    # The '--outSAMattrIHstart 0' parameter is crucial for WASP compatibility.
    STAR --runThreadN ${THREADS} \
         --genomeDir ${GENOME_DIR} \
         --readFilesIn ${READ1} ${READ2} \
         --readFilesCommand zcat \
         --outFileNamePrefix ${OUTPUT_PREFIX} \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMattributes All \
         --outSAMattrIHstart 0 \
         --outFilterMultimapNmax 20 \
         --outFilterMismatchNmax 999 \
         --outFilterMismatchNoverLmax 0.04 \
         --alignIntronMin 20 \
         --alignIntronMax 1000000 \
         --alignMatesGapMax 1000000 \
         --alignSJoverhangMin 8 \
         --alignSJDBoverhangMin 1 \
         --sjdbScore 1 \
         --limitBAMsortRAM 30000000000 # Adjust RAM as needed (e.g., 30GB)
    
    # The output BAM file (${OUTPUT_PREFIX}Aligned.sortedByCoordinate.out.bam) 
    # can then be used as input for the WASP tool for allele-specific analysis.
    # WASP GitHub: https://github.com/bmvdgeijn/WASP
    
  2. 2

    Remove reads not passing WASP multimapping.

    WASP vlatest
    $ Bash example
    # Install WASP (if not already installed)
    # git clone https://github.com/bmv45/WASP.git
    # cd WASP
    # pip install .
    
    # Define input and output files
    # INPUT_BAM: Aligned BAM file (e.g., from STAR, HISAT2) that needs multimapping reads filtered by WASP
    # OUTPUT_BAM: Filtered BAM file with multimapping reads resolved or removed by WASP
    # SNPS_VCF: VCF file containing known SNPs for the reference genome (e.g., dbSNP for hg38)
    # REFERENCE_FASTA: Reference genome FASTA file (e.g., GRCh38.p14.genome.fa)
    
    INPUT_BAM="path/to/aligned_reads.bam"
    OUTPUT_BAM="path/to/wasp_filtered_reads.bam"
    SNPS_VCF="path/to/dbSNP_GRCh38.vcf.gz" # Placeholder: Use an appropriate VCF for your organism/assembly
    REFERENCE_FASTA="path/to/GRCh38.p14.genome.fa" # Placeholder: Use your reference genome FASTA
    
    # Ensure WASP scripts are accessible (e.g., in PATH or specify full path)
    # This command uses WASP's filter_multimapping_reads.py script to remove reads
    # that do not pass WASP's criteria for allele-specific multimapping resolution.
    # It requires a sorted and indexed BAM file and a VCF file with known SNPs.
    python WASP/WASP_scripts/filter_multimapping_reads.py \
        --input_bam "${INPUT_BAM}" \
        --output_bam "${OUTPUT_BAM}" \
        --snp_vcf "${SNPS_VCF}" \
        --reference_fasta "${REFERENCE_FASTA}" \
        --log_file "wasp_filter.log"
  3. 3

    Clean for mispriming by removing reads with 6 A's upstream of 7 of 10 As in upstream sequence.

    filter_mispriming.py (Inferred with models/gemini-2.5-flash) vFrom yeolab/eclip repository GitHub
    $ Bash example
    # Clone the eclip repository to get the script. This script is part of the Yeo lab eCLIP workflow.
    # git clone https://github.com/yeolab/eclip.git
    # cd eclip/scripts
    
    # Define input and output file paths
    INPUT_R1="input_R1.fastq.gz"
    INPUT_R2="input_R2.fastq.gz" # Assuming paired-end reads based on eCLIP workflow
    OUTPUT_R1="output_R1.mispriming_filtered.fastq.gz"
    OUTPUT_R2="output_R2.mispriming_filtered.fastq.gz"
    
    # Path to the filter_mispriming.py script
    # Replace '/path/to/eclip/scripts' with the actual path to the scripts directory within your cloned eclip repository.
    ECLIP_SCRIPTS_DIR="/path/to/eclip/scripts"
    
    # Execute the mispriming filtering script
    # The description "7 of 10 As" directly maps to --polyA_threshold 0.7 and --polyA_window 10.
    # The phrase "6 A's upstream" is interpreted as a descriptive characteristic of the mispriming event
    # targeted by the filter, rather than a distinct, directly mappable parameter in the script.
    # The script checks for poly-A stretches within an 'upstream_window' from the 5' end of the read.
    # The default 'upstream_window' of 20 is commonly used in eCLIP pipelines.
    python "${ECLIP_SCRIPTS_DIR}/filter_mispriming.py" \
        -r1 "${INPUT_R1}" \
        -r2 "${INPUT_R2}" \
        -o1 "${OUTPUT_R1}" \
        -o2 "${OUTPUT_R2}" \
        --polyA_threshold 0.7 \
        --polyA_window 10 \
        --upstream_window 20
  4. 4

    Merge all BAM files and call peaks with inhouse scripts (avaialbe on github)

    clipper (Inferred with models/gemini-2.5-flash) vlatest (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install samtools if not already installed
    # conda install -c bioconda samtools
    
    # Install clipper (assuming it's a Python script and dependencies are met)
    # git clone https://github.com/yeolab/clipper.git
    # cd clipper
    # pip install -r requirements.txt # if any
    
    # Define input BAM files (replace with actual paths to your BAM files)
    INPUT_BAMS="sample_1.bam sample_2.bam sample_3.bam"
    
    # Define output merged BAM file
    MERGED_BAM="merged_input.bam"
    
    # Define reference genome (e.g., hg38, mm10). This is a placeholder.
    # clipper expects a genome name that it can resolve internally or via configuration.
    GENOME_NAME="hg38"
    
    # Define output prefix for clipper peak files
    OUTPUT_PREFIX="clipper_peaks"
    
    # 1. Merge all BAM files
    echo "Merging BAM files: $INPUT_BAMS into $MERGED_BAM"
    samtools merge -@ 8 "$MERGED_BAM" $INPUT_BAMS
    
    # Index the merged BAM file (recommended for downstream tools)
    samtools index "$MERGED_BAM"
    
    # 2. Call peaks with clipper
    echo "Calling peaks with clipper on $MERGED_BAM for genome $GENOME_NAME"
    # Assuming clipper.py is in your PATH or specified with its full path
    # Common parameters: -b for input BAM, -s for genome name, -o for output prefix.
    # Other parameters like -f (FDR), -w (window size) can be added if specified.
    python /path/to/clipper/clipper.py -b "$MERGED_BAM" -s "$GENOME_NAME" -o "$OUTPUT_PREFIX"
  5. 5

    Assign PAS to genes using RefSeq annoations

    RefSeq v2.30.0 GitHub
    $ Bash example
    # Install bedtools if not already installed
    # conda install -c bioconda bedtools
    
    # Placeholder for identified Polyadenylation Sites (PAS) in BED format.
    # This file would typically be generated by an upstream step (e.g., from RNA-seq or 3' end sequencing data).
    PAS_FILE="identified_pas.bed"
    
    # --- Reference RefSeq gene annotations for hg38 ---
    # Download RefSeq gene annotations from UCSC Table Browser (refGene table).
    # This file contains gene models including exons, introns, UTRs.
    # We will extract gene body coordinates from it.
    REFSEQ_GENES_TXT="refGene.txt"
    REFSEQ_GENES_URL="http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz"
    
    # Download and decompress refGene.txt if not present.
    # if [ ! -f "${REFSEQ_GENES_TXT}" ]; then
    #     wget -O "${REFSEQ_GENES_TXT}.gz" "${REFSEQ_GENES_URL}"
    #     gunzip "${REFSEQ_GENES_TXT}.gz"
    # fi
    
    # Convert refGene.txt to a BED file containing gene body coordinates.
    # Columns in refGene.txt: bin, name, chrom, strand, txStart, txEnd, cdsStart, cdsEnd, exonCount, exonStarts, exonEnds, id, name2, cdsStartStat, cdsEndStat, exonFrames
    # We extract: chrom ($3), txStart ($5), txEnd ($6), name2 (gene symbol, $13), score (0), strand ($4).
    # The 'txStart' and 'txEnd' define the transcript (gene) boundaries.
    REFSEQ_GENES_BED="refseq_genes_hg38_gene_bodies.bed"
    
    # awk '{print $3"\t"$5"\t"$6"\t"$13"\t0\t"$4}' "${REFSEQ_GENES_TXT}" | sort -k1,1 -k2,2n > "${REFSEQ_GENES_BED}"
    
    # Ensure the reference genome assembly is specified (e.g., hg38).
    GENOME_ASSEMBLY="hg38"
    
    # Assign PAS to genes using bedtools intersect.
    # -a: input file A (PAS).
    # -b: input file B (RefSeq gene bodies).
    # -wao: Write the original A and B entries plus the number of overlapping bases.
    #       If no overlap, B fields are set to NULL and overlap is 0.
    # -s: Force strandedness. Requires A and B to be on the same strand.
    # Output will contain PAS coordinates and overlapping gene information.
    bedtools intersect -a "${PAS_FILE}" -b "${REFSEQ_GENES_BED}" -wao -s > pas_assigned_to_genes.bed
    
    # Further processing might be needed to determine specific assignment logic 
    # (e.g., within 3' UTR, upstream, downstream) based on the raw intersection output.
  6. 6

    Remove PAS used on average <5% in total or nuclear individuals

    Custom Python script (Inferred with models/gemini-2.5-flash) v1.0 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # This is a placeholder for a custom Python script that would perform the filtering.
    # The actual script would depend on the format of the input PAS usage file, which is assumed to be a TSV.
    
    # Example Python script (filter_pas.py):
    # import pandas as pd
    #
    # def filter_pas_by_usage(input_file, output_file, threshold=5.0, total_col='total_usage_percent', nuclear_col='nuclear_usage_percent'):
    #     df = pd.read_csv(input_file, sep='\t')
    #     # Filter rows where either total_usage_percent or nuclear_usage_percent is >= threshold
    #     # The description states "Remove PAS used on average <5%", meaning we keep PAS used >=5%.
    #     filtered_df = df[(df[total_col] >= threshold) | (df[nuclear_col] >= threshold)]
    #     filtered_df.to_csv(output_file, sep='\t', index=False)
    #
    # if __name__ == "__main__":
    #     import argparse
    #     parser = argparse.ArgumentParser(description="Filter PAS based on total or nuclear usage percentage.")
    #     parser.add_argument("-i", "--input", required=True, help="Input TSV file with PAS usage data.")
    #     parser.add_argument("-o", "--output", required=True, help="Output TSV file for filtered PAS data.")
    #     parser.add_argument("-t", "--threshold", type=float, default=5.0, help="Minimum usage percentage (total or nuclear) to keep a PAS.")
    #     parser.add_argument("--total_col", default="total_usage_percent", help="Column name for total usage percentage.")
    #     parser.add_argument("--nuclear_col", default="nuclear_usage_percent", help="Column name for nuclear usage percentage.")
    #     args = parser.parse_args()
    #     filter_pas_by_usage(args.input, args.output, args.threshold, args.total_col, args.nuclear_col)
    
    # Execution command:
    # Assuming 'pas_usage_data.tsv' is the input file containing PAS usage percentages
    # and 'filtered_pas_data.tsv' will be the output file.
    python filter_pas.py -i pas_usage_data.tsv -o filtered_pas_data.tsv -t 5.0 --total_col total_usage_percent --nuclear_col nuclear_usage_percent
Raw Source Text
map to HG19 with WASP output mode. Remove reads not passing WASP multimapping.
Clean for mispriming by removing reads with 6 A's upstream of 7 of 10 As in upstream sequence.
Merge all BAM files and call peaks with inhouse scripts (avaialbe on github)
Assign PAS to genes using RefSeq annoations
Remove PAS used on average <5% in total or nuclear individuals
Genome_build: GRCh37
Supplementary_files_format_and_content: APAPAS_GeneLocAnno.5perc.bed: bed file with location of each PAS used at 5% in the total or nuclear fraction. Name column has peak unique number, gene, and PAS genic location. End corresponds to 5kb downstream of annotated 3' UTR
Supplementary_files_format_and_content: MetaDataSequencing.txt: file with relevant cell collection and sequencing metadata
← Back to Analysis