GSE138197 Processing Pipeline

RNA-Seq code_examples 6 steps

Publication

Skipper analysis of eCLIP datasets enables sensitive detection of constrained translation factor binding sites.

Cell genomics (2023) — PMID 37388912

Dataset

Alternative polyadenylation mediates genetic regulation of gene expression

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

map to HG19 with WASP output mode.

STAR (Inferred with models/gemini-2.5-flash) v2.7.9a GitHub

$ Bash example

# Install STAR if not already available
# conda install -c bioconda star

# Define variables
# Replace with actual paths to your HG19 STAR index and input FASTQ files
GENOME_DIR="/path/to/hg19_star_index" # Pre-built STAR index for HG19
READ1="input_R1.fastq.gz"
READ2="input_R2.fastq.gz" # Adjust if single-end reads, remove READ2 and use only READ1
OUTPUT_PREFIX="aligned_hg19_wasp_"
THREADS=8 # Adjust as needed, typically based on available CPU cores

# Note: The STAR genome index for HG19 must be pre-built. 
# Example command to build index (requires HG19 FASTA and GTF/GFF annotation):
# STAR --runThreadN ${THREADS} --runMode genomeGenerate --genomeDir ${GENOME_DIR} \
#      --genomeFastaFiles /path/to/hg19.fa --sjdbGTFfile /path/to/hg19.gtf

# Run STAR alignment with WASP-compatible output parameters
# The '--outSAMattrIHstart 0' parameter is crucial for WASP compatibility.
STAR --runThreadN ${THREADS} \
     --genomeDir ${GENOME_DIR} \
     --readFilesIn ${READ1} ${READ2} \
     --readFilesCommand zcat \
     --outFileNamePrefix ${OUTPUT_PREFIX} \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes All \
     --outSAMattrIHstart 0 \
     --outFilterMultimapNmax 20 \
     --outFilterMismatchNmax 999 \
     --outFilterMismatchNoverLmax 0.04 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --alignSJoverhangMin 8 \
     --alignSJDBoverhangMin 1 \
     --sjdbScore 1 \
     --limitBAMsortRAM 30000000000 # Adjust RAM as needed (e.g., 30GB)

# The output BAM file (${OUTPUT_PREFIX}Aligned.sortedByCoordinate.out.bam) 
# can then be used as input for the WASP tool for allele-specific analysis.
# WASP GitHub: https://github.com/bmvdgeijn/WASP

View on GitHub

Remove reads not passing WASP multimapping.

WASP vlatest

$ Bash example

# Install WASP (if not already installed)
# git clone https://github.com/bmv45/WASP.git
# cd WASP
# pip install .

# Define input and output files
# INPUT_BAM: Aligned BAM file (e.g., from STAR, HISAT2) that needs multimapping reads filtered by WASP
# OUTPUT_BAM: Filtered BAM file with multimapping reads resolved or removed by WASP
# SNPS_VCF: VCF file containing known SNPs for the reference genome (e.g., dbSNP for hg38)
# REFERENCE_FASTA: Reference genome FASTA file (e.g., GRCh38.p14.genome.fa)

INPUT_BAM="path/to/aligned_reads.bam"
OUTPUT_BAM="path/to/wasp_filtered_reads.bam"
SNPS_VCF="path/to/dbSNP_GRCh38.vcf.gz" # Placeholder: Use an appropriate VCF for your organism/assembly
REFERENCE_FASTA="path/to/GRCh38.p14.genome.fa" # Placeholder: Use your reference genome FASTA

# Ensure WASP scripts are accessible (e.g., in PATH or specify full path)
# This command uses WASP's filter_multimapping_reads.py script to remove reads
# that do not pass WASP's criteria for allele-specific multimapping resolution.
# It requires a sorted and indexed BAM file and a VCF file with known SNPs.
python WASP/WASP_scripts/filter_multimapping_reads.py \
    --input_bam "${INPUT_BAM}" \
    --output_bam "${OUTPUT_BAM}" \
    --snp_vcf "${SNPS_VCF}" \
    --reference_fasta "${REFERENCE_FASTA}" \
    --log_file "wasp_filter.log"

Clean for mispriming by removing reads with 6 A's upstream of 7 of 10 As in upstream sequence.

filter_mispriming.py (Inferred with models/gemini-2.5-flash) vFrom yeolab/eclip repository GitHub

$ Bash example

# Clone the eclip repository to get the script. This script is part of the Yeo lab eCLIP workflow.
# git clone https://github.com/yeolab/eclip.git
# cd eclip/scripts

# Define input and output file paths
INPUT_R1="input_R1.fastq.gz"
INPUT_R2="input_R2.fastq.gz" # Assuming paired-end reads based on eCLIP workflow
OUTPUT_R1="output_R1.mispriming_filtered.fastq.gz"
OUTPUT_R2="output_R2.mispriming_filtered.fastq.gz"

# Path to the filter_mispriming.py script
# Replace '/path/to/eclip/scripts' with the actual path to the scripts directory within your cloned eclip repository.
ECLIP_SCRIPTS_DIR="/path/to/eclip/scripts"

# Execute the mispriming filtering script
# The description "7 of 10 As" directly maps to --polyA_threshold 0.7 and --polyA_window 10.
# The phrase "6 A's upstream" is interpreted as a descriptive characteristic of the mispriming event
# targeted by the filter, rather than a distinct, directly mappable parameter in the script.
# The script checks for poly-A stretches within an 'upstream_window' from the 5' end of the read.
# The default 'upstream_window' of 20 is commonly used in eCLIP pipelines.
python "${ECLIP_SCRIPTS_DIR}/filter_mispriming.py" \
    -r1 "${INPUT_R1}" \
    -r2 "${INPUT_R2}" \
    -o1 "${OUTPUT_R1}" \
    -o2 "${OUTPUT_R2}" \
    --polyA_threshold 0.7 \
    --polyA_window 10 \
    --upstream_window 20

View on GitHub

Merge all BAM files and call peaks with inhouse scripts (avaialbe on github)

clipper (Inferred with models/gemini-2.5-flash) vlatest (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools

# Install clipper (assuming it's a Python script and dependencies are met)
# git clone https://github.com/yeolab/clipper.git
# cd clipper
# pip install -r requirements.txt # if any

# Define input BAM files (replace with actual paths to your BAM files)
INPUT_BAMS="sample_1.bam sample_2.bam sample_3.bam"

# Define output merged BAM file
MERGED_BAM="merged_input.bam"

# Define reference genome (e.g., hg38, mm10). This is a placeholder.
# clipper expects a genome name that it can resolve internally or via configuration.
GENOME_NAME="hg38"

# Define output prefix for clipper peak files
OUTPUT_PREFIX="clipper_peaks"

# 1. Merge all BAM files
echo "Merging BAM files: $INPUT_BAMS into $MERGED_BAM"
samtools merge -@ 8 "$MERGED_BAM" $INPUT_BAMS

# Index the merged BAM file (recommended for downstream tools)
samtools index "$MERGED_BAM"

# 2. Call peaks with clipper
echo "Calling peaks with clipper on $MERGED_BAM for genome $GENOME_NAME"
# Assuming clipper.py is in your PATH or specified with its full path
# Common parameters: -b for input BAM, -s for genome name, -o for output prefix.
# Other parameters like -f (FDR), -w (window size) can be added if specified.
python /path/to/clipper/clipper.py -b "$MERGED_BAM" -s "$GENOME_NAME" -o "$OUTPUT_PREFIX"

View on GitHub

Assign PAS to genes using RefSeq annoations

RefSeq v2.30.0 GitHub

$ Bash example

# Install bedtools if not already installed
# conda install -c bioconda bedtools

# Placeholder for identified Polyadenylation Sites (PAS) in BED format.
# This file would typically be generated by an upstream step (e.g., from RNA-seq or 3' end sequencing data).
PAS_FILE="identified_pas.bed"

# --- Reference RefSeq gene annotations for hg38 ---
# Download RefSeq gene annotations from UCSC Table Browser (refGene table).
# This file contains gene models including exons, introns, UTRs.
# We will extract gene body coordinates from it.
REFSEQ_GENES_TXT="refGene.txt"
REFSEQ_GENES_URL="http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz"

# Download and decompress refGene.txt if not present.
# if [ ! -f "${REFSEQ_GENES_TXT}" ]; then
#     wget -O "${REFSEQ_GENES_TXT}.gz" "${REFSEQ_GENES_URL}"
#     gunzip "${REFSEQ_GENES_TXT}.gz"
# fi

# Convert refGene.txt to a BED file containing gene body coordinates.
# Columns in refGene.txt: bin, name, chrom, strand, txStart, txEnd, cdsStart, cdsEnd, exonCount, exonStarts, exonEnds, id, name2, cdsStartStat, cdsEndStat, exonFrames
# We extract: chrom ($3), txStart ($5), txEnd ($6), name2 (gene symbol, $13), score (0), strand ($4).
# The 'txStart' and 'txEnd' define the transcript (gene) boundaries.
REFSEQ_GENES_BED="refseq_genes_hg38_gene_bodies.bed"

# awk '{print $3"\t"$5"\t"$6"\t"$13"\t0\t"$4}' "${REFSEQ_GENES_TXT}" | sort -k1,1 -k2,2n > "${REFSEQ_GENES_BED}"

# Ensure the reference genome assembly is specified (e.g., hg38).
GENOME_ASSEMBLY="hg38"

# Assign PAS to genes using bedtools intersect.
# -a: input file A (PAS).
# -b: input file B (RefSeq gene bodies).
# -wao: Write the original A and B entries plus the number of overlapping bases.
#       If no overlap, B fields are set to NULL and overlap is 0.
# -s: Force strandedness. Requires A and B to be on the same strand.
# Output will contain PAS coordinates and overlapping gene information.
bedtools intersect -a "${PAS_FILE}" -b "${REFSEQ_GENES_BED}" -wao -s > pas_assigned_to_genes.bed

# Further processing might be needed to determine specific assignment logic 
# (e.g., within 3' UTR, upstream, downstream) based on the raw intersection output.

View on GitHub

Remove PAS used on average <5% in total or nuclear individuals

Custom Python script (Inferred with models/gemini-2.5-flash) v1.0 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# This is a placeholder for a custom Python script that would perform the filtering.
# The actual script would depend on the format of the input PAS usage file, which is assumed to be a TSV.

# Example Python script (filter_pas.py):
# import pandas as pd
#
# def filter_pas_by_usage(input_file, output_file, threshold=5.0, total_col='total_usage_percent', nuclear_col='nuclear_usage_percent'):
#     df = pd.read_csv(input_file, sep='\t')
#     # Filter rows where either total_usage_percent or nuclear_usage_percent is >= threshold
#     # The description states "Remove PAS used on average <5%", meaning we keep PAS used >=5%.
#     filtered_df = df[(df[total_col] >= threshold) | (df[nuclear_col] >= threshold)]
#     filtered_df.to_csv(output_file, sep='\t', index=False)
#
# if __name__ == "__main__":
#     import argparse
#     parser = argparse.ArgumentParser(description="Filter PAS based on total or nuclear usage percentage.")
#     parser.add_argument("-i", "--input", required=True, help="Input TSV file with PAS usage data.")
#     parser.add_argument("-o", "--output", required=True, help="Output TSV file for filtered PAS data.")
#     parser.add_argument("-t", "--threshold", type=float, default=5.0, help="Minimum usage percentage (total or nuclear) to keep a PAS.")
#     parser.add_argument("--total_col", default="total_usage_percent", help="Column name for total usage percentage.")
#     parser.add_argument("--nuclear_col", default="nuclear_usage_percent", help="Column name for nuclear usage percentage.")
#     args = parser.parse_args()
#     filter_pas_by_usage(args.input, args.output, args.threshold, args.total_col, args.nuclear_col)

# Execution command:
# Assuming 'pas_usage_data.tsv' is the input file containing PAS usage percentages
# and 'filtered_pas_data.tsv' will be the output file.
python filter_pas.py -i pas_usage_data.tsv -o filtered_pas_data.tsv -t 5.0 --total_col total_usage_percent --nuclear_col nuclear_usage_percent

View on GitHub

Raw Source Text

map to HG19 with WASP output mode. Remove reads not passing WASP multimapping.
Clean for mispriming by removing reads with 6 A's upstream of 7 of 10 As in upstream sequence.
Merge all BAM files and call peaks with inhouse scripts (avaialbe on github)
Assign PAS to genes using RefSeq annoations
Remove PAS used on average <5% in total or nuclear individuals
Genome_build: GRCh37
Supplementary_files_format_and_content: APAPAS_GeneLocAnno.5perc.bed: bed file with location of each PAS used at 5% in the total or nuclear fraction. Name column has peak unique number, gene, and PAS genic location. End corresponds to 5kb downstream of annotated 3' UTR
Supplementary_files_format_and_content: MetaDataSequencing.txt: file with relevant cell collection and sequencing metadata

← Back to Analysis