GSE152684 Processing Pipeline

RNA-Seq code_examples 8 steps

Publication

Evaluation of Engineered CRISPR-Cas-Mediated Systems for Site-Specific RNA Editing.

Cell reports (2020) — PMID 33147453

Dataset

Discovery and evaluation of CRISPR/Cas-mediated systems for site-specific RNA editing

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Raw reads were adapter-trimmed using Cutadapt (version 1.14)

cutadapt v1.14 GitHub

$ Bash example

# Install Cutadapt if not already installed
# conda install -c bioconda cutadapt=1.14

# Define input and output file paths
INPUT_READS="raw_reads.fastq.gz"
OUTPUT_TRIMMED_READS="trimmed_reads.fastq.gz"

# Define adapter sequence. This must be replaced with the actual adapter sequence used.
# Common Illumina universal adapter sequence: AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
# If trimming 3' adapters, use -a. If trimming 5' adapters, use -g.
# If trimming unknown adapters, consider using --trim-n.
ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC" # Placeholder: Replace with actual adapter

# Execute Cutadapt for adapter trimming
# -a ADAPTER: Trims 3' adapters
# -o: Output file for trimmed reads
cutadapt -a "${ADAPTER_SEQUENCE}" \
         -o "${OUTPUT_TRIMMED_READS}" \
         "${INPUT_READS}"

View on GitHub

Sequenced reads were aligned to hg19 genome using STAR (version 2.5.2b) and sorted using samtools (version 1.5)

STAR v2.5.2b GitHub

$ Bash example

# Install STAR
# conda install -c bioconda star=2.5.2b

# Install samtools
# conda install -c bioconda samtools=1.5

# Placeholder for STAR genome index directory for hg19
# Replace with the actual path to your indexed hg19 genome
# Example: /path/to/STAR_indices/hg19
GENOME_DIR="/path/to/hg19_STAR_index"

# Placeholder for input FASTQ file(s)
# Replace with your actual input FASTQ file(s)
INPUT_FASTQ="reads.fastq.gz"

# Placeholder for output prefix
OUTPUT_PREFIX="aligned_reads"

# Align reads using STAR
# --genomeDir: path to the STAR genome index
# --readFilesIn: input FASTQ file(s)
# --outFileNamePrefix: prefix for output files
# --outSAMtype BAM Unsorted: output unsorted BAM file, to be sorted by samtools
# --runThreadN: number of threads to use (example: 8)
STAR --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${INPUT_FASTQ}" \
     --outFileNamePrefix "${OUTPUT_PREFIX}_" \
     --outSAMtype BAM Unsorted \
     --runThreadN 8

# Sort the aligned BAM file using samtools
# -o: output file
# Input file is the unsorted BAM generated by STAR (e.g., aligned_reads_Aligned.out.bam)
samtools sort -o "${OUTPUT_PREFIX}.sorted.bam" "${OUTPUT_PREFIX}_Aligned.out.bam"

View on GitHub

Gene counts were quantified from sorted BAM files using featureCounts (version 1.5.0) of the Subread (Liao et al., Nucleic Acids Research 2013) package using "gene" as a metafeature.

featureCounts v1.5.0

$ Bash example

# Install featureCounts (part of the Subread package)
# conda install -c bioconda subread

# Quantify gene counts from sorted BAM files using featureCounts
# -a: Specify the annotation file (GTF/GFF) for gene features (e.g., hg38.gtf)
# -o: Specify the output file for gene counts
# -t gene: Specify "gene" as the feature type to count reads for (metafeature)
# input.bam: Placeholder for the sorted BAM input file(s)
featureCounts -a hg38.gtf -o gene_counts.txt -t gene input.bam

Annotated gene length and gene counts were subsequently used to calculate transcript per-million (TPM) for each gene in each library.

RSEM (Inferred with models/gemini-2.5-flash) v1.3.3 GitHub

$ Bash example

# Install RSEM (example using conda)
# conda install -c bioconda rsem

# --- Placeholder for reference data ---
# Replace with actual paths to your genome FASTA and GTF annotation.
# For eCLIP, this would typically be the reference genome and annotation for the species (e.g., GRCh38).
GENOME_FASTA="path/to/your/genome.fa"
GTF_ANNOTATION="path/to/your/annotation.gtf"
RSEM_REFERENCE_DIR="rsem_reference_index" # Directory to store RSEM index

# --- Build RSEM reference index (if not already built) ---
# This step needs to be run once for a given reference genome and annotation.
# The GTF provides gene/transcript lengths and annotations necessary for RSEM.
# rsem-prepare-reference --gtf "${GTF_ANNOTATION}" "${GENOME_FASTA}" "${RSEM_REFERENCE_DIR}"

# --- Input data ---
# Replace with the actual path to your aligned BAM file (e.g., from STAR alignment in an eCLIP pipeline).
INPUT_ALIGNED_BAM="path/to/your/aligned_eclip_reads.bam"
SAMPLE_NAME="eclip_sample" # Prefix for output files

# --- Run RSEM to calculate expression and TPM ---
# RSEM uses the provided GTF (for gene lengths) and the aligned reads to estimate
# gene counts (expected counts) and subsequently calculate TPM values.
# The '--bam' flag indicates input is a BAM file.
# '--no-qualities' can speed up processing if quality scores are not needed for quantification.
# '--paired-end' if your reads are paired-end (common in eCLIP). Remove if single-end.
# '--num-threads' for parallel processing.
rsem-calculate-expression \
    --bam \
    --no-qualities \
    --paired-end \
    --num-threads 8 \
    "${INPUT_ALIGNED_BAM}" \
    "${RSEM_REFERENCE_DIR}" \
    "${SAMPLE_NAME}"

# Output files in the current directory will include:
# ${SAMPLE_NAME}.genes.results (contains TPM values for genes, along with expected counts, FPKM, etc.)
# ${SAMPLE_NAME}.isoforms.results (contains TPM values for isoforms)
# ${SAMPLE_NAME}.stat (statistics about the run)

View on GitHub

For downsampling prior to downstream editing analysis, samtools view -bs INT.FRAC was used for proportional downsampling of read pairs to achieve sampling depth between 15-35 million aligned reads

samtools v1.10 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install samtools (example using conda)
# conda install -c bioconda samtools

# Define input and output BAM files
INPUT_BAM="path/to/your/input.bam"
OUTPUT_BAM="path/to/your/downsampled.bam"

# --- The following variables (SAMPLING_SEED, SAMPLING_FRACTION) would be calculated
# --- based on the current read count and the desired target range (15-35 million aligned reads).
# --- Example calculation (replace with actual logic based on your pipeline):
# # Get current read count (excluding unmapped reads, primary alignments)
# CURRENT_READS=$(samtools view -c -F 4 -F 256 "$INPUT_BAM")
# # Define target (e.g., 25 million, midpoint of 15-35M)
# TARGET_READS=25000000
# # Calculate fraction, ensuring it doesn't exceed 1 if current reads are already below target
# SAMPLING_FRACTION=$(echo "scale=4; if ($CURRENT_READS > $TARGET_READS) $TARGET_READS / $CURRENT_READS else 1" | bc)
# # Generate a random seed for reproducibility
# SAMPLING_SEED=$(shuf -i 1-10000 -n 1)

# Placeholder values for demonstration
SAMPLING_SEED=12345
SAMPLING_FRACTION=0.5 # Example: downsample to 50% if current reads are 50M and target is 25M

# Perform proportional downsampling of read pairs
# The -b flag outputs in BAM format.
# The -s flag takes a seed and a fraction (e.g., 12345.0.5 for 50% sampling with seed 12345).
# When used with paired-end reads, samtools view -s ensures both reads in a pair are kept or discarded together.
samtools view -b -s "${SAMPLING_SEED}.${SAMPLING_FRACTION}" "$INPUT_BAM" > "$OUTPUT_BAM"

View on GitHub

Aligned BAM files were run through SAILOR (v1.0.4) using the following parameters-- min_variant_coverage: 5, alpha: 0, beta: 0, edit_fraction: 0.01

SAILOR v1.0.4 GitHub

$ Bash example

# Install SAILOR (example, adjust based on specific environment)
# git clone https://github.com/yeolab/SAILOR.git
# cd SAILOR
# pip install .

# Define input and output files
INPUT_BAM="aligned.bam" # Placeholder for aligned BAM file
OUTPUT_FILE="sailor_variants.tsv" # Placeholder for output file (e.g., TSV or VCF)
REFERENCE_FASTA="genome.fa" # Placeholder for reference genome FASTA (e.g., hg38.fa)

# Execute SAILOR
python sailor.py \
    --input_bam "${INPUT_BAM}" \
    --reference_fasta "${REFERENCE_FASTA}" \
    --output_file "${OUTPUT_FILE}" \
    --min_variant_coverage 5 \
    --alpha 0 \
    --beta 0 \
    --edit_fraction 0.01

View on GitHub

BED files that are output corresponding to positive and negative strands of the genome from SAILOR are combined into one BED file and filtered based upon confidence score (Washburn et al., Cell Reports 2014)

SAILOR vWashburn et al., Cell Reports 2014 GitHub

$ Bash example

# Define input and output file names
POSITIVE_STRAND_BED="sailor_output_positive.bed"
NEGATIVE_STRAND_BED="sailor_output_negative.bed"
COMBINED_BED="sailor_combined.bed"
FILTERED_BED="sailor_filtered.bed"

# Define confidence score threshold (placeholder - adjust as needed)
# The specific threshold and filtering logic (e.g., >= or <=) depends on
# how SAILOR defines its confidence score (e.g., raw score, p-value, FDR).
# For demonstration, assuming a higher score is better.
CONFIDENCE_THRESHOLD=5 # Example threshold, adjust based on SAILOR's score range

# Combine positive and negative strand BED files
cat "${POSITIVE_STRAND_BED}" "${NEGATIVE_STRAND_BED}" > "${COMBINED_BED}"

# Filter the combined BED file based on confidence score (5th column)
# Assuming the 5th column is the confidence score and higher is better.
awk -v threshold="${CONFIDENCE_THRESHOLD}" '$5 >= threshold' "${COMBINED_BED}" > "${FILTERED_BED}"

View on GitHub

Recurrent sites between replicates were determined using pybedtools (version 0.8.0) intersect with the following options: wo=True, s=True

pybedtools v0.8.0 GitHub

$ Bash example

# Install pybedtools and bedtools if not already installed
# conda install -c bioconda pybedtools bedtools

# Define input and output files
# Replace 'replicate1.bed' and 'replicate2.bed' with your actual input BED files.
# These files should contain genomic regions (e.g., peaks, binding sites).
REPLICATE1_BED="replicate1.bed"
REPLICATE2_BED="replicate2.bed"

# Define the output file name.
# The 'wo=True' option means the output will include columns from both input files
# plus an additional column for the overlap length, making it a tab-separated file (TSV).
OUTPUT_TSV="recurrent_sites_overlap_details.tsv"

# Execute the pybedtools intersect operation using a Python one-liner
# This command finds regions that overlap between replicate1.bed and replicate2.bed.
# - 'wo=True' ensures that the output includes the original entries from both files
#   and the length of the overlap.
# - 's=True' requires that overlapping features must be on the same strand.
python -c "
import pybedtools
import sys

# Check if correct number of arguments are provided
if len(sys.argv) != 4:
    print('Usage: python -c \"...\" <replicate1.bed> <replicate2.bed> <output.tsv>', file=sys.stderr)
    sys.exit(1)

# Load bed files from command-line arguments
try:
    a = pybedtools.BedTool(sys.argv[1])
    b = pybedtools.BedTool(sys.argv[2])
except Exception as e:
    print(f'Error loading BedTool files: {e}', file=sys.stderr)
    sys.exit(1)

# Perform intersect with specified options
# wo=True: write original A and B entries plus the number of base pairs of overlap
# s=True: require same strand
result = a.intersect(b, wo=True, s=True)

# Save the result to a file
try:
    result.saveas(sys.argv[3])
    print(f'Recurrent sites (with overlap details) saved to {sys.argv[3]}', file=sys.stderr)
except Exception as e:
    print(f'Error saving output file: {e}', file=sys.stderr)
    sys.exit(1)
" "${REPLICATE1_BED}" "${REPLICATE2_BED}" "${OUTPUT_TSV}"

View on GitHub

Tools Used

STAR SAILOR

Raw Source Text

Raw reads were adapter-trimmed using Cutadapt (version 1.14)
Sequenced reads were aligned to hg19 genome using STAR (version 2.5.2b) and sorted using samtools (version 1.5)
Gene counts were quantified from sorted BAM files using featureCounts (version 1.5.0) of the Subread (Liao et al., Nucleic Acids Research 2013) package using "gene" as a metafeature. Annotated gene length and gene counts were subsequently used to calculate transcript per-million (TPM) for each gene in each library.
For downsampling prior to downstream editing analysis, samtools view -bs INT.FRAC was used for proportional downsampling of read pairs to achieve sampling depth between 15-35 million aligned reads
Aligned BAM files were run through SAILOR (v1.0.4) using the following parameters-- min_variant_coverage: 5, alpha: 0, beta: 0, edit_fraction: 0.01
BED files that are output corresponding to positive and negative strands of the genome from SAILOR are combined into one BED file and filtered based upon confidence score (Washburn et al., Cell Reports 2014)
Recurrent sites between replicates were determined using pybedtools (version 0.8.0) intersect with the following options: wo=True, s=True
Genome_build: hg19
Supplementary_files_format_and_content: Matrix tables containing normalized transcripts per-million (TPM) data for every annotated gene; BED files containing editing events per library

← Back to Analysis