GSE69583 Processing Pipeline

OTHER code_examples 7 steps

Publication

Musashi-2 attenuates AHR signalling to expand human haematopoietic stem cells.

Nature (2016) — PMID 27121842

Dataset

Musashi-2 Post-transcriptionally Attenuates Aryl Hydrocarbon Receptor Signaling to Expand Human Hematopoietic Stem Cells

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Sequencing reads from CLIP-seq and RIP-seq libraries were first trimmed of polyA tails, adapters, and low quality ends using cutadapt with parameters --match-read-wildcards --times 2 -e 0 -O 5 --quality-cutoff' 6 -m 18 -b TCGTATGCCGTCTTCTGCTTG -b ATCTCGTATGCCGTCTTCTGCTTG -b CGACAGGTTCAGAGTTCTACAGTCCGACGATC -b TGGAATTCTCGGGTGCCAAGG -b AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA -b TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT.

cutadapt v1.18 GitHub

$ Bash example

# Install cutadapt (e.g., using conda)
# conda install -c bioconda cutadapt

# Define input and output file names (placeholders)
INPUT_FASTQ="input.fastq.gz"
OUTPUT_FASTQ="trimmed.fastq.gz"

# Define adapter sequences
ADAPTER1="TCGTATGCCGTCTTCTGCTTG"
ADAPTER2="ATCTCGTATGCCGTCTTCTGCTTG"
ADAPTER3="CGACAGGTTCAGAGTTCTACAGTCCGACGATC"
ADAPTER4="TGGAATTCTCGGGTGCCAAGG"
ADAPTER_POLYA="AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
ADAPTER_POLYT="TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT"

# Execute cutadapt command
cutadapt \
  --match-read-wildcards \
  --times 2 \
  -e 0 \
  -O 5 \
  --quality-cutoff 6 \
  -m 18 \
  -b "${ADAPTER1}" \
  -b "${ADAPTER2}" \
  -b "${ADAPTER3}" \
  -b "${ADAPTER4}" \
  -b "${ADAPTER_POLYA}" \
  -b "${ADAPTER_POLYT}" \
  -o "${OUTPUT_FASTQ}" \
  "${INPUT_FASTQ}"

View on GitHub

Reads were then mapped against a database of repetitive elements derived from RepBase18.05.

bowtie2 (Inferred with models/gemini-2.5-flash) v2.5.2 GitHub

$ Bash example

# Install bowtie2 (if not already installed)
# conda install -c bioconda bowtie2
# conda install -c bioconda samtools

# --- Reference Data Preparation ---
# Download RepBase18.05 (placeholder - actual download method may vary, often requires license)
# For demonstration, let's assume RepBase18.05.fasta is available.
# Example: wget "http://www.girinst.org/repbase/update/RepBase18.05.fasta.gz"
# gunzip RepBase18.05.fasta.gz

# Build the bowtie2 index for repetitive elements
# This step needs to be done once for the reference database
# Replace RepBase18.05.fasta with the actual path to your RepBase FASTA file
# The index prefix will be RepBase18.05_index
bowtie2-build RepBase18.05.fasta RepBase18.05_index

# --- Read Mapping ---
# Define input and output files
READS_FASTQ="input_reads.fastq.gz" # Replace with your actual input reads file (can be .fastq or .fastq.gz)
OUTPUT_SAM="mapped_to_repeats.sam"
OUTPUT_BAM="mapped_to_repeats.bam"
OUTPUT_SORTED_BAM="mapped_to_repeats.sorted.bam"
INDEX_PREFIX="RepBase18.05_index"
NUM_THREADS=8 # Adjust based on available CPU cores

# Map reads against the repetitive elements database using bowtie2
# --very-sensitive: Use a very sensitive alignment mode, good for finding matches in repetitive regions.
# -U: For single-end reads. Use -1 <R1.fastq> -2 <R2.fastq> for paired-end reads.
# --no-unal: Suppress SAM records for unaligned reads.
# -S: Output SAM file.
bowtie2 --very-sensitive -p "${NUM_THREADS}" -x "${INDEX_PREFIX}" -U "${READS_FASTQ}" -S "${OUTPUT_SAM}"

# Convert SAM to BAM and sort the BAM file
# -bS: Output in BAM format, input is SAM.
# -o: Output file.
samtools view -bS "${OUTPUT_SAM}" -o "${OUTPUT_BAM}"

# Sort the BAM file by coordinate
samtools sort "${OUTPUT_BAM}" -o "${OUTPUT_SORTED_BAM}"

# Remove intermediate files if desired
# rm "${OUTPUT_SAM}"
# rm "${OUTPUT_BAM}"

View on GitHub

Bowtie version 1.0.0 with parameters -S -q -p 16 -e 100 -l 20 was used to align reads against an index generated from Repbase sequences (Langmead et al., 2009).

Bowtie v1.0.0 GitHub

$ Bash example

# Install Bowtie (if not already installed)
# conda install -c bioconda bowtie=1.0.0

# Align reads using Bowtie
# Assuming 'repbase_index' is the prefix for the Bowtie index files generated from Repbase sequences
# Assuming 'reads.fastq' is the input FASTQ file containing the reads
# Assuming 'output.sam' is the desired output SAM file
bowtie -S -q -p 16 -e 100 -l 20 repbase_index reads.fastq > output.sam

View on GitHub

Reads not mapped to Repbase sequences were aligned to the hg19 human genome (UCSC assembly) using STAR (Dobin et al., 2013) version 2.3.0e with parameters --outSAMunmapped Within âoutFilterMultimapNmax 1 âoutFilterMultimapScoreRange 1.

STAR v2.3.0e GitHub

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star=2.3.0e

# Placeholder for STAR genome index generation (run once for hg19)
# STAR --runMode genomeGenerate --genomeDir hg19_star_index --genomeFastaFiles hg19.fa --sjdbGTFfile genes.gtf --runThreadN 8

# Align reads to hg19 using STAR
STAR \
  --genomeDir hg19_star_index \
  --readFilesIn input_reads_filtered_from_repbase.fastq \
  --outFileNamePrefix star_hg19_alignment_ \
  --outSAMunmapped Within \
  --outFilterMultimapNmax 1 \
  --outFilterMultimapScoreRange 1 \
  --outSAMtype BAM SortedByCoordinate \
  --runThreadN 8

View on GitHub

Reads that were PCR replicates were removed from each CLIP-seq library using a custom script.

CLIP-seq v2.23.4 GitHub

$ Bash example

# Install Picard (if not already installed)
# conda install -c bioconda picard

# Run Picard MarkDuplicates to remove PCR replicates
# Assuming input.bam is a coordinate-sorted BAM file
java -jar /path/to/picard.jar MarkDuplicates \
    INPUT=input.bam \
    OUTPUT=output_dedup.bam \
    METRICS_FILE=deduplication_metrics.txt \
    REMOVE_DUPLICATES=true \
    ASSUME_SORTED=true \
    VALIDATION_STRINGENCY=SILENT

View on GitHub

Briefly one read was kept at each nucleotide position when more than one readâs 5' end was mapped

dedup_reads.py (Inferred with models/gemini-2.5-flash) vv1.0.0 GitHub

$ Bash example

# Install samtools and pysam if not available
# conda install -c bioconda samtools pysam

# Clone the eCLIP workflow repository to get the script
# git clone https://github.com/yeolab/eclip.git

# Define input and output file names
INPUT_BAM="aligned.bam"
OUTPUT_DEDUP_BAM="deduplicated.bam"
SORTED_BAM="aligned.sorted.bam"

# Sort the input BAM file by coordinate, which is required by the deduplication script
samtools sort -o "${SORTED_BAM}" "${INPUT_BAM}"

# Index the sorted BAM file (optional, but good practice for downstream tools)
samtools index "${SORTED_BAM}"

# Execute the deduplication script
# The script is located in the 'tools' directory of the cloned eclip repository
python eclip/tools/dedup_reads.py "${SORTED_BAM}" "${OUTPUT_DEDUP_BAM}"

View on GitHub

Clusters were then assigned using the CLIPper software with parameters --bonferroni --superlocal --threshold- software (Lovci et al., 2013).

CLIPper v1.0 GitHub

$ Bash example

# Installation (example, adjust path as needed)
# git clone https://github.com/yeolab/clipper.git
# cd clipper
# Ensure Python dependencies are met (e.g., numpy, scipy, pysam)
# pip install numpy scipy pysam

# Example: Run CLIPper for peak calling
# Input: Aligned BAM file (e.g., from STAR or HISAT2)
# Output: BED file containing identified peaks/clusters

# Note: The description provided "--threshold-" which is incomplete.
# The standard parameter is "--threshold <float>" (default: 0.05).
# Using the default value for demonstration.
python clipper.py \
    --bonferroni \
    --superlocal \
    --threshold 0.05 \
    input_aligned.bam \
    -o output_peaks.bed

View on GitHub

Tools Used

STAR CLIP-seq

Raw Source Text

Sequencing reads from CLIP-seq and RIP-seq libraries were first trimmed of polyA tails, adapters, and low quality ends using cutadapt with parameters --match-read-wildcards --times 2 -e 0 -O 5 --quality-cutoff' 6 -m 18 -b TCGTATGCCGTCTTCTGCTTG -b ATCTCGTATGCCGTCTTCTGCTTG -b CGACAGGTTCAGAGTTCTACAGTCCGACGATC -b TGGAATTCTCGGGTGCCAAGG -b AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA -b TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT.
Reads were then mapped against a database of repetitive elements derived from RepBase18.05. Bowtie version 1.0.0 with parameters -S -q -p 16 -e 100 -l 20 was used to align reads against an index generated from Repbase sequences (Langmead et al., 2009).
Reads not mapped to Repbase sequences were aligned to the hg19 human genome (UCSC assembly) using STAR (Dobin et al., 2013) version 2.3.0e with parameters --outSAMunmapped Within âoutFilterMultimapNmax 1 âoutFilterMultimapScoreRange 1.
Reads that were PCR replicates were removed from each CLIP-seq library using a custom script. Briefly one read was kept at each nucleotide position when more than one readâs 5' end was mapped
Clusters were then assigned using the CLIPper software with parameters --bonferroni --superlocal --threshold- software (Lovci et al., 2013).
Genome_build: hg19
Supplementary_files_format_and_content: bed format, contains clusters of predicted MSI2 binding

← Back to Analysis