GSE69583 Processing Pipeline
Publication
Musashi-2 attenuates AHR signalling to expand human haematopoietic stem cells.Nature (2016) — PMID 27121842
Dataset
GSE69583Musashi-2 Post-transcriptionally Attenuates Aryl Hydrocarbon Receptor Signaling to Expand Human Hematopoietic Stem Cells
Processing Steps
Generate Jupyter Notebook-
1
Sequencing reads from CLIP-seq and RIP-seq libraries were first trimmed of polyA tails, adapters, and low quality ends using cutadapt with parameters --match-read-wildcards --times 2 -e 0 -O 5 --quality-cutoff' 6 -m 18 -b TCGTATGCCGTCTTCTGCTTG -b ATCTCGTATGCCGTCTTCTGCTTG -b CGACAGGTTCAGAGTTCTACAGTCCGACGATC -b TGGAATTCTCGGGTGCCAAGG -b AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA -b TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT.
$ Bash example
# Install cutadapt (e.g., using conda) # conda install -c bioconda cutadapt # Define input and output file names (placeholders) INPUT_FASTQ="input.fastq.gz" OUTPUT_FASTQ="trimmed.fastq.gz" # Define adapter sequences ADAPTER1="TCGTATGCCGTCTTCTGCTTG" ADAPTER2="ATCTCGTATGCCGTCTTCTGCTTG" ADAPTER3="CGACAGGTTCAGAGTTCTACAGTCCGACGATC" ADAPTER4="TGGAATTCTCGGGTGCCAAGG" ADAPTER_POLYA="AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA" ADAPTER_POLYT="TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT" # Execute cutadapt command cutadapt \ --match-read-wildcards \ --times 2 \ -e 0 \ -O 5 \ --quality-cutoff 6 \ -m 18 \ -b "${ADAPTER1}" \ -b "${ADAPTER2}" \ -b "${ADAPTER3}" \ -b "${ADAPTER4}" \ -b "${ADAPTER_POLYA}" \ -b "${ADAPTER_POLYT}" \ -o "${OUTPUT_FASTQ}" \ "${INPUT_FASTQ}" -
2
Reads were then mapped against a database of repetitive elements derived from RepBase18.05.
$ Bash example
# Install bowtie2 (if not already installed) # conda install -c bioconda bowtie2 # conda install -c bioconda samtools # --- Reference Data Preparation --- # Download RepBase18.05 (placeholder - actual download method may vary, often requires license) # For demonstration, let's assume RepBase18.05.fasta is available. # Example: wget "http://www.girinst.org/repbase/update/RepBase18.05.fasta.gz" # gunzip RepBase18.05.fasta.gz # Build the bowtie2 index for repetitive elements # This step needs to be done once for the reference database # Replace RepBase18.05.fasta with the actual path to your RepBase FASTA file # The index prefix will be RepBase18.05_index bowtie2-build RepBase18.05.fasta RepBase18.05_index # --- Read Mapping --- # Define input and output files READS_FASTQ="input_reads.fastq.gz" # Replace with your actual input reads file (can be .fastq or .fastq.gz) OUTPUT_SAM="mapped_to_repeats.sam" OUTPUT_BAM="mapped_to_repeats.bam" OUTPUT_SORTED_BAM="mapped_to_repeats.sorted.bam" INDEX_PREFIX="RepBase18.05_index" NUM_THREADS=8 # Adjust based on available CPU cores # Map reads against the repetitive elements database using bowtie2 # --very-sensitive: Use a very sensitive alignment mode, good for finding matches in repetitive regions. # -U: For single-end reads. Use -1 <R1.fastq> -2 <R2.fastq> for paired-end reads. # --no-unal: Suppress SAM records for unaligned reads. # -S: Output SAM file. bowtie2 --very-sensitive -p "${NUM_THREADS}" -x "${INDEX_PREFIX}" -U "${READS_FASTQ}" -S "${OUTPUT_SAM}" # Convert SAM to BAM and sort the BAM file # -bS: Output in BAM format, input is SAM. # -o: Output file. samtools view -bS "${OUTPUT_SAM}" -o "${OUTPUT_BAM}" # Sort the BAM file by coordinate samtools sort "${OUTPUT_BAM}" -o "${OUTPUT_SORTED_BAM}" # Remove intermediate files if desired # rm "${OUTPUT_SAM}" # rm "${OUTPUT_BAM}" -
3
Bowtie version 1.0.0 with parameters -S -q -p 16 -e 100 -l 20 was used to align reads against an index generated from Repbase sequences (Langmead et al., 2009).
$ Bash example
# Install Bowtie (if not already installed) # conda install -c bioconda bowtie=1.0.0 # Align reads using Bowtie # Assuming 'repbase_index' is the prefix for the Bowtie index files generated from Repbase sequences # Assuming 'reads.fastq' is the input FASTQ file containing the reads # Assuming 'output.sam' is the desired output SAM file bowtie -S -q -p 16 -e 100 -l 20 repbase_index reads.fastq > output.sam
-
4
Reads not mapped to Repbase sequences were aligned to the hg19 human genome (UCSC assembly) using STAR (Dobin et al., 2013) version 2.3.0e with parameters --outSAMunmapped Within âoutFilterMultimapNmax 1 âoutFilterMultimapScoreRange 1.
$ Bash example
# Install STAR (example using conda) # conda install -c bioconda star=2.3.0e # Placeholder for STAR genome index generation (run once for hg19) # STAR --runMode genomeGenerate --genomeDir hg19_star_index --genomeFastaFiles hg19.fa --sjdbGTFfile genes.gtf --runThreadN 8 # Align reads to hg19 using STAR STAR \ --genomeDir hg19_star_index \ --readFilesIn input_reads_filtered_from_repbase.fastq \ --outFileNamePrefix star_hg19_alignment_ \ --outSAMunmapped Within \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outSAMtype BAM SortedByCoordinate \ --runThreadN 8
-
5
Reads that were PCR replicates were removed from each CLIP-seq library using a custom script.
$ Bash example
# Install Picard (if not already installed) # conda install -c bioconda picard # Run Picard MarkDuplicates to remove PCR replicates # Assuming input.bam is a coordinate-sorted BAM file java -jar /path/to/picard.jar MarkDuplicates \ INPUT=input.bam \ OUTPUT=output_dedup.bam \ METRICS_FILE=deduplication_metrics.txt \ REMOVE_DUPLICATES=true \ ASSUME_SORTED=true \ VALIDATION_STRINGENCY=SILENT -
6
Briefly one read was kept at each nucleotide position when more than one readâs 5' end was mapped
$ Bash example
# Install samtools and pysam if not available # conda install -c bioconda samtools pysam # Clone the eCLIP workflow repository to get the script # git clone https://github.com/yeolab/eclip.git # Define input and output file names INPUT_BAM="aligned.bam" OUTPUT_DEDUP_BAM="deduplicated.bam" SORTED_BAM="aligned.sorted.bam" # Sort the input BAM file by coordinate, which is required by the deduplication script samtools sort -o "${SORTED_BAM}" "${INPUT_BAM}" # Index the sorted BAM file (optional, but good practice for downstream tools) samtools index "${SORTED_BAM}" # Execute the deduplication script # The script is located in the 'tools' directory of the cloned eclip repository python eclip/tools/dedup_reads.py "${SORTED_BAM}" "${OUTPUT_DEDUP_BAM}" -
7
Clusters were then assigned using the CLIPper software with parameters --bonferroni --superlocal --threshold- software (Lovci et al., 2013).
$ Bash example
# Installation (example, adjust path as needed) # git clone https://github.com/yeolab/clipper.git # cd clipper # Ensure Python dependencies are met (e.g., numpy, scipy, pysam) # pip install numpy scipy pysam # Example: Run CLIPper for peak calling # Input: Aligned BAM file (e.g., from STAR or HISAT2) # Output: BED file containing identified peaks/clusters # Note: The description provided "--threshold-" which is incomplete. # The standard parameter is "--threshold <float>" (default: 0.05). # Using the default value for demonstration. python clipper.py \ --bonferroni \ --superlocal \ --threshold 0.05 \ input_aligned.bam \ -o output_peaks.bed
Raw Source Text
Sequencing reads from CLIP-seq and RIP-seq libraries were first trimmed of polyA tails, adapters, and low quality ends using cutadapt with parameters --match-read-wildcards --times 2 -e 0 -O 5 --quality-cutoff' 6 -m 18 -b TCGTATGCCGTCTTCTGCTTG -b ATCTCGTATGCCGTCTTCTGCTTG -b CGACAGGTTCAGAGTTCTACAGTCCGACGATC -b TGGAATTCTCGGGTGCCAAGG -b AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA -b TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT. Reads were then mapped against a database of repetitive elements derived from RepBase18.05. Bowtie version 1.0.0 with parameters -S -q -p 16 -e 100 -l 20 was used to align reads against an index generated from Repbase sequences (Langmead et al., 2009). Reads not mapped to Repbase sequences were aligned to the hg19 human genome (UCSC assembly) using STAR (Dobin et al., 2013) version 2.3.0e with parameters --outSAMunmapped Within âoutFilterMultimapNmax 1 âoutFilterMultimapScoreRange 1. Reads that were PCR replicates were removed from each CLIP-seq library using a custom script. Briefly one read was kept at each nucleotide position when more than one readâs 5' end was mapped Clusters were then assigned using the CLIPper software with parameters --bonferroni --superlocal --threshold- software (Lovci et al., 2013). Genome_build: hg19 Supplementary_files_format_and_content: bed format, contains clusters of predicted MSI2 binding