GSE178259 Processing Pipeline
RNA-Seq
code_examples
6 steps
Publication
Non-microRNA binding competitively inhibits LIN28 regulation.Cell reports (2021) — PMID 34380031
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
Raw sequencing reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using cutadapt.
$ Bash example
# Install cutadapt (e.g., using conda) # conda install -c bioconda cutadapt=1.18 # Define input and output files INPUT_FASTQ="raw_reads.fastq.gz" OUTPUT_FASTQ="trimmed_reads.fastq.gz" REPORT_FILE="cutadapt_report.txt" # Define adapter sequences # Illumina TruSeq 3' adapter (common for eCLIP) ADAPTER_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Placeholder for 5' barcode (e.g., 10 random Ns for UMI/randomer in eCLIP) # Adjust this based on the specific eCLIP library preparation protocol ADAPTER_5PRIME_BARCODE="N{10}" # Execute cutadapt cutadapt \ -a "${ADAPTER_3PRIME}" \ -g "${ADAPTER_5PRIME_BARCODE}" \ -o "${OUTPUT_FASTQ}" \ --minimum-length 18 \ --quality-cutoff 20 \ --error-rate 0.1 \ --cores 8 \ --discard-untrimmed \ --max-n 0 \ --trim-n \ "${INPUT_FASTQ}" > "${REPORT_FILE}" 2>&1 -
2
Trimmed reads were mapped against RepBase to remove reads mapping to repetitive sequences.
$ Bash example
# Install STAR and samtools if not available # conda install -c bioconda star samtools # Placeholder for RepBase STAR index directory. # This directory should contain the Genome.fa, SA, SAindex, etc. files for RepBase. REPBASE_STAR_INDEX="/path/to/repbase_star_index" # Placeholder for input trimmed reads (e.g., single-end or R1 for paired-end) TRIMMED_READS_R1="trimmed_reads_R1.fastq.gz" # If paired-end, also define TRIMMED_READS_R2 # TRIMMED_READS_R2="trimmed_reads_R2.fastq.gz" # Placeholder for output prefix for STAR alignment files STAR_OUTPUT_PREFIX="repbase_alignment" # 1. Align trimmed reads to RepBase using STAR. # Parameters are chosen to be permissive for repetitive sequences and consistent with eCLIP workflows. STAR --genomeDir "${REPBASE_STAR_INDEX}" \ --readFilesIn "${TRIMMED_READS_R1}" \ # If paired-end, uncomment the line below and add R2 # --readFilesIn "${TRIMMED_READS_R1}" "${TRIMMED_READS_R2}" \ --outFileNamePrefix "${STAR_OUTPUT_PREFIX}" \ --outSAMtype BAM Unsorted \ --outSAMunmapped Within \ --outFilterMultimapNmax 100 \ --outFilterMismatchNmax 10 \ --outFilterScoreMinOverLread 0.66 \ --outFilterMatchNminOverLread 0.66 \ --runThreadN 8 # Adjust number of threads as appropriate for your system # 2. Filter out reads that mapped to RepBase (keep only unmapped reads). # The flag '-f 4' selects reads where the 'read unmapped' flag is set. # The output 'non_repetitive_reads.bam' will contain reads that did not map to RepBase. samtools view -f 4 -b "${STAR_OUTPUT_PREFIX}Aligned.out.bam" > non_repetitive_reads.bam # Optional: Convert the filtered BAM back to FASTQ if subsequent steps require FASTQ input # samtools fastq -n -F 4 "${STAR_OUTPUT_PREFIX}Aligned.out.bam" > non_repetitive_reads.fastq.gz -
3
Remaining reads were mapped to the appropriate genome build using STAR aligner
$ Bash example
# Install STAR (example using conda) # conda create -n star_env star=2.7.10a -c bioconda -c conda-forge # conda activate star_env # Define variables READS="remaining_reads.fastq.gz" # Placeholder for input reads GENOME_DIR="STAR_genome_index/hg38" # Placeholder for STAR genome index (e.g., for human hg38) OUTPUT_PREFIX="aligned_STAR_" THREADS=8 # Number of threads to use # --- Reference Data Setup (run once per genome) --- # To generate the STAR genome index, you would typically use: # STAR --runThreadN ${THREADS} --runMode genomeGenerate \ # --genomeDir ${GENOME_DIR} \ # --genomeFastaFiles /path/to/hg38.fa \ # --sjdbGTFfile /path/to/gencode.v38.annotation.gtf \ # --sjdbOverhang 100 # Recommended: (ReadLength - 1) # Reference genome (e.g., hg38.fa) can be obtained from UCSC, Ensembl, or NCBI. # Annotation GTF (e.g., gencode.v38.annotation.gtf) can be obtained from GENCODE. # Map reads to the genome STAR --runThreadN ${THREADS} \ --genomeDir ${GENOME_DIR} \ --readFilesIn ${READS} \ --readFilesCommand zcat \ --outFileNamePrefix ${OUTPUT_PREFIX} \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes Standard \ --outFilterMultimapNmax 20 \ --outFilterMismatchNmax 999 \ --outFilterMismatchNoverLmax 0.1 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --sjdbScore 1 -
4
RPKM values were calculated for all polysome samples.
$ Bash example
# Install RSEM (if not already installed) # conda install -c bioconda rsem # --- Reference Preparation (run once) --- # This step builds the RSEM reference index from a genome FASTA and gene annotation GTF file. # Replace 'genome.fa' with your reference genome FASTA file (e.g., hg38.fa). # Replace 'genes.gtf' with your gene annotation GTF file (e.g., gencode.v38.annotation.gtf). # 'rsem_reference_index' will be the prefix for the generated index files. # rsem-prepare-reference --gtf genes.gtf genome.fa rsem_reference_index # --- RPKM Calculation for a single polysome sample --- # This command calculates RPKM (among other metrics) for a single sample. # For multiple polysome samples, this command would typically be run in a loop or a pipeline. # Replace 'aligned_reads.bam' with the path to your input BAM file (e.g., from STAR alignment). # Replace 'rsem_reference_index' with the path to your prepared RSEM reference index (e.g., 'rsem_reference_index'). # Replace 'sample_output_prefix' with a desired prefix for output files (e.g., 'polysome_sample1'). # The '--strandedness reverse' parameter is common for many RNA-seq library preparations (e.g., Illumina TruSeq stranded). # Adjust to 'forward' or 'none' based on your library preparation protocol. rsem-calculate-expression \ --bam \ --no-qualities \ -p 8 \ --strandedness reverse \ aligned_reads.bam \ rsem_reference_index \ sample_output_prefix # The RPKM values will be found in the 'sample_output_prefix.genes.results' file (e.g., 'polysome_sample1.genes.results'). -
5
For eCLIP samples, read densities were calculated to identify eCLIP peaks.
$ Bash example
# Install dependencies for clipper # pip install numpy scipy pysam # Placeholder for genome sizes file (e.g., hg38.chrom.sizes) # This file can be generated from a FASTA file using samtools faidx and awk, # or downloaded from UCSC (e.g., http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes) # Example: # samtools faidx hg38.fa # cut -f1,2 hg38.fa.fai > hg38.chrom.sizes # Run clipper to identify eCLIP peaks # -b: Input BAM file (aligned reads) # -s: Genome sizes file # -o: Output prefix for peak files python clipper.py -b sample_eclip.bam -s hg38.chrom.sizes -o sample_eclip_peaks
-
6
eclip data processing pipeline can be requested from the following link : https://github.com/YeoLab/eclip
$ Bash example
# Install cwltool (if not already installed) # pip install cwltool # Clone the YeoLab eCLIP CWL workflow repository git clone https://github.com/YeoLab/eclip.git cd eclip # Navigate into the cloned repository directory # --- Placeholder Input Data --- # Replace with actual FASTQ files for your experiment # Assuming single-end reads for demonstration, as paired-end is optional in the CWL mkdir -p data/fastq echo "fake_read1_content" > data/fastq/sample_R1.fastq.gz # --- Placeholder Reference Data (hg38) --- # Download or specify paths to your reference genome files # These are examples for human hg38, replace with actual paths mkdir -p reference/hg38 # Example: Download from UCSC/GENCODE # wget -P reference/hg38 https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz # gunzip reference/hg38/hg38.fa.gz # mv reference/hg38/hg38.fa reference/hg38/GRCh38.p14.genome.fa # wget -P reference/hg38 http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_45/gencode.v45.annotation.gtf.gz # gunzip reference/hg38/gencode.v45.annotation.gtf.gz # wget -P reference/hg38 https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes # wget -P reference/hg38 https://raw.githubusercontent.com/Boyle-Lab/Blacklist/master/lists/hg38-blacklist.v2.bed # Create a dummy STAR index directory mkdir -p reference/hg38/STAR_index # In a real scenario, you would generate this using STAR --runMode genomeGenerate # STAR --runMode genomeGenerate --genomeDir reference/hg38/STAR_index --genomeFastaFiles reference/hg38/GRCh38.p14.genome.fa --sjdbGTFfile reference/hg38/gencode.v45.annotation.gtf --sjdbOverhang 99 --runThreadN 8 # --- Create CWL input YAML file --- # This file specifies all inputs for the eCLIP CWL workflow cat << EOF > eclip_inputs.yaml fastq_file_r1: class: File path: data/fastq/sample_R1.fastq.gz # fastq_file_r2: # Uncomment and provide path for paired-end reads # class: File # path: data/fastq/sample_R2.fastq.gz genome_fasta: class: File path: reference/hg38/GRCh38.p14.genome.fa genome_gtf: class: File path: reference/hg38/gencode.v45.annotation.gtf star_index: class: Directory path: reference/hg38/STAR_index chrom_sizes: class: File path: reference/hg38/hg38.chrom.sizes blacklist_bed: class: File path: reference/hg38/hg38-blacklist.v2.bed # Optional parameters for CLIPper (using default values from the CWL or common practice) min_read_count: 10 min_peak_length: 10 max_peak_length: 500 p_threshold: 0.01 # Add other parameters as needed, e.g., for adapter trimming, quality filtering, etc. EOF # --- Execute the eCLIP CWL workflow --- # The eclip.cwl file is in the current directory after 'cd eclip' cwltool eclip.cwl eclip_inputs.yaml --outdir eclip_output
Raw Source Text
Raw sequencing reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using cutadapt. Trimmed reads were mapped against RepBase to remove reads mapping to repetitive sequences. Remaining reads were mapped to the appropriate genome build using STAR aligner RPKM values were calculated for all polysome samples. For eCLIP samples, read densities were calculated to identify eCLIP peaks. eclip data processing pipeline can be requested from the following link : https://github.com/YeoLab/eclip Genome_build: mm10 and hg19 Supplementary_files_format_and_content: RPKM files for all polysome samples. eCLIP input normalized peaks in BED format. eCLIP reproduclible peaks in BED format. eCLIP normalized read densitied in BIGWIG format.