GSE109423 Processing Pipeline
RNA-Seq
code_examples
6 steps
Publication
A Transcriptome-wide Translational Program Defined by LIN28B Expression Level.Molecular cell (2019) — PMID 30527666
Dataset
GSE109423A transcriptome-wide divergence in protein translation scales with LIN28B expression
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
Library strategy: ribosome profiling
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define variables # Replace with actual paths and filenames GENOME_DIR="/path/to/STAR_genome_index/hg38" # Directory containing STAR genome index files (e.g., SA, SAindex) READS_FILE="input_riboseq_reads.fastq.gz" # Input gzipped FASTQ file OUTPUT_PREFIX="riboseq_alignment" # Prefix for output files THREADS=8 # Number of threads to use # Note: A STAR genome index must be pre-built using STAR --runMode genomeGenerate. # Example command for building index (run once per genome/annotation combination): # STAR --runMode genomeGenerate \ # --genomeDir ${GENOME_DIR} \ # --genomeFastaFiles /path/to/hg38.fa \ # --sjdbGTFfile /path/to/hg38.gtf \ # --runThreadN ${THREADS} # Align Ribo-seq reads using STAR STAR --genomeDir ${GENOME_DIR} \ --readFilesIn ${READS_FILE} \ --runThreadN ${THREADS} \ --outFileNamePrefix ${OUTPUT_PREFIX}_ \ --outSAMtype BAM SortedByCoordinate \ --outSAMunmapped None \ --outFilterType BySJout \ --outFilterMultimapNmax 20 \ --outFilterMismatchNmax 999 \ --outFilterMismatchNoverLmax 0.04 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --alignSJoverhangMin 8 \ --alignSJDBoverhangMin 1 \ --sjdbScore 1 \ --limitBAMsortRAM 30000000000 # Adjust based on available RAM (e.g., 30GB) -
2
Raw sequencing reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using cutadapt.
$ Bash example
# Install cutadapt if not already available # conda install -c bioconda cutadapt=3.4 # Define input and output files INPUT_READS="raw_reads.fastq.gz" OUTPUT_TRIMMED_READS="trimmed_reads.fastq.gz" # Define adapter and barcode sequences common in eCLIP # Common Illumina TruSeq Universal Adapter (3' end) ADAPTER_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Placeholder for a 5' barcode/UMI (e.g., 8 Ns) # In a real eCLIP experiment, this would be a specific barcode sequence or a UMI pattern. BARCODE_5PRIME="NNNNNNNN" # Trim adapter and barcode sequences, perform quality trimming, and filter by length cutadapt \ -a "${ADAPTER_3PRIME}" \ -g "${BARCODE_5PRIME}" \ -m 18 \ -q 20 \ -e 0.1 \ --cores 4 \ -o "${OUTPUT_TRIMMED_READS}" \ "${INPUT_READS}" -
3
Trimmed reads were mapped against RepBase to remove reads mapping to repetitive sequences.
bowtie2 (Inferred with models/gemini-2.5-flash) v2.4.5 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install bowtie2 if not already installed # conda install -c bioconda bowtie2 # --- Reference Data Preparation (RepBase) --- # RepBase sequences are typically obtained from the Genetic Information Research Institute (GIRI). # For demonstration, we assume a RepBase FASTA file is available. # The actual download might involve specific species or versions (e.g., human, mouse). # Example (illustrative, actual URL/process may vary): # wget -O RepBase.fasta "http://www.girinst.org/repbase/update/RepBase.fasta" REPBASE_FASTA="RepBase.fasta" # Placeholder for the RepBase FASTA file REPBASE_INDEX_PREFIX="RepBase_index" # Build bowtie2 index for RepBase # This step only needs to be done once per RepBase FASTA if [ ! -f "${REPBASE_INDEX_PREFIX}.1.bt2" ]; then echo "Building bowtie2 index for RepBase..." bowtie2-build "${REPBASE_FASTA}" "${REPBASE_INDEX_PREFIX}" echo "bowtie2 index built." else echo "bowtie2 index for RepBase already exists." fi # --- Read Mapping and Filtering --- INPUT_FASTQ="trimmed_reads.fastq.gz" # Input trimmed reads OUTPUT_UNMAPPED_FASTQ="reads_without_repetitive_sequences.fastq.gz" # Output reads that did not map to RepBase NUM_THREADS=8 # Number of threads to use echo "Mapping trimmed reads against RepBase to remove repetitive sequences..." bowtie2 -x "${REPBASE_INDEX_PREFIX}" \ -U "${INPUT_FASTQ}" \ --un-gz "${OUTPUT_UNMAPPED_FASTQ}" \ -S /dev/null \ -p "${NUM_THREADS}" \ --very-sensitive-local # Using a sensitive local alignment mode for better detection of repetitive elements echo "Reads not mapping to RepBase saved to ${OUTPUT_UNMAPPED_FASTQ}" -
4
Remaining reads were mapped to the appropriate genome build using STAR aligner
$ Bash example
# Install STAR (if not already installed) # conda create -n star_env star=2.7.10a -c bioconda -c conda-forge # conda activate star_env # Define variables # Placeholder for GRCh38 genome index. Replace with the actual path to your STAR genome index. GENOME_DIR="/path/to/STAR_genome_index/GRCh38" # Input FASTQ files. Assuming paired-end reads. Adjust if single-end. READS_R1="remaining_reads_R1.fastq.gz" READS_R2="remaining_reads_R2.fastq.gz" OUTPUT_PREFIX="remaining_reads_aligned" NUM_THREADS=8 # Adjust based on available CPU cores RAM_LIMIT_BAM_SORT=30000000000 # 30GB, adjust based on available RAM # Run STAR alignment STAR --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READS_R1}" "${READS_R2}" \ --readFilesCommand zcat \ --runThreadN "${NUM_THREADS}" \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes Standard \ --outFilterType BySJout \ --outFilterMismatchNmax 10 \ --outFilterScoreMinOverLread 0.3 \ --outFilterMatchNminOverLread 0.3 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --limitBAMsortRAM "${RAM_LIMIT_BAM_SORT}" -
5
RPKM values were calculated for all polysome and rnaseq samples.
$ Bash example
# Install RSEM (if not already installed) # conda install -c bioconda rsem # --- Reference Preparation (run once) --- # RSEM requires a reference genome and a GTF annotation file to build its index. # For human (GRCh38), you can download these from Ensembl or GENCODE. # Example for human GRCh38 (Ensembl release 109): # wget -O Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz ftp://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz # wget -O Homo_sapiens.GRCh38.109.gtf.gz ftp://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/Homo_sapiens.GRCh38.109.gtf.gz # gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz # gunzip Homo_sapiens.GRCh38.109.gtf.gz # Define paths for reference files (replace with your actual paths) # GENOME_FA="/path/to/Homo_sapiens.GRCh38.dna.primary_assembly.fa" # GTF_FILE="/path/to/Homo_sapiens.GRCh38.109.gtf" # RSEM_REF_INDEX="/path/to/rsem_ref_index_GRCh38" # Build RSEM reference index (uncomment and run once per reference) # mkdir -p $(dirname "${RSEM_REF_INDEX}") # rsem-prepare-reference \ # --gtf "${GTF_FILE}" \ # "${GENOME_FA}" \ # "${RSEM_REF_INDEX}" # --- RPKM Calculation for Samples --- # Define paths for input FASTQ files and output results # INPUT_DIR="/path/to/your/fastq_files" # OUTPUT_DIR="/path/to/your/rsem_results" # RSEM_REF_INDEX="/path/to/rsem_ref_index_GRCh38" # Same as used for reference preparation # Create output directory if it doesn't exist # mkdir -p "${OUTPUT_DIR}" # Loop through samples (assuming paired-end FASTQ files: SAMPLE_ID_R1.fastq.gz, SAMPLE_ID_R2.fastq.gz) # Replace 'sample1 sample2' with the actual list of your sample IDs for sample_id in sample1 sample2; do READ1="${INPUT_DIR}/${sample_id}_R1.fastq.gz" READ2="${INPUT_DIR}/${sample_id}_R2.fastq.gz" OUTPUT_PREFIX="${OUTPUT_DIR}/${sample_id}" echo "Processing sample: ${sample_id}" rsem-calculate-expression \ --paired-end \ --num-threads 8 \ "${READ1}" \ "${READ2}" \ "${RSEM_REF_INDEX}" \ "${OUTPUT_PREFIX}" done # RPKM values for genes are typically found in the '.genes.results' file # The 6th column of this file (0-indexed) contains the RPKM values. # Example: To view gene_id and RPKM for 'sample1': # head -n 1 "${OUTPUT_DIR}/sample1.genes.results" # cat "${OUTPUT_DIR}/sample1.genes.results" | awk 'NR==1 || $1 ~ /^ENSG/ {print $1, $6}' -
6
For eCLIP samples, read densities were calculated to identify eCLIP peaks. eCLIP peaks were identified using clipper
$ Bash example
# Clone the clipper repository if you haven't already # git clone https://github.com/yeolab/clipper.git # cd clipper # Install dependencies (if not already present in your environment) # CLIPper is a Python script and typically requires numpy, scipy, and pysam. # conda install -c conda-forge numpy scipy pysam # Define input files and parameters # Replace with your actual file paths and desired parameters TREATMENT_BAM="path/to/your/eCLIP_sample.bam" CONTROL_BAM="path/to/your/SMInput_or_IgG_control.bam" # This is typically the size-matched input or IgG control BAM GENOME_SIZE="hg38" # Example: hg38 for human, mm10 for mouse. Adjust as needed. OUTPUT_PREFIX="eCLIP_peaks_output" P_VALUE_THRESHOLD="0.01" # Default p-value threshold in clipper.py FDR_THRESHOLD="0.05" # Default FDR threshold in clipper.py MIN_PEAK_LENGTH="10" # Default minimum peak length in clipper.py # Execute clipper # Ensure clipper.py is in your current directory or in your PATH python clipper.py \ -t "${TREATMENT_BAM}" \ -c "${CONTROL_BAM}" \ -s "${GENOME_SIZE}" \ -o "${OUTPUT_PREFIX}" \ -p "${P_VALUE_THRESHOLD}" \ -f "${FDR_THRESHOLD}" \ -l "${MIN_PEAK_LENGTH}"
Raw Source Text
Library strategy: ribosome profiling Raw sequencing reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using cutadapt. Trimmed reads were mapped against RepBase to remove reads mapping to repetitive sequences. Remaining reads were mapped to the appropriate genome build using STAR aligner RPKM values were calculated for all polysome and rnaseq samples. For eCLIP samples, read densities were calculated to identify eCLIP peaks. eCLIP peaks were identified using clipper Genome_build: mm10 and hg19 Supplementary_files_format_and_content: RPKM files for all polysome samples. eCLIP input normalized peaks in BED format