GSE195495 Processing Pipeline

RIP-Seq code_examples 12 steps

Publication

Nuclear and cytoplasmic poly(A) binding proteins (PABPs) favor distinct transcripts and isoforms.

Nucleic acids research (2022) — PMID 35438785

Dataset

GSE195495

Identification of different transcripts that favor the nuclear or cytoplasmic Poly(A) Binding Proteins (PABPs) through RNA immunoprecipitation

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

All reads were aligned to the human genome hg38 primary assembly using STAR.

STAR v2.7.10a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star=2.7.10a

# Define paths for reference genome and GTF
# For hg38 primary assembly, we typically use GENCODE or Ensembl annotations.
# Example Ensembl GRCh38 release 110:
# GENOME_FASTA="ftp://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"
# GTF_FILE="ftp://ftp.ensembl.org/pub/release-110/gtf/homo_sapiens/Homo_sapiens.GRCh38.110.gtf.gz"

# Placeholder paths for local files
GENOME_FASTA="/path/to/reference/hg38/Homo_sapiens.GRCh38.dna.primary_assembly.fa" # e.g., from Ensembl or UCSC
GTF_FILE="/path/to/reference/hg38/Homo_sapiens.GRCh38.110.gtf" # e.g., from Ensembl or GENCODE
STAR_INDEX_DIR="/path/to/STAR_index/hg38_ensembl_110" # Directory to store the STAR genome index

# Create STAR genome index (run this command once to build the index)
# mkdir -p ${STAR_INDEX_DIR}
# STAR --runMode genomeGenerate \
#      --genomeDir ${STAR_INDEX_DIR} \
#      --genomeFastaFiles ${GENOME_FASTA} \
#      --sjdbGTFfile ${GTF_FILE} \
#      --sjdbOverhang 100 \
#      --runThreadN 8 # Adjust threads based on available resources

# Define input and output files
READS_R1="sample_R1.fastq.gz" # Placeholder for forward reads
READS_R2="sample_R2.fastq.gz" # Placeholder for reverse reads (if paired-end, otherwise remove)
OUTPUT_PREFIX="sample_aligned" # Prefix for output files (e.g., sample_aligned.Aligned.sortedByCoord.out.bam)
THREADS=8 # Number of threads to use for alignment

# Align reads to hg38 primary assembly using STAR
STAR --genomeDir ${STAR_INDEX_DIR} \
     --readFilesIn ${READS_R1} ${READS_R2} \
     --readFilesCommand zcat \
     --runThreadN ${THREADS} \
     --outFileNamePrefix ${OUTPUT_PREFIX}. \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes All \
     --outSAMunmapped Within \
     --twopassMode Basic

View on GitHub

featureCounts version 2.0.2 was used to annotate reads, using the flag --minOverlap 20 and a custom GTF file derived from gencode v34

featureCounts v2.0.2

$ Bash example

# featureCounts is part of the Subread package.
# Installation via Bioconda:
# conda install -c bioconda subread

# Placeholder for the custom GTF file derived from gencode v34.
# Ensure this GTF file is available in your working directory or provide its full path.
# Example: gencode.v34.annotation.gtf
CUSTOM_GTF="custom_gencode_v34.gtf"

# Placeholder for input BAM file(s).
# featureCounts can take multiple BAM files as input.
INPUT_BAM="input.bam"

# Placeholder for output counts file.
OUTPUT_COUNTS="featureCounts_counts.txt"

# Execute featureCounts
# -a: Annotation file (GTF/GFF)
# -o: Output file
# --minOverlap: Minimum number of overlapping bases in a read that is required for read assignment.
# Input BAM files are provided as positional arguments.
featureCounts -a "${CUSTOM_GTF}" \
              -o "${OUTPUT_COUNTS}" \
              --minOverlap 20 \
              "${INPUT_BAM}"

Differential expression was calculated using DESeq2.

DESeq2 v1.42.0 (R 4.3.2, Bioconductor 3.18) (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install R and Bioconductor if not already present
# conda create -n deseq2_env r-base=4.3.2 bioconductor-deseq2 -c bioconda -c conda-forge -y
# conda activate deseq2_env

# Create dummy input files for demonstration purposes
# In a real scenario, 'counts.tsv' would be your gene/transcript count matrix
# and 'sample_info.tsv' would contain metadata for your samples.
cat << EOF > counts.tsv	gene1	gene2	gene3	sampleA1	100	50	20
sampleA2	120	60	25
sampleB1	50	100	15
sampleB2	60	110	18
EOF

cat << EOF > sample_info.tsv	sample	condition	sampleA1	control	sampleA2	control	sampleB1	treated	sampleB2	treated
EOF

# R script for DESeq2 analysis
cat << 'EOF' > run_deseq2.R
library(DESeq2)

# Load count data
count_data <- read.table("counts.tsv", header=TRUE, row.names=1, sep="\t")

# Load sample information
sample_info <- read.table("sample_info.tsv", header=TRUE, row.names=1, sep="\t")

# Ensure sample order matches between count data and sample info
sample_info <- sample_info[colnames(count_data), , drop=FALSE]

# Create DESeqDataSet object
dds <- DESeqDataSetFromMatrix(countData = round(count_data), # DESeq2 expects integer counts
                              colData = sample_info,
                              design = ~ condition)

# Run DESeq2 analysis
dds <- DESeq(dds)

# Get results for the 'treated' vs 'control' comparison
res <- results(dds, contrast=c("condition", "treated", "control"))

# Order results by adjusted p-value
res <- res[order(res$padj), ]

# Write results to a CSV file
write.csv(as.data.frame(res), file="deseq2_results.csv")

# Optional: Save normalized counts
norm_counts <- counts(dds, normalized=TRUE)
write.csv(as.data.frame(norm_counts), file="deseq2_normalized_counts.csv")

message("DESeq2 analysis complete. Results saved to deseq2_results.csv")
EOF

# Execute the R script
Rscript run_deseq2.R

View on GitHub

Cut-offs used to determine significant differential expression were baseMean > 50 and padj <=0.01

DESeq2 (Inferred with models/gemini-2.5-flash) v1.42.0 GitHub

$ Bash example

# This script assumes DESeq2 has already been run and its results are in 'deseq2_results.tsv'.
# The following R script filters the results based on the specified cut-offs.

# If R and DESeq2 are not installed:
# # conda install -c conda-forge r-base
# # conda install -c bioconda bioconductor-deseq2

Rscript -e '
  results_df <- read.delim("deseq2_results.tsv", sep="\t", header=TRUE)
  significant_results <- subset(results_df, baseMean > 50 & padj <= 0.01)
  write.table(significant_results, "significant_differential_expression.tsv", sep="\t", quote=FALSE, row.names=FALSE)
'

View on GitHub

Genes which had substantial upstream intergenic reads (similar to DoGs) were removed from DE tables.

Custom script (Inferred with models/gemini-2.5-flash) vN/A

$ Bash example

# This script filters a Differential Expression (DE) results table
# by removing genes identified as having "substantial upstream intergenic reads" (similar to DoGs).
# The identification of such genes is assumed to be performed by a prior custom script.

# --- Conceptual Step: Identification of genes with substantial upstream intergenic reads ---
# This part is highly dependent on how "substantial upstream intergenic reads" is defined.
# It would typically involve:
# 1. Defining upstream intergenic regions for each gene using a genome annotation file (e.g., GTF/GFF).
#    Reference dataset: Genome annotation (e.g., Homo_sapiens.GRCh38.109.gtf)
# 2. Quantifying reads in these defined regions from aligned sequencing data (e.g., BAM files).
#    Reference dataset: Aligned reads (e.g., sample.bam)
# 3. Applying a threshold or statistical method to identify genes with "substantial" read counts
#    in these upstream intergenic regions.
#
# For this filtering step, we assume a file named 'genes_to_remove.txt' has already been generated,
# containing one gene identifier per line for genes that meet the removal criteria.
# Example (conceptual) command that might generate 'genes_to_to_remove.txt':
# python /path/to/custom_script/identify_dogs_like_genes.py \
#     --bam_file /path/to/aligned_reads/sample.bam \
#     --gtf_file /path/to/genome_annotation/Homo_sapiens.GRCh38.109.gtf \
#     --output_file genes_to_remove.txt \
#     --upstream_distance 1000 \
#     --min_read_count 50 \
#     --intergenic_only

# --- Actual Filtering Step ---

# Input: Differential Expression results table
# This file is expected to have gene identifiers in the first column and a header.
DE_TABLE="differential_expression_results.tsv"

# Input: List of gene identifiers to be removed
# Each line should contain one gene ID.
GENES_TO_REMOVE="genes_to_remove.txt"

# Output: Filtered Differential Expression results table
FILTERED_DE_TABLE="differential_expression_results_filtered.tsv"

# Check if the DE table exists
if [ ! -f "$DE_TABLE" ]; then
    echo "Error: Differential expression table '$DE_TABLE' not found."
    exit 1
fi

# Check if the list of genes to remove exists and is not empty
if [ ! -s "$GENES_TO_REMOVE" ]; then
    echo "Warning: List of genes to remove '$GENES_TO_REMOVE' is empty or not found. No genes will be filtered."
    # If the file is empty, we can just copy the original DE table or proceed without filtering.
    cp "$DE_TABLE" "$FILTERED_DE_TABLE"
    echo "Original DE table copied to $FILTERED_DE_TABLE as no genes were specified for removal."
    exit 0
fi

# Filter the DE table:
# 1. Read the list of genes to remove into an associative array 'a'.
# 2. Process the DE table:
#    - Print the header line (NR==1).
#    - For subsequent lines, print the line only if the gene ID (first column, $1) is NOT in the 'a' array.
awk 'FNR==NR{a[$1]=1; next} NR==1 || !($1 in a)' "$GENES_TO_REMOVE" "$DE_TABLE" > "$FILTERED_DE_TABLE"

echo "Filtering complete. Filtered DE table saved to: $FILTERED_DE_TABLE"
echo "Genes removed from DE table (listed in $GENES_TO_REMOVE):"
cat "$GENES_TO_REMOVE"

Upstream intergenic regions were extracted using bedtools flank -l 2000 -r 0 -s.

bedtools v2.31.0 GitHub

$ Bash example

# Install bedtools if not already installed
# conda install -c bioconda bedtools

# Placeholder for input intergenic regions file.
# This file would typically be generated in a previous step,
# e.g., by subtracting gene annotations from the whole genome.
# For demonstration, let's assume 'intergenic_regions.bed' exists.
# Example (not part of this specific step):
# bedtools complement -i genes.bed -g hg38.chrom.sizes > intergenic_regions.bed

# Download a chrom.sizes file for the desired genome assembly (e.g., hg38)
# wget -O hg38.chrom.sizes http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes

# Extract upstream intergenic regions
bedtools flank -i intergenic_regions.bed -g hg38.chrom.sizes -l 2000 -r 0 -s > upstream_intergenic_regions.bed

View on GitHub

Intervening upstream genes that had any overlap with this region were removed with bedtools subtract.

bedtools v2.30.0 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install bedtools if not already installed
# conda install -c bioconda bedtools

# Define input and output files (placeholders)
# input_regions.bed: BED file containing the regions of interest (e.g., original regions)
# upstream_genes.bed: BED file containing the upstream gene regions to be removed
# filtered_regions.bed: Output BED file with upstream genes subtracted

bedtools subtract -a input_regions.bed -b upstream_genes.bed > filtered_regions.bed

View on GitHub

Reads that did not align to intronic or exonic regions were extracted from bam files with fgrep.

fgrep (Inferred with models/gemini-2.5-flash) vSystem Utility (Inferred with models/gemini-2.5-flash)

$ Bash example

# This step describes extracting reads that *did not* align to intronic or exonic regions.
# Directly using `fgrep` on a BAM file for genomic region filtering is not standard as BAM is binary.
# A common approach involves first identifying reads that *do* align to these regions (e.g., using bedtools intersect),
# extracting their read IDs, and then using `fgrep -v` on a list of all read IDs to find the complement.
# Finally, `samtools` can filter the original BAM based on these non-aligning read IDs.
# Alternatively, if 'fgrep' is used directly, it implies the BAM was converted to SAM/FASTQ and then filtered for specific string patterns, which is less likely for genomic region filtering.
# The following code block demonstrates a plausible interpretation using standard tools to achieve the described outcome, assuming 'fgrep' was used on a list of read IDs.

# Define input and output files
INPUT_BAM="input.bam"
INTRONIC_EXONIC_BED="intronic_exonic.bed" # Placeholder for a BED file defining intronic/exonic regions
OUTPUT_BAM="reads_not_in_intronic_exonic.bam"

# --- Installation (commented out) ---
# conda install -c bioconda samtools bedtools

# 1. Extract all read IDs from the input BAM file
samtools view "${INPUT_BAM}" | cut -f1 > all_read_ids.txt

# 2. Identify reads that *do* align to intronic or exonic regions
# This step uses bedtools intersect to find reads overlapping the defined regions.
# The output is a BAM file containing only reads that align to intronic/exonic regions.
bedtools intersect -abam "${INPUT_BAM}" -b "${INTRONIC_EXONIC_BED}" > aligned_to_regions.bam

# 3. Extract read IDs from the reads that *do* align to intronic/exonic regions
samtools view aligned_to_regions.bam | cut -f1 | sort -u > aligned_read_ids.txt

# 4. Use fgrep to find read IDs that are *not* in the 'aligned_read_ids.txt' list
# This effectively identifies reads that did not align to intronic or exonic regions.
fgrep -v -f aligned_read_ids.txt all_read_ids.txt > non_intronic_exonic_read_ids.txt

# 5. Filter the original BAM file to extract reads corresponding to the non-intronic/exonic read IDs
# Note: Filtering BAM by a list of read names is not directly supported by `samtools view -N`.
# A common workaround is to convert to SAM, filter with `grep`, and convert back.
# However, `samtools view -N` is a common misconception. A more efficient way is to use `samtools view -L` for regions or `samtools view -f` for flags.
# Given the explicit mention of `fgrep` and the need to filter by read names, converting to SAM and back is the most direct interpretation of using `fgrep` for this purpose.

samtools view -h "${INPUT_BAM}" | fgrep -f non_intronic_exonic_read_ids.txt - > temp_filtered.sam
samtools view -bS temp_filtered.sam > "${OUTPUT_BAM}"

# Clean up intermediate files
rm all_read_ids.txt aligned_to_regions.bam aligned_read_ids.txt non_intronic_exonic_read_ids.txt temp_filtered.sam

Coverage across the 2000 bp upstream region was determined with bedtools coverage -S -split.

bedtools vv2.31.0 GitHub

$ Bash example

# Install bedtools (if not already installed)
# conda install -c bioconda bedtools

# Placeholder variables (replace with actual file paths)
INPUT_BAM="aligned_reads.bam" # e.g., a STAR-aligned BAM file from eCLIP data
UPSTREAM_REGIONS_BED="upstream_regions_2000bp.bed" # BED file defining the 2000 bp upstream regions
OUTPUT_COVERAGE_FILE="upstream_coverage.tsv"

# Determine coverage across the 2000 bp upstream region
bedtools coverage -a "${UPSTREAM_REGIONS_BED}" -b "${INPUT_BAM}" -S -split > "${OUTPUT_COVERAGE_FILE}"

View on GitHub

A TPM value was determined for this upstream region and compared to the TPM of the adjacent downstream gene.

Salmon (Inferred with models/gemini-2.5-flash) v1.10.0 GitHub

$ Bash example

# Install Salmon (if not already installed)
# conda install -c bioconda salmon

# Define variables
GENOME_VERSION="GRCh38" # Example: Human genome version
TRANSCRIPTOME_FASTA="gencode.v38.transcripts.fa.gz" # Example: GENCODE v38 transcriptome FASTA
SALMON_INDEX_DIR="salmon_index_${GENOME_VERSION}"
READ1="sample_R1.fastq.gz"
READ2="sample_R2.fastq.gz" # Optional, if paired-end
OUTPUT_DIR="salmon_quant_output"

# Download reference transcriptome (example commands - replace with actual paths if already available)
# mkdir -p references
# cd references
# wget -c ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.transcripts.fa.gz
# cd ..

# 1. Build Salmon index (run once for a given transcriptome)
salmon index -t "${TRANSCRIPTOME_FASTA}" -i "${SALMON_INDEX_DIR}" --gencode

# 2. Quantify expression (determine TPM values)
# For paired-end reads:
salmon quant -i "${SALMON_INDEX_DIR}" -l A \
             -1 "${READ1}" -2 "${READ2}" \
             -p 8 --validateMappings -o "${OUTPUT_DIR}"

# For single-end reads (uncomment and use if applicable):
# salmon quant -i "${SALMON_INDEX_DIR}" -l A \
#              -r "${READ1}" \
#              -p 8 --validateMappings -o "${OUTPUT_DIR}"

# The output directory will contain 'quant.sf' which includes TPM values for transcripts.
# Further analysis (e.g., aggregating transcript TPMs to gene TPMs, 
# identifying "upstream regions" and "downstream genes" based on genomic coordinates 
# from an annotation file like GTF, and comparing their TPMs) would be performed 
# using custom scripts (e.g., R or Python) parsing the 'quant.sf' file and the annotation.

View on GitHub

If a ratio of 20% of the signal (determined by TPM) was present in the upstream region, and a breadth of 70% of that region was covered, that gene was determined to have significant enough upstream transcription so as to not be a reliable functional coding product, and was removed from subsequent analysis.

Custom Script (Inferred with models/gemini-2.5-flash) vN/A (Inferred with models/gemini-2.5-flash)

$ Bash example

#!/bin/bash

# --- Configuration ---
TPM_RATIO_THRESHOLD=0.20 # 20% of signal (TPM) in upstream region
COVERAGE_BREADTH_THRESHOLD=0.70 # 70% breadth of coverage in upstream region

# --- Input Files (placeholders - replace with actual paths) ---
# These files would typically be generated by previous steps in the pipeline.
# For example:
# - gene_body_tpm.tsv: Gene-level TPM quantification (e.g., from Salmon, Kallisto, RSEM)
#   Format: gene_id<tab>TPM_value
# - upstream_region_tpm.tsv: TPM quantification specifically for the upstream region of each gene.
#   This would require defining upstream regions (e.g., using bedtools flank on gene annotations),
#   quantifying reads in these regions (e.g., featureCounts), and then normalizing to TPM.
#   Format: gene_id<tab>TPM_value
# - upstream_coverage_breadth.tsv: Fraction of the upstream region covered by transcription.
#   This could be calculated using bedtools coverage with aligned RNA-seq reads (BAM) and upstream region BED files.
#   Format: gene_id<tab>fraction_covered

# --- Output Files ---
FILTERED_GENES_LIST="genes_removed_due_to_upstream_transcription.tsv"
RELIABLE_GENES_LIST="reliable_functional_coding_products.tsv"

# --- Placeholder for generating dummy input files for demonstration ---
# In a real pipeline, these would be actual output from previous steps.
echo -e "geneA\t100.0" > gene_body_tpm.tsv
echo -e "geneB\t20.0" >> gene_body_tpm.tsv
echo -e "geneC\t50.0" >> gene_body_tpm.tsv
echo -e "geneD\t10.0" >> gene_body_tpm.tsv
echo -e "geneE\t75.0" >> gene_body_tpm.tsv

echo -e "geneA\t30.0" > upstream_region_tpm.tsv # Ratio 0.30
echo -e "geneB\t5.0" >> upstream_region_tpm.tsv   # Ratio 0.25
echo -e "geneC\t8.0" >> upstream_region_tpm.tsv   # Ratio 0.16
echo -e "geneD\t3.0" >> upstream_region_tpm.tsv   # Ratio 0.30
echo -e "geneE\t10.0" >> upstream_region_tpm.tsv  # Ratio 0.13

echo -e "geneA\t0.85" > upstream_coverage_breadth.tsv # Covered
echo -e "geneB\t0.60" >> upstream_coverage_breadth.tsv # Not covered
echo -e "geneC\t0.90" >> upstream_coverage_breadth.tsv # Covered
echo -e "geneD\t0.75" >> upstream_coverage_breadth.tsv # Covered
echo -e "geneE\t0.65" >> upstream_coverage_breadth.tsv # Not covered

# --- Core Filtering Logic (using awk for custom script) ---
echo "Applying filtering logic..."
awk -v tpm_ratio_thresh="${TPM_RATIO_THRESHOLD}" -v cov_breadth_thresh="${COVERAGE_BREADTH_THRESHOLD}" '
BEGIN { OFS="\t" }
FILENAME=="gene_body_tpm.tsv" {
    gene_body_tpm[$1] = $2;
}
FILENAME=="upstream_region_tpm.tsv" {
    upstream_tpm[$1] = $2;
}
FILENAME=="upstream_coverage_breadth.tsv" {
    coverage_breadth[$1] = $2;
}
END {
    for (gene_id in gene_body_tpm) {
        # Check if upstream data exists for this gene
        if (gene_id in upstream_tpm && gene_id in coverage_breadth) {
            # Calculate TPM ratio: (TPM in upstream region) / (TPM in gene body)
            ratio = upstream_tpm[gene_id] / gene_body_tpm[gene_id];

            # Apply filtering conditions:
            # 1. Upstream TPM ratio >= threshold (20%)
            # 2. Upstream coverage breadth >= threshold (70%)
            if (ratio >= tpm_ratio_thresh && coverage_breadth[gene_id] >= cov_breadth_thresh) {
                print gene_id, "REMOVED", "UpstreamTPMRatio="ratio, "CoverageBreadth="coverage_breadth[gene_id] >> "'"${FILTERED_GENES_LIST}"'" ;
            } else {
                print gene_id, "KEPT", "UpstreamTPMRatio="ratio, "CoverageBreadth="coverage_breadth[gene_id] >> "'"${RELIABLE_GENES_LIST}"'" ;
            }
        } else {
            # If upstream data is missing, assume it is not problematic and keep the gene
            print gene_id, "KEPT", "NoUpstreamData" >> "'"${RELIABLE_GENES_LIST}"'" ;
        }
    }
}' gene_body_tpm.tsv upstream_region_tpm.tsv upstream_coverage_breadth.tsv

echo "Filtering complete. Genes to remove are in ${FILTERED_GENES_LIST}"
echo "Reliable genes are in ${RELIABLE_GENES_LIST}"

# Clean up dummy files (remove this in a real pipeline)
rm gene_body_tpm.tsv upstream_region_tpm.tsv upstream_coverage_breadth.tsv

Genes that showed this pattern in either the input or IP condition were both removed from that IP analysis.

Custom Filtering Script (Inferred with models/gemini-2.5-flash) vN/A (Custom) (Inferred with models/gemini-2.5-flash)

$ Bash example

# This script conceptually demonstrates the removal of genes based on a defined pattern
# in either the input or IP condition, as described. The exact "pattern" (e.g., high input signal,
# low IP signal, specific statistical criteria) and the input data format would be specific
# to the experimental design and upstream analysis.

# For demonstration, let's assume we have a file 'gene_quantification_data.tsv'
# that contains gene identifiers along with their quantified values (e.g., counts, RPKM, TPM)
# for both input and IP conditions.
# Example format (tab-separated):
# GeneID    Input_Value    IP_Value

# Let's create a dummy file for demonstration purposes:
echo -e "GeneA\t1000\t5000" > gene_quantification_data.tsv
echo -e "GeneB\t5000\t100" >> gene_quantification_data.tsv # High input, low IP - candidate for removal
echo -e "GeneC\t50\t20" >> gene_quantification_data.tsv     # Low IP - candidate for removal
echo -e "GeneD\t200\t8000" >> gene_quantification_data.tsv
echo -e "GeneE\t10000\t12000" >> gene_quantification_data.tsv # High input, but also high IP - might be kept or removed based on specific criteria

# Define example thresholds for the "pattern" to be removed:
# - Remove if Input_Value is above a certain threshold (e.g., highly abundant background)
# - OR Remove if IP_Value is below a certain threshold (e.g., not enriched enough)
# These thresholds are illustrative and would be determined empirically based on the specific assay.
INPUT_VALUE_THRESHOLD_FOR_REMOVAL=4000 # Genes with input value > 4000 might be considered problematic background
IP_VALUE_THRESHOLD_FOR_REMOVAL=50     # Genes with IP value < 50 might be considered not enriched

# Perform filtering using awk.
# Assuming columns are: 1:GeneID, 2:Input_Value, 3:IP_Value
# We want to KEEP genes where:
# (Input_Value <= INPUT_VALUE_THRESHOLD_FOR_REMOVAL) AND (IP_Value >= IP_VALUE_THRESHOLD_FOR_REMOVAL)
# The following awk command filters out genes matching the removal pattern.
# It prints lines where the input value is NOT above the input threshold AND the IP value is NOT below the IP threshold.
awk -v input_thresh_remove="$INPUT_VALUE_THRESHOLD_FOR_REMOVAL" \
    -v ip_thresh_remove="$IP_VALUE_THRESHOLD_FOR_REMOVAL" \
    'NR==1 {print; next} # Assuming a header line, print it and skip to next record
     $2 <= input_thresh_remove && $3 >= ip_thresh_remove {print}' \
    gene_quantification_data.tsv > filtered_gene_analysis.tsv

echo "Filtered genes saved to filtered_gene_analysis.tsv"

# Reference datasets: The gene quantification data would typically be derived from
# alignment to a reference genome (e.g., hg38, mm10) and annotation (e.g., GENCODE, Ensembl).
# The specific reference used would depend on the species and experimental design
# of the upstream quantification step.

Tools Used

STAR DESeq2

Raw Source Text

All reads were aligned to the human genome hg38 primary assembly using STAR.
featureCounts version 2.0.2 was used to annotate reads, using the flag --minOverlap 20 and a custom GTF file derived from gencode v34
Differential expression was calculated using DESeq2. Cut-offs used to determine significant differential expression were baseMean > 50 and padj <=0.01
Genes which had substantial upstream intergenic reads (similar to DoGs) were removed from DE tables. Upstream intergenic regions were extracted using bedtools flank -l 2000 -r 0 -s. Intervening upstream genes that had any overlap with this region were removed with bedtools subtract. Reads that did not align to intronic or exonic regions were extracted from bam files with fgrep. Coverage across the 2000 bp upstream region was determined with bedtools coverage -S -split. A TPM value was determined for this upstream region and compared to the TPM of the adjacent downstream gene. If a ratio of 20% of the signal (determined by TPM) was present in the upstream region, and a breadth of 70% of that region was covered, that gene was determined to have significant enough upstream transcription so as to not be a reliable functional coding product, and was removed from subsequent analysis. Genes that showed this pattern in either the input or IP condition were both removed from that IP analysis.
Genome_build: hg38
Supplementary_files_format_and_content: comma separated files with output of DESeq2; normalized abundance measurements for all genes when comparing RIP condition to total input condition

← Back to Analysis