Bioinformatics Tools Installation Workshop

Bioinformatics Tools Installation Workshop
A practical 60-minute hands-on demonstration for installing and verifying essential bioinformatics tools
Get Started
View Prerequisites
Workshop Overview
What You'll Learn
This intensive workshop guides 25 bioinformatics scientists through installing all essential computational tools needed for modern sequence analysis. You'll gain hands-on experience with remote access, version control, package management, and specialized bioinformatics software.
Each tool installation includes immediate verification steps to ensure everything works correctly. While detailed usage will be covered in future sessions, you'll leave with a fully functional bioinformatics toolkit ready for production work.
Session Details
Duration: 60 minutes of focused installation
Format: Live demonstration with verification
Audience: Bioinformatics scientists and researchers
Goal: Complete toolkit installation and validation
Outcome: Production-ready bioinformatics environment
Installation Approach
01
Why We Need It
Understand the purpose and context for each tool in your bioinformatics workflow
02
How to Install
Follow step-by-step live demonstrations with exact commands and configuration
03
How to Verify
Confirm successful installation with validation commands and expected output
04
What It Does
Learn the core functionality and role in your analysis pipeline
Module 1: Remote Access Foundation
Remote access tools form the foundation of HPC cluster computing. These utilities enable you to connect to high-performance computing resources from your local workstation, execute commands as if you were physically at the server, and transfer data securely across networks.
SSH and SCP are industry-standard protocols that provide encrypted communication channels, ensuring your data and credentials remain protected during transmission. Most Linux and macOS systems include these tools by default, while Windows users can access them through Git Bash or Windows Subsystem for Linux.
Time allocation: 5 minutes total for both SSH and SCP setup and verification
SSH - Secure Shell Connection
Why SSH Matters
SSH enables secure remote terminal access to your HPC cluster from any location. Instead of physically accessing the server, you can execute commands, run analyses, and manage files from your laptop through an encrypted connection.
This tool is essential for bioinformatics work because computationally intensive tasks run on powerful HPC systems rather than local machines.
Installation & Verification
# Check if SSH exists
which ssh
ssh --version

# Expected output on Linux/Mac
OpenSSH_8.9p1

# On Windows: Install Git Bash or WSL
Connecting to HPC
# Connect to your cluster
ssh [email protected]

# Now you're on the remote system
# Type commands as if local
SCP - Secure Copy Protocol
1
Download from HPC
scp username@hpc:/home/username/file.txt ~/Downloads/
Transfer results and data from the cluster to your local machine for analysis or sharing
2
Upload to HPC
scp ~/myfile.fastq username@hpc:/home/username/data/
Send raw sequencing data, scripts, and reference files from your laptop to the cluster
SCP comes bundled with SSH, requiring no separate installation. It provides secure, encrypted file transfer between your local system and remote servers. Think of it as network-enabled drag-and-drop functionality that maintains file permissions and metadata during transfer.
Module 2: Version Control with Git
Version control is fundamental to reproducible bioinformatics research. Git tracks changes to your analysis scripts, enables collaboration with colleagues, and provides access to thousands of public bioinformatics tools hosted on platforms like GitHub and GitLab.
This module covers Git installation, initial configuration, and cloning repositories—the essential skills needed to download and manage bioinformatics software packages. Understanding Git also prepares you for contributing to open-source bioinformatics projects and maintaining your own analysis pipelines.
Time allocation: 8 minutes for Git setup and repository cloning demonstration
Git Installation & Configuration
Installation
# Check existing installation
which git
git --version

# CentOS/RHEL installation
sudo yum install git -y

# Ubuntu/Debian installation
sudo apt install git -y

# Conda installation
mamba install -c conda-forge git
First-Time Setup
# Configure identity
git config --global user.name "Your Name"
git config --global user.email "[email protected]"

# Verify configuration
git config --global --list

# Expected output
user.name=Your Name
[email protected]
Configuration is a one-time process that identifies you as the author of any commits or contributions. This identity information becomes part of the project history and enables proper attribution in collaborative environments.
Git Clone - Downloading Repositories
Clone Repository
git clone https://github.com/DerrickWood/kraken2.git
Downloads the complete project including source code, documentation, and version history
Navigate to Project
cd kraken2
ls -la
Enter the downloaded directory and view all files including hidden configuration
Read Documentation
cat README.md
less INSTALL.md
Review installation instructions and usage guidelines provided by the developers
Git clone is your gateway to thousands of bioinformatics tools. Rather than manually downloading files or navigating complex websites, a single command retrieves everything you need. This approach ensures you always get the latest stable version and complete documentation.
Module 3: Conda & Mamba Package Management
Package managers revolutionize software installation in bioinformatics by automatically handling dependencies, versions, and compatibility. Conda and its faster alternative Mamba eliminate the tedious manual compilation that traditionally consumed hours of setup time.
These tools create isolated environments for different projects, preventing conflicts between incompatible tool versions. This isolation ensures reproducibility—critical for scientific research—by allowing you to specify and recreate exact software versions used in any analysis.
This module verifies your existing Conda/Mamba installation and introduces YAML environment files, the key to reproducible bioinformatics environments.
Time allocation: 7 minutes covering verification and environment creation from YAML templates
Verify Conda & Mamba Installation
Check Your System
# Locate Conda and Mamba
which conda
which mamba

# Expected output
/home/apps/miniforge3/bin/conda
/home/apps/miniforge3/bin/mamba

# Check versions
conda --version
mamba --version

# Expected output
conda 25.9.1
mamba 2.3.0
Why Mamba?
Mamba is a drop-in replacement for Conda that uses C++ for dependency resolution, making it 10-20 times faster than Conda for complex bioinformatics environments.
All conda commands work with mamba—just replace "conda" with "mamba" in any command for significantly improved performance.
Understanding YAML Environment Files
Environment Name
name: WIC
Identifies your environment for activation and management
Software Channels
channels:
  - conda-forge
  - bioconda
Repositories where packages are downloaded from
Tool Dependencies
dependencies:
  - fastqc=0.12.1
  - samtools=1.18
  - kraken2=2.1.6
Exact versions of tools to install
YAML files define complete computational environments in a human-readable format. They specify every tool and its exact version, ensuring that you, your collaborators, and reviewers can recreate identical analysis environments months or years later. This is the foundation of reproducible bioinformatics research.
# View template environment
cat /shared/envs/templates/WIC_optimized.yml
Create Environment from YAML
01
Create Environment
mamba env create -f /shared/envs/templates/WIC_optimized.yml
Installation takes 5-10 minutes as Mamba downloads and configures 20+ bioinformatics tools
02
Activate Environment
mamba activate WIC
Switch to your new environment, making all installed tools available in your PATH
03
Verify Installation
mamba list
which fastqc kraken2 samtools
Confirm all tools installed correctly with proper versions and locations
Environment creation handles all dependency resolution automatically. Mamba determines compatible versions, downloads binaries, and configures paths—eliminating the compilation and troubleshooting that traditionally consumed days of setup time.
Module 4: SRA Toolkit Installation
The Sequence Read Archive (SRA) at NCBI hosts over 20 million public sequencing datasets, representing petabytes of genomic data from researchers worldwide. The SRA Toolkit provides command-line utilities for searching, downloading, and converting this data into standard FASTQ format for analysis.
Access to public sequencing data enables meta-analyses, validation studies, and exploratory research without generating new sequencing data. Whether you're investigating novel organisms, validating published findings, or performing comparative genomics, SRA Toolkit is your gateway to this vast resource.
Time allocation: 8 minutes for SRA Toolkit installation and configuration
Install SRA Toolkit
Installation
# Activate environment
mamba activate WIC

# Install SRA Toolkit
mamba install -c bioconda sra-tools

# This installs multiple utilities:
# - fastq-dump
# - fasterq-dump  
# - prefetch
# - sam-dump
Verification
# Check installation
which fastq-dump
which fasterq-dump

# Verify version
fastq-dump --version

# Expected output
fastq-dump : 3.1.1
The toolkit includes several utilities with different performance characteristics. The newer fasterq-dump offers significantly faster download speeds through multi-threaded processing, making it the preferred tool for large datasets. The original fastq-dump remains available for compatibility with older workflows.
Configure SRA Toolkit
Create Cache Directory
mkdir -p ~/.ncbi
Establishes configuration folder in your home directory
Write Configuration
cat > ~/.ncbi/user-settings.mkfg << 'EOF' /repository/user/main = "/home/$USER/sra_cache" EOF
Specifies where downloaded files are temporarily stored
Benefits
Prevents re-downloading identical files, saves disk space, and improves download efficiency
This one-time configuration optimizes SRA Toolkit performance by implementing local caching. When you download sequencing runs, the toolkit stores them in your cache directory. If you need the same data later, it retrieves the local copy instead of re-downloading, saving bandwidth and time.
Module 5: Sequence Analysis Tools
Modern sequence analysis requires specialized tools optimized for handling massive genomic datasets efficiently. This module covers five essential utilities that form the core of most bioinformatics pipelines: sequence statistics, taxonomic classification, visualization, assembly quality assessment, and read quality control.
Each tool addresses a specific analytical challenge. Seqkit provides lightning-fast sequence manipulation, Kraken2 identifies organisms in metagenomic samples, Krona creates interactive visualizations, QUAST evaluates assembly quality, and FastQC assesses raw read quality. Together, these tools enable comprehensive sequence analysis from raw data to final results.
The WIC environment includes all these tools pre-configured and ready to use, eliminating individual installation headaches.
Time allocation: 15 minutes covering five critical sequence analysis tools
Seqkit - Sequence Manipulation
Installation Status
# Verify in WIC environment
mamba activate WIC
which seqkit
seqkit version

# Expected output
seqkit v2.6.1
Performance Advantage
Seqkit processes FASTQ files 100x faster than traditional tools like AWK or BEDTools through optimized parallel processing and efficient memory management.
Core Capabilities
Calculate sequence statistics (count, length distribution, GC content)
Convert between FASTA and FASTQ formats seamlessly
Filter sequences by length, quality score, or pattern
Remove duplicate sequences efficiently
Extract subsequences by coordinates or IDs
Reverse complement sequences
Sample random subsets for testing
Seqkit's speed makes it practical for routine quality checks on multi-gigabyte sequencing files. Operations that would take hours with traditional Unix tools complete in seconds, enabling rapid iteration during pipeline development.
Kraken2 - Taxonomic Classification
Installation
mamba activate WIC
which kraken2
kraken2 --version
Pre-installed in WIC environment
Database Location
echo $KRAKEN2_DEFAULT_DB
ls -lh /shared/databases/kraken2/
Reference database exceeds 100 GB
Classification Process
Matches k-mers from reads against database, assigns taxonomy with confidence scores
Kraken2 enables rapid taxonomic profiling of metagenomic samples by comparing sequencing reads against a comprehensive reference database of microbial genomes. It identifies which bacteria, viruses, or other organisms are present in your sample and their relative abundances.
Common applications include microbiome analysis, contamination detection in sequencing libraries, pathogen identification in clinical samples, and quality control for genome assemblies. The tool processes millions of reads per minute, making it practical for routine screening.
Krona - Interactive Taxonomic Visualization
Purpose & Installation
Krona transforms taxonomic classification results from Kraken2 into stunning interactive HTML visualizations. Instead of reviewing thousands of lines of text output, you explore a hierarchical sunburst chart where each ring represents a taxonomic level.
# Verify installation
mamba activate WIC
which ktImportTaxonomy

# Expected location
/home/apps/miniforge3/envs/WIC/bin/ktImportTaxonomy
Interactive Exploration
Click any segment to zoom into that taxonomic level and explore its children
Proportional Display
Segment size represents relative abundance, making dominant taxa immediately visible
Publication Ready
Generate high-quality figures suitable for papers and presentations
QUAST - Assembly Quality Assessment
Installation
mamba activate WIC
which quast.py
quast.py --version
Verify QUAST is available in WIC environment
Key Metrics
N50 and L50 statistics
Total assembly length
GC content percentage
Number of contigs
Largest contig size
Comparative Analysis
Evaluate multiple assemblies side-by-side to identify the best assembly strategy
QUAST (Quality Assessment Tool for Genome Assemblies) provides comprehensive evaluation of genome assembly quality without requiring a reference genome. It calculates standard metrics that enable comparison between different assembly algorithms, parameter sets, or sequencing technologies.
The tool generates detailed HTML reports with visualizations, making it easy to identify problematic assemblies and optimize your assembly pipeline. Use QUAST whenever you perform de novo genome assembly to ensure your results meet quality standards.
FastQC - Read Quality Control
Installation & Verification
mamba activate WIC
which fastqc
fastqc --version

# Expected output
FastQC v0.12.1
What FastQC Analyzes
Per-base sequence quality scores
Per-sequence quality distribution
Per-base sequence content
GC content distribution
Sequence length distribution
Duplicate sequence levels
Adapter content detection
Overrepresented sequences
Output & Interpretation
FastQC generates interactive HTML reports with color-coded quality assessments. Green indicators show passing metrics, orange flags warn about potential issues, and red marks highlight serious problems requiring attention.
These reports guide downstream decisions about quality filtering, trimming, and whether sequencing runs meet project requirements. FastQC is typically the first analysis performed on any new sequencing dataset.
The tool processes FASTQ files from all major sequencing platforms including Illumina, PacBio, and Oxford Nanopore.
Module 6: Additional Essential Tools
This module introduces three powerful utilities that enhance your quality control and preprocessing capabilities. MultiQC aggregates reports across samples, while Trimmomatic and Fastp provide complementary approaches to read trimming and quality filtering.
These tools bridge the gap between raw sequencing data and analysis-ready datasets. MultiQC transforms individual sample reports into comprehensive dashboards, enabling rapid identification of batch effects or problematic samples. The trimming tools remove technical artifacts and low-quality regions that could compromise downstream analyses.
Time allocation: 10 minutes covering report aggregation and read preprocessing tools
MultiQC - Unified QC Reporting
1
Multiple FastQC Reports
Each sample generates individual HTML files that must be reviewed separately
2
MultiQC Aggregation
Scans directories, identifies all QC reports, combines them intelligently
3
Unified Dashboard
Single interactive HTML with plots comparing all samples side-by-side
# Verify installation
mamba activate WIC
which multiqc
multiqc --version

# Expected output
multiqc, version 1.15
MultiQC revolutionizes quality control review for projects with dozens or hundreds of samples. Instead of opening individual reports, you examine one comprehensive dashboard that highlights outliers and trends across your entire dataset. This dramatically reduces QC time while improving consistency.
The tool supports over 100 bioinformatics tools beyond FastQC, including aligners, variant callers, and quantification software, making it the standard for generating publication-quality QC summaries.
Trimmomatic - Robust Read Trimming
Installation Check
mamba activate WIC
which trimmomatic
trimmomatic -version 2>&1

# Expected output
0.39
Trimming Operations
ILLUMINACLIP: Remove adapter sequences
SLIDINGWINDOW: Trim when average quality drops
LEADING/TRAILING: Cut low-quality bases from ends
MINLEN: Discard short reads
When to Use Trimmomatic
Trimmomatic has been the standard read preprocessing tool for over a decade, with extensive documentation and published validation. Use it when you need fine-grained control over trimming parameters or when following established protocols that specify Trimmomatic.
The tool handles both single-end and paired-end reads, maintaining read pairing throughout processing. It supports custom adapter sequences and provides detailed logging of how many reads were affected by each filtering step.
Fastp - Modern All-in-One QC
Speed Advantage
Fastp processes reads 2-5x faster than Trimmomatic through multi-threaded C++ implementation
Automatic Detection
Identifies and removes adapters automatically without requiring adapter sequence files
Integrated QC
Generates before-and-after quality reports in single run, eliminating separate FastQC step
Default Parameters
Works well out-of-the-box with sensible defaults, reducing parameter optimization time
# Verify installation
mamba activate WIC
which fastp
fastp --version

# Expected output
fastp 0.23.4
Fastp represents the modern approach to read preprocessing, combining quality control, filtering, and trimming in a single efficient tool. For most applications, Fastp provides superior convenience and speed compared to traditional Trimmomatic + FastQC workflows.
Module 7: Local Setup - Windows Subsystem for Linux
Windows Subsystem for Linux (WSL) transforms Windows laptops into full-featured bioinformatics workstations by running genuine Linux distributions directly within Windows 10 or 11. This eliminates the need for dual-booting or virtual machines while providing complete access to Linux command-line tools.
For bioinformatics researchers using Windows laptops, WSL is revolutionary. Install Ubuntu, Debian, or other distributions from the Microsoft Store, then use the same commands and tools as Linux-based HPC systems. Your Windows files remain accessible, and you can run Windows and Linux applications simultaneously.
This module walks through WSL installation and initial setup, preparing your laptop for local tool installation and testing before deploying analyses on HPC clusters.
Time allocation: 12 minutes covering WSL installation and Conda setup within WSL
WSL Installation Requirements
1
Check Windows Version
systeminfo | findstr /I "OS"
Requires Windows 10 version 2004+ or Windows 11. Older versions need manual WSL1 installation.
2
Enable Virtualization
Access BIOS/UEFI settings during boot and enable Intel VT-x or AMD-V virtualization technology
3
Administrator Access
WSL installation requires administrator privileges. Right-click PowerShell and select "Run as administrator"
4
Disk Space
Allocate at least 20 GB free space for Linux distribution and bioinformatics tools
Modern Windows versions simplify WSL installation to a single command. Microsoft continuously improves WSL performance, with WSL2 offering near-native Linux performance through a lightweight virtual machine architecture.
Install WSL via PowerShell
Installation Command
# Open PowerShell as Administrator
# Check current OS info
systeminfo | findstr /I "OS"

# Install WSL (includes Ubuntu)
wsl --install

# System will prompt for restart
# Click "Restart Now"
What Gets Installed
WSL2 core components
Linux kernel update
Ubuntu distribution (default)
Windows Terminal integration
First Boot Setup
After restart, Ubuntu launches automatically. The first boot takes 2-3 minutes as it initializes the Linux filesystem and installs base packages.
You'll be prompted to create a Linux username and password. This account is separate from your Windows account and has sudo privileges within the Linux environment.
Important: Remember this password—you'll need it for installing software and administrative tasks within Linux.
Access and Use WSL
01
Launch Ubuntu
Click "Ubuntu" from the Start menu or type "wsl" in any PowerShell/Command Prompt window
02
Linux Terminal Opens
You're now in a full Linux bash shell running on your Windows system
03
Access Windows Files
cd /mnt/c/Users/YourName/Documents
Windows drives mount under /mnt/, enabling easy file access
04
Install Tools
Use standard Linux package managers (apt, conda, mamba) to install bioinformatics software
WSL provides bidirectional file access—Linux can read Windows files under /mnt/, and Windows Explorer can access Linux files via \\wsl$ network path. This integration enables flexible workflows where you edit scripts in Windows text editors while running analyses in Linux.
Install Miniconda in WSL
Download & Install
# Download Miniconda installer
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

# Run installation script
bash Miniconda3-latest-Linux-x86_64.sh

# Follow prompts:
# - Press Enter to review license
# - Type "yes" to accept
# - Press Enter for default location
# - Type "yes" to initialize conda
Activate Installation
# Close and reopen terminal
# OR source the configuration
source ~/.bashrc

# Verify installation
which conda
conda --version

# Expected output
conda 25.9.1
With Conda installed in WSL, your Windows laptop becomes a complete bioinformatics workstation. Create environments, install tools, test analyses locally before deploying to HPC. The entire toolkit demonstrated in this workshop is now available on your personal machine.
Environment Management Best Practices
One Environment Per Project
Create separate environments for different analyses to prevent version conflicts and maintain reproducibility
Export Environments
conda env export > environment.yml
Save exact package versions used in published analyses for perfect reproducibility
Clean Unused Environments
conda env list
conda env remove -n oldenv
Remove outdated environments to free disk space and reduce clutter
Regular Updates
mamba update --all
Keep tools current to benefit from bug fixes and performance improvements
Complete Installation Checklist
1
Remote Access
SSH connection verified
SCP file transfer tested
HPC login successful
2
Version Control
Git installed and configured
Repository cloning working
User identity set globally
3
Package Management
Conda/Mamba verified
WIC environment created
Environment activation working
4
Data Access
SRA Toolkit installed
Cache directory configured
Download utilities verified
5
Analysis Tools
Seqkit, Kraken2, Krona ready
QUAST and FastQC available
MultiQC, Trimmomatic, Fastp installed
6
Local Development
WSL installed (Windows users)
Conda available in WSL
Tools tested locally
Next Steps & Resources
Upcoming Workshops
1
Week 2: Quality Control Deep Dive
Interpreting FastQC reports, running MultiQC, establishing QC thresholds
2
Week 3: Read Preprocessing
Hands-on trimming with Fastp, adapter removal, quality filtering strategies
3
Week 4: Taxonomic Analysis
Kraken2 workflows, database selection, Krona visualization techniques
Documentation Links
SRA Toolkit: NCBI documentation and tutorials
Conda: Managing environments guide
Seqkit: Command reference and examples
Kraken2: Database building and optimization
FastQC: Interpreting quality metrics
WSL: Microsoft official documentation
Support Resources
Post questions to the workshop Slack channel, attend weekly office hours on Fridays 3-4pm, or email the bioinformatics core for one-on-one troubleshooting.
Workshop Complete - You're Ready!
Congratulations! You've successfully installed and verified a complete bioinformatics toolkit. Your system is now configured for production sequence analysis, from raw data download through quality control, preprocessing, and taxonomic classification.
9
Core Tools
Installed and verified across your environment
20+
Dependencies
Automatically managed by Conda/Mamba
3
Environments
Configured for different analysis workflows
Practice using these tools on sample datasets before the next workshop session. Experiment with commands, explore documentation, and familiarize yourself with each tool's options. The more comfortable you become with the basics now, the more you'll gain from upcoming deep-dive sessions.
Review Checklist
Access Resources