☰ Menu

      Fundamentals of Scientific Computing

Home
Introduction and Lectures
Intro to the Workshop and Core
Schedule
Command Line Interface
CLI
Challenge & Homework Solutions
BashCrawl
R
R
Exercise solutions
Python
Python
Installations
Make and CMake
Apptainer
Blast Primer
Support
Zoom
Slack
Cheat Sheets
Software and Links
Scripts
Github page
Report Errors
Biocore website

Introduction to Python

Why Python

Python

https://businessoverbroadway.com/2019/01/13/programming-languages-most-used-and-recommended-by-data-scientists/

https://www.tiobe.com/tiobe-index/


Background

What is a programming language and why do we need it?

Speaking to a computer in its native language is tedious and complicated. A programming language is a way for humans to describe a set of operations to a computer in a more abstract and understandable way. A helper program then translates our description of the operations into a set of instructions (machine code) for the computer to carry out.

Some day we may develop a programming language that allows us to communicate our instructions to the computer in our native language (Alexa, turn on the TV). Except for simple cases, this option doesn’t exist yet, largely because human languages are complicated and instructions can be difficult to understand (even for other humans).

In order for the helper program to work properly, we need to use a concise language:

Specifically in Python:

PythonInterpreter


A brief history of Python


Interesting features of Python


Base Python and the extensive package ecosystem


Project Jupyter

Jupyter


Disclaimer

Learning all the nuances of python takes a long time! Our goal here is to introduce you to as many concepts as possible but if you are serious about mastering python you will need to apply yourself beyond this introduction.

Installation

We are going to use a python distribution platform, Anaconda. It was designed to meet the demand of Data Sciences and AI projects. It can be installed on all three operating systems and has 45 million users as of 2024. It includes over 300 packages, offers jupyter Notebooks and jupyter Lab and includes conda, the package and environment manager. It makes installing a lot of python packages very easy. Please follow the instructions below to install Anaconda. Instructions for all platforms can be found at https://www.anaconda.com/docs/getting-started/anaconda/install

After Anaconda is successfully installed, please open a new Command Line interface to be able to use the installed libraries. If you have chosen to ask the installer not to add configuration to your initialize conda every time you open a new shell, you will have to do it manualy when you need to access any program that Anaconda has installed. Using the following commands in the new Command Line window will accomplish this task.

source <PATH_TO_CONDA>/bin/activate
conda init –all

The way to launch a jupyter lab session in command line is to use the following command.

jupyter lab --no-browser


Once this command is run, you will see a message similar to the following to provide urls for your web browser to start the interactive session.

To access the server, open this file in a browser:
    file:///Users/jli/Library/Jupyter/runtime/jpserver-86015-open.html
Or copy and paste one of these URLs:
    http://localhost:8892/lab?token=99e0c0136bcee32edfc8ce37c673f5dc1940bb2a307fc3d3
    http://127.0.0.1:8892/lab?token=99e0c0136bcee32edfc8ce37c673f5dc1940bb2a307fc3d3

Hello World!

[Input:]

print("Hello World!")

[Output:]

Hello World!

Functions and how to find help

As in any other programming language, a function is a predefined set of operations. In order to find the information on what parameters a function requires, one has 2 options:

[Input:]

help(print)

[Output:]

Help on built-in function print in module builtins:

print(*args, sep=' ', end='\n', file=None, flush=False)
    Prints the values to a stream, or to sys.stdout by default.

    sep
      string inserted between values, default a space.
    end
      string appended after the last value, default a newline.
    file
      a file-like object (stream); defaults to the current sys.stdout.
    flush
      whether to forcibly flush the stream.


*args means that the function print can accept any number of arguments.

Basic Data Types

Built in data types

Data type Functions
Text/Characterstr()
Numericint(), float(), complex()
Sequencelist()/[], tuple()/(), range()
Mappingdict()/{}
Setset()/{}, frozenset()
Booleanbool()
Binarybytes(), bytearray(), memoryview()
NoneNone

Integers, Floating-point numbers, booleans, strings.

Language Python R
Integer data int() as.integer(), integer()
Float data float() numeric()
Logical data bool(), [True, False] as.logical(), logical(), (TRUE, FALSE)
Character data str() as.character(), character()

Integer

[Input:]

a = int(5)

[Input:]

a

[Output:]

5

[Input:]

b = 5

[Input:]

b

[Output:]

5

[Input:]

type(a)

[Output:]

int

[Input:]

type(b)

[Output:]

int

Arithmetic operators

if flow

Exercise

Creat different data types (integers, floats, booleans and strings) and perform some operations.

Sequence data

Data type Usage Characteristics
Lists Store multiple items in a single variable Ordered, mutable, allow duplicate values
Tuples Store multiple items in a single variable Ordered, immutable, allow duplicate values
Range Create an immutable sequence of numbers Immutable, a range of integers

Immutable means that an object’s state or value cannot be changed after its creation. Let’s take a look at one example to understand what immutable means in python. Here we will create a variable using range() function.

Range

[Input:]

range_example = range(6)

[Input:]

range_example

[Output:]

range(0, 6)

[Input:]

range_example[0]

[Output:]

0

[Input:]

range_example[0] = 3

[Output:]

Error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[12], line 1
----> 1 range_example[0] = 3

TypeError: 'range' object does not support item assignment

[Input:]

range_example[5]

[Output:]

5

[Input:]

range_example[6] = 8

[Output:]

Error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[14], line 1
----> 1 range_example[6] = 8

TypeError: 'range' object does not support item assignment

Range function is usually used in a for loop. For example,

[Input:]

protein_seq = "MTKAAVGLVKNRAWGIPSDF"
protein_len = len(protein_seq)
for i in range(protein_len):
    print("The ", i+1, "th amino acid is: ", protein_seq[i])

[Output:]

The  1 th amino acid is:  M
The  2 th amino acid is:  T
The  3 th amino acid is:  K
The  4 th amino acid is:  A
The  5 th amino acid is:  A
The  6 th amino acid is:  V
The  7 th amino acid is:  G
The  8 th amino acid is:  L
The  9 th amino acid is:  V
The  10 th amino acid is:  K
The  11 th amino acid is:  N
The  12 th amino acid is:  R
The  13 th amino acid is:  A
The  14 th amino acid is:  W
The  15 th amino acid is:  G
The  16 th amino acid is:  I
The  17 th amino acid is:  P
The  18 th amino acid is:  S
The  19 th amino acid is:  D
The  20 th amino acid is:  F

List

A list can be created by using the function list() and by using [].

[Input:]

list_example = ["NDUFAF7", "AGMAT", "TOP1MT", "IARS2", "MTFP1", "SLC25A51", "PRORP", "SLC25A52", "ENDOG"]

[Input:]

list_example

[Output:]

['NDUFAF7',
 'AGMAT',
 'TOP1MT',
 'IARS2',
 'MTFP1',
 'SLC25A51',
 'PRORP',
 'SLC25A52',
 'ENDOG']

[Input:]

len(list_example)

[Output:]

9

[Input:]

type(list_example)

[Output:]

list

List is mutable, which means that one may modify the values of a list after its creation.

Function Operation
append Add an element at the end of the list
extend Add multiple elements at the end of the list
insert Add an element at a specific position
remove Remove the first occurrence of an element
pop Removes an element at a specific position or the last element if no index is specified
del Delete object

For example, let’s add an element to list_example

[Input:]

list_example.append("CKMT2")
list_example

[Output:]

['NDUFAF7',
 'AGMAT',
 'TOP1MT',
 'IARS2',
 'MTFP1',
 'SLC25A51',
 'PRORP',
 'SLC25A52',
 'ENDOG',
 'CKMT2']

We see a new syntax, object.function(). This is the way to call a method that is bound to an object in python. A method is a function that is defined specifically for an object of a class in python. In order to know what methods are bound to an object, we can use the following operation:

[Input:]

methods = [method_name for method_name in dir(list_example) if callable(getattr(list_example, method_name)) and not method_name.startswith("__")]
methods

[Output:]

['append',
 'clear',
 'copy',
 'count',
 'extend',
 'index',
 'insert',
 'pop',
 'remove',
 'reverse',
 'sort']

Exercise

Let’s modify list_example by using the functions listed above, add element(s), delete element(s), …

Mapping data - Dictionaries

Dictionaries in python store data in key:value pairs.

Syntax:
{
    Key1: value1,
    Key2: value2,
    …,
    KeyN: valueN,
}

[Input:]

dict_example = {
    "gene_name": "CKMT2",
    "gene_biotype": "protein_coding",
    "n_transcripts": 32,
    "n_orthologues": 217,
    "n_paralogues": 4,
    "ensembl_id": "ENSG00000131730",
    "discription": "creatine kinase, mitochondrial 2",
    "loc": "Chromosome 5: 81,233,320-81,266,399",
    "strand": "+",}
dict_example

[Output:]

{'gene_name': 'CKMT2',
 'gene_biotype': 'protein_coding',
 'n_transcripts': 32,
 'n_orthologues': 217,
 'n_paralogues': 4,
 'ensembl_id': 'ENSG00000131730',
 'discription': 'creatine kinase, mitochondrial 2',
 'loc': 'Chromosome 5: 81,233,320-81,266,399',
 'strand': '+'}

Access items in the dictionary

[Input:]

dict_example["gene_name"]

[Output:]

'CKMT2'

But how do one know what keys there are in a dictionary object?

[Input:]

methods = [method_name for method_name in dir(dict_example) if callable(getattr(dict_example, method_name))
           and not method_name.startswith("__")]
methods

[Output:]

['clear',
 'copy',
 'fromkeys',
 'get',
 'items',
 'keys',
 'pop',
 'popitem',
 'setdefault',
 'update',
 'values']

[Input:]

dict_example.keys()

[Output:]

dict_keys(['gene_name', 'gene_biotype', 'n_transcripts', 'n_orthologues', 'n_paralogues', 'ensembl_id', 'discription', 'loc', 'strand'])

How does one extract the values of more than one key. For example, let’s try to extract values of some keys, gene_name, gene_biotype, and n_transcripts. This can be done easily using a for loop.

[Input:]

for key in ["gene_name", "gene_biotype", "n_transcripts"]:
    print(f"{key}: ", dict_example.get(key))

[Output:]

gene_name:  CKMT2
gene_biotype:  protein_coding
n_transcripts:  32

In another example, we are going to update the dictionary with an additional piece of information, using the method update.

[Input:]

dict_example.update({"pathway": "Mitochondria disease pathway"})
dict_example
dict_example.update(pathway_members = 10)
dict_example

[Output:]

{'gene_name': 'CKMT2',
 'gene_biotype': 'protein_coding',
 'n_transcripts': 32,
 'n_orthologues': 217,
 'n_paralogues': 4,
 'ensembl_id': 'ENSG00000131730',
 'discription': 'creatine kinase, mitochondrial 2',
 'loc': 'Chromosome 5: 81,233,320-81,266,399',
 'strand': '+',
 'pathway': 'Mitochondria disease pathway',
 'pathway_members': 10}

Because the values can be any python data type, one may create nested dictionaries.

[Input:]

nested_dict = {
    "ENSG00000131730": {
        "gene_name": "CKMT2",
        "gene_biotype": "protein_coding",
        "n_transcripts": 32,
        "n_orthologues": 217,
        "n_paralogues": 4,
        "ensembl_id": "ENSG00000131730",
        "discription": "creatine kinase, mitochondrial 2",
        "loc": "Chromosome 5: 81,233,320-81,266,399",
        "strand": "+",},
    "ENSG00000003509": {
        "gene_name": "NDUFAF7",
        "gene_biotype": "protein_coding",
        "n_transcripts": 20,
        "n_orthologues": 210,
        "n_paralogues": 0,
        "ensembl_id": "ENSG00000003509",
        "discription": "NADH:ubiquinone oxidoreductase complex assembly factor 7",
        "loc": "Chromosome 2: 37,231,631-37,253,403 ",
        "strand": "+",},
}

[Input:]

nested_dict

[Output:]

{'ENSG00000131730': {'gene_name': 'CKMT2',
  'gene_biotype': 'protein_coding',
  'n_transcripts': 32,
  'n_orthologues': 217,
  'n_paralogues': 4,
  'ensembl_id': 'ENSG00000131730',
  'discription': 'creatine kinase, mitochondrial 2',
  'loc': 'Chromosome 5: 81,233,320-81,266,399',
  'strand': '+'},
 'ENSG00000003509': {'gene_name': 'NDUFAF7',
  'gene_biotype': 'protein_coding',
  'n_transcripts': 20,
  'n_orthologues': 210,
  'n_paralogues': 0,
  'ensembl_id': 'ENSG00000003509',
  'discription': 'NADH:ubiquinone oxidoreductase complex assembly factor 7',
  'loc': 'Chromosome 2: 37,231,631-37,253,403 ',
  'strand': '+'}}

[Input:]

nested_dict["ENSG00000131730"]["discription"]

[Output:]

'creatine kinase, mitochondrial 2'

Keys() function can be used to list all the keys in a dictionary.

[Input:]

nested_dict.keys()

[Output:]

dict_keys(['ENSG00000131730', 'ENSG00000003509'])

[Input:]

nested_dict['ENSG00000131730'].keys()

[Output:]

dict_keys(['gene_name', 'gene_biotype', 'n_transcripts', 'n_orthologues', 'n_paralogues', 'ensembl_id', 'discription', 'loc', 'strand'])

Dictionaries can be created using dict() function.

[Input:]

gene_names = ["CKMT2", "NDUFAF7", "AGMAT"]
gene_biotypes = ["protein_coding"] * 3
n_transcripts = [32, 20, 4]
n_orthologues = [217, 210, 195]
n_paralogues = [4, 0, 2]
ensembl_IDs = ["ENSG00000131730", "ENSG00000003509", "ENSG00000116771"]
descriptions = ["creatine kinase, mitochondrial 2", "NADH:ubiquinone oxidoreductase complex assembly factor 7",
                "agmatinase (putative)"]
locus = ["Chromosome 5: 81,233,320-81,266,399", "Chromosome 2: 37,231,631-37,253,403", "Chromosome 1: 15,571,699-15,585,078"]
strands = ["+", "+", "-"]
list_values = zip(gene_names, gene_biotypes, n_transcripts, n_orthologues,
                  n_paralogues, ensembl_IDs, descriptions, locus, strands)
list_keys = ["gene_name", "gene_biotype", "n_transcripts", "n_orthologues",
             "n_paralogues", "ensembl_ID", "Description", "loc", "strand"]
list_dict = [dict(zip(list_keys, value)) for value in list_values]
list_dict

[Output:]

[{'gene_name': 'CKMT2',
  'gene_biotype': 'protein_coding',
  'n_transcripts': 32,
  'n_orthologues': 217,
  'n_paralogues': 4,
  'ensembl_ID': 'ENSG00000131730',
  'Description': 'creatine kinase, mitochondrial 2',
  'loc': 'Chromosome 5: 81,233,320-81,266,399',
  'strand': '+'},
 {'gene_name': 'NDUFAF7',
  'gene_biotype': 'protein_coding',
  'n_transcripts': 20,
  'n_orthologues': 210,
  'n_paralogues': 0,
  'ensembl_ID': 'ENSG00000003509',
  'Description': 'NADH:ubiquinone oxidoreductase complex assembly factor 7',
  'loc': 'Chromosome 2: 37,231,631-37,253,403',
  'strand': '+'},
 {'gene_name': 'AGMAT',
  'gene_biotype': 'protein_coding',
  'n_transcripts': 4,
  'n_orthologues': 195,
  'n_paralogues': 2,
  'ensembl_ID': 'ENSG00000116771',
  'Description': 'agmatinase (putative)',
  'loc': 'Chromosome 1: 15,571,699-15,585,078',
  'strand': '-'}]

Exercise

Please apply the methods available for a dictionary object to get familiar with this data type.

List comprehension

List comprehension is a very useful method available in python. It creates a new list by performing a pre-defined set of operations on each element of an existing list.

New_list = [expression for item in iterable if condition == True]

[Input:]

list_values = zip(gene_names, gene_biotypes, n_transcripts, n_orthologues,
                  n_paralogues, ensembl_IDs, descriptions, locus, strands)
list_dict = [dict(zip(list_keys, value)) for value in list_values]
list_dict

[Output:]

[{'gene_name': 'CKMT2',
  'gene_biotype': 'protein_coding',
  'n_transcripts': 32,
  'n_orthologues': 217,
  'n_paralogues': 4,
  'ensembl_ID': 'ENSG00000131730',
  'Description': 'creatine kinase, mitochondrial 2',
  'loc': 'Chromosome 5: 81,233,320-81,266,399',
  'strand': '+'},
 {'gene_name': 'NDUFAF7',
  'gene_biotype': 'protein_coding',
  'n_transcripts': 20,
  'n_orthologues': 210,
  'n_paralogues': 0,
  'ensembl_ID': 'ENSG00000003509',
  'Description': 'NADH:ubiquinone oxidoreductase complex assembly factor 7',
  'loc': 'Chromosome 2: 37,231,631-37,253,403',
  'strand': '+'},
 {'gene_name': 'AGMAT',
  'gene_biotype': 'protein_coding',
  'n_transcripts': 4,
  'n_orthologues': 195,
  'n_paralogues': 2,
  'ensembl_ID': 'ENSG00000116771',
  'Description': 'agmatinase (putative)',
  'loc': 'Chromosome 1: 15,571,699-15,585,078',
  'strand': '-'}]

Let’s dissect the operation

[dict(zip(list_keys, value)) for value in list_values]

Exercise

Create a numeric iterable object and perform the same operation on each element of the object without using a for loop.

Challenge

File handling

Python offers general-purpose file handling that offers efficient ways to deal with very large data. The biggest advantage that python has comparing to R is the ability to read file line by line to reduce memory usage for large volume data.

For example, we are going to read a small example of a genome annotation (gtf) file and parse the information using what we have learned so far.

First let’s download a small gtf file using command line.

wget https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2025-December-Fundamentals-of-Scientific-Computing/main/base/GRCh38.ensembl112.4k.gtf

[Input:]

## Initialize the dictionary that will hold the parsed information
annotation = {}

with open("GRCh38.ensembl112.4k.gtf", "r") as f:

    # iterate through the file line-by-line
    for line in f:
        
        # split each line with tab as the delimiter
        fields = line.strip().split("\t")

        # initialize a dictionary to hold the attribute info
        attributes = {}

        # only parse the lines with "gene" in the 3rd column
        if len(fields) == 9 and fields[2] == "gene":
 
            attr_pairs = [attr.strip().split(" ") for attr in fields[8].strip().split(";")]
            
            for pair in attr_pairs[:-1]:
                key = pair[0]
                value = pair[1].strip('"')
                attributes[key] = value

                # extract gene id information as the key for each gene record
                if key == "gene_id":
                    annotation_key = value

            # create a dictionary using the gene information extracted above
            feature_info = {
            "seqname": fields[0],
            "source": fields[1],
            "start": fields[3],
            "end": fields[4],
            "strand": fields[6],
            "attributes": attributes
            }
            
            annotation[annotation_key] = feature_info


with open("annotation.tsv", "w") as outfile:
    print(annotation, file = outfile, sep = "\n")


Visualization

Data visualization is one essential step in data analysis. There are many libraries that can be used to generate visualizations in python. The following is a short list to get you started.

Matplotlib

We are going to use the same data set used in Wednesday’s R session for some examples of visualization in python.

[Input:]

# load python modules
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# read in the birth weight data and the miRNA expression data using the url or from your local copy
bw = pd.read_csv("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2022_February_Introduction_to_R_for_Bioinformatics/main/birthweight.csv")
mir = pd.read_csv("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2022_February_Introduction_to_R_for_Bioinformatics/main/miRNA.csv")

[Input:]

display(bw.head(8))

[Output:]

     ID birth.date     location  length  birthweight  head.circumference  \
0  1107  1/25/1967      General      52         3.23                  36   
1   697   2/6/1967  Silver Hill      48         3.03                  35   
2  1683  2/14/1967  Silver Hill      53         3.35                  33   
3    27   3/9/1967  Silver Hill      53         3.55                  37   
4  1522  3/13/1967     Memorial      50         2.74                  33   
5   569  3/23/1967     Memorial      50         2.51                  35   
6   365  4/23/1967     Memorial      52         3.53                  37   
7   808   5/5/1967  Silver Hill      48         2.92                  33   

   weeks.gestation smoker  maternal.age  maternal.cigarettes  maternal.height  \
0               38     no            31                    0              164   
1               39     no            27                    0              162   
2               41     no            27                    0              164   
3               41    yes            37                   25              161   
4               39    yes            21                   17              156   
5               39    yes            22                    7              159   
6               40    yes            26                   25              170   
7               34     no            26                    0              167   

   maternal.prepregnant.weight  paternal.age  paternal.education  \
0                           57           NaN                 NaN   
1                           62          27.0                14.0   
2                           62          37.0                14.0   
3                           66          46.0                 NaN   
4                           53          24.0                12.0   
5                           52          23.0                14.0   
6                           62          30.0                10.0   
7                           64          25.0                12.0   

   paternal.cigarettes  paternal.height  low.birthweight  geriatric.pregnancy  
0                  NaN              NaN                0                False  
1                  0.0            178.0                0                False  
2                  0.0            170.0                0                False  
3                  0.0            175.0                0                 True  
4                  7.0            179.0                0                False  
5                 25.0              NaN                1                False  
6                 25.0            181.0                0                False  
7                 25.0            175.0                0                False  

[Input:]

display(mir.head(8))

[Output:]

  Unnamed: 0  sample.27  sample.1522  sample.569  sample.365  sample.1369  \
0     miR-16         46           56          47          54           56   
1     miR-21         52           43          40          35           59   
2   miR-146a         98           97          87          96           84   
3    miR-182         53           45          63          41           46   

   sample.1023  sample.1272  sample.1262  sample.575  ...  sample.1360  \
0           59           49           55          62  ...           70   
1           47           42           45          55  ...           57   
2           96           88           97          96  ...          111   
3           50           49           50          62  ...           46   

   sample.1058  sample.755  sample.462  sample.1088  sample.553  sample.1191  \
0           77          56          65           42          63           66   
1           55          46          58           54          54           48   
2          124         101         101          107         106          102   
3           56          50          60           63          60           50   

   sample.1313  sample.1600  sample.1187  
0           64           50           57  
1           47           44           46  
2          104          111           86  
3           42           67           43  

[4 rows x 43 columns]

It looks that the miRNA expression table uses the miRNA names as row names. We can tell the read_csv function to use the first column as the row names.

[Input:]

mir = pd.read_csv("https://raw.githubusercontent.com/ucdavis-bioinformatics-training/2022_February_Introduction_to_R_for_Bioinformatics/main/miRNA.csv", index_col = 0)
display(mir.head(8))

[Output:]

          sample.27  sample.1522  sample.569  sample.365  sample.1369  \
miR-16           46           56          47          54           56   
miR-21           52           43          40          35           59   
miR-146a         98           97          87          96           84   
miR-182          53           45          63          41           46   

          sample.1023  sample.1272  sample.1262  sample.575  sample.792  ...  \
miR-16             59           49           55          62          63  ...   
miR-21             47           42           45          55          45  ...   
miR-146a           96           88           97          96         104  ...   
miR-182            50           49           50          62          51  ...   

          sample.1360  sample.1058  sample.755  sample.462  sample.1088  \
miR-16             70           77          56          65           42   
miR-21             57           55          46          58           54   
miR-146a          111          124         101         101          107   
miR-182            46           56          50          60           63   

          sample.553  sample.1191  sample.1313  sample.1600  sample.1187  
miR-16            63           66           64           50           57  
miR-21            54           48           47           44           46  
miR-146a         106          102          104          111           86  
miR-182           60           50           42           67           43  

[4 rows x 42 columns]

[Input:]

mir_trans = mir.T
display(mir_trans.head(8))

[Output:]

             miR-16  miR-21  miR-146a  miR-182
sample.27        46      52        98       53
sample.1522      56      43        97       45
sample.569       47      40        87       63
sample.365       54      35        96       41
sample.1369      56      59        84       46
sample.1023      59      47        96       50
sample.1272      49      42        88       49
sample.1262      55      45        97       50

In order to merge the two dataframes, we will insert a column into the transposed miRNA expression dataframe with the sample IDs. First, let’s check what data type does the ID column is in the birth weight dataframe.

[Input:]

bw.dtypes

[Output:]

ID                               int64
birth.date                      object
location                        object
length                           int64
birthweight                    float64
head.circumference               int64
weeks.gestation                  int64
smoker                          object
maternal.age                     int64
maternal.cigarettes              int64
maternal.height                  int64
maternal.prepregnant.weight      int64
paternal.age                   float64
paternal.education             float64
paternal.cigarettes            float64
paternal.height                float64
low.birthweight                  int64
geriatric.pregnancy               bool
dtype: object

[Input:]

mir_trans.insert(loc = 0, column = "ID", value = [int(INDEX.split(".")[1]) for INDEX in mir_trans.index])
display(mir_trans.head(8))

[Output:]

               ID  miR-16  miR-21  miR-146a  miR-182
sample.27      27      46      52        98       53
sample.1522  1522      56      43        97       45
sample.569    569      47      40        87       63
sample.365    365      54      35        96       41
sample.1369  1369      56      59        84       46
sample.1023  1023      59      47        96       50
sample.1272  1272      49      42        88       49
sample.1262  1262      55      45        97       50

[Input:]

mir_trans.dtypes

[Output:]

ID          int64
miR-16      int64
miR-21      int64
miR-146a    int64
miR-182     int64
dtype: object

[Input:]

# merge the two dataframes
data = pd.merge(bw, mir_trans, on = "ID", how = "inner")
display(data.head(8))

[Output:]

     ID birth.date     location  length  birthweight  head.circumference  \
0  1107  1/25/1967      General      52         3.23                  36   
1   697   2/6/1967  Silver Hill      48         3.03                  35   
2  1683  2/14/1967  Silver Hill      53         3.35                  33   
3    27   3/9/1967  Silver Hill      53         3.55                  37   
4  1522  3/13/1967     Memorial      50         2.74                  33   
5   569  3/23/1967     Memorial      50         2.51                  35   
6   365  4/23/1967     Memorial      52         3.53                  37   
7   808   5/5/1967  Silver Hill      48         2.92                  33   

   weeks.gestation smoker  maternal.age  maternal.cigarettes  ...  \
0               38     no            31                    0  ...   
1               39     no            27                    0  ...   
2               41     no            27                    0  ...   
3               41    yes            37                   25  ...   
4               39    yes            21                   17  ...   
5               39    yes            22                    7  ...   
6               40    yes            26                   25  ...   
7               34     no            26                    0  ...   

   paternal.age  paternal.education  paternal.cigarettes  paternal.height  \
0           NaN                 NaN                  NaN              NaN   
1          27.0                14.0                  0.0            178.0   
2          37.0                14.0                  0.0            170.0   
3          46.0                 NaN                  0.0            175.0   
4          24.0                12.0                  7.0            179.0   
5          23.0                14.0                 25.0              NaN   
6          30.0                10.0                 25.0            181.0   
7          25.0                12.0                 25.0            175.0   

   low.birthweight  geriatric.pregnancy  miR-16  miR-21  miR-146a  miR-182  
0                0                False      57      49       116       48  
1                0                False      68      47        98       57  
2                0                False      49      48        98       55  
3                0                 True      46      52        98       53  
4                0                False      56      43        97       45  
5                1                False      47      40        87       63  
6                0                False      54      35        96       41  
7                0                False      59      56       101       74  

[8 rows x 22 columns]

[Input:]

# enter matplotlib mode
%matplotlib

# Let's do some simple plot
data.boxplot(rot = 90)

[Output:]

Using matplotlib backend: module://matplotlib_inline.backend_inline
<Axes: >

output image


Let’s use Seaborn library to generate some visualizations

Let’s take a look at the data distribution

[Input:]

# data distribution plots can be generated using displot() function
sns.displot(data, x = "birthweight")

[Output:]

<seaborn.axisgrid.FacetGrid at 0x320f2d940>

output image


[Input:]

# One may define a different bin size from the default
sns.displot(data, x = "birthweight", hue = "location", multiple = "stack")

[Output:]

<seaborn.axisgrid.FacetGrid at 0x321309a90>

output image


Let’s take a look at how to plot the relationship between some variables.

[Input:]

sns.relplot(data, x = "birthweight", y = "head.circumference", hue = "smoker")

[Output:]

<seaborn.axisgrid.FacetGrid at 0x3213aa0d0>

output image


Up until now, we have been using the default theme for the visualization from matplotlib. Let’s switch to Seaborn’s default theme to see if there is any difference.

[Input:]

sns.set_theme()

[Input:]

sns.displot(data, x = "birthweight", hue = "location", multiple = "stack")

[Output:]

<seaborn.axisgrid.FacetGrid at 0x321441310>

output image


[Input:]

sns.relplot(data, x = "birthweight", y = "head.circumference", hue = "smoker")

[Output:]

<seaborn.axisgrid.FacetGrid at 0x321501e50>

output image


Faceted visualization

[Input:]

tidy_data = pd.melt(data[["ID", "smoker", "miR-16", "miR-21", "miR-146a", "miR-182"]], id_vars = ["ID", "smoker"])
display(tidy_data.head(8))

[Output:]

     ID smoker variable  value
0  1107     no   miR-16     57
1   697     no   miR-16     68
2  1683     no   miR-16     49
3    27    yes   miR-16     46
4  1522    yes   miR-16     56
5   569    yes   miR-16     47
6   365    yes   miR-16     54
7   808     no   miR-16     59

[Input:]

sns.catplot(tidy_data, x = "smoker", y = "value", col = "variable", hue = "smoker", kind = "violin")

[Output:]

<seaborn.axisgrid.FacetGrid at 0x3230087d0>

output image


[Input:]

g = sns.catplot(data=tidy_data, x='smoker', y='value', col='variable', hue='smoker',
                kind='violin', col_wrap=4, height=6, aspect=1.2)

# Remove individual x-axis labels from each subplot
for ax in g.axes.flat:
    ax.set_xlabel('')
    ax.tick_params(axis='x', labelsize=16)  # x-axis tick labels
    ax.tick_params(axis='y', labelsize=16)  # y-axis tick labels

    # Remove the default title
    title_text = ax.get_title()
    ax.set_title('')
    
    # Add title inside the plot area
    ax.text(0.5, 0.95, title_text, 
            transform=ax.transAxes,  # Use axes coordinates (0-1)
            fontsize=16, 
            fontweight='bold',
            verticalalignment='top',
            horizontalalignment='center',
            bbox=dict(boxstyle='round', facecolor='white', alpha=0.7))  # Optional background box


# Add a single shared x-axis label
g.fig.text(0.5, 0.01, 'Mother smoking Status', ha='center', fontsize=18, fontweight='bold')

# Adjust layout to make room for the shared label
plt.subplots_adjust(bottom=0.12)

[Output:]

output image


Final note

The materials for Python in this workshop is meant to introduce you to some concepts/syntax in Python and is far from a comprehensive guide. It is important to remember that gaining a deep understanding of the fundamentals of programming, such as how to breakdown a complex problem, recognize patterns, and design elegant solutions is the only way to become independent in programming.