Creating Images with PyQRCode

Mass generation of QR codes with Python.

This is a script for taking a list of URLs from a spreadsheet and generating a captioned QR code for each entry.

Specifically, the script reads the ‘LongURLs‘ input file, shortens the URLs, creates QR Codes, adds captions, and saves each QR code as a .PNG image file.

We shorten the URLs to reduce the complexity of the QR code, which makes it less likely to become unreadable from printing imperfections and dirt smudges.

We’ll use:

1. Numpy
2. Pandas
3. PyQRCode
4. pyshorteners
5. PIL
6. PyPNG

We load our URLs and IDs (captions) using the LongURLs template.

LongURLs Template

Next, we run the script and our QR codes will be output as PNG files in the same directory as our script.

Email links (such as “”) can be used as input URLs, but you’ll need to disable the ValueError:’Please enter a valid url’ that pyshorteners will raise.

import numpy as np
import pyqrcode
import pandas as pd
from pyshorteners import Shortener
from PIL import ImageFont
from PIL import Image
from PIL import ImageDraw

DF = pd.DataFrame(pd.read_excel(r'C:\Users\Craig\Documents\Python Scripts\LongURLs.xlsx',

ShortURL=np.array(LongURL, dtype='str')

for i in range(0,len(LongURL)):
    code.png(ID[i] + '.png', scale=6, module_color=[0,0,0,128],quiet_zone=7) 

    #Adds caption[i] + '.png')
    font = ImageFont.truetype("ariblk.ttf", 20)
    draw.text((xcor,245),str(ID[i]),font=font)[i]) + '.png')


With pyshorteners, we have the option of using a bunch of different URL shorteners – in this case we used TinyURL.  See the pyshorteners github for a full list.

The font of your caption can be adjusted by taking the desired font’s .tff file (found in Control Panel > Appearance and Personalization > Fonts), copying it into the same folder as your script, and updating line 25.

You might need to adjust the “xcor” value (based on the length of your IDs) to get your caption centered under the QR image.  If your ID lengths are all different, consider adding a few lines of code to detect the ID length and update “xcor” dynamically.

Finding Words with PyPDF2

Find all instances of words in a PDF with Python’s PyPDF2 library.

This is a script for finding all instances of a given search word (or multiple search words) in a PDF.

For our example, we’ll be using a PDF of Romeo and Juliet.  In this case, our search terms are “Romeo” and “Juliet” (search is not case-sensitive).

import PyPDF2
import re




with open('FoundWordsList.csv', 'w') as f:
    f.write('{0},{1}\n'.format("Sheet Number", "Search Word"))
    for word in searchwords:
        for page in range(number_of_pages):
            words_start_pos[page]=[dwg.start() for dwg in re.finditer(word, pages_text[page].lower())]
            words[page]=[pages_text[page][value:value+len(word)] for value in words_start_pos[page]]
        for page in words:
            for i in range(0,len(words[page])):
               if str(words[page][i]) != 'nan':
                    f.write('{0},{1}\n'.format(page+1, words[page][i]))
                    print(page, words[page][i])

We run the script and get an output that shows each instance of each search word and the associated PDF page number:

This script can be used for a variety of other applications by updating the file path (line 4) and the search terms (line 12).

A few ideas for modification include:

  • Frequency counts of words in books/lyrics (ATS has an awesome frequency count graph generator)
  • Finding reference drawing numbers in a document
  • Identify search terms by prefixes rather than whole words
  • Identifying sheets that need to be updated
  • Using glob to iterate through multiple files

How else would you modify this script?  Let me know!

Thanks for reading!

Special thanks to these sources:

Automate the Boring Stuff with Python
ritesh_shrv on Stack Overflow

Automated Email with Python

Automated email notifications and task tracking system.

This article explains how to use a Python script in conjunction with a simple Action Tracking spreadsheet to create an automated email notifications and task-tracking system for your team.

To begin, let’s setup our “ActionTracker” spreadsheet as shown below:


We can use the expression “=IF(ISBLANK(H2)=TRUE,”Active”,”Closed”)” in our Status column to acknowledge when a date has been entered in the “Completion Date” column.  This will help our script later on.

The “Days Open” column can be calculated using “=IF(ISBLANK(H2)=FALSE,H2-F2,TODAY()-F2)”.  As your list grows, be sure to drag down your formulas.

It can be helpful to apply conditional formatting here in order to see which items are “Open” and late, so that we know which items we expect to send notifications about.  This can be accomplished by the expression shown below, but it is not a necessary step.  Again, remember to update your applicable range as your list grows.


On our “Email” tab, we’ll list our unique assignees by name and add their email addresses (separated by a comma and a space) in column B.


In order to minimize errors, we can apply a Data Validation rule to our “Assignee” column on the “ActionList” tab.  We’ll select all of the unique names on our “Email” tab as the Source validation criteria.  New emails can easily be added to this list, however, we must update our Source range.


Here’s a download link for the ActionTracker template.

Next, we’ll use the following Python script to send automated email notifications to our team for any Open actions that are open for more than three days.  The three day threshold can be easily adjusted in line #50 of the script below.

Note: In order to allow the script to access your gmail account, make sure that your less secure app access settings are currently turned on.

import smtplib
import pandas as pd
import sys
from tkinter import Tk
from tkinter import filedialog
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart

#Define email login information.
from_email="" #Replace with your email address.  Must be gmail.
pw="_____" #Replace with gmail password.

#Select file to use.
                                    filetypes=[('.xlsx files','*.xlsx'),
                                               ('All files','*.*')],
if filepath=="." or filepath=="":

#Import ActionTracker
ActionTracker = pd.DataFrame(pd.read_excel(filepath,sheetname='ActionList',

#Import Email Addresses by name

#Establish connection to gmail server, login.
server = smtplib.SMTP('',587)
server.login(from_email, pw)


#Initialize a list of submittals with invalid email addresses

#Send emails to late Action Tracker assignees
for i in range(0,ActionTracker_maxrow):
    if status[i]=='Active' and DaysOpen[i]>3:
        print('Active Line Item #'+str(LineItem[i])+': '+person[i])
        msg=MIMEText("Action Tracker Line Item #" + str(LineItem[i]) + " has been open for " +
                     str(DaysOpen[i]) + " days.\n\n" + str(action[i]) +
                     "\n\nPlease take action.",_charset="UTF-8")
        msg['Subject']="Open Action #" + str(LineItem[i])
            server.sendmail(from_email, msg['TO'].split(","),
        except smtplib.SMTPRecipientsRefused:
            print('Line Item #' + str(LineItem[i]) + 'has an invalid email address.')

if len(invalid_addresses) != 0:
    for i in range(0,len(invalid_addresses)):
        if len(invalid_addresses)==1:
            msg=MIMEText(str(invalid_addresses) +
            " has an invalid email address associated with the responsible party.",
            msg=MIMEText(str(invalid_addresses) +
                             " have invalid email addresses associated with the responsible parties.",
        msg['Subject']='Invalid Email Addresses'
        server.sendmail(from_email, msg['TO'].split(","),
    except smtplib.SMTPRecipientsRefused:
        print('Invalid Email Address notification email failed.')



And that’s it.  Full automation can be achieved by hard-coding in the file location and using Windows Task Scheduler to execute the Python script.

Finding Correlations

Script for normalizing and finding correlations across variables in a numeric dataset.  Data can be analyzed as a whole or split into ‘n’ many subsets.  When split, normalizations are calculated and correlations are found for each subset.

Input is read from a .csv file with any number of columns (as shown below).  Each column must have the same number of samples.  Script assumes there are headers in the first row.


import numpy as np

#Divides a list (or np.array) into N equal parts.
def slice_list(input, size):
    input_size = len(input)
    slice_size = input_size // size
    remain = input_size % size
    result = []
    iterator = iter(input)
    for i in range(size):
        for j in range(slice_size):
        if remain:
            remain -= 1
    return result

#Functions below are from Data Science From Scratch by Joel Grus
def mean(x):
    return sum(x)/len(x)

def de_mean(x):
    return [x_i-x_bar for x_i in x]

def dot(v,w):
    return sum(v_i*w_i for v_i, w_i in zip(v,w))

def sum_of_squares(v):
    return dot(v,v)

def variance(x):
    return sum_of_squares(deviations)/(n-1)

def standard_deviation(x):
    return np.sqrt(variance(x))  

def covariance(x,y):
    return dot(de_mean(x),de_mean(y))/(n-1)

def correlation(x,y):
    if stdev_x >0 and stdev_y>0:
        return covariance(x,y)/stdev_x/stdev_y
        return 0

#Read data from CSV

#Determine number of samples & variables

#Define number of samples (and start/end points) in full time interval

#Define number of intervals to split data into
subnorm_test=np.zeros((full_sample_end, number_of_allvars+1))

#Slice variable lists
for dvar in range(0,number_of_allvars):
    for sublist in range(0,n):


#Normalize each sublist.
for var in range(0, number_of_allvars):
    for n_i in range(0,n):
        for x in range(0,sublength):


#Check for correlation between each variable
for n_i in range(0,n):
    for i in range(0,number_of_allvars):
        for j in range(0,number_of_allvars):

#Writes to CSV
np.savetxt(r'C:\Users\Craig\Documents\GitHub\normalized\sublists_normalized.csv',subnorm_test, delimiter=",") 

print(var_sub_correlation, 'variable correlation matrix')

Filenames to CSV

Python script for writing all filenames in the selected folder to ‘Filenames.csv’.

import glob
from tkinter import Tk
from tkinter import filedialog

def FileList(filepath):
    return glob.glob(str(filepath) + '*')


if folderpath_user.endswith('\\'):
    folderpath=str(folderpath_user)+ '\\'
with open('Filenames.csv', 'w') as f:
    for file in FileList(folderpath):

Unique Values from Spreadsheet

Python script for extracting all unique values from each tab of a spreadsheet.
Logs results in Output.csv in the same directory as .py file.

import pandas as pd
import xlrd
from tkinter import Tk
from tkinter import filedialog
def UniqueValuesFromSpreadsheet(WorkbookPath,sheet_name):
     df=pd.DataFrame(pd.read_excel(WorkbookPath,sheetname=sheet_name, header=None,index_col=False))   
     return df2

for i in range(0,NumSheets):
    Results[SheetNames[i]] = UniqueValuesFromSpreadsheet(filename,i)
    print(SheetNames[i], UniqueValuesFromSpreadsheet(filename,i))
with open('Output.csv', 'w') as f:
    f.write('{0},{1}\n'.format("Sheet Name", "Value"))    
    for key in Results:
        for i in range(0,len(Results[key])):
           if str(Results[key][i]) != 'nan':
                f.write('{0},{1}\n'.format(key, Results[key][i]))               
                print(key, Results[key][i])

Data Tools Overview

Data Visualization, Business Intelligence, and Data Science Tools

All descriptions and information shown below were harvested from official software websites, GitHub, Wikipedia, DataCamp, and other websites as listed.

  1. Tableau
    • Data Visualization/Dashboarding Tool
    • Very easy to quickly create graphs, filters, add trendlines, slicers/filters.
    • “Business Intelligence and Analytics”
    • Tableau Public (Free) for Open Community/Non-Commercial use.
      • Connect to a Server
        • OData
        • Web Data Connector
          • Collect data from “virtually any site that publishes data in JSON, XML, or HTML”
          • Use Tableau Web Data Connector Software Development Kit (SDK) to build connectors using Javascript and HTML.
            • SDK includes templates, docs, examples
      • Connect to a File
        • Excel
        • Text
        • Access
        • Statistical Files
          • SAS, SPSS, R
    • Tableau Desktop ($) Professional
    • Connects to pretty much any data source.
  2. (Apache) Hadoop (Free)
    • Open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware (wikipedia)
    • “High Volume Data Flows > MapReduce Process > Consume Results”
    • Can scale very well from a single server to a thousands of machines “each offering local computation and storage” (e-commerce, mobile data type scale)
    • Examples of Use (Gigaom)
      • Satellite Image Processing
      • Fraud Detection
      • IT Security – “Identify malware and cyber-attack patterns”
    • Several Hadoop-related projects listed on
      • MapReduce
        • “programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.” (wikipedia)
        • “Distributed Processing Framework”
      • Pig
        • “A high-level data-flow language and execution framework for parallel computation” (From Hadoop site)
        • “high-level platform for creating MapReduce programs with Hadoop” (Wikipedia)
        • Easily program parallel analysis and more complex data flow sequences (Paraphrased from Hadoop site)
        • “Scripting”
  3. Python (Free) (library selection from DataCamp)
    • Sci-kit learn
      • “Machine Learning in Python” (
      • “Simple and efficient tools for data mining and data analysis”
      • Performs:
        • Classification
        • Regression
        • Clustering
        • Dimensionality reduction
        • Model Selection
        • Preprocessing
    • NumPy
      • Fundamental package for scientific computing with Python. It contains among other things: (
      • a powerful N-dimensional array object
      • sophisticated (broadcasting) functions
      • tools for integrating C/C++ and Fortran code
      • useful linear algebra, Fourier transform, and random number capabilities
    • Pandas
      • High-performance, easy-to-use data structures and data analysis tools for the Python programming language. (
    • SciPy
      • Collection of numerical algorithms and domain-specific toolboxes, including signal processing, optimization, statistics and much more. (
    • Matplotlib
    • Statsmodels
      • Explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator. (
        • Linear regression models
        • Generalized linear models
        • Discrete choice models
        • Robust linear models
        • Many models and functions for time series analysis
        • Nonparametric estimators
        • A collection of datasets for examples
        • A wide range of statistical tests
        • Input-output tools for producing tables in a number of formats (Text, LaTex, HTML) and for reading Stata files into NumPy and Pandas.
        • Plotting functions
        • Extensive unit tests to ensure correctness of results
    • Seaborn
      • Library for making attractive and informative statistical graphics in Python. It is built on top of matplotlib and tightly integrated with the PyData stack, including support for numpy and pandas data structures and statistical routines from scipy and statsmodels. (
      • Several built-in themes that improve on the default matplotlib aesthetics
      • Tools for choosing color palettes to make beautiful plots that reveal patterns in your data
      • Functions for visualizing univariate and bivariate distributions or for comparing them between subsets of data
      • Tools that fit and visualize linear regression models for different kinds of independent and dependent variables
      • Functions that visualize matrices of data and use clustering algorithms to discover structure in those matrices
      • A function to plot statistical timeseries data with flexible estimation and representation of uncertainty around the estimate
      • High-level abstractions for structuring grids of plots that let you easily build complex visualizations
    • Bokeh
      • Bokeh is a Python interactive visualization library that targets modern web browsers for presentation. Its goal is to provide elegant, concise construction of novel graphics in the style of D3.js, and to extend this capability with high-performance interactivity over very large or streaming datasets. Bokeh can help anyone who would like to quickly and easily create interactive plots, dashboards, and data applications. (
    • Pygal
    • SciPy Stack (
      • Python (2.x >= 2.6 or 3.x >= 3.2)
      • NumPy (>= 1.6)
      • SciPy library (>= 0.10)
      • Matplotlib (>= 1.1)
        • dateutil
        • pytz
        • Support for at least one backend
      • IPython (>= 0.13)
        • pyzmq
        • tornado
      • pandas (>= 0.8)
      • Sympy (>= 0.7)
      • nose (>= 1.1)
  4. R (Free) (library selection from (DataCamp))
    • Data.table
      • Extension of Data.frame
      • Fast aggregation of large data (e.g. 100GB in RAM), fast ordered joins, fast add/modify/delete of columns by group using no copies at all, list columns and a fast file reader (fread). Offers a natural and flexible syntax, for faster development. (
    • Dplyr
      • A Grammar of Data Manipulation
      • A fast, consistent tool for working with data frame like objects, both in memory and out of memory. (
    • Plyr
      • Tools for Splitting, Applying and Combining Data
      • A set of tools that solves a common set of problems: you need to break a big problem down into manageable pieces, operate on each piece and then put all the pieces back together. For example, you might want to fit a model to each spatial location or time point in your study, summarize data by panels or collapse high-dimensional arrays to simpler summary statistics. (
    • Stringr
      • Simple, Consistent Wrappers for Common String Operations
      • A consistent, simple and easy to use set of wrappers around the fantastic ‘stringi’ package. All function and argument names (and positions) are consistent, all functions deal with “NA”‘s and zero length vectors in the same way, and the output from one function is easy to feed into the input of another. (
    • Zoo
      • S3 Infrastructure for Regular and Irregular Time Series (Z’s Ordered Observations)
      • An S3 class with methods for totally ordered indexed observations. It is particularly aimed at irregular time series of numeric vectors/matrices and factors. zoo’s key design goals are independence of a particular index/date/time class and consistency with ts and base R by providing methods to extend standard generics. (
    • Ggvis
      • Interactive Grammar of Graphics
      • An implementation of an interactive grammar of graphics, taking the best parts of ‘ggplot2’, combining them with the reactive framework from ‘shiny’ and web graphics from ‘vega’. (
    • Lattice
      • Trellis Graphics for R
      • A powerful and elegant high-level data visualization system inspired by Trellis graphics, with an emphasis on multivariate data. Lattice is sufficient for typical graphics needs, and is also flexible enough to handle most nonstandard requirements. (
    • Ggplot2
      • An Implementation of the Grammar of Graphics
      • It combines the advantages of both base and lattice graphics: conditioning and shared axes are handled automatically, and you can still build up a plot step by step from multiple data sources. It also implements a sophisticated multidimensional conditioning system and a consistent interface to map data to aesthetic attributes. (
    • Caret
      • Classification and Regression Testing
      • Misc functions for training and plotting classification and regression models. (
    • RevoScaleR
      • Proprietary R package from Revolution Analytics.
      • Revolution Analytics is a commercial distribution of R.
  5. Power BI (Free, $)
    • “suite of business analytics tools to analyze data and share insights.”
    • Looks like mostly data visualization
    • Extremely interactive/dynamic dashboards.
    • Connects to pretty much everything
    • Cross Platform
  6. SAS ($)
    • “Statistical Analysis Software is a software suite developed by SAS Institute for advanced analytics, multivariate analyses, business intelligence, data management, and predictive analytics” (Wikipedia)
    • SAS University Edition (Free)
    • Local
    • AWS
  7. Google Analytics
    • For Enterprise ($)
    • For Small Businesses
    • Google Analytics (Free)
      • “Google Analytics helps you analyze visitor traffic and paint a complete picture of your audience and their needs. Track the routes people take to reach you and the devices they use to get there with reporting tools like Traffic Sources. Learn what people are looking for and what they like with In-Page Analytics. Then tailor your marketing and site content for maximum impact.” (
    • Data Collection & Management
      • Data collection and management with Google Analytics provides a single, accurate view of the customer that can be customized to your needs and shared across the organization.
    • Data Consolidation
      • Google delivers integrated solutions that preserve data integrity, reduce friction, and seamlessly connect disparate data sources.
    • Data Analytics & Reporting
      • Reports can be segmented and filtered to reflect the needs of your business. Real-time views let you know which new content is popular, how much traffic today’s new promotion is driving to your site, and which tweets and blog posts draw the best results.
    • Data Activation
      • Make smarter marketing decisions. Google Analytics allows you to seamlessly activate your data to improve marketing campaigns and experiment with new channels and content.
    • Tag Manager (Free)
      • Google Tag Manager lets you launch new tags any time with a few clicks, so you never miss a measurement or marketing opportunity.
    • Analytics Academy
      • Learn analytics with free online courses
      • Take lessons from Google measurement experts
      • Join the Google Analytics learning community
      • Test your knowledge
      • For Mobile Apps
    • Google Analytics API

General Notes

  1. “Data scientists that use primarily opensource tools earned a higher median salary (130K) than those using proprietary tools (90K).” – Datacamp
  2. Big Data Landscape 2016
  3. For further Data Science reading, I highly recommend Joel Grus’ Data Science from Scratch.