Transformation Matrices for Robotic Arms

Python functions for serial manipulators.

# -*- coding: utf-8 -*-
"""
Functions for calculating Basic Transformation Matrices in 3D space.
"""
from math import cos, radians, sin
from numpy import matrix

def rotate(axis, theta, angular_units='radians'):
    '''Compute Basic Homogeneous Transform Matrix for
    rotation of "theta" about specified axis.'''
    #Verify string arguments are lowercase
    axis=axis.lower()
    angular_units=angular_units.lower()
    #Convert to radians if necessary
    if angular_units=='degrees':
        theta=radians(theta)
    elif angular_units=='radians':
        pass
    else:
        raise Exception('Unknown angular units.  Please use radians or degrees.')
    #Select appropriate basic homogenous matrix
    if axis=='x':
        rotation_matrix=matrix([[1, 0, 0, 0],
                               [0, cos(theta), -sin(theta), 0],
                               [0, sin(theta), cos(theta), 0],
                               [0, 0, 0, 1]])
    elif axis=='y':
        rotation_matrix=matrix([[cos(theta), 0, sin(theta), 0],
                               [0, 1, 0, 0],
                               [-sin(theta), 0, cos(theta), 0],
                               [0, 0, 0, 1]])  
    elif axis=='z':
        rotation_matrix=matrix([[cos(theta), -sin(theta), 0, 0],
                               [sin(theta), cos(theta), 0, 0],
                               [0, 0, 1, 0],
                               [0, 0, 0, 1]])
    else:
        raise Exception('Unknown axis of rotation.  Please use x, y, or z.')
    return rotation_matrix

def translate(axis, d):
    '''Calculate Basic Homogeneous Transform Matrix for
    translation of "d" along specified axis.'''   
    #Verify axis is lowercase
    axis=axis.lower()
    #Select appropriate basic homogenous matrix
    if axis=='x':
        translation_matrix=matrix([[1, 0, 0, d],
                                  [0, 1, 0, 0],
                                  [0, 0, 1, 0],
                                  [0, 0, 0, 1]])
    elif axis=='y':
        translation_matrix=matrix([[1, 0, 0, 0],
                                  [0, 1, 0, d],
                                  [0, 0, 1, 0],
                                  [0, 0, 0, 1]])
    elif axis=='z':
        translation_matrix=matrix([[1, 0, 0, 0],
                                  [0, 1, 0, 0],
                                  [0, 0, 1, d],
                                  [0, 0, 0, 1]])
    else:
        raise Exception('Unknown axis of translation.  Please use x, y, or z.')
    return translation_matrix

if __name__=='__main__':
    #Calculate arbitrary homogeneous transformation matrix for CF0 to CF3
    H0_1=rotate('x', 10, 'degrees')*translate('y', 50)
    H1_2=rotate('y', 30, 'degrees')*translate('z', 10)
    H2_3=rotate('z', -20, 'degrees')*translate('z', 10)
    H0_3=H0_1*H1_2*H2_3
    print(H0_3)

Also available on GitHub.

Time-lapse Camera with Raspberry Pi

Building a Time-lapse Camera with Raspberry Pi.

Recently I built a time-lapse camera with a Raspberry Pi.  Here’s how:

Bill of Materials

  1. Raspberry Pi 3
  2. Power Supply
  3. Camera Mount
    • This ended up having a slightly different mounting hole pattern than the Arducam.
  4. Arducam Camera

During initial setup, you’ll also want to have a HDMI Cable, Keyboard, Mouse, and Monitor for your Pi.

Hardware

  1. Fasten the Arducam to the camera mount.
  2. Connect the Arducam ribbon cable to the Pi’s CSI port.
  3. Download the python code.
    • Update start time, end time, and sleep interval as desired.
  4. (Optional) Update rc.local as mentioned below.

Code

from time import sleep
from picamera import PiCamera
from datetime import datetime

MORNING_START_HOUR=7
EVENING_END_HOUR=19

def day_or_night(datetime):
    hour=datetime.hour
    if hour>=MORNING_START_HOUR and hour<EVENING_END_HOUR:
        return 'day'
    else:
        return 'night'
    
def take_picture():
    camera.start_preview()
    sleep(2)
    now=datetime.now().strftime("%Y-%m-%d-%H-%M")
    label='timelapse_' + now + '.jpg'
    camera.capture('/home/pi/Pictures/'+label)
    camera.stop_preview()
    print('Image captured at '+now)
    
if __name__=='__main__':
    camera = PiCamera()
    
    while True:
        now=datetime.now()
        if day_or_night(now) is 'day':
            try:
                take_picture()
            except:
                pass
		sleep(900) #15 minutes

Also on GitHub.  I was able to get the code to execute upon startup by updating the Pi’s rc.local file.  I followed the rc.local method shown here.

The images are saved in /home/pi/Pictures/ on the Pi.  I used ImageMagick to create the GIF of the plant shown above.

Future Improvements

  • Saving the files to Google Drive to avoid file storage limitations.  Also, you can view the images without disturbing the camera system.  Looks like this article points us in the right direction.
  • Utilizing a portable power supply.

Building a Superflight Controller

Building a controller for Superflight with Arduino, PySerial, and a Wii Nunchuk.

Superflight_1

A few months ago, I downloaded Superflight on Steam.  It’s an awesome game.

I thought it might be fun to play with a a joystick, but I didn’t have one… so I hacked one together with an Arduino Mega, an old Wii Nunchuk, and PySerial.  The controller works by using the Arduino as an interface between the Nunchuk and computer (via USB) which allows our Python code to read & interpret the Nunchuk data and simulate keystrokes in the game.  The entire hardware configuration and most of the code I needed was already generously available from Gabriel Bianconi’s Makezine article and Chris Kiehl’s Virtual Keystroke project.

The only real hardware change I made was the Arduino pin locations.  For me on an Arduino Mega 2560, this was SDA: Pin 20 and SCL: Pin 21.

In the Arduino code from Bianconi’s article, I modified the Baud rate from 19200 to 9600.  This seemed to be more stable for me, but I’m not sure if it was entirely necessary.  Regardless of what rate you select, make sure the Baud rate matches in the Arduino and Python code.

I pruned a lot of the Nunchuk gyroscopic readings out of Bianconi’s Python code.  Then I added the VK_CODE dictionary (partial) and “press” function from Chris Kiehls’ project which takes advantage of the win32api to simulate keystrokes on a Windows machine.  Finally, I modified some of the existing logic to “press” the arrow keys when the Wii joystick was moved in the corresponding direction.  My python code ended up looking like this:

"""
Building a Superflight Controller with a Wii Nunchuk

Note: Must run with Python 2.
"""

# Import the required libraries for this script
import string, time, serial, win32api, win32con

#Dictonary to hold key name and VK value
VK_CODE = {'left_arrow':0x25,
           'up_arrow':0x26,
           'right_arrow':0x27,
           'down_arrow':0x28,}

#press keys
def press(*args):
    '''
    one press, one release.
    accepts as many arguments as you want. e.g. press('left_arrow', 'a','b').
    '''
    for i in args:
        win32api.keybd_event(VK_CODE[i], 0,0,0)
        time.sleep(.001)
        win32api.keybd_event(VK_CODE[i],0 ,win32con.KEYEVENTF_KEYUP ,0)

# The port to which your Arduino board is connected
port = 'COM3'

# Invert y-axis (True/False)
invertY = False

# The cursor speed
cursorSpeed = 20

# The baudrate of the Arduino program
baudrate = 9600

# Variables indicating whether the mouse buttons are pressed or not
leftDown = False
rightDown = False

# Variables indicating the center position (no movement) of the controller
midAnalogY = 130
midAnalogX = 125

if port == 'arduino_port':
    print('Please set up the Arduino port.')
    while 1:
        time.sleep(1)

# Connect to the serial port
ser = serial.Serial(port, baudrate, timeout = 1)

# Wait 5s for things to stabilize
time.sleep(5)

# While the serial port is open
while ser.isOpen():

    # Read one line
    line = ser.readline()

    # Strip the ending (\r\n)
    line = string.strip(line, '\r\n')

    # Split the string into an array containing the data from the Wii Nunchuk
    line = string.split(line, ' ')

    print(line)

    # Set variables for each of the values
    analogX = int(line[0])
    analogY = int(line[1])
    zButton = int(line[5])

    threshold=25

    # If the analog stick is not centered
    if((analogY-midAnalogY)>threshold):
        press('up_arrow')
    elif((analogY-midAnalogY)threshold):
        press('right_arrow')
    elif((analogX-midAnalogX):
        press('left_arrow')

# After the program is over, close the serial port connection
ser.close()

To summarize, the overall process looks something like this:

  1. Connect the Wii Nunchuk to the Arduino as shown in the Makezine article.  Make sure you wire the Nunchuk to your SDA and SCL pins – these might be different that what’s shown in the article depending on what Arduino model you have.
  2. Connect the Arduino to your computer through USB and upload Bianconi’s Arduino sketches.  Take note of what baud rate you’re using.
  3. Save the python code (shown above) to your local machine.  Update the baud rate as needed – make sure it’s the same as what is listed in your Arduino code.  Make sure you have all python library dependencies installed.
  4. Open a terminal.  CD to whatever directory you saved the python code to.  Run the .py file using Python 2.  If you attempt to run it with Python 3, it probably won’t work.
  5. Open Superflight and have fun.

A few closing thoughts:

  • The overall setup is still a little bit unstable.  The python code seems to crash after a few minutes.  A few parameters to troubleshoot with are the threshold variable, the sleep duration, and the baud rate.
  • A definite improvement would be to make the Nunchuk trigger buttons work in the menu for a more complete controller.  But the keyboard still works.
  • Another big improvement would be to re-write the python code to use a variable rate of virtual button-pressing based on how far from the origin the controller is.
  • For a more long-term hardware design, we could design & 3D-print an enclosure that houses an ATtiny which runs the Arduino code.  From the outside, it would just look like a Nunchuk-to-USB cable.
  • Maybe it would have been more interesting to use the Wii Gyro data instead of the joystick?

 

Web Scraping for Engineers

Scrape 3D models from McMaster-Carr with Python & Selenium.

Here is a script for fetching 3D models from McMaster-Carr using Selenium.

Make sure Chromedriver is in the same directory as your .py file.  The 3D models will be downloaded to your default Downloads directory.

# -*- coding: utf-8 -*-
"""
Scrape 3D models from McMaster-Carr.

Requirements: Chromedriver.exe is in the same folder as this script.
"""
from selenium import webdriver
import time

test_part_numbers=['98173A200', '7529K105', '93250A440']

def fetch_model(part_numbers, delay=3):
    if type(part_numbers) is str:
        part_numbers=[part_numbers]
    
    #Start browser
    options = webdriver.ChromeOptions()
    driver = webdriver.Chrome(chrome_options=options)
    
    #For each part number
    for part_number in part_numbers:
        driver.get('https://www.mcmaster.com/#' + str(part_number) + '/')
        #Pause for page to load
        time.sleep(delay)    
        #Find and Click submit button
        try:
            try:
                submit_button=driver.find_element_by_class_name("button-save--sidebyside")
            except:
                submit_button=driver.find_element_by_class_name("button-save--stacked")
            finally:
                submit_button.click()
        except:
            print('No button found or other error occured')
        finally:
            time.sleep(delay)
            
    driver.close()
    driver.quit
    
fetch_model(test_part_numbers)

Find and Zip All .py Files

Create a Zip folder including all .py files on your computer.

This is a program for finding all .py files in a directory and zipping them into a single folder.  We can substitute any file extension (.doc, .xlsx, .dxf, etc.) that we’d like to search for, but in our example we’ll use .py

To find all of the files in a given directory, we’ll import and use functionality from the FolderHierarchy script.  Be sure to save this file in the same directory as the script below.

Next, we’ll filter our FolderHierarchy results to include just the .py files.  Then we simply loop through the filtered DataFrame and add each file to a zip folder using the zipfile module.

# -*- coding: utf-8 -*-
"""
Purpose: Copy all files of a given filetype within a folder, zip them,
         and save.

Input: 1. A folderpath to search in.
       2. File type/File Extension to search for.
Output: A zip file containing all of the files with the specified file extension
        that were folder in the specified input folder.
"""
import pandas as pd
import zipfile
import FolderHierarchy

#Define what filetype to search for
extension=".py"

#Execute 'FolderHierarchy' program (in same directory)
FolderHierarchy
results=FolderHierarchy.all_levels

#Filtering for just files with the defined file extension
found_files=pd.concat([results[level][results[level]['Path'].str.contains(extension,
                                  regex=False)] for level in results])

#Copy and zip all of the files found.
new_zip=zipfile.ZipFile('all_'+extension+ '_files'+'.zip',mode='w')

#Writing to zip (https://pymotw.com/2/zipfile/)
for file in found_files['Path']:
    new_zip.write(file,arcname=file.split('\\')[-1])
new_zip.close()

print('Found '+str(len(found_files))+' '+extension+' files.')

Generating Math Tests with Python

Auto-Generate Unique Tests

This is a script for generating a bunch of unique math tests from a “Test Template” and a spreadsheet containing test inputs and problem solutions.

In our Test Template we set the layout of our test and define our test problems. Our test problems will have variable placeholders (TestID, Question ID, VarA, etc.) that we will replace with data from our “Test Data” spreadsheet.

In our excel file, we random generate values for the A, B, and C variables (using the =RANDBETWEEN() function) and clearly identify which Question, Equation, and Test ID they correspond to. In the Excel file, we’ll calculate solutions using the input data and equation listed for each entry.

Next, we can run our script. This is dependent on the docx (Note: pip install python-docx), docx2txt, re, pandas, and tkinter libraries.  Forms will pop-up prompting you for the Test Template and Test Data files.

"""
Creates unique test documents with data
taken from a DataFrame (which is populated from an excel file).

Input: Test Template (Word Document).  Test Data (Excel File)
Output: 20 Unique Tests (Test Data)
"""
#Import modules
import docx
import docx2txt
import pandas as pd
import re
from tkinter import Tk
from tkinter import filedialog

Tk().withdraw()

#Define "Test" template
template_file=filedialog.askopenfilename(title="Please select Word template")
testdata_file=filedialog.askopenfilename(title="Please select Test Data spreadsheet")

#Read file data
template_text=docx2txt.process(template_file)
testdata=pd.read_excel(testdata_file)

#Produce 20 unique tests
for i in range(20):
    new_text=template_text
    #Add data for 10 unique questions
    for j in range(10):
        #Define replacement dictionary
        #http://stackoverflow.com/questions/6116978/python-replace-multiple-strings
        rep={'QuestionID':str(testdata['Question'][i+j*20]),
             'VarA':str(testdata['VarA'][i+j*20]),
             'VarB':str(testdata['VarB'][i+j*20]),
             'VarC':str(testdata['VarC'][i+j*20])}
        rep=dict((re.escape(k),v) for k, v in rep.items())
        pattern=re.compile("|".join(rep.keys()))
    
        if j==0:
            new_text=pattern.sub(lambda m: rep[re.escape(m.group(0))],template_text,count=4)
            new_text=new_text.replace('TestID','Test #' + str(i+1))
        else:
            new_text=pattern.sub(lambda m: rep[re.escape(m.group(0))],new_text,count=4)
            
    #Create and save new test document
    test_doc=docx.Document()
    test_doc.add_paragraph(new_text)
    test_doc.save(r'C:\Users\Craig\Documents\Python Scripts\Test #'+str(i+1)+'.docx')

After the files have been selected, the script reads the Test Template text and loads the Test Data into a DataFrame. We then loop through the Test Data and produce 20 unique test documents by substituting the placeholder variables with values from the Test Data spreadsheet. Each test document is clearly labeled and we can use our original Test Data as our answer key.

Thanks to Andrew Clark for his code for replacing multiple text strings.

Reading & Writing Excel Data with Python

Using pandas to read/write data in Excel.

In this post we’re going to explore how easy it is to read and write data in Excel using Python.  There’s a few different ways to do this.  We’re going to use pandas.  The pandas DataFrame  is the main data structure that we’re going to be working with.

Reading

The sample Excel data we’ll be using is available on Tableau’s Community page.

To load a single sheet of the Excel file into Python, we’ll use the read_excel function:

import pandas as pd
sales_data=pd.read_excel(r'C:\Users\Craig\Downloads\Sample - Superstore Sales (Excel).xls')

This loads one tab of the spreadsheet (.xls, .xlsx, or .xlsm) into a DataFrame.

In fact, if we didn’t want to download the Excel file locally, we can load it into Python directly from the URL:

sales_data_fromURL=pd.read_excel('https://community.tableau.com/servlet/JiveServlet/downloadBody/1236-102-1-1149/Sample%20-%20Superstore%20Sales%20(Excel).xls')

Note that we can load specific sheets (sheetname), grab specific columns (parse_cols), and handle N/A values (na_values) by using the optional keyword arguments.

To load all of the sheets/tabs within an Excel file into Python, we can set sheetname=None:

sales_data_all=pd.read_excel(r'C:\Users\Craig\Downloads\Sample - Superstore Sales (Excel).xls', sheetname=None)

This will return a dictionary of DataFrames – one for each sheet.

Writing

Writing existing Python data to an Excel file is just as straightforward.  If our data is already a DataFrame, we can call the pd.DataFrame.to_excel(‘filename.xlsx’) function.  If not, we can just convert the data into a DataFrame and then call to_excel.

import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.randn(50,50))
df.to_excel('MyDataFrame.xlsx')

This will work for the .xls, .xlsx, and .xlsm.  Pandas also writer functions such as to_csv, to_sql, to_html, and a few others.

To write data on multiple sheets, we can use the pd.ExcelWriter function as shown in the pandas documentation:

with pd.ExcelWriter('filename.xlsx') as writer:
    df1.to_excel(writer, sheet_name='Sheet1')
    df2.to_excel(writer, sheet_name='Sheet2')

Quick Data Grabs

Try experimenting with the

pd.read_clipboard() #and
pd.to_clipboard()

functions to quickly transfer data from Excel to Python and vice-versa.

Thank you, pandas, for creating and maintaining excellent documentation.

Finding Words with PyPDF2

Find all instances of words in a PDF with Python’s PyPDF2 library.

This is a script for finding all instances of a given search word (or multiple search words) in a PDF.

For our example, we’ll be using a PDF of Romeo and Juliet.  In this case, our search terms are “Romeo” and “Juliet” (search is not case-sensitive).

import PyPDF2
import re

pdfFileObj=open(r'C:\Users\Craig\RomeoAndJuliet.pdf',mode='rb')
pdfReader=PyPDF2.PdfFileReader(pdfFileObj)
number_of_pages=pdfReader.numPages

pages_text=[]
words_start_pos={}
words={}

searchwords=['romeo','juliet']

with open('FoundWordsList.csv', 'w') as f:
    f.write('{0},{1}\n'.format("Sheet Number", "Search Word"))
    for word in searchwords:
        for page in range(number_of_pages):
            print(page)
            pages_text.append(pdfReader.getPage(page).extractText())
            words_start_pos[page]=[dwg.start() for dwg in re.finditer(word, pages_text[page].lower())]
            words[page]=[pages_text[page][value:value+len(word)] for value in words_start_pos[page]]
        for page in words:
            for i in range(0,len(words[page])):
               if str(words[page][i]) != 'nan':
                    f.write('{0},{1}\n'.format(page+1, words[page][i]))
                    print(page, words[page][i])

We run the script and get an output that shows each instance of each search word and the associated PDF page number:
foundsearchwords

This script can be used for a variety of other applications by updating the file path (line 4) and the search terms (line 12).

A few ideas for modification include:

  • Frequency counts of words in books/lyrics (ATS has an awesome frequency count graph generator)
  • Finding reference drawing numbers in a document
  • Identify search terms by prefixes rather than whole words
  • Identifying sheets that need to be updated
  • Using glob to iterate through multiple files

How else would you modify this script?  Let me know!

Thanks for reading!

Special thanks to these sources:

Automate the Boring Stuff with Python
ritesh_shrv on Stack Overflow

Automated Email with Python

Automated email notifications and task tracking system.

This article explains how to use a Python script in conjunction with a simple Action Tracking spreadsheet to create an automated email notifications and task-tracking system for your team.

To begin, let’s setup our “ActionTracker” spreadsheet as shown below:

ActionTracker_Overview

We can use the expression “=IF(ISBLANK(H2)=TRUE,”Active”,”Closed”)” in our Status column to acknowledge when a date has been entered in the “Completion Date” column.  This will help our script later on.

The “Days Open” column can be calculated using “=IF(ISBLANK(H2)=FALSE,H2-F2,TODAY()-F2)”.  As your list grows, be sure to drag down your formulas.

It can be helpful to apply conditional formatting here in order to see which items are “Open” and late, so that we know which items we expect to send notifications about.  This can be accomplished by the expression shown below, but it is not a necessary step.  Again, remember to update your applicable range as your list grows.

ConditionalFormatting

On our “Email” tab, we’ll list our unique assignees by name and add their email addresses (separated by a comma and a space) in column B.

EmailTab

In order to minimize errors, we can apply a Data Validation rule to our “Assignee” column on the “ActionList” tab.  We’ll select all of the unique names on our “Email” tab as the Source validation criteria.  New emails can easily be added to this list, however, we must update our Source range.

DataValidation

Here’s a download link for the ActionTracker template.

Next, we’ll use the following Python script to send automated email notifications to our team for any Open actions that are open for more than three days.  The three day threshold can be easily adjusted in line #50 of the script below.

Note: In order to allow the script to access your gmail account, make sure that your less secure app access settings are currently turned on.

import smtplib
import pandas as pd
import sys
from tkinter import Tk
from tkinter import filedialog
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart

#Define email login information.
from_email="_____@gmail.com" #Replace with your email address.  Must be gmail.
pw="_____" #Replace with gmail password.

#Select file to use.
Tk().withdraw()
filepath=filedialog.askopenfilename(defaultextension='*.xlsx',
                                    filetypes=[('.xlsx files','*.xlsx'),
                                               ('All files','*.*')],
                                    initialdir=r'C:\Users')
if filepath=="." or filepath=="":
    sys.exit(0)

#Import ActionTracker
ActionTracker = pd.DataFrame(pd.read_excel(filepath,sheetname='ActionList',
                                           parse_cols='A:E'))
ActionTracker_maxrow=len(ActionTracker)
status=ActionTracker.iloc[:,0]
LineItem=ActionTracker.iloc[:,1]
action=ActionTracker.iloc[:,2]
person=ActionTracker.iloc[:,3]
DaysOpen=ActionTracker.iloc[:,4]

#Import Email Addresses by name
EmailList=pd.DataFrame(pd.read_excel(filepath,sheetname='Email',index_col=0,
                                     parse_cols='A:B'))

#Establish connection to gmail server, login.
server = smtplib.SMTP('smtp.gmail.com',587)
server.starttls()
server.login(from_email, pw)

msg=MIMEMultipart('alternative')
msg.set_charset('utf8')
msg['FROM']=from_email

#Initialize a list of submittals with invalid email addresses
invalid_addresses=[]

#Send emails to late Action Tracker assignees
for i in range(0,ActionTracker_maxrow):
    if status[i]=='Active' and DaysOpen[i]>3:
        print('Active Line Item #'+str(LineItem[i])+': '+person[i])
        msg=MIMEText("Action Tracker Line Item #" + str(LineItem[i]) + " has been open for " +
                     str(DaysOpen[i]) + " days.\n\n" + str(action[i]) +
                     "\n\nPlease take action.",_charset="UTF-8")
        msg['Subject']="Open Action #" + str(LineItem[i])
        msg['TO']=str(EmailList.iloc[EmailList.index.get_loc(person[i]),0])
        try:
            server.sendmail(from_email, msg['TO'].split(","),
                            msg.as_string())
        except smtplib.SMTPRecipientsRefused:
            invalid_addresses.append(LineItem[i])
            print('Line Item #' + str(LineItem[i]) + 'has an invalid email address.')

if len(invalid_addresses) != 0:
    for i in range(0,len(invalid_addresses)):
        invalid_addresses[i]=invalid_addresses[i].strip('\xa0')
    try:
        if len(invalid_addresses)==1:
            msg=MIMEText(str(invalid_addresses) +
            " has an invalid email address associated with the responsible party.",
            _charset="UTF-8")
        else:
            msg=MIMEText(str(invalid_addresses) +
                             " have invalid email addresses associated with the responsible parties.",
                             _charset="UTF-8")
        msg['Subject']='Invalid Email Addresses'
        msg['TO']=str(from_email)
        server.sendmail(from_email, msg['TO'].split(","),
                        msg.as_string())
    except smtplib.SMTPRecipientsRefused:
        print('Invalid Email Address notification email failed.')

server.quit()

 

And that’s it.  Full automation can be achieved by hard-coding in the file location and using Windows Task Scheduler to execute the Python script.

Finding Correlations

Script for normalizing and finding correlations across variables in a numeric dataset.  Data can be analyzed as a whole or split into ‘n’ many subsets.  When split, normalizations are calculated and correlations are found for each subset.

Input is read from a .csv file with any number of columns (as shown below).  Each column must have the same number of samples.  Script assumes there are headers in the first row.

Input

import numpy as np

#Divides a list (or np.array) into N equal parts.
#http://stackoverflow.com/questions/4119070/how-to-divide-a-list-into-n-equal-parts-python
def slice_list(input, size):
    input_size = len(input)
    slice_size = input_size // size
    remain = input_size % size
    result = []
    iterator = iter(input)
    for i in range(size):
        result.append([])
        for j in range(slice_size):
            result[i].append(iterator.__next__())
        if remain:
            result[i].append(iterator.__next__())
            remain -= 1
    return result

#Functions below are from Data Science From Scratch by Joel Grus
def mean(x):
    return sum(x)/len(x)

def de_mean(x):
    x_bar=mean(x)
    return [x_i-x_bar for x_i in x]

def dot(v,w):
    return sum(v_i*w_i for v_i, w_i in zip(v,w))

def sum_of_squares(v):
    return dot(v,v)

def variance(x):
    n=len(x)
    deviations=de_mean(x)
    return sum_of_squares(deviations)/(n-1)

def standard_deviation(x):
    return np.sqrt(variance(x))  

def covariance(x,y):
    n=len(x)
    return dot(de_mean(x),de_mean(y))/(n-1)

def correlation(x,y):
    stdev_x=standard_deviation(x)
    stdev_y=standard_deviation(y)
    if stdev_x >0 and stdev_y>0:
        return covariance(x,y)/stdev_x/stdev_y
    else:
        return 0

#Read data from CSV
input_data=np.array(np.genfromtxt(r'C:\Users\Craig\Documents\GitHub\normalized\VariableTimeIntervalInput.csv',delimiter=",",skip_header=1))
var_headers=np.genfromtxt(r'C:\Users\Craig\Documents\GitHub\normalized\VariableTimeIntervalInput.csv',delimiter=",",dtype=str,max_rows=1)

#Determine number of samples & variables
number_of_samples=len(input_data[0:,0])
number_of_allvars=len(input_data[0,0:])

#Define number of samples (and start/end points) in full time interval
full_sample=number_of_samples
full_sample_start=0
full_sample_end=number_of_samples

#Define number of intervals to split data into
n=2
dvar_sublists={}
max_sublists=np.zeros((number_of_allvars,n))
min_sublists=np.zeros((number_of_allvars,n))
subnorm_test=np.zeros((full_sample_end, number_of_allvars+1))

#Slice variable lists
for dvar in range(0,number_of_allvars):
    dvar_sublists[dvar]=slice_list(input_data[:,dvar],n)
    for sublist in range(0,n):
        max_sublists[dvar,sublist]=np.max(dvar_sublists[dvar][sublist])
        min_sublists[dvar,sublist]=np.min(dvar_sublists[dvar][sublist])

var_interval_sublists=max_sublists-min_sublists

#Normalize each sublist.
for var in range(0, number_of_allvars):
    x_count=0
    for n_i in range(0,n):
        sublength=len(dvar_sublists[var][n_i])
        for x in range(0,sublength):
            subnorm_test[x_count,var]=(dvar_sublists[var][n_i][x]-min_sublists[var,n_i])/var_interval_sublists[var,n_i]
            subnorm_test[x_count,6]=n_i
            x_count+=1

var_sub_correlation=np.zeros((n,number_of_allvars,number_of_allvars),float)

#Check for correlation between each variable
for n_i in range(0,n):
    for i in range(0,number_of_allvars):
        icount=0
        for j in range(0,number_of_allvars):
            jcount=0
            starti=icount*len(dvar_sublists[i][n_i])
            endi=starti+len(dvar_sublists[i][n_i])
            startj=icount*len(dvar_sublists[j][n_i])
            endj=startj+len(dvar_sublists[j][n_i])
            var_sub_correlation[n_i,i,j]=correlation(subnorm_test[starti:endi,i],subnorm_test[startj:endj,j])

#Writes to CSV
np.savetxt(r'C:\Users\Craig\Documents\GitHub\normalized\sublists_normalized.csv',subnorm_test, delimiter=",") 

print(var_sub_correlation, 'variable correlation matrix')