## Finding Correlations

Script for normalizing and finding correlations across variables in a numeric dataset.  Data can be analyzed as a whole or split into ‘n’ many subsets.  When split, normalizations are calculated and correlations are found for each subset.

Input is read from a .csv file with any number of columns (as shown below).  Each column must have the same number of samples.  Script assumes there are headers in the first row.

```import numpy as np

#Divides a list (or np.array) into N equal parts.
#http://stackoverflow.com/questions/4119070/how-to-divide-a-list-into-n-equal-parts-python
def slice_list(input, size):
input_size = len(input)
slice_size = input_size // size
remain = input_size % size
result = []
iterator = iter(input)
for i in range(size):
result.append([])
for j in range(slice_size):
result[i].append(iterator.__next__())
if remain:
result[i].append(iterator.__next__())
remain -= 1
return result

#Functions below are from Data Science From Scratch by Joel Grus
def mean(x):
return sum(x)/len(x)

def de_mean(x):
x_bar=mean(x)
return [x_i-x_bar for x_i in x]

def dot(v,w):
return sum(v_i*w_i for v_i, w_i in zip(v,w))

def sum_of_squares(v):
return dot(v,v)

def variance(x):
n=len(x)
deviations=de_mean(x)
return sum_of_squares(deviations)/(n-1)

def standard_deviation(x):
return np.sqrt(variance(x))

def covariance(x,y):
n=len(x)
return dot(de_mean(x),de_mean(y))/(n-1)

def correlation(x,y):
stdev_x=standard_deviation(x)
stdev_y=standard_deviation(y)
if stdev_x >0 and stdev_y>0:
return covariance(x,y)/stdev_x/stdev_y
else:
return 0

#Determine number of samples & variables
number_of_samples=len(input_data[0:,0])
number_of_allvars=len(input_data[0,0:])

#Define number of samples (and start/end points) in full time interval
full_sample=number_of_samples
full_sample_start=0
full_sample_end=number_of_samples

#Define number of intervals to split data into
n=2
dvar_sublists={}
max_sublists=np.zeros((number_of_allvars,n))
min_sublists=np.zeros((number_of_allvars,n))
subnorm_test=np.zeros((full_sample_end, number_of_allvars+1))

#Slice variable lists
for dvar in range(0,number_of_allvars):
dvar_sublists[dvar]=slice_list(input_data[:,dvar],n)
for sublist in range(0,n):
max_sublists[dvar,sublist]=np.max(dvar_sublists[dvar][sublist])
min_sublists[dvar,sublist]=np.min(dvar_sublists[dvar][sublist])

var_interval_sublists=max_sublists-min_sublists

#Normalize each sublist.
for var in range(0, number_of_allvars):
x_count=0
for n_i in range(0,n):
sublength=len(dvar_sublists[var][n_i])
for x in range(0,sublength):
subnorm_test[x_count,var]=(dvar_sublists[var][n_i][x]-min_sublists[var,n_i])/var_interval_sublists[var,n_i]
subnorm_test[x_count,6]=n_i
x_count+=1

var_sub_correlation=np.zeros((n,number_of_allvars,number_of_allvars),float)

#Check for correlation between each variable
for n_i in range(0,n):
for i in range(0,number_of_allvars):
icount=0
for j in range(0,number_of_allvars):
jcount=0
starti=icount*len(dvar_sublists[i][n_i])
endi=starti+len(dvar_sublists[i][n_i])
startj=icount*len(dvar_sublists[j][n_i])
endj=startj+len(dvar_sublists[j][n_i])
var_sub_correlation[n_i,i,j]=correlation(subnorm_test[starti:endi,i],subnorm_test[startj:endj,j])

#Writes to CSV
np.savetxt(r'C:\Users\Craig\Documents\GitHub\normalized\sublists_normalized.csv',subnorm_test, delimiter=",")

print(var_sub_correlation, 'variable correlation matrix')
```

## Data Tools Overview

Data Visualization, Business Intelligence, and Data Science Tools

All descriptions and information shown below were harvested from official software websites, GitHub, Wikipedia, DataCamp, and other websites as listed.

1. Tableau
• Data Visualization/Dashboarding Tool
• Very easy to quickly create graphs, filters, add trendlines, slicers/filters.
• Tableau Public (Free) for Open Community/Non-Commercial use.
• Connect to a Server
• OData
• Web Data Connector
• Collect data from “virtually any site that publishes data in JSON, XML, or HTML”
• Use Tableau Web Data Connector Software Development Kit (SDK) to build connectors using Javascript and HTML.
• SDK includes templates, docs, examples
• Connect to a File
• Excel
• Text
• Access
• Statistical Files
• SAS, SPSS, R
• Tableau Desktop (\$) Professional
• Connects to pretty much any data source.
• Open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware (wikipedia)
• “High Volume Data Flows > MapReduce Process > Consume Results”
• Can scale very well from a single server to a thousands of machines “each offering local computation and storage” (e-commerce, mobile data type scale)
• Examples of Use (Gigaom)
• Satellite Image Processing
• Fraud Detection
• IT Security – “Identify malware and cyber-attack patterns”
• MapReduce
• “programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.” (wikipedia)
• “Distributed Processing Framework”
• Pig
• “A high-level data-flow language and execution framework for parallel computation” (From Hadoop site)
• “high-level platform for creating MapReduce programs with Hadoop” (Wikipedia)
• Easily program parallel analysis and more complex data flow sequences (Paraphrased from Hadoop site)
• “Scripting”
3. Python (Free) (library selection from DataCamp)
• Sci-kit learn
• “Machine Learning in Python” (http://scikit-learn.org/stable/)
• “Simple and efficient tools for data mining and data analysis”
• Performs:
• Classification
• Regression
• Clustering
• Dimensionality reduction
• Model Selection
• Preprocessing
• NumPy
• Fundamental package for scientific computing with Python. It contains among other things: (http://www.numpy.org/)
• a powerful N-dimensional array object
• tools for integrating C/C++ and Fortran code
• useful linear algebra, Fourier transform, and random number capabilities
• Pandas
• High-performance, easy-to-use data structures and data analysis tools for the Python programming language. (http://pandas.pydata.org/)
• SciPy
• Collection of numerical algorithms and domain-specific toolboxes, including signal processing, optimization, statistics and much more. (www.scipy.org)
• Matplotlib
• Statsmodels
• Explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator. (statsmodels.sourceforge.net)
• Linear regression models
• Generalized linear models
• Discrete choice models
• Robust linear models
• Many models and functions for time series analysis
• Nonparametric estimators
• A collection of datasets for examples
• A wide range of statistical tests
• Input-output tools for producing tables in a number of formats (Text, LaTex, HTML) and for reading Stata files into NumPy and Pandas.
• Plotting functions
• Extensive unit tests to ensure correctness of results
• Seaborn
• Library for making attractive and informative statistical graphics in Python. It is built on top of matplotlib and tightly integrated with the PyData stack, including support for numpy and pandas data structures and statistical routines from scipy and statsmodels. (pypi.python.org/pypi/seaborn)
• Several built-in themes that improve on the default matplotlib aesthetics
• Tools for choosing color palettes to make beautiful plots that reveal patterns in your data
• Functions for visualizing univariate and bivariate distributions or for comparing them between subsets of data
• Tools that fit and visualize linear regression models for different kinds of independent and dependent variables
• Functions that visualize matrices of data and use clustering algorithms to discover structure in those matrices
• A function to plot statistical timeseries data with flexible estimation and representation of uncertainty around the estimate
• High-level abstractions for structuring grids of plots that let you easily build complex visualizations
• Bokeh
• Bokeh is a Python interactive visualization library that targets modern web browsers for presentation. Its goal is to provide elegant, concise construction of novel graphics in the style of D3.js, and to extend this capability with high-performance interactivity over very large or streaming datasets. Bokeh can help anyone who would like to quickly and easily create interactive plots, dashboards, and data applications. (http://bokeh.pydata.org/en/latest/)
• Pygal
• SciPy Stack (https://www.scipy.org/stackspec.html)
• Python (2.x >= 2.6 or 3.x >= 3.2)
• NumPy (>= 1.6)
• SciPy library (>= 0.10)
• Matplotlib (>= 1.1)
• dateutil
• pytz
• Support for at least one backend
• IPython (>= 0.13)
• pyzmq
• pandas (>= 0.8)
• Sympy (>= 0.7)
• nose (>= 1.1)
4. R (Free) (library selection from (DataCamp))
• Data.table
• Extension of Data.frame
• Fast aggregation of large data (e.g. 100GB in RAM), fast ordered joins, fast add/modify/delete of columns by group using no copies at all, list columns and a fast file reader (fread). Offers a natural and flexible syntax, for faster development. (cran.r-project.org)
• Dplyr
• A Grammar of Data Manipulation
• A fast, consistent tool for working with data frame like objects, both in memory and out of memory. (cran.r-project.org)
• Plyr
• Tools for Splitting, Applying and Combining Data
• A set of tools that solves a common set of problems: you need to break a big problem down into manageable pieces, operate on each piece and then put all the pieces back together. For example, you might want to fit a model to each spatial location or time point in your study, summarize data by panels or collapse high-dimensional arrays to simpler summary statistics. (cran.r-project.org)
• Stringr
• Simple, Consistent Wrappers for Common String Operations
• A consistent, simple and easy to use set of wrappers around the fantastic ‘stringi’ package. All function and argument names (and positions) are consistent, all functions deal with “NA”‘s and zero length vectors in the same way, and the output from one function is easy to feed into the input of another. (cran.r-project.org)
• Zoo
• S3 Infrastructure for Regular and Irregular Time Series (Z’s Ordered Observations)
• An S3 class with methods for totally ordered indexed observations. It is particularly aimed at irregular time series of numeric vectors/matrices and factors. zoo’s key design goals are independence of a particular index/date/time class and consistency with ts and base R by providing methods to extend standard generics. (cran.r-project.org)
• Ggvis
• Interactive Grammar of Graphics
• An implementation of an interactive grammar of graphics, taking the best parts of ‘ggplot2’, combining them with the reactive framework from ‘shiny’ and web graphics from ‘vega’. (cran.r-project.org)
• Lattice
• Trellis Graphics for R
• A powerful and elegant high-level data visualization system inspired by Trellis graphics, with an emphasis on multivariate data. Lattice is sufficient for typical graphics needs, and is also flexible enough to handle most nonstandard requirements. (cran.r-project.org)
• Ggplot2
• An Implementation of the Grammar of Graphics
• It combines the advantages of both base and lattice graphics: conditioning and shared axes are handled automatically, and you can still build up a plot step by step from multiple data sources. It also implements a sophisticated multidimensional conditioning system and a consistent interface to map data to aesthetic attributes. (cran.r-project.org)
• Caret
• Classification and Regression Testing
• Misc functions for training and plotting classification and regression models. (cran.r-project.org)
• RevoScaleR
• Proprietary R package from Revolution Analytics.
• Revolution Analytics is a commercial distribution of R.
5. Power BI (Free, \$)
• “suite of business analytics tools to analyze data and share insights.”
• Looks like mostly data visualization
• Extremely interactive/dynamic dashboards.
• Connects to pretty much everything
• Cross Platform
6. SAS (\$)
• “Statistical Analysis Software is a software suite developed by SAS Institute for advanced analytics, multivariate analyses, business intelligence, data management, and predictive analytics” (Wikipedia)
• SAS University Edition (Free)
• Local
• AWS
• For Enterprise (\$)
• “Google Analytics helps you analyze visitor traffic and paint a complete picture of your audience and their needs. Track the routes people take to reach you and the devices they use to get there with reporting tools like Traffic Sources. Learn what people are looking for and what they like with In-Page Analytics. Then tailor your marketing and site content for maximum impact.” (google.com/analytics)
• Data Collection & Management
• Data collection and management with Google Analytics provides a single, accurate view of the customer that can be customized to your needs and shared across the organization.
• Data Consolidation
• Google delivers integrated solutions that preserve data integrity, reduce friction, and seamlessly connect disparate data sources.
• Data Analytics & Reporting
• Reports can be segmented and filtered to reflect the needs of your business. Real-time views let you know which new content is popular, how much traffic today’s new promotion is driving to your site, and which tweets and blog posts draw the best results.
• Data Activation
• Make smarter marketing decisions. Google Analytics allows you to seamlessly activate your data to improve marketing campaigns and experiment with new channels and content.
• Tag Manager (Free)
• Google Tag Manager lets you launch new tags any time with a few clicks, so you never miss a measurement or marketing opportunity.