Data Science – Clockwork Robotics

Data Visualization, Business Intelligence, and Data Science Tools

All descriptions and information shown below were harvested from official software websites, GitHub, Wikipedia, DataCamp, and other websites as listed.

Tableau
- Data Visualization/Dashboarding Tool
- Very easy to quickly create graphs, filters, add trendlines, slicers/filters.
- “Business Intelligence and Analytics”
- Tableau Public (Free) for Open Community/Non-Commercial use.
  - Connect to a Server
    - OData
    - Web Data Connector
      - Collect data from “virtually any site that publishes data in JSON, XML, or HTML”
      - Use Tableau Web Data Connector Software Development Kit (SDK) to build connectors using Javascript and HTML.
        
        SDK includes templates, docs, examples
  - Connect to a File
    - Excel
    - Text
    - Access
    - Statistical Files
      - SAS, SPSS, R
- Tableau Desktop ($) Professional
- Connects to pretty much any data source.
(Apache) Hadoop (Free)
- Open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware (wikipedia)
- “High Volume Data Flows > MapReduce Process > Consume Results”
- http://www.ebizq.net/blogs/enterprise/images/mapreduce_hadoop.png
- Can scale very well from a single server to a thousands of machines “each offering local computation and storage” (e-commerce, mobile data type scale)
- Examples of Use (Gigaom)
  - Satellite Image Processing
  - Fraud Detection
  - IT Security – “Identify malware and cyber-attack patterns”
- Several Hadoop-related projects listed on Hadoop.apache.org
  - MapReduce
    - “programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.” (wikipedia)
    - “Distributed Processing Framework”
  - Pig
    - “A high-level data-flow language and execution framework for parallel computation” (From Hadoop site)
    - “high-level platform for creating MapReduce programs with Hadoop” (Wikipedia)
    - Easily program parallel analysis and more complex data flow sequences (Paraphrased from Hadoop site)
    - “Scripting”
Python (Free) (library selection from DataCamp)
- Sci-kit learn
  - “Machine Learning in Python” (http://scikit-learn.org/stable/)
  - “Simple and efficient tools for data mining and data analysis”
  - Performs:
    - Classification
    - Regression
    - Clustering
    - Dimensionality reduction
    - Model Selection
    - Preprocessing
- NumPy
  - Fundamental package for scientific computing with Python. It contains among other things: (http://www.numpy.org/)
  - a powerful N-dimensional array object
  - sophisticated (broadcasting) functions
  - tools for integrating C/C++ and Fortran code
  - useful linear algebra, Fourier transform, and random number capabilities
- Pandas
  - High-performance, easy-to-use data structures and data analysis tools for the Python programming language. (http://pandas.pydata.org/)
- SciPy
  - Collection of numerical algorithms and domain-specific toolboxes, including signal processing, optimization, statistics and much more. (www.scipy.org)
- Matplotlib
  - A mature and popular plotting package, that provides publication-quality 2D plotting as well as rudimentary 3D plotting (www.scipy.org, http://matplotlib.org/)
- Statsmodels
  - Explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator. (statsmodels.sourceforge.net)
    - Linear regression models
    - Generalized linear models
    - Discrete choice models
    - Robust linear models
    - Many models and functions for time series analysis
    - Nonparametric estimators
    - A collection of datasets for examples
    - A wide range of statistical tests
    - Input-output tools for producing tables in a number of formats (Text, LaTex, HTML) and for reading Stata files into NumPy and Pandas.
    - Plotting functions
    - Extensive unit tests to ensure correctness of results
- Seaborn
  - Library for making attractive and informative statistical graphics in Python. It is built on top of matplotlib and tightly integrated with the PyData stack, including support for numpy and pandas data structures and statistical routines from scipy and statsmodels. (pypi.python.org/pypi/seaborn)
  - Several built-in themes that improve on the default matplotlib aesthetics
  - Tools for choosing color palettes to make beautiful plots that reveal patterns in your data
  - Functions for visualizing univariate and bivariate distributions or for comparing them between subsets of data
  - Tools that fit and visualize linear regression models for different kinds of independent and dependent variables
  - Functions that visualize matrices of data and use clustering algorithms to discover structure in those matrices
  - A function to plot statistical timeseries data with flexible estimation and representation of uncertainty around the estimate
  - High-level abstractions for structuring grids of plots that let you easily build complex visualizations
- Bokeh
  - Bokeh is a Python interactive visualization library that targets modern web browsers for presentation. Its goal is to provide elegant, concise construction of novel graphics in the style of D3.js, and to extend this capability with high-performance interactivity over very large or streaming datasets. Bokeh can help anyone who would like to quickly and easily create interactive plots, dashboards, and data applications. (http://bokeh.pydata.org/en/latest/)
- Pygal
  - Sexy python charting (http://www.pygal.org/en/latest/#)
  - Dynamic SVG charting library (github)
  - Pygal is a Python framework which can be used to programmatically generate graphs (.svg images) on the server-side of a web-application. (http://share.elijahcaine.me/howToPygal.pdf)
- SciPy Stack (https://www.scipy.org/stackspec.html)
  - Python (2.x >= 2.6 or 3.x >= 3.2)
  - NumPy (>= 1.6)
  - SciPy library (>= 0.10)
  - Matplotlib (>= 1.1)
    - dateutil
    - pytz
    - Support for at least one backend
  - IPython (>= 0.13)
    - pyzmq
    - tornado
  - pandas (>= 0.8)
  - Sympy (>= 0.7)
  - nose (>= 1.1)
R (Free) (library selection from (DataCamp))
- Data.table
  - Extension of Data.frame
  - Fast aggregation of large data (e.g. 100GB in RAM), fast ordered joins, fast add/modify/delete of columns by group using no copies at all, list columns and a fast file reader (fread). Offers a natural and flexible syntax, for faster development. (cran.r-project.org)
- Dplyr
  - A Grammar of Data Manipulation
  - A fast, consistent tool for working with data frame like objects, both in memory and out of memory. (cran.r-project.org)
- Plyr
  - Tools for Splitting, Applying and Combining Data
  - A set of tools that solves a common set of problems: you need to break a big problem down into manageable pieces, operate on each piece and then put all the pieces back together. For example, you might want to fit a model to each spatial location or time point in your study, summarize data by panels or collapse high-dimensional arrays to simpler summary statistics. (cran.r-project.org)
- Stringr
  - Simple, Consistent Wrappers for Common String Operations
  - A consistent, simple and easy to use set of wrappers around the fantastic ‘stringi’ package. All function and argument names (and positions) are consistent, all functions deal with “NA”‘s and zero length vectors in the same way, and the output from one function is easy to feed into the input of another. (cran.r-project.org)
- Zoo
  - S3 Infrastructure for Regular and Irregular Time Series (Z’s Ordered Observations)
  - An S3 class with methods for totally ordered indexed observations. It is particularly aimed at irregular time series of numeric vectors/matrices and factors. zoo’s key design goals are independence of a particular index/date/time class and consistency with ts and base R by providing methods to extend standard generics. (cran.r-project.org)
- Ggvis
  - Interactive Grammar of Graphics
  - An implementation of an interactive grammar of graphics, taking the best parts of ‘ggplot2’, combining them with the reactive framework from ‘shiny’ and web graphics from ‘vega’. (cran.r-project.org)
- Lattice
  - Trellis Graphics for R
  - A powerful and elegant high-level data visualization system inspired by Trellis graphics, with an emphasis on multivariate data. Lattice is sufficient for typical graphics needs, and is also flexible enough to handle most nonstandard requirements. (cran.r-project.org)
- Ggplot2
  - An Implementation of the Grammar of Graphics
  - It combines the advantages of both base and lattice graphics: conditioning and shared axes are handled automatically, and you can still build up a plot step by step from multiple data sources. It also implements a sophisticated multidimensional conditioning system and a consistent interface to map data to aesthetic attributes. (cran.r-project.org)
- Caret
  - Classification and Regression Testing
  - Misc functions for training and plotting classification and regression models. (cran.r-project.org)
- RevoScaleR
  - Proprietary R package from Revolution Analytics.
  - Revolution Analytics is a commercial distribution of R.
Power BI (Free, $)
- “suite of business analytics tools to analyze data and share insights.”
- Looks like mostly data visualization
- Extremely interactive/dynamic dashboards.
- Connects to pretty much everything
- Cross Platform
SAS ($)
- “Statistical Analysis Software is a software suite developed by SAS Institute for advanced analytics, multivariate analyses, business intelligence, data management, and predictive analytics” (Wikipedia)
- SAS University Edition (Free)
- Local
- AWS
Google Analytics
- For Enterprise ($)
- For Small Businesses
- Google Analytics (Free)
  - “Google Analytics helps you analyze visitor traffic and paint a complete picture of your audience and their needs. Track the routes people take to reach you and the devices they use to get there with reporting tools like Traffic Sources. Learn what people are looking for and what they like with In-Page Analytics. Then tailor your marketing and site content for maximum impact.” (google.com/analytics)
- Data Collection & Management
  - Data collection and management with Google Analytics provides a single, accurate view of the customer that can be customized to your needs and shared across the organization.
- Data Consolidation
  - Google delivers integrated solutions that preserve data integrity, reduce friction, and seamlessly connect disparate data sources.
- Data Analytics & Reporting
  - Reports can be segmented and filtered to reflect the needs of your business. Real-time views let you know which new content is popular, how much traffic today’s new promotion is driving to your site, and which tweets and blog posts draw the best results.
- Data Activation
  - Make smarter marketing decisions. Google Analytics allows you to seamlessly activate your data to improve marketing campaigns and experiment with new channels and content.
- Tag Manager (Free)
  - Google Tag Manager lets you launch new tags any time with a few clicks, so you never miss a measurement or marketing opportunity.
- Analytics Academy
  - Learn analytics with free online courses
  - Take lessons from Google measurement experts
  - Join the Google Analytics learning community
  - Test your knowledge
  - For Mobile Apps
- Google Analytics API

General Notes

“Data scientists that use primarily opensource tools earned a higher median salary (130K) than those using proprietary tools (90K).” – Datacamp
Big Data Landscape 2016
For further Data Science reading, I highly recommend Joel Grus’ Data Science from Scratch.

Tag: Data Science

Finding Correlations

Data Tools Overview