Data Science is growing. It’s been called the “sexiest job of the 21st century”, and is attracting a flood of new entrants. But what does a data scientist do? And does your company actually need one?
These days we hear a lot about Artificial Neural Networks. Leading companies, from Facebook, to Google, to Zillow use them throughout their core products. But what are Neural Networks? And when should you use one?
We analyzed data fragmentation among the Alexa Top 1M domains over the past 3 years. A large fraction used at least one external user or marketing data source, and the rate is growing exponentially at 2.88X Y/Y.
After several months building on Apache Spark here are some lessons we learned about the benefits of DataFrame vs RDDs and several situations in which the RDD API may still be preferable.
In this post, we review common applications of Machine Learning, and the differences between the two subtypes of Supervised vs. Unsupervised Machine Learning.
Often we want to predict discrete outcomes in our data. Can an email be designated as spam or not spam? Was a transaction fraudulent or valid?
One of the most common questions we have of our data is evaluating the value of something. How many items will we sell next month? How much does it cost to produce them? How much revenue will we make over the year?
One of the most common analyses we perform is to look for patterns in data. What market segments can we divide our customers into? How do we find clusters of individuals in a network of users?
Machine Learning can often be a black box. To gain actionable insights, its helpful to know how a variable influences a model. Here we outline 5 ways to assess feature importance to affecting the probability of an outcome.
When importing data into your data warehouse, you will almost certainly encounter data quality errors at many steps of the ETL pipeline. How do you catch these errors proactively, and ensure data quality in your data warehouse?
A seemingly good machine learning model may still be wrong. We’ll show how you can evaluate these issues by assessing metrics of bias vs. variance and precision vs. recall, and present some solutions for such scenarios.
Customer intelligence requires segmenting customers by their company’s properties, such as web traffic, app performance, technology adoption, ad spend, and company size. But how do you identify companies with these properties?