In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?
Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.
Today: Hadoop Hive meets Excel. Your flight is boarding now
The Challenge
Today’s challenge is weather-based - and something we’ve all experienced ourselves while traveling. How are flight departures at US airports impacted by changing weather patterns? What role do weather / temperature fluctuations play in delaying flight patterns?
We’re sure the big data geeks at the big airlines have their own stash of secret, predictive algorithms but we can also try to figure this out ourselves. To do that, we first need to combine weather information with flight departure data.
On the one hand, we have a whole archive of US flights over the years, something in the order of millions of records, which we have saved on a big data platform, such as Hadoop Hive. On the other, we have daily US weather information downloadable from https://www.ncdc.noaa.gov/cdo-web/datasets/ in the form of Excel files. So, a Hadoop parallel platform on one side and traditional Excel spreadsheets on the other. Will they blend?
Topic. Exploring correlations between flights delays and weather variables.
Challenge. Blend data from Hadoop Hive and Excel files.
Access Mode. Connection to Hive with in-database processing and Excel file reading.