Data Exploration in #66DaysOfData with KNIME

Data Exploration in #66DaysOfData with KNIMEheather.fysonMon, 09/20/2021 - 10:00

A roadmap to deepen your knowledge of data exploration techniques within the #66daysofdata challenge

Here at KNIME, we’ve set out a #66daysofdata roadmap. Follow it to learn about data preparation, blending, and visualization. Every day @DMR_Rosaria (on Twitter) and Roberto Cadili (LinkedIn) will share a daily task on Twitter and LinkedIn with links to further resources.

The idea is to spend around 5-10 minutes on a specific data science project each day for 66 days and share your progress on your favorite social media platform with #66daysofdata. Ken Jee is the original instigator of #66daysofdata. Why 66 days? Because that's the average time it takes us to get practiced at doing something. In this case, data science with KNIME.

New to KNIME?

Our tool of choice for the project is KNIME Analytics Platform (v. 4.4.1). It’s open source and accessible to anyone. Download and start using right away.

KNIME Forum for your questions along the way

If you have questions about using KNIME, or any of the data science techniques in our roadmap, head over to the KNIME Forum. Use this thread especially for the #66daysofdata. There’s a really active community on the Forum, happy to help answer your questions!

If you don’t already have a Forum account, it’s easy to set one up on the Login page.

Share Your Progress on KNIME Hub and Social Media

Store your work on the KNIME Hub, and share your progress by posting an impression of your workflow or visualization on social media (e.g., Twitter, LinkedIn, etc.) with the hashtags #KNIME and #66daysofdata.

Celebrate Your Project with a Digital Badge

After completing the challenge, send the link to your work on the KNIME Hub to blog@knime.com, and earn a celebratory digital badge of the #66daysofdata challenge with KNIME that you will be able to share on social media.

Explore the Datasets—Musicians, Tracks, Danceability

The core of this project relies on three Spotify datasets freely available on Kaggle (sign in to download them). The Kaggle descriptions don't provide too much information about the different columns, you can check out an overview of column names and descriptions for each dataset on this article page.

The tracks.csv dataset contains about 600k tracks from the period 1900-2021 and is described by 20 columns
The artist-uris.csv dataset contains data on roughly 81k artists and is described by 2 columns (header names are not provided)
The artist.csv dataset is very similar to the tracks.csv dataset but also includes a popularity metric for the artists.

Ready for the Challenge?

Use KNIME Analytics Platform to start the #66daysofdata challenge

Download KNIME

The roadmap is split up into twelve sections including one bonus section. Click the section header to hop straight there.

1. Import Data

First things first: Learn how to import data and then specifically investigate options for importing text data.

7. Plots and Charts - Multivariate Analysis

To investigate relationships among features we need multivariate types of visualization. Dive into sunburst charts, scatter plots, learn about rule engines, and loops.

2. Descriptive Statistics

Delve into basic statistics measures and find out things like average popularity of songs, highest danceability, plus the ever-popular topic of "missing values".

8. Plots and Charts - Time Plots

It's flow variable time! Try plotting annual track numbers in different types of plots, including stacked area charts and bar charts for evolution over time. And then add a bit of JavaScript!

3. Histograms

Investigate how to build a histogram for features such as loudness, tempo, and energy. And learn about components and uploading workflows to the KNIME Hub.

9. Plots and Charts - Control

This is all about Guided Analytics and Widgets, and exploring how to build interactive dashboards.

4. Date Standardization

Dates always seem to come in very different formats. Learn how to standardize fields, filter rows, append, transform, and extract date&time data.

10. Covariance and Correlation

The focus here is on adding a little statistical flavor to your analysis, to improve understanding of the data and enable sounder identification of relationships between input features.

5. Ungrouping and Aggregations

Let's extract the number of tracks for each artist and learn how to ungroup, aggregate, and join in the process.

11. Text Visualization

Science isn't all about numbers! But texts, images, networks, and more. Explore here how to visualize your text analysis.

6. Plots and Charts - Univariate Analysis

Who are the 20 most prolific artists of all time? And visualize it! Explore how to add color and build composite views, use pivoting, and add interactivity.

12. Graph Visualization (Bonus)

The bonus topic! Moving into more advanced visualizations, here you can explore visualizing interactions among users. Here: Tweets!

1. Import Data

The first part of the project focuses on the first basic step: importing the data into KNIME Analytics Platform. A few documentation sources are provided to download and understand the dataset, to install and explore KNIME Analytics Platform, and then to investigate options for importing text data.

Day 1.Download three Spotify datasets from Kaggle (sign in to download them). Open and investigate the content of each file. For this project, we will use the files: tracks.csv, artist-uris.csv and artist.csv.

tracks.csv
artist-uris.csv
artist.csv

Day 2. Learn more about KNIME Analytics Platform. Recommended readings and videos:

Why KNIME?, Low Code for Advanced Data Science
Codeless Data Science with KNIME, Low Code for Advanced Data Science
A Friendly Introduction to KNIME Analytics Platform, Analytics Vidhya
Creating and Productionizing Data Science, KNIMETV

Day 3. Download KNIME Analytics Platform and install it on your machine in a folder where you have reading and writing permission. After installation, start KNIME Analytics Platform, set workspace (default workspace is OK) and get familiar with the workbench. Create an account on the KNIME Forum. The same account will also work for the KNIME Hub. Recommended videos:

How to install KNIME Analytics Platform, KNIMETV
The KNIME Workbench, KNIMETV
How to install KNIME Extensions, KNIMETV
How to import and export KNIME workflows, KNIMETV

Day 4. In the KNIME Explorer panel, under LOCAL, create a new folder (Workflow group) to host your work. Then, create an empty workflow to start the project. Create a “Data” folder under //Data and copy the Spotify datasets in it. After refreshing the KNIME Explorer (right-click on LOCAL and select Refresh), you should see the folder Data with all copied files. Recommended videos:

What is a Node, What is a Workflow, KNIMETV
Creating New Workflows and Workflow Groups, KNIMETV

Day 5. Read file tracks.csv of the Spotify dataset. You can:

Use a File Reader node, KNIME Hub
Use a File Reader (Complex) node, KNIME Hub
Use a CSV Reader node, KNIME Hub

Drag&drop the file from KNIME Explorer onto the workflow editor and see what node is created. Tip: it might be necessary to sensibly increase the count of “Limit data rows scanned”.

Once the data has been read, open the output table to investigate the results.

Notice that many song titles are not pure ASCII: make sure to choose the right character decoding option in the reader node. Compare the performances of the different nodes in terms of flexibility and speed. Then read files artist-uris.csv and artists.csv with your preferred node. Recommended documents and videos:

How to Create, Configure, Select, Execute a Node, KNIMETV
KNIME File Handling Guide: Reader Nodes, KNIME Documentation
Reading Files in KNIME, KNIMETV
The CSV Reader node, KNIMETV
Data Structures in KNIME, KNIMETV

Day 6. Investigate the concept of relative paths in reader nodes. Use a relative path instead of an absolute path in your reader files. Did you use a workspace or a workflow reference?

KNIME File Handling Guide: Path Syntax, KNIME Documentation
KNIME File Handling Guide: Standard File Systems, KNIME Documentation

Day 7. Comment the nodes and write a general annotation with title and description of what the project is about. Learn more about data exploration.

Annotations and Comments, KNIMETV
Data Visualization for Data Exploration, t.b.a.

2. Descriptive Statistics

Before diving into the visualization, let’s start with some basic descriptive statistics.

Day 8. Learn about basic statistics measures: mean, median, mode, variance, standard deviation, range, and quantiles. What is the difference between average and mean? And between variance and standard deviation?

Descriptive Statistics, Towards Data Science
Difference between average and mean, Cuemath
An Introduction to descriptive statistics, Scribbr
Quantile: definition and how to find them in easy steps, Statistics How To

Day 9. Learn about more complex statistics measures: skewness and kurtosis.

Descriptive Statistics, Towards Data Science
Probability distributions, Towards Data Science (only if you do not know what a probability is)
Shape of data: Skewness and Kurtosis, Analytics Vidhya

Day 10. Learn about the Data Explorer node to calculate the basic descriptive statistics measures of the dataset in tracks.csv.

Data Explorer: Interactive univariate data exploration, KNIMETV

Day 11. Explore the interactive view of the Data Explorer node. What’s the average popularity of the songs in the dataset? The highest danceability? How many missing values in the feature “key”? Investigate the effect on the output data of excluding a column in the interactive view.

Data Explorer: Interactive univariate data exploration, KNIMETV

Day 12. Investigate the difference between zeros, missing values, infinity, and NaN. In KNIME Analytics Platform, missing values are represented by a red question mark.

Inf, NaN, and Null, Wiki Analytica
Missing Value Handling, KNIME Hub

3. Histograms

Did you notice the histograms in the last column on the right of the interactive view of the Data Explorer node? In this part of the project we will deepen our knowledge of histograms.

Data exploration 66daysofdata with KNIME — Fig. 1. Spotify’s track feature distribution visualized in histograms.

Day 13. Learn more about Histograms.

Histograms, Math is Fun
Histograms, Laerd Statistics
A complete guide to histograms, Data Tutorials

Day 14. Introduce a Histogram node into the workflow. Build the histogram of feature “loudness” for occurrences on quantile bins first and on a fixed number of bins later. Why does not it make sense to build a histogram of occurrences on quantile bins?

Comparison charts - Histogram and Bar Chart, KNIME Hub

Day 15. Build Histograms with Histogram nodes for the other features as well, like liveness, valence, instrumentalness, danceability, tempo, energy, speechiness, acousticness, popularity, key, loudness, and duration.

Day 16. Learn what a metanode and a component are. Create a component with all Histogram nodes, execute it and open its composite view.

Component Configuration, KNIMETV
KNIME Components Guide KNIME documentation
Metanode or component. What is the difference? KNIME Blog

Day 17. Arrange the Histogram nodes in the component composite view using the Layout button in the toolbar. Add a title to the component view via a Text Output Widget node. Reshape the layout of the composite view.

Layout button, KNIME Documentation

Day 18. Upload your current workflow onto the KNIME Hub; that is, copy your workflow onto your public space in the My-KNIME-Hub folder in the KNIME Explorer. Then, open the workflow on the KNIME Hub. Explore the KNIME Hub for other contributions to #66daysofdata. Is yours there?

The KNIME Hub, KNIMETV
The EXAMPLES server and the KNIME Hub, KNIMETV
The KNIME Hub - Share and Collaborate, KNIME Blog
Introducing More KNIME Hub Features, KNIME Blog

4. Date Standardization

The task of this part is date standardization. Let’s fix the relase_date field. Some dates have format yyyy-MM-dd (--), some just report the or the - of the track release. Let’s standardize this field. We want to build a metanode that adds -01-01 where and/or are missing, according to the format yyyy-MM-dd.

Day 19. Investigate the String Manipulation node and its functions, especially length(), replace(), and join().

Data Manipulation: Numbers, Strings, and Rules, KNIMETV

Day 20. Investigate row filtering (Row Filter) and row splitting (Row Splitter) nodes. Investigate how to keep and exclude rows, how to filter based on patterns or on numerical ranges or on missing values. Separate rows in: original dataset to have all rows with only in release_date on one side, and all other rows on the other side. Repeat for rows with only - in release_date.

What is Row Filtering?, KNIMETV
ETL with KNIME: Row Filter with Pattern Matching, KNIMETV
ETL with KNIME: Row Filter based on Numerical Intervals and Missing Values, KNIMETV
ETL with KNIME: Row Filter based on RowID, KNIMETV
ETL with KNIME: Advanced Row Filtering, KNIMETV
ETL with KNIME: Advanced Row Filter for Special Data Types, KNIMETV
All Row Filters of KNIME, t.b.a.

Day 21. Append “-01” in release_date where needed to always have dates in format yyyy-MM-dd and reassemble all pieces together via the Concatenate node.

ETL with KNIME. What is Concatenation, KNIMETV
ETL with KNIME. The Concatenate node, KNIMETV
ETL with KNIME. Concatenate up to 4 datasets, KNIMETV

Day 22. Transform all values in release_date column into Date&Time object. Check that no release_date values are missing.

Date&Time Integration, KNIME Blog
Explore using Date&Time formats …, KNIME Blog
String to Date&Time node, KNIMETV

Day 23. Extract year from release_date. Wrap up all nodes from this section into a metanode.

Extract Date&Time Fields, KNIMETV
Metanodes to clean up workflows, KNIMETV
Metanodes or Components?, KNIME Blog

5. Ungrouping and Aggregations

The task of this part is ungrouping &aggregation. We want to join the tracks.csv with the artist features from the files artist-uris.csv and artist.csv, and extract the number of tracks for each artist in the dataset.

Day 24. The id_artists column in the tracks dataset is a String, including one or more artists. To assign each track to each artist (if more than one), we need to split them and create a new row for each artist. Investigate the Cell Splitter node, especially the output (set, list, new columns). Investigate the Collection type for columns. Using the Ungroup node, disaggregate the dataset so that for each row there is only one artist and the corresponding track.

Collection Cookbook, t.b.a.
The Ungroup node, t.b.a.
Working With Collections - Collection Types, KNIME Hub

Day 25. Read the artist file artist-uris.csv and extract artist ids. Read the second artist file artist.csv, and use the Transformation tab to retain only the columns with artist name and artist popularity information. Convert to integer the popularity column.

Double to Int node, KNIME Hub

Day 26. Join the data for each artist from the artist-uris.csv and the artist.csv. Join the resulting data table with the disaggregated tracks dataset (output table of the Ungroup node).

ETL with KNIME. What is a Join operation?, KNIMETV
ETL with KNIME. The Joiner node - Part I, KNIMETV
ETL with KNIME. The Joiner node - Part II, KNIMETV

Day 27. Study the GroupBy node for all aggregations. Study how to build groups of data and what metrics can be calculated on them. In the joined table with artists and tracks, use the GroupBy node to count the number of tracks for each artist plus artist popularity and their period of activity from the first release date to the last release date. Using a second GroupBy node, count again the number of tracks for each artist across all years.

What is data aggregation?, KNIMETV
Basic Aggregations with GroupBy node, KNIMETV
Advanced Aggregations with GroupBy node, KNIMETV
ETL with KNIME. The Column Filter node, KNIMETV

6. Plots and Charts - Univariate Analysis

Some more visualizations! Let’s build a composite view over the top 20 most prolific artists of all times … We mean … of the whole dataset!

Day 28. Using the output table of the first GroupBy node (the aggregated artist, track and popularity data), select the top k (k = 20) most prolific artists ever, that is, the ones with the highest track total count. Sort by track count and rename columns with meaningful names.

Top k Selector node, t.b.a.
How to sort data with KNIME, YouTube (NickyDee)
How to rename columns in your dataset …, YouTube (NickyDee)

Day 29. Learn more about univariate, bivariate, and multivariatevisualizations

Visual Data Exploration in Three Steps, KNIME Blog
Data Explorer. Interactive Univariate data exploration, KNIMETV
Scatter Plot. Interactive bivariate visual exploration, KNIMETV

Day 30. Put the top k most prolific artists in an interactive Table View. Color rows by artist.

Color Manager node, KNIME Hub
Data Visualization and Interactive Data Exploration with KNIME, KNIMETV
Table View node with colored rows, t.b.a.

Day 31. Build a composite view that for all selected rows in the table shows the artist detail (name, # of tracks, popularity, period of activity from: to:) in a Tile View on the right.

Components composite views, KNIME Documentation
Visualizing clusters via dendrogram, heatmap, tile view, and CSS styles, KNIMETV
Table View and Tile View node, t.b.a.

Day 32. Display the number of tracks by artist in a Pie Chart. Make sure that the same color code by artist is used as in the table and tile view.

Data Visualization and Interactive Data Exploration with KNIME, KNIMETV
How to create KNIME Charts: Pie charts and bar charts, YouTube (Yoda Learning Academy)

Day 33. Display the number of tracks by artist in a monochromatic Bar Chart.

How to create KNIME Charts: Pie charts and bar charts, YouTube (Yoda Learning Academy)
Bar Chart Examples, KNIME Hub

Day 34. Display the number of tracks by artist in a Bar Chart with the same color scheme used for the table. Transform the data to keep color mapping and synchronous interactivity for all items in the composite view. Learn more about the Pivoting node.

The Pivoting node, KNIMETV
Custom Bar Chart Colors for each Bin, KNIME Hub
How to Assign Colors to Bars in a Bar Chart - Three Shades of Green, KNIME Blog
Assigning colours to Bar Charts, t.b.a.

Day 35. Investigate interactivity in the composite view. Learn more about composite views becoming dashboards both locally and on the KNIME WebPortal.

How to Create an Interactive Dashboard in Three Steps with KNIME Analytics Platform, Low Code for Advanced Data Science
The KNIME WebPortal, KNIMETV
KNIME WebPortal User Guide, KNIME Documentation

7. Plots and Charts - Multivariate Analysis

Pie charts and bar charts show and compare aggregated values. To investigate relationships among features though we need to use multivariate types of visualization. The most commonly used visualization of this kind is surely the scatter plot. Since those visualizations do not show aggregated values like the bar and pie charts, but potentially all data points in your data set, often sampling is required.

Day 36. Using the joined tracks and artist features table, create a track popularity class (high, medium, low) with the Rule Engine node, perform different sampling strategies and display percentages for each popularity class and strategy (original data, with random sampling, and with stratified sampling) in a table view.

Data Manipulation: Numbers, Strings, and Rules, KNIMETV
Sampling strategies, t.b.a.

Day 37. Scatter Plot is probably the most common way to visualize and investigate relationships among pairs of features. Let’s learn more about scatter plots, how to implement them in KNIME, and how to carry out a visual exploration of the data. Check various pairs of features. Which feature creates a pattern with which feature? Try “loudness” vs “explicit” and see if there is some form of correlation. Add a Table View to visualize details of selected points only in the scatter plot.

A Complete Guide to Scatter Plots, Data Tutorials
Scatter Plot: Interactive Bivariate Visual Exploration, KNIMETV

Day 38. With the data table created using stratified sampling, let’s prepare the data to use the Sunburst Chart to visualize the feature proportions that lead to high popularity. The Sunburst Chart node requires nominal values, so the numerical columns must be binned or, if already binned but still numerical, converted to strings.

Binning data, YouTube (Caleb Curry)
Binning nodes: Auto-binner and Numeric Binner, t.b.a.
Math Formula node in Data Manipulation: Numbers, Strings, and Rules, KNIMETV

Day 39. Binned buckets might not have a distinctive name. Let’s loop through the binned columns to change the bin names to + .

What is a loop, KNIMETV
How to build a generic loop, KNIMETV
Loop Commands, KNIMETV
The Group Loop Start node, KNIMETV
Loop End nodes, KNIMETV
KNIME Flow Control Guide, KNIME documentation

Day 40. We can now apply the Sunburst Chart to visualize the proportion of each feature to reach high popularity.

Sunburst, From Data to Viz
Three steps to build an interactive board, KNIMETV
Data Visualization and Interactive Exploration with KNIME, KNIMETV

Day 41. We now want to apply a Heatmap to all numerical features. Since the Heatmap visualizes numerical values with a color gradient built on the [min, max] interval, for better visualization we need normalized features first.

Normalization Techniques at a Glance, Google Data Prep
KNIME Normalize and Denormalize Data, YouTube (NickyDee)

Day 42. Let’s apply a Heatmap to all numerical normalized features.

A Complete Guide to Heatmaps, Data Tutorials
Heatmap node, t.b.a.

Day 43. Visualize all feature contributions to popularity classes via a Parallel Coordinates Plot.

Parallel coordinates plots, From Data to Viz blog
Data Visualization and Interactive data exploration with KNIME, KNIMETV
Parallel Coordinates Plot node, t.b.a.

Day 44. A full component should be dedicated to Box Plots. Learn more about Box Plots & Conditional Box Plots. With the data table created using stratified sampling, you can box plot single features one by one or you can box plot multiple features all together. In this last case, you must pay attention to the different ranges. You could normalize of course, but then you lose the interpretability of the box plot.

Understanding Box Plots, Towards Data Science
Box Plot examples, KNIME Hub
Box Plots and Conditional Box Plots, t.b.a.
Conditional Box Plot, KNIME Hub
Four Techniques for outlier detection, KNIME Blog

8. Plots and Charts - Time Plots

Some more charts and plots. We use this section to also introduce the concept of Flow Variables.

Day 45. Learn more about Flow Variables.

Flow Variables, KNIMETV
Flow Variables: From Data to Variables, KNIMETV
KNIME Flow Control Guide, KNIME Documentation
KNIME Analytics: Flow Variables, the Red Line, t.b.a.

Day 46. Return to the output table of the second GroupBy node where we counted the number of tracks for each artist across all years. In a selected time window (), find out the maximum number of years of activity for an artist and extract those artists that have been active that maximum number of years. Wrap it into a component. Flow Variables here might be helpful.

Sharing components, KNIME Documentation
Table Column to Variable node, t.b.a.

Day 47. Make the previous component parametric by adding a configuration window where you can select the time window (). You can further parameterize your configuration window by allowing the top k artists to be extracted (i.e., artists with the largest number of tracks in the selected time window).

Component Configurations, KNIMETV
Custom components configuration dialogs, KNIME Documentation

Day 48. Plot yearly number of tracks by artist in a Line Plot. Add color for each line, i.e. for each artist. Make the plot subtitle parametric for the selected time window with the help of the String Manipulation (Variable) node. Inspect lines one by one and all together. Find out the artists who have been most consistently active across the years of your time window.

Line plot in KNIME Introduction. Part 15. Visualizations, YouTube (Scott McLeod)
Time Plots: Line Plot, t.b.a.

Day 49. Plot yearly number of tracks by artist in a Stacked Area Chart. Add color for each area, i.e. for each artist. Make the plot subtitle parametric for the selected time window. Explore interactivity, especially how to add and remove areas for artists.

What is a stacked area chart, From Data to Viz
Data Visualization & Interactive Data Exploration with KNIME, KNIMETV
Text Stream Visualization, KNIME Blog

Day 50. Create a Bar Chart to inspect evolution over time. Visualize the number of tracks by artist over years in a bar chart. Add color for each bar, i.e. for each artist. Make the plot subtitle parametric for the selected time window. Inspect bars one by one, in small groups, and all together. Find out the most prolific artist in a specific year of your time window.

Day 51. Let’s conclude this part with some free JavaScript code. Let’s investigate the Generic JavaScript View node. Do not worry, you do not need to code. Just explore the KNIME Hub for workflows and components based on the Generic Javascript View node. Drag&drop the Animated Bar Chart component from the KNIME Hub into your workflow and study, for example, the evolution of artists and track count throughout the years. Wrap all the time plots in a component and investigate selections of year and artists in the composite view.

KNIME JavaScript Views, KNIME Hub

Example components and visualizations based on free JavaScript code:

FIFA World Cup, KNIME Hub
Animated Bar Chart, KNIME Hub
Epidemiological data from Zika virus, KNIME Hub
Relations in data with scatter and 3-D scatter, KNIME Hub

9. Plots and Charts - Control

Day 52. Learn about Guided Analytics and Widget nodes

Widget nodes in KNIME Component Guide, KNIME Documentation
Widget vs.Configuration nodes, t.b.a.
Principles of Guided Analytics, KNIME Blog

Day 53. Let’s continue using the output table of the second GroupBy node where we counted the number of tracks for each artist across all years. Build a guided analytics sequence with a component including a Widget-based selection framework to select the time window and extract the top k artists. Next, build a second component to filter artists of choice and a third component to build time plot views (i.e., line plot, bar chart, stacked area chart, etc.) to visualize the number of tracks by artist over a selected time period (ex: 1970- 1980).

Integer Widget node, KNIME Hub
Column Filter Widget node, KNIME Hub
Value Selection Widget node, KNIME Hub
Widget nodes, t.b.a.

Day 54. Using the top k artists extracted from the time window selection used in the first component, investigate the usage of the Interactive Range Slider Filter Widget node in conjunction with the Scatter Plot or the Stacked Area Chart node. Build a component with a scatter plot or stacked area chart and control the number of points via the Interactive Range Slider Filter Widget node, for example by controlling the size of the year interval. Alternatively, use the data table created with stratified sampling in section 7 to build a component with an interactive multivariate scatter plot and table view in conjunction with the Interactive Range Slider Filter Widget node to control for the year interval.

Interactive Range Slider Filter Widget node, KNIME Hub
Interactive Filter Widget nodes, t.b.a.

Day 55. Investigate the usage of the Refresh Button Widget node. After the component that uses Widget nodes to select the time window (ex: 1970-1980) and extracts the top k artists, build a dashboard with a Widget-based framework to select the top k artists, a stacked area chart, a line plot, a bar chart, and a Refresh button.

Eight Data App Designs with the New Refresh Button, KNIME Blog
Example workflow collection using the Refresh Button Widget node, KNIME Hub

Day 56. Combine the Refresh button node with the Interactive Range Slider Filter Widget node and with a stacked area chart to visualize extracted top k artists in a selected time window within a dashboard. Watch how the Refresh Button Widget node could work on the KNIME WebPortal.

Twitter analysis Data App, KNIMETV
Machine learning Data App, KNIMETV

10. Covariance and Correlation

Creating bar charts, pie charts or time plots are definitely great ways to visually explore a dataset and gain precious insights. However, sometimes adding a tiny statistical flavour to the analysis improves our understanding of the data and, more importantly, enables a sounder identification of relationships between the input features.

Day 57. Learn about covariance, linear correlation, and rank correlation. Focus on the similarities and differences between them, as well on their strengths and weaknesses.

5 Things You Should Know About Covariance, Towards Data Science
How to measure the relationship between variables, Towards Data Science
Why correlation does not imply causation?, Towards Data Science

(Bonus task) Work out the math behind covariance and linear correlation, it will help you understand their differences and similarities. Pick two numeric input features of your choice (e.g., “loudness” and “popularity”), select 5 observations for each feature, and compute the covariance and the linear correlation manually. You can double-check your results with an online calculator.

Day 58. Compute the covariance for all pairs of the following features: “energy”, “loudness”, “danceability”, “valence” and “popularity” in the tracks.csv dataset using the GroupBy node. Build a covariance matrix both with unnormalized and normalized (z-score normalization) features separately.

Aggregations, Aggregations, Aggregations! — Part II (section: “1. Statistical Aggregations: Covariance vs. Correlation”), KNIME Blog
GroupBy node for statistical aggregations, KNIME Hub

Day 59. Compute the linear correlation for all pairs of meaningful numeric features in the track dataset using the Linear Correlation node. Inspect the correlation matrix view, and the results of the three output ports. What are the top four positively correlated feature pairs? Compare the values in the linear correlation matrix with those of the normalized covariance matrix, what do you observe?

Linear Correlation node, KNIME Hub

Day 60. Compute the rank correlation for all pairs of meaningful numeric features in the track dataset using the Rank Correlation node. Inspect the correlation matrix view, and the results of the three output ports. What are the top four positively correlated feature pairs? Do you observe similarities with the linear correlation matrix? If yes, why do you think that’s the case?

Rank Correlation node, KNIME Hub

Day 61. Visualize in a component composite view the unnormalized and normalized covariance matrices side-by-side both using a Heatmap node and a Table View node. Inspect in a different component composite view the linear correlation for the top four positively correlated features using the Scatter Plot node for each pair.

11. Text Visualization

There are not only numbers in data science! Besides numerical data, we have to deal with texts, images, networks, and even more diverse data types. In this section, a few items on text visualization.

Day 62. Get familiar with the KNIME Text Processing Extension, the Document object, the Term object and all text processing operations.

KNIME Text Processing Extension, KNIME Hub
From Data Collection to Text Mining Interpretation, KNIME Blog

Day 63. Read the IMDb-sample.csv file from the KNIME Hub. This dataset collects 2000 movie reviews written by users and contains sentiment annotation (i.e., positive or negative) for each review. Convert all texts to Documents and perform some basic text cleaning.

Common Steps in a Text Mining Project, KNIMETV
All you need to know about text preprocessing for NLP and Machine Learning, KDNuggets
Strings to Document and Tika Parser nodes, KNIMETV

Day 64. Using the file MPQA-OpinionCorpus-PositiveList.csv for the list of positive words in the English language and MPQA-OpinionCorpus-NegativeList.csv for the list of negative words in the English language (from the KNIME Hub or the latest version from the MPQA site), tag words in the texts as positive or negative.

Document Tagging: Introduction, KNIMETV
Domain and Custom Tagging, KNIMETV
Dictionary Tagger node, KNIME Hub

Day 65. Let’s transform each text into a Bag of Words and let’s calculate all Term Frequencies.

How important are the words in your text data? Tf-Idf answers…, Towards Data Science
Bag of Words, YouTube (Quantopian)
Bag of Words and Frequencies, KNIMETV

Day 66. We have the list of words, we have their frequencies, let’s visualize them in a word cloud

What are word clouds?, BoostLabs Blog
Word cloud with Additional Visualizations, KNIMETV
Tag Cloud, KNIME Hub

12. Graph Visualization (Bonus)

As a bonus, let’s investigate something more complicated: how to visualize interactions among users of a community, like the retweeting patterns around a hashtag on Twitter.

Bonus. First, let’s learn what a Network Graph (or sometimes just a graph) is.

Network Diagram, From Data to Viz
Graph Theory, Wikipedia

Bonus. Retrieve tweets through the Twitter API around a given hashtag, like #knime. Alternatively, you can download a TwitterData.table around #knime for the time window July 26th - August 3rd, 2021.

Confirm that you are a robot, Low Code for Advanced Data Science
Twitter API Connectors Extension, KNIME Hub
KNIME Twitter nodes, t.b.a.
Twitter meets PostgreSQL, KNIME Blog

Bonus. Shape your Twitter data as an adjacency matrix:

The #KNIME Connection. Where are you?, KNIME Blog

Bonus. The KNIME Network Mining Extension deals with graphs and network objects. You need to create a network of interactions among Twitter users before visualizing it. Create the network diagram from the adjacency matrix of Twitter interactions built in the previous step.

KNIME Network Mining Extension, KNIME Hub
Social Network Analysis, t.b.a.

Bonus. Visualize the network object of Twitter users interactions with a graph.

Data Visualization & Interactive Data Exploration with KNIME, KNIMETV
Data Visualization in KNIME, YouTube
Network Viewer node, KNIME Hub

Bonus. Another way of visualizing interactions is the chord diagram. Let’s learn what a chord diagram is and how to build one in KNIME using the Generic Javascript View node.

Chord diagram, From Data to Viz

Blog

Create

Image

Data Exploration in #66daysofdata with KNIME

Image style

Fullwidth