A roadmap to deepen your knowledge of data exploration techniques within the #66daysofdata challenge
Here at KNIME, we’ve set out a #66daysofdata roadmap. Follow it to learn about data preparation, blending, and visualization. Every day @DMR_Rosaria (on Twitter) and Roberto Cadili (LinkedIn) will share a daily task on Twitter and LinkedIn with links to further resources.
The idea is to spend around 5-10 minutes on a specific data science project each day for 66 days and share your progress on your favorite social media platform with #66daysofdata. Ken Jee is the original instigator of #66daysofdata. Why 66 days? Because that's the average time it takes us to get practiced at doing something. In this case, data science with KNIME.
New to KNIME?
Our tool of choice for the project is KNIME Analytics Platform (v. 4.4.1). It’s open source and accessible to anyone. Download and start using right away.
KNIME Forum for your questions along the way
If you have questions about using KNIME, or any of the data science techniques in our roadmap, head over to the KNIME Forum. Use this thread especially for the #66daysofdata. There’s a really active community on the Forum, happy to help answer your questions!
If you don’t already have a Forum account, it’s easy to set one up on the Login page.
Share Your Progress on KNIME Hub and Social Media
Store your work on the KNIME Hub, and share your progress by posting an impression of your workflow or visualization on social media (e.g., Twitter, LinkedIn, etc.) with the hashtags #KNIME and #66daysofdata.
Celebrate Your Project with a Digital Badge
After completing the challenge, send the link to your work on the KNIME Hub to blog@knime.com, and earn a celebratory digital badge of the #66daysofdata challenge with KNIME that you will be able to share on social media.
Explore the Datasets—Musicians, Tracks, Danceability
The core of this project relies on three Spotify datasets freely available on Kaggle (sign in to download them). The Kaggle descriptions don't provide too much information about the different columns, you can check out an overview of column names and descriptions for each dataset on this article page.
The tracks.csv dataset contains about 600k tracks from the period 1900-2021 and is described by 20 columns
The artist-uris.csv dataset contains data on roughly 81k artists and is described by 2 columns (header names are not provided)
The artist.csv dataset is very similar to the tracks.csv dataset but also includes a popularity metric for the artists.
Ready for the Challenge?
The roadmap is split up into twelve sections including one bonus section. Click the section header to hop straight there.
First things first: Learn how to import data and then specifically investigate options for importing text data. | 7. Plots and Charts - Multivariate Analysis To investigate relationships among features we need multivariate types of visualization. Dive into sunburst charts, scatter plots, learn about rule engines, and loops. |
Delve into basic statistics measures and find out things like average popularity of songs, highest danceability, plus the ever-popular topic of "missing values". | 8. Plots and Charts - Time Plots It's flow variable time! Try plotting annual track numbers in different types of plots, including stacked area charts and bar charts for evolution over time. And then add a bit of JavaScript! |
Investigate how to build a histogram for features such as loudness, tempo, and energy. And learn about components and uploading workflows to the KNIME Hub. | This is all about Guided Analytics and Widgets, and exploring how to build interactive dashboards. |
Dates always seem to come in very different formats. Learn how to standardize fields, filter rows, append, transform, and extract date&time data. | 10. Covariance and Correlation The focus here is on adding a little statistical flavor to your analysis, to improve understanding of the data and enable sounder identification of relationships between input features. |
5. Ungrouping and Aggregations Let's extract the number of tracks for each artist and learn how to ungroup, aggregate, and join in the process. | Science isn't all about numbers! But texts, images, networks, and more. Explore here how to visualize your text analysis. |
6. Plots and Charts - Univariate Analysis Who are the 20 most prolific artists of all time? And visualize it! Explore how to add color and build composite views, use pivoting, and add interactivity. | 12. Graph Visualization (Bonus) The bonus topic! Moving into more advanced visualizations, here you can explore visualizing interactions among users. Here: Tweets! |
1. Import Data
The first part of the project focuses on the first basic step: importing the data into KNIME Analytics Platform. A few documentation sources are provided to download and understand the dataset, to install and explore KNIME Analytics Platform, and then to investigate options for importing text data.
Day 1.Download three Spotify datasets from Kaggle (sign in to download them). Open and investigate the content of each file. For this project, we will use the files: tracks.csv, artist-uris.csv and artist.csv.
Day 2. Learn more about KNIME Analytics Platform. Recommended readings and videos:
Why KNIME?, Low Code for Advanced Data Science
Codeless Data Science with KNIME, Low Code for Advanced Data Science
A Friendly Introduction to KNIME Analytics Platform, Analytics Vidhya
Day 3. Download KNIME Analytics Platform and install it on your machine in a folder where you have reading and writing permission. After installation, start KNIME Analytics Platform, set workspace (default workspace is OK) and get familiar with the workbench. Create an account on the KNIME Forum. The same account will also work for the KNIME Hub. Recommended videos:
How to install KNIME Analytics Platform, KNIMETV
The KNIME Workbench, KNIMETV
How to install KNIME Extensions, KNIMETV
Day 4. In the KNIME Explorer panel, under LOCAL, create a new folder (Workflow group) to host your work. Then, create an empty workflow to start the project. Create a “Data” folder under
What is a Node, What is a Workflow, KNIMETV
Day 5. Read file tracks.csv of the Spotify dataset. You can:
Use a File Reader node, KNIME Hub
Use a File Reader (Complex) node, KNIME Hub
Use a CSV Reader node, KNIME Hub
Drag&drop the file from KNIME Explorer onto the workflow editor and see what node is created. Tip: it might be necessary to sensibly increase the count of “Limit data rows scanned”.
Once the data has been read, open the output table to investigate the results.
Notice that many song titles are not pure ASCII: make sure to choose the right character decoding option in the reader node. Compare the performances of the different nodes in terms of flexibility and speed. Then read files artist-uris.csv and artists.csv with your preferred node. Recommended documents and videos:
How to Create, Configure, Select, Execute a Node, KNIMETV
KNIME File Handling Guide: Reader Nodes, KNIME Documentation
Reading Files in KNIME, KNIMETV
The CSV Reader node, KNIMETV
Data Structures in KNIME, KNIMETV
Day 6. Investigate the concept of relative paths in reader nodes. Use a relative path instead of an absolute path in your reader files. Did you use a workspace or a workflow reference?
KNIME File Handling Guide: Path Syntax, KNIME Documentation
KNIME File Handling Guide: Standard File Systems, KNIME Documentation
Day 7. Comment the nodes and write a general annotation with title and description of what the project is about. Learn more about data exploration.
Annotations and Comments, KNIMETV
Data Visualization for Data Exploration, t.b.a.
2. Descriptive Statistics
Before diving into the visualization, let’s start with some basic descriptive statistics.
Day 8. Learn about basic statistics measures: mean, median, mode, variance, standard deviation, range, and quantiles. What is the difference between average and mean? And between variance and standard deviation?
Descriptive Statistics, Towards Data Science
Difference between average and mean, Cuemath
An Introduction to descriptive statistics, Scribbr
Quantile: definition and how to find them in easy steps, Statistics How To
Day 9. Learn about more complex statistics measures: skewness and kurtosis.
Descriptive Statistics, Towards Data Science
Probability distributions, Towards Data Science (only if you do not know what a probability is)
Shape of data: Skewness and Kurtosis, Analytics Vidhya
Day 10. Learn about the Data Explorer node to calculate the basic descriptive statistics measures of the dataset in tracks.csv.
Data Explorer: Interactive univariate data exploration, KNIMETV
Day 11. Explore the interactive view of the Data Explorer node. What’s the average popularity of the songs in the dataset? The highest danceability? How many missing values in the feature “key”? Investigate the effect on the output data of excluding a column in the interactive view.
Data Explorer: Interactive univariate data exploration, KNIMETV
Day 12. Investigate the difference between zeros, missing values, infinity, and NaN. In KNIME Analytics Platform, missing values are represented by a red question mark.
Inf, NaN, and Null, Wiki Analytica
Missing Value Handling, KNIME Hub
3. Histograms
Did you notice the histograms in the last column on the right of the interactive view of the Data Explorer node? In this part of the project we will deepen our knowledge of histograms.
Day 13. Learn more about Histograms.
Histograms, Math is Fun
Histograms, Laerd Statistics
A complete guide to histograms, Data Tutorials
Day 14. Introduce a Histogram node into the workflow. Build the histogram of feature “loudness” for occurrences on quantile bins first and on a fixed number of bins later. Why does not it make sense to build a histogram of occurrences on quantile bins?
Comparison charts - Histogram and Bar Chart, KNIME Hub
Day 15. Build Histograms with Histogram nodes for the other features as well, like liveness, valence, instrumentalness, danceability, tempo, energy, speechiness, acousticness, popularity, key, loudness, and duration.
Day 16. Learn what a metanode and a component are. Create a component with all Histogram nodes, execute it and open its composite view.
Component Configuration, KNIMETV
KNIME Components Guide KNIME documentation
Day 17. Arrange the Histogram nodes in the component composite view using the Layout button in the toolbar. Add a title to the component view via a Text Output Widget node. Reshape the layout of the composite view.
Layout button, KNIME Documentation
Day 18. Upload your current workflow onto the KNIME Hub; that is, copy your workflow onto your public space in the My-KNIME-Hub folder in the KNIME Explorer. Then, open the workflow on the KNIME Hub. Explore the KNIME Hub for other contributions to #66daysofdata. Is yours there?
The KNIME Hub, KNIMETV
The EXAMPLES server and the KNIME Hub, KNIMETV
The KNIME Hub - Share and Collaborate, KNIME Blog
Introducing More KNIME Hub Features, KNIME Blog
4. Date Standardization
The task of this part is date standardization. Let’s fix the relase_date field. Some dates have format yyyy-MM-dd (
Day 19. Investigate the String Manipulation node and its functions, especially length(), replace(), and join().
Day 20. Investigate row filtering (Row Filter) and row splitting (Row Splitter) nodes. Investigate how to keep and exclude rows, how to filter based on patterns or on numerical ranges or on missing values. Separate rows in: original dataset to have all rows with only
What is Row Filtering?, KNIMETV
ETL with KNIME: Row Filter with Pattern Matching, KNIMETV
ETL with KNIME: Row Filter based on Numerical Intervals and Missing Values, KNIMETV
ETL with KNIME: Row Filter based on RowID, KNIMETV
ETL with KNIME: Advanced Row Filtering, KNIMETV
ETL with KNIME: Advanced Row Filter for Special Data Types, KNIMETV
All Row Filters of KNIME, t.b.a.
Day 21. Append “-01” in release_date where needed to always have dates in format yyyy-MM-dd and reassemble all pieces together via the Concatenate node.
ETL with KNIME. What is Concatenation, KNIMETV
ETL with KNIME. The Concatenate node, KNIMETV
Day 22. Transform all values in release_date column into Date&Time object. Check that no release_date values are missing.
Date&Time Integration, KNIME Blog
Explore using Date&Time formats …, KNIME Blog
String to Date&Time node, KNIMETV
Day 23. Extract year from release_date. Wrap up all nodes from this section into a metanode.
Extract Date&Time Fields, KNIMETV
Metanodes to clean up workflows, KNIMETV
Metanodes or Components?, KNIME Blog
5. Ungrouping and Aggregations
The task of this part is ungrouping &aggregation. We want to join the tracks.csv with the artist features from the files artist-uris.csv and artist.csv, and extract the number of tracks for each artist in the dataset.
Day 24. The id_artists column in the tracks dataset is a String, including one or more artists. To assign each track to each artist (if more than one), we need to split them and create a new row for each artist. Investigate the Cell Splitter node, especially the output (set, list, new columns). Investigate the Collection type for columns. Using the Ungroup node, disaggregate the dataset so that for each row there is only one artist and the corresponding track.
Collection Cookbook, t.b.a.
The Ungroup node, t.b.a.
Working With Collections - Collection Types, KNIME Hub
Day 25. Read the artist file artist-uris.csv and extract artist ids. Read the second artist file artist.csv, and use the Transformation tab to retain only the columns with artist name and artist popularity information. Convert to integer the popularity column.
Double to Int node, KNIME Hub
Day 26. Join the data for each artist from the artist-uris.csv and the artist.csv. Join the resulting data table with the disaggregated tracks dataset (output table of the Ungroup node).
ETL with KNIME. What is a Join operation?, KNIMETV
Day 27. Study the GroupBy node for all aggregations. Study how to build groups of data and what metrics can be calculated on them. In the joined table with artists and tracks, use the GroupBy node to count the number of tracks for each artist plus artist popularity and their period of activity from the first release date to the last release date. Using a second GroupBy node, count again the number of tracks for each artist across all years.
What is data aggregation?, KNIMETV
Basic Aggregations with GroupBy node, KNIMETV
6. Plots and Charts - Univariate Analysis
Some more visualizations! Let’s build a composite view over the top 20 most prolific artists of all times … We mean … of the whole dataset!
Day 28. Using the output table of the first GroupBy node (the aggregated artist, track and popularity data), select the top k (k = 20) most prolific artists ever, that is, the ones with the highest track total count. Sort by track count and rename columns with meaningful names.
Top k Selector node, t.b.a.
How to sort data with KNIME, YouTube (NickyDee)
How to rename columns in your dataset …, YouTube (NickyDee)
Day 29. Learn more about univariate, bivariate, and multivariatevisualizations
Visual Data Exploration in Three Steps, KNIME Blog
Data Explorer. Interactive Univariate data exploration, KNIMETV
Scatter Plot. Interactive bivariate visual exploration, KNIMETV
Day 30. Put the top k most prolific artists in an interactive Table View. Color rows by artist.
Color Manager node, KNIME Hub
Data Visualization and Interactive Data Exploration with KNIME, KNIMETV
Table View node with colored rows, t.b.a.
Day 31. Build a composite view that for all selected rows in the table shows the artist detail (name, # of tracks, popularity, period of activity from: to:) in a Tile View on the right.
Components composite views, KNIME Documentation
Visualizing clusters via dendrogram, heatmap, tile view, and CSS styles, KNIMETV
Table View and Tile View node, t.b.a.
Day 32. Display the number of tracks by artist in a Pie Chart. Make sure that the same color code by artist is used as in the table and tile view.
Data Visualization and Interactive Data Exploration with KNIME, KNIMETV
How to create KNIME Charts: Pie charts and bar charts, YouTube (Yoda Learning Academy)
Day 33. Display the number of tracks by artist in a monochromatic Bar Chart.
How to create KNIME Charts: Pie charts and bar charts, YouTube (Yoda Learning Academy)
Bar Chart Examples, KNIME Hub
Day 34. Display the number of tracks by artist in a Bar Chart with the same color scheme used for the table. Transform the data to keep color mapping and synchronous interactivity for all items in the composite view. Learn more about the Pivoting node.
The Pivoting node, KNIMETV
Custom Bar Chart Colors for each Bin, KNIME Hub
How to Assign Colors to Bars in a Bar Chart - Three Shades of Green, KNIME Blog
Assigning colours to Bar Charts, t.b.a.
Day 35. Investigate interactivity in the composite view. Learn more about composite views becoming dashboards both locally and on the KNIME WebPortal.
How to Create an Interactive Dashboard in Three Steps with KNIME Analytics Platform, Low Code for Advanced Data Science
The KNIME WebPortal, KNIMETV
KNIME WebPortal User Guide, KNIME Documentation
7. Plots and Charts - Multivariate Analysis
Pie charts and bar charts show and compare aggregated values. To investigate relationships among features though we need to use multivariate types of visualization. The most commonly used visualization of this kind is surely the scatter plot. Since those visualizations do not show aggregated values like the bar and pie charts, but potentially all data points in your data set, often sampling is required.
Day 36. Using the joined tracks and artist features table, create a track popularity class (high, medium, low) with the Rule Engine node, perform different sampling strategies and display percentages for each popularity class and strategy (original data, with random sampling, and with stratified sampling) in a table view.
Data Manipulation: Numbers, Strings, and Rules, KNIMETV
Sampling strategies, t.b.a.
Day 37. Scatter Plot is probably the most common way to visualize and investigate relationships among pairs of features. Let’s learn more about scatter plots, how to implement them in KNIME, and how to carry out a visual exploration of the data. Check various pairs of features. Which feature creates a pattern with which feature? Try “loudness” vs “explicit” and see if there is some form of correlation. Add a Table View to visualize details of selected points only in the scatter plot.
A Complete Guide to Scatter Plots, Data Tutorials
Scatter Plot: Interactive Bivariate Visual Exploration, KNIMETV
Day 38. With the data table created using stratified sampling, let’s prepare the data to use the Sunburst Chart to visualize the feature proportions that lead to high popularity. The Sunburst Chart node requires nominal values, so the numerical columns must be binned or, if already binned but still numerical, converted to strings.
Binning data, YouTube (Caleb Curry)
Binning nodes: Auto-binner and Numeric Binner, t.b.a.
Math Formula node in Data Manipulation: Numbers, Strings, and Rules, KNIMETV
Day 39. Binned buckets might not have a distinctive name. Let’s loop through the binned columns to change the bin names to
What is a loop, KNIMETV
How to build a generic loop, KNIMETV
Loop Commands, KNIMETV
The Group Loop Start node, KNIMETV
Loop End nodes, KNIMETV
KNIME Flow Control Guide, KNIME documentation
Day 40. We can now apply the Sunburst Chart to visualize the proportion of each feature to reach high popularity.
Sunburst, From Data to Viz
Three steps to build an interactive board, KNIMETV
Data Visualization and Interactive Exploration with KNIME, KNIMETV
Day 41. We now want to apply a Heatmap to all numerical features. Since the Heatmap visualizes numerical values with a color gradient built on the [min, max] interval, for better visualization we need normalized features first.
Normalization Techniques at a Glance, Google Data Prep
KNIME Normalize and Denormalize Data, YouTube (NickyDee)
Day 42. Let’s apply a Heatmap to all numerical normalized features.
A Complete Guide to Heatmaps, Data Tutorials
Heatmap node, t.b.a.
Day 43. Visualize all feature contributions to popularity classes via a Parallel Coordinates Plot.
Parallel coordinates plots, From Data to Viz blog
Data Visualization and Interactive data exploration with KNIME, KNIMETV
Parallel Coordinates Plot node, t.b.a.
Day 44. A full component should be dedicated to Box Plots. Learn more about Box Plots & Conditional Box Plots. With the data table created using stratified sampling, you can box plot single features one by one or you can box plot multiple features all together. In this last case, you must pay attention to the different ranges. You could normalize of course, but then you lose the interpretability of the box plot.
Understanding Box Plots, Towards Data Science
Box Plot examples, KNIME Hub
Box Plots and Conditional Box Plots, t.b.a.
Conditional Box Plot, KNIME Hub
Four Techniques for outlier detection, KNIME Blog
8. Plots and Charts - Time Plots
Some more charts and plots. We use this section to also introduce the concept of Flow Variables.
Day 45. Learn more about Flow Variables.
Flow Variables, KNIMETV
KNIME Flow Control Guide, KNIME Documentation
KNIME Analytics: Flow Variables, the Red Line, t.b.a.
Day 46. Return to the output table of the second GroupBy node where we counted the number of tracks for each artist across all years. In a selected time window (
Sharing components, KNIME Documentation
Table Column to Variable node, t.b.a.
Day 47. Make the previous component parametric by adding a configuration window where you can select the time window (
Component Configurations, KNIMETV
Custom components configuration dialogs, KNIME Documentation
Day 48. Plot yearly number of tracks by artist in a Line Plot. Add color for each line, i.e. for each artist. Make the plot subtitle parametric for the selected time window with the help of the String Manipulation (Variable) node. Inspect lines one by one and all together. Find out the artists who have been most consistently active across the years of your time window.
Line plot in KNIME Introduction. Part 15. Visualizations, YouTube (Scott McLeod)
Time Plots: Line Plot, t.b.a.
Day 49. Plot yearly number of tracks by artist in a Stacked Area Chart. Add color for each area, i.e. for each artist. Make the plot subtitle parametric for the selected time window. Explore interactivity, especially how to add and remove areas for artists.
What is a stacked area chart, From Data to Viz
Data Visualization & Interactive Data Exploration with KNIME, KNIMETV
Text Stream Visualization, KNIME Blog
Day 50. Create a Bar Chart to inspect evolution over time. Visualize the number of tracks by artist over years in a bar chart. Add color for each bar, i.e. for each artist. Make the plot subtitle parametric for the selected time window. Inspect bars one by one, in small groups, and all together. Find out the most prolific artist in a specific year of your time window.
Day 51. Let’s conclude this part with some free JavaScript code. Let’s investigate the Generic JavaScript View node. Do not worry, you do not need to code. Just explore the KNIME Hub for workflows and components based on the Generic Javascript View node. Drag&drop the Animated Bar Chart component from the KNIME Hub into your workflow and study, for example, the evolution of artists and track count throughout the years. Wrap all the time plots in a component and investigate selections of year and artists in the composite view.
KNIME JavaScript Views, KNIME Hub
Example components and visualizations based on free JavaScript code:
FIFA World Cup, KNIME Hub
Animated Bar Chart, KNIME Hub
Epidemiological data from Zika virus, KNIME Hub
9. Plots and Charts - Control
Day 52. Learn about Guided Analytics and Widget nodes
Widget nodes in KNIME Component Guide, KNIME Documentation
Widget vs.Configuration nodes, t.b.a.
Principles of Guided Analytics, KNIME Blog
Day 53. Let’s continue using the output table of the second GroupBy node where we counted the number of tracks for each artist across all years. Build a guided analytics sequence with a component including a Widget-based selection framework to select the
Integer Widget node, KNIME Hub
Column Filter Widget node, KNIME Hub
Value Selection Widget node, KNIME Hub
Widget nodes, t.b.a.
Day 54. Using the top k artists extracted from the time window selection used in the first component, investigate the usage of the Interactive Range Slider Filter Widget node in conjunction with the Scatter Plot or the Stacked Area Chart node. Build a component with a scatter plot or stacked area chart and control the number of points via the Interactive Range Slider Filter Widget node, for example by controlling the size of the year interval. Alternatively, use the data table created with stratified sampling in section 7 to build a component with an interactive multivariate scatter plot and table view in conjunction with the Interactive Range Slider Filter Widget node to control for the year interval.
Interactive Range Slider Filter Widget node, KNIME Hub
Interactive Filter Widget nodes, t.b.a.
Day 55. Investigate the usage of the Refresh Button Widget node. After the component that uses Widget nodes to select the time window (ex: 1970-1980) and extracts the top k artists, build a dashboard with a Widget-based framework to select the top k artists, a stacked area chart, a line plot, a bar chart, and a Refresh button.
Eight Data App Designs with the New Refresh Button, KNIME Blog
Example workflow collection using the Refresh Button Widget node, KNIME Hub
Day 56. Combine the Refresh button node with the Interactive Range Slider Filter Widget node and with a stacked area chart to visualize extracted top k artists in a selected time window within a dashboard. Watch how the Refresh Button Widget node could work on the KNIME WebPortal.
Twitter analysis Data App, KNIMETV
Machine learning Data App, KNIMETV
10. Covariance and Correlation
Creating bar charts, pie charts or time plots are definitely great ways to visually explore a dataset and gain precious insights. However, sometimes adding a tiny statistical flavour to the analysis improves our understanding of the data and, more importantly, enables a sounder identification of relationships between the input features.
Day 57. Learn about covariance, linear correlation, and rank correlation. Focus on the similarities and differences between them, as well on their strengths and weaknesses.
5 Things You Should Know About Covariance, Towards Data Science
How to measure the relationship between variables, Towards Data Science
Why correlation does not imply causation?, Towards Data Science
(Bonus task) Work out the math behind covariance and linear correlation, it will help you understand their differences and similarities. Pick two numeric input features of your choice (e.g., “loudness” and “popularity”), select 5 observations for each feature, and compute the covariance and the linear correlation manually. You can double-check your results with an online calculator.
Day 58. Compute the covariance for all pairs of the following features: “energy”, “loudness”, “danceability”, “valence” and “popularity” in the tracks.csv dataset using the GroupBy node. Build a covariance matrix both with unnormalized and normalized (z-score normalization) features separately.
Aggregations, Aggregations, Aggregations! — Part II (section: “1. Statistical Aggregations: Covariance vs. Correlation”), KNIME Blog
GroupBy node for statistical aggregations, KNIME Hub
Day 59. Compute the linear correlation for all pairs of meaningful numeric features in the track dataset using the Linear Correlation node. Inspect the correlation matrix view, and the results of the three output ports. What are the top four positively correlated feature pairs? Compare the values in the linear correlation matrix with those of the normalized covariance matrix, what do you observe?
Linear Correlation node, KNIME Hub
Day 60. Compute the rank correlation for all pairs of meaningful numeric features in the track dataset using the Rank Correlation node. Inspect the correlation matrix view, and the results of the three output ports. What are the top four positively correlated feature pairs? Do you observe similarities with the linear correlation matrix? If yes, why do you think that’s the case?
Rank Correlation node, KNIME Hub
Day 61. Visualize in a component composite view the unnormalized and normalized covariance matrices side-by-side both using a Heatmap node and a Table View node. Inspect in a different component composite view the linear correlation for the top four positively correlated features using the Scatter Plot node for each pair.
11. Text Visualization
There are not only numbers in data science! Besides numerical data, we have to deal with texts, images, networks, and even more diverse data types. In this section, a few items on text visualization.
Day 62. Get familiar with the KNIME Text Processing Extension, the Document object, the Term object and all text processing operations.
KNIME Text Processing Extension, KNIME Hub
From Data Collection to Text Mining Interpretation, KNIME Blog
Day 63. Read the IMDb-sample.csv file from the KNIME Hub. This dataset collects 2000 movie reviews written by users and contains sentiment annotation (i.e., positive or negative) for each review. Convert all texts to Documents and perform some basic text cleaning.
Common Steps in a Text Mining Project, KNIMETV
All you need to know about text preprocessing for NLP and Machine Learning, KDNuggets
Day 64. Using the file MPQA-OpinionCorpus-PositiveList.csv for the list of positive words in the English language and MPQA-OpinionCorpus-NegativeList.csv for the list of negative words in the English language (from the KNIME Hub or the latest version from the MPQA site), tag words in the texts as positive or negative.
Document Tagging: Introduction, KNIMETV
Domain and Custom Tagging, KNIMETV
Dictionary Tagger node, KNIME Hub
Day 65. Let’s transform each text into a Bag of Words and let’s calculate all Term Frequencies.
How important are the words in your text data? Tf-Idf answers…, Towards Data Science
Bag of Words, YouTube (Quantopian)
Bag of Words and Frequencies, KNIMETV
Day 66. We have the list of words, we have their frequencies, let’s visualize them in a word cloud
What are word clouds?, BoostLabs Blog
Word cloud with Additional Visualizations, KNIMETV
Tag Cloud, KNIME Hub
12. Graph Visualization (Bonus)
As a bonus, let’s investigate something more complicated: how to visualize interactions among users of a community, like the retweeting patterns around a hashtag on Twitter.
Bonus. First, let’s learn what a Network Graph (or sometimes just a graph) is.
Network Diagram, From Data to Viz
Graph Theory, Wikipedia
Bonus. Retrieve tweets through the Twitter API around a given hashtag, like #knime. Alternatively, you can download a TwitterData.table around #knime for the time window July 26th - August 3rd, 2021.
Confirm that you are a robot, Low Code for Advanced Data Science
Twitter API Connectors Extension, KNIME Hub
KNIME Twitter nodes, t.b.a.
Twitter meets PostgreSQL, KNIME Blog
Bonus. Shape your Twitter data as an adjacency matrix:
The #KNIME Connection. Where are you?, KNIME Blog
Bonus. The KNIME Network Mining Extension deals with graphs and network objects. You need to create a network of interactions among Twitter users before visualizing it. Create the network diagram from the adjacency matrix of Twitter interactions built in the previous step.
KNIME Network Mining Extension, KNIME Hub
Social Network Analysis, t.b.a.
Bonus. Visualize the network object of Twitter users interactions with a graph.
Data Visualization & Interactive Data Exploration with KNIME, KNIMETV
Data Visualization in KNIME, YouTube
Network Viewer node, KNIME Hub
Bonus. Another way of visualizing interactions is the chord diagram. Let’s learn what a chord diagram is and how to build one in KNIME using the Generic Javascript View node.
Chord diagram, From Data to Viz
