So, what lies ‘under the hood’ of the Kapiche analytical platform? What does the ‘engine’ comprise of? What is it’s capacity? What are its performance parameters? How does it methodologically compare with the alternatives?
At the core of Kapiche’s approach to unstructured data analytics is a unique Topic Modelling algorithm and data management platform developed by the Kapiche technology team. It is this platform which provides the foundation for the Kapiche suite of analytical tools used to analyse both structured and unstructured data.
Topic Modelling can be defined as a form of text mining or a type of ‘natural language processing’, natural language the dialect we use in everyday communication, whatever the language. It’s a way of identifying patterns and relationships in a set of unstructured data.
In identifying these patterns and relationships, Kapiche is able to clusters words, across the entire dataset, to form ‘topics’. In fact, Topic Modelling has been formally described as “a method for finding and tracing clusters of words (called “topics” in shorthand) in large bodies of texts.”
In creating these clusters of words or topics, the frequency of all words in the documents and the co-occurrence between all words is considered. The novel aspect of the Kapiche Topic Modelling algorithm is it’s ability to determine which co-occurrence relationships are important, not just noise, and hence which words can be considered influential. For example, the word dog might influence the presence of a host of other terms like kennel, fleas, bark, and lead in a document. The measure of individual word frequency and influence forms the basis of our Topic Modelling algorithm.
There are no limits to the types of documents Kapiche can ingest. Our products support a range of file formats (like CSV, XLSX, DOCX, PDF, and TXT) and you aren’t restricted to just files. Voice – say from a call centre, focus groups, video, or other presentation technology can also be readily converted to text for analysis. Integrating with on-premise 3rd party database technologies is also popular and we aren’t just talking about unstructured data. The Kapiche suite of analytical tools will use any structure present in your data to add further information to its outputs. Interrogating a Topic Model using demographic and time based information often leads to much richer insights and a deeper understanding of your data.
Simplifying the complex structure of natural language by ignoring syntax and grammar and focusing on the frequency of words within documents might sound like a very simplistic approach to creating understanding from a vast amount of textual data. Instead of a properly ordered, grammatically correct sentence, the Kapiche approach slices and dices text into clusters of topics, frequency counts and statistical probability measures. From that perspective, real understanding of what is being communicated, no matter what the content, can be gained. Ignoring syntax and grammar also means our approach is inherently multi-lingual provided we can identify word boundaries in the data.
One way to think about how the process of Topic Modelling works is to imagine working through a body of text with a set of highlighter pens. Reading through the text, you might use a different colour for the key words of topics in the text. When your done, you cut out the highlighted terms and group them by color to form topics.
In Kapiche’s case the computer, not you the user armed with highlighting pens, identifies the occurrence of topics and their subordinate influencers. In Kapiche parlance these topic influencers are known as “Key Terms”. It is the statistical identification of topics and their inter-play with key terms that facilitates understanding. Furthermore, by accessing the various visualisation tools, provided in the Kapiche suite of analytical tools, comprehensive dimensions of understanding are available, such as sentiment identification. Moreover, further application of analytical results are accessible, such as predictive analysis.
Is Kapiche the only analysis tool to employ Topic Modelling?
Far from it. Approaches such as Latent Dirichlet Allocation (LDA), much favoured by academics as a focus of research, and its more commercial derivative Latent Semantic Analysis (LSA) have a number of significant supporters. But these approaches do have fundamental flaws when cast into the realities of commercial application. Not the least of these flaws is the need to estimate results before commencing analysis. A more significant issue for these Topic Modelling approaches is the set-up, training and high level interpretive skills these such approaches require.
Topic Modelling analytical approaches are comparatively commercially recent. Much of the marketplace is still utilising a range of products founded upon comparing what can be found in dictionary and thesauri data sets with what is contained in the yet to be analysed data set. Such approaches require that these dictionaries and thesauri be kept up to date with modern communication idioms such as slang or discipline specific language, for example as for pharmaceuticals or engineering, be generated; and a whole new set of dictionary and thesauri be made available for each foreign language to which the product is to be applied. Additionally these approaches require even greater set-up, training and high level interpretive skills than LSA or similarly based products.
Gartner recognised that the set-up, training and high level interpretive skills requirements of all marketplace products was a significant barrier to the commercial growth of unstructured data analytics. There basically were not enough highly skilled analysts to meet the labour market demands of these products. Gartner saw that the creation of “Citizen Analysts”, requiring simple to set-up and use commercial tools, was the answer. Kapiche simultaneously recognised that this was one of a number of key issues hampering organisation adoption.
In response, Kapiche created a fast, accurate, easy-to-use, in every respect, analytical suite of tools. It is for this reason Gartner awarded Kapiche one of its four 2015 “Cool Vendor Awards”. Kapiche?