Topic Modeling

Unveil the relevant topics in your texts

Topic Modeling is a commonly used unsupervised learning task to identify the hidden thematic structure in a collection of documents. The main goal of this text-mining technique is finding relevant topics to organize, search or understand large amounts of unstructured text data. Topic models are based on the assumption that any document can be explained as a unique mixture of topics, where each topic is a group of co-occurring terms with different probabilities. BigML can find the topics in small text fragments like short descriptions, tweets, or emails as well as bigger collections of documents such as articles, blog posts, or entire books.

Sign up now! It's free!

Applications of Topic Modeling

The advent of digitization has exponentially increased the amount of written material accessible to businesses and consumers to an extent that has become impossible to deal with manually. As such, Topic Models can help effectively organize, analyze and understand the hidden insights in any large collection of unstructured text data. Besides general text-mining, Topic Models are useful in detecting instructive structures in genetics, bioinformatics, network analysis, information retrieval, collaborative filtering, content recommendations, and assessing document similarity among other uses. In addition, the extracted topics often prove highly effective as new input features to train different types of models e.g., classification, regression, cluster analysis, or anomaly detection.

Best-in-class algorithm

BigML Topic Models are an optimized implementation of Latent Dirichlet Allocation, one of the best-known probabilistic unsupervised learning methods that determines the topics underlying a collection of documents be they small text fragments like tweets, long articles, papers or books. BigML Topic Models currently support any type of text in seven different languages: English, Spanish, Catalan, French, Portuguese, German and Dutch.

Highly interpretable results

BigML provides two original visualizations to help you better inspect your Topic Models: Topic Map and Term Chart. The Topic Map visualization maps your topics as circles by thematic closeness and gives you an overview of each topic's importance within the training dataset. On the other hand, since a topic is defined by a group of terms with different probabilities, the Term Chart is an ideal way to inspect the prominent terms by topic as ranked by their probability.

Real-time or customizable Topic Distributions

Once you create a Topic Model, you can use it to discover the Topic Distributions in new documents that your model has not been exposed to before. For example, a new document may be 70% about "Machine Learning", 20% about "stock market" and 10% about "startups". BigML Topic Distributions allow you to make predictions for a single data instance, while Batch Topic Distributions help predict the same for multiple instances simultaneously. Based on a given Topic Model, BigML Topic Distributions provide a set of probabilities for each data instance (one probability per topic), which indicate the relative relevance of all topics for that instance.

Fully programmable Topic Modeling

In addition to the point-and-click mode on BigML Dashboard, you can create, configure, update, and use your Topic Models programmatically via the BigML API and bindings. You can choose to use BigML with Python, Node.js, Java, Swift, C# or other languages. By using the API and bindings you can easily embed Topic Models in your applications to analyze and discover thematic patterns of your collection of documents at scale. Topic Models are also supported by WhizzML, our domain-specific language for automating Machine Learning workflows, implementing high-level Machine Learning algorithms, and sharing them with others.

Automatic data pre-processing

Unstructured text data usually includes errors and noise so it needs to be cleaned and pre-processed in advance of Topic Modeling. During Topic Model creation, BigML automatically performs tedious pre-processing tasks like removing stop words, stemming (to keep the lexeme) and extracting compound names via bigrams.

Topic Modeling Training Video