diff --git a/docs/faq.md b/docs/faq.md index 8aed0875..da4bbd81 100644 --- a/docs/faq.md +++ b/docs/faq.md @@ -186,7 +186,7 @@ but a category of outliers. ## **I have too many topics, how do I decrease them?** If you have a large dataset, then it is possible to generate thousands of topics. Especially with large datasets, there is a good chance they contain many small topics. In practice, you might want a few hundred topics at most to interpret them nicely. -There are a few ways of increasing the number of generated topics: +There are a few ways of decreasing the number of generated topics: * First, we can set the `min_topic_size` in the BERTopic initialization much higher (e.g., 300) to make sure that those small clusters will not be generated. This is an HDBSCAN parameter that specifies the minimum number of documents needed in a cluster. More documents in a cluster mean fewer topics will be generated. @@ -310,4 +310,4 @@ No. By using document embeddings there is typically no need to preprocess the da are important in understanding the general topic of the document. Although this holds in 99% of cases, if you have data that contains a lot of noise, for example, HTML-tags, then it would be best to remove them. HTML-tags typically do not contribute to the meaning of a document and should therefore be removed. However, if you apply -topic modeling to HTML-code to extract topics of code, then it becomes important. \ No newline at end of file +topic modeling to HTML-code to extract topics of code, then it becomes important.