fix issue 4322, enable lda summary output #5260

frank-dong-ms-zz · 2020-06-26T05:01:30Z

fix issue 4322: #4322

output topic summary to model file when OutputTopicWordSummary is set to true
open public API from lda transformer to get topic summary

antoniovs1029 · 2020-06-26T05:56:10Z

Awesome! This was a much asked feature, it's great that you've fixed this 😄 I always assumed that our implementation of Lda held the topics in a compressed hard-to-read manner, and I wasn't aware that "GetLdaDetails()" returned a readable version of the topics already! Good to know 👍

antoniovs1029 · 2020-06-26T06:05:47Z

src/Microsoft.ML.Transforms/Text/LdaTransform.cs

+        public ModelParameters GetLdaDetails(int topicIndex)
        {
-            Contracts.Assert(0 <= iinfo && iinfo < _ldas.Length);
+            Contracts.Assert(0 <= topicIndex && topicIndex < _ldas.Length);

-            var ldaState = _ldas[iinfo];
-            var mapping = _columnMappings[iinfo];
+            var ldaState = _ldas[topicIndex];
+            var mapping = _columnMappings[topicIndex];

            return ldaState.GetLdaSummary(mapping);


The way I understand it GetLdaDetails(i) will give you the LdaSummary for column i not the topic i (because each column will have several topics). So I think that renaming iinfo to topicIndex and the documentation you added is somewhat misguiding. #Resolved

You are right! I confused this with topic index, thanks

In reply to: 445984832 [](ancestors = 445984832)

codecov · 2020-06-26T06:11:24Z

Codecov Report

Merging #5260 into master will decrease coverage by 0.12%.
The diff coverage is 74.60%.

@@            Coverage Diff             @@
##           master    #5260      +/-   ##
==========================================
- Coverage   73.62%   73.50%   -0.13%     
==========================================
  Files        1022     1014       -8     
  Lines      190056   188738    -1318     
  Branches    20444    20340     -104     
==========================================
- Hits       139930   138732    -1198     
+ Misses      44595    44485     -110     
+ Partials     5531     5521      -10

Flag	Coverage Δ
#Debug	`73.50% <74.60%> (-0.13%)`	⬇️
#production	`69.31% <71.73%> (-0.06%)`	⬇️
#test	`87.42% <82.35%> (-0.20%)`	⬇️

Impacted Files	Coverage Δ
...osoft.ML.Functional.Tests/IntrospectiveTraining.cs	`98.73% <62.50%> (-1.27%)`	⬇️
src/Microsoft.ML.Transforms/Text/LdaTransform.cs	`92.14% <71.73%> (+6.95%)`	⬆️
...icrosoft.ML.TestFramework/DataPipe/TestDataPipe.cs	`91.06% <100.00%> (+0.06%)`	⬆️
src/Microsoft.ML.EntryPoints/TrainTestSplit.cs	`76.92% <0.00%> (-17.53%)`	⬇️
...rc/Microsoft.ML.LightGbm/LightGbmRankingTrainer.cs	`79.20% <0.00%> (-8.80%)`	⬇️
src/Microsoft.ML.CodeGenerator/Utils.cs	`54.35% <0.00%> (-4.85%)`	⬇️
...soft.ML.Data/DataLoadSave/DataOperationsCatalog.cs	`81.81% <0.00%> (-3.69%)`	⬇️
src/Microsoft.ML.AutoML/Utils/SplitUtil.cs	`73.68% <0.00%> (-2.64%)`	⬇️
src/Microsoft.ML.AutoML/API/RankingExperiment.cs	`58.06% <0.00%> (-2.55%)`	⬇️
test/Microsoft.ML.AutoML.Tests/AutoFitTests.cs	`90.30% <0.00%> (-1.97%)`	⬇️
... and 28 more

antoniovs1029 · 2020-06-26T06:18:23Z

src/Microsoft.ML.Transforms/Text/LdaTransform.cs

+        {
+            var summary = GetLdaDetails(i);
+
+            ctx.SaveTextStream(WordTopicModelFilename, writer =>


I know that the WordTopicModelFilename was unused in the codebase before your changes on this PR, and it was a legacy variable coming from TLC (as commented here), but I'm not sure this is the best way to use it.

The potential problem I see is that for every _columns[i] that we save, we'll call this method (if _columns[i].OutputTopicWordSummary == true, as seen in your code above), and all of them will try to save their WordSummary to the same WordTopicModelFilename.

So if a given LdaTransformer has more than 1 column with OutputTopicWordSummary enabled, then 2 or more columns will try to output to the same file.

Looking at the implementation of ctx.SaveTextStream I'm not sure if it will throw if we try to write twice the same file, or if it will overwrite them. If it does, this would be a problem. If it doesn't throw, there's still the problem that we will have the output topics of all the columns in one single file, and no way to know which file is of which column.

So adding text to identify where the output of one column begins would be useful. Another sugestion would be to actually use one file per column (for example, something like ""word_topic_summary-col{i}.txt" might be sufficient.

(PS: I don't know what was the behavior back on TLC, but maybe there we couldn't have 1 Lda transformer acting on several columns). #Resolved

sounds good, I will add column option name (which is output column name from options) as part of summary text file

In reply to: 445988583 [](ancestors = 445988583)

antoniovs1029 · 2020-06-26T06:19:16Z

src/Microsoft.ML.Transforms/Text/LdaTransform.cs

+                    {
+                        foreach (var wordScore in wordScores)
+                        {
+                            writer.WriteLine($"Top[{topId}]: {wordScore.Word}\t{wordScore.Score}");


nit: I think it might be clearer to use $"Topic[{topId}] since "Top" might be somewhat misguiding. The same applies below. #Resolved

antoniovs1029 · 2020-06-26T06:23:31Z

src/Microsoft.ML.Transforms/Text/LdaTransform.cs

@@ -141,6 +141,9 @@ internal sealed class Column : OneToOneColumn
            [Argument(ArgumentType.AtMostOnce, HelpText = "Reset the random number generator for each document", ShortName = "reset")]
            public bool? ResetRandomGenerator;

+            [Argument(ArgumentType.AtMostOnce, HelpText = "Whether to output the topic-word summary in text format", ShortName = "summary")]


nit: Suggestion, change the HelpText of this argument and the one on internal Options.OutputTopicWordSummary to be "Whether to output the topic-word summary in text format when saving the model to disk"`, else it isn't understood that users will actually need to save the model to disk in order to get that output. (see microsoft/NimbusML#405 where the user actually expected to see the output on the console).

Also update the docs elsewhere where you've added this option. #Resolved

sounds good, thanks

In reply to: 445990136 [](ancestors = 445990136)

antoniovs1029 · 2020-06-26T18:29:18Z

test/Microsoft.ML.TestFramework/DataPipe/TestDataPipe.cs

+
+            // topic summary text file saved inside the model.zip file.
+            string name = TestName + ".zip";
+            string modelPath = GetOutputPath("SavePipe", name);
+            using (var file = Env.OpenInputFile(modelPath))
+            using (var strm = file.OpenReadStream())
+            using (var zip = new ZipArchive(strm, ZipArchiveMode.Read))
+            {


(appended to code to allow for threaded comment)

Can you please also add a test to test your PR on ML.NET's API instead of MAML?
I think modifying a test such as InspectLdaModelParameters would be enough. Notice that in particular that test was waiting for this issue to be fixed in order to test the output summary:

machinelearning/test/Microsoft.ML.Functional.Tests/IntrospectiveTraining.cs

Lines 190 to 191 in 45a16dc

// Get the trained LDA model.

// TODO #2197: Get the topics and summaries from the model.

Thanks.! #Resolved

Great idea, I will test both MAML and from transformer. Actually test from MAML and check summary was how TLC originally test topic summary.

In reply to: 446344620 [](ancestors = 446344620)

antoniovs1029 · 2020-07-01T06:13:32Z

test/Microsoft.ML.Functional.Tests/IntrospectiveTraining.cs

+                .Append(mlContext.Transforms.Text.LatentDirichletAllocation("Features", "SentimentBag", 
+                numberOfTopics: numTopics, maximumNumberOfIterations: 10));


(appended to code to allow for threaded discussion)
Should we also add the outputWordSummary option to the ML.NET public API found here on the TextCatalog?:

https://github.com/frank-dong-ms/machinelearning/blob/aa75df98f549741893deb3a042a5dd287a56ab99/src/Microsoft.ML.Transforms/Text/TextCatalog.cs#L597-L614 #Resolved

That will be break public API, I think we should avoid break public API as much as possible. User still can get summary via the public interface "GetLdaDetails" from lda transformer as this test shows

In reply to: 448136744 [](ancestors = 448136744)

You're right. I don't think it's a good idea to add this option to this public API. No need for a new API only for this.

In reply to: 448522335 [](ancestors = 448522335,448512997,448136744)

antoniovs1029 · 2020-07-01T06:15:37Z

src/Microsoft.ML.Transforms/Text/LdaTransform.cs

-        [BestFriend]
-        internal ModelParameters GetLdaDetails(int iinfo)
+        /// <summary>
+        /// Method to provide details about the topic discovered by LightLDA


nit: "about the topics discovered by LightLDA" #Resolved

antoniovs1029 · 2020-07-01T06:19:36Z

src/Microsoft.ML.Transforms/Text/LdaTransform.cs

+        /// <param name="columnIndex">index of column</param>
+        /// <returns></returns>
+        public ModelParameters GetLdaDetails(int columnIndex)
        {
-            Contracts.Assert(0 <= iinfo && iinfo < _ldas.Length);
+            Contracts.Assert(0 <= columnIndex && columnIndex < _ldas.Length);


suggestion: So I don't know how usable this API would be for users, specially because users might not be sure of what index corresponds to the column they want. Is it possible to have public ModelParameters GetLdaDetails(string columnName) instead? (throw an exception if the provided name can't be matched with the names found in _columns) #Resolved

The order is same order as user define column options pair in below API. In most cases this should be 0 as user is using lda by input column, output column pair.
Maybe I can rename this as "index of column options pair" to make it clear

/// <include file='doc.xml' path='doc/members/member[@name="LightLDA"]/*' /> /// <param name="env">The environment.</param> /// <param name="columns">Describes the parameters of the LDA process for each column pair.</param> internal LatentDirichletAllocationEstimator(IHostEnvironment env, params ColumnOptions[] columns) { Contracts.CheckValue(env, nameof(env)); _host = env.Register(nameof(LatentDirichletAllocationEstimator)); _columns = columns.ToImmutableArray(); }

In reply to: 448138812 [](ancestors = 448138812)

I see. I think rephrasing the docs of GetLdaDetails to be "Index of the column options pair" would be better.

In reply to: 448520738 [](ancestors = 448520738,448138812)

…to issue-4322

UlrichSuess · 2020-10-15T12:57:08Z

Hi everyone,

great to have topic summary available. And it IS available directly after training.

BUT:

Where can "OutputTopicWordSummary" actually be set to true to trigger saving the details together with the model? Probably I'm missing something, I'm not really firm with the framework yet. However:

LatentDirichletAllocationEstimator.Defaults is internal (and the value is const)
ColumnOptions is internal
there is no argument for it in TextCatalog.LatentDirichletAllocation() (and you seem to want to keep it that way in order not to break the public API)

Thanks
Uli (using 1.5.2)

enable lda summary output

a41290c

frank-dong-ms-zz requested a review from a team as a code owner June 26, 2020 05:01

frank-dong-ms-zz requested review from harishsk and antoniovs1029 June 26, 2020 06:04

antoniovs1029 reviewed Jun 26, 2020

View reviewed changes

frank-dong-ms added 2 commits June 27, 2020 00:04

take comments

5d7bc1c

fix test

aa75df9

antoniovs1029 reviewed Jul 1, 2020

View reviewed changes

antoniovs1029 approved these changes Jul 1, 2020

View reviewed changes

frank-dong-ms-zz merged commit b7d0063 into dotnet:master Jul 1, 2020

frank-dong-ms added 2 commits July 3, 2020 12:09

take comments

cd7fc3e

Merge branch 'master' of https://github.com/dotnet/machinelearning in…

2a7c2d9

…to issue-4322

frank-dong-ms-zz deleted the issue-4322 branch July 8, 2020 02:50

ghost locked as resolved and limited conversation to collaborators Mar 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix issue 4322, enable lda summary output #5260

fix issue 4322, enable lda summary output #5260

frank-dong-ms-zz commented Jun 26, 2020

antoniovs1029 commented Jun 26, 2020 •

edited

Loading

antoniovs1029 Jun 26, 2020 •

edited by frank-dong-ms-zz

Loading

frank-dong-ms-zz Jun 26, 2020

codecov bot commented Jun 26, 2020 •

edited

Loading

antoniovs1029 Jun 26, 2020 •

edited by frank-dong-ms-zz

Loading

frank-dong-ms-zz Jun 26, 2020 •

edited

Loading

antoniovs1029 Jun 26, 2020 •

edited by frank-dong-ms-zz

Loading

antoniovs1029 Jun 26, 2020 •

edited by frank-dong-ms-zz

Loading

frank-dong-ms-zz Jun 26, 2020

antoniovs1029 Jun 26, 2020 •

edited by frank-dong-ms-zz

Loading

frank-dong-ms-zz Jun 26, 2020

antoniovs1029 Jul 1, 2020 •

edited by frank-dong-ms-zz

Loading

frank-dong-ms-zz Jul 1, 2020 •

edited

Loading

antoniovs1029 Jul 1, 2020

antoniovs1029 Jul 1, 2020 •

edited by frank-dong-ms-zz

Loading

antoniovs1029 Jul 1, 2020 •

edited by frank-dong-ms-zz

Loading

frank-dong-ms-zz Jul 1, 2020

antoniovs1029 Jul 1, 2020

UlrichSuess commented Oct 15, 2020

	// Get the trained LDA model.
	// TODO #2197: Get the topics and summaries from the model.

		.Append(mlContext.Transforms.Text.LatentDirichletAllocation("Features", "SentimentBag",
		numberOfTopics: numTopics, maximumNumberOfIterations: 10));

fix issue 4322, enable lda summary output #5260

fix issue 4322, enable lda summary output #5260

Conversation

frank-dong-ms-zz commented Jun 26, 2020

antoniovs1029 commented Jun 26, 2020 • edited Loading

antoniovs1029 Jun 26, 2020 • edited by frank-dong-ms-zz Loading

Choose a reason for hiding this comment

frank-dong-ms-zz Jun 26, 2020

Choose a reason for hiding this comment

codecov bot commented Jun 26, 2020 • edited Loading

Codecov Report

antoniovs1029 Jun 26, 2020 • edited by frank-dong-ms-zz Loading

Choose a reason for hiding this comment

frank-dong-ms-zz Jun 26, 2020 • edited Loading

Choose a reason for hiding this comment

antoniovs1029 Jun 26, 2020 • edited by frank-dong-ms-zz Loading

Choose a reason for hiding this comment

antoniovs1029 Jun 26, 2020 • edited by frank-dong-ms-zz Loading

Choose a reason for hiding this comment

frank-dong-ms-zz Jun 26, 2020

Choose a reason for hiding this comment

antoniovs1029 Jun 26, 2020 • edited by frank-dong-ms-zz Loading

Choose a reason for hiding this comment

frank-dong-ms-zz Jun 26, 2020

Choose a reason for hiding this comment

antoniovs1029 Jul 1, 2020 • edited by frank-dong-ms-zz Loading

Choose a reason for hiding this comment

frank-dong-ms-zz Jul 1, 2020 • edited Loading

Choose a reason for hiding this comment

antoniovs1029 Jul 1, 2020

Choose a reason for hiding this comment

antoniovs1029 Jul 1, 2020 • edited by frank-dong-ms-zz Loading

Choose a reason for hiding this comment

antoniovs1029 Jul 1, 2020 • edited by frank-dong-ms-zz Loading

Choose a reason for hiding this comment

frank-dong-ms-zz Jul 1, 2020

Choose a reason for hiding this comment

antoniovs1029 Jul 1, 2020

Choose a reason for hiding this comment

UlrichSuess commented Oct 15, 2020

antoniovs1029 commented Jun 26, 2020 •

edited

Loading

antoniovs1029 Jun 26, 2020 •

edited by frank-dong-ms-zz

Loading

codecov bot commented Jun 26, 2020 •

edited

Loading

antoniovs1029 Jun 26, 2020 •

edited by frank-dong-ms-zz

Loading

frank-dong-ms-zz Jun 26, 2020 •

edited

Loading

antoniovs1029 Jun 26, 2020 •

edited by frank-dong-ms-zz

Loading

antoniovs1029 Jun 26, 2020 •

edited by frank-dong-ms-zz

Loading

antoniovs1029 Jun 26, 2020 •

edited by frank-dong-ms-zz

Loading

antoniovs1029 Jul 1, 2020 •

edited by frank-dong-ms-zz

Loading

frank-dong-ms-zz Jul 1, 2020 •

edited

Loading

antoniovs1029 Jul 1, 2020 •

edited by frank-dong-ms-zz

Loading

antoniovs1029 Jul 1, 2020 •

edited by frank-dong-ms-zz

Loading