-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about the inference pipeline #3
Comments
Same question... |
Thanks for the detailed description. I also found the same problem, but I think the main information leaking happens at line 301 in srvp_generate() which is also called directly by the forward() function.
During inference ( To verify this, I modified the original code It can be seen that when the information leakage is modified, the performance of StretchBEV-P drops significantly (much worse than the baseline). Does this further shows that the results of StretchBEV-P are not convincing? Or is there something wrong with my modifications and implementation? |
Hello, As mentioned in the paper, we use the posterior distribution in the conditioning frames (t < receptive_field) to update the state variables. In the prediction phase (t > receptive_field), we use prior distribution. In StretchBEV-P, the posterior is sampled from both future state information extracted from images and GT labels. On the other hand, the prior distribution is sampled from the current state variable. The evaluation results only cover the prediction phase, which is predicted without posterior distribution, only with the prior distribution. You are right, if we use only prior distribution, the performance drops. However, we proposed StretchBEV for this purpose. In StretchBEV, posterior distribution only uses future state information extracted from images, not the GT labels. The code does not include that version; however, we will release it as soon as possible. Feel free to ask if you have further questions. |
Thank you @kaanakan for your response and continued updates. Doubts about StretchBEV-P:
Personally, I don't think the GT labels should be included at any phase of the inference process, either in the conditioning frames or the prediction phase. Because the state variables you mentioned will be used all the way to the end of the prediction, in other words, the GT information you get in the conditioning frames will also be passed to the calculation in the prediction phase in some way. And that's where the information leaks happen. In addition, from my personal understanding, the correct way to use the posterior should be: make the prior distribution extracted by the model as close to the posterior as possible only during the training process, while during the inference process, compute the prior distribution using the above-mentioned well-trained model (at this time, the prior is assumed to be close enough to the posterior), without including any posterior input. As I demonstrated in my previous experiments, the way your posterior distribution is used results in the model relying more on the posterior input than on the features perceived from the image during inference. Please correct me if there is something wrong with my idea. Thanks.
This expression is tricky. Based on this idea, is it possible to claim that the validity of the end-to-end framework can still be verified by evaluating the model only for the prediction phase, even if the conditional terms or the hidden states have GT information? If so, then it would also be acceptable to omit the perception module completely and directly use the GT labels of the conditioning frames as input to the prediction phase. But this clearly does not meet the requirements of the end-to-end framework and thus is not comparable with other end-to-end approaches. Doubts about StretchBEV:
Does this also suggest that the performance improvement of StretchBEV-P is due to information leakage?
I fully believe in the correctness and feasibility of StretchBEV and agree that it can be used as a benchmark. But unfortunately, the performance of the model itself is far inferior to that of the FIERY baseline, and even with pre-training, it can barely beat FIERY. So what are the benefits of StretchBEV? Or it indicates that the structure of SRVP is applicable to video prediction rather than BEV prediction. In conclusion, my questions can be summarized as follows:
Thank you again for your explanation and looking forward to your more substantive reply! |
Hello, Thank you for your questions. The term For the second question, StretchBEV performs on par or a little worse compared to FIERY. However, there are two essential parts. First, the most important aspect of our work is diversity. Both our models have much more diverse results proved both quantitatively and qualitatively. We believe that one of the important points of both our models is that they can generate much more diverse results for a given input. Second, StretchBEV can make use of pre-training. You can use a pre-trained backbone with StretchBEV to train on unlabeled data to have better performance. In this work, we only use the NuScenes dataset for unsupervised pre-training, but one can use any autonomous driving dataset to first pre-train the model in an unsupervised way. Then, the model can be trained with supervision to have better performance. Best. |
Hi,
I evaluated the pretrained model provided in this repo, whose result is similar to the StretchBEV-P model in the paper.
However, it seems that ground truth labels for history and current timeframe are involved in the evaluation process. In inference_srvp_generate() function, hx_z contains feature generated from label input.
Under this setting, the comparison with fiery in the paper seems unfair, and the extra data usage in the evaluation doesn't match the caption of table 1 in the paper
The text was updated successfully, but these errors were encountered: