Investigation of Partial Image Classification Methods

Ziwen Dong; Ijazul Haq; Shan Huang; Jin Y. Du

Authors

Ziwen Dong Gunagdong Janus Biotechnology Co., Ltd.
Ijazul Haq Guangdong CAS Angels Biotechnology Co., Ltd.
Shan Huang Guangdong Janus Biotechnology Co., Ltd.
Jin Y. Du Guangdong CAS Angels Biotechnology Co., Ltd.

Keywords:

Image Classification, Large Vision Models, Computer Vision, GPT, Generative AI, Machine Learning

Abstract

Recognizing an object based on only partial information is a common task that humans perform every day. In this study, we explore how accurate several computer algorithms, including traditional methods and LVMs (Large Vision Models), perform at image classification using a novel dataset comprised of 10 different animal classes. The traditional methods we use are Resnet and Transformer, while the LVMs are GPT-4, Claude, Gemini, LLaVa, Qwen, and CLIP (Contrastive Language-Image Pre-training). The dataset consists of 16K manually cropped images, providing a unique challenge in assessing the models’ ability to recognize images based on incomplete information. The results indicate significant variations in model performance. Swin Transformer achieves the best accuracy, outperforming even humans. On the other hand, LVMs under zero-shot underperform humans; but benefit from few-shot preparation.

References

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877-1901.

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., ... & Fiedel, N. (2023). Palm: Scaling language modeling with pathways. Journal of machine learning research, 24(240), 1-113.

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., ... & Lample, G. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.

Team, V. (2023). Vicuna: An open-source chatbot impressing gpt-4 with 90% chatgpt quality. Vicuna: An open-source chatbot impressing gpt-4 with, 90.

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., ... & McGrew, B. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774.

Caffagni, D., Cocchi, F., Barsellotti, L., Moratelli, N., Sarto, S., Baraldi, L., ... & Cucchiara, R. (2024). The revolution of multimodal large language models: A survey. Findings of the association for computational linguistics: ACL 2024, 13590-13618.

Sparkman, M., & Witt, A. (2025). Claude AI and literature reviews: An experiment in utility and ethical use. Library Trends, 73(3), 355-380.

Pichai, S., & Hassabis, D. (2023). Introducing Gemini: Our largest and most capable AI model. Google. https://blog.google/technology/ai/google-gemini-ai/

Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visual instruction tuning. Advances in neural information processing systems, 36, 34892-34916.

Team, Q. (2023). Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv preprint arXiv:2308.12966.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763).

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., ... & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10012-10022).

Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W. X., & Wen, J. R. (2023). Evaluating object hallucination in large vision-language models. In Proceedings of the 2023 conference on empirical methods in natural language processing (pp. 292-305).

Liu, Y., Li, Z., Huang, M., Yang, B., Yu, W., Li, C., ... & Bai, X. (2024). Ocrbench: on the hidden mystery of ocr in large multimodal models. Science China Information Sciences, 67(12), 220102.

Xu, P., Shao, W., Zhang, K., Gao, P., Liu, S., Lei, M., ... & Luo, P. (2024). Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3), 1877-1893.

Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., ... & Wang, L. (2023). Mm-vet: Evaluating large Multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490.

Li, C., Liu, H., Li, L., Zhang, P., Aneja, J., Yang, J., ... & Gao, J. (2022). Elevater: A benchmark and toolkit for evaluating language-augmented visual models. Advances in Neural Information Processing Systems, 35, 9287-9301.

Yin, Z., Wang, J., Cao, J., Shi, Z., Liu, D., Li, M., ... & Ouyang, W. (2023). Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark. Advances in Neural Information Processing Systems, 36, 26650-26685.

Ziems, C., Held, W., Shaikh, O., Chen, J., Zhang, Z., & Yang, D. (2024). Can large language models transform computational social science?. Computational Linguistics, 50(1), 237-291.

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25.

Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y. (2013). Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229.

Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431-3440).

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

Team, G., Anil, R., Borgeaud, S., Alayrac, J. B., Yu, J., Soricut, R., ... & Blanco, L. (2023). Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.

Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009, June). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248-255). Ieee.

Bottou, L. (2010, September). Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT'2010: 19th International Conference on Computational StatisticsParis France, August 22-27, 2010 Keynote, Invited and Contributed Papers (pp. 177-186). Heidelberg: Physica-Verlag HD.

Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1-9).

Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., ... & Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.

Krizhevsky, A. , & Hinton, G. (2009). Learning multiple layers of features from tiny images. Handbook of Systemic Autoimmune Diseases, 1(4).

LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (2002). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324.

Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., ... & Zitnick, C. L. (2014, September). Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740-755). Cham: Springer International Publishing.

Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2), 303-338.

Jeong, J. , Lee, J. Y. , & Son, Y. (2018). A study of partial image classification of vehicles using finger gestures. International Journal of Grid and Distributed Computing, 11, 111-122.

Zhao, G., Zhang, C., Wang, X., Lin, B., & Yan, F. (2024). PMANet: Progressive multi-stage attention networks for skin disease classification. Image and Vision Computing, 149, 105166.

Cen, F., & Wang, G. (2019). Boosting occluded image classification via subspace decomposition-based estimation of deep features. IEEE transactions on cybernetics, 50(7), 3409-3422.

Cen, F., Zhao, X., Li, W., & Wang, G. (2021). Deep feature augmentation for occluded image classification. Pattern Recognition, 111, 107737.

Kassaw, K., Luzi, F., Collins, L. M., & Malof, J. M. (2025). Are deep learning models robust to partial object occlusion in visual recognition tasks?. Pattern Recognition, 112215.

Investigation of Partial Image Classification Methods

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

Make a Submission