近年來,由於深度學習的發展,卷積神經網路(convolutional neural network,CNN)已經廣泛地被應用在圖像辨識上。但因為本身的計算較為複雜,因此會需要硬體化來滿足某些應用上的效能需求。由於FPGA具有高性能、可重新配置、開發快速的優點,各種基於FPGA平台的硬體加速器被提了出來。但硬體加速器可能的設計方案太多,若沒有好好考量的話,加速器可能會因為沒有充分利用邏輯資源或是記憶體頻寬,而無法達到最佳性能。而且每個FPGA平台的資源限制皆不同,若要將設計好的硬體移植到別的FPGA上,可能會遇到資源不夠或是無法有效運用資源的情形。因此我們將卷積神經網路的設計利用循環分塊(loop tiling)等技術優化,並量化計算吞吐率、所需的記憶體頻寬、資源使用率。然後在Roofline模型[11]的幫助下,找到最佳性能和最低FPGA資源需求的設計方案。所以在移植到新的FPGA平台之前,可以先依照新平台的資源限制事先調整設計。若整個FPGA除了卷積神經網路外還有其他硬體的話,也能夠考量要分配多少資源給其他硬體使用,然後改變卷積神經網路的設計。 Recently, the convolutional neural network (CNN) has been widely used in deep learning for many challenging tasks, such as image recognition. Because of complicated calculations, CNN often needs to be implemented on FPGA, GPU or ASIC to meet the performance requirement. Among these realization alternatives, FPGA has been accredited for high performance, reconfigurable, and short development. Consequently, FPGA based CNN accelerators deserve good optimization strategies in order to achieve high performance under logic, memory, and I/O bandwidth constraints.In this regard, we propose to use loop tiling and subsequently to calculate the throughput, memory bandwidth, and resource usage, all under the Roofline model. As such, we can easily find trade-off among various design parameters. Moreover, the proposed methodology can be quickly adapted to other platforms for the same purpose of prototyping CNN accelerators in FPGA.