儘管在自然影像中的場景文字偵測(scene text detection)在學術研究中已經發展長遠,現存的自然影像文字偵測方法卻不適合直接應用在漫畫的文字偵測中,原因是漫畫中的文字有相當大的變異性與不同的上下文資訊。於本論文中,我們提出一個基於深層網路的漫畫文字偵測方法。此方法的主要偵測流程包含區域提案(region proposal),特徵提取(feature extraction),分類及回歸(classification/regression),我們將它們整合在單一的網路中。我們也結合一個空間轉換(spatial transformer)的網路以改善偵測的準確度。這個網路最主要的目的是藉由空間上扭曲轉變特徵,以達成能夠讓網路中的分類更加準確。空間轉換網路的整合讓我們的偵測方法能學習漫畫文字空間轉換上的特徵關係,將特徵加以轉換,最終使分類網路能夠正確分類困難或是變形劇烈的文字。最後經由實驗證明,我們的偵測方法表現優於現存的作法。因此在漫畫的文字偵測上,本論文提供一個目前為止最先進的方法。 Though scene text detection for natural images has been studied for years, text in manga presents high variations and different contextual information, and existing scene text detection methods are not directly applicable. In this thesis, we propose an approach based on a deep network to detect text in manga. In this approach, primary processes of text detection, including region proposal, feature extraction, and classification/regression, are taken together in a single deep network. We also improve our approach by integrating a spatial transformer network. This network is proposed to deform the features spatially to make classification more accurate. In our work, the spatial transformer network is used to learn spatially deforming feature maps, and advances performance of the detector.The evaluation results show that our approach yields a big performance leap over the current state of the art, making it the leading method in manga text detection.