Continuous Sign Language Recognition (CSLR) is a typical weakly supervised task, which aims to convert a sign video into a gloss sequence. However, there is lack of clear segmentation points between gestures in sign videos, so it is not easy to obtain the time information of gloss. Existing CSLR models usually extract gesture-wise features with a receptive field of single time granularity, which causes inconsistent segmentation and local ambiguity issues and becomes a bottleneck in the entire model. This paper proposes a Purification and Multi Temporal Semantic Network (PMTSNet) to handle the local consistency and context dependency problems. Specifically, the proposed model first extracts frame-wise features of sign language videos using 2D convolutions and then captures gesture-wise features from sign video segments of different temporal granularities. The obtained gesture-wise features are then fed into BiLSTM to get gloss-wise features by modeling the context dependencies. Then, an attention based purification module selectively combines fine-grained gesture and coarse-grained gloss information to obtain features with richer semantics. Finally, the model is trained using multi knowledge distillation connectionist temporal classification loss, which further enhances the performance. The experimental results on the RWTH-PHOENIX-Weather-2014 dataset show that the proposed model outperforms the state-of-the-arts.