- 发表评论
- 5,305
- A+
所属分类:关关采集规则
2019年11月可用自己写的杰奇1.7关关采集规则,最近研究小说网站 遇到了服务https网站不能采集的问题 目前还没有解决,就自己了一下 http 网站的采集代码 分享给大家了 下面代码复制 另存为.xml文件就可以了
<RuleConfigInfo xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<RuleVersion>
<FilterPattern/>
<Method>Match</Method>
<Options>None</Options>
<Pattern>爱书荒网 http://www.ishuhuang.com/</Pattern>
<RegexName>RuleVersion</RegexName>
</RuleVersion>
<RuleID>
<FilterPattern/>
<Method>Match</Method>
<Options>None</Options>
<Pattern>101</Pattern>
<RegexName>RuleID</RegexName>
</RuleID>
<GetSiteName>
<FilterPattern/>
<Method>Match</Method>
<Options>None</Options>
<Pattern>VIP中文</Pattern>
<RegexName>GetSiteName</RegexName>
</GetSiteName>
<GetSiteCharset>
<FilterPattern/>
<Method>Match</Method>
<Options>None</Options>
<Pattern>UTF-8</Pattern>
<RegexName>GetSiteCharset</RegexName>
</GetSiteCharset>
<GetSiteUrl>
<FilterPattern/>
<Method>Match</Method>
<Options>None</Options>
<Pattern>http://www.022003.com/</Pattern>
<RegexName>GetSiteUrl</RegexName>
</GetSiteUrl>
<NovelSearchUrl>
<FilterPattern/>
<Method>Match</Method>
<Options>None</Options>
<Pattern/>
<RegexName>NovelSearchUrl</RegexName>
</NovelSearchUrl>
<NovelSearchData>
<FilterPattern/>
<Method>Match</Method>
<Options>None</Options>
<Pattern/>
<RegexName>NovelSearchData</RegexName>
</NovelSearchData>
<NovelSearch_GetNovelKey>
<FilterPattern/>
<Method>Match</Method>
<Options>Singleline</Options>
<Pattern/>
<RegexName>NovelSearch_GetNovelKey</RegexName>
</NovelSearch_GetNovelKey>
<NovelListUrl>
<FilterPattern/>
<Method>Match</Method>
<Options>IgnoreCase</Options>
<Pattern>http://www.022003.com/</Pattern>
<RegexName>NovelListUrl</RegexName>
</NovelListUrl>
<NovelList_GetNovelKey>
<FilterPattern/>
<Method>Match</Method>
<Options>IgnoreCase</Options>
<Pattern>
<span class="s2"><a href="http://www.022003.com/\d*_(\d*)/" >(.+?)</a></span><span class="s3">
</Pattern>
<RegexName>NovelList_GetNovelKey</RegexName>
</NovelList_GetNovelKey>
<NovelUrl>
<FilterPattern/>
<Method>Match</Method>
<Options>Singleline</Options>
<Pattern>http://www.022003.com/{NovelKey/1000}_{NovelKey}/</Pattern>
<RegexName>NovelUrl</RegexName>
</NovelUrl>
<NovelName>
<FilterPattern/>
<Method>Match</Method>
<Options>None</Options>
<Pattern><title>(.+?)最新章节列</Pattern>
<RegexName>NovelName</RegexName>
</NovelName>
<NovelErr>
<FilterPattern/>
<Method>Match</Method>
<Options>None</Options>
<Pattern>对不起,该文章不存在</Pattern>
<RegexName>NovelErr</RegexName>
</NovelErr>
<NovelAuthor>
<FilterPattern/>
<Method>Match</Method>
<Options>None</Options>
<Pattern><p>作 者:(.+?)</p></Pattern>
<RegexName>NovelAuthor</RegexName>
</NovelAuthor>
<LagerSort>
<FilterPattern/>
<Method>Match</Method>
<Options>None</Options>
<Pattern><a href="/">VIP中文</a> > <a href=".+?">(.+?)</a></Pattern>
<RegexName>LagerSort</RegexName>
</LagerSort>
<SmallSort>
<FilterPattern/>
<Method>Match</Method>
<Options>None</Options>
<Pattern><a href="/">VIP中文</a> > <a href=".+?">(.+?)</a></Pattern>
<RegexName>SmallSort</RegexName>
</SmallSort>
<NovelIntro>
<FilterPattern><p>|</p>|</div> 推荐地址:http://.+?/.+?/</FilterPattern>
<Method>Match</Method>
<Options>None</Options>
<Pattern><div id="intro">((.|\n)+?)</div></Pattern>
<RegexName>NovelIntro</RegexName>
</NovelIntro>
<NovelKeyword>
<FilterPattern/>
<Method>Match</Method>
<Options>None</Options>
<Pattern><meta name="keywords" content="(.+?)" /></Pattern>
<RegexName>NovelKeyword</RegexName>
</NovelKeyword>
<NovelDegree>
<FilterPattern/>
<Method>Match</Method>
<Options>None</Options>
<Pattern><b>文章状态:</b>(.+?)中</td>|<b>文章状态:</b>已(.+?)</td></Pattern>
<RegexName>NovelDegree</RegexName>
</NovelDegree>
<NovelCover>
<FilterPattern/>
<Method>Match</Method>
<Options>None</Options>
<Pattern><div id="fmimg"><img src="(.+?)"</Pattern>
<RegexName>NovelCover</RegexName>
</NovelCover>
<NovelDefaultCoverUrl>
<FilterPattern/>
<Method>Match</Method>
<Options>None</Options>
<Pattern>nocover.jpg</Pattern>
<RegexName>NovelDefaultCoverUrl</RegexName>
</NovelDefaultCoverUrl>
<NovelInfo_GetNovelPubKey>
<FilterPattern/>
<Method>Match</Method>
<Options>None</Options>
<Pattern/>
<RegexName>NovelInfo_GetNovelPubKey</RegexName>
</NovelInfo_GetNovelPubKey>
<PubCookies>
<FilterPattern/>
<Method>Match</Method>
<Options>None</Options>
<Pattern/>
<RegexName>PubCookies</RegexName>
</PubCookies>
<PubIndexUrl>
<FilterPattern/>
<Method>Match</Method>
<Options>None</Options>
<Pattern>{NovelPubKey}</Pattern>
<RegexName>PubIndexUrl</RegexName>
</PubIndexUrl>
<PubIndexErr>
<FilterPattern/>
<Method>Match</Method>
<Options>None</Options>
<Pattern>无法找到该页</Pattern>
<RegexName>PubIndexErr</RegexName>
</PubIndexErr>
<PubVolumeContent>
<FilterPattern/>
<Method>Match</Method>
<Options>None</Options>
<Pattern/>
<RegexName>PubVolumeContent</RegexName>
</PubVolumeContent>
<PubVolumeSplit>
<FilterPattern/>
<Method>Spilt</Method>
<Options>None</Options>
<Pattern><dl></Pattern>
<RegexName>PubVolumeSplit</RegexName>
</PubVolumeSplit>
<PubVolumeName>
<FilterPattern/>
<Method>Match</Method>
<Options>Singleline</Options>
<Pattern></dl></Pattern>
<RegexName>PubVolumeName</RegexName>
</PubVolumeName>
<PubChapterName>
<FilterPattern>【|】|(|)|、|:|!|。|\.|\s* 网友上传章节</FilterPattern>
<Method>Match</Method>
<Options>None</Options>
<Pattern><dd><a href=".+?">(.+?)</a></dd></Pattern>
<RegexName>PubChapterName</RegexName>
</PubChapterName>
<PubChapter_GetChapterKey>
<FilterPattern/>
<Method>Match</Method>
<Options>None</Options>
<Pattern><dd><a href="/\d*_\d*/(\d*).html">.+?</a></dd></Pattern>
<RegexName>PubChapter_GetChapterKey</RegexName>
</PubChapter_GetChapterKey>
<PubContentUrl>
<FilterPattern/>
<Method>Match</Method>
<Options>None</Options>
<Pattern>{ChapterKey}.html</Pattern>
<RegexName>PubContentUrl</RegexName>
</PubContentUrl>
<PubContentErr>
<FilterPattern/>
<Method>Match</Method>
<Options>None</Options>
<Pattern>您访问的页面可能暂时未更新、已更名或已经删除,请稍后访问或马上点此举报:</Pattern>
<RegexName>PubContentErr</RegexName>
</PubContentErr>
<PubContent_GetTextKey>
<FilterPattern/>
<Method>Match</Method>
<Options>None</Options>
<Pattern/>
<RegexName>PubContent_GetTextKey</RegexName>
</PubContent_GetTextKey>
<PubTextUrl>
<FilterPattern/>
<Method>Match</Method>
<Options>None</Options>
<Pattern/>
<RegexName>PubTextUrl</RegexName>
</PubTextUrl>
<PubContentText>
<FilterPattern>
022003.com♂ishuhuang.com VIP中文♂爱书荒网 http://www..+?<br /> 思%路%客siluke*info更新最快的♂爱书荒网 妙书屋♂爱书荒网
</FilterPattern>
<Method>Match</Method>
<Options>None</Options>
<Pattern><div id="content">((.|\n)+?)</div></Pattern>
<RegexName>PubContentText</RegexName>
</PubContentText>
<PubContentChapterName>
<FilterPattern/>
<Method>Match</Method>
<Options>None</Options>
<Pattern/>
<RegexName/>
</PubContentChapterName>
<PubContentImages>
<FilterPattern/>
<Method>Match</Method>
<Options>None</Options>
<Pattern/>
<RegexName/>
</PubContentImages>
<PubContentReplace>
<FilterPattern/>
<Method>Match</Method>
<Options>None</Options>
<Pattern/>
<RegexName/>
</PubContentReplace>
<PubContentPageArea>
<FilterPattern/>
<Method>Match</Method>
<Options>None</Options>
<Pattern/>
<RegexName/>
</PubContentPageArea>
<PubContentPage>
<FilterPattern/>
<Method>Match</Method>
<Options>None</Options>
<Pattern/>
<RegexName/>
</PubContentPage>
</RuleConfigInfo>
- 我的微信
- 这是我的微信扫一扫
- 我的微信公众号
- 我的微信公众号扫一扫
赞
0
赏
分享