1.å¦ä½ç¨Pythonåç¬è«
å¦ä½ç¨Pythonåç¬è«
1ï¼é¦å ä½ è¦æç½ç¬è«ææ ·å·¥ä½ã
æ³è±¡ä½ æ¯ä¸åªèèï¼ç°å¨ä½ 被æ¾å°äºäºèâç½âä¸ãé£ä¹ï¼ä½ éè¦æææçç½é¡µé½çä¸éãæä¹åå¢ï¼æ²¡é®é¢åï¼ä½ å°±é便ä»æ个å°æ¹å¼å§ï¼æ¯å¦è¯´äººæ°æ¥æ¥çé¦é¡µï¼è¿ä¸ªå«initial pagesï¼ç¨$表示å§ã
å¨äººæ°æ¥æ¥çé¦é¡µï¼ä½ çå°é£ä¸ªé¡µé¢å¼åçåç§é¾æ¥ãäºæ¯ä½ å¾å¼å¿å°ä»ç¬å°äºâå½å æ°é»âé£ä¸ªé¡µé¢ã太好äºï¼è¿æ ·ä½ 就已ç»ç¬å®äºä¿©é¡µé¢ï¼é¦é¡µåå½å æ°é»ï¼ï¼æä¸ä¸ç¨ç®¡ç¬ä¸æ¥ç页é¢æä¹å¤ççï¼ä½ å°±æ³è±¡ä½ æè¿ä¸ªé¡µé¢å®å®æ´æ´ææäºä¸ªhtmlæ¾å°äºä½ 身ä¸ã
çªç¶ä½ åç°ï¼ å¨å½å æ°é»è¿ä¸ªé¡µé¢ä¸ï¼æä¸ä¸ªé¾æ¥é¾åâé¦é¡µâãä½ä¸ºä¸åªèªæçèèï¼ä½ è¯å®ç¥éä½ ä¸ç¨ç¬åå»çå§ï¼å ä¸ºä½ å·²ç»çè¿äºåãæ以ï¼ä½ éè¦ç¨ä½ çèåï¼åä¸ä½ å·²ç»çè¿ç页é¢å°åãè¿æ ·ï¼æ¯æ¬¡çå°ä¸ä¸ªå¯è½éè¦ç¬çæ°é¾æ¥ï¼ä½ å°±å æ¥æ¥ä½ èåéæ¯ä¸æ¯å·²ç»å»è¿è¿ä¸ªé¡µé¢å°åãå¦æå»è¿ï¼é£å°±å«å»äºã
好çï¼ç论ä¸å¦æææç页é¢å¯ä»¥ä»initial pageè¾¾å°çè¯ï¼é£ä¹å¯ä»¥è¯æä½ ä¸å®å¯ä»¥ç¬å®ææçç½é¡µã
é£ä¹å¨pythonéæä¹å®ç°å¢ï¼
å¾ç®å
import Queue
initial_page = "åå§å页"
url_queue = Queue.Queue()
seen = set()
seen.insert(initial_page)
url_queue.put(initial_page)
while(True): #ä¸ç´è¿è¡ç´å°æµ·æ¯ç³ç
if url_queue.size()>0:
current_url = url_queue.get() #æ¿åºéä¾ä¸ç¬¬ä¸ä¸ªçurl
store(current_url) #æè¿ä¸ªurl代表çç½é¡µåå¨å¥½
for next_url in extract_urls(current_url): #æåæè¿ä¸ªurléé¾åçurl
if next_url not in seen:
seen.put(next_url)
url_queue.put(next_url)
else:
break
åå¾å·²ç»å¾ä¼ªä»£ç äºã
ææçç¬è«çbackboneé½å¨è¿éï¼ä¸é¢åæä¸ä¸ä¸ºä»ä¹ç¬è«äºå®ä¸æ¯ä¸ªé常å¤æçä¸è¥¿ââæç´¢å¼æå ¬å¸é常æä¸æ´ä¸ªå¢éæ¥ç»´æ¤åå¼åã
2ï¼æç
å¦æä½ ç´æ¥å å·¥ä¸ä¸ä¸é¢ç代ç ç´æ¥è¿è¡çè¯ï¼ä½ éè¦ä¸æ´å¹´æè½ç¬ä¸æ´ä¸ªè±ç£çå 容ãæ´å«è¯´Googleè¿æ ·çæç´¢å¼æéè¦ç¬ä¸å ¨ç½çå 容äºã
é®é¢åºå¨åªå¢ï¼éè¦ç¬çç½é¡µå®å¨å¤ªå¤å¤ªå¤äºï¼èä¸é¢ç代ç å¤ªæ ¢å¤ªæ ¢äºã设æ³å ¨ç½æN个ç½ç«ï¼é£ä¹åæä¸ä¸å¤éçå¤æ度就æ¯N*log(N)ï¼å 为ææç½é¡µè¦éåä¸æ¬¡ï¼èæ¯æ¬¡å¤éç¨setçè¯éè¦log(N)çå¤æ度ãOKï¼OKï¼æç¥épythonçsetå®ç°æ¯hashââä¸è¿è¿æ ·è¿æ¯å¤ªæ ¢äºï¼è³å°å å使ç¨æçä¸é«ã
é常çå¤éåæ³æ¯ææ ·å¢ï¼Bloom Filter. ç®å讲å®ä»ç¶æ¯ä¸ç§hashçæ¹æ³ï¼ä½æ¯å®çç¹ç¹æ¯ï¼å®å¯ä»¥ä½¿ç¨åºå®çå åï¼ä¸éurlçæ°éèå¢é¿ï¼ä»¥O(1)çæçå¤å®urlæ¯å¦å·²ç»å¨setä¸ãå¯æ天ä¸æ²¡æç½åçåé¤ï¼å®çå¯ä¸é®é¢å¨äºï¼å¦æè¿ä¸ªurlä¸å¨setä¸ï¼BFå¯ä»¥%ç¡®å®è¿ä¸ªurl没æçè¿ãä½æ¯å¦æè¿ä¸ªurlå¨setä¸ï¼å®ä¼åè¯ä½ ï¼è¿ä¸ªurlåºè¯¥å·²ç»åºç°è¿ï¼ä¸è¿ææ2%çä¸ç¡®å®æ§ã注æè¿éçä¸ç¡®å®æ§å¨ä½ åé çå å足å¤å¤§çæ¶åï¼å¯ä»¥åå¾å¾å°å¾å°ãä¸ä¸ªç®åçæç¨:Bloom Filters by Example
注æå°è¿ä¸ªç¹ç¹ï¼urlå¦æ被çè¿ï¼é£ä¹å¯è½ä»¥å°æ¦çéå¤çä¸çï¼æ²¡å ³ç³»ï¼å¤ççä¸ä¼ç´¯æ»ï¼ãä½æ¯å¦æ没被çè¿ï¼ä¸å®ä¼è¢«çä¸ä¸ï¼è¿ä¸ªå¾éè¦ï¼ä¸ç¶æ们就è¦æ¼æä¸äºç½é¡µäºï¼ï¼ã [IMPORTANT: æ¤æ®µæé®é¢ï¼è¯·ææ¶ç¥è¿]
好ï¼ç°å¨å·²ç»æ¥è¿å¤çå¤éæå¿«çæ¹æ³äºãå¦å¤ä¸ä¸ªç¶é¢ââä½ åªæä¸å°æºå¨ãä¸ç®¡ä½ ç带宽æå¤å¤§ï¼åªè¦ä½ çæºå¨ä¸è½½ç½é¡µçé度æ¯ç¶é¢çè¯ï¼é£ä¹ä½ åªæå å¿«è¿ä¸ªé度ãç¨ä¸å°æºåä¸å¤çè¯ââç¨å¾å¤å°å§ï¼å½ç¶ï¼æ们å设æ¯å°æºåé½å·²ç»è¿äºæ大çæçââ使ç¨å¤çº¿ç¨ï¼pythonçè¯ï¼å¤è¿ç¨å§ï¼ã
3ï¼é群åæå
ç¬åè±ç£çæ¶åï¼ææ»å ±ç¨äºå¤å°æºå¨æ¼å¤ä¸åå°è¿è¡äºä¸ä¸ªæãæ³è±¡å¦æåªç¨ä¸å°æºåä½ å°±å¾è¿è¡ä¸ªæäº...
é£ä¹ï¼åè®¾ä½ ç°å¨æå°æºå¨å¯ä»¥ç¨ï¼æä¹ç¨pythonå®ç°ä¸ä¸ªåå¸å¼çç¬åç®æ³å¢ï¼
æ们æè¿å°ä¸çå°è¿ç®è½åè¾å°çæºå¨å«ä½slaveï¼å¦å¤ä¸å°è¾å¤§çæºå¨å«ä½masterï¼é£ä¹å顾ä¸é¢ä»£ç ä¸çurl_queueï¼å¦ææ们è½æè¿ä¸ªqueueæ¾å°è¿å°masteræºå¨ä¸ï¼ææçslaveé½å¯ä»¥éè¿ç½ç»è·masterèéï¼æ¯å½ä¸ä¸ªslaveå®æä¸è½½ä¸ä¸ªç½é¡µï¼å°±åmaster请æ±ä¸ä¸ªæ°çç½é¡µæ¥æåãèæ¯æ¬¡slaveæ°æå°ä¸ä¸ªç½é¡µï¼å°±æè¿ä¸ªç½é¡µä¸ææçé¾æ¥éå°masterçqueueéå»ãåæ ·ï¼bloom filterä¹æ¾å°masterä¸ï¼ä½æ¯ç°å¨masteråªåéç¡®å®æ²¡æ被访é®è¿çurlç»slaveãBloom Filteræ¾å°masterçå åéï¼è被访é®è¿çurlæ¾å°è¿è¡å¨masterä¸çRediséï¼è¿æ ·ä¿è¯æææä½é½æ¯O(1)ãï¼è³å°å¹³ææ¯O(1)ï¼Redisç访é®æçè§:LINSERT â Redis)
èèå¦ä½ç¨pythonå®ç°ï¼
å¨åå°slaveä¸è£ 好scrapyï¼é£ä¹åå°æºåå°±åæäºä¸å°ææåè½åçslaveï¼å¨masterä¸è£ 好Redisårqç¨ä½åå¸å¼éåã
代ç äºæ¯åæ
#slave.py
current_url = request_from_master()
to_send = []
for next_url in extract_urls(current_url):
to_send.append(next_url)
store(current_url);
send_to_master(to_send)
#master.py
distributed_queue = DistributedQueue()
bf = BloomFilter()
initial_pages = "www.renmingribao.com"
while(True):
if request == 'GET':
if distributed_queue.size()>0:
send(distributed_queue.get())
else:
break
elif request == 'POST':
bf.put(request.url)
好çï¼å ¶å®ä½ è½æ³å°ï¼æ人已ç»ç»ä½ å好äºä½ éè¦çï¼darkrho/scrapy-redis 日报日报dj乐网站源码· GitHub
4ï¼å±æååå¤ç
è½ç¶ä¸é¢ç¨å¾å¤âç®åâï¼ä½æ¯çæ£è¦å®ç°ä¸ä¸ªåä¸è§æ¨¡å¯ç¨çç¬è«å¹¶ä¸æ¯ä¸ä»¶å®¹æçäºãä¸é¢ç代ç ç¨æ¥ç¬ä¸ä¸ªæ´ä½çç½ç«å ä¹æ²¡æ太大çé®é¢ã
ä½æ¯å¦æéå ä¸ä½ éè¦è¿äºåç»å¤çï¼æ¯å¦
ææå°åå¨ï¼æ°æ®åºåºè¯¥ææ ·å®æï¼
ææå°å¤éï¼è¿éæç½é¡µå¤éï¼å±å¯ä¸æ³æ人æ°æ¥æ¥åæè¢å®ç大æ°æ¥æ¥é½ç¬ä¸éï¼
ææå°ä¿¡æ¯æ½åï¼æ¯å¦æä¹æ ·æ½ååºç½é¡µä¸ææçå°åæ½ååºæ¥ï¼âæé³åºå¥è¿è·¯ä¸åéâï¼ï¼æç´¢å¼æé常ä¸éè¦åå¨ææçä¿¡æ¯ï¼æ¯å¦å¾çæåæ¥å¹²å...
åæ¶æ´æ°ï¼é¢æµè¿ä¸ªç½é¡µå¤ä¹ ä¼æ´æ°ä¸æ¬¡ï¼
å¦ä½ ææ³ï¼è¿éæ¯ä¸ä¸ªç¹é½å¯ä»¥ä¾å¾å¤ç 究è åæ°å¹´çç 究ãè½ç¶å¦æ¤ï¼
âè·¯æ¼«æ¼«å ¶ä¿®è¿å ®,å¾å°ä¸ä¸èæ±ç´¢âã
æ以ï¼ä¸è¦é®æä¹å ¥é¨ï¼ç´æ¥ä¸è·¯å°±å¥½äºï¼ï¼