본문 바로가기

Upstage AI Lab 2기

Upstage AI Lab 2기 [Day005] 파이썬 AI/데이터 분석 과정 part2. 크롤링 실습

Upstage AI Lab 2기

2023년 12월 15일 (금) Day_005

 

Day_005 실시간 강의 :

파이썬 AI/데이터분석 과정 (패스트캠퍼스 김인섭 강사님)

 

크롤링 실습

 

크롤링 starters

# 셀레늄과 웹드라이버 매니저 설치
!pip install selenium
!pip install webdriver-manager
import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
import pandas as pd
browser = webdriver.Chrome()
url = ''
browser.get(url)

 

 

실습 1) 구글뉴스 크롤링 (키워드 : 인공지능 서비스)

url = 'https://www.google.com/search?q=%EC%9D%B8%EA%B3%B5%EC%A7%80%EB%8A%A5+%EC%84%9C%EB%B9%84%EC%8A%A4&sca_esv=591053097&biw=540&bih=583&tbm=nws&sxsrf=AM9HkKn1Z22t-83ifbrCENn4FdFlHCRrAg%3A1702604018326&ei=8qx7ZfvEE8r7wAPG-pCgAQ&ved=0ahUKEwj7hdKdppCDAxXKPXAKHUY9BBQQ4dUDCA0&uact=5&oq=%EC%9D%B8%EA%B3%B5%EC%A7%80%EB%8A%A5+%EC%84%9C%EB%B9%84%EC%8A%A4&gs_lp=Egxnd3Mtd2l6LW5ld3MiFuyduOqzteyngOuKpSDshJzruYTsiqQyBRAAGIAEMgUQABiABDIFEAAYgAQyBRAAGIAEMgUQABiABDIFEAAYgAQyBRAAGIAEMgUQABiABDIFEAAYgAQyBRAAGIAESLQOUM0DWOYMcAJ4AJABAZgBjwGgAbQHqgEDMC44uAEDyAEA-AEBwgILEAAYgAQYsQMYgwHCAgoQABiABBiKBRhDiAYB&sclient=gws-wiz-news'
browser.get(url)

 

#single data
# 언론사
browser.find_element(By.CLASS_NAME, 'MgUUmf').text
#뉴스제목
browser.find_element(By.CLASS_NAME, 'n0jPhd').text
# 뉴스내용
browser.find_element(By.CLASS_NAME, 'GI74Re').text
# 작성시간
browser.find_element(By.CLASS_NAME, 'iRPxbe').find_element(By.CLASS_NAME, 'OSrXXb').text
# 기사링크
browser.find_element(By.CLASS_NAME, 'SoaBEf').find_element(By.TAG_NAME, 'a').get_attribute('href')

 

# container
container = browser.find_element(By.CLASS_NAME, 'SoaBEf')

 

# multi data
containers = browser.find_elements(By.CLASS_NAME, 'SoaBEf')

data_list = []
for container in containers :
    agency = container.find_element(By.CLASS_NAME, 'MgUUmf').text
    title = container.find_element(By.CLASS_NAME, 'n0jPhd').text
    content = container.find_element(By.CLASS_NAME, 'GI74Re').text
    time = container.find_element(By.CLASS_NAME, 'iRPxbe').find_element(By.CLASS_NAME, 'OSrXXb').text
    link = container.find_element(By.TAG_NAME, 'a').get_attribute('href')
    
    data = {'언론사' : agency,
            '뉴스제목' : title,
            '뉴스내용' : content,
            '작성시간' : time,
            '기사링크' : link}
    data_list.append(data)
# save as csv
df = pd.DataFrame(data_list)
df.to_csv('google_news.csv', encoding = 'utf-8-sig')

 

 

실습 2) DBPIA 크롤링

# 키워드 입력 받아 논문 검색
word = input("키워드 입력")
url = f'https://www.dbpia.co.kr/search/topSearch?searchOption=all&query={word}'
browser.get(url)

 

thesis_list = browser.find_elements(By.CLASS_NAME, 'thesis__summary')

data_list = []

for thesis in thesis_list:
    title = thesis.find_element(By.CLASS_NAME, 'thesis__tit').text
    conf = thesis.find_element(By.CLASS_NAME, 'nodeIprd').text
    count = thesis.find_element(By.CLASS_NAME, 'thesis__useCount').text.split(' ')[1]
    
    data = {
        '논문 제목' : title,
        '학회' : conf,
        '이용수' : count
    }
    
    print(f"논문 제목 : {title}\n학회 : {conf}\n이용수 : {count}\n")
    data_list.append(data)

df = pd.DataFrame(data_list)
df.to_csv('DBPIA.csv', encoding = 'utf-8-sig')

 

여러 페이지 크롤링

tip : XPath

# XPath
# 1페이지 버튼 //*[@id="pageList"]/a[1]
# 2페이지 버튼 //*[@id="pageList"]/a[2]
# 3페이지 버튼 //*[@id="pageList"]/a[3]

 

n = int(input('원하는 페이지 수 입력'))
data_list = []
for i in range(1, n+1):
    xpath = f'//*[@id="pageList"]/a[{i}]'
    browser.find_element(By.XPATH, xpath).click()
    time.sleep(3)
    
    thesis_list = browser.find_elements(By.CLASS_NAME, 'thesis__summary')
    
    for thesis in thesis_list:
        title = thesis.find_element(By.CLASS_NAME, 'thesis__tit').text
        conf = thesis.find_element(By.CLASS_NAME, 'nodeIprd').text
        count = thesis.find_element(By.CLASS_NAME, 'thesis__useCount').text.split(' ')[1]
    
        data = {
            '논문 제목' : title,
            '학회' : conf,
            '이용수' : count
        }
        data_list.append(data)
    time.sleep(3)
# 데이터 길이 체크하기
len(data_list)

# csv로 저장
df = pd.DataFrame(data_list)
df.to_csv('DBPIA_tillpage3.csv', encoding = 'utf-8-sig')

 

실습 3) 표데이터 크롤링

 

 

 

실습 4) YES24 베스트 셀러 크롤링

tip : parent 설정, parent에서 find_elements로 container 설정

browser = webdriver.Chrome()
url = 'https://www.yes24.com/Product/category/bestseller?CategoryNumber=001&sumgb=06'
browser.get(url)

parent = browser.find_element(By.ID, 'yesBestList')
containers = parent.find_elements(By.CLASS_NAME, 'itemUnit')

data_list = []

for container in containers:
    title = container.find_element(By.CLASS_NAME, 'gd_name').text
    url = container.find_element(By.CLASS_NAME, 'gd_name').get_attribute('href')
    auth = container.find_element(By.CLASS_NAME, 'info_auth').text
    pub = container.find_element(By.CLASS_NAME, 'info_pub').text
    sales = container.find_element(By.CLASS_NAME, 'saleNum').text.split(' ')[1]
    
    print(f"{title}\n{auth}\n{pub}\n{sales}\n")
    
    data = {
        "제목" : title,
        "저자" : auth,
        "출판사" : pub,
        "판매지수" : sales
    }
    
    data_list.append(data)
    
df = pd.DataFrame(data_list)
df.to_csv('yes24.csv', encoding = 'utf-8-sig')

 

 

실습 5) SRT 예매 자동화

 

 

추가) 슬랙봇 생성

import requests
import json

url = '' # webhookAPI url
headers = {"Content-type" : "application/json"}
payload = {'text' : 'success'}
response = requests.post(url, data = json.dumps(payload), headers = headers)

print(response)
더보기

note. 

requests : 사이트 접속, API 요청 등

json : dict -> json