Pyhton (공공데이터활용, 2개 숫자 문자열 결합 후 정렬, URL-HTML, beautifulsoup4, requests 사용법, 크롤링 robot, HTML 태그, HTML-beautifulsoup4 활용, beautifulsoup4 실습)

Link

« 2025/08 »
일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31

Recent Comments

Recent Posts

Archives

Today

Total

Tags more

관리 메뉴

Learn & Record

Pyhton (공공데이터활용, 2개 숫자 문자열 결합 후 정렬, URL-HTML, beautifulsoup4, requests 사용법, 크롤링 robot, HTML 태그, HTML-beautifulsoup4 활용, beautifulsoup4 실습) 본문

Dev/Python

Pyhton (공공데이터활용, 2개 숫자 문자열 결합 후 정렬, URL-HTML, beautifulsoup4, requests 사용법, 크롤링 robot, HTML 태그, HTML-beautifulsoup4 활용, beautifulsoup4 실습)

Walker_ 2024. 3. 7. 17:32

1. 공공데이터활용(공항터미널)

import csv
import pprint

import requests
import xmltodict

# 인증키 저장
service_key : str = '1wMGYoH1onj8LIYDjyTfyuVPZLQc6F31PLdZjBj6jxjEi5P5suF4F9tGV2d38RvWOUj0tpiv6%2FOmN0NsBd93gg%3D%3D'

# URL 기입
url = 'http://apis.data.go.kr/B551177/BusInformation/getBusInfo'
param = f'?serviceKey={service_key}&type=xml&area=6&numOfRows=30'

# URL 정보 불러옴
response = requests.get(url+param)

# 불러온 정보의 TEXT 정보 저장
xml_data = response.text

# XML의 텍스트 정보를 파이썬의 딕셔너리로 변환
dict_data = xmltodict.parse(xml_data)
print(dict_data)

# 가독성이 좋게 출력 해주는 pprint 출력
pprint.pprint(dict_data)

# 2. 데이터 저장

# 목록을 저장할 리스트
item_list : list[dict] = list()
# csv의 헤더 및 딕셔너리의 key값
name_list : list[str] = ['버스번호', '버스등급', '성인요금', '평일시간표', '주말시간표']

# 시간표 합치는 함수
def sort_str(string1 : str, string2 : str) -> str:
    # 기본 데이터는 문자열. 2개의 문자열을 결합하고, 공백제거 후, 리스트로 변환
    temp_list : list[str] = (string1 + ', ' + string2).replace(' ','').split(",")
    temp_list = list(set(temp_list)) # 중복 제거
    temp_list.sort() # 정렬
    return str(temp_list)[1:-1].replace("'",'')

# 출력물을 파일로 저장하는 함수
def output_csv(filename : str):
    with open(f'./output/{filename}', 'w', newline='', encoding='UTF-8') as file:
        dict_writer = csv.DictWriter(file, name_list)
        dict_writer.writeheader()
        for data in item_list:
            dict_writer.writerow(data)

# 필요한 정보를 추출하여 list에 저장 (대한항공만 추출)
# 필요 데이터 경로 파악 후 기입
for item in dict_data['response']['body']['items']['item']:
    if item['busnumber'].find('대구') != -1: # 버스번호(경유지)안에 대구가 포함되어 있을 때, 리스트에 추가 코드 실행
        new_item: dict = dict() # 딕셔너리 객체 생성
        new_item[name_list[0]] = item['busnumber']
        new_item[name_list[1]] = item['busclass']
        new_item[name_list[2]] = item['adultfare']

        # 기본 데이터는 문자열, 2개의 문자열을 결합하고, 공백제거 후, 리스트로 변환
        temp_list: list[str] = (item['t1wdayt'] + ', ' + item['t2wdayt']).replace(' ','').split(",")
        temp_list = list(set(temp_list)) # 중복 제거
        temp_list.sort() # 정렬
        new_item[name_list[3]] = str(temp_list)[1:-1].replace("'",'')

        temp_list: list[str] = (item['t1wt'] + ', ' + item['t2wt']).replace(' ','').split(",")
        temp_list = list(set(temp_list)) # 중복 제거
        temp_list.sort() # 정렬
        new_item[name_list[4]] = str(temp_list).replace("'",'').replace('[','').replace(']','')

        item_list.append(new_item)

# 가독성 좋게 pprint 출력
pprint.pprint(item_list)
print('run OK')

# 2. csv파일로 데이터 저장
output_csv('air_bus_daegu.csv')

2. 2개의 문자열 결합 후 정렬

# 두 개의 문자열을 결합해서 정렬

t1wdayt: str = '0620, 0750, 0900, 1030, 1140, 1520, 1700, 1940, 2100, 2350'
t2wdayt: str = '0600, 0730, 0840, 1010, 1120, 1500, 1640, 1920, 2040, 2330'

# 1. 문자열 결합
tmp : str = t1wdayt + ', ' + t2wdayt
print(f'{tmp}')
# 0620, 0750, 0900, 1030, 1140, 1520, 1700, 1940, 2100, 2350, 0600, 0730, 0840, 1010, 1120, 1500, 1640, 1920, 2040, 2330
# -> 정렬이 안됨

# 2. 정렬을 위해 리스트로 변환
tmp_list: list[str] = tmp.split(',')
print(tmp_list)
# ['0620', ' 0750', ' 0900', ' 1030', ' 1140', ' 1520', ' 1700', ' 1940', ' 2100', ' 2350', ' 0600', ' 0730', ' 0840', ' 1010', ' 1120', ' 1500', ' 1640', ' 1920', ' 2040', ' 2330']
# -> 공백이 포함된 문자열이 있음

# 공백제거 후 리스트 처리
tmp_list: list[str] = tmp.replace(' ', '').split(',')
print(tmp_list)
# ['0620', '0750', '0900', '1030', '1140', '1520', '1700', '1940', '2100', '2350', '0600', '0730', '0840', '1010', '1120', '1500', '1640', '1920', '2040', '2330']

# 3. 중복 제거 후 정렬
tmp_list = list(set(tmp_list)) # 중복 제거
tmp_list.sort()
print(tmp_list)
# ['0600', '0620', '0730', '0750', '0840', '0900', '1010', '1030', '1120', '1140', '1500', '1520', '1640', '1700', '1920', '1940', '2040', '2100', '2330', '2350']

# 4. 문자열로 형변환. 예) 0600, 0620, 0730
tmp = str(tmp_list)[1:-1].replace("'", '')
print(tmp)
# 0600, 0620, 0730, 0750, 0840, 0900, 1010, 1030, 1120, 1140, 1500, 1520, 1640, 1700, 1920, 1940, 2040, 2100, 2330, 2350

3. URL 이용 HTML 가져오는 방법

- 크롤링할 때 url을 이용해서 HTML를 가져오는 방법은 크게 2가지

- 1) 내장 모듈인 urllib를 사용

- 2) 외장 모듈인 requests를 사용

- 2번이 월등하게 좋음

import urllib.request as request

import requests

# 정상 접속
url = "https://www.python.org/"
code = request.urlopen(url)
print(code)

# 비정상 접속. 비정상일 경우 에러 발생
# url = "https://www.python.org/1"
# code = request.urlopen(url)
# print(code)

url = "https://www.python.org/"
response = requests.get(url)
print(response) # <Response [200]>. 정상적인 통신이 이루어짐

# 페이지가 없는 경우에도 에러가 발생하지 않고, Response [404]를 리턴
url = "https://www.python.org/1"
response = requests.get(url)
print(response) # <Response [404]>. 해당 페이지를 찾을 수 없음

# 응답 코드 : 서버에서 클라이언트로 보내는 코드
# 1XX : 요청을 받았고, 작업 진행 중
# 2XX : 사용자의 요청이 성공적으로 수행 됨
# 3XX : 요청은 완료 되었으나, 리다이렉션이 필요
# 4XX : 사용자의 요청이 잘못됨
# 5XX : 서버에 오류가 발생함

4. beautifulsoup4

- 구문을 분석해서 필요한 내용만 추출 할 수 있는 기능을 가지고 있는 외부 패키지

- Settings > Python Interpreter > beautifulsoup4 검색 > 4.10.0 버전 다운로드

- 서버로 요청을 할 때 브라우저의 정보를 (User-Agent)가 같이 전달됨

- 서버에서는 브라우저의 정보를 가지고 접속자가 bot인지 일반 사용자임을 구분함

- 특정 사이트의 경우 요청하는 브라우저의 정보가 일반 사용자가 아니면 접속을 막는 경우가 있음

- requests의 경우 브라우저의 헤더 정보를 수정해서 일반 브라우저 처럼(?) 접속 할 수 있게 함

import requests
from bs4 import BeautifulSoup as bs

#헤더 정보 확인
url = 'https://planet-trade.kr/header_info.php'

# 1. requests를 이용해서 접속을 하면, 브라우저의 정보 (User-Agent)가 requests의 모듈 정보로 나옴
# 서버에서 해당 정보를 보고 크롤링을 판단할 수 있음
response = requests.get(url)
soup = bs(response.text, 'html.parser')
print(soup)
# 접속 IP : 58.149.46.252
# 접속 정보 : python-requests/2.31.0

# 2. requests에서 헤더 정보를 변경할 수 있음
request_headers = {
    'Uesr-Agent' : ('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36'),
    'Referer' : '',
}
resp = requests.get(url, headers=request_headers)
soup = bs(resp.text, 'html.parser')
print(soup)

5. requests 사용법

import requests

# requests 사용법

url = 'https://www.naver.com/'
response = requests.get(url) # get() 또는 post() 메서드를 이용해서 html 정보를 받아옴

html = response.text # response 객체의 text 속성을 지정하면 html 정보 반환
print(html) # html 소스가 출력

headers = response.headers # response 객체의 headers 속성 지정하면 헤더 정보 반환
print(headers)

6. 크롤링 robot

- 모든 크롤링이 불법은 아님

- 하지만 운영자의 의사에 상관없이 무단으로 크롤링하는 것은 불법

- 운영자의 의사를 알 수 있는 방법 : robot.txt 에서 확인

- allow는 허용, disallow는 검색 불가

- ex) https://developers.google.com/search/docs/advanced/robots/intro?hl=ko

robots.txt 소개 및 가이드 | Google 검색 센터 | 문서 | Google for Developers

robots.txt는 크롤러 트래픽을 관리하는 데 사용됩니다. robots.txt 소개 가이드에서 robots.txt 파일의 정의와 사용 방법을 알아보세요.

developers.google.com

https://searchadvisor.naver.com/guide/seo-basic-robots

robots.txt 설정하기

robots.txt는 검색로봇에게 사이트 및 웹페이지를 수집할 수 있도록 허용하거나 제한하는 국제 권고안입니다. IETF에서 2022년 9월에 이에 대한 표준화 문서를 발행하였습니다. robots.txt 파일은 항상

searchadvisor.naver.com

7. HTML 태그

- 크롤링 시 태그 참고

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>This is title</title>
</head>
<body>
    <h1>This is heading1 text</h1> <!-- 머리말 (heading) -->
    <h2>This is heading2 text</h2> <!-- 머리말 (heading) -->
    <h3>This is heading3 text</h3> <!-- 머리말 (heading) -->
    <p>This is a paragraph.</p> <!-- 문단 (paragraph) -->
    This is plain text.<br/> <!-- 줄바꿈 (line break) -->
    <b>This is bold text.</b><br/> <!-- 볼드체 -->
    <i>This is Italic text.</i><br/> <!-- 이텔릭체 -->
    <s>This is strike text.</s><br/> <!-- 취소선(strike) -->
    <ol> <!-- 순서 리스트 ordered list -->
        <li>the first orderd list</li> <!-- 리스트 아이템 -->
        <li>the second orderd list</li> <!-- 리스트 아이템 -->
        <li>the third orderd list</li> <!-- 리스트 아이템 -->
    </ol>
    <ul> <!-- 비순서 리스트 unorered list -->
        <li>unorder list</li> <!-- 리스트 아이템 -->
        <li>unorder list</li> <!-- 리스트 아이템 -->
        <li>unorder list</li> <!-- 리스트 아이템 -->
    </ul>
    <div>Division의 약자로, 레이아웃을 나누는데 사용.</div>
    <table border=1> <!-- 테이블 -->
        <tr><!-- 테이블 -->
            <th>table header 1</th> <!-- 테이블 헤더 -->
            <th>table header 2</th> <!-- 테이블 헤더 -->
            <th>table header 3</th> <!-- 테이블 헤더 -->
        </tr>
        <tr>
            <td>table data 4</td> <!-- 테이블 데이터 -->
            <td>table data 5</td>
            <td>table data 6</td>
        </tr>
        <tr>
            <td>table data 7</td>
            <td>table data 8</td>
            <td>table data 9</td>
        </tr>
    </table>
    <br/>
    <a href="https://www.python.org">Visit Python homepage!<br/> <!-- 하이퍼 링크 -->
        <img src="https://www.python.org/static/img/python-logo.png"/></a> <!-- 이미지-->
</body>
</html>

8. HTML, beautifulsoup4 활용

- beautifulsoup4 : 구문을 분석해서 필요한 내용만 추출 할 수 있는 기능을 가지고 있는 외부 패키지

from bs4 import BeautifulSoup as bs

# beautifulsoup : 구문을 분석해서 필요한 내용만 추출 할 수 있는 기능을 가지고 있는 외부 패키지

with open('./7_HTML.html','r', encoding='UTF-8') as file:
    html = file.read()

soup = bs(html, 'html.parser') # html.parser : html 코드를 사용하기 쉽게 beautifulsoup의 객체로 파싱

print(type(soup)) # <class 'bs4.BeautifulSoup'>
print(soup) # html 출력

print(soup.find('title').text) # 문서의 제목
print(soup.find('div').text) # div 태그의 텍스트
print(soup.find('h1').text.strip())

9. beautifulsoup4 실습

import pprint

import requests
from bs4 import BeautifulSoup as bs

# beautifulsoup 실습 : find() 메소드 이용하기

# 1) find 메소드
# 지정된 태그들 중에서 가장 첫 번째 태그만 가져오는 메소드(하나의 값만 반환), 문자열 형태로 반환
# 일반적으로 하나의 태그만 존재하는 경우에 사용. 만약 여러 태그가 있으면 첫 번째 태그만 가져옴

# 위키피디아 '대구광역시' 페이지
url = 'https://ko.wikipedia.org/wiki/%EB%8C%80%EA%B5%AC%EA%B4%91%EC%97%AD%EC%8B%9C'
resp = requests.get(url)
soup = bs(resp.text, 'html.parser')

first_img = soup.find('img') # img 태그 중에 제일 먼저 나오는 것
print(type(first_img))
print(first_img)

target_img = soup.find(name='img', attrs = {'alt': 'Deadongyeonjido (Gyujangek) 17-02.jpg'})
print(target_img)

# 2) find_all() 메소드 이용하기
# 지정한 태그들을 모두 가져오는 메소드. 가져온 태그들은 모두 리스트에 보관

# 네이버 스포츠 페이지에서 박스 뉴스 제목 들고 옴

url = 'https://sports.news.naver.com/index.nhn'
response = requests.get(url)
soup = bs(response.text, 'html.parser')

# today_list = soup.find('ul',{'class': 'today_list'}).find_all('strong', {'class', 'title'})
today_list = soup.find('ul',{'class':'today_list'})
# print(today_list)

today_list_title = today_list.find_all('strong', {'class', 'title'})
pprint.pprint(today_list_title)

for title in today_list_title:
    print(title.text.strip()) # 양쪽 공백 없애는 메서드 strip()

# 3) find_all() 메소드 이용하기

# 다음 뉴스
url = 'https://news.daum.net/'
response = requests.get(url)
soup = bs(response.text, 'html.parser')

# a 태그의 갯수 출력
print('1. a 태그의 갯수')
print(len(soup.find_all('a')))
print()

# a 태그 20개만 출력
# print('2. a 태그 20개만 출력')
# for news in soup.find_all('a')[:20]:
#     print(news.text)

# a 태그 링크 5개 출력
print('3. a 태그 링크 5개 출력')
for i in soup.find_all('a')[:5]:
    print(i.attrs['href'])
    print(i.get('href'))
print("=" * 20)

# 특정 클래스 속성을 출력하기
# print('4. 특정 클래스 속성을 출력')
# print(soup.find_all('div', {'class': 'item_issue'}))
# print("=" * 20)

# 4. 링크를 텍스트 파일로 저장
print('5. 링크를 텍스트 파일로 저장')
file = open('./output/links.txt', 'w') # 쓰기 전용 파일 생성

for i in soup.find_all('div', {'class': 'item_issue'}):
    file.write(i.find('a').get('href') + '\n')
file.close()

# 문제 : with사용. 뉴스 타이틀 추출. 파일명은 news_title.txt
# 넘버링 붙도록 예) 1. title~

# 5. 타이틀을 파일로 저장
with open('./output/news_title.txt', 'w', encoding='utf-8') as file:
    for i, news in enumerate(soup.find_all('div', {'class':'item_issue'})):
        file.write(f'{i+1}.{news.find_all("a")[1].text.strip()}\n')
        print('*' * 20)

# 6. find_all() 메소드 이용하기

# 네이버 뉴스 : IT/과학에서 오른쪽 섹션의 언론사별 가장 많이 본 뉴스 제목 들고오기

url = 'https://news.naver.com/section/105'
response = requests.get(url)
soup = bs(response.text, 'html.parser')
# print(soup)

section_list = soup.find('ul', {'class': 'ranking_list'})
# print(section_list)
for section in section_list:
    news_list = section.find_all('p', {'class': 'rl_txt'})
    for i in news_list:
        print(i.text)
    print()


# news_list = section_list.find_all('p', {'class': 'rl_txt'})
# pprint.pprint(news_list)

# for news in news_list:
#     print(news.text)

공부 과정을 정리한 것이라 내용이 부족할 수 있습니다.

부족한 내용은 추가 자료들로 보충해주시면 좋을 것 같습니다.

읽어주셔서 감사합니다 :)

'Dev > Python' 카테고리의 다른 글

Python (넘파이 Numpy, 배열 생성, 배열의 연산, 배열 인덱싱과 슬라이싱, 판다스 Pandas) (0)	2024.03.11
Python (연습문제, 크롤링 select, 크롤링 image, 크롤링 chart) (0)	2024.03.08
Python (공공데이터활용(에어코리아), 공공데이터활용(근접측정소), XML, 공공데이터활용(인천공항)) (2)	2024.03.06
Python (공공데이터활용, 서버-데이터 수신 방식, JSON, openweathermap, 공공데이터포털, JSON과 API, API 이용하기, 공공데이터 활용하기) (2)	2024.03.05
Python (CSV, CSV객체, 연습문제, 공공데이터활용) (0)	2024.03.04

'Dev/Python' Related Articles

Learn & Record

Pyhton (공공데이터활용, 2개 숫자 문자열 결합 후 정렬, URL-HTML, beautifulsoup4, requests 사용법, 크롤링 robot, HTML 태그, HTML-beautifulsoup4 활용, beautifulsoup4 실습) 본문

Pyhton (공공데이터활용, 2개 숫자 문자열 결합 후 정렬, URL-HTML, beautifulsoup4, requests 사용법, 크롤링 robot, HTML 태그, HTML-beautifulsoup4 활용, beautifulsoup4 실습)

'Dev > Python' 카테고리의 다른 글

티스토리툴바

'Dev > Python' 카테고리의 다른 글