Web Scraping через Python BeautifulSoup

1

Я просто новичок в Python.

Я пытаюсь очистить данные с сайта и смог написать приведенный ниже код.

Тем не менее, я не уверен, как продвигаться вперед, поскольку я не могу получить теги href чтобы я мог перейти к каждому списку и получить данные. Я также не очень хорошо знаю HTML-теги, поэтому я подозреваю, что я неправильно определил теги.

Вот мой код:

import requests 
from bs4 import BeautifulSoup

urls = []
for i in range(1,5):
    pages = "https://directory.singaporefintech.org/?p={0}&category=0&zoom=15&is_mile=0&directory_radius=0&view=list&hide_searchbox=0&hide_nav=0&hide_nav_views=0&hide_pager=0&featured_only=0&feature=1&perpage=20&sort=random".format(i)
    urls.append(pages)

Data = []
for info in urls:
    page = requests.get(info)
    soup = BeautifulSoup(page.content, 'html.parser')
    links = soup.find_all('a', attrs ={'class' :'sabai-directory-title'})
    hrefs = [link['href'] for link in links]

Вышеприведенный код создает hrefs в виде пустого списка. Любая помощь будет высоко оценен!!

Спасибо!!!

Теги:
python-3.x
web-scraping
beautifulsoup

3 ответа

0

Вы можете обрезать ссылки с помощью селектора CSS. Селектор div.sabai-directory-title a найдет любые теги <a> внутри <div> с классом sabai-directory-title (я обновил URL-адрес, ваш дал мне страницы с ошибками):

from bs4 import BeautifulSoup
import requests
from pprint import pprint

r = requests.get('https://directory.singaporefintech.org/')
soup = BeautifulSoup(r.text, 'lxml')

hrefs = [a['href'] for a in soup.select('div.sabai-directory-title a')]

pprint(hrefs)

Это напечатает:

['https://directory.singaporefintech.org/directory/listing/silent-eight',
 'https://directory.singaporefintech.org/directory/listing/incomlend',
 'https://directory.singaporefintech.org/directory/listing/bizgrow',
 'https://directory.singaporefintech.org/directory/listing/makerscut',
 'https://directory.singaporefintech.org/directory/listing/soho-fintech',
 'https://directory.singaporefintech.org/directory/listing/dxmarkets',
 'https://directory.singaporefintech.org/directory/listing/fundrevo',
 'https://directory.singaporefintech.org/directory/listing/money4money',
 'https://directory.singaporefintech.org/directory/listing/onelyst',
 'https://directory.singaporefintech.org/directory/listing/hearti-lab',
 'https://directory.singaporefintech.org/directory/listing/samurai-fintech-singapore-pte-ltd',
 'https://directory.singaporefintech.org/directory/listing/ceo-1',
 'https://directory.singaporefintech.org/directory/listing/arcadier',
 'https://directory.singaporefintech.org/directory/listing/plmp-fintech-pte-ltd',
 'https://directory.singaporefintech.org/directory/listing/cash-in-asia',
 'https://directory.singaporefintech.org/directory/listing/grc-systems',
 'https://directory.singaporefintech.org/directory/listing/sendexpense',
 'https://directory.singaporefintech.org/directory/listing/jinjerjade',
 'https://directory.singaporefintech.org/directory/listing/hatcher',
 'https://directory.singaporefintech.org/directory/listing/fintech-consortium']
0

Привет, я сделал несколько изменений кода:

import requests
from bs4 import BeautifulSoup
from pprint import pprint

urls = []
for i in range(1,5):
    pages = "https://directory.singaporefintech.org"
    urls.append(pages)

Data = []
hrefs = []
for info in urls:
    page = requests.get(info)
    soup = BeautifulSoup(page.content, 'html.parser')
    links = soup.find_all('div', attrs ={'class' :'sabai-directory-title'})
    for link in links:
        Data.extend([a['href'].encode('ascii') for a in link.find_all('a', href=True) if a.text])
pprint (Data)

выход:

     ['https://directory.singaporefintech.org/directory/listing/silent-eight',
     'https://directory.singaporefintech.org/directory/listing/moolahsense',
     'https://directory.singaporefintech.org/directory/listing/myfinb',
     'https://directory.singaporefintech.org/directory/listing/wefinance',
     'https://directory.singaporefintech.org/directory/listing/quber',
     'https://directory.singaporefintech.org/directory/listing/ayondo-asia-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/ceo-1',
     'https://directory.singaporefintech.org/directory/listing/acekards',
     'https://directory.singaporefintech.org/directory/listing/paper-ink-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/alpha-payments-cloud',
     'https://directory.singaporefintech.org/directory/listing/samurai-fintech-singapore-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/corris-asset-management-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/fundmylife',
     'https://directory.singaporefintech.org/directory/listing/mooments',
     'https://directory.singaporefintech.org/directory/listing/venture-capital-network-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/junotele_',
     'https://directory.singaporefintech.org/directory/listing/mobilecover',
     'https://directory.singaporefintech.org/directory/listing/cherrypay',
     'https://directory.singaporefintech.org/directory/listing/toast',
     'https://directory.singaporefintech.org/directory/listing/cashdab',
     'https://directory.singaporefintech.org/directory/listing/silent-eight',
     'https://directory.singaporefintech.org/directory/listing/moolahsense',
     'https://directory.singaporefintech.org/directory/listing/myfinb',
     'https://directory.singaporefintech.org/directory/listing/wefinance',
     'https://directory.singaporefintech.org/directory/listing/quber',
     'https://directory.singaporefintech.org/directory/listing/ayondo-asia-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/ceo-1',
     'https://directory.singaporefintech.org/directory/listing/acekards',
     'https://directory.singaporefintech.org/directory/listing/paper-ink-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/alpha-payments-cloud',
     'https://directory.singaporefintech.org/directory/listing/samurai-fintech-singapore-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/corris-asset-management-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/fundmylife',
     'https://directory.singaporefintech.org/directory/listing/mooments',
     'https://directory.singaporefintech.org/directory/listing/venture-capital-network-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/junotele_',
     'https://directory.singaporefintech.org/directory/listing/mobilecover',
     'https://directory.singaporefintech.org/directory/listing/cherrypay',
     'https://directory.singaporefintech.org/directory/listing/toast',
     'https://directory.singaporefintech.org/directory/listing/cashdab',
     'https://directory.singaporefintech.org/directory/listing/silent-eight',
     'https://directory.singaporefintech.org/directory/listing/moolahsense',
     'https://directory.singaporefintech.org/directory/listing/myfinb',
     'https://directory.singaporefintech.org/directory/listing/wefinance',
     'https://directory.singaporefintech.org/directory/listing/quber',
     'https://directory.singaporefintech.org/directory/listing/ayondo-asia-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/ceo-1',
     'https://directory.singaporefintech.org/directory/listing/acekards',
     'https://directory.singaporefintech.org/directory/listing/paper-ink-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/alpha-payments-cloud',
     'https://directory.singaporefintech.org/directory/listing/samurai-fintech-singapore-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/corris-asset-management-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/fundmylife',
     'https://directory.singaporefintech.org/directory/listing/mooments',
     'https://directory.singaporefintech.org/directory/listing/venture-capital-network-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/junotele_',
     'https://directory.singaporefintech.org/directory/listing/mobilecover',
     'https://directory.singaporefintech.org/directory/listing/cherrypay',
     'https://directory.singaporefintech.org/directory/listing/toast',
     'https://directory.singaporefintech.org/directory/listing/cashdab',
     'https://directory.singaporefintech.org/directory/listing/silent-eight',
     'https://directory.singaporefintech.org/directory/listing/moolahsense',
     'https://directory.singaporefintech.org/directory/listing/myfinb',
     'https://directory.singaporefintech.org/directory/listing/wefinance',
     'https://directory.singaporefintech.org/directory/listing/quber',
     'https://directory.singaporefintech.org/directory/listing/ayondo-asia-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/ceo-1',
     'https://directory.singaporefintech.org/directory/listing/acekards',
     'https://directory.singaporefintech.org/directory/listing/paper-ink-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/alpha-payments-cloud',
     'https://directory.singaporefintech.org/directory/listing/samurai-fintech-singapore-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/corris-asset-management-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/fundmylife',
     'https://directory.singaporefintech.org/directory/listing/mooments',
     'https://directory.singaporefintech.org/directory/listing/venture-capital-network-pte-ltd',
     'https://directory.singaporefintech.org/directory/listing/junotele_',
     'https://directory.singaporefintech.org/directory/listing/mobilecover',
     'https://directory.singaporefintech.org/directory/listing/cherrypay',
     'https://directory.singaporefintech.org/directory/listing/toast',
     'https://directory.singaporefintech.org/directory/listing/cashdab']

Это вывод данных, который вы ожидаете.

Надеюсь, поможет!!

  • 0
    Да, он дает результат, который вы упомянули, но как я могу зациклить его, чтобы получить информацию из различных списков на разных страницах, извините, но я сильно застрял на этом
0

Код в порядке, класс, который вы ищете, просто не существует на этих страницах. Например, заменил класс sabai-directory-title с комментарием-ответ-ссылкой после проверки https://directory.singaporefintech.org/hello-world/?category=0&zoom=15&is_mile=0&directory_radius=0&view=list&hide_searchbox=0&hide_nav=0&hide_nav_views=0&hide_pager = 0 & featured_only = 0 & feature = 1 & perpage = 20 & sort = random и получили результаты, когда я добавил заявления печати

  • 0
    я извиняюсь, но я не слишком хорош в идентификации тегов, я проверил элемент и обнаружил, что hrefs, которые мне нужно щелкнуть, чтобы открыть этот конкретный список, находится под тегом div класса sabai-directory-title. ниже HTML-тег, пожалуйста, предложите решение: -
  • 0
    <a href = " directory.singaporefintech.org/directory/listing/amaas " title = "AMaaS" class = "sabai-entity-permalink sabai-entity-id-43 sabai-entity-type-content sabai-entity-bundle- список имен каталогов sabai-entity-bundle-type-directory-list "> AMaaS </a>
Показать ещё 5 комментариев

Ещё вопросы

Сообщество Overcoder
Наверх
Меню