0

    宏记知识(三):正则表达式爬取网页内容实例知识分享

    2023.06.27 | admin | 177次围观

    分享兴趣,传播快乐,增长见闻,留下美好。

    亲爱的您,

    这里是LearingYard学苑!

    今天小编为大家带来

    正则表达式爬取网页内容实例知识分享。

    欢迎您的用心访问!

    本期推文阅读时长大约5分钟,请您耐心阅读。

    Share interest,

    Spread happiness,

    Increase knowledge, and leave a good impression.

    Dear you,

    This is the Learning Yard!

    Today Xiaobian brings you

    Regular expressions crawl web content instance knowledge sharing.

    This tweet usually takes about 5 minutes to read. Please be patient and read.

    今天,小编假期学习了使用python批量爬取网页文本,分享给大家,如果有哪里有疑问,欢迎私信询问,小编会积极的做出解答。

    Today, I learned to use python batch crawl web text during the holidays, share it with you, if there is where there are questions, welcome to ask a private letter, I will actively make answers.

    此次使用爬取网页文本信息使用的是正则表达式,正则表达式只是爬取网页文本信息方法之一,还可以使用xpath、BeautifulSoup等方法。爬取的目标就是各城市的平均房价信息。爬取网页信息的第一步当然是收集好目标网页的合集,小编使用的是遍历for的方法,找出网页(url)的特性公司网页需要哪些内容,放入一个列表当中:

    This time use crawl web text information using regular expressions, regular expressions is only one of the methods to crawl web text information, you can also use xpath, BeautifulSoup and other methods. The target of the crawl is the average price information of each city. The first step in crawling web information is of course to collect a good collection of target web pages, I use the method of traversing for to find the characteristics of the web page (url) into a list of.

    接下来就是请求数据的固定操作了,写请求头(模拟一个浏览器发出请求),然后使用get函数请求文本信息放入到名字为resp的变量中去,并把结果解析成一个text,方便操作。

    Next is the request data fixed operation, write the request header (simulate a browser to send a request), and then use the get function to request text information into the variable named resp, and parse the results into a text, easy to operate.

    接下来就是对这个text进行正则匹配,正则匹配就是按照写出来的正则表达式形式从text中得到和正则表达式格式相同的文本,正则表达式的优点就是提取特点比较鲜明的内容很方便,缺点就是对有大量本文的内容筛选速度慢,而且正则表达式相对其他方法难度大。

    Next is the request data fixed operation, write the request header (simulate a browser to send a request), and then use the get function to request text information into the variable named resp, and parse the results into a text, easy to operate.

    小编使用的是正则表达式的re.findall函数,就是找出所有匹配到的内容并返回到一个列表中。接下来就是对列表中的数据进行操作了,首先将数据数量控制到100个,并将这100个放入data1的列表里面。

    I use the regular expression re.findall function, which finds all the matches and returns them to a list. The next step is to manipulate the data in the list, first of all to control the number of data to 100, and put the 100 into the list of data1.

    由于匹配并放入列表中的内容是字符串格式,因此在计算之前需要对字符串格式的数字进行浮点数处理,才能将数字用于计算公司网页需要哪些内容,在这里的计算的方式算这100个数据的平均数(这就体现了编程语音的强大,可以批量处理数据)。

    Since the content matched and put into the list is in string format, it is necessary to perform floating point processing on the numbers in string format before the numbers can be used in the calculation, where the calculation of the way to count the average of these 100 data (this reflects the power of the programming voice, which can batch process data).

    最后就是将两组数据关联并根据值进行排序并以字典的形式结合,然后放入表格中,需要使用pandas库,下面代码输出的是有两列数据的表格。这样就完成了从网页中爬取数字数据,控制数据数量,计算,排序,放入表格的操作。

    The last thing is to associate the two sets of data and sort them according to the values and combine them in the form of a dictionary and then put them into a table, which requires the use of the pandas library, and the following code outputs a table with two columns of data. This completes the operation of crawling numeric data from the web page, controlling the number of data, calculating, sorting, and putting it into a table.

    版权声明

    本文仅代表作者观点。
    本文系作者授权发表,未经许可,不得转载。

    标签: 正则表达式
    发表评论