requests_html
requests_html
AsyncHTMLSession(loop=None, workers=None, mock_browser=True, *args, **kwargs)
Bases: BaseSession
An async consumable session.
Set or create an event loop and a thread pool.
Parameters:
-
loop
(Optional[AbstractEventLoop]
, default:None
) –Asyncio loop to use.
-
workers
(Optional[int]
, default:None
) –Amount of threads to use for executing async calls. If not pass it will default to the number of processors on the machine, multiplied by 5.
Source code in requests_html.py
972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 |
|
close()
async
If a browser was created close it first.
Source code in requests_html.py
999 1000 1001 1002 1003 1004 |
|
request(*args, **kwargs)
Partial original request func and run it in a thread.
Source code in requests_html.py
994 995 996 997 |
|
run(*coros)
Pass in all the coroutines you want to run, it will wrap each one in a task, run it and wait for the result. Return a list with all results, this is returned in the same order coros are passed in.
Source code in requests_html.py
1006 1007 1008 1009 1010 1011 1012 1013 1014 |
|
BaseParser(*, element, default_encoding=None, html=None, url)
A basic HTML/Element Parser, for Humans.
Parameters:
-
element
(Any
) –The element from which to base the parsing upon.
-
default_encoding
(_DefaultEncoding
, default:None
) –Which encoding to default to.
-
html
(_HTML
, default:None
) –HTML from which to base the parsing upon (optional).
-
url
(_URL
) –The URL from which the HTML originated, used for
absolute_links
.
Source code in requests_html.py
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 |
|
absolute_links: _Links
property
All found links on page, in absolute form
(learn more <https://www.navegabem.com/absolute-or-relative-links.html>
_).
base_url: _URL
property
The base URL for the page. Supports the <base>
tag
(learn more <https://www.w3schools.com/tags/tag_base.asp>
_).
encoding: _Encoding
property
writable
The encoding string to be used, extracted from the HTML and
HTMLResponse <HTMLResponse>
headers.
full_text: _Text
property
The full text content (including links) of the
Element <Element>
or HTML <HTML>
.
html: _BaseHTML
property
writable
Unicode representation of the HTML content
(learn more <http://www.diveintopython3.net/strings.html>
_).
links: _Links
property
All found links on page, in as–is form.
lxml: HtmlElement
property
lxml <http://lxml.de>
_ representation of the
Element <Element>
or HTML <HTML>
.
pq: PyQuery
property
PyQuery <https://pythonhosted.org/pyquery/>
_ representation
of the Element <Element>
or HTML <HTML>
.
raw_html: _RawHTML
property
writable
Bytes representation of the HTML content.
(learn more <http://www.diveintopython3.net/strings.html>
_).
text: _Text
property
The text content of the
Element <Element>
or HTML <HTML>
.
find(selector='*', *, containing=None, clean=False, first=False, _encoding=None)
Given a CSS Selector, returns a list of
Element <Element>
objects or a single one.
Parameters:
-
selector
(str
, default:'*'
) –CSS Selector to use.
-
clean
(bool
, default:False
) –Whether or not to sanitize the found HTML of
<script>
and<style>
tags. -
containing
(_Containing
, default:None
) –If specified, only return elements that contain the provided text.
-
first
(bool
, default:False
) –Whether or not to return just the first result.
-
_encoding
(str
, default:None
) –The encoding format. Defaults to None
Example CSS Selectors:
a
a.someClass
a#someID
a[target=_blank]
See W3School's CSS Selectors Reference
<https://www.w3schools.com/cssref/css_selectors.asp>
_
for more details.
If first
is True
, only returns the first
Element <Element>
found.
Source code in requests_html.py
192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 |
|
search(template)
Search the Element <Element>
for the given Parse template.
Parameters:
-
template
(str
) –The Parse template to use.
Source code in requests_html.py
310 311 312 313 314 315 316 317 |
|
search_all(template)
Search the Element <Element>
(multiple times) for the given parse
template.
Parameters:
-
template
(str
) –The Parse template to use.
Source code in requests_html.py
319 320 321 322 323 324 325 326 |
|
xpath(selector, *, clean=False, first=False, _encoding=None)
Given an XPath selector, returns a list of
Element <Element>
objects or a single one.
Parameters:
-
selector
(str
) –XPath Selector to use.
-
clean
(bool
, default:False
) –Whether or not to sanitize the found HTML of
<script>
and<style>
tags. -
first
(bool
, default:False
) –Whether or not to return just the first result.
-
_encoding
(str
, default:None
) –The encoding format.
If a sub-selector is specified (e.g. //a/@href
), a simple
list of results is returned.
See W3School's XPath Examples
<https://www.w3schools.com/xml/xpath_examples.asp>
_
for more details.
If first
is True
, only returns the first
Element <Element>
found.
Source code in requests_html.py
257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 |
|
BaseSession(mock_browser=True, verify=True, browser_args=['--no-sandbox'])
Bases: Session
A consumable session, for cookie persistence and connection pooling, amongst other things.
Source code in requests_html.py
911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 |
|
response_hook(response, **kwargs)
Change response encoding and replace it by a HTMLResponse.
Source code in requests_html.py
928 929 930 931 932 |
|
Element(*, element, url, default_encoding=None)
Bases: BaseParser
An element of HTML.
Parameters:
-
element
(Any
) –The element from which to base the parsing upon.
-
url
(_URL
) –The URL from which the HTML originated, used for
absolute_links
. -
default_encoding
(_DefaultEncoding
, default:None
) –Which encoding to default to. Defaults to None.
Source code in requests_html.py
421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 |
|
attrs: _Attrs
property
Returns a dictionary of the attributes of the Element <Element>
(learn more <https://www.w3schools.com/tags/ref_attributes.asp>
_).
HTML(*, session=None, url=DEFAULT_URL, html, default_encoding=DEFAULT_ENCODING, async_=False)
Bases: BaseParser
An HTML document, ready for parsing.
Parameters:
-
url
(str
, default:DEFAULT_URL
) –The URL from which the HTML originated, used for
absolute_links
. -
html
(_HTML
) –HTML from which to base the parsing upon (optional).
-
default_encoding
(str
, default:DEFAULT_ENCODING
) –Which encoding to default to.
Source code in requests_html.py
461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 |
|
arender(retries=8, script=None, wait=0.2, scrolldown=False, sleep=0, reload=True, timeout=8.0, keep_page=False, cookies=[{}], send_cookies_session=False)
async
Async version of render. Takes same parameters.
Source code in requests_html.py
795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 |
|
next(fetch=False, next_symbol=None)
Attempts to find the next page, if there is one. If fetch
is True
(default), returns HTML <HTML>
object of
next page. If fetch
is False
, simply returns the next URL.
Parameters:
-
fetch
(bool
, default:False
) –dictates whether to fetch the next page, or return next url
-
next_symbol
(_NextSymbol
, default:None
) –if specified, only fetch elements containing this text value
Source code in requests_html.py
496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 |
|
render(retries=8, script=None, wait=0.2, scrolldown=False, sleep=0, reload=True, timeout=8.0, keep_page=False, cookies=[{}], send_cookies_session=False)
Reloads the response in Chromium, and replaces HTML content with an updated version, with JavaScript executed.
Parameters:
-
retries
(int
, default:8
) –The number of times to retry loading the page in Chromium.
-
script
(str
, default:None
) –JavaScript to execute upon page load (optional).
-
wait
(float
, default:0.2
) –The number of seconds to wait before loading the page, preventing timeouts (optional).
-
scrolldown
(bool
, default:False
) –Integer, if provided, of how many times to page down.
-
sleep
(int
, default:0
) –Integer, if provided, of how many seconds to sleep after initial render.
-
reload
(bool
, default:True
) –If
False
, content will not be loaded from the browser, but will be provided from memory. -
timeout
(Union[float, int]
, default:8.0
) –specify a timeout for the render
-
keep_page
(bool
, default:False
) –If
True
will allow you to interact with the browser page throughr.html.page
. -
send_cookies_session
(bool
, default:False
) –If
True
sendHTMLSession.cookies
convert. -
cookies
(list
, default:[{}]
) –If not
empty
sendcookies
.
If scrolldown
is specified, the page will scrolldown the specified
number of times, after sleeping the specified amount of time
(e.g. scrolldown=10, sleep=1
).
If just sleep
is provided, the rendering will wait n seconds, before
returning.
If script
is specified, it will execute the provided JavaScript at
runtime. Example:
script = """
() => {
return {
width: document.documentElement.clientWidth,
height: document.documentElement.clientHeight,
deviceScaleFactor: window.devicePixelRatio,
}
}
"""
Returns the return value of the executed script
, if any is provided:
>>> r.html.render(script=script)
{'width': 800, 'height': 600, 'deviceScaleFactor': 1}
Note: method requires that you have run playwright install
Source code in requests_html.py
687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 |
|
HTMLResponse(session)
Bases: Response
An HTML-enabled requests.Response <requests.Response>
object.
Effectively the same, but with an intelligent .html
property added.
Source code in requests_html.py
859 860 861 862 |
|
HTMLSession(**kwargs)
Bases: BaseSession
Source code in requests_html.py
947 948 |
|
close()
If a browser was created close it first.
Source code in requests_html.py
961 962 963 964 965 966 |
|
user_agent(style=None)
Returns an apparently legit user-agent, if not requested one of a specific style. Defaults to a Chrome-style User-Agent.
Source code in requests_html.py
885 886 887 888 889 890 891 892 893 |
|