Extract Text from PDF API

Data Extraction

#extract #pdf #text

ApyHub

50 atoms

Base tier

About

This Utility API lets you extract the content of any PDF file. Our Extract Text from PDF API is your go to API tool that can extract all text from PDF documents, optionally including style, and preserving paragraphs.

Extracting text from a PDF file can make the text searchable, which can be particularly useful if you need to quickly find specific information within the file. This can make the content more accessible for people with disabilities, such as those with visual impairments. By extracting the text, you can convert it into a format that is easier to read using assistive technologies such as screen readers.

Try out the text extractor API in the API playground and see how this free online tool can become your PDF text extractor, helping you save time and reduce manual text extraction and exporting through a simple API call.

Select API Endpoints

Input

file

url

API Playground

API Documentation

upload file: extracted data

POST

https://api.apyhub.com/extract/text/pdf-file

Request example

1
curl --location --request POST 'https://api.apyhub.com/extract/text/pdf-file' \
2
--header 'apy-token: {{token}}' \
3
--header 'Content-Type: multipart/form-data' \
4
--form 'file=@"test.pdf"'

The method lets you submit a pdf file and returns the extracted text as string output. This is the most straight forward way to use this service - submit a pdf file and receive the extracted text as a response.

Method: POST

Content Type: multipart/form-data

Request Body

Attribute	Type	Mandatory	Description
file	File	Yes	The source pdf file.
preserve_paragraphs	Boolean	No	This preserves the paragraphs in the response, if `true`, defaults to `false`.
start_page	Integer	No	The starting page number for text extraction. Default is `1`, can range from `1` to the last page number. For example, to start from page 2, set `start_page` to `2`.
end_page	Integer	No	The ending page number for text extraction. Default is the last page number, can range from `1` to the last page number. For example, to end at page 5, set `end_page` to `5`.
starting_x_coordinate	Integer	No	Distance from the left edge (x-coordinate) to start extraction. Can range from `0` to `100`, default is `0`. For example, to start extraction 20% from the left, set `starting_x_coordinate` to `20`.
starting_y_coordinate	Integer	No	Distance from the top edge (y-coordinate) to start extraction. Can range from `0` to `100`, default is `0`. For example, to start extraction 20% below the top edge, set `starting_y_coordinate` to `20`.
ending_x_coordinate	Integer	No	Defines the width of the extraction area, starting from the starting_x_coordinate. Must be greater than starting_x_coordinate. Can range from `0` to `100`, with a default of `100`. For example, set `ending_x_coordinate` to `50` to extract text up to 50% of the page width from the left edge.
ending_y_coordinate	Integer	No	Defines the height of the extraction area, starting from the starting_y_coordinate. Must be greater than starting_y_coordinate. Can range from `0` to `100`, with a default of `100`. For example, set `ending_y_coordinate` to `50` to extract text up to 50% of the page height from the top edge.

Sample Response

1
{
2
    "data": "A consequuntur voluptatem ut mollitia voluptatem. Lorem ipsum dolor sit amet. Aut aspernatur quibusdam hic amet quas nam internos consequatur et ipsam repellendus ut galisum obcaecati. In repudiandae ipsa aut dolor asperiores et sint voluptatem aut voluptates quia. Nam numquam enim ut dolorum sunt Et doloremque cum temporibus debitis aut labore delectus et odit tempore in autem odio. Aut impedit fugit ut voluptate rerum sit culpa voluptatem ab galisum perferendis qui dignissimos minima est galisum perferendis et consequatur dolorem. Eum velit Quis ut nobis consequuntur Quo aperiam est fugiat quas ut fugiat dolorem ut voluptatum exercitationem aut quia temporibus. Rem rerum voluptas et quis voluptatum et veniam temporibus et tempore galisum et odit error est animi consequuntur. Et magni deleniti et molestiae corporis eos possimus itaque ut placeat quis hic fuga ratione qui fugiat porro aut praesentium repellat."
3

4
}

HTTP Response Codes

The method may return one of the following HTTP status codes:

Status Code	Description
200	The request was successful.
400	Request is invalid or the file is not accessible.
401	Required authentication information is either missing or not valid for the resource.
500	There was an error in processing this request.

Authentication

All API requests to ApyHub services need to be authenticated. Currently we support tokens or basic authentication mechanisms. You can generate and view your existing credentials from your workspace settings (on the left side of the navbar) and go to “API Keys".

Points to note:

Credential secrets are generated on the fly and are not stored in plain text, so on generating a credential please save the secrets somewhere safe.
Use the apy-token as the header parameter to pass the token.
Use the Authorization header to send the basic authentication credentials.

Error codes

1
{
2
  "error": {
3
    "code": 105,
4
    "message": "Invalid URL"
5
  }
6
}

To search for a specific error code, enter the code in the search box below. Alternatively, you can click on the button to view a complete list of all error codes.

Search by code

Table of contents

AboutAPI PlaygroundAPI DocumentationAuthenticationError codesRelated Utility APIsRelated Articles

Extract Text from PDF API

About

API Playground

API Documentation

Request example

HTTP Response Codes

Authentication

Error codes

101 - Missing parameters

102 - Invalid JSON

103 - Invalid input

104 - Invalid file

105 - Invalid URL

109 - Invalid input format

110 - Server error