Learn to parse an HTML Page on Android with JSoup

When you make Android applications, you can have to parse HTML data or HTML pages got from the Web. One of the most known solution to make that in Java is to use JSoup Library. Like said on the official website of JSoup : “It is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.”

JSoup can be used in Android applications and we’re going to study how to parse an HTML Page on Android with JSoup. You can discover the tutorial in video on Youtube :

First, you need to add the JSoup dependency in your Gradle build file :


compile 'org.jsoup:jsoup:1.10.1'

For our example, we are going to download the content of the SSaurel’s Blog and display all the links of the main page. To download the content of a website, JSoup offers the connect method and then a get method. This last method works synchronously. So, we should call these methods in a separated Thread. Our application will have just a simple layout with a Button to launch the download of the website and a TextView to display the links.

It will have the following form :


<?xml version="1.0" encoding="utf-8">
<RelativeLayout xmlns:android="http://schemas.android.com/apk/res/android"
  xmlns:tools="http://schemas.android.com/tools"
  android:id="@+id/activity_main"
  android:layout_width="match_parent"
  android:layout_height="match_parent"
  android:paddingBottom="@dimen/activity_vertical_margin"
  android:paddingLeft="@dimen/activity_horizontal_margin"
  android:paddingRight="@dimen/activity_horizontal_margin"
  android:paddingTop="@dimen/activity_vertical_margin"
  tools:context="com.ssaurel.jsouptut.MainActivity">

  <Button
    android:id="@+id/getBtn"
    android:layout_width="wrap_content"
    android:layout_height="wrap_content"
    android:text="Get website"
    android:layout_marginTop="50dp"
    android:layout_centerHorizontal="true"/>

  <TextView
    android:id="@+id/result"
    android:layout_width="wrap_content"
    android:layout_height="wrap_content"
    android:text="Result ..."
    android:layout_centerHorizontal="true"
    android:layout_marginTop="30dp"
    android:layout_below="@id/getBtn"
    android:textSize="17sp"/>
</RelativeLayout>

In the main Activity of the application, we are going to get instances of the Button and the TextView from our layout. Then, we set a click listener on the Button to start the download of the website when the user will click it.

In the getWebsite() method, we create a new Thread to download the content of the website. We use the connect() method of the Jsoup object to connect the application to the website, then we call the get() method to download the content. These calls return a Document object instance. We have to call the select() method of this instance with the query to get all the links of the content. This query returns an Elements instance and finally, we have just to iterate on the elements contained in this object to display the content of each link to the screen.

At the end of our separated Thread, we refresh the UI with the links got from the website. This refresh is embedded inside a runOnUiThread call because it’s forbidden to refresh the UI elements inside a separated thread.

The code of the MainActivity has the following form :


package com.ssaurel.jsouptut;

import android.os.Bundle;
import android.support.v7.app.AppCompatActivity;
import android.view.View;
import android.widget.Button;
import android.widget.TextView;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;

public class MainActivity extends AppCompatActivity {

  private Button getBtn;
  private TextView result;

  @Override
  protected void onCreate(Bundle savedInstanceState) {
    super.onCreate(savedInstanceState);
    setContentView(R.layout.activity_main);
    result = (TextView) findViewById(R.id.result);
    getBtn = (Button) findViewById(R.id.getBtn);
    getBtn.setOnClickListener(new View.OnClickListener() {
      @Override
      public void onClick(View view) {
        getWebsite();
      }
    });
  }

  private void getWebsite() {
    new Thread(new Runnable() {
      @Override
      public void run() {
        final StringBuilder builder = new StringBuilder();

        try {
          Document doc = Jsoup.connect("http://www.ssaurel.com/blog").get();
          String title = doc.title();
          Elements links = doc.select("a[href]");

          builder.append(title).append("n");

          for (Element link : links) {
            builder.append("n").append("Link : ").append(link.attr("href"))
            .append("n").append("Text : ").append(link.text());
          }
        } catch (IOException e) {
          builder.append("Error : ").append(e.getMessage()).append("n");
        }

        runOnUiThread(new Runnable() {
          @Override
          public void run() {
            result.setText(builder.toString());
          }
        });
      }
    }).start();
  }
}

Last step is to run the application and to enjoy the final result with all the links of the SSaurel’s blog displayed on the screen :

screenshot_20170101-193623

Leave a Reply