Browsertrix on a Raspberry Pi
Browsertrix runs on a Pi! If you have a few around you can use them to help scrape sites for SUCHO!
Here is what you’ll need:
- 1 (or more) raspberry pi – only pi 3s, pi 4s work for this project!
- 1 (or more) raspberry pi power source
- 1 keyboard
- 1 screen & HDMI cable
- Wifi or LAN with internet access of course
- 1 (or more) SD card 64GB+
- A PC, Mac or other machine to flash the SD card
- The Raspberry Pi Imager software: https://www.raspberrypi.com/software/.
You may also need:
- An hdmi adapter for your pi
- A USB adapter for your pi
- A card reader adapter for you PC/MAC
OPTIONAL:
Another tool that I use is piTunnel, which gives you access to the pi remotely through a web interface where you can access the command line, it’s not free, though. https://www.pitunnel.com/ for those interested. You can also SSH into your machine if you’d like.
Setting up the Pi:
Step 1 – flash the SD Card
- Insert your SD card into your PC/MAC
- Open the Raspberry Pi imager software
- Click “choose OS”
- Click the second option from the list “Raspberry Pi OS (other)”
- Click “Raspberry Pi OS Lite (64-bit)”
- Click “Choose storage”
- Click the row that shows the SD card that you inserted in the first step of this section
- Click “write”
- The software will warn you about writing to the card, just double check that you selected your SD card. Make a correction if necessary, otherwise continue
- Once the flash is complete you can remove the SD card
Step 2 – Setup the Pi
Powering up
- Insert the flashed SD card into the pi
- Attach a keyboard to one of the usb ports
- Connect a screen using the HDMI port
- Insert the power adapter
- If your power adapter has a switch, just make sure to flip the switch on
Configuring the Pi
- When prompted, enter the default username “pi”
- Note – this will happen towards the end of the bootup process, but sometimes gets buried in the bootup text. If the bootup seems to have stalled, it may be that the prompt is just mixed in with the bootup text somewhere, to advance, just type “pi” and hit enter/return
- When prompted, enter the default password “raspberry”
- You should now have access to the command line. If not, review the steps above and make sure you didn’t miss a step
-
Enter the following command and press enter/return to access the pi settings GUI:
Localization Options
- From the raspi-config main menu choose “5 Localisation Options”
- Select “L1 Locale”
- The locale is set to the default “en_GB.UTF-8 UTF-8” so if this is your locale you can skip the rest of this section, if not you’ll likely want to complete this section.
- Select your locale by navigating to proper locale and press the space bar, you can deselect the default with the spacebar as well. I’m in the US, so I’m using “en-US.UTF-8 UTF-8”
- Once you have your locale select press enter/return
- In the next screen select your locale from the list to set it as the default and press enter/return. The system will then set the locale – it takes maybe 30 seconds
- Go back into “5 Localisation Options”
- Select “L2 Timezone”
- Select the appropriate geographic area and press enter/return
- Select the appropriate timezone and press enter/return
- Go back into “5 Localisation Options”
- Select “L3 Keyboard”
- Select a keyboard from the list (In the US I always use the default “Generic 105-key PC (intl.)” option)
- Select a proper layout.
- IMPORTANT steps for US folks:
- select “other”
- Select “English (US)”
- Select “English (US)” again
- Select “The default for the keyboard layout”
- Select “No Compose Key”
- This will map the US keyboard character set properly, missing this step will make the command line more difficult to use
System Options
- Navigate to and select “1 System Options”
- Select “S3 password”
- Press enter/return to advance to the password screen when prompted
- In the bottom left of the screen you’ll be prompted to enter a new password, simply enter any secure password of your choosing so that the pi doesn’t have a default password anymore. You’ll also be asked to re-enter the password
- Once the password is set successfully you will see a confirmation message which you can close by pressing enter/return
- Select “1 System Options” again
- Select “S1 Wireless LAN” to setup wifi
- Choose your country from the list
- Press enter/return to advance through the prompt
- Enter the SSID of the wireless network that you’d like to use and press enter/return
- Enter the wifi password if necessary, if it is an open network, simply press enter/return to leave the field blank
- OPTIONAL - autologin to command line on boot (if you don’t want to login after every boot/reboot):
- Select “1 System Options”
- Select “S5 Boot / Auto Login”
- Select “B2 Console Autologin”
Expanding the filesystem
- From the raspi-config main menu choose “Advanced Options”
- Select “A1 Expand Filesystem” and press enter/return – this will make sure you can use the entire SD card
Review other options (optional)
At this point the base system is setup, but there are other options in the menu to choose from. I won’t go over all the different options, but there may be other settings that you’d like to switch on like SSH, fan controls, etc. you can always come back and update later on as well.
Reboot
- From the main menu of the raspi-config press the left/right arrows on your keyboard to select “finish” and press enter/return
- When prompted choose “yes” to reboot the pi
-
If you accidentally press “no” type
into the command line
Using the command line
Once your system is rebooted you should be automatically logged in and brought right to the command line. If not, review the pi setup section and make sure you didn’t miss anything.
Update and Upgrade
-
Enter the following command and press enter/return to check for any updates:
If prompted type Y and press enter/return to install any available upgrades
-
Enter the following command and press enter/return to get any upgrades:
If prompted type Y and press enter/return to install any available upgrades
Installing Docker
-
Enter the following command and press enter/return:
sudo apt-get install docker.io
If prompted type Y and press enter/return to install any available upgrades
-
Enter the following command and press enter/return:
sudo apt-get install docker-compose
If prompted type Y and press enter/return to install any available upgrades
-
If prompted about a kernel update:
- press enter/return to close the message
- leave the settings as-is and press enter/return
-
type the following command and press enter/return to reboot the system:
Installing browsertrix
-
enter the following command and press enter/return:
sudo service docker start
-
enter the following command and press enter/return:
sudo docker pull webrecorder/browsertrix-crawler
this may take a little while to download and decompress
Creating a configuration YAML file
-
enter the following command and press enter/return:
sudo nano crawl-config.yaml
-
follow the steps outlined at https://www.sucho.org/browsertrix in the section “Creating a configuration YAML file” to populate your YAML file.
- Note – indentation in the “seeds” area is necessary and you must use spaces, not tabs
- To save your YAML file press ctrl+x
- When prompted to save press Y
- When prompted to name your file, just make sure it’s named “crawl-config.yaml” and press enter/return
Running Browsertrix
-
If you just booted up the machine, make sure that you start the docker service:
sudo service docker start
-
Enter the following command (all one line) and press enter/return:
sudo docker run -v $PWD/crawl-config.yaml:/app/crawl-config.yaml -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler crawl --config /app/crawl-config.yaml --text –generateWACZ
Exiting Brosertrix
To exit browsertrix simply press ctrl+c
HAPPY SCRAPING!