Finding stuff from the web is fun especially when you scrape data with the command line.
In this tutorial, you’ll use your terminal to get data from Reddit especially free stuff from subreddits like Udemy to get the latest coupons posted by Reddit users. I’ve made it customized so that you can apply it on any other subreddit.
Minimal example
Let’s take it step by step and see first how to scrape Reddit. Do you
really need API credentials to get such info? In fact, you just need to
know the endpoint you desire and then use a command-line utility to get
data from the data like curl
or wget
.
curl -sA 'udemy subreddit scraper' 'https://www.reddit.com/r/udemy/top.json?t=month'
That line returns JSON data filtered by the top data across the last month from udemy subreddit.
curl
is used here with these options:
-s
to make in a silent mode while downloading-A
to assign the user agent followed by a string
Getting the titles of subreddit posts
If you want to see the hierarchy of that JSON format, use jq
program
piped at the end. To get the titles of those top Reddit posts, get the
value of the title key which lies inside the parent data which is a
leaf of the children list of hashes. children are literally the
children of the parent data
.
curl -sA 'udemy subreddit scraper' 'https://www.reddit.com/r/udemy/top.json?t=month' | jq '.data.children[].data.title'
Getting free stuff
Let’s know how to filter the title with Free keyword. That assumes that the free stuff exists only if the Free Udemy course is stated explicitly with the keyword free in the title:
$ curl -sA 'udemy subreddit scraper' 'https://www.reddit.com/r/udemy/top.json?t=month' | jq '.data.children[].data | select(.title|test("Free")).title'
"List of 40+ Free & Some Best Selling Discounted Tuesday, June 22, 2021"
"Free Udemy Course - 4 July 2021"
See here, we used select
to get filter specific value of the child
title piped with test
function to deal with regex including the
Free word with any character whatsoever after it.
But this still has a problem, we don’t have any result of a title that
has the FREE word (capitalized) or any variation of the word. For
example, free , FRee , FREe , and FREE don’t get returned with
that command. So we can fix that by adding the case insensitive argument
"i"
in the test
function.
$ curl -sA 'udemy subreddit scraper' 'https://www.reddit.com/r/udemy/top.json?t=month' | jq '.data.children[].data | select(.title|test("free"; "i")).title'
"18 FREE Programming courses on Udemy! 3 days only!"
"List of 40+ Free & Some Best Selling Discounted Tuesday, June 22, 2021"
"Free Udemy Course - 4 July 2021"
"[FREE] Video Editing Courses - Adobe Premiere, After Effects, Davinci Resolve, Photoshop"
Now we have more results containing other variations even if there are free keywords and any other variations in that subreddit for our case, we could see it.
Now, let’s customize it and put it in a bash script:
SUBREDDIT="$1"
curl -sA 'subreddit reader' \
'https://www.reddit.com/r/'${SUBREDDIT}'/top.json?t=month' \
| jq '.data.children[].data | select(.title|test("free"; "i")).title'
We now have the subreddit name as an argument. Save that bash script as e.g. top_free.sh and then put the name of the subreddit you want to scrape the free stuff from:
$ chmod u+x top_free.sh # give access to that shell script
$ ./top_free.sh udemy # replace udemy with whatever subreddit you want
"18 FREE Programming courses on Udemy! 3 days only!"
"List of 40+ Free & Some Best Selling Discounted Tuesday, June 22, 2021"
"Free Udemy Course - 4 July 2021"
"[FREE] Video Editing Courses - Adobe Premiere, After Effects, Davinci Resolve, Photoshop"
More customized command line
Let’s even create more customization in the bash script to the query parameter that you desire instead of ‘free’:
SUBREDDIT="$1"
QUERY="${2:-free}"
curl -sA 'subreddit reader' \
'https://www.reddit.com/r/'${SUBREDDIT}'/top.json?t=month' \
| jq '.data.children[].data | select(.title|test("'"${QUERY}"'"; "i")) | .title'
The parameter expansion ${2:-free}
means that the query will be
the second parameter. If it is null, the query will be assigned to the
word free .
Let’s save this bash script as top_stuff.sh and see titles that have the keyword courses :
$ chmod u+x top_stuff.sh
$ ./top_stuff.sh udemy courses
"18 FREE Programming courses on Udemy! 3 days only!"
"13 Udemy (100% off Coupons) Programming Courses [Limited Time]"
"Udemy 15 (100% off Coupons) Programming Courses [Limited Time]"
"A tendency in coding courses I find really annoying"
"[FREE] Video Editing Courses - Adobe Premiere, After Effects, Davinci Resolve, Photoshop"
Getting both titles and URLs
Now, we have the desired titles. Why don’t we return the URL of each title so that we can click on it and explore that subreddit post and comments?
SUBREDDIT="$1"
QUERY="${2:-free}"
curl -sA 'subreddit reader' \
'https://www.reddit.com/r/'${SUBREDDIT}'/top.json?t=month' \
| jq '.data.children[].data | select(.title|test("'"${QUERY}"'"; "i")) | {title, url} | .[]'
Here, we just added both children title and url using Object
Construction in jq
documentation.
When we run that command again, we get:
"18 FREE Programming courses on Udemy! 3 days only!"
"https://www.reddit.com/r/Udemy/comments/ogxrrp/18_free_programming_courses_on_udemy_3_days_only/"
"Free Udemy Course - 4 July 2021"
"https://www.reddit.com/r/Udemy/comments/od89ex/free_udemy_course_4_july_2021/"
"Free Unity + AWS DynamoDB Course! 3 days only!"
"https://www.reddit.com/r/Udemy/comments/oq5iog/free_unity_aws_dynamodb_course_3_days_only/"
"[FREE] Video Editing Courses - Adobe Premiere, After Effects, Davinci Resolve, Photoshop"
"https://www.reddit.com/r/Udemy/comments/ohd0at/free_video_editing_courses_adobe_premiere_after/"
Final thoughts
In this tutorial, we’ve seen how to scrape the top monthly free stuff from any subreddit. We’ve retrieved the title of that post and the URL to be able to view that and participate in the community if you want.
We also make it generalized to include any keyword you want to search for in the subreddit you desire.
Please let me know if you have any further questions! or if you want more scraping posts, just comment below!
Enjoy!
You might be interested in this
tutorial
in which I used jq