Write a PYTHON script
Store https://www.genecards.org/cgi-bin/cardlisttxt.pl it in a flat file.
The GeneCards database currently contains 270,168 GeneCards
Parse the first 10 genes from each series (1A9N_Q-ZZZ3) https://www.genecards.org/cgi-bin/carddisp.pl?gene=GENE NAME
If the genes are less than 10 then parse all.
Extract Genomic Locations for GENE NAME and store it in a file for each gene you parse.For example
Open https://www.genecards.org/cgi-bin/carddisp.pl?gene=A1BG
Do scraping for “Genomic Locations for A1BG Gene”, you will see
Genomic Locations for A1BG Gene
chr19:58,345,178-58,353,492(GRCh38/hg38)
Size:8,315 bases
Orientation:Minus strand
Store the scrapped output into a file and rendered it in HTML as it looks in genecard
SOLUTION TO THE ABOVE QUESTION
SOLUTION CODE
import requests
import html
#define a function to get the gene_card request
def gene_card_request():
#our url is https://www.genecards.org/cgi-bin/cardlisttxt.pl
url_to_request = 'https://www.genecards.org/cgi-bin/cardlisttxt.pl?gene='
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url_to_request, headers=headers)
gene_card_html = html.unescape(r.text)
return gene_card_html
print(gene_card_request())
Comments
Leave a comment